Performance Analysis of Trust Region Subproblem Solvers for Limited-Memory Distributed BFGS Optimization Method

The limited-memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS) optimization method performs very efficiently for large-scale problems. A trust region search method generally performs more efficiently and robustly than a line search method, especially when the gradient of the objective function cannot be accurately evaluated. The computational cost of an L-BFGS trust region subproblem (TRS) solver depend mainly on the number of unknown variables ( n ) and the number of variable shift vectors and gradient change vectors ( m ) used for Hessian updating, with m << n for large-scale problems. In this paper, we analyze the performances of different methods to solve the L-BFGS TRS. The first method is the direct method using the Newton-Raphson (DNR) method and Cholesky factorization of a dense n × n matrix, the second one is the direct method based on an inverse quadratic (DIQ) interpolation, and the third one is a new method that combines the matrix inversion lemma (MIL) with an approach to update associated matrices and vectors. The MIL approach is applied to reduce the dimension of the original problem with n variables to a new problem with m variables. Instead of directly using expensive matrix-matrix and matrix-vector multiplications to solve the L-BFGS TRS, a more efficient approach is employed to update matrices and vectors iteratively. The L-BFGS TRS solver using the MIL method performs more efficiently than using the DNR method or DIQ method. Testing on a representative suite of problems indicates that the new method can converge to optimal solutions comparable to those obtained using the DNR or DIQ method. Its computational cost represents only a modest overhead over the well-known L-BFGS line-search method but delivers improved stability in the presence of inaccurate gradients. When compared to the solver using the DNR or DIQ method, the new TRS solver can reduce computational cost by a factor proportional to n 2 / m for large-scale problems.


INTRODUCTION
Decision-making tools based on optimization procedures have been successfully applied to solve practical problems in a wide range of areas. An optimization problem is generally defined as minimizing (or maximizing) an objective function f (x) within a user defined search domain x ∈ Ω, and subject to some linear or nonlinear constraints, where x is an n-dimensional vector that contains all controllable variables.
In the oil and gas industry, an optimal business development plan requires robust production optimization because of the considerable uncertainty of subsurface reservoir properties and volatile oil prices. Many papers have been published on the topic of robust optimization and their applications to cyclic CO 2 flooding through the Gas-Assisted Gravity Drainage process [1], well placement optimization in geologically complex reservoirs [2], optimal production schedules of smart wells for water flooding [3] and in naturally fractured reservoirs [4], just mentioning a few of them as examples.
Some researchers formulated the optimization problem under uncertainty as a single objective optimization problem, e.g., only maximizing the mean of net present value (NPV) for simplicity by neglecting the associated risk. However, it is recommended to formulate robust optimization as a bi-objective optimization problem for consistency and completeness. A bi-objective optimization problem is generally defined as minimizing (or maximizing) two different objective functions, f 1 (x) and f 2 (x), within a user-defined search domain x ∈ Ω, and subject to some linear or nonlinear constraints. For example, we may maximize the mean value, denoted by f 1 (x), and minimize the standard deviation, denoted by f 2 (x), of NPV. For a bi-objective optimization problem, the optimal solutions are defined as the Pareto optimal solutions (or Pareto front). It is a very challenging task to find multiple optimal solutions on the Pareto front [5,6].
It is also a very challenging task to properly characterize the uncertainty of reservoir properties (e.g., porosity and permeability in each grid-block) and reliably quantify the ensuing uncertainty of production forecasts (e.g., production rates of oil, gas, and water phases) by conditioning to historical production data and 4D seismic data [7,8], which requires generating multiple conditional realizations by minimizing a properly defined objective function, e.g., using the randomized maximum likelihood (RML) method [9].
When an adjoint-based gradient of the objective function is available [10][11][12], we may apply a gradient-based optimization method, e.g., the Broyden-Fletcher-Goldfarb-Shanno (BFGS) optimization method [13,14]. However, the adjoint-based gradient is not available for many commercial simulators and not available for integer type or discrete variables (e.g., well locations defined by grid-block indices). In such a case, we must apply a derivative-free optimization (DFO) method [15][16][17][18]. For problems with smooth objective functions and continuous variables, model-based DFO methods using either radial-basis function [19] or quadratic model [16,20,21] performs more efficiently than other DFO methods such as direct pattern search methods [22] and stochastic search methods [23].
Traditional optimization methods only locate a single optimal solution, and they are referred to as single-thread optimization methods in this paper. It is unacceptably expensive to use a singlethread optimization method to locate multiple optimal solutions on the Pareto front or generate multiple RML samples [24]. To overcome the limitations of single-thread optimization methods, Gao et al. [17] developed a local-search distributed Gauss-Newton (DGN) DFO method to find multiple best matches concurrently. Later, Chen et al. [25] modified the local-search DGN optimization method and generalized the DGN optimizer for global search. They also integrated the global-search DGN optimization method with the RML method to generate multiple RML samples in parallel.
Because multiple search points are generated in each iteration, finally, multiple optimal solutions can be found in parallel. These distributed optimization methods are referred to as multiplethread optimization methods. Both the local-and global-search DGN optimization methods are only applicable to a specific optimization problem, i.e., history matching problem or leastsquares optimization problem, of which the objective function can be expressed as a form of least-squares, f (x) 1 2 αx T x + 1 2 y T (x)y(x), but they cannot be applied to other type or generic optimization problems where the objective function cannot be expressed as a form of leastsquares. Furthermore, the DGN optimization methods may become less efficient for history matching problems when both the number of variables (n) and the number of observed data (m) are large (e.g., in the order of thousands or more).
Recently, Gao et al. [18] developed a well-paralleled distributed quasi-Newton (DQN) DFO method for generic optimization problems by introducing a generalized form of the objective function, F(x, y) f (x). Here, x represents the vector of controllable variables or model parameters to be optimized (explicit variables) and y(x) denotes the vector of simulated responses (implicit variables) of a reservoir using explicit variables x. Using the generalized form of the objective function, the gradient of the objective function can be evaluated analytically by, In Eq. 1, J T (x) ∇ x y T (x) is the transpose of the sensitivity matrix. Using the generalized expression of the objective function F(x, y), different objective functions can be evaluated using the same y(x) that is simulated from the same reservoir model, e.g., for multi-objective optimization problems. A specific case of the generalized expression is F(x, y) y f (x), where the transpose of the sensitivity matrix becomes the gradient of the objective function.
The DQN optimization algorithm runs N e optimization threads in parallel. In the k-th iteration, there are N (k) LC ≤ N e non-converged threads. The DQN optimizer generates a new search points x (i,k) T x (i,k) + s (i,k) for each non-converged thread, where the search step s (i,k) is solved from a trust region subproblem (TRS). The N (k) LC simulation cases are submitted to a high-performance computing (HPC) cluster in parallel, to generate corresponding simulated responses y (i,k) Then, we evaluate the objective function f (i,k) T F(x (i,k) T , y (i,k) T ) and its associated partial derivatives ∇ x F (i,k) and ∇ y F (i,k) . If the new search point x (i,k) T improves the objective function, i.e., f (i,k) T . All simulation results generated by all DQN optimization threads in previous and current iterations are recorded in a training data set, and they are shared among all DQN optimization threads. The sensitivity matrix J (i,k+1) J(x (i,k+1) ) is approximated using a modified QR-method proposed by Gao, et al. [18] by linear interpolation of training data points that are closest to x (i,k+1) . Then, we approximate the gradient g (i,k+1) g(x (i,k+1) ) using Eq. 1. Finally, the Hessian H (i,k+1) for each thread is updated using either the BFGS method or the symmetric rank-1 (SR1) method [14]. For simplicity, we will drop the superscript "i" (the optimization thread index) in the following discussions.
Both DGN and DQN optimization methods approximate the objective function by a quadratic model, and they are designed for problems with smooth objective function and continuous variables. Although their convergence is not guaranteed for problems with integer type variables (e.g., well location optimization), if those integers can be treated as truncated continuous variables, our numerical tests indicate that these distributed optimization methods can improve the objective function significantly and locate multiple suboptimal solutions for problems with integer type variables in only a few iterations.
The number of variables for real-world problems may vary from a few to thousands or even more, depending on the problem and the parameterization techniques employed. For example, the number of variables could be in the order of millions if we do not apply any parameter reduction techniques to reduce the dimension of some history matching problems (e.g., to tune permeability and porosity in each grid block). Because reservoir properties are generally correlated with each other with long correlation lengths, we may reduce the number of variables to be tuned to only a few hundred, e.g., using the principal component analysis (PCA) or other parameter reduction techniques [26].
In this paper, our focus is on performance analysis of different methods to solve the TRS formulated with the limited-memory BFGS (L-BFGS) Hessian updating method for unconstrained optimization problems. In the future, we will further integrate the new TRS solver with our newly proposed limited-memory distributed BFGS (L-DBFGS) optimization algorithm and then apply the new optimizer to some realistic oil/gas field optimization cases and benchmark its overall performance against other distributed optimization methods such as those only using the gradient [27]. We will also continue our investigation in the future for constrained optimization problems (e.g., variables with lower-and/or upper-bounds and problems with nonlinear constraints). Gao et al. [18] applied the popular Newton-Raphson method [28] to directly solve the TRS using matrix factorization, the DNR method in short. The DNR method is quite expensive when it is applied to solve N e TRSs of a distributed optimization method, especially for large-scale problems with thousands or more variables [29]. This paper follows the ideas and concepts presented in the book "Matrix Computation" [30]. Flops are used to quantify the volume of work associated with a computation, a count for floating-point operations of add, subtract, multiply, or divide. Computational cost (flops) of some commonly used algebraic operations and numerical methods are summarized in Table 1 for reference. For completeness, we discuss the compact representation of the L-BFGS Hessian updating formulation [31] and the algorithm to directly update the Hessian in the next section directly. In the third section, we present three different methods to solve the L-BFGS TRS: the DNR method, the direct method using inverse quadratic (DIQ) interpolation approach proposed by Gao et al. [32], and the technique using matrix inversion lemma (MIL) together with an efficient matrix updating algorithm. Some numerical tests and performance comparisons are discussed in the fourth section. We finally draw some conclusions in the last section.

THE LIMITED-MEMORY HESSIAN UPDATING FORMULATION
Let x (k) be the best solution obtained in the current (k-th) iteration, and f (k) , g (k) , and H (k) , respectively, the objective function, its gradient and Hessian evaluated at x (k) . The Hessian H (k+1) is updated using the BFGS formulation as follows, >0 is a small positive number.

Compact Representation
To save memory usage and computational cost, the L-BFGS Hessian updating method limits the maximum history size or the maximum number of pairs (s (k) , z (k) ) used for the BFGS Hessian updating to L M > 1. The recommended value of L M ranges from 5 to 20, using smaller number for problems with more variables. Let 1 ≤ l k ≤ L M denote the number of pairs of variable shift vectors and gradient change vectors, (s (j) , z (j) ) for j 1,2,...l k , used to update the Hessian using the L-BFGS method in the k-th iteration. Let S (k) [s (1) , s (2) , . . . s (l k ) ] be the n × l k variable shift matrix and Z (k) [z (1) , z (2) , . . . z (l k ) ] the n × l k gradient change matrix. Both matrices S (k) and Z (k) are updated iteration by iteration. Let m 2l k and V (k) [S (k) , Z (k) ] is an n × m matrix. Let A (k) [S (k) ] T S (k) and B (k) [S (k) ] T Z (k) . We decompose B (k) into three parts: the strictly lower triangular part L (k) , the strictly upper triangular part U (k) , and the diagonal part E (k) that contains the main diagonals of B (k) , i.e., The Hessian can be updated using the compact form of the L-BFGS Hessian updating formulation [31,[33][34][35] as follows, In Eq. 4, I n is an n × n identity matrix and W (k) is an m×m symmetric matrix defined as, In Eq. 4 and Eq. 5, α (k) is a scaling factor. It is required that a (k) > a cr > 0 using Eq. 4 and Eq. 5 to update the Hessian, where a cr is a small positive number.

The Algorithm to Directly Update the Hessian Using the L-BFGS Compact Representation
, the two n × l k−1 matrices, S (k−1) and Z (k−1) , we can update S (k) , Z (k) and directly compute the Hessian H (k+1) using the compact representation of the L-BFGS Hessian updating Eq. 4, as summarized in Algorithm-1.
Algorithm-1: Updating the Hessian directly using the L-BFGS compact representation 1. Compute ||s||, ||z||, ε (k) z T s and α (k) , and goto step 4; 3. Otherwise, a. Set l k L M and remove the first column from S (k−1) and

Stop.
The computational cost to directly update the Hessian using the L-BFGS method as described in Algorithm-1 mainly depends on the number of variables (n) and the number of vectors used to update the Hessian (m 2l k ). We should reemphasize that m is updated iteratively but limited to m ≤ 2L k . The computational cost is c 1 2mn 2 + (7 + 3m 2 )n + O(m 3 ) (flops), which includes the following seven operations: 1) computing s , z , and ε (k) z T s in step 1 (6n flops); 2) computing A (k) and B (k) in step 5 (m 2 n flops); 3) forming the matrix W (k) I in step 6 (m 2 flops); 4) computing W (k) in step 7 (O(m 3 ) flops); 5) compute U (k) in step 8 (2m 2 n flops); 6) compute T (k) in step 9 (2mn 2 flops); and 7) updating H (k+1) in step 10 (n flops).

SOLVING THE L-BFGS TRUST REGION SUBPROBLEM The Trust Region Subproblem
In the neighborhood of the best solution x (k) , the objective function f(x) can be approximated by a quadratic function of s x − x (k) , If the Hessian H (k) is positive definite, then q (k) (s) has a unique minimum s * (k) , which is the solution of H (k) s −g (k) , When a line search strategy is applied, we accept the full search step s *(k) by setting x (k+1) x (k) + s *(k) if it improves the objective function sufficiently (e.g., satisfying the two Wolfe conditions). Otherwise, we can find a better search step size 0 < γ (k) ≤ 1 such that x (k+1) x (k) + γ (k) s *(k) improves the objective function sufficiently. If the gradient g (k) can be accurately evaluated, it has theoretically proved that a line search strategy can converge [14]. However, for most real-world optimization problems, the gradient cannot be accurately evaluated, and as discussed by Gao et al. [16], a trust region search strategy performs more robustly and more efficiently than a line search strategy.
The trust region search step s *(k) for the next iteration is the global minimum of the quadratic model q (k) (s) within a ballshaped trust region with radius Δ (k) ≥ Δ min > 0 which is solved from the following trust region subproblem (TRS) [28], If the solution s *(k) given by Eq. 7 satisfies s *(k) 2 ≤ Δ (k) , then s *(k) is the solution of the TRS defined in Eq. 8. Otherwise, the solution of the TRS must lie on the boundary of s 2 Δ (k) , and we should solve the Lagrange multiplier λ *(k) together with the search step s *(k) from the following nonlinear TRS equations, For the trivial case with H (k) αI n , we have s(λ) − g (k) α+λ and π(λ) g (k) 2 otherwise. We may apply the popular DNR method to solve the TRS when H (k) is a dense matrix. However, the DNR method is quite expensive for large-scale problems and we should seek and apply a more efficient method to solve the TRS.

Solving the L-BFGS TRS Directly Using the Newton-Raphson Method
The DNR method solves the TRS defined in Eq. 8 using the Cholesky decomposition. It applies the Newton-Raphson method to directly solve the nonlinear TRS equation [28], Given the trust region size Δ Δ (k) > 0, threshold of convergence δ cr > 0, maximum number of iterations allowed N TRS,max > 0, Hessian H H (k) and gradient g g (k) evaluated at the current best solution x (k) , both the Lagrange multiplier λ * and the trust region search step s * s(λ * ) can be solved from the TRS defined in Eq. 8, using the DNR method as summarized in Algorithm-2.
Algorithm-2: Solving the L-BFGS TRS Using the DNR Method 1. Initialize l 0, λ 0 0; 2. Compute D H + λ 0 I n and Cholesky decomposition D LL T ; 3. Solve u * from Lu −g, s * from L T s u * , and v * from L T v s * ; 4. Compute s * and v * ; 5. If s * ≤ Δ, then accept λ * λ 0 and go to step 8; Compute D H + λ l+1 I n and Cholesky decomposition D LL T ; c. Solve u * from Lu −g, s * from L T s u * , and v * from L T v s * ; d. Compute s * and v * ; Let N DNR denote the number of iterations required for the DNR TRS solver to converge. The total computational cost (flops) to solve the TRS using Algorithm-2 is, c 2 5n + 3n 2 + n 3 3 (N DNR + 1), including the following three operations: (1) computing D H + λ k+1 I n and Cholesky decomposition D LL T in step 2 and step 7(b) (n + n 3 3 flops); (2) solving u * from Lu −g, s * from L T s u * , and v * from L T v s * in step 3 and step 7(c) (3n 2 flops); (3) computing s * and v * in step 4 and step 7(d) (4n flops).
The total computational cost (flops) used to solve the L-BFGS TRS using the DNR method (Algorithm-2) together with the L-BFGS Hessian updating method (Algorithm-1) is, Our numerical results indicate that the DNR TRS solver may fail when tested on the well-known Rosenbrock function, especially for problems with large n. The root cause for failure of convergence using the DNR method is the same as in the GNTRS solver using the traditional Newton-Raphson method as discussed by Gao et al. [36]. Very small value of ϕ ′ (λ) may result in a very large search step and thus result in failure of converging. Gao et al. [36] proposed integrating the DNR method with a bisection line search to overcome this issue.

The Inverse Quadratic Model Interpolation Method to Directly Solve the L-BFGS TRS
Instead of applying the Newton-Raphson method which requires evaluating ϕ ′ (λ), Gao et al. [32] proposed a method to directly solve the TRS using inverse quadratic model interpolation (called the DIQ method), i.e., approximating π(λ) [s(λ)] T s(λ) by the following inverse quadratic function, Both coefficients of a and b in Eq. 12 can be determined by interpolating the values of π(λ) evaluated at two different points π 1 π(λ 1 ) and π 2 π(λ 2 ) with 0 ≤ λ 1 < λ 2 .
In the first iteration, we set λ 1 λ min 0 and accept λ min 0 as the solution if π 1 ≤ Δ 2 . Otherwise, π 1 > Δ 2 holds and we set Δ − α and π 2 < Δ 2 holds if it is not accepted as the solution. In the following iteration, either λ 1 or λ 2 will be updated accordingly such that the two conditions π 1 > Δ 2 and π 2 < Δ 2 always hold.
Let ρ π1 √ π 1 √ − π 2 √ ≥ 1 and the following solution of q INV (λ) Δ 2 will be used as the trial search point for the next iteration, We either update λ 1 λ * if π(λ * ) > Δ 2 or update λ 2 λ * if π(λ * ) < Δ 2 iteratively until convergence. We accept λ * as the desired solution and terminate the iterative process either when Given the scaling factor α > 0, trust region size Δ > 0, threshold of convergence δ cr > 0, maximum number of iterations allowed N TRS,max > 0, Hessian H and gradient g, both the Lagrange multiplier λ * and the trust region search step s * s(λ * ) can be solved from the TRS defined in Eq. 8, using the DIQ method as summarized in Algorithm-3.
Let N DIQ denote the number of iterations required for the DIQ TRS solver to converge. The total computational cost to solve the L-BFGS TRS using Algorithm-3 together with Algorithm-1 is, c IQ 2mn 2 + 7 + 3m 2 n + 5n + 2n 2 + n 3 3 Generally, the DIQ method converges faster than the DNR method and it performs more efficiently and robustly than the DNR method, especially for large scale problems. Both the DNR method and the DIQ method become quite expensive for largescale problems.

Using Matrix Inversion Lemma (MIL) to Solve the L-BFGS TRS
To save both memory usage and computational cost, Gao, et al. [32,36,37] proposed an efficient algorithm to solve the Gauss-Newton TRS (GNTRS) for large-scale history matching problems using the matrix inversion lemma (or the Woodbury matrix identity). With appropriate normalization of both parameters and residuals, the Hessian of the objective function for a history matching problem can be approximated by the well-known Gauss-Newton equation as follows, In Eq. 15, V (k) [J (k+1) ] T is an n × m matrix, the transpose of the sensitivity matrix J (k+1) that is evaluated at the current best solution x (k+1) . Here, m denotes the number of observed data to be matched and it is not the same m as used in the L-BFGS Hessian updating Eq. 4. The GNTRS solver proposed by Gao, et al. [32,36,37] using the matrix inversion lemma (MIL) has been implemented and integrated with the distributed Gauss-Newton (DGN) optimizer. Because the compact representation of the L-BFGS Hessian updating formula Eq. 4 is similar to Eq. 15, we follow the similar idea as proposed by Gao, et al. [32,36,37] to compute [H (k+1) + λI n ] − 1 by applying the matrix inversion lemma and then solve the L-BFGS trust region search step s(λ).
Let P (k) [V (k) ] T V (k) and u (k) [V (k) ] T g (k) . We first solve v(λ) from, Then, we compute the trust region search step s(λ) and π(λ) s(λ) 2 as, From Eq. 15, we have and Eq. 19 can be rewritten as, We first try λ λ min 0. If π(0) ≤ Δ 2 holds, we accept λ * 0 and s * s(0) as the solution of the TRS defined in Eq. 8. Otherwise, the solution is on the boundary defined in Eq. 10, which can be solved by the Newton-Raphson (NR) method iteratively. Let ϕ(λ) In Eq. 21, w(λ) v′(λ) is solved from, Given the n × l k variable shift matrix S (k) and gradient change matrix Z (k) , we may directly compute the following three l k × l k matrices, and C (k) [Z (k) ] T Z (k) , and then form the m × m matrix P (k) [V (k) ] T V (k) as, Directly computing the three l k × l k matrices, A (k) , B (k) and C (k) , requires 6l 2 k n 1.5m 2 n flops. We also need to compute the vector u (k) [V (k) ] T g (k) (2mn flops) and g (k) 2 (2n flops). To further reduce the computational cost, the three matrices A (k) , B (k) and C (k) and the vector u (k) should be updated iteratively instead.
The L-BFGS Hessian updating formulation of Eq. 4 requires ε (k) z T s > c 3 z s and α (k) z T z z T s > α cr > 0 where c 3 and α cr are small positive numbers. If ε (k) ≤ c 3 z s or α (k) ≤ α cr , we simply set l k l k−1 , , and P (k) P (k−1) , with no updating. However, we need to compute u (k) [S (k) , Z (k) ] T g (k+1) using the new gradient g (k+1) . In the following cases, we assume that both conditions ε (k) > c 3 z s and α (k) > α cr are satisfied.

Case-1: Sflag (k) "False"
Although the best solution x (k+1) does not change when Sflag (k) "False", the sensitivity matrix J (k+1) and thus the gradient of the objective function g (k+1) evaluated at x (k+1) using simulation results and training data points updated in the k + 1 (current) iteration may be different from those evaluated at the same point but using simulation results and training data points obtained in the k (previous) iteration. To use the right gradient, we should replace the last column in Z (k) with z.
We can also apply Eq. 33 to update u (k) by replacing u (k−1) that is composed of the second row through the l k−1 -th rows in u (k−1) and replacing u (k−1) that is composed of the last l k−1 − 1 rows in u (k−1) .

End
The computational cost to update matrices and vectors used for solving the L-BFGS TRS as described in Algorithm-4 is at most c 4 2n + 4mn (taking Case-2 in step 5 as an example), including the following 4 operations: (1) computing ε (k) z T s, μ (k) z T z, and β (k) s T s in step 1 (6n flops); (2) computing [Z (k−1) ] T s, and p (k)
In this paper, we apply the same idea of reducing the dimension from n to m using the matrix inversion lemma as presented in the three papers [32,36,37]. However, the implementation is quite different. For the DGN optimization method, the n × m matrix V (k) (or the transpose of the sensitivity matrix) in Eq. 15 is directly computed. Its dimensions (both the number of variables n and the number of observed data m) are fixed, i.e., they remain the same for all iterations. Therefore, we have to directly compute the m × m matrix P (k) [V (k) ] T V (k) and the m-dimensional vector u (k) [V (k) ] T g (k) , with the computational cost 2m 2 n + 2mn flops. Using the algorithm as summarized in Algorithm-4, the computational cost can be further reduced to 2n + 4mn, by a factor of l k m/2, 5 to 20 roughly.

The Algorithm to Solve the L-BFGS TRS Using the Matrix Inversion Lemma
Given the trust region size Δ Δ (k) > 0, threshold of convergence δ cr > 0, maximum number of iterations allowed N TRS,max > 0, the scaling factor α > α cr > 0, m ≥ 0, n × m matrix V [S (k) , Z (k) ], m × m matrices P P (k) and W I [W (k) ] − 1 , m-dimensional vector u u (k) and n-dimensional gradient vector g g (k) evaluated at the current best solution x (k) , both the Lagrange multiplier λ * and the trust region search step s * s(λ * ) can be solved from the TRS defined in Eq. 8 using the matrix inversion lemma (MIL) as summarized in Algorithm-5.
The total computational cost (flops) used to solve the L-BFGS TRS using Algorithm-5 together with Algorithm-4 is, It is straightforward to compute the normalized computational cost, or the ratio of computational cost of the DNR method (Algorithm-2) or the DIQ method (Algorithm-3) together with the algorithm that directly updates Hessian (Algorithm-1) over the MIL TRS solver using the matrix inversion lemma (Algorithm-5) and the efficient matrix updating algorithm (Algorithm-4), β c DIQ c MIL 2mn 2 + 7 + 3m 2 n + 5n + 2n 2 + n 3 Because n ≫ m for large-scale problems, Eq. 35 can be further simplified as, For some problems the solution s * s(0) is accepted with N DIQ N MIL 0. For other problems, it takes roughly N DIQ ≈ N MIL 5 ∼ 15 iterations for a TRS solver to converge. Figure 1A,B illustrate the plots of β vs. n 2 /m for different m by fixing N DIQ N MIL 0 and N DIQ N MIL 10, respectively. As shown in both Figure 1A and Figure 1B, the MIL TRS solver may reduce the computational cost by a factor β 10 ∼ 10 5 for n 2 m 100 ∼ 10 6 .

NUMERICAL VALIDATION
We benchmarked the MIL TRS solver against the DNR (or DIQ) TRS solver on two well-known analytic optimization problems using analytical gradients: the Rosenbrock function and the Sphere function defined as follows, f (x) Sphere n i 1 The Rosenbrock function defined in Eq. 37 has one local minimum located at x 1 −1 and x i 1 for i 2, 3, . . . n with the objective function being 4 (approximately), and one global minimum located at x i 1 for i 1, 2, . . . n with the objective function being 0. Occasionally, a local-search optimizer may converge to the local minimum. We implemented different algorithms described in this paper in our in-house C++ optimization library (OptLib). We employ "Armadillo" as our foundational template-based C++ library for linear algebra (http://arma.sourceforge.net/). We ran all test cases reported on Tables 2-5 on a virtual machine computer equipped with an Intel Xeon Platinum Processor at 2.60 MHz and 16.0 GB of RAM. We varied the number of parameters, n as powers of two and fixed twice the maximum number of snapshots, M 2L M 10. Tables 2-5 summarize the preliminary numerical results, including the value of the objective function, the gradient norm, the number of iterations, and the elapsed CPU time (s). The last column displays the resulting speedup, namely, the ratio of CPU times using the DNR (or DIQ) method over the MIL method. We prescribe the tolerance to stop the L-BFGS TRS solver for all the numerical experiments in this subsection as δ cr 10 − 4 and the maximum number of iterations as N TRS,max 16. We set the initial trust region size at Δ 0.5. We noticed that the matrix D may become ill-conditioned for a few cases. Indeed, we could observe that the first entries on the diagonal are nearly zero. The same situation occurs for different optimization steps where m grows from 0, 2, 4, 6, up to ten. The Cholesky decomposition of such a matrix will fail, i.e., Armadillo throws an exception. We advise replacing the former with an LU-decomposition to handle the pivots, namely, entries on the diagonal of the lower triangular matrix, to be nearly zero. The rest of the algorithm remains the same.     i.e., renders very large lambda values that imply tiny shifts. Often, after 2,048 iterations, the objective function is still greater than one. It seems that the DNR method is not robust enough to tackle the Rosenbrock function with more than 48 parameters. It appears that the MIL method slightly outperforms the DNR method for problems with fewer parameters. Table 3 compares performance of the DIQ method vs. the MIL method. These numerical results confirm that the MIL method outperforms the DIQ method. As expected, the speedup improves with the increase of the problem size n. Table 4 shows similar results for the Sphere function problem. We could benchmark the DNR method against the MIL method with ranks up to 2,048 parameters. Again, the speedup increases with problem size. Finally, Table 5 benchmarks the DIQ method against the MIL method testing on the Sphere function. The speedup reduces when compared to Table 4 for the same ranks, but we still clearly see that the performance of the MIL solver exceeds its DIQ counterpart. Results shown in Table 4 and Table 5 also confirm that the DIQ method performs better than the DNR method. We believe that the DIQ TRS solver converges faster with smaller number of iterations than the DNR TRS solver (i.e., N DIQ < N DNR ).
The computational costs (in terms of CPU time) summarized in Tables 2-5 are the overall costs for all iterations, including the cost of computing both value and gradient of the objective function in addition to the cost of solving the L-BFGS TRS. In contrast, the theoretical analysis results of computational cost for β in Eq. 35 only count for the cost of solving the L-BFGS TRS in just one iteration. Therefore, numerical results listed in Tables 2-5 are not quantitatively in agreement with the theoretically derived β.

CONCLUSION
We can draw the following conclusions based on theoretical analysis and numerical tests: 1) Using the compact representation of the L-BFGS formulation in Eq. 4, the updated Hessian H (k+1) is composed of a diagonal matrix λI n and a matrix V (k) W (k) [V (k) ] T with low rank when m < n. 2) Using the matrix inversion lemma, we are able to reduce the cost of solving the L-BFGS TRS by transforming the original equation with n variables to a new equation with m variables, by taking advantage of the low rank matrix updating feature. 3) Further reduction in computational cost is achieved by updating the vector u (k) and matrices such as S (k) and Z (k) , A (k) , B (k) , and C (k) , P (k) , and [W (k) ] − 1 iteratively, which effectively avoids expensive operations such as matrixmatrix and matrix-vector multiplications. 4) Through seamless integration of matrix inversion lemma with the iterative matrix and vector updating approach, the newly proposed TRS solver for the limited-memory distributed BFGS optimization method performs much more efficiently than the DNR TRS solver. 5) The MIL TRS solver can obtain the right solution with acceptable accuracy, without increasing memory usage but reducing the computational cost significantly, as confirmed by numerical results on the Rosenbrock function and Sphere function.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.