Abstract
The limited-memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS) optimization method performs very efficiently for large-scale problems. A trust region search method generally performs more efficiently and robustly than a line search method, especially when the gradient of the objective function cannot be accurately evaluated. The computational cost of an L-BFGS trust region subproblem (TRS) solver depend mainly on the number of unknown variables (n) and the number of variable shift vectors and gradient change vectors (m) used for Hessian updating, with m << n for large-scale problems. In this paper, we analyze the performances of different methods to solve the L-BFGS TRS. The first method is the direct method using the Newton-Raphson (DNR) method and Cholesky factorization of a dense n × n matrix, the second one is the direct method based on an inverse quadratic (DIQ) interpolation, and the third one is a new method that combines the matrix inversion lemma (MIL) with an approach to update associated matrices and vectors. The MIL approach is applied to reduce the dimension of the original problem with n variables to a new problem with m variables. Instead of directly using expensive matrix-matrix and matrix-vector multiplications to solve the L-BFGS TRS, a more efficient approach is employed to update matrices and vectors iteratively. The L-BFGS TRS solver using the MIL method performs more efficiently than using the DNR method or DIQ method. Testing on a representative suite of problems indicates that the new method can converge to optimal solutions comparable to those obtained using the DNR or DIQ method. Its computational cost represents only a modest overhead over the well-known L-BFGS line-search method but delivers improved stability in the presence of inaccurate gradients. When compared to the solver using the DNR or DIQ method, the new TRS solver can reduce computational cost by a factor proportional to for large-scale problems.
Introduction
Decision-making tools based on optimization procedures have been successfully applied to solve practical problems in a wide range of areas. An optimization problem is generally defined as minimizing (or maximizing) an objective function within a user defined search domain , and subject to some linear or nonlinear constraints, where x is an n-dimensional vector that contains all controllable variables.
In the oil and gas industry, an optimal business development plan requires robust production optimization because of the considerable uncertainty of subsurface reservoir properties and volatile oil prices. Many papers have been published on the topic of robust optimization and their applications to cyclic CO2 flooding through the Gas-Assisted Gravity Drainage process [], well placement optimization in geologically complex reservoirs [], optimal production schedules of smart wells for water flooding [] and in naturally fractured reservoirs [], just mentioning a few of them as examples.
Some researchers formulated the optimization problem under uncertainty as a single objective optimization problem, e.g., only maximizing the mean of net present value (NPV) for simplicity by neglecting the associated risk. However, it is recommended to formulate robust optimization as a bi-objective optimization problem for consistency and completeness. A bi-objective optimization problem is generally defined as minimizing (or maximizing) two different objective functions, and , within a user-defined search domain , and subject to some linear or nonlinear constraints. For example, we may maximize the mean value, denoted by , and minimize the standard deviation, denoted by , of NPV. For a bi-objective optimization problem, the optimal solutions are defined as the Pareto optimal solutions (or Pareto front). It is a very challenging task to find multiple optimal solutions on the Pareto front [, ].
It is also a very challenging task to properly characterize the uncertainty of reservoir properties (e.g., porosity and permeability in each grid-block) and reliably quantify the ensuing uncertainty of production forecasts (e.g., production rates of oil, gas, and water phases) by conditioning to historical production data and 4D seismic data [, ], which requires generating multiple conditional realizations by minimizing a properly defined objective function, e.g., using the randomized maximum likelihood (RML) method [].
When an adjoint-based gradient of the objective function is available [–], we may apply a gradient-based optimization method, e.g., the Broyden–Fletcher–Goldfarb–Shanno (BFGS) optimization method [, ]. However, the adjoint-based gradient is not available for many commercial simulators and not available for integer type or discrete variables (e.g., well locations defined by grid-block indices). In such a case, we must apply a derivative-free optimization (DFO) method [–]. For problems with smooth objective functions and continuous variables, model-based DFO methods using either radial-basis function [] or quadratic model [, , ] performs more efficiently than other DFO methods such as direct pattern search methods [] and stochastic search methods [].
Traditional optimization methods only locate a single optimal solution, and they are referred to as single-thread optimization methods in this paper. It is unacceptably expensive to use a single-thread optimization method to locate multiple optimal solutions on the Pareto front or generate multiple RML samples []. To overcome the limitations of single-thread optimization methods, Gao et al. [] developed a local-search distributed Gauss-Newton (DGN) DFO method to find multiple best matches concurrently. Later, Chen et al. [] modified the local-search DGN optimization method and generalized the DGN optimizer for global search. They also integrated the global-search DGN optimization method with the RML method to generate multiple RML samples in parallel.
Because multiple search points are generated in each iteration, finally, multiple optimal solutions can be found in parallel. These distributed optimization methods are referred to as multiple-thread optimization methods. Both the local- and global-search DGN optimization methods are only applicable to a specific optimization problem, i.e., history matching problem or least-squares optimization problem, of which the objective function can be expressed as a form of least-squares, , but they cannot be applied to other type or generic optimization problems where the objective function cannot be expressed as a form of least-squares. Furthermore, the DGN optimization methods may become less efficient for history matching problems when both the number of variables (n) and the number of observed data (m) are large (e.g., in the order of thousands or more).
Recently, Gao et al. [] developed a well-paralleled distributed quasi-Newton (DQN) DFO method for generic optimization problems by introducing a generalized form of the objective function, . Here, x represents the vector of controllable variables or model parameters to be optimized (explicit variables) and y(x) denotes the vector of simulated responses (implicit variables) of a reservoir using explicit variables x. Using the generalized form of the objective function, the gradient of the objective function can be evaluated analytically by,In Eq. 1, is the transpose of the sensitivity matrix. Using the generalized expression of the objective function F(x,y), different objective functions can be evaluated using the same y(x) that is simulated from the same reservoir model, e.g., for multi-objective optimization problems. A specific case of the generalized expression is , where the transpose of the sensitivity matrix becomes the gradient of the objective function.
The DQN optimization algorithm runs Ne optimization threads in parallel. In the k-th iteration, there are non-converged threads. The DQN optimizer generates a new search points for each non-converged thread, where the search step is solved from a trust region subproblem (TRS). The simulation cases are submitted to a high-performance computing (HPC) cluster in parallel, to generate corresponding simulated responses . Then, we evaluate the objective function and its associated partial derivatives and . If the new search point improves the objective function, i.e., , then we update and .
All simulation results generated by all DQN optimization threads in previous and current iterations are recorded in a training data set, and they are shared among all DQN optimization threads. The sensitivity matrix is approximated using a modified QR-method proposed by Gao, et al. [] by linear interpolation of training data points that are closest to x(i,k+1). Then, we approximate the gradient using Eq. 1. Finally, the Hessian H(i,k+1) for each thread is updated using either the BFGS method or the symmetric rank-1 (SR1) method []. For simplicity, we will drop the superscript “i” (the optimization thread index) in the following discussions.
Both DGN and DQN optimization methods approximate the objective function by a quadratic model, and they are designed for problems with smooth objective function and continuous variables. Although their convergence is not guaranteed for problems with integer type variables (e.g., well location optimization), if those integers can be treated as truncated continuous variables, our numerical tests indicate that these distributed optimization methods can improve the objective function significantly and locate multiple suboptimal solutions for problems with integer type variables in only a few iterations.
The number of variables for real-world problems may vary from a few to thousands or even more, depending on the problem and the parameterization techniques employed. For example, the number of variables could be in the order of millions if we do not apply any parameter reduction techniques to reduce the dimension of some history matching problems (e.g., to tune permeability and porosity in each grid block). Because reservoir properties are generally correlated with each other with long correlation lengths, we may reduce the number of variables to be tuned to only a few hundred, e.g., using the principal component analysis (PCA) or other parameter reduction techniques [].
In this paper, our focus is on performance analysis of different methods to solve the TRS formulated with the limited-memory BFGS (L-BFGS) Hessian updating method for unconstrained optimization problems. In the future, we will further integrate the new TRS solver with our newly proposed limited-memory distributed BFGS (L-DBFGS) optimization algorithm and then apply the new optimizer to some realistic oil/gas field optimization cases and benchmark its overall performance against other distributed optimization methods such as those only using the gradient []. We will also continue our investigation in the future for constrained optimization problems (e.g., variables with lower- and/or upper-bounds and problems with nonlinear constraints). Gao et al. [] applied the popular Newton-Raphson method [] to directly solve the TRS using matrix factorization, the DNR method in short. The DNR method is quite expensive when it is applied to solve TRSs of a distributed optimization method, especially for large-scale problems with thousands or more variables []. This paper follows the ideas and concepts presented in the book “Matrix Computation” []. Flops are used to quantify the volume of work associated with a computation, a count for floating-point operations of add, subtract, multiply, or divide. Computational cost (flops) of some commonly used algebraic operations and numerical methods are summarized in Table 1 for reference.
TABLE 1
| Operation | Dimension | Computational cost (flops) |
|---|---|---|
| a = xTy | x, yϵRn | 2n |
| y = y + axy | aϵR; x, yϵRn | 2n |
| y = y + Ax | AϵRm×n; xϵRn; yϵRm | 2mn |
| A = A + yxT | AϵRm×n; xϵRn; yϵRm | 2mn |
| C = C + AB | AϵRm×r; BϵRr×n; CϵRm×n | 2mnr |
| A = LLT | Symmetric AϵRn×n; LϵRn×n | n3/3 |
| Solve x from Lx=y | Lower triangle LϵRn×n; xϵRn; yϵRn | n2 |
A summary of computational costs of commonly used algebraic operations.
For completeness, we discuss the compact representation of the L-BFGS Hessian updating formulation [] and the algorithm to directly update the Hessian in the next section directly. In the third section, we present three different methods to solve the L-BFGS TRS: the DNR method, the direct method using inverse quadratic (DIQ) interpolation approach proposed by Gao et al. [], and the technique using matrix inversion lemma (MIL) together with an efficient matrix updating algorithm. Some numerical tests and performance comparisons are discussed in the fourth section. We finally draw some conclusions in the last section.
The Limited-Memory Hessian Updating Formulation
Let x(k) be the best solution obtained in the current (k-th) iteration, and f(k), g(k), and H(k), respectively, the objective function, its gradient and Hessian evaluated at x(k). The Hessian H(k+1) is updated using the BFGS formulation as follows,In Eq. 2, and . The Hessian H(k+1) updated using Eq. 2 is guaranteed positive definite if the condition is satisfied where c3>0 is a small positive number.
Compact Representation
To save memory usage and computational cost, the L-BFGS Hessian updating method limits the maximum history size or the maximum number of pairs (s(k), z(k)) used for the BFGS Hessian updating to LM > 1. The recommended value of LM ranges from 5 to 20, using smaller number for problems with more variables.
Let 1 ≤ lk ≤ LM denote the number of pairs of variable shift vectors and gradient change vectors, (s(j), z(j)) for j = 1,2,...lk, used to update the Hessian using the L-BFGS method in the k-th iteration. Let be the n × lk variable shift matrix and the n × lk gradient change matrix. Both matrices S(k) and Z(k) are updated iteration by iteration. Let m=2lk and is an n × m matrix. Let and . We decompose B(k) into three parts: the strictly lower triangular part L(k), the strictly upper triangular part U(k), and the diagonal part E(k) that contains the main diagonals of B(k), i.e.,The Hessian can be updated using the compact form of the L-BFGS Hessian updating formulation [, –] as follows,In Eq. 4, In is an n × n identity matrix and W(k) is an m×m symmetric matrix defined as,In Eq. 4 and Eq. 5, is a scaling factor. It is required that a(k) > acr > 0 using Eq. 4 and Eq. 5 to update the Hessian, where acr is a small positive number.
The Algorithm to Directly Update the Hessian Using the L-BFGS Compact Representation
Given
LM≥ 1,
k≥ 1,
lk−1≥ 1, the thresholds
c3> 0 and
acr> 0, search step
and gradient change
, the two
n×
lk−1matrices,
S(k−1)and
Z(k−1), we can update
S(k),
Z(k)and directly compute the Hessian
H(k+1)using the compact representation of the L-BFGS Hessian updating
Eq. 4, as summarized in Algorithm-1.
Algorithm-1: Updating the Hessian directly using the L-BFGS compact representation
1. Compute ||s||, ||z||, and if ;
2. Ifor, set, , , and goto step 4;
3. Otherwise,
a. Setand remove the first column fromS(k−1)andZ(k−1)if;else set;
b. Updateand;
4. Form;
5. Computeand;
6. Form the matrixusingEq. 5;
7. Computeusing Cholesky factorization;
8. Compute;
9. Compute;
10. Update.
11. Stop.
The computational cost to directly update the Hessian using the L-BFGS method as described in Algorithm-1 mainly depends on the number of variables (n) and the number of vectors used to update the Hessian (m = 2lk). We should reemphasize that m is updated iteratively but limited to m ≤ 2Lk. The computational cost is (flops), which includes the following seven operations: 1) computing ‖s‖, ‖z‖, and in step 1 (6n flops); 2) computing A(k) and B(k) in step 5 (m2n flops); 3) forming the matrix in step 6 (m2 flops); 4) computing W(k) in step 7 (O(m3) flops); 5) compute U(k) in step 8 (2m2n flops); 6) compute T(k) in step 9 ( flops); and 7) updating in step 10 (n flops).
Solving the L-BFGS Trust Region Subproblem
The Trust Region Subproblem
In the neighborhood of the best solution x(k), the objective function f(x) can be approximated by a quadratic function of ,If the Hessian is positive definite, then has a unique minimum s*(k), which is the solution of ,When a line search strategy is applied, we accept the full search step by setting if it improves the objective function sufficiently (e.g., satisfying the two Wolfe conditions). Otherwise, we can find a better search step size such that improves the objective function sufficiently. If the gradient can be accurately evaluated, it has theoretically proved that a line search strategy can converge []. However, for most real-world optimization problems, the gradient cannot be accurately evaluated, and as discussed by Gao et al. [], a trust region search strategy performs more robustly and more efficiently than a line search strategy.
The trust region search step for the next iteration is the global minimum of the quadratic model within a ball-shaped trust region with radius which is solved from the following trust region subproblem (TRS) [],If the solution given by Eq. 7 satisfies , then is the solution of the TRS defined in Eq. 8. Otherwise, the solution of the TRS must lie on the boundary of , and we should solve the Lagrange multiplier together with the search step from the following nonlinear TRS equations,For the trivial case with , we have and . The solution of the TRS defined in Eq. 8 is and if or and otherwise.
We may apply the popular DNR method to solve the TRS when is a dense matrix. However, the DNR method is quite expensive for large-scale problems and we should seek and apply a more efficient method to solve the TRS.
Solving the L-BFGS TRS Directly Using the Newton-Raphson Method
The DNR method solves the TRS defined in Eq. 8 using the Cholesky decomposition. It applies the Newton-Raphson method to directly solve the nonlinear TRS equation [], , iteratively, which requires computing the first order derivative where v is solved from together with the Cholesky factorization of or LU-decomposition.
Given the trust region size , threshold of convergence , maximum number of iterations allowed , Hessian and gradient evaluated at the current best solution , both the Lagrange multiplier and the trust region search step can be solved from the TRS defined in Eq. 8, using the DNR method as summarized in Algorithm-2.
Solving the L-BFGS TRS Using the DNR Method
1. Initialize, ;
2. Computeand Cholesky decomposition;
3. Solvefrom,from, andfrom;
4. Computeand;
5. If, then acceptand go to step 8;
6. Set ;
7. Repeat steps (a) through (f) below, until convergence (or):
a. Update;
b. Computeand Cholesky decomposition;
c. Solvefrom, from, andfrom;
d. Computeand;
e. Update;
f. Set.
8. Stop.
Let denote the number of iterations required for the DNR TRS solver to converge. The total computational cost (flops) to solve the TRS using Algorithm-2 is, , including the following three operations: (1) computing and Cholesky decomposition in step 2 and step 7(b) (flops); (2) solving from , from , and from in step 3 and step 7(c) ( flops); (3) computing and in step 4 and step 7(d) ( flops).
The total computational cost (flops) used to solve the L-BFGS TRS using the DNR method (Algorithm-2) together with the L-BFGS Hessian updating method (Algorithm-1) is,Our numerical results indicate that the DNR TRS solver may fail when tested on the well-known Rosenbrock function, especially for problems with large . The root cause for failure of convergence using the DNR method is the same as in the GNTRS solver using the traditional Newton-Raphson method as discussed by Gao et al. []. Very small value of may result in a very large search step and thus result in failure of converging. Gao et al. [] proposed integrating the DNR method with a bisection line search to overcome this issue.
The Inverse Quadratic Model Interpolation Method to Directly Solve the L-BFGS TRS
Instead of applying the Newton-Raphson method which requires evaluating , Gao et al. [] proposed a method to directly solve the TRS using inverse quadratic model interpolation (called the DIQ method), i.e., approximating by the following inverse quadratic function,Both coefficients of and in Eq. 12 can be determined by interpolating the values of evaluated at two different points and with .
In the first iteration, we set and accept as the solution if . Otherwise, holds and we set and holds if it is not accepted as the solution. In the following iteration, either or will be updated accordingly such that the two conditions and always hold.
Let and the following solution of will be used as the trial search point for the next iteration,We either update if or update if iteratively until convergence. We accept as the desired solution and terminate the iterative process either when , the tolerance for convergence, or when the number of iterations used to solve the TRS () reaches the user specified maximum iteration number (), i.e.,.
Given the scaling factor , trust region size , threshold of convergence , maximum number of iterations allowed , Hessian and gradient , both the Lagrange multiplier and the trust region search step can be solved from the TRS defined in Eq. 8, using the DIQ method as summarized in Algorithm-3.
Algorithm-3: Solving the L-BFGS TRS Using the DIQ Method
(1) Compute;
(2) Calculate;
(3) Initialize, and;
(4) Computeand Cholesky decomposition;
(5) Solvefromandfrom;
(6) Compute;
(7) If, then acceptand go to step 12;
(8) If, then
a. Computeand Cholesky decomposition;
b. Solvefromandfrom;
c. Compute;
(9) If, then, ;
(10) Otherwise, , ;
(11) Repeat steps (a) through (i) below, until convergence ( or ):
a. Calculate;
b. Calculate;
c. Computeand Cholesky decomposition;
d. Solvefromandfrom;
e. Compute ;
f. Update ;
g. If, updateand;
h. Otherwise, updateand;
i.
(12) Stop.
Let denote the number of iterations required for the DIQ TRS solver to converge. The total computational cost to solve the L-BFGS TRS using Algorithm-3 together with Algorithm-1 is,Generally, the DIQ method converges faster than the DNR method and it performs more efficiently and robustly than the DNR method, especially for large scale problems. Both the DNR method and the DIQ method become quite expensive for large-scale problems.
Using Matrix Inversion Lemma (MIL) to Solve the L-BFGS TRS
To save both memory usage and computational cost, Gao, et al. [, , ] proposed an efficient algorithm to solve the Gauss-Newton TRS (GNTRS) for large-scale history matching problems using the matrix inversion lemma (or the Woodbury matrix identity). With appropriate normalization of both parameters and residuals, the Hessian of the objective function for a history matching problem can be approximated by the well-known Gauss-Newton equation as follows,In Eq. 15, is an matrix, the transpose of the sensitivity matrix that is evaluated at the current best solution . Here, denotes the number of observed data to be matched and it is not the same as used in the L-BFGS Hessian updating Eq. 4.
The GNTRS solver proposed by Gao, et al. [, , ] using the matrix inversion lemma (MIL) has been implemented and integrated with the distributed Gauss-Newton (DGN) optimizer. Because the compact representation of the L-BFGS Hessian updating formula Eq. 4 is similar to Eq. 15, we follow the similar idea as proposed by Gao, et al. [, , ] to compute by applying the matrix inversion lemma and then solve the L-BFGS trust region search step .Let and . We first solve from,Then, we compute the trust region search step and as,From Eq. 15, we have and Eq. 19 can be rewritten as,We first try . If holds, we accept and as the solution of the TRS defined in Eq. 8. Otherwise, the solution is on the boundary defined in Eq. 10, which can be solved by the Newton-Raphson (NR) method iteratively.
Let and we have where is computed by,In Eq. 21, is solved from,Because is an symmetric and positive definite matrix, it only requires. flops to solve and ) from Eq. 17 and Eq. 22 using the Cholesky factorization.
Given the variable shift matrix and gradient change matrix , we may directly compute the following three matrices, , and , and then form the matrix as,Directly computing the three matrices, , and , requires flops. We also need to compute the vector ( flops) and ( flops). To further reduce the computational cost, the three matrices , and and the vector should be updated iteratively instead.
Updating Matrices and Vectors Used for Solving the L-BFGS TRS
We first initialize and . If the trial search point does not improve the objective function, i.e., , we set and , recompute the sensitivity matrix and the gradient when needed (e.g., using updated training data points), and shrink the trust region size with . We repeatedly generate trial search point until improves the objective function, i.e., .
If , we set , , evaluate , and compute and . In the following iterations, we update and associated matrices and vectors used for solving the L-BFGS TRS for different cases accordingly. We use to indicate whether the new search point improves the objective function () or not ().
The L-BFGS Hessian updating formulation of Eq. 4 requires and where and are small positive numbers. If or , we simply set , , , , , , , , , and , with no updating. However, we need to compute using the new gradient . In the following cases, we assume that both conditions and are satisfied.
Case-1: Sflag(k) = “False”
Although the best solution does not change when Sflag(k) = “False”, the sensitivity matrix and thus the gradient of the objective function evaluated at using simulation results and training data points updated in the (current) iteration may be different from those evaluated at the same point but using simulation results and training data points obtained in the (previous) iteration. To use the right gradient, we should replace the last column in with .
Let be the submatrix of by removing its last column , the submatrix of by removing its last column, and the submatrix of by removing its last row and last column. We first compute the following two vectors and and two scalars and , with computational cost flops.
We set , , , , and , and Update as,It is straightforward to derive the following equations to update and accordingly,By definition,Thus, we have,
Case-2: Sflag(k) = “True” and lk−1 < LM
We set and and , compute four -dimensional vectors and and and and five scalars, and and and and , with computational cost flops. Both and are updated by,It is straightforward to derive the following equations to update A(k), B(k), and C(k),We first split the -dimensional vector into two -dimensional sub-vectors and , i.e.,where is composed of the first rows in and is composed of the last rows in .
Because and , thus we have,
Case-3: Sflag(k) = “True” and lk−1 = LM
We set . Let and denote the submatrices of and by removing the first column and from and , respectively. Let and and denote the submatrices of A(k), B(k), and C(k) by removing their first row and first column.
apply Eq. 28 through Eq. 32 to update S(k), Z(k), A(k), B(k), and C(k) by simply replacing S(k−1), Z(k−1), A(k−1), B(k−1), and C(k−1) with and and and and , respectively. We can also apply Eq. 33 to update by replacing with that is composed of the second row through the -th rows in and replacing with that is composed of the last rows in .
The Algorithm to Update Matrices and Vectors Used for Solving the L-BFGS TRS
Given LM ≥ 1, k > 1, lk−1 ≥ 1, the thresholds c3 > 0 and acr > 0, -dimensional vectors g(k+1), s and z, lk−1-dimensional vector u(k−1), n × lk−1 matrices S(k−1) and Z(k−1), lk−1 × lk−1 matrices A(k−1), B(k−1) and C(k−1), 2lk−1 × 2lk−1 matrices P(k−1) and , we can update and the -dimensional vector u(k), n × lk matrices S(k) and Z(k), lk × lk matrices A(k), B(k), and C(k), and 2lk × 2lk matrices and , using the algorithm as summarized in Algorithm-4.
Algorithm-4: Updating Matrices and Vectors Used for Solving the L-BFGS TRS
1. Compute, , ;
2. Compute the scaling factorif, or setotherwise;
3. Ifor, then
a. Setand;
b. Updateand;
c. Update, , and;
d. Updateand;
e. Compute;
f. Goto step 10.
4. Else if(for Case-1), then
a. Set;
b. Update;
c. Setto the last column of;
d. Setby removing the last columnfrom;
e. Update;
f. Compute;
g. Computeand;
h. ComputeusingEq. 27;
i. Update;
j. Setby removing the first column from;
k. Setby removing the first row and the first column from;
l.
UpdateandusingEq. 25andEq. 26;
The computational cost to update matrices and vectors used for solving the L-BFGS TRS as described in Algorithm-4 is at most (taking Case-2 in step 5 as an example), including the following 4 operations: (1) computing , , and in step 1 ( flops); (2) computing ,,, and in step 5(b) ( flops),; (3) computing and in step 5(c) ( flops).
In this paper, we apply the same idea of reducing the dimension from to using the matrix inversion lemma as presented in the three papers [, , ]. However, the implementation is quite different. For the DGN optimization method, the matrix (or the transpose of the sensitivity matrix) in Eq. 15 is directly computed. Its dimensions (both the number of variables and the number of observed data ) are fixed, i.e., they remain the same for all iterations. Therefore, we have to directly compute the matrix and the -dimensional vector , with the computational cost flops. Using the algorithm as summarized in Algorithm-4, the computational cost can be further reduced to , by a factor of , 5 to 20 roughly.
The Algorithm to Solve the L-BFGS TRS Using the Matrix Inversion Lemma
Given the trust region size , threshold of convergence , maximum number of iterations allowed , the scaling factor , , matrix , matrices and , -dimensional vector and -dimensional gradient vector evaluated at the current best solution , both the Lagrange multiplier and the trust region search step can be solved from the TRS defined in Eq. 8 using the matrix inversion lemma (MIL) as summarized in Algorithm-5.
Algorithm-5: Solving the L-BFGS TRS Using the Matrix Inversion Lemma
1. Compute ;
2. Initialize, ;
3. If, then
a. Setandif;
b. Setandif;
c. Goto step 11.
4. Computeand Cholesky decomposition;
5. Solvefromandfrom;
6. Compute, , , andusingEq. 20;
7. If, then acceptand goto step 10;
8. Set;
9. Repeat steps (a) through (j) below, until convergence ( or ):
10. Computeand;
11. End
Let denote the number of iterations required for the L-BFGS TRS solver to converge. The total computational cost (flops) to solve the L-BFGS TRS using Algorithm-5 is, , including the following eight operations: (1) computing in step 1 ( flops); (2) computing and Cholesky decomposition in step 4 or step 9(g) ( flops); (3) solving from , and from in step 5 or step 9(h) ( flops); (4) computing , and in step 6 or step 9(i) ( flops); (5) computing in step 9(a) ( flops); (6) Solving from and from in step 9(b) ( flops); and (7) computing , , and in step 9(c) ( flops); and (8) computing and in step 10 ().
The total computational cost (flops) used to solve the L-BFGS TRS using Algorithm-5 together with Algorithm-4 is,It is straightforward to compute the normalized computational cost, or the ratio of computational cost of the DNR method (Algorithm-2) or the DIQ method (Algorithm-3) together with the algorithm that directly updates Hessian (Algorithm-1) over the MIL TRS solver using the matrix inversion lemma (Algorithm-5) and the efficient matrix updating algorithm (Algorithm-4),Because for large-scale problems, Eq. 35 can be further simplified as,For some problems the solution is accepted with . For other problems, it takes roughly iterations for a TRS solver to converge. Figure 1A,B illustrate the plots of vs. for different by fixing and , respectively. As shown in both Figure 1A and Figure 1B, the MIL TRS solver may reduce the computational cost by a factor for .
FIGURE 1
Numerical Validation
We benchmarked the MIL TRS solver against the DNR (or DIQ) TRS solver on two well-known analytic optimization problems using analytical gradients: the Rosenbrock function and the Sphere function defined as follows,The Rosenbrock function defined in Eq. 37 has one local minimum located at and for with the objective function being 4 (approximately), and one global minimum located at for with the objective function being 0. Occasionally, a local-search optimizer may converge to the local minimum.
We implemented different algorithms described in this paper in our in-house C++ optimization library (OptLib). We employ “Armadillo” as our foundational template-based C++ library for linear algebra (http://arma.sourceforge.net/). We ran all test cases reported on Tables 2–5 on a virtual machine computer equipped with an Intel Xeon Platinum Processor at 2.60 MHz and 16.0 GB of RAM. We varied the number of parameters, as powers of two and fixed twice the maximum number of snapshots, . Tables 2–5 summarize the preliminary numerical results, including the value of the objective function, the gradient norm, the number of iterations, and the elapsed CPU time (s). The last column displays the resulting speedup, namely, the ratio of CPU times using the DNR (or DIQ) method over the MIL method. We prescribe the tolerance to stop the L-BFGS TRS solver for all the numerical experiments in this subsection as and the maximum number of iterations as . We set the initial trust region size at ∆ = 0.5.
TABLE 2
| DNR | MIL | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| n | f(k) | g(k) | Iters. | CPU | f(k) | g(k) | Iters. | CPU | Speedup |
| 8 | 4.4E-09 | 2.9E-05 | 41 | 2.9E-02 | 3.4E-09 | 2.5E-05 | 41 | 1.6E-02 | 1.8 |
| 16 | 1.7E-09 | 1.0E-05 | 62 | 3.8E-02 | 1.5E-09 | 1.9E-05 | 61 | 4.2E-02 | 0.9 |
| 32 | 1.3E-09 | 1.1E-05 | 75 | 4.6E-02 | 3.2E-09 | 2.2E-05 | 75 | 4.8E-02 | 1.0 |
| 48 | 1.3E-08 | 2.6E-05 | 72 | 9.3E-02 | 4.4E-09 | 2.0E-05 | 72 | 4.0E-02 | 2.3 |
Performance comparison of DNR vs. MIL methods testing on Rosenbrock function.
TABLE 3
| DIQ | MIL | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| n | f(k) | g(k) | Iters. | CPU | f(k) | g(k) | Iters. | CPU | Speedup |
| 16 | 1.2E-09 | 2.2E-05 | 51 | 2.4E-02 | 1.5E-09 | 1.9E-05 | 61 | 3.2E-02 | 0.8 |
| 32 | 1.0E-09 | 2.4E-05 | 61 | 4.3E-02 | 3.2E-09 | 2.2E-05 | 75 | 3.2E-02 | 1.4 |
| 64 | 8.0E-10 | 2.4E-05 | 75 | 1.3E-01 | 7.8E-09 | 2.9E-05 | 79 | 3.6E-02 | 3.5 |
| 128 | 4.5E-10 | 1.8E-05 | 82 | 3.0E-01 | 4.6E-09 | 2.8E-05 | 73 | 3.8E-02 | 7.9 |
| 256 | 2.2E-10 | 1.3E-05 | 87 | 1.0E+00 | 5.0E-09 | 3.0E-05 | 78 | 7.2E-02 | 13.9 |
| 512 | 4.0E+00 | 1.5E-05 | 84 | 5.5E+00 | 3.1E-10 | 9.6E-06 | 90 | 1.4E-01 | 39.2 |
| 1024 | 1.5E-09 | 2.8E-05 | 98 | 3.9E+01 | 7.5E-09 | 2.7E-05 | 94 | 4.2E-01 | 90.9 |
Performance comparison of DIQ vs. MIL methods testing on Rosenbrock function.
TABLE 4
| DNR | MIL | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| n | f(k) | g(k) | Iters. | CPU | f(k) | g(k) | Iters. | CPU | Speedup |
| 16 | 5.4E-64 | 1.8E-33 | 39 | 2.2E-02 | 3.3E-65 | 3.1E-34 | 56 | 3.8E-02 | 0.6 |
| 32 | 1.1E-63 | 1.5E-33 | 66 | 4.0E-02 | 1.3E-65 | 1.6E-34 | 67 | 3.7E-02 | 1.1 |
| 64 | 4.4E-63 | 2.1E-33 | 92 | 1.4E-01 | 4.0E-67 | 2.2E-35 | 86 | 5.2E-02 | 2.8 |
| 128 | 8.4E-63 | 2.1E-33 | 128 | 7.3E-01 | 9.2E-66 | 6.5E-35 | 135 | 6.3E-02 | 11.6 |
| 256 | 1.4E-62 | 1.8E-33 | 188 | 2.8E+00 | 9.7E-63 | 1.5E-33 | 185 | 7.0E-02 | 39.1 |
| 512 | 9.6E-64 | 3.3E-34 | 265 | 1.9E+01 | 5.7E-64 | 2.7E-34 | 256 | 5.6E-01 | 32.9 |
| 1024 | 3.6E-62 | 1.4E-33 | 383 | 1.4E+02 | 1.8E-64 | 1.0E-34 | 371 | 2.7E+00 | 52.4 |
| 2048 | 2.1E-60 | 7.8E-33 | 525 | 8.4E+02 | 2.2E-66 | 7.8E-36 | 534 | 1.7E+01 | 50.2 |
Performance comparison of DNR vs. MIL methods testing on Sphere function.
TABLE 5
| DIQ | MIL | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| f(k) | g(k) | Iters. | CPU | f(k) | g(k) | Iters. | CPU | Speedup | |
| 16 | 3.6E-67 | 4.6E-35 | 39 | 3.8E-02 | 1.6E-65 | 2.1E-34 | 56 | 4.0E-02 | 1.0 |
| 32 | 2.1E-68 | 6.5E-36 | 66 | 3.8E-02 | 7.4E-64 | 1.2E-33 | 67 | 4.4E-02 | 0.9 |
| 64 | 2.0E-65 | 1.4E-34 | 92 | 1.0E-01 | 5.3E-65 | 2.5E-34 | 86 | 4.8E-02 | 2.1 |
| 128 | 2.0E-65 | 9.9E-35 | 128 | 3.0E-01 | 4.4E-64 | 4.5E-34 | 135 | 5.8E-02 | 5.1 |
| 256 | 2.7E-65 | 8.0E-35 | 188 | 1.6E+00 | 2.6E-64 | 2.5E-34 | 185 | 8.8E-02 | 18.3 |
| 512 | 1.5E-67 | 4.2E-36 | 265 | 1.3E+01 | 7.9E-65 | 9.9E-35 | 256 | 6.5E-01 | 19.7 |
| 1024 | 4.4E-66 | 1.6E-35 | 383 | 6.5E+01 | 4.3E-65 | 5.0E-35 | 371 | 3.1E+00 | 21.1 |
| 2048 | 1.4E-65 | 2.0E-35 | 525 | 4.6E+02 | 1.6E-66 | 6.6E-36 | 534 | 1.7E+01 | 27.4 |
Performance comparison of DIQ vs. MIL methods testing on Sphere function.
We noticed that the matrix may become ill-conditioned for a few cases. Indeed, we could observe that the first entries on the diagonal are nearly zero. The same situation occurs for different optimization steps where grows from 0, 2, 4, 6, up to ten. The Cholesky decomposition of such a matrix will fail, i.e., Armadillo throws an exception. We advise replacing the former with an LU-decomposition to handle the pivots, namely, entries on the diagonal of the lower triangular matrix, to be nearly zero. The rest of the algorithm remains the same.
Table 2 compares performance of the DNR method vs. the MIL methods for the Rosenbrock problem encompassing 8, 16, 32, and 48 parameters, respectively. Beyond those problem sizes, the DNR method could not converge to the solution. Indeed, the DNR method becomes relatively slow, i.e., renders very large lambda values that imply tiny shifts. Often, after 2,048 iterations, the objective function is still greater than one. It seems that the DNR method is not robust enough to tackle the Rosenbrock function with more than 48 parameters. It appears that the MIL method slightly outperforms the DNR method for problems with fewer parameters.
Table 3 compares performance of the DIQ method vs. the MIL method. These numerical results confirm that the MIL method outperforms the DIQ method. As expected, the speedup improves with the increase of the problem size n.
Table 4 shows similar results for the Sphere function problem. We could benchmark the DNR method against the MIL method with ranks up to 2,048 parameters. Again, the speedup increases with problem size. Finally, Table 5 benchmarks the DIQ method against the MIL method testing on the Sphere function. The speedup reduces when compared to Table 4 for the same ranks, but we still clearly see that the performance of the MIL solver exceeds its DIQ counterpart. Results shown in Table 4 and Table 5 also confirm that the DIQ method performs better than the DNR method. We believe that the DIQ TRS solver converges faster with smaller number of iterations than the DNR TRS solver (i.e., ).
The computational costs (in terms of CPU time) summarized in Tables 2–5 are the overall costs for all iterations, including the cost of computing both value and gradient of the objective function in addition to the cost of solving the L-BFGS TRS. In contrast, the theoretical analysis results of computational cost for β in Eq. 35 only count for the cost of solving the L-BFGS TRS in just one iteration. Therefore, numerical results listed in Tables 2–5 are not quantitatively in agreement with the theoretically derived β.
Conclusion
We can draw the following conclusions based on theoretical analysis and numerical tests:
1) Using the compact representation of the L-BFGS formulation in Eq. 4, the updated Hessian is composed of a diagonal matrix and a matrix with low rank when m < n.
2) Using the matrix inversion lemma, we are able to reduce the cost of solving the L-BFGS TRS by transforming the original equation with variables to a new equation with variables, by taking advantage of the low rank matrix updating feature.
3) Further reduction in computational cost is achieved by updating the vector u(k) and matrices such as S(k) and Z(k), A(k), B(k), and C(k), P(k), and iteratively, which effectively avoids expensive operations such as matrix-matrix and matrix-vector multiplications.
4) Through seamless integration of matrix inversion lemma with the iterative matrix and vector updating approach, the newly proposed TRS solver for the limited-memory distributed BFGS optimization method performs much more efficiently than the DNR TRS solver.
5) The MIL TRS solver can obtain the right solution with acceptable accuracy, without increasing memory usage but reducing the computational cost significantly, as confirmed by numerical results on the Rosenbrock function and Sphere function.
Statements
Data availability statement
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.
Author contributions
GG, HF, and JV contributed to conception and algorithmic analysis of the study. HF, TW, FS, and CB contributed to design and numerical validations. GG and FH wrote the first draft of the article. All authors contributed to article revision, read, and approved the submitted version.
Conflict of interest
Authors GG, HF, and FS were employed by Shell Global Solutions (United States) Inc. Authors JV, TW and CB were employed by Shell Global Solutions International B.V.
References
1.
Ai-MudhafarWJRaoDNSrinivasanS. Robust Optimization of Cyclic CO2 Flooding through the Gas-Assisted Gravity Drainage Process under Geological Uncertainties. J Pet Sci Eng (2018). 166:490–509. 10.1016/j.petrol.2018.03.044
2.
AlpakFOJinLRamirezBA. Robust Optimization of Well Placement in Geologically Complex Reservoirs. in: Paper SPE-175106-MS Presented at the SPE Annual Technical Conference and Exhibition; September28–30 2015; Houston, TX (2015).
3.
van EssenGMZandvlietMJDen HofPMJBosgraOHJansenJDet alRobust Optimization of Oil Reservoir Flooding. in: Paper Presented at the IEEE International Conference on Control Applications; Oct 4-6 2006; Munich, Germany (2006). p. 4–6.
4.
LiuZForouzanfarF. Ensemble Clustering for Efficient Robust Optimization of Naturally Fractured Reservoirs. Comput Geosci (2018). 22:283–96. 10.1007/s10596-017-9689-1
5.
LiuXReynoldsAC. Augmented Lagrangian Method for Maximizing Expectation and Minimizing Risk for Optimal Well-Control Problems with Nonlinear Constraints. SPE J (2016). 21(5). 10.1007/s10596-017-9689-1
6.
LiuXReynoldsAC. Gradient-based Multi-Objective Optimization with Applications to Waterflooding Optimization. Comput Geosci (2016). 20:677–93. 10.1007/s10596-015-9523-6
7.
OliverDSReynoldsACLiuN. Inverse Theory for Petroleum Reservoir Characterization and History Matching. Cambridge, United Kingdom: Cambridge University Press (2008). 10.1017/cbo9780511535642
8.
TarantolaA. Inverse Problem Theory and Methods for Model Parameter Estimation. Philadelphia, PA: SIAM (2005). 10.1137/1.9780898717921
9.
OliverDS. On Conditional Simulation to Inaccurate Data. Math Geol (1996). 28:811–7. 10.1007/bf02066348
10.
JansenJD. Adjoint-based Optimization of Multi-phase Flow through Porous Media - A Review. Comput Fluids (2011). 46(1):40–51. 10.1016/j.compfluid.2010.09.039
11.
LiRReynoldsACOliverDS. History Matching of Three-phase Flow Production Data. SPE J (2003). 8(4):328–40. 10.2118/87336-pa
12.
SarmaPDurlofskyLAzizK. Implementation of Adjoint Solution for Optimal Control of Smart Wells. in: Paper SPE 92864 Presented at the SPE Reservoir Simulation Symposium Held in the Woodlands; Jan.-2 Feb; Woodlands, TX (2005). p. 31.
13.
GaoGReynoldsAC. An Improved Implementation of the LBFGS Algorithm for Automatic History Matching. SPE J (2006). 11(1):5–17. 10.2118/90058-pa
14.
NocedalJWrightSJ. Numerical Optimization. New York City: Springer (1999).
15.
ChenCet alAssisted History Matching Using Three Derivative-free Optimization Algorithms. in: Paper SPE-154112-MS Presented at the SPE Europec/EAGE Annual Conference Held in Copenhagen; June 4–7 2012; Copenhagen, Denmark (2012). p. 4–7.
16.
GaoGVinkJCChenCAlpakFODuK. A Parallelized and Hybrid Data-Integration Algorithm for History Matching of Geologically Complex Reservoirs. SPE J (2016). 21(6):2155–74. 10.2118/175039-pa
17.
GaoGVinkJCChenCEl KhamraYTarrahiMet alDistributed Gauss-Newton Optimization Method for History Matching Problems with Multiple Best Matches. Comput Geosciences (2017). 21(5-6):1325–42. 10.1007/s10596-017-9657-9
18.
GaoGWangYVinkJWellsTSaafFet alDistributed Quasi-Newton Derivative-free Optimization Method for Optimization Problems with Multiple Local Optima. in: 17th European Conference on the Mathematics of Oil Recovery ECMOR XVII; September 2020; Edinburgh, United Kingdom (2020). p. 14–7.
19.
WildSM. Derivative Free Optimization Algorithms for Computationally Expensive Functions. [PhD thesis]. Ithaca (new york): Cornell University (2009).
20.
PowellMJD. Least Frobenius Norm Updating of Quadratic Models that Satisfy Interpolation Conditions. Math Programming (2004). 100(1):183–215. 10.1007/s10107-003-0490-7
21.
ZhaoHLiGReynoldsACYaoJ. Large-scale History Matching with Quadratic Interpolation Models. Comput Geosci (2012). 17:117–38. 10.1007/s10596-012-9320-4
22.
AudetCDennisJEJr.Mesh Adaptive Direct Search Algorithms for Constrained Optimization. SIAM J Optim (2006). 17:188–217. 10.1137/040603371
23.
SpallJC. Introduction to Stochastic Search and Optimization. Hoboken, NJ: John Wiley & Sons (2003). 10.1002/0471722138
24.
ChenCet alEUR Assessment of Unconventional Assets Using Parallelized History Matching Workflow Together with RML Method. In: Paper Presented the 2016 Unconventional Resources Technology Conference; August 1–3 2016; San Antonio, TX(2016).
25.
ChenCGaoGLiRCaoRChenTVinkJCet alGlobal-Search Distributed-Gauss-Newton Optimization Method and its Integration with the Randomized-Maximum-Likelihood Method for Uncertainty Quantification of Reservoir Performance. SPE J (2018). 23(05):1496–517. 10.2118/182639-pa
26.
ChenCGaoGRamirezBAVinkJCGirardiAM. Assisted History Matching of Channelized Models by Use of Pluri-Principal-Component Analysis. SPE J (2016). 21(05):1793–812. 10.2118/173192-pa
27.
ArmackiAJakoveticDKrejicNKrklec JerinkicNet alDistributed Trust-Region Method with First Order Models. in Paper Presented at the IEEE EUROCON-2019, 18th International Conference on Smart TechnologiesJuly 1–14 2019Novi Sad, Serbia (2019). p. 1–4.
28.
MoréJJSorensenDC. Computing a Trust Region Step. SIAM J Sci Stat Comput (1983). 4(3):553–72. 10.1137/0904038
29.
MohamedAW. Solving Large-Scale Global Optimization Problems Using Enhanced Adaptive Differential Evolution Algorithm. Complex Intell Syst (2017). 3:205–31. 10.1007/s40747-017-0041-0
30.
GolubGHVan LoanCF. Matrix Computations. 4th ed. The Johns Hopkins University Press (2013).
31.
BrustJJLeyfferSPetraCG. Compact Representations of BFGS Matrices.Preprint ANL/MCS-P9279-0120. Lemont, IL: ARGONNE NATIONAL LABORATORY (2020).
32.
GaoGJiangHVinkJCvan HagenPPHWellsTJet alPerformance Enhancement of Gauss-Newton Trust-Region Solver for Distributed Gauss-Newton Optimization Method. Comput Geosci (2020). 24(2):837–52. 10.1007/s10596-019-09830-x
33.
BurdakovOGongLZikrinSYuanY. On Efficiently Combining Limited-Memory and Trust-Region Techniques. Math Prog Comp (2017). 9:101–34. 10.1007/s12532-016-0109-7
34.
BurkeJVWiegmannAXuL. Limited Memory BFGS Updating in a Trust-Region framework Tech. Rep. Seattle, WA: University of Washington (2008).
35.
ByrdRHNocedalJSchnabelRB. Representations of Quasi-Newton Matrices and Their Use in Limited Memory Methods. Math Programming (1994). 63:129–56. 10.1007/bf01582063
36.
GaoGJiangHvan HagenPVinkJCWellsT. A Gauss-Newton Trust Region Solver for Large Scale History Matching Problems. SPE J (2017b). 22(6):1999–2011. 10.2118/182602-pa
37.
GaoGSaafFVinkJKrymskayaMWellsTet alGauss-Newton Trust Region Search Optimization Method for Ill-Conditioned Least Squares Problem. in: ECMOR XVII 17th European Conference on the Mathematics of Oil Recovery; 14 September 2020; Edinburgh, UK (2020). p. 14–17.
Summary
Keywords
limited-memory BFGS method, trust region search optimization method, trust region subproblem, matrix inversion lemma, low rank matrix update
Citation
Gao G, Florez H, Vink JC, Wells TJ, Saaf F and Blom CPA (2021) Performance Analysis of Trust Region Subproblem Solvers for Limited-Memory Distributed BFGS Optimization Method. Front. Appl. Math. Stat. 7:673412. doi: 10.3389/fams.2021.673412
Received
27 February 2021
Accepted
29 April 2021
Published
24 May 2021
Volume
7 - 2021
Edited by
Olwijn Leeuwenburgh, Netherlands Organisation for Applied Scientific Research, Netherlands
Reviewed by
Andreas Størksen Stordal, Norwegian Research Institute (NORCE), Norway
Sarfraz Ahmad, COMSATS University Islamabad, Pakistan
Updates
Copyright
© 2021 Gao, Florez, Vink, Wells, Saaf and Blom.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Guohua Gao, guohua.gao@shell.com
This article was submitted to Optimization, a section of the journal Frontiers in Applied Mathematics and Statistics
Disclaimer
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.