Efficient Ensemble-Based Stochastic Gradient Methods for Optimization Under Geological Uncertainty

Ensemble-based stochastic gradient methods, such as the ensemble optimization (EnOpt) method, the simplex gradient (SG) method, and the stochastic simplex approximate gradient (StoSAG) method, approximate the gradient of an objective function using an ensemble of perturbed control vectors. These methods are increasingly used in solving reservoir optimization problems because they are not only easy to parallelize and couple with any simulator but also computationally more efficient than the conventional finite-difference method for gradient calculations. In this work, we show that EnOpt may fail to achieve sufficient improvement of the objective function when the differences between the objective function values of perturbed control variables and their ensemble mean are large. On the basis of the comparison of EnOpt and SG, we propose a hybrid gradient of EnOpt and SG to save on the computational cost of SG. We also suggest practical ways to reduce the computational cost of EnOpt and StoSAG by approximating the objective function values of unperturbed control variables using the values of perturbed ones. We first demonstrate the performance of our improved ensemble schemes using a benchmark problem. Results show that the proposed gradients saved about 30–50% of the computational cost of the same optimization by using EnOpt, SG, and StoSAG. As a real application, we consider pressure management in carbon storage reservoirs, for which brine extraction wells need to be optimally placed to reduce reservoir pressure buildup while maximizing the net present value. Results show that our improved schemes reduce the computational cost significantly.


INTRODUCTION
Since the ensemble Kalman filter was first introduced into the petroleum engineering (Lorentzen et al., 2001;Naevdal et al., 2002;Kim et al., 2018), many ensemble-based history matching methods have gained popularity because they are reduced rank methods (meaning less computational effort) and are relatively easy to implement, parallelize, and couple with any numerical simulator. Chen et al. (2009) first systematically applied the ensemble concept to optimization of well control variables (e.g., well rates and bottom-hole pressures) to maximize the net present value in oil and gas fields. They named their scheme the ensemble optimization (EnOpt) method. Similar to the ensemble-based data assimilation methods, EnOpt can also be easily parallelized and coupled with any simulator.
Another strength of EnOpt is that EnOpt finds an optimal solution under geological uncertainty by maximizing the expectation of the objective function values of multiple models representing model uncertainties, whereas the conventional optimization methods typically require solving each model separately (Chen et al., 2009;van Essen et al., 2009) and optimization under uncertainty is non-trivial (Sun et al., 2013;Zhang et al., 2016). The idea of considering model uncertainties in optimization was also explored by van Essen et al. (2009). van Essen et al. (2009) named their method robust optimization, and the handling of model uncertainties in their method is essentially the same as that in EnOpt. However, EnOpt includes a specific way to compute the gradient, which is needed by all gradient-based optimization algorithms.
The gradient of EnOpt is determined on the basis of the cross covariance between randomly perturbed control variables (or decision variables) and the corresponding objective function values. Because this work is mainly concerned with the gradient approximation in various ensemble methods, hereafter, we will use EnOpt to refer to the gradient approximation in EnOpt where no confusion occurs. Traditional methods for gradient calculation include the finite-difference method (FDM) and adjoint-state method (Sun and Sun, 2015). FDM needs as many objective function evaluations as the product of the number of control variables and the number of ensemble members because FDM perturbs each control variable separately. The adjointstate method typically requires derivation and solution of a dual problem of the original problem in the adjoint state space, which is not straightforward. In comparison, EnOpt only requires as many objective function evaluations as the number of ensemble members, because it computes the search directions by averaging the objective function anomalies resulting from simultaneous random perturbations of the control vector. Previous studies have shown that the EnOpt can efficiently and satisfactorily achieve improvement of the objective function, despite its low computational cost (Chen et al., 2009;Oliver, 2010, 2012). However, Fonseca et al. (2017) indicated that EnOpt may not produce satisfactory results for multiple geological models unless the variance in the ensemble models is sufficiently small. The first objective of our work is to show mathematically and experimentally why EnOpt may fail to produce satisfactory results when the variance of the ensemble models is not small.
EnOpt can be considered a variant of the simultaneous perturbation stochastic approximation (SPSA) method introduced by Spall (1992), and SPSA is appropriate for robust optimization because the computational cost of SPSA is significantly lower than that of FDM for a high-dimensional control vector. Even though accurate gradients are obtained using FDM at a high computational cost, gradient-based optimizations are likely to converge to local optima. Rather than spending considerable computational resources computing the gradients, it is more practical to find global optima by trying many initial solutions using the less accurate but more computationally efficient SPSA. SPSA computes the gradient of an objective function more efficiently than FDM does by perturbing control variables randomly and simultaneously (Spall, 1992(Spall, , 1998. There are several variants of SPSA that can be used to compute the gradient of an objective function stochastically and quickly. Bangerth et al. (2006) introduced the integer SPSA to solve an optimal well placement problem. Li et al. (2013) applied SPSA for joint optimization of well placement and controls under geological uncertainty. Li and Reynolds (2011) proposed a modification of the SPSA, which is called the stochastic Gaussian search direction (SGSD or G-SPSA). The original SPSA samples perturbations from a symmetric Bernoulli distribution, while SGSD and EnOpt generate perturbations from Gaussian distributions (Chen et al., 2009;Li and Reynolds, 2011). Do and Reynolds (2013) used the simplex gradient (SG) that has the perturbation coefficient of 1 in the formulation of SGSD.
However, Bangerth et al. (2006), Li et al. (2013), Reynolds (2011), andDo andReynolds (2013) applied the variants of SPSA to optimization of a single geologic model, which means that geological uncertainty was not considered. Fonseca et al. (2017) proposed an extension of SG, which is named the stochastic simplex approximate gradient (StoSAG), that improves the accuracy of the stochastic gradient by repeating multiple perturbations for each ensemble model. In this study, we propose practical ways to reduce the computational cost of EnOpt, SG, and StoSAG by approximating the objective function values of unperturbed control variables using those obtained for the perturbed ones. The proposed approaches reduce about 10% to 50% of the computational cost compared to EnOpt, SG, and StoSAG in our examples. This paper is organized as follows. In the next section, we explain why EnOpt may fail when the variance of objective function values of the ensemble members is not small, by comparing the gradient approximation schemes in the original EnOpt and SG. Then we propose new hybrid schemes for further reducing computational costs in EnOpt, SG, and StoSAG. Finally, we demonstrate the efficacy of the different schemes using two examples, a test function that is popular for algorithm benchmarking and a well placement optimization problem for pressure management in geologic carbon storage reservoirs.

COMPARISON OF ENOPT AND SG
The steepest ascent or descent algorithm to maximize or minimize an objective function J(u) is given as where u is the column vector of control variables; u 0 is the initial guess; d is the search direction; α is the step size; and k is an iteration index. For convenience, all major notations used in this study are listed in the Nomenclature table attached at the end of the paper. In this problem, the objective function (J) is dependent only on u. van Essen et al. (2009) replaced J(u) with J(m, u) by adding another input (m) to the scalar function, where m is a random vector generated from a known probability density function. In optimization of well placement and controls, for instance, m may represent uncertain rock properties arising from geologic heterogeneity. J(m, u) is dependent on both m and u, but only u is a control vector. Because J(m, u) inherits the uncertainty of m, van Essen et al. (2009) suggested to maximize the approximate expectation of J(m, u) for m i (i = 1, 2, 3, . . . , N e ) sampled from a given probability density function (Fonseca et al., 2017).
where 1/N e N e i=1 J (m i , u) and J(m i , u) are called the objective function and the J-function, respectively, to avoid confusion. The former is marginalized over all ensemble members, while the latter corresponds to a single ensemble member.
Ensemble-based gradients (EnOpt, SG, and StoSAG) generate the objective function anomalies by stochastically sampling perturbations for u from a multivariate Gaussian distribution with mean u k and covariance matrix C u . The perturbed vector of u at the kth iteration is denoted bŷ u k . In the following, we explain situations that EnOpt may underperform. We then proceed to introduce several hybrid schemes that can significantly reduce the computational costs of the ensemble-based optimization in general while mitigating the underperformance of EnOpt. SG is given by Do and Reynolds (2013) as.
Equation (3) presented by Do and Reynolds (2013) was applied to a single model. However, once the objective function is replaced with Equation (2) in the formulation of Do and Reynolds (2013), Equation (3) can be applied to multiple models (or robust optimization). Equation (3) is identical to StoSAG with a single repetitive perturbation. Hereafter, we will use SG to refer to StoSAG with a single repetitive perturbation. The search direction of EnOpt is given by Chen et al. (2009).
Compared to Equation (3), the unperturbed vector of control variables (u k ) and the corresponding J(m i , u k ) are approximated by their corresponding ensemble means (û k and J m,û k ) in Equation (4) that are further defined as in Equations (5) and (6) (Chen et al., 2009;Do and Reynolds, 2013;Fonseca et al., 2017). In Equation (4) Fonseca et al. (2017) pointed out that EnOpt may produce unsatisfactory results because the two assumptions given in Equations (5) and (6) are invalid unless N e is sufficiently large or the variance in the prior model for m is sufficiently small. Equation (5) is usually valid because perturbations for u k should be sufficiently small to approximate ∇ u E m [J(m, u)]. Equation (6) is likely to be invalid if high variations in model parameters such as permeability and porosity cause large variations in Jfunction values. However, even though the variance of m is small, other variables in the J-function such as unit costs may still make large variations in the J-function values. The claim of Fonseca et al. (2017) can be expressed more quantitatively by simply manipulating Equation (4): From Equation (7), it follows that where d k,EnOpt /||d k,EnOpt || ∞ ≈ d k,SG /d k,SG || ∞ is true if ||r k || 2 is sufficiently small compared to d k,SG . However, as ||r k || 2 increases, d k,EnOpt becomes more inaccurate than d k,SG . In r k in Equation.
(8), because perturbations for u k should be small enough to approximate a search direction accurately, ||r k || 2 is significantly dependent on J (m i , u k ) − J m,û k . Thus, ||r k || 2 is closely related to the variance in J(m, u k ), which is σ 2 J(m,u k ) as given in Equation (9): The norm ||r k || 2 is not exactly the same as the variance in J(m, u k ), but ||r k || 2 is expected to increase as σ 2 J(m,u k ) increases because J (m, u k ) ≈ J m,û k . On the basis of the observation that the approximation of EnOpt becomes inaccurate as the variance in J(m i , u k ) increases, we propose a hybrid gradient of EnOpt and SG in the next section that first clusters ensemble members based on the variance of J m i ,û k and then approximate J(m i , u k ) using J m i ,û k within each cluster. Thus, the hybrid gradient is more computationally efficient than SG because the objective function for the unperturbed control vector in some ensemble members does not need to be evaluated. We also propose two additional practical measures that can save the computational cost of EnOpt and StoSAG.

COMPUTATIONAL COST REDUCTION OF ENOPT, SG, AND STOSAG
Here, we introduce three new formulations to save on the computational cost of EnOpt, SG, and StoSAG. The new formulations approximate the objective function values for unperturbed control variables using the objective function values for perturbed ones in the formulations of EnOpt, SG, and StoSAG. Thus, these new formulations require fewer J-function evaluations than that by the original ensemble-based gradients.
In EnOpt, the J-function for unperturbed control variables does not need to be evaluated for the search direction, but it needs to be evaluated for the objective function in Equation (2). However, the means of J(m i , u k ) and J m i ,û k,i are similar because the perturbations on u are small. Thus, E m [J(m, u)] can be approximated using J m i ,û k,i as given in Equation (10): We call this the modified EnOpt (ModEnOpt), which uses Equation ( To save the computational cost of SG, we propose a hybrid gradient of EnOpt and SG, which is named the hybrid simplex gradient (HSG) method, on the basis of the observation that EnOpt provides satisfactory search directions if the variance in J(m, u k ) is small. HSG clusters the ensemble members based on J m i ,û k,i and then uses the cluster mean instead of J(m i , u k ) for clusters that have more than one member as given in the first term of Equation (11), which is close to the search direction of EnOpt given in Equation (4). In the second term of Equation (11), J(m i , u k ) should be evaluated for the clusters that have only a single member, which is close to the search direction of SG given in Equation (3). The search direction of HSG is given by HSG uses a different approximation of E m [J(m, u)] given in Equation (12), instead of Equation (2): In Equation (12), the J-function values for perturbed control variables are used to approximate E m [J(m, u)], but the J-function values for unperturbed ones are used for clusters that have only a single member. In Equation (11), the N e ensemble members (models) are grouped based on [J(m i , û k,i )]. However, determining the optimal number of clusters is still a challenging problem in data clustering (Jain, 2010). Furthermore, even though models are grouped to the optimal number of clusters, some groups might have significantly different J m i ,û k,i values because cluster algorithms group the models into the number of clusters unexceptionally regardless of how similar J m i ,û k,i values are in a group. For this reason, a mean of J m i ,û k,i in a group might not be properly representative of J(m i , u k ) of the group members. Thus, rather than trying to determine the optimal number of clusters, we use a predefined criterion to determine if models have similar J m i ,û k,i . A standard deviation is the most common indicator of how dissimilar data are, but the standard deviation should be normalized in our clustering problem. For example, let us assume that there are two data sets (1, 2, 3) and (198,200,202). The standard deviation of (1, 2, 3) is smaller than that of (198, 200, 202), but (198, 200, 202) has relatively small differences in terms of magnitude compared to (1, 2, 3). In our clustering problem, group members should have relatively similar J-function values within each group. Thus, for the predefined criterion, we use the coefficient of variation shown in Equation (13) instead of standard deviation to normalize the standard deviation. coefficient of variation (CV) = standard deviation mean (13) Thus, the N e models are grouped so that the coefficient of variation in each cluster is smaller than a predefined value (CV HSG ). However, during clustering, a model should be assigned to a cluster such that the coefficient of variation becomes minimal. We introduce an algorithm to find groups that have small coefficient of variations that are lower than a predefined coefficient of variation. First, N e models have random cluster indices where the initial number of groups (N c ) is the same to the number of models (N e ). Then whether the coefficient of variation of a group can be reduced by adding a model to the group is examined where the model makes the coefficient of variation of the group minimum. For example, let us assume that we try to select a cluster for a model between clusters A and B. The coefficients of variation of both A and B are smaller than those of CV HSG . If the coefficients of variation of A and B (after including the model) are 0.001 and 0.002, respectively, then the model should be put in cluster A. This is repeated until the coefficient of variation of the group does not become smaller. Finding models that make the coefficient of variation of other groups smaller is repeated. Other details of the proposed algorithm are described in Algorithm 1. In HSG, we use Algorithm 1 to make the coefficients of variation of clusters smaller than CV HSG and to drop the coefficients of variation of clusters. The procedure of clustering is given as follows: The input of Algorithm 1 includes a predefined coefficient of variation, CV HSG , and the J-function values for perturbed control variables. The number of clusters does not need to be inputted for Algorithm 1. The k-means clustering algorithm (MacQueen, 1967), which is one of the most commonly used clustering algorithms, cannot be used in this case because it requires the number of clusters to be specified and it tends to group members that are relatively close to each other. Determination of the CV HSG value depends on how much computational cost of HSG is expected to be saved compared to SG. For example, if 70% of the computational cost is expected to be saved using HSG compared to SG, then CV HSG is set to a number that makes the number of clusters 70% of the number of ensemble members. CV HSG can be chosen based on the initial J-function values of an ensemble.
HSG is equivalent to ModEnOpt and SG for large and small CV HSG , respectively, where the sum is divided by N e -1 in d k , ModEnOpt , but this is canceled by ||d k , ModEnOpt || ∞ in Equation (1). For large ||r k || 2 , a small CV HSG should be used because the approximate gradient of ModEnOpt is inaccurate. HSG takes N e ∼ 2N e J-function evaluations where N e and 2N e correspond to the number of J-function evaluations of ModEnOpt and SG, respectively. The search direction of StoSAG is given by Fonseca et al. (2017) as StoSAG repeats multiple perturbations (N p ) for each ensemble member, while SG takes a single perturbation for each ensemble member. StoSAG provides more accurate search directions than SG does because StoSAG takes more J-function evaluations to compute the search direction than SG. StoSAG needs N e (N p + 1) J-function evaluations, and this is (N p + 1)/2 times as many J-function evaluations as SG requires. The computational cost of StoSAG can be reduced by approximating the J-function values of unperturbed control variables using the J-function values of perturbed ones as given below: The modified StoSAG (ModStoSAG) uses different approximations of E m [J(m, u)] given in Equation (17) instead of Equation (2): (17) ModStoSAG requires N e N p J-function evaluations, and it can save 1/(N p + 1) of the computational cost of StoSAG needed to compute a search direction. A summary of the ensemblebased stochastic gradient methods discussed thus far is presented in Table 1. ModEnOpt, HSG, and ModStoSAG use different objective functions, but the objective function in Equation (2) is calculated at initial and final steps for all the formulations to compare them in the next numerical examples.

Rosenbrock Function
The performance of the six formulations given in Table 1 is tested using the Rosenbrock (1960) function, which is widely used for benchmarking optimization solvers. The Rosenbrock function is given by where u = u 1 u 2 · · · u N u T and i = 1, 2, . . . , N e . m i is a constant  (19), and 100 J-function evaluations are needed to calculate the objective function. Two values of σ m (0.01 and 1.00) are used to generate two sets of m 1 , m 2 , . . . , m 100 for the 100 ensemble members. A larger σ m makes a larger variance in J(m i , u) and larger ||r k || 2 , which causes inaccurate search directions of EnOpt and ModEnOpt. Perturbations for u are generated using N(0, C u ) where C u is a diagonal matrix and the diagonal elements are 0.001 2 . N p is set to 3 for both StoSAG and ModStoSAG.
CV HSG is set to a number that makes the number of clusters about 70 (=0.7 · 100). CV HSG for σ m = 0.01 and 1.00 are  1E−5 and 5E−5, respectively, which were determined based on the J-function values for the initial solution (u i = 2.0 for i = 1, 2, . . . , N u ). The performance of the stochastic gradients is compared to the gradient obtained using FDM (Sun and Sun, 2015). A search direction of FDM is computed using Equation (20), and N e (N u + 1) function evaluations are needed.
The optimization problem is solved using the steepest descent method given in Equation (1). The optimization is terminated if either of the following two conditions is satisfied: (i) the relative increase of the objective function is <0.0001%, or (ii) the relative change of the norm of u ku k+1 is <0.01%. Figure 1 shows the optimization results for σ m = 0.01. Figures 1A,B show the results of a single run and the average result of 100 runs, respectively. Because perturbations are stochastically generated for the stochastic gradients (EnOpt, ModEnOpt, SG, HSG, StoSAG, and ModStoSAG), the optimization runs using the stochastic gradients show the different numbers of J-function evaluations and relative objective function values for iterations. For this reason, in Figure 1B, the 100 optimization runs are repeated, and the relative objective function values and the total number of J-function evaluations are averaged for each iteration. In Figure 1, because σ m (=0.01) is small, the variance in J(m i , u) is also small, and the small variance in J(m i , u) leads to small ||r k || 2 . Because EnOpt and TABLE 2 | Numbers of function evaluations that are needed to achieve 5% of the relative objective function value in Figure 1B.

Method
Number of J-function evaluations ModEnOpt provide accurate search directions for small ||r k || 2 , EnOpt and ModEnOpt find satisfactory solutions as the other gradient methods do. Table 2 shows the numbers of function evaluations that are needed to achieve 5% of the relative objective function value in Figure 1B. ModEnOpt, HSG, and ModStoSAG saved about 47, 33, and 34% of the function evaluations compared to EnOpt, SG, and StoSAG, respectively. In Table 2, ModEnOpt took the smallest number of function evaluations, and EnOpt needed a smaller number of function evaluations than SG did.
However, EnOpt and ModEnOpt do not achieve satisfactory reduction of the objective function value for σ m = 1.0 as shown in Figure 2A. In Figure 2B, EnOpt and ModEnOpt show the abnormal relation between the number of J-function evaluations and relative objective function values because EnOpt and ModEnOpt do not reduce the objective function value sufficiently for all the 100 optimization runs. The large σ m (=1.0) results in a large variance in J(m i , u) and, consequently, large ||r k || 2 . For this reason, EnOpt and ModEnOpt provide inaccurate search directions and unsatisfactory optimization results. The average angles between the search directions obtained using FDM and the stochastic gradients for σ m = 1.0 are given in Table 3. The  angle between two search direction vectors is calculated using Equation (21): where u and v are vectors corresponding to search directions. The angle calculations are repeated 100 times and averaged. The search direction obtained using FDM is considered to be the most accurate, and a higher angle from the FDM search direction implies less accuracy. It can be seen from Table 3 that SG, HSG, StoSAG, and ModStoSAG provide acceptable search directions, while EnOpt and ModEnOpt provide inaccurate search directions. The number of Jfunction evaluations for the same relative objective value increases from HSG, SG, ModStoSAG, and StoSAG, as shown in Figure 2B.

Problem Formulation
As a realistic example, we now consider the optimization of brine extraction well placement in geological carbon sequestration (GCS) applications. CO 2 injection into saline aquifers or depleted oil and gas reservoirs necessarily leads to pore pressure increases. The pore pressure buildup not only can affect the CO 2 injectivity and storage performance but may also cause caprock damage, fault reactivation, induced seismicity, and leakage of brine and CO 2 , posing severe problems to CO 2 long-term storage permanence and public safety (Birkholzer et al., 2012;Carroll et al., 2014;Cihan et al., 2015). Recently, active reservoir pressure management was proposed as a mitigation measure, which installs one or more brine extraction wells to reduce pressure buildup in the reservoir (Bergmo et al., 2011;Buscheck et al., 2011Buscheck et al., , 2012Birkholzer et al., 2012;Cihan et al., 2015;Arena et al., 2017). The extraction, treatment, and disposal of the extracted brine, however, impose additional expenses to GCS operators (Cihan et al., 2015). Thus, the placement and control of brine extraction wells need to be optimized to improve the economic feasibility of GCS projects. Cihan et al. (2015) optimized well placement and controls of brine extraction wells in different geological models to minimize extracted brine volume and keep pressure buildups under critical thresholds for potentially activating fault leakage and/or fault slippage. In their study, Cihan et al. (2015) adopted a constrained differential evolution algorithm, which is a heuristic stochastic evolution algorithm similar to the genetic algorithm. Here, we demonstrate the use of the more efficient ensemble-based gradient algorithms. Optimal placement of a brine extraction well is sought in multiple geological models (i.e., geologic uncertainty) to maximize the objective function, which is expressed in the form of the net present value. Performance of the six formulations given in Table 1 is compared. The J-function for a single model, m i , is given by Frontiers in Earth Science | www.frontiersin.org − Next k=1 c be · q n be,k − c ce · q n ce,k − N leak l=1 c bl · q n bl,l − c cl · q n cl,l      f ciq q n ci,j = if q n ci,j < q n ciq , c ciq q n ciq − q n ci,j otherwise, 0 where the control vector u is a two-dimensional column vector including I and J indices of brine extraction wells. In words, the objective function in Equation (22) can be described as the tax credit for injected CO 2 minus the penalty for unfulfilled CO 2 injection, the cost of brine extraction wells, and the damage cost related to brine and CO 2 leakage.
FIGURE 3 | Three-dimensional view of log 10 permeability (md). The z-axis is exaggerated by a factor of 20.
The brine extraction wells are vertically perforated in a CO 2 injection zone, and I and J indices are rounded off to integers during iterations. t n is the size (days) of the nth time step in the reservoir simulation, and b is the annual discount rate. N inj , N ext , and N leak are the numbers of CO 2 injection, brine extraction, and leaky wells, respectively. q n ci,j , q n be,k , and q n ce,k represent the average CO 2 injection rate at the jth CO 2 injector, the average brine extraction rate at the kth brine extractor, and the average CO 2 extraction rate at the kth brine extractor for t n , respectively.  Mean of log 10 horizontal permeability (md) 2 Standard deviation of log 10 horizontal permeability (md) 0.6 Correlation coefficient between porosity and log 10 horizontal permeability 0.7 Correlated direction of porosity and log 10 horizontal permeability North-south Correlation lengths (major, minor, vertical) of porosity and log 10 horizontal permeability (m) 500, 300, 10 FIGURE 4 | Top view of log 10 permeability (md) of 3 of 20 geological models, and the locations of an injector (INJ) and two leaky abandoned wells (L1 and L2).
Frontiers in Earth Science | www.frontiersin.org One of the main risks related to GCS is leakage from abandoned wells-if abandoned wells are leaky, then brine and CO 2 can migrate to an overlying formation along the leaky wells (Birkholzer et al., 2012;Sun et al., 2013). Because the risk related to leakage must be minimized for safe long-term CO 2 storage, abandoned wells are assumed to be leaky for conservative estimation. In Equation (20), q n bl,l and q n cl,l are the brine leakage rate at the lth leaky abandoned well and the CO 2 leakage rate at the lth leaky abandoned well for the nth time step. If brine extraction wells are placed near the leaky wells, the brine and CO 2 leakage amount decreases because the installed brine extraction wells reduce the pressure buildup at the leaky wells; on the flip side, the brine extraction wells also need to be shut in early if they are placed near CO 2 injectors because they have early CO 2 breakthroughs. The brine extraction wells should be placed at locations that minimize the leakage costs at the leaky wells while postponing CO 2 breakthrough at the brine extraction wells as late as possible. r ci is the credit for CO 2 storage, which is often provided by the government subsidy or driven by the carbon trading market (Jahangiri and Zhang, 2012;Allen et al., 2017). In Equation (23), q n ciq is the minimum required CO 2 injection rate for t n , and a penalty is imposed when the injection rate cannot be met, q n ci,j < q n ciq , in which case the upstream capturing facility needs to find an alternative means for temporary storage. In Equation (20), c be , c ce , c bl , and c cl are the unit costs of brine treatment, CO 2 reinjection, brine leakage, and CO 2 leakage, respectively.
Overall, the brine extraction well placement optimization is complex, especially when geologic uncertainty is involved. The ensemble gradient methods are well-suited to solve such problems because of their efficiency and the ability to incorporate geologic uncertainty. As a demonstration, we consider the threedimensional reservoir model shown in Figure 3, which is 2.55 km (x-axis) by 2.55 km (y-axis) by 30 m (z-axis), and the dimensions of a grid block are 50 m by 50 m by 10 m. The vertical structure consists of three formations, which are named the above zone, caprock, and injection zone. CO 2 is injected at 30 tons/day at the center of the injection zone for 5 years, and the maximum bottom-hole pressure is 20,000 kPa. Brine is extracted at an extraction well at 60 m 3 /day, and it is shut in if the ratio of produced CO 2 volume to produced brine volume is >100. The caprock blocks the flow between the above and injection zones, but brine and CO 2 can vertically flow up along two leaky abandoned wells where the leaky wells are located 300 m north and east of the injector in Figure 4. The unit cost data needed for calculating the J-function in Equations (22) and (23) are given in Table 4.
Heterogeneous porosity and log 10 permeability fields are generated using the sequential Gaussian simulation and the sequential Gaussian co-simulation modules of SGeMS (Remy et al., 2011). Porosity and log 10 permeability follow normal distributions, and their statistical parameters are given in Table 5. Figure 4 shows three realizations of log10 permeability out of a total of 20 geological models. The porosity and permeability of the caprock zone are assumed deterministic and are assigned values of 0.1 and 0.001 md, respectively. The flow simulation is conducted using a compositional multiphase reservoir simulator, CMG-GEM (CMG, 2015). The two leaky abandoned wells are described using the local grid refinement of CMG-GEM, and the porosity and the vertical permeability of the leaky wells are set to 0.2 and 1,000 md.
The convergence criteria are as follows: (i) the relative increase of the objective function is <0.1%, or (ii) the relative change of the norm of u ku k+1 is <1%.

RESULTS AND DISCUSSION
The objective of this optimization problem is to find the optimal location of a brine extraction well that maximizes the mean of the J-function values given in Equation (22) of the 20 geological models. As described in Equation (22), a solution for a brine extraction well is a two-dimensional vector including I and J indices of the brine extraction well. Thus, in this example, the vector of a search direction has two elements for the I and J indices of a brine extraction well, and the I and J indices of a solution are updated at the same time using the search direction. The optimal solution is sought using the six stochastic gradients, and then the performance of the six stochastic gradients is compared. In HSG, CV HSG is set to a number that makes the number of clusters about 80% of the number of ensemble members. N p = 2 is used for StoSAG and ModStoSAG. Figure 5 shows the brine extraction well locations at the iterations obtained using HSG for four different initial solutions. In general, initial solutions for optimization are sampled in a prior distribution, but the initial solutions located at the four corners in Figure 5 are selected because optimal solutions are expected to be located between CO 2 plume and the boundary. The four different initial solutions are fixed to compare the performance of the stochastic gradients. As described in the previous section, a brine extraction well should be placed as close to the CO 2 injector and the leaky wells as possible to mitigate the reservoir pressure buildup, as well as CO 2 and brine leakage amounts. However, the brine extraction well is shut in early if the brine extraction well is placed in the extent of the CO 2 plume, which changes for different brine extraction well locations because the CO 2 migration is affected by the reservoir pressure drawdown caused by FIGURE 6 | Average CO 2 saturation map of 20 geological models after CO 2 is injected for 5 years when placing a brine extraction well at the optimal location (I = 31 and J = 18). INJ, EXT, and L1 and L2 represent the CO 2 injector, the brine extraction well, and two leaky abandoned wells, respectively. the brine extraction well. Furthermore, the reservoir pressure buildup and drawdown, the CO 2 and brine leakage amounts, and the CO 2 plume extent are significantly affected by the heterogeneity of rock permeability in the 20 geological models. As shown in Figure 6, a brine extraction well should be placed out of the CO 2 plume extent and as close to the CO 2 injector and the leaky wells. (11, 11) and (40, 11) shown in Figures 5C,D are converged to the same solution, which is (31, 18) shown in Figure 6. Figure 7 shows the numbers of simulation runs and objective function values of the six stochastic gradient methods for four initial solutions. In Figure 7, the number of simulation runs is how many times the flow simulator is conducted until an iteration is finished. For example, if the suite of the 20 models is simulated five times until an iteration ends, then the number of simulation runs is 100.
EnOpt and ModEnOpt do not achieve satisfactory objective function values compared to the others. This implies that r k 2 in Equation (8) or the variance in J(m i , u) is too large for EnOpt and ModEnOpt to provide acceptable accuracy of search directions. In Figure 7, for the same number of simulation runs, HSG and ModStoSAG reach higher objective function values than SG and StoSAG do.

CONCLUSION
Ensemble-based optimization algorithms are widely used for reducing the computational costs of optimization, especially when the forward problem requires a significant amount of time to run. In this work, we theoretically and experimentally showed when EnOpt may fail to achieve satisfactory performance. If r k 2 in Equation (8) or the variance in J(m i , u) is large, EnOpt produces unsatisfactory optimization results because the search direction of EnOpt is inaccurate as shown in the Rosenbrock function example. We also introduced hybrid schemes to reduce the computational costs of EnOpt, SG, and StoSAG. For the benchmark example and the geological carbon sequestration example, the computational costs of EnOpt, SG, and StoSAG can be significantly reduced by replacing the J-function values for the unperturbed control variables with those for the perturbed ones. The ensemble-based optimization schemes proposed in this study are generic and can be readily used on other types of problems involving computationally expensive forward simulations or optimization under uncertainty.

DATA AVAILABILITY STATEMENT
The datasets generated for this study will not be made publicly available. It is a confidential data set. Requests to access the datasets should be directed to the corresponding author.

AUTHOR CONTRIBUTIONS
HJ was a main author to make research results and wrote this manuscript. HJ had technical discussions with AS, JJ, BM, and DJ.