Accelerating Physics-Based Simulations Using End-to-End Neural Network Proxies: An Application in Oil Reservoir Modeling

We develop a proxy model based on deep learning methods to accelerate the simulations of oil reservoirs–by three orders of magnitude–compared to industry-strength physics-based PDE solvers. This paper describes a new architectural approach to this task modeling a simulator as an end-to-end black box, accompanied by a thorough experimental evaluation on a publicly available reservoir model. We demonstrate that in a practical setting a speedup of more than 2000X can be achieved with an average sequence error of about 10% relative to the simulator. The task involves varying well locations and varying geological realizations. The end-to-end proxy model is contrasted with several baselines, including upscaling, and is shown to outperform these by two orders of magnitude. We believe the outcomes presented here are extremely promising and offer a valuable benchmark for continuing research in oil field development optimization. Due to its domain-agnostic architecture, the presented approach can be extended to many applications beyond the field of oil and gas exploration.


INTRODUCTION
Reservoir modeling plays an essential role in modern oil and gas exploration.After an acquisition of a reservoir field, energy companies plan the field development based on a capital expense of billions of dollars.A placement of even a single well in a bad spot can mean a significant economical loss for the developer.A reservoir model (RM) is a computerized representation of the field drawing from various data sources, such as geological expert analyses, seismic measurements, well logs, etc., with added properties determining the dynamic reservoir behavior.Its primary purpose is to allow optimization and better prediction of the field's future output using mathematical simulators.Given a set of input actions (well drilling), a simulator operates on an RM by solving large systems of nonlinear PDEs to predict future outcomes over long time horizons (up to tens of years).
RM simulators may require a considerable amount of computation (time) to produce predictions (minutes, hours, or days, depending on the RM size).Current optimization techniques in reservoir engineering are therefore able to simulate only a small number * Author is currently with OriGen.AI, New York, NY of cases.In our context, the problem of field development is formulated as an optimization over a very high number (millions) of candidate sequential well placements.As such, the solving time of a simulation quickly becomes the bottleneck rendering methods such as Monte Carlo Planning and Reinforcement Learning impractical.While the idea of creating an approximating surrogate, or proxy, is not new, previous approaches targeted accelerating steps inside the physics-based simulator (see Section 2).Instead, we pursue a deep learning approach to learn approximating essential output variables of a PDE-solver using training data generated by that solver, and demonstrate its performance and accuracy at predicting production rates on a benchmark RM with complex geological properties, namely the SPE9 [19].

PREVIOUS WORK
Accelerating reservoir simulations has a wide literature.State of the art techniques can be classified in two categories: 1) reducing the complexity of the PDEs while providing an acceptable loss of prediction accuracy (e.g., reduced order modeling [9] can accelerate the simulations by factors of 10 2 and has been found useful in optimization and control [12], and upscaling which achieves acceleration with coarser reservoir models [6]), and (2) simple polynomial interpolation techniques computing the objective function (e.g.Net Present Value, or NPV) or to characterize the uncertainty [23].It is important to note that while the latter techniques are fast their accuracy tends to be poor.
Artificial Neural Networks (ANN) have been proposed to accelerate oil reservoir simulations and aim to achieve both objectives: accuracy and acceleration.A special issue of Journal of Petroleum Science and Engineering in 2014 was devoted to this topic, and a good summary can be found in the editorial [2].As outlined in this summary, ANN approaches can be categorized as (1) physics-based models, and (2) data-driven models.Physics-based models use the PDE structures as features.Recent example from this category is application of Convolutional Neural Networks (CNN) to modeling flow around complex boundaries for the purpose of accelerating animation [21].In our preliminary investigation, CNNs gave acceptable results modeling short time horizons, however, significant error accumulation over longer time spans of a reservoir model made this approach impractical.The challenge to purely data-driven approaches is the extreme nonlinearity of the reservoir dynamics.
It seems important to incorporate features of the reservoir.Examples of basic, fully connected ANNs to predict reservoir production are [26] and [25].In the context of pre-existing literature, we believe our contribution is twofold: (1) our architectural approach is unique as it deals with sequential action input and output of varying span, reservoir uncertainty, and optimized well control, and (2) we present a thorough experimental analysis on a publicly available reservoir model thus creating a valuable reference for future comparison by the community.

RESERVOIR MODEL SIMULATION
The Reservoir Model (RM) simulation considered in this work is based on a so-called Black Oil model [22] that describes the flow of reservoir three fluids through porous rock: oil, water, and gas.

Physics
The Black Oil model equates the time change of mass in a region with the mass flux across the region boundary.The flux is driven by pressure differences caused by well operations.The system of equations is highly nonlinear due to the non-stationary interactions between rock types and fluids.Rock compressibility and relative permeability are altered by changes in fluid pressure and saturation levels.In addition, petroleum fluids will undergo phase changes, between gas and liquid form, as they move through pressure and temperature gradients.An accessible introduction into the model and inner-workings of Black Oil Simulators (BOS) can be found in [15].
A BOS is a basic discrete-time finite-volume simulation.The reservoir is partitioned into cells and the values of the primary variables: oil pressure, water saturation, and gas saturation, which are evaluated over a sequence of time steps out to a fixed time horizon.
At each time step, the BOS solves the mass balance equations: for each cell c and each fluid f , where In a real reservoir model there may be up to 10 6 cells and 10 3 time steps, so the number of nonlinear equations to be solved simultaneously can be on the order of 10 9 , although some practical problems are smaller.Depending on the degree of the nonlinearity and the length of the time step, the set of equations to be solved for each time step can require many iterations of a nonlinear equation solver.A high-fidelity reservoir simulation for industrial applications can take hours or days to complete one simulation.

Reservoir Uncertainty
Each cell has a rock type, for example sandstone or shale, which defines important properties, such as compressibility and permeability.However, due to the depth of reservoirs and the complex geology surrounding their formation, the attribution of a rock type to any given cell is somewhat uncertain.
An important notion is one of a realization, representing a particular spatial distribution of rock type across the grid cells of the RM -a distribution that comes with a natural uncertainty [15].This can be modeled by three-dimensional probability density functions, called variograms, that are calibrated from geological knowledge, seismic data, and cores from test wells.In a typical application, analysts will work with a few hundred samples (referred to as Realizations).To take this uncertainty into account in our experimental setup, we have used 500 realizations generated from the original RM according to [5] with subsequent porosity and permeability calculation described in [16].

Wells
Another important element in BOS are wells.They have two functions: (1) the production of commercially valuable fluids (referred to as Producer wells), and (2) the management of pressure differentials and fluid saturations in the reservoir.Some wells are drilled specifically to inject water or gas (referred to as Injectors).
The simulator predicts the impact of well operations which consist of two types of decisions: (1) the location and time sequence of well completions, and (2) the control of the production or injection rate.The ultimate goal is to maximize the expected NPV.

BOS Implementations
Our work relies on the following BOS packages: (1) Open Porous Media (OPM) -an open source project supported by a consortium of companies and academic institutions [18], and (2) Eclipse [7] -a commercial software.

SPE BENCHMARKS
Our study draws data from one (of several) RM considered reference within the oil and gas industry: the SPE9 model [13,19].The model consists of a 24 × 25 × 15 grid and we used a varying number of injector and producer wells as will be described later.One particular realization of the SPE9 RM is shown in Figure 1 illustrating the shape (and incline), grid structure and property (water pressure in this example) distribution within the RM.

LEARNING TO PREDICT PRODUCTION RATES -THE PROXY APPROACH 5.1 Task
The main targets for the proxy model to predict are a set of variables referred to as production rates, taking into account interactions between wells, fluid, and rock.More specifically, given a particular RM realization and a series of drilling actions, a simulator predicts rates at which the wells produce or inject fluids at future times (referred to as time steps).A major application of such simulations is in Field Development Planning [11], where the total production rates constitute economic metrics, such as the Net Present Value (NPV).Individual rates of each well drilled may also be of importance and we will include these in our investigations as well.
Considering the above, we define the task of proxy modeling as follows: Given: (1) Sequence of actions.An action encapsulates a drilling decision, in our case one of {"Drill a Producer Well (P)", "Drill an Injector Well (I)", "Do Nothing (X)"}, accompanied by applicable coordinates on the reservoir surface.(2) Realization ID.This ID maps to a known distribution of rock properties in the reservoir.Predict: (1) Field rates, i.e., aggregate output of the entire field, typically including oil, and water production as well as water injection over a desired future horizon in (equidistant) time steps.(2) (Optional) Individual well rates, i.e., individual output of each producer well placed so far, typically oil production is of interest An example of an action sequence suitably encoded could be as follows: x-P(0,15)-I(2,5)-x-x-P(23,19)-...-x where P(0,15) refers to drilling a Producer at surface location (0,15), etc. Continuing on this example, an expected output could be a series of rates predicted at increasing time steps, each representing a 30-day interval.In our experiments we used action sequences of length 20, and varying prediction spans (horizon) of 20 to 40 months.

Model
5.2.1 Encoder-Decoder Architecture .Given the above task definition, our method adopts the sequence-to-sequence approach based on recurrent neural networks (RNNs), specifically a version of a gated RNN known as Long-Short Term Memory (LTSM) cell [10], arranged in a encoder-decoder architecture.This approach has gained popularity across various application fields [8,20,24] and excels at modeling temporal sequences of (typically discrete) variables.In our setting, however, the model performs a continuous-variable regression, as will be explained more in detail below.Figure 2 illustrates our solution.In this view, the series of drilling actions, along with additional geological information is the input sequence, X = {x i } 0≤i <K , while the series of production rates is the output sequence, Y = {y t } 0≤t <T , y t ∈ R D , where D is the number of fluid phases.Here, x t represents a union of potentially heterogeneous features, such as discrete and multivariate continuous variables.Later we will discuss how X is transformed into a common space via embeddings.The role of the Encoder is to capture the information about the input X and pass it to the Decoder to generate an accurate prediction Ŷ about Y .Both the encoder and the decoder involve multiple layers of LSTMs in our architecture.[3] to aid modeling the causal nature between the actions and the output.In order to generate a prediction ŷt , the attention mechanism applies a probability distribution (a "mask") over the individual input actions to emphasize actions particularly relevant to produce y t .The mask itself is generated by a layer trained jointly with the rest of the network parameters.In contrast to a typical setup where the attention is applied on the upper-most encoder layer, we have found that concatenating encoder states from all layers outperformed the former.

Decoding Variants.
We investigate the following modes of operation with respect to the decoder: • Ground Truth (GT) Training, in which the decoder is provided the GT, i.e., {y t } 0≤t <T as input at each time step.1: Datasets and their partitioning both training as well as inference (test).This mode is a novel variant with benefits demonstrated in Section 5.5.6.

Experimental Evaluation
5.3.1 Datasets.In order to develop various aspects of the neural network architecture, we have used the two BOS mentioned in Section 3, namely, OPM and Eclipse.While the OPM was used in majority of the tests, Eclipse was used to generate simulations with well control optimization -a feature not yet available in OPM.
Two main datasets were created: (a) 22k simulations via OPM (fixed well control), and (b) 32k simulations via Eclipse (well control optimized).Each such set was partitioned into training (TRAIN), validation (VALID), and test (TEST) sets, as shown in Table 1, maintaining a proportion 80%, 10%, and 10%, respectively.Additionally, a large dataset involving 163k OPM simulations (OPM-163k) was collected to examine effects of varying training set size on the resulting error metrics.We want to emphasize that the OPM-163k only serves exploratory purposes of this paper and, in its size, would unlikely be a practical choice due to the considerable computational burden required to generate the corresponding simulations.In selected experiments (Section 5.5.2), the OPM-22k was further downsampled to investigate dependency on the training size.
Each simulation input consists of a uniform random choice over 500 realizations (see Section 3) and an action sequence of length 20.For each of the 20 actions, a decision to drill was made with 99% probability, with a ratio 5 : 1 in favor of drilling a producer.The well location follows a uniform random distribution with the constraint of not drilling within a 2-cell neighborhood of pre-existing wells.

Features and Preprocessing.
Actions and Realization.The primary information to be encoded in the input X = {x k } 0≤k <K are the drilling decisions (see Section 5.1).Since the SPE9 reservoir has dimensions 24 × 25 × 15 grid cells, initially, a joint action-location encoding was defined on a discrete set of 1201 (2x24x25+1) symbols.An alternative separate encoding for the action (3 symbols), the x-coordinate (24 symbols), and the y-coordinate (25 symbols), however, outperformed the joint encoding in our experiments, as will be shown below.
A realization ID associated with an input sequence is attached to each x k as a discrete variable.
In order to provide the neural network with the above input, the discrete variables are "embedded" in a continuous space of certain dimensionality (a hyperparameter), similar to a method of embedding words in Natural Language Processing [17] Geological Features.Hypothesizing that local geological properties influence the flow through the wells the most, we added features related to the neighborhood of the well locations.More specifically, the following local (per-cell) features were considered: (a) rock type (shale, sandstone), (b) porosity, (c) permeability in horizontal and vertical direction.In our case, 5 cells at the bottom (for injectors) and 5 at the top (for producers) are affected by drilling, resulting in 20 real-valued features concatenated to form the final feature vector (see Table 2).
Standardization.Standardization was applied to the output variables before modeling, via a linear transform y ′ (k ) = (y (k) − µ k )/σ k with y (k ) being the k-th dimension of the output vector and µ k , σ k denoting the mean and standard deviation estimates obtained from the corresponding dimension over the TRAIN partition.The variables y ′ are transformed back to their physical range using the inverse transform.All experimental evaluation is then performed on unnormed output.

Error Metric.
The core metric in our experimentation is the prediction error relative to target (GT) generated by the simulator.We base this metric on an L2 norm.Let Y k = {y kt } 0≤t <T and Ŷkt = { ŷkt } 0≤t <T denote target and predicted values for a simulation k of length T .Let K denote the total number of simulations in the test set.While the reservoir simulator produces rates at each time step, for practical use we are interested in the cumulative output.Hence, the cumulative value z kt = t τ =0 y kτ is calculated forming a vector z k = [z k 0 , ..., z kT −1 ] (and similarly for ẑkt ).The (relative) error of a simulation k is then defined as follows Note that the denominator is calculated as an average over the entire test set.This mitigates issues with near-zero targets that occur in some valid cases.The final error rate used for reporting in this paper is the average of e from Eq. 3 over the entire test set.

5.3.4
Training Procedure.The system was implemented in Tensorflow [1].Each model was trained using the TRAIN partition (see Table 1) to minimize the sequence MSE via the Adam optimizer [14], performed in batches of 100 simulations and an initial learning rate of 0.001.Parameters with the best loss on the VALID partition were then chosen.In case of GT Pre-Training (see Section 5.2.3), the resulting parameters served as starting point for the next training round with a slower learning rate of 0.0002.Trained models were evaluated using the TEST partition.
We experimented with varying model size in terms of the number of LSTM layers (ranging between 1 and 5), and hidden units (ranging between 32 and 2048).In the experimental sections below we report results on three representative configurations by memory footprint ("#units x #layers"): 1024x2 (large), 128x5 (medium), and 64x1 (small).

Baselines
We consider several traditional techniques as references for comparison, e.g., upscaling [6] offers itself as a suitable baseline to achieve acceleration due to reduced refinement and complexity (see below).We also describe two additional simple machine learning baselines as references.
5.4.1 Upscaling.While oil reservoir fluids flow through microscopic pores, practical reservoir models, such as SPE9 in Figure 1, have homogeneous properties over tens of meters.These coarser models are constructed attempting to approximate and analyze reservoir flow performance of multiscale nature with the available computational power.To this end, upscaling consists of a set of procedures to obtain coarser reservoir models for flow performance prediction from geological characterizations which typically contain 10 7 − 10 8 cells [6].Extrapolating from this idea, a reasonable question to ask is how the proxy models based on Neural Networks perform compared to even coarser versions of the base case reservoir model obtained with simple upscaling procedures.
We apply single-phase upscaling based on averaging reservoir properties aiming at a comparison of accuracy and performance with the proxies.Given the base grid of 24 × 25 × 15 grid-blocks we constructed two coarser grids of 12 × 13 × 15 and 8 × 9 × 15 gridblocks, referred to as "UP2" and "UP3," respectively.The coarser grids average four (2 × 2 × 1) and nine (3 × 3 × 1) horizontal neighbor blocks in the "UP2" and "UP3" case, respectively.All the relevant spatial reservoir properties, namely pressure, saturation, absolute permeability and porosity, are averaged on the pore-volume of the union of cells.These averaging computations are performed in negligible computational time when compared to the solution of non-linear equations for simulation.Finally, the coarser reservoir description is simulated with the BOS as usual.
While an increased error relative to the refined base model is anticipated, it should be pointed out that the upscaling procedure does not require any data to train or tune its parameters.In this respect, upscaling offers an advantage over learning proxies in use cases where there is an absolute lack of such training data.

Fixed Predictors. Two simple baselines using fixed predictors are created as follows:
• A predictor that at all times outputs the mean value for each variable as seen in the training data (referred to as "Mean Baseline"); • A predictor outputting the mean vector of all observations at the corresponding time step in the training data (referred to as "Time-Step Mean, " or "TSM").In other words, this predictor generates average flow curves for each component, as observed in the corresponding TRAIN partition.The two baselines above use the same training data as the proxy model and thus offer helpful calibration points.Obviously, the two fixed predictors above incur only negligible latency.

Results on OPM-22k
. Calculated according to Eq. 3, Table 3 summarizes error rates for the essential techniques described in previous sections, using the 1024x2 model.For comparison, we also compute the simple baselines proposed in Section 5.4."Mean Baseline" and "TSM Baseline" have an error of 43.1% and 39.3%, respectively, whereas the upscaling cases "UP2" and "UP3" have an error of 19.1% and 32.3%, respectively.Then, starting with the basic encoder-decoder at 21.5%, pre-training a model in GT mode, followed by Prop training at a slower learning rate results in a significant decrease of error rate -by 6.2% absolute to 15.3%.Replacing jointly encoded action-location input (see Section 5.3.2) by factored action-location information and by adding geological features (see Section 5.3.2) decreases the error rate further to 12.2%.Finally, the attention mechanism attending to all encoder layers achieves an error rate of 10.3%.For comparison we also give results for the more standard attention setup using the top layer which is inferior at 11.2%.All differences are statistically significant with p < 0.001 on a paired-sample t-test.Later on, we refer to the multi-layer attentive, GT-pretrained models using a prefix M, e.g., as M1024x2.
In order to gain insight into how the attentive decoder focuses with respect to the input, Figure 3 shows two example attention masks, i.e., the weighting distributions {α kt } 0≤k <K at each decoder time step t over the input action space.In each panel, the horizontal axis corresponds to decoder time and the vertical axis to well drilling sequence (abbreviated as {I,P}#(x,y)).It is interesting to note that while there is a fairly linear alignment between the decoder focus (light areas) and the input actions in the Prop mode (a), the bulk of attention is dispersed over an initial third of the input sequence when the decoder is given the GT at every time step (b).We hypothesize that the sharper action alignment emerges as the decoder obtains strong signal from the action sequence, while consuming its own, error-prone prediction as input.In contrast, the GT decoding mode supplies the decoder with accurate input making the decoder mostly rely on that, rendering the attention less important.Figure 5 compares predicted oil production rates curves across the various baselines on an example of the first ("sim0") simulation, namely the "Mean," "Time-Slot-Mean (TSM)," upscaled "UP2," and "UP3" baselines as well as the proxy.As can be seen, the proxy tends to approximate the ground-truth (GT) flow most accurately, followed by the UP2 baseline.The UP3, Mean, and TSM baselines tend to be considerably further off the target.

Varying Training Size.
To assess the dependency of the prediction error on training data amount we conducted an experiment with subsampling the TRAIN partition to smaller amounts.Figure 6 shows the results: starting from the main OPM-22k result (see above) with 17.6k simulations, the training amount was halved repeatedly to 1.1k simulations -an amount extremely small considering that there are 500 different reservoir realizations to be covered.On the other extreme, the training set was augmented by the remaining simulations from OPM-163k yielding a TRAIN partition with about 130k simulations.In Figure 6, two trends are evident: (1) models with higher parameter complexity tend to outperform lower-complexity ones for larger training amounts, with trend reversal at the lower end of the x-axis, and (2) there is a roughly linear relationship between the error rate and the log-scaled training amount.The accuracy of the rel.small 64x1 model is remarkably competitive across the conditions.Having established accuracy figures in the experiments above, we now turn to the question of performance (timing).We want to compare the average time needed to generate a simulation with 20 actions and 20 outputs, as above, between the OPM simulator and a selected set of proxy models.For the timing experiments we use OPM (2107.10release) ebos_2p which we found to be the fastest implementation with an executable optimized for speed.Benchmarking was conducted on a server equipped with 4 x E7-4850 v2 CPUs @ 2.30 GHz (48/96 physical/virtual cores), 256 GB RAM, Open MPI implementation of MPI (for mpirun; version: 1.2.10) and standard multiprocessing.Pool objects (from Python, version: 2.7.14).A comparison is somewhat complicated by the fact that (1) the OPM simulator is used as a black box and timing includes its initialization, and (2) the OPM code does not support the use of a GPU thus leaving only CPU comparison.(1) has a relatively minor effect as, judging from logging, the initialization stage of OPM takes a negligible fraction of the total time.Table 4 summarizes the benchmarking results.We distinguish three devices of practical interest: (a) CPU 1-core, (b) CPU 8-core, and (c) single GPU (NVIDIA Tesla K80).All values in Table 4 are durations (in milliseconds) of a simulation averaged over 100 measurements using different realizations.The first row shows the OPM time on a single core as well as an 8-core CPU (employing MPI), resulting in an average runtime of about 10 and 4.3 s/sim, respectively.The next three rows give the timing for the three model sizes benchmarked across the three devices.A clear advantage of the small-footprint M64x1 model emerges, in particular in the 1-Core case achieving a speedup factor of 2343X over the OPM simulator.The GPU seems to provide an advantage only in case of the largest model.
The last block of three rows shows simulation time in a case where 100 input sequences (simulation requests) can be batched up at once, thus allowing for the respective device to better utilize matrix operations.This sort of batching is of practical use in certain applications (e.g., Field Development Optimization using Monte Carlo methods).A significant speedup (0.1 ms/sim) can be seen now on the GPU side taking full advantage of its internal memory and architecture.
In a separate experiment, performance gains due to upscaling are assessed on a single core of an Intel Xeon CPU E5-2680 v2 @ 2.80GHz contrasted with Eclipse 2011.1 [7] (we experienced difficulties getting the desired upscaling setup working using OPM).In this setting, the average running times of the OPM-22k TEST partition are 5125 (ms) when performing no upscaling, 1489 (ms) when performing the UP2, and 1155 (ms) the UP3 upscaling.Thus, the speed-up with upscaling ranges from 3X to 4X.Another important (and advantageous) aspect of the proxy benchmarking results is their consistency.For instance, the M64x1 singlecore result is 4.4 ± 0.1 ms/sim which is a range of ±3% relative to the mean.The corresponding OPM measurement of 10310 ± 4030 ms/sim exhibits a considerably larger range of ±39%, which is typically caused by convergence issues for certain action sequences and realizations.This large variability is also observed when running the upscaled cases.

CPU Device GPU
Overall, considering the error patterns associated with each model size (see Figure 6), the small proxy model M64x1 offers an acceptable error rate while achieving the highest speedups.5 summarizes the results in terms of average error rates, with M128x5-E denoting a model trained on OPM-22k-E and M128x5 one trained on OPM-22k, as before.The first row in Table 5 shows that both models perform comparably on OPM-22k, with M128x5-E having no troubles to predict a shortened horizon of 20 time steps.The more interesting case of M128x5 extrapolating to 40 time steps is shown in the second row, where it achieves an error rate of 13.0% -a moderately elevated error over the matched M128x5-E at 10.6%.While this increase is statistically significant, it seems to be sufficiently limited for us to conclude that the model has a reasonable capability to extrapolate to longer horizons not seen during its training.Furthermore, it is reassuring to observe the error rates of both models tested on the shorter time horizon perform equally well.This suggests that training on data with longer horizon is beneficial, whenever possible.

Modeling Individual Wells.
All experiments so far involved predicting field rates, i.e., total production and injection rates over all wells.In some applications, however, more detail may be needed.In one of our use cases (an optimization of a Field Development Plan) predictions of all producer wells were required.This corresponds to adding 20 new output variables each mapped to wells in the order of drilling (as there may be up to 20 producers).We then train the model using OPM-22k via same steps as before resulting in a model with 23-dimensional output.
A comparison between the best wells model (using suffix "W", i.e., M64x5+W) and the best field-rates-only counterpart (M1024x2) is given in Table 6.Clearly, the task has become considerably more challenging as the wells model must maintain accuracy across 23 variables.Compared on the field rates only, the wells model is at 14%, while the dedicated model is around 10%, and at this error rate it still outperforms all other baselines.The error over the 20 well rates is at 19% for the proxy compared to 16.6% and 23.8% for the UP2 and UP3 baselines, respectively.Figure 7 shows a small sample of individual well predictions along with their ground truth.The output variables map to each producer in the order of drilling which corresponds to the timed onset of production visible in the cumulative curves.We observed that the model tends to overestimate wells that remain non-productive (zero output for entire simulation), albeit by relatively small amounts.

Well Control Optimization.
Optimizing the well controls is an enhancement currently only available in the commercial Eclipse simulator (see Section 3).The simulator runtime is typically increased by a factor of 5-10 due to the optimization step.We want to assess the ability of the proxy model to capture the optimization implicitly from control-optimized simulations.The dataset Ecl-32k (see Table 1) serves this purpose.
We start with a model M128x5+W trained on Ecl-32k.In comparison to levels seen with OPM-22k, the resulting error rates in the first row of Table 7 lie considerably higher, at 27.1%.An inspection of the model output (see Figure 8, dashed curve) reveals a rel.high prediction error at time step 1 propagating further during decoding.An in-depth variance analysis, which is omitted here due to space constraints, also reveals a high variability of the GT at time step 1 when (and only when) the optimization is present.This first step appears to be challenging for the decoder to predict accurately.This observation motivated the idea behind HybridProp described in Section 5.2.3.In HybridProp, the decoder is given 1 or several GT frames as input both at training and test time to generate the rest of the sequence using regular propagation.An experimental assessment of this method is shown in Table 7.With M128x5+W at 27.1% error, giving the same model a single frame of GT during decoding only (referred to here as HybridDec) improves the error  7: Error rates on Ecl-32k using a model without and with additional simulator input at prediction time.rate by 4.3%.However, when the model is retrained with the same modification (HybridProp), the error declines dramatically, to 15.8%.Providing more than one GT frame seems to yield diminishing returns.Despite the added requirement of having a simulator available at test time, this a remarkable and a practical result.Returning to Figure 8, after applying HybridProp1 the original (dashed red) curve moves down (solid red) to a trend comparable to one seen on non-optimized data (OPM-22k).The HybridProp method was also tested on models trained on OPM-22k and observed only negligible improvements further confirming our hypothesis regarding the high variance observed at first time step being unique to the well control optimization process.

CONCLUSION
The series of experimental results presented in this paper demonstrates the effectiveness of the described proxy approach at accelerating reservoir model simulations, including variability due to uncertainty.We have observed a significant acceleration capability of more than 2000X compared to an industry-strength physicsbased simulator OPM.Furthermore, we demonstrated it is possible to approximate the simulations with well control optimization thus offering an additional relative speedup.For practical amounts of training data, the accuracy of the neural network proxies presented here ranges between 10% and 15% error over 20-40 months horizons, relative to the simulator.The model shows a good extrapolation capability.Our proxy model captures non-linear interactions between wells, fluid, and rock, giving it a great advantage over state-of-theart commercial techniques.When compared to simple upscaling procedures, we observe that the proxy models are capable to run about 500X faster and provide higher accuracy.We believe these outcomes, generated on a publicly available reservoir, are extremely promising and represent a valuable benchmark for future research in oil field development optimization.Moreover, we anticipate that, due to its application-agnostic nature, the approach is suitable for solving tasks in related fields of energy and environment modeling.

Figure 1 :
Figure 1: An example realization of the SPE9 RM (water pressure distribution).

Figure 2 :
Figure 2: An architecture overview of the model

Figure 4
Figure4compares target (GT) and predicted production rates in three randomly selected simulations drawn from the TEST partition (sim: 0, 10, and 100).The first column in Figure4shows direct output of the model, i.e., rates (in barrels) for a given time period (30 days), while the second column compares the corresponding cumulative curves.The latter eventually serves the NPV calculation of the field as mentioned in Section 5.1.

Figure 4 :
Figure 4: Production rates: Ground truth ("GT," solid), and prediction ("Pred," dashed) curves obtained from a M1024x2 model with attention on three randomly selected simulations.The model has an overall error rate of 10.3%.

Figure 5 :
Figure 5: Predictions produced by the various baselines and the proxy in comparison with the simulator ground truth ("GT").Shown are the field oil rates of a single simulation ("sim0").

Figure 6 :
Figure 6: Prediction error as a function of training data amount.The dashed vertical line indicates training size of the OPM-22k dataset.

5. 5 . 4
Extrapolating to a Longer Simulation Horizon.An interesting question arises regarding the proxy model's extrapolation capability.Suppose we are given a model trained with 20 input actions and 20 months worth of output.We now want to run the model for an extended period, say, 40 months.How well does such a model generalize in comparison to a model trained on 40 months worth of ground truth?To investigate, we generated simulations identical to those in OPM-22k but with output extended to 40 time steps, i.e. 1200 days, (labeled OPM-22k-E) allowing a direct comparison of such two models.Table

Figure 7 :
Figure 7: A sample of individual well (oil producer) predictions (blue) along with their ground truth (red).

Figure 8 :
Figure 8: Effect of the HybridProp method on time step specific error rates, with the first time step being crucial.

Table 2 :
. The embedding vectors for each discrete variable are initialized to random values and are trained jointly with the rest of the network.Negligible Input features with their type and dimensionality sensitivity was observed due to the dimensionality hyperparameter.Values we used are listed in Table2.

Table 3 :
Sequence error rates of various configurations of the 1024x2 model on the OPM-22k TEST partition.

Table 4 :
Timing results.All values are averages over 100 measurements using same random set of simulations.

Table 6 :
Error rates comparison between the wells model predicting 20 wells + 3 field rates and the best field-rates-only model on OPM-22k.