Physics-Aware Deep-Learning-Based Proxy Reservoir Simulation Model Equipped With State and Well Output Prediction

Data-driven methods have been revolutionizing the way physicists and engineers handle complex and challenging problems even when the physics is not fully understood. However, these models very often lack interpretability. Physics-aware machine learning (ML) techniques have been used to endow proxy models with features closely related to the ones encountered in nature; examples span from material balance to conservation laws. In this study, we proposed a hybrid-based approach that incorporates physical constraints (physics-based) and yet is driven by input/output data (data-driven), leading to fast, reliable, and interpretable reservoir simulation models. To this end, we built on a recently developed deep learning–based reduced-order modeling framework by adding a new step related to information on the input–output behavior (e.g., well rates) of the reservoir and not only the states (e.g., pressure and saturation) matching. A deep-neural network (DNN) architecture is used to predict the state variables evolution after training an autoencoder coupled with a control system approach (Embed to Control—E2C) along with the addition of some physical components (loss functions) to the neural network training procedure. Here, we extend this idea by adding the simulation model output, for example, well bottom-hole pressure and well flow rates, as data to be used in the training procedure. Additionally, we introduce a new architecture to the E2C transition model by adding a new neural network component to handle the connections between state variables and model outputs. By doing this, it is possible to estimate the evolution in time of both the state and output variables simultaneously. Such a non-intrusive data-driven method does not need to have access to the reservoir simulation internal structure, so it can be easily applied to commercial reservoir simulators. The proposed method is applied to an oil–water model with heterogeneous permeability, including four injectors and five producer wells. We used 300 sampled well control sets to train the autoencoder and another set to validate the obtained autoencoder parameters. We show our proxy’s accuracy and robustness by running two different neural network architectures (propositions 2 and 3), and we compare our results with the original E2C framework developed for reservoir simulation.


INTRODUCTION
In this study, we build upon a recent study on embedding physical constraints to machine learning architectures to efficiently and accurately solve large-scale reservoir simulation problems [1,2]. Scientific Machine Learning (SciML) is a rapidly developing area in reservoir simulation. SciML introduces regularizing physics constraints, allowing predicting future performance of complex multiscale, multiphysics systems using sparse, low-fidelity, and heterogeneous data. Here, we address the problem of creating consistent input-output relations for a reservoir run, taking into account physical constraints such as mass conservation, multiphase flow flux matching, and well outputs (rates) derived directly from data.
Simulation of complex problems can usually be performed by recasting the underlying partial differential equations into a (nonlinear) dynamical system state-space representation. This approach has been explored in numerical petroleum reservoir simulation, and several techniques were developed or applied to handle the intrinsic nonlinearities of this problem [3]. The main idea is to extract information from states, which are usually given by phase pressures and saturations-as in multiphase flow in porous media with no coupling with additional phenomena, such as geomechanics. Although state-space models can be easily manipulated, the transformation of the reservoir simulation equations induces systems with a very large number of states. In this case, model reduction methods [4,5] can be used to reduce the complexity of the problem and mitigate the large computation cost associate with the solution under consideration. The area of model order reduction (MOR) for reservoir simulation has been very active in the past decade and goes beyond recent projection-based MOR developments. Methods such as upscaling, proxy and surrogate modeling, and parameterization have always found their ways in reservoir simulation. It is not the intention here to give a comprehensive historical account of MOR, but two methods have emerged as good candidates for projection-based MOR: POD-TPWL [5][6][7] and POD-DEIM [8][9][10]. Each of these methods has its advantages and drawbacks, which have been discussed in many published articles [9,11]. The unifying framework in these methods is projection; that is, we project the original large state-space model into a much smaller space of (almost) non-physical significance. The recovery of physical meaning is usually attained by proper training and storage of the states of the system. Although very efficient algorithms can be used for improving state selection through clustering and adaptation, the lack of physical meaning can hinder the application of complex phenomena simulations. This is to say that the reduced-order model's output, that is, well rates and bottom-hole pressures, can sometimes differ significantly from the original fine-scale simulation.
Data-driven modeling in reservoir engineering is not new [12]. Many data-analytics and statistical multi-variable regression have been applied in reservoir simulation to obtain fast proxy models [13,14]. Recently there has been an explosion of methods associated at large with machine learning algorithms [15][16][17]. It is our view, however, that the machinery developed with physics-based model reduction, especially the strategies derived from state-space identification, can be enhanced with machine learning algorithms. System identification can lead to parametric models where the structure of the model is predefined (see Figure 1, for instance, for the state-space representation), whereas machine learning methods lead to non-parametric models and depend largely on the number of training points [18,19].
A combination of data-driven model reduction strategies and machine learning (deep-neural networks-DNN) will be used here to achieve state and input-output matching simultaneously. In [2], the authors use a DNN architecture to predict the state variables evolution after training an autoencoder coupled with a control system approach (Embed to Control-E2C) and adding some physical components (Loss functions) to the neural network training procedure. The idea was to use the framework of the POD-TPWL as a way to recast the reservoir simulator from a system perspective and obtain reduced-order states out of the encoding blocks. Physical constraints were handled by employing particular loss functions related to mass conservation.
In this study, we extend this idea by adding the simulation model output, for example, well bottom-hole pressure and well flow rates, as data to be used in the training procedure. The contributions here are twofold: first, we have extended the E2C to the E2CO (E2C and Observe) form, which allows a generalization of the method to any type of well model. By doing this, it is possible to estimate the evolution in time of both the state variables as well as the output variables simultaneously. Second, our new formulation provides a fast and reliable proxy for the simulation outputs that can be coupled with other components in reservoir management workflows-for instance, well-control optimization workflows such a nonintrusive method, like data-driven models, does not need to have access to reservoir simulation internal structure so that it can be easily applied to commercial reservoir simulations. We view this as an analogous step to system identification whereby mappings related to state dynamics, inputs (controls), and measurements (output) are obtained.
This study is organized as follows. We start by giving a brief overview of numerical reservoir simulation, focusing on the aspects that will be used to develop the ideas presented here, such as identifying states. Next, we revisit the model order reduction (MOR) framework and the methods used here, such as the trajectory piecewise linearization-TPWL and proper orthogonal decomposition-POD. Then, we describe the machine learning architecture E2C and revisit the methodology used as a basis for our study. This leads to the introduction of our main contribution and new propositions to handle the reservoir outputs. Finally, we give details of our implementation and a description of the training set's data. We show the results comparing the propositions, and we draw some conclusions in the end.

Numerical Petroleum Reservoir Simulation
Widely used in all phases of development of an oil field, reservoir simulations are applied to model flow dynamics and predict reservoir performance from preliminary exploration until the full development stage of petroleum production. A deep understanding of the model, discretization, and solution methods can be found in [20,21].
A reservoir simulator is a combination of rock and geological structure properties, fluid properties, and a mathematical representation of the subsurface flow dynamics. A two-phase (oil and water) two-dimensional reservoir simulator is based on flow equations represented in Eq. 1.
where the subscript j represents the phase oil or water, k is the absolute permeability, k r is the relative permeability, µ is the viscosity, ρ is the fluid density, and ϕ is porosity. The solution we seek here is P j , S j . It is possible to identify three terms on Eq. 1: the first one is the flux term, followed by the source/sink term, and the last one is the accumulation term. Although we only describe a twophase flow system, these equations can easily generalize for multiphase flow simulations. As stated before, the reservoir simulation equations can be recast in a systems framework, where the nonlinear equations can be linearized into a timevarying state-space model as depicted in Figure 1. Note that while the controls u are imposed on the wells, the state x evolution is calculated based on the discretized partial differential equation (Eq. 1), and the output y is observed on the wells.
In general, the wells can be controlled by flow rates or bottom-hole pressure (BHP). Here we will assume that the producer wells are controlled by BHP and injector wells are controlled by injection rate (q inj ). So, we can define a control vector composed of the controls for all wells at a timestep t as: p is the number of producers, and i the number of injectors. The state dynamical evolution will be solved for each timestep t. The state is represented as: where P t [P t The output for the producers is the oil flow rate (q o ) and the water flow rate (q w ) and for the injectors is the bottom hole pressure (BHP). Thus, the output vector can be defined as: These values are calculated using the Peaceman equation [22]. For example, for a producer well, the flow rate of phase j can be calculated as: where WI is the well index and can be calculated as: where α is a unit conversion factor, h is the height of the reservoir gridblock, r o is the grid block dimension equivalent radius, r w is the wellbore radius, and SKIN is the skin factor. Following the assumptions as in [3] and the application of fully implicit discretization it is possible to write the residual form of Eq. 1 as: where we can identify the flux term F(x t+1 ), the accumulation term Acc(x t+1 , x t ), and the source and sink term Q(x t+1 , u t+1 ). This equation can be solved by the Newton's method by defining the Jacobian matrix J zg zx . So, the state evolution can be calculated as: To apply reduced-order modeling techniques to reservoir simulation efficiently, one can linearize the residual Eq. 4 (or Eq. 5). In Ref. [23], the authors presented the linearization for a single-phase 2D reservoir based on the simulator's Jacobian matrix. Van Doren et al. [24] and Heijn et al. [25] developed similar approaches for a two-phase reservoir. These linearization methods were developed aligned to a control system approach, and the reader can find a comprehensive review of the control system applied to reservoir simulation in [3]. Committing with the control system approach, a statespace representation of a linear system can be written as Eq. 6. The authors cited in this paragraph manipulated equations similar to Eq. 4 or Eq. 5 to linearize them and write as a linear system.
where x represents the state, _ x the state time derivative ( dx dt ), and u the inputs (controls) and y the system output. A c is called state matrix, B c input matrix, C c output matrix, and D c is the feedthrough matrix. The subscript c stands for continuous time. It is worth mentioning that on Eq. 6 we are presenting the state evolution equation (first one) and the output equation (second one), but the focus of linearization method is to be applied on the state evolution equation.
In general, it is necessary to choose a linearization point in time and use the Jacobian matrix at this time to calculate previously presented matrices. If the system dynamic strongly changes over time, the choice of a fixed linearization time can lead to large estimation errors. Techniques like TPWL (trajectory piecewise linearization) have been used to overcome these difficulties, where it is possible to approximate matrices A, B, C, and D at each interest time.
In the next section, we will present the linearization method TPWL that, combined with a model order reduction technique called proper orthogonal decomposition (POD), has been successfully applied to reservoir simulation [6].

Trajectory Piecewise Linearization (TPWL)
Trajectory piecewise linearization is a method composed of two main steps. The first step is a training procedure (offline processing) that aims to capture state snapshots of a highfidelity solution for the system to be linearized. These state snapshots are stored along with the Jacobian matrix of each timestep.
The purpose of the second step (online processing) is to predict the state on the next timestep x t+1 based on the current state x t and the control state u t . The idea here is to use the stored states on the training step in order to predict the new state. First, in the group of the stored state snapshots, we identified the snapshot x i closest to x t . Then, we use the stored state and the Jacobian matrix for i and i + 1 to calculate the state x t+1 using the following equation: Previously, we wrote the state evolution equation of the reservoir simulation solution (Eq. 5) as a linearized system (Eq. 6). Using similar approach, we can rewrite the TPWL equation (Eq. 7) as a linearized system: where A, B, C, and D are the time discretized version of the continuous matrices, and the superscript i represents the linearized matrices using information from the snapshot i. To solve Eq. 7 and to compute matrices A i , B i , C i , and D i it is necessary to calculate the inverse of the Jacobian matrix , which is computationally costly. Furthermore, storing a full state and a full Jacobian matrix for each timestep on the training runs requires a huge memory amount. The next section will show how to use a model order reduced method to solve these problems.

Proper Orthogonal Decomposition (POD)
The states stored on the training step of the TPWL can be arranged on: It is possible to take the singular value decomposition (SVD) of X. Based on some criteria, one can choose to keep only the l largest singular values. The POD basis Φ is defined as the l right singular vectors, and the reduced/latent state space z is then defined as a projection of the original space x based on the POD basis: Using the relation x ≈ Φz we can rewrite Eq. 7 as: and multiplying both sides of the above equation by Φ T : Now, defining the reduced Jacobian matrix as J i+1 r Φ T J i+1 Φ and the other reduced terms as F i+1 and Q r Φ T Q, we can rewrite Eq. 11 as the POD-TPWL equation: By truncating the latent space, we deal with smaller dimension matrices when compared with the TPWL equation presented previously.
One can rewrite Eq. 12 as a linear time variant system: where A i r , B i r , C i r , and D i r are the reduced version of the ones presented on Eq. 8.
The complexity of calculating matrices A i r , B i r , C i r , and D i r (or its equivalents on Eqs. 6 and 8) can increase in more realistic cases (black-oil or compositional simulators). Another drawback of these methods is the mandatory access to the reservoir simulator code to extract the Jacobian matrix and other data structures from the simulator, which may turn impractical the use of these methods with commercial reservoir simulators.
In [1], the authors develop a method to calculate matrices A t r and B t r based on deep learning, using state snapshots as training data. Ref. [2] extended this idea by adding a physical loss function Frontiers in Applied Mathematics and Statistics | www.frontiersin.org September 2021 | Volume 7 | Article 651178 related to reservoir simulation. Here we expand on this and demonstrate the feasibility of extracting not only states but outputs.

METHODS
In this section, we built upon a previously developed alternative to predict the state time evolution applied to numerical petroleum reservoir simulation based on deep learning, where a direct semi-empirical relation between the state and the output was used to calculate the well data. We will propose an improvement of well data estimation, first, by considering this data on the training process, and second, by building a specific network to estimate the output from the latent space that does not rely on the direct relation between the full state and the output.

Embed to Control (E2C)
This method was proposed in [1] for model learning and control of a nonlinear dynamical system using raw pixel images as input.
It uses a convolutional autoencoder coupled with a control linear system approach to predict the time evolution of the system state.
In Figure 2, we present a schematic diagram showing the main elements for training this model. The inputs used on the training procedure are the elements shaded in gray (x t , x t+1 , u t , and Δt t ). The variables with a hat ( ·) are the estimated values. The paths presented with continuous arrows are used in both training and prediction steps, and those with dashed arrows are only used on the training procedures.
The autoencoder is composed of an encoder that aims to transform the state in the original dimension space (x t ) to a reduced (latent) space (z t ), and a decoder that works as an inverted encoder to project back the reduced space state to the original space. One can relate the encoder with the POD projection (Eq. 9) since both have the ability to transform the original state x on the latent space state z.
The transition element uses the state in latent space to generate the matrices A t r and B t r . These matrices are used to handle the dynamical system evolution using a linear control system approach, as in Eq. 14. Since this equation operates on the latent space, we can relate A t r and B t r with their reduced version used on Eq. 13.
It should be pointed out that matrices A t r and B t r change for each different input state x t , like in other linearization procedures applied to reduced-order models techniques (e.g., TPWL-trajectory piecewise linearization).
Other important elements on this method are the 3 loss functions (L) that will be used during the training procedure. The reconstruction loss function L rec is introduced to guarantee the autoencoder (encoder and decoder) capability to reconstruct x t as close as possible to x t . x t is calculated using: The reconstruction loss function is defined as: where i represents the sample index.
Minimizing the prediction loss guarantees the accuracy of the time evolution of the dynamical system. In this case, we define the prediction loss function as: where, x t+1 is the prediction of the state variables at t + 1, and then is possible to calculate the prediction loss: The third loss function of the E2C model is the transition loss function, which tries to guarantee the model competence to evolve the reduced state z in time. This mimics the MOR idea of developing a projection framework in which the reduced state can evolve in time based on the reduced Jacobian matrix. First, we need to calculate: and then: and with these two equations it is possible to calculate the transition loss (Eq. 19): Here, we briefly described the main elements of the E2C model. In the next section, we will show an improvement of this model, adding a physical loss function specifically designed to reservoir simulation problems.

Embed to Control with Physical Loss Function for Petroleum Reservoir Numerical Simulation
Inspired by the E2C model, Jin et al. [2] proposed an improvement on this model by adding a physical loss function to it. The model proposed by them focuses on the problem of building a proxy for a petroleum numerical reservoir simulator to be used on a well control optimization process. In Eq. 28 of [2], the authors introduced a physical loss function, reproduced below: This physical loss function is composed of two parts. The first one contains the mismatch between the predicted pressure gradient between adjacent grid blocks and the input pressure gradient multiplied by the permeability. This part of the proposed loss function intends to assure that the fluid flow between adjacent grid blocks in the predicted model is as close as possible to the input/true model. Simply put, this is basically a mass conservative requirement. Here, this first part of the physical loss function will be named the flux loss function. The second part of the physical loss function deals with a mismatch in the producer wells' flow rate. In fact, as described by the authors, it tries to guarantee that the producers' grid block pressure in the predicted and reconstructed states are as close as possible to its equivalent on the input/true model. Instead of adding physics to the neural network, this second part adds more weight to the producers' grid block pressure during the training procedure. In fact the authors in [2] did not use the flow rate on their physical loss function; instead, they used only the producers' well grid block pressure. Since we will propose a new way to handle well data, we will omit this second part of the study's proposed physical loss function. Even though we are omitting this part of the physical loss function, we will have it in our implementation of the method proposed by [2], for comparison purposes. Our implementation of their loss function (Eq. 28 of [2]) is presented here on Eq. 35. Figure 3 a diagram of the autoencoder with the flux loss function terms calculation. Comparing this figure with Figure 2, it is possible to notice the addition of the gradient calculation on both sides of the chart and also the calculation of the flux loss functions both in the reconstruction and prediction parts, which can be expressed as:

We present in
and where k is the permeability.
The flux loss function can be defined as: In addition to the flux loss function, on the top right of Figure 3, it is possible to see the calculation of the output y t+1 . The model outputs are well bottom hole pressure for the injector wells, and water and oil flow rates for the producer wells, in the case presented here. The calculation of these quantities from the state variables (pressures and saturations) is made by applying the Peaceman equation ( [22]). This way to estimate the well data is here denoted by proposition 1. Note, that this is the actual proposed method in [2], implemented here for comparison purposes. In the following sections, we will propose and compare two other ways of estimating well data using E2C like models.

Embed to Control with Well Data Loss Functions for Petroleum Reservoir Numerical Simulation
Since we are building a proxy model for a petroleum reservoir numerical simulator to be applied in a well control optimization framework, we would like to emphasize the well data prediction (BHP and flow rates), not only on the state (pressure and saturation) calculation; or, in other words, we want to reconstruct the output and not only the states. To achieve this, we propose a well data loss function named L well data2 as a reference to proposition 2. We present the diagram for proposition 2 in Figure 4, where it is possible to see the addition of the well data loss function on the right upper corner. This loss function will make the well data predicted by the model y t+1 as close as possible to the true value y t+1 . Note that both y and y are composed by oil and water flow rates and bottom hole pressures, as defined in Eq. 2. The input used to train the model is the well data calculated by a reservoir simulation. Proposition 2 well data loss function is defined as: To get a more reliable state prediction and, as consequence, a better well data estimation, we introduce a two-phase flux loss function (Eq. 25) for a two-phase simulator.
where its terms are defined in Eqs 26, 27: and where k is the permeability, k r,j is the relative permeability to the phase j, and S w is the water saturation. The addition of proposition 2 loss function L well data2 to the training process is a simple step but can improve the model's ability to estimate well data. Although the model's structure is very similar to proposition 1 (Figure 3), we expect to have more reliable well data estimation because we are adding this information to the training procedure. On proposition 1, the well data was not part of the training procedure. It was calculated on the validation/test steps based on the states.

Embed to Control and Observe with Flux Loss Function for Petroleum Reservoir Numerical Simulation
Inspired by the E2C model, we extend its idea to the output data and named it as Embed to Control and Observe (E2CO). On the original E2C model, a transition network receiving as input the state in the reduced space was employed to generate matrix A t r and B t r , which was used to evolve the state in the latent space based on Eq. 14. We introduce another transition network, namely, the transition output. This network will receive information from the state on latent space z t to generate matrix C t r and D t r in order to be able to estimate well data y t+1 as in .
The E2CO diagram is presented in Figure 5, where it is possible to see the new transition output network. We also create a loss functions L well data3 (as reference to proposition 3) that provides information to train the neural network transition output parameters. This loss function has the same structure as Eq. 24.
One of the main advantages of E2CO, when compared to propositions 1 and 2, is that here we do not rely on the state variables in the original space to predict well data. Instead of this, E2CO uses the state on the reduced space to create matrices C t r and D t r and uses a control system approach to estimate the output data. This resembles the case of system identification as in the control system community.

Training Versus Prediction
The diagrams presented in Figures 2-5 are used to train each proposed method's networks. In these diagrams, the loss functions have an important role in the training procedure.
They are defined to provide information about the system dynamics and honor the physics presented in the reservoir simulation process.
When the training procedure is finished, and the neural network parameters (weights, bias, and filters) are optimally found, it is time to use them to predict the system behavior. The prediction diagrams are significantly simpler than the training ones. The prediction diagram for propositions 1 and 2 is presented in Figure 6. In this figure, the time evolution happens in the latent space (z), potentially reducing the computational cost of the prediction step. However, to predict well data using propositions 1 and 2, it is necessary to have the state on the original space (x) at each desired time step to predict the well data at that time. That being said, one can note that it is necessary to pass z t+1 through the decoder to calculate x t+1 and further apply the Peaceman equation to calculate well data y t+1 on each time step. The application of the decoder to predict the well data can lead to a computational cost increase. For proposition 3, there is no need to decode the latent space since the well data is generated on that latent space using a specific network for this purpose (Transition Output). Figure 7 shows that both the state time evolution and the well data estimation happen in latent space. On proposition 3, the decoder's use is limited to the cases where one would like to estimate the state on the original space (x).

Neural Networks Structure
In the following sections, we describe the networks used in each part of the proposed models. The structure of the networks used here is very similar to the ones used in [2]. The three proposed ways of calculating well data use the same elements described in this section. Although the elements are the same, each proposed method's parameters will differ due to the different model/ network structure and different loss functions used in the training procedure.

Encoder
The encoder is built with a series of encoding blocks followed by residual convolutional blocks and a dense (fully connected layer), as depicted in Figure 8 (top).
The main idea of the encoder structure is to transform the state variables on the original space x to the latent space z. The dimension of the latent space l z is smaller than the dimension of the original space l x . To keep things in perspective the numerical model we will present next has l x 60 × 60 × 2 7200 states and l z 50.
Each encoding block comprises a convolutional 2D layer, followed by batch normalization and a ReLU (Rectified Linear Unit) activation function, as presented in Figure 9 (top). Although the structure of each encoding block is the same, the dimension of each block will change to be able to reduce the dimension of the encoder input. This dimension reduction is pictorially presented in Figure 8. The dimensions of each component used in this work will be presented later.
The residual convolutional blocks also have a convolutional 2D layer, followed by batch normalization and a ReLU, adding another convolutional 2D layer and batch normalization, as presented in Figure 9. It is possible to observe a link between the input and the       output of the residual convolutional blocks. This is the characteristic of a residual block. It is built in this way mainly to avoid gradient vanishing during the training procedure.

Decoder
The decoder will work as an inverted encoder, where its main purpose is to transform the state variables in latent space z into the original space x. Its structure is composed of a dense (fully connected) layer, followed by three residual convolutional blocks, four decoding blocks, and a convolutional 2D layer, as presented in Figure 8 (bottom).
The residual convolutional block used in the decoder has the same structure as the used on the encoder. The decoding block ( Figure 9) comprises a transpose convolutional 2D layer, batch normalization, and ReLU. In [2], the decoding block has a sequence of unpooling, padding, and convolution layers; here we replace this with a transpose convolutional layer.

Transition
The transition network is composed of two transformation blocks, followed by two dense layers directly connected to the output of the last transformation block. The output of these dense layers is reshaped to create the matrices A t r and B t r . The transition network is presented in Figure 10 (left).
Although the authors in [2] used three transformation blocks, we will use two as proposed on one example from the original E2C model [1]. The transformation block (Figure 9 (bottom)) is composed of a dense layer with batch normalization followed by a ReLU activation layer.

Transition Output
The transition output network (Figure 10 (right)), used on our proposition 3, has a very similar structure compared with the transition network. The dimension of the dense layer in the transformation blocks will differ between these two networks. This will be detailed in the following section. Still, it is intuitive to think that the transition output network may require layers with smaller dimensions than the transition network. The transition network output is the matrices A t r and B t r , which have sizes defined by the system input/state. On the other hand, the sizes of C t r and D t r depend on the dimensions of system input/output, which usually has lower dimensions when compared to the input/ state.

Neural Networks Implementation
We implemented all the methods proposed here using Python 3.7.4 [26] and the framework TensorFlow 2.3.1 [27] with Keras API [28]. The dimensions of layers, number and size of filters, and output size of each block on the encoder and decoder used are presented in Tables 1 and 2.
The transformation block has dimension of n Trans 200 for the transition network, and the last dense layers before the matrices calculations have dimensions of l 2 z for A t r and l z (n i + n p ) for B t r , where n i and n p are, respectively, the number of injector and producer wells. The output of these dense layers is then reshaped to match each matrix dimension. For the transition output network we are using n Trans WD 20 for the transformation block dimension, and the last dense layers have dimensions of l z (n i + 2n p ) for C t r and (n i + n p )(n i + 2n p ) for D t r . We used the Adam optimization algorithm with a learning rate of 1 × 10 −4 during the training procedure with a batch size of 4. The sample is randomly selected to compose the training batch. We run the training procedure for 100 epochs.
We initially trained our models on CPUs. However, we change to GPUs since the speed up is expressive. We provide in Table 3 a non-rigorous estimation of the training time, where it is possible to see the differences between training on CPU and GPU. More details on the GPUs NVIDIA Tesla used can be found on www.hprc. tamu.edu.

Non-determinism
During the initial implementation of the methods proposed here, we noticed that if we repeat the same training procedure twice in the same machine with the same input and parameters, we will end up with different networks, and this will lead to different predictions and different error estimations. Although this can be understood as a native characteristic of the neural network training procedure, this can increase the complexity of comparing different methods  since each method's results can change based on an uncontrollable variable. Some of the non-determinism sources are layers weight initialization, kernel weights initialization, and the order of models to be part of the training batch. When training on GPU, some other sources can happen when using convolutional, pooling, and upsampling layers. All these sources are discussed in [29], which provides a solution for most of these sources of non-determinism. Here we used this method provided, and we can reproduce the training results for the same input and parameters.

Normalization
We used normalization to treat our input data from the state and the well data and also during the loss functions calculations. For each quantity, normalization parameters (min and max) are defined and used on this quantity over the training and prediction procedures. The normalization used here is: where Q is the quantity to be normalized.
The minimum and maximum values for each quantity were selected from the input state (quantities: pressures and saturations), from the well data (quantities: well bottom hole pressure, the oil flow rate, and the water flow rate), and from the well controls (quantities: the water injection rate and well bottom hole pressure).
We also normalized values for the loss functions calculation. When applying Eqs 18 and 16, we use normalized values of state pressures and saturations. The same is valid for Eq. 24 where we used normalized values for well bottom hole pressure, the oil rate, and the water rate.
Here, we introduce the use of a normalized flux loss function. We first calculate the minimum and maximum flux from the input state and use them to normalize the flux loss function. This is valid for the single-phase loss function (Eqs. 21 and 22) and the two-phase (Eqs. 26 and 27), where we calculate maximum and minimum values for oil and water flux. By doing this, we expect not to need to use weight terms on these loss functions, like the ones used in [2]. This can potentially reduce the number of hyperparameters to be chosen on the training procedures.

DATA GENERATION AND ASSESSMENT
In this section, the simulation model used to generate state and well data is described, followed by the details on the training and validation data sets' generation. We then show how we evaluate each trained model's quality using the validation set by calculating the relative error.

Numerical Simulation
We utilized a commercial numerical reservoir simulator (IMEX from [30]) to generate the data employed on the training and validation steps of our propositions. However, any commercial off-the-shelf reservoir simulator can be used to train our model. A Cartesian grid with 60 × 60 × 1 gridblocks and an oil-water fluid model were used. A fixed permeability field was generated and is presented in Figure 11 (left). The relative permeability curves used in the simulation model are shown in Figure 11 (right).
The model has five producers and four injector wells. The producer wells are controlled by bottom hole pressure (BHP), and the injectors are controlled by the water injection rate. This model runs through 2,000 days, changing controls every 100 days. Some properties used in the simulation model are presented in Table 4.

Data Sets
The objective of constructing a fast proxy model is to eventually connect it to control optimization frameworks whereby multiple calls of the surrogate model are used to determine a control set that optimizes an objective function (e.g., Net Present Value and sweep efficiency). In this case, we expect that our proxy might have the competence to predict well data under different sets of controls, which is usually unknown before the optimization is complete. To mimic the behavior of a varying input setting, we will use 300 simulations to train the proposed models, and each simulation will run through 20 time periods. For the producers, the bottom-hole pressure was sampled over a uniform distribution U(260, 275) kgf/cm 2 . To avoid having the same volume of water injected in all control sets, we used a different approach to create the injectors controls. For each control set i,  we sampled a base injection rate q inj base,i ∼ U(300, 950) m 3 /day, and for each time period t, we sampled a perturbation q inj pert,i,t ∼ U(−80, 90) m 3 /day. The injection flow rate is then calculated as q inj (i, t) q inj base,i + q t inj pert,i . In Figure 12 we present an example of the injectors controls where readers can confirm the injection rate variability. The variability in the producer's controls can be observed in Figure 13.
Using each one of the 300 control sets generated, we run the reservoir simulator and store the state (pressure and water saturation) at each one of the 20 time periods. The training set will be composed by 6,000 samples, each one containing: In the case presented here, t 0, 1, . . . , 19, where t 0 represents the initial state that also needs to be stored as the first snapshot of the state (x 0 ). The validation set contains 100 control sets that will create 2,000 samples. The validation set generation follows a similar approach but with a different random number generator seed. This is to enforce that training and validation are performed using different control inputs.

Model Evaluation
Each proposition (propositions 1 to 3, as constructed before) was trained as described in Section 2.7 using the training data set. We then used the models (networks) trained for each proposition on the prediction architecture, as shown in Figures 6, 7, and evaluated each model's quality on the validation set. We also used this validation set to choose the better hyperparameters for each model. This is a tedious but necessary step in looking for a low prediction error model.

Error Definition
In this section, we present the definitions of relative error used to evaluate our model's accuracy. Most of these definitions are similar to the ones defined in [2]. For a single producer wells (p), the relative error in the water and oil rates is defined as e p w and e p o that in a general way can be written as: The error for all producer wells can be defined as: where j o, w stands for the oil and water phases, respectively, and HFS stands for high-fidelity solution (numerical reservoir simulation run). The upper hat ( ) stands for estimated and is related to the output of the 3 propositions on how to estimate well data.
A similar equation can be built for the bottom hole pressure of an injector well (i), as given below: The error for all injector wells can be calculated as: where n i is the number of injector wells. We defined a relative error for the cumulative flowrates as: where j o, w and Q is the cumulative rate, and for all the producer we define: We also define a relative error in terms of the primary variables (pressures and water saturations) as: where v k denotes the state variable at grid block k (pressure p k or water saturation S k ), and n b is the number of grid blocks in the numerical reservoir simulation grid.

Hyper-Parameters Tuning
The process of obtaining predictive neural networks with architecture like those used on the proposed methods is not straightforward. There are no mysteries in the training procedure itself. However, several hyper-parameters need to be selected before training, for example, the training batch size, number of epochs, and the learning rate. The choice of the best set of hyper-parameters is challenging, and the authors could not find a well-defined procedure on how to do it. To better choose the hyper-parameters, we have to proceed with full training and then using the validation data set, predict the outputs of this trained network to analyze the error of this choice. We should now repeat this process with another set of hyperparameters to find the one that provides a smaller prediction error. Looking for isolating each parameter's influence, we run isolated tuning, where each tuning run will consist of trying some values of a hyper-parameter and fixing the others. For example, trying to find the best learning rate (for ADAM's algorithm) we run the training and prediction for the values 1 × 10 −5 , 1 × 10 −4 , 1 × 10 −3 , 1 × 10 −2 , and 1 × 10 −1 . The lowest error in estimating the well data was obtained with 1 × 10 −4 . Now, this value will be fixed on the other tuning runs. We understand that this might not be the optimal approach. However, using a grid search-like method is not feasible since to span the hyperparameters space would be high computationally costly.
Using a batch size of four is a significantly better choice than using 20 or 100. Larger batch sizes reduce the training process's duration; however, it increases the prediction error in our case. The samples were shuffled to build the batch. This approach worked better than building the batch following the time sequence.
The number of epochs used on the training runs was not considered as a hyper-parameter. We run all our training processes for 100 epochs, saving the network at every 10 epochs. In some cases, we observed that the network that produces the lowest prediction well data error is not obtained after epoch 100, as we expected. This was observed during some hyper-parameters tuning runs. However, when the best hyper-parameter set was found on the specific tuning run, it was always on the last epoch. This might be interpreted as when we are far from the best hyper-parameter, the training procedure presents some instability. It is noteworthy that along the training process, the total loss function keeps decreasing at each epoch. However, the error we are analyzing here is the prediction error defined in a previous section, which is not directly related to the total loss function used on the training procedure.
We tried to find the dimension of the latent space l z that produces the lowest well data prediction error by changing l z 10, 20, 50, and 100, and the smallest error was obtained by using l z 50. The dimension of the dense layer on the transformation blocks was changed between values 50, 100, 200, and 500, and the lowest error was found when the dimension was 200. For the E2CO, the dense layer's dimension on the transformation blocks in the transition output network was varied between 10, 20, 100, and 200, and the best value was 20.
The most challenging tuning was on weights applied to the loss functions. Our training procedure relies on the combination of loss functions, and how these functions are optimally weighted to build the total loss function is not an easy task. Here was mentioned that we used normalization of inputs, outputs, and components of loss functions to avoid the need to tune the loss function weights. We initially tried to train our models with the loss functions not weighted to compose the total loss function, but after some tests, we realize that there was room for improvement in weighing them.
For proposition 1, we can define the total loss function as: where c flux is the weight applied to the flux loss function (Eq. 23) and c p pres , is the weight applied to the loss function of pressure in the producers well gridblocks. This resembles the way the authors in [2] defined the loss function. The values that produces a smaller prediction error are c flux 5 and c p pres 0. We tried several values for c p pres , but the best results where obtained with zero for this weight.  For proposition 2, we need to add the well data loss function (L well data2 ), so the total loss function will become: where c flux2Ph is the weight applied to the two phase flux loss function and c well data2 represents the weight applied to the well data loss function.
For the E2CO (proposition 3) we can define the total loss function as: We tried variations of propositions 2 and 3 total loss functions using different flux loss functions: one-phase, two-phase, and no flux loss. We also tried to give more weight to the state variables (pressure and saturation) on the well grid blocks. This analysis was applied to the injectors and producers' well gridblocks. We tested several combinations of different weights for a loss function that looks at the mismatch between the predicted and the true values of the wells gridblock state variables. These loss functions can be identified as L p pres , L p sat , L i pres , and L i sat , and there respective weights are c p pres , c p sat , c i pres , and c i sat . We expected that this would improve the model's ability to predict well data, but this was not true for all cases.
For proposition 2, the best results were obtained using c flux2Ph 1.0, c well data2 1.0, and c p pres c p sat c i pres c i sat 0 .
For E2CO (proposition 3), the best results were obtained using single phase loss function, c flux1Ph 0.1, c well data3 1.0, c p pres 0.01, and c p sat c i pres c i sat 0.
The choice of best hyperparameters set for each proposition was defined by analyzing the validation set's total prediction error. We used the summation of the median of the errors defined by Eqs 30-34, here defined as total prediction error, to choose the best hyperparameters set.

Training Size Sensibility
We are interested in understanding the performance of the two proposed methods when less data is available for training or the resources available for the training are limited. Using the best set of hyperparameters described in the previous session, 3 training procedures were conducted varying the training set size for each model. The original training set size was 6,000 samples, and we also trained with 4,000 samples and 2,000 samples (each one corresponding to 300, 200, and 100 controls sets). Figure 14 shows the relative error median over the validation set for the oil rate, the water rate, and BHP for each proposed method. A similar trend for all the curves can be observed, and as expected, when the number of training samples is reduced, the estimation error on the validation set increases. Important to note that for the oil rate and BHP, the E2CO method presents lower values for relative error when reducing the training set size. Proposition 1 provides slightly better results only for the water rate with the lowest numbers of training samples.

RESULTS
This section will present a validation set error comparison between the best hyperparameters set for each of the propositions. We finish this section by comparing the methods by showing well data and states outputs of a specific control combination from the validation set.
We defined the best hyperparameters set based on the total error median over the validation set for all the three proposed models on the hyperparameters tuning procedure. We can now analyze the performance of these three models by comparing the error for each item analyzed. In Figure 15 we present the error for well data, where this box plot represents the distribution of the relative error on the validation set. The bottom and top limits of each box represent 25th, and 75th percentile errors, the orange line inside the box is the median of the distribution. Analyzing oil rate relative error in Figure 15, it is possible to conclude that E2CO overcome propositions 1 and 2. For the water rate and bottom hole pressure, we obtained a smaller error for both E2CO and proposition 2 than proposition 1. However, the E2CO estimation errors for the water rate are higher than proposition 2. We impute this pour performance to failure on estimating water rate before water breakthrough. We present and discuss this problem better in Figure 16.
A similar analysis can be built for cumulative flow rates, as in Figure 17. It is possible to observe that proposition 2 and E2CO overcome the cumulative water flow rates prediction ability of proposition 1. The same figure shows that the cumulative oil rate is better predicted by proposition 1. We can also visualize the cumulative errors in a scatter plot of cumulative flow rate plots (Figure 18), where it is possible to confirm the previous figure's analysis.
The error on the state variables is presented in Figure 19. We can observe that both E2CO and proposition 2 overcome the state estimation ability of proposition 1. During the development of both E2CO and proposition 2, we aimed on getting accurate well data estimation; however, these results show that we are also able to predict states. Proposition 2 has a lower estimation error, which can be assigned as a benefit of using the 2 phase flux loss function proposed in this research.
We now present in Figure 16 the predicted well data of one validation sample for all the three models. It is noteworthy that E2CO has some problems with water flowrate estimation, mainly before the water breakthrough. This could be addressed by narrowing its values to only positive values, which is an easy procedure. One can check the saturation on the well gridblock, and if saturation is not higher than the connate water saturation, set its value to zero. This will compel the need to rebuild the full state. Here, we did not implement any of these procedures, and the results presented are exactly the ones generated by the E2CO method.
The pressure and water saturation maps for the last predicted time step using one validation sample in Figures 20, 21. The maps did not show significant differences; however, the error map for pressure shows lower values for proposition 2 and E2CO. For saturation error map, proposition 2 shows higher errors in a southern area.

CONCLUSION
This work presents a new framework to develop proxy models for reservoir simulation. Our method is based on a deep convolutional autoencoder and control system approach. Our development was inspired by the E2C-ROM method (named proposition 1) proposed by [2]. In this study, we leverage its capability of state reconstruction by adding to it a well data (output) loss function on the training procedure (proposition 2). Additionally, we introduce E2CO-Embed to Control and Observe (proposition 3)-as an alternative approach to calculate the model output by having a specific neural network to capture the relation between reduced state, controls, and outputs. The methods developed here can be related to the reduced-order modeling technique POD-TPWL.
We applied our proposed new architecture on a two-phase 2D synthetic reservoir simulation model running on a commercial reservoir simulation. The results of proposition 2 and E2CO applications in our synthetic case presented improvements when compared with E2C-ROM. The well data estimation errors on the validation set are generally more accurate and present less variability using the two proposed methods, especially the estimations of oil rate and cumulative oil and water production. Proposition 2 has also shown slightly better results, but its limitations are the same as the E2C-ROM: the need to reconstruct the full state to estimate the well data. On the other hand, our new E2CO methodology can predict well data using the reduced state directly without being estimated by the Peacemen formula.
Although we have shown the advantages and limitations of our proposed methods using a 2D reservoir, their performance and generalizations need to be tested on more realistic scenarios with larger and more complex models (black-oil or compositional). The explosion on the number of state variables' and, thus, the model dimension may lead to more challenging well data estimation. It may be necessary to increase the dimension of the reduced space l z , and the neural network dimension responsible for generating the linearized system matrix.