An Application of an Embedded Model Estimator to a Synthetic Nonstationary Reservoir Model With Multiple Secondary Variables

A method (Ember) for nonstationary spatial modeling with multiple secondary variables by combining Geostatistics with Random Forests is applied to a three-dimensional Reservoir Model. It extends the Random Forest method to an interpolation algorithm retaining similar consistency properties to both Geostatistical algorithms and Random Forests. It allows embedding of simpler interpolation algorithms into the process, combining them through the Random Forest training process. The algorithm estimates a conditional distribution at each target location. The family of such distributions is called the model envelope. An algorithm to produce stochastic simulations from the envelope is demonstrated. This algorithm allows the influence of the secondary variables, as well as the variability of the result to vary by location in the simulation.


A Background
A Decision (or Random) Forest is an ensemble of Decision Trees. Decision Trees have been used for non-parametric regression for many years and their theory is reasonably well understood (Győrfi et al., 2002). They work by inducing a partition of space into hyper-rectangles with one or more data in each hyper-rectangle. To be concrete, suppose we have observations ( , ), = 1, … , with each being a vector of predictor variables of dimension and the ℎ training observation of a target variable that we wish to estimate. This paper is concerned with spatial data, so that we are looking for estimation or simulation of ( ) at a location . The notation is thus shorthand for ( ), the value of the target variable at the observed location . The basic idea behind decision trees is to construct the tree using the training data ( , ) and to predict at a new location x by passing the vector ( ) down the tree reading off the values of stored in the terminal node hyperrectangle, using them to form the required type of estimator.
A decision tree is grown by starting with all data in the root node. Nodes are split reclusively using rules involving a choice about which of the coordinates/variables in to use as well as the value of that variable used for the split, , with vectors ( , ) going to the left child if < , and to the right child if not. When tree growth terminates (another rule about size of terminal node), the geometry of the terminal node is therefore a hyper-rectangle. The associated set of are stored in that terminal node. In random forests the choices of variables and splits are randomized in ways explained in the next paragraph, but for now let be the set of parameters chosen and call ( ) the tree created with this choice of parameters. Denote by the hyper-rectangles associated with the leaves of the tree and call ( , ) the rectangle which represents the terminal node found by dropping ( ) = down the tree ( ). As stated before, estimations will be created from these terminal nodes, so for example, the estimate of the mean value of ( ) given ( ) = from the single decision tree ( ) is ̂( , ) = ∑ ( , ) ( , ) where the weight is defined by i.e., the weight is 1 divided by the number of data in the hyper-rectangle when is in the terminal node and is 0 when it's not.
Decision Forests are just ensembles of decision trees. The Random Forest is a specific type of decision forest introduced by Leo Breiman (1999,2004) which have been particularly successful in applications combining the ideas of Bagging (Breiman, 1996) and Random Subspaces (Ho, 1998). In most versions of decision forest, randomization is performed by selecting the data for each tree by subsampling or bootstrapping and by randomizing the splitting. We extend the parameter to cover the data selection as well as splitting. An adaptive split makes use of the values to help optimize the split. A popular choice is the CART criterion (Breiman et al., 1984), whereby a node is split to = { ∈ ; < ,} and it's compliment . The variable is chosen among a subset of the variables chosen randomly for each split such that the winning split variable and associated split value are the ones that minimize within node variance of the offspring. A slightly more radical split is employed in this paper. Instead of testing all possible split values for each variable, a split value is chosen at random for each variable candidate and then, as before, the within node variances for children calculated and the minimum chosen. This additional randomness seems to help reduce visual artefacts when using a random forest spatially.
The weight assigned to the ith training sample for a forest is ( ) = 1 ∑ ( , ) =1 where k is the number of trees. The estimator of the conditional distribution, which we need for Ember is then (Meinshausen, 2006) And the forest estimate of the conditional mean is ( | ) = ∑ ( ) .
There is a large literature on Random Forests, so we only mention a few that were helpful to the current work. As well as the foundational papers by Breiman and Meinshausen, Lin and Jeon (2006) describe how they can be interpreted as k-PNN, a generalization of nearest neighbour algorithms and give some good intuitive examples to help understand the role of weak and strong variables, splitting strategies and terminal node sizes. Establishing consistency is relatively easy for Random Forests with some reasonable conditions (we look at an example later). However, there has not of yet been a full explanation of how Breiman's Forest achieves such good results. Nonetheless, some good progress has been made, extending the basic consistency results. The PhD of Scornet (2015) includes a demonstration of the consistency of the Breiman forest in the special case of additive regression models (that chapter was in conjunction with Biau and Vert). Other results concern Central Limit Type results for Random Forests (Athey et al (2019) and references therein). Mentch and Zhou (2019) investigate the role of the regularization implicit in Random Forests to help explain their success.

A.1 Embedding Models in Decision Forests
If we have one or more models, ( ), = 1, … , , which are themselves estimators of ( ), it is possible to embed these for use within a random forest. As described in the main article, the reason we might want to do this is to make use of information that is not directly available in the form of observed data. In spatial modelling, the variogram or covariance function can give information about the spatial continuity of the variable of interest. A model, such as kriging (Chiles and Delfiner, 2012), which is based on this continuity is well known to be a powerful technique for spatial estimation. In cases where there are a number of other predictor variables (often called secondary variables in the Geostatistics literature), the standard method of combining the information is through linear models of coregionalization (Wackernagel, 2003). The linearity of the relationship between variables can be restrictive and correlations may not be stationary. An alternative is the cloud transform (Kolbjørnsen and Abrahamsen, 2004) which allows a multivariate estimate of a conditional at the target location but does not use any direct interaction with remote target locations until the simulation which uses a P-field method. Probability aggregation (Mariethoz et al., 2009) combines conditional distributions created from the secondary data and from the spatial target data separately. Barnett et al. (2014) use a projection pursuit approach to provide multivariate gaussian transforms for the variables.
It is well known that estimators such as kriging have good properties in their own right (e.g. BLUE) and can be more tolerant of departures from stationarity than associated full multivariate models. These considerations, as well as the possibility of a simplified workflow have prompted the idea of embedding models. The embedded model then acts as an Oracle, which is called to produce estimates at a location, but there is no requirement to concern ourselves with its higher order statistics. So, in this case, rather than try to fit a probability model to the target variable directly, the Ember model captures most of the power of the embedded model while allowing for better modelling of the relationship between variables and of local variability of the target variable. The probability model then emerges as the envelope of distributions ̂( | , ) which doesn't provide an interpretation as a stationary RF for Z as it depends on the current state of knowledge but does allow for exploration of uncertainty through sampling. The embedding is lazy both in the sense that rather than looking for potentially very complex interactions between variables, it tries to use the recognized power of Random Forests to deal with this aspect of the modelling and in the sense that no effort is expended on explicitly interpreting Z as some modification of a stationary RF.
In the sort of applications considered here, the Y variable will usually include the spatial location, x, meaning that the forest will explicitly use spatial coordinates as training variables. This means that at each location, we can estimate the conditional distribution ̂ ≝̂( ( )| ( ) = , ( ) = ), with a vector of size p and M a vector of size l. We call this family of estimates the envelope of Z. Note, this is not the full conditional distribution at x, rather it will be treated as a marginal distribution at x that we will sample from in section 5.3. To emphasize that this is not a model of a spatial random function, we note that the mean of ̂, ̂, is a trend rather than an exact interpolator, although it generally performs well and will resemble Kriging closely when the latter is a strong variable (Daly 2020) .
Training a forest with embedded models requires an extra step. In many cases the embedded model is an exact interpolator so ( ) = . This is the case if the embedded model is kriging. If training were to use this exact value, then the forest would over-train and learn little about how the model behaves away from the well locations, so this is not a useful strategy. A more promising strategy is to use cross validated estimates. Let − be the cross validated values found by applying the spatial model at data location , using all the training data except that observed at . This leads to the training algorithm 1) Choose the iid parameters for the ith tree. 2) Select a subset of data, the in-bag samples, for construction of the ith tree (either by subsampling or bootstrapping) 3) For each model and each in-bag sample, using all other in-bag samples, calculate the cross validated estimate − . 4) Construct the ith tree using ( , , − ) Since the model has now been 'converted into' a datum at each location , to keep the notation simple, we will continue to talk about training with ( , ) but need to remember that some of the are actually embedded models. The effects of this are that a) each tree is constructed on a different data set (as the embedded models depend on the actual in-bag sample) b) Since the − are not independent, we can no longer make the iid assumption for ( , ) that is usually made for random forests. Meinshausen (2006) shows that his quantile Random Forest algorithm is consistent in that the estimate of the conditional distribution converges to the true conditional distribution as the number of samples increase.

A.2 Consistency of the model
The assumptions needed for theorem A.1 are 1) Y is uniform on [0,1] . In fact, it is only required that the density is bounded above and below by positive constants. We discuss what this means for embedded models later.
2) For node sizes, A) The proportion of observations for each node vanishes for large n (i.e. is o(n)).
3) When a node is split, each variable is chosen with a probability that is bounded below by a positive constant. The split itself is done to ensure that each sub-node has a proportion of the data of the parent node. This proportion and the minimum probabilities are constant for each node. 4) ( | = ) is Lipschitz continuous.

5) ( | = ) is strictly monotonically increasing for each y.
Theorem 1 (Meinshausen) When the assumptions above are fulfilled, and if, in addition, the observations are independent, the Embedded Forest estimate is consistent pointwise, for each y, Theorem 1 remains valid when using embedded variables provided the hypothesis hold and the samples can be considered iid. The variables are iid when ( ) = ( ( )| ( )) are iid. While this is not necessarily the case, it is often quite a good approximation and may justify an embedding approach with general models. The hypothesis is not required when kriging is a strong embedded variable. Theorem 2 (Daly, 2020) is quite simple. It is known that as data becomes dense around a point to be estimated, that kriging converges to the true answer. This just shows that the same is true here when either the embedded variable is kriging and is a strong variable or when the spatial coordinates x,y,z are strong variables. For the demonstration, we need to strengthen assumption 1 and weaken assumption 4. Let a closed bounded subset of ℝ .
1a) The embedded data variables in Y are uniform on [0,1] . In fact, it is only required that the density is bounded above and below by positive constants. Moreover the samples occur at spatial locations = { } =1 such that, for any > 0, ∃ such that ∀ ∈ , ∃ ∈ with ( , ) < for all > . In other words, the samples become dense uniformly. 4a) ( ), ∈ is a continuous mean squared second order random function with standard deviation ( ) < ∞.
Theorem 2 Let a closed bounded subset of ℝ . Let : → 2 (Ω, , ) be a continuous mean squared second order random function with standard deviation ( ) < ∞, and if, in addition, assumptions 1a, 2, 3 and 5 hold, then 1) If kriging is an embedded model and it is a strong variable for the Random Forest, then the Ember estimate ( | ) = ∑ ( ) , converges in 2 to Z(x).
2) If the random forest also has a mean, ( ), that is continuous in x, is trained on the coordinate vector x, and if each component of x is a strong variable, then the Ember estimate ( | ) = ∑ ( ) , converges in 2 to Z(x).

A.3 Simulation
For simulation, we need a full probability model. Defining the sampling variable as ( ) = ( ( )| ( ) = ), the required hypothesis is that the sampling variable is a uniform m.s continuous stationary ergodic random function. This means that the class of random functions that we can model are of type ( ) = ( ( )) for some monotonic function , which depends on x. To model ( ) let us further assume that ( ) is generated as the transform of a standard multigaussian random function. So ( ) = ( ( )) with the cumulative Gaussian distribution function and X(x) is a stationary Multigaussian RF with correlation function (ℎ). This leads to the relation between Z and X, This can be expanded in Hermite polynomials (Daly, 2020), which when locations is considered to be a random variable and the expansion is truncated to first order, give equation 3.
To obtain a conditional simulation, it is sufficient to choose a value of = ( ) at data location such that the observed target variable data is = ( ) where is the envelope distribution at . Now in general there is more than one such possible , particularly when the envelop distribution is estimated using random decision forests. If fact any ∈ [ , ℎ ℎ ]will satisfy this condition where , ℎ ℎ can be read off the Ember estimate of . Since in this section, we assume that u is created as a transform of a Gaussian RF, then a truncated Gaussian distribution can be used for the sampling (e.g. Freulon, 1992). The simulation process becomes, 1) Infer a covariance for the underlying Gaussian X(x) (for example, using equation 3) 2) For each simulation: a. Use a truncated gaussian to sample values of satisfying = ( ) b. Construct a conditional Uniform Field U(x) based on X(x) such that U( ) = c. Sample from ( ( )| ( ) = ) using U(x) for each target location x.
Note: The simulation is based on resampling the training data. So, if the sampling is close to uniform (i.e. each sample ends up in a hyperrectangle of approximately the same hypervolume), then the simulation will roughly follow the same distribution as the training data.