Geostatistical Learning: Challenges and Opportunities

Statistical learning theory provides the foundation to applied machine learning, and its various successful applications in computer vision, natural language processing and other scientific domains. The theory, however, does not take into account the unique challenges of performing statistical learning in geospatial settings. For instance, it is well known that model errors cannot be assumed to be independent and identically distributed in geospatial (a.k.a. regionalized) variables due to spatial correlation; and trends caused by geophysical processes lead to covariate shifts between the domain where the model was trained and the domain where it will be applied, which in turn harm the use of classical learning methodologies that rely on random samples of the data. In this work, we introduce the geostatistical (transfer) learning problem, and illustrate the challenges of learning from geospatial data by assessing widely-used methods for estimating generalization error of learning models, under covariate shift and spatial correlation. Experiments with synthetic Gaussian process data as well as with real data from geophysical surveys in New Zealand indicate that none of the methods are adequate for model selection in a geospatial context. We provide general guidelines regarding the choice of these methods in practice while new methods are being actively researched.


Introduction
Classical learning theory [1,2,3] and its applied machine learning methods have been popularized in the geosciences after various technological advances, leading initiatives in open-source software [4,5,6,7], and intense marketing from a diverse portfolio of industries. In spite of its popularity, learning theory cannot be applied straightforwardly to solve problems in the geosciences as the characteristics of these problems violate fundamental assumptions used to derive the theory and related methods (e.g. i.i.d. samples).
Among these methods derived under classical assumptions (more on this later), those for estimating the generalization (or prediction) error of learned models in unseen samples are crucial in practice [2]. In fact, estimates of generalization error are widely used for selecting the best performing model for a problem, out of a collection of available models [8]. If estimates of error are inaccurate because of violated assumptions, then there is great chance that models will be selected inappropriately [9]. The issue is aggravated when models of great expressiveness (i.e. many learning parameters) are considered in the collection since they are quite capable of overfitting the available data [10,11]. In the following paragraphs, we consider statistical learning broadly as minimization of generalization error.
The literature on generalization error estimation methods is vast [8,12], and we do not intend to review it extensively here. Nevertheless, some methods have gained popularity since their introduction in the mid 70s because of their generality, ease of use, and availability in open-source software: Leave-one-out (1974). The leave-one-out method for assessing and selecting learning models was based on the idea that to estimate the prediction error on an unseen sample one only needs to hide a seen sample from a dataset and learn the model. Because the hidden sample has a known label, the method can compare the model prediction with the true label for the sample. By repeating the process over the entire dataset, one gets an estimate of the expected generalization error [13]. Leave-one-out has been investigated in parallel by many statisticians, including Nicholson (1960) and Stone (1974), and is also known as ordinary cross-validation. k-fold cross-validation (1975). The term k-fold cross-validation refers to a family of error estimation methods that split a dataset into non-overlapping "folds" for model evaluation. Similar to leave-one-out, each fold is hidden while the model is learned using the remaining folds. It can be thought of as a generalization of leave-one-out where folds may have more than a single sample [14,15]. Cross-validation is less computationally expensive than leave-one-out depending on the size and number of folds, but can introduce bias in the error estimates if the number of samples in the folds used for learning is much smaller than the original number of samples in the dataset.
Major assumptions are involved in the derivation of the estimation methods listed above. The first of them is the assumption that samples come from independent and identically distributed (i.i.d.) random variables. It is wellknown that spatial samples are not i.i.d., and that spatial correlation needs to be modeled explicitly with geostatistical theory. Even though the sample mean of the empirical error used in those methods is an unbiased estimator of the prediction error regardless of the i.i.d. assumption, the precision of the estimator can be degraded considerably with non-i.i.d. samples.
Motivated by the necessity to leverage non-i.i.d. samples in practical applications, and evidence that model's performance is affected by spatial correlation [16,17], the statistical community devised new error estimation methods using the spatial coordinates of the samples: h-block leave-one-out (1995). Developed for time-series data (i.e. data showing temporal dependency), the h-block leave-one-out method is based on the principle that stationary processes achieve a correlation length (the "h") after which the samples are not correlated. The time-series data is then split such that samples used for error evaluation are at least "h steps" distant from the samples used to learn the model [18]. Burman (1994) showed how the method outperformed traditional leave-one-out in time-series prediction by selecting the hyperparameter "h" as a fraction of the data, and correcting the error estimates accordingly to avoid bias.
Spatial leave-one-out (2014). Spatial leave-one-out is a generalization of hblock leave-one-out from time-series to spatial data [19]. The principle is the same, except that the blocks have multiple dimensions (e.g. norm-balls).
Block cross-validation (2016). Similarly to k-fold cross-validation for nonspatial data, block cross-validation was proposed as a faster alternative to spatial leave-one-out. The method creates folds using blocks of size equal to the spatial correlation length, and separates samples for error evaluation from samples used to learn the model. The method introduces the concept of "dead zones", which are regions near the evaluation block that are discarded to avoid over-optimistic error estimates [20,21].
Unlike the estimation methods proposed in the 70s, which use random splits of the data, these methods split the data based on spatial coordinates and what the authors called "dead zones". This set of heuristics for creating data splits avoids configurations in which the model is evaluated on samples that are too near (< spatial correlation length) other samples used for learning the model. Consequently, these estimation methods tend to produce error estimates that are higher on average than their non-spatial counterparts, which are known to be over-optimistic in the presence of spatial correlation. However, systematic splits of the data introduce bias, which have not been emphasized enough in the literature.
All methods for estimating generalization error in classical learning theory, including the methods listed above, rely on a second major assumption. The assumption that the distribution of unseen samples to which the model will be applied is equal to the distribution of samples over which the model was trained. This assumption is very unrealistic for various applications in geosciences, which involve quite heterogeneous (i.e. variable), heteroscedastic (i.e. with different variability) processes [22].
Very recently, an alternative to classical learning theory has been proposed, known as transfer learning theory, to deal with the more difficult problem of learning under shifts in distribution, and learning tasks [23,24,25]. The theory introduces methods that are more amenable for geoscientific work [26,27,28], yet these same methods were not derived for geospatial data (e.g. climate data, Earth observation data, field measurements).
Of particular interest in this work, the covariate shift problem is a type of transfer learning problem where the samples on which the model is applied have a distribution of covariates that differs from the distribution of covariates over which the model was trained [29]. It is relevant in geoscientific applications in which a list of explanatory features is known to predict a response via a set of physical laws that hold everywhere. Under covariate shift, a generalization error estimation method has been proposed: Importance-weighted cross-validation (2007). Under covariate shift, and assuming that learning models may be misspecified, classical cross-validation is not unbiased. Importance weights can be considered for each sample to recover the unbiasedness property of the method, and this is the core idea of importance-weighted cross-validation [30,31]. The method is unbiased under covariate shift for the two most common supervised learning tasks: regression and classification.
The importance weights used in importance-weighted cross-validation are ratios between the target (or test) probability density and the source (or train) probability density of covariates. Density ratios are useful in a much broader set of applications including two-sample tests, outlier detection, and distribution comparison. For that reason, the problem of density ratio estimation has become a general statistical problem [32]. Various density ratio estimators have been proposed with increasing performance [33,34,35,36], yet an investigation is missing that contemplates importance-weighted cross-validation and other existing error estimation methods in geospatial settings.
In this work, we introduce geostatistical (transfer) learning, and discuss how most prior work in spatial statistics fits in a specific type of learning from geospatial data that we term pointwise learning. In order to illustrate the challenges of learning from geospatial data, we assess existing estimators of generalization error from the literature using synthetic Gaussian process data and real data from geophysical well logs in New Zealand that we made publicly available [37].
The paper is organized as follows. In section 2, we introduce geostatistical (transfer) learning, which contains all the elements involved in learning from geospatial data. We define covariate shift in the geospatial setting and briefly review the concept of spatial correlation. In section 3, we define generalization error in geostatistical learning, discuss how it generalizes the classical definition of error in non-spatial settings, and review estimators of generalization error from the literature devised for pointwise learning. In section 4, we present our experiments with geospatial data, and discuss the results of different error estimation methods. In section 5, we conclude the work and share a few remarks regarding the choice of error estimation methods in practice.

Geostatistical learning
In this section, we define the elements of statistical learning in geospatial settings. We discuss the covariate shift and spatial correlation properties of the problem, and illustrate how they affect the involved feature spaces.
Consider a sample space Ω, a source spatial domain D s ⊂ R ds , and a target spatial domain D t ⊂ R dt on which stochastic processes (i.e. spatial random variables) are defined: For example, (Z sj ) j=1,2,...,ns may represent a collection of processes observed remotely from satellite on a 2D surface D s ⊂ R 2 , whereas (Z tj ) j=1,2,...,nt may represent processes that occur within the 3D subsurface of the Earth D t ⊂ R 3 (see Figure 1). Any process Z in these collections can be viewed in two distinct ways: Geostatistical theory. From the viewpoint of geostatistical theory, samples z(·, ω) of the process Z(u, w) are obtained by fixing ω ∈ Ω. These samples are spatial maps that assign a real number to each location u ∈ D.
Learning theory. From the viewpoint of statistical learning theory, scalar samples z(u, ·) are obtained by fixing u ∈ D. These scalar samples are ordered into a "feature vector" x u = (z 1 , z 2 , . . . , z n ) for a collection of processes (Z j ) j=1,2,...,n , and for a specific location u ∈ D. In this case, X u : Ω → R n denotes the corresponding random vector of features such that x u ∼ X u .
In order to define the geostatistical learning problem, we need to understand the joint probability distribution of features for all locations in a spatial domain Pr({X u } u∈D ). This distribution is very complex in general as feature vectors X u and X v for two different locations u = v are not independent. The closer the locations u, v ∈ D in the spatial domain, the more similar are their features x u , x v ∈ R n in the feature space. Moreover, given that only one realization z obs = z(·, ω) ∼ Z of the process is available at any given time, one must introduce stationarity assumptions inside D to pool together different scalar samples z(u, ·) from different locations u ∈ D in the spatial domain, and be able to estimate the distribution.
Regardless of the stationarity assumptions involved in modeling these processes, we can assume that inside D the probability Pr D (X) = Pr({X u } u∈D ) is well-defined. For example, most prior art in statistical learning with geospatial data assumes that the pointwise probability of features Pr u (X) = Pr(X u ) is not a function of location, that is Pr u (X) = Pr(X), ∀u ∈ D. Under this assumption, samples from everywhere in D are used to estimate Pr(X) = Pr(Z 1 , Z 2 , . . . , Z n ). With the additional assumption that the feature vectors X u and X v are independent, the joint distribution of features for all locations can be written as Pr D (X) = u∈D Pr u (X).
Whereas the pointwise stationarity assumption may be reasonable inside a given spatial domain, the assumption of spatial independence of features is rarely defensible in practice. Additionally, pointwise stationarity often does not transfer from a source domain D s where the model is learned to a target domain D t where the model is applied, and consequently the joint distributions of features differ Pr Ds = Pr Dt . Before we can illustrate these two issues in more detail, we need to complete the definition of geostatistical learning problems by introducing the notion of spatial learning tasks.
We have introduced the notion of spatial domain D, and the notion of joint probability of features Pr D (X) for all locations in the domain. Now we introduce the notion of spatial learning tasks, which are similar to classical learning tasks, but with the main difference that they can leverage properties of the underlying spatial domain. Classically, a learning task describes an action in terms of available features to produce new data. For example, "predict feature Z j0 from features (Z j1 , Z j2 )", or "cluster the samples using features (Z j1 , Z j2 , Z j3 )" are classical learning tasks. Differently, a spatial learning task T involves the spatial domain D besides the features, and is therefore more complex. Practical examples from the industry include: • Mining: The task of segmenting a mineral deposit from borehole samples using a set of features is a spatial learning task. It assumes the segmentation result to be a contiguous volume of rock, which is an additional constraint in terms of spatial coordinates.
• Agriculture: The task of identifying crops from satellite images is a spatial learning task. Locations that have the same crop type appear together in the images despite possible noise in image layers (e.g. presence of clouds, animals).
• Petroleum: The task of segmenting formations from seismic data is a spatial learning task because these formations are large-scale near-horizontal layers of stacked rock.
Many more examples of spatial learning tasks exist, and others are yet to be proposed. Given the concepts introduced above, we are now ready for the main definition of this section: Definition (Geostatistical Learning). Let D s be a source spatial domain, and D t be a target spatial domain. Let Pr Ds (X s ) and Pr Dt (X t ) be the joint distributions of features for all locations in these domains, and let T s and T t be two spatial learning tasks. Geostatistical (transfer) learning consists of learning T t over D t using the knowledge acquired while learning T s over D s , and assuming that the observed spatial data in D s and D t are both a single spatial sample of Pr Ds (X s ) and Pr Dt (X t ), respectively. There are considerable differences between the classical definition of transfer learning [23,24], and the proposed definition above. First, the distribution we have denoted by Pr D (X) is spatial and involves all the locations u ∈ D, whereas the distribution in classical transfer learning is the marginal for any specific location, obtained from the assumption of pointwise stationarity Pr(X u ) = Pr(X). Second, we use the term domain to refer to spatial domains D, whereas the nonspatial literature uses the same term for the pair (X u , Pr(X u )) = (X, Pr(X)). Third, the spatial learning task T we have introduced may be described in terms of properties of the spatial domain, which are not available in generic transfer learning problems.
Having understood the main differences between classical and geostatistical learning, we now focus our attention to a specific type of geostatistical transfer learning problem, and illustrate some of the unique challenges caused by spatial dependence.

Covariate shift
Assume that the two spatial domains are different D s = D t , but that they share a set of processes (Z 1 , Z 2 , . . . , Z n ). Additionally, assume that pointwise stationarity holds. Let Z o = f (Z 1 , Z 2 , . . . , Z n ) be a new process obtained as a function of the shared processes, and assume that it has only been observed in D s via a measuring device and/or manual labeling. That is, z obs o (·, ω) ∼ Z o is a spatial sample of the process Z o over D s . Under these assumptions, we can introduce the shared vector of features X s = X t = X = (Z 1 , Z 2 , . . . , Z n , Z o ), and the supervised learning task T s = T t = T of predicting the process Z o regardless of location u ∈ D s ∪ D t .
Let X = X 1:n be the explanatory features, and Y = X n+1 be the response feature. For any u ∈ D s , we can write Pr u (X , Y) = Pr u (Y|X ) Pr u (X ). Likewise, for any v ∈ D t we can write Pr v (X , Y) = Pr v (Y|X ) Pr v (X ). The covariate shift property is defined as follows: Definition (Covariate Shift). A geostatistical learning problem has the covariate shift property when for any u ∈ D s and for any v ∈ D t the distributions Pr u (X , Y) and The property is based on the idea that the underlying true function f that created the process Y = f (X ) is the same for all u ∈ D s and all v ∈ D t . In this case, the function is approximated by the conditional distribution Pr u (Y|X ) = Pr v (Y|X ) for each and every location (see Figure 2).
Pr v (X ) In the geosciences, it is very common to encounter problems with covariate shift due to the great variability of natural processes. Whenever a model is (1) learned using labels provided by experts on a spatial domain D s , is (2) validated with classical train-validation-test methodologies (meaning that it satisfies some performance threshold), and yet (3) performs poorly on a new spatial domain D t where the labeling function is expected to be the same, we can conclude that there are shifts in distribution. In section 4 we illustrate covariate shifts in real data that we prepared in-house from geophysical surveys in New Zealand.

Spatial correlation
Another important issue with geospatial data that is often ignored is spatial dependence, which we illustrate next. As mentioned earlier, the closer are two locations u, v ∈ D in a spatial domain, the more similar are their features x u , x v ∈ R n in the feature space. Different statistics are available to quantify this spatial dependence in a collection of samples, and a popular choice from geostatistics is the variogram γ(h), which estimates for each spatial lag h = ||u − v|| ∈ R + 0 a correlation σ 2 − γ(h), where σ 2 is the total sill in the samples [22]. Parallel algorithms for efficient variogram estimation exist in the literature [38], and can be useful tools for fast diagnosis of the spatial correlation property: Definition (Spatial Correlation). A geostatistical learning problem has the spatial correlation property when the variogram of any of the stochastic processes (Z sj ) j=1,2,...,ns and (Z tj ) j=1,2,...,nt defined over D s and D t has a nonnegligible positive range (or correlation length).
Besides serving as a tool for diagnosing spatial correlation in geostatistical learning problems, variograms can also be used to simulate spatial processes with theoretical correlation structure. In Figure 3, we illustrate the impact of spatial correlation in the feature space of two independent spatial processes Z 1 and Z 2 simulated with direct (a.k.a. LU) Gaussian simulation [39]. As we increase the variogram range r in a spatial domain D with 100 × 100 pixels, we observe that the distribution of features Pr(X ) = Pr(Z 1 , Z 2 ) is gradually deformed from a standard Gaussian (r = 0) to a "boomerang"-shaped distribution (r = 80). Figure 3: Impact of spatial correlation in feature space. Two Gaussian processes Z 1 and Z 2 are simulated over a domain D with 100 × 100 pixels. As the variogram range r increases from r = 0 to r = 80 pixels, the distribution of features Pr(X ) = Pr(Z 1 , Z 2 ) is gradually deformed away from the standard Gaussian N (0, I). Marker colors in scatter plots represent the pixel number in column-major order.
Similar deformations are observed when the two processes Z 1 and Z 2 are correlated. In Figure 4, we illustrate the impact of spatial correlation for an inter-process correlation of ρ(Z 1 , Z 2 ) = 0.9. Figure 4: Impact of spatial correlation in feature space with correlated processes. Similar to Figure 3 with the difference that the processes Z 1 and Z 2 are strongly correlated with correlation ρ(Z 1 , Z 2 ) = 0.9.
Spatial correlation may have different impact in source and target domains D s and D t , and can certainly affect the generalization error of learning models. In our experiments of section 4, we assume that the variogram range of source and target processes are equal (i.e. r s = r t = r) to facilitate the analysis of the results. In practice, source and target processes may also have different spatial correlation, which is a type of shift that is not considered in classical transfer learning problems.

Generalization error of learning models
Having defined geostatistical learning problems, and their covariate shift and spatial correlation properties, we now turn into a general definition of generalization error of learning models in geospatial settings. We review an importanceweighted approximation of a related generalization error based on pointwise stationarity assumptions, and the use of an efficient importance-weighted crossvalidation method for error estimation.
Consider a geostatistical learning problem P = {(D s , Pr Ds , T s ), (D t , Pr Dt , T t )} with a single supervised spatial learning task T s = T t = T (e.g. regression), and assume that a set of response features Y u are created by a function f , based on a set of explanatory features X u for each and every location u ∈ D s ∪ D t . Our goal is to learn a model {Y u } u∈Dt ≈f {X } u∈Dt over the target domain D t that approximates f in terms of expected risk for some spatial supervised loss function L:f In the expected value of Equation 2, spatial samples of the processes are drawn from Pr Dt and rearranged into feature vectors X u and Y u for every location u ∈ D t . The spatial loss function L compares the spatial map of features from the sample {Y u } u∈Dt with the approximated map from the model The modelf is the model that minimizes the expected loss (or risk) over the target domain D t .
Definition (Generalization Error). The generalization error of a learning modelf in a geostatistical learning problem P is the expected risk attained by the model when spatial samples are drawn from Pr Dt over the target domain D t (see Equation 2).
Unlike the classical definition of generalization error, the definition above for geostatistical learning problems relies on a spatial loss function L, and on spatial samples like those produced via geostatistical simulation [40,41]. For truly spatial learning modelsf that use multiple locations in the spatial domain to make predictions, this generalization error is more appropriate. In this present work, however; we do not target spatial learning models, and only consider pointwise learning: Definition (Pointwise Learning). Given a family of classical (non-spatial) learning models {f u } u∈D , pointwise learning consists of learning the model f {X } u∈D = {f u (X u )} u∈D that assigns for each location u ∈ D the valuê f u (X u ) independently of the explanatory features at other locations.
More specifically, we consider pointwise learning with families that are made of a single learning model {f u } u∈D = {ḟ }. In this case, the modelḟ is often learned based on pointwise stationarity assumptions, for some pointwise lossL: Although pointwise learning with a single model is a very simple type of geostatistical learning, it is by far the most widely used approach in the geospatial literature. We acknowledge this fact, and consider an empirical approximation of the pointwise expected risk in Equation 3 as opposed to the spatial expected risk in Equation 2.
An empirical approximation of the pointwise expected risk of a modelġ can be obtained via discretization of the target spatial domain D t : with |D t | the number of locations in the discretization. The problem with this empirical approximation is that the response features Y u are not available in the target domain where the model will be applied. However, it is easy to show that the pointwise expected risk in Equation 4 can be rewritten with importance weightsẇ(X , Y) when samples from D s are drawn instead [30]: withẇ(X , Y) = Pr u∈Ds (X ,Y) . Under covariate shift, the importance weights only depend on the distribution of explanatory featuresẇ(X ) = Pr v∈D t (X ) Pr u∈Ds (X ) , and we can write a simple importance-weighted empirical approximation: Our goal is to find the pointwise model that minimizes the empirical risk approximationR t (ġ) introduced in Equation 6: Alternatively, our goal is to rank a collection of models {ġ i } i=1,2,...,k based on their empirical risk {R t (ġ i )} i=1,2,...,k in a geostatistical learning problem to aid model selection.
In order to achieve the stated goals, we need to (1) estimate the importance weights in the empirical risk approximation, and (2) remove the dependence of the approximation on a specific dataset. These two issues are addressed in the following sections.

Density ratio estimation
The empirical approximation of the riskR t (ġ), depends on estimates of the weightsẇ(X u ), which are ratios of probabilities in the target and source domains. The following problem can be posed [32]: Definition (Density Ratio Estimation). Given two collections of samples {X u } u∈Ds and {X v } v∈Dt from source and target domains, estimate the density ratio Pr u∈Ds (X ) at any new sample X . In particular, estimate the ratio at all samples {X u } u∈Ds from the source.
Efficient methods for density ratio estimation that perform well with highdimensional features have been proposed in the literature. In this work we consider a fast method named Least Squares Importance Fitting (LSIF) [35,36]. The LSIF method assumes that the weights are a linear combination of basis functionsẇ(X u ) = α ϕ(X u ) with coefficients to be learned α = (α 1 , α 2 , . . . , α b ) and fixed basis ϕ(X u ) = (ϕ 1 (X u ), ϕ 2 (X u ), . . . , ϕ b (X u )). The LSIF estimator is derived by minimizing the squared error with the true density ratio: under the constraint that densities are always positive. By choosing b center features randomly from the target domain {X i } i=1,2,...,b , the method introduces a Gaussian kernel basis ϕ i (X u ) = k(X u , X i ) that simplifies the objective function to a matrix form: with H = ϕ(X u )ϕ(X u ) Pr u∈Ds (X u )dX u and h = ϕ(X v ) Pr v∈Dt (X v )dX v . The regularization parameter λ ≥ 0 on the coefficients α avoids overfitting, and empirical estimates of both H and h are easily obtained with sample averages: This quadratic optimization problem with linear inequality constraints can be solved very efficiently with modern optimization software [42,43]. In the end, the optimal coefficients α are plugged back into the basis expansion for optimal estimates of the weights on new samplesẇ(X ) = α ϕ(X ).

Weighted cross-validation
In order to remove the dependence of the empirical risk approximation on the dataset, we use importance-weighted cross-validation (IWCV) [30,31]. As with the traditional cross-validation procedure, the source domain is split into k folds D s = k j=1 D (j) s , and each fold D (j) s is hidden for error evaluation while the modelġ (j) is learned on the remaining folds: The main difference in the IWCV procedure are the weights that multiply each sample. The regularization exponent l ∈ [0, 1] can be set to zero to recover the traditional estimator, or to a positive value to account for covariate shift. An optimal value for l can be found via hyperparameter search by considering another layer of cross-validation. In this work, we simply set default values for l such as l = 1 or l = 0.5.
In the rest of the paper, we combine IWCV with LSIF into a method for estimating generalization error that we term Density Ratio Validation. Although IWCV is known to outperform classical cross-validation methods in non-spatial settings, little is known about its performance with geospatial data. Moreover, like all prior art, IWCV approximates the pointwise generalization error of Equation 3 as opposed to the geostatistical generalization error of Equation 2, and therefore is limited by design to non-spatial learning models.

Experiments
In this section, we perform experiments to assess estimators of generalization error under varying covariate shifts and spatial correlation lengths. We consider Cross-Validation (CV), Block Cross-Validation (BCV) and Density Ratio Validation (DRV), which all rely on the same cross-validatory mechanism of splitting data into folds.
First, we use synthetic Gaussian process data and simple labeling functions to construct geostatistical learning problems for which learning models have a known (via geostatistical simulation) generalization error. In this case, we assess the estimators in terms of how well they estimate the actual error under various spatial distributions. Second, we demonstrate how the estimators are used for model selection in a real application with well logs from New Zealand, which can be considered to be a dataset of moderate size in this field.

Gaussian processes
Let Z s1 , Z s2 be two Gaussian processes with constant mean µ s and variogram γ s defined over D s , and likewise let Z t1 , Z t2 be two Gaussian processes with constant mean µ t and variogram γ t defined over D t . Denote by r s the variogram range (or correlation length) and by σ 2 s the variogram sill (or total variance) of the processes in the source domain. Likewise, denote by r t and σ 2 t the range and sill of the variogram in the target domain. It is clear that pointwise stationarity holds inside each of these domains. The feature vector X u ∈ R 2 for any location u ∈ D s in the source domain has a bivariate Gaussian distribution N (µ s 1, σ 2 s I), whereas the feature vector X v ∈ R 2 for any location v ∈ D t in the target domain has a bivariate Gaussian distribution N (µ t 1, σ 2 t I). By constraining the variogram ranges to be equal in source and target domains, that is r s = r t = r, and by requiring that both variograms pass through the origin (i.e. no nugget effect), we can investigate two types of covariate shift for various ranges r: Mean shift. Define the shift in the mean as δ = c |µ t − µ s | ∈ [0, ∞) for some normalization constant c > 0. In this experiment, we set the value of this constant to c = 1 3 √ 2σs for convenience so that a δ = 1 becomes equivalent to |µ t − µ s | = 3 √ 2σ s , which in turn is equivalent to two circles (i.e. bivariate Gaussians) of radii 3σ s touching each other along the identity line, see Figure 5.
Variance shift. Define the shift in the variance as τ = σ t /σ s ∈ (0, ∞). Here, τ = 1 means absence of variance shift, τ < 1 means that the variance where the model is applied is smaller than the variance where the model was trained, and τ > 1 means the exact opposite of τ < 1.
Geostatistical learning problems with τ > 1 are very challenging to solve, and usually require additional extrapolation models, beyond the pointwise learning models discussed in this work. Therefore, we only consider cases with τ ≤ The first configuration in Equation 12 refers to the case in which the target distribution N (µ t 1, σ 2 t I) is "inside" the source distribution N (µ s 1, σ 2 s I) meaning that the circle of radius 3σ t centered at µ t 1 is contained in the circle of radius 3σ s centered at µ s 1. Similarly, the second configuration refers to the case in which the target distribution is "outside" the source distribution. Finally, the third configuration refers to a "partial" overlap when the two distributions share a common set of samples but are not entirely one inside of the other. We note, however; that the illustration with circles provided in Figure 5 is only representative in the absence of spatial correlation (i.e. r = 0), see Figure 3.
To efficiently simulate multiple spatial samples of the processes over a regular grid domain with 100 × 100 locations (or pixels), we use spectral Gaussian simulation [44]. We fix the parameters of the source distribution at µ s = 0 inside x u x v For u ∈ D s and v ∈ D t : partial P ru ( X ) Figure 5: Three possible shift configurations. The target distribution is "inside" the source distribution (left), "outside" (right), or they show "partial" overlap (middle). The processes Zs 1 and Zs 2 are observed at any location u ∈ Ds forming a feature vector xu ∈ R 2 . Similarly, the processes Zt 1 and Zt 2 (which are shifted versions of the previous two) are observed at any location v ∈ Dt forming a feature vector xv ∈ R 2 . In this case the features xu ∼ Pru(X ) and xv ∼ Prv(X ) are samples of bivariate Gaussians illustrated with circles of radii 3σs and 3σt centered at µs1 and µt1.
and σ s = 1 without loss of generality, and assume no inter-process correlation (i.e. ρ = 0) like we did in Figure 3. Under these modeling assumptions, we are able to investigate the spatial distribution of features as a function of shift parameters (δ, τ ) ∈ B and variogram ranges r ∈ C = {0, 10, 20}.
To fully specify the geostatistical learning problem, we need to specify a learning task. The task consists of predicting a binary variable Y v at locations v in the target grid D t based on observations y u of the variable at locations u in the source grid D s . These observations (or labels) are synthesized using known labeling functions such as y u = sgn(sin(w x u p )), where · p is the p-norm, w is the angular frequency, and sgn is the modified sign function that assigns +1 to x ≥ 0 and −1 otherwise. The observations produced by these functions form alternating patterns in the feature space, which are not trivial to predict with simple learning models, see Figure 6. In this experiment, we fix p = 1 and w = 4 to save computational time. Other norms and angular frequencies produce similar results.
Having defined the problem, we proceed and specify learning models in order to investigate the different estimators of generalization error. We choose two models that are based on different prediction mechanisms [2]: Decision tree. A pointwise decision tree modelḟ T makes predictions solely based on the features of the sample, without exploiting nearby features in the feature space.
K-nearest neighbors. A pointwise k-nearest neighbors modelḟ N makes predictions based on nearby features, and is sometimes called a "spatial model".
These two modelsḟ T andḟ N are simply representative models from the "non-spatial" and "spatial" families of models. We emphasise, however; that the term "spatial model" can be misleading in the spatial statistics literature.  (w x p)). Labels form alternating patterns in the feature space for different p-norm and angular frequencies w.
It is important to distinguish "spatial models" such as k-nearest neighbors that exploit the notion of proximity of features in the feature space from "geospatial models" that also exploit the proximity of samples in the physical space (or spatial domain as we have been calling it) besides their features.
The experiment proceeds as follows. For each shift (δ, τ ) ∈ B, each correlation length r ∈ C, and each pointwise learning modelḟ ∈ {ḟ T ,ḟ N }, we sample a problem P δ,τ,r and estimate the generalization error of the modelḟ on the problem with CV, BCV and DRV. We set the hyperparameters of the CV and BCV estimators based on the fact that the correlation length never exceeds 20. For instance, we set the block side in BCV to s = 20, and use the equivalent number of folds in CV, i.e. k = (100/20) 2 = 25 for a domain with 100 × 100 pixels. We set the kernel width of the LSIF estimator in DRV to σ = 2 based on the synthetic Gaussian distributions, and use 10 kernels in the basis expansion. Additionally, we approximate the true generalization error of the models with Monte Carlo simulation over the target domain (e.g. 100 spatial samples).
To facilitate the visualization of the results, we introduce shift functions S : B → [0, ∞) that map the shift parameters (δ, τ ) to a single covariate shift value, which can be interpreted loosely as the "difficulty" of the problem: Kullback-Leibler divergence. The divergence or relative entropy between two distributions p and q is defined as S KL (p||q) = p(x) log p(x) q(x) dx, and can be derived analytically for two (2D) Gaussian distributions p = N (µ t 1, σ 2 t I) and q = N (µ s 1, σ 2 s I). We derive a formula in terms of δ and τ by fixing σ s = 1: Jaccard distance. The Jaccard index between two sets A and B is defined as J(A, B) = |A∩B| |A∪B| , and the corresponding distance as S J (A, B) = 1 − J(A, B). For two (2D) Gaussian distributions, we consider A and B to be circles of radii 3σ s and 3σ t centered at µ s 1 and µ t 1. The distance is then expressed in terms of areas, which can be derived analytically in terms of δ and τ by fixing σ s = 1: Novelty factor. We propose a new shift function termed the novelty factor inspired by the geometric view of Jaccard. First, we define the novelty of B with respect to A as N (B/A) = |B−A∩B|−|A∩B| |B| , and notice that it is the fraction of B that is outside of A minus the fraction of B that is inside of A. Second, we restrict the definition to cases with |B| ≤ |A| (e.g. Gaussian case with τ ≤ 1), and notice that the novelty N (B/A) lies in the interval [−1, 1]. Finally, we define the novelty factor S N (A, B) = N (B/A)+1 2 in the interval [0, 1], which can be easily computed for the Gaussian case using the formulas derived in Equation 14.
We plot the true generalization error of the models as a function of the different covariate shifts in Figure 7, and color the points according to their shift configuration (see Equation 12). In this plot, the horizontal dashed line intercepting the vertical axis at 0.5 represents a model that assigns positive and negative labels to samples at random with equal weight (i.e. Bernoulli variable).
Among the three shift functions, the novelty factor is the only function that groups shift configurations along the horizontal axis. In this case, configurations deemed easy (i.e. where the target distribution is inside the source distribution) appear first, then configurations of moderate difficulty (i.e. partial overlap) appear next, and finally difficult configurations (i.e. target is outside the source) appear near the theoretical Bernoulli limit. The Kullback-Leibler divergence and the Jaccard distance fail to summarize the shift parameters into a onedimensional visualization, and are therefore omitted in the next plots.
The two models behave similarly in terms of generalization error for the given dataset size (i.e. 100 × 100 pixels), and therefore can be aggregated into a single scatter to increase the confidence in the observed trends.
We plot the CV, BCV and DRV estimates of generalization error versus covariate shift (i.e. novelty factor) in the top row of Figure 8, and color the points according to their correlation length. We omit a few DRV estimates that suffered from numerical instability in challenging partial or outside configurations, and show the box plot of error estimates in the bottom row of the figure.
First, we emphasize that the CV and BCV estimates remain constant as a function of covariate shift. This is expected given that these estimators do not  Figure 7: Generalization error of learning models versus covariate shift functions. Among all shift functions, the novelty factor is the only function that groups shift configurations along the horizontal axis. Models behave similarly in terms of generalization error for the given dataset size (100 × 100 pixels), and difficult shift configurations lead to errors that approach the theoretical Bernoulli limit of 0.5. make use of the target distribution. The DRV estimates increase with covariate shift as expected, but do not follow the same rate of increase of the true (or actual) generalization error obtained with Monte Carlo simulation. Second, we emphasize in the box plots for the inside configuration that the correlation length affects the estimators differently. The CV estimator becomes more optimistic with increasing correlation length, whereas the BCV estimator becomes less optimistic, a result that is also expected from prior art. Additionally, the interquartile range of the BCV estimator increases with correlation length. It is not clear from the box plots that a trend exists for the DRV estimator. The actual generalization error behaves erratically in the presence of large correlation lengths as indicated by the scatter and box plots.
In order to better visualize the trends in the estimates, we smooth the scatter plots with locally weighted regression per correlation length in the top row of Figure 9, and show in the bottom row of the figure the Q-Q plots of the different estimates against the actual generalization error for the inside configuration where all estimators are supposed to perform well.
From the figure, there exists a gap between the DRV estimates and the actual generalization error of the models for all covariate shifts. This gap is expected given that the target distribution may be very different from the source distribution, particularly in partial or outside shift configurations. On the other hand, the gap also seems to be affected by the correlation length, and is largest with 20 pixels of correlation. Additionally, we emphasize in the Q-Q plots that the BCV estimates are biased due to the systematic selection of folds. The BCV estimates are less optimistic than the CV estimates, which is a desired property in practice, however there is no guarantee that the former estimates will approximate well the actual generalization error of the models.

New Zealand dataset
Unlike the previous experiment with synthetic Gaussian process data and known generalization error, this experiment consists of applying the CV, BCV and DRV estimators to a real dataset of well logs prepared in-house [37]. We quickly describe the dataset, introduce the related geostatistical learning problems, and use error estimates to rank learning models. Finally, we compare these ranks with an ideal rank obtained with additional label information that is not available during the learning process.
The dataset consists of 407 wells in the Taranaki basin, including the main geophysical logs and reported geological formations. The basin comprises an area of about 330.000km 2 , located broadly onshore and offshore the New Zealand west coast (see Figure 10). Well trajectories are georeferenced in UTM coordinates (X and Y) and true vertical depth (Z). We split the wells into onshore and offshore locations in order to introduce a geostatistical learning problem with covariate shift. The problem consists of predicting the rock formation from well logs offshore after learning a model with well logs and reported (i.e. manually labeled) formations onshore. The well logs considered are gamma ray (GR), spontaneous potential (SP), density (DENS), compressional sonic (DTC) and neutron porosity (NEUT). We eliminate locations with missing values for these logs and investigate a balanced dataset with the two most frequent formations-Urenui and Manganui. We normalize the logs and illustrate the covariate shift property by comparing the scatter plots of onshore and offshore locations in Figure 11. Additionally, we define a second geostatistical learning problem without covariate shift. In this case, we join all locations filtered in the previous problem and sample two new sets of locations with sizes respecting the same source-to-target proportion (e.g. 300000 : 50000).
We set the hyperparameters of the error estimators based on variography and according to available computational resources. In particular, we set blocks for the BCV estimators with sides 10000 × 10000 × 500 that are much greater than the vertical and horizontal correlation lengths estimated from empirical variograms. We obtain the corresponding number of folds k = 99 for the CV estimator by partitioning the bounding box of onshore wells into blocks with the given sides. Similarly to the previous experiment with synthetic Gaussian process data, we set the kernel width in DRV to σ = 2 given that the well logs were normalized to have unit variance. Finally, we select a list of learning models to rank including Ridge classification (Ridge), logistic regression (Logistic), knearest neighbors (KNeighbors), naive Bayes (GaussianNB), linear discriminant analysis (BayesianLDA), perceptron (Perceptron), decision tree (DecisionTree), and a dummy model that reproduces the marginal distribution of formations in the source domain (Dummy).
In Table 1, we report the results for the onshore-to-offshore problem. In the upper part of the table we compare side-by-side the error estimates of the different methods. We highlight the closest estimates to the target error in blue color, and the most distant in red color. We emphasize that the target error is the error of the model in one single realization of the process, and is not the generalization error averaged over multiple spatial realizations. In spite of this important distinction, we still think it is valuable to compare it with the estimates of generalization error given by CV, BCV and DRV since these methods were all derived under pointwise learning assumptions, and are therefore smooth averages over multiple points exactly like the error estimated from the single realization of the target. In the bottom part of the table, we report the model ranks derived from the error estimates as well as the ideal rank derived from the target error.
Among the three estimators of generalization error, the CV estimator produces estimates that are the most distant from the target error, with a tendency to underestimate the error. The BCV estimator produces estimates that are higher than the CV estimates, and consequently closer to the target error in Figure 11: Distribution of main geophysical logs onshore (gray) and offshore (purple) centered by the mean and divided by the standard deviation. Visible covariate shift in the scatter and contour plots. this case. The DRV estimator produces the closest estimates for most models, however; like the CV estimator it fails to approximate the error for models like KNeighbors and DecisionTree that are over-fitted to the source distribution. The three estimators fail to rank the models under covariate shift and spatial correlation. Over-fitted models with low generalization ability are incorrectly ranked at the top of the list, and the best models, which are simple "linear" models, appear at the bottom. We compare these results with the results obtained for the problem without covariate shift in Table 2.
From Table 2, the CV estimator produces estimates that are the closest to the target error. The BCV estimator produces estimates that are higher than the CV estimates as before, however this time this means that the BCV estimates are the most distant to the target error. The DRV estimator produces estimates that are not the closest nor the most distant to the target error. The three estimators successfully rank the models from simple linear models at the bottom of the list to more complex learning models at the top. Unlike the previous problem with covariate shift, this time complex models like KNeighbors and DecisionTree show high generalization ability.

Conclusions
In this work, we introduce geostatistical (transfer) learning, and demonstrate how most prior art in statistical learning with geospatial data fits into a category we term pointwise learning. We define geostatistical generalization error and demonstrate how existing estimators from the spatial statistics literature such as block cross-validation are derived for that specific category of learning, and are therefore unable to account for general spatial errors.
We propose experiments with spatial data to compare estimators of generalization error, and illustrate how these estimators fail to rank models under  Table 1: Estimates of generalization error with different estimators for the onshore-to-offshore problem. The CV estimator produces estimates that are the most distant to the actual target error due to covariate shift and spatial correlation. None of the estimators is capable of ranking the models correctly. They all select complex models with low generalization ability.
covariate shift and spatial correlation. Based on the results of these experiments, we share a few remarks related to the choice of estimators in practice: • The apparent quality of the BCV estimator is falsified in the Q-Q plots of Figure 9 and in Table 2. The systematic bias produced by the blocking mechanism only guarantees that the error estimates are higher than the CV estimates. When the CV estimates are good (i.e. no covariate shift), the BCV estimates are unnecessarily pessimistic.
• The CV estimator is not adequate for geostatistical learning problems that show various forms of covariate shift. Situations without covariate shift are rare in geoscientific settings, and since the DRV estimator works reasonably well for both situations (i.e. with and without shift), it is recommended instead.
• Nevertheless, both the CV and DRV estimators suffer from a serious issue with over-fitted models in which case they largely underestimate the generalization error. For risk-averse applications where one needs to be careful about the generalization error of the model, the BCV estimator can provide more conservative results.
• None of the three estimators were capable of ranking models correctly under covariate shift and spatial correlation. This is an indication that  one needs to be skeptical about interpreting similar rankings available in the literature.
Finally, we believe that this work can motivate methodological advances in learning from geospatial data, including research on new estimators of geostatistical generalization error as opposed to pointwise generalization error, and more explicit treatments of spatial coordinates of samples in learning models.

Computer code availability
All concepts and methods developed in this paper are made available in the GeoStats.jl project [45]. The project is hosted on GitHub under the ISC 3 open-source license: https://github.com/JuliaEarth/GeoStats.jl.
Experiments of this specific work can be reproduced with the following scripts: https://github.com/IBM/geostats-gen-error.