Nonlocal observations and information transfer in data assimilation information transfer data

Non-local observations are observations that cannot be allocated one speciﬁc spatial location. Examples are observations that are spatial averages of linear or non-linear functions of system variables. In conventional data assimilation, such as (ensemble) Kalman Filters and variational methods information transfer between observations and model variables is governed by covariance matrices that are either preset or determined from the dynamical evolution of the system. In many science ﬁelds the covariance structures have limited spatial extent, and this paper discusses what happens when this spatial extent is smaller then the support of the observation operator that maps state space to observations space. It is shown that information is carried beyond the physical information in the prior covariance structures by the non-local observational constraints, building an information bridge (or information channel) that has not been studied before: the posterior covariance can have non-zero covariance structures where the prior has a covariance of zero. It is shown that in standard data-assimilation techniques that enforce a covariance structure and limit information transfer to that structure the order in which local and non-local observations are assimilated can have a large inﬂuence on the analysis. Local observations should be assimilated ﬁrst. This relates directly to localization used in Ensemble Kalman Filters and Smoothers, but also to variational methods with a prescribed covariance structure where observations are assimilated in batches. This suggests that the emphasis on covariance modeling should shift away from the prior covariance and toward the modeling of the covariances between model and observation space. Furthermore, it is shown that local observations with non-locally correlated observation errors behave in the same way as uncorrelated observations that are non-local. Several theoretical results are illustrated with simple numerical examples. The signiﬁcance of the information bridge provided by non-local observations is highlighted further through discussions of temporally non-local observations, and new ideas on targeted observations.

Non-local observations are observations that cannot be allocated one specific spatial location. Examples are observations that are spatial averages of linear or non-linear functions of system variables. In conventional data assimilation, such as (ensemble) Kalman Filters and variational methods information transfer between observations and model variables is governed by covariance matrices that are either preset or determined from the dynamical evolution of the system. In many science fields the covariance structures have limited spatial extent, and this paper discusses what happens when this spatial extent is smaller then the support of the observation operator that maps state space to observations space. It is shown that information is carried beyond the physical information in the prior covariance structures by the non-local observational constraints, building an information bridge (or information channel) that has not been studied before: the posterior covariance can have non-zero covariance structures where the prior has a covariance of zero. It is shown that in standard data-assimilation techniques that enforce a covariance structure and limit information transfer to that structure the order in which local and non-local observations are assimilated can have a large influence on the analysis. Local observations should be assimilated first. This relates directly to localization used in Ensemble Kalman Filters and Smoothers, but also to variational methods with a prescribed covariance structure where observations are assimilated in batches. This suggests that the emphasis on covariance modeling should shift away from the prior covariance and toward the modeling of the covariances between model and observation space. Furthermore, it is shown that local observations with non-locally correlated observation errors behave in the same way as uncorrelated observations that are non-local. Several theoretical results are illustrated with simple numerical examples. The significance of the information bridge provided by non-local observations is highlighted further through discussions of temporally non-local observations, and new ideas on targeted observations.

INTRODUCTION
The most general form of data assimilation is given by Bayes Theorem that describes how the probability density function (pdf) of the state of the system x is updated when observations y become available: in which p(x) is the prior pdf of the state, and p(y|x) the likelihood of the observation given that the state is equal to x. This likelihood is determined by the measurement process. For instance, when the measurement error is additive we can write This equation maps the given state vector x into observation space via the observation operator H(..). Since y is given also this equation determines ǫ, and since the pdf of the observation errors is known we know what the likelihood looks like. It is emphasized that Bayes Theorem is a point-wise equation for every possible state vector x.
A non-local observation is typically defined as an observation that cannot be attributed to one model grid point. The consequence is that model state space and observation space should be treated differently. However, Bayes Theorem is still valid, and general enough to tells us how to assimilate these non-local observations. This is different in practical data-assimilation methods for high-dimensional systems that are governed by local dynamics. Examples are Ensemble Kalman Filters, in which a small number, typically O(10 − 100), of ensemble members is used to mimic a Kalman Filter. Because of the small ensemble size the sample covariance matrix is noisy, and a technique called localization is used to set long-range correlations to zero, as they physically should be, see [1,2].
Non-local observations can have a support that is larger than the localization area. With support is meant that part of state space that is needed to specify the model equivalent of an observation. When H is linear it is that part of state space that is not mapped to zero. A larger support is not necessarily a problem as long as non-local observations are allowed to influence those model variables with whom they have strong correlations, see e.g., [3]. Assimilating non-local observations as local ones, e.g., by using the grid points where they have most influence, can lead to degradation of the data-assimilation result, as Liu et al. [4] show and hence it is important to retain their full non-local structure.
There has been an extensive search for efficient covariance localization methods that allow for non-local observations, including using off-line climatological ensembles, groups of ensembles, and augmented ensembles in which the ensemble members are localized by construction, see e.g., [5][6][7][8][9].
All of these methods try to find the best possible localization function based on the prior. The main focus of this paper is not on developing better covariance estimates, but rather on the influence on the data-assimilation results of nonlocal observations in which the support of the observation operator is larger than the dependency (or, for linear relations, correlation) length scale in the prior. This can be due to a misspecification of the prior localization area, or due to a real prior covariance influence area that is smaller than the support of the observation operator. Since the prior is expected to contain the physical dependencies in the system, this means that a non-local observation needs information from model variables that are physically independent. As will be shown after assimilation new dependencies between the variables involved in the observation operator are generated, on top of the physical dependencies already present. Hence, the non-local observations generate information bridges that are not present in the prior. These bridges can appear both in space and in time.
As an example of the influence of non-local observations on practical data-assimilation systems, since non-local observations generate information bridges, so build new covariance structures, the order in which observations are assimilated becomes important in serial assimilation when covariance length scales are imposed, as in standard localization techniques and in variational methods. This is also true for local observations, but the effect in the non-local case is much larger.
In this paper we will discuss the implications of these information bridges, and strategies of how to assimilate non-local observations. Furthermore, the connection is made to correlated observation errors where the correlations are non-local in the sense defined above. Finally, we discuss ways how we can exploit the appearance of these information bridges to improve dataassimilation systems.

THE ASSIMILATION OF NON-LOCAL OBSERVATIONS
In the following we will first demonstrate the treatment of nonlocal observations in the most general way, via Bayes Theorem, and how non-local observations generate information bridges in the posterior. Then we show that the order in which local and non-local observations are assimilated does not matter, when we solve the full data-assimilation problem, so building the bridges first or later is not relevant. This conclusions does not hold necessarily when approximations to the full data-assimilation problem are introduced, as we will see in later sections.

Non-local Observations in Bayes Theorem
Let us first study how these information bridges are formed via a simple example. Assume two parts of the state space are independent under the prior, so p(x 1 , x 2 ) = p(x 1 )p(x 2 ), and we have an observation that combines the two, e.g., y = H(x 1 , x 2 , ǫ), where the observation operator H(..) can be a non-linear function of its arguments. Bayes Theorem shows: Since y depends on both x 1 and x 2 the likelihood cannot be separated in a function of x 1 only times a function of x 2 only. This means that we also cannot separate the posterior pdf in this way, and hence x 1 and x 2 have become dependent under the posterior. Since Bayes Theorem is the basis of all dataassimilation schemes, the same is true for (Ensemble) Kalman Filters/Smoothers, variational methods, or e.g., Particle Filters. As an example, Figure 1 shows the joint prior pdf of two independent variables. The pdf is constructed from p(x 1 , x 2 ) = p(x 1 )p(x 2 ) in which p(x 1 ) is bimodal and p(x 2 ) a unimodal Gaussian. The likelihood is given in Figure 2, related to an observation y = x 1 + x 2 + ǫ in which ǫ is Gaussian distributed with zero mean. Their product is the posterior given in Figure 3. It is clearly visible that the two variables are highly dependent under the posterior, purely due to the non-local observation.
We now analyse the following simple system in more detail to understand what the influence of non-local spatial observations in linear and linearized data-assimilation methods is. The state is two-dimensional x = (x 1 , x 2 ) T , with diagonal prior covariance  matrix B with diagonal elements (b 11 , b 22 ) and a non-local observation operator H = (1 1). A scalar non-local observation y = Hx true + ǫ true with measurement error variance r is taken. The subscript true reminds us that the observation is from the true system, while x denotes the state of our model of the real world (this can easily be generalized to different parts x 1 and x 2 of a larger state vector and more, or more complicated, non-local observations y. The two-dimensional system is chosen here for ease of presentation).
The Kalman filter update equation for this system reads: with posterior covariance matrix: This simple example illustrates the two points from the general case above. Firstly, even if the prior variables are uncorrelated they are correlated in the posterior because of the non-local observation operator mixes the uncorrelated variables of the prior. A second point is that the update of each variable is dependent on the value of the other, even when the two variables are uncorrelated in the prior. Hence the non-local observation acts as an information bridge between uncorrelated variables, both in terms of mean and covariance.
This conclusion remains valid for variational methods like 3DVar as 3DVar implicitly applies the Kalman Filter equations in an iterative manner.

Order of Observations
The results from the previous section might suggest that the order in which local and non-local observations are assimilated is important: if a non-local observation is assimilated first the next local observation can influence all variables involved in the nonlocal observation operator. On the other hand, when the local observation is assimilated first this advantage seems to be lost. However, this is not the case, the order in which we assimilate observations is irrelevant in the full Bayesian setting (this is different when localization is used, as explained in section 4).
The easiest way to see this is via Bayes Theorem. Suppose we have two observations, a non-local observation y nl and a local observation y l of only x 2 . We assume their measurement errors are independent. Bayes Theorem tells us: If we assimilate y l first we get: and vice versa: but the result is the same as the order in a multiplication doesn't matter (this is true in theory, in practice differences may arise due to round-off errors). Since the Kalman Filter/Smoother is a special case of this when all pdf 's are Gaussian it is true also for the Kalman Filter/Smoother, and for variational methods. As we will see in a later section, care has to be taken when observations are assimilated sequentially and localization is enforced.
To complete the intuition for the Kalman Filter, if the local observation y l is assimilated first the mean and covariance of variable x 2 are updated, but x 1 remains unchanged. Hence when the non-local observation is assimilated both the mean and covariance of x 2 have changed, and these changed values are used when assimilating the non-local observation, so that x 1 does feel the influence of the local observation via the updated x 2 and its updated variance. Typically the updated x 2 will be such that x 1 +x 2 is closer to y nl , and its prior covariance before assimilating y nl will be smaller. The result is that x 1 will be updated stronger than in the case when x 2 has not seen y l first, as proven in the next section.

KALMAN FILTER/SMOOTHER WITH TEMPORALLY NON-LOCAL OBSERVATIONS
The above discussed spatially non-local observations. However, we can easily extend this to temporally non-local observations as we show here in a simple example that illustrates the point. We can easily generalize the results below to the vector case.
We study a one-dimensional system with states x n at time n and an observation y = x m + x n + ǫ. This problem can be solved by considering a Kalman Smoother, exploring the cross covariance of the states at time n and m, and is explored in standard textbooks. The interesting case is when this prior cross covariance between time n and m is zero, or negligible (for many systems this would mean that m >> n).
Similarly to the spatially non-local case we define the state vector x = (x m , x n ) T . The prior covariance of this vector depends on the model that governs the evolution of the state in absence of observations. As mentioned, we will study the case that the cross covariance between these two times is zero in the prior, so the prior covariance for this state vector x is given by The Kalman Filter update equation reads for this case: with posterior covariance matrix: The similarity with the spatial non-local observations is striking, and indeed the cases are completely identical, with time taking the place of space. The same conclusions as for the spatial case hold: even when states at different times are completely uncorrelated in time under the prior they can be correlated under the posterior when the observation is related to a function of the state vectors at both times, providing an information bridge between the two times.

CONSEQUENCES FOR SEQUENTIAL UPDATING SCHEMES WITH FIXED COVARIANCE LENGTH SCALES
The results above show that when non-local observations are assimilated they significantly change the prior covariance length and/or time-scales during the data-assimilation process. This has direct consequences for methods that assimilate observations sequentially and at the same time enforce covariance structures with certain length scales. An example is a Local Ensemble Kalman Filter, in which spurious correlations are suppressed either by Schur-multiplying the prior ensemble covariance with a local correlation matrix or Schur-multiplying the inverse of the observation error covariance matrix with a distance function. This procedure effectively sets covariances equal to zero above a certain distance between grid points. This localization in combination with sequential observation updating has to be done with care, as shown below. Another example is a 3D or 4DVar in which observations are assimilated in batches.
Assimilation of non-local observations that span length scales larger than the localization correlation length scale can potentially lead to suboptimal updates if the observations are assimilated in the wrong order. This is illustrated below, first theoretically and then with a simple example.

The Influence of Localization
Let us assume we have two variables x 1 and x 2 that lie outside each others localization area. A localization area is defined here as the area in which observations are allowed to influence the grid point in consideration. Two observations are made, a non-local observation y 1 = x 1 +x 2 +ǫ 1 and a local observation y 2 = x 2 +ǫ 2 . We study the result of the data-assimilation process on variable x 1 where we change the order of assimilation.
When we assimilate the non-local observation y 1 first, we find the update:x As shown in the previous sections, this assimilation generates a cross covariance between x 1 and x 2 aŝ When we now assimilate observation y 2 the fixed covariance length scale, from localization or otherwise, we will remove the cross covarianceb 12 before y 2 is assimilated, and hence x 1 is not updated further, so x a 1 =x 1 and b a 11 =b 11 and b a 12 = 0. The story is different when we first assimilate y 2 and then y 1 . In this case we find that x 1 is not updated by y 2 , but x 2 and its variance are. Let's denote these updated variables byx 2 andb 22 . We then find for the update of x 1 by the non-local observation y 1 : We can now substitute theˆvalues in these expressions. We start with the posterior variance b a 11 . We find, after some algebra: The first and second terms in the expression above appear when we would assimilate y 1 first, and the third term proportional to d is an extra reduction of the variance of x 1 due to the fact that we first assimilated y 2 . That reduction is absent when we first assimilate y 1 and then assimilate y 2 due to the localization procedure as shown above.
This third term can be as large as the second term. We can quantify this with the following example, in which we assume the prior variances of x 1 and x 2 are the same, hence b 22 = b 11 = b. In that case d becomesd defined by: This is the extra reduction due to assimilating y 1 after y 2 , relative to the reduction due to assimilating y 1 alone. In Figure 4 the size of this term is shown as function of r 1 /b and r 2 /b. As expected the size increases when the observation errors are smaller than the prior variances.
Let us now look at the posterior mean, for which we find: The first line is the contribution purely from the non-local observation. It appears when we first assimilate the non-local observation and then the local observation, and is equal to Equation (12). However, first assimilating y 2 and then the nonlocal observation y 1 leads to the appearance of two extra terms. The term related to y 2 − x 2 is a direct contribution of the innovation of the local observation at x 2 (as can be seen from x 2 = x 2 + b 22 /(b 22 + r)(y 2 − x 2 )), and can be traced back to the fact that x 2 has changed due to assimilation of y 2 first. The other term is related to the change in the variance of x 2 due to the assimilation of y 2 . ). This is purely due to the fixed covariance length scales used in the prior. To understand the importance of these two extra terms we again assume b 11 = b 22 = b, to find: Of course, the value of x a 1 depends on the actual values for the observations and the prior means. To obtain an order of magnitude estimate we assume that the innovation y 1 − (x 1 + x 2 ) is of order √ 2b + r 1 . and similarly Since the signs of the different contributions depend on the actual signs of y 1 −(x 1 +x 2 ) and y 2 −x 2 we proceed as follows. We rewrite the expression for x a 1 as: Hence the ratio of the first extra term d(y 1 − (x 1 + x 2 )) to the contribution only from y 1 is proportional to d, and is given in Figure 4. The ratio of the rest to the contribution only from y 1 is given in Figure 5. Note that we used the approximations for y 2 − x 2 and y 1 − (x 1 + x 2 ) above, which means that this ratio becomes 1 + d The sign of this contribution is unclear, as mentioned above, so we have to either add to or subtract this figure from Figure 4. The importance of Figures 4, 5 is that they show that when the observation errors are small compared to the prior variance the update could be more than 100% too small when localization is used if one first assimilates the non-local observation y 1 , followed by assimilating y 2 . Hence non-local observations should be assimilated after local observations. When the update is not sequential, but instead local and non-local observations are assimilated in one go, we obtain the same result we would obtain by first assimilating the local observation and then the non-local observation. The reason is simple: assimilating the non-local observation means that all grid points in the domain of the non-local observation are allowed to see all other gridpoints in that domain, and hence information from local observations is shared too.
It is emphasized again that the above conclusions are not restricted to Ensemble Kalman Filters and Smoothers. Any scheme that assimilates observations sequentially, or in batches, should ensure that non-local observations in which the support of the observation operator is larger than the correlation length scales used in the covariance models should be assimilated after the local observations.

An Assimilation Example
To illustrate the effect explained in the previous section the following numerical experiment is conducted. We run a 40dimensional model, the Lorenz 1996 model, with an evolution The forcing F = 8 is a standard value ensuring chaotic dynamics.
An LETKF is used with a Gaspari-Cohn localization function on R −1 with cut-off radius of 5 gridpoints, which means that observation error variances are multiplied by a factor >10 after 3 grid points, so they have little influence compared to observations close to the updated grid point. This localization is kept constant to illustrate the effects; it might be tuned in real situations. The ensemble consist of 10 members, initialized from the true state at time zero with random perturbations drawn from N(0, I). When assimilating the non-local observation the localization is only applied outside the domain of the non-local observation.
We run two sets of experiments, one set in which all the local observations are assimilated first at an analysis time, followed by assimilating the non-local observation, and one in which the non-local observation is assimilated first, followed by all other observations. We looked at the difference of these two assimilation runs for different values of the observation period t obs and different values of the observation error variances in R. All observation error values for local and non-local observations are the same. Figure 6 shows the posterior RMSE averaged over all assimilation times of the state component x 0 as function of the chosen observation error standard deviation. The different lines correspond to the different observation periods, ranging from 10, via 20 to 50 time steps, in colors black, blue, and red, respectively. The solid lines denote the results when the non-local observation is assimilated last, and the dashed lines show results when the non-local observation is assimilated first. All results are averaged over 10 model runs, and the resulting uncertainty is of the order of 0.1. Increasing the average to 100 model runs did not differ the numbers within these error bounds.
As can be seen from the figure, assimilating the non-local observation last leads to a systematically lower RMSE for all observation periods and for all observation error sizes. These results confirm the theory in the previous section.
We also performed several experiments in which the observation error in the non-local observation was higher or lower than that in the local observations for the experiment with t obs = 20 time steps. As an example of the results, increasing the non-local observation error from 0.1 to 0.3 increased the RMSE in x 0 from 0.23 to 0.24 when the non-local observation is assimilated last, but from 0.70 to 0.81 when it is assimilated first. This shows that the impact of a larger non-local observation error is less when the non-local observation is assimilated last because of the benefit of the more accurate state x 5 . When the non-local observation is assimilated first this update of x 5 is not noticed by the data-assimilation system. In another example we decreased the observation error of the non-local observation from 0.3 to 0.1. In this case the RMSE of x 0 remained at 0.47 when the non-local observation is assimilated last, and decreased from 1.60 to 1.50 when it is assimilated first. As expected, the influence is much smaller in the former case as the state at x 5 is now less accurate. Hence, also these experiments demonstrate that the theory developed above is useful.
Finally, the results are independent of the dimension of the system; a 1,000-dimensional Lorenz 1996 model yields results that are very similar and with differences smaller than the uncertainty estimate of 0.1. This is because the analysis is local, and the non-local observation spans just part of the state space.

CORRELATED OBSERVATION ERRORS
Although observations errors are typically assumed to be uncorrelated in data-assimilation systems, they in fact are often correlated, and the correlation length scales can even be longer than the correlation length scales in the prior. Correlated observation errors can either arise from the measurement instrument, e.g., via correlated electrical noise in satellite observations, but also from the mismatch between what the observations and the model represent. The latter are called representation errors and typically arise when the observations have smaller length scales than the model can resolve. See e.g., full explanations of representation errors in Hodyss and Nichols [10] and Van Leeuwen [11], and a recent review by Janjić et al. [12].
The latter, representation errors, typically do not lead to nonlocal correlation structures in the model domain as the origin of these errors is sub grid scale. The discussion here focusses on correlation between observation errors of observations that are farther apart than the localization radius or than the imposed correlation length scales in variational methods. As we will see, there is a strong connection to non-local observations.

A Simple Example
Let us look at a simple example of two grid points that are farther apart than the imposed localization radius, or than physical correlation length scales. Both are observed, and the observation errors of these two observations are correlated. The observation operator H = I, the identity matrix. The covariance matrices read: The inverse in the Kalman gain is a full matrix: in which D is the absolute value of the determinant, given by D = (b 11 + r 11 )(b 22 + r 22 ) − r 12 r 21 . The factor BH T is diagonal, as H is the identity matrix. Because of the non-zero off-diagonal elements in the resulting Kalman gain the state component x 1 is updated as: To make sense of this equation we rewrite it, after some algebra, as: 11 and ρ 2 = This equation shows us several interesting phenomena. The first term in the brackets denotes the update of x 1 when we would only use observation y 1 . The other term has to do with using observation y 2 while its errors are correlated with that of y 1 . Interestingly, the update by y 1 gets enhanced by a term with factor ρ 1 ρ 2 /(1 − ρ 1 ρ 2 ), which is positive. This update is also changed by y 2 , with a sign depending on r 12 and y 2 − x 2 . To understand where these terms come from we can rewrite the expression above further as: (30) We now see that the influence of the correlated observation errors is to change the denominator of the factor that multiplies the innovation y 1 − x 1 . That denominator becomes smaller because of the cross correlations, so the innovation will lead to a larger update.
The influence of the second innovation is not that straightforward. Let's take the situation that the errors in the two observations are positively correlated, so ρ 2 > 0. Now assume that y 1 − x 1 is positive. Then, as expected, the K 11 element of the gain is positive, so the update to x 1 is positive. Indeed, we want to move the state closer to the observation y 1 . On the other hand, we know that x 1 is an unbiased estimate of the truth, so this does suggest that the realization of the observation error of y 1 is positive. Let's also assume that y 2 − x 2 is positive. As also x 2 is assumed unbiased this suggests that the observation error in y 2 is positive too. The filter knows that these two errors are correlated via the specification of R. Because both innovations indicate that the actual observation error is positive it will incorporate the contribution from y 2 − x 2 with a negative sign to avoid a too large positive update of x 1 . If, on the other hand, y 2 − x 2 would have been negative the filter has no indication that the actual observation error positive or negative, so y 2 − x 2 would be allowed to add positively to x 1 . However, note that innovation will act negatively on the update of x 2 . The Kalman Filter is a clever device, designed to ensure an unbiased posterior for both x 1 and x 2 .
A similar story holds for x 2 as the problem is symmetric. This shows that neglecting long-range correlations in observation errors can lead to suboptimal results with analysis errors that are too large. Similar results have been found for locally correlated observations, e.g., [13,14], and recent discussions on the interplay between HBH T and R in Miyoshi et al. [15] and Evensen and Eikrem [16].

The Connection to Non-local Observations
An interesting connection can be made with recent ideas to transform observations such that their errors are uncorrelated. The interest for such a transform stems from the fact that many data-assimilation algorithm implementations either assume uncorrelated observation errors or run much more efficiently when these errors are uncorrelated. Let us assume such a transformation is performed on our two observations. There are infinitely many transformations that do this, and let us assume here that the eigenvectors of R are used.
Decomposing R gives R = U U T in which the columns of U contain the eigenvectors and is a diagonal matrix with on the diagonal the eigenvalues If we transform the observation vector asŷ = U T y this new vector has covariance matrix , and hence the errors in the components ofŷ are uncorrelated. The interesting observation is now that the components ofŷ are non-local observations with uncorrelated errors. Hence, our analysis of the influence of non-local observations applies directly to this case. As a simple example, assume r 11 = r 22 = r and r 12 = r 21 = ρr (this could be worked out for the general case, but the expressions become complicated and serve no specific purpose for this paper). In that case λ 1,2 = r(1 ± ρ) and the eigenvectors are (1, 1) T / √ 2 and (1, −1) T / √ 2. This leads to transformed observationsŷ 1 = (x 1 + x 2 )/ √ 2 andŷ 2 = (x 1 − x 2 ) / √ 2. Using the Kalman Filter update equation for variable x 1 we find: in which D = (b 11 + b 22 + 2λ 1 )(b 11 + b 22 + 2λ 2 )− (b 11 − b 22 ) 2 /4 which turns out to be the same D as found for the correlated observation errors. This is not surprising as with y also H and R have been transformed, and hence HBH T + R has transformed in the same way. This means we can always extract a similar factor from the denominator of the Kalman update. Hence, we have transformed the problem from local observations with non-locally correlated errors to non-local observations with uncorrelated errors.
To show that this is the same as the original problems we now rewrite this analysis for the observations with correlated errors, so in terms of y, and collect terms y 1 − x 1 and y 2 − x 2 to find: With r 22 = r and r 12 = ρr we recover the analysis equation for the correlated observation error case.
Hence we have shown that assimilating correlated observations with correlation length scales larger than physical length scales is the same as assimilating the corresponding non-local uncorrelated observations (this result has resemblance to but is different from that of Nadeem and Potthast [17], who discuss transforming all observations, local and non-local such that they all become local observations. They then do the localization and data assimilation in this space, and transform back to physical space. This, of course, will lead to correlated errors in the transformed non-local observations). It will not come as a surprise that if observations are assimilated sequentially and localization, or a fixed covariance length scale, is used the order of assimilation matters, and nonlocally correlated observations should be assimilated after local observations. Finally, it is mentioned that the story is similar for non-local observation error correlations in time.

EXPLOITING NON-LOCAL OBSERVATIONS
As mentioned above, to avoid cutting off non-local observation information during sequential assimilation it is best to assimilate the non-local observation after the local observations. However, an alternative strategy is to change the localization area if that is possible. Before the non-local observation is assimilated the localization area should remain as is. As soon as the non-local observation is assimilated we have created significant non-zero correlations along the domain of the non-local observation operator. This means that we can include this area in the localization domain.
To illustrate this idea the same experiments from section 4 have been repeated applying the procedure outlined above for t obs = 10, and used the real observation error of the observation at x 5 in location x 0 . This means that effectively there is no localization between these two grid points. This is an extreme case, but does illustrate the point we want to make. The results are depicted in Figure 7.
They show that changing the localization area after assimilating the non-local observation does help. In theory, for two variables, this should give the same result as assimilating the non-local observation after the local observation. That this is not the case here is because grid point 5 is also updated slightly by grid point 10, and that reduces the variance at x 1 further when the non-local observation is assimilated last, while that information is not available when the non-local observation is assimilated first. The conclusion is that changing the localization area after assimilating a non-local observation is beneficial for the data-assimilation result.
Another way to explore the features of non-local observations in data assimilation is as follows. Assume we have an area of interest that is poorly observed, and not easily observed locally. We assume that we do not have and cannot obtain accurate local observations in that area, but a non-local observation is possible. Section 4, and specifically Equation (18) shows that is make sense to ensure that the support of this non-local observation contains a well-observed area. In this way the area of interest will benefit from the accurate information from the well-observed area via the information bridge. Another way of phrasing what happens is that by having an accurately observed area in the support of the non-local observation its information is redistributed more toward Another possibility along the same lines is when we already have a non-local observation containing the area of interest in its support. The accuracy in the area of interest can be enhanced by performing extra local observations in an easy to observe area, that is in the support of the non-local observation. Hence again we exploit the information bridge, this time by adding a local observation in a well-chosen position. This idea provides a new way to perform targeted observations that has not been explored as yet, as far as this author knows. This could also be exploited in time, or even in space and time.
Finally, one might think that it is possible to enhance the accuracy of the update in an area of interest by artificially introducing correlated observation errors between observations in that area and observations in another well-observed area. Since correlated observation errors can be transformed to non-local observations a similar information bridge can be build that might be beneficial.
To study this in detail we use the example from section 5.1. Assume that we have two observations with uncorrelated errors, and we add a fully correlated random perturbation with zero mean to them, so y 1 = Hx 1 + ǫ 1 + ǫ and y 2 = Hx 2 + ǫ 2 + ǫ in which ǫ 1 and ǫ 2 are uncorrelated, and ǫ ∼ N(0, r), the same value for each observation. Hence this ǫ term contains the fully correlated part of the observation error that we added to the observations artificially. This leads to a correlated observational error covariance given by: Using this in the Kalman gain we find for the gain of observation y 1 : The first ratio is the Kalman gain without the correlated observation error contribution. This is multiplied by the factor when the correlated observation error is included. We immediately see that this factor is smaller then 1, so reducing the Kalman gain. This shows that adding extra correlated observation errors to observations that are farther apart then the covariance structures in the prior will not lead to better updates: in fact the updates will deteriorate (As an aside, this would be different if we would have access to ǫ 1 and ǫ 2 , in which case ǫ could be a linear combination of these two, and the Kalman gain could be made larger than the gain for just assimilating y 1 . Unfortunately, we are given y 1 and y 2 , not their error realizations).

CONCLUSIONS AND DISCUSSION
In this paper we studied the information transfer in dataassimilation systems when non-local observations are assimilated. Non-local observations are defined here as observations with a observation operator support that is larger then the covariance length scales. This pertains to both spatial and temporal non-locality. It is found that these observations connect parts of the domain that were not connected in the prior, building an information bridge that is longer than the physics and statistics in the prior predict. Hence the notion that information from observations is spread around via the correlation length scales in B is only part of the story as non-local observations can spread information over larger distances. Indeed, one should look at the full BH T factor of the gain, and realize that the observation operator can change the covariance structures.This suggests that the emphasis on covariance modeling should shift away from the prior covariance and toward the modeling of the covariances between model and observation space.
We showed how non-local observation information is transferred to the posterior in Bayes Theorem and hence in fully non-linear and in linear and linearized data-assimilation schemes, such as (Ensemble) Kalman Filters and variational methods. Then we elaborated on the interaction of localized covariances, as typically used in the geosciences, and the sequential assimilation of observations. It was shown that it is beneficial to assimilate non-local observations after local observations in order to maximize information flow from observations in the dataassimilation system. This was quantified both analytically and numerically.
Furthermore, it was shown that observations with non-locally correlated observation errors can be transformed to non-local observations with uncorrelated observation errors, demonstrating the equivalence of the two.
In an attempt to explore the information flow by nonlocal observations we showed that non-local observations do not have to be assimilated after local observations if localization areas are extended along the support of the non-local observation operator after the non-local observations are assimilated, both analytically and with a numerical example.
Furthermore, it was shown that targeted non-local observations can be used to bring the information from accurate observations to other parts of the system. It was also shown that it is not beneficial to add correlated observation errors to distant observations to set up an information bridge as that will always be detrimental.
These initial explorations of non-local observations might guide the development of real data-assimilation applications where observations are assimilated sequentially and non-local observations and/or non-locally correlated observation errors are used. The main message might be that one should not optimize for the covariance structures in the prior, but optimize the covariance structures between observations and model variables.

DATA AVAILABILITY STATEMENT
The datasets generated for this study are available on request to the corresponding author.

AUTHOR CONTRIBUTIONS
The author confirms being the sole contributor of this work and has approved it for publication.

FUNDING
The author thanks the European Research Council (ERC) for funding the CUNDA project 694509 under the European Union Horizon 2020 research and innovation programme.

ACKNOWLEDGMENTS
I would like to thank the two reviewers for valuable comments that greatly improved the paper.