Surrogate modeling for the climate sciences dynamics with machine learning and data assimilation

The outstanding breakthroughs of deep learning in computer vision and natural language processing have been the horn of plenty for many recent developments in the climate sciences. These methodological advances currently find applications to subgrid-scale parameterization, data-driven model error correction, model discovery, surrogate modeling, and many other uses. In this perspective article, I will review recent advances in the field, specifically in the thriving subtopic defined by the intersection of dynamical systems in geosciences, data assimilation, and machine learning, with striking applications to physical model error correction. I will give my take on where we are in the field and why we are there and discuss the key perspectives. I will describe several technical obstacles to implementing these new techniques in a high-dimensional, possibly operational system. I will also discuss open questions about the combined use of data assimilation and machine learning and the short- vs. longer-term representation of the surrogate (i.e., neural network-based) dynamics, and finally about uncertainty quantification in this context.

do not offer full coverage of the system, neither in space nor in time. This challenge led to the development of data assimilation (DA) techniques that aim to optimally merge information from the observations and the numerical models to obtain a more accurate representation of the physical fields and parameters of the models. Both machine learning (ML) to which deep learning belongs and DA are statistical estimation/inverse problems techniques which can offer predictions of the dynamical system under study and which share methodologies because they can both root in statistical and Bayesian principles. However, the acceleration of technical achievements, particularly the deep learning differentiable libraries, and its subsequent hype, made it very attractive to test ML techniques in geosciences. To address model deficiencies, which is critical for better forecasts and hindcasts, the emphasis is then more on data-driven rather than model-driven solutions.
In the following, I will first describe and justify this beneficial infusion of ideas to the geo-and climate sciences from a viewpoint at the intersection of the geoscientific dynamical systems, DA and ML. I will review some of the recent accomplishments through deep learning in the field, and in particular how the power of ML can be embedded in the Bayesian framework of DA, with a view to more accurate forecasts and surrogate modeling. I will give a realistic numerical example of a deep learning-based surrogate model for numerical weather prediction (NWP).
In the second part of this perspective paper, I will discuss aspects of these developments that may play a key role in the future success or represent significant challenges that must be addressed. I will first examine whether building a data-driven surrogate model requires focusing on very short-term dynamics (which amounts to equations discovery) or on longer-term dynamics (which amounts to building a surrogate resolvent of the dynamics). I will then return to the beneficial iterative combination of DA and ML to discover not only dynamics but also DA methods and their solver. I will then describe the potential of online vs. offline ML+DA schemes. As a third perspective, I will briefly discuss uncertainty quantification, where DA and ML can benefit from each other, before giving my final thoughts.
. State-of-the-art The Earth's climate is composed of several compartments with heterogeneous space and time scales: the atmosphere and its trace gases and aerosols, the oceans and its biological and chemical species, the land surface and its biosphere, the cryosphere (Arctic, Antarctic, Greenland, glaciers, and permafrost), etc. Its state and evolution, of critical importance for everyday life as well as the whole biosphere's fate, are estimated using numerical models of the geofluids and of the species' fate and using Earth's observation through constellations of satellites, ground stations, aircrafts, ships, buoys, radiosondes, and other remote sensors. Despite this impressive coverage, these observations remain noisy, can be nonlocal , and can either be considered sparse or infrequent, depending Non-local observations are typically linear or non-linear functions of spatially extended variables. Typical examples in the geosciences are the radiance measurements of space-born instruments which often probe full on the instrument's platform and the Earth's compartment. Most geofluids are chaotic, which severely limits the Earth system's predictability. Predictions are computed through the optimal mathematical combination of these observations and the numerical forecasting models, i.e., DA, which has met success even beyond the boundaries of geoscience [5].
Note that the handling of these observations is a Big Data problem. At the same time, these numerical models are often computationally very demanding. They are orders of magnitude slower than the inference of large NNs. Thus, DA for operational numerical weather prediction is considered a high-performance computing challenge [6].
The precision of the forecasts is driven by the number of observations, their instrumental errors, as well as the fidelity of the numerical models. It is impacted by the relevance and consistency of the employed DA technique [7,8]. It eventually depends on the accuracy of the sensitivity map established by the observation operator between the model variables and the observations, the so-called representation error [9].
Data assimilation in geosciences is the counterpart to training in ML and vice versa; they are both parts of estimation theory or inverse problems and combine a numerical model of either physical or statistical origin and possibly huge datasets (the larger, the better) [10]. There are known correspondences between them. For instance, the adjoint models [11] required by the gradients in variational DA draw from the branch of applied mathematics known as optimal control [12] and have been named backpropagation in deep learning [13]. They both ultimately rely on fine numerical optimization techniques. Yet, they diverge in their use of the model. In ML, the models are essentially statistical and fast to infer, while in DA, they are essential of physical origin and are usually significantly slower.
In contrast to computer vision and natural language processing which are well-delineated computer science problems, forecasting geophysical systems cannot be so neat a mathematical problem. Indeed, geophysical system have many sources of significant error mentioned above (model, data, and methods), they are often intrinsically multiscale and heterogeneous, and they are rarely isolated. Then, why should we bother with the recent progress of ML in computer vision and natural language processing if those could be specific and difficult to transfer to the climate sciences? The deep learning achievements are so spectacular that they cannot be ignored, nor are the reasons behind their success. Deep learning revived the NN techniques of the 90's by proving their relevance, by the availability of very large training databases, by demonstrating that deeper, i.e., multiple internal layers, and larger NNs can counter the curse of dimensionality of estimation theory, often met in ML and DA [14]. On a more technical level, this progress was critically accompanied by substantial software developments supported by Google, Facebook, Apache, Nvidia, etc. They propose open, easily accessible, and fast evolving deep learning libraries to efficiently implement the new methods on CPUs, GPUs, TPUs, and mobile/embedded devices. But the key ingredient is the ability of these software packages to generate the adjoint of any of the deep learning models, yielding differentiable models. This capability allows the generation of any loss function gradient relying on the NN model. Whether based on graph generation of the codes or eager execution (less numerically efficient but handling conditional branching in the code), they offer partial to almost complete remarkable solutions to the fundamental computational sciences problem of automatic differentiation. Examples of these tools are TensorFlow/Keras [15], PyTorch/Lightning [16], Jax [17], numpy, and autograd [18], which are mainly used on top of python or entirely new programming languages such as Julia/Flux [19]. In the meantime, over the past 20 years, progress has been slower in DA applied to the geosciences. Automatic differentiation was recognized as a critical challenge by this research community, especially since the advent and the remarkable success of the 4D-Var method and its implementation at the ECMWF, Météo-France, the UK MetOffice, Environment Canada, the Japan Meteorological office, and the Korean Meteorological Administration [20]. Researchers got help from computer scientists and experts in automatic differentiation but never got the means of the aforementioned high-tech companies. Besides, the presence of numerous physical thresholds (implemented via if statements in numerical geophysical codes) complicates the task As a consequence, the generation of adjoints was and can still be limited to key components of forecasting models and implemented by hand. This limitation significantly hampered the efforts to optimize key physical or statistical parameters of the DA and forecasting systems. Alternatives to variational-based DA methods exist, such as the ensemble Kalman filter [21][22][23][24][25]. They are excellent for forecasting but they are limited in their ability with non-linear higher-dimensional problems, where non-linear gradient-based optimization is the golden tool. Exceptions are iterative ensemblevariational techniques that mimic gradient-based optimization and, when numerically affordable, particle filters.
Because handling model error is key to making progress in geofluid forecasting, parameter estimation is also an important area of geophysical DA (e.g., [26]). Yet, efficient parameter estimation also relies on the ability to generate the adjoint of the models (whichever is present in the numerical chain from the parameters to the observations). Exceptions are some ensemble-variational techniques (e.g., a comparison of adjoint-based and iterative ensemble-based DA methods is given in [27]) but they come with their own problems such as time-dependent localization [28,29]. As a consequence, efforts were also mitigated by this computer science issue.
As a result, deep learning techniques were welcomed into the geosciences. Among the first specialists to experiment with these techniques were the climatologists. They already have significant experience, if not advanced expertise in statistics and more traditional ML techniques. These techniques were applied to past events, reanalysis, and Coupled Model Intercomparison Project datasets, with decade-long forecast lead times, in contrast to the short-term geophysical forecasts where DA is deemed necessary. It was then proposed to apply advanced ML techniques to derive parameterizations of physical processes such as subgrid turbulent physics and convection, which are crucial for both accurate climate and shorter-term NWP models [30][31][32][33][34]. Over the past few years, in the realm of NWP and DA applied to the geosciences, deep learning was also seriously considered, with the research community shifting gears under ML influence. Obviously, it was partly due to the computer vision successes and legitimate exposure of these fashionable techniques. But my take is that the ML blood flow into NWP and DA is also due to other fundamental reasons. First, deep learning offers efficient nonlinear regression tools that the DA community craves. Second, as mentioned earlier, the deep learning libraries offer solutions to the automatic differentiation core problem of DA, at the very least wherever NN architectures from these libraries are used, and with the promise to go beyond with Jax and Julia. Third, the possibilities offered by the deep learning methods and the related software freed our minds from many constraints and limitations of traditional DA-based parameter estimation. We can now attempt ambitious parameter estimation techniques, i.e., with millions of parameters non-linearly related to the observations and based on combined DA and deep learning. Moreover, this endeavor blurred the lines between NWP DA and ML experts in the geo-and climate sciences, making the community larger and its impact deeper with faster communication streams. Even ML/deep learning specialists from the industry attracted by the challenges of NWP and climate modeling brilliantly joined the collective efforts (e.g., Google, Nvidia, Huawei, and Microsoft [35][36][37][38][39][40]). Despite these references and the amazing accurate surrogate skills of their deep learning weather models, they still crucially rely on re-analysis data of the ECMWF patiently produced from complex and large physical models, huge datasets of meteorological observations, and increasingly sophisticated DA methods to combine them [41].
The following is focused on the recent and ongoing developments at the intersection of the dynamical systems in geoscience, DA and ML, that we (R. Arcucci's team at Imperial as well as mine) dubbed MLDADS for ML, DA, and dynamical systems. Despite the appearance of a niche subject, it reinvigorated past DA and ML open problems and witnessed a significant blooming of published works in geo-and environmental sciences (and beyond).
Let us list a few MLDADS opportunities and challenges aiming to combine DA and ML techniques. Deep learning can help us with subgrid parameterization as a powerful non-linear regression tool (e.g., [42,43]). Most current parameterizations of geoscientific forecasting models are based on physical models with many tuned physical parameters. They alternatively rely on linear regressions, as for instance used in downscaling and upscaling [44]. Deep learning techniques hence offer more systematic non-linear regression tools provided that the datasets are large enough. More generally, it could help us with model error correction of forecasting systems [45][46][47]. Deep learning could help us build surrogate models of part or the whole dynamical systems either from pure observations or from data of high resolution simulations. In the first case, it bypasses the requirement for physical principles and modeling. In the second case, it can offer an acceleration to forecasting, or a way to simplify automatic differentiation of the represented physical model since its NN counterpart can easily be differentiated [48], or a fast means to generate ensembles, etc. It can also be a very appealing versatile tool to help with traditional DA methods, for instance in the tuning of statistical parameters needed to regularize covariances in ensemble Bocquet .
DA methods, or help carry out non-linear adaptive transformation such as DA methods based on Gaussian anamorphosis. Creating data-driven surrogate models of low-order models of geofluids, which are meant to capture the key difficulties inherent to high-dimensional numerical models was one of the first objectives of the research community. There are by the end of 2022, dozens of ML papers in the literature dealing with the problem of datadriven surrogate dynamics, even if most of them focus on low-order models. The problem can be addressed by typical ML techniques, such as the projection on a regressor frame or basis, random forests, analogs, diffusion maps, reservoir computing, long short-term memory NN, and other NN approaches (e.g., [49][50][51][52][53][54][55][56][57][58][59]).
However, most ML techniques assume almost noiseless and complete observations which are fundamental limitations to extrapolating them to the geosciences. By contrast, most DA methods can handle noisy and sparse observations. That is why surrogate modeling in this context can be tackled using a conjunction of ML and DA techniques to exploit noisy and incomplete observations such as those met in realistic geoscience systems [60][61][62][63][64]. For high-dimensional systems, the relative lack of information can be mitigated by additionally using past trajectories or information on the system such as an approximate model derived from physical laws [65][66][67].
In a variational context, the mathematical problem of estimating the dynamical model and the state trajectory can be framed into a rigorous Bayesian formalism [60][61][62][63][68][69][70]. Generalizing a classical short-window DA rational [71] yields a cost function for variational DA, where ML tools are leveraged upon: where x k → H(x k ) is the observations operator at time k, x k → M(p, x k ) is the resolvent of the statistical dynamical model, typically a NN with weights and biases stacked in the p vector; the y k and x k for k = 0, . . . , K are the observation and the state vectors of the physical system over the time interval [0, K], respectively. The matrices R k and Q k are the observation and model error covariance matrices, respectively. The background term −2 ln p(p, x 0 ), which corresponds to the prior density function on the initial condition x 0 and the parameters p, is neglected here for the sake of simplicity. Besides, the sensitivity to x 0 vanishes in the large K limit for chaotic dynamics, a standpoint at variance with classical DA where K is small and regularization of p is usually addressed via a stopping criterion built on the validation loss and dropout layers (although a Tikhonov term is also sometimes summoned). Accounting for this approximation, J is given up to terms that do not depend on p and x k .
This cost function is rigorously obtained from Bayes' rule under Gaussian assumptions on some of the errors of the problem. This resembles a typical weak-constraint 4D-Var cost function [72]. This DA standpoint is remarkable as it allows for noisy and partial observations of the physical system, whereas most ML approaches do not account for this significantly noisy and partial data. Note that the solution of this problem is a trajectory x ⋆ 0,K for the state over Daisuke Hotta, communication at ISDA online .
the time interval, a deterministic NN model parameterized by p ⋆ , and a stochastic correction over the deterministic surrogate model, whose predefined statistics is normal with model error covariance matrix Q k . It is possible to partially estimate this stochastic correction through the expectation-maximization algorithm (see [62,73,74]). Let us assume that the dynamics to be learned fully and directly observed, i.e., H k ≡ I, and that the observation errors vanish, i.e., R k → 0. Then the observation term in the cost function gets frozen and imposes that x k ≃ y k , so that, in this limit, J (x 0 : K , p) becomes ( This cost function coincides with a typical ML loss function provided Q k ≡ I.
Note that it is also possible to accomplish the same goal of joint estimation using ensemble-based DA methods [75,76]. It then appears as a much more ambitious parameter estimation problem than in traditional DA. It has however a potentially higher numerical complexity than the variational approach. In particular, the size of the ensemble may have to match the number of global parameters plus the number of a single copy of the local parameters, in addition to careful use of localization to regularize ensemblebased covariances.

. . Physical and statistical hybrid models and their resolvent vs. tendency corrections
Learning the full dynamics of the atmosphere or the ocean, not to mention a holistic Earth system model from pure observations to build a surrogate dynamical model could be possible but would not yet compete with state-of-the-art physical models. An intermediate, reasonable step would be to use a hybrid representation of the dynamics. In this framework, the statistical data-driven model is a correction to a well-tuned physical numerical model [65,66,77,78]. For instance, the partial differential equations of the physical model could be corrected, i.e., the tendencies using NWP terminology: where x → ϕ(x) is the physical tendency and x → n(p, x) is the tendency correction. Alternatively, one could correct the resolvent which would be a substitute for the statistical model in the cost function J (p, x 0 : K ): Bocquet . /fams. .

FIGURE
Schematic of the coordinate descent alternating DA and ML which aims to minimize J (p, x ,K ) and is expected to converge to the maximum a posteriori estimator. ϕ represents the physical model, η is the deep learning model that depends on parameters p, and ϕ ⊕ η(p) is the hybrid model.
where x → (x) is the resolvent obtained from the physical model whereas x → N (p, x) is the corrective statistical model. When possible, this hybrid formulation has the key advantages of (i) limiting the number of NN parameters, (ii) providing the learning step with the physical model as a first guess, and consequently (iii) accelerating the learning step. Hence, this approach has been favored in several recent and ambitious surrogate models. One potential issue can appear if the adjoint of the physical model is not available. In that case, the tendencies cannot be learned through non-linear optimization; only the resolvent approach would be easily implementable since it would not require the physical model's adjoint. In both cases, the NN correction's adjoint is obtained by automatic differentiation. These options for learning corrections of chaotic models are still largely unexplored. Finally, note that surrogate modeling of geofluids has a lot of connections with, and could gain from recent surrogate modeling endeavors, in computational fluid dynamics and engineering (e.g., [79][80][81][82]).

. . MLDADS optimization system
An algorithm proposed to minimize J (p, x 0 : K ) as defined in Equation (1), on both the state trajectory and the statistical NN model parameters consists of iteratively (i) estimating the state trajectory using classical DA methods with (ii) optimizing the NN parameters using ML methods [61]. This is illustrated in Figure 1. A DA smoother (e.g., 4D-Var, ensemble-based smoothers) is typically used in the DA step while deep learning is used in the ML step. This has been reckoned mathematically as a coordinate descent for nonlinear optimization, with iterations of the loop until convergence [62]. Moreover, it has been demonstrated to be successful with low-order models [62]. Nonetheless, this could seem numerically costly for higher dimensional problems, such as NWP, so that one should first focus on the first half steps of this loop. In this framework, traditional DA can be considered as order-1/2 of the loop, while order-1, i.e., DA followed by ML has been studied in [78] and coincides with model error correction using analysis increments of DA runs. Note that the notation order-1/2 merely means that DA only represents half of the loop, the first step out of two in the surrogate model estimation. This has the potential to be used in an NWP context, since the analysis increments are natural outputs of operational DA. Order-3/2 consists in additionally using the NN model error correction of the order-1 loop to improve NWP based on DA with an improved forecast model [78]. This in turn means uncovering a state trajectory closer to the true one which, in realistic conditions, cannot be accessed directly since the observations are both sparse and noisy. It certainly remains to implement these in an NWP framework. If order-2 (and beyond) has been proven to be consistent and have a lot of potential for low-order models [62], it would be interesting to demonstrate this potential with higher dimensional models and NWP. This entails a second application of ML, and hence obtaining a more accurate surrogate model or model error correction in the hybrid case. Nonetheless, this is expected to be a significant numerical challenge.
One idea would be to skip the estimation of the state trajectory, or marginalizing it out, and hence to compute the sensitivity of the observations with respect to the parameters directly. This idea is put forward by the Ensemble Kalman Inversion technique [83] and has been experimented with but on a small number of parameters [84]. It remains to be investigated whether the method can be applied to larger sets of parameters, or even deep learning NNs. This approach would likely call for model reduction, explicitly or implicitly, which would be implemented by deep learning. Figure 2 illustrates learning the dynamics of the global atmosphere from the ERA5 dataset [41]. We used the setup, data, and ECMWF Integrated Forecasting System model runs provided by the WeatherBench platform [85]. A residual deep NN model of about 1.5 × 10 6 parameters is learned from 38 years of meteorology at coarse 5.625 • × 5.625 • spatial resolution. This surrogate model is independently tested on the years 2017-2018 over many forecast runs, on the 500 hPa geopotential fields using the root mean square error (RMSE) and the auto-correlation (ACC), as functions of the forecast lead time up to 14 days. The surrogate model uses 3 levels of geopotential and 3 levels of temperature. Its forecast skill is compared to global and daily climatologies, to the persistence model, and to the ECMWF model at truncation T42 (about 2.8 • grid-cell at the equator). We observe that in spite of its coarser resolution (about T21 truncation), the surrogate model outperforms the ECMWF model at T42 projected on T21. The shades around the skill curves represent the variability of the forecast scores induced by the variability of Bocquet .

RMSE
(top) and ACC (bottom) for the hPa geopotential predicted by the learned neural network model that simulates weather at coarse resolution, and comparison to uniform and daily climatologies, persistence, and the ECMWF model forecasts at T (see text for details).
the ERA5 meteorology over the test years 2017-2018, which is in turn mostly decided by the distribution of initial states on (part of) the current attractor of the chaotic dynamics and by forcings (radiation, ocean). They were obtained by running 5, 000 forecasts in the test period starting from as many distinct initial conditions and computed from the dispersion of this ensemble of forecasts. Provided the surrogate model is a good enough representation of the truth, this variability is not expected to depend much on the surrogate model solution, since it is mainly driven by the dynamics. Surprisingly, this uncertainty (or risk of the forecast scores) is usually not provided in the ML surrogate modeling literature, where the focus is only on deterministic forecasts without their uncertainty.
In particular, this variability is not the uncertainty attached to the learning process from the dataset. However, it shows how easy it is to pick initial conditions with flattering forecast scores.
Among the fundamental open problems in combining DA with ML is the ability of the ML+DA schemes to assimilate fresh observations. Upon receiving them, one would update our knowledge of not only the state and possibly a few physical model parameters (as in classical DA) but would also update any neural network used within the sequential algorithm, whether it is related to correcting the evolution model, the observation model, or any block of the assimilation scheme itself. In other words, one wishes to obtain from online/sequential ML+DA algorithms what has been achieved in offline variational ML+DA. This challenge has been explored with low-dimensional to intermediate models with ensemble Kalman filters [75,76] and with variational schemes [67,78] where the model (or a compartment thereof) and the state trajectory were both updated when new observations are acquired. We expect this topic to thrive in the coming years since (i) it aligns with the conditions of operational forecasting centers, (ii) it proposes algorithms for Bocquet .
incremental learning over very long periods of time where data are generated/acquired, and since (iii) later observations better capture the current trends in a time-dependent model in the context of climate change.
A more systemic approach to applying ML tools on DA is to learn the DA scheme itself. This clearly requires that the building blocks of the DA scheme be differentiable so that gradients with respect to the outputs of the DA scheme, typically the analysis products, could be efficiently computed [86][87][88]. Such a typical holistic approach has been coined end-to-end in ML. Learning the output of a local ensemble Kalman filter in conjunction with the intermediate global circulation model SPEEDY was proposed very early [89,90]. A radically different approach consists in optimizing the internal cogs of DA schemes by finding an efficient latent space representation for either the ensemble Kalman filter [91][92][93][94] or the variational schemes [95]. One can also try to learn the solvers of, for instance, variational schemes since these are critical to the DA analysis and its numerical efficiency and cost [96,97]. Beyond supplementing ML techniques to DA schemes as was originally proposed, one may further replace key components of the DA scheme with advanced NNs such as transformers and multi-headed attention [98]. Possibilities of combining ML and DA from this end-to-end standpoint seems endless as hinted at by the previous examples. But the existence of a more powerful, unifying and enlightening way to do so is still to be investigated.

. . Uncertainty quantification
Uncertainty quantification, i.e., the ability to quantify the confidence we have in one of our statistical estimator, is in the genes of DA and NWP. In particular, most DA methods have an underlying Bayesian rationale that helps to formalize or connect to uncertainty quantification. Ensemble Kalman filtering and ensemble-variational methods in geoscience are based on an ensemble of states, or state trajectories because ensembles provide an estimate of the uncertainty of the state trajectory through the ensemble spread and other empirical moments. The confidence in weather forecasts is assessed with ensemble forecasts which could be initialized with the ensemble from an ensemble Kalman filter [99]. However, uncertainty quantification with variational DA methods is challenging since it requires estimating a Hessian in high dimension, which is a dreadful task. There are nowadays several methods meant to address this, such as an ensemble of DA for sequential DA [100, 101], approximating the Hessian of cost functions [102], or randomized singular value decomposition [103,104]. Finally, model error estimation is one of the most critical issues of DA in geoscience [115, and references therein].
For the same reason, uncertainty quantification in basic ML is far from natural and simple. Besides, it does not have the same rigorous basis as in DA despite seminal papers that paved the way (e.g., [105,106]) but were insufficiently followed. However, a wealth of solutions have been proposed, most of them unsatisfactory in view of the standards of uncertainty quantification. Some of them underwent a preliminary investigation in geoscience. First of all, learning models from several random initializations of the weights and biases and from distinct randomization of the data, hence generating a deep ensemble, yields some measure of uncertainty quantification for the statistical model per se and its internal parameters through e.g., generating an ensemble of model variants [107]. But this quantification is unlikely to match, or even target, the uncertainty of surrogate model estimates. Indeed, those perturbations of the weights and biases are not sampled according to their conditional likelihood. Dropout layers are used as a regularization scheme which, through the increased robustness to noise in the NN, mitigates overfitting. Optimizing these layers is a priori used in the learning step while they are kept frozen in the inference (i.e., the forecasts). Yet, again, the technique should in general fail to target the uncertainty of the estimate [108]. Additionally, stochasticity can be introduced again in the forecasts to build an ensemble [107, 109,110]. Whether this generated ensemble is a proper quantification of uncertainty is however questionable. Alternatively, one can try to augment the NN and predict statistical hyperparameters such as the standard deviation of the variable estimates, or some parameters of their a priori statistics. This is reminiscent of variational auto-encoders where the hyperparameters nonetheless lie in the bottleneck layer. Yet, such an approach is much more difficult to train [111]. Indeed, those hyperparameters would be at the second level in a Bayesian hierarchy and requires more testing and investigations. At the first hierarchy level, the NN could output biases [112]. Another approach consists of estimating probability density functions of some of the output variables (marginals), using a softmax layer for a categorical output of the density or a convolutional layer for a continuous density [110]. More generally, one could account for the weights uncertainty by including them as part of the outputs, using Bayes' rule, an approach even closer to the principles of DA. These NNs are Bayesian neural networks [113]. Even though basic ML does not genuinely account for uncertainty, it is hoped that the flexibility of deep learning will allow designing augmented architectures that address uncertainty quantification. Yet, the subject remains in its infancy in ML. It could benefit from the seminal papers of the field [105,106] and from the expertise developed on Bayesian uncertainty quantification within geophysical and in particular meteorological DA. Finally, let us highlight out that the subject is complementary to all of the topics mentioned in Section 3.2.

. Final thoughts
In the previous section, a few key MLDADS challenges have been examined. The field is nonetheless evolving very fast. So where does this lead us? I believe supplementing DA with the power of ML tools is already a success and, as a topic, should continue to thrive for many years, boosting DA research, with still many applications to explore. Furthermore, the DA+ML topic may not have clear boundaries yet; the merging of DA and ML is at the beginning.
On the one hand, model error correction and surrogate modeling, which is at the core of MLDADS, undergo much faster progress than when they were only addressed by pure DA and dynamical systems techniques. On the other hand, they rely on either high-resolution simulations of existing models or from reanalysis, a product of DA techniques, observation, and numerical modeling. As recently proven by the industry artificial intelligence research teams and others, the potential of surrogate modeling for Bocquet .
short-term meteorological forecasting is extraordinary but critically relies on the modelers' expertise through the reanalysis products. That is why a long-term challenge in the geoscience domain is to build surrogate models from large database of observations only, without relying on physical and numerical expertise which, for now, remains critical. Moreover, the potential of such dynamical surrogate modeling for longer time-scales, i.e., for climate modeling, is far from established and the topic is nascent (e.g., [114]). Accounting for slow-evolving trends and limited time reanalysis data from a climate standpoint as opposed to numerical weather prediction, is likely to make the surrogate and model error correction challenges quite difficult.

Data availability statement
Publicly available datasets were analyzed in this study. This data can be found here: https://github.com/pangeo-data/ WeatherBench.

Author contributions
The author confirms being the sole contributor of this work and has approved it for publication.

Funding
This work was granted access to the HPC resources of IDRIS under the allocations 2020-AD011011184R1 and 2021-AD011011184R2 made by GENCI; this was used for generating Figure 2 of the manuscript.