Reconstructing Global Chlorophyll-a Variations Using a Non-linear Statistical Approach

Monitoring the spatio-temporal variations of surface chlorophyll-a concentration (Chl, a proxy of phytoplankton biomass) greatly benefited from the availability of continuous and global ocean color satellite measurements from 1997 onward. These two decades of satellite observations are however still too short to provide a comprehensive description of Chl variations at decadal to multi-decadal timescales. This paper investigates the ability of a machine learning approach (a non-linear statistical approach based on Support Vector Regression, hereafter SVR) to reconstruct global spatio-temporal Chl variations from selected surface oceanic and atmospheric physical parameters. With a limited training period (13 years), we first demonstrate that Chl variability from a 32-years global physical-biogeochemical simulation can generally be skillfully reproduced with a SVR using the model surface variables as input parameters. We then apply the SVR to reconstruct satellite Chl observations using the physical predictors from the above numerical model and show that the Chl reconstructed by this SVR more accurately reproduces some aspects of observed Chl variability and trends compared to the model simulation. This SVR is able to reproduce the main modes of interannual Chl variations depicted by satellite observations in most regions, including El Niño signature in the tropical Pacific and Indian Oceans. In stark contrast with the trends simulated by the biogeochemical model, it also accurately captures spatial patterns of Chl trends estimated by satellite data, with a Chl increase in most extratropical regions and a Chl decrease in the center of the subtropical gyres, although the amplitude of these trends are underestimated by half. Results from our SVR reconstruction over the entire period (1979–2010) also suggest that the Interdecadal Pacific Oscillation drives a significant part of decadal Chl variations in both the tropical Pacific and Indian Oceans. Overall, this study demonstrates that non-linear statistical reconstructions can be complementary tools to in situ and satellite observations as well as conventional physical-biogeochemical numerical simulations to reconstruct and investigate Chl decadal variability.


INTRODUCTION
Phytoplankton-the microalgae that populate the upper lit layers of the ocean-fuels the oceanic food web and regulates oceanic and atmospheric carbon dioxide levels through photosynthetic carbon fixation. The launch of the "Coastal Zone Color Scanner" (CZCS) onboard the Nimbus-7 spacecraft in October 1978 (Hovis et al., 1980) provided the first synoptic view of near-surface chlorophyll-a concentration (Chl, a proxy of phytoplankton biomass). Although primarily focusing on coastal regions, CZCS also provided global pictures of Chl distribution and a new perspective on phytoplankton biomass seasonal variability (Campbell and Aarup, 1992;Longhurst et al., 1995;Yoo and Son, 1998;Banse and English, 2000).
After the failure of CZCS in 1986, ocean color observations were not available for more than a decade. The launch of the modern radiometric Sea-viewing Wide Field-of-View Sensor (SeaWiFS; McClain et al., 2004) in late 1997 followed later by other satellites allowed monitoring and understanding the spatio-temporal Chl variations at global scale over the past two decades. For instance, it revealed that El Niño events induce a Chl decrease in the central and eastern equatorial Pacific in response to reduced upwelled nutrients to the surface layers (e.g., Chavez et al., 1999;Wilson and Adamec, 2001;McClain et al., 2002;Radenac et al., 2012) but also a Chl signature outside the tropical Pacific through atmospheric teleconnections (Behrenfeld et al., 2001;Yoder and Kennelly, 2003;Dandonneau et al., 2004;Messié and Chavez, 2012). It also allowed identifying the Indian Ocean Dipole (IOD; Saji et al., 1999) as the main climate mode driving Chl interannual variations in the Indian Ocean (e.g., Murtugudde et al., 1999;Wiggert et al., 2009;Currie et al., 2013) and monitoring a Chl increase in the subpolar North Atlantic related to the positive phase of the North Atlantic Oscillation (NAO) (Martinez et al., 2016). Aside from the Chl decrease monitored in the mid-ocean gyres over the first decade of the XXIst century (Polovina et al., 2008;Irwin and Oliver, 2009;Vantrepotte and Mélin, 2009;Signorini and McClain, 2012), the reliability of the long-term trends derived from these satellite data are more questionable and led to conflicting results in the past literature (Behrenfeld et al., 2006;Vantrepotte and Mélin, 2011;Siegel et al., 2013;Gregg and Rousseaux, 2014). These discrepancies suggest that detection of robust global trend may require several decades of continuous observations (Beaulieu et al., 2013).
The production of longer, consistent ocean color time series can partly alleviate this issue. The combination of the global CZCS and SeaWiFS datasets provided an insight on the Chl response to natural decadal climate variations D'Ortenzio et al., 2012), such as the Pacific Decadal Oscillation (PDO; Mantua et al., 1997) and the Atlantic Multidecadal Oscillation (AMO; Enfield et al., 2001). However, blending these two archives or reconstructing them using compatible algorithms also led to contrasting results (Gregg and Conkright, 2002;Antoine et al., 2005).
The time span of the modern radiometric observations (∼20 years), as well as the CZCS-SeaWiFS reprocessed time series, are still too short to investigate Chl decadal variations and longer-term trends. Longer, continuous and consistent records are required. In situ biogeochemical observatories can provide such long and continuous records, but their inhomogeneous spatial distribution and varying record length prevent a confident assessment of Chl long-term changes at the scale of a basin (Henson et al., 2016).
Coupled physical-biogeochemical ocean model simulations can provide additional, valuable information's in areas with limited observational coverage. These models resolve reasonably well the seasonal to interannual biogeochemical variability (Dutkiewicz et al., 2001;Wiggert et al., 2006;Aumont et al., 2015). They can however diverge in capturing Chl variations at a timescale of a decade (Henson et al., 2009a,b;Patara et al., 2011), in particular phytoplankton regime shifts (Henson et al., 2009b). Different biological models are often coupled to different physical models, which renders the attribution of the different modeled responses to their physical or biological components difficult. The decadal or longer variability of the simulated primary producers should then be interpreted cautiously.
In this context, statistical methods reconstructing past Chl variations may be useful alternatives to overcome limitations associated with both observations and numerical models. While statistical reconstructions are now commonly used to extend physical variables back in time (e.g., Smith et al., 2012;Huang et al., 2017;Nidheesh et al., 2017), reconstructions of surface Chl are still in their infancy. Phytoplankton distribution is strongly controlled by physical processes, such as mixing and uplifting, fueling nutrients in the upper-lit layer (i.e., bottom up processes). Thus, relevant physical variables may allow to reconstruct Chl past variations. To our knowledge, a single study allowed the derivation of spatio-temporal surface Chl variations over several decades in the tropical Pacific (Schollaert Uz et al., 2017). This reconstruction used a linear canonical correlation analysis on Sea Surface Temperature (SST) and Sea Surface Height (SSH) to improve the description of the Chl response to the diversity of observed El Niño events and decadal climate variations in the tropical Pacific.
The objective of the present study is to explore the potential of an alternative statistical technique to reconstruct Chl at global scale over a 32-year time-series (i.e., 1979-2010). The considered machine learning technique is based on a Support Vector Regression (SVR) which accounts for non-linearities between predictors and Chl. First, the SVR is trained over 1998-2010 on a self-consistent dataset of physical and Chl variables, all extracted from a forced ocean model simulation that includes a biogeochemical component (i.e., the NEMO-PISCES model). Then, modeled physical variables are used to reconstruct Chl over 1979-2010. The feasibility and robustness of the proposed reconstruction process is assessed through the comparison of modeled vs. reconstructed Chl. In a second step, this framework is applied to satellite ocean color observations.

The NEMO-PISCES Simulation
In this study, we used the "Nucleus for European Modeling of the Ocean" (NEMO) modeling framework (Madec, 2008). The NEMO configuration used displays a coarse resolution with 31 vertical levels and a 2 • horizontal grid with a refined 0.5 • resolution in the equatorial band. The model includes a biogeochemical component, the Pelagic Interaction Scheme for Carbon and Ecosystem Studies (PISCES; Aumont et al., 2015). PISCES is a model of intermediate complexity designed for global ocean applications , which uses 24 prognostic variables and simulates biogeochemical cycles of oxygen, carbon and the main nutrients controlling phytoplankton growth (nitrate, ammonium, phosphate, silicic acid, and iron). It simulates the lower trophic levels of marine ecosystems distinguishing four plankton functional types based on size: two phytoplankton groups (small = nanophytoplankton and large = diatoms) and two zooplankton groups (small = microzooplankton and large = mesozooplankton). Chl from PISCES (hereafter referred to as Chl PISCES ) is defined as the sum of the simulated diatoms and nanophytoplankton Chl content.
The NEMO-PISCES simulation is forced with atmospheric fields from the interannual Drakkar Forcing Set 5 (DFS5.2, Dussin et al., 2014) for wind, air temperature and humidity, precipitation, shortwave and longwave radiations. It is initialized with the World Ocean Atlas 2005 (WOA05) climatology for temperature, salinity, phosphate, nitrate and silicate (Garcia et al., 2006), while iron initial state is similar to the model climatology employed by Aumont and Bopp (2006). The model simulation was spun up using 3 repetitions of the 30 years' DFS5.2 forcing set, and finally ran over 1979-2010. Although successfully used in a variety of biogeochemical studies (e.g., Bopp et al., 2005;Gehlen et al., 2006;Lengaigne et al., 2007;Schneider et al., 2008;Steinacher et al., 2010;Tagliabue et al., 2010;Séférian et al., 2013;Aumont et al., 2015;Keerthi et al., 2017;Parvathi et al., 2017 and references therein), the ability of the PISCES model to reproduce satellite surface Chl is briefly illustrated in section "Evaluation of Chl PISCES at global scale."

Chl Derived From Satellite Radiometric Observations
Satellite surface Chl for Case I waters is provided by the Ocean Color -Climate Change Initiative (OC-CCI, hereafter referred to as Chl OC−CCI ) from the European Space Agency 1 . This 1 http://www.esaoceancolour-cci.org/ product combines multi-sensor, global, ocean-color products while attempting to reduce inter-sensor biases for climate research (Storm et al., 2013). OC-CCI extends the time series beyond that provided by single satellite sensors and is performant in terms of long-term consistency than other products from multi-mission initiatives (Belo Couto et al., 2016).
Only deep oceanic areas (depth > 200 m) are considered to avoid coastal waters where specific non-case-1 waters products are required. The Chl Level-3 product is binned on a regular 1 • grid with a monthly resolution over January 1998-December 2010. This time period does not extend beyond 2010 to be consistent with the NEMO-PISCES simulation. Chl OC−CCI is used to evaluate the PISCES model performances in Section "Evaluation of Chl PISCES at global scale, " and to train the statistical method in Section "Application to satellite radiometric observations."

Predictors and Chl Variables
The variability of phytoplankton biomass is driven in many regions of the world ocean and at many timescales by physical processes (e.g., Wilson and Adamec, 2002;Wilson and Coles, 2005;Feng et al., 2015;Messié and Chavez, 2015). Our statistical architecture relates to 12 predictors and one biological variable (Chl). A sample thus refers to 13 variables. The 12 predictors (7 physical variables from NEMO-DFS5.2, 2 temporal and 3 spatial parameters) are detailed in Table 1, including their influence on Chl variations and the references supporting this influence.
We purposely limited the predictors to surface variables because our objectives are (1) to reconstruct Chl from physical observations, which are mainly available through remotely sensed surface data (oceanic observations below the surface are indeed usually not accessible at large spatial-scales or interannual time-scales); (2) to build a statistical scheme that can complement more complex numerical models (here, NEMO-PISCES) which simulate complex three-dimensional processes and are costly to run.
A first SVR is trained on physical predictors from NEMO and DFS5.2 vs. Chl PISCES . The reconstructed Chl time-series is referred to as Chl Svr−PISCES . A second SVR is trained using the same physical predictors but vs. satellite Chl observations (Chl OC−CCI ). The reconstructed Chl time-series is referred to as Chl Svr−CCI .

Climate Indices
Climate indices are provided by the National Oceanic and Atmospheric Administration (NOAA) website 2 : the AMO, the Multivariate El Niño Southern Oscillation (ENSO) Index (MEI) and the Interdecadal Pacific Oscillation (IPO).

Support Vector Regression
The statistical reconstruction technique is based on a SVR. This method belongs to kernel methods in Statistical Learning Theory and relates to the Support Vector Machine (SVM, Vapnik, 1998). SVM is a kernel-based supervised learning method  Behrenfeld et al., 2006;Polovina et al., 2008;Martinez et al., 2009;Thomas et al., 2012;Lewandowska et al., 2014 Sea level anomaly Thermocline/pycnocline depths Adamec, 2001, 2002;Radenac et al., 2012 Zonal and meridional surface wind components Surface momentum flux forcing and vertical motions driven by Ekman pumping Martinez et al., 2011;Thomas et al., 2012 Zonal and meridional surface current components Horizontal advective processes Messié and Chavez, 2012;Radenac et al., 2013 Short-wave radiations Photosynthetically active radiation Sakamoto et al., 2011 Month (cos and sin) Periodicity of the day of the year (day 1 is very similar to day 365 from a seasonal perspective) Sauzède et al., 2015 Longitude (cos and sin) Latitude (sin) Periodicity (longitude 0 • = longitude 360 • ) Sauzède et al., 2015 (Vapnik, 2000) developed for classification purpose in the early 1990s and then extended for regression by Vapnik (1995). The basic idea behind SVR is to map the variables into a new nonlinear space using the kernel function, so that the regression task becomes linear in this space. The learning step estimates the parameters of the regression model according to a linear quadratic optimization problem, which can be solved efficiently. SVR also uses a robust error norm based on the principle of structural risk minimization, where both the error rates and the model complexity should be minimized simultaneously. Because SVR can efficiently capture complex non-linear relationships, it has been used in a variety of fields, and more specifically for oceanographic, meteorological and climate impact studies (Aguilar- Martinez and Hsieh, 2009;Descloux et al., 2012;Elbisy, 2015;Neetu et al., 2020), as well as in marine bio-optics (Kim et al., 2014;Hu et al., 2018;Tang et al., 2019). Predictors and Chl are normalized by removing their respective average and dividing them by their standard deviations. Two SVR are trained over 1998-2010: one on Chl PISCES and one on Chl OC−CCI (Step A in Figure 1). This time period has been chosen as 1998 is the first complete year of the satellite Chl OC−CCI time-series, and 2010 is the last year available of the modeled Chl PISCES . The two resulting SVR schemes are applied on the NEMO-DFS5.2 physical predictors over 1979-2010. Finally, the annual means and standard deviations initially removed are applied to perform the back transformation and reconstruct either Chl Svr−PISCES or Chl Svr−CCI (Step B in Figure 1).
Considering a Gaussian kernel, SVR only involves the selection of two hyperparameters: the penalty parameter C of the error term and the kernel coefficient gamma, driving the reduction of the cost function. C and gamma values are 1 and 0.1, respectively when the SVR is trained on Chl PISCES , and 2 and 0.3 when trained on Chl OC−CCI (see details in the Supplementary Material and Supplementary Figure 1A). Sensitivity tests to an increasing portion of the sample total number (from 0.2 to 9% of the full dataset) used in the training process are performed (see Supplementary Material and Supplementary Figure 1B). The mean absolute error stabilizes for a sample number higher than 6%, suggesting that the SVR skills don't improve much afterward. This observation combined with computational limitations lead us to present the 9% experiment hereafter.

Empirical Orthogonal Function Analysis
The SVR skills to reconstruct Chl interannual to decadal variations are investigated performing Empirical Orthogonal Function analysis on Chl PISCES , Chl OC−CCI , Chl Svr−PISCES and Chl Svr−CCI . First, Chl data are centered and reduced (i.e., the monthly climatology is removed and the induced anomalies are divided by their standard deviations) to avoid an overly dominant contribution of high values on the analysis (Emery and Thomson, 1997) over the periods of interest (i.e., 1998-2010 or 1979-2010). A 5-month running mean is applied to focus on the interannual/decadal signal. The analysis is separately performed for the Atlantic, Pacific and Indian Oceans north of 40 • S until 60 • N, and for the 40 • S-60 • S region hereafter referred to as the Austral Ocean. Indeed, the large area covered by the Pacific Ocean and its dominant modes in climate variability (i.e., ENSO/IPO), could regionally dampen other modes of variability. Basin-scale spatial maps are then gathered to a global one, referred to as EOF. The associated time-series refer to as the Principal Components (PCs).

SYNTHETIC RECONSTRUCTION FROM A PHYSICAL-BIOGEOCHEMICAL OCEAN MODEL
This section assesses the reliability and robustness of the SVR approach using a complete and coherent dataset extracted from a global simulation performed with a coupled physicalbiogeochemical ocean model. The SVR is first trained over 1998-2010 on Chl PISCES , and Chl Svr−PISCES is reconstructed over 1979-2010. Chl PISCES and Chl Svr−PISCES are then compared over 32 years to evaluate the consistency of the proposed data-driven reconstruction scheme.

Evaluation of Chl PISCES at Global Scale
The ability of the NEMO-PISCES model to reproduce the satellite Chl over 1998-2010 is briefly presented here. Boreal winter and summer climatology from Chl PISCES compare reasonably well with those of Chl OC−CCI (Figure 2A vs. 2B and 2C vs. 2D). The model correctly represents the main spatial patterns with, for instance, higher Chl and a stronger seasonal cycle at high latitudes, despite an overestimated biomass in the Southern Ocean (Launois et al., 2015). The model also captures low Chl in the subtropical gyres, with some underestimation. This discrepancy may be explained by the lack of acclimation dynamics to oligotrophic conditions or by the assumption of constant stoichiometry either in phytoplankton or in organic matter in the model (Ayata et al., 2013;Aumont et al., 2015). The model underestimates Chl values in the equatorial Atlantic and Arabian Sea. In this latter region, mesoscale and submesoscale processes unresolved by the model have been shown to be of critical importance (Hood et al., 2003;Resplandy et al., 2011). Finally, the parameterization of nitrogen-fixing organisms not explicitly modeled in that PISCES version could explain the Chl PISCES underestimation in the western Pacific in austral summer (Dutheil et al., 2018).
High Chl are accurately simulated in the eastern boundary upwelling systems. In two of the three main High Nutrient Low Chlorophyll (HNLC) regions, i.e. the equatorial Pacific and the eastern subarctic Pacific, the model successfully reproduces the moderate Chl OC−CCI . However, the model overestimates Chl OC−CCI east of Japan because of an incorrect representation of the Kuroshio current trajectory. This common bias in coarse resolution models (i.e., Gnanadesikan et al., 2002;Dutkiewicz et al., 2005;Aumont and Bopp, 2006) is potentially related to too deep mixed layer simulated in winter inducing very strong spring blooms . In the Southern Ocean, the third and largest main HNLC region, the model overestimates Chl OC−CCI values, especially during summer. However, the standard satellite algorithms that deduce Chl from reflectance tend to underestimate in situ observations by a factor of about 2-2.5, especially for intermediate concentrations (e.g., Dierssen and Smith, 2000;Kahru and Mitchell, 2010). It is to note that Chl in physical-biogeochemical coupled models is commonly overestimated in the Southern Ocean, and systematically underestimated in the oligotrophic gyres (Séférian et al., 2013).
The 1 st mode of the EOF analysis performed on interannual Chl displays close percent of total variance for Chl OC−CCI and Chl PISCES (16.6% vs. 21.1%, respectively). Their PCs in the Pacific Ocean are well correlated with the MEI (r = 0.71 and 0.89 with p = 0.0015 and p < 0.001, respectively; Figure 3C). PCs show the greatest positive values in January 1998 during the peak of the strong 1997/1998 El Niño event and the greatest negative values during the following La Niña beginning of 1999. The associated EOFs display a Chl horseshoe pattern ( Figures 3A,B), reminiscent of the ENSO pattern on SST (Supplementary Figure 2; Messié and Chavez, 2012). While the tropical Pacific experiences a Chl decrease during El Niño events, the North and South Pacific display a Chl increase, and inversely during La Niña. This typical ENSO pattern is also related to remote Chl anomalies outside the Pacific induced by atmospheric teleconnections, such as a Chl decrease in the tropical North Atlantic and in the South Indian Ocean during El Niño. Although the Atlantic and Indian Ocean's PCs are not correlated with the MEI (0.14 and 0.05, respectively), their EOFs are similar to those obtained from analysis performed at global scale (vs. basin scale here) and which have been largely discussed in the past (e.g., Behrenfeld et al., 2001Behrenfeld et al., , 2006Yoder and Kennelly, 2003;Chavez et al., 2011). Chl PISCES reasonably well captures the first mode of Chl OC−CCI interannual variability over 1998-2010 in the Pacific and Atlantic Oceans, with 0.89 and 0.77 (p < 0.001) correlations between their PCs, respectively, but not in the Indian Ocean, where the PCs correlation is far weaker (0.13) and insignificant (Figures 3C-E).

Statistical Performances
A first evaluation of the SVR applied on the synthetic dataset (i.e., both physical and biogeochemical model outputs) is provided for the dedicated subset (i.e., 20% of 9% of the total data set) over the 1998-2010 training time period. Chl PISCES and Chl Svr−PISCES datasets display a determination coefficient of 0.95 and a root mean square error (RMSE) of 0.22 (see Supplementary Figure 1C), indicating at first glance a very good ability of the SVR to reconstruct Chl PISCES . The SVR reconstruction is very accurate when comparing the full modeled and reconstructed Chl for (i) the 1998-2010 training time period, (ii) the 1979-1997 fully independent dataset, and (iii) the 1979-2010 whole dataset, both at global and basin scales (Table 2 and Figure 4). For each oceanic basin, determination coefficients between both datasets over 1979-1997 exceed 0.84, except in the Austral Ocean where they get down to 0.71. RMSE are lower than 0.14 and associated with a slope ranging from 0.84 in the Austral to 0.97 in the Atlantic (Figure 4). In addition, the quality of the reconstructed Chl Svr−PISCES over the 1979-1997 independent time period is only marginally degraded compared to the 1998-2010 training period or the 1979-2010 full period.

Evaluation of the Reconstructed Chl Spatio-Temporal Variability
The Normalized Root-Mean-Square-Error (NRMSE, i.e., RMSE normalized by the average Chl used to train the SVR) between Chl PISCES and Chl Svr−PISCES filtered with a 5-month running mean (to discard the high frequency signal) shows an error ranging between 10 and 20% over 1998-2010 ( Figure 5A). Their correlation exceeds 0.7 (p < 0.001) over most of the global ocean ( Figure 5B). At mid-latitudes they are generally larger than 0.8, and they range between 0.6 and 0.9 in the equatorial Pacific. This accurate reconstruction demonstrates that a strong relationship exists between physical processes and Chl at global scale. However, the reconstructed Chl field can be regionally less accurate. For instance, the edges of the oligotrophic gyres (delimited by the 0.1 mg.m −3 contour in Figure 5A) exhibit the highest NRMSE and lowest correlations. Large NRMSE are also evident in the Gulf Stream region while the western tropical Atlantic exhibits lower correlations than 0.5.
Those discrepancies could be due first to the zooplankton grazing pressure (top-down control) which is often overestimated in PISCES simulations. It results in an underestimated nanophytoplankton biomass in the oligotrophic gyres, emphasized along their edges (Laufkötter et al., 2015). Because the top-down control is not accounted for by the SVR, Chl variability induced by the overgrazing in these areas might not be captured. Second, in the equatorial Pacific Ocean, a minimum iron threshold value has been imposed   1998-2010, 1979-1997, and the whole 1979-2010 time period. 1998-2010 1979-1997 1979-2010    (0.01 nmol.L −1 ) in the biogeochemical model. Without that threshold Chl is too low on both sides of the equator, resulting in a strong accumulation of macronutrients and a spurious poleward migration of the subtropical gyre boundaries . While the existence of such a threshold suggests that a minor but regionally important source of iron is missing in PISCES, it also suggests the inability of the SVR in reproducing ecosystem dynamics related to such artificial input of micronutrient. Finally, atmospheric input of iron through desert dust deposition is known to be stronger in the Atlantic than in the Pacific Ocean (Jickells et al., 2005). Such signal cannot be accounted for by the SVR with the given predictors, which might (with meso -and sub-mesoscale activities) explain the higher NRMSE in the north western Atlantic than in the north-western Pacific. As expected, areas of high NRMSE and low correlations between Chl PISCES and Chl Svr−PISCES identified over 1998-2010 (Figure 5, left column) extend and strengthen over 1979-1997 ( Figure 5, right column). Indeed, the correlations significantly decrease in the tropical Pacific while they slightly decrease in midlatitudes between the two periods. Correlations remain high and NRMSE low in the North-West Pacific, North and South-West Atlantic, and South Indian Oceans as well as over a large part of the Southern Ocean providing confidence for analyses extended beyond the training period of the SVR.
The analysis is now extended to the 1979-2010 timeperiod to investigate the skills of the SVR in reproducing phytoplankton interannual/decadal cycles. The 1 st EOFs of Chl PISCES vs. Chl Svr−PISCES have the same sign of variability over 72% of the global ocean (Figures 6A,B). Both EOFs are similar in the Pacific and Atlantic Oceans and their PCs are highly correlated over 1979-2010 (Table 3 and Figures 6C,E). In the Pacific, these EOFs strongly resemble the typical horseshoe pattern of IPO with SST anomalies of opposite polarities in the tropical and extra-tropical Pacific regions (Supplementary Figure 3). Correlations between Chl PISCES and Chl Svr−PISCES 1 st PCs and the IPO index are high (0.94 and 0.95 with p < 0.001, respectively; blue and black vs. red lines in Figure 6C). It highlights that the 1 st mode of Chl variability in the Pacific is strongly driven by the IPO. In the Atlantic, both PCs are strongly correlated with the AMO (−0.8 for Chl Svr−PISCES and −0.85 for Chl PISCES with p < 0.001; Figure 6E). The AMO shifts from a cold to a warm phase in the mid-1990's (Supplementary Figure 3), and is associated with a decrease in Chl (Figures 6A,B).
The 1 st two modes explain a similar percent variance for Chl PISCES and Chl Svr−PISCES in the four oceanic basins, with the exception of the 1 st mode in the Atlantic Ocean (see Table 3). In this basin Chl Svr−PISCES percent variance is underestimated by a factor 2 compared to Chl PISCES , while their 1 st EOFs and PCs are well correlated. One explanation might be that the AMO is the climate cycle with the longest period (80 years) when compared to the IPO. Thus, it might be the most difficult signal to reproduce as the SVR is trained over a relatively "short" 12 years' time-period.
The agreement between Chl PISCES and Chl Svr−PISCES 1 st mode is not as good in the Austral and Indian Oceans when compared to the Atlantic and Pacific Oceans (Table 3 and Figures 6A,B,D,F). In the Indian Ocean, the Chl PISCES EOF exhibits a maximum positive variability along the western Arabian Sea, while it is located northeast of Madagascar for Chl Svr−PISCES . In the Austral Ocean, Chl PISCES and Chl Svr−PISCES EOFs roughly follow a zonal distribution.
A strong correspondence between SST and Chl has been previously reported over a large part of the global ocean (Behrenfeld et al., 2006;Martinez et al., 2009;Siegel et al., 2013), demonstrating the close interrelationship between ocean biology and climate variations. Consequently, it is not surprising to observe strong correlations between Chl PISCES or Chl Svr−PISCES and climatic indexes mostly built on SST anomalies (Supplementary Figure 3).
The 2 nd mode of variability of Chl PISCES is also well reproduced by the SVR. The percent variances are close ( Table 3) as well as their spatio-temporal variability in the four oceanic basins (Supplementary Figure 4). The high correlations between the first two modes of Chl PISCES vs. Chl Svr−PISCES highlight the SVR ability to relatively well reproduce the Chl PISCES lowfrequency variability.

APPLICATION TO SATELLITE RADIOMETRIC OBSERVATIONS SVR Statistical Performances and Sensitivity Tests
In this section, the SVR uses the same physical predictors from NEMO-DFS5.2 as in Section "Synthetic reconstruction from a physical-biogeochemical ocean model, " but it is trained on satellite radiometric observations (e.g., Chl OC−CCI ). The same procedure is followed (see Supplementary  Figures 5A,B). A first validation is performed for 20% of 9% of the full data set and over the 1998-2010 training period showing a high determination coefficient of 0.87 and RMSE of 0.37 between Chl OC−CCI and Chl Svr−CCI (Supplementary Figure 5C).
As expected, the regression lines between the whole dataset of Chl OC−CCI vs. Chl Svr−CCI for each oceanic basin and at global scale are farther away from the 1:1 line than for the synthetic study over the training period, but still remain close The correlation (r) between the Chl Svr−PISCES and Chl PISCES PCs is also reported with a significant level of *p < 0.001 and **p < 0.002.
Frontiers in Marine Science | www.frontiersin.org Frontiers in Marine Science | www.frontiersin.org (higher slope than 0.8, except in the Austral Ocean; Figure 7). The SVR trained on NEMO-DFS5.2 predictors vs. satellite Chl is expected to be less efficient than the SVR trained on the coherent NEMO-DFS5.2-PISCES physical-biogeochemical dataset. Some of the biological interactions/processes (such as the diversity of the prey-predator relationships, the complexity of photoacclimation phenomena) are not yet optimally formulated by model equations inducing that Chl derived from numerical modeling is oversimplified compared to the complexity of the real ocean. Not to mention that satellite Chl may itself be partially affected by other components that are not Chl, such as colored dissolved organic matter (CDOM; Morel and Gentili, 2009) and suspended particulate matter (SPM). Phytoplankton can also adjust their intracellular Chl according to light and nutrient availability (e.g., Laws and Bannister, 1980;Behrenfeld et al., 2015). The induced Chl changes are no longer ascribed to changes in biomass. All these signatures on satellite Chl could explain Chl Svr−CCI underestimation. Nevertheless, determination coefficients between Chl Svr−CCI and Chl OC−CCI remain high over the training time period (0.85, 0.89, and 0.86 for the Indian, Pacific and Atlantic Oceans, respectively, Figure 7). The NRMSE between Chl OC−CCI vs. Chl Svr−CCI is lower than 20% over most of the global ocean ( Figure 8A). Correlations higher than 0.9 (p < 0.001) are evident over large subtropical areas in the Atlantic, Indian and Pacific Oceans as well as in the Equatorial Pacific ( Figure 8B). Interestingly, the SVR generally does a better job at reconstructing the satellite Chl than the modeled one (Figures 5A,C vs. Figure 8). NRMSE are higher at high latitudes and along the oligotrophic area boundaries, although to a less extent than for Chl PISCES . Because Chl OC−CCI can only be retrieved under clear sky conditions, gaps in satellite observations (especially during wintertime) likely alters the SVR learning and could explain such a degradation of Chl Svr−CCI as moving toward high latitudes.

Reconstruction of Satellite Chl Interannual to Decadal Variability and Trends
The SVR ability to replicate Chl OC−CCI interannual variability is now investigated over 1998-2010 (Figure 9). In the Pacific Ocean, Chl OC−CCI and Chl Svr−CCI 1 st EOFs are close ( Figure 9A vs. 9B), their PCs are highly correlated (r = 0.89, p < 0.001; Figure 9C), and their percent variance are similar ( Table 4). As presented in Section "Evaluation of Chl PISCES at global scale, " this mode of Chl variability can be attributed to ENSO, given their EOFs pattern as well as their PCs highly correlated with the MEI (r OC−CCI / MEI = 0.71 and r Svr−CCI / MEI = 0.91, with p = 0.0015 and p < 0.001, respectively). Interestingly, Chl Svr−CCI EOFs are closer to Chl OC−CCI than Chl PISCES in several areas such as in the northwestern Pacific, the south-western Atlantic and the Indian Ocean from Madagascar to the western coast of Australia (Figures 9A,B vs. Figure 3B). Consistently, correlations between Chl OC−CCI and Chl Svr−CCI PCs in the three basins and for the 1 st two modes are higher than between Chl OC−CCI and Chl PISCES ( Table 4).
Chl OC−CCI linear trends over 1998-2010 exhibit large areas of increase or decrease (red and blue areas in Figure 10A, respectively). Productive regions at high latitudes and along the equatorial and upwelling areas generally exhibit positive Chl OC−CCI trends, albeit many underlying regional nuances. Contrastingly, trends are generally negative in the center of the gyres. These regional trends are consistent with those extracted from the first 13 years of the SeaWiFS record and discussed by Siegel et al. (2013) (see their Figures 5B, 8B). The negative trends in the oligotrophic gyres were also reported by Signorini et al. (2015) who attributed this behavior to MLD shallowing trends. Surface water density variability induced by changes in temperature and salinity, combined with wind stirring, are effective drivers of vertical mixing, which in turn control the renewal of nutrients from the rich-deep layers toward the euphotic zone. Thus, shallower MLD would decrease nutrient uplift and phytoplankton growth in the oligotrophic areas.
Chl Svr−CCI trends agree qualitatively well with those of Chl OC−CCI at global scale ( Figure 10B vs. 10A, respectively). Indeed, decline of Chl Svr−CCI can be observed in the center of the gyres, while outside Chl Svr−CCI generally increases in a similar way to Chl OC−CCI . Chl Svr−CCI accurately captures the largest Chl OC−CCI increase observed in the Southern Ocean along the Antarctic Circumpolar Current. While Gregg and Casey (2004) reported a substantial negative bias in the SeaWiFS data for this region when compared to in situ observations, which could hamper the reliability of satellite trends discussed in this area (e.g., Siegel et al., 2013), the SVR remains able to reproduce the positive observed trend. Despite qualitative spatial agreements, it is noteworthy that the SVR underestimates by half the magnitude of the satellite trend (see scales in Figure 10A vs. 10B).
Interestingly, trends in Chl PISCES generally differ from Chl OC−CCI (Figure 10C). This is striking for the North Pacific and Atlantic high latitudes, but also in the equatorial Atlantic and Arabian Sea with opposite trends when compared with Chl OC−CCI and Chl Svr−CCI , and in a more mitigated manner in the Austral Ocean.
Chl Svr−CCI is also compared with the only historical consistent dataset built by Antoine et al. (2005) who reanalyzed ocean color time series from CZCS (1979CZCS ( -1983 and SeaWiFS (1998SeaWiFS ( -2002. A 22% global mean increase of Chl between the two era was reported. It was mainly due to large increases in the intertropical areas and to a lesser extent in higher latitudes, while oligotrophic gyres displayed declining concentrations ( Figure 11A). SST from the SODA reanalysis was used as a proxy of ocean stratification and opposite Chl and SST changes over 60% of the ocean between 50 • S and 50 • N was reported (light blue and yellow in Figure 11B, adapted from Martinez et al., 2009). This inverse relationship was used to hypothesized that multidecadal changes in global phytoplankton abundances were related to basin-scale oscillations of the ocean dynamics. Briefly, SST changes were related to a regime shift of the PDO (although the use of the basin-scale IPO would have been more appropriate) from a warm to a cold phase in the Pacific and Indian Oceans leading to an increase of Chl, and inversely in the Atlantic Ocean with a regime shift from a cold to a warm phase of the AMO leading to a Chl decrease.
Observed Chl changes over the last decades are accurately reproduced by Chl Svr−CCI , including a Chl increase in the equatorial Pacific and the southern tropical Indian Oceans, as well as a Chl decline in both the Atlantic and Pacific oligotrophic gyres ( Figure 11C). However, the magnitude of the SVR reconstructed Chl is underestimated (note that the Chl ratio is multiplied by 2 in Figure 11C to allow the comparison with Figure 11A). On average, the inverse relationship between Chl Svr−CCI and SST NEMO ( Figure 11D) occurs over 69.4% of the global ocean between 50 • S and 50 • N in a similar way to that reported by Martinez et al. (2009), especially in the Pacific Ocean (see Figure 11D vs. 11B). In the Indian Ocean, although Chl mainly increases in both studies, it is here associated with a SST decrease. Interestingly, this inverse Chl-SST relationship in the Indian Ocean (yellow area in Figure 11D) was reported in Behrenfeld et al. (2006) over the SeaWiFS era, suggesting that the SST dataset used in Martinez et al. (2009) may have decadal discrepancies for this region.
In their study, Martinez et al. (2009) analyzed two 5-year time periods apart from each other by 15 years. They suggested that averaging observations separately over the two time-periods may have dampen the effect of interannual variability and reveal the decadal one. Most of the changes observed between the time periods covered by the two satellites are here confirmed based on the reconstructed Chl Svr−CCI . However, the continuous  The correlation (r) between the PCs of Chl OC−CCI vs. Chl Svr−CCI and between Chl OC−CCI vs. Chl PISCES is also reported with a significant level of *P < 0.001 and **P < 0.02.
Frontiers in Marine Science | www.frontiersin.org 30-year time series of Chl Svr−CCI provides new insights on the observed regime shifts (Figure 12). In the Pacific Ocean, the 1 st EOF of Chl Svr−CCI ( Figure 12A) is close to the Chl spatial patterns obtained from the CZCS to SeaWiFS era ( Figure 11C) and the PC remains highly correlated with the IPO over 1979-2010 (r = 0.94 with p < 0.001, Figure 12B). The Chl increase in the Indian Ocean, north-east of Madagascar toward the west coast of Australia, between the 1980's and the 2000's also appears on the Chl Svr−CCI EOF. These temporal changes might also be related to the IPO variability (correlation between the IPO index and Chl Svr−CCI PC = 0.6, p < 0.001; Figure 12C).
In the Atlantic Ocean, CZCS-SeaWiFS Chl and Chl Svr−CCI 1 st EOF also share some similarities, including a decrease of Chl in the subtropical gyres and an increase in the equatorial/tropical regions. The associated PC (Figure 12D), exhibits a shift between 1979-1983 and 1998-2002 consistently with Figure 2C of Martinez et al. (2009). In this latter study, this change was attributed to a regime shift of the AMO. However, the AMO index is not correlated with the 1 st Chl Svr−CCI PC (r = 0.03, p = 0.43) but rather with the 2 nd mode (r = 0.43 with p = 0.003, Supplementary Figure 6), likely explaining the spatial discrepancies in Figure 11A vs. 11C. Although the detailed analysis of Chl decadal variability is beyond the scope of the present study, these initial findings underscore the importance of continuous time series at regional/global scales to combine spatial and temporal information's and properly investigate Chl long-term variability.

SUMMARY AND CONCLUSION
In this paper, we assess the efficiency of a machine learning statistical approach based on support vector regression to reconstruct surface Chl from oceanic and atmospheric variables. We first apply this strategy on a self-consistent global dataset gathering physical predictors and Chl data simulated by a coupled physical-biogeochemical model simulation. Our results indicate that this non-linear method accurately hindcasts interannual-to-decadal variations of the phytoplankton biomass simulated at global scale by the model, except at the boundaries of the subtropical gyres where the strong top-down control of zooplankton grazing in the numerical model is not accounted for by the SVR. Likewise, this statistical approach cannot yet reproduce Chl variability induced by nutrient inputs that are not directly related to our selected physical predictors, such as atmospheric iron deposit.
The SVR was then trained on satellite Chl observations. It accurately reproduces observed interannual Chl variations in most regions, including El Niño signature in the tropical Pacific and Indian Oceans as well as the main modes of Atlantic Chl variability. Despite an amplitude underestimation by half, it also accurately captures spatial patterns of Chl trends over the satellite period, with a Chl increase in most extratropical regions and a Chl decrease in the center of the subtropical gyres, as well as their changes between the CZCS and SeaWiFS era. Interestingly, while Chl PISCES magnitude is closer to Chl OC−CCI than Chl Svr−CCI , interannual variability and spatial trends of Chl PISCES are farther than Chl Svr−CCI to Chl OC−CCI . Equations representing the processes that govern the evolution of biogeochemical variables in a biogeochemical model are obviously less complex than the ones at play in the real ocean. We thus anticipated the modeled Chl to be easier to reconstruct than the satellite one. Additional complications were also expected through the reconstruction of satellite Chl from the model oceanic and atmospheric predictors, which may be less realistic than physical parameters derived from satellite measurements. As a consequence, the SVR is indeed slightly less efficient at reproducing the major satellite Chl patterns compared to the model ones but is surprisingly more efficient at capturing observed Chl temporal variations. This results in a NRMSE generally weaker when reconstructing satellite data compared to the model one, although the predictors used are identical.
Machine learning techniques are powerful tools to statistically model non-linear processes. They require a significant amount of data to be trained and are well-suited to analyze remote sensing data. While several attempts have been made over the last decade to retrieve oceanic Chl content (Kwiatkowska and Fargion, 2003;Zhan et al., 2003;Camps-Valls et al., 2009;Jouini et al., 2013;Blix and Eltoft, 2018), the present work is one of the first attempt to use such machine learning techniques to reconstruct past time series of phytoplankton biomass at global scale. To our knowledge only Schollaert Uz et al. (2017) tried to reconstruct the Chl multi-decadal variability in the tropical Pacific using a canonical correlation analysis built only from SST and SSH. Our SVR approach leads to higher correlations between reconstructed and satellite Chl in the tropical Pacific, highlighting the strength of such non-linear machine-learning methods with multiple predictors. These results emphasize deep learning approaches as promising tools to reconstruct multidecadal Chl time series in the global ocean, based on the knowledge of physical conditions. The successful use of surface variables only in reproducing Chl variability which is influenced by 3D-processes is here clearly noteworthy, and investigation of variable importance in the Chl reconstruction will deserve some future insights. An obvious short-term perspective of the current study is to train a wider range of such statistical models with physical predictors from surface satellite observations but also from observations within the water column which could be derived from Argo data (i.e., mixed layer and thermocline depth). Including complementary variables such as satellite particulate backscattering coefficient (as a proxy of the Particulate Organic Carbon) in the training/reconstruction process should also be considered. It would allow to investigate the extent to which the Chl variability reflects changes in phytoplankton biomass vs. cellular changes in response to light (e.g., Siegel et al., 2005;Westberry et al., 2008;Behrenfeld et al., 2015). The use of longitude and latitude as predictors may limit the ability to capture long-term trends in the evolution of the biogeochemical province boundaries, such as the expansion of the oligotrophic areas (Polovina et al., 2008;Irwin and Oliver, 2009;Staten et al., 2018). Thus, exploring deep learning schemes which may not explicitly depend on longitude and latitude, especially convolutional representations (LeCun et al., 2015), are particularly appealing. Further efforts need also to be dedicated to alleviate the issue of the underestimation of the long-term Chl trends. For instance, it would be noteworthy to investigate secular trends such as the 30% Chl decrease reported at global scale over the last century by Boyce et al. (2010), which remains largely debated (Mackas, 2011;McQuatters-Gollop et al., 2011;Rykaczewski and Dunne, 2011).
Whatever the methodology used (i.e., numerical models, satellite or in situ observations), they all have both advantages and drawbacks. In situ observations are considered as ground truth (with some errors/uncertainties depending for instance on the field measurement protocols), but are heterogeneous in time and space. Satellite Chl data provide a spatio-temporal synoptic view but they have their own measurement issues and uncertainties (e.g., radiometric sensors and spectral properties, atmospheric corrections, water constituents and their optical properties) and are limited to 20 years in their record length. Biogeochemical models are useful tools to (i) interpolate or extrapolate in time and space biogeochemical tracers such as Chl and to (ii) investigate complex three-dimensional processes responsible for their variations. However, those models also suffer from biases and are farther from in situ observations than satellite data. They are also not straightforward to run and require large computing resources. Thus, machine learning statistical schemes could be seen as a complementary tool to the "interpolate/extrapolate" use of biogeochemical models in providing a long-term synoptic surface view built from observations (being aware of the uncertainties associated with the variables used in the training schemes). Such methods, applied on observations only, will then provide an independent tool that may either question or enforce conclusions drawn from model simulations. Comparison between both methods and observations will help to improve biogeochemical models with acute quantification of model biases and identification of the most meaningful predictors that may point to missing processes in biogeochemical models. As a conclusion, machine learning is a versatile tool that, associated with biogeochemical models and observations, may greatly enhance our view of global biogeochemistry.

AUTHOR CONTRIBUTIONS
EM led the project, analyzed the results, and wrote the first draft of the manuscript. TG provided the physical model outputs. TG and ML provided support in the analysis and the writing of the manuscript. CF processed the machine learning approach with support from RS. RF provided the feedbacks on the statistical approach. All the authors contributed to the development of the manuscript and provided the feedbacks throughout its many stages of preparation.

FUNDING
This work was supported by CNES under contract n • 160515/00 within the framework of the PhytoDev project.