A Chlorophyll-a Algorithm for Landsat-8 Based on Mixture Density Networks

Retrieval of aquatic biogeochemical variables, such as the near-surface concentration of chlorophyll-a (Chla) in inland and coastal waters via remote observations, has long been regarded as a challenging task. This manuscript applies Mixture Density Networks (MDN) that use the visible spectral bands available by the Operational Land Imager (OLI) aboard Landsat-8 to estimate Chla. We utilize a database of co-located in situ radiometric and Chla measurements (N = 4,354), referred to as Type A data, to train and test an MDN model (MDNA). This algorithm’s performance, having been proven for other satellite missions, is further evaluated against other widely used machine learning models (e.g., support vector machines), as well as other domain-specific solutions (OC3), and shown to offer significant advancements in the field. Our performance assessment using a held-out test data set suggests that a 49% (median) accuracy with near-zero bias can be achieved via the MDNA model, offering improvements of 20 to 100% in retrievals with respect to other models. The sensitivity of the MDNA model and benchmarking methods to uncertainties from atmospheric correction (AC) methods, is further quantified through a semi-global matchup dataset (N = 3,337), referred to as Type B data. To tackle the increased uncertainties, alternative MDN models (MDNB) are developed through various features of the Type B data (e.g., Rayleigh-corrected reflectance spectra ρ s ). Using held-out data, along with spatial and temporal analyses, we demonstrate that these alternative models show promise in enhancing the retrieval accuracy adversely influenced by the AC process. Results lend support for the adoption of MDNB models for regional and potentially global processing of OLI imagery, until a more robust AC method is developed. Index Terms—Chlorophyll-a, coastal water, inland water, Landsat-8, machine learning, ocean color, aquatic remote sensing.


INTRODUCTION
Near-surface concentration of chlorophyll-a (Chla), a proxy for phytoplankton biomass, has been observed and quantified in aquatic ecosystems through optical remote sensing for many years (Clarke et al., 1970;Wezernak et al., 1976;Smith and Baker 1982;Gordon et al., 1983;Bukata et al., 1995). This technique has led to the routine production of Chla distributions for the global oceans for more than two decades. The heritage algorithms have used blue-green band-ratio models to estimate Chla (Gordon et al., 1980;O'Reilly et al., 1998), which are realistic representations of biomass in ecosystems where other constituents, such as detritus and colored dissolved organic matter (CDOM), co-vary with Chla. In optically complex inland and coastal waters however, the color of water is further modulated by the presence of organic and inorganic particles, as well as dissolved matter (Han et al., 1994;Harding et al., 1994) that do not generally co-vary with phytoplankton, rendering retrievals of Chla a far more challenging task (IOCCG 2000). To improve estimates of Chla in these turbid and eutrophic environments, other methods have been developed. For example, spectral bands within the red-edge (RE) region (690-715 nm) (Vos et al., 1986;Mittenzwey et al., 1992), combined with red bands have shown to correlate well with Chla in turbid and/or eutrophic waters (Munday and Zubkoff 1981;Gower et al., 1984;Khorram et al., 1987;Gitelson 1992;Rundquist et al., 1996;Gitelson et al., 2007). The RE observations, however, are not available in the suite of measurements made by heritage missionssuch as Landsat-which have provided the longest record of Earth observation from space (Goward et al., 2017).
The Operational Land Imager (OLI) aboard Landsat-8 was launched in February 2013 to continue Landsat's mission of monitoring Earth systems and capturing changes at relatively high spatial resolution (30 m) (Irons et al., 2012). This mission has offered significant improvements in both data quality and quantity (i.e., both spectral and spatial coverage) over previous heritage instruments (Markham et al., 2014;Pahlevan et al., 2014;Markham et al., 2015). Several methods have been developed to retrieve Chla from the four OLI visible bands (Allan et al., 2015;Watanabe et al., 2015;Concha and Schott 2016;Manuel et al., 2020), yet Chla retrieval methods in inland and coastal waters using traditional approaches are challenged by optical complexity and high dynamic ranges where water types can range anywhere from very clear to highly turbid and eutrophic (Spyrakos et al., 2018). It is, therefore, critical to continue to formulate novel methodologies that enable the production of viable Chla products from Landsat-8 data for global scientific studies and applications (Snyder et al., 2017). Pahlevan et al. (2020) successfully applied Mixture Density Networks (MDNs)a class of neural networks that estimates multimodal Gaussian distributions over a range of solutionsto Sentinel-2 and Sentinel-3 data for mapping Chla. This model has further been extended to the hyperspectral domain to obtain Chla and phytoplankton absorption properties from the images of the Hyperspectral Imager for the Coastal and Ocean (HICO) (Pahlevan et al., 2021).
Our motivation for this study is to test the feasibility of using MDN algorithms extended to the OLI imagery for Chla retrievals. Four different MDN models were trained, evaluated, and compared against current machine learning (ML) algorithms using the visible spectral bands. One model (MDN A ) was developed similar to that of Pahlevan et al. (2020), using paired in situ Chla and remote sensing reflectance (R rs ) (Mobley 1999), whereas three other models (MDN B ) were trained using in situ Chla matchups and atmospherically corrected (or partially corrected) products . These latter models, developed to compensate for uncertainties in the atmospheric correction (AC) (Warren et al., 2019), were trained using input features comprised of: 1) satellite-derived R rs (hereafter referred to as R Δ rs ); 2) R Δ rs in combination with ancillary data; and 3) intermediate Rayleigh-corrected reflectance products (ρ s ) (Wynne et al., 2013) combined with ancillary data. The manuscript follows with sensitivity analyses on: 1) the contribution of different spectral bands to the outputs of the model; 2) the impacts of different AC methods; and 3) the implications for aquatic science and applications.

Chla RETRIEVALS FROM VISIBLE BANDS
For satellite missions like Landsat-8 that do not support measurements in the RE, Chla algorithms tend to rely on either blue-green ratio algorithms (O'Reilly et al., 1998) or neural network (NN) models (Doerffer and Schiller 2007;Kajiyama et al., 2018) that apply all (or a subset of) bands within the visible (VIS) and near infrared (NIR) bands. Algorithms based on band ratios for Chla work well in ocean environments; however, when applied to optically complex waters, such as in coastal or inland areas, performance significantly degrades (Bukata et al., 1981;Le et al., 2013;Freitas and Dierssen 2019). Most research on these environments has focused on instruments like the MEdium Resolution Imaging Spectrometer (MERIS), equipped with RE bands (Gitelson 1992;Gower et al., 2005;Gitelson et al., 2007); however, these algorithms are not applicable to OLI or missions without such measurements (e.g., the Moderate Resolution Imaging Spectroradiometer [MODIS (Esaias et al., 1998), Visible Infrared Imaging Radiometer Suite (VIIRS) (Wang et al., 2014), and Geostationary Ocean Color Image (GOCI) (Ryu et al., 2012)]. Thus, the only widely used Chla estimation algorithms available are those of the band-ratio Ocean Color (OC) family (e.g., OC3), a combination of those (Neil et al., 2019), or ML models. Regional and local algorithms specific to OLI imagery have also been attempted with some success in lakes and reservoirs (Allan et al., 2015;Watanabe et al., 2015).
Over the years, several generic ML methods have been utilized in the OC or aquatic remote sensing domain. Among them, Multilayer Perceptrons (MLP), Support Vector Machines (SVM), and Extreme Gradient Boosting (XGB) have shown promise in retrieving Chla. MLPs are NNs with feed-forward connections arranged in a series of layers, which perform regression by learning a set of weights that are used in dot products through sequential layers (Hinton 1990). This type of model has been employed in past research to obtain various bio-optical parameters as well as Chla (Schiller and Doerffer 1999;Gross et al., 2000;Ioannou et al., 2011;Vilas et al., 2011;Jamet et al., 2012;Chen et al., 2014;Hieronymi et al., 2017). SVMs on the other hand, perform regression by finding a maximal separation hyperplane, fitting the training samples within some margin of tolerance. This margin is tunable, and influences over-and under-fitting (Chang and Lin 2011). This method has previously been utilized for Chla estimations in open ocean environments (Kwiatkowska and Fargion 2003;Zhan et al., 2003). Lastly, XGB is a highly optimized tree-based method which fits a series of models to the training data, incrementally reducing the error through gradient boosting-a specific type of ensembling focusing on the error gradient as the target (Chen and Guestrin 2016). This approach has been proven to improve Chla retrieval from OLI ρ s products in highly turbid or eutrophic lakes in China . The MDNs utilized in this research is a variation of MLPs that learn a probability distribution over the output space to allow for multimodal target distributions (Section Mixture Density Network). This multimodality is a fundamental characteristic of inverse problems, owing to the non-unique relationships between input and output features (Sydor et al., 2004).

DATASETS
Two datasets are utilized in this study: paired in situ Chla-R rs measurements (Type A); and near-simultaneous Chla-R Δ rs satellite matchups (Type B). Using both Type A and Type B datasets provides significant benefits in understanding an algorithm's performance. Co-located R rs -Chla measurement pairs (Type A) provide the theoretically ideal environment, as performance on this dataset quantifies the quality of estimates when applied to sample spectra with minimal noise. Near-simultaneous satellite matchups (Type B), on the other hand, provide a practical demonstration of an algorithm's capability when applied with significant noise. Both sets have Chla ranging from 0.1 to >1,000 mg m −3 (as shown in Figure 1 and Supplementary Appendix A). Continental distributions of the datasets are provided in Table 1.

Type A: In situ Data
Type A data consists of radiometric and biogeochemical parameters that have been collected and assembled from various lakes, bays, estuaries, coast lines, and rivers from around the world (Figure 2), covering a wide range of trophic states and geographic locations . The frequency distribution of Chla, Total Suspended Solids (TSS), and the absorption by CDOM at 443 nm (a CDOM (440)) is shown in Figure 1. Although our in situ measurements are not void of uncertainties, this dataset has proven useful for model development and validation , representing the closest to ideal while still considering instrument and human errors.
The radiometric quantity primarily used for model developments in this study is the remote sensing reflectance (R rs ):  R rs (sr −-1 ) is determined using the water-leaving radiance L w and downwelling irradiance E d in the air, just above the water surface. Hyperspectral R rs spectra were resampled according to the OLI's relative spectral response functions. Furthermore, the data was preprocessed before being used as input into any machine learning models, with R rs data transformed according to a robust median-centering interquartile range (IQR) scaling process (fit to the training data); and the Chla values being log-scaled, and transformed to be within the interval (−1, 1). Type A data were used for training and validation of the first MDN model (MDN A ; Section Mixture Density Network Model Types), and for performance assessment against that of other Chla algorithms (Section Performance Assessment). performance of the MDN model applied to atmospherically corrected products (R Δ rs ) (Section Atmospheric Correction and Matchup Selection); and 2) to explore alternative inputs for MDN models (MDN B types) using R Δ rs and ρ s products, combined with ancillary data (Section Mixture Density Network Model Types). The former analysis enables gauging the performance of MDN models in practice, comparing its efficacy with our previous studies Pahlevan et al., 2021) and quantifying how uncertainties in R Δ rs propagate to Chla products. The motivation behind focusing on R Δ rs and ρ s products (the latter analysis) is to identify alternative approaches [e.g., Cao et al. (2020)] to use the provisional, readily accessible, United States Geological Survey aquatic reflectance products (Franz et al., 2015;Pahlevan et al., 2017b). The locations of all successful OLI matchups are mapped in Figure 3

Mixture Density Network
Radiative transfer theory details a series of equations that are concerned with the forward problem: given a set of parameters which describe the inherent optical properties (IOPs) of the water, concentrations of water constituents (e.g., Chla), and a set of boundary conditions, which limit the environment itself, how the relevant apparent optical properties (AOPs) can be discerned (Mobley, 1994). In the same manner, the standard target of a model in machine learning takes the form of a forward problem: given a set of independent variables, the goal is to find a function which approximates the relationship between these and the dependent variables. In particular, the relationship must be right-unique, which guarantees there is a single set of true outputs (y) for any given set of input (x) variables in a dataset D: Plainly, for any input-output pair in a dataset, any samples with the same input must also maintain the same output (conditioned on noise). Inverse problems reverse the relationship however, i.e. switching x and y, which leads to violations of this core assumption. In natural environments, bio-optically active constituents and illumination conditions cause observed R rs ; the same set of input parameters, with perfect knowledge, should always lead to the same R rs . In the inverse formulation, we attempt to instead determine bio-optical parameters and biogeochemical properties from the R rs observations, and thus have the possibility of a single R rs spectrum leading to multiple sets of valid parametric solutions (and so, multiple valid environments in which the spectrum might have been observed) Pahlevan et al., 2021).
Mixture Density Networks (MDN) (Bishop 1994) are a class of neural networks which attempt to address this one-to-many mapping (Sydor et al., 2004;Defoin-Platel and Chami 2007). Where a standard neural network (e.g. MLP) directly models the R rs > Chla relationship, MDNs model a conditional probability distribution, i.e., p(Chla|R rs ), over the R rs > Chla mapping as a mixture of multiple (c) Gaussian functions: with c mixture components and dimensionality d, μ being the mean vector, and ϕ denoting a Gaussian distribution. A valid Gaussian mixture requires that the mixing coefficients π and the covariance matrix Σ adhere to the constraints explained in detail in Bishop (1994). The final model estimate is then taken to be the maximum likelihood, which represents the area of highest probability mass. In our formulation, Bootstrap Aggregation (bagging) (Breiman, 1996) is also applied to the model to improve the quality and consistency of estimates. The practice of bagging is an ensemble technique which is intended to reduce variance and improve generalization. In short, the idea is to repeatedly resample the available training set into a smaller subset (in practice, 50-75% of the original size) and train the model on this new, randomly sampled training subset. After some number of models is added to the ensemble, the median of all model estimates is taken as the final output.

Atmospheric Correction and Matchup Selection
There are several viable AC methods suitable for OLI data processing; nonetheless we focused only on one processing chain, i.e., the SeaWiFS Data Analysis System (SeaDAS), the heritage ocean color AC processing scheme adopted for OLI (Franz et al., 2015). This processing approach is also adopted by the USGS Earth Resource Operation and Science center (EROS) to produce aquatic reflectance products, which are equivalent to R rs products normalized by π. In this study, OLI images were not only fully processed to R rs but also partially processed to output intermediate ρ s that is corrected for atmospheric gaseous absorption, molecular scattering effects, and air-water interface multiple scattering phenomena (Gordon 1997). To compare the effects of AC schemes on Chla retrievals, a single OLI image was processed by three other methods: Polynomial based algorithm applied to MERIS (POLYMER) (Steinmetz et al., 2011), Atmospheric Correction for OLI lite (ACOLITE) (Vanhellemont and Ruddick 2018), and Case-2 Extreme Waters (C2X) (Brockmann et al., 2016) (Section Impacts of Atmospheric Correction).
To create satellite matchup datasets (Type B), SeaDASprocessed OLI scenes were paired with in situ measurements on same-day overpasses and the matchup criteria proposed in Bailey and Werdell (2006) was followed using strict spatiotemporal filters to remove matchups with questionable Frontiers in Remote Sensing | www.frontiersin.org February 2021 | Volume 1 | Article 623678 5 quality. A 3 × 3-element box centered on the closest geographic coordinates of the in situ measurement was used to select potential satellite observations (Figure 3), and any matchup was discarded if four or more pixels were flagged as invalid. Further, any cruise samples <500 m apart were considered duplicates and removed. Temporal mismatch criteria were further tightened for dynamic aquatic ecosystems (e.g., Chesapeake Bay, riverine systems) to 30 min to minimize the associated uncertainties. The median value of valid pixels within the 3 × 3-element box was then derived for each parameter and preprocessed in the same manner as the Type A data. All relevant reflectance products, including top-of-atmosphere reflectance (ρ t ), ρ s , and R Δ rs , as well as ancillary (anc) parameters (e.g., sensor and solar viewing geometries, water vapor content, etc.), were simultaneously extracted, filtered, and stored.

MDN Model Types
The naming convention used for MDN model developments follows the format listed in Table 2. Each of the MDN models was trained with 50% of the total available samples within the respective dataset, chosen uniformly at random (with the same set of samples used to train all ML models; Section Benchmarking). The remaining, held-out portion of the dataset was then used to test the models. The "bagging" scheme was also applied to all ML models-in order to ensure a fair comparison between algorithms-with 75% of the training data used per bagging estimator (without replacement), and an ensemble size of 10 estimators. All hyperparameters of the benchmark models were chosen via a 5-fold cross validation grid search on the training data. For a detailed discussion on MDN hyperparameters, see Supplementary Appendix C.
In those MDN models which utilize ancillary data, these features are added alongside their respective R Δ rs (or ρ s ) spectra as inputs. Bagging was also applied to these ancillary features when they are included (i.e., keeping only a random 75% of the ancillary features for each estimator in the bagging ensemble); the full VIS R Δ rs (i.e., 443, 482, 561, 665 nm) or ρ s (443, 482, 561, 665, 865, 1609 nm) spectrum for a sample was always included as input. Ancillary data (Supplementary Appendix B) included per pixel imaging geometry, coarse scale wind parameter estimations, and other general atmospheric condition variables which were available from SeaDAS (e.g., NO 2 , O 3 , water vapor). We hypothesize that these additional features help the model to learn the biases and uncertainties specific to the AC method, which uses these features in their derivations of R Δ rs . For instance, wind parameters are known to correlate well with sunglint signal (Wang and Bailey 2001;Kay et al., 2009), and are not utilized by default in SeaDAS. Unaccounted water vapor absorption, which affects the OLI's red and ShortWave InfraRed (SWIR) bands, can also introduce additional uncertainties in R Δ rs . On the other hand, ρ s products contain aerosol scattering and absorption governed by imaging geometry parameters; hence, the sun-sensor geometries must be evaluated when retrieving Chla through MDN ρs,anc B , a point not taken into consideration in other studies which utilize similar features .
In order to help prevent the models from learning spurious relationships due to temporal misalignment, we added one additional feature to all MDN B models, which represents the number of minutes between the satellite overpass and the in situ measurement, i.e., Δt. This number is negative if the in situ measurement was taken prior to the overpass, and positive if after. When applying the model to a scene to generate Chla maps (see Spatial Analysis), this feature is simply set to 0 for all pixels for the exact time of the overpass.

Benchmarking
Given their previous application in the aquatic remote sensing area, MLP, SVM, and XGB were the main ML models identically trained and tested with the MDN A model. Due to its simple implementation and successful application in classification problems, K Nearest Neighbor (KNN) was also added as another benchmark (Altman 1992). In spite of its expected performance loss in waterbodies rich in organic or inorganic material, the OC3 model was also used as another benchmark (Franz et al., 2015). The MDN B models were further benchmarked against another XGB model, hereafter referred to by its name in original publication (BST), developed and tested by Cao et al. (2020).
To quantify performance, we primarily examined three metrics: Median Symmetric Accuracy (Morley et al., 2018), referred to as "Error" in all plots and tables; Symmetric Signed Percentage Bias (Morley et al., 2018), referred to as "Bias" in all plots and tables; and the slope of the least-squares linear regression line on the log-transformed data (Campbell 1995). All three have straightforward interpretations, though to clarify the first two: • Median Symmetric Accuracy ("Error") can be interpreted as a symmetric percentage error, equally penalizing overand under-estimation. Lower values indicate better performance, with perfect accuracy being assigned a value of 0%.
Error 100 × e median (| log ( Chla/Chla )|) − 1 • Symmetric Signed Percentage Bias ("Bias"), as with the former metric, is interpretable as a percentage bias that maintains symmetry between over-and under-estimation. Values closer to zero indicate better performance, with positive values indicating over-estimation and negative indicating under-estimation.
MdLQ median log Chla Chla Bias 100 × sign(MdLQ) × e | MdLQ | − 1 In the equations above (5-7), Chla denotes the in situ value and Chla is the estimated value. Both metrics are designed to address the widely documented drawbacks (Makridakis 1993;Hyndman and Koehler 2006;Tofallis 2015) of other commonly used statistical measures (e.g., MAPE). For a thorough discussion on the advantages of these methods over others commonly found in literature, we direct the reader to (Morley et al., 2018). One should also note that the above metrics are zero-centered and symmetric compared to recently proposed metrics in Seegers et al. (2018).

Performance Assessment
The performance of MDN A , the other ML models, and OC3 and the corresponding statistical metrics (N 2,177) are provided in Figure 4. The MDN model outperforms other ML models with improvements in error ranging from 30 to 60%, and MLP ranking as the second-best performer. In fact, given the coarse hyperparameter grid search performed for other ML methods, and the lack of equivalent optimization for MDN parameters, the improvement in performance is even more significant than shown here (Supplementary Appendix C). The MDN model, as well as other ML models, remarkably outperform OC3, which overestimates Chla in the 1-10 mg/m 3 range and underestimates in the higher range. Yet, compared to Pahlevan et al. (2020) 1 , the performance is worse since R rs simulated for the MultiSpectral Instrument (MSI) and Ocean Land Color Instrument (OLCI) contain the RE bands (Gitelson et al., 2007).
To gauge the level of noise introduced through the AC (SeaDAS), we also produced Chla scatter plots via MDN A and other benchmark algorithms applied to the Type B dataset  (i.e., satellite matchups) ( Figure 5). The model performances degrade considerably when applied to SeaDAS-derived R Δ rs . Error levels exceeding 100% render the utility of OLI-derived Chla somewhat impractical for robust scientific applications (Trinh et al., 2017;Bresciani et al., 2018). Further, all the models tend to overestimate Chla >1 (mg m −3 ), suggesting primarily underestimated R Δ rs . This is in agreement with other studies in which the confounding effects of AC have been highlighted; OLIretrieved R Δ rs over coastal and inland waters are known to carry these major biases and uncertainties due to imperfections in the AC process (Werdell et al., 2009;Zibordi et al., 2009;Pahlevan et al., 2017a;Ilori et al., 2019;Kuhn et al., 2019). The AC processor introduces varying degrees of error in Chla estimates, due to issues in the shape and magnitude of R Δ rs spectra estimated. Regardless, while evaluating Chla retrievals against the satellite matchups is informative, the differences between the Type A and Type B sets are too large to draw any firm conclusions about the underlying models (see Figures 1 and Supplementary Appendix A). In Type B data, error is introduced by spatial differences, temporal differences, adjacency effects, variability in atmospheric conditions, and image artifacts, to name a few. Performance metrics will therefore reflect these errors as much as any error inherent to the models themselves.
In spite of these issues, Landsat-matchup trained MDN model MDN B is to some extent capable of improving the accuracy as compared to that of the original (MDN A ) model ( Figure 6). In essence, the model accounts for some uncertainties inherent to the R Δ rs products-though still exhibiting a fair amount of error due to the limited information contained in the four available R Δ rs bands. This error is reduced further for MDN anc B . By including ancillary data, this MDN model compensates for uncertainties from AC sources in SeaDAS retrievals of R Δ rs . With some of the uncertainties in AC addressed (e.g., imperfect accounting of sunsensor geometry; (Pahlevan et al., 2017b;Gilerson et al., 2018)), the model can, in general, make more accurate estimates. The comparable performance of MDN ρ s , anc B demonstrates viable retrievals in areas with both highly eutrophic and/or turbid waters (Bailey et al., 2010) and increased water-surface signal induced by residual and/or moderate sunglint . The recently published model in Cao et al. (2020) (BST) is also added as another benchmark, which poorly predicts Chla-likely because of the drastic differences in the distribution of training data, i.e., the median Chla in Cao et al. (2020)

Spatial Analysis
The different models examined in Section Methods are retrained using the full datasets, rather than splitting into training and/or testing sets. This allows for the model development to have the widest range of data available. Using the retrained models, two scenarios are demonstrated here.
As a first example, a natural color image of Lake Erie and the derived products during a harmful algal bloom event on Sept. 14th, 2015 was used (Figure 7). The black "x" markers indicate the positions of the three monitoring stations visited by the Great Lakes Environmental Research Laboratory (GLERL) within ±1 hr of Landsat-8 overpass. High concentrations of Chla in the southwestern section of the lake are evident in the natural color image generated from ρ s products. The elevated  backscatter evident in the natural color image of the northern and central Lake St. Claire discharging into Lake Erie is commonly attributed to suspended sediments and resuspension events (Bukata et al., 1988;Hawley and Lesht 1992;Czuba et al., 2011;Avouris and Ortiz, 2019). Table 3 contains the extracted Chla measurements and the Chla estimates from the MDN models. Although there is a slight temporal mismatch between Landsat-8 overpass and in situ measurements, there is a clear pattern in the results which quantitatively supports the accuracy improvement made by the MDN models. One point to note is the apparent underestimation of the MDN Type B models. This can be at least partially attributed to the AC frequently failing in highly eutrophic waters (Wang et al., 2019): since the MDN B models are only trained on samples (Figure 3 and Supplementary Appendix A) for which SeaDAS gives a valid result, there will necessarily be a bias toward lower concentration samples within the training data. This bias does not appear to be present in the MDN A model, as it would not have such a selection bias in its training set. Taking WE13 station as an example, we note that in spite of the reported concentrations >50 mg m −3 , OC3 estimates a maximum of around 10.1 mg m −3 , with the majority of examined areas below 11 mg m −3 ( Table 3). MDNB exhibits similar behavior, possibly due to the previously discussed selection bias in the training data. However, when ancillary features are added to the model inputs (as is the case with MDN anc B ), the estimated Chla concentrations become far more plausible. Furthermore, there is a notable consistency between the maps predicted from MDN anc B and MDN ρ s , anc B , implying ρ s has the potential to be used as a substitute for R Δ rs . The benefit of this substitution is demonstrated in the southern area of the image: for a large proportion of Sandusky Bay, R Δ rs through SeaDAS is unavailable and thus missing within the MDN anc B map. Our second example is comprised of a scene over the San Francisco Bay (SFB), imaged on April 27th, 2017 at 18:43 GMT (Figure 8), for which there were a few near-simultaneous in situ Chla measurements (Table 3) provided by SFB monthly cruises. In contrast to MDN B -derived maps, the map obtained from MDN A shows highly eutrophic areas (>20 mg·m −3 ) in the south bay region. Retrievals from MDN A were similar to in situ measurements (in the lower bay) except for the estimate at station 33-which is far greater than the measured concentrations ( Table 3). This might be caused by any number of factors, not least of which being those biases inherent to the AC process. An interesting feature in Figure 8 is the Chla field outside the bay in the Pacific coasts (or in San Pablo Bay) that has been merely predicted by MDN ρ s , anc B , suggesting the advantage of this model over other models limited by the failure in the AC (e.g., due to haze or over highly turbid/eutrophic waters).

Temporal Analysis
The Time-series of estimated Chla are compared with in situ measurements in Figure 9. In this case, the in situ data (N ∼ 30) were measured via calibrated autonomous fluorometers deployed near-surface in Grizzly Bay, the northern section of the SFB region ( Figure 8). The errors (Eq. 5) for the different models amounted to 54%, 41%, 120%, and 241% for MDN anc B , MDN in the Pacific coastal waters where SeaDAS did not return valid R Δ rs . In situ Chla measurements for the stations are provided in Table 3. The Grizzly Bay station whose time-series Chla data are shown in Figure 9 is highlighted. predictions resemble the in situ measurements without considerable noise, suggesting their potential applicability if adequate training data are supplied. This time-series analysis highlights the primary role of AC in achieving high-quality Chla retrievals and the need for improvement.

DISCUSSION
Based on these results, one infers that the MDN is a promising model in retrieving Chla from OLI, offering improvements in accuracy over other current models. Here, we further address why this model is a likely choice for retrievals and demonstrate its strength in suppressing noise in R rs compared to other ML models. This is followed by a discussion on the impacts of varying AC methods on the performance of MDN A and the implications of this research for studying and monitoring global waterbodies using Landsat-8 and other missions.

Model Validity
Neural network models have long been regarded as black-box models, with their complexity being a double-edge: providing more accurate solutions than have been previously available, at the cost of understanding the rationale in their estimations. This loss of explicability is of great concern for those involved in critical applications, due to the costs incurred in the event of failures. Without a source to identify as the cause-as is often the case in these models-the trust placed in the application is eroded. Recent research has led to a number of methods which allow for better model transparency, however. For instance, with many models it can be helpful to visualize the effect a given input feature has on the output of the model; in this case the effect of a certain band on the Chla estimation.
One such method to do this is called an Accumulated Local Effects (ALE) plot (Apley and Zhu 2016). The interested reader may also examine the literature for the related Partial Dependence (PD) plot-though these have the disadvantage of assuming independence between input features, and so are not the best choice in this case. ALE plots, on the other hand, calculate the effect of a feature conditional upon the other input features. Another way to explain this is, they examine the average change in prediction over a window around an input feature's values, only conditioning upon other features in areas for which values exist in the data set. Figure 10 shows the ALE plots for the OLI bands, generated via the Type A data set. Note that the y-axis values correspond to the accumulated local effect, which can be thought of as "change in estimated Chla." These plots indicate that 561 nm, when observed with a large magnitude, has the greatest (positive) effect on chlorophyll-a estimation. Not surprisingly, 482 nm also appears to significantly impact chlorophyll estimates with an inverse relationship: low magnitude reflectances indicating a higher than average chlorophyll-a value, and high magnitude reflectances indicating a lower than average chlorophyll-s value. On the other hand, since the 655-nm band does not fully cover Chla absorption at 676 nm, it appears to contain limited spectral information pertaining to Chla (see Helder et al. (2018)).

Impacts of Atmospheric Correction
Although the primary processing scheme used here was SeaDAS, here, we underscore the importance and challenges associated with the AC. To that end, C2X, ACOLITE, and POLYMER, in addition to SeaDAS, were implemented to a sample OLI scene over Lake Peipsi, June 14th, 2016, followed by applying the MDN A model to all the derived R rs products. Figure 11 shows the corresponding Chla map products. The inconsistency in the relative distribution and the overall magnitude of products is largely noticeable. For example, C2X appears to predict a large bloom in the center of the lake whereas other schemes provide values closer to the lake-wide average estimate. Moreover, SeaDAS and ACOLITE tend to estimate relatively high Chla in the southern and eastern basins while POLYMER retrieves only slightly higher-than-average estimates. Same-day in situ Chla measurements and the estimated Chla from MDN A and OC3 from the four processors are included in Table 4 ( Alikas et al., 2015). Despite that the in situ dataset does not represent the entirety of the ecosystem, it allows to better comprehend the complexity induced through the AC process and how confusing the output products may be. Given the statistics in Table 4, there is no single processor that distinctly outperforms the rest for this instance of OLI image and/or lake. It is worth noting that while SeaDAS, ACOLITE, and POLYMER statistically yield better Chla estimates via MDN A , retrieved Chla values from C2X through OC3 resembles in situ samples more closely. This observation and the discrepancies in the performances further corroborates the need for an improved AC method for the OLI data processing to achieve the theoretical limit shown in Figure 4.

Implications for Aquatic Studies
Landsat-8 data, when combined with the data from Sentinel-2 and -3, are expected to allow for near-daily global observations of  inland and coastal waters. Irrespective of differences in their observation modalities, creating consistent Chla products is key for successful assessment and monitoring of these ecosystems. Considering the missing RE measurements in the OLI suite of observations however, retrieving Chla as accurately as that with MSI and OLCI appears challenging. Although the number of matchups assessed is different, comparing to our previous results , it can be inferred that R rs , or their equivalent R Δ rs and ρ s , within the RE region contain significant information related to Chla in eutrophic waters. These channels also help constrain the solution space in less eutrophic waters with higher turbidity, which may be mislabeled as having high Chla with OLI. That said, the addition of spectral information within the 865 nm band may prove valuable under such circumstances , which has been the case for our MDN ρ s ,anc B model. In addition to SeaDAS, we also implemented and tested ACOLITE and POLYMER to assess the performance of MDN A models. Our analyses showed that these alternative models yield Chla as inaccurate as that from SeaDAS ( Figure 5). The performance of MDN B and MDN B anc using R Δ rs derived from ACOLITE and POLYMER showed consistent performances however, similar to those illustrated in Figure 6. Therefore, it is surmised that until major improvements in the state-of-the-art AC methods are achieved, an alternative approach to obtaining improved Chla is through MDN B models supplied with R Δ rs , ρ s , or a combination of both. Assuming adequate matchups spanning a wide array of trophic states and aerosol conditions are incorporated in training (beyond what was used in our Type B dataset; Supplementary Apendix A), such a model should provide global retrievals nearly as robust as those obtained by MSI and OLCI. This is likely the path forward, given the extent to which AC degrades the performance of MDN A (rendering it virtually equivalent to OC3). This approach is, in particular, applicable to regional monitoring sites (e.g., western Lake Erie, Lake Taihu) where ample high-quality, historic in situ datasets are available for model training. In other words, a successful compilation of high-quality discrete samples of Chla at global scales is a challenging task and may take several years to achieve.
Spatial and temporal mismatches inherent to satellite matchups introduce further uncertainties in our assessments. Of concern is, in particular, our same-day criteria. We made an attempt to diminish the impact of this noise source by supplying Δt to the MDN B models, in the hopes that the model could adjust for the importance of temporally distant samples during the learning process. Overall, choosing an optimal threshold for the temporal filtering is a trade-off between the accuracy of the model, and the generalization capability conferred by a wider range of samples and environments.
The inclusion of ancillary data, such as the solar angles, sensor viewing angle, wind data, water vapor, and others, enhanced the model performance noticeably when added to model input features, regardless of AC processor. The improvements stem from how SeaDAS utilizes the ancillary information itself (Mobley et al., 2016) while some of the parameters are not often used, e.g., wind speed, wind angle. In some cases, the algorithm may apply simplifications, for example to reduce computational burden, that may preclude a rigorous integration of ancillary data in the process. In particular, we found that our model is very sensitive to sensor azimuth angles, which change sign for the two adjacent focal plane modules (Markham et al., 2014). Our spatial analysis suggested that including these angles yield alternate low-high Chla for the odd and even focal plane modules (Pahlevan et al., 2017b); hence, we decided to discard this information, which led to more spatially uniform maps. Further, it is worth pointing out that most ancillary variables (e.g., water vapor concentration) are coarse-resolution features with little to no per-pixel variability. Therefore, any fine structures or patterns seen in the estimates can only be influenced by the spectral information itself.

Future Work
This work introduces a great number of potential directions for exploration. From the perspective of the aquatic remote sensing field as a whole, it is yet to be determined how the MDN model fares when applied to other missions. In the future, the performance evaluation is expected to be carried out for other ocean color missions, such as MODIS, which do not measure in the RE region but provide relevant spectral content in the vicinity of 750-nm region. Similarly, the MDN developed for OLI's visible bands might be further extended by including the panchromatic band to further constrain the solution space (Castagna et al., 2020). As the atmospheric correction process has been shown to introduce significant errors in downstream products of such missions, and ρ s being a feasible substitute to bypass portions of this process, the question becomes whether it is possible to bypass AC completely and allow for direct retrieval of the relevant biogeochemical properties. Alternatively: whether there are certain AC-specific parameters which might be tuned, in order to provide a more amenable input for learning the product inverse function.
Other directions include those focused more on ML, and the MDN itself. For instance, the MDN model also has the capability to simultaneously estimate multiple products; to what extent do the inclusion of additional variables in the model output (e.g., TSS) affect performance? Intuitively these additions should serve only to improve accuracy overall, given the additional information of target covariances-but which products might be estimated synergistically is yet to be explored.
More theoretically, we might ask if there are non-Gaussian distributions (e.g., Laplace, which may better represent the data); or, whether the learned mixture components might relate to the physical environments of the samples assigned. Further exploration is required in regard to the model hyperparameters, and the mixture components especially. There are very likely advancements in the field of machine learning (e.g., activation functions, batch normalization procedures, convolutional/temporal architectures, etc.) which could also be applied to enhance retrievals-though potentially requiring alternative data formulations, such as incorporating spatial or temporal information.

CONCLUSION
In this work, we have gathered a global dataset of both R rs -Chla and R Δ rs -Chla matchups, compiled from a variety of different sources. These two datasets were used to train several machine learning (ML) models, and in particular, used to train the Mixture Density Network (MDN)-an algorithm which we introduce as a theoretically plausible model for biogeochemical variable retrieval via remotely sensed radiometric data. These ML algorithms were benchmarked against each other, in order to provide empirical justification alongside the theory of MDN superiority on inverse problems; as well as against OC3 to demonstrate accuracy on the described task. Furthermore, we showed that instead of using R Δ rs spectrum as input, it is feasible to instead use ρ s to achieve similar performance in Chla estimation. The benefits of this were briefly touched upon, where ρ s -trained models were shown to seamlessly retrieve Chla where previously unavailable due to AC failure.
The MDN algorithm represents a promising step toward the goal of global simultaneous biophysical and biogeochemical variable retrieval, in the context of aquatic remote sensing. While results are promising, much work is left to be done in both data acquisition and model validation. To truly design a global-scale model, capable of approximating an inverse solution to the radiative transfer equations, significantly more data is required. Simultaneously retrieving all parameters of interest to the community requires the potential dataset to have the necessary information to learn relevant covariances in all atmospheric conditions. Just as important, the various sources of uncertainty and misalignment must also be minimized in order for the model to accurately learn these relationships.
We conclude with broad discussions of other justifications and benefits, analyses on the hyperparameters, implications of the model within the broader community, and potential directions for further experimentation. These discussions are far from exhaustive, but we hope they will provide the seed for future advancements in remote sensing.

DATA AVAILABILITY STATEMENT
The datasets presented in this article are not readily available because data ownership belongs to partner organizations. The data will be published in the future upon agreement with data providers. Requests to access the datasets should be directed to Nima.pahlevan@nasa.gov. All the developed codes are available through https://github.com/STREAM-RS/STREAM-RS.

FUNDING
We acknowledge the European Union's Horizon 2020 research and innovation program (grant agreement No. 730066,EOMORES) to support in situ data collection in Estonian inland waters. Nima Pahlevan is funded under NASA ROSES contract # 80HQTR19C0015, Remote Sensing of Water Quality element, and the USGS Landsat Science Team Award # 140G0118C0011.