Validation of Sentinel-2 (MSI) and Sentinel-3 (OLCI) Water Quality Products in Turbid Estuaries Using Fixed Monitoring Stations

It is common in estuarine waters to place fixed monitoring stations, with the advantages of easy maintenance and continuous measurements. These two features make fixed monitoring stations indispensable for understanding the optical complexity of estuarine waters and enable an improved quantification of uncertainties in satellite-derived water quality variables. However, comparing the point-scale measurements of stationary monitoring systems to time-snapshots of satellite pixels suffers from additional uncertainties related to temporal/spatial discrepancies. This research presents a method for validating satellite-derived water quality variables with the continuous measurements of a fixed monitoring station in the Ems Dollard estuary on the Dutch-German borders. The method has two steps; first, similar in-situ measurements are grouped. Second, satellite observations are upscaled to match these point measurements in time and spatial scales. The upscaling approach was based on harmonizing the probability distribution functions of satellite observations and in-situ measurements using the first and second moments. The fixed station provided a continuous record of data on suspended particulate matter (SPM) and chlorophyll-a (Chl-a) concentrations at 1 min intervals for 1 year (2016–2017). Satellite observations were provided by Sentinel-2 (MultiSpectral Instrument, S2-MSI) and Sentinel-3 (Ocean and Land Color Instrument, S3-OLCI) sensors for the same location and time of in-situ measurements. Compared to traditional validation procedures, the proposed method has improved the overall fit and produced valuable information on the ranges of goodness-of-fit measures (slope, intercept, correlation coefficient, and normalized root-mean-square deviation). The correlation coefficient between measured and derived SPM concentrations has improved from 0.16 to 0.52 for S2-MSI and 0.14 to 0.84 for S3-OLCI. For the Chl-a matchup, the improvement was from 0.26 to 0.82 and from 0.14 to 0.63 for S2-MSI and S3-OLCI, respectively. The uncertainty in the derived SPM and Chl-a concentrations was reduced by 30 and 23% for S2-SMI and by 28 and 16% for S3-OLCI. The high correlation and reduced uncertainty signify that the matchup pairs are observing the same fluctuations in the measured variable. These new goodness-of-fit measures correspond to the results of the performed sensitivity analysis, previous literature, and reflect the inherent accuracy of the applied derivation model.


INTRODUCTION
With the planned launch of the Pre-Aerosol Clouds and ocean Ecosystem (PACE) in 2022, and the data streaming from the Sentinel missions, Earth Observation (EO) technology is moving toward a long-term sustainable data flow, forming the basis for its operational use (Groom et al., 2019). Setting up satellite-based services for water quality monitoring in turbid estuaries provides an instrument for early warning and mitigation on extended spatiotemporal scales (Anderson et al., 2017). However, the added value of satellite products in estuarine waters is determined, primarily, by their accuracy, which needs to be validated against reference data. In this regard, continuous data measured from a fixed monitoring station are indispensable for understanding estuarine systems (Benway et al., 2019) and quantifying the uncertainties of satellite products of water quality. Validation is the action of checking or proving the validity or accuracy of satellite products. It measures the correctness and meaningfulness of derived values against measured quantities (e.g., Salama et al., 2012). In this respect, validation outputs are metrics measuring the resemblance between satellite-derived products and measured values and a set of logical values addressing whether satellite products are acceptable (Seegers et al., 2018;Werdell et al., 2018). The standard procedure of comparing point-scale measurements to time-snapshots of satellite pixels suffers from additional uncertainties related to the temporal and spatial mismatches (Evers-King et al., 2017). For example, Salama and Su (2011) have shown that the spatial mismatch between 300 and 1000 m pixel resolutions could be up to 0.02 sr −1 for ocean color remote sensing reflectance at 665 nm. From this, atmospheric correction accounts for up to 40% of the total uncertainty, which is partly due to the unavailability of information on atmospheric variables, e.g., aerosols optical thickness (Frouin et al., 2019), at the satellite measured scales. Having fine spatial resolution could provide extra details (Vanhellemont and Ruddick, 2014). However, there is a trade-off between gaining information from reducing the spatial scale and adding additional noise-component resulting from small-scale turbulences of water constituents (Bissett et al., 2004). For example, Groetsch et al. (2014Groetsch et al. ( , 2016 have shown a mismatch between fluorescence measurements from ship-ofopportunity, taken at 5 m below the water surface, and satellite-derived concentrations of chlorophyll-a (Chl-a). The authors attributed this mismatch to the different sample sizes of a satellite pixel vs. a point in-situ measurement and the depthintegrated satellite observation vs. in-situ measurement taken at a specific depth.
The spatial and temporal dynamics of water quality variables are particularly high in estuarine waters (Hommersom et al., 2009;Nechad et al., 2015). The spatial variability is usually addressed through assessing the homogeneity of satellite pixels of the matchup location (e.g., Harding Jr et al., 2005) or using Bayesian inference (e.g., Salama and Su, 2010). Whereas the temporal variability of in-situ measurements is usually addressed by averaging data points within a temporal window of 1 to 4 h of the satellite overpass (IOCCG, 2019). Current guidelines (e.g., Sentinel three Validation Team S3VT) minimize the spatiotemporal variability by sampling homogeneous water within ±1 h of the satellite overpass and under good atmospheric conditions. Applying these approaches to address the spatiotemporal mismatch could, nonetheless, result in variable performances and lead to little improvement in validation statistics (Barnes et al., 2019). This calls for a validation scheme that provides an unbiased assessment of uncertainty at several scales and ensures the consistency of uncertainty estimates of satellite-derived water quality variables. Recently, McKinna et al. (2021) have shown that incorporating model and observation uncertainties have improved the validation metrics. This paper incorporates the variabilities of field measurements and satellite observations to estimate the underlying accuracy of satellite-derived concentrations of suspended particulate matter (SPM) and alga's green pigment, chlorophyll-a (Chl-a) in estuarine waters. To achieve this objective, we use semicontinuous measurements from a fixed location and observations from two satellite sensors: the Multispectral Instrument on Sentinel 2 (S2-MSI) and the Ocean Land Color Instrument on Sentinel 3 (S3-OLCI).
The paper is organized as follows: first, the employed method to answer the research question is presented in Section 2, followed by a description of the study area in Section 3. Used data and preprocessing are then detailed in Section 4, followed by results (Section 5) and discussion (Section 6). The paper is finalized with a summary of findings in the conclusion section.

Validation
Following the ISO standard (Åhlin et al., 2014) and its implementation in the Committee On Earth Observation Satellite, validation is the process of assessing, by independent means, the quality of the data products derived from the system outputs. For such validation to be meaningful, it should be applied to matchups-which is defined (IOCCG, 2019) as pairs of measured (say 10) and satellite-derived (say y) values that have the same locations and times (±1 h) of sampling. The validation protocol (IOCCG, 2019) suggests taking several satellite pixels surrounding the location of the monitoring station. While this could be performed in open waters, it will be a challenge in fine-scale estuarine systems where proximity to land is inevitable. The approach adopted in this paper to construct the matchup is to exclude all pixels that partially cover the ground and select the nearest water pixel to the location.
Validation is carried out on the matchup dataset using type-II linear regression (Laws, 1997;IOCCG, 2006), whereby a regression line-passing through the centroid of the data μ x , μ y -is estimated through adjusting its slope α II and intercept β II to minimize the sum of squares of normal deviates-lines perpendicular to the regression line. Approaches other than minimizing the normal deviates exist, e.g., calculate the geometric mean of the slope from regression of y on x and x on y but they are similar. The slope measures the gain of y, i.e., how close/far the regression line is from unity. Whereas the intercept measures the offset of y, i.e., how close/far the regression line is from passing through the origin.
In addition to the coefficients resulting from type-II regression, validation involves calculating the correlation coefficient r and the root-mean-square deviation, Δ, or the mean absolute differences (Seegers et al., 2018). The correlation coefficient, r, is a measure of linearity, whereas Δ provides the total mean uncertainty of satellite-derived products, w.r.t measured values (hereafter denoted as Δ tot ): where n is the total number of matchups. Normally water authorities or engineering applications require a percentage of uncertainty to be presented from the validation of satellite products. This percentage measure is quantitative and provides an immediate feeling of accuracy. For example, most satellite missions require percentage accuracy in the derived water quality variables. The Δ could be normalized, for x ≠ 0, as (e.g., Hieronymi et al., 2017), ψ x−y x . However, this calculation of percentage accuracy is not symmetric resulting in higher values for y < x and lower values for y > x. For example: To avoid such a confusion, in this paper we calculate the percentage error by dividing Δ by the mean μ x :

In-Situ Measurements vs. Satellite Observations
Deriving water quality variables from satellite observations requires two main steps (Gordon and Morel, 1983;Frouin et al., 2019): atmospheric correction to retrieve the water leaving signal, and deriving water quality variables from the water leaving signal. Eventually, both steps could be combined and simultaneously carried out (e.g., Arabi et al., 2016Arabi et al., , 2018. Water quality variables that can be derived from optical observations are those with the property of changing the visible sunlight through absorption and/or scattering (Mobley et al., 1993), namely phytoplankton pigment (here we focus on the green pigment chlorophyll-a, Chl-a), suspended particulate matter (SPM) and colored dissolved organic matter (CDOM) (Kirk, 1994). This study will focus on the first two variables, namely SPM and Chl-a concentrations. When satellite-derived SPM and Chl-a are compared to in-situ measurements, some discrepancies arise related to differences in sampled times, depths, and water volumes. Much research (see, for example, Salama and Su, 2011;Groetsch et al., 2014Groetsch et al., , 2016Valdés and Lomas, 2017) has shown that due to these differences, in-situ devices record small turbulences (fluctuations on a short time scale) that are undetected or averaged by the satellite's pixel observations. This phenomenon was referred to by Oke and Sakov (2008) as the representation error. This mismatch in representation will supplement additional components to the retrieval's uncertainty, Δ tot , namely uncertainties related to temporal and spatial mismatches (IOCCG, 2019).
In an analogy to the Taylor approximation of the second moments and using the expression of Salama and Stein (2009), Su (2011), and, the total uncertainty could be approximated as the sum of three components: where the subscripts stand for temporal (t), spatial (s), and derivation (d) uncertainties. Eq. 3 separates the total uncertainty into three contributors: derivation, spatial, and temporal mismatch. The derivation uncertainty Δ 2 d represents the uncertainty resulting from all processing steps required for deriving water quality variables from satellite observations, i.e., sensor calibration, atmospheric correction, and model inversion. This definition coincides with the Guide to the Expression of Uncertainty Measurement (GUM, 2008). The other components, Δ 2 s and Δ 2 t will add up to Δ 2 d only when comparing a satellite pixel to a point in-situ measurement.

Representative In-Situ Measurements
Depending on the sampling frequency of the in-situ device, n measurements will be recorded within the 2 h of the satellite overpass. Validation protocol (IOCCG, 2019) commonly advises to average in-situ measurements that fall within a temporal window of 1 to 4 h of the satellite overpass. However, due to the representation error discussed in the previous section, this time window (±1 to ±4 h) depends on water quality dynamics with spatial and temporal dependencies. To reduce representation error resulting from temporal averaging and ensure that these n measurements represent the same dynamics, we group the data based on their variability. This is done by selecting subsets that fall within a range of variabilities and recorded in a sequential (uninterrupted) timespan. This additional grouping produces sets of representative measurements (hereafter called "optimal in-situ matchup") that fall inside satellite pixels and form the primary basis for assessing the accuracy of satellite derived water quality products, and is performed as follows: (i) Analyze the dis/similarities between in-situ measurements; (ii) Determine data points that have inter-variance that falls within a range of variability (will be defined later in this section); Frontiers in Remote Sensing | www.frontiersin.org February 2022 | Volume 2 | Article 808287 (iii) Identify the subset from [ii] that has a sequential recording time.
The (dis-)similarities between consecutive records of in-situ measurements can be quantified by the empirical semivariogram (Bowman and Crujeiras, 2013). For the 2-h span, the empirical semivariogram can be estimated from the data points as: where γ(x, dt) is the temporal semivariogram of x with a time lag dt and n (dt) is the number of bins (data pairs) of width dt. Similar measurements have sequential recording time and fall within the range of data variability, say γ s : The value of γ s Eq. 5 is empirically estimated in this work by employing the principle of insufficient reason (Jaynes, 1957;Salama and Shen, 2010), which states that when there is not enough evidence to separate the uncertainties in Eq. 3, we can assign uniform weights, i.e., γ s ≈ 1/3 Δ 2 tot , so we have: The total uncertainty Δ 2 tot can be written as (see Eq. 3 in Taylor, 2001): With r x,y being the correlation coefficient between x and y. For standardized anomalies the condition Eq. 6 becomes: The application of Eq. 8 requires iteration to estimate the correct value of r x,y . The iteration procedure is detailed in Section 5.2.
The semivariogram Eq. 4 and the similarity condition Eq. 8 can be applied on each data point as well (instead of one data point of the satellite overpass) producing thereby m records of similar measurements, with m being less than the number of data points within ±1 h of satellite overpass (i.e., 120). These m sets of similar measurements provide an ensemble of information to carry out the validation.

Unbiased Validation
The component Δ d in Eq. 3 represents the underlying accuracy of satellite-derived water quality and should therefore be unbiased. A technique to estimate Δ d without the need to calculate the individual components of Eq. 3 is described hereafter and starts with the assumption that variables x and y are normally distributed and often log-normal (Campbell, 1995).
The matchup pairs of measurements (x) and satellitederivations (y) should in principle be linearly related in the form ofŷ α I x + β I . Whereŷ is the predictor of y estimated using type-I regression, and α I and β I are type-I regression coefficients. From this linear relationship it follows: μŷ α I μ x + β I and σŷ α I σ x . Where μ and σ are the mean and standard deviation of the variable in the subscripts. The standardized anomaly of predicted valuesŷ is expressed as (ŷ − μŷ)/σŷ, which can be written in terms of the variable x as: Eq. 9 means that if satellite-derived products are linearly related to measured variables, a requirement to accept derived products, their anomalies should be equal. This finding from Eq. 9 has great utilities, as it facilitates incorporating the ratio σ x /σ y in the validation to better estimate the derivation uncertainty, Δ d . The main assumption here is that, adjusting satellite-derived values, y, such that they have the same mean and standard deviation of measured values will yield new satellite-derived values, y, that are upscaled to match the point measurements of the fixed monitoring station.
We start first by approximating Eq. 9 as: Second, the values of x are substituted with new updated satellite-derived values y such that: Rearranging the sides of Eq. 11 the new updated y can be obtained as (Draper et al., 2009;van der Velde et al., 2019): From Eq. 12, it can be shown easily that μ y μ x and σ y σ x , whereas the correlation coefficient, r x, y , between x and y is equal to that between x and y, i.e., r x, y r x,y . Using the result of Eq. 12, the uncertainty of satellite-derivation that is free of temporal and spatial mismatch, is found to be: Eq. 13 demonstrates that for values of the correlation coefficient r ≥ 0.5 the Δ will always be equal or less than σ x and the uncertainty cannot be larger (r → − 1) than 2σ x . There is one drawback for Eq. 13, namely when r → 1 the value of uncertainty Δ d will vanish. For the presented data, this condition was unlikely with the value of r x,y not exceeding 0.6.
Eq. 13 provides the minimum value of uncertainty if σ y > σ x , the proof is provided in the Appendix section Proof I. In addition, the slope and intercept (α II , β II ) of the type-II regression between x and the updated values y are unity and zero, respectively. The proof to that is provided in the Appendix section Proof II. This leaves Δ to be the only estimator of uncertainties. Due to the equal mean between x and y, the bias will also be zero, i.e., unbiased.

STUDY AREA
The method presented in Section 2 is applied to matchups in the Ems-Dollard estuary, a semi-enclosed body of water stretching from the Ems River's Mouth to the Island of Borkum in the Wadden Sea along the Dutch-German border ( Figure 1). The estuary covers an area of 1071 km 2 , half of which consists of intertidal flats (Compton et al., 2017). Concerning the Amsterdam Reference Level (Normaal Amsterdams Peil, NAP), the tidal channel has a depth ranging from ≈ − 1.5 m to ≈ − 25 m.
Ensemble of physical processes, dominated by large tide gradients and freshwater influx, control the dynamics in the Ems's estuary. The Ems Dollard is a mesotidal estuary with a semi-diurnal tidal cycle that ranges between 2 m in the channel (inner Ems) to 3 m near the mouth of the Ems River (de Jonge, 1983). The primary source of freshwater inflow is the Ems River, with a mean discharge of 100-120 m 3 s −1 , with a high dynamic range from 25 to 390 m 3 s −1 (Ysebaert et al., 1998). Recently, Schulz et al. (2020) showed that the system alternates between a destratification state during the flood and a built-up of stratification throughout the ebb. The authors also observed that the lateral circulation experiences abrupt transitions during a tidal cycle with large spatial variabilities at finescale ( < 200 m). Within a tidal cycle, the concentration of suspended particulate matter (SPM) varies between a few g.m −3 up to 1000 g m −3 with up to five peaks (Schulz et al., 2020). In addition to these natural processes, human interventions (e.g., deepening of tidal channels) have altered the dynamics of physical processes, increasing thereby the concentration of SPM in the estuary (van Maren et al., 2015a(van Maren et al., , 2016. Due to dredging activities in the last 50 years, SPM concentration has increased in the estuary by two-to threefold with the turbidity maximum moving landward extending into the freshwater tidal river (de Jonge et al., 2014). The Ems estuarine system can, therefore, be considered hyper-turbid, with SPM concentrations reaching up to 200 kg m 3 , particularly in the maximum turbidity zone (Papenmeier et al., 2013). On the other hand, Chl-a concentrations could reach high values ≈ 50 mg.m −3 during the spring and summer blooms (Colijn, 1982), with a peak above 150 mg m −3 , in the tidal channel (Staats et al., 2001).
The high turbidity and the fine-scale variability in the Ems Dollard estuary poses a challenge for in-situ measurements and validation of satellite products. In this respect, measurements from fixed stations require careful selection to be representative of the matching satellite pixels. Details on this selection procedure are provided in Section 4.

In-situ Data
For this research, we have collected in-situ and satellite data forming two sets of matchups: radiometric and water quality derivations. The in-situ data were collected from two permanent monitoring stations located at Texel Island and Eemshaven, both in the Dutch part of the Wadden Sea, about 150 km apart. The Texel station provided radiometric data for atmospheric correction, whereas the Eemshaven station provided measurements on water quality variables serving the purpose of validation. Radiometric data, from proximity (i.e., close-range) remote sensing, were obtained from the permanent station at Texel Island in the Wadden Sea, the red star on Figure 1A. The Texel station has three RAMSES TriOS sensors measuring upwelling radiance and sky-radiance and downwelling irradiance covering the visible range between 318 and 950 nm at ≈ 3.3 nm spectral interval. Radiometric data and their preprocessing chain of converting to reflectance and quality check are described with great detail in Arabi et al. (2018Arabi et al. ( , 2016. Proximity remote sensing data are used in this study to identify the atmospheric correction procedure most suited for the Wadden Sea. Three years of Texel data (from 2015 to 2017) that are matching satellite observations (±1 h) were averaged on the 2h period (eight measurements) and used to establish the radiometric matchup set. Radiometric data from proximity remote sensing were transformed to the bands of the matching space-sensor using its spectral response function.

Water Quality Data
The Department of Public Works and Water Management (Rijkswaterstaat, RWS), have a permanent station in the Eemshaven, the red triangle in Figure 1B. This station contains a YSI 6600 V2 sensor at -3.5 m deep (NAP reference), measuring turbidity (in NTU unit) and chlorophyll-a concentration each minute of the day. Fixed monitoring stations of the RWS follow the guidelines of the OSPAR's Joint Assessment and Monitoring Programme (JAMP). RWS carries out periodical quality checks with fluorometric measurements and yearly validation with HPLC. However, as the stratification occurs primarily during the built-up of the ebb, and the maximum turbidity is near the bottom, YSI6600 measurements during the flood cycle were selected (Schulz et al., 2020), see Section 4.3 for more details.
The conversion from NTU to g.m −3 is performed using two gain factors: 1.7 for silt (SPM particle size < 63 μ m) and 2.2 for total suspended matter. These gain factors are estimated by comparing the NTU readings to the water contents of SPM (performed by Rijkswaterstaat, Monitoring Waterstaatkundige Toestand des Lands, MWTL). This data set is used here to validate satellite-derived water quality variables. One year of the Eemshaven data (2017), matching (±1 h) satellite derivations, were used without time averaging to establish the validation matchup set.

Satellite Data
Satellite observations were obtained, for both stations (Texel and Eemshaven), from Sentinel-3 Ocean and Land Color Instrument (S3-OLCI) and Sentinel-2 Multi-Spectral Instrument (S2-MSI). S3-OLCI and S2-MSI images of the Ems Dollard Estuary were collected from the Copernicus Open Access Hub (Copernicus, 2019). Three years (2015-2017) of data were used to establish the radiometric matchup set for the S2-MSI. Whereas 1 year (2017) of data was used to establish the radiometric matchup set for the S3-OLCI. For both sensors 1 year, 2017, of data was used to establish the validation matchup set. Satellite pixels covering the water nearby the Texel and Eemshaven sites were then selected based on three criteria (flags): (i) Flagged as good observations, (ii) Flagged as cloud-free, and (iii) Flagged as water targets.
Ancillary atmospheric data of ozone total column, wind speed, surface pressure, and air temperature were obtained from the global reanalysis data ERA-Interim (ECMWF, 2017). ERA-Interim data are updated monthly with a delay of 2 months. Therefore, the closest data in time is used to perform the atmospheric correction.
The processing chain of satellite data followed these two steps: (1) Different atmospheric correction methods were intercompared using a radiometric matchup set to select the procedure with the least error. For S2-MSI we applied Acolite Ruddick, 2014, 2015), Polymer (Steinmetz et al., 2011), Case-2 Regional/Coast Color (C2RCC), and the extreme Case-2 waters (C2X) (Brockmann et al., 2016). For S3-OLCI we applied C2RCC and an alternative version of it (C2RCC-A, C. Lebreton personal communication). The Sentinel's toolbox (SNAP) was used to process satellite data.
(2) The Nechad-705 model (Nechad et al., 2010) was employed to retrieve SPM concentrations, whereas for Chl-a the Gons model (Gons, 1999) was employed. These models were developed for regions with similar characteristics as the Ems Dollard Estuary and were employed by this study without calibration.
The standard flags of the different atmospheric correction approaches were used without adjustment, which has caused lowering the number of radiometric matchups from 14 to 7, 9, 10 and four spectra for, respectively, Acolite, C2X, C2RCC, and Polymer.

Matchups Data
The radiometric matchup set was used to select the atmospheric correction procedure with the least error and consisted of 14 and 11 pairs of in-situ measured vs. S2-MSI and S3-OLCI reflectance, respectively. On the other hand, the validation matchup set was used to validate satellite derivations of water quality variables as described in Section 2. Each pair in the validation matchup set consisted of satellite-pixels and the matching in-situ record. To guarantee that the in-situ measurements at 3.5 m depth near the Eemshaven are representative to the surface-derived satellite concentrations of SPM and Chl-a, we filtered data using two conditions based on the recent findings of Schulz et al. (2020): (i) Select the nearest water pixel to the Eemshaven station to minimize the cross-sectional fine-scale spatial variability.
(ii) Select measurements during the flood cycle, as the water column is well mixed. Water level data recorded by RWS at the Eemshaven station were used to filter the data.
A summary of the validation matchup set is presented in Table 1.

Verification of Atmospheric Correction
The results from four atmospheric correction methods applied to S2-MSI (Figure 2) show that the C2X approach rendered spectra that fall within the range of measured values. Only C2X produced spectra that are within a similar range to the measured values.
C2RCC, on the other hand, resulted in S2-MSI spectra with magnitudes close to those of C×2, but with a different range and variability between 665-705 nm (i.e., absorption effect of Chl-a). The least fit was obtained from Acolite and Polymer with a noticeable higher magnitude for Polymer. More quantitative goodness-of-fit measures are presented in Table 2. Slope, intercept, and R 2 were estimated from type-II regression, the Δ was normalized to the mean (denoted as Δ ), representing a relative measure of accuracy.
It is noticeable from Table 2 that Acolite and Polymer produced the least accurate results with Δ values ranging from 50% in the blue band to 80% in the red band, with larger error (Δ 99%) in the green band. This is also apparent from Figure 2, i.e., the shapes of Polymer spectra follow the shapes of measured spectra but with a large gain and offset. From a magnitude perspective, C2RCC and C2X provide the best accuracy, with C2X providing better accuracy in the red bands than C2RCC. Because the used water quality models (Nechad et al., 2010;Gons, 1999) employ the red bands in retrieving SPM and Chl-a, the C2X was selected as the atmospheric correction of choice.
For the application to S3-OLCI, only the C2RCC and C2RCCalternative were available. The results (Figure 3) show that C2RCC-alternative has a slightly better fit than its older version. More quantitative goodness-of-fit measures are presented in Table 3. Both approaches for atmospheric correction did not produce a good fit to measured spectra with very low values of correlation coefficient and large Δ .
Nonetheless, the average spectrum resulting from both atmospheric correction methods resemble the averaged spectrum obtained from the field with the alternative version being slightly better than C2RCC. The analysis of atmospheric correction results is limited in this study to select the approach that yields the least error in the satellite water leaving spectra. The above-presented results show that C2X and C2RCC-A were used to atmospherically correct S2-MSI and S3-OLCI, respectively.    than the matchups of SPM ( Figure 4A,C). The semivariograms of in-situ measurements are computed from the standardized anomalies and shown in Figure 5.

Validation of Satellite-Derived Water Quality Variables
To further analyze the data, we applied the extended triple collocation technique (Stoffelen, 1998;McColl et al., 2014) on the standardized anomalies of these data (see Table 4).  The results in Table 4 show that the uncertainty of in-situ measurement within the ± hour of satellite overpass is comparable to the uncertainties in satellite-derived water quality products. This observation is also confirmed when analyzing the semivariograms. The semivariogram shows nonstationary and cyclic processes, particularly for Chl-a measurements, with a cycle of about 10 min. As the sensors of RWS are well maintained and periodically calibrated, this time variability could be explained by the inertial subrange (Kolmogorov et al., 1991), where phytoplankton is affected by fluctuations at a small-scale ranging from seconds to minutes. Blauw et al. (2018) measured Chl-a at 12-and 30-min intervals and found substantial hourly variations, and explained it by vertical mixing and horizontal transport of different water masses. Franks (2005) showed high patchiness (in Figure 4) of one snapshot in time of phytoplankton fluorescence taken with an imaging fluorometer. However, further research is needed to investigate this small temporal variability of Chl-a in the study area. On the other hand, SPM also shows a cyclic pattern but on a larger temporal scale of a few hours, roughly corresponding to the tidal cycle of the Ems Dollard (Blauw et al., 2012). The analysis here will be, however, focused on using the semivariograms to identify the optimal in-situ matchups. For each in-situ matchup of 120 data points, sets of similar measurements are identified by first computing the semivariograms and second applying the similarity condition of Eq. 8 using the following iterative procedure: (i) Starting at the first data point (at t = − 60), compute the semivariogram using Eq. 4; (ii) Initialize r x,y (either −0.995 or use the computed correlation) and compute the condition in Eq. 8; (iii) Subset all data points that satisfy Eq. 8 and compute their median or mean. (iv) Perform steps i to iii on all in-situ matchup and compute r x,y ; (v) Go to Step ii and replace r x,y and iterate until its value does not change significantly. (vi) Exclude the data points that satisfy the similarity condition (say m) and move the cursor to the next data point m + 1; (vii) Repeat Steps i to vi starting from m + 1.
Steps ii to v converge very rapidly with a maximum of eight iterations regardless of the starting value. Only sequential measurements were grouped in this similarity check. In other words, the first m s sequential data points that stratified the similarity condition were grouped. Another condition was required to guarantee that the data recorded is 5 min (i.e., five data points) or more, namely m s ≥ 5. The whole procedure, Steps i to vii, produces data records without overlap, i.e., independent, each representing a similar set of measurements. Therefore, each in-situ matchup has a different number of records. Data records with a correlation coefficient r < 0.15 were excluded from the calculation. Finally, the unbiased validation procedure of Section 2.4 is applied to the values of weighted averages X m .
The proposed approach is compared to the standard procedure of averaging measurements within the 2-h span of satellite overpass and presented in Table 5 for S2-MSI and Table 6 for S3-OLCI.
From Tables 5, 6 and Figure 6 it is apparent that using the similarity conditions of Section 2.3 to group in-situ measurements, with iterative estimation of r x,y , has improved the overall fit and produced useful information on the ranges of goodness-of-fit measures (slope, intercept, correlation coefficient, and Δ ). The largest improvements can be seen in the correlation coefficient and the slope. For example, the correlation coefficient between measured and derived SPM concentrations has improved from 0.16 to 0.52 for S2-MSI and 0.14 to 0.84 for S3-OLCI. For the Chl-a matchup, the improvement was from 0.26 to 0.82 and from 0.14 to 0.63 for S2-MSI and S3-OLCI, respectively. This high correlation signifies that the matchup pairs are observing the same fluctuations in the measured variable. Applying the method of unbiased validation (UV, Section 2.4) will further reduce the uncertainty, Δ , in S2-MSI estimates by 30 and 23% (for SPM and Chl-a, respectively) and by 28 and 16% in S3-OLCI estimates of SPM and Chl-a, respectively. We hypothesize that including the ratio σ x /σ y , Eq. 12, will most likely reduce the effect of discrepancies other than those related to satellite derivation (more details on this statement are provided in Section 6.1). This also means that the used models to derive Chl-a and SPM from satellite observations have inherent uncertainties independent of the used validation data. The Nechad model (Nechad et al., 2010) has uncertainties in the derived SPM values between 27 and 29%, whereas the Gons model (Gons, 1999) has derived Chl-a values with uncertainties of 51 and 46% for S2-MSI and S3-OLCI, respectively. From Table 5 it is apparent that due to the fine spatial resolution of S2-MSI (≈ 10m) the reductions in Δ values were limited with larger uncertainty (standard deviation of Δ ), in particular when compared to S3-OLCI results ( Table 6).

Method and Results
This study has presented a validation procedure for water quality products derived from the Sentinel missions in estuarine waters. Validation of S2-MSI and S3-OLCI products was carried out using continuous records of suspended sediment (SPM) and Chla measured in Ems Dollard Estuary by the Department of Public Works and Water Management (RWS) in the Netherlands. However, proximity remote sensing observations were used from the Texel station to verify the quality of atmospheric correction. From the four tested atmosphere correction approaches, C2X produced S2-MSI spectra close to measured values in magnitude and shape. For S3-OLCI, two models were available, with the augmented version of C2RCC producing slightly better results consistent with recent findings. For example, Mograne et al. (2019) compared five atmospheric correction methods for S3-OLCI over optically complex waters and found that C2RCC-A is better suited for turbid water. Toming et al. (2017) showed that C2RCC is suitable for  Frontiers in Remote Sensing | www.frontiersin.org February 2022 | Volume 2 | Article 808287 atmospheric correction of S3-OLCI but not for deriving water quality variables. Warren et al. (2019) compared six algorithms for retrieving water-leaving reflectance from S2-MSI observations and found that all algorithms have high uncertainties, with Polymer and C2RCC achieving the lowest mean absolute differences of 40-60% in blue/green bands. Verification of atmospheric correction, although indicated radiometric accuracy, were performed far from the Ems Dollard and should, therefore, be taken with caution (Pahlevan et al., 2017;Crout et al., 2019).
In-situ data showed high variability and data points matching the satellite overpass did not correspond with satellite-derived values. This could be explained by the fact that although surface processes are related to underwater processes, this relationship is not always evident (e.g., linear/logarithmic depth profile or stratification). In this case, in-situ and satellite measurements are likely observing different processes. For example, Groetsch et al. (2014) showed that the stratification largely caused the mismatch between insitu and satellite-derived Chl-a values during cyanobacterial blooms. Valdés and Lomas (2017) showed that turbulence of patches of water constituents are in the order of centimeters and minutes on the spatiotemporal scale, which does not correspond with satellite observations (hundred meters and snapshots). The Ems Dollard Estuary has fine-scale lateral spatial variability at fine-scale ( < 200 m) and the destratification state occurs during the flood cycle (Schulz et al., 2020). Therefore, we filtered the data to include the nearest pixels and observations during the flood period to avoid stratification and lateral variability. Moreover, the presented analyses of semivariograms showed a peculiar cyclic pattern of about 10 min in Chl-a. The in-situ data were verified with earlier work (Brinkman et al., 2015) in the Ems Dollard estuary and found to be consistent. One explanation could be that the measurements are affected by the inertial subrange, at which phytoplankton are subjected to high-frequency fluctuations ranging from seconds to minutes Kolmogorov et al. (1991). For example, Blauw et al. (2018) measured Chl-a at 12-min and 30-min intervals and found substantial hourly variations, and explained it by vertical mixing and horizontal transport of different water masses. However, the dynamics of suspended sediments corresponded to the tidal cycle in the estuary (van Maren et al., 2015b). The data analysis presented in this paper calls for more elaborate research on the effect of the tidal cycle, which was considered to be out-of-scope for this study (see Section 6.3 for details).
Each in-situ matchup consisted of 2 h measurements (one per minute), resulting in 120 data points around the satellite overpass. The semivariogram was applied to identify similar measurements: sequential and fall within a range of variability defined by the semivariogram. In applying the semivariogram and similarity condition, a different number of records for each in-situ matchup were produced with each record representing a similar set of measurements and is independent of the other records. The semivariogram was applied without exclusion to investigate the effect of records' overlapping, i.e., staring at t = 5 and moving with an interval of 5, 23 data records for each in-situ matchup were produced. These records have some overlap, i.e., a record i could contain parts i − 1 and i + 1 records. The results, not shown here, are similar to those presented in Tables 5, 6 but with lower variability, which the overlap of the data records could explain.
The semivariogram and the similarity condition were also applied to point t = 0 around the satellite overpass but did not provide additional information as it was comparable to the standard approach. Remembering that in-situ measuring devices record fluctuations that are not necessarily observed by the satellite, and then following the principle of insufficient reason (Jaynes, 1957), any measurement, within the ±1 h of satellite overpass, could be a candidate point for selecting the optimal in-situ matchup. Using satellite estimates as background information, the concept of optimal interpolation, commonly used to grid satellite data, could be employed. We used the Cressman weights (Cressman, 1959) to produce a weighted average X m from these m records of similar measurements: X m 1 m m i 1 w i X i . X i is the average of ith set (from m sets) of similar measurements, and w i are the weights, w i , where d i is the deviation of X i from the background value, and D i is the search radius, D i = max (d i ) + 1. This weighting mechanism will improve the values of correlation coefficients, but without sensible uncertainty reduction after applying the unbiased validation (results are not shown here).
The proposed validation procedure incorporates the variability of in-situ measurements and satellite observations (the ratio σ x /σ y ) to estimate the accuracy of satellite-derived water quality products. Due to this incorporation, the bias will vanish (therefore, the name unbiased validation) and the validation metrics (slope, intercept, correlation, and root-mean-square deviation) will improve. Our results correspond with the recent work of McKinna et al. (2021), in which they have shown that incorporating model and observation uncertainties will improve the validation metrics. In our work, the variability of in-situ measurement σ x is on a point-scale and includes two scales: 2 h (around the satellite overpass) and the timeaveraged over the matchup data points. Whereas the variability of satellite observations, σ y , is on a pixel scale (it includes spatial variability) and contains the time-averaged over the matchup pixels. Taking the ratio σ x /σ y will largely remove time-averaged over the matchup sets and leave the spatial (pixel scale) and temporal (2 h span) variability. Therefore, incorporating the ratio, σ x /σ y , in the validation will largely account for the spatial and temporal mismatch, yielding the underlying derivation-uncertainty.

Derivation Models
The Gons (Gons, 1999) and Nechad (Nechad et al., 2010) models were used in this study to respectively derive Chl-a and SPM from satellite data. Although more advanced models do exist (Pitarch and Vanhellemont, 2021;Werdell et al., 2018;Salama and Verhoef, 2015;Budhiman et al., 2012), they are not as robust as the models used in this study. Gons and Nechad models were validated by recent (Neil et al.,  Gons) and earlier studies (Kari et al., 2017;Gons et al., 2008, for SPM and Chl-a, respectively) and were, therefore, used in this study as off shelf models. For example, Neil et al. (2019) have compared 48 models to retrieve Chl-a from satellite observations. The Gons model (Gons, 1999), with its original parameterizations, provided very good estimates of Chl-a with a score of 11 out of 14 (Neil et al., 2019, Figure 6, the model H). Although the uncertainties of SPM and Chl-a derived from S2-MSI are more significant than those of S3-OLCI (5 vs. 6), the fine resolution of S2-MSI reveals spatial features that are not apparent in S3-OLCI. The unbiased validation approach provides uncertainty measures that better reflect the intrinsic accuracy of the derivation models in the Ems Dollard. For example, the Nechad model has an inherent uncertainty of about 40% in deriving SPM, whereas the Gons model has less than 35% of uncertainties in the derived values of Chl-a. This information, on models' accuracies, corresponds very well with the considered range of concentrations, previous work (e.g., , and the values obtained from the sensitivity analyses of these models. For comparison, we employ the algebraic method (based on derivatives) to compute the model's response to variations in its coefficients. For the Nechad model, a variation of ±15% in water leaving reflectance ρ w and model's coefficients yielded ±20% and ±30% of error in SPM when derived from S3-OLCI and S2-MSI, respectively. On the other hand, for the Gons model, the same variation in ρ w and its coefficients (specific inherent absorption coefficient of Chl-a) resulted in ±40% of errors in Chl-a values when derived from S3-OLCI or S2-MSI.

Limitations
This paper dealt with matchup data points between satellite observations from two sensors and measurements from a fixed station in a mesotidal estuary. Tidal circulation generally dominates the dynamics of SPM and Chl-a concentrations in the Ems Dollard. We used the findings of Schulz et al. (2020) to avoid stratification and filtered the data to include matchups during the flood cycle (high water level). However, the measurements of Schulz et al. (2020) were for 1 day and at locations to the southeast of the Eemshaven. In this respect, their results and findings may not apply to the Eemshaven site. Comparing the measured values (averaged over 120 min around the satellite overpass, not shown here) to water level did not reveal an apparent effect of the tide on the concentrations of SPM and Chl-a. We recognize that analyzing the impact of the tidal cycle on the uncertainties in satellite estimations is of prime importance but deemed out-of-scope for this study. For such an analysis to be appropriate, it should consider more stations and include the effect of vertical mixing in a 3D hydrodynamic setup. In addition, the study showed that performance of standard algorithms (Gons, 1999;Nechad et al., 2010) was not suited to the highly turbid and optically complex water of the Ems Dollard (as shown in the first row of Tables 5, 6). For optically complex waters, additional skills are required. For example, calibrating the model's coefficients with regional data , or classifying the water and for each identified type apply the model with the least uncertainty (Moore et al., 2014).
Nonetheless, the data analysis of this study has revealed the complexity of validating satellite products of water quality in mesotidal estuary. The study raised another concern related to the depth of the inertial subrange in Eemshaven. Although inertial subrange could explain the small-scale cyclic pattern in Chl-a measurements, we did not verify this process. These limitations, however, do not preclude the utility of the suggested technique as it may also remove part of the tidal effects by incorporating the variability of in-situ measurements.

CONCLUSION
The mismatch, in time and space, between satellite-pixel and point field measurements is a significant challenge facing the accuracy assessment of satellite products of water quality. Comparing in-situ data measured on a small footprint with satellite value derived on a pixel-scale is only meaningful when this mismatch is reduced.
The objective of this study was to devise a method to validate satellite derivations of water quality with continuous in-situ measurements in estuarine waters. The main challenges in this validation exercise were: the high variability of in-situ records, 120 values for each matchup, and discrepancies related to uncertainties other than the derivation-uncertainty. The first challenge was addressed by identifying the similarity between in-situ measurements using their semivariogram. The second challenge was addressed by incorporating the variabilities of field measurements and satellite observations in the validation procedure. The proposed approach has produced zero bias and improved the validation metrics (slope, intercept, correlation, and root-mean-square deviation). When compared to traditional validation procedures, the proposed method has improved the overall fit and produced valuable information on the ranges of goodness-of-fit measures (slope, intercept, correlation coefficient, and normalized root-mean-square deviation). The correlation coefficient between measured and derived SPM concentrations has improved from 0.16 to 0.52 for S2-MSI and 0.14 to 0.84 for S3-OLCI. For the Chl-a matchup, the improvement was from 0.26 to 0.82 and from 0.14 to 0.63 for S2-MSI and S3-OLCI, respectively. The uncertainty in the derived SPM and Chl-a concentrations was reduced by 30 and 23% for S2-SMI and by 28 and 16% for S3-OLCI. The high correlation and reduced uncertainty signify that the matchup pairs are observing the same fluctuations in the measured variable. These new goodness-of-fit measures correspond to the results of the performed sensitivity analysis, previous literature, and reflect the inherent accuracy of the applied derivation model.
The presented work is not limited to the realm of ocean color but can be used to validate satellite retrievals on land. For example, the method is applicable to validate active-passive microwave products of soil moisture or thermal-infrared products of land and sea surface temperatures. The proposed technique ensures the consistency of APPENDIX Proof 1 Here we will show that the uncertainty Δ d Eq. 13 resulting from the transformed values y Eq. 12 always provides the lowest uncertainty, Δ tot > Δ d as long as σ y > σ x . The proof of this statement follows from calculating the root-mean-square deviation, Δ 2 , between measured x and satellite-derived values y: Δ 2 tot σ 2 x + σ 2 y − 2σ x σ y r x,y (A1) σ 2 x + σ 2 y − 2σ x σ y r x,y > 2σ 2 x 1 − r x,y Rearranging the inequality in Eq. A1: r x,y < 0.5 + 0.5 σ y σ x (A3) The inequality in Eq. A3, and thus Δ tot > Δ d , is always satisfied as long as σ y > σ x .

Proof 2
The proof that α II = 1 and β II = 0 for y is provided hereafter. Type-II regression minimizes the normal deviates in the major axis regression line, y α II x + β II (York, 1966), we will have the slope estimated as: x 2 + 4σ 2 x σ 2 y r 2 x, y 2σ x σ y r x, y Remembering that σ y σ x yields α II = 1 and μ y μ x yields β II = 0. This proof is valid as long as the slope α II is positive, which is expected to hold for a good matchup set.