Machine-learning based approach to examine ecological processes influencing the diversity of riverine dissolved organic matter composition

Dissolved organic matter (DOM) assemblages in freshwater rivers are formed from mixtures of simple to complex compounds that are highly variable across time and space. These mixtures largely form due to the environmental heterogeneity of river networks and the contribution of diverse allochthonous and autochthonous DOM sources. Most studies are, however, confined to local and regional scales, which precludes an understanding of how these mixtures arise at large, e


Introduction
The movement of water connects not only terrestrial and aquatic life but also fresh-and marine water subsidies, transporting, for example, large amounts of terrestrial carbon (C) in the form of particulate and dissolved organic matter (DOM) along the land-ocean aquatic continuum (Drake et al., 2018).During its journey, DOM provides nutrients and energy to the aquatic food web (Azam et al., 1983), undergoing many biotic and abiotic transformations depending on its intrinsic composition, as well as extrinsic constraints such as microbial community composition and environmental conditions (Berggren et al., 2022, Hu et al., 2022).Thus, the DOM pool represents a complex blend of numerous compounds with varied compositions and quantities (Catalán et al., 2021) arising from diverse sources, transformation processes, and environmental contexts (Cooper et al., 2022).DOM chemistry also reflects a combination of biogeochemical processes (Amon and Benner, 1996;Ward et al., 2017;Ferreira et al., 2020) occurring across terrestrial and aquatic ecosystems.Determining DOM molecular composition and its reactivity within and across watershed compartments are central pieces to disentangle its role in carbon and nutrient cycles and flux of gasses to the atmosphere in a changing world.
Although recent studies have shown distinct spatial patterns of DOM within and across streams (Riedel et al., 2016;Garayburu-Caruso et al., 2020;Stadler et al., 2023;Freeman et al., 2024), the intrinsic and/or extrinsic attributes driving such variations are not yet fully understood.Research has shown that the composition of DOM varies across different scales including in-stream compartments, positions in the river networks, and latitude zones (Jaffé et al., 2012;Roth et al., 2013;Hawkes et al., 2018).For example, distinct patterns of DOM molecules have been observed when comparing surface waters and hyporheic zones (Stegen et al., 2022) and in rivers with different sizes of upstream catchment areas (Danczak et al., 2023).Another study in US rivers showed that molecular richness in river sediment decreased with increasing latitude (Cui et al., 2024).Furthermore, the composition of DOM is shaped by its reactivity to photochemical and microbial transformations, as well as to solidphase sequestration such as flocculation and adsorption (e.g., Lu et al., 2013;Wen et al., 2022).Currently, the processing rates of organic C degradation vary regionally and globally (Tiegs et al., 2019), likely arising from the differences in biotic (autotrophic production and heterotrophic microbial degradation) and abiotic (e.g., light) degradation of DOM across large spatial scales.Specifically, environmental factors such as temperature, precipitation, and solar irradiation, have been identified as important regulators of DOM compositions at both region and continental scales (Du et al., 2022(Du et al., , 2023)).The variation in microbial community compositions, driven by environmental factors within and across stream ecosystems, also plays a role.The distinct capacities of different microbes for DOM synthesis and degradation can contribute to differences in DOM molecular composition (Amaral et al., 2016;Logue et al., 2016;D' Andrilli et al., 2019;Tanentzap et al., 2019;Wang et al., 2022).
Most of the aforementioned studies have characterized DOM and its association with physicochemical drivers by employing multivariate statistical methods like principal component analysis (PCA) and discriminant analysis (see for example Angst et al., 2016;Johnson et al., 2019 andLynch et al., 2019).These methods, being primarily linear in nature, have limitations in accurately reflecting complex biogeochemical processes (e.g., varying stability of molecules under different biogeochemical conditions), resulting in a potential misinterpretation of meaningful information as noise.Machine learning (ML) approaches, on the other hand, are particularly useful in ecological studies with complex data where non-linear relationships and interactions between various environmental factors and parameters of interest such as DOM properties exist.Random Forest (RF), an ensemble learning-based ML algorithm proposed by Breiman (2001), has been shown to improve accuracy by integrating results from numerous decision trees, giving more weight to significant variables while minimizing the impact of noise.Additionally, RF can assess the importance of different variables that influence model accuracy, which is essential for understanding the distinctions between different samples.This approach offers a more effective way of handling complex molecular DOM data that can further improve interpretation of biotic and abiotic controls in diverse ecosystems.Other studies that showcase the nuances of identifying molecular DOM composition patterns include Spencer et al. (2007) on diurnal variability in riverine DOM composition, He et al. (2016) on the molecular diversity of riverine sediment organic matter, and Cuss et al. (2016) on classifying DOM using ML and fluorescence signatures.Artificial neural networks (ANNs) have also been applied to biogeochemical and ecological studies for their efficiency in revealing patterns and predicting outcomes.For instance, ANNs have been instrumental in identifying the patterns involved in the spatial and temporal variation of the abundance and composition of abiotic and biotic variables (Larsen et al., 2012;Broullón et al., 2020;Danczak et al., 2020).
Here, we use a ML approach to identify patterns and trends in the previously characterized molecular composition of continental-scale river and sediment DOM samples collected under the crowdsourced Worldwide Hydrobiogeochemical Observation Network for Dynamic River Systems (WHONDRS; see for example Barnard et al., 2022;Borton et al., 2022;Dwivedi et al., 2022;Goldman et al., 2022).The data set was created using Fourier Transform Ion Cyclotron Resonance Mass Spectrometry (FTICR-MS) which generates highly dimensional datasets that largely defy 2D or other linear approaches of data manipulation.
Given the high dimensional complexity of these data, the use of ML methods may help resolve the drivers of DOM composition, reactivity, and chemical character across systems.We focus on molecular diversity (number of unique molecular formulae per sample, e.g., Danczak et al., 2023) and composition and address two main questions: (1) how do the different classes of DOM and their molecular attributes vary within and between sediment and surface waters?and (2) how do the relative contributions of biotic and abiotic variables (i.e., watershed characteristics, macronutrient ratios, oxidation state, sediment metabolism) drive variation in sediment and surface water DOM.

Materials and methods
To conduct this study, we analyzed previously published data from the WHONDRS Summer 2019 Sampling (S19S) campaign (Stegen et al., 2018) using unsupervised and supervised ML approaches to identify the environmental parameters influencing the diversity of DOM clusters (Figure 1).

Data sources
The samples we analyzed were collected and processed in 2019 as part of the WHONDRS consortium (Stegen et al., 2018), and the data were retrieved from publicly available data packages (Goldman et al., 2020;Toyoda et al., 2020).Full details on sample and metadata collection are provided in Garayburu-Caruso et al. (2020).In brief, during July and August 2019, 97 river corridor systems were sampled for surface water and sediment, along with metadata, climate, vegetation, and geospatial data.Surface water was collected in triplicate, filtered (0.22 μmSterivex), and stored in clean, pre-acidified amber VOA glass vials.Sediment samples were collected at sediment surface depths (1-3 cm) using a sterilized stainless-steel scoopula.Samples were rapidly shipped to the Pacific Northwest National Laboratory (PNNL, Richland, Washington United States) and surface water samples were frozen at −20°C upon arrival and sediments were sieved (<2 mm), subsampled, and stored at −20°C.A 12 Tesla Bruker SolariX FTICR-MS (mass resolving power was 220,000 at m/z 481.185) was used to collect ultrahigh-resolution mass spectra of DOM in each surface water and sediment sample (Garayburu-Caruso et al., 2020).The FTICR-MS was equipped with an Electrospray Ionization (ESI) source and operated in negative mode at a − 4.2 kV voltage.Data collection varied between surface water (0.05 s ion accumulation) and sediment (0.1 or 0.2 s ion accumulation), covering a m/z range of 100-900 at 4 M.The mass accuracy was less than 1 ppm for singly charged ions in the m/z 100-900 range (Garayburu-Caruso et al., 2020).
Surface water samples were analyzed for dissolved organic carbon (DOC) concentrations, stable water isotopes [oxygen (O) and hydrogen (H)], specific conductivity, total nitrogen (TN) concentrations, and concentrations of chloride (Cl − ), sulfate (SO 4 2− ), nitrate (NO 3 − ), nitrite (NO 2 − ), and fluorine (F − ); see Toyoda et al. (2020) for details of these measurements.Sediment samples were assessed for non-purgeable organic carbon as sediment, water extractable organic carbon (WEOC) Overview of the multifaceted data analysis framework employed in this study [Steps (A-D)].The framework begins with (A) the prevalence data for molecular (chemical) formulae (CF) across all sites, represented in a binary matrix format used to determine observed richness.Next (B) a diversity index based on the Jaccard similarity coefficient is calculated, yielding an N x N symmetrical matrix that captures the pairwise similarity between sites.Then (C) unsupervised machine learning techniques are applied, starting with dimensionality reduction via PCA (Principal Component Analysis) and followed by K-means clustering to identify inherent groupings in the data (water and sediment clusters).Lastly (D) a supervised machine learning approach, Histogram-based Gradient Boosting (HGB), is applied to obtain a deeper understanding of the environmental variables influencing molecular diversity of the dissolved organic matter (DOM) clusters.concentrations, microbial respiration rates, and X-ray fluorescence; details can be found in Goldman et al. (2020).Additional information on WHONDRS and methods used can be found at https://whondrs.pnnl.gov.Some of the metadata for the continental United States sites are from the StreamCat database accessed through https://waterfolk.shinyapps.io/streamcat/ (Hill et al., 2016;Powers et al., 2023).

Fourier transform ion cyclotron resonance mass spectrometry data processing
The WHONDRS dataset has been discussed in other publications (i.e., Garayburu-Caruso et al., 2020) In the following, we provide a brief overview of the original data processing.Data were pre-processed (Garayburu-Caruso et al., 2020) using the BrukerDaltonik Data Analysis software (version 4.2), which allowed the conversion of raw spectra to a list of m/z values by applying a signal-to-noise ratio (S/N) of 7 and mass measurement error < 0.5 ppm.Peaks were then aligned, and molecular formulae assigned using Formularity software (Tolić et al., 2017).The initial assignments were post-processed using the R package ftmsRanalysis (Bramer et al., 2020), removing results outside of a high confidence m/z range (200-900) and/or with a 13C isotopic signature for further DOM characterization analysis.The ftmsRanalysis package calculates molecular formula properties and chemical classes (Kim et al., 2003;Koch and Dittmar, 2006;LaRowe and Van Cappellen, 2011).Molecular formulae were then classified into amino sugar-like, carbohydrate-like, condensed aromatic-like, lignin-likes, lipid-like, protein-like, tannin-like, and unsaturated hydrocarbon-like compounds using the assign_class() function (see Kim et al., 2003 for chemical class descriptions and elemental properties).Such chemical compound classes are determined based on the atomic O/C and H/C ratios from the assigned formulae, which have shown to be consistent with other analytical techniques (Kim et al., 2003).
Peak intensities were transformed into presence-absence data, with sediment samples from different river segments of the same river treated as replicates (Dorazio et al., 2011).Peaks that were assigned the same molecular formula due to minor mass differences were merged (0.5 ppm threshold).Only peaks with an assigned molecular formula and with an elemental combination of C 1-130 , H 1-200 , O 1-50, N 0-4 , S 0-2 , and P 0-1 were retained (Riedel and Dittmar, 2014).The Compound Identification Algorithm in Formularity was used with the following criteria: S/N > 7 and mass measurement error < 0.5 ppm.This algorithm takes into consideration the presence of C, H, O, N, S, and P and excludes other elements.Molecular formulae in the range of 0.3 ≥ H/C ≤ 2.2 and O/C ≤ 1.2 (Hawkes et al., 2020) and double bond equivalents minus oxygen ≤10 were considered reliable based on chemical feasibility (Herzsprung et al., 2014).
The molecular properties and chemical character of the molecular formulae were calculated, including their nominal oxidation state of C (NOSC) (unitless; Garayburu-Caruso et al., 2020), Gibbs Free Energy GFE (in kJ/mol C; according to LaRowe and Van Cappellen, 2011), double bond equivalent DBE (unitless; according to Koch and Dittmar, 2006), and degree of aromaticity AImod (unitless; according to Koch and Dittmar, 2006).

Machine-based learning examination of DOM composition
Our analysis focused on three categories: DOM data (matrix of assigned DOM molecular formulae), relevant environmental metadata, pertinent to biological and/or chemical DOM processes (including pH, water temperature, concentrations of Cl − , F − , and nitrate, isotopic composition, δ 18 O, δ 2 H, and mean annual temperature, MAT, among others), and DOM molecular properties and chemical character.Molecular formulae present in less than 10% of samples were categorized as "rare" and excluded.
The drivers of DOM molecular composition across diverse sites having similar characteristics can be difficult to interpret.To obtain a deeper understanding of differences and drivers of potentially small differences across the continental-scale dataset (Step A in Figure 1), we first reduced the dimensionality by applying molecular diversity indices (representing the composition of each sample) and counted observed richness as the number of unique molecular formulae per sample.Jaccard pairwise similarity coefficients were then calculated and used in Step B (Figure 1), resulting in a N x N engineered DOM dataset for both water (n = 265) and sediment (n = 239).Diversity metrics were calculated using the R package "vegan" (Oksanen et al., 2020) in the R environment (R Development Core Team, 2008).We then applied an unsupervised k-means clustering on the transformed data using PCA and the number of clusters decided by examining the distortion, inertias, and silhouette score for number of clusters ranging from 2 to 10 (Supplementary Figure S1).Each sample type was best characterized by 3 distinct clusters, referred to as Sed-0, Sed-1, Sed-2, Wat-0, Wat-1, and Wat-2 (see Figure 2 for clustering following PCA-k-means).The clustering is based on the Jaccard index (commonly used to determine how similar sample sets are), and likely represents similar ecological influences on DOM formation and diversity (Step C).After the removal of rare molecular formulae (present in less than 10% of samples) and unsupervised k-means clustering, we observed 4,936 molecular formulae in the 265 water samples and 4,053 molecular formulae in the 239 sediment samples with an overlap of 2,109 molecular formulae in both datasets.In total, 6,880 unique molecular formulae were found across water and sediment samples (Supplementary Table S1).
To obtain a better understanding of environmental influences driving the formation of the DOM clusters, we developed ML models.One-hot encoding was used for categorical data pre-processing.Class imbalances were addressed by generating additional samples for minority classes (clusters with lower sample numbers).Histogrambased Gradient Boosting (HGB) models were trained with the metadata to predict DOM cluster formations (Step D).Hyperparameter tuning was undertaken using the BayesSearchCV algorithm in the Scikit-Learn framework (Pedregosa et al., 2011).The hyperparameters (Supplementary Table S2) were selected to prevent model overfitting.SHapley Additive exPlanations (SHAP) values were computed for each set of metadata, ranked, and plotted.Features with positive SHAP values positively impact the prediction, while those with negative values have a negative impact.The magnitude is a measure of how strong the effect is.In each stage, two models were developed for each water and sediment sample.During the first stage of the model development, using all metadata, the models were developed using default and tuned hyperparameters.values obtained in stage one, 13 metadata parameters were used to train the water model (13 for sediment) using HGB's default hyperparameters and then the tuned hyperparameters.All models were evaluated for their performance using the 10-fold cross validation (CV), test accuracy score and accuracy score-based learning curve (Supplementary Table S3).
Please see Supplementary Figure S2 (sediment) and Supplementary Figure S3 (water) for the accuracy-based learning curves for each HGB model and Supplementary Figures S4, S5 for an overview of SHAP values from each HGB model.All data and codes are available at: https://github.com/WHONDRS-Crowdsourced-Manuscript-Effort/Topic4/tree/main.

Unsupervised learning reveals diverse and distinct DOM clusters
Applying an unsupervised ML method resulted in highly distinguishable and unique clusters for the sample types -sediment and surface water.The 97 sampled systems and individual replicates (504 samples in total) were used as input for cluster analyses.For sediment samples, 76 samples were identified in the cluster 0 (Sed-0), 17 samples in cluster 1 (Sed-1) and the majority, 146 samples in cluster 2 (Sed-2).For surface water, 147 samples were clustered in the cluster 0 (Wat-0), 28 samples in the cluster 1 (Wat-1) and 90 in the cluster 2 (Wat-2).Concerning the prevalence of DOM molecular formulae, we found that most molecules identified in sediments were found across all clusters (70.1%), and some were exclusively present in Sed-0 and Sed-2 (29.9%), while a single unique formula was observed in Sed-2 and none were exclusively found in Sed-0 or Sed-1 (Supplementary Figure S6).Similar results were observed in water samples (Supplementary Figure S6), as an even higher number of molecular formulae were observed across the three clusters (80.5%), 17.7% in clusters Wat-0 and Wat-2, and just a few exclusively in Wat-0 only (0.45%) or in both Wat-0 and Wat-1 (1.4%).These results point toward homogenization in terms of shared molecular formulae across clusters, in both sediments and water habitats and potential variation in DOM signatures of individual samples within each cluster (which is considered below).
Differences in DOM molecular compositions of sediment and water samples belonging to the three identified clusters were expected given the potential influence of diverse environmental factors (e.g., terrestrial input, microbial activity, human activities, and runoff patterns) across the continental-scale dataset (Stegen et al., 2022).Using chemical compound classes determined by differences in atomic O/C and H/C ratios (Kim et al., 2003), we observed that, in general, samples of sediment clusters had a larger relative contribution of protein-like and unclassified chemical compound classes compared to water clusters.Sediment clusters 0, 1, and 2 had distinct compositions, with Sed-0 mostly comprised of lignin-and lipid-like compounds, Sed-1 was dominated by concentrated hydrocarbon-like (ConHC) and lignin-like compounds, and Sed-2 was dominated by lignin-and tannin-like compounds (Figure 3).The diversity in molecular composition potentially highlights different sources and processes affecting the clusters' composition.Water clusters also had distinct compositions with Wat-0 being dominated by ConHC and tannin-like compounds, while that Wat-1 was dominated by amino sugar-and protein-like compounds, and Wat-2 by lignin-like compounds (Figure 3).Intriguingly, Sed-0 and Wat-1 contained the ranging from ~1,000 to 2,000 in sediment and slightly higher values between ~1,500 to 3,000 in water clusters (Figure 4).More information regarding the molecular composition of the clusters can be found in the Supplementary material (section 3).To summarize, the composition of sediment clusters indicates significant terrestrial inputs, particularly from vegetation.Sed-2 suggests influence from fresh plant debris due to its high CHO content.Nitrogen, sulfur, and phosphorus present in Sed-0 and Sed-1 point to microbial activity and human influences like agriculture and wastewater discharge.In water clusters, the abundance of CHO and lignin-like character in Wat-2 indicate terrestrial plant and soil inputs, hydrologic connectivity among soils and adjacent rivers, and the end-products of in-situ heterotrophic microbial degradation of DOM.Higher percentages of CHOS in Wat-0 may be attributed to biotic and abiotic sulfurization reactions under anoxic conditions or wastewater inputs.The abundance of lignin-and tannin-like character in Wat-0 and Wat-2 indicates natural and anthropogenic terrestrial sources from runoff and land use are also significant contributors.Phosphorus-containing formulae in Wat-1 and Wat-2 hints at nutrient cycling as a key process, potentially influenced by agricultural runoff (see Supplementary material section 3 for more details).This ML clustering approach revealed features like those obtained by optical fluorescence analyses (excitation emission matrices, e.g., Yamashita et al., 2008), and a methodological crossvalidation in future studies could guide researchers toward more costeffective methods to characterize DOM pools across spatial, temporal, and cross-boundaries scales.
The environmental parameters across sediment and water clusters showed distinct profiles for each cluster (see Supplementary Figures S8, S9).For the sediment clusters, only respiration rate and NPOC showed significant differences between the clusters (p-value of 0.01237 and 0, respectively; Supplementary Figure S8).In the sediment clusters, Sed-2 displayed significantly lower respiration rate, which could be related to the large contributions of lignin-and tannin-like classes-the large, structurally complex molecules that are considered more recalcitrant or resulting from microbial respiration.Sed-1 showed significantly higher NPOC concentrations.This cluster also contained more DOM having ConHC character, indicating the abundance of low-O containing DOM in the sediments, which may be related to in-situ processing or reflect the signature of previous processing in the water column before deposition.In contrast, the water Overview of the relative contribution (%) of compound classes to the three sediment (left) and water (right) clusters.A heatmap of the key molecular formulae contributing to the cluster formation can be found in Supplementary Figure S7.clusters show a different pattern.For the water clusters, most variables showed significant differences, indicating that the clusters are highly distinct in terms of these environmental characteristics (Supplementary Figure S9).Wat-0 is characterized by greater distances from dams and gages and highest median number of days since precipitation, indicating DOM molecular composition sources in potentially more remote locations having drier climates.The dry condition corresponds to the higher proportions of condensed hydrocarbons in cluster 0, as fire can be a primary source for these compounds.Wat-2 is generally closer to the dam and gage with the least variability in distance, and experiences precipitation more frequently.This may be linked to higher proportions of lignin-like composition in Wat-2, since hydrological events predominantly facilitate the transfer of terrestrial DOM into aquatic ecosystems.Across both sediment and water clusters, the variability and median values suggest that each cluster experiences unique environmental conditions, with some being more prone to extremes and others displaying more homogeneity in their environmental parameters.
To determine whether the distribution of clusters is statistically significant with respect to latitude, longitude, or a combination of both, we performed point-biserial correlation and ANOVA (Analysis of Variance) tests.Like Cui et al. (2024), we observed a slight but statistically significant correlation between sediment cluster membership and latitude (Supplementary Table S4).While Sed-1 was positively related (p = 0.0029), Sed-2 showed a negative relationship with latitude (p = 0.0021).Sed-3 showed no significant correlation with latitude and all three clusters had no significant correlation with longitude.Water clusters displayed the opposite trend, no signification correlation with latitude but with longitude.This was, however, only true for Wat-0 which displayed a slight but statistically significant negative correlation with longitude (p = 0.091).This analysis suggests that while latitude and longitude do have some (weak) predictive power for cluster membership, they are not the main factors influencing it.More complex models and in-depth knowledge are required for a more accurate prediction of cluster membership.

Supervised machine-based learning reveals influence of environmental parameters on the molecular richness of DOM clusters
To gain deeper insights into how the various environmental parameters influenced the formation of DOM clusters in water and sediment samples, we applied a supervised ML algorithm (Histogrambased Gradient Boosting, HGB) using SHAP (Shapley Additive exPlanations) values.Sediment and water clusters were the targets and all environmental variables used to train the model (see methods for more details).The HGB ML model consistently highlighted surface water isotopic composition δ 18 O (‰) and mean annual temperature (MAT; °C) as influential across both sediment and water sample types (Figure 5).Both water and sediment SHAP values, however, also show that certain features have more variable impacts than others on DOM cluster formation, as indicated by the spread of the SHAP values along the x-axis (Figure 5).For example, NPOC concentration and respiration rates for sediment clusters and stream order and variations in natural flow in water clusters appear to have highly variable impacts (Figure 5).These examples demonstrate the complex, non-linear relationships inherent in ecological data.
For sediment clusters, the SHAP values suggest that NPOC and nitrate concentrations help predict the DOM clusters.Features related to isotopic composition (δ 18 O and deuterium, ‰), and mean annual temperature (MAT; °C) were also highlighted as influential, which indicates the importance of geographic water source and thermodynamically favorable hydrological processes.Deuterium provides information about the role of precipitation, groundwater, and evaporation processes in continental waters, all of which may have different outcomes on the molecular composition of DOM (Baskaran et al., 2009).McDonough et al. (2022) discovered that the transformation of DOM in groundwater resulted in the elimination of oxidized DOM composition, along with an accumulation of both reduced photodegradable compounds and aerobically biodegradable compounds exhibiting a pronounced microbial signature.Ide et al. (2017) found significant variations in the number of DOM molecular formulae in rainwater, throughfall, soil water, groundwater, and stream water, with a linear correlation between DOM molecular diversity and the number of lignin-like molecules.Lignin-like composition was particularly high in groundwater samples.Sediment clusters 0 and 2 were characterized by more lignin-like composition (Figure 3) and could potentially be influenced by groundwater discharge.
Nitrate, respiration rate, mean annual precipitation, and grass percentage within 100-meter of the river all showed a negative influence on DOM cluster formation (Figure 5).The combination of respiration rate and water column depth could be interpreted as areas with deeper waters and more biological activity, as indicated by respiration rates, may harbor more diverse organic molecules within the sediment.Variables related to precipitation, in combination with nitrate, suggest that rainfall and higher nitrate concentrations may be associated with non-biomass building microbial processes leading to lower molecular richness in the sediment.Precipitation events have been shown to influence the amount and composition of DOM transported through river networks by mobilizing terrestrial DOM into the river water column and shifting flow paths to flushing upper, organic-rich soil horizons (Hong et al., 2012;Wagner et al., 2019).The negative influence of high nitrate concentrations could be due to conditions that are not conducive to molecular diversity, for example eutrophic conditions favoring the excessive production of algaederived DOM.Elevated nitrate concentration is also commonly associated with agricultural influences, yet the impact on DOM can vary.Agricultural land use has been shown to increase microbially derived, protein-like DOM composition with decreased structural complexity (Wilson and Xenopoulos, 2008) and/or increase terrestrially derived, aromatic DOM composition (Shang et al., 2018;Ji et al., 2024).In comparing three types of riparian soils (forested, agricultural, and wetland soils) in headwater streams, Ji et al. (2024) found that agricultural soil DOM exhibited the lowest molecular richness, while agricultural particulate organic matter and DOM displayed highest molecular richness.
The water clusters had a broader range of SHAP values, suggesting that the model finds a greater variation in how the environmental parameters affect water DOM clusters.Physical and chemical parameters like MAT and inorganic ions (sulfate and chloride) have a substantial influence on the DOM clusters (Figure 5).Chloride has been associated with environmental conditions or events that foster a diverse array of organic molecules, such as increased groundwater discharge (Gue et al., 2018).Research conducted in coastal aquifers explored the molecular diversity of DOM in the subterranean estuary, revealing a unique ecohydrological interface where marine organic matter mixes with groundwater containing aged C from terrestrial sources (Waska et al., 2021).McDonough et al. (2021) used FTICR-MS to investigate the molecular composition and character of DOM in groundwater and reported that the molecular character of reactive DOM in groundwater differs from that of surface water.Fluoride had a positive impact on DOM clusters as well, particularly at higher values (Figure 5), supporting the potential role of groundwater discharge in DOM richness.Liu et al. (2015) explored how geochemical processes, including the role of DOM derived from rock weathering and biodegradation of organic matter, affect fluoride concentrations in groundwater.They showed that competitive adsorption of HCO3− and OH− with F− can lead to the release of F− from aquifer matrix into solution, increasing groundwater F− concentration.Overall, our ML approach can decipher environmental influences on DOM diversity that strongly agree with other published work.
The long-standing ecological conceptual model, the "River Continuum Concept, " has argued that stream order serves as a general predictor of DOM diversity, with the highest diversity appearing in low-order streams (Vannote et al., 1980).Evidence from empirical data, however, varies geographically and with anthropogenic influence.For instance, using FT-ICR MS, Mosher et al. ( 2015) showed 1st-order streams have the highest molecular formulae diversity and compound classes in a forested catchment, while Roebuck Jr et al. (2020) showed that the influence of stream order was outweighed by land use in regulating DOM compositions along a river continuum.Stream order had both negative (blue) and positive (red) influence on DOM clusters (Figure 5), which could indicate that DOM richness is increased in clusters consisting of samples taken in low-order streams and decreased in clusters made up by samples taken in higher-order streams.
Primary sources introducing flow variability showed a positive impact on DOM clusters (Figure 5).Such features could for example be dams, and the presence of upstream dams indeed showed a noticeable cluster of positive SHAP values, indicating that the presence of a dam upstream can be an important predictor for the model outcome.Dams showed a mix of positive and negative impacts, supporting our finding that Wat-2 is characterized by a close association to dams and gages as compared to Wat-0 and Wat-1 as being further away.Dams have been shown to affect the structure of DOM (Wang et al., 2021).In reservoirs created by dams, certain areas experience slower water flow compared to free-flowing river segments.This reduction in flow velocity alters the physical, chemical, and biological environment of the water, which in turn impacts the concentration and composition of DOM.Wang et al. (2021) showed that the reservoir area had relatively higher terrestrial input and increased abundance of recalcitrant DOM, a consequence of water intrusion from the main stem of the stream caused by the construction and operation of the reservoir.Dam constructions increase the residence time of DOM in the river (Hong et al., 2012) and Sun et al. (2017) noted that in slower flow areas of a mid-subtropical drinking water source reservoir, there was a higher content of certain DOM classes, supporting the notion that altered hydrodynamics can lead to variation in the DOM composition (Lynch et al., 2019).Non-anthropogenic organic debris dams in streams trap sediments and collect particulate organic matter, which again affects the concentration and composition of DOM in stream water (Bilby, 1981).
As observed for sediment clusters, recent precipitation events also had a cluster of high positive SHAP values at lower feature values (Figure 5) which suggests that rainfall events have a significant impact on water DOM clusters.DOM increases in streams during heavy rainstorms and snowmelt, mainly due to storm flow flushing through upper, organic matter-rich soil horizons (Kaiser and Guggenberger, 2005).
Recent research at the basin level, such as the study by Danczak et al. (2023) and Cui et al. (2024), have documented a correlation between watershed attributes and the chemical diversity of DOM in water.This research revealed that in the Yakima River, DOM chemical diversity expands with the growth of the watershed area and fluctuates with different types of land cover.While the extent to which findings from specific sites can be generalized to larger areas remains uncertain, our analysis on a continental scale hints at the possibility that Summary plots for SHAP values of the two best performing HGB models for sediment (left) and water (right) sample types.Individual points represent samples.Features with positive SHAP values positively impact the prediction, while those with negative values have a negative impact.The magnitude is a measure of how strong the effect is. Features are arranged along the y axis based on their importance, which is given by the mean of their absolute SHAP values.The higher vertical position of the feature in the model, the greater importance it has on the model.

Summary and conclusions
Our limited capacities to unravel biogeochemical processes in lotic ecosystems worldwide at different spatial and temporal scales, combined with a poor knowledge on complex interactions between abiotic and biotic drivers, result in an urgent need to develop new strategies and tools to study DOM.Such new approaches may allow us to identify the major patterns governing ecological processes, so we can predict how they might be affected in a changing world.Here, we applied unsupervised and supervised ML approaches to analyze the diversity and molecular composition of continental-scale river and sediment DOM samples of the WHONDRS database.We then assessed the potential influence of environmental parameters on their molecular diversity.This data-driven approach provided a mechanism to identify common DOM clusters and the key environmental conditions that generate these groups of compounds.While both sediment and water samples shared some common influential features, we found clear differences in the range and nature of the most influential parameters.Supervised ML revealed that features like dams, precipitation events, and watershed characteristics had significant impacts on the DOM composition and diversity, particularly in water samples.The study also underscored the complex and non-linear relationships inherent in ecological data, highlighting the need for advanced analytical methods like ML to understand non-linear correlations in large data sets and bridge relationship gaps across carbon cycling scientists in diverse ecological communities.

FIGURE 2
FIGURE 2 Principal component analysis (PCA)-based k-means clustering of the sediment (A) and water (B) samples.Different colors indicate the three different clusters found for the three principal components (PC1-PC3).

FIGURE 4
FIGURE 4Diversity indices of sediment and water clusters.Observed richness for sediment clusters in (A) and for water clusters in (B).Lowercase letters a-c indicate significant differences between clusters based on Kruskal-Wallis followed by Dunn tests (p < 0.01).Each boxplot's upper and lower hinges correspond to the first and third quartiles, respectively.The whiskers extend from the hinge to the largest and smallest value within 1.5 times the interquartile range.Data beyond whiskers are displayed as outlier points.

FIGURE 5
FIGURE 5 Metadata with high SHAP values computed from Stage 1 were selected for model training in Stage 2. Based on the SHAP 10.3389/frwa.2024.1379284 10.3389/frwa.2024.1379284Frontiers in Water 09 frontiersin.orgconnections between DOM diversity and watershed traits might be widespread.A better understanding of the watershed characteristics driving DOM beta diversity and richness could be instrumental in forecasting the chemical diversity of riverine DOM across extensive geographical regions.