Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Environ. Sci., 12 January 2026

Sec. Freshwater Science

Volume 13 - 2025 | https://doi.org/10.3389/fenvs.2025.1716967

This article is part of the Research TopicEnvironmental Management of Headwater Lakes and CatchmentsView all 3 articles

Analysis of key factors influencing cyanoHAB intensity levels and evaluation of machine learning predictions

Hyeonsu ChungHyeonsu ChungTaesung Kim
Taesung Kim*
  • Yeongsan River Environment Research Center, National Institute of Environmental Research, Gwangju, Republic of Korea

This study quantified the effects of biological and physicochemical factors on the occurrence of cyanobacterial harmful algal blooms (cyanoHABs) at various stages in the Juksan Weir, where cyanoHABs frequently occur due to decreased flow velocity and prolonged residence time caused by artificial structures. The predictive performance of machine learning techniques such as Random Forest (RF), XGBoost (XGB), boosted regression trees (BRT), and a stacking model (RF + XGB) was evaluated. The analysis revealed distinct changes in physicochemical factors and differences in plankton community structure across the stages of cyanoHABs. Water temperature, COD, and DO were identified as common key physicochemical factors, with stage-specific key factors identified for plankton communities. At the low level, Cladocera and Chlorophyceae were dominant, contributing to initial community stabilization. At the moderate level, mixotrophic dinoflagellates (e.g., Cryptomonas) and Copepoda increased, mediating energy flow redistribution and community changes. At the high level, community reorganization occurred as Protozoa exhibited functional responses to adapt to abundant organic matter and low-oxygen environments. In the machine learning analysis, RF and XGB generally showed high performance, whereas the stacking model (RF + XGB) exhibited the most consistent and accurate predictive power at all levels. All models except BRT were stable at the low level, whereas the stacking model achieved the highest performance at the moderate and high levels. These differences in performance highlight the advantages of utilizing ensemble (stacking) approaches in capturing stage-specific nonlinear interactions. The results of this study demonstrate that a stacking ensemble approach, which incorporates stage-specific key variables, is an effective strategy in achieving accuracy and stability in predicting the intensity levels of cyanoHABs. In the future, this study will provide scientific evidence to support the development of early warning systems for cyanoHABs.

1 Introduction

Freshwater ecosystems are facing serious environmental crises due to the combined effects of climate change and human activities. Among these, cyanobacterial harmful algal blooms (cyanoHABs) have emerged as one of the major water quality problems worldwide. CyanoHABs are caused by the excessive proliferation of harmful cyanobacteria, such as Microcystis, Dolichospermum, Oscillatoria, and Planktothrix (Van Hassel et al., 2022). These organisms secrete toxic substances, such as anatoxins, cylindrospermopsins, and microcystins, which have fatal impacts on aquatic organisms, including fish, zooplankton, and aquatic plants (Sultana et al., 2024). In addition, the production of odorous compounds, such as geosmin and 2-methylisoborneol, creates serious problems for the operation of water treatment plants and degrades the quality of tap water (Devi et al., 2021). In this regard, cyanoHABs are directly related to the health of aquatic ecosystems and the use of water resources, and understanding their generation mechanisms is essential for effective response and policy formulation (Paerl, 2018).

The Juksan Weir, the subject of this study, is an artificial weir located in the middle and lower reaches of the Yeongsan River, one of the four major rivers in Korea. It was installed in 2012 as part of the Four Major Rivers Restoration Project. Changes in hydraulic characteristics (e.g., decreased flow velocity and increased residence time) caused by the installation of artificial structures have been identified as key factors that increase the likelihood of cyanoHABs, and conditions favorable for the growth of harmful cyanobacteria have been reported (Chung et al., 2024; Kwak et al., 2016). In particular, the weir has structural characteristics that affect water level and flow velocity in the middle and lower reaches during its operation, due to its direct connection to the downstream Yeongsan River estuary bank. Frequent algal blooms and low-oxygen phenomena have been reported in summer, influenced by geographic factors and basin pollutant concentrations that decrease flow velocity and cause long-term water stagnation (Kim and Shin, 2021). From this perspective, the Juksan Weir provides suitable conditions for quantitatively analyzing biological responses and community changes, making it a representative case area for studying cyanoHAB mechanisms. Previous studies on cyanoHABs were largely confined to simple correlations between key water quality indicators (e.g., algal blooms, chlorophyll-a, water temperature, and nutrients), which limited understanding of intensity-level characteristics (Han et al., 2024; Lee et al., 2024). In particular, few studies have analyzed the relative contributions of ecological and environmental factors at different stages of cyanoHAB intensity. This has resulted in insufficient understanding of the mechanisms underlying the interactions between harmful cyanobacteria and changes in environmental factors associated with bloom intensity. Therefore, an approach that identifies harmful cyanobacteria–environment relationships according to intensity levels is required for effective understanding and management of cyanoHABs.

In this study, cyanoHAB intensity was divided into three stages (Low: <1,000 cells/mL; Moderate: 1,000–10,000 cells/mL; High: ≥10,000 cells/mL) based on harmful cyanobacterial cell counts in the Juksan Weir from 2016 to 2022. Significant biological and environmental factors for each stage were selected through Kruskal–Wallis statistical analysis, and the relative contributions of each variable were quantified using hierarchical partitioning analysis (HPA). Based on these results, Random Forest (RF), XGBoost (XGB), boosted regression trees (BRT), and stacking models were applied to compare predictive performance for cyanoHAB intensity. The aim of this study was to improve the accuracy and reliability of cyanoHAB intensity predictions by comparing the performance of multiple machine learning models, which incorporated stage-specific biological and environmental factors. These findings will aid in the development of customized prediction and management strategies designed to address to bloom intensity and provide scientific evidence for effective water quality management and the establishment of an early warning system.

2 Materials and methods

2.1 Study area

The Juksan Weir (JS) is an artificial weir located in the downstream area of the Yeongsan River. It was installed in 2012 in Dasi-myeon, Naju-si, Jeollanam-do, Korea, as part of the Four Major Rivers Restoration Project (Figure 1). The weir is primarily used to supply agricultural water, with a watershed area of approximately 2,359 km2 and a reservoir capacity of 25,700,000 m3. The management water level is EL. 3.5 m, and the average water depth is maintained at approximately 5–6 m (Kim and Shin, 2021).

Figure 1
Map showing the location of the Yeongsan River watershed in South Korea, highlighting areas like the Hwangryong River and Gwangju Stream. Red lines indicate weirs at Seungchon and Juksan, with a sampling site marked as

Figure 1. Location of the sampling site in the Yeongsan River at the Juksan Weir.

The Juksan Weir is a multi-purpose structure that performs various functions, including water level maintenance, discharge control, securing water for ecological maintenance, flood control, and small-scale hydropower generation. It plays a crucial role in water quality and algae management in the Yeongsan River system, as it influences flow and residence time in the middle and lower reaches owing to its location between the upstream Seungchon Weir and the downstream estuary bank (Son et al., 2018). In particular, the Juksan Weir can create stagnant water zones due to reduced flow caused by the installation of artificial structures. Flow velocity may be further reduced, exacerbating stagnation under the additional discharge control effect of the upstream Seungchon Weir (Kwon et al., 2024). In addition, a stagnant water zone forms in front of the weir due to the diversion of part of the main stream discharge by the intake of a small hydropower plant located 540 m upstream, which is known to create conditions favorable for cyanoHABs (Kim et al., 2024). Furthermore, the weir is prone to eutrophication because pollution sources (e.g., sewage treatment plant effluent and agricultural and livestock activities) are concentrated within the watershed. Based on these watershed and water environment conditions, this study selected the Juksan Weir as the target site for analyzing the characteristics and predictability of cyanoHABs at each stage.

2.2 Sampling and data collection

Weekly surveys of phytoplankton and water quality were conducted in the Juksan Weir (JS) from 2016 to 2022. Zooplankton data were obtained from the Ministry of Environment (ME) Water Environment Information System (http://water.nier.go.kr), and hydrological data were obtained from K-water (http://water.or.kr). For analysis, surface water samples (0.5 m depth) were collected using a Van Dorn Water Sampler and fixed with 2%–3% Lugol’s solution. The samples were stabilized for 24 h, and cell counts were measured using a Sedgwick-Rafter chamber (Wildco, MI, United States) and an optical microscope (100–1,000x magnification) (MOE, 2011).

Phytoplankton were classified into harmful cyanobacteria (Harmful_Cyanophyceae), Bacillariophyceae, Chlorophyceae, Cyanophyceae, and other plankton. For harmful cyanobacteria, some Anabaena species have been reclassified into the Dolichospermum genus, and Oscillatoria species into the Planktothrix genus, due to recent changes in the classification system. In this study, however, four harmful cyanobacterial genera (Microcystis, Anabaena, Aphanizomenon, and Oscillatoria) were analyzed according to ME water system monitoring standards (Chung et al., 2019).

Other plankton groups, including Cryptophyceae, Chrysophyceae, Dinophyceae, and Euglenophyceae, were identified to the species level (41 species in total). However, they were combined into the “other plankton” category because they represented only 5.1% of the total cell count. Krammer and Lange-Bertalot (2007) and Massaru et al. (1977) were used for phytoplankton identification, and classifications followed the system of Simonsen (1979). Physicochemical environmental factors were divided into field measurement items (water temperature, pH, DO, and electrical conductivity (EC)) and laboratory analysis items (BOD, COD, SS, TN, NO3-N, NH3-N, TP, PO4-P, Chl-a, and TOC). A boat was used for sampling, and field measurement items were directly measured on site using calibrated field equipment (EXO2 Sonde; YSI, Yellow Springs, OH, United States). Collected samples were stored in an ice cooler and transported to the laboratory for analysis of laboratory items in accordance with the water pollution process test method (ME, 2017). The diaphragm electrode method, the potassium permanganate method, and the high-temperature combustion method were used for BOD, COD, and TOC, respectively. An automatic analyzer (QuAAtro 39 Autoanalyzer; BLTEC, Osaka, Japan) was used for TN, TP, and PO4-P; high-temperature ion chromatography (Integrion; Dionex, Sunnyvale, CA, United States) for NO3-N; and ion chromatography (AA3; BLTEC, Osaka, Japan) for NH3-N. Chl-a was measured using a UV-Vis spectrophotometer (LAMBDA™ 25; Perkin-Elmer, Shelton, CT, United States).

2.3 Data analysis

2.3.1 Data preparation

The data were classified into three groups based on harmful cyanobacterial cell counts to distinguish the stages of cyanoHABs and analyze factors affecting each intensity level. Specifically, the low stage (n = 285) was defined as a cell count of less than 1,000 cells/mL, the moderate stage (n = 42) as a cell count between 1,000 and 10,000 cells/mL, and the high stage (n = 24) as a cell count greater than 10,000 cells/mL. For phytoplankton and zooplankton cell counts, log-transformed values (log10 (x + 1)) were applied to achieve normality and stabilize variance.

2.3.2 Statistical analyses

As normal distribution was not observed in all variables according to the Shapiro–Wilk test, differences between groups were tested using the Kruskal–Wallis test, followed by multiple comparisons were performed using Dunn’s test. Factors that exhibited significant differences among groups were then selected, and their contributions were examined through Hierarchical Partitioning Analysis (HPA). HPA was conducted using the hier.part package in R. The independent contributions (I) and interactions (J) of the variables were calculated with the hier.part function (Walsh and Mac Nally, 2013). In addition, a randomization-based test generated the distribution of I values for each variable through 1,000 randomizations using the rand.hp function, and statistical significance was evaluated based on the Z score (Z ≥ 1.65) (Zhao et al., 2022; Mac Nally, 2002).

2.3.3 Machine learning-based prediction models

Four machine learning models (RF, XGB, BRT, and stacking RF + XGB) were applied to predict harmful cyanobacteria according to the intensity level of cyanoHABs (Low–Moderate–High), based on key contributing variables derived through HPA.

2.3.3.1 Random Forest (RF)

Random Forest (RF) is an ensemble-based, non-parametric learning algorithm that constructs multiple decision trees using bootstrap samples and aggregates their predictions to reduce variance and enhance robustness (Breiman, 2001). For regression tasks, the RF prediction can be expressed as:

y^=1Tt=1Thtx

where ht(x) is the prediction of the t-th decision tree, T is the total number of trees, and x is the input predictor vector. In this study, the number of trees was set to 700 to prevent overfitting and optimize predictive performance (Huang et al., 2022).

2.3.3.2 Extreme Gradient Boosting (XGB)

Extreme Gradient Boosting (XGB) improves the traditional gradient boosting framework by incorporating regularization and efficient optimization strategies. Specifically, XGB builds trees sequentially by fitting new learners to the gradients of the loss function, enabling effective bias reduction and robust modeling of complex nonlinear relationships. The model minimizes the following regularized objective function:

L=i=1nlyi,y^i+k=1kΩfk

where lyi,y^i denotes the loss function, fk represents the k-th decision tree, and Ωfk is a regularization term controlling model complexity. By sequentially learning residuals, XGB effectively reduces bias and captures complex nonlinear relationships. Model stability was enhanced by setting a learning rate (eta) of 0.15, a maximum tree depth of 4, a subsampling rate of 80%, and 200 boosting iterations (Song et al., 2024; Li D. et al., 2022).

2.3.3.3 Boosted Regression Trees (BRT)

Boosted Regression Trees (BRT) integrate regression trees with a boosting framework and are effective for modeling nonlinear environment–organism relationships without prior data transformation (Agasild et al., 2024; Descy et al., 2016). In BRT, successive trees are fitted to the residuals of previous models, enabling gradual improvement of predictive accuracy. The BRT model can be expressed as an additive function:

y^=m=1Mν·fmx

where fmx is the m-th regression tree, M is the total number of trees, and ν is the shrinkage (learning rate) parameter. For the low and moderate stages, parameters were set to n.trees = 200, interaction.depth = 4, and shrinkage = 0.01. For the high stage, the reduced sample size required adjustment to n.trees = 100 and interaction.depth = 3 (Bertani et al., 2017).

2.3.3.4 Stacking ensemble (RF + XGB)

The stacking ensemble model integrates the complementary strengths of bagging- and boosting-based learners using a two-level learning framework. At the first level, RF and XGB base models were independently trained using the same set of input predictors, and their prediction outputs were generated for all samples. These prediction values were then used as new input features for the second-level model. At the second level, a meta-regression model was applied to combine the prediction outputs derived from the base learners and to produce the final prediction. In this study, a linear regression model was used as the meta-learner, allowing transparent integration of base model predictions and direct interpretation of their relative contributions. The stacking framework can be expressed as:

y^stack=β0+β1y^RF+β2y^XGB

where y^RF and y^XGB denote the predictions generated by the RF and XGB base models, respectively, and β0, β1, and β2 represent the regress0ion coefficients estimated by the meta-regression model. RF and XGB were selected as base learners based on their consistently superior standalone performance across cyanoHAB intensity levels. By combining variance reduction through bootstrap aggregation in RF and bias reduction through sequential residual learning in XGB, the stacking ensemble achieves improved prediction stability and accuracy under nonlinear and stage-dependent environmental conditions.

The RandomForest, xgboost, and gbm packages in R (ver. 4.3.3) were used for model learning. The root mean square error (RMSE), mean absolute error (MAE), and Nash–Sutcliffe efficiency (NSE) were used as indicators for predictive performance comparison and optimal model selection (Rad et al., 2024; Tao et al., 2024). RMSE and MAE quantify the difference between observed and predicted values in the same unit, with smaller values indicating higher performance (Li et al., 2021). NSE evaluates the explanatory power of a model by comparing residual variance with the variance of observed values, with values closer to 1 indicating stronger explanatory power (Cao et al., 2016). The formulas for the indicators are as follows. To evaluate the robustness and generalization ability of the models, predictive performance was compared across all CyanoHABs intensity levels using RMSE, MAE, and NSE. Model stability was further ensured by tuning key hyperparameters for RF, XGB, and BRT, and by examining performance differences between the individual models and the ensemble Stacking model. This approach minimized overfitting and supported consistent model behavior under varying environmental conditions.

RMSE=1ni=1nyiy^i2
MAE=1ni=1nyiy^i
NSE=1i=1nyiy^i2i=1nyiy¯2

n = sample size, yi = i-th observed value, y^i is the i-th predicted value by the model, and y¯ is the average of the observed values.

3 Results

3.1 Biological and environmental factors differences among cyanoHAB intensity levels

In this study, differences in plankton communities and environmental factors were analyzed according to cyanoHAB intensity levels. Based on the Kruskal–Wallis test results, both phytoplankton and zooplankton exhibited significant differences (Figure 2). Water temperature, DO, COD, and some nutrients differed significantly among environmental factors, whereas hydrological factors showed no significant differences across levels. This could be attributed to the decreased impact of hydrological factors and the increased importance of physicochemical factors, as variability in flow velocity and residence time was reduced in the stagnant zone formed by the weir structure (Choi et al., 2020; Kim et al., 2020).

Figure 2
Bar graphs showing biological metrics across low, moderate, and high categories. Panels (A) to (H) display various counts such as Bacillariophyceae, Chlorophyceae, and others. Significant differences are marked with asterisks, indicating varying levels of statistical significance between categories.

Figure 2. Comparisons of phytoplankton and zooplankton abundances across cyanoHAB intensity levels (low, moderate, high) in the Juksan Weir revealed significant differences, denoted by *(P < 0.05) and **(P < 0.01). (A) Bacillariophyceae, (B) Chlorophyceae, (C) Other phytoplankton, (D) Total phytoplankton, (E) Protozoa, (F) Rotifera, (G) Copepoda, and (H) Cladocera.

For phytoplankton, Bacillariophyceae and other plankton decreased by approximately 6.7- and 3.7-fold, respectively, from the low level (8,051 ± 12,091 cells/mL; 9,758 ± 1,267 cells/mL) to the high level (1,197 ± 1,538 cells/mL; 205 ± 306 cells/mL) (P < 0.01; Figures 2A,C). In contrast, Chlorophyceae increased by approximately 2.3-fold from the low level (2,181 ± 3,314 cells/mL) to the high level (5,040 ± 4,827 cells/mL) (P < 0.01; Figure 2B). The total phytoplankton population increased sharply from the moderate level (12,400 ± 5,844 cells/mL) to the high level (42,119 ± 39,256 cells/mL) (P < 0.01; Figure 2D).

For zooplankton, Protozoa and Rotifera abundance increased up to the moderate level and then decreased at the high level (Figures 2E,F), whereas Copepoda abundance gradually increased from the low level (14 ± 33 ind./L) to the high level (122 ± 126 ind./L) (P < 0.01; Figure 2G). Cladocera abundance also increased significantly at both moderate and high levels compared with the low level (P < 0.01; Figure 2H).

In the analysis of environmental factors, water temperature and COD gradually increased with increasing levels (Figures 3A,C). In particular, the water temperature at the high level (28.2 °C ± 2.9 °C) was nearly twice that at the low level (15.0 °C ± 7.5 °C) (P < 0.01; Figure 3A). DO and some nutrients (TN, NO3-N, and NH3-N) significantly decreased at higher levels compared with the low level (P < 0.01; Figures 3B,D–G). PO4-P decreased slightly at the moderate level compared with the low level and then increased again at the high level (Figure 3H). These results confirm significant changes in the adaptation strategies of major zooplankton communities and environmental factors according to the cyanoHAB intensity level.

Figure 3
Bar graphs labeled A to G compare water quality parameters across three categories: Low (red), Moderate (green), and High (blue). Each graph shows mean values with error bars, indicating significant differences (* p < 0.05, ** p < 0.01). Graph (A) shows water temperature (WT), (B) dissolved oxygen (DO), (C) chemical oxygen demand (COD), (D) total nitrogen (TN), (E) nitrate nitrogen (NO₃-N), (F) ammonia nitrogen (NH₄-N), and (G) phosphate phosphorus (PO₄³⁻-P). Each parameter exhibits varying levels across the categories, with numerous significant comparisons.

Figure 3. Comparisons of physicochemical factors across cyanoHAB intensity levels (low, moderate, high) in the Juksan Weir revealed significant differences, denoted by *(P < 0.05) and **(P < 0.01). (A) water temperature (WT), (B) dissolved oxygen (DO), (C) chemical oxygen demand (COD), (D) total nitrogen (TN), (E) nitrate nitrogen (NO3-N), (F) ammonia nitrogen (NH3-N), and (G) phosphate phosphorus (PO4-P).

3.2 Hierarchical partitioning analysis

In this study, HPA was conducted to identify key contributing factors affecting cyanoHAB intensity levels (Table 1). At the low level, Cladocera (57.0%), Chlorophyceae (11.2%), zooplankton (7.9%), Rotifera (6.7%), and Copepoda (6.2%) were identified as key biological contributors. Among physicochemical factors, water temperature (37.9%) exhibited the highest contribution, followed by TN (14.9%), NH3-N (13.0%), and NO3-N (11.7%). At the moderate level, other plankton (27.3%), Copepoda (13.7%), and phytoplankton (14.9%) were identified as key biological contributors. Among physicochemical factors, COD (26.2%) had the greatest impact, followed by DO (23.2%) and PO4-P (16.9%). At the high level, phytoplankton (61.4%), Protozoa (13.0%), zooplankton (7.8%), and Bacillariophyceae (6.1%) were identified as key biological contributors, whereas water temperature (37.3%), COD (31.7%), and DO (11.4%) had the highest contributions among physicochemical factors. These results demonstrate that the contributions of key environmental conditions and biological interactions vary with cyanoHAB intensity levels, providing fundamental data for customized management strategies and predictive variables at each stage.

Table 1
www.frontiersin.org

Table 1. Identification of biological and physicochemical factors influencing cyanoHAB intensity levels (low, moderate, high) using hierarchical partitioning analysis. The percentage influence (% I) was estimated through hierarchical partitioning analysis, and Z-scores were derived by comparing observed % I values with the distribution of % I values obtained from 1,000 random permutations of the independent variable data matrices.

3.3 Predictive performance of the models

In this study, the performance of four machine learning models (RF, XGB, BRT, and stacking RF + XGB) was compared to distinguish and predict cyanoHAB intensity levels. Each model utilized stage-based key environmental variables derived through HPA as predictors, and model performance was comprehensively evaluated using RMSE, MAE, and NSE (Figures 46, Table 2). All regression lines shown in Figures 46 represent the relationship between the predicted and observed log10-transformed harmful cyanobacterial cell counts. At the low level, the RF, XGB, and stacking models showed higher predictive power than the BRT model (Figure 4). In particular, the stacking model exhibited the highest predictive accuracy, with y = 0.912x + 0.038 and r2 = 0.91. The RF model (y = 0.648x + 0.156, r2 = 0.90) and XGB model (y = 0.861x + 0.078, r2 = 0.78) also demonstrated strong performance, whereas the BRT model showed low predictive accuracy (y = 0.265x + 0.249, r2 = 0.44). At the moderate level, the stacking model (y = 0.940x + 0.205, r2 = 0.94) exhibited the highest performance, followed by the RF model (y = 0.584x + 1.420, r2 = 0.93) and XGB model (y = 0.892x + 0.370, r2 = 0.86). The BRT model again showed poor performance (y = 0.136x + 2.932, r2 = 0.25) (Figure 5). At the high level, the stacking model (y = 0.969x + 0.135, r2 = 0.97) and XGB model (y = 0.944x + 0.227, r2 = 0.92) achieved the best performance, whereas the RF model (y = 0.766x + 1.029, r2 = 0.95) also showed excellent results. By contrast, the BRT model performed worst (y = 0.587x + 1.801 and r2 = 0.93) (Figure 6). Model performance was further evaluated using RMSE, MAE, and NSE, which are widely applied to assess the accuracy and reliability of machine learning-based environmental prediction models (Adnan et al., 2025). At the low level, the stacking model exhibited the highest performance (RMSE = 0.275, MAE = 0.198, NSE = 0.912), followed by the XGB model (RMSE = 0.448, MAE = 0.136, NSE = 0.768) and RF model (RMSE = 0.385, MAE = 0.248, NSE = 0.828). The BRT model showed the lowest performance (RMSE = 0.740, MAE = 0.471, NSE = 0.366).

Figure 4
Four graphs compare predicted versus observed cell counts from 2016 to 2022 using different models: Random Forest (A), XGB (B), BRT (C), and Stacking (D). Each shows time series plots on the left and scatter plots with regression lines on the right. Random Forest and Stacking exhibit high correlation with R² values of 0.90 and 0.91, respectively. XGB has an R² of 0.78, and BRT shows a lower R² of 0.44. The models display varying alignment of observed and predicted values across time.

Figure 4. Time series comparisons of observed and predicted values in the low intensity level for (A) Random Forest, (B) XGB, (C) BRT, and (D) Stacking (RF + XGB). The second plot shows scatter plots depicting the relationship between observed and predicted values.

Figure 5
Four panels display cell count data over time using different prediction models. Panel A shows Random Forest results, with observed and predicted lines closely aligning. Panel B illustrates XGB results, slightly less aligned. Panel C presents BRT results, showing more divergence. Panel D combines RF and XGB (Stacking), displaying a strong fit. Scatter plots alongside each panel depict the correlation between observed and predicted values, with equations and R-squared values indicating fit quality: 0.93 for RF, 0.86 for XGB, 0.25 for BRT, and 0.94 for Stacking.

Figure 5. Time series comparisons of observed and predicted values in the moderate intensity level for (A) Random Forest, (B) XGB, (C) BRT, and (D) Stacking (RF + XGB). The second plot shows scatter plots depicting the relationship between observed and predicted values.

Figure 6
Line graphs and scatter plots compare observed and predicted cell counts (cells/mL) over time from 2016 to 2022 using four models: Random Forest, XGB, BRT, and Stacking (RF-XGB). Each line graph shows observed and predicted values, and each corresponding scatter plot shows observed versus predicted values with a line of best fit and equation. The R-squared values indicate model accuracy, with Stacking (RF-XGB) showing the highest R-squared value of 0.97.

Figure 6. Time series comparisons of observed and predicted values in the high intensity level for (A) Random Forest, (B) XGB, (C) BRT, and (D) Stacking (RF + XGB). The second plot shows scatter plots depicting the relationship between observed and predicted values.

Table 2
www.frontiersin.org

Table 2. Performance of prediction models using significant environmental factors by cyanoHAB intensity levels (low, moderate, high).

At the moderate level, the stacking model again outperformed the others (RMSE = 0.074, MAE = 0.058, NSE = 0.940), followed by the XGB model (RMSE = 0.114, MAE = 0.041, NSE = 0.857), RF model (RMSE = 0.133, MAE = 0.115, NSE = 0.803), and BRT model (RMSE = 0.270, MAE = 0.225, NSE = 0.195).

At the high level, the stacking model (RMSE = 0.059, MAE = 0.043, NSE = 0.969) and XGB model (RMSE = 0.099, MAE = 0.030, NSE = 0.912) showed the highest predictive performance. The RF model (RMSE = 0.097, MAE = 0.069, NSE = 0.916) performed slightly lower, whereas the BRT model showed the weakest performance (RMSE = 0.148, MAE = 0.122, NSE = 0.803).

Overall, these results demonstrate that the stacking and XGB models provided consistently strong predictive performance compared with the other models at the target site. In particular, the high agreement between predicted and observed values for the stacking model supports the effectiveness of ensemble-based learning techniques (Ge et al., 2024; Jo et al., 2023). In addition to evaluating predictive accuracy, the influence of key environmental variables on the model outputs was assessed to enhance interpretability. Water temperature and COD exhibited a positive influence on predicted CyanoHABs intensity levels, reflecting their stepwise increases across the Low–Moderate–High stages identified by the Kruskal–Wallis test. In contrast, DO showed a negative influence, consistent with its decreasing trend across stages. The magnitude of these effects aligned with the HPA-derived contribution rates, confirming that WT, COD, and DO were the dominant predictors shaping model behavior across all intensity levels. Although the sample sizes for the Moderate (n = 42) and High (n = 24) CyanoHABs stages were relatively small, model stability was maintained by restricting predictors to the key variables identified through HPA and by using ensemble-based learning structures. These approaches mitigated the risk of overfitting and enabled the models—particularly the Stacking model—to produce consistent performance across stages. Nonetheless, the limited sample sizes in the upper intensity levels should be acknowledged as a structural constraint of field-collected CyanoHABs datasets, and the interpretation of stage-specific predictions should consider this limitation.

4 Discussion

4.1 Importance of stage-based ecological dynamics in cyanoHAB intensity levels

In this study, the interactions between biological and physicochemical factors and the resulting community structure were analyzed according to cyanoHAB intensity levels (Low–Moderate–High). To this end, significant factors for each level were identified using the Kruskal–Wallis test and HPA. Based on their contributions, change patterns were analyzed.

4.1.1 Low level–contributions of cladoceran and chlorophyceae during the initial formative period

At the low level, Cladocera (57.0%) and Chlorophyceae (11.2%) exhibited high contributions. Cladocera played a crucial role in stabilizing the initial community structure, including Bosmina longirostris, which effectively feeds on small phytoplankton (Kosiba and Krztoń, 2022; Leitão et al., 2018). Chlorophyceae showed rapid proliferation under high solar radiation and water temperature (Li J. et al., 2022; Kim et al., 2020) and may have formed competitive interactions with harmful cyanobacteria at the initial stage. The key environmental factors during this period were water temperature (37.9%), TN (14.9%), and NH3-N (13.0%). In particular, the increase in water temperature served as a key factor accelerating the initial growth of harmful cyanobacteria. Therefore, Cladocera and Chlorophyceae led the community structure under high water temperature and nutrient conditions at the low level, serving as a foundation for changes at subsequent levels.

4.1.2 Moderate level–community response adjustment due to intensified cyanoHABs

At the moderate level, other plankton (27.3%) and Copepoda (13.7%) were identified as key contributors. According to a previous study that analyzed the same dataset (Chung et al., 2024), other plankton mainly consisted of Cryptomonas, a mixotrophic dinoflagellate. Cryptomonas is a preferred food source for Copepoda (Ger et al., 2019; Agasild et al., 2012), which avoid feeding on cyanobacteria and instead selectively consume alternative food sources. The actual increase in Copepoda populations during this period reflects this feeding selectivity and functional adaptation (Ger et al., 2014). In contrast, the reduction in Cryptomonas populations is attributed to cyanobacterial photoinhibition and resource competition (Zohary et al., 2021). The key environmental factors at this level were COD (26.2%), DO (23.2%), and PO4-P (16.9%). The increase in COD contributed to the intensified dominance of cyanobacteria through the accumulation of organic matter. Therefore, the moderate level can be interpreted as a transitional period in which the community begins reorganizing into a cyanobacteria-dominated structure, driven by the temporary dominance of Cryptomonas and the feeding selectivity of Copepoda.

4.1.3 High level–structural reorganization dominated by harmful cyanobacteria and the functional response of protozoa

At the high level, phytoplankton and Protozoa contributed 61.4% and 13.0% of the total community, respectively. The high contribution of phytoplankton resulted from the dominance of harmful cyanobacteria, which represented 78.8% of the community during this period. This dominance reflects the growth characteristics of cyanobacteria under high water temperature and organic matter accumulation. Within this dominant structure, Protozoa fulfilled functional roles while adapting to environmental changes. Among Protozoa, ciliates (e.g., Carchesium, Tintinnidium, and Vorticella) likely contributed to material cycling by feeding on organic matter and bacteria derived from cyanobacteria (Abirhire et al., 2023; Johnke et al., 2017). In contrast, Bacillariophyceae sharply decreased due to reduced light penetration and maladaptation to high water temperature (Zohary et al., 2021; Naselli-Flores et al., 2020). The key environmental factors during this period were water temperature (37.3%), COD (31.7%), and DO (11.4%). Rising water temperature and COD facilitated cyanobacterial growth, and reduced DO promoted community reorganization through low-oxygen stress. The dominance of a single group and the reduction in functional diversity may weaken community resilience by diminishing buffering capacity against changes in the external environment (Kim et al., 2023). Consequently, the high level is a critical stage that must be prioritized when establishing early warning systems and stepwise management strategies. The analysis across all levels revealed that water temperature, COD, and DO were the key environmental factors consistently influencing cyanoHAB dominance. Water temperature served was a key factor in the formation of dominance at all levels as it was closely related to the optimal growth conditions of harmful cyanobacteria (Hecht et al., 2022; Chapra et al., 2017). COD was closely related to bloom intensification, reflecting organic matter accumulation at the moderate and high levels (Ding et al., 2021). DO significantly decreased alongside the dominance of harmful cyanobacteria and served as an important environmental driver of community changes by inducing low-oxygen conditions (Li et al., 2024; Wen et al., 2020; Chen et al., 2018). These shared factors were identified as key environmental factors that consistently affected the dominance of cyanoHABs across all levels.

4.2 Stage-based prediction of cyanoHABs: performance evaluation of machine learning models

As confirmed in the previous analysis, biological and physicochemical contributions vary depending on cyanoHAB intensity levels. Therefore, stage-based design is required for machine learning predictions. In particular, it is challenging to ensure stable performance across all stages using a single predictive algorithm because of the nonlinear interactions and spatiotemporal variability of key environmental factors such as water temperature, COD, and DO (Visser et al., 2021). In this context, ensemble techniques that integrate outputs from multiple models have garnered attention. In particular, stacking has been considered an effective approach that enhances predictive accuracy and consistency by complementing the limitations of individual models (Szewczyk et al., 2025; Ly et al., 2023). In the present study, predictions were performed based on cyanoHAB intensity levels (Low–Moderate–High) using stage-based key environmental factors derived through the Kruskal–Wallis test and HPA as predictive variables. The performance of four models (RF, XGB, BRT, and stacking RF + XGB) was then compared. The stacking method applied RF and XGB, which exhibited high performance in this study, as base learners. At the low level, all models showed stable performance. The stacking model (RF + XGB) achieved the lowest RMSE and MAE and the highest NSE. At the moderate level, significant differences in predictive power were observed among the models, with the stacking model (RF + XGB) maintaining consistent predictive performance. At the high level, nonlinear interactions among variables were noticeable. The stacking model (RF + XGB) effectively captured these complex patterns while exhibiting the highest linearity between the actual values and the predicted values. The superior performance of the stacking model can be attributed to its algorithmic structure, which integrates complementary learning mechanisms from RF and XGB. RF, as a bagging-based ensemble, reduces prediction variance by aggregating multiple decorrelated decision trees, whereas XGB, as a boosting-based model, minimizes prediction bias by sequentially learning residuals and capturing complex nonlinear relationships. By combining these heterogeneous learners within a stacking framework, the meta-regression model effectively synthesized diverse predictive patterns and mitigated the limitations of individual models. This structural advantage is generally expected to be pronounced under conditions characterized by strong nonlinearity and stage-dependent interactions, which corresponded to the moderate and high cyanoHAB intensity levels observed in this study. Under these conditions, individual models exhibited greater variability in predictive performance, whereas the stacking model maintained stability and accuracy by leveraging complementary information from both base learners. Previous studies have also reported the high performance of boosting-based XGB (Wang et al., 2021) and the improvement in time-series prediction stability achieved through stacking (Yan et al., 2019). Consistent with these findings, the stacking model (RF + XGB) and XGB model also exhibited high accuracy and stable performance in predicting complex aquatic environments. In conclusion, stage-based variable selection combined with an ensemble (stacking) application provides high accuracy and consistency in predicting each cyanoHAB level. Overall, the findings indicate that stacking-based ensemble learning is particularly effective for stage-dependent cyanoHAB prediction, as it enhances robustness against nonlinear environmental variability and improves generalization across intensity levels. These findings provide a scientific basis for developing early warning and real-time response systems and support the advancement of prediction models applicable to complex aquatic ecosystem conditions.

5 Conclusion

In this study, the contributions and change patterns of stage-based biological communities and environmental factors were evaluated according to cyanoHAB intensity levels (Low–Moderate–High). Based on these findings, the need for stage-based design and the performance of machine learning predictions were assessed.

5.1 Stage-based community changes and contributions of key environmental factors

As cyanoHAB intensity increased, the key contributing biological groups and overall community structure changed accordingly. At the low level, the filter-feeding strategy of Cladocera and the rapid growth of Chlorophyceae under high solar radiation and water temperature contributed to initial community stabilization. At the moderate level, Copepoda, which selectively feed on alternative food sources while avoiding mixotrophic dinoflagellates (e.g., Cryptomonas) and harmful cyanobacteria, played a functionally important role. At the high level, cyanobacteria dominated, whereas Protozoa contributed to material cycling by adapting to high organic matter and low-oxygen conditions and by feeding on cell debris and bacteria.

In terms of environmental factors, water temperature, COD, and DO were identified as common key drivers. Water temperature directly influenced the growth of cyanobacteria, whereas COD was closely associated with increasing intensity levels by reflecting organic matter accumulation. In contrast, hydrological factors (e.g., residence time, flow velocity, and water depth) showed relatively low explanatory power.

5.2 Predictions for each level of cyanoHABs: performance and applicability of machine learning models

Four models, Random Forest (RF), XGBoost (XGB), boosted regression trees (BRT), and stacking RF + XGB, were applied for stage-based predictions according to cyanoHAB intensity levels. Overall, RF and XGB demonstrated high performance, while the stacking model (RF + XGB), based on these two models, exhibited the most consistent and accurate predictive power across all levels. All models except BRT were stable at the low level, whereas the stacking model maintained high suitability even under complex environmental conditions at the moderate and high levels. Stacking ensured consistent agreement between measurements and predictions by integrating the strengths of RF and XGB, thereby effectively capturing nonlinear interactions among variables. These results indicate that variable selection combined with an ensemble (stacking) application, tailored to stage-based characteristics, is an efficient approach for predicting each cyanoHAB level.

Data availability statement

The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/Supplementary Material.

Author contributions

HC: Conceptualization, Data curation, Formal Analysis, Investigation, Methodology, Project administration, Resources, Validation, Visualization, Writing – original draft, Writing – review and editing. TK: Formal Analysis, Funding acquisition, Investigation, Methodology, Project administration, Supervision, Writing – review and editing.

Funding

The author(s) declared that financial support was received for this work and/or its publication.

Acknowledgements

This work was supported by a grant from the National Institute of Environmental Research (NIER), funded by the Ministry of Environment (ME) of the Republic of Korea (NIER-2024-01-01-075).

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fenvs.2025.1716967/full#supplementary-material

References

Abirhire, O., Davies, J. M., Imtiazy, N., Hunter, K., Emmons, S., Beadle, J., et al. (2023). Response of phytoplankton community composition to physicochemical and meteorological factors under different hydrological conditions in Lake diefenbaker. Sci. Total Environ. 856 (2), 159210. doi:10.1016/j.scitotenv.2022.159210

PubMed Abstract | CrossRef Full Text | Google Scholar

Adnan, R. M., Mostafa, R. R., Wang, M., Parmar, K. S., Kisi, O., and Zounemat-Kermani, M. (2025). Improved random vector functional link network with an enhanced remora optimization algorithm for predicting monthly streamflow. J. Hydrology 650, 132496. doi:10.1016/j.jhydrol.2024.132496

CrossRef Full Text | Google Scholar

Agasild, H., Zingel, P., and Nõges, T. (2012). Live labeling technique reveals contrasting role of crustacean predation on microbial loop in two large shallow lakes. Hydrobiologia 684, 177–187. doi:10.1007/s10750-011-0981-0

CrossRef Full Text | Google Scholar

Agasild, H., Blank, K., Haberman, J., Tuvikene, L., Zingel, P., Noges, P., et al. (2024). Interactive efects shape the dynamics of Chydorus sphaericus (O.F. Muller, 1776) population in a shallow eutrophic lake. Hydrobiologia 852, 341–357. doi:10.1007/s10750-024-05612-4

CrossRef Full Text | Google Scholar

Bertani, I., Steger, C. E., Obenour, D. R., Fahnenstiel, G. L., Bridgeman, T. B., Johengen, T. H., et al. (2017). Tracking cyanobacteria blooms: do different monitoring approaches tell the same story? Sci. Total Environ. 575, 294–308. doi:10.1016/j.scitotenv.2016.10.023

PubMed Abstract | CrossRef Full Text | Google Scholar

Breiman, L. (2001). Random forests. Machune Learn 45 (1), 5–32. doi:10.1023/A:1010933404324

CrossRef Full Text | Google Scholar

Cao, H., Recknagel, F., and Bartkow, M. (2016). Spatially-explicit forecasting of cyanobacteria assemblages in freshwater lakes by multi-objective hybrid evolutionary algorithms. Ecol. Model. 342, 97–112. doi:10.1016/j.ecolmodel.2016.09.024

CrossRef Full Text | Google Scholar

Chapra, S. C., Boehlert, B., Fant, C., Bierman, V. J., Henderson, J., Mills, D., et al. (2017). Climate change impacts on harmful algal blooms in U.S. freshwaters: a screening-level assessment. Environ. Sci. Technol. 51, 8933–8943. doi:10.1021/acs.est.7b01498

PubMed Abstract | CrossRef Full Text | Google Scholar

Chen, M., Ding, S., Chen, X., Sun, Q., Fan, X., Lin, J., et al. (2018). Mechanisms driving phosphorus release during algal blooms based on hourly changes in iron and phosphorus concentrations in sediments. Water Res. 133, 153–164. doi:10.1016/j.watres.2018.01.040

PubMed Abstract | CrossRef Full Text | Google Scholar

Choi, J., Min, J. O., Choi, B., Kim, D., Kang, J. J., Lee, S. H., et al. (2020). Key factors controlling primary production and cyanobacterial harmful algal blooms (cHABs) in a continuous Weir system in the Nakdong River, Korea. Sustainability 12 (15), 6224. doi:10.3390/su12156224

CrossRef Full Text | Google Scholar

Chung, H., Son, M., Ryu, H. S., Park, C. H., Lee, R., Cho, M., et al. (2019). Variation of cyanobacteria occurrence pattern and environmental factors in Lake Juam. Korean J. Environ. Biol. 34 (4), 640–651. doi:10.11626/KJEB.2019.37.4.640

CrossRef Full Text | Google Scholar

Chung, H. S., Son, M. S., Kim, T. S., Park, J. H., and Lee, W. S. (2024). Correlations between spatiotemporal variations in phytoplankton community structure and physicochemical parameters in the Seungchon and Juksan Weirs. Water 16 (20), 2976. doi:10.3390/w16202976

CrossRef Full Text | Google Scholar

Descy, J. P., Leprieur, F., Pirlot, S., Leporcq, B., Van Wichelen, J., Peretyatko, A., et al. (2016). Identifying the factors determining blooms of cyanobacteria in a set of shallow lakes. Ecol. Inf. 34, 129–138. doi:10.1016/j.ecoinf.2016.05.003

CrossRef Full Text | Google Scholar

Devi, A., Chiu, Y.-T., Hsueh, H.-T., and Lin, T.-F. (2021). Quantitative PCR based detection system for cyanobacterial geosmin/2-methylisoborneol (2-MIB) events in drinking water sources: current status and challenges. Water Res. 188, 116478. doi:10.1016/j.watres.2020.116478

PubMed Abstract | CrossRef Full Text | Google Scholar

Ding, S., Liu, Y., Dan, S. F., and Jiao, L. (2021). Historical changes of sedimentary P-binding forms and their ecological driving mechanism in a typical “grass-algae” eutrophic lake. Water Res. 204, 117604. doi:10.1016/j.watres.2021.117604

PubMed Abstract | CrossRef Full Text | Google Scholar

Ge, P., Yang, O., Hua, X., Chen, Z., He, J., Liu, Z., et al. (2024). Predicting bond strength of corroded reinforced concrete after high-temperature exposure: A stacking model and feature selection. Constr. Build. Mater. 456, 139290. doi:10.1016/j.conbuildmat.2024.139290

CrossRef Full Text | Google Scholar

Ger, K. A., Hansson, L. A., and Lürling, M. (2014). Understanding cyanobacteria-zooplankton interactions in amore eutrophic world. Freshw. Biol. 59, 1783–1798. doi:10.1111/fwb.12393

CrossRef Full Text | Google Scholar

Ger, K. A., Naus-Wiezer, S., Meester, L. R., and Lürling, M. (2019). Zooplankton grazing selectivity regulates herbivory and dominance of toxic phytoplankton over multiple prey generations. Limnol. Oceanogr. 64, 1214–1227. doi:10.1002/lno.11108

CrossRef Full Text | Google Scholar

Han, J. W., Kim, T., Lee, S., Kang, T., and Im, J. K. (2024). Machine learning and explainable AI for chlorophyll-a prediction in Namhan River Watershed, South Korea. Ecol. Indic. 166, 112361. doi:10.1016/j.ecolind.2024.112361

CrossRef Full Text | Google Scholar

Hecht, J. S., Zia, A., Clemins, P. J., Schroth, A. W., Winter, J. M., Oikonomou, P. D., et al. (2022). Modeling the sensitivity of cyanobacteria blooms to plausible changes in precipitation and air temperature variability. Sci. Total Environ. 812, 151586. doi:10.1016/j.scitotenv.2021.151586

PubMed Abstract | CrossRef Full Text | Google Scholar

Huang, H., Wang, W., Lv, J., Liu, Q., Liu, X., Xie, S., et al. (2022). Relationship between chlorophyll a and environmental factors in lakes based on the random forest algorithm. Water 14 (19), 3128. doi:10.3390/w14193128

CrossRef Full Text | Google Scholar

Jo, B. G., Jung, W. S., Nam, S. H., and Kim, Y. D. (2023). Prediction of cyanobacteria using decision tree algorithm and sensor monitoring data. Appl. Sci. 13 (22), 12266. doi:10.3390/app132212266

CrossRef Full Text | Google Scholar

Johnke, J., Boenigk, J., Harms, H., and Chatzinotas, A. (2017). Killing the killer: predation between protists and predatory bacteria. FEMS Microbiol. Lett. 364. doi:10.1093/femsle/fnx089

PubMed Abstract | CrossRef Full Text | Google Scholar

Kim, D., and Shin, C. (2021). Algal boom characteristics of Yeongsan River based on weir and estuary dam operating conditions using EFDC-NIER model. Water 13 (16), 2295. doi:10.3390/w13162295

CrossRef Full Text | Google Scholar

Kim, K., Mun, H., Shin, H., Park, S., Yu, C., Lee, J., et al. (2020). Nitrogen stimulates Microcystis-dominated blooms more than phosphorus in river conditions that favor non-nitrogen-fixing genera. Environ. Sci. Technol. 54 (12), 7185–7193. doi:10.1021/acs.est.9b07528

PubMed Abstract | CrossRef Full Text | Google Scholar

Kim, D., Sung, J. W., Kim, T. H., Choi, H. M., Kim, J., and Park, H. J. (2023). Comparative seasonality of phytoplankton community in two contrasting temperate estuaries on the western coast of Korea. Front. Mar. Sci. 10, 1257904. doi:10.3389/fmars.2023.1257904

CrossRef Full Text | Google Scholar

Kim, Y. H., Cho, I. H., Kim, H. K., Hwang, E. A., Han, B. H., and Kim, B. H. (2024). Assessing the impact of weirs on water quality and phytoplankton dynamics in the South Han River: a two-year study. Water 16 (6), 833. doi:10.3390/w16060833

CrossRef Full Text | Google Scholar

Kosiba, J., and Krztoń, W. (2022). Insight into the role of cyanobacterial bloom in the trophic link between ciliates and predatory copepods. Hydrobiologia 849, 1195–1206. doi:10.1007/s10750-021-04780-x

CrossRef Full Text | Google Scholar

Krammer, K., and Lange-Bertalot, H. (2007). “Bacillariophyceae 1. Teil: Naviculaceae,” in Susβwasserflora von Mitteleuropa, Band 2/1. Editors H. Ettl, J. Gerloff, H. Heying, and D. Mollenhauer (Berlin, Germany: Elsevier Book Co).

Google Scholar

Kwak, S. D., Choi, J. W., and An, K. G. (2016). Chemical water quality and fish component analyses in the periods of before- and after-the weir constructions in Yeongsan River. J. Ecol. Environ. 39, 99–110. doi:10.5141/ecoenv.2016.011

CrossRef Full Text | Google Scholar

Kwon, Y., Kim, J., Choi, J., Kim, T., Cha, S. M., and Kwon, S. (2024). Assessment of the impacts of constructing artificial structures on the water quality and hydrological environment of a meandering river. Water Environ. Res. 96 (9), e11120. doi:10.1002/wer.11120

PubMed Abstract | CrossRef Full Text | Google Scholar

Lee, B., Im, J. K., Han, J. W., Kang, T., Kim, W., Kim, M., et al. (2024). Multiple remotely sensed datasets and machine learning models to predict chlorophyll a concentration in the Nakdong River, South Korea. Environ. Sci. Pollut. Res. 31, 58505–58526. doi:10.1007/s11356-024-35005-y

PubMed Abstract | CrossRef Full Text | Google Scholar

Leitão, E., Ger, K. A., and Panosso, R. (2018). Selective grazing by a tropical copepod (Notodiaptomus iheringi) facilitates microcystis dominance. Front. Microbiol. 9, 301. doi:10.3389/fmicb.2018.00301

PubMed Abstract | CrossRef Full Text | Google Scholar

Li, D., He, P., Liu, C., Xu, J., Hou, L., Gao, X., et al. (2022). Quantitative relationship between cladocera and cyanobacteria: a study based on field survey. Front. Ecol. Evol. 10, 915787. doi:10.3389/fevo.2022.915787

CrossRef Full Text | Google Scholar

Li, H., Qin, C., He, W., Sun, F., and Du, P. (2021). Improved predictive performance of cyanobacterial blooms using a hybrid statistical and deep-learning method. Environ. Res. Lett. 16 (12), 124045. doi:10.1088/1748-9326/ac302d

CrossRef Full Text | Google Scholar

Li, Y., Fang, L., Cao, G., Mi, W., Lei, C., Zhu, K., et al. (2024). Reservoir regulation-induced variations in water level impacts cyanobacterial bloom by the changing physiochemical conditions. Water Res. 259, 121836. doi:10.1016/j.watres.2024.121836

PubMed Abstract | CrossRef Full Text | Google Scholar

Li, J., An, X., Li, Q., Wang, C., Yu, H., Zhou, X., et al. (2022). Application of XGBoost algorithm in the optimization of pollutant concentration. Atmos. Res. 276, 106238. doi:10.1016/j.atmosres.2022.106238

CrossRef Full Text | Google Scholar

Ly, Q. V., Tong, N. A., Lee, B. M., Nguyen, M. H., Trung, H. T., Nguyen, P. L., et al. (2023). Improving algal bloom detection using spectroscopic analysis and machine learning: a case study in a large artificial reservoir, South Korea. Sci. Total Environ. 901, 166467. doi:10.1016/j.scitotenv.2023.166467

PubMed Abstract | CrossRef Full Text | Google Scholar

Mac Nally, R. (2002). Multiple regression and inference in ecology and conservation biology, further comments on identifying important predictor variables. Biodivers. Conserv. 11 (8), 1397–1401. doi:10.1023/A:1016250716679

CrossRef Full Text | Google Scholar

Masaru, A., Teru, I., Kozo, I., Hideo, K., Shigeru, K., Hiromu, K., et al. (1997). Illustration of the Japanese fresh-water algae. Tokyo, Japan: Uchidarokakuho Publishing Company.

Google Scholar

ME (2017). Standard method for water and wastewater. Sejong, Republic of Korea: Ministry of Environment.

Google Scholar

MOE (2011). Standard method for the examination of water pollution. Sejong, Republic of Korea: Ministry of Environment.

Google Scholar

Naselli-Flores, L., Zohary, T., and Padisa´k, J. (2020). Life in suspension and its impact on phytoplankton morphology: an homage to Colin S. Reynolds. Hydrobiologia 848, 7–30. doi:10.1007/s10750-020-04217-x

CrossRef Full Text | Google Scholar

Paerl, H. W. (2018). Mitigating toxic planktonic cyanobacterial blooms in aquatic ecosystems facing increasing anthropogenic and climatic pressures. Toxins 10 (2), 76. doi:10.3390/toxins10020076

PubMed Abstract | CrossRef Full Text | Google Scholar

Rad, M., Abtahi, A., Berndtsson, R., McKnight, U. S., and Aminifar, A. (2024). Interpretable machine learning for predicting the fate and transport of pentachlorophenol in groundwater. Environ. Pollut. 345, 123449. doi:10.1016/j.envpol.2024.123449

PubMed Abstract | CrossRef Full Text | Google Scholar

Simonsen, R. (1979). The diatom system: ideas on phylogeny. Bacillaria 2, 9–71. Available online at: https://scholar.google.com/scholar?q=The+diatom+system%3A+ideas+on+phylogeny+Simonsen+1979.

Google Scholar

Son, M. S., Chung, H. S., Park, C. H., Park, J. H., Lim, C. H., and Kim, K. H. (2018). The change of phytoplankton community structure and water quality in the Juksan weir of the Yeongsan River watershed. Korean J. Environ. Biol. 36, 591–600. doi:10.11626/KJEB.2018.36.4.59

CrossRef Full Text | Google Scholar

Song, J., Jiang, W., Xin, L., and Zhang, X. (2024). Predicting the temporal-spatial distribution of chlorophyll-a in the Yellow River Estuary using explainable machine learning. Estuar. Coast. Shelf Sci. 304, 108820. doi:10.1016/j.ecss.2024.108820

CrossRef Full Text | Google Scholar

Sultana, S., Khan, S., Rahman, Z., Hena, S. M., Ahmed, M. S., Haque, M. M., et al. (2024). Influence of environmental factors on the dynamics and toxicology of microcystis and anabaena in eutrophic ponds. Aquac. Res. 2024, 8826738. doi:10.1155/2024/8826738

CrossRef Full Text | Google Scholar

Szewczyk, T. M., Aleynik, D., and Davidson, K. (2025). Ensemble models improve near-term forecasts of harmful algal bloom and biotoxin risk. Harmful Algae 142, 102781. doi:10.1016/j.hal.2024.102781

PubMed Abstract | CrossRef Full Text | Google Scholar

Tao, Y., Ren, J., Zhu, H., Li, J., and Cui, H. (2024). Exploring spatiotemporal patterns of algal cell density in Lake Dianchi with explainable machine learning. Environ. Pollut. 356, 124395. doi:10.1016/j.envpol.2024.124395

PubMed Abstract | CrossRef Full Text | Google Scholar

Van Hassel, W. H. R., Andjelkovic, M., Durieu, B., Marroquin, V. A., Masquelier, J., Huybrechts, B., et al. (2022). Summer of cyanobacterial blooms in Belgian waterbodies: Microcystin quantification and molecular characterizations. Toxins 14, 61. doi:10.3390/toxins14010061

PubMed Abstract | CrossRef Full Text | Google Scholar

Visser, H., Evers, N., Bontsema, A., Rost, J. de., Niet, A., Vethman, P., et al. (2021). What drives the ecological quality of surface waters? A review of 11 predictive modeling tools. Water Res. 208, 117851. doi:10.1016/j.watres.2021.117851

PubMed Abstract | CrossRef Full Text | Google Scholar

Walsh, C., and Mac Nally, R. (2013). Hier.part: hierarchical partitioning. R package version 1.0-4. R project for statistical computing. Vienna: R Project. Available online at: http://CRAN.R-project.org/package=hier.part.

Google Scholar

Wang, L., Zhu, Z., Sassoubre, L., Yu, G., Liao, C., Hu, Q., et al. (2021). Improving the robustness of beach water quality modeling usingan ensemble machine learning approach. Sci. Total Environ. 765, 142760. doi:10.1016/j.scitotenv.2020.142760

PubMed Abstract | CrossRef Full Text | Google Scholar

Wen, S., Zhong, J., Li, X., Liu, C., Yin, H., Li, D., et al. (2020). Does external phosphorus loading diminish the effect of sediment dredging on internal phosphorus loading? An in-situ simulation study. J. Hazard. Mater. 394, 122548. doi:10.1016/j.jhazmat.2020.122548

PubMed Abstract | CrossRef Full Text | Google Scholar

Yan, J., Jia, S., Lv, A., and Zhu, W. (2019). Water resources assessment of China’s transboundary river basins using a machine learning approach. Water Resour. Res. 55 (1), 632–655. doi:10.1029/2018WR023044

CrossRef Full Text | Google Scholar

Zhao, K., Wang, L., You, Q., Zhang, J., Pang, W., and Wang, Q. (2022). Impact of cyanobacterial bloom intensity on plankton ecosystem functioning measured by eukaryotic phytoplankton and zooplankton indicators. Ecol. Indic. 140, 109028. doi:10.1016/j.ecolind.2022.109028

CrossRef Full Text | Google Scholar

Zohary, T., Flaim, G., and Sommer, U. (2021). Temperature and the size of freshwater phytoplankton. Hydrobiologia 848 (1), 143–155. doi:10.1007/s10750-020-04246-6

CrossRef Full Text | Google Scholar

Keywords: cyanoHABs, ecological dynamics, hierarchical partitioning analysis (HPA), Juksan Weir, Korea, machine learning models, stacking ensemble, stage-based prediction

Citation: Chung H and Kim T (2026) Analysis of key factors influencing cyanoHAB intensity levels and evaluation of machine learning predictions. Front. Environ. Sci. 13:1716967. doi: 10.3389/fenvs.2025.1716967

Received: 01 October 2025; Accepted: 23 December 2025;
Published: 12 January 2026.

Edited by:

Yang Song, University of Michigan, United States

Reviewed by:

Russell Milne, University of Alberta, Canada
Wang Tian, North China Electric Power University, China

Copyright © 2026 Chung and Kim. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Taesung Kim, a2ltdHMzQGtvcmVhLmty

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.