Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Water, 29 January 2026

Sec. Water and Artificial Intelligence

Volume 8 - 2026 | https://doi.org/10.3389/frwa.2026.1719097

Multi-pollutant prediction and process parameter optimization of a wastewater treatment plant based on machine learning models


Hairong ChenHairong Chen1Qiang Zhang
Qiang Zhang1*Jinge XieJinge Xie1Kaixuan WangKaixuan Wang2Wen YueWen Yue2
  • 1Langfang Qingquan Water Supply Co., Ltd., Langfang, Hebei, China
  • 2Civil Engineering Department, North China Institute of Aerospace Engineering, Langfang, China

Conventional wastewater treatment models, heavily reliant on manual expertise and offline monitoring, cause response delays, struggle with inefficient fluctuations, and lead to high resource consumption. To overcome these challenges, this study established a data-driven multi-pollutant prediction model using three years of daily monitoring data from a wastewater treatment plant (WWTP). The model integrates data cleaning, advanced feature engineering, multi-dimensional intelligent feature selection, and an ensemble learning strategy. Furthermore, combined with nitrification/denitrification mechanisms, a back-calculation model employing Particle Swarm Optimization-Support Vector Regression (PSO-SVR) was developed to predict optimal aeration intensity and carbon source dosage. The prediction model excelled, achieving R2 values of 0.96 for total nitrogen (TN), 0.94 for total phosphorus (TP), 0.91 for ammonia nitrogen (NH3-N), 0.92 for influent wastewater volume (Qw), and 0.75 for chemical oxygen demand (COD). The back-calculation models also demonstrated high precision, with test set R2 of 0.94 for aeration rate and 0.96 for carbon dosage. Additionally, this strategy achieved an estimated 15–20% aeration energy savings and reduced carbon source overdosing to below 5%, while ensuring stable effluent compliance. This closed-loop approach of “pollutant concentration prediction → process parameter back-calculation” dynamically responds to fluctuations, enabling quantitative and refined WWTP management, thereby demonstrating significant practical impact for improving treatment efficiency while reducing energy and resource consumption.

1 Introduction

In the context of aquatic ecosystem preservation and sustainable water resource management, wastewater treatment plants (WWTPs) serve as a critical barrier for reducing pollution. Their ability to ensure effluent compliance while minimizing operational costs relies heavily on precise process monitoring and control (Miao et al., 2021). Within this framework, accurate forecasting of key water quality parameters, specifically total nitrogen (TN), ammonia nitrogen (NH3-N), total phosphorus (TP), and chemical oxygen demand (COD), is essential for implementing dynamic control strategies and optimizing resource allocation (Xie et al., 2022).

However, current operational practices rely largely on manual experience and offline monitoring. This reliance creates significant limitations, such as response delays spanning hours to days are common, and inflexibility to handle sudden load fluctuations caused by industrial discharges or seasonal rainfall. Such inefficiencies frequently result in unstable treatment performance and the persistent overuse of critical resources, particularly aeration energy and supplemental carbon sources (Cardoso et al., 2021).

To address these operational challenges, recent research has focused on two main modeling approaches, mechanistic models and data-driven methods. Mechanistic models, such as Activated Sludge Models, are based on biochemical kinetics and provide valuable insights into core processes like nitrification and denitrification (Wu et al., 2023). However, their application is limited by complex calibration requirements, such as determining maximum specific growth rates, which demand extensive experimental validation. Furthermore, their inability to adapt to varying process configurations and microbial community shifts often limits predictive accuracy to below 80% in field applications (Huang et al., 2025), making them insufficient for real-time control. Conversely, the rise of data-driven machine learning has unlocked new potential for capturing the complex nonlinear relationships in wastewater quality dynamics (Kamalov et al., 2023). Despite this potential, practical application faces two major obstacles. Firstly, the dependence on high-frequency, high-quality data imposes high costs, with online sensor networks potentially consuming 15%−20% of a facility's budget (Kusuda and Moriyama, 1994). Secondly, existing models often suffer from poor adaptability. They are frequently tailored to specific predictors or unique facility configurations, lacking broader generalization capabilities (Jain et al., 2024). While algorithmic improvements, such as integrating Adaptive-Network-Based Fuzzy Inference System or multiple-kernel Support Vector Regression (SVR), have increased accuracy (Najafzadeh et al., 2016; Najafzadeh and Niazmardi, 2021), these solutions typically remain tied to specific targets, lacking a unified framework for multi-pollutant forecasting.

Addressing these gaps, this study proposed a high-precision, highly generalizable predictive framework. Using a 3-year dataset of daily monitoring records, a “generic” modeling architecture was established to predict distinct pollutants (Najafzadeh and Zeinolabedini, 2019; Zeinolabedini and Najafzadeh, 2019). This framework integrated robust data cleaning, advanced feature engineering, intelligent feature selection, and ensemble learning strategies. Initial validation focused on TN prediction, with subsequent extensions to NH3-N, TP, COD, and influent volume (Najafzadeh and Zeinolabedini, 2018) to rigorously evaluate cross-parameter generalizability. Crucially, this research moved beyond simple prediction to enable process optimization. By coupling a Particle Swarm Optimization-SVR (PSO-SVR) model with mechanistic principles of nitrification and denitrification, a process parameter inversion model was constructed. This allowed for the quantitative determination of optimal aeration intensity and carbon source dosage. The resulting “pollutant prediction–process inference” closed-loop mechanism provided direct decision support for operational control. Ultimately, this approach offers a viable technical pathway for reducing energy consumption and material waste, aligning WWTP operations with China's broader “dual carbon” strategic objectives.

2 Materials and methods

2.1 Study area and data source

The data for this study originated from the continuous monitoring records of a wastewater treatment plant located in Langfang City, Hebei Province, China, spanning the period from January 1, 2021 to December 31, 2023, and comprising a total of 1,074 daily records. The dataset primarily encompassed two major categories: water quality parameters, including Qw, concentrations of COD, NH3-N, TN, TP, and pH; concurrent meteorological and environmental parameters, such as daily maximum/minimum temperature, weather conditions, wind speed and direction, and air quality index (AQI), which provided essential evidence for analyzing the potential influence of the external environment on water quality variations. The descriptive statistics of the key parameters are presented in Table 1.

Table 1
www.frontiersin.org

Table 1. Descriptive statistics of key water quality and meteorological/environmental parameters (2021–2023).

2.2 Data preprocessing

To construct robust and high-precision prediction models, this study implemented a systematic data preprocessing workflow, as illustrated in Figure 1. Firstly, data cleaning and denoising were performed. The missing values of pH and AQI were processed using the median imputation method. Thirteen outliers in TN concentration were identified and replaced through the interquartile range method. The raw TN time series data contained high-frequency noise due to sensor fluctuations and environmental interference. To mitigate this noise while preserving the essential features of the signal, such as the height and width of load peaks, a Savitzky-Golay filter was applied. Unlike simple moving average filters, the Savitzky-Golay method fits a low-degree polynomial to adjacent data points, effectively smoothing the data without distorting the signal trend. In this study, the filter was implemented with a window size of 11 and a polynomial order of 3. Subsequently, advanced feature engineering was carried out, involving the construction of time features, lag features, sliding window statistical features, and interaction features, thus generating a high-dimensional space containing 161 candidate features. Thereafter, multi-dimensional intelligent feature selection was implemented, integrating Pearson Correlation, F-test, Recursive Feature Elimination, and Random Forest importance evaluation to screen out 60 optimal features. Finally, the dataset was divided in chronological order, with the last 60 days set as the test set and the rest as the training set. Robust standardization was applied to the features to eliminate differences in dimensions (Hyndman and Athanasopoulos, 2018; Pedregosa et al., 2011).

Figure 1
Flowchart detailing a data preprocessing workflow. It starts with raw data, sorted by detection time, followed by outlier detection and replacement using the median. A Savitzky-Golay filter is applied. Feature engineering includes temporal, lag, and statistical features, alongside rolling statistics, trend, and variation features. Intelligent feature selection combines correlation analysis, statistical methods, RFE, and Random Forest, reducing features from 161 to 60. The process concludes with preprocessing completed and 60 optimized features ready for modeling.

Figure 1. Preprocessing flow chart.

Further analysis of the feature selection results specifically highlights the physical interpretability of the lag features. Consistent across the prediction of pollutants (NH3-N, TN, COD), the immediate lag features (e.g., Lag 1 and Lag 2) were frequently identified as critical inputs.

This pattern correlates strongly with the system's hydraulic retention time (HRT) and biochemical reaction inertia. Since the biological treatment process requires a specific retention time (typically 10–20 hours for oxidation ditches) to degrade pollutants, the water quality state at the previous time step (t-1) exerts the most direct influence on the current state (t).

Additionally, the selection of Lag 7 (where applicable) captures the periodicity of influent characteristics driven by human activity cycles (i.e., weekday vs. weekend variations). This confirms that the PSO-SVR model effectively learned the underlying temporal dynamics and hydraulic characteristics of the wastewater treatment plant, rather than merely fitting numerical data.

2.3 Construction and optimization of the generic prediction model

To construct an accurate and robust generic prediction framework, this study adopted an ensemble learning strategy, strictly selecting and combining multiple machine learning models with distinct theoretical foundations based on the inherent characteristics of wastewater treatment data to enhance overall performance. To address the high correlation prevalent among water quality parameters (e.g., COD and TN) and sporadic sensor noise, Ridge Regression and Elastic Net were selected to effectively mitigate multicollinearity issues, while the Huber Regressor was introduced to enhance model robustness by leveraging its insensitivity to outliers. Regarding the management of nonlinear heterogeneity, Random Forest (RF) and Gradient Boosting Decision Tree (GBDT) were employed to represent Bagging and Boosting strategies, respectively. The former reduces variance through bootstrap sampling, whereas the latter captures complex biochemical reaction patterns by iteratively correcting residuals. Furthermore, given the data magnitude in this study, the standard GBDT is sufficient to guarantee convergence performance while forming a beneficial complementarity with linear base models. Building upon this foundation, a two-layer Stacking architecture was implemented, wherein the predictive outputs of the aforementioned base models served as new features for the second-layer meta-learner (Ridge Regression). By automatically learning weight combinations and correcting systematic biases, this architecture realizes the complementary advantages of diversified models, ultimately yielding comprehensive prediction results with generalization capabilities and accuracy superior to those of any single base learner (Aljarah and Çetin, 2022; Takalo-Mattila et al., 2022).

The two-layer Stacking strategy operates as follows:

Let: D={(xi,yi)}i=1N be the dataset. The first layer consists of M base models (Ridge, Elastic Net, Huber, RF, GBDT). Let fm(x) denote the prediction of the m-th base model. The input features for the second-layer meta-learner (Ridge Regression) are constructed as:

Xmeta={f1(x),f2(x),fM(x)}

The final prediction ŷstacking is generated by the meta-learner minimizing the regularized squared error:

minω,bi=1N(yi-(ωTXmean,  i+b))2+λ  ω22

where ω is the weight vector assigned to each base model's output, b is the bias, and λ is the regularization strength.

Firstly, taking TN prediction as an example, strict time-series cross-validation was used to train and evaluate the model on the training set. Through 5-fold time-series partitioning, both the validation set and the training set were ensured to simulate real scenarios, thus avoiding data leakage. For each base learner, grid search was applied to find the optimal combination within the hyperparameter space, with the coefficient of determination (R2), mean absolute error (MAE), mean squared error (MSE), and root mean squared error (RMSE) serving as evaluation metrics (Masini et al., 2023). The specific formulas and value ranges for these metrics are defined as follows:

Coefficient of Determination (R2): R2=1i=1n(yiy^i)2i=1n(yiy¯i)2

Range: (–∞, 1). An R2 score of 1 indicates a perfect fit, while values closer to 1 represent better model performance.

Mean Absolute Error (MAE): MAE=1ni=1n|yi-ŷi|

Range: (0, +∞). MAE measures the average magnitude of errors; a lower value indicates better accuracy.

Mean Squared Error (MSE): MSE=1ni=1n(yi-ŷi)2

Range: (0, +∞). MSE penalizes larger errors more severely than MAE.

Root Mean Squared Error (RMSE): RMSE=1ni=1n(yi-ŷi)2

Range: (0, +∞). RMSE is in the same unit as the target variable, with lower values indicating better fit. Where n is the number of samples, yi is the observed value, ŷi is the predicted value, and y¯ is the mean of observed values.

Secondly, to verify that the entire methodology constructed in this study, ranging from data preprocessing and feature engineering to model integration, was not merely a specific scheme applicable only to TN prediction but rather a general prediction framework with wide applicability, the universal applicability of the model was further verified (Mahdavian et al., 2021). This model was applied to predict key water quality indicators such as COD, NH3-N, TP, and Qw.

During the multi-pollutant prediction, the input (X) is a 60-dimensional feature vector composed of historical water quality data, meteorological parameters, and temporal interaction features. The output (Y) is the predicted concentration of a specific target pollutant on a future day. The strong correlation of immediate lag features (e.g., X1 = t1) with the target variable (Y1 = t) captures the system's HRT and biochemical reaction inertia. Specifically, since biological treatment requires a specific retention time to degrade pollutants, the HRT imposes a temporal delay between influent disturbances and effluent responses. The biochemical reaction inertia ensures that the microbial activity state possesses memory, preventing abrupt system failures. The water quality state at the previous time step serves as the causal input determining the current output state. The model captures these physical-biological mechanisms to ensure the nonlinear dependencies between historical operating conditions (input) and future water quality fluctuations (output).

2.4 Model for back-calculating process parameters based on predicted pollutant concentrations

Based on the predicted pollutant concentrations and the mechanisms of nitrification and denitrification, a back-calculation model for aeration intensity and carbon source dosage was constructed using a PSO-SVR framework. Predicting multi-pollutant concentrations and process parameters involves high non-linearity and multivariate coupling. SVR was chosen as the base learner due to its strong ability to model non-linear relationships and its superior generalization performance on small-to-medium-sized datasets. However, a key limitation of standard SVR is its sensitivity to hyperparameters (penalty factor C, insensitive loss ε, and kernel parameter γ), which can lead to overfitting or underfitting. To address this, PSO was employed. The PSO-SVR hybrid offers two main strengths: global optimization, effectively exploring the search space to avoid local optima, and robustness in maintaining prediction accuracy under fluctuating operating conditions. Although the training process of the PSO-SVR model incurs higher computational costs compared to simpler models, it offers a significant improvement in prediction accuracy, which can meet the requirements for precise process control. Specifically, the SVR model utilizes a Radial Basis Function kernel, where the optimization process aims to minimize prediction error by identifying the optimal combination of the penalty factor (C), kernel coefficient (γ), and epsilon (ε). To ensure the reproducibility of the proposed model, the specific parameter settings for the PSO-SVR algorithm and the search ranges for these hyperparameters are detailed in Table 2.

Table 2
www.frontiersin.org

Table 2. Hyperparameter settings and search ranges for the PSO-SVR model.

The model employs a feature vector comprising nine key variables: NH3-N, TN, COD, Qw, maximum temperature, minimum temperature, average temperature, year, and month. Crucially, the core pollutant indices—NH3-N, TN, and COD—utilize “future predicted values” generated by a preceding general model rather than contemporaneous in-situ measurements. This approach effectively circumvents the inherent time lag associated with traditional feedback control systems. Regarding auxiliary features, wastewater discharge volume and temperature metrics are derived from flow rate predictions and meteorological forecasts, respectively, while temporal features are treated as deterministic variables. Collectively, these inputs comprehensively reflect the substrate concentrations and environmental conditions anticipated for future biochemical reactions. During the process parameter optimization, the relationship between these inputs and the back-calculated outputs is governed by the underlying biological reaction kinetics. Firstly, the input parameters primarily include the predicted ammonia nitrogen (NH3-N) and temperature and the output is aeration intensity. This relationship follows the theoretical oxygen demand of nitrification (approximately 4.57 gO2/gNH3-N), where the input ammonia load determines the stoichiometric oxygen requirement, and temperature inputs adjust for the metabolic activity rate of nitrifying bacteria. Secondly, the model links the input nitrate (NO3-N, from TN-NH3-N) and influent volume (Qw) to the output carbon source dosage. This relationship represents the electron donor balance, where the input nitrate load acts as the electron acceptor, mathematically dictating the required carbon quantity (theoretical C/N ration ≈ 2.86) to achieve complete denitrification.

The datasets were randomly divided into three subsets: a training set (70%), a validation set (15%), and a test set (15%). This partition strategy is adopted to ensure a balanced distribution, where the training set is large enough to capture the underlying data; the validation set is used for hyperparameter optimization (specifically for tuning PSO-SVR parameters) and the test set remains strictly unseen to provide an unbiased evaluation of the final model (Hastie et al., 2009). In terms of temporal partitioning, the training set (January 1, 2022–April 13, 2023) utilized historical observed data for model fitting to capture the underlying physical mapping relationships between input features and process parameters. Conversely, during the validation and testing phases (April 13, 2023–October 31, 2023), the model was strictly supplied with predicted values as inputs. This strategy was designed to simulate the efficacy of feedforward control under real-world operating conditions and to rigorously evaluate the model's generalization capability.

The PSO algorithm optimizes the SVR hyperparameters by simulating bird flocking behavior. For a particle i in a D-dimensional search space, its position Xi = (xi1, xi1, …, xiD) and velocity Vi = (vi1, vi1, …, viD) are updated iteratively using the following equations:

Velocity update:

vidt+1=ωvidt+c1r1(Pbest,idt-xidt)+c2r2(gbest,id1-xidt)

Position Update:

xidt+1=xidt+vidt+1

Where:

ω: Inertia weight (set to 0.7 as per Table 2).

c1, c2: Learning factors (set to 1.5).

r1, r2: Random numbers in (0, 1).

Pbest: The individual historical best position.

gbest: The global historical best position.

The SVR model maps the input data into a high-dimensional feature space using the Radial Basis Function (RBF) kernel, defined as:

K(xi,xj)=exp(-γ||xi-xj||2)

where γ is the kernel coefficient.

The optimization objective of SVR is to find a function/that minimizes the structural risk, formulated as:

minω,b,ξ,ξ12ω2+Ci=1N(ξi+ξi*)

Subject to:

{yi(ωTϕ(xi)+b)ε+ξi(ωT(xi)+b)yiε+ξi*                       ξi,ξi*0

Where C is the penalty factor (optimizing the trade-off between margin maximization and error minimization), and ε defines the insensitive zone

Based on the biochemical principles of nitrification and denitrification in wastewater treatment, the model construction adhered to the following core constraints. The aeration intensity back-calculation model was based on nitrification kinetics. It input real-time features such as the predicted influent NH3-N concentration (NH3-Npred), water temperature, and pH into the nonlinear function f1. By fitting the quantitative relationship between oxygen demand and NH3-N conversion, it output the optimal aeration intensity = f1 (NH3-Npred, T, pH). The carbon source dosage back-calculation model was based on the denitrification stoichiometric relationship (theoretical carbon demand ≈ 2.86 g COD/gNO3-N). It input the predicted nitrate nitrogen concentration (NO3-Npred, derived from the difference between the predicted total nitrogen TNpred and ammonia nitrogen NH3-Npred), target carbon-to-nitrogen ratio (C/Nreq), and effective utilization rate of carbon source (η) as inputs for function f2 to calculate the optimal carbon source dosage = f2 (NO3-Npred, C/Nreq, η).

The back-calculation model relies on the following stoichiometric constraints derived from biological reaction kinetics:

Aeration Back-calculation (Nitrification):

NH4++2O2NO3-+2H++H2O

Based on the molar mass comparison, the theoretical oxygen demand coefficient is calculated as:

RO2/N=2×32gO214gN4.57

This coefficient is utilized in function f1 to convert predicted Ammonia load into Oxygen requirement.

Carbon Dosage Back-calculation (Denitrification):

The theoretical carbon source demand (using methanol/acetate equivalent as COD) is based on the electron balance during nitrate reduction:

6NO3-+5CH3OH3N2+5CO2+7H2O+6OH-

Converting this to COD equivalents yields the theoretical ratio:

RCOD/N2.86gCOD/gNO3-N

This stoichiometric ratio serves as the baseline constraint for the carbon dosage function f2, adjusted by the utilization efficiency η.

To facilitate the practical adoption of the proposed model in WWTPs, addressing the “black box” nature of machine learning is crucial for establishing operational trust. In this study, we employed the SHAP (SHapley Additive exPlanations) method to decode the complex non-linear relationships captured by the model, transforming raw predictions into intelligible operational insights. By quantifying the contribution of each driving factor—such as distinguishing whether a TN spike is driven by a sudden temperature drop or an influent hydraulic shock—SHAP provides the necessary diagnostic clarity. This interpretability not only empowers operators to understand the root causes of process fluctuations but also validates the model's reliability as a decision-making tool. Building upon this robust predictive framework, the study further implements an advanced control strategy.

With energy consumption minimization and effluent compliance rate maximization as the optimization objectives, the PSO-SVR algorithm was employed to perform real-time parameter optimization for functions f1 and f2. Through the closed-loop workflow of “concentration prediction → parameter inference → process regulation → effect validation”, the predicted effluent data was fed back into the model to enable iterative refinement of the parameters for f1 and f2, thereby enhancing the model's adaptability to complex operating conditions.

3 Results and discussion

3.1 Predictability analysis of raw data

High completeness of core variables and stability of sampling frequency are prerequisites for ensuring model reliability (Athanasiadis et al., 2010). To demonstrate the high feasibility of this dataset for machine learning prediction, a systematic visual analysis is performed on the TN concentration. As shown in Figure 2, the data integrity of the core target variable and key predictive features reaches over 99%, with no significant missing values. Figure 2 confirms the exceptional data completeness (>99%) for the core target variable (TN) and key predictive features, which is a fundamental prerequisite for reliable model construction. Although there are a small number of missing values in auxiliary variables such as pH, such variables are generally weakly associated with the biological transformation process of TN (Bayram et al., 2014). Therefore, their absence has a limited impact on model performance, and there is no need to rely excessively on imputation methods. This is consistent with the “core feature priority” principle proposed by Utku and Can (2022).

Figure 2
Bar chart titled “Data Completeness Assessment of Key Variables“ showing data completeness percentages for various variables. Most variables, including TP, TN, NH3-N, and others, have percentages above 99%, rated as excellent. The pH variable has a completeness below 80%, rated as poor. A legend indicates categories: Excellent (green), Very Good, Good, Fair, and Poor (red).

Figure 2. Data completeness assessment of key variables.

In addition, the sampling frequency of the dataset remains basically stable at once per day (Figure 3), which meets the basic requirement of time series models for data continuity (Maryam et al., 2018). As illustrated in Figure 3, the dataset maintains a stable daily sampling frequency, ensuring the temporal continuity required for time-series modeling. As shown in Figure 4, the concentration of TN shows a stable fluctuation trend during the 3-year monitoring period, with a mean value of 6.2 mg/L and a standard deviation of 1.8 mg/L. Its probability distribution is slightly right-skewed (skewness coefficient = 0.32), and the proportion of outliers is only 1.2%. The distribution of TN concentrations (Figure 4) exhibits low noise (std = 1.8 mg/L) and minimal outliers (1.2%), characteristic of high-quality modeling data. These indicate that the data noise is low, which is consistent with the characteristics of high-quality modeling data (Liu and Zhang, 2019). Time series visualization shows that TN concentration has a significant annual seasonal cycle (Figure 5). The average peak concentration in winter (December to February of the following year) reaches 8.5 mg/L, while the average valley concentration in summer (June to August) is 4.1 mg/L. Figure 5 clearly reveals a strong annual seasonal cycle for TN, with peak concentrations in winter (8.5 mg/L) and troughs in summer (4.1 mg/L), justifying the inclusion of temporal features in the model. The cycle pattern is clear and can be used as a core input feature for time series models. The significant seasonal cycle confirms the impact of environmental factors such as temperature and rainfall on the wastewater treatment process (LaMartina et al., 2021), suggesting that time features (such as month and quarter) need to be emphatically embedded in the model.

Figure 3
Time Series Continuity Visualization shows a green line for sampled data above a red line indicating missing data. The timeline spans from January 2021 to January 2024. Inset details: time span of 1,094 days, actual sampling of 1,074 days, theoretical days were 1,095, a time coverage rate of 98.1%, and 21 missing days. Continuity assessment is marked as excellent.

Figure 3. Visualization analysis of time series.

Figure 4
Histogram showing TN concentration (mg/L) distribution with density on the y-axis. A blue line represents the TN distribution, a red dashed line shows the normal distribution reference, and a vertical red line marks the mean. The data peaks between 4 and 6 mg/L, highlighting higher density in this range.

Figure 4. Distribution of TN.

Figure 5
Time series graph titled “TN Concentration Time Series Analysis” showing TN concentration in milligrams per liter from 2021 to 2024. The blue line represents TN concentrations, the orange line is a 30-day moving average, and the dashed red line shows the trend. TN concentrations fluctuate, with notable peaks and troughs, while the moving average smooths short-term variations. The overall trend appears relatively steady.

Figure 5. Time series analysis of TN concentration.

Analysis of the correlation heatmap (Figure 6) shows that TN concentration has a moderate positive correlation with COD concentration, NH3-N, and TP concentration, with correlation coefficients of 0.32, 0.47, and 0.34 respectively, and a significant negative correlation with Qw and average temperature. Figure 6 (correlation heatmap) shows a moderate positive correlation between TN and COD (r = 0.32), consistent with known synergistic degradation mechanisms, providing a quantitative basis for feature selection. This provides a direct quantitative basis for feature selection, and these strongly correlated variables are prioritized for inclusion in the model, which is highly consistent with the biochemical mechanism of nitrogen transformation in wastewater treatment. The correlation between TN and COD reflects the synergy of pollutant degradation, which is consistent with the regulation mechanism of carbon-nitrogen ratio on denitrification efficiency in the activated sludge process (Zhang, 2020), indicating that multi-feature fusion modeling can effectively improve the prediction accuracy. The radar chart for prediction feasibility (Figure 7), which is based on indicators such as data integrity, temporal continuity, and pattern clarity, shows that the dataset performs excellently overall in all indicators. The prediction feasibility radar chart (Figure 7) demonstrates excellent scores across all metrics (integrity, continuity, clarity), confirming the dataset's high suitability for predictive modeling. The TN concentration dataset possesses the characteristics of high quality and strong regularity, laying a solid foundation for building a high-precision prediction model.

Figure 6
Heatmap showing correlation coefficients between water quality parameters: TN, COD, NH3-N, TP, Qw, and Tavg. Values range from 1.0 to -0.4, with colors from dark red (high correlation) to light blue (low/negative correlation). The strongest correlation is TN with itself at 1.0.

Figure 6. Correlation heatmap of water quality parameters.

Figure 7
Radar chart titled “Prediction Feasibility Assessment” with six axes: Seasonal Features, Temporal Continuity, Data Completeness, Distribution Properties, Outlier Proportion, and Correlation Strength. Each axis is marked at 20%, 40%, 60%, 80%, and 100%. Overall score is 92.3%, indicating “Excellent” feasibility.

Figure 7. Radar chart of prediction feasibility assessment.

Overall, the excellent characteristics of this dataset support the use of complex algorithms for modeling. Considering the existence of nonlinear relationships in the data, this study selects ensemble learning algorithms, which have advantages in capturing nonlinear patterns and inherent resistance to overfitting (Wałęga et al., 2019).

3.2 Results of TN prediction model

After a complete process of model training, validation and testing, the TN prediction model exhibited excellent performance on the test set.

The results of specific quantitative evaluation are as follows. The coefficient of determination (R2) of the model is 0.96, which can explain more than 96% of the variance of the target variable, indicating an extremely high goodness of fit. This is much higher than the range of 0.85–0.92 reported in similar studies (Zhang et al., 2020), which may be attributed to the high-quality dataset and the selection of an appropriate model architecture. The low error indicators, with MAE = 0.2624, MSE = 0.1065, and RMSE = 0.3264, fully demonstrate the high consistency between the predicted values and the actual values of the model. The prediction accuracy has met the strict requirements for TN control in wastewater treatment plants. For example, the limit value of TN in the first-class A standard of China's Discharge Standard of Pollutants for Municipal Wastewater Treatment Plant (GB 18918-2002) is 15 mg/L, and the model error is only about 2% of the standard limit, which can fully support refined operation decisions (Shao et al., 2023).

The predicted output of the model is compared with the temporal variation of the actual effluent TN concentration (Figure 8). This chart indicates that the model can not only accurately capture the long-term variation trend of TN concentration but also effectively follow the short-term peaks and valleys, showing excellent dynamic response capability. In the sewage treatment process, short-term fluctuations of TN concentration are often related to sudden situations such as influent shocks and equipment failures. The dynamic response capability of the model can help operators timely warn and adjust the process (Zounemat-Kermani et al., 2022). From the scatter comparison chart of the model's predicted values and actual values (Figure 9), it can be clearly seen that the data points are closely distributed along the diagonal, showing a strong linear correlation. This again confirms the high accuracy of the model, and no obvious systematic deviation is found. This is consistent with the random distribution characteristics of errors in the residual analysis, indicating that the model does not miss key influencing factors (Zhang and Liu, 2022).

Figure 8
Line chart titled “TN Prediction Model Time Series Comparison” with TN concentration in milligrams per liter on the y-axis and time index on the x-axis. Blue circles represent actual values, while orange squares show predicted values. Both lines follow a similar fluctuating pattern, indicating the model's accuracy across the test set of sixty indices.

Figure 8. Time series regression of TN prediction.

Figure 9
Scatter plot titled “TN Prediction Model: Actual vs Predicted Values” showing TN concentration in milligrams per liter. Blue dots represent data points. The red dashed line indicates perfect prediction, while the green line is the regression line, closely aligning with the dashed line, suggesting strong predictive accuracy.

Figure 9. Scatter plot for comparison of model predicted values and actual values.

Overall, the TN prediction model in this study has shown outstanding performance in terms of prediction accuracy, stability, and interpretability, outperforming most existing research results (Luo et al., 2024; Shao et al., 2023). This fully demonstrates the model's potential in practical applications and provides a reliable tool for the intelligent management and control of TN in WWTPs.

3.3 Results on the universal applicability of the model

To validate the universality of the generic prediction model proposed in this study, the identical modeling workflow was applied to prediction the concentration of TP, NH3-N, Qw, and COD.

The evaluation results, as shown in Table 3, indicate that except for COD, the model R2 values of the three core indicators (TP, NH3-N, and Qw) all exceed 0.91, and the prediction error indicators remain at a low level. Figures 10, 11 visually support the high universal accuracy of the model, showing nearly overlapping time-series curves and tightly clustered scatter points for TP, NH3-N, and Qw. From the time series regression curves in Figure 10, it can be intuitively seen that the predicted curves and the real curves show an almost overlapping trend; while the scatter comparison between predicted values and real values in Figure 11 shows that all data points are densely distributed around the diagonal. This high consistency between the prediction results and the actual situation further confirms from the data performance that the model has high prediction accuracy. This proves that the model can adaptively screen key features, requiring no large-scale structural adjustments for different targets, and thus has extremely strong portability.

Table 3
www.frontiersin.org

Table 3. Evaluation indicators for the prediction of various pollutant concentrations by the prediction model.

Figure 10
(a) Time series line graph comparing actual and predicted TP concentrations over 60 time indices. The y-axis represents TP concentration in milligrams per liter, and the x-axis represents the time index. Actual values are marked with blue circles, while predicted values are marked with orange squares. Both lines fluctuate significantly from zero to twelve on the TP concentration scale. (b)Time series graph comparing NH3-N concentration predictions to actual values. The x-axis represents time index, and the y-axis shows concentration in mg/L. Actual values are marked with blue circles, and predicted values are marked with orange squares. Data shows two significant peaks, with actual and predicted values aligning closely. (c) Line graph titled “COD Prediction Model Time Series Comparison” showing actual versus predicted values of COD concentration in milligrams per liter over a test set index. Actual values are depicted with blue circles and vary from approximately 8 to 18, while predicted values are shown with orange squares, closely following the actual values. The graph demonstrates fluctuations and trends in COD concentration across the test period. (d) Line chart comparing actual and predicted values of Qw over a time index. The x-axis represents the time index (test set), and the y-axis represents Qw in cubic meters per day, ranging from 32,000 to 44,000. Blue circles indicate actual values, and orange squares indicate predicted values, with both lines showing similar fluctuations. A legend on the right distinguishes between the two datasets.

Figure 10. Time series regression for pollutant concentration prediction: TP (a), NH3-N (b), COD (c), and Qw (d).

Figure 11
Scatter plot showing the actual versus predicted total phosphorous concentration in milligrams per liter. Blue dots represent data points. A red dashed line indicates perfect prediction (y equals x), while a green solid line represents the regression line. Most data points cluster near 0.04 milligrams per liter, with some spread along the regression line.

Figure 11. Scatter plot for comparison of model predicted values and actual values: TP (a), NH3-N (b), COD (c), and Qw (d).

This result is significantly higher than the performance of cross-indicator prediction models reported in similar studies, where R2 usually ranges from 0.80 to 0.88 (Kuwayama and Olmstead, 2020). Its advantage may stem from the accurate identification of core driving factors for different indicators by the “intelligent feature selection” module in the framework. For instance, for TP and NH3-N, the framework prioritizes retaining operating parameters related to biological adsorption and nitrification, such as TN and COD; whereas for Qw, it focuses on incorporating time-series features related to year and month, as well as environmental factors like maximum temperature, minimum temperature, and average temperature. This adaptive capability is highly consistent with the “feature-target matching theory” proposed by Mattioli et al. (2024).

The outstanding performance of the Qw model is particularly noteworthy. This result indicates that the general model is not only applicable to chemical indicators such as pollutant concentrations but also effective for physical quantities like hydraulic load, which are affected by factors such as pipe network dynamics and users' water consumption habits. This is consistent with the research conclusion of Zhang and Wang (2022), who pointed out that ensemble learning algorithms have inherent advantages in handling nonlinear, multi-factor-driven physical processes, and the gradient boosting tree adopted in this framework can precisely capture the diurnal and weekly periodic patterns of water consumption.

In contrast, the performance of the COD prediction model is slightly weaker, with R2 = 0.75. Although this result is acceptable in most practical scenarios, it is significantly lower than that of other models. To investigate this accuracy bottleneck, a quantitative analysis of the prediction residuals was conducted. It revealed that the COD error distribution displays a distinct “long-tail” characteristic (Kurtosis > 3), implying that while the model fits well during most periods, large deviations occur at specific instances. This phenomenon is attributed to the nature of COD as a comprehensive index reflecting the total amount of reducing substances in water. Its composition encompasses not only readily biodegradable components (SS), such as carbohydrates and proteins, but also inert, non-biodegradable organic matter (S1) (Henze et al., 2006). Current routine monitoring data lack fine-grained capture of the specific ratios of these components (e.g., S1/CODtotal). Consequently, when the influent is subjected to shock loads from sudden industrial wastewater discharges—such as those resulting from intermittent manufacturing operations—implicit abrupt changes in the proportion of non-biodegradable components cause the actual COD degradation kinetics to deviate from the model's expectations (Negara, 2023), thereby generating peak prediction errors. This finding aligns with the perspective of Wang et al. (2024), who argue that the composite nature of COD, combined with its susceptibility to latent factors such as microbial community structure and inhibition by toxic substances (Sharma, 2023), renders it one of the most challenging parameters to predict accurately in wastewater treatment. Therefore, to enhance COD prediction, future research could consider incorporating a pollutant component identification module into the general framework or coupling the system with deep learning models (e.g., Long Short-Term Memory [LSTM] networks) to capture latent correlations over long-term cycles (Ismanto et al., 2024).

Overall, 80% of the models achieve an excellent performance level with R2 > 0.91, fully demonstrating the reliability of the generic prediction framework. Although COD prediction exhibits room for improvement, these results substantially surpass industry performance expectations for cross-indicator prediction tools (typically requiring R2 > 0.70) (Environmental Protection Agency, 2021). This series of successful applications robustly indicates that the data-driven modeling paradigm proposed in this study possesses high robustness and portability. It can be effectively extended to the monitoring and prediction of multiple critical parameters in wastewater treatment plants, thereby laying a solid technical foundation for achieving refined and intelligent water quality management.

3.4 Results of process parameter back-calculation model and analysis of energy-saving potential

The process parameter back-calculation model constructed by PSO-SVR in this study, including the aeration rate model and the carbon source dosage model, shows the prediction results of key control parameters in sewage treatment as presented in Table 4. The aeration rate model demonstrates nearly perfect fitting ability in the training set with R2 = 0.9960, achieving accurate prediction of aeration rate. In the validation set and test set, the values of R2 are 0.95 and 0.94 respectively. Although there is a slight decrease, it still has high explanatory power, indicating that the model has good generalization ability for unknown data. However, the significant discrepancy in the MSE between the training and validation sets (0.00016 vs. 0.00497) indicates the presence of a certain degree of overfitting. An in-depth analysis suggests that this phenomenon is partly attributable to sudden fluctuations in influent water quality. Specifically, the intrusion of industrial wastewater containing toxic substances (e.g., heavy metals and phenols) can inhibit the activity of nitrifying bacteria (Olya, 2016). Since these factors were not incorporated into the model inputs, this finding aligns with the perspective of Alagador and Cerdeira (2018), who identified latent interference factors as bottlenecks for model generalization. On the other hand, this phenomenon is also closely related to the optimization mechanism inherent in the PSO-SVR algorithm itself. In the pursuit of minimizing training error, Particle Swarm Optimization (PSO) tends to drive the kernel parameter (γ) and the penalty coefficient (C) of the SVR toward extreme values. This results in the model overfitting high-frequency noise within the training data, thereby compromising, to a certain extent, the smoothness required when confronting unseen operating conditions. Overall, despite the aforementioned room for improvement, the aeration model effectively couples the biochemical oxygen demand laws of the nitrification process (theoretical oxygen demand of 4.57 gO2/gNH4+-N) through the inverse inference logic of “NH3-N concentration → Dissolved Oxygen demand → Aeration volume.” The test set result of R2 = 0.94 fully validates the model's capability to quantitatively capture temperature-driven dynamics of nitrifying bacteria activity (e.g., rate decay induced by low temperatures) and variations in ammonia nitrogen load.

Table 4
www.frontiersin.org

Table 4. Evaluation metrics of the PSO-SVR model for predicting aeration intensity and carbon dosage.

The carbon source dosage model performed excellently in the training set (R2 = 0.98), validation set (R2 = 0.92), and test set (R2 = 0.96) (Table 4), reflecting the model's strong adaptability to new data and achieving accurate prediction of carbon source dosage. The relatively high error in the validation set (MSE = 0.25) may stem from the complex inhibitory effects in the denitrification process. For instance, the leakage of dissolved oxygen from the aeration tank to the anoxic zone, even at concentrations <0.5 mg/L, can act as a preferential electron acceptor, reducing the efficiency of carbon source utilization (Williams et al., 1999), yet such micro-environmental fluctuations are not quantified in the current model. Based on the principle of electron donor balance in denitrification (theoretical carbon-nitrogen ratio COD/NO3-N≈2.86), the carbon source dosage model accurately distinguishes the demand for internal/external carbon sources by integrating the residual influent COD and TN concentration. Meanwhile, it internalizes the dual constraints of hydraulic retention time and low temperature on denitrification rate through the variables of “sewage discharge + temperature”.

The time-series regression curves for the aeration intensity and carbon dosage models on the test set distinctly demonstrate near-perfect alignment between predicted and actual curves (Figure 12). Corresponding scatter plots comparing predicted vs. true values reveal data points tightly clustered along the diagonal (Figure 13). The time-series regression (Figure 12) and scatter plots (Figure 13) for the back-calculated aeration intensity and carbon dosage visually demonstrate the near-perfect alignment between predictions and actual values, confirming the models‘ exceptional accuracy (R2 = 0.94 and 0.96, respectively). This high consistency between predictions and measurements robustly reconfirms both models' exceptional accuracy. With R2 values reaching 0.94 and 0.96 respectively, the models significantly outperform traditional Proportional-Integral-Derivative control models (typically R2 <0.85) (Hernández-del-Olmo et al., 2023). This superiority stems from precise characterization of core biochemical processes governing nitrification and denitrification. For instance, the aeration model's embedded theoretical oxygen demand coefficient of 4.57gO2/gNH3-N aligns exactly with internationally accepted stoichiometry for nitrification reactions (Metcalf and Eddy, 2014), ensuring mechanistic soundness in model outputs. The high performance of the PSO-SVR model in process parameter inference validates the efficacy of the “biochemical mechanism + data-driven” integrated modeling approach.

Figure 12
Line chart comparing actual and predicted values of Q\(_w\) over a time index. The x-axis represents the time index (test set), and the y-axis represents Q\(_w\) in cubic meters per day, ranging from 32,000 to 44,000. Blue circles indicate actual values, and orange squares indicate predicted values, with both lines showing similar fluctuations. A legend on the right distinguishes between the two datasets.

Figure 12. Time series regression curve chart of test set: aeration intensity (a) and carbon source dosage (b).

Figure 13
Line graph titled “COD Prediction Model Time Series Comparison” showing actual versus predicted values of COD concentration in milligrams per liter over a test set index. Actual values are depicted with blue circles and vary from approximately 8 to 18, while predicted values are shown with orange squares, closely following the actual values. The graph demonstrates fluctuations and trends in COD concentration across the test period.

Figure 13. Scatter plot for comparison of model predicted values and actual values: aeration rate (a) and carbon source dosage (b).

The high-precision control of the model provides a clear path for energy reduction in WWTPs. The accurate back-calculation of aeration rate can avoid energy waste caused by excessive aeration. It is estimated that the energy consumption of the aeration system accounts for 40%−60% of the total energy consumption of wastewater treatment plants (Soares et al., 2017). The low error of MAE = 0.018 in the test set of this model can reduce the aeration rate control deviation to within ±2%, and the annual power saving rate is expected to reach 15%-20%. Specifically, the precise COD prediction allows for a lower Dissolved Oxygen (DO) setpoint buffer, validating this projected energy reduction potential.

Similarly, the precise regulation of carbon source dosage (MAE = 0.189) can reduce chemical waste. In traditional operations, the overdosing rate of carbon sources often reaches 10%−30% (Jamaludin et al., 2024). However, through the dynamic balance of carbon-nitrogen ratio, the model can control the overdosage rate below 5%, significantly reducing operating costs. To substantiate these savings, a quantitative cost-benefit analysis was performed based on a hypothetical wastewater treatment plant with a daily flow of 50,000 m3/d. In traditional operations, a safety factor of 1.2 (i.e., 20% overdosing) is typically applied to the external carbon source dosage to buffer against load fluctuations. However, given the low prediction error demonstrated by our model, this safety margin can be confidently reduced to 1.05 (5% buffer). Assuming a typical Carbon/Nitrogen ratio requirement of 4:1 and a generic carbon source price of 0.3 USD/kg (e.g., Sodium Acetate equivalent), the cost reduction is estimated via the following equation:

Costsaving=Q×ΔN×RatioC/N×(SFold-SFnew)×Price

This calculation suggests an annual saving of approximately 65,700 USD, confirming that the proposed 5% overdosing rate is not only theoretically feasible but specifically achievable due to the model's ability to capture peak loads within a tight error margin.

Compared with similar studies, the innovation of the PSO-SVR model lies in its “closed-loop dynamic response” capability. By coupling the pollutant concentration prediction module, the model can adjust aeration and carbon source dosage 6–12 h in advance, which is more effective in coping with the lag of water quality than real-time feedback control (Masala and Servetti, 2013). For future improvements, methods from Li and Liu (2024) can be referenced, such as introducing reinforcement learning algorithms to optimize the parameter update frequency, or integrating online biosensor data to improve the accuracy of capturing microbial activity.

In summary, the PSO-SVR model constructed in this study has successfully achieved a deep integration of biochemical mechanisms and data algorithms. The high reliability of the test set (R2 = 0.94 for aeration rate and R2 = 0.96 for carbon source dosage) verifies its feedforward control value in real wastewater treatment plant scenarios. By accurately backstepping the aeration rate and carbon source dosage, the model provides an innovative technical path for dynamically optimizing denitrification efficiency and reducing energy and chemical consumption, significantly supporting the refined operation of wastewater treatment under the dual carbon goal. It is recommended to further integrate real-time sensor data (pH/ORP/DO) to build an adaptive feedback mechanism, and quantify the actual energy and consumption reduction benefits through pilot-scale verification, so as to promote the model to evolve deeply toward engineering applications.

3.5 Applicability and practical utility of the study

This study employs an intelligent feature selection model. This allows the model to adaptively screen optimal features based on the data available at a new location. If a new WWTP lacks specific sensors, the framework can automatically re-optimize the input vector using a reduced feature set or substitute variables without changing the core algorithm structure. Secondly, the core of our process parameter efficiency is constrained by fundamental biochemical stoichiometry. Since the biological kinetics of nitrification and denitrification are consistent across locations, this “Mechanism + Data” hybrid approach ensures that the control logic remains valid for other WWTPs, requiring only the recalibration of hyperparameters to fit the specific hydraulic characteristics or microbial activity rates of the new plant. Finally, data scarcity issues common in other locations was addressed. The model's ensemble structure is robust enough to handle missing values through the imputation methods described in our preprocessing process.

Additionally, the practical utility of this work is mainly reflected in three aspects: quantifying economic benefits, defining standardized operational workflows, and highlighting environmental impact. Firstly, the specific cost savings was calculated based on a hypothetical 50,000 m3/d WWTP. By reducing the carbon source safety factor from 1.2 to 1.05 (enabled by our model's high precision), the facility could save approximately $65,700 USD annually. Secondly, the model functions as a “Decision Support System” in a real-world setting. It acts as a virtual sensor for 24-h forecasting, allowing operators to switch from reactive control (lagged) to proactive control (feed-forward). Finally, the reduction in energy consumption and chemical waste aligns with sustainability and “Dual Carbon” goals.

4 Conclusion

To address the core challenges inherent in the refined operation of WWTPs, this study successfully constructed and validated a universal data-driven solution that integrates multi-pollutant prediction with the backstepping optimization of process parameters. The proposed framework, which incorporates systematic data preprocessing, multi-dimensional intelligent feature selection, and ensemble learning, demonstrated high adaptability and portability across varying prediction complexities. Specifically, the model achieved a high determination coefficient (R2 = 0.96) for TN prediction, and maintained R2 values exceeding 0.91 when applied to the prediction of TP, NH3-N, and wastewater discharge without necessitating structural adjustments. Even though the COD prediction (R2 = 0.75) is relatively lower, it still highlights the model's robustness.

Building upon these predictive capabilities and the biochemical mechanisms of nitrification/denitrification, a process parameter backstepping model was established utilizing PSO-SVR. This approach facilitated the precise determination of aeration rate and carbon source dosage, yielding test set R2 values of 0.94 and 0.96, respectively. Consequently, a closed feedback loop from prediction to control was formed, providing a quantitative basis for operational regulation. This methodology is anticipated to guide the implementation of feedforward dynamic regulation in WWTPs, thereby mitigating excessive aeration energy consumption and redundant carbon source addition. Such optimization contributes to reduced operating costs while ensuring effluent compliance, offering substantial technical support for the wastewater treatment industry's alignment with dual carbon goals.

However, there are two limitations in this study. First of all, the prediction accuracy for COD (R2 = 0.75) is lower than other parameters, due to the lack of fine-grained data on pollutant components (e.g., soluble vs. particulate, biodegradable vs. inert). Furthermore, while the PSO-SVR model performs well, it shows slight overfitting tendencies in the validation set, likely due to unmonitored influent shock loads (e.g., toxic substances). Future directions include integrating deep learning models (like LSTM) to capture long-term dependencies, incorporating additional sensor data (e.g., ORP, spectral data) to improve component identification, and conducting pilot-scale field tests to validate the economic benefits.

Data availability statement

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Author contributions

HC: Conceptualization, Data curation, Investigation, Methodology, Software, Validation, Writing – original draft. QZ: Conceptualization, Funding acquisition, Project administration, Resources, Supervision, Validation, Writing – review & editing. JX: Formal analysis, Methodology, Validation, Writing – review & editing. KW: Data curation, Software, Visualization, Writing – review & editing. WY: Investigation, Resources, Writing – review & editing.

Funding

The author(s) declared that financial support was received for this work and/or its publication. This research was supported by the internal R&D fund of Langfang Qingquan Water Supply Co., Ltd. and the North China Institute of Aerospace Engineering.

Acknowledgments

The authors sincerely acknowledge the support from the operational staff of the WWTP for providing the long-term monitoring data and technical assistance. We also thank the editors and reviewers for their valuable comments and suggestions, which significantly improved the quality of this manuscript.

Conflict of interest

HC, QZ and JX were employed by Langfang Qingquan Water Supply Co., Ltd.

The remaining author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Alagador, D., and Cerdeira, J. O. (2018). A quantitative analysis on the effects of critical factors limiting the effectiveness of species conservation in future time. Ecol. Evol. 8, 3457–3467. doi: 10.1002/ece3.3788

PubMed Abstract | Crossref Full Text | Google Scholar

Aljarah, F., and Çetin, A. (2022). Prediction of Water Quality with Ensemble Learning Algorithms. Adv. Artif. Intell. Res. 3, 36–44. doi: 10.54569/aair.1200695

Crossref Full Text | Google Scholar

Athanasiadis, I. N., Rizzoli, A. E., and Beard, D. W. (2010). “Data mining methods for quality assurance in an environmental monitoring network,” in International Conference on Artificial Neural Networks (Berlin; Heidelberg: Springer Berlin Heidelberg), 451–456. doi: 10.1007/978-3-642-15825-4_60

Crossref Full Text | Google Scholar

Bayram, A., Kankal, M., Tayfur, G., and Önsoy, H. (2014). Prediction of suspended sediment concentration from water quality variables. Neural Comput. Appl. 24, 1079–1087. doi: 10.1007/s00521-012-1333-3

Crossref Full Text | Google Scholar

Cardoso, B. J., Rodrigues, E., Gaspar, A. R., and Gomes, Á. (2021). Energy performance factors in wastewater treatment plants: a review. J. Clean. Prod. 322:129107. doi: 10.1016/j.jclepro.2021.129107

Crossref Full Text | Google Scholar

Environmental Protection Agency (2021). Guidelines for Performance Evaluation of Water Quality Prediction Models. Washington, DC: EPA.

Google Scholar

Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York, NY: Springer.

Google Scholar

Henze, M., Gujer, W., Mino, T., and Van Loosedrecht, M. (2006). Activated sludge models ASM1, ASM2, ASM2d and ASM3. London: IWA Publishing.

Google Scholar

Hernández-del-Olmo, F., Gaudioso, E., Duro, N., Dormido, R., and Gorrotxategi, M. (2023). Advanced control by reinforcement learning for wastewater treatment plants: a comparison with traditional approaches. Appl. Sci. 13:4752. doi: 10.3390/app13084752

Crossref Full Text | Google Scholar

Huang, S. Y., Wu, X. Q., Zhao, Q. B., Xie, J. F., Zheng, Y. M., Zhou, T. T., et al. (2025). Prediction and Optimization of Ammonia Removal in Direct Aeration Process Based on Wastewater Properties: An Integrated Experimental and Machine Learning Approach. Wasington, DC: ACS ESandT Eng.

Google Scholar

Hyndman, R. J., and Athanasopoulos, G. (2018). Forecasting: Principles and Practice. Melbourne, VIC: OTexts.

Google Scholar

Ismanto, E., Ab Ghani, H., Aziz, N. H. B. A., Saleh, N. I. M., and Effendy, N. (2024). “Enhancing student performance prediction through LSTM-based deep learning models with unbalanced data handling using oversampling approach,” in Proceedings of the 4th International Conference on Communication, Language, Education and Social Sciences (CLESS 2023), Vol. 819 (Springer Nature), 192. doi: 10.2991/978-2-38476-196-8_18

Crossref Full Text | Google Scholar

Jain, B., Anand, K., Priya, S., and Bhargava, K. (2024). “Overcoming the limitations of conventional deep learning methods for gender classification and age prediction with transfer learning approach,” in AIP Conference Proceedings, Vol. 3131 (AIP Publishing LLC), 020011. doi: 10.1063/5.0229634

Crossref Full Text | Google Scholar

Jamaludin, M., Tsai, Y. C., Lin, H. T., Huang, C. Y., Choi, W., Chen, J. G., et al. (2024). Modeling and control strategies for energy management in a wastewater center: a review on aeration. Energies 17:3162. doi: 10.3390/en17133162

Crossref Full Text | Google Scholar

Kamalov, F., Cherukuri, A. K., Sulieman, H., Thabtah, F., and Hossain, A. (2023). Machine learning applications for COVID-19: a state-of-the-art review. Data Sci. Genom. 277–289. doi: 10.1016/b978-0-323-98352-5.00010-0

Crossref Full Text | Google Scholar

Kusuda, T., and Moriyama, K. (1994). Evaluation of wastewater treatment systems by cost-benefit analysis. Environ. Syst. Res. 22, 171–181. doi: 10.2208/proer1988.22.171

Crossref Full Text | Google Scholar

Kuwayama, Y., and Olmstead, S. M. (2020). Hydroeconomic modeling of resource recovery from wastewater: Implications for water quality and quantity management. J. Environ. Qual. 49, 593–602. doi: 10.1002/jeq2.20050

PubMed Abstract | Crossref Full Text | Google Scholar

LaMartina, E. L., Mohaimani, A. A., and Newton, R. J. (2021). Urban Wastewater Bacterial Communities Show Seasonal Patterns. doi: 10.21203/rs.3.rs-968108/v1

Crossref Full Text | Google Scholar

Li, Q., and Liu, J. (2024). Reinforcement learning for dynamic aeration control in wastewater treatment: a case study. Bioresour. Technol. 385:129247.

Google Scholar

Liu, J., and Zhang, Q. (2019). Statistical evaluation of water quality datasets for machine learning applications. Environ. Model. Softw. 111, 352–363.

Google Scholar

Luo, A., Gurses, M. E., Gecici, N. N., Kozel, G., Lu, V. M., Komotar, R. J., et al. (2024). Machine learning applications in craniosynostosis diagnosis and treatment prediction: a systematic review. Child's Nerv. Syst. 40, 2535–2544. doi: 10.1007/s00381-024-06409-5

PubMed Abstract | Crossref Full Text | Google Scholar

Mahdavian, A., Shojaei, A., Salem, M., Laman, H., Eluru, N., and Oloufa, A. A. (2021). A universal automated data-driven modeling framework for truck traffic volume prediction. IEEE Access 9, 105341–105356. doi: 10.1109/access.2021.3099029

Crossref Full Text | Google Scholar

Maryam, G., Kaveh, O., Saeid, E., and Singh, P. (2018). Application of time series modeling to study river water quality. Am. J. Eng. Appl. Sci. 11, 574–585. doi: 10.3844/ajeassp.2018.574.585

Crossref Full Text | Google Scholar

Masala, E., and Servetti, A. (2013). “Performance VS quality of experience in a remote control application based on real-time 3D video feedback,” in 2013 Fifth International Workshop on Quality of Multimedia Experience (QoMEX) (IEEE), 28–29. doi: 10.1109/qomex.2013.6603198

Crossref Full Text | Google Scholar

Masini, R. P., Medeiros, M. C., and Mendes, E. F. (2023). Machine learning advances for time series forecasting. J. Econ. Surv. 37, 76–111. doi: 10.48550/arxiv.2012.12802

Crossref Full Text | Google Scholar

Mattioli, D., Sabia, G., Petta, L., Altobelli, M., Evangelisti, M., and Maglionico, M. (2024). A modeling analysis of wastewater heat recovery effects on wastewater treatment plant nitrification. Water 16:1074. doi: 10.3390/w16081074

Crossref Full Text | Google Scholar

Metcalf and Eddy (2014). Wastewater Engineering: Treatment and Resource Recovery, 5th ed. New York, NY: McGraw-Hill.

Google Scholar

Miao, S., Zhou, C., AlQahtani, S. A., Alrashoud, M., Ghoneim, A., and Lv, Z. (2021). Applying machine learning in intelligent sewage treatment: a case study of chemical plant in sustainable cities. Sustain. Cities Soc. 72:103009.

Google Scholar

Najafzadeh, M., Etemad-Shahidi, A., and Lim, S. Y. (2016). Scour prediction in long contractions using ANFIS and SVM. Ocean Eng. 111, 128–135. doi: 10.1016/j.oceaneng.2015.10.053

Crossref Full Text | Google Scholar

Najafzadeh, M., and Niazmardi, S. (2021). A novel multiple-kernel support vector regression algorithm for estimation of water quality parameters. Nat. Resour. Res. 30, 3761–3775. doi: 10.1007/s11053-021-09895-5

Crossref Full Text | Google Scholar

Najafzadeh, M., and Zeinolabedini, M. (2018). Derivation of optimal equations for prediction of sewage sludge quantity using wavelet conjunction models: an environmental assessment. Environ. Sci. Pollut. Res. 25, 22931–22943. doi: 10.1007/s11356-018-1975-5

PubMed Abstract | Crossref Full Text | Google Scholar

Najafzadeh, M., and Zeinolabedini, M. (2019). Prognostication of waste water treatment plant performance using efficient soft computing models: an environmental evaluation. Measurement 138, 690–701. doi: 10.1016/j.measurement.2019.02.014

Crossref Full Text | Google Scholar

Negara, A. P. (2023). Modeling and Analysis of Microbial Activities in Industrial Wastewater Treatment Plant. doi: 10.33612/diss.626423313

Crossref Full Text | Google Scholar

Olya, M. E. (2016). A New Method for Prediction of Wastewater Treatment Efficiency in the Photo catalytic Processes. Orient. J. Chem. 32, 1453–1463. doi: 10.13005/ojc/320319

Crossref Full Text | Google Scholar

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830.

Google Scholar

Shao, S., Fu, D., Yang, T., Mu, H., Gao, Q., and Zhang, Y. (2023). Analysis of machine learning models for wastewater treatment plant sludge output prediction. Sustainability 15:13380. doi: 10.3390/su151813380

Crossref Full Text | Google Scholar

Sharma, P. (2023). Exploring the microbial dynamics for heavy metals bioremediation in the industrial wastewater treatment: a critical review. Nov. Res. Microbiol. J. 7, 2034–2047. doi: 10.21608/nrmj.2023.307215

Crossref Full Text | Google Scholar

Soares, R. B., Memelli, M. S., Roque, R. P., and Gonçalves, R. F. (2017). Comparative analysis of the energy consumption of different wastewater treatment plants. Int. J. Archit. Arts Appl. 3, 79–86. doi: 10.11648/j.ijaaa.20170306.11

Crossref Full Text | Google Scholar

Takalo-Mattila, J., Heiskanen, M., Kyllönen, V., Määttä, L., and Bogdanoff, A. (2022). Explainable steel quality prediction system based on gradient boosting decision trees. IEEE Access 10, 68099–68110. doi: 10.1109/access.2022.3185607

Crossref Full Text | Google Scholar

Utku, A., and Can, Ü. (2022). Deep learning based air quality prediction: a case study for London. Türk Doğa ve Fen Dergisi 11, 126–134. doi: 10.46810/tdfd.1201415

Crossref Full Text | Google Scholar

Wałęga, A., Chmielowski, K., and Młyński, D. (2019). Nitrogen and phosphorus removal from sewage in biofilter-activated sludge combined systems. Pol. J. Environ. Stud. 28, 1939–1947. doi: 10.15244/pjoes/89898

Crossref Full Text | Google Scholar

Wang, P., Lehti-Shiu, M. D., Lotreck, S., Segura Abá, K., Krysan, P. J., and Shiu, S. H. (2024). Prediction of plant complex traits via integration of multi-omics data. Nat. Commun. 15:6856. doi: 10.1038/s41467-024-50701-6

PubMed Abstract | Crossref Full Text | Google Scholar

Williams, M. D., Vermeul, V. R., Oostrom, M., Evans, J. C., Fruchter, J. S., Istok, J. D., et al. (1999). Anoxic Plume Attenuation in a Fluctuating Water Table System: Impact of 100-D Area in Situ Redox Manipulation on Downgradient Dissolved Oxygen Concentrations (No. PNNL-12192; EW 40). Richland, WA: Pacific Northwest National Lab (PNNL). doi: 10.2172/7649

Crossref Full Text | Google Scholar

Wu, T., Yang, S. S., Zhong, L., Pang, J. W., Zhang, L., Xia, X. F., et al. (2023). Simultaneous nitrification, denitrification and phosphorus removal: what have we done so far and how do we need to do in the future?. Sci. Total Environ. 856:158977. doi: 10.1016/j.scitotenv.2022.158977

PubMed Abstract | Crossref Full Text | Google Scholar

Xie, Y., Chen, Y., Lian, Q., Yin, H., Peng, J., Sheng, M., et al. (2022). Enhancing real-time prediction of effluent water quality of wastewater treatment plant based on improved feedforward neural network coupled with optimization algorithm. Water 14:1053. doi: 10.3390/w14071053

Crossref Full Text | Google Scholar

Zeinolabedini, M., and Najafzadeh, M. (2019). Comparative study of different wavelet-based neural network models to predict sewage sludge quantity in wastewater treatment plant. Environ. Monit. Assess. 191:163. doi: 10.1007/s10661-019-7196-7

PubMed Abstract | Crossref Full Text | Google Scholar

Zhang, H., and Liu, W. (2022). Residual analysis for validating water quality prediction models. J. Hydrol. 607:127518.

Google Scholar

Zhang, J., Ma, L., and Yan, Y. (2020). A dynamic comparison sustainability study of standard wastewater treatment system in the straw pulp papermaking process and printing & dyeing papermaking process based on the hybrid neural network and emergy framework. Water 12:1781. doi: 10.3390/w12061781

Crossref Full Text | Google Scholar

Zhang, L. (2020). Correlation analysis between COD and nitrogen removal in activated sludge systems. Bioresour. Technol. 305:123095.

Google Scholar

Zhang, Q., and Wang, L. (2022). Predicting wastewater flow rate using gradient boosting machines: a case study. J. Hydrol. 609:127702.

Google Scholar

Zounemat-Kermani, M., Alizamir, M., Keshtegar, B., Batelaan, O., and Hinkelmann, R. (2022). Prediction of effluent arsenic concentration of wastewater treatment plants using machine learning and kriging-based models. Environ. Sci. Pollut. Res. 29, 20556–20570. doi: 10.1007/s11356-021-16916-6

PubMed Abstract | Crossref Full Text | Google Scholar

Keywords: energy conservation and consumption reduction, generic prediction model, machine learning, process parameter optimization, wastewater treatment

Citation: Chen H, Zhang Q, Xie J, Wang K and Yue W (2026) Multi-pollutant prediction and process parameter optimization of a wastewater treatment plant based on machine learning models. Front. Water 8:1719097. doi: 10.3389/frwa.2026.1719097

Received: 08 October 2025; Revised: 21 December 2025;
Accepted: 05 January 2026; Published: 29 January 2026.

Edited by:

Ibrahim Demir, Tulane University, United States

Reviewed by:

Anurag Malik, Punjab Agricultural University, India
Mohammad Najafzadeh, Graduate University of Advanced Technology, Iran

Copyright © 2026 Chen, Zhang, Xie, Wang and Yue. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Qiang Zhang, MjI4Mzk4MTk1MEBxcS5jb20=

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.