Data-driven modeling techniques for prediction of settled water turbidity in drinking water treatment

Drinking water treatment is a complex system of chemical, physical, and biological processes that is highly dependent on water quality and the design of the treatment process. To create decision-support tools, the prediction of key performance indicators, such as settled water turbidity, is needed. A variety of data-driven modeling techniques is available to formulate such predictions. Data-driven models provide valuable tools for formulating predictions where there is a lack of mechanistic models or the mechanisms are not fully understood, as in surface water treatment. The objective of this paper is to evaluate and compare the effectiveness of various data-driven techniques for this important, but dif ﬁ cult, problem. Recognizing that the size and quality of the dataset are most critical in this kind of analysis, this work uses one of the largest datasets used in this context consisting of 2,527 vectors of water quality and operational data (2,527 X nine data frame) from a full-scale water treatment plant. The paper constructs and compares the performance of the several data-driven models including k-nearest neighbor (KNN) regression, polynomial regression, and arti ﬁ cial neural networks (ANN). Based on test scaled root mean square error (RMSE), the ANN model was the most predictive (0.124). Similarly, the ANN model had the best predictive performance based on total scaled RMSE (0.086). These results show that ANNs have a high potential for the development of a future decision support system in selecting appropriate coagulant doses based on settled water turbidity.


Introduction and background
Drinking water treatment is a vital public health program to deliver potable and palatable water to customers.Treatment regimes depend upon the source water, which is typically either surface water (e.g., rivers, lakes, or reservoirs) or groundwater.Surface water treatment systems are prone to seasonal changes in water quality, as well as more rapid changes, particularly during storm events (Wu and Lo, 2008).As such, chemical dosages (e.g., coagulant, pre-oxidant, disinfectant) are often adjusted to maintain effective treatment during these changing water quality conditions.Decisions to change chemical dosage have historically been made from a combination of operator experience and bench-scale analyses (Menezes et al., 2018).Changes to coagulant dosage are typically evaluated with a bench-scale jar test.While jar tests help predict the chemistry of coagulation, they are relatively expensive to run and do not allow for responses to sudden changes in water quality (Joo et al., 2000;Menezes et al., 2018;Edzwald, 2019).Hence, decision-support tools have become increasingly important in applying consistent dosages of the coagulant.
Mathematical models are common tools to gain insight into the performance of complex systems and to be able to predict future behavior accurately.Models can range in both complexity and accuracy of prediction, with many conventional models providing limited insights in scenarios where relationships are highly non-linear and/or poorly understood.Several software programs and programming languages have been developed to facilitate ease of model development and evaluation of predictive accuracy for increasingly complex problems.Efforts have been undertaken to utilize data-driven mathematical models to better understand drinking water treatment processes and formulate predictions of performance based on collected data especially in circumstances when science-based (such as from chemistry or physics) models are not available or are too inaccurate.As an example of data-driven models, artificial neural networks (ANNs) have been growing in use for modeling drinking water treatment processes, particularly in the prediction of turbidity at various points in the treatment process.ANN models are based on a model of the structure of human neural networks.Input nodes are connected to nodes in hidden layers through nonlinear transformation functions.These hidden nodes, which form hidden layers, can be connected to other hidden layers or an output layer that determines the predicted response variable.An example diagram of ANN architecture is shown in Figure 1.
Since the work presented here focuses on the prediction of settled water turbidity, a summary of the reported results from studies predicting settled water turbidity is given in Table 1.These studies show that data-driven models can be a highly effective decision-support tool for water treatment especially when large data sets are available.However, one glaring deficiency in these studies is that they were almost exclusively developed using small data sets from full-scale operations or they relied on bench-scale data, thereby severely limiting wider applicability in the field.In particular, bench-scale studies can be difficult to scale to replicate full-scale operations.Physical processes, such as mixing and settling, pose many challenges of forming predictions when scaling from bench-to full-scale systems, such as the fact that the power input of a mixer, the surface-area-to-volume ratio, and density of the fluid do not necessarily scale appropriately from the bench-to fullscale systems.Additionally, bench-scale results do not account for the temporal and spatial variation of surface waters (Joo et al., 2000;Menezes et al., 2018).Hence, there continues to be a need to evaluate and establish the efficacy of data-based models for predicting water turbidity using full-scale data.It is for this reason that the current study uses a large data set from a fullscale water treatment plant.To the authors' knowledge, the dataset used in this paper is the largest ever used for this problem; in addition, this is also the first work to compare different modeling techniques including KNN regression for the modeling of a drinking water treatment process.

DWTP description
The DWTP, that is the source of data used in this work, is dubbed "Plant A." Plant A is a conventional, publicly owned treatment works that includes rapid mixing, flocculant mixing, sedimentation, and filtration.The plant utilizes ferric chloride as a primary coagulant with lime used for alkalinity addition and pH adjustment.Chlorine is applied in the rapid mix and post-sedimentation.Chlorine is also applied post-filtration with ammonia to generate chloramines for a distribution system residual.

Key performance indicators
Surface water treatment has several key performance indicators (KPIs) that can be evaluated to model plant performance.The KPIs for the effectiveness of a selected coagulant dose include total organic carbon (TOC) removal and settled water turbidity.Under the Stage-1 and Stage-2 Disinfectant and Disinfection Byproduct Rules (DBPRs), TOC is used as a surrogate for natural organic matter and its removal is required to reduce the formation of DBPs (US EPA, 1999).While vital to the performance and regulatory compliance of a DWTP, TOC removal requires laboratory analysis to measure, whereas turbidity can be measured by online instrumentation.Therefore, turbidity data is often more abundant than TOC data.Additionally, online data allows for a more immediate response to changing water quality conditions than laboratory data.Therefore, this study used settled water turbidity as the main KPI since abundant settled water turbidity data was available.

Data pre-processing
Data was collected in the period spanning 1 July 2011-30 June 2019 (aligning with the fiscal years of the public utility).Variables were initially selected based on interviews with DWTP Operations staff.Operational data included coagulant dose, raw water parameters (alkalinity, pH, turbidity), general plant parameters (water temperature and influent flow rate), and settled water turbidity.Additional operational parameters, such as chlorine dosing, were collected, but were not found to have a significant impact on model performance as they were held relatively constant.River parameters (flow rate and conductance) were collected from the United States Geological Survey (USGS) online database.A summary of the data used for model development is given in Table 2.  (1999) Bench-scale Settled water turbidity 0.9 Polynomial a Model goodness-of-fit determined using Nash-Sutcliffe (NS): Frontiers in Environmental Engineering frontiersin.org Operational data frequently contains corrupted data points and outliers.Two common approaches exist for handling outliers: 1) the standard deviation method; 2) the median absolute deviation (MAD) method.The standard deviation method relies on the assumption that the data is normally distributed and filters out data that is more than a certain number of standard deviations away from the mean.The MAD method similarly removes data that is more than a certain number of MADs away from the median, although this relies on the assumption that the data is not normally distributed (Leys et al., 2013).In this work, a modified version of the standard deviation method was used, which included filtering out zero values for parameters that could not conceivably be zero.For example, river turbidity will fluctuate but will never be zero; however, data was filtered if more than two standard deviations from the mean and scaled to be between zero and one.Since most data-driven techniques are distance-based, scaling of data reduces unwarranted impacts on model predictions for parameters that are in higher orders of magnitude than others (James et al., 2017).In total, the data set contained 2,527 vectors of data after processing.

Correlation and multi-collinearity assessment
The data space was explored with a correlation matrix and principal component analysis (PCA).The correlation matrix (Figure 2) shows the linear correlation between variables.Correlation matrices for the training, validation, and test sets are given in Supplementary Figure S1-3, respectively.One concern is the phenomena of multicollinearity, where two or more predictors are highly correlated with one another.To assess multicollinearity, the variance inflation factor (VIF) is calculated (Eq. 1) for each predictor (X j ) based on all other predictors (X -j ).The minimum value for VIF is 1, which suggests no multicollinearity.A VIF above five or 10 suggests the potential for multicollinearity to cause problems such as algorithm divergence and singularity during model development (James et al., 2017).For this prediction space, the VIF ranged between 1.06 and 2.3 for each predictor, suggesting an absence of multicollinearity.

Principal Component analysis
A PCA was performed to assess the linearity of the prediction space.One method to assess linearity of a data space is with the cumulative proportion of variance explained (PVE).In a linear data space, the cumulative PVE of one or two components will achieve a threshold of 90 or 95% (James et al., 2017).The cumulative PVE for the prediction space (Figure 3) shows that six of eight components are required to achieve 90% cumulative PVE, indicating a nonlinear data space, as is common in water quality parameters, suggesting that linear models may not be optimal for formulating accurate predictions (Baxter et al., 1999;Chun et al., 1999;Van Leeuwen et al., 1999;Heddam et al., 2012;James et al., 2017;Kim and Parnichkun, 2017;Zhang et al., 2019).

Feature selection
In order to derive robust data-based models that can generalize well, appropriate features need to be identified from the source data.By 'appropriate' we mean, features that are best able to correlate the output with inputs.Here, the following features were selected based on known impacts on drinking water treatment processes, a review of literature and professional judgement, and data availability.
• River flow rate was selected to represent precipitation events throughout the watershed.River conductance was selected to represent the ionic strength of the source water, which can impact coagulant demand and thereby treatment efficacy, particularly the removal of colloidal particles (Edzwald et al., 1974;Jiang, 2015).• Water quality parameters (raw water turbidity, pH, alkalinity, and temperature) were selected due to their well-documented relationships with the performance of coagulation and flocculation (Jiang, 2015).• Operational parameters (coagulant dose and plant flow rate) were selected to measure the impacts of operational setpoints.Coagulant dose is an operational setpoint selected by the engineering staff based on raw water quality.Plant flow rate is set to meet water demand needs, but also has an impact on the hydraulic retention times in the individual unit processes.

Modeling techniques
Several data-driven modeling techniques were used to formulate predictions of settled water turbidity, including regular subset linear regression, KNN regression, polynomial regression, and ANN.These models range in complexity from relatively simple (e.g., linear regression) to complex (e.g., ANN).The strategy of employing models of increasing complexity was intentionally designed to result in an optimal model that balances interpretability with accuracy.While the ANN model was presumed to provide the best-fit model, linear, KNN, and polynomial regressions were selected to provide a baseline for comparison.Data was divided into a 70-15-15 trainingvalidation-test split using random sampling to develop and test each model.The efficacy of model predictions is evaluated with both the RMSE (Eq.2) and the correlation coefficient between actual and predicted values (Eq.3).Correlation matrix of data space.

Regular subset linear regression
Regular subset linear regression was used to determine the optimal combination of predictors without including unnecessary predictors (James et al., 2017;Lumley and Miller, 2020).The optimal set of predictors for the linear model includes raw water basin (RWB) effluent pH, RWB effluent turbidity, water temperature, plant flow rate, and river conductance.

KNN regression
KNN models provide a non-parametric approach to formulating predictions through determining the average value of the response variable a number of the neighbors, k, with the smallest Euclidean distance to the test value (James et al., 2017).KNN regression models were developed for all values of neighbors between one to the sample size (Beygelzimer et al., 2019).Backward stepwise selection of parameters was used to determine predictors for the polynomial regression model using a generalized additive model, beginning with all parameters raised to the third power (R Core Team, 2019).Parameters were removed if they were insignificant (p > 0.05), resulting in the model parameters in Table 3.

Artifical neural network (ANN)
ANN models were developed using MATLAB (The MathWorks, 2019).Two hyperparameter optimization tasks were undertaken for this work: training algorithm selection and model architecture optimization.The three training algorithms assessed were Levenberg-Marquardt, Bayesian Regularization, and scaled conjugate gradient (SGM) backpropagation (Demuth and Beale, 2004).The Levenberg-Marquardt method stops when generalization of the model stops improving, which is measured by the mean square error (MSE) of the validation set.Bayesian Regularization has a higher computational cost, but is applicable for smaller and noisier, i.e., more randomly distributed datasets (Haykin, 2009).SGM backpropagation is an algorithm which an approximation to the function within a neighborhood of the neural network architecture is iteratively minimized, often using firstor second-order Taylor expansions of the function (Møller, 1993).SGM backpropagation is often recommended for larger problems, due to its computational efficiency.The Levenberg-Marquardt method was used in this work.The default learning rate of 0.01 with a loss goal of 10 -5 over 300 epochs was selected for this study.Cumulative PVE for prediction space.

First-order parameters
Second-order parameters Third-order parameters A manual grid search was performed to determine the optimal configuration of nodes based on test and total RMSE (Bergstra and Bengio, 2012).The manual search included between one and three hidden layers with between 1 and 100 nodes in each layer.Each model was fit three times to account for variations in model fit.The lowest test RMSE for each model architecture was recorded to compare the various architectures.The optimal ANN structure in this work was the largest size tested.Expanded architectures were tested (200, 300, 400, and 500 nodes in each hidden layer), but did not appear to improve performance, as measured by RMSE.

Regular subset linear regression
The pre-processed data was initially explored with PCA.A common threshold in PCA is a cumulative PVE of 0.9, with more linear prediction spaces often having 90% of the variance explained by the first two components (James et al., 2017).The results of the PCA suggest that the prediction space is nonlinear, which is typical for water quality and operational data (Van Leeuwen et al., 1999;Heddam et al., 2012;Zhang et al., 2019).This would suggest that a linear model would not be an appropriate tool to use as was observed in baseline model analysis.Indeed, the linear model had the highest test RMSE (0.176) and lowest R (0.683) of any of the models analyzed.

KNN regression
Prior to fitting the final KNN regression model, the optimal number of neighbors needed to be determined.The optimal number of neighbors, based on the RMSE, was 35 with a test RMSE of 0.154.The KNN model performed better than the linear model based on test RMSE (0.154) and R (0.805).Since there appear to be no reported studies of KNN regression being used to predict settled water turbidity in a drinking water treatment context, the model results are only comparable to the other models within this study.KNN regression performed better than all other modeling techniques, except for the ANN model.However, when the model was applied to the entire data set, the RMSE increased to 0.147 and the correlation coefficient decreased to 0.714 suggesting possible overfitting and model specialization.

Polynomial regression
The polynomial model was developed using backward stepwise selection (James et al., 2017).The model that was developed performed worse than the KNN regression and ANN models, but better than the linear model, based on test and total RMSE (0.171 and 0.14, respectively) and correlation coefficient (0.688 and 0.752, respectively).There are very few examples of polynomial regressions applied to drinking water treatment parameters.Van Leeuwen et al. (1999) developed a model for various plants for predicting alum dose using jar test and raw water quality parameters, with a correlation coefficient of 0.9.The results presented here do not match those results, which may result from inconsistencies in the collected data or more dramatic changes in water quality that make modeling more difficult in general.Additionally, this work utilized full-scale data, while Van Leeuwen et al. (1999) utilized bench-scale data.Benchscale data may not provide a model that reflects the changes in the water quality of full-scale plants (Joo et al., 2000;Menezes et al., 2018;Edzwald, 2019).

ANN regression
The ANN led to the most effective models with the lowest total RMSE (0.086) and highest total correlation coefficient (0.911) between the actual and predicted values.Particularly, the ANN model appeared to formulate more accurate predictions around the extrema.As indicated by the PCA, the data space is nonlinear, which is common for water quality data (Kim and Parnichkun, 2017;Zhang et al., 2019).ANN have been shown to be an effective tool for recognizing patterns in nonlinear data to develop a predictive model, even with little to no knowledge of the underlying mechanisms (Haykin, 2009;Kim and Parnichkun, 2017;Zhang et al., 2019).
The pedictive accuracy of the ANN developed in this work aligns with those results presented in Table 4.However, this work utilizes a larger amount of full-scale data to predict settled water turbidity.The benefit of using full-scale data is the increased applicability over that of bench-or pilot-scale data, as there are no effects of scaling.The correlation coefficient between the actual and predicted values provides a good indication of the accuracy of predictions.The reported correlation values described in Table 1 range between 0.9 and 0.93, while the ANN total correlation coefficient is 0.91.

Summary of results
The test and total RMSE and correlation coefficient between actual and predicted values for the various models is given in Table 4.A summary of the training, validation, and test RMSE is given in Figure 4. Visualizations of the model test and total predictions plotted against the actual data points are given in Figure 5, 6, respectively.The lowest test and total RMSE were achieved by the ANN model at 0.124 and 0.086, respectively.The highest correlation coefficient between actual and predicted test data was 0.821 for the ANN model, and the highest total correlation coefficient was 0.911 for the ANN model.

Limitations of results
The research presented here suffers from a few limitations.First, the data that was used to develop the model contained many outliers and some corrupted data points.This is characteristic of full-scale operational data, as opposed to bench-scale data.Better data management practices are recommended to further evaluate the performance of the models.Second, the data that was collected was daily averages.A higher degree of granularity in data would allow for the development of a model that would be more responsive to water Finally, the surface water treatment process is highly complex due to the ever-shifting nature of influent water quality, including the composition of natural organic matter.We plan to address some of these limitations, where possible.In fact, future studies for this plant will incorporate higher-quality data, including more granular data, which will improve the applicability of these models to develop decision support tools.This was conducted to develop and evaluate data-driven models for the prediction of settled water turbidity on a large set of full-scale data, where many studies to-date have focused solely on bench-scale data.The use of bench-scale data presents several challenges when applying these models to drinking water systems, as bench-scale studies do not account for the spatial or temporal variation of surface waters and aspects of physical processes are difficult to scale.In this work, computational data-driven models were developed using operational and water quality data from a DWTP.The modeling techniques examined here were regular subset linear regression, KNN regression, polynomial regression, and ANN.By test RMSE, the regular subset linear model was the least predictive (0.176), and the ANN had the lowest test RMSE at 0.124.The total RMSE of the regular subset linear regression, KNN regression, and polynomial regression were all similar at 0.147.The ANN outperforms other models resulting in the lowest total RMSE at 0.086, which is an acceptable accuracy for water turbidity prediction.
The results presented here indicate that ANN is a powerful tool.Combined with a reliable, large data set, ANN modeling can predict, with high accuracy, appropriate coagulant doses based on settled water turbidity.Such models have the potential to replace timeconsuming and expensive jar tests and to provide faster response time to changing raw water quality and thus lead to cost and time savings for treatment plants.
Future extension of this work should include the development of a decision support tool for helping Plant A operations in determining the optimal ferric chloride dose, and the development of a model with a more granular time scale.The use of more granular data will allow for more real-time decisions to be made based on changes in raw water quality.Some or all data, models, or code generated or used during the study are proprietary or confidential in nature and may only be provided with restrictions.

TABLE 1
Summary of reported results for models predicting turbidity in drinking water treatment.

TABLE 2
Plant A data for model development.

TABLE 3 Polynomial
Regression Model parameters.

TABLE 4
Results table for prediction of settled water turbidity.
(James et al., 2017)lidation, and test RMSE summary.qualitychanges.Third, additional model configurations, such as a generalized linear model or SVR, could possibly be explored instead of a strictly polynomial model.These models may provide a higher degree of predictive accuracy without sacrificing model interpretability, like "black box" methods(James et al., 2017).