Pollutant Flux Estimation of the Lijiang River Based on an Improved Prediction-Correction Method

Pollutant flux estimation and the analysis of flux variations are the basis for water quality assessment and water pollution control. At present, pollution flux estimation has certain shortcomings, such as a low frequency of water quality monitoring and inadequate calculation methods. To improve the rationality and reliability of river pollution flux estimation results, an improved prediction-correction pollution flux estimation method was developed by combining the LOADEST model and the Kalman filtering algorithm. By establishing the regression equation between pollutant flux and daily discharge, the predicted pollution flux procedure can be calculated using the LOADEST model. In a subsequent step, the pollutant flux is corrected based on the Kalman filtering algorithm. The improved method was applied to estimate the fluxes of chemical oxygen demand (COD), ammonia nitrogen (NH3-N), and total phosphorus (TP) at the Guilin Section of the Lijiang River from 2010 to 2019. The estimated fluxes were in good agreement with the measured ones, with relative deviation values for COD, NH3-N, and TP of 2.27, 3.20, and 1.39%, respectively. The improved method can reasonably estimate fluctuations in river pollution fluxes without requiring more data. The results in the present study provide powerful scientific basis for pollutant flux estimation under low-frequency water quality monitoring.


INTRODUCTION
Globally, water pollution is one of the most important environmental issues. In rivers, various pollutants, impacted by physical, chemical, biological, ecological, climatic, and other factors, can cause eutrophication, acidification, or alkalization, posing a threat to river ecosystem health (Aparicio et al., 2016;Steward et al., 2018). Pollutants in rivers can be rapidly transported through surface and subsurface routes, directly influencing the landscape water quality and regional water safety (Zhang et al., 2017;Qin et al., 2019). The river pollutant flux can directly reflect the total pollution load in the watershed above the river section, representing the production and transportation characteristics of pollutants in the watershed, which is the basis for formulating pollution control plans and measures (Halliday et al., 2014). However, lowfrequency and discrete water quality monitoring data series pose great challenges to the reliable quantification of river pollutant fluxes (Li and Guo, 2017).
In China, water quality is routinely monitored once a month in most rivers. The monthly representative value method (i.e., assuming that the water quality monitoring data represent the monthly average water quality concentration) and the linear interpolation method (i.e., assuming that the water quality concentration changes linearly between two measured data) are conventional pollutant flux calculation methods suitable for rivers that are not subjected to considerable human activities (Valero, 2012;Gnann et al., 2018). However, when human activities are intensified, this has an obvious influence on water quality. Since river pollutant flux estimations are mainly based on the above two conventional methods, the actual situation is mostly not reflected, impeding the development of refined water management strategies. Therefore, to obtain more accurate simulation results, a method of pollutant flux estimation based on watershed pollution load modeling (e.g., SWAT model and HSPF model) is developed (Chang and Li, 2017;Bui et al., 2019). As such an approach requires numerous data types (e.g., terrain, meteorological, land use, soil, and vegetation data), which are difficult to obtain, it is necessary to develop a method which not only reasonably describes the water quality fluctuation characteristics but also has simple requirements for data and can be easily used (Murphy and Sprague, 2019;Terskii et al., 2019).
Based on previous studies, there are good statistical correlations between river pollutant flux and discharge (Li and Guo, 2017;Kim et al., 2018). Therefore, by establishing the regression relationship between pollutant flux and discharge, high-frequency discharge monitoring data can be used to interpolate the pollutant flux between two measured water quality values, facilitating the determination of the fluctuation process between two measured values. In this paper, the Load Estimator (LOADEST) model can establish the regression equation between pollutant flux and daily discharge and can subsequently estimate the river pollutant flux at different time and space scales using daily discharge monitoring data and lowfrequency water quality monitoring data (Runkel et al., 2004). Some authors have applied the LOADEST model to estimate pollutant fluxes in the North Jiulong River, the Three Gorges Reservoir Area, the Peru Creek, the Mississippi River, and the Ishikari River (Duan et al., 2013;Runkel et al., 2013;Pellerin et al., 2014;Gao et al., 2018;Zhu et al., 2019). However, affected by incomplete data and non-optimal parameters, there are inevitable errors between estimated and measured values, resulting in deviation during pollutant flux estimation. Therefore, to reduce the error and improve the calculation accuracy, an effective data correction method is crucial.
As an optimal autoregressive data-processing algorithm, Kalman filtering has the advantages of a small calculation workload and a short computing time. The main processes of Kalman filtering are prediction and correction (Cammalleri and Ciraolo, 2012). The prediction process mainly uses the time renewal equation to establish an a-priori estimation for the current state and calculates the values of state variable and error covariance to establish an a-priori estimation for the next time state. In the correction process, the measurement renewal equation is used to establish an improved posterior estimation of the current state based on the prior estimation of the prediction process and the measured variables (Evensen, 2003). It can improve the rationality and reliability of the estimation results, and is widely used in the real-time correction of hydrological and hydrodynamic models (Goncalves and Costa, 2013;Javaheri et al., 2019;Xiong et al., 2019).
This paper combines the LOADEST model and the Kalman filtering algorithm to improve the reliability of river pollutant flux estimation results. Based on low-frequency water quality monitoring data and daily discharge data collection, the optimal pollutant flux regression equation was selected by the LOADEST model, and the daily pollutant flux process was predicted based on the regression equation. Subsequently, the predicted pollutant flux was corrected by the Kalman filtering algorithm, thus obtaining the estimated pollutant flux.

Prediction of Pollutant Flux Based on the LOADEST Model
The LOADEST model estimates the river pollutant flux using multiple linear regression. The optimal flux regression equation of the corresponding pollutant is established based on continuous daily discharge monitoring data, the low-frequency water pollutant concentration monitoring data, and subsequently, the daily pollutant flux at different time scales is estimated, which makes up for the deficiency that conventional statistical methods cannot describe the fluctuation characteristics of water pollutant concentration (Kim et al., 2018).

Regression Equation of Pollutant Flux
Taking the water pollutant concentration monitoring data and the daily discharge monitoring data as input, the optimal regression equation between pollutant flux and discharge is selected by the LOADEST model to determine the daily pollutant flux (Gao et al., 2021). The LOADEST model provides 11 regression equations, as shown in Table 1.

Parameter Estimation and Test
Due to limited observation times and the inaccuracy of historical information, the data frequently contain non-specific values, which fall in a specific observation interval, or values greater (less) than a certain threshold rather than a specific value. In statistics, such data are called censored data. The LOADEST model uses different parameter estimation methods based on whether the residual error of pollutant flux follow normal distribution and whether censored data occur. When residual error values are normally distributed, the censored data are estimated by adjusted maximum likelihood estimation (AMLE), whereas uncensored data are estimated by minimum variance unbiased estimation (MVUE) (Cohn et al., 1989(Cohn et al., , 1992. The specific algorithms are shown in Eqs 1, 2, respectively. When the residual error does not meet the requirements of normal distribution, the least absolute deviation (LAD) is used whether the data are censored or not, as shown in Eq. 3 (Powell, 1984): where L AMLE , L MVUE , and L LAD are the estimated pollutant fluxes calculated by AMLE, MVUE, and LAD methods, respectively; H (a, b, s 2 , α, κ) represents the likelihood approximation function of infinite series; g m (m, s 2 , V) represents the Bessel function; α and κ represent the function of gamma distribution; a, b, and V represent the dependent variables; m represents the degree of freedom; s 2 represents the residual variance; e k represents the residual error; and n represents the number of data points for equation calibration. For the parameters of pollutant flux regression equation a 0 to a 6 , the LOADEST model mainly tests the validity by the following methods: 1) Determination coefficient (R 2 ) testing method. The determination coefficient is used to test the data fitness of the regression equation. According to the theory of mathematical statistics, R 2 > 80% indicates that the regression equation has a preferable fitting degree, and R 2 > 90% indicates that the regression equation fits well (Menard, 2000;Runkel et al., 2004). 2) Nash-Sutcliffe efficiency (NSE) testing method. The NSE represents the relationship between the calculated value and the average measured value of the model, which ranges from -∞ to 1. The larger the NSE value, the higher the coincidence between the simulated value and the measured value (Wu et al., 2019). The NSE value is calculated as follows: where Q mea,i and Q sim,i represent the measured and simulated values, respectively, and Q obs represents the average of the total measured value.
3) Serial correlation of residuals (SCR) testing method. The SCR is used to test whether there is sequence correlation in the residuals (Verbeke et al., 1998). The smaller the SCR value, the more independent the residual of the equation. For uncensored data, the probability plot correlation coefficient (PPCC) is used to test whether the residual of the regression equation is in accordance with normal distribution, and PPCC > 0.9 indicates that the residual meets the requirements of normal distribution (Vogel, 1986;Runkel et al., 2004). For censored data, the Turnbull-Weiss statistic method is used to test whether the residual of the regression equation is in accordance with normal distribution, and P < 0.05 indicates normal distribution (Turnbull and Weiss, 1978). 4) T-ratio testing method. As multicollinearity affects the result of regression analysis, the regression model uses the correlation coefficient to determine whether there is a correlation between independent variables. In the case of multicollinearity correlation, it can be eliminated by centralizing independent variables (Cohn et al., 1992). The equations are as follows: whereT represents the centering value; N represents the number of observation data used for parameter calibration; T is the mean value of the observation data.

Regression Equation Optimization
The regression equation is optimized by the Akaike Information Criterion (AIC) and the Schwarz Posterior Probability Criterion (SPPC) (Akaike, 1974;Schwarz, 1978). Both AIC and SPCC are standards to measure the complexity and the fitting accuracy of statistical models. They can be used to select the model with good fitness and least free parameters. When optimizing the regression equation, the AIC and SPCC values of each regression equation can be calculated by Eqs 7, 8, respectively. The equation with the minimum values is the optimum: Frontiers in Environmental Science | www.frontiersin.org March 2022 | Volume 10 | Article 868404 where SSR is the residual sum of squares; k represents the number of equation parameters; m represents the number of data groups for parameter estimation.

Pollutant Flux Correction Based on Kalman Filtering
The basic idea of the Kalman filtering algorithm is taking the minimum mean square error as the best estimation criterion, establishing the space model of signal and noise states, and using estimates of the previous time and observations of the present time to update the estimation of state variables, followed by obtaining the optimal estimates of the present time (Kalman, 1960;Campestrini et al., 2016). In this study, taking the measured pollutant flux values and the corresponding predicted pollutant flux values by the LOADEST model as inputs, the optimal pollutant flux estimates at the measured time are calculated based on the Kalman filtering algorithm. The difference between the optimal estimates and the measured values is the error at the measured time. Assuming that the errors follow linear distribution, the daily pollutant flux errors are obtained by extending the interpolation of the measured error values to the daily values (Aulenbach, 2013). Therefore, the corrected pollutant flux values are obtained by adding the daily flux pollutant error values to the predicted daily pollutant flux values. According to the Kalman filtering theory, the state equation needs to be established and is as follows: where X 0 is the initial iteration value by Kalman filtering; X L represents the pollutant flux estimated by the LOADEST model; A represents the state transition parameter, which is equal to the linear correlation coefficient between the measured and the predicted value using the LOADEST model; w is the model noise that meets the requirements of normal distribution with mean value of 0 and a variance of D. The value of D can be determined based on the estimated flux error by the LOADEST model.X k represents the predicted value of the pollutant flux for the kth iteration; X k-1 represents the corrected value of the pollutant flux for the k-1st iteration. The updated equation of the state estimation error covariance is as follows:P whereP k represents the predicted error covariance for the kth iteration; P k-1 represents the corrected error covariance for the k-1st iteration, and the initial value P 0 refers to the predicted error covariance by the LOADEST model. The Kalman gain can be calculated as follow: where K k represents the Kalman gain for the kth iteration; H is the transformation parameter matrix with the determinant value of 1; and B represents the variance of measured noise. The equation used for the filtering correction is as follows: where X k represents the corrected pollutant flux value for the kth iteration, and Y represents the measured pollutant flux value. The updated equation of the state filtering error covariance is as follows: where P k represents the state filtering error covariance, and I is the identity matrix. The autoregressive iterative calculation is performed sequentially according to Eqs 9-13. The optimal corrected pollutant flux value in the measured time is obtained when P k converges to a constant value.

Study Area and Data
In this study, the improved method was applied to the Lijiang River ( Figure 1) which is a tourist attraction of world-wide interest and belongs to the Pearl River basin. It is located in the northeast of the Guangxi Zhuang Autonomous Region, China. The Lijiang River originates from the northeast of the Mao'er Mountain and is famous for its picturesque scenery of mountains on the riverbanks, with a karst topography. The total length of the Lijiang River is 214 km, with a drainage area of 12,285 km 2 (24°18′-25°41′ N, 109°45′-110°40′ E). The climate of the Lijiang River basin is subtropical monsoonal climate, with warm, humid summers and cool, wet winters. The annual average temperature is 19.3°C, with an annual average precipitation of about 2,200 mm. Flooding mostly occurs in June and July, and the dry period is from October to March (Li et al., 2015).
The overall water quality of the Lijiang River is good. The rapid population growth, coupled with accelerated economic development, has gradually led to a deterioration of the water quality during the last few years (Deng et al., 2021). The discharge of wastewater from factories and the increase in industrial and domestic pollution discharge have resulted in water pollution risk. As it is difficult to control diffuse pollution from agricultural sources, the protection of the basin is challenging. Currently, water pollution is threatening the safety of water resources from the Lijiang River.
This paper estimated the river pollutant flux from 2010 to 2019 at the Guilin section of the Lijiang River. Due to nitrogen pollution caused by the extensive use of chemical fertilizers and pesticides on farmland, coupled with wastewater discharge from households and restaurants on both sides of the Lijiang River, chemical oxygen demand (COD), ammonia nitrogen (NH 3 -N), and total phosphorus (TP) were selected as characteristic pollutants (Ye et al., 2010). Simultaneous observed precipitation data were obtained from the National Meteorological Information Center of China. Monitoring data of daily average discharge and low-frequency monitoring water pollutant concentration data for the Guilin section from 2010 to 2019 were provided by the Guilin Hydrology Bureau. These data were examined and calibrated to test whether they were homogeneous, extreme, and temporally consistent. Therefore, the Frontiers in Environmental Science | www.frontiersin.org March 2022 | Volume 10 | Article 868404 data were considered to be reliable (Huang et al., 2019). The statistics of measured data, such as maximum, minimum, and mean of discharge and pollutant concentration parameters were seen in Table 2.

Regression Model
The pollutant flux regression Eqs 14-16 were obtained by using the daily average discharge, measured time, and water pollutant concentration data at the Guilin section of the Lijiang River; the parameters were calibrated by the optimization of AIC and SPCC.
The results of the LOADEST model show that the optimal regression models for both NH 3 -N and TP fluxes were sevenparameter equations, whereas the optimal regression model for COD was a three-parameter equation. The regression model analyses of COD, ammonia nitrogen, and TP are shown in Table 3.
FIGURE 1 | Location of the lijiang river basin and gauging station.
As shown in Table 3, the pollutant flux regression equations for COD, NH 3 -N, and TP fit well for the study period, with R 2 values ranging from 0.70 to 0.93. The highest R 2 value was obtained for COD; a large R 2 value indicates that the pollutant flux, daily discharge, and time are well correlated. When comparing the simulated pollutant flux to the measured one, the NSE values were 0.85 (COD), 0.83 (NH 3 -N), and 0.88 (TP), indicating that the numerical results for the Lijiang River and the monitoring results were similar, with a small error, and the LOADEST model is highly reliable. The p values obtained for COD, ammonia nitrogen, and TP were below 0.05, indicating that the equation coefficients were statistically significant. The SCR values were 0.27 (COD), 0.25 (NH 3 -N), and 0.16 (TP), indicating that the residuals were independent. Furthermore, the PPCC values were 0.99 (COD), 0.97 (NH 3 -N), and 0.99 (TP), indicating that the residuals meet the requirements of normal distribution. Based on the results, the regression equation optimized based on the LOADEST model can be used to predict the pollutant flux in the Lijiang River.

Pollutant Flux Estimation
In this study, the improved prediction-correction method was used to estimate the pollutant flux of the Lijiang River. The predicted pollutant flux based on the LOADEST model was corrected by applying the Kalman filtering algorithm. Generally, pollution flux depends on river discharge and water pollutant concentration. However, based on the results of the correlation analysis, the fluxes of COD, NH 3 -N, and TP were only correlated with river discharge, with R 2 values of 0.92, 0.73, and 0.89, respectively (Figure 2). Although the pollution flux is a product of river discharge and water pollutant concentration, here, the fluxes of COD, NH 3 -N, and TP mainly depended on the variation of river discharge during the study period, whereas the influence of water pollutant concentration was weak. These results indicate that there is a significant correlation between pollution flux and river 3 | Regression model analyses of chemical oxygen demand (COD), ammonia nitrogen (NH 3 -N), and total phosphorus (TP). Std. Dev, standard deviation; NSE, Nash-Sutcliffe efficiency value; PPCC, probability plot correlation coefficient; SCR, serial correlation of residuals. discharge, which is consistent with the results of previous studies (Park and Engel, 2016;Kim et al., 2018). The correlation between pollutant flux and concentration largely depends on the hydrological conditions; in rivers with stable discharge, it is higher than in those with large flow fluctuations (Li and Guo, 2017). The annual and seasonal variations of pollutant fluxes are shown in Figures 3, 4. Throughout the study period, the annual average fluxes of COD, NH 3 -N, and TP were 6,523.35, 997.53, and 237.21 t, respectively. As shown in Figure 3, the annual pollutant fluxes of both COD and TP exhibited an increasing trend from 2010 to 2019, with growth rates of 5.18 and 0.94%, respectively. This increase was mainly due to the economic and industrial development and the population increase in the Lijiang River Basin. The scouring on sediment causes more phosphorus to be released into the river, which may increase the pollutant flux of TP. The NH 3 -N flux increased from 2010 to 2013, followed by a decrease until 2019, with an overall decline rate of 6.79%. Most likely, this is a result of the reduction of non-point source pollution input (e.g., pesticides, livestock manure, and fertilizer) and the decreased rural wastewater discharge (Xu et al., 2020).
As shown in Figures 4, 5, annual fluxes of COD, NH 3 -N, and TP showed considerable seasonality, corresponding with variations in discharge and rainfall (Li and Guo, 2017). All pollutants showed larger fluxes in the wet season (from March to August) compared to the dry season (from September to February). The annual average fluxes of COD, NH 3 -N, and TP in the wet season were 4,970.98, 779.35, and 189.86 t throughout the study period, accounting for 76.20, 78.13, and 80.04% of the annual average fluxes, respectively. In the wet season, precipitation in the Lijiang River Basin is heavy, accounting for about 80% of the annual precipitation, which explains the large seasonal variations in pollutant fluxes (Figure 4). Pollutants are discharged into the river by runoff scouring after precipitation, resulting in increased pollutant loads. Therefore, precipitation is the main driving factor for the increase of runoff pollution in the river basin, and attention should be paid to the  indicating that concentration is another factor affecting the NH 3 -N flux.

Comparison With Other Methods
To verify the rationality of the pollutant flux estimation results based on the improved prediction-correction method, the estimated fluxes of COD, NH 3 -N, and TP were compared with the measured fluxes and the simulated ones based on the LOADEST model. As shown in Figures 6, 7  The fluxes of COD, NH 3 -N, and TP at Guilin Section were calculated for 2010 to 2019, using the monthly representative value method (MRVM) and the linear interpolation method (LIM), and subsequently compared with the fluxes estimated by the improved prediction-correction method. As shown in Tables 4, 5, the values obtained by the different methods largely varied. For periods with small fluctuations in water quality (e.g., 2013-2015), the annual pollutant fluxes calculated via MRVM, LIM, and in the present study only slightly differed, with a deviation ranging from 0.21 to 18.60%.
On the contrary, for periods with large fluctuations (e.g., 2010, 2011, 2016, and 2019), the annual pollutant fluxes differed largely depending on the applied method, with deviation ranging from 20.94 to 41.78%. Because of the slight fluctuations in monthly COD, NH 3 -N, and TP concentrations at Guilin Section in the dry season, the variation in pollutant flux was mainly determined via discharge, resulting in only slight differences depending on the method. In the wet season, the opposite was observed. The pollutant flux deviation in separate months exceeded 100% (Supplementary Tables S1-S10). In addition, there was no significant difference in the processes and total COD flux estimated by the different methods. The main reason is that the monthly change of COD concentration is small, ranging between 1 and 2 mg/L in most months, and its flux is mainly determined by river discharge. In summary, for water pollutant concentration within a narrow range, the LIM and the MRVM can be used for pollutant flux estimation,

CONCLUSION
To further develop the currently used river pollutant flux estimation methods, an improved prediction-correction method is proposed, including two steps: pollutant flux prediction based on the LOADEST model and pollutant flux correction using the Kalman filter algorithm. In the first step, the regression equation between pollutant flux and daily discharge is established to reflect the fluctuation characteristics of pollutants to compensate for the shortcomings of conventional calculation methods. In the second step, the predicted pollutant fluxes are corrected by the Kalman filter algorithm to reduce the error between the corrected and the measured values and to increase the reliability of the estimated results. The improved method has the advantages of simple data requirements and low technical complexity, making it the method of choice in large-scale applications. The results showed that the estimated fluxes of COD, NH 3 -N, and TP were in good agreement with the measured values, indicating that the results based on the combination of the LOADEST model and the Kalman filtering algorithm are reliable. Compared with the results of the LIM and the MRVM, the improved prediction-correction method can be used as a better choice for pollutant flux estimation. The improved method requires a good statistical regression relationship between pollutant flux and river discharge and is therefore suitable for rivers subjected to non-point source pollution. With the increase in the proportion of point source pollution, its application effect will decrease. The pollutant flux estimation method for rivers dominated by point source pollution should be further investigated.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.

ACKNOWLEDGMENTS
Special thanks to Liangang Chen and Gang Wang for providing instructions and suggestion on this work. In addition, we thank the reviewers for their useful comments and suggestions.