Model prediction of radioactivity levels in the environment and food around the world’s first AP 1000 nuclear power unit

Objectives Model prediction of radioactivity levels around nuclear facilities is a useful tool for assessing human health risks and environmental impacts. We aim to develop a model for forecasting radioactivity levels in the environment and food around the world’s first AP 1000 nuclear power unit. Methods In this work, we report a pilot study using time-series radioactivity monitoring data to establish Autoregressive Integrated Moving Average (ARIMA) models for predicting radioactivity levels. The models were screened by Bayesian Information Criterion (BIC), and the model accuracy was evaluated by mean absolute percentage error (MAPE). Results The optimal models, ARIMA (0, 0, 0) × (0, 1, 1)4, and ARIMA (4, 0, 1) were used to predict activity concentrations of 90Sr in food and cumulative ambient dose (CAD), respectively. From the first quarter (Q1) to the fourth quarter (Q4) of 2023, the predicted values of 90Sr in food and CAD were 0.067–0.77 Bq/kg, and 0.055–0.133 mSv, respectively. The model prediction results were in good agreement with the observation values, with MAPEs of 21.4 and 22.4%, respectively. From Q1 to Q4 of 2024, the predicted values of 90Sr in food and CAD were 0.067–0.77 Bq/kg and 0.067–0.129 mSv, respectively, which were comparable to values reported elsewhere. Conclusion The ARIMA models developed in this study showed good short-term predictability, and can be used for dynamic analysis and prediction of radioactivity levels in environment and food around Sanmen Nuclear Power Plant.


Introduction
As a clean energy with near zero emissions, nuclear power has been vigorously developed in China in recent years as one of the important measures to achieve carbon neutrality by 2060 (1).Currently, there are 55 nuclear reactors in operation, and 18 under construction in China (2).During the operation of nuclear power plants, radioactive debris or effluents are inevitably discharged into the environment through air and water.As the continuous expansion of nuclear energy, the levels of radioactivity in environment and food have become a major concern for residents around the nuclear power plant (3)(4)(5), especially after the Fukushima Daiichi Nuclear Power Plan accident in 2011.
Radioactive substances present in the environment could potentially induce radiation exposure to human via external radiation and internal radiation through absorption, inhalation, and ingestion.As per the findings of the United Nations Scientific Committee on the Effects of Atomic Radiation (UNSCEAR 2000), an estimated 8% of natural human radiation exposure can be attributed to the ingestion of water and food (6).In order to safeguard the well-being of inhabitants, the World Health Organization (WHO) has established thresholds for gross alpha (0.5 Bq/L) and gross beta (1.0 Bq/L) in drinking water as a means of ensuring radiation safety (7). 90Sr is a high-yield byproduct of nuclear fission (8,9), possessing a half-life of 28.8 years.Its primary route of entry into human body is through the food chain, where it can accumulate in teeth, bones, and muscle tissues.Given its relatively long half-life and high radiotoxicity, 90 Sr is recognized as a crucial artificial radionuclide for evaluating radiation risks to both the environment and human health (10,11).
In view of the radiation exposure risks of and high level of public concern, prediction of radioactivity levels in the environment and food around the nuclear power plant are crucial to ensure radiation safety for the public and the environment.
Time series analysis is widely recognized as a valuable predictive model, and the Autoregressive Integrated Moving Average model (ARIMA) being one of the most important models.The ARIMA model, a hybrid of autoregressive and moving average models, was initially proposed in 1976 by Box and Jenkins (12).So far, it has been widely used in economic management, meteorological prediction, environmental prediction, and disease prediction (13)(14)(15)(16)).In addition, ARIMA model also has been applied in the field of radioactive monitoring and evaluation.In four districts of Istanbul, the model was used to predict concentrations of 226 Ra, 232 Th, and 40 K (17).After Fukushima Daiichi Nuclear Power Plan accident, Hemn Salh et al. reported the use of ARIMA models for prediction of air radiation dose rates around (18).Some scholars have already used it to predict radon concentrations and thus to predict earthquakes (19,20).
Sanmen Nuclear Power Plant (SNPP), located in Sanmen, Zhejiang, China, adopts the world's most advanced third-generation pressurized water reactor (AP1000) technology, which is one of the achievements of China's efforts to develop nuclear power.With the operation of SNPP, its impacts on the environment and residents' health have become a growing concern.To assess the impacts, we have continuously monitored radioactivity levels in the environment and food around SNPP since 2011.In this study, we analyzed the historical monitoring data and used ARIMA models, which is a time-series analysis technology, to fit and predict radioactive levels around SNPP for the first time.The predicted data can provide a basis for environmental impact assessment and human health risk assessment around SNPP.

The study region
All monitoring stations were located in Sanmen County.There were four monitoring stations for gross α and gross β in water representing surface water, factory water, tap water, and well water, respectively.Water samples were collected quarterly.There were three monitoring stations for 90 Sr in collected quarterly food including mullet, crucian carp, cabbage, and rice.There were 30 monitoring stations for ambient radiation, which were monitored quarterly.The images of the collected samples are shown in Figure 1.All data were obtained from Zhejiang Provincial Center for Disease Control and Prevention and Taizhou City Center for Disease Control and Prevention.In this study, all these monitoring data were averaged across the corresponding monitoring stations as shown in Figures 2, 3.The dataset of gross α in water was excluded because the activity concentrations were mostly below the detection limits.

90 Sr in food
Figure 5 illustrates the schematic flow for determination of 90 Sr in food, according to the Chinese national standard (22).

Ambient radiation
The cumulative ambient dose (CAD) was monitored by thermoluminescent dosimeter (TLD) (23).The LiF (Mg, Cu, and P) powder was placed into a hermetically sealed container in order to fabricate TLD.Each monitoring point was equipped with two TLDs installed at a height of 2 m.The TLDs were measured by TLD reader.

Data
The model training dataset (Supplementary Tables S1-S3) included 90 Sr activity concentrations in food from the second quarter (Q2) of 2011 to the fourth quarter (Q4) of 2022, gross β activity concentrations   An ARIMA model is defined by three parameters: p, d, and q, where p is the order of Auto-Regressive (AR) term, d is the order of differencing required to make the time-series stationary, and q is the order of Moving Average (MA) term.AR (p) usually explains the present value X t , unidirectionally it terms of its previous values X t − 1 ,X t − 2 ,, X t − p , and the current residuals ε t .It can be expressed as Equation ( 1): The model illustrates a linear association between the current observed value of the sequence at time t and the past observed values at the preceding p time points.This type of regression was known as autoregression because it is based on its own historical data.The regression with the observed values from the previous p time points is referred to as p-order autoregression.φ i (i = 1, 2…p) is its partial regression coefficient.
MA (q) refers to the current value of the time series X t in terms of its current and previous residuals It can be expressed as Equation ( 2): The model indicates that the value of the sequence at time t is independent of the past observed values at the preceding q time points, however, it exhibits a linear relationship with the preceding q stochastic disturbances.Therefore, the model is referred to as the q-order moving average model.The ε t is indicative of a stochastic disturbance sequence, also referred to as a white noise sequence, that is characterized by independence and adherence to a normal distribution.θ i (i = 1, 2…q) is its partial regression coefficient.
The ARIMA model is the combination of AR model and MA model algorithms.I in the ARIMA (p,d,q) refers to Integrated.When time-series is stationary, the ARIMA (p,d,q) model is ARMA (p,q), can be expressed as Equation (3): Transform the Equation (3) to Equation (4).
In order to facilitate ease of expression, a backshift operator is employed, which is akin to a time pointer.The multiplication of the present sequence value by a backshift operator is tantamount to shifting the temporal position of said sequence value one moment into the past.If denoted as B, the backshift operator yields Equation ( 5) and ( 6): Upon utilization of the backshift operator, Equation ( 4) is transformed to Equation (7).Procedure for determination of gross α and β in water.

FIGURE 5
Procedure for determination of 90 Sr determination in food.
Where φ(B) is non-seasonal autoregressive polynomial, θ(B) is non-seasonal movingaverage polynomial, which are expressed as Equation ( 8) and ( 9): If the time-series is non-stationary, which should be converted to stationary by differencing, the model used is ARIMA (p,d,q).Assume ΔX t is the sequence which obtained after first-order differencing of X t , which is expressed as Equation ( 10): Therefore, the equation of ARIMA (p,d,q) as follows Equation ( 11), which transforms from Equation (7).

Multiple seasonal auto-regressive integrated moving average [MSARIMA (p, d, q) (P, D, Q)]
The Seasonal-ARIMA [SARIMA(p, d, q)(P, D, Q)] model incorporates a non-seasonal ARIMA(p, d, q) model along with supplementary seasonal terms (P, D, Q) s , which accounts for the seasonality inherent in the time-series data over S time steps, corresponding to a singular seasonal period.P is the order of seasonal autoregressive term, D is the order of seasonal differencing, Q is the order of seasonal moving average term and S is the length of the seasonal cycle.The complete expression of the SARIMA model can be written as Equation ( 12): Where ε t was white noise; U(B S ) is seasonal autoregressive polynomial, and V(B S ) is seasonal moving average polynomial, which are expressed as Equation ( 13) and ( 14): When P = Q = D = 0, it means that there is no seasonal in the model.In this case, the model reduces to the standard ARIMA model.6A) shows fluctuations within a certain range and most ACFs (Figure 7A) fall into the confidence interval and tend to zero rapidly.This indicates that the sequence is stable and does not need differencing.In contrast, both the time-series diagrams (Figure 6B) and ACF plots (Figure 7B) for the 90 Sr activity concentrations in food exhibit clear periodic changes, indicating that the sequences require seasonal differencing.For gross β activity concentrations in water, the time series diagram (Figure 6C) appears relatively stable without any obvious trend, but all ACFs (Figure 7C) fall within the confidence interval.Therefore, it was determined to be white noise, and further analysis was halted.

Model establishment
After performing first-order seasonal differencing on the original time-series of 90 Sr activity concentrations in food, the ACF plot of the resulting time-series is shown in Figure 8.The majority of ACFs fall within the confidence interval and quickly reach zero, indicating that the new sequence is stable and does not exhibit any significant seasonal fluctuations.
The determination of p, q, P and Q is a crucial aspect for establishing the ARIMA model.Some researchers suggested that the most effective method to determine these values involves analyzing ACF and partial autocorrelation function (PACF) of time-series after differencing and seasonal differencing.The lag number of the peak entering the confidence level in ACF and PACF plots is used to determine the values of p, q, P and Q (24-28).However, some scientists criticized this estimation method for not being sufficient every time (29).Besides the ACF and PACF plots, there are other methods that can be used to determine the optimal ARIMA model, such as the Bayesian Information Criterion (BIC).The model with the lowest BIC value is considered as the best fit.
In this study, we tested various parameter values for p, q, P and Q (with a maximum value of 4) in ascending order, and selected the models with the lowest BIC values.After model identification and parameter estimation, we selected the models that met all standards, as shown in Table 1.
Another key point of model construction is the diagnosis of residual sequence and model parameters.If the residual sequence of the model follows a normal distribution and appears random like white noise, it suggests that the model has extracted most of the information from the original sequence (30).The residual sequences that are considered as white noise could be verified by Ljung-Box Q-test (31,32) with p value greater than 0.05.In addition, the parameters of the model should also be significantly different from zero (p < 0.05) to ensure the validity of the model.The models shown in Table 1 all passed both residual sequences Ljung-box Q-test and model parameters diagnosis, which were eventually selected as the optimal models.2).These low error values indicate that the model has a high level of accuracy in predicting the future values of these variables.
In this study, the established optimal models were applied to forecast the radioactivity levels around SNPP in 2024.As shown in Table 3, the activity concentrations of 90 Sr in food from Q1 to Q4 were predicted to be 0.067-0.77Bq/kg, and the CAD to be 0.067-0.129mSv.The forecasted values of 90 Sr in food are lower than the concentration limit recommended by the Chinese standard (33) and    ACF plot of the sequence of 90 Sr activity concentrations in food after performing first-order seasonal differencing.Based on radioactivity monitoring data around SNPP, ARIMA models were established to fit and predict the short-term changes.However, there were instance where the forecast values and actual values of individual quarters differ greatly, such as 90 Sr in food in Q1 and Q2 of 2019, CAD in Q4 of 2017 and 2020, and Q2 of 2021 (Figure 7).These differences may be related to large fluctuations of external conditions, such as climate (42), which were accounted by the ARIMA model.While ARIMA models focus on the role of time factors in fitting and forecasting, they do not analyze and discuss the relationship between the prediction object and the influencing factors (43).
When the prediction time becomes longer, these external influencing factors bring greater changes.Therefore, despite ARIMA models can maintain high accuracy when using historical data for short-and medium-term prediction, their accuracy may decrease when predicting further into the future.To improve the models accuracy in predicting radioactivity levels around SNPP for a longer period, continuous radioactivity monitoring data collection should be carried out to adjust and refine the model over time (44).This will ensure that the models reflect the variability and trends of radioactivity levels, leading to the best possible predictions.
In this study, the monitoring data from 2011 to 2022 as the training dataset.In June 2018, after the operation of SNPP, significant changes have taken place in external conditions.In our previous research, we found that the contributions of radioactive substances released to the environment after SNPP operation was negligible (45,46), we think this change does not affect the establishment of the model.In addition, if a nuclear accident or other unknown external input occurs, the model cannot predict.When the observed data are significantly higher than predicted, excluding external input, it indicates that SNPP significantly releases radioactive substances.
The ARIMA model has been successfully applied to SNPP, but its applicability to other nuclear power plants with different technologies, operating practices or environmental backgrounds, such as Qinshan Nuclear Power Plant, will depend on the regularity and randomness of the time series of their historical radioactivity data.If the time series has a certain regularity and norandom, the ARIMA model has good potential to be applied.

Conclusion
This study attempts to develop ARIMA models for medium and short-term prediction of radioactive levels in the environment and food around SNPP, utilizing historical time series data spanning from 2011 to 2023.The data from 2011 to 2022 were used as the training set, while the data from 2023 served as the testing set.The time series of gross β in water was a white noise series, which has no value in establishing a model.In contrast, the time series of 90 Sr in food and CAD were stationary, non random, and have short-term correlation, making them suitable for establishing ARIMA models.In this study, our established models showed good consistency between the fitted values and observed values, coupled with the relatively small MAPEs, suggesting satisfactory fitting effect and accuracy in the ARIMA models.
The CAD and 90 Sr activity concentrations in food in 2024 were predicted using the established model.The forecasted values of 90 Sr in food are lower than the recommended threshold by the Chinese standard, and comparable to the active concentrations of 90 Sr in food from other regions of the world.The forecasted values of CAD are comparable to the level of Qinshan Nuclear Power Plant.
However, we acknowledge the limitation of the ARIMA models, such as some external factors, e.g., climate, were not taken into account.This limitation might be compensated by continuous feeding of monitoring data thus to adjust and improve the models performance over time.In short, ARIMA models can be used as an additional tool for environmental radioactivity monitoring and human health risk assessment around SNPP.

Figure 4
Figure 4 illustrates the schematic flow for determination of gross α and β in water, according to the Chinese national standard (21).

FIGURE 1
FIGURE 1Samples collected in this study.

FIGURE 2
FIGURE 2Monitoring stations of water and food around SNPP.The map was produced by software of Lantu Draw (URL link: https://www.ldmap.net/).

FIGURE 3 Frontiers
FIGURE 3Monitoring stations for ambient radiation exposure of ambient environmental around SNPP.The map was produced by software of Lantu Draw (URL link: https://www.ldmap.net/).

Figures 4 ,
Figures 4, 5  show the time-series diagrams and autocorrelation function (ACF) plots of the original sequences of the CAD,90 Sr activity concentrations in food, and gross β activity concentrations in water.The time-series diagram for the CAD (Figure6A) shows fluctuations within a certain range and most ACFs (Figure7A) fall into the confidence interval and tend to zero rapidly.This indicates that the sequence is stable and does not need differencing.In contrast, both the time-series diagrams (Figure6B) and ACF plots (Figure7B) for the90 Sr activity concentrations in food exhibit clear periodic changes, indicating that the sequences require seasonal differencing.For gross β activity concentrations in water, the time series diagram (Figure6C) appears relatively stable without any obvious trend, but all ACFs (Figure7C) fall within the confidence interval.Therefore, it was determined to be white noise, and further analysis was halted.After performing first-order seasonal differencing on the original time-series of 90 Sr activity concentrations in food, the ACF plot of the resulting time-series is shown in Figure8.The majority of ACFs fall within the confidence interval and quickly reach zero, indicating that the new sequence is stable and does not exhibit any significant seasonal fluctuations.The determination of p, q, P and Q is a crucial aspect for establishing the ARIMA model.Some researchers suggested that the most effective method to determine these values involves analyzing ACF and partial autocorrelation function (PACF) of time-series after differencing and seasonal differencing.The lag number of the peak entering the confidence level in ACF and PACF plots is used to determine the values of p, q, P and Q (24-28).However, some scientists criticized this estimation method for not being sufficient every time(29).Besides the ACF and PACF plots, there are other methods that can be used to determine the optimal ARIMA model, such as the Bayesian Information Criterion (BIC).The model with the lowest BIC value is considered as the best fit.In this study, we tested various parameter values for p, q, P and Q (with a maximum value of 4) in ascending order, and selected the models with the lowest BIC values.After model identification and parameter estimation, we selected the models that met all standards, as shown in Table1.Another key point of model construction is the diagnosis of residual sequence and model parameters.If the residual sequence of the model follows a normal distribution and appears random like white noise, it suggests that the model has extracted most of the information from the original sequence (30).The residual sequences that are considered as white noise could be verified by Ljung-Box Q-test(31,32) with p value greater than 0.05.In addition, the parameters of the model should also be significantly different from zero (p < 0.05) to ensure the validity of the model.The models shown in Table1all passed both residual sequences Ljung-box Q-test and model parameters diagnosis, which were eventually selected as the optimal models.

Figure 9
Figure 9 shows the model forecasting results covering the training, testing and prediction periods.Each diagram consists of the observed, fitted and forecasted data, as well as the upper and lower confidence limits.The fitted values in the diagram follow the same trend as the observed values and are within the confidence interval, indicating that the fitting effect of the model is good.To further the accuracy of the model, MAPEs were calculated between the observed and the forecast values for 90 Sr activity concentrations in food, CAD from Q1 to Q4 in 2023, which were 21.4 and 22.4%, respectively (Table2).These low error values indicate that the model has a high level of accuracy in predicting the future values of these variables.In this study, the established optimal models were applied to forecast the radioactivity levels around SNPP in 2024.As shown in Table3, the activity concentrations of 90 Sr in food from Q1 to Q4 were predicted to be 0.067-0.77Bq/kg, and the CAD to be 0.067-0.129mSv.The forecasted values of 90 Sr in food are lower than the concentration limit recommended by the Chinese standard (33) and

FIGURE 6
FIGURE 6 Time-series diagram of monitoring data.(A) CAD; (B) 90 Sr activity concentrations in food; (C) Gross β activity concentrations in water.

FIGURE 7 ACF
FIGURE 7 ACF plot of original sequence.(A) CAD; (B) 90 Sr activity concentrations in food; (C) Gross β activity concentrations in water.

FIGURE 9
FIGURE 9 Time-series diagram for the model fit and forecast.(A) 90 Sr activity concentrations in food; (B) CAD.

TABLE 2
Observed and predicted radioactivity levels around SNPP from Q1 to Q4 in 2023.

TABLE 3
Forecast values of radioactivity levels around SNPP from Q1 to Q4 in 2024.

TABLE 4
Comparison of activity concentrations of 90 Sr in foods from different regions of the world.