Forecasting Scrub Typhus Cases in Eight High-Risk Counties in China: Evaluation of Time-Series Model Performance

Scrub typhus (ST) is expanding its geographical distribution in China and in many regions worldwide raising significant public health concerns. Accurate ST time-series modeling including uncovering the role of environmental determinants is of great importance to guide disease control purposes. This study evaluated the performance of three competing time-series modeling approaches at forecasting ST cases during 2012–2020 in eight high-risk counties in China. We evaluated the performance of a seasonal autoregressive-integrated moving average (SARIMA) model, a SARIMA model with exogenous variables (SARIMAX), and the long–short term memory (LSTM) model to depict temporal variations in ST cases. In our investigation, we considered eight environmental variables known to be associated with ST landscape epidemiology, including the normalized difference vegetation index (NDVI), temperature, precipitation, atmospheric pressure, sunshine duration, relative humidity, wind speed, and multivariate El Niño/Southern Oscillation index (MEI). The first 8-year data and the last year data were used to fit the models and forecast ST cases, respectively. Our results showed that the inclusion of exogenous variables in the SARIMAX model generally outperformed the SARIMA model. Our results also indicate that the role of exogenous variables with various temporal lags varies between counties, suggesting that ST cases are temporally non-stationary. In conclusion, our study demonstrates that the approach to forecast ST cases needed to take into consideration local conditions in that time-series model performance differed between high-risk areas under investigation. Furthermore, the introduction of time-series models, especially LSTM, has enriched the ability of local public health authorities in ST high-risk areas to anticipate and respond to ST outbreaks, such as setting up an early warning system and forecasting ST precisely.


INTRODUCTION
Scrub typhus (ST) is a mite-borne disease caused by the Orientia tsutsugamushi (O. tsutsugamushi). A number of rodent species have been identified to carry O. tsutsugamushi including Apodemus agrarius, Micromys minutus, Mus musculus, Rattus norvegicus, Microtus fortis, and Tscherskia triton O'Guinn et al., 2010;Sames et al., 2010).Once infected, patients will have typical clinical symptoms, such as fever, headache, fatigue, myalgia, chills, eschar, facial flushing, rash, acute hearing loss, and pneumonitis (Premaratna et al., 2006;Zhang et al., 2010). Delays in diagnosis of scrub typhus can lead to acute respiratory distress syndrome, septic shock, and multi-organ failure, leading to death (Chrispal et al., 2010). Currently, ST attracts considerable public health concerns in China, South Korea, India, and Thailand (Park, 2016;Rodkvamtook et al., 2018;Zheng et al., 2019). In China, not only the incidence of reported ST cases has significantly increased from 0.09 to 1.6 per 100,000 population in 2006 and 2016, respectively, but also ST cases are reported in the entire country within rural and urban communities (Li et al., 2020).
ST epidemiology has been extensively studied in China, such as identification of zoonotic sources of ST infection (Kuo et al., 2015), the clinical manifestations of ST infection 2012), the local spatial or spatio-temporal distributions of ST notifications (Kuo et al., 2011;Ding et al., 2012b), and the associations between the environmental factors and ST notifications (Tsai and Yeh, 2013;Wardrop et al., 2013;Li et al., 2014;Yang et al., 2014). Although the spatial variation of ST cases in China has been comprehensively studied, to date there are relatively few studies aiming to validate epidemiological time-series models to forecast ST cases. Precise forecasting of the ST cases can help local health administrative departments release an early warning of the increased risk of ST incidence and distribute reasonable medical resources in a timely manner for preventing and controlling the ST spread. Machine learning techniques have been developing rapidly during the past decade; random forest, support vector machine, or gradient boost machine techniques are used to determine the relationship between the studied natural attribute and the related environmental variable in the field of public health (Carvajal et al., 2018;He et al., 2018a). However, these methodologies have certain limitations on forecasting the disease in the future. The available literature indicates that several time-series modeling approaches have been applied to infectious diseases in China including the autoregressiveintegrated moving average (ARIMA) model and the seasonal ARIMA (SARIMA) model (Ding et al., 2012a;Yang et al., 2015).While the ARIMA and SARIMA modeling approaches cannot account for the effect of disease-related environmental factors (e.g., the meteorological and land cover factors), the SARIMAX model allows time-variant exogenous variables to be considered along with the temporal autocorrelation in disease counts. The SARIMAX models have previously been utilized to model the time series of hemorrhagic fever with renal syndrome in China (He et al., 2018b). Recently, a new time-series modeling approach based on recurrent neural networks (RNNs) has been developed (Yu et al., 2019;Sherstinsky, 2020). This modeling approach known as the long-short term memory (LSTM) model, a machine learning method, has shown strong ability for COVID-19 time-series forecasting (Chimmula and Zhang, 2020). The memory capability of LSTM to retain information from previous time instants is suitable for time-series forecasting, especially in the case of time-series with temporal correlations. The LSTM model inherits all features from RNN and the basic artificial neural network (ANN), such as self-learning, self-adaption, and selforganization; moreover, the structure of LSTM can to some extent solve the issue of vanishing and/or exploding temporal gradients that occur in RNN modeling (Gonzalez and Yu, 2018;DiPietro and Hager, 2020). However, to date the relative performance of these time-series modeling approaches at forecasting the time series of ST cases in high-risk areas in China has not been explored. Therefore, such a study will benefit the public health managers on providing precise ST forecasting models, especially in the high-risk ST counties.
In this study, we aimed to evaluate the performance of three competing time-series modeling approaches at uncovering the temporal variability of ST cases in eight high-risk counties of China and quantifying the role of ST-related environmental factors at explaining the temporal variation in ST incidence.

Data Collection and Pre-Processing
Monthly ST cases were collected for the top eight high-risk ST counties of China for the 2012-2020 period from the China Information System for Disease Control and Prevention, including Yingshang County (Anhui province), Guangning County (Guangdong province), Huaiji County (Guangdong province), Yingde City (Guangdong province), Longling County (Yunnan province), Gengma County (Yunnan province), Mang city (Yunnan province), and Yingjiang County (Yunnan province), as shown in Figure 1. The total cases of the eight counties over the time period of analysis were 2,177, 2,832, 3,395, 2,688, 4,052, 1,936, 1,952, and 2,073, respectively. The criteria for a confirmed ST case included epidemiological exposure patient histories (travel to an epidemic area and contact with chiggers or rodents within 3 weeks before the onset of illness), clinical manifestations (for example, skin rash, lymphadenopathy, high fever, and eschar or ulcers), and also positivity for at least one of the laboratory diagnostic criteria/tests: isolation of O. tsutsugamushi from clinical specimens, or detection of O. tsutsugamushi by polymerase chain reaction (PCR) in clinical specimens, or a 4-fold or greater rise in serum IgG antibody titers between acute and convalescent sera by using indirect immunofluorescence antibody assay (IFA) (Zhang et al., 2013;Li et al., 2020).
Environmental data used in this study included raster maps of the normalized difference vegetation index (NDVI) representing the amount of vegetation at specific locations which was collected from the MODIS-Terra products (MOD13A2, https://modis.gsfc. nasa.gov/data/dataprod/mod13.php) with spatial resolution 1 km. We also used meteorological data from weather monitoring stations in China from the China Meteorological Administration (http://www.cma.gov.cn/), including precipitation, pressure, relative humidity, sunshine duration, daily mean temperature, daily minimum temperature, daily maximum temperature, and mean wind speed. The multivariate El Niño/Southern Oscillation Index (MEI), regarded as global climate change proxy, was collected from the physical sciences laboratory of National Oceanic and Atmospheric Administration (https://psl.noaa.gov/enso/mei/).
The inverse distance weighted method was employed for mapping the six meteorological data with the same spatial resolution of the NDVI. Then, the NDVI and the six meteorological data were extracted by the administrative boundaries of the eight counties, and the mean values of each variable were calculated each month during the study period for further analysis.

SARIMAX Modeling
Based on the standard ARIMA model, the SARIMAX model considers simultaneously the seasonal variation in ST cases and accounts for the effects of exogenous risk factors for better understanding and fitting the considered time series. The basic equations of SARIMAX are as follows: where Y t represents the ST cases at time instant t, while X i,t represents the i th exogenous variables (i 1, 2, . . . , 8) at time instant t , and β i represent the coefficients of the exogenous variables; Z t representing the main trend of the ST time series satisfies the SARIMA equation (Eq. 2) based on the model structure (p, d, q) × (P, D, Q) S . In Eq. 2, S represents the periodicity, and ε t denotes the white noise; B represents the backshift operator, e.g., B i Z t Z t−i , while ∇ represents the differencing process, e.g., (1 − B S ) D with the non-seasonal and seasonal differencing orders d and D, respectively; ϕ p (B) 1 − p i 1 ϕ i B i and Φ P (B S ) 1 − P i 1 Φ i B iS represent the non-seasonal and seasonal autoregressive process with the orders p and P, respectively; represent the non-seasonal and seasonal moving average processes with the orders q and Q, respectively. In the current study, the periodicity parameter S was set to 12 months. The main procedure of defining the structure of SARIMAX is briefly described as follows: 1) SARIMA models with the parameters p, P, q, and Q ranged from 0, 1, and 2, and the parameters d and D ranged from 0 and 1 were constructed to fit the time series of ST cases in each of the eight studied counties. The model with the lowest Akaike information criterion (AIC) value was regarded as the best SARIMA model. 2) Multicollinearity was investigated before entering the exogenous variables into the SARIMAX model. We found a very high level of correlation (0.9834 and 0.9912, respectively) between mean temperature and daily minimum temperature (or daily maximum temperature), and therefore, the daily mean temperature was selected for modeling in the current study. In Frontiers in Environmental Science | www.frontiersin.org January 2022 | Volume 9 | Article 783864 our study, eight exogenous variables with a 0-5 month lag were prepared in an exogenous variable pool. Based on the SARIMA model, a forward stepwise variable selection procedure was implemented to build the SARIMAX model. During each of the variable selection loops, multicollinearity was tested before adding the exogenous variable into the model, and only the one with a variance inflation factor (VIF) smaller than 3 was regarded as a candidate. Furthermore, the variable candidates were added into the model separately, and only the one with the lowest AIC in each loop was added into the model. Finally, a t-test was used to exclude the exogenous variables with largest p-values (larger than 0.05). More detailed information can be found in the literature (He et al., 2018b). The first eight-year ST cases were employed to fit the SARIMAX model, and the last year data were used to test the performance of the built model. To evaluate the goodness of the fit of the model, we considered the R 2 , mean absolute error (MAE), and root mean square error (RMSE).

LSTM Modeling
In order to test the possibility of applying LSTM in ST modeling and forecasting, the exogenous variables included in the final SARIMAX model together with the ST time series were chosen as the LSTM input. LSTM is constructed by a number of connected cells (the basic unit of the network), while each cell consists of three gates, i.e., the forget gate, input gate, and output gate; see Figure 2. Specifically, the hidden state from the previous cell h t−1 and the current series Y t at time instant t are combined and flow through the forget gate, leading to useless information of loss by in other words, the forget gate can retain the useful information in series modeling. The input gate gathers the hidden state and current series to update the cell situation C t of LSTM, and two preparation workflows are designed, i.e., , and the cell situation can be calculated by C t f t pC t−1 + i t pC t ; in other words, the input gate collects and updates the information flow into the model. The output gate uses the hidden state and current series to update the hidden state as the input of the next cell of LSTM, i.e., o t σ(W o · [h t−1 , Y t ] + b o ) and h t optanh(C t ); in other words, the output gate integrates the information and generates the output information for the next time instant. In these equations, σ represents the sigmoid activate function, W and b represent the weights and bias in different parts of LSTM, respectively. Other than some other advanced LSTM models, such as forward and backward variate sensitive LSTM, convolutional LSTM, and convolutional neural network LSTM (Kim and Cho, 2019;Wan et al., 2019;Fouladgar and Främling, 2020), the current study used the basic LSTM as described above to forecast the ST series. To define the best structure of the LSTM model, the values of the hidden layers were considered to vary from 2 to 3, while the hidden dimension varies from 54 to 68 with interval 4, the batch size varies from 45 to 55 with interval 5, and the number of previous months used to forecast the current month ranges from 3 to 12. The optimal model was selected with the largest R 2 and smallest MAE and RMSE.

The Performance of SARIMA and SARIMAX Models on ST Forecasting
Various SARIMA and SARIMAX models were established with different model structures, and temporal lagged exogenous variables at the eight considered counties and the optimal model structure with the corresponding performance are presented in Supplementary Appendix Table A1. According to the optimal model structures, all models have the same seasonal difference order, i.e., D 1. Specifically, the models of the four counties in the Yunnan province have very similar seasonal characteristics, i.e., P, D, and Q are exactly the same, except the Q value for Yingjiang County. Regarding the SARIMA model, it showed good performance in modeling ST variations in the model fitting stage in Guangning County, Longling County, and Yingjiang County with R 2 larger than 0.8, while the SARIMA model showed better performance (in terms of R 2 ) at the model forecasting stage in Longling County, Gengma County, and the Mang city of the Yunnan province (Table 1).
Through exogenous variable selection processes, the SARIMAX models absorb the strength of one or several exogenous variables with temporal lags for ST variation modeling. Atmospheric pressure with a 4-month temporal lag was found to be the sole exogenous variable for SARIMAX modeling in Huaiji County. As shown in Supplementary Appendix Table A1, the significant ST-associated exogenous variables vary between the eight counties under investigation. Atmospheric pressure, sunshine duration, wind speed, and MEI with various temporal lags were found to be correlated with the ST temporal variation in Yingyang County and Guangning County; however, precipitation and relative humidity were also included in ST forecasting at Yingyang County, while NDVI was included in the model at Guangning County. At Gengma County and Yingjiang County, relative humidity and sunshine duration were significant variables associated with ST; specifically, mean temperature was also found to be associated with ST at Yingjiang County. In addition, mean temperature and wind speed were two important impact factors of ST at Yingde city and Mang city, but precipitation and MEI were considered as other important factors for the two cities, respectively. Finally, the ST variation at Longling County was closely correlated with precipitation, relative humidity, and mean temperature with various temporal lags. Compared to the SARIMA model, the results of the SARIMAX model showed that more accuracy performance (in terms of R 2 ) can be achieved at the model fitting stage in all of the eight counties; meanwhile, at the model forecasting stage, only the SARIMAX models in Yingshang County and Yingjiang County showed better forecasting accuracy than the SARIMA models in terms of RMSE and MAE.

ST Cases Forecast Performance of LSTM Models
By inputting the ST series and the corresponding significantly related exogenous variables shown in Supplementary Appendix Table A1 to the LSTM model, the optimal model structure was obtained according to the smallest R 2 in the model fitting stage. The results showed that better performance can be yielded by setting the length of input series as 12 months at Huaiji County, Yingde city, Longling County, Gengma County, and Yingjiang County, while the optimal length of input series at Yingshang County, Guangning County, and Mang city were 10, 3, and 11, respectively. The hidden dimension, batch size, and number of layers varied between counties. At the model fitting process, the LSTM model showed better performance than SARIMA at Yingshang County and Longling County in terms of R 2 ; at the model forecasting stage, LSTM showed more accuracy in forecasting ST cases than SARIMA and SARIMAX at Yingshang County and Yingde County, while only than SARIMAX at Guangning County and Longling County ( Table 1).

The Comparisons of SARIMA, SARIMAX, and LSTM Estimations
The model fitting and forecasting results were separated by a vertical dash line, as shown in Figure 3.

Methodological Considerations for Modeling Temporal Variation in ST Notifications
Although the performance of the three time-series forecasting methods varied between counties at both the model fitting stage and the model forecasting stage, the methodological comparisons among SARIMA, SARIMAX, and LSTM should be discussed as follows: 1) continuous series modeling was the core concept of the three methods, i.e., the previous conditions can be used to forecast the current or future conditions; furthermore, the SARIMA and SARIMAX models took a step ahead that included seasonal or cyclic parts in modeling, i.e., the seasonal auto regression part and seasonal moving average part with the cycle of 12 months. Similarly, the similar SARIMA model was employed for modeling the ST series at Laiwu city and Shandong province, China (Ding et al., 2012a;Yang et al., 2015). On the other hand, LSTM also borrows the continuous characteristics of ST series for forecasting. Compared to the standard artificial neural network (ANN), the forget gate was utilized to remove the information with larger lags, which is similar to the memory of humans that one can remember the recent things but will forget some of the things that occurred long time ago (Yu et al., 2019). In the current work, we have demonstrated that the length of the time-series data used to forecast ST varies between high-risk counties being 12 months in five counties, 11, 10, and 3 months in the other three counties. These results indicate that ST has predominantly an annual cycle, which is in line with the literature that 8-12 months is the general cycle of ST in Guangzhou city, China . 2) Previous studies have used a number of regression techniques to unravel the ST temporal variation in incidence, and its associations with environmental factors using the spatial Poisson regression model, negative binomial regression model, MaxEnt, random forest, and ANN models (Wardrop et al., 2013;Li et al., 2014;Yang et al., 2014;Kwak et al., 2015;Seto et al., 2017;Yu et al., 2018;Acharya et al., 2019). However, these models cannot capture the seasonal characteristics of ST incidence, which can be overcome by the SARIMAX model that can simultaneously depict the seasonal characteristics and the effects of exogenous factors. Given the complex and often nonlinear interplay between ST factors and ST infection (Elliott et al., 2019), the linear regression component of the SARIMAX model may not be sufficient for modeling and Frontiers in Environmental Science | www.frontiersin.org January 2022 | Volume 9 | Article 783864 forecasting ST incidence. The LSTM model offers a selforganizing, self-learning, and highly nonlinearity system from the ANN, and thus, it is rather suitable for modeling the ecological system of ST. In view of its model structure, the LSTM was initially postulated as the optimal model among the three methods. However, our results indicate that LSTM not always outperforms SARIMA or SARIMAX indicating that the selection of methods for ST series modeling and forecasting should be carefully considered through model comparison and validation to local conditions.

Environmental Factors Associated With the Temporal Variation in ST Notifications
Another outcome of the current work is the determination of environmental factors in each of the considered eight counties associated with ST cases using SARIMAX models. Including these factors (i.e., exogenous variables mentioned above), the SARIMAX showed better performance than SARIMA at the modeling fitting stage ( Table 1), demonstrating that the considered factors were significantly associated with ST variations. Temperature with various temporal lags was found to be positively related to ST variations at Yingde city, Longling County, Mang city, and Yingjiang County, which is in line with the ecological niche modeling of ST in the literature that temperature was found to be the key factor in determining ST occurrence . A previous study found that the sunshine duration was negatively and positively associated with ST with 1-3 month lags and 4-6 month lags, respectively . Similar findings can be found in the current study, i.e., sunshine duration with 3-, 2-, and 1-month lags were negatively associated with the ST variations at Yingshang County, Guangning County, and Gengma County, respectively, but the sunshine duration with 0-month lag showed positive relationship with ST at Yingjiang County. Warming the environment promotes the growth of vector larvae and rodents and increased exposure opportunities due to people wearing shorter clothes (Yao et al., 2019). The other study found that the ST case was positively correlated with the duration of sunshine, suggesting an occupational exposure where people possibly have longer time for agricultural field work, leading to an increased probability of exposure (Li et al., 2014). Besides, relative humidity and pressure were supposed to be key factors that influenced the regeneration of the rodents, i.e., a 1-2 month lag effects of relative humidity and pressure were closely related to the ST variation (Sun et al., 2017), which is similar to the findings of the results at Yingshang County, Guangning County, Huaiji County, Longling County, and Gengma County. In our study, we considered the wind speed in our SARIMAX models with 1-, 2-, 3-, 5-month lag since it can be a factor associated with the spawning conditions of mites (Kwak et al., 2015). Indeed, we found that wind speed was a contributor to ST variation at four counties of the current study, especially in Yingde city. Finally, our results indicate a significant effect of the multivariate ENSO index on the time series of ST cases in three of the eight counties investigated (i.e., Yingshang County, Guangning County, and Mang city). This is in line with other studies that found an association between ST incidence and the ENSO Index which is regarded as a global pattern of climatic oscillation affecting the local environment and thus human population behaviors .

Contribution to Public Health
Machine learning techniques have become popular in modeling nonlinear systems, including in the field of public health (dos Santos et al., 2019;Panch et al., 2018), by providing precise predictive models (Passos et al., 2016). Introduction of LSTM has enriched the ability of local public health authorities in ST high-risk areas to anticipate and respond to ST outbreaks. The proposed models, such as SARIMAX and LSTM, can be used locally in high-risk ST counties for ST early warning and precise ST forecasting programs; it can also enable local public health managers to monitor the variation of the environmental factors and deploy public health measures, such as health promotion alerts to the communities to prevent large ST outbreaks.

Limitations and Future Work
Certain limitations of the current work should be mentioned. First, due to lack of a longer time series of ST series data, this study only utilized a 96-month time series for model fitting, which may have hindered the performance of LSTM models; it may be the reason why LSTM showed low accuracy in ST estimation. Second, this study explored the basic LSTM model in structure and did not explore modeling combinations that considered SARIMAX + LSTM, which can also give accuracy predictions (Sheng and Jia, 2020); hence, future studies should explore the possibility of this combination for ST forecasting. Third, our SARIMAX models did not consider socioeconomic factors (including the gross domestic product, income, urbanization, population density, educational institutions, land use and land change, and medical institution), which may also play a role in the variations of ST (Ranjan and Prakash, 2018). Finally, although it is reported the warming condition might favor the reproduction of mites and increase the probability of human infection, the mechanism between the global climate change and ST outbreaks is still unclear (Jeung et al., 2016;Kuo et al., 2015); given the short-term impact of temperature mentioned in previous subsection, the long-term impact of global climate change (such as the warming condition) on ST outbreaks and its feasibility in forecasting ST outbreaks are worthy to explore in the future (Zhou et al., 2021); furthermore, the hydrology impact of climate change on ST variations will be another topic for consideration (Zhou et al., 2018).

CONCLUSION
In the current study, SARIMA, SARIMAX, and LSTM models were employed to model the temporal variation of ST cases in Frontiers in Environmental Science | www.frontiersin.org January 2022 | Volume 9 | Article 783864 high-risk communities in China. The results indicated that annual dynamics of ST vary significantly between the eight studied counties; with exogenous variables, the SARIMAX and LSTM models showed better performance than SARIMA models. Specifically, precipitation, atmospheric pressure, relative humidity, mean temperature, sunshine duration, wind speed, NDVI, and MEI were found to be partly associated with the time series of ST cases. The models and findings of the current study will support the development of local early warning systems for ST in the high-risk areas in China.

DATA AVAILABILITY STATEMENT
The data analyzed in this study are subject to the following licenses/restrictions: Patient data are protected by the China CDC and are unsuitable for public sharing. The scrub typhus data are not allowed to be publicly shared due to the local infection disease law Requests to access these datasets should be directed to data@ chinacdc.cn.

AUTHOR CONTRIBUTIONS
JH and WZ contributed to conception and design of the study. WZ, WY, YW, and QQ organized the database. JH performed the statistical analysis. JH wrote the first draft of the manuscript. All authors contributed to manuscript revision, read, edit, and approved the submitted version.

FUNDING
This work is partly supported by the China Postdoctoral Science Foundation (2020M681825) and the projects from the 13th Five-Year Plan (Nos.18QNP063 and 17SAZ01).