ORIGINAL RESEARCH article

Front. Ecol. Evol., 02 May 2022

Sec. Environmental Informatics and Remote Sensing

Volume 10 - 2022 | https://doi.org/10.3389/fevo.2022.875000

Forecasting New COVID-19 Cases and Deaths Based on an Intelligent Point and Interval System Coupled With Environmental Variables

  • 1. School of Management, Lanzhou University, Lanzhou, China

  • 2. Research Center for Emergency Management, Lanzhou University, Lanzhou, China

Article metrics

View details

1

Citations

3,5k

Views

486

Downloads

Abstract

The outbreak of Coronavirus disease 2019 (COVID-19) has become a global public health event. Effective forecasting of COVID-19 outbreak trends is still a complex and challenging issue due to the significant fluctuations and non-stationarity inherent in new COVID-19 cases and deaths. Most previous studies mainly focused on univariate prediction and ignored the uncertainty prediction of COVID-19 pandemic trends, which may lead to insufficient results. Therefore, this study utilized a novel intelligent point and interval multivariate forecasting system that consists of a distribution function analysis module, an intelligent point prediction module, and an interval forecasting module. Aimed at the characteristics of the COVID-19 series, eight hybrid models composed of various distribution functions (DFs) and optimization algorithms were effectively designed in the analysis module to determine the exact distribution of the COVID-19 series. Then, the point prediction module presents a hybrid multivariate model with environmental variables. Finally, interval forecasting was calculated based on DFs and point prediction results to obtain uncertainty information for decision-making. The new cases and new deaths of COVID-19 were collected from three highly-affected countries to conduct an empirical study. Empirical results demonstrated that the proposed system achieved better prediction results than other comparable models and enables the informative and practical quantification of future COVID-19 pandemic trends, which offers more constructive suggestions for governmental administrators and the general public.

Introduction

Risk prevention and control of major infectious diseases are essential for human health and social stability. In recent years, with global warming, the deterioration of the ecological environment, and the acceleration of urbanization, an increasing number of pathogenic microorganisms have mutated, leading to the outbreak of major infectious diseases more frequently (Wu et al., 2017). In December 2019, infectious pneumonia caused by a novel coronavirus disease (COVID-19) was discovered and quickly spread to more than 200 countries worldwide. With the global novel coronavirus epidemic becoming more serious, the World Health Organization raised the global risk of the COVID-19 epidemic to the highest level.

The COVID-19 epidemic was non-linear, dynamic, and fuzzy, thereby increasing the difficulty of prevention and control decision-making. Practical modeling approaches to predict the spread of a novel virus in the population play an essential role in the preparation and formulation of health and economic policies of any government or authority figure. When new cases increase at rates of thousands per day, health care systems of even the most developed countries are overwhelmed and unable to handle influxes of such large numbers of patients. In overwhelming situations, timely outbreak forecasting supports responsible agencies in being prepared and in managing the response effectively. For example, by targeting exclusion zones and scheduling economic activities, managing medical resources, and planning for emergency hospitals, effective forecasting is strategically essential for decision-makers (Swapnarekha et al., 2020).

Recently, various models have been developed to forecast the upcoming number of COVID-19 cases and its spread in the near future. Epidemiological models have been widely adopted in predicting COVID-19 cases and deaths. Many of these models were based on the traditional SEIR model and have been widely adopted (Li et al., 1999; Barmparis and Tsironis, 2020; He et al., 2020; Ndaïrou et al., 2020; Pandey et al., 2020). Additionally, statistical forecasting models, artificial intelligence (AI) models, and hybrid forecasting models have also been practical for epidemic prediction. For example, Ceylan (2020) applied auto regressive integrated moving average model (ARIMA) to forecast the epidemiological trend in Italy, Spain, and France. Ghosal et al. (2020) used linear and multiple linear regression methods to predict the number of deaths in India over a short period of 6 weeks. Moftakhar and Seif (2020) used the ARIMA model to forecast the number of patients with COVID-19 in Iran in the next 30 days. Ala’raj et al. (2021) developed a dynamic hybrid model based on SEIRD and ARIMA models to provide long- and short-term forecasts with confidence intervals. Ly (2020) employed an Adaptive Neuro-Fuzzy Inference System (ANFIS) to predict COVID-19 cases in the United Kingdom. The results showed that data from Spain and Italy increased the ability to forecast COVID-19 cases in the United Kingdom. Borghi et al. (2021) used a machine learning model based on the multilayer Perceptron artificial neural network structure, which effectively predicted the behavior of four time series (accumulated infected cases, new cases, accumulated deaths, and new deaths). Parbat and Chakraborty (2020) used support vector regression (SVR) for a 60-day forecast of COVID-19 cases in India based on time-series data reported from March 01, 2020, to April 30, 2020. Meanwhile, the combination and mixing of different models have also regarded as effective ways to improve prediction, including applications in different fields, such as economic modeling and policy-making [18,19] (Stock and Watson, 2004; McAdam and McNelis, 2005), electricity price forecasting (Yang et al., 2022), environmental pollution (Hao et al., 2021), and COVID-19 forecasting (Castillo and Melin, 2020).

Although these methods have contributed significantly to the field of COVID-19 prediction, most of the models mainly focused on deterministic forecasts and ignored the uncertain information in the forecasts, resulting in the inability of the government disease control department to assess and manage epidemic risk. Additionally, one area of research has been on the impact of air pollution on new cases and deaths from COVID-19. It is known that air pollution can result in several diseases, including chronic respiratory diseases, stroke and cardiovascular problems. Recent studies have identified links between air pollution (mainly nitrogen oxides NO2 and PM2.5) and deaths and cases of COVID-19. Travaglio et al. (2021) explored potential links between air pollutants and COVID-19 mortality and infectivity. They found that air pollutant concentrations, especially nitrogen oxides and PM2.5, were positively associated with COVID-19 mortality and infectivity. Konstantinoudis et al. (2021) used high geographical resolution to investigate the effect of long-term exposure to NO2 and PM2.5 on COVID-19 mortality in England. They found some evidence of an association of NO2 with COVID-19 mortality, while the effect of long-term exposure to PM2.5 remained uncertain. Lian et al. (2021) reported that urban lockdown was an effective method to reduce the number of new cases, and nitrogen dioxide (NO2) concentrations can be used as an indicator of environmental lockdown to assess the effectiveness of lockdown measures. In some studies, the influence of meteorological parameters on the transmission of COVID-19 was discussed, and it was found that weather factors could affect the spread of COVID-19 (Malki et al., 2020; Shi et al., 2020). For example, Wu et al. (2020) analyzed the relationship between temperature change and n COVID-19 pneumonia and its impact on 166 countries. Wang et al. (2020) demonstrated that temperature can significantly modify the spread of COVID-19 to a certain extent and that there may be an optimal temperature for virus transmission. The above studies have pointed out the effects of environmental and meteorological factors on the survival and spread of the virus. A tremendous number of studies support that both nitrogen oxides and temperature play an important role in the spread and infection of COVID-19, motivating the current study to take environmental and meteorological factors into account in the prediction of COVID-19. We sought to determine whether the addition of these variables would improve the outbreak prediction.

Hence, by taking into consideration the results of the above works, this study utilized a novel point and interval data-driven forecasting model consisting of a distribution function analysis module, an intelligent point prediction module, and an interval forecasting module. First, several distribution functions (DFs) optimized by a metaheuristic algorithm were effectively designed to analyze the characteristics of the COVID-19 series. Furthermore, we used environmental features, such as nitrogen dioxide (NO2) and temperature, as inputs to the multivariable hybrid prediction model, which is a combination of the sine cosine algorithm (SCA) and least square support vector machine (LSSVM). Based on the DFs and point forecasting results, interval forecasting was designed to obtain uncertain information. The new case and new death series collected from the top three affected countries were used for the empirical study. We compared the performance of the best data-driven univariate model and the best multivariate model in an attempt to generate better predictions.

Our main contributions are as follows:

  • 1

    A practical epidemic analysis and prediction tool based on distribution function analysis, intelligent point prediction, and interval forecasting modules are proposed for the government and the public.

  • 2

    Environmental variables, such as NO2 and temperature, were selected as inputs to construct a multivariable hybrid prediction model.

  • 3

    Interval forecasting based on DFs and point forecasting results can provide more uncertainty information for decision-making.

The rest of the paper is organized as follows. Section “Methodology” introduces the related Methodologies. Section “A Framework of the Developed Hybrid Forecasting System” describes the primary process of the proposed framework of the developed hybrid system. Section “Data Description and Evaluation Criteria” describes the research datasets and the evaluation criteria of this study. Section “Experimental Results and Analysis” discusses the forecasting results of the proposed model and the comparative results with other models. Finally, Section “Conclusion” concludes the critical conclusions of this paper.

Methodology

Some related methodologies are introduced in this section, including LSSVM, SCA, DFs, and interval prediction theory.

Least Squares Support Vector Machine

The support vector machine (SVM) proposed by Vapnik is an essential method in machine learning that effectively resolved pattern identification and classification tasks. The support vector machine is aimed at a small sample problem, is based on structural risk minimization, better solves the previous machine learning model overlearning, non-linear, dimensional disaster and local minimum problems, and has a good generalization ability. However, this method has some defects, such as slow training speed and poor stability when training samples on a large scale, limiting its application scope (quadratic programming problem needs to be solved in the learning process). Therefore, Suykens and Vandewalle (1999) proposed the least squares support vector machine (LSSVM) based on SVM, which significantly reduced the algorithm’s computational complexity and improved the training speed. The LSSVM is an extension of the standard SVM. The algorithm transforms the solution of the support vector machine from a quadratic programming problem to linear equations. More details on the LSSVM can be found in Suykens and Vandewalle (1999).

It is worth noting that different types of kernel functions can be used in the LSSVM model, such as sigmoid, polynomial, and radial basis function (RBF), which are commonly used in the LSSVM model. RBF is a general choice of the kernel function proposed in Keerthi and Lin (2003), requiring fewer parameters and superior performance in applications. Accordingly, this study identifies RBF as the appropriate kernel function:

Sine Cosine Algorithm

Mirjalili (2016) proposed the SCA, which is based on sine and cosine functions to explore different regions of the search space. It can effectively avoid local optimization, converge to global optimization, and effectively use the promising area of the search space during optimization. In SCA, the search space dimension is determined by the number of parameters required for optimization. The SCA creates different initial random agent solutions and requires them to use mathematical models based on sine and cosine functions to swing outward or toward the best solution.

where is the current position at the tth iteration in the ith dimension, is the targeted optimal global solution and rand1, rand2, rand3 ∈ [0,1] are random numbers. Eqs. (2) and (3) use 0.5 ≤ rand4 < 0.5 conditions for exploitation and exploration.

Distribution Functions

The probability distribution function has played an essential role in time series analysis, resource evaluation, and interval prediction in recent years. Researchers have tried to fit the basic characteristics of historical data by various DFs, hoping to mine the relevant characteristics, thereby deeply understanding data uncertainty. This study used the weibull distribution, gamma distribution, lognormal distribution, and Rayleigh DFs to study the statistical characteristics of new Covid-19 cases and deaths in three countries. The above DFs are shown in Table 1.

TABLE 1

Distribution functionsEquationsParameters
Lognormalμ,σ
Gammaσ
Weibullξ,θ
Rayleighk,c

Four distribution functions.

Interval Prediction Theory

Based on deterministic prediction, many studies (Song et al., 2015; Xu et al., 2017; Tian and Hao, 2020) have proposed interval prediction technology that can reflect the uncertain trend of future values to provide uncertain information about time series, such as air pollutants, wind energy, macroeconomic economy, and carbon trading prices. This type of interval prediction is a dynamic interval prediction method that calculates the uncertain information of future values based on point prediction and DFs. Therefore, the performance of the interval prediction model depends on the accuracy of the point prediction and the estimation of the distribution function. To be specific, assuming that the observation is Yt, at the significance level α, the probability formula for the lower limit: L and upper limit: U can be expressed:

The above formula can also be described by the following equation.

Additionally, we suppose that the forecasting values possess similar DFs with the historical datasets. Therefore, once the DFs of the original time series are determined, the estimated variance can be obtained. As a result, the values of the upper and lower bounds can be calculated with a certain confidence levelα.

The above equation can also be expressed as:

A Framework of the Developed Hybrid Forecasting System

This section describes the details of the developed hybrid architecture framework, as shown in Figure 1. The framework consists of three modules: distribution function analysis, intelligent point prediction with environmental features, and interval forecasting.

FIGURE 1

Distribution Function Analysis Module

This module mainly implements characteristic data analysis of raw epidemic data. First, the Weibull distribution, Rayleigh distribution, Lognormal distribution, and Gamma distribution are introduced to fit the epidemic time series. To obtain the optimal estimation of model parameters, two different estimation methods, namely, maximum likelihood estimation (MLE) and a robust heuristic algorithm (SCA), are applied to evaluate the parameters of different DFs. Finally, the most suitable epidemic sequence distribution function is obtained by comparing the fitting ability of 8 hybrid probability DFs.

Intelligent Point Prediction Module With Environmental Features

The volatility and non-linearity of new cases and new deaths of COVID-19 make modeling very difficult. A successful predictive model requires optimization as well as sufficient data to drive it. Previous studies have shown that some environmental variables are highly correlated with epidemic changes, especially nitrogen dioxide and temperature, which have a significant impact on the epidemic trend of COVID-19 (Bauwens et al., 2020; Shi et al., 2020; Wang et al., 2020; Travaglio et al., 2021). Thus, we took environmental features, such as nitrogen dioxide (NO2) and temperature, as inputs to construct a multivariable hybrid prediction model. To develop an intelligent point prediction model, we designed a LSSVM prediction model based on SCA optimization, namely, the hybrid SCA-LSSVM. Specifically, the SCA was introduced when training the LSSVM model, and the parameters (i.e., α, γ) of the LSSVM model were optimized by the SCA algorithm to achieve high-performance forecasting.

Interval Forecasting Module

According to interval forecasting theory, interval prediction of the COVID-19 epidemic can be achieved based on the appropriate distribution function and point prediction values of COVID-19.

Data Description and Evaluation Criteria

Data Description

The accuracy of the prediction mainly depends on the quality of the data and requires sufficient historical data. This study collected the data from the open dataset Our World in Data [Coronavirus (COVID-19) Cases – Our World in Data], which contains global daily data from the European Center for Disease Prevention and Control (ECDC). Due to the significant fluctuations and non-stationarity inherent in COVID-19, new case and death series bring great challenges to predictions. To verify the performance of the model, we used new cases per 100 thousand of the population per day as one of the predictive variables:

The new deaths per thousand of the population calculated according to Equation (10) were also predicted based on available data.

The World Air Quality Index project (WAQI) (Covid-19 Worldwide Air Quality data) provides a dataset covering air quality for more than 130 countries, updated daily starting in the first quarter of 2020. The dataset contains the data of each air pollutant, i.e., CO, NO2, O3, SO2, PM10, and PM2.5, as well as meteorological data including humidity and temperature.

We focused on the three major countries that have been most strongly affected by COVID-19: the United States, India, and Brazil. The data of new cases and new deaths per 100 thousand of the population for the three countries, as well as the data of NO2 and temperature for the same period, were selected as input variables for the outbreak modeling. Notably, the first observation time (or start time) and the length of the time series are different for each country. Sample data from the United States were collected from February 29, 2020, to March 10, 2021. Sample data from India were collected from March 18, 2020 to March 10, 2021. Sample data from Brazil were collected from March 17, 2020, to March 10, 2021. Sample data were divided into two parts: a training subset and a testing subset. We used 80% of the total data as the training subset and the remaining 20% as the test subset.

Evaluation Criteria

This study considered eight evaluation criteria to effectively evaluate the model’s performance, as shown in Table 2. Specifically, the MAE, RMSE, and R2 were chosen as error criteria to determine the fitting level of these DFs. The MAE, RMSE, MAPE, IA, DA, and R2 were used to reflect the prediction performance of the point forecasting models. The PIAW and PICP were used to measure the validity of the interval prediction.

TABLE 2

MetricEquationDefinition
MAEThe average absolute forecast error of n times forecast results
RMSEThe root-mean-square forecast error
MAPEThe average of absolute error
TICTheil’s inequality coefficient
IAThe index of agreement of forecasting results
R2Coefficient of determination
IFAWInterval forecasting average width
IFCPInterval forecasting coverage probability

Eight evaluation rules.

Here yn and represent the actual and predicted values at time n, respectively. N denotes the sample size. Ln and Un are the lower and upper values of the interval forecasting, and bn means a Boolean value.

Experimental Results and Analysis

In this section, we establish three experiments (Experiment I: DFs of COVID-19 cases; Experiment 2: point prediction of COVID-19 cases; Experiment 3: interval prediction of COVID-19 cases) to illustrate that the proposed hybrid system can effectively analyze the deterministic and uncertain information of COVID-19. Specifically, Experiment I used four probability DFs (Weibull, Rayleigh, Lognormal, and Gamma) to fit the distribution of epidemic cases. The parameters of the four probability DFs were optimized using the SCA algorithm. In experiment II, a hybrid model with environmental features, TN-SCA-LSSVM, was proposed for the point prediction of new cases and deaths from COVID-19. Three countries were selected as experimental cases and compared with the benchmark model to verify the prediction accuracy of the proposed model. To show the superior forecast performance of the hybrid model, five benchmark models, namely, ARIMA, back propagation neural network (BPNN), general regression neural network (GRNN), LSSVM, and SCA-LSSVM, were introduced. Experiment III calculated the interval prediction of new cases and new deaths in three countries based on the best distribution function determined in Experiment I and the point prediction results with the highest accuracy in Experiment II. Details are shown in the following sections.

Experiment I: Distribution Functions of COVID-19 Cases

To obtain the characteristics of the COVID-19 series and determine the optimal distribution function, four DFs (Weibull, Rayleigh, Lognormal, and Gamma), were used to calculate the distribution function of new COVID-19 cases and deaths. In addition, the parameter assessment of DFs was an essential step. Traditionally, the MLE method is used for parameter estimation of DFS. However, this study employed a robust optimization algorithm SCA to optimize the relevant parameters, and MLE was used as a comparison method to illustrate the optimization performance of SCA. Table 3 shows the estimated parameters of the different DFs determined by the MLE and SCA methods. To further select the optimal DFs, the MAE, RMSE, and R2 were chosen as error criteria to determine the fitting level of these DFs. Table 4 shows the values of the error results for the different distributions of new cases and new deaths of the epidemic in the three countries, and the bold values are the optimal results. Among the four DFs of all datasets, the R2 determined by the SCA algorithm was significantly larger than that of the MLE method. At the same time, the SCA algorithm determined that the values of MAE and RMSE were also smaller than those of the MLE method. Thus, the SCA algorithm used in this paper had better optimization performance and simulated the distribution of the epidemic data exactly.

TABLE 3

CountriesTypes of casesMethodsLognormal
Gamma
Weibull Rayleigh
μσθkλkσ
United StatesNew casesMLE1.58594.8799235.40080.9975232.56671.1314221.8384
SCA5.04810.8191162.35091.3498238.42171.0433135.1476
New deathsMLE1.41590.99732.08272.15224.52371.26303.6913
SCA1.24910.71122.16121.96764.53821.41153.0889
BrazilNew casesMLE1.29014.5723117.63141.2773160.82501.3241126.2440
SCA4.82610.690970.61312.2244176.46881.4984122.9704
New deathsMLE1.12510.95892.17631.66613.97251.59072.9499
SCA1.21750.59331.58462.46024.29211.79052.9930
IndiaNew casesMLE1.66922.420427.66800.829122.37400.936821.4061
SCA2.79671.058827.48670.904324.95690.913815.6776
New deathsMLE1.3955−1.71070.32061.00240.32731.05290.2944
SCA−1.44391.12660.28681.07980.35070.98850.2440

The parameters values of the different distribution functions are determined by MLE and SCA.

TABLE 4

CountriesTypes of casesCriteriaLognormal
Gamma
Weibull Rayleigh
MLESCAMLESCAMLESCAMLESCA
United StatesNew casesMAE0.08390.02350.04410.03750.16340.06180.04200.0403
RMSE0.10230.03510.05400.04470.19300.08270.05050.0476
R20.87500.98530.96520.97610.55530.91840.96960.9730
New deathsMAE0.09730.01650.02870.01480.02800.02140.08160.0472
RMSE0.11400.02170.03490.01810.03750.02450.09490.0533
R20.84550.99440.98550.99610.98330.99290.89300.9662
BrazilNew casesMAE0.09300.05260.06110.03270.04650.02230.05030.0491
RMSE0.10840.05870.06950.03920.05130.02980.05850.0572
R20.85910.95870.94210.98160.96840.98930.95900.9608
New deathsMAE0.09170.03860.06110.02840.04240.02370.03170.0323
RMSE0.10690.04860.06860.03540.04640.02960.03750.0368
R20.86620.97240.94490.98530.97470.98970.98360.9842
IndiaNew casesMAE0.07340.03850.03530.02260.03150.02320.12810.1213
RMSE0.08530.04740.04080.02690.03680.02790.15950.1322
R20.91310.97320.98010.99130.98380.99070.69620.7912
New deathsMAE0.05650.03960.03800.03250.03220.02310.11400.1010
RMSE0.06730.04820.04410.03800.03670.02910.13660.1189
R20.94520.97190.97650.98250.98380.98970.77430.8290

The criteria values of different distribution functions of six datasets.

The bold values present the optimal results.

Furthermore, among the four DFs optimized by SCA, SCA-Lognormal only achieved optimal simulation capability for the new cases in the United States. SCA-Gamma achieved optimal simulation performance for both the new deaths in the United States and the new cases in India. SCA-Weibull obtained optimal simulation ability for new cases and new deaths in Brazil and India.

Experiment II: Intelligent Point Prediction for COVID-19 Cases

In this experiment, an intelligent hybrid prediction model coupled with environmental variables (TN-SCA-LSSVM) was used to perform a point prediction analysis of new cases and new deaths in three countries. The new cases and new deaths of COVID-19 and the environmental variables (temperature and NO2) were taken as inputs of the multivariable point prediction. Thus, the number of input neurons of LSSVM was set to 4. To evaluate the predictive advantages of the proposed hybrid model, five univariate approaches, namely, ARIMA, BPNN, GRNN, LSSVM, and SCA-LSSVM, were selected as benchmark models for comparison. In addition, six evaluation criteria (MAE, RMSE, MAPE, IA, DA, and R2) were used to reflect the prediction performance of the models; the results are shown in Tables 5, 6. The boldly marked values indicate the best values of the model in different evaluation metrics, and the optimal point prediction model is selected accordingly. Figure 2 shows the predicted and observed values between the proposed model and other models. Further discussion of the experimental results follows.

TABLE 5

CountriesCriteriaARIMABPNNGRNNLSSVMSCA-LSSVMTN-SCA-LSSVM
United StatesMAE103.336572.334365.652460.569658.485155.9566
RMSE125.7786103.3695102.053188.361988.552885.2538
MAPE (%)39.142417.610315.691914.100013.839312.9726
TIC0.13370.11360.11310.09270.09270.0951
IA0.85710.92610.92500.95240.95240.9533
R20.64930.76310.77100.82690.84020.8262
BrazilMAE147.501780.052778.686384.319459.029758.6014
RMSE191.1310101.7826103.0591105.444475.111473.9705
MAPE (%)65.349651.033353.750051.259228.939928.0350
TIC0.36160.22300.22660.23690.15210.1478
IA0.35090.61310.62250.59570.70500.7084
R2−4.2445−0.4873−0.5248−0.59620.19010.2145
IndiaMAE3.61802.81702.31591.94321.82601.7828
RMSE4.35805.45663.39623.10493.20723.1651
MAPE (%)36.413518.850417.590615.222214.503014.3134
TIC0.16970.21830.14060.13200.13700.1358
IA0.51700.51110.67220.71200.72830.7308
R2−0.4228−1.25240.12750.27780.22190.2422

The comparative forecasting error of different models for COVID-19 new cases.

TABLE 6

CountriesCriteriaARIMABPNNGRNNLSSVMSCA-LSSVMTN -SCA-LSSVM
United StatesMAE3.91381.75811.86881.74761.70061.6252
RMSE4.83462.11342.28232.27802.15072.0040
MAPE (%)55.999125.298027.953226.156925.339024.3988
TIC0.29220.12890.14340.13870.13210.1262
IA0.50750.83540.79830.81660.83760.8470
R2−1.41480.53860.46180.46390.52210.5821
BrazilMAE2.02071.59951.55511.49971.29211.1995
RMSE2.48662.32212.33522.20381.75771.4880
MAPE (%)47.304248.282646.831944.104333.425526.4318
TIC0.22770.22320.23010.21850.16940.1381
IA0.35610.55780.58510.62690.72560.7876
R2−0.5565−0.3574−0.3727−0.22250.22230.4427
IndiaMAE0.11760.05910.03300.02610.02270.0251
RMSE0.12350.14280.04710.03960.03410.0400
MAPE (%)144.331543.540523.656918.539117.650217.7402
TIC0.36780.45680.18440.15820.13950.1583
IA0.36770.42840.80350.85380.87270.8557
R2−4.3463−6.50960.18440.42360.59310.4409

The comparative forecasting error of different models for COVID-19 new death cases.

FIGURE 2

From Table 5, we can draw the following conclusions:

For the single model comparisons, including ARIMA, BPNN, GRNN, LSSVM, it can be seen from Table 5 and Figure 3 that LSSVM had more accurate prediction accuracy than other single models and had the best performance among a variety of error indicators of MAE, RMSE, MAPE, IA, DA, and R2. For instance, the MAPE of new cases predicted by ARIMA, BPNN, GRNN, and LSSVM in the United States were 39.1424, 17.6103, 15.6918, and 14.1000%, respectively. In Brazil, the MAPE values of ARIMA, BPNN, GRNN, and LSSVM were 65.3496, 51.0333, 53.7500, and 51.2592%, respectively. In India, the MAPE values of ARIMA, BPNN, GRNN, and LSSVM were 36.4135, 18.8504, 17.5906, and 15.2222%, respectively.

FIGURE 3

The proposed hybrid model with environmental features showed stronger predictive performance compared with other models. For example, in the United States, compared with the LSSVM and SCA-LSSVM, TN-SCA-LSSVM led to 7.6160 and 4.3233% reductions in MAE, 3.5175 and 3.7255% reductions in RMSE, and 7.9957 and 6.2626% reductions in MAPE, respectively. In Brazil, compared with LSSVM and SCA-LSSVM, TN -SCA-LSSVM led to 30.5007 and 0.7256% reductions in MAE, 29.8488 and 1.5190% reductions in RMSE, and 45.3074 and 3.1267% reductions in MAPE, respectively. In India, compared with LSSVM and SCA-LSSVM, TN-SCA-LSSVM led to 21.1537 and 6.0300% reductions in MAE, 5.5636 and −3.2965% reductions in RMSE, and 17.5524 and 4.7246% reductions in MAPE, respectively. According to the six evaluation criteria, it can be concluded that the proposed hybrid multivariable model was significantly better than other benchmark models for forecasting new cases.

From Table 6, we can draw the following conclusions:

It can be seen from Table 6 and Figure 3 that the proposed TN-SCA-LSSVM showed stronger predictive performance than ARIMA, BPNN, GRNN, LSSVM, and SCA-LSSVM. LSSVM had more accurate prediction accuracy than other single models and had the best performance among various error indicators of MAE, RMSE, MAPE, IA, DA, and R2. The proposed TN-SCA-LSSVM showed stronger predictive performance than other single or hybrid univariate models. According to the six evaluation criteria, it can be concluded that the proposed hybrid multivariable model was significantly better than other benchmark models for forecasting new death cases.

Remark

The proposed hybrid multivariable model with environmental features had strong prediction ability and effectively addressed the complexity and non-linearity of new cases and new deaths. The optimization method played an essential role in improving the prediction accuracy of the hybrid model. Results indicated that the SCA significantly improved the prediction performance of the LSSVM. In addition, the forecasting model with environmental variables further improved the prediction ability of the hybrid model.

Experiment III: Interval Forecasting of COVID-19 Cases

In Experiment III, based on the interval forecasting theory discussed in Section “Interval Forecasting Module,” the interval prediction of new cases and new deaths in three countries was calculated by incorporating the optimal distribution function determined in Section “Experiment I: Distribution Functions of COVID-19 Cases” and the point prediction results with the highest accuracy in Section “Experiment II: Intelligent Point Prediction for COVID-19 Cases.” In addition, two metrics, PIAW and PICP listed in Table 1, were used to measure the validity of the interval prediction. It should be noted that the optimal interval prediction should satisfy the following conditions: The larger the IFCP value (0 ≤ IFCP ≤ 100%) and the smaller the IFAW value at a given significance level α are, the better the predictive performance of the interval prediction. Table 7 shows the United States, India, and Brazil interval prediction results under five different significance levels (0.20, 0.25, 0.30, 0.35, and 0.40). From Table 7, it can be observed that the values of IFCP and IFAW were different at five significance levels. For example, when α was 0.3, the IFCP and IFAW of COVID-19 new cases in the United States were 100.00% and 372.9357; when α was 0.35, the IFCP and IFAW of COVID-19 new cases in the United States were 100.00% and 270.2132, respectively.

TABLE 7

CountriesTypes of casesCriteriaα
0.20.250.30.350.40.45
United StatesNew casesIFCP100.00%100.00%100.00%100.00%98.65%87.84%
IFAW627.5977489.2653372.9357270.2132176.005886.8294
New death casesIFCP100.00%100.00%98.65%93.24%90.54%66.22%
IFAW9.90367.77235.95294.32802.82561.3958
BrazilNew casesIFCP100.00%98.59%98.59%92.96%81.69%59.15%
IFAW300.3756236.0082180.9128131.609285.956342.4711
New death casesIFCP100.00%100.00%98.59%97.18%87.32%59.15%
IFAW5.51154.35273.34892.44261.59810.7904
IndiaNew casesIFCP100.00%100.00%100.00%100.00%100.00%85.71%
IFAW19.178814.864011.28198.14965.29752.6103
New death casesIFCP100.00%100.00%100.00%100.00%100.00%84.29%
IFAW0.20100.15480.11700.08420.05460.0269

The interval prediction results under five different significance levels of COVID-19 cases.

To present the interval prediction results more visually, the interval prediction results of COVID-19 cases at four significance levels of 0.25, 0.3, 0.35, and 0.4 were selected to make a visual effect, as shown in Figure 4. Figure 4 contains six subplots showing the interval prediction results of new cases and new deaths for each of the three countries. The dots represent the actual value, and the color depth of the shaded area indicates the range of interval forecasting at different significance levels. When a smaller significance level is selected, there are individual actual values that exceed the corresponding shaded areas. When a smaller significance level is chosen, there are individual actual values that exceed the corresponding shaded areas. When the significance level is large, although the shaded area can cover all the actual values well, it will lead to a large range of prediction intervals and lose practical significance.

FIGURE 4

Discussion

The proposed point and interval forecasting approach with environmental variables obtained better prediction results than other comparable models. The specific reasons were determined to be as follows: First, the optimal DFs and their parameters that best fit the epidemic data of different countries were obtained by SCA. Second, the proposed hybrid multivariable model SCA-LSSCM had a strong prediction ability and effectively addressed the complexity and non-linearity of new cases and new deaths. Third, the addition of environmental variables further improved the prediction ability of the hybrid model. Finally, interval forecasting was calculated based on the optimal DFs and point prediction results to capture uncertainty information for decision-making.

Notably, because the interval prediction results were calculated based on the point prediction results, the interval prediction performance depends mainly on the point prediction results. In addition, a suitable significance level needs to be selected according to the actual situation in the practical application. In conclusion, the interval forecasting model proposed in this study could provide uncertain information about future epidemic development and could be combined with the accurate deterministic information provided by the point prediction hybrid model in Experiment 2. It could provide public health decision-makers with rich information for epidemic prevention and control decisions.

In practice, the proposed model could be driven by real-time data to dynamically and continuously optimize the model parameters by updating the data daily, making the model adaptable to complex epidemic scenarios that are non-linear, dynamic, and ambiguous. At the same time, this data-driven prediction would also help to establish a predictable safeguard mechanism, leaving a window of time for relevant decision-making departments to take measures and adjust strategies in advance to avoid the continuous spread of the epidemic.

Conclusion

This study presented a novel point and interval forecasting approach with environmental variables, which was composed of a distribution function analysis module, an intelligent point prediction module, and an interval forecasting module. In the distribution function analysis module, according to the results of the MAE, RMSE, and R2, SCA-Lognormal achieved optimal simulation capability for the new cases in the United States, while SCA-Gamma achieved optimal simulation performance in both the new deaths in the United States and the new cases in India. SCA-Weibull obtained optimal simulation ability for new cases and new deaths in Brazil and new deaths in India. In the intelligent point prediction module, according to the MAE, RMSE, MAPE, IA, DA, and R2, the hybrid multivariate model TN-SCA-LSSVM achieved more robust predictive performance than other univariate approaches, such as ARIMA, BPNN, GRNN, LSSVM, and SCA-LSSVM, which indicated that SCA significantly improved the prediction performance of LSSVM and that the addition of environmental features (temperature and NO2) further improved the prediction ability of the hybrid model. For instance, the average MAPE values of the proposed TN-SCA-LSSVM model were 62.1521, 33.9225, 27.5146, 18.3956, and 5.8034% lower than those of ARIMA, BPNN, GRNN, LSSVM, and SCA-LSSVM, respectively. In the interval forecasting module, for interval prediction of Covid-19 data in three countries, interval prediction results for new cases and new deaths were obtained based on the point prediction values and optimal DFs of the proposed hybrid TN-SCA-LSSVM model. The results showed that the performance of interval prediction was excellent because most of the observed values were located in the shaded area, with higher values of IFCP and smaller values of IFAW at different significance levels. Overall, the proposed system achieved better prediction results than other comparable models and enabled the informative and practical quantification of future COVID-19 pandemic trends, which offers more constructive suggestions for governmental administrators and the general public.

In this study, epidemiological data and two environmental variables were considered inputs for point and interval prediction models. However, predicting COVID-19 is a complex problem related to multiple factors, such as meteorological, environmental, socioeconomic or policy factors. Thus, the forecasting model can be improved by incorporating more influencing factors from different data sources, which may be an interesting research pursuit.

Publisher’s Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Statements

Data availability statement

The original datasets used in the study are included in the article, further inquiries can be directed to the corresponding author.

Author contributions

ZQ: writing, conceptualization, and methodology. YS: writing-reviewing and editing. QX: formal analysis. YL: data curation and visualization. All authors contributed to the article and approved the submitted version.

Funding

This work was supported by the National Natural Science Foundation of China (Grant No. 72004086).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

  • ARIMA

    auto regressive integrated moving average model

  • BPNN

    back propagation neural network

  • GRNN

    general regression neural network

  • ANFIS

    Adaptive Neuro-Fuzzy Inference System

  • LSSVM

    least square support vector machine

  • SCA

    sine cosine algorithm

  • DFs

    distribution functions

  • MLE

    maximum likelihood estimation

  • TN-SCA-LSSVM

    SCA-LSSVM with NO2 and temperature

  • ECDC

    European Center for Disease Prevention and Control

  • WAQI

    World Air Quality Index project.

References

  • 1

    Ala’rajM.MajdalawiehM.NizamuddinN. (2021). Modeling and forecasting of COVID-19 using a hybrid dynamic model based on SEIRD with ARIMA corrections.Infect. Dis. Model.698111. 10.1016/J.IDM.2020.11.007

  • 2

    BarmparisG. D.TsironisG. P. (2020). Estimating the infection horizon of COVID-19 in eight countries with a data-driven approach.Chaos Solitons Fract.135:109842. 10.1016/j.chaos.2020.109842

  • 3

    BauwensM.CompernolleS.StavrakouT.MüllerJ. F.van GentJ.EskesH.et al (2020). Impact of Coronavirus Outbreak on NO2 Pollution Assessed Using TROPOMI and OMI Observations.Geophys. Res. Lett.[Epub Online ahead of print]10.1029/2020GL087978

  • 4

    BorghiP. H.ZakordonetsO.TeixeiraJ. P. (2021). A COVID-19 time series forecasting model based on MLP ANN.Procedia Comput. Sci.181940947. 10.1016/j.procs.2021.01.250

  • 5

    CastilloO.MelinP. (2020). Forecasting of COVID-19 time series for countries in the world based on a hybrid approach combining the fractal dimension and fuzzy logic.Chaos Solitons Fract.140:110242. 10.1016/j.chaos.2020.110242

  • 6

    CeylanZ. (2020). Estimation of COVID-19 prevalence in Italy, Spain, and France.Sci. Total Environ.729:138817. 10.1016/j.scitotenv.2020.138817

  • 7

    Covid-19 Worldwide Air Quality data. COVID-19 Worldwide Air Quality data. Available Online at: https://aqicn.org/data-platform/covid19/(accessed Feb. 13, 2022).

  • 8

    GhosalS.SenguptaS.MajumderM.SinhaB. (2020). Prediction of the number of deaths in India due to SARS-CoV-2 at 5–6 weeks.Diabetes Metab. Syndr.14311315. 10.1016/j.dsx.2020.03.017

  • 9

    HaoY.NiuX.WangJ. (2021). Impacts of haze pollution on China’s tourism industry: a system of economic loss analysis.J. Environ. Manag.295:113051. 10.1016/J.JENVMAN.2021.113051

  • 10

    HeS.PengY.SunK. (2020). SEIR modeling of the COVID-19 and its dynamics.Nonlinear Dyn.10116671680. 10.1007/s11071-020-05743-y

  • 11

    KeerthiS. S.LinC. J. (2003). Asymptotic behaviors of support vector machines with gaussian kernel.Neural Comput.1516671689. 10.1162/089976603321891855

  • 12

    KonstantinoudisG.PadelliniT.BennettJ.DaviesB.EzzatiM.BlangiardoM. (2021). Long-term exposure to air-pollution and COVID-19 mortality in England: a hierarchical spatial analysis.Environ. Int.146:106316. 10.1016/J.ENVINT.2020.106316

  • 13

    LiM. Y.GraefJ. R.WangL.KarsaiJ. (1999). Global dynamics of a SEIR model with varying total population size.Math. Biosci.160191213. 10.1016/S0025-5564(99)00030-9

  • 14

    LianX.HuangJ.ZhangL.LiuC.LiuX.WangL. (2021). Environmental Indicator for COVID-19 Non-Pharmaceutical Interventions.Geophys. Res. Lett.48:e2020GL090344. 10.1029/2020GL090344

  • 15

    LyK. T. (2020). A COVID-19 forecasting system using adaptive neuro-fuzzy inference.Finance Res. Lett.41:101844. 10.1016/j.frl.2020.101844

  • 16

    MalkiZ.AtlamE. S.HassanienA. E.DagnewG.ElhosseiniM. A.GadI. (2020). Association between weather data and COVID-19 pandemic predicting mortality rate: machine learning approaches.Chaos Solitons Fract.138:110137. 10.1016/J.CHAOS.2020.110137

  • 17

    McAdamP.McNelisP. (2005). Forecasting inflation with thick models and neural networks.Econ. Model.22848867. 10.1016/J.ECONMOD.2005.06.002

  • 18

    MirjaliliS. (2016). SCA: a Sine Cosine Algorithm for solving optimization problems.Knowl.-Based Syst.96120133. 10.1016/j.knosys.2015.12.022

  • 19

    MoftakharL.SeifM. (2020). The exponentially increasing rate of patients infected with COVID-19 in Iran.Arch. Iran. Med.23235238. 10.34172/aim.2020.03

  • 20

    NdaïrouF.AreaI.NietoJ. J.TorresD. F. M. (2020). Mathematical modeling of COVID-19 transmission dynamics with a case study of Wuhan.Chaos Solitons Fract.135:109846. 10.1016/j.chaos.2020.109846

  • 21

    PandeyG.ChaudharyP.GuptaR.PalS. (2020). SEIR and Regression Model based COVID-19 outbreak predictions in India.medRxiv [Preprint]. 10.1101/2020.04.01.20049825

  • 22

    ParbatD.ChakrabortyM. (2020). A python based support vector regression model for prediction of COVID19 cases in India.Chaos Solitons Fract.138:109942. 10.1016/j.chaos.2020.109942

  • 23

    ShiP.DongY.YanH.ZhaoC.LiX.LiuW.et al (2020). Impact of temperature on the dynamics of the COVID-19 outbreak in China.Sci. Total Environ.728:138890. 10.1016/J.SCITOTENV.2020.138890

  • 24

    SongY.QinS.QuJ.LiuF. (2015). The forecasting research of early warning systems for atmospheric pollutants: a case in Yangtze River Delta region. Atmos. Environ.118, 5869. 10.1016/j.atmosenv.2015.06.032

  • 25

    StockJ. H.WatsonM. W. (2004). Combination forecasts of output growth in a seven-country data set.J. Forecast.23405430. 10.1002/for.928

  • 26

    SuykensJ. A. K.VandewalleJ. (1999). Least squares support vector machine classifiers.Neural Process. Lett.9293300. 10.1023/A:1018628609742

  • 27

    SwapnarekhaH.BeheraH. S.NayakJ.NaikB. (2020). Role of intelligent computing in COVID-19 prognosis: a state-of-the-art review.Chaos Solitons Fract.138:109947. 10.1016/j.chaos.2020.109947

  • 28

    TianC.HaoY. (2020). Point and interval forecasting for carbon price based on an improved analysis-forecast system. Appl. Math. Model.79, 126144. 10.1016/j.apm.2019.10.022

  • 29

    TravaglioM.YuY.PopovicR.SelleyL.LealN. S.MartinsL. M. (2021). Links between air pollution and COVID-19 in England.Environ. Pollut.268:115859. 10.1016/J.ENVPOL.2020.115859

  • 30

    WangM.JiangA.GongL.LuoL.GuoW.LiC.et al (2020). Temperature Significantly Change COVID-19 Transmission in 429 cities.medRxiv [Preprint]. 10.1101/2020.02.22.20025791

  • 31

    WuT.PerringsC.KinzigA.CollinsJ. P.MinteerB. A.DaszakP. (2017). Economic growth, urbanization, globalization, and the risks of emerging infectious diseases in China: a review.Ambio461829. 10.1007/s13280-016-0809-2

  • 32

    WuY.JingW.LiuJ.MaQ.YuanJ.WangY.et al (2020). Effects of temperature and humidity on the daily new cases and new deaths of COVID-19 in 166 countries.Sci. Total Environ.729:139051. 10.1016/J.SCITOTENV.2020.139051

  • 33

    XuY.DuP.WangJ. (2017). Research and application of a hybrid model based on dynamic fuzzy synthetic evaluation for establishing air quality forecasting and early warning system: a case study in China. Environ. Pollut.223, 435448. 10.1016/j.envpol.2017.01.043

  • 34

    YangW.SunS.HaoY.WangS. (2022). A novel machine learning-based electricity price forecasting model based on optimal model selection strategy.Energy238:121989. 10.1016/J.ENERGY.2021.121989

Summary

Keywords

COVID-19, point forecasting, interval forecasting, artificial intelligence, environmental variables

Citation

Qu Z, Sha Y, Xu Q and Li Y (2022) Forecasting New COVID-19 Cases and Deaths Based on an Intelligent Point and Interval System Coupled With Environmental Variables. Front. Ecol. Evol. 10:875000. doi: 10.3389/fevo.2022.875000

Received

13 February 2022

Accepted

25 March 2022

Published

02 May 2022

Volume

10 - 2022

Edited by

Yan Hao, Shandong Normal University, China

Reviewed by

Yunxuan Dong, University of Macau, China; Ling Xiao, Xuzhou University of Technology, China

Updates

Copyright

*Correspondence: Yongzhong Sha,

This article was submitted to Environmental Informatics and Remote Sensing, a section of the journal Frontiers in Ecology and Evolution

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Figures

Cite article

Copy to clipboard


Export citation file


Share article

Article metrics