Global Forecasting Confirmed and Fatal Cases of COVID-19 Outbreak Using Autoregressive Integrated Moving Average Model

The world health organization (WHO) formally proclaimed the novel coronavirus, called COVID-19, a worldwide pandemic on March 11 2020. In December 2019, COVID-19 was first identified in Wuhan city, China, and now coronavirus has spread across various nations infecting more than 198 countries. As the cities around China started getting contaminated, the number of cases increased exponentially. As of March 18 2020, the number of confirmed cases worldwide was more than 250,000, and Asia alone had more than 81,000 cases. The proposed model uses time series analysis to forecast the outbreak of COVID-19 around the world in the upcoming days by using an autoregressive integrated moving average (ARIMA). We analyze data from February 1 2020 to April 1 2020. The result shows that 120,000 confirmed fatal cases are forecasted using ARIMA by April 1 2020. Moreover, we have also evaluated the total confirmed cases, the total fatal cases, autocorrelation function, and white noise time-series for both confirmed cases and fatalities in the COVID-19 outbreak.


INTRODUCTION
The first case of the virus came to light in Wuhan city of China in November 2019. The population of Wuhan city is nearly 11 million, and it connects to many major cities in China. The number of cases changed to dozens and then hundreds by the end of December. Medical experts first suspected that it was viral pneumonia, which could not be cured with conventional medicines. Ever since the virus first started to infect people, it has continued to spread and affect thousands of people (1). Further, every patient infected with this virus was infecting two or three people ahead of them. Until December 30 2019, no information was released from China regarding the deadly virus. Finally, in December 2020, officials from the public health department of China informed the World Health Organization about the medical issue that affected people in Hubei Province, China. The infection was described as a pneumonia-like ailment in humans and caused by a coronavirus, an extreme group of pathogens. Coronaviruses are known to spread among people, mice, winged creatures, bats, domesticated animals, and other wild creatures (2)(3)(4). In December 2019, the WHO was alerted by China to certain occurrences of a respiratory infection associated with specific people who had visited the seafood market in Wuhan city (5). Wuhan experienced the spread of a coronavirus, called Coronavirus Disease-19 . In (6), the author presumed that COVID-19 likely started in bats, since it is progressively like two bat-determined coronavirus strains. However, the origin of the COVID- 19 has not yet been confirmed at this point, and it requires more investigation. In 2003 and 2012, the Middle East Respiratory Syndrome (MERS) coronavirus and Severe Acute Respiratory Syndrome (SARS) coronavirus were found to be zoonotic such that they may be transmitted among animals and humans (7). COVID-19 is the third profoundly pathogenic human coronavirus that has been identified over the most recent two decades. The individual-to-individual transmission has been depicted both in emergency clinics and family settings (8). Therefore, it is necessary to forestall any further spread in the general society and in human services settings. COVID-19 transmission through tainted dry surfaces makes it even easier to transmit. Hence self-immunization of the mucous layers of the nose, eyes, or mouth has been proposed (9)(10)(11). Biocidal products like hydrogen peroxide, alcohols, sodium hypochlorite, and benzalkonium chloride are being utilized worldwide for sanitization purposes, especially in social settings (12).
As of March 25 2020, 18,295 individuals had died from COVID-19 infection, while 107,089 patients recovered. As per the WHO, there were more than 411,242 confirmed cases worldwide, with the majority of revealed cases in Wuhan city. This led to Wuhan placing a citywide lockdown on January 23 2020, in which no individuals were permitted to enter or leave. The officials temporarily suspended all accessible transportation, including trains, metro, air terminals, and public vehicles to avoid the spread of COVID-19. Also, a few urban areas in Hubei territory were put under lockdown. One of the challenges posed by COVID-19 is its quarantine period, which is as long as 2-14 days (13), and during this period, it can spread to others. Besides, in (14), it is mentioned that the alone time may range from 0 to 24 days depending upon the situation of the patient. The spread of such sickness is unbelievably dangerous. It requires continuously extraordinary blueprints and plans, which have been executed in different Chinese urban districts, particularly in the Hubei area. Hence it is indispensable to explore the number of confirmed cases at this time to start the vital assertion plans. The main contribution of this research work is the use of an ARIMA model (15), which is capable of forecasting the global pandemic COVID-19 using the dataset, as shown in Table 1. The main contributions are per the following: • We used a proficient forecasting model to find the confirmed cases of COVID-19 dependent on recently confirmed cases. • An ARIMA model was used to forecast the exact confirmed fatalities of the coronavirus outbreak from February 1 2020 to April 1 2020. • We evaluated total confirmed cases, total fatalities, confirmed cases concerning fatal cases, Q-Q plot of confirmed and fatal cases, white noise confirmed cases vs. fatal cases, and

LITERATURE SURVEY
Existing work has been conducted in the past to evaluate estimation problems, like an adaptive neuro-fuzzy inference system (ANFIS) (16), which is applied extensively in the time course of action desire and envisioning issues, and it indicated that there was incredible execution in the present application. It offers adaptability for handling non-linearity in time series data, by combining an artificial neural network (ANN) and a fuzzy approach. ARIMA models applied to historical hemorrhagic fever with renal syndrome (HFRS) occurrence information are a significant device for HFRS observation in China. Chinese HFRS information from 1975 to 2008 was taken into account for fitting the ARIMA model. Akaike information criterion (AIC) and the Ljung-Box test have been relied on for assessing the developed models. Along these lines, the fitted ARIMA model was applied to get the suited HFRS frequency from 1978 to 2008 and appeared differently concerning the corresponding observed values (17). This paper highlights the significance of embracing dynamic modeling approaches, proposes difficulties for performing model determination across long time periods, and relates comprehensively to the predictability of complex adaptive systems. (18) introduced an ensemble model for sequential forecasting using a frequent computational bootstrap approach to evaluate the Ebola outbreak  and generated short-term forecasts of the epidemic outbreak by combining two models, the generalized-growth model (GGM) and the generalized-logistic model (GLM) (19). The seasonal autoregressive-integrated moving average (SARIMA) model is used to forecast monthly cases of hand, foot, and mouth disease (HFMD) in China (20). A short-term forecast of incidence in China has been done by applying ARIMA and exponential smoothing (ETS) that analyzed data from the Chinese Center for Disease Control and Prevention between 2005 and 2006 (21). Ture and Kurt (22) proposed a comparative study among different types of time series methods to forecast Hepatitis A virus (HAV) infection. The methods considered were the ANN algorithm, radial basis function (RBF), timedelay neural networks (TDNN), and the ARIMA model, where the ANN algorithm was found to be more accurate than the others. In paper (15)         of Ebola virus disease (EVD) patients in a couple of African countries.
On the basis of the existing studies, in this paper study, the ARIMA model was used for time series analysis to either get comprehensive information or to anticipate future qualities. This model is applicable in situations where information may be non-fixed. Non-fixed practices can be patterns, a cycle, random walks, or mixes of the three. Non-fixed information focuses are unusual and can not be displayed or estimated. An investigation utilizing non-fixed time arrangement information focuses may not be fitting as it might show the connection between two factors where one does not exist. To get predictable, reliable outcomes, the non-fixed information should be changed into fixed information. The non-fixed procedure and the fixed procedure around a consistent long haul has a steady difference autonomous of time.

Dataset
The dataset considered for the study has been collected from relevant sources (https://www.kaggle.com/c/covid19-globalforecasting-week-2/data). It contains the day to day confirmed cases from all over the world between January 22 2020 and March 31 2020. An overview of the dataset has been shown in Table 1. The dataset consists of a total of 22,032 columns and 7 rows. The COVID-19 dataset also includes 5 attributes, i.e., id, prov_state, country_region, confirmed case, and fatal case. The data is in the form of time-series data points.     Forecast confirmed and fatalities case of COVID-19 across the globe index column. Therefore, the date column is no longer a feature for us. This is because time-series data perform tasks related to the date. That is why it is the most used parameter in our methodology ( Table 1).

Autoregressive Integrated Moving Average (ARIMA)
ARIMA is a famous and adaptable class of forecasting models that uses recorded data to make estimations. This model is an essential forecasting technique that can serve as a starting point for progressively complex models (15). It works effectively when the information displays a steady or predictable example after some time with a base measure of anomalies. The ARIMA approach endeavors to portray developments in a stationary time series as an element of what is designated as "autoregressive and moving normal" parameters. These are alluded to as autoregressive parameters and moving average (MA) parameters. We accept time is a discrete variable, Z t shows the observation at time t and t demonstrates the zero-mean random noise term at time t. The MA(n) (moving average) model uses this procedure: where γ i denotes coefficient, similar to MA(n) models, autoregression model, denoted by AR(m), Zt is a noisy linear combination of the previously taken m observations. An increasingly advanced model is the ARIMA (m, n), mix of AR(m), and MA(n) with a reduced structure and gives an adaptable demonstrating system. This model expects that Zt is created through the formula: where t is the zero-mean noise term. On the off chance that we are adding imperative to the AR(m) part, it ensures a stationary process. A fixed and invertible ARIMA (m, n) model may be depicted either as an infinite AR model (AR(∞)) or an infinite MA model (MA(∞)). For the ARIMA model, one can compute the first-order differences of Zt by ∇Zt= Z t -Z t−1 and second-order differences of Zt by ∇2Zt= ∇Zt-∇Zt-1 such that the sequence of ∇dZt satisfies an ARIMA (m, n). We state that the sequence of Zt satisfies the ARIMA (m, d, n).
which are specified by three order parameters terms m, d, n with specific weights vector δ∈R m and γ∈Rn.
Forecasting with ARIMA (m, d, n) is an inversion of the differential equation. Assuming the time-series sequence Zt fulfills ARIMA (m, d, n), one can predict the dth order differential of observation at time t + 1 as ∇ d Z ∼ t+1 and then predict the observation at time t + 1 as Z ∼ t :

RESULT AND DISCUSSION
For this study, data were analyzed using a python library called matplotlib. It is a popular package for plotting 2D data. This library has been used to derive the line charts of the dataset. We analyzed the COVID-19 data and performed data visualization, which gave a complete idea of the brief summarization of our dataset. For visualization, we used python modules like pandas, matplotlib, and seaborn. The study provided us with the summarized data using the described methods. This function prints the total distribution of the dataset, i.e., 50% of dataset, 75% of dataset, etc. We used further visualization techniques to get a better insight into our data. Using various parameters, we have analyzed our data and described the total confirmed case of COVID-19 starting from January 1 2020 to April 1 2020. We observed that as the time period increased the number of confirmed cases also increased. In Figure 1, the x-axis indicates the period as months, and the y-axis indicates the number of fatalities. Figure 2 describes the total fatalities of COVID-19 starting from January. As the time period increased, the number of confirmed cases also increased. We can observe that as the time period increased at the same time the number of fatalities also increased. Here the x-axis indicates the time period in months, and the y-axis indicates the numbers of fatalities. Figure 3 is the comparison of increasing trends of confirmed and fatal cases over the same period of time. As we can see the legends are mentioned above in the diagram. From this graph, we observed that the number of confirmed cases were more than the number of fatalities. While the number of confirmed cases increased gradually, it was not the same for fatalities. Figures 4,  5 are referred to as Q-Q plots in statistics. These plots are a graphical technique for determining if two data sets come from populations with a standard distribution. For such modules in python, we used scipy.stats and pylab. The above Q-Q plots of confirmed cases and fatalities describe the theoretical quintiles of both cases. It means that based on the numbers and statistics, the theoretical increase in the graph should follow the red line. The quantiles-quantiles (Q-Q) plots are only used to draw the theoretical quintiles. Figures 6, 7 represent the white noise time-series data of the above COVID-19 data. White noise is a sequence of independent and identically distributed random variables with finite mean and variance. Figure 6 is the white noise figure of the confirmed case and Figure 8 is the white noise figure of the fatal case. It is worth mentioning that our dataset is stationary in nature because most values are around the mean figure in the white noise data. White noise describes the particular behavior of the time-series data. Figure 8 is the comparison of the confirmed cases and white noise data. From the graph, we can see that the initial values are mostly around white noise, which means our dataset is distributed well. Figure 9 represents the comparison of the fatalities and white noise data. From the graph, we can observe that most of the initial values are around the white noise, which means our dataset is distributed well.
Seasonality: Figures 10, 11 are the seasonality analysis of our data. A repeating pattern within a given time period is known as seasonality, although the term is applied more generally to repeating patterns within any fixed period. It means that we decomposed the time-series data and split them into trend, seasonal, residual, and observation. Seasonal decomposition can be performed in two ways, i.e., multiplicative. Here the term trend refers to a general systematic linear or (most often) nonlinear component that changes over time and does not repeat, i.e., distribution throughout data. Seasonal refers to the cyclical effects of the dataset. Residual means the error of prediction. In a time series, it depicts what is left over after fitting a model. ACF (autocorrelation function): Figures 12, 13 depict the ACF of both confirmed cases and fatalities over time. Autocorrelation is the correlation between a sequence and itself. Statistically, it can be referred to as the correlation among the members of a variable. But in general, when the values of the observation are somehow related to each other, the corresponding stage is referred to as autocorrelation.
For model building, we have used the ARIMA model. By applying the ARIMA model, we forecast the future trend of confirmed cases and fatalities which is shown in Figure 14.
A comparative analysis of COVID-19 has been discussed in Table 2. In this study the ARIMA model focused on different global forecasting of COVID-19 total confirmed cases and total fatal cases in the earlier stage. Here in the current research work, population has been taken as a parameter.

CONCLUSION
In response to the COVID-19 pandemic, we applied time series analysis to find different measurements like the stationary, trend, and the pattern of the dataset. Various visualization techniques have been applied to the dataset for studying the outbreak related to COVID-19. We have relied on seaborn and matplotlib modules for the same. The graphs appropriately describe the trend and pattern of the COVID-19 pandemic outbreak. The time-series model ARIMA has been used to forecast the cases of COVID-19 in the future and has successfully calculated the total confirmed cases and fatalities over the studied dates and Q-Q plots of confirmed cases and fatalities. We have also estimated the total confirmed cases and fatalities over the date-Q plots.
The dataset is stationary in nature; we have presented the ACF of both confirmed cases and fatalities over time and forecasted the future trend for the same. This study provides an advanced level of work, which may be useful in analyzing as well as fighting the pandemic. In future work, we can apply advanced algorithms and techniques for preparing the model which will improve and forecast more precisely.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/supplementary material.