Country-Wise Forecast Model for the Effective Reproduction Number Rt of Coronavirus Disease

Due to the particularities of SARS-CoV-2, public health policies have played a crucial role in the control of the COVID-19 pandemic. Epidemiological parameters for assessing the stage of the outbreak, such as the Effective Reproduction Number (Rt), are not always straightforward to calculate, raising barriers between the scientific community and non-scientific decision-making actors. The combination of estimators of Rt with elaborated Machine Learning-based forecasting techniques provides a way to support decision-making when assessing governmental plans of action. In this work, we develop forecast models applying logistic growth strategies and auto-regression techniques based on Auto-Regressive Integrated Moving Average (ARIMA) models for each country that records information about the COVID-19 outbreak. Using the forecast for the main variables of the outbreak, namely the number of infected (I), recovered (R), and dead (D) individuals, we provide a real-time estimation of Rt and its temporal evolution within a timeframe. With such models, we evaluate Rt trends at the continental and country levels, providing a clear picture of the effect governmental actions have had on the spread. We expect this methodology of combining forecast models for raw data to calculate Rt to serve as valuable input to support decision-making related to controlling the spread of SARS-CoV-2.


INTRODUCTION
Different aspects of modern society favored the rapid spread of COVID-19 at a global level [1] so that it was declared as a pandemic by the World Health Organization in March 2020 [2]. SARS-CoV-2, the novel coronavirus associated with this disease, was identified for first time in the region of Wuhan, China, by sequence sampling from patients showing symptoms similar to pneumonia [3]. Genomic studies of SARS-CoV-2 suggest a phylogenetic relation with RaTG13, an endogenous variant reported in bats, based on the 96.2% identity between the two genomes [4]. Three different variants of SARS-CoV-2 have been reported, which are distributed on Asia, Europe, and America [5], to date accounting for 54 strains [6]. Additionally, among 103 strains of SARS-CoV-2 analyzed by Tang et al. [7], 101 exhibited a complete link between two specific Single Nucleotide Polymorphisms (SNPs): 72 strains exhibited a "CT" haplotype (defined as lineage L, because it is at the Leucine codon) and 29 strains exhibited a "TC" haplotype (defined as lineage S, because it is at the Serine codon) at these two SNPs. These lineages present significant differences of prevalence (70 and 30 %, respectively for L and S), and evolutionary analyses suggested that the S lineage appeared to be more related to corona viruses in animals, leaving open for question whether these lineages might have different rates of transmission or replication [7]. All of the variability and particularities of SARS-CoV-2 mentioned above make the development of a vaccine or effective treatments more difficult, demanding a considerable effort from governmental actors to control the COVID-19 outbreak.
In the current scenario, mathematical models, data mining, and pattern recognition techniques play fundamental roles in understanding, forecasting the evolution of the spread, and supporting public health policies. Herein, we present some remarkable examples of these. Hu et al. [8] proposes a prognosis model to estimate in real time the number of contagious people and the time when the propagation of COVID-19 will finish. Guo et al. [9] developed predictive models for early detection and generation of alerts to avoid SARS-CoV-2 outbreak. Following the same objective, applications of mathematical models based on the well-known SIR model proposed by Kermack and McKendrick [10] have been employed to assess the situations in different countries and as a support for health policies [11,12]. Nevertheless, the use of these models required the resolution of inverse problems, demanding extensive volumes of data and elaborate strategies to identify their parameters. Moreover, these models fail to represent the spread in countries with heterogeneous demographics [13]. Machine Learning approaches have been extensively used in the diagnosis of COVID-19, especially in the fields of X-ray and image analysis using deep convolutional neural networks techniques [14][15][16][17][18], to predict critical patients to optimize hospital resources [19,20], and to search for candidate drugs for the treatment of SARS-CoV-2 [21,22].
Despite enormous efforts to make a prognosis of different variables to support and guide health policies, relevant parameters for studying the evolution of this outbreak are not always adequately delivered to the decision-making actors. The Effective Reproduction Number R t , for example, is a well-known parameter used to evaluate the propagation of a disease. In previous work [23], we proposed a simple and fast methodology to estimate this rate directly from raw data. In this work, we applied a different approach to study R t and its evolution. Through data mining and forecasting techniques, based on Auto-Regressive Integrated Moving Average (ARIMA) models, we identify different spreading behaviors of the pandemic in countries around the world and develop models to forecast the spread of this pandemic. Using the forecast for the number of infected (I), recovered (R), and dead (D) individuals, we calculate R t and its temporal evolution.

METHODS
The workflow to create forecast models of relevant variables necessary to estimate the Effective Reproduction Number R t can be summarized as follows. First, the variables Infected (I), Dead (D), and Recovered (R) are processed to obtain the daily values. Next, Logistic Growth models were applied to estimated Infected (I) cases, and ARIMA models were used to create a forecast model of Dead (D) and Recovered (R) values. Finally, all predicted variables were employed to estimate R t .

Preparation of Datasets
All datasets were gathered from public repositories, which are updated on a daily basis [24]. Data pre-processing, such as filtering and scaling, was performed with scripts written in Python version 3.6 [25].

Estimation of R t
Using the data gathered for each country, we proceed to estimate R t using the methodology proposed by Contreras et al. [23]. Assuming that the spreading dynamics of COVID-19 in a certain territory are well-described by a SIR model, represented by Equations (1)-(3), we can easily derive an expression for R t .
Assuming that function I, active cases, can be expressed as a function of the susceptible fraction S, I(S), applying the chain rule in Equation (2) and replacing Equation (1), we obtain: where R t = β γ . Following the formalism of Contreras et al. [23], after using the hypothesis S N ≈ 1, we write the discrete version of the equation in a given timeframe [t i−1 , t i ] that is consistent with the temporal resolution of the data: As the different reported fractions must sum up the total population, applying a mass balance, we may state the following dynamic condition: By using Equation (6) in Equation (5), we obtain Equation (7): where I, R, and D, represent the new reported infections, recoveries, and deaths in the estimation timeframe. To smooth FIGURE 1 | RMSE histograms for assessing the quality of the forecast models for each variable in the 185 countries considered. The red line in each histogram represents the division between models with RMSE ≤ 1. As every country could be associated with one RMSE value, this figure provides a visual idea of the fraction of countries where the data was good enough to train reliable forecast models.
FIGURE 2 | Forecast for the evolution of R t in selected countries. Even though countries, such as Chile and the USA are approaching the control threshold of R t = 1, the immediate forecast is not so optimistic. Details of the different governmental actions taken in the timeline are presented in Table 2. The Chilean curve is not continuous in the second week of June due to changes in the data-reporting criteria [30]. the different trends, we apply mobile averages, which is also our variability estimation method. From its definition, R t ≥ 1 indicates that the outbreak might have exponential growth, while R t < 1 would indicate a disappearing infection. The above results from the analysis of Equation (2), , the number of countries our models successfully forecast, per continent. Several countries above the Q3 quartile exhibit R t values above 4.6, denoting a lack of relevant control over the SARS-CoV-2 outbreak. All data were obtained from Dong et al. [24].
which, under the hypothesis S N ≈ 1, has a unstable bifurcation when R t = 1, exhibiting an exponential growth or decay depending on whether R t is greater or lower than 1, respectively.

Forecast Models
Auto-Regressive Integrated Moving Average (ARIMA) models, which are related to auto-regression techniques [26], were used to develop forecast model to predict the variables related to the number of deaths (D), and the number of recovered individuals (R). The selection of hyperparameters related to algorithm was based on the maximization of the performance metrics of the produced models, in this case, Root Mean Square Error (RMSE). All models were implemented using Python version 3.6 [25] and the libraries statsmodels [27] and scikit-learn [28]. outbreak will be loosened [34] 2020-06-01 Chinese authorities ban behaviors deemed "uncivilized," including placing a prohibition on sneezing or coughing without covering the nose or mouth and imposing a requirement to "dress properly." [35] 2020-06-08 Chinese authorities announce that 95 foreign airlines will be permitted to resume commercial flights to Chinese destinations [36] USA 2020-04-21 Total closure of borders [37] 2020-06-07 New York is out of quarantine [38] 2020-05-08

Reopening of business in California [39]
2020-05-15 Reopening of business in New York [40] 2020-05-18 Reopening of business in Florida [41] Logistic Growth models [29], which follow Equation (9), were applied to create predictive models of the number of confirmed cases (I). Parameters r, P, and K were obtained and optimized for each country-model, applying Non-linear Least Square Estimation.
Finally, R t for each country is estimated using Equation (7) considering the predicted variables by the prognosis models previously explained.

RESULTS: FORECAST MODELS
Forecast models of the variables Infected (I), Recovered (R), and Dead (D) were developed for 185 countries that track the progression of the COVID-19 outbreak, including countries, such as the United States, Italy, Australia, Chile, and Brazil, among others. Using the predictions generated by the forecast models, we estimate R t and its evolution over time (Figure 2). The performance of each model was assessed using a root mean square error (RMSE)-based criterion. Figure 1 shows the RMSE histograms for each forecast variable in the different countries considered. Each histogram presents a division marked by a red line at RMSE = 1, setting a threshold for considering only those countries where the quality of the data provided was sufficient to obtain reliable predictors.
A more detailed assessment of the models can be made through the use of the statistical distributions of the RMSE for each variable under study. Table 1 shows the error ranges obtained for each model divided into quartiles. Forecast models for variables D (Dead) and R t present narrower ranges and lower values, mainly because of the low variability that these variables present in each country. Moreover, I and R sometimes exhibit abrupt increases on particular days and are more susceptible to presenting errors in data acquisition, as the distribution of resources (sampling capabilities) and the criteria for clinical recovery are not homogeneous.
Data quality and the performance of the generated forecast model are deeply connected. In this example, if the forecast model for D has an RMSE ranking in the first quartile (Q1), the forecast models for the other variables are also likely to be satisfactory. Figure 3A shows the SARS-CoV-2 propagation trend for different countries, divided by continents. To date, countries, such as South Korea, China, and Australia have successfully controlled the spread of the pandemic, as they have reached the R t < 1 zone. However, attention should be paid to slight increases in R t weeks after reaching control of the spread, as they could account for new outbreaks. Nevertheless, such outbreaks can occur regardless of the stage of evolution of the pandemic. For example, countries like France and Ecuador, which have not yet reached the control threshold but are approaching it, have shown patterns indicating new contagion peaks (see Figure 3A). The USA and Ecuador show values far above the control threshold R t = 1, without a clear decreasing tendency. Countries, such as Chile, Canada, and Brazil, although presenting lower R t values, are still fighting to control the spread of the virus. It is possible to associate differences in the R t values with the actions applied to combat the SARS-CoV-2 outbreak. Moreover, Figure 3A highlights the effect of different health policies or government actions, such as border closings, periods of isolation or quarantine, and cancellation of massive events, on the spread of the virus. The effects after the application of the action plans are not immediate due to the incubation and spread dynamics of the virus, among other reasons. However, the trend is clear: R t curves decrease -on average-over time, which is consistent with the progressive actions countries have executed. A detailed analysis of Chilean trends on R t is presented in Contreras et al. [31], and iconic dates for control measures in other countries from Figures 2, 3A are listed in Table 2.

EVOLUTION OF COVID-19, PUBLIC POLITICS, AND TENDENCIES OF COUNTRIES
A statistical analysis of the value of R t for the most recent day of analysis (June 21) is presented in Figure 3B. A limited number of countries, such as China or S. Korea, have controlled the spread of the virus. However, a significant number of countries present R t values >4.6, belonging to the third quartile of the local distribution. In other words, most of the countries reporting progression of the COVID-19 outbreak have not reached the control threshold. At the continental level, Europe and Asia have a greater tendency to higher quartiles, while most African states belong to the first quartile, indicating satisfactory control of the outbreak. Nevertheless, those values should be analyzed carefully, as the latter effect might be rather a sampling effect than a planned situation, as the testing capabilities of most African countries have proven to be overridden by the contingency [42,43]. Moreover, there are several sources of error to be considered in the analysis of R t , some of them associated with the data processing and reporting protocols or rather with the nature of the virus.
Despite the several applications of R t for the evaluation of government action plans and health policies and the assessment of the SARS-CoV-2 outbreak in a country, the estimators used remain somewhat naive, as they rely on the quality of the data. For example, some peaks that can be explained because of incorrect data-reporting or another sampling errors can be spotted in Figures 2, 3A. As estimators do not consider possible errors related to the COVID-19 detection tests, temporal delays between diagnosis and records, or discrepancies among the clinical recovery criteria, proper data pre-treatment should be carried out before using them in order to correct some of these errors. Moreover, in countries with limited resources that do not have sufficient testing capacity to apply screening tests, R t trends will be altered and negatively affected, since the real dynamics will remain masked and uncertain.

DISCUSSION
We have developed prognostic models for the variables infected (I), recovered (R), and dead (D) to enable the estimation of the rate of spread of novel SARS-CoV-2 through the Effective Reproduction Number R t in different countries worldwide. The models implemented are based on the use of logistic growth techniques in combination with auto-regression, assessing their performance by using the root mean square error (RMSE). Of the models generated for the 185 countries that record data related to the COVID-19 outbreak, 25% have RMSE values under the typical threshold of 1, therefore having predictions for R t with minimal errors. The source code is available on request.
Asian countries, such as China and S. Korea have controlled the spread in recent weeks, while in Europe, the average trend approaches control. However, new data provide evidence of new outbreaks of COVID-19. At the same time, the panorama in America is much more complicated, since the trends clearly show R t = 1 roaming far above the control threshold.
Despite the usability of R t , work should be done on estimating the magnitude of sources of error and the variability of the data. For instance, uncertainties in diagnosis, and differences in the testing strategy and clinical criteria of recovery might lead to temporal misclassification of patients, among others, therefore heavily impacting the reported value of R t . Moreover, we found discrepancies between the data provider servers of Dong et al. [24] and Info [30] that should be carefully studied. The lack of a protocol to assess and incorporate such errors can lead to unrealistic estimations of R t , which are particularly dangerous. In this way, new strategies for estimating sources of error in R t , together with the proposed forecasting methodology, can provide a robust tool for decision-making agents in the COVID-19 pandemic.