Modeling of COVID-19 Pandemic vis-à-vis Some Socio-Economic Factors

The impact of the COVID-19 epidemic on the socio-economic status of countries around the world should not be underestimated, when we consider the role it has played in various countries. Many people were unemployed, many households were careful about their spending, and a greater social divide in the population emerged in 14 different countries from the Organization for Economic Co-operation and Development (OECD) and from Africa (that is, in developed and developing countries) for which we have considered the epidemiological data on the spread of infection during the first and second waves, as well as their socio-economic data. We established a mathematical relationship between Theil and Gini indices, then we investigated the relationship between epidemiological data and socio-economic determinants, using several machine learning and deep learning methods. High correlations were observed between some of the socio-economic and epidemiological parameters and we predicted three of the socio-economic variables in order to validate our results. These results show a clear difference between the first and the second wave of the pandemic, confirming the impact of the real dynamics of the epidemic’s spread in several countries and the means by which it was mitigated.


INTRODUCTION
Modeling of COVID-19 by scientists, epidemiologists, and health experts was considered early on in the pandemic as it began to ravage the world. The socio-economic determinants of this pandemic taken into account in this modeling are important because they condition the severity of an affected country and the way in which it is controlled, which leads to consider the corresponding variables alongside the daily reproduction rates during the period of contagiousness of individuals infected by the pandemic. The aim of this article is to show the joint variations of socio-economic determinants and epidemiological parameters, which can be observed between developed and developing states and between successive epidemic waves.
Some researchers have already worked on socio-economic analysis of the COVID-19 pandemic: in [1], the authors examined the geoclimatic, demographic, and socio-economic determinants of COVID-19 prevalence and have shown that the influence of these determinants varies by comparing the first and second wave of the pandemic. The socio-economic impact of the COVID-19 pandemic in United States of America (United States) was studied by Barlow and Vodenska [2], where the authors investigate the systematic risk posted by sector-level industries within the United States. Ahmed et al. [3] modeled daily confirmed cases of COVID-19 in different countries across the globe using regression models with predictions for upcoming scenarios. Kong et al. [4] worked on the socio-economic and environmental factors influencing the basic reproduction number of the COVID-19 pandemic by fitting a logistic growth curve to the reported daily cases up to the first peak of the pandemic while Qiu et al. [5] studied the impact of socio-economic factors on the transmission of COVID-19 disease with China as a case study using an empirical model, and the authors conclude that these determinants have rich implications for ongoing efforts in containing the pandemic. The work in this present article is an extension of [6], which was based on the analysis of the reproduction numbers of COVID-19 based on the Current Health Expenditure as Gross Domestic Product Percentage (CHE/GDP) across several countries using some machine learning tools. The results of this study show that some countries with a high CHE/GDP improved their public health strategy against the virus during the second wave of the pandemic, fighting it all the more effectively against it the more effectively they were. the most affected during the first wave. The difference with the present study lies in the fact that the latter takes into account data from twice as many public sites [7][8][9][10][11][12][13][14][15][16][17] and that it is more focused on social inequality, quantified for example by the Social Fracture coefficient (SF equal to the ratio between the incomes of the richest 10% and the poorest 10% of a given population), and the Theil and Gini indices. It was shown previously in [18] that the Gini index was highly correlated (r 0.93) to another Demo-economic index denoted DI and equal to the quotient (CHE/GDP)/SF, proving that all these indices are closely related and carry part of the causality of inter-country variations in epidemiological parameters.
The main objectives of this article are to establish a relationship between Theil and Gini index, analyze critically some of the socio-economic determinants of the pandemic, correlate them, predict three of the socio-economic variables, and perform some regression analyses. We have also clustered countries according to these parameters with the help of the lasso (least absolute shrinkage and selection operator) method, and we were able to select the best variables for the COVID-19 modeling.
The paper is divided into seven sections: after an introductive section, we explain in Section 2 the methodology used in this research, Section 3 deals with the variables used, Section 4 establishes a mathematical relationship between Theil and Gini index, Section 5 is dedicated to the visualization of the results obtained, while we finally give the discussion and conclusion in Sections 6 and 7, respectively.

METHODS
The use of machine learning methods to analyze data has been helpful over the years to get a proper view on how a model  behaves. In this research, we used some supervised and unsupervised machine learning methods and we also tried to use one deep learning method for the identification and visualization of clusters. To jointly interpret the socioeconomic and epidemiological data, we have chosen these three main classes of the descriptive statistics, which allow us to compare these socio-economic and epidemiological data. Supervised learning is used in its regression function (prediction of a quantitative variable from annotated examples) and unsupervised learning (in which the data is not labeled) in its classification function. As for deep learning, it makes it possible to create a model from large-scale unlabeled data.
The supervised machine learning methods we used are first univariate polynomial regression, linear regression, lasso regression, and ridge regression. We also use some of these methods to make prediction by training the model and testing some percentage of the values. Lasso regression helped us to know the best variables to be used in the modeling. After the univariate regressions, we introduced multivariate least square methods, allowing us to test much more complex relations between variables. It can be represented as follows: Where β 1 , β 2 , / are coefficients or weights, ∈ is the residual noise, y is the dependent variable, and x 1 , x 2 , / are the independent variables. Ridge and lasso regression are simple methods to reduce the model complexity and prevent over-fitting, which may result from linear regression. The cost function for ridge regression is given below: with for some c > 0, m j 0 β 2 j < c, while α is the penalty term that regularizes the coefficients such that if the coefficients take large values, the optimization function is penalized. Ridge regression puts constraint on the coefficients β. We define the cost function for lasso regression in the same way, but by replacing the L2 penalty term by an L1 one as: with for some c > 0, m j 0 |β j | < c. After the supervised learning methods, we used unsupervised learning approches to cluster variables across countries and the methods we proposed to validate our results were K-means clustering, Hierarchy clustering, and Principal Component Analysis (PCA). We also performed correlation calculations among parameters used in the modeling step and we chose an optimization method called Ordinary Least Square (OLS) for the socio-economic determinants of COVID-19. Eventually, the deep learning methods we used were Neural Network (NN) and Multi-Layer Perceptron (MLP) regressor, which is a class of feedforward Artificial Neural Network (ANN).
The calculated socio-economic variables are as follows: -Social fracture (SF) index is the ratio between the 10% highest income and the 10% lowest income. In brief, it is expressed by the equation below: -Demo-economic (DI) index is the ratio between the percentage of GDP devoted to health expenditure and social fracture index. It is expressed by the equation below: We give a precise value of all variables in

Epidemiologic Variables
We have six epidemiologic variables: first wave maximum R o , second wave maximum R o , first wave deterministic R o , second wave deterministic R o , and opposite of the initial autocorrelation slope averaged on 6 days for both first and second wave of the daily new cases for developed and developing countries. All epidemiologic variables values were taken from [18] (see also the Appendix in [6]).
The epidemiologic variables were recorded during the exponential phase of the first and second wave of the pandemic. Daily new cases observed during the first 100 days were used to calculate the exponential slope for the first and second wave. The opposite of the initial autocorrelation slope was averaged on 6 days for the first and second wave. The maximum R o was collated from [6] while observing this value during the first and second waves of countries considered. We also collated from [6] the deterministic R o for the first and second wave of the pandemic taking 6 days as length of contagiousness period.
In this present study, we validated our results by performing cross-validation and also training 80% of the data and training 30%.

Mathematical Approach
We first show the relationship between Theil index and Gini index mathematically. The Gini index is defined as follows [16]: where x k (respectively y k ) denotes the kth cumulative part of the population (respectively income). If we choose the population increments, d k x k -x k-1 are equal to 1/n, and if E(Δ) represents the expectation of the increment, Δ k y k− y k−1 for the distribution d k.
Then, the Theil index applied to the percentage y k of the total income relative to a percentage x k of the total population ([17]) is defined by the following equation: If the first increment of y, Δ 1 y 1 ≤ 1, is close to 1 [which corresponds to a square-shaped Lorenz curve, i.e., closed to a left right triangle-shaped income vs. population curve (in red on Figure 1), or to a high Gini index close to 1], then we have: − Log(Δ 1 ) ∼ 1 − Δ 1 and Δ k ∼ 0, for k > 1. Then, we get: the equality being available only if the Lorenz curve presents a perfect left right triangle shape.

Correlation
We correlated both Theil and Gini indices with all epidemiologic, demographic, and socio-economic variables, and as it can be seen in Figure 2G, Theil and Gini indices are highly positively correlated with coefficient 0.7.

Neural Network for Theil Index and Gini Index
We used the neuralnet package in R in order to visualize the weights of the network and the bias between Theil and Gini index, and as it can be seen in Figure 2H, the weights are good with low bias.

Regression Analysis Between Theil Index and Gini Index
Linear regression models use some historic data concerning independent and dependent variables and consider a linear relationship between both while polynomial regression models use a similar approach but the dependent variable is modeled as a degree m (m 2 in the present study) polynomial in x.
Linear regression model is given as: where β i 's are the weights, β o is the intercept and ∈ is the random error term. The above equation is the linear equation that needs to be obtained with the minimum error. Polynomial regression of order 2 is given below: We present the visualization of the regression results using this approach in Figures 2A-F.
For the linear regression as shown in Figure 2A, the intercept is 31.03, p-value is 0.0181, R 2 is 0.4881, residual standard error is 3.116, and all coefficients are significant with p < 0.05 for both the train and test data for linear and polynomial regression. The median of the residual plot in Figures 2B,F are 0.2111 and 0.2566, respectively, for both linear and polynomial regression, which are low values. The normality of the residual was tested using Jarque-Bera and Durbin-Watson tests, which gave a high p-value, and we failed to reject the null hypothesis that the skewness and kurtosis of the residuals are statistically equal to zero. In order to know the performance of the linear regression model, we trained 80% of the data and tested the 20% remaining data and also did cross-validation to be sure of the accuracy. The predicted and the observed values are very close to the results presented for the regression models used. For the linear model, we present the cross-validation result in Figure 2E whose average mean square error for the five portion folds is 11.72794. We observed that the correlation between the tested and the predicted values has high accuracy (R 2 0.97). The test set p-value is 0.02 with a residual standard error of 3.528. For polynomial regression of order 2, the train set has the following results: R 2 0.6, p-value 0.002, and residual standard error 2.935. The test set has the following results: R 2 0.99, p-value 0.008, and residual standard error 0.5639. Figure 3 corresponds to the ordinary multivariate least square methods with R 2 0.674. Figure 3A shows Paraguay as outlier not fitting data, Figure 3B normalizes all countries and does not point any country in the plot.

Prediction of Gini Index Using MLP Regressor, and Linear, Lasso, and Ridge Regression
In this section, we used cross-validation method to choose the best parameter α for the modeling as shown in Figure 4C. For ridge regression, α 0.142 with a mean square error of 1.36 and α 0.368 for lasso regression with a mean square error 5.10. For Figure 4E, training score 1.000 and test score 0.641; for Figure 4F, training score 0.992 and test score 0.497; for Figure 4G, training score 0.99 and test score 0.406; and for Figure 4H, training score 0.984 and test score −0.077. It is evident from these results that linear regression best predicts Gini index with the highest test score, and predicted values are very close to each other as presented in Table 1. Also, we observed the same pattern of prediction in Figures 4E-H showing that all methods used in this section have the same predictive behavior.

Clustering Analysis of Latino-American Countries for Gini Index and Theil Index Alongside Other Socio-Economic Variables and Epidemiologic Variables
In Figure 5C, the first two clusters have 14 countries and the third has three countries, which are Uruguay and El Salvador on the same hierarchy while Argentina is on another hierarchy. We only show the cluster dendrogram for the first cluster. In Figure 5F, the Gini index has the highest positive correlation of 0.44 with the principal component PC 1 and Theil index has only the value 0.34 with PC 1. The main variable causing the separation into three classes is the Gini index.

Regression and Multivariate Analysis for Socio-Economic and Epidemiologic Variables
In Figures 6C,D, we modeled the dependent variable as a degree n (n 6 in the present study) polynomial in x, an extension of Eq. 8. Figure 6 presents regression analyses with the parameters   Figure 6E shows some developing countries as outliers, while Belgium is the only developed country, which does not fit the data.

Prediction of Percentage GDP Devoted to Health Expenditure
In this section, we used a cross-validation method to choose the best parameter α for the modeling as shown in Figure 7C. For ridge regression, α 0.012 with a mean square error 2.32 and α 0.029 for the lasso regression with a mean square error 2.21. For Figure 7E, training score 0.983 and test score 0.607; for Figure 7F, training score 0.170 and test score 0.021; for Figure 7G, training score 0.854 and test score 0.115; and for Figure 7H, training score 0.980 and test score −2.386.
It is evident from the results that linear regression best predicts GDP percentage devoted to health expenditure with the highest test score, and all predicted values are very close.

Principal Component Analysis and Clustering Results
In Figures 8E,F, the first cluster has 15 countries, the second cluster has 27 countries, while the last cluster has 2 countries, which are Tanzania and Mauritius. We only show the first two cluster dendrograms. With PC1, Gini index GI has the highest positive correlation of 0.52 with PC1 and demo-economic index DI has the second highest negative correlation of −0.53, while with PC2, first wave maximum R 0 has the highest positive correlation of 0.70 ( Figure 8C). The first cluster contains a majority of developing countries, and the second cluster contains a majority of developed countries, the main variable causing the separation into two classes being the Gini index in PC1.

Multivariate Analysis for Socio-Economic Variables and Epidemiologic Variables
Figures 9A-D correspond to the ordinary multivariate least square method with R 2 0.60. Figure 9A shows Botswana and Tanzania as outliers not fitting the data.

Prediction of Temperature
In this section, we used the cross-validation method to choose the best parameter α for the modeling as shown in Figure 10C. For ridge regression, α 1.005 with a mean square error of 19.13, and for lasso regression, α 6.018 with a mean square error 16.93.
For Figure 10E, training score 0.647 and test score −2.228; for Figure 10F, training score 0.316 and test score 0.154; for Figure 10G, training score 0.573 and test score −1.136; and for Figure 10H, training score −6.728 and test score −4.714. It is evident from these results that the lasso regression best predicts temperature with the highest test score, and predicted values of temperature for lasso and ridge regression are close.
All the regression methods give about the same result with the maximum accuracy for the ridge regression.

Principal Component Analysis and Clustering Results
In Figures 11E,F, the first cluster has 40 countries, the second cluster has 13 countries, while the last cluster has only 1 country, which is Botswana. We only show the two cluster dendrograms with many countries. In Figure 11C, average life expectancy has the highest positive  Figure 12 corresponds to the ordinary multivariate least square method with R 2 0.90. Figure 12A shows Iceland, United States, Austria, and Belgium as outliers not fitting the data.   We see on the partial regression plots of the Figure 12C that the best correlation observed between parameters is between CHE/GDP and the demo-economic index DI as observed before in [21].

Prediction of Percentage GDP Health Expenditure
In this section, we used the cross-validation method to choose the best parameter α for the modeling as shown in Figure 13D. For ridge regression, α 0.005 with a mean square error of 1.905, and for Lasso regression, α 0.027 with a mean square error 1.657. For Figure 13E, training score 0.993 and test score 0.535; for Figure 13F, training score 0.898 and test score 0.629; for Figure 13G, training score 0.983 and test score 0.259; and for Figure 13H, training score −0.072 and test score −0.196. It is evident from these results that the lasso regression best predicts percentage of GDP devoted to health expenditure with the highest test score and predicted values are very close.
All the regression methods give about the same result with the maximum accuracy for the ridge regression.

Principal Component Analysis and Clustering Results
In Figures 14E,F, the first cluster has 20 countries, and the second has 5 countries, which are United States and Bulgaria on the same hierarchy, Mexico and Costa Rica on the same hierarchy, and Chile standing alone. The third cluster has 12 countries. We only show the two highest cluster dendrograms. In Figure 14C, the Gini index and social fracture index have the highest positive correlation of 0.45 and 0.46, respectively, in PC 1 while the percentage of GDP devoted to health expenditure and demo-economic index have the highest positive correlation in PC 2, whose values equal to 0.65 and 0.41, respectively.
The two main clusters correspond both to developed countries, but in the first, countries are more continental, and in the second, countries are more maritime, which could be explained by their difference in consumer confidence index (CCI), which is less important in maritime countries than in continental ones.

DISCUSSION
We have been able to develop new approaches to the socio-economic determinants for the modeling of the COVID-19 pandemic during the exponential phase. Some of these determinants have shown high correlation with epidemiologic parameters as it can be seen in the heatmap diagrams in Figures 2G, 8A, 11A, 14A, explaining the role of each variable thanks to these correlations.
For developed and developing countries, the lasso regression reduced the correlation between the social fracture index and the 10% highest income, while for OECD countries, the correlation between the Gini index and social fracture index was reduced to zero. Some of our variables were not used in the optimization method-OLS due to multicollinearity observed on results summary. For the two sets of countries, consumer confidence index, opposite of the initial autocorrelation slope averaged on 6 days for the first and second wave, 10% lowest income, and 10% highest income were not used in the modeling. The R 2 for OLS results for developed, developing, and OECD countries are 0.76 and 0.90, Frontiers in Applied Mathematics and Statistics | www.frontiersin.org respectively, which shows a high significance rate ( Figures  8E,F, 12).
The principal component analysis shows high correlation for the numbers of new cases we used in this research. The social fracture index has high correlation in PC1 for both cases, while in PC2, percentage of GDP devoted to health expenditure was dominant for OECD countries, and maximum R o for the first wave was dominant for both developed and developing countries (see Figures 8C, 14C). We can deduce from all these observations that the socioeconomic determinants are a key to the modeling of infectious diseases like COVID-19 as these parameters give high signals on the trend during the spread of the pandemic for various countries ( [22][23][24][25][26][27][28][29][30]).

CONCLUSION
The systematic study of the correlations between socio-economic variables (Gini and Theil indices, percentage of GDP devoted to heath expenditure, etc.) and epidemiological variables (reproduction rate, opposite of the slope of autocorrelation to origin, etc.) shows a disparity between developed and developing countries, as well as between epidemic waves. Developed countries with high indices of social divide, but high health expenditure, did not, for the first wave, react better to the COVID-19 epidemic than developing countries. On the other hand, the rapid implementation of isolation and vaccination measures enabled them to anticipate and reduce the effects of the second wave. In a subsequent work, we will study the evolution of this disparity between developed and developing countries during subsequent waves of SARS CoV-2.