Development and Validation of a Predictive Model for Severe COVID-19: A Case-Control Study in China

Background: Predicting the risk of progression to severe coronavirus disease 2019 (COVID-19) could facilitate personalized diagnosis and treatment options, thus optimizing the use of medical resources. Methods: In this prospective study, 206 patients with COVID-19 were enrolled from regional medical institutions between December 20, 2019, and April 10, 2020. We collated a range of data to derive and validate a predictive model for COVID-19 progression, including demographics, clinical characteristics, laboratory findings, and cytokine levels. Variation analysis, along with the least absolute shrinkage and selection operator (LASSO) and Boruta algorithms, was used for modeling. The performance of the derived models was evaluated by specificity, sensitivity, area under the receiver operating characteristic (ROC) curve (AUC), Akaike information criterion (AIC), calibration plots, decision curve analysis (DCA), and Hosmer–Lemeshow test. Results: We used the LASSO algorithm and logistic regression to develop a model that can accurately predict the risk of progression to severe COVID-19. The model incorporated alanine aminotransferase (ALT), interleukin (IL)-6, expectoration, fatigue, lymphocyte ratio (LYMR), aspartate transaminase (AST), and creatinine (CREA). The model yielded a satisfactory predictive performance with an AUC of 0.9104 and 0.8792 in the derivation and validation cohorts, respectively. The final model was then used to create a nomogram that was packaged into an open-source and predictive calculator for clinical use. The model is freely available online at https://severeconid-19predction.shinyapps.io/SHINY/. Conclusion: In this study, we developed an open-source and free predictive calculator for COVID-19 progression based on ALT, IL-6, expectoration, fatigue, LYMR, AST, and CREA. The validated model can effectively predict progression to severe COVID-19, thus providing an efficient option for early and personalized management and the allocation of appropriate medical resources.


INTRODUCTION
The current outbreak of coronavirus disease 2019 (COVID- 19) has spread rapidly and widely across the world, causing panic and major public health challenges in the international community (1). COVID-19 presents a wide clinical manifestation, including asymptomatic infection, mild upper respiratory tract illness, and severe viral pneumonia, with respiratory failure. Only a small proportion of the total number of cases progress to a severe condition (∼15-20%); however, ∼40% of patients with severe disease die (2)(3)(4)(5). Although some research has shown that initial therapy with remdesivir or non-invasive positive pressure ventilation (NIPPV) is very efficient for severe cases, there is currently a lack of accepted recommendations for severe patients with regard to individualized treatment (6)(7)(8). Therefore, the rapid deterioration of patients with severe COVID-19 deserves special attention. There is an urgent need to develop options for the personalized diagnosis and treatment of such patients, particularly with regard to protecting the relative shortage of medical resources.
Fever, cough, and fatigue are commonly present in patients with mild COVID-19 (9,10). As the disease progresses further, patients may also experience respiratory failure, acute respiratory distress syndrome, heart failure, metabolic acidosis, and septic shock (11). Besides the well-defined clinical characteristics of COVID-19, previous studies have shown that abnormal laboratory findings and cytokine levels are often associated with disease progression, including coagulation-related markers such as D-dimer and fibrinogen (FIB), neutrophil count, lymphocyte count, and high-sensitivity C-reactive protein (HsCRP) (5,(12)(13)(14)(15). In addition, research has identified that a cytokine storm could be the primary driver of severe progression in COVID-19 patients (16,17). However, the application of these independent indicators is limited by many factors, including insufficient information, individual differences, the experience of the attending physician, and the complexity of disease. Thus, there is an urgent need for advanced multivariable prediction models (18,19). Although several studies have attempted to develop prediction models, most of the existing models were developed in a single center and based on retrospective data; in some cases, only partial datasets were used, and there was a clear lack of validation. These factors may lead to the omission of key variables and the risk of over-fitting, thus limiting the clinical application of such models. Therefore, there is a critical need to develop more effective prediction models (14,15,20,21).
Here, we prospectively and consecutively enrolled a cohort of COVID-19 patients with a complete set of demographic data, clinical characteristics, laboratory findings, and cytokine information, and we then constructed a multiparameter prediction model for the early identification of severe COVID-19. Our model could help to monitor and guide precision medicine.

Participants
COVID-19 patients were prospectively and consecutively enrolled from regional medical institutions by the West China Medical Center between December 20, 2019, and April 10, 2020. The patients were divided into severe and non-severe groups according to the China National Health Commission Guidelines for Diagnosis and Treatment of COVID-19 infection (Versions 5 and 7). Serum samples were collected from patients within 3 days of infection confirmation and stored at −80 • C for the subsequent detection of cytokine levels. Demographic data, clinical characteristics, and laboratory findings were acquired from electronic medical records (Figure 1). Two independent researchers reviewed the data collection forms.

Diagnostic and Severity Classification Criteria
Patients with pneumonia, typical findings on computed tomography (CT) chest scan, and positive severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) nucleic acid results, as determined by real-time fluorescent reverse transcription-polymerase chain reaction assessment from bronchoalveolar lavage (BAL) or sputum, were considered as COVID-19 "cases" according to the diagnosis and treatment guidelines released by the China Health and Medical Commission (22). Patients with at least one of the following symptoms during hospitalization were allocated into the severe group: (1) respiratory distress, respiratory rate ≥30 times/min; (2) oxygen saturation ≤93% at rest; and (3) oxygen partial pressure (PaO 2 )/oxygen concentration (FiO 2 ) in arterial blood ≤300 mmHg. All patients were discharged or had died by the time the model was developed.

Construction of the Predictive Model and Internal Validation
Patients from the Chengdu region were divided into a derivation cohort, including a training set for modeling and a testing set for internal validation. Stepwise selection was based on p-values; least absolute shrinkage and selection operator (LASSO) and the Boruta algorithm were used to select variables (23,24).
Stepwise selection, as based on p-values, is a classic regressionbased method. A variable's value with a p < 0.05 was regarded as significant and was retained. This practice generally achieves a better performance in smaller datasets and has been extensively used in previous research. LASSO regression can compress the coefficients of the features via penalty function to obtain optimal constraint models; this practice has been used effectively to avoid over-fitting and co-linearity in classical analysis methods based on significance differences and also enhances the ability of a model to be generalized. Boruta algorithm is a wrapper algorithm that uses random forest classification. This practice can iteratively remove features that prove to be less relevant than random probes and thus aims to retain relevant variables for the function of a response variable. In addition, these two algorithms are particularly suitable for a dataset with a small sample size but with a large number of variables. By using these three different variable selection methods, we were able to select three candidate predictor panels to construct different binary logistic regression models, which were then verified internally by 10-fold crossvalidation. The optimal model was then selected by comparing the area under the curve (AUC) and the Akaike information criterion (AIC) in order to generate a nomogram that could be encapsulated as an open-source online predictive calculator.

Independent Validation
The independent validation cohort consisted of patients from outside Chengdu; this was used for external verification to predict the generalization ability of the model by comparing the predicted results with a set of follow-up results to calculate several metrics: sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). AUCs and decision curve analysis (DCA) were used to comprehensively evaluate the model's discrimination and net clinical benefits (25).

Statistical Analysis
Continuous variables and categorical variables are presented as the median (upper and lower quartiles) and as a frequency, respectively. The chi-squared test for categorical variables and the Student t-test or Mann-Whitney U-test for continuous variables were used to test the data between groups. Pearson correlation was used to determine the linear correlation between two variables. The diagnostic performance of equations was then displayed by AIC and receiver operating characteristic (ROC) curve and quantified by AUCs. An open-source online predictive calculator was then created using the Shiny tool in the R environment (version 1.2.0). All statistical analyses were completed using R 3.5.0 version. All statistical tests were two-tailed, and p ≤ 0.05 was considered to indicate statistical significance.

Standard Protocol Approvals, Registrations, and Patient Consent
The protocol for this study was approved by the West China Hospital, Sichuan University Medical Ethics Committee (reference no. 193, 2020), and conformed to the principles of the Declaration of Helsinki. Written informed consent was obtained from all participants.

Epidemiological Characteristics
We recruited 206 patients with a confirmed diagnosis of COVID-19; of these, 44 patients progressed to severe COVID-19, and 162 patients were classified as having non-severe COVID-19. Patients in the severe group were significantly older (50 vs. 46, p = 0.005) and had a significantly higher frequency of underlying diseases (diabetes and hypertension) than the non-severe group (p < 0.001 and p = 0.013, respectively). There were no differences between the two groups in terms of gender (male: 54.940 vs. 56.810%, p = 0.400). With regard to epidemiological exposure, most of the patients (79.000%) in the severe group had been overseas or had visited Wuhan or surrounding regions within 14 days of disease onset; patients who had been overseas accounted for 50% of the patients with non-severe COVID-19. As of April 28, 2020, the time for the reversal of a negative nucleic acid test result in the non-severe and severe groups was 11 and 18 days (median) except for three patients who died from multiple organ failure (MOF).

Differences in Characteristics and Correlation Analysis
Demographic data, clinical characteristics, laboratory findings, and cytokine levels are shown in Table 1 and Supplementary Figures 1, 2. Several cytokines were significantly elevated in the severe COVID-19 group (p ≤ 0.010). The predictive value of each single cytokine, and a combined panel of cytokines, were evaluated by ROC curve analysis and quantified by AUC (Supplementary Figure 3). Results showed that the AUCs were 0.830, 0.796, 0.729, 0.707, 0.694, 0.667, 0.656, and 0.653 for single IL-10, IL-6, IL-1α, IL-1β, IL-17A, IL-4, TNF-α, and IL-2 and that the binary logistic model had a similar AUC (0.796-0.848). These data indicated that IL-10 and IL-6 may represent potential biomarkers for patients with severe COVID-19. We found significant differences between the severe and non-severe COVID-19 group with regard to a range of clinical characteristics, including respiratory rate, cough, expectoration, dyspnea, asthma, and debilitation. Significant differences were also identified in several laboratory findings; lymphocyte ratio (LYMR), eosinophil ratio (EOSR), monocyte ratio (MONOR), total bilirubin (TBIL), total protein (TP), albumin (ALB), Ca, and URIC were all significantly lower in the severe COVID-19 group, while neutrophil ratio (NEUTR), FIB, aspartate transaminase (AST), glucose (GlU), and HsCRP were all significantly higher. However, the AUCs for these indicators when used to predict severe COVID-19 were all <0.690. Simple logistic analysis was not suited for the severe COVID-19 group, owing to the feature selection of such a large number of indicators. We identified significant correlations between each pair for all cytokines except IL-33 and IFN-β. In addition, IL-6, IL-10, and IFN-β were closely associated with certain laboratory indicators of hepatobiliary function. Similarly, hematocrit (HCT), tBIL, direct bilirubin (DBIL), indirect bilirubin (IBIL), TP, creatine kinase (CK), and myoglobin (Myo) were significantly associated with most cytokines except IL-33, which was not correlated with any of the indices. corresponding predictive models (predictive models A, B, and C, respectively) (Table 2, Figure 2). Predictive model B exhibited a better performance than the other two models in terms of sensitivity, specificity, discrimination, calibration, and clinical net benefit. In addition, the predictors included in this model are objective and universal. An optimal model, with seven features, alanine aminotransferase (ALT), IL-6, expectoration, fatigue, LYMR, AST, and serum creatinine (CREA), were used to generate a nomogram (Figure 3) and were encapsulated as an open-source online predictive calculator with R/Shiny (https://severeconid-19predction.shinyapps.io/SHINY/).

Validation of the Online Predictive Model
Finally, we predicted the disease progression of the 108 patients in the validation cohort using our model. The model predicted that 18 patients would progress to severe COVID-19 while the remaining 90 would not. Compared with the follow-up results (91 patients with non-severe COVID-19 and 17 patients with  (Figures 4, 5).

DISCUSSION
The accurate and individualized assessment of a patient who may progress to severe COVID-19 will promote the efficiency of clinical intervention and improve the rational use of medical resources. In the present study, we recruited 206 patients (162 patients with non-severe COVID-19 and 44 patients with severe COVID-19). We analyzed a range of indicators associated with severe COVID-19 and developed a novel predictive model that included ALT, IL-6, expectoration, fatigue, LYMR, AST, and CREA. This model proved to have excellent ability to predict the progression of COVID-19 during hospitalization, in both the derivation and validation cohorts.
Our final model was visualized in the form of a nomogram and was then packaged into an open-source and free predictive calculator (https://severeconid-19predction.shinyapps.io/ SHINY/). The model represents a powerful tool with which to aid decision-making and guide treatment strategies for target patients who are at high risk of developing severe progression. The model could also be used to facilitate personalized management.
Previous research reported wide differences in the levels of a large number of cytokines from patients with non-severe and severe COVID-19 (26)(27)(28). Our present results identified obvious elevations of various cytokines in patients with severe COVID-19, including IL-1α, IL-1β, IFN-γ, TNF-α, IL-2, IL-4, IL-6, IL-10, and IL-17A. Of these cytokines, IL-6 and IL-10 showed the highest fold-change, thus indicating the presence of a strong inflammatory reaction; this could be a sufficient response to trigger a cytokine storm. Univariate logistic analysis showed that a number of cytokines can be used as predictors for patients with severe illness, although their predictive efficacies can vary considerably; these cytokines could not be used individually. We also found that underlying diseases (diabetes and hypertension), initial clinical characteristics (cough, expectoration, dyspnea, asthma, and debilitation), and laboratory findings [LYMR ALT, AST, CK, GlU, and procalcitonin (PCT)] were also significantly associated with disease progression, although these were nonspecific. The extensive correlation between cytokines and the clinical response spectrum may be explained by multiple organ damage caused by the over-exuberant inflammatory response in severe COVID-19 (12,29).
Univariate logistic analysis indicated that using a certain evaluation index could not provide sufficient evidence for the prediction of progression and that modeling by data mining may be a more efficient and viable tool with which to compensate for the lack of a single source of information (30). We used the LASSO algorithm and logistic regression and compared different modeling approaches. Finally, we selected a predictive model that included ALT, IL-6, expectoration, fatigue, LYMR, AST, and CREA. Our model achieved satisfactory predictive performance with AUCs of 0.910 and 0.879 in the derivation and validation cohorts, respectively. We also packaged this model into an open-source online format for clinical use. Although several predictive models have been published previously, these studies were associated with obvious limitations, including the fact that they were retrospective reviews or were associated with suboptimal predictive abilities or were not validated externally (31)(32)(33). Taking these limitations into account, our study is superior in several respects. First, we considered potential predictors for severe COVID-19 and included a comprehensive dataset retrospectively. Second, our shrinking model, featuring representative key variables, may exhibit better levels of performance than a complex model. This can be supported by the fact that our predictive model was established by comparing several different methods; the optimal method had a significantly higher AUC than the other models; this finding was reconfirmed in the validation cohort. Third, the predictive model was used to create a nomogram that was then used to generate an open-source online calculator format with visualization and maneuverability function.
There are also some limitations associated with our study that need to be considered. For example, we mainly focused on the changes of symptoms and the levels of key indicators in patients after SARS-CoV-2 infection and did not consider the influence of individual differences on the progression of disease. More in-depth investigations and longitudinal dynamic monitoring studies now need to be conducted to explain the specific characteristics of the potential predictors. Furthermore, the predictive model needs to be validated in a larger patient cohort and other populations outside of China.

CONCLUSION
In this study, we developed and validated an online predictive calculator that provides personalized probability for the progression of disease based on seven commonly used variables. The model will be vital for early personalized management, to promote the appropriate allocation of medical resources, and to ensure that patients who may develop severe COVID-19 can receive appropriate treatment as soon as possible.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author/s.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by West China Hospital, Sichuan University Medical Ethics Committee. The patients/participants provided their written informed consent to participate in this study.

AUTHOR CONTRIBUTIONS
ZM and MW designed the research and wrote the manuscript. ZZ and YoZ responsible for the recruitment of COVID-19 patients and clinical treatment. YW and SG responsible for the detection of candidate biomarkers. ML, SY, and YaZ responsible for collecting and organizing data. All authors contributed to the article and approved the submitted version.