Prediction of Sepsis in COVID-19 Using Laboratory Indicators

Background The outbreak of coronavirus disease 2019 (COVID-19) has become a global public health concern. Many inpatients with COVID-19 have shown clinical symptoms related to sepsis, which will aggravate the deterioration of patients’ condition. We aim to diagnose Viral Sepsis Caused by SARS-CoV-2 by analyzing laboratory test data of patients with COVID-19 and establish an early predictive model for sepsis risk among patients with COVID-19. Methods This study retrospectively investigated laboratory test data of 2,453 patients with COVID-19 from electronic health records. Extreme gradient boosting (XGBoost) was employed to build four models with different feature subsets of a total of 69 collected indicators. Meanwhile, the explainable Shapley Additive ePlanation (SHAP) method was adopted to interpret predictive results and to analyze the feature importance of risk factors. Findings The model for classifying COVID-19 viral sepsis with seven coagulation function indicators achieved the area under the receiver operating characteristic curve (AUC) 0.9213 (95% CI, 89.94–94.31%), sensitivity 97.17% (95% CI, 94.97–98.46%), and specificity 82.05% (95% CI, 77.24–86.06%). The model for identifying COVID-19 coagulation disorders with eight features provided an average of 3.68 (±) 4.60 days in advance for early warning prediction with 0.9298 AUC (95% CI, 86.91–99.04%), 82.22% sensitivity (95% CI, 67.41–91.49%), and 84.00% specificity (95% CI, 63.08–94.75%). Interpretation We found that an abnormality of the coagulation function was related to the occurrence of sepsis and the other routine laboratory test represented by inflammatory factors had a moderate predictive value on coagulopathy, which indicated that early warning of sepsis in COVID-19 patients could be achieved by our established model to improve the patient’s prognosis and to reduce mortality.


INTRODUCTION
The outbreak of coronavirus disease 2019  in Wuhan, China, has developed into a global pandemic and major public health concern (Tu et al., 2020;Zhou et al., 2020). As of November 23, 2020, around 58 million patients have been diagnosed with severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) infection, and 14 million (2.37%) patients have died, according to the latest statistical data from Johns Hopkins University. Compared with severe acute respiratory syndrome (SARS) and Middle East Respiratory Syndrome (MERS), SARS-COV-2 infection is less lethal. Due to the high infectivity of this virus, it has however, caused more severe and fatal cases (Tu et al., 2020;Vlachodimitropoulou Koumoutsea et al., 2020). Currently, the cure for COVID-19 is essentially dependent on the patient's immune system and no specific drugs are available (Cao et al., 2020;The Lancet, 2020). So far, a variety of vaccines have been announced, each with their own good efficacy, but most of them have been released through press releases, and there is still scientific uncertainty (The Lancet, 2020;Nat Nanotechnol, 2020). Therefore, it is crucial to monitor COVID-19 patients closely and to issue an early warning to prevent deterioration.
For COVID-19, in addition to lung injury, impaired liver and kidney function, and microcirculatory dysfunction in some patients fulfilled the criteria synonymous with sepsis and septic shock based on the Sepsis-3 International Consensus (Guan et al., 2020;Zhang et al., 2020). Sepsis is defined as life-threatening organ dysfunction caused by a dysregulated host response to infection (such as bacterial, viral, and/or fungal infections) (Singer et al., 2016). The mortality rate due to sepsis is high, indicating that it is still one of the main causes of death in the world. Identification and treatment of sepsis are a matter of great concern in the medical field and need to be solved urgently (Gaieski et al., 2013;Grondman et al., 2020;. A broad range of pathogens can cause sepsis, including bacterial, fungal, or viral pathogens. Although bacterial infections were the main cause of sepsis in these patients, the clinical research and diagnosis of Viral Sepsis still remains very rare (Lin et al., 2018;Musher, 2019;Grondman et al., 2020). Viral Sepsis secondary to viral pneumonia has been reported (Musher, 2019). For patients with COVID-19, secondary Viral Sepsis may be one of the critical causes of patients' death. The view that the condition of COVID-19 patients is complicated by sepsis, causing aggravation and even death has been widely recognized (Connors and Levy, 2020). In COVID-19, the main reason for this phenomenon is because severe COVID-19 is accompanied by hyper-cytokinemia (Giamarellos-Bourboulis et al., 2020). Tumor necrosis factor-a (TNF-a) and interleukin-6 (IL-6) production by circulating monocytes were persistent, a complex pattern different from influenza or bacterial sepsis (Audo et al., 2020). Furthermore, interleukin-10 (IL-10) has been reported to be a unique feature of the COVID-19 cytokine storm, and its concentrations strongly correlated with those of IL-6 and other inflammatory markers such as C-reactive protein (Lu et al., 2020). The cytokine storm would damage the epithelium of the lungs and lead to extrapulmonary manifestations (cardiovascular, renal, hepatic, gastrointestinal, ocular, dermatologic, and neurological) (Falasca et al., 2020;Johnson et al., 2020;Maxwell et al., 2020). And it induces acute respiratory distress syndrome (ARDS) and secondary sepsis, which often leads to multiorgan failure and death (Lu et al., 2020;Opoka-Winiarska et al., 2020).
With the emerging demands for auxiliary diagnosis and computational tools, several works have been proposed for sepsis prediction in the common medical settings using machine learning. For example, Fohner et al. used latent Dirichlet Allocation as the un-supervised learning model to assess clinical heterogeneity in sepsis, and Taylor et al. applied the random forest model to predict the in-hospital mortality in emergency department patients with sepsis (Taylor et al., 2016;Fohner et al., 2019). Extreme Gradient Boosting (Xgboost), as it functions as an iterative refit of weak classifiers to residuals of previous models (Yao et al., 2020), has become one of the most popular machine learning models, outperforming other models. It has been widely used in different scenarios in medical application Ogunleye and Wang, 2020), and there is no exception for sepsis (Zabihi et al., 2019;Yao et al., 2020). To our knowledge, there is no analytical tool to predict which COVID-19 patients are most likely to develop sepsis in the near future. Furthermore, explainable machine learning is the future direction in the medical application as it can offer more credible and traceable outcomes for clinicians (Tonekaboni et al., 2019). As a model which explains unrelated methods, SHAP (Lundberg and Lee, 2017) started to draw the attention of researchers gradually.
To our knowledge, there is no analytical tool that predicts which COVID-19 patients are most likely to develop sepsis in the near future. Our research aims to use interpretable machine learning to identify risk factors for Viral Sepsis Caused by SARS-CoV-2 (VSCS-2) and to rationalize these indicators using knowledge about viral sepsis. On this basis, the laboratory indicators used for early warning of VSCS-2 were developed and provided some enlightenment for the study of respiratory viruses. Predictive models were established to predict coagualopathy using the laboratory indicators, and issue warning for the early diagnosis and treatment of VSCS-2 to allow better prognosis.

Materials
This study was carried out at Tongji Hospital (the largest hospital in central China) and was approved by the ethical committee of Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, China. A total of 2,453 patients with 257 males,1,196 females) were recruited between December 2019 and March 2020. These patients were diagnosed with nucleic acid testing or clinical diagnosis. The age distribution of the 2,453 patients with COVID-19 is 55.7 ± 15.3 years old.

Identification of Risk Factors for VSCS-2
We used Extreme gradient boosting (Xgboost) (Chen and Guestrin, 2016), which has been actively promoted in the medical community, and the Shapley Additive ePlanation (SHAP) (Lundberg and Lee, 2017) method, a tool for analyzing the impact of each feature on the prediction, to find the most relevant risk factors for VSCS-2.
We built the first classification model to have a general understating of VSCS-2 ( Figure 1). According to sepsis-1 criteria (Table 1), the recruited COVID-19 patients were classified into two groups: VSCS-2 group (1,376 patients, 56.1%) and pure COVID-19 patient group (1,077 patients, 43.9%). Due to the complex pathogens of sepsis, this study tries to incorporate selected laboratory test items with clinical signs and symptoms. They are inflammatory factors, coagulation factors, and blood routines that may favor the occurrence of VSCS-2. Additionally, the biochemical blood indicators that can indicate the function of the pancreas, liver, kidney, glucose metabolism, and myocardial injury are also included. The total 69 indicators, are classified into four types (Supplementary Table 1). All the indicators of patients were extracted from their electronic health records. Here, 2,453 samples were divided according to 7:3, with 1,717 cases in the training set and 736 cases in the testing set.
Based on the knowledge of viral sepsis and the results of the first classification model, we noted a strong correlation between coagulation disorders and VSCS-2. To verify our hypothesis, seven coagulation function indicators were used to build the second classification model to explore the link between coagulation function and VSCS-2. The indicators are prothrombin time (PT), prothrombin activity (PTA), activated partial thromboplastin time (APTT), thrombin time (Tt), international standardized ratio (INR), D-dimer (DD), and fibrinogen (FIB). The population was also divided into the training set and the testing set according to the 7:3 ratio.

Predictive Models for Coagulation Disorder
To further analyze the factors related to coagulation disorders, we used the coagulation function factors to re-evaluate the 2,453 patients with COVID-19 and to identify whether they had coagulation disorders. If one of the following criteria (PT > 14.5 seconds; DD > 0.5 ug/mL; APTT > 42 seconds; and PTA < 75%) is met, the patient was considered as having coagulation disorder. Otherwise, the patient is considered to have normal blood coagulation. Finally, 988 patients with COVID-19 were labeled as having abnormal coagulation function, while 510 patients had normal coagulation function, giving a total of 1,498 patients. From the 69 indicators (see Supplementary Table 1), 22 blood routine factors, eight inflammatory factors, and 15 selected blood biochemistry indicators, a total of 45 features were used to build the first predictive model to classify and predict COVID-19 coagulopathy. This model aimed to find laboratory indicators that have an important impact on COVID-19 coagulopathy.
To identify the predictive ability of the laboratory test factors for coagulation dysfunction, we randomly extracted 70 samples from 1,498 patients whose detection time of inflammatory and blood routine factors was before the detection time of coagulation. This time interval between the inflammatory and FIGURE 1 | The overview of the four models. The aim of the first classification model for the early identification of VSCS-2 is to identify the risk factors. The second classification model for VSCS-2 is to further clarify the relationship between seven coagulation indicators and VSCS-2. The first predictive model is to classify and predict COVID-19 coagulopathy, which hints VSCS-2 from coagulation disorder. The second predictive model implemented the prediction of coagulation disorder with as few inflammatory and blood routine indicators as possible. To determine the critical risk factors of coagulation dysfunction with clinical significance, we selected the most important features based on the first prediction model and the analysis results of the SHAP method. To verify the effect of these features and to provide a clinical reference, we developed the second predictive model for COVID-19 coagulopathy.

RESULTS
The detailed demographics and laboratory characteristics distribution are shown below in Table 2. The classification and predictive performance of the four models are shown in Figure 2 and Table 3.

Correlation Between Coagulopathy and VSCS-2
The classification performance of the first classification model ( Table 3) shows that the AUC of the model under ROC curve was 0.9349 (95% CI, 91.42-95.56%), and the sensitivity and specificity were 96.93% (95% CI, 94.68-98.29%) and 83.01% (95% CI, 78.28-86.92%), respectively. Then, the critical risk factors of the model are explained by the SHAP method. The results (Figure 3) suggest that there is a strong correlation between the coagulation function indicators and VSCS-2. The largest contribution to the model was PT, the second was APTT, and the sixth was DD. The AUC performance, 0.9213 (95% CI, 89.94-94.31%), of the second classification model ( Table 3) also hinted to the correlation between coagulation function indicators and VSCS-2. As shown in Figure 2, the classification performance of VSCS-2 using seven coagulation indicators is very close to the model using all 69 indicators. The result also shows that a variety of biochemistry and blood routine indicators have a strong correlation with VSCS-2, such as estimated Glomerular filtration rate (eGFR), Hematocrit (HCT), Creatinine (Cr), and Total bilirubin (TBil).
The results of the single-factor analysis using the SHAP method show (Supplementary Figure 1) that the impaired coagulation function indicated VSCS-2. That is, the value of PT is roughly higher than 12s, the APTT is above about 35s, and the value of DD is almost higher than 0.5mg/L. These values are almost consistent with the clinical detection of coagulation dysfunction. Figure 3 also shows that there is a strong correlation between inflammatory indicators and VSCS-2, for example IL-10, TNF-a, IL-6, hypersensitive C-reactive protein (hs-CRP), and Interleukin 2 Receptor (IL-2R). From the relevant analysis of the single-factor analysis ( Supplementary Figure 1), it is suggested that the performance of PT is closely related to IL-6, TNF-a, and Interleukin 8 (IL-8). The performance of APTT is closely related to Glucose (Glu), calibration Calcium (cCa), and estimated Glomerular filtration rate (eGFR). The DD is closely related to Basophil percentage (BASOP), Eosinophilia percentage (EOP), and Eosinophilia absolute value (EOA).
From the single-factor analysis results ( Supplementary  Figure 2), the trend boundary of hs-CRP is obvious. The value of 40±5 mg/L or even more increases the probability of coagulation disorder, and if it is below this threshold, the risk of abnormal coagulation is relatively low. If IL-6 is above 10-15 pg/mL, or IL-2R above 600 m/mL, the probability of coagulopathy risk is higher. If IL-6 is below 10-15 pg/mL, or IL-2R below 500 m/mL, the probability is lower. Moreover, all of the LYMPHA below 1.5 10^9/L, GLB above 28 g/L, or TP below 70 g/L can positively indicate the risk of VSCS-2.
In the second prediction model, we used the ablation experiment method to successfully add the crucial features obtained from Figure 4 and the medical analysis. The XGBoost is used to analyze the performance of the models ( Table 4). The SHAP method is applied to analyze the influence of eight features on the prediction results (Supplementary Figure 3) and the single-factor influence of the features (Supplementary Figure 4).

DISCUSSION
Sepsis is a concerning public health problem as the host's response to the source of infection results in significant morbidity and mortality (Kempker et al., 2018;Alhazzani et al., 2020). Currently, sepsis with subsequent multiorgan dysfunction is one of the main causes of death in COVID-19 patients, and several previous studies have looked at the question of activation of the coagulation system in advanced or severe patients with SARS-CoV-2 infection Tang et al., 2020). Reliable definitions and increased attention are of utmost necessity in the medical domain, as proper and early treatment of illness demands an accurate preceding diagnosis (Fan et al., 2016). Therefore, novel technologies and detection methods allow for the rapid and accurate identification of sepsis, or even coagulopathy in patients with SARS-CoV-2 infection are especially urgently needed for the control and management of the disease in clinical practice (Channappanavar et al., 2016).
The present study enrolled approximately 2,500 patients to examine the feasibility and efficiency of measuring markers in routine laboratory tests for the diagnosis and prediction of sepsis in patients with SARS-CoV-2 infection. It was found that the abnormality of the coagulation function highly suggested the occurrence of sepsis and the other parameters represented by inflammatory factors including IL-2R, IL-6, and hs-CRP had a moderate predictive value on coagulopathy. The established model using the combination of former markers such as inflammatory indexes had the potential for the early warning of sepsis in COVID-19 patients. It could help and guide clinicians to conduct available measures to improve the patient's prognosis. To the best of our knowledge, this study is the first clinical evaluation targeted at the early diagnosis of sepsis in patients with COVID-19 using machine learning. The existing shreds of evidence do not answer the question of which patients overreact in terms of hyper-inflammation and "cytokine storm"; although other people have slight signs, the same microorganism may be found on their airways. So, it is exciting and meaningful to contemplate and speculate on the reasons for the various patterns of change before the occurrence of sepsis in SARS-CoV-2 infection.
It has been well studied that cytokines play an essential part in the immune system during viral infections. A fast and effectively organized intuitive immune response is the vanguard of defense against viral infections. But an imbalanced and over immune response can lead to damage to the immune organism (Law et al., 2005;Tynell et al., 2016;Ye et al., 2020). In vitro experiments have shown that after SARS infection, mainly airway epithelial cells, dendritic cells, monocytes, and macrophages participate in the release of chemokines and cytokines (Scheuplein et al., 2015;Tynell et al., 2016). After MERS infection, plasma cell-like dendritic cells are mainly involved in the release of chemokines and cytokines (Giannakopoulos et al., 2017). According to our data, after SARS-CoV-2 infection, the generation of inflammatory storms is mainly involved in T lymphocytes. Sepsis might be associated with endogenous activation of coagulation and fibrinolysis during COVID-19. Several studies demonstrated that dysregulation of procoagulant and fibrinolytic pathways may uniquely contribute to the pathophysiology of sepsis. However, this issue should be further investigated to obtain more details (Bouck et al., 2020;Colantuoni et al., 2020;Jose et al., 2020;Kang et al., 2020).
Based on our findings, we think that the former activated inflammation may be the forerunner of later coagulopathy, which further degenerates into septic shock and finally causes multi organ dysfunctions. This assumption is consistent with the previous theory that reducing inflammation as one of the conventional methods to sepsis pathophysiology and resilience is considered an intuitive way in which organisms respond to microorganisms (Gotts and Matthay, 2016;Rosen et al., 2019). The intrinsic mechanism might be that the cell subsets mainly . It indicates not only the influence of features but also represents how the influence is impacted. Each row in the figure represents a feature, and the abscissa is the SHAP value. A point represents a sample. The color closed to red indicates the larger value, while the color closed to blue indicates the smaller value. For example, IL-10 is an essential feature and is negatively correlated with VSCS-2, which is, the smaller the value, the higher chance for the determination of VSCS-2.
composed of lymphocytes are activated and secrete cytokines and chemokines during the body's early immune response, then gradually become exhausted and tent to apoptosis (Mira et al., 2017;Patel et al., 2019). Multiple organs of the patient are damaged due to excessive inflammation and disorder of the endogenous coagulation activation pathway (Gomez and Kellum, 2016;Shen et al., 2019). All of the above finally leads to the collapse of the body's homeostasis.
There are also some limitations in this study. First, the scale of patients included in this study is reasonably large, but they all come from a single center. Further validation in more centers with more patient cases and complete laboratory test data need to be verified in the future. Second, even though the current study adds to the understanding of the progress of sepsis syndromes, this is a retrospective study and lacks verification in vivo. There is much work yet to be performed to understand these changes entirely. Finally, the lack of some test items may cause a bias and subsequently misleading results, so the conclusions we obtained in this study need to be confirmed in a prospective design.
In summary, our study provides a preliminary understanding of sepsis in patients with SARS-CoV-2 infection. We found that inflammation and coagulopathy might play a prominent part as a precursor in progress. We envisage that our findings could serve as an instrumental tool for diagnosing and predicting sepsis during COVID-19 treatment.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding author.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by the ethical committee of Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, China. The patients/participants provided their written informed consent to participate in this study.

AUTHOR CONTRIBUTIONS
ZS, HJ, and AZ were responsible for leading this study. GT, YL, FL, and WL were responsible for discussing and designing the research plan. ZS, GT, and YL were responsible for providing the clinical dataset and interpretations of the laboratory analysis results. XioL, YN, YR, XiaL, and SW analyzed and designed the machine learning models. FL, WL, and XioL were responsible for preprocessing the dataset, developing and implementing algorithm details, and conducting the analysis on the dataset. GT, YL, FL, WL, and XioL were responsible for writing the manuscript and guarantee the data, analysis, and interpretation. All authors contributed to the article and approved the submitted version.  Here, we use sepsis-1 criteria ( Table 1) to identify VSCS-2. Negative values indicate that VSCS-2 occurs before. The blue dots represent the number of patients confirmed on that day. As shown in the figure, the number of people who predicted one day in advance was the highest, 43, while the remaining cases did not exceed five days. Here, the detection time of laboratory indicators (including inflammatory factors) can only be obtained retrospectively at the time point with complete test data. During the COVID-19 outbreak, the tests were usually cases dependent, and some data is missing.