Machine learning models to predict in-hospital mortality in septic patients with diabetes

Background Sepsis is a leading cause of morbidity and mortality in hospitalized patients. Up to now, there are no well-established longitudinal networks from molecular mechanisms to clinical phenotypes in sepsis. Adding to the problem, about one of the five patients presented with diabetes. For this subgroup, management is difficult, and prognosis is difficult to evaluate. Methods From the three databases, a total of 7,001 patients were enrolled on the basis of sepsis-3 standard and diabetes diagnosis. Input variable selection is based on the result of correlation analysis in a handpicking way, and 53 variables were left. A total of 5,727 records were collected from Medical Information Mart for Intensive Care database and randomly split into a training set and an internal validation set at a ratio of 7:3. Then, logistic regression with lasso regularization, Bayes logistic regression, decision tree, random forest, and XGBoost were conducted to build the predictive model by using training set. Then, the models were tested by the internal validation set. The data from eICU Collaborative Research Database (n = 815) and dtChina critical care database (n = 459) were used to test the model performance as the external validation set. Results In the internal validation set, the accuracy values of logistic regression with lasso regularization, Bayes logistic regression, decision tree, random forest, and XGBoost were 0.878, 0.883, 0.865, 0.883, and 0.882, respectively. Likewise, in the external validation set 1, lasso regularization = 0.879, Bayes logistic regression = 0.877, decision tree = 0.865, random forest = 0.886, and XGBoost = 0.875. In the external validation set 2, lasso regularization = 0.715, Bayes logistic regression = 0.745, decision tree = 0.763, random forest = 0.760, and XGBoost = 0.699. Conclusion The top three models for internal validation set were Bayes logistic regression, random forest, and XGBoost, whereas the top three models for external validation set 1 were random forest, logistic regression, and Bayes logistic regression. In addition, the top three models for the external validation set 2 were decision tree, random forest, and Bayes logistic regression. Random forest model performed well with the training and three validation sets. The most important features are age, albumin, and lactate.


Introduction
The definition of sepsis has updated to sepsis-3 (1), namely, a life-threatening dysfunction caused by a dysregulated host response to the infection, in 2016, which indicates huge workload in related fields. On the other hand, our understanding of this disease is limited given the generalized definition and the incidence and mortality rate that still remained high (2). From 1979 to 2015, hospital mortality was even 17% and 26% for sepsis and severe sepsis, respectively, in case of excluding the data from lower-income countries (3). Up to now, we have not established longitudinal networks from molecular mechanisms to heterogeneous clinical phenotypes. Together with the related influence elements of the infection, individual host heterogeneity makes the whole disease network vague. To figure it out, establishing criteria to distinguish subgroup of sepsis has been proposed (4).
Attention should be paid to patients with sepsis presented with diabetes, which account for 20% (5, 6). The prevalence of diabetes is surging, due to the improvement of the living standards and spreading of Western lifestyle. The number of people with diabetes rose from 108 million in 1980 to 422 million in 2014 (7). As a major cause of blindness, kidney failure, heart attacks, stroke, and lower limb amputation, diabetes was the ninth leading cause (8) of death with an estimated 1.5 million deaths per year directly caused by it (7). Obviously, diabetes has been a non-negligible global healthy burden for a long time. Moreover, the complicated pathogenesis and multi-organ complications make it difficult for clinicians to evaluate the prognosis, especially when the sepsis presented. Many cohort studies had published with the purpose to explore the interaction between the blood sugar level and infection or diabetes and sepsis. However, the conclusions do not show a high degree of conformance.
Therefore, to build an efficient clinical tool, five machine learning models were established to predict the in-hospital mortality of patients with sepsis with diabetes. The accuracy was compared to distinguish the performance, and important features were discussed to get hints for guiding clinical practice.

Data sources and cohort selection
Data were collected from two of the largest critical care databases in USA, the Medical Information Mart for Intensive Care (MIMIC-IV) database (9, 10) and the eICU Collaborative Research Database (eICU-CRD, version 2.0) (11), and a large critical care database in China (named "dtChina") (12). This study was approved by the Institutional Review Broad (IRB) of the Massachusetts Institute of Technology (MIT) online (Record ID: 38889441), and informed consent was waived. All data were deidentified for privacy protection and extracted by Structured Query Language with PostgreSQL 9.6 as described in a previous study (13). The study was reported in accordance with REporting of studies Conducted using Observational Routinely collected health Data (RECORD) statement (14).
The eligibility criteria in this study included the following: (1) patients were diagnosed as sepsis following the definition of Sepsis 3.0 (1); (2) patients were diagnosed with diabetes as comorbidity; (3) age ≥18 years old; and (4) lengths of intensive care unit (ICU) stays ≥ 1 day. For patients who were admitted to the ICU more than once, only the first ICU stay was considered, and patients were excluded with no ICU admission. Variables with over 30% missing values were excluded. The data were then randomly divided into a training set (66% of the data) and an internal validation set (remaining 34% of the data); finally, two large datasets including American and Chinese patients' records were conducted as external validation sets (named Validation Set 1 and Validation Set 2, respectively). A summary of the study methods is shown in Figure 1.

Data collection and definition
Demographic data were collected including age and gender. Otherwise, the history of myocardial infarction, congestive heart failure, peripheral vascular disease, cerebrovascular disease, chronic pulmonary disease, peptic ulcer disease, paraplegia, chronic kidney disease, malignant cancer, and metastatic solid tumor were collected as comorbidities. The first record of vital sign was collected while admitted to ICU, including the heart rate, respiratory rate, systolic blood pressure, diastolic blood pressure, mean blood pressure, and oxygen saturation. The first record of laboratory data was collected within 72 h after admission to ICU, including white blood cell, neutrophils, lymphocytes, monocytes, eosinophils, basophils, red blood cell, platelet, hematocrit, hemoglobin, mean corpuscular hemoglobin (MCH), MCH concentration (MCHC), mean corpuscular volume (MCV), red cell distribution width (RDW), alanine transaminase (ALT), aspartate transaminase (AST), alkaline phosphatase (ALP), albumin, total bilirubin, blood urea nitrogen (BUN), blood creatinine, prothrombin time (PT), international normalized ratio (INR), blood sodium, potassium, calcium, chloride, bicarbonate, anion gap (AG), blood glucose, and lactate. The interventions included continuous renal replacement therapy (CRRT), ventilation, and vasopressor use (including dobutamine, dopamine, epinephrine, norepinephrine, and phenylephrine use). Data with more than 30% missing data were removed, and inhospital mortality was used as outcome of all patients.

Statistical analysis
The counting data are shown as percentages, and measurement data are shown as the mean or median with standard deviation (SD). The chi-squared test or Fisher's exact test was used to evaluate the comparison of counting data, the Student's t-test or one-way ANOVA was used for the comparison of continuous variables that were conformed to normal distribution, and Mann-Whitney U-test was conducted for the skewed distribution data. Spearman correlation matrix was produced to calculate correlations between all continuous variables. The extreme values were regarded as missing data; then, all the missing data were interpolated using multiple imputation (mi) method. Next, using the predictors above, we constructed five machine learning prediction models: (1) logistic regression with lasso regularization (lasso regression), (2) Bayes logistic regression, (3) decision tree, (4) random forest, and (5) eXtreme Gradient Boosting (XGBoost). Logistic regression with lasso regularization could help shrink the regression coefficients toward zero but reserve the variables more stringent thresholds, which could finally build a parsimonious and predictive logistic regression model. The naïve Bayes method was conducted by using the Bayes' theorem to evaluate post-error probabilities. A decision tree is a non-parametric model and a tree-like graph, which could easily establish classification rules, interpretation, and good interpretability; CART was conducted as decision tree algorithm. For random forest and XGBoost, different combinations (top N) of features were tested, and a 20-fold cross-validation was calculated. When the number of negative events was much greater than that of positive events, PRC (precision recall curve) could measure the quality of protective model more effectively and provide more information than area under the receiver operating characteristic (AUC-ROC) curve (15). In this study, AUC-PRC, accuracy, sensibility, specificity, positive predictive value (PPV), and negative predictive value (NPV) from the confusion matrix were conducted to measure each model. Statistical analyses were performed using the StataSE 15.1 (Stata Corporation, College Station, Texas), software R version 4.0.2 (Vienna, Austria), and the tidymodels packages. P-value was considered statistically significant below the 5% level. Flow chart depicting number of patients who were included in analysis after exclusion criteria. A total of 7001 records were enrolled.

Summary of the training and validation sets
A total of 5,727 patients collected from MIMIC-IV database, 815 patients collected from eICU-CRD database, and 459 patients collected from dtChina database were finally enrolled in our study ( Figure 1). The baseline characteristics, clinical tests, interventions, and in-hospital mortality of the study cohorts are shown in Table 1. There was no significant difference between the training set and the internal validation set. Among all the patients, the mortality rates of the training set, internal validation set, external validation set 1, and external validation set 2 were 12.70%, 12.69%, 12.27%, and 15.03%, respectively. In addition, there were no significant differences of in-hospital mortality between each dataset (Table S1,

Model construction and validation
The correlation of the continuous variable presented a good data consistency between each database ( Figure S1, see Supplementary Materials). Then, logistic regression with lasso regularization, Bayes logistic regression, decision tree, random forest, and XGBoost were conducted to build the predictive model by using training set, whereas the internal and external validation sets were used for estimating the generalization capability of each model.

Bayes logistic regression
Bayes logistic regression was conducted as described by a previous study (16). The randomness and distribution of all variables are shown in Figures 4A

Decision tree
The decision tree model was established using all potential risk factors. After pruning, four influencing factors with five depth and five leaf nodes were found in our study: lymphocytes, oxygen saturation, bicarbonate level, and lactate level ( Figures 5A, B).

Random forest
A total of 53 indicators were tested, a 20-fold crossvalidation was calculated, and the importance and number of errors are shown in Figures 6A, B. The five most predictive variables were lactate level, age, oxygen saturation at admission, systolic blood pressure at admission, and albumin level.

XGBoost
XGBoost model was conducted by using a 20-fold crossvalidation, and the top five most predictive variables were age, albumin level, lactate, systolic blood pressure, and ventilation ( Figure 7A). For the internal validation set, the random forest model obtained a PR-AUC = 0.459, accuracy = 0.882, sensitivity

Models' comparison
The results of model performance in predicting in-hospital mortality for the sepsis patients with diabetes are shown in Figure 8. The top three models for internal validation set were Bayes logistic regression, random forest, and XGBoost, whereas the top three models for the external validation set 1 were The change of the coefficient of different variables with penalty parameter l set (A). Penalty parameter plot (B). random forest, logistic regression, and Bayes logistic regression. In addition, the top three models for the external validation set 2 were decision tree, random forest, and Bayes logistic regression.

Discussion
Both ROC curves and PR curves provide a diagnostic tool to evaluate the performance of binary classification models. ROC curves visualize the trade-off between the true positive rate (TPR) and the false positive rate (FPR). However, PR-AUC focuses on precision (PPV) and recall (TPR) (17). Because of this, although there is some disturbance of the proportion of positive and negative samples in the test set changes, the ROC curve can remain unchanged (18). This consistency of the ROC in the face of class imbalance reflects its ability to measure the predictive power of a model itself, which is independent of the proportion of the positive and negative samples. Otherwise, compared with ROC curves, PRC curves retain the sensibility to the change of proportion (15,(19)(20)(21). Hence, in case of imbalance between the positive and negative sample ratios, PRC is more responsive to the goodness of the classifier than ROC. It is the reason why, in this research, we use PR-AUC to present our results.
We analyzed a dataset composed of clinical data from 7,001 patients in the East and the West. Gender ratio in this present study is 3:4 (female:male), and the ratio does not show   , and, according to not only the two feature importance score plots but also the rest predictor, age is a statistically crucial predictor variable of decease. For the elder people, even if without underlying diseases, sepsis is a critical health issue and a major cause of admission to ICU (22). Furthermore, older patients suffer longer length of stay in ICU and higher case fatality rates (23). More prolonged proinflammatory responses may count for much. Aging, on the other hand, is one of the nonmodifiable factors that contribute to the increased incidence of diabetes. It is suggested that aging and lower vascular telomere length in patients with Type 2 Diabetes (24) converge to endothelial dysfunction, which is an indicator of sepsis severity. Lactate, catalyzed by lactate dehydrogenase, is a product of glucose metabolism. In our research, it is one of the most important variables of in-hospital mortality prediction for patients with sepsis with diabetes. In clinical use, the serum level of lactate is commonly included in management of patients with sepsis because of the impaired pyruvate dehydrogenase in sepsis. According to the sepsis-3 guidelines, septic shock should be clinically defined in case of persistence of a serum lactate more than 2 mmol/L despite adequate fluid resuscitation (1). The high lactate concentrations are suggested to be a predictor of mortality. Meanwhile, the lower lactate levels are related to improved clinical outcomes (25). Mechanically, some research studies indicated that the extracellular lactate may have important regulatory effects on a variety of immune cells (26). There is view that aerobic glycolytic metabolism is important to initiate immune cells (27). Compared with non-diabetic ones, the patients with diabetes have higher plasma lactate levels even in the prediabetes stage and hyperinsulinemia condition. Intriguingly, there is some evidence to indicate that lactate can be used to predict the occurrence of diabetes (28, 29). However, there is no research study that reveals direct association between sepsis and diabetes via lactate, especially for lactate, in metabolism disorder; a question remained: cause or consequence or both?
In this study, all the predictors are performed on the basis of the albumin levels. Classically, albumin levels reflect the nutritional status and organic function of patient. A research revealed that patients with low albumin had higher mortality and longer length of hospital stays than patients with normal albumin, whereas patients with high albumin had lower mortality and shorter hospital stays (30). Intriguingly, a prospective observational study concluded that hypoalbuminemia during sepsis was caused by enhanced clearance from the circulation instead of dysfunction of the liver (31). Moreover, ischemia modified albumin cloud be an effective diagnosis marker of neonatal sepsis (32).
Holistically, we noticed the performance level in the Chinese validation dataset decreased. Owing to the limit to the size of accessible matched dataset, our attempt to train the predictor with Chinese data and to use Western data for validation failed. Although, the racial differences should be considered. There are some published research studies that reveal the increased incidence and severity of sepsis in black individuals compared with that in white individuals (33)(34)(35). Furthermore, an excellent research proved that APOL1 risk variants, which are specifically present in individuals of African ancestry, contribute to the exacerbated sepsis (36). From this point, the conformance of responses to sepsis between different races should not be expected.

Conclusion
To predict the outcomes of patients with sepsis with diabetes, five machine learning models were established and validated. Random forest model performed well with the training and three validation sets. Among all variables of data, age, lactate, and albumin could be of high diagnostic value. Our results provide an approach of applying of algorithms to resolve the issue about prediction of complicated disease conditions. The result of models performance in the internal validation set, the external validation set 1, and the external validation set 2 (from (A-C)].

Ethics statement
This study was approved by the Institutional Review Broad (IRB) of the Massachusetts Institute of Technology (MIT) online (Record ID: 38889441), and informed consent was waived.

Author contributions
CS contributed to the conception of the study. JQ and JL contributed significantly to analysis and manuscript preparation. HL helped perform the analysis with constructive discussions. KZ, ZD, NL, and DH helped records enrolled. All authors contributed to the article and approved the submitted version.