Machine Learning Methods for Predicting Long-Term Mortality in Patients After Cardiac Surgery

Objective: This study aims to construct and validate several machine learning (ML) algorithms to predict long-term mortality and identify risk factors in unselected patients post-cardiac surgery. Methods The Medical Information Mart for Intensive Care (MIMIC-III) database was used to perform a retrospective administrative database study. Candidate predictors consisted of the demographics, comorbidity, vital signs, laboratory test results, scoring systems, and treatment information on the first day of ICU admission. Four-year mortality was set as the study outcome. We used the ML methods of logistic regression (LR), artificial neural network (NNET), naïve bayes (NB), gradient boosting machine (GBM), adapting boosting (Ada), random forest (RF), bagged trees (BT), and eXtreme Gradient Boosting (XGB). The prognostic capacity and clinical utility of these ML models were compared using the area under the receiver operating characteristic curves (AUC), calibration curves, and decision curve analysis (DCA). Results Of 7,368 patients in MIMIC-III included in the final cohort, a total of 1,337 (18.15%) patients died during a 4-year follow-up. Among 65 variables extracted from the database, a total of 25 predictors were selected using recursive feature elimination and included in the subsequent analysis. The Ada model performed best among eight models in both discriminatory ability with the highest AUC of 0.801 and goodness of fit (visualized by calibration curve). Moreover, the DCA shows that the net benefit of the RF, Ada, and BT models surpassed that of other ML models for almost all threshold probability values. Additionally, through the Ada technique, we determined that red blood cell distribution width (RDW), blood urea nitrogen (BUN), SAPS II, anion gap (AG), age, urine output, chloride, creatinine, congestive heart failure, and SOFA were the Top 10 predictors in the feature importance rankings. Conclusions The Ada model performs best in predicting 4-year mortality after cardiac surgery among the eight ML models, which might have significant application in the development of early warning systems for patients following operations.


INTRODUCTION
Every year, two million cardiac surgical procedures are being performed around the world (1). Risk prediction models of patients undergoing cardiac surgery might be helpful for clinicians for alerting, judgment, and intervention to improve postoperative survival (2). Some risk stratifications scores and models have been created to aid clinical decision making such as the original European System for Cardiac Operative Risk Evaluation (EuroSCORE) (3), EuroSCORE II (4), and the North American Society of Thoracic Surgeons (STS) (5)(6)(7). The majority of attention in such models has, however, been focused on those that predict short-term outcomes. There has been much less attention paid to the prediction of longterm outcomes, which are probably an equivalent indication of surgeon performance and surgical treatment appropriateness. Additionally, most of the prediction scores, using the traditional logistic regression method, were developed assuming that the predictors interact in a linear and additive way (8), despite the reality that the interactions are often non-linear and multifactorial (9). It might influence the predictive power of these scores. Several studies have reported that some of these scores overestimate the risk of mortality for patients with low risk in actuality while underestimating the risk for high-risk patients (10)(11)(12)(13)(14)(15).
Machine learning (ML), a branch of artificial intelligence, is a relatively new technique that arose from the development of complicated algorithms and the analysis of enormous datasets (16). ML has been applied in areas of medicine such as diagnosis, interpretation of medical imaging, treatment strategies, and outcome prediction (17). ML models can provide new insight into complicated interactions, non-linearities, unrecognized patterns and correlations, and the importance of trends in the explanatory variables (18). There are a growing number of studies that ML models could provide a more accurate risk prediction compared to conventional statistical methods. Moreover, several recent studies have applied ML to predict short-term mortality in patients after cardiac surgery (19)(20)(21). However, to the best of our knowledge, no predictive model for long-term mortality has been constructed targeting unselected patients post-cardiac surgery using ML techniques.
In the present study, we aimed to construct and validate eight ML models using easily accessible, early-stage, and wellgeneralized variables to predict long-term mortality and identify  (ICU) at the Beth Israel Deaconess Medical Center (26). After passing the "Protecting Human Research Participants" exam, we were granted access to the dataset (authorization codes: 33281932 and 41657645). Since the study was an analysis of a third-party anonymized publicly available database with preexisting institutional review board approval, the ethical approval statement and the requirement for informed consent were waived. In summary, this study conformed to the provisions of the Declaration of Helsinki (as revised in Edinburgh 2000). This study was reported according to the transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) guideline (27).

Patient Selection
Of all patients in the MIMIC-III database, we included patients as follows: (1) age older than 18 years; (2) those who underwent cardiac surgery including coronary artery bypass grafting (CABG), valvular operation, revision procedures, and some indicators of cardiac surgery. Patients were excluded if they had: (1) multiple ICU admission; (2) a length of stay in the ICU < 24 h; and (3) incomplete follow-up information.

Data Extraction and Processing
Demographics, vital signs, laboratory tests, scoring systems, treatment information, and others were extracted from the MIMIC-III database using structured query language with PostgreSQL (version 9.4.6, www.postgresql.org). Only early-stage clinical and laboratory variables that can be obtained on the first day of ICU admission were incorporated in the prediction model. If patients received vital signs measurement or laboratory tests more than once on the first day of admission, only the initial test results were considered for subsequent analyses. For privacy considerations, the MIMIC-III database changes the date of birth to exactly 300 years before admission for those patients over the age of 89 at the time of admission. As a result, values of 300 for "age" were reverted to 89. The subject IDs were used to identify distinct adult patients. The predictors included: (1) demographics: age, gender, and ethnicity; (2) comorbidities: coronary artery disease, congestive heart failure, valvular disease, active endocarditis, cardiac arrhythmias, hypertension, pulmonary circulation disorders, chronic pulmonary disease, peripheral vascular disease, stroke, diabetes, dyslipidemia, anemia, renal failure, liver disease, coagulopathy, metastatic cancer, solid tumor (without metastasis), hypothyroidism, fluid and electrolyte disorders, obesity, weight loss, alcohol abuse, drug abuse, and smoker; (3) vital signs: systolic blood pressure (SBP), diastolic blood pressure (DBP), mean blood pressure (MBP), heart rate, respiratory rate, temperature, and urine output; (4) Laboratory findings: white blood cell (WBC), red blood cell (RBC), platelet, red blood cell distribution width (RDW), hematocrit, hemoglobin, sodium, potassium, calcium, magnesium, chloride, phosphate, prothrombin time (PT), international normalized ratio (INR), SpO 2 , pH, base excess (BE), anion gap (AG), bicarbonate, glucose, blood urea nitrogen (BUN), and creatinine; (5) prognostic scoring system: Sequential Organ Failure Assessment (SOFA), quick Sequential Organ Failure Assessment (qSOFA), and Simplified Acute Physiology Score II (SAPS II); (6) Treatment information: surgical type, mechanical ventilation, renal replacement therapy (RRT), and extracorporeal membrane oxygenation (ECMO). Finally, 4-year mortality was set as the study outcome.

Management of Missing Data
As extensive missing data might lead to bias, variables with over 20% missing values were excluded. Correspondingly, multivariable imputation was applied for variables with fewer than 20% missing values (28). Additionally, the extreme and error values were not omitted and treated as missing data for imputation. Variables for which multivariable imputation was adopted were listed in Supplementary Table 1.

Statistical Analysis
Values were presented as total numbers with percentages for categorical variables and the means with standard deviations (if normal) or medians with interquartile ranges (IQR) (if nonnormal) for continuous variables. Proportions were compared using χ² test or Fisher exact tests while continuous variables were compared using the Student t-test, or Wilcoxon rank-sum test, as appropriate.
In this study, the data were divided at random, with 70% utilized for training and 30% for testing. The most relevant variables were selected using recursive feature elimination (RFE) as a feature selection approach. In short, RFE recursively fits a model based on smaller feature sets until a specified termination criterion is reached. In each loop, in the trained model, features are ranked based on their importance. Finally, dependency and collinearity were eliminated. Features were then considered in groups of 5/15/25/35/45/55/ALL (ALL = 65 variables) organized by the ranks obtained after the feature selection method. To find the optimal hyperparameters, 5-fold cross-validation was used as a resampling method. In each iteration, every 9 folds are used as a training subset, and the remaining 1 fold was processed to tune the hyperparameters. This training-testing process was repeated thirty times. And in this way, each sample would be involved in the training model, and also participate in the testing model, so that all data were used to the greatest extent. In this study, we employed multiple diverse ML algorithms to develop models, containing artificial neural network (NNET), naïve bayes (NB), gradient boosting machine (GBM), adapting boosting (Ada), random forest (RF), bagged trees (BT), eXtreme Gradient Boosting (XGB), and logistic regression (LR). Initially, we conducted internal validation on the development sets to quantify optimism in the predictive performance and evaluate the stability of the prediction model. We use the Cross-validation technique with 30 repeats of 5-fold cross-validation to evaluate the internal validity of each model. All the models were assessed in multiple dimensions regarding their model performance. The median and 95% confidence intervals of the area under the receiver operating characteristic curves (AUC) were calculated, where an AUC value of 1.0 means perfect discrimination and 0.5 represents no discrimination. And the accuracy, sensitivity, specificity, negative predictive value, and positive predictive value were also calculated. Calibration plots were drawn to visualize the prediction abilities of the models. To determine the clinical usefulness of the included variables by quantifying the net benefit at different threshold probabilities, we conducted the decision curve analysis (DCA) (19). For the best-performing model, the significance of the model parameters was identified and reported. Finally, the "Shiny" package in the R was used to construct a visual data analysis platform.
All analyses were performed by the statistical software packages R version 4.0.2 (http://www.R-project.org, The R Foundation). In our study, we used the "Caret" R packages to achieve the process. P < 0.05 (two-sided test) were considered statistically significant.

Variable Importance
A total of 65 predictors were extracted from the database. Finally, 25 important predictors were selected by the RFE algorithm, including metastatic cancer, urine output, ECMO, RDW, AG, congestive heart failure, mechanical ventilation, sodium, SBP, bicarbonate, DBP, RBC, hemoglobin, age, BUN, chloride, SAPS II, creatinine, RRT, BE, renal failure, dyslipidemia, platelet, SOFA, and glucose (Figure 2). Then, these variables were used in all the subsequent analyses for all models in both training and testing sets. Each variable included in the study had varying importance over 4-year mortality relying on the ML approach (Figure 3). In the Ada model, we determined that RDW, BUN, SAPS II, AG, age, urine output, chloride, creatinine, congestive heart failure, and SOFA were the Top 10 predictors in the feature importance rankings.

Evaluation of Model Performance
The discriminatory abilities of all models for the prediction of mortality are in Figure 4, Table 2. Within the training set, the NNET, NB, LR, GBM, Ada, RF, BT, and XGB models were established, and the testing set obtained AUCs of 0.790, 0.786, 0.797, 0.748, 0.801, 0.789, 0.752, and 0.781, respectively.
Comparatively, the Ada model had the highest predictive performance among these eight models (AUC 0.801, 95% CI: 0.784-0.817). Calibration plots of the eight models are presented in Figure 5. The calibration curves of NNET and Ada performed better than the other models. The decision curve compared the net benefit of the best model and alternative approaches for clinical decision making. As is shown in Figure 6, the net benefit of the RF, Ada, and BT models surpassed that of other ML models for almost all threshold values, showing that these three models were more superior in predicting the risk of 4-year deaths in this cohort.

Development of Webservers for Convenient Clinical Use
We next used the Shiny to illustrate the impacts of key features on the death prediction model in individual patients. One visualized and publicly accessible online calculator based on the Ada model was built (https://pengchi2009.shinyapps.io/cardic/) (Figure 7). The webservers may generate an estimated survival probability by entering the covariates.

DISCUSSION
Long-term mortality risk prediction tools for cardiac surgery can play an important role in enhancing continuity of care and planning resource allocation appropriately. With the advancement of electronic medical records and artificial intelligence, ML algorithms have become more widely utilized in individualized medicine to assist clinical decision-making (29). In this study, several ML algorithms (NNET, NB, GBM, Ada, RF, BT, LR, and XGB) were developed and validated to predict 4-year mortality of patients undergoing cardiac surgery. Concerning the predictive performance, the Ada model exhibited the greatest AUC and outperformed the remaining ML models. Moreover, to help surgeons use the model, a visualized and publicly accessible online calculator was developed, which provided a user-friendly interface. This study was the first to establish a long-term prediction model after cardiac surgery using early-stage and easily obtained variables based on ML methods. It is worth noting that early and accurate prediction of long-term mortality in patients post-cardiac surgery could provide more time for clinicians to offer individualized treatment strategies.
Cardiac surgery, as a unique operation type, had a significant impact on circulation and physiology, as well as posing significant hurdles in terms of lowering mortality (30). In the field of cardiac surgery, there has been an increasing interest in risk prediction models for clinical use. Various risk stratification methods were cited in European guidelines for decision making, even though these scores cannot replace clinical judgment and multidisciplinary dialogue (31). Among the many scores that have been proposed, the original EuroSCORE, EuroSCORE II and STS scores are the most widely used to predict mortality after  cardiac surgery. However, several studies have reported that these scores have limitations in some surgeries or patient subgroups (11)(12)(13). Recently, a growing number of studies have focused on mid-term or long-term mortality after cardiac surgery (32)(33)(34)(35)(36). For example, Wu et al. (37) created a risk score predicting long-term mortality following isolated CABG surgery with the C-statistics ranging from 0.768 to 0.783 for mortality at 1, 3, 5, and 7 years of follow-up. Due to the need for more precise prediction models, the application of ML approaches has been increasingly studied. A recent meta-analysis using 15 studies showed that when compared with LR, ML models provide better discrimination in operative mortality prediction after cardiac surgery (38). In the present study, the Ada model had a better performance in both discriminatory ability with the higher AUC of 0.804 and goodness of fit (visualized by calibration curve) compared to the traditional LR methods. The potential advantage of ML models is their capacity to capture nonlinearity and the interactions among features without the need for the modeler to manually specify all interactions, as needed with LR. Moreover, compared with traditional statistical methods, ML algorithms can handle missing data more efficiently because they do not rely on data distribution assumptions and are capable of more complex calculations. Clinical models constructed by ML have been used to predict short-term mortality in cardiac surgery with the performance regarding AUC ranging from 0.77 to 0.92 (19,20,(39)(40)(41)(42)(43)(44)(45)(46)(47). Zhou et al. (39) and Ong et al. (40). Found that the RF models predict short-term  mortality better than other models in cardiac surgical procedures. Additionally, several studies showed that the XGBoost method performed better in predicting operative or in-hospital mortality than the other ML methods (19,20,(41)(42)(43). In our study, the study outcome was set as long-term mortality, and the Ada model performed better than the RF and XGBoost model. This also supports the so-called No Free-Lunch theorem in ML (48), which shows that there is no one model that works best for every problem or every dataset. Therefore, it is necessary to try and evaluate multiple ML models to determine which one performs best for a specific problem or study cohort. Actually, The Ada model is a technique that is gaining increasing application in clinical research (49)(50)(51). Our study is the first to apply the Ada model in the context of cardiac surgery.
Through sophisticated ML methods, we determined that RDW, BUN, SAPS II, AG, age, urine output, chloride, creatinine, congestive heart failure, and SOFA were the Top 10 predictors in the feature importance rankings. In general, the predictors for long-term mortality identified in the Ada model in this study are consistent with other studies. RDW is a simple measure of the broadness of erythrocyte size distribution, conventionally called anisocytosis (52). A growing body of evidence demonstrated that higher RDW is strong correlation with a higher mortality rate in widespread cardiovascular diseases such as cardiac surgery, heart failure, and acute coronary syndrome (53)(54)(55)(56). However, there is less research available about whether RDW affects long-term outcomes after cardiac surgery, for which our study is a novel contribution to the published literature. The SAPS II, based on a large international sample of patients, provides an estimate of the risk of death without having to specify a primary diagnosis (57). According to our findings, the SAPS II score seems to be more important than the SOFA score in the feature importance rankings of the Ada model. Similar to our findings, Schoe et al. (58) found that the SOFA score used as a mortality prediction model underperformed compared to the SAPS-II score in this large cohort of cardiac surgery patients. Urine output, BUN, and creatinine were all Top 10 important variables. Lassnigg et al. (59) reported that even a slight increase in serum creatinine is correlated with a considerable increase in 30-day mortality following cardiac surgery. Tseng et al. (60) developed and validated ML algorithms using 94 preoperative and intraoperative features to predict cardiac surgery-associated acute kidney injury, which is closely associated with increased morbidity and mortality. In their model, the importance matrix plot reveals that the most important variables contributing to the model were intraoperative urine output. Our results also underline the importance of detecting, evaluating, and improving preoperative renal function in patients requiring cardiac surgery, which might serve as a target for improving outcomes.
There are several strengths of our study. Firstly, this is the first study that established advanced ML death prediction models focusing on the long-term mortality of patients undergoing all types of cardiovascular surgery. Given the heterogeneity of patients on ICU admission, our findings can be used to identify patients at high-risk for death, and determine which patients would benefit most from cardiac surgery. Providers can then offer targeted individualized care such as more extensive evaluation, post-discharge home visits, closer surveillance by primary care physician, or earlier post-operative follow-up appointments for these patients, actions that might mitigate future adverse outcomes. Secondly, we used MIMIC-III, a highquality database with large sample size and extensive clinical data. Thirdly, we utilized advanced statistical methods, including eight ML models. To evaluate the performance of these models, the AUCs, calibration curves, and DCA were calculated and plotted, representing the discrimination, goodness of fit, and clinical application, respectively. Fourthly, the models were created based on the data readily available collected within the first 24 h after patients' admission. It is worth noting that early and accurate prediction of mortality can provide more time for clinicians to adjust corresponding treatment strategies. Finally, to help surgeons use the model at the bedside, a calculator was developed, which provided a user-friendly interface.
Our study had several limitations. Firstly, we used data from a single academic medical center in the USA, with the earliest cases from almost 20 years ago, when care may have been inconsistent with currently accepted standards. Therefore, a multicenter registry, prospective studies are needed to confirm these findings. Secondly, derived from the ICU adult participants, the results of our study cannot be generalized to other populations such as children and non-ICU patients. Thirdly, we did not obtain information including laboratory testing and interventions before ICU admission, which may cause confounders to some extent. Fourthly, restricted by the contents of the MIMIC-III database, some important information, including preoperative data (i.e., lactate, left ventricular ejection fraction, NYHA functional class, EuroSCORE score, and STS score), intraoperative data (i.e., intraoperative hypotension, vasopressorinotropes and cardiopulmonary bypass time), and postoperative data (i.e., complications, late extubation, and length of ICU stay) were recorded incompletely and not included in the analysis. Fifthly, although we included patients in the database with the primary diagnosis of receiving cardiac surgery, it cannot be ruled out that some patients were admitted to treating other diseases. Finally, although our study deeply explored 4-year mortality in the ICU settings, other outcomes, such as acute kidney injury incidence, are also needed for further investigation.

CONCLUSIONS
The Ada model performs better than the LR, NNET, NB, GBM, RF, BT, and XGB models in predicting long-term mortality after cardiac surgery. Our results suggest that RDW, BUN, SAPS II, AG, age, urine output, chloride, creatinine, congestive heart failure, and SOFA might be closely associated with 4-year mortality after cardiac surgery. We anticipate that this new risk model can become a handy risk stratification tool that can be used by clinicians and patients in the choice of treatment for cardiac disease. However, further external validations are warranted to test the generalization of our models.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

AUTHOR CONTRIBUTIONS
YY, ZW, and ZJ conceived the analysis. YY and CP extracted all data. YY, ZZ, and KS undertook and refined the inclusion process. YY, CP, and ZW co-wrote the paper. YY, CP, YZ, and JX undertook the statistical analyses. WX, PW, and ZW were consulted for clinical issues. All authors contributed to and revised the final manuscript.