Development and validation of an interpretable 3 day intensive care unit readmission prediction model using explainable boosting machines

Background Intensive care unit (ICU) readmissions are associated with mortality and poor outcomes. To improve discharge decisions, machine learning (ML) could help to identify patients at risk of ICU readmission. However, as many models are black boxes, dangerous properties may remain unnoticed. Widely used post hoc explanation methods also have inherent limitations. Few studies are evaluating inherently interpretable ML models for health care and involve clinicians in inspecting the trained model. Methods An inherently interpretable model for the prediction of 3 day ICU readmission was developed. We used explainable boosting machines that learn modular risk functions and which have already been shown to be suitable for the health care domain. We created a retrospective cohort of 15,589 ICU stays and 169 variables collected between 2006 and 2019 from the University Hospital Münster. A team of physicians inspected the model, checked the plausibility of each risk function, and removed problematic ones. We collected qualitative feedback during this process and analyzed the reasons for removing risk functions. The performance of the final explainable boosting machine was compared with a validated clinical score and three commonly used ML models. External validation was performed on the widely used Medical Information Mart for Intensive Care version IV database. Results The developed explainable boosting machine used 67 features and showed an area under the precision-recall curve of 0.119 ± 0.020 and an area under the receiver operating characteristic curve of 0.680 ± 0.025. It performed on par with state-of-the-art gradient boosting machines (0.123 ± 0.016, 0.665 ± 0.036) and outperformed the Simplified Acute Physiology Score II (0.084 ± 0.025, 0.607 ± 0.019), logistic regression (0.092 ± 0.026, 0.587 ± 0.016), and recurrent neural networks (0.095 ± 0.008, 0.594 ± 0.027). External validation confirmed that explainable boosting machines (0.221 ± 0.023, 0.760 ± 0.010) performed similarly to gradient boosting machines (0.232 ± 0.029, 0.772 ± 0.018). Evaluation of the model inspection showed that explainable boosting machines can be useful to detect and remove problematic risk functions. Conclusions We developed an inherently interpretable ML model for 3 day ICU readmission prediction that reached the state-of-the-art performance of black box models. Our results suggest that for low- to medium-dimensional datasets that are common in health care, it is feasible to develop ML models that allow a high level of human control without sacrificing performance.

. Example of the hospital stays with according ICU stays and labels.   Description of data cleaning of included variables Table 3. Overview of feature classes.   However, readmission to an excluded ICU or IMC ward after discharge from an included ICU was considered for labeling. (a) The first two ICU transfers are included as negative and positive instances, because readmission happened after three days and within three days, respectively. The third ICU transfer is excluded due to death at the ICU. (b) The first two ICU transfers are consecutively, hence they are merged and considered as a single stay. It gets a positive label because readmission to an excluded ICUs or IMC ward occurs within three days. The last ICU transfer is also labeled positively because the patient dies within three days. (c) The first two ICU transfers are not included because they are merged but directly followed by a transfer to an excluded ICUs or IMC ward. However, the subsequent transfer at an included ICU is included with a negative label. The last ICU transfer is excluded because the follow-up period within the hospital is less than three days. Table 1. Table with detailed information of included ICUs. All ICUs are managed by the ANIT-UKM. The number of patients and transfers in rows six and seven corresponds to the input of cohort selection. The last two rows represent the UKN cohort. Note that they do not sum up to the total number of patients and ICU stays because one patient or stay can be associated with more than one ICU. 1 These are estimated average values because the number of beds changed several times over the years.

Description of data cleaning of included variables
We developed a data pipeline consisting of preprocessing, merging, filtering, and postprocessing.
Treatment of duplicates with identical entries and numerical values was performed first. Preprocessing methods were applied to the raw recordings and could be reused between different items. They also included datatype-specific routines for continuous and categorical variables. Merging was optional and usually needed a custom merging procedure to account for different data formats. Filtering allowed to enforce an interval or a set of allowed values. Lastly, postprocessing was applied analogously to preprocessing. The most important data cleaning procedures are summarized below.
-Duplicates for non-medication and non-fluid variables. We used a similar approach as (  values outside these sets and detected several malformed recordings, which were due to manual data entries. Whenever a valid value could be determined for those, we mapped them accordingly. Value sets were sometimes reduced to more general categories when it seemed more appropriate.
-Valid values for medications. The medication variables contained many artifacts which made it impractical to use dosages as variables. Instead, we only used indicators if a certain medication was administered. To this end, we removed zero and negative entries. We added three medication categories with dosages. We determined valid maximum values during the medical review of the variables and removed all entries above them.
-Adjusting body temperature for measuring site. Based on the measuring site, body temperature measurements were adjusted for the core temperature. We applied an offset of 0.4°C for tympanic, 0.5°C for oral and, 0.6°C for axillary sites (2). Groin and axillar show similar behavior, so an offset of 0.6°C was used for groin (3). Since no evidence could be found for the nasal site, we used the same offset as for oral.
-Adjusted blood pressure for measuring site. Non-invasive systolic, mean, and diastolic blood pressures were adjusted for measurement at arm or thigh. We used offsets of 7, 4, and 3 mmHg for arm and -5, 6, 11 mmHg for thigh (4).
-Missing values. We used value imputation only for some variables to incorporate the missingness of variables in the model. However, when only a few values were missing or missingness indicated a normal value, we imputed them. For gender, a single missing value could be derived from another source. For static variables patient class and responsible clinic, we used "inpatient" and "other". Missing Glasgow Coma Score or Richmond Agitation-Sedation Scale indicate a normal value in the clinic, so we used them for imputation. For 29 time-series features with few missing values, we used the median value.
-Computation of estimated glomerular filtration rate (eGFR). The eGFR in the PDMS was based on different formulas. We used creatinine, gender, and age to calculate it as a new variable (5).

Explainable Boosting Machine
Based on implementation in interpret.glassbox.ExplainableBoostingClassifier in interpret library with slight modifications enabling unknown values and exposing the argument min_samples_bin.

Recurrent Neural Network with Long Short-Term Memory
Based on implementation in tensorflow.keras.Model in tensorflow library. It contains all data collected at the ICUs making it a good candidate for external validation. We used code from the shared code repository to load the data and generate medical concepts.
We tried to mimic the cohort selection as close as possible. MIMIC-IV already provided consecutive ICU stays as so-called concepts so that merging of transfer was not necessary. Especially, the manual procedure to classify gaps between ICU stays was irrelevant for MIMIC-IV. The flow chart in the supplement shows the cohort selection. First, 104 malformed ICU stays of 76 patients were excluded due to hospital discharges before admission, overlapping hospital stays, and a time of death before ICU admission. Second, stays that were not discharged from an ICU managed by the Department of Anesthesia, Critical Care, and Pain Medicine at BIDMC were excluded (n=43,154). The included ICUs were Trauma Surgical Intensive Care Unit, Surgical Intensive Care Unit, Cardiovascular Intensive Care Unit, and Neuroscience Intensive Care Unit. Next, we excluded all stays who died during the ICU stay (n=1,888). This included two stays that were transferred to an ICU after death which we also considered as death at ICU. Analogously to the original cohort, we required a length of stay at the hospital after the ICU of at least 72 hours (n=12,187). The ratio of the excluded cases was considerably higher than in the original cohort. Lastly, we removed cases that had no heart frequency entries for at least two hours (n=99).
We also labeled discharges from an included ICU to standard care as true when a patient was readmitted to any ICU or IMC units or died within three days. For the UKM cohort, we designed a special procedure and performed manual annotation for transfers of at most twelve hours. However, this was not feasible for the MIMIC-IV cohort because we lacked clinical knowledge about the data. Hence, we used a simplified procedure that required stays at a standard care unit or consecutive readmission to an ICU or IMC unit to last at least one hour to prevent artifacts. Also, we excluded the post-anesthesia care unit and unknown from the standard care units because they could indicate a planned surgery leading to readmission to an ICU or IMC unit. BIDMC had more IMC units so that fewer ICU stays were directly discharged to a standard care unit. Hence, we expected a lower fraction of positive labels. However, 1626 ICU stays were labeled as readmission or death. Of those, 1273 were readmitted to an ICU, 39 to an IMC unit, and 314 died within three days. We controlled 20 random positive labels stratified by ICUs to verify the labeling procedure.
We extracted 41 variables for the EBM model from MIMIC-IV. We also used the data collected during ICU stays. We searched through the item definitions and the provided code to identify relevant items. Analogously to the UKM cohort we defined allowed value ranges and applied median value imputation. Only the variable procalcitonin was not contained in MIMIC-IV. As a result, 66 features were generated for the EBM and 515