Early Prediction of Multiple Organ Dysfunction in the Pediatric Intensive Care Unit

Objective: The objective of the study is to build models for early prediction of risk for developing multiple organ dysfunction (MOD) in pediatric intensive care unit (PICU) patients. Design: The design of the study is a retrospective observational cohort study. Setting: The setting of the study is at a single academic PICU at the Johns Hopkins Hospital, Baltimore, MD. Patients: The patients included in the study were <18 years of age admitted to the PICU between July 2014 and October 2015. Measurements and main results: Organ dysfunction labels were generated every minute from preceding 24-h time windows using the International Pediatric Sepsis Consensus Conference (IPSCC) and Proulx et al. MOD criteria. Early MOD prediction models were built using four machine learning methods: random forest, XGBoost, GLMBoost, and Lasso-GLM. An optimal threshold learned from training data was used to detect high-risk alert events (HRAs). The early prediction models from all methods achieved an area under the receiver operating characteristics curve ≥0.91 for both IPSCC and Proulx criteria. The best performance in terms of maximum F1-score was achieved with random forest (sensitivity: 0.72, positive predictive value: 0.70, F1-score: 0.71) and XGBoost (sensitivity: 0.8, positive predictive value: 0.81, F1-score: 0.81) for IPSCC and Proulx criteria, respectively. The median early warning time was 22.7 h for random forest and 37 h for XGBoost models for IPSCC and Proulx criteria, respectively. Applying spectral clustering on risk-score trajectories over 24 h following early warning provided a high-risk group with ≥0.93 positive predictive value. Conclusions: Early predictions from risk-based patient monitoring could provide more than 22 h of lead time for MOD onset, with ≥0.93 positive predictive value for a high-risk group identified pre-MOD.


INTRODUCTION
Pediatric multiple organ dysfunction syndrome (MODS) occurs in more than 25% of children admitted to the pediatric intensive care unit (PICU) (1)(2)(3) and is one of the leading pathways to mortality in critically ill children (4). MOD has been defined as the presence of two or more concurrent organ dysfunctions (4)(5)(6)(7). Multiple underlying conditions could lead to MOD, although sepsis stands out as the most commonly associated condition (3,(8)(9)(10) followed by trauma, burns, pancreatitis, inborn errors in metabolism, transplantation, and others (4). The all-cause incidence of MOD ranges between 6% in 283 admissions from a single PICU and as high as 57% in 3,671 admissions across nine PICUs (1,2,4,9,(11)(12)(13)(14). A higher number of dysfunctional organ systems is associated with increased risk of mortality regardless of the study population (8,15,16) with more than 75% difference in mortality between patients with the lowest and highest numbers of organ failures (9,10). In a prospective study of 831 patients in a single PICU, more than 90% of deaths were associated with MODS (5,6). Of children admitted to the PICU, 14-30% encounters have MOD on PICU day 1 (1,14,17) consisting of the largest proportion (78%) of all PICU MOD cases (18). New and progressive MOD (NPMOD) is defined as concomitant dysfunction of two or more organ systems occurring after PICU admission with no or single organ dysfunction, or additional dysfunctional organs following admission with MOD (19). NPMOD has been proposed as primary or secondary outcome instead of mortality in several pediatric critical care studies as PICU mortality has declined significantly over the last decades (19)(20)(21)(22). There continues to be insufficient knowledge of the physiologic trajectory of patients who develop new, progressive, or persistent multiple organ dysfunction. We hypothesized that a complex set of physiologic patterns likely preceded adverse events such as simultaneous failure of more than one organ system.
Due to the inherent diversity in the cause and sequence of physiologic events that lead to MOD, the need to understand the risk factors for, and progression of, the syndrome calls for investigating ways to predict MOD, with the ultimate goal of optimizing monitoring, staffing, and early intervention regimens based on the progression of the underlying condition. The primary objective of this study was to develop data-driven early prediction models for identifying patients with increased risk of transitioning from no-MOD to MOD state. The secondary objective of this study was to explore if the time-evolving risk of developing MOD among positive predictions presented any clustering patterns and, if so, whether these patterns could be used as a risk-stratification method.

Study Design and Patients
We conducted a retrospective cohort study including all children <18 years admitted to the Johns Hopkins PICU between July 2014 and October 2015. This dataset included time series physiologic monitoring data from bedside monitors as well as electronic medical record data including ICD-9/10 diagnostic and procedure codes, vital signs, drug infusions, respiratory support, laboratory results, fluid input-output, and transfusion data. We excluded PICU encounters with missing hospital admission and discharge times, missing demographics, or vital signs, those with <2 h of PICU data preceding time of death and those with PICU day 1 MOD without recurrent transitions from no-MOD to MOD state. We further excluded no-MOD to MOD transitions with <15 min of data available prior to MOD transition. This study was approved by the Johns Hopkins Institutional Review Board (IRB00210572).

Multiple Organ Dysfunction Definitions and Multiple Organ Dysfunction Label Evaluation
MOD is defined by the simultaneous presence of two or more organ dysfunctions (5-7, 18, 23). For this study, we used two widely accepted sets of organ dysfunction criteria published by Proulx et al. (4,18) and International Pediatric Sepsis Consensus Conference (IPSCC) (7). We assigned and updated organ dysfunction labels every minute using the preceding 24-h windows, starting at 24 h from the time of PICU admission until PICU discharge. Organ dysfunction was evaluated starting 24 h from the time of PICU admission because no data was captured prior to PICU admission and reporting delay in laboratory test results might cause the first few hours to be labeled as not having organ dysfunction due to non-availability of measured data. Therefore, PICU encounters where MOD occurred within 24 h after admission and never recurred during the PICU course and those with <15 min of data prior to transition from a no-MOD to MOD state any time during the PICU course were excluded as data available prior to MOD transition was deemed insufficient.
We generated continuous binary MOD labels based on the presence of two or more simultaneous organ dysfunction as defined by the Goldstein (7) and Proulx criteria (4,18). In order to populate infrequently measured physiologic data for continuous MOD labeling, we used carry-forward interpolation for a maximum period of 24 h (allowing up to 30 h only if a new test was ordered within 24-30 h of the previous) for laboratory tests and 6 h for vasoactive drug administration and continuous renal replacement therapy. We did carry forward interpolation for vital signs like non-invasive blood pressure, temperature, etc., which were not measured every minute. We defined MOD transition as the event when a patient transitioned from no organ or single-organ dysfunction to a state of at least two concomitant organ dysfunctions. We defined the interval of time over which the number of organ dysfunction remained the same as the dwell time in that state.

Features
A total of 228 features extracted from minute-to-minute bedside monitor data and electronic medical record data were used to train predictive models. A missingness indicator variable was created for each infrequently measured value that is not routinely collected from all patients (see Section

Model Building and Evaluation
The primary objective of this study was to build prognostic models, which could learn to distinguish physiologic patterns in the data collected prior to the development of MOD from those collected from non-MOD patients and provide a quantitative sense of likelihood of an impending no-MOD to MOD transition. We used different machine learning methods to build models that continuously output probability of developing MOD on a scale of 0-1, which we refer to as risk score in this study. The risk scores were also filtered by taking the median over the preceding 5 min windows to avoid sudden abrupt changes in risk scores from any outliers in minute-to-minute captured features. Using the time-varying risk scores evaluated on data collected prior to MOD transition as well as non-MOD patient data, we learned an optimal threshold value using the receiver operating characteristic (ROC) curve for each model. Since several criteria to choose thresholds from ROC curves have been described, we compared our model performances based on two different methods: point nearest from (0, 1) on ROC curve and point corresponding to maximum F1-score. We defined a high-risk alert event (HRA) as an instance when the risk score exceeded the optimal threshold and then remained above the threshold for at least 15 min. We defined the entire interval over which risk score stayed above the threshold as the high-risk interval (HRI), and the lead time between HRA and MOD transition as the early warning time (EWT).
We divided the complete set of eligible targets and controls into training and test sets by randomly splitting them in a 4:1 ratio in a stratified manner to preserve the relative ratio of MOD to non-MOD PICU admissions so that all MOD transitions from the same patient during a PICU stay are in the same data partition. We built predictive models for MOD transition events using training data consisting of up to 24 h of data collected prior to each MOD transition (target) and an equal number of randomly selected samples from no-MOD admissions (control). To test the model performance, we used the entire length of available data from both target and control periods to evaluate time-varying probability of developing MOD. The training set was subjected to 10-fold cross validation for parameter tuning and training performance assessment. We compared our model performance in terms of sensitivity, specificity, AUROC (area under ROC curve), accuracy, F1-score, PPV (positive predictive value), and NPV (negative predictive value) (24) (see Supplementary Section 1.2 on performance metrics). We also presented precision-recall curves in addition to ROC curves to account for the large class imbalance in the dataset. We compared the distribution of EWT, EWT normalized by length of available time and HRI across all positive predictions and reported if any difference in the distribution of these entities was observed across true positives and false positives using Wilcoxon rank sum test.
We compared the performance of models built using the following four methods: extreme gradient boosting (XGBoost) (25,26), random forest-based probability machines (random forest) (27,28), gradient boosted generalized linear models (GLMBoost) (29)(30)(31), and L1-regularized generalized linear models (LassoGLM) (24). All models were built in R (version 3.5.3) using the open source CRAN packages: xgboost (26), ranger (27), mboost (32), and glmnet (24), respectively, for the above methods. The choice of the above four methods was driven by the amount of available data, range of model complexity, and their capability of pruning features to yield a parsimonious model. The entire analysis, including model building, threshold selection, performance evaluation, and feature importance computation, was carried out independently for each of the above four applied methods.
Apart from being able to detect HRA events and evaluate EWTs in MOD transition cases, we also performed spectral clustering (33) on the risk score trajectories post-HRA to explore if the data resolved into distinct patterns and whether the accuracy of our predictions differ across different clusters. The spectral clustering method used in our analysis determined the number of clusters using the largest eigengap metric in the graph spectrum of similarity matrix built based on the K-nearest neighbor algorithm between each pair of positive predicted risk score trajectories in a 24-h window post-HRA. We also reported the positive predictive values across different risk groups by dividing the positive predicted cases into quartiles of risk scores at HRA, similar to analysis performed across deciles of risk scores by Liu et al. (34). Wilcoxon rank sum tests were used to compare the distribution of continuous variables such as EWT, HRI, etc., among different groups.
The choice of machine learning methods, training-test splitting, model building and threshold determination, performance metrics, and use of clustering and quartiles of risk-score at HRA to explore the presence of any underlying patterns in the risk score trajectories post-HRA were planned a priori. The observations made about differences in the distribution of EWT, HRIs (both unnormalized and normalized by length of available data) were based on inspection of the results obtained with our planned analysis.

Feature Importance Evaluation
Global feature importance was computed on XGBoost and random forest model using the average fractional contribution of each feature in the total information gain or reduction in Gini index, respectively (using built-in functions in the XGBoost and ranger CRAN packages). For GLMBoost and LassoGLM, their risk scores are computed in the same manner as ordinary generalized linear models, and hence, the trained set of coefficients for each feature reflected its relative importance within a model. Since all of these four methods use different metrics to define variable importance, we reported the quantitative importance scores as relative importance on a scale of 0-100 such that the feature with the highest importance has a relative importance equal to 100. These feature importance metrics may not be accounted for perfectly if a variable is used more often in a model in lieu of a different feature that is highly correlated with it. For GLM (generalized linear model)-based models, relative importance was calculated as the absolute value of model coefficient for a feature divided by the highest absolute value of model coefficient across all features.

Study Population Characteristics
The final cohort included 2,023 patients with 2,565 PICU encounters (Figure 1 and Table 1). MOD was present in 893 (36.2%) and 692 (28.1%) of PICU encounters, using the IPSCC and Proulx criteria, respectively. Most MOD was present on PICU day 1, rendering a final data set of 293 and 687 MOD transitions using the IPSCC and Proulx criteria, respectively, that had >15 min of data prior to the time of transition. Rates of individual and multiple organ dysfunction are shown in Table 1. In bivariate analysis, age, weight, cardiac surgery during the admission, a diagnosis of status asthmaticus, pneumonia, sepsis, cardiac arrest, or renal insufficiency, chronic conditions (35,36), and interventions including transfusions and ECMO, were significantly associated with MOD (p < 0.05) during the PICU encounter (Supplementary Table 2).   ) for XGBoost, random forest, GLMBoost, and LassoGLM methods, respectively. All cross-validation set performance metrics obtained using tuned model parameters are shown in Supplementary Figures 2, 3. The final model trained on the entire training set using tuned parameters that maximized cross-validation set performance and yielded AUC 0.92, 0.93, 0.91, and 0.92 (IPSCC criteria) and 0.92, 0.93, 0.91, and 0.92 (Proulx criteria) with XGBoost, random forest, GLMBoost, and LassoGLM methods, respectively, on the test set. All results discussed below refer to metrics evaluated on the test set unless specified otherwise. The ROC and precision-recall curves for all methods are shown in Figures 4, 5, respectively. The performance metrics for all methods are presented in Table 2. All four methods exhibited high specificity, with moderate sensitivity related to the low prevalence of MOD transition events in the data set. All models were able to learn a high degree of separation between the positive and negative classes reflected by the high AUC values (>0.91). Performance metrics were comparable across the four methods when maximum F1-score was used for optimal threshold detection. The highest positive predictive value (PPV) was obtained using the LassoGLM method (PPV, 0.72) for IPSCC criteria and XGBoost, GLMBoost, and LassoGLM methods (PPV, 0.81 for each method) for Proulx criteria, respectively, with maximum F1-score used for optimal threshold determination. Maximum F1-score criteria for threshold selection yielded consistently higher PPV than the point nearest (0, 1) on the ROC curve. The high negative predictive values (>0.92 for both IPSCC and Proulx criteria) for all methods indicate a low false alarm rate. Random forest (0.71) and XGBoost (0.81) were the best models in terms of maximum F1-score obtained using IPSCC and Proulx criteria, respectively. Precision-recall plot showed that the sensitivity could be boosted to ≥0.9 at the cost of reduced PPV (≤0.5) and vice versa (Figure 5).

Model Performance
Supplementary Figure 4 shows the histograms of EWT across all four methods for the two sets of MOD criteria.  Figure 5) show that the majority of positive predicted cases were detected with HRA at the first instance of available data. The observed HRA event relative to the length of available data was significantly different between true positives and false positives [median (95% confidence interval) of pairwise differences was −0.29 (−0.53, −0.10), p < 0.001 for random forest with IPSCC criteria, and −0.12 (−0.19, −0.06), p < 0.001 for XGBoost with Proulx criteria using Wilcoxon rank sum test], and usually, HRA occurred relatively sooner in true positives for all methods as well as for both criteria. The HRIs for more than 64% of true positives were the same as their EWTs, implying that the risk scores stayed above the threshold after HRA for those cases. The distribution of HRI normalized by EWT was significantly different between true positives and false positives [median (95% confidence interval) of pairwise differences was 0 (−0.52, 0), p < 0.001 for random forest with IPSCC criteria, and −0.02 (−0.46, 0) for XGBoost with Proulx criteria using Wilcoxon rank sum test] across all methods and for both MOD criteria (Supplementary Table 3).

Risk Group Stratification
Within the group of positive predictions, the risk scores of the true positives were consistently much higher than the threshold following the HRA event, while those of the false positives tended to be only marginally above threshold, rendering these cases less distinct from the true negatives than the true positives. We used time-evolving risk score trajectories from 24-h time windows following HRA from all positive predicted cases in our test set to perform spectral clustering. The positive predicted risk score trajectories separated into two clusters (labeled as high risk and low risk) or three clusters (labeled as high risk, moderate risk, and low risk), depending on the method and on the set of MOD criteria used (Figure 6). All methods for both IPSCC and Proulx criteria yielded PPV ≥0.93 in the high-risk cluster, emphasizing higher confidence in predictions within that group ( Table 3). Time to observed HRA relative to the length of available data was significantly smaller [median (95% confidence interval) of pairwise differences was 0 (−0.02, 0), p < 0.001 for random forest with IPSCC criteria, −0.16 (−0.34, −0.03) for XGBoost with Proulx criteria using Wilcoxon rank sum test] in the high-and moderate-risk groups than the low-risk group for all methods  except LassoGLM with IPSCC criteria (Supplementary Table 4). These risk groups appeared to follow similar patterns prior to HRA and showed a significant difference in their risk score trajectories immediately following HRA. This observation raised the intuitive question as to whether the instantaneous value of risk score at HRA was directly related to the accuracy of our predictions. We divided all positive predicted labels into four groups based on quartiles of risk scores at HRA and observed a monotonic increase in PPV with increasing average risk score at HRA across these quartiles for all models using IPSCC criteria and for GLMBoost and LassoGLM models using Proulx criteria (Supplementary Table 5). The PPV in the highest quartile for all methods was ≥0.93.

Feature Importance
Our models were trained using a large number of features, and each machine learning method employed a different feature selection scheme. We found that a small subset of features had a relatively high contribution to these models. Supplementary Tables 6, 7 show a list of the 20 most important features with their relative importance metric scaled to range 0-100 in each model and for each set of MOD criteria.

Predictive Power of Prognostic Models
The objective of this study was to explore suitable machine learning methods to build prognostic models for early identification of increased risk of transition from a no OD or single-organ dysfunction state to MOD in the general PICU population. The machine learning methods used in this study were selected based on the amount of available training data and the ability to tailor feature selection. All four methods achieved comparable and appreciable results for both diagnostic criteria for MOD. Maximum F1-score was preferred over the point nearest (0, 1) on the ROC curve for optimal threshold determination because it maximizes sensitivity (recall) and PPV (precision) instead of specificity in order to account for the class imbalance between target and control cases. AUC for all methods being higher than 0.9 for both IPSCC and Proulx criteria implied that our trained models were able to distinguish with high degree of accuracy data prior to MOD transition vs. data from no-MOD admission cases. The considerably large median EWT indicates that our proposed models could alert physicians almost a day in advance for most MOD cases in PICU allowing sufficient time for early therapeutic intervention. The relatively low temporal variation in risk score trajectories is primarily due to low underlying physiologic data measurement frequencies, especially laboratory tests, which are normally performed in intervals of several hours and sometimes only once or twice a day. The sustained highrisk scores post-HRA serves as evidence that the probability of developing MOD can be used as a continuous indicator of  severity over time. Models built for Proulx criteria demonstrates that a larger number of target samples can significantly boost the sensitivity without impacting the specificity and, hence, also yields slightly higher PPV.

Risk Group-Based Prediction
Risk-group stratification among positive predicted labels was pivotal in assigning confidence to the model predictions. We chose spectral clustering over more conventional clustering algorithms such as k-means to avoid assuming homogeneity in the density of different clusters. The risk stratification method was highly effective in identifying the high-risk group (≥93% prediction accuracy) and, therefore, could be used as an additional alert for severe cases with extremely high likelihood of developing MOD. Moreover, the risk score trajectories of high-risk groups showed very high (0.7-1.0), non-decreasing, and steady evolution of risk scores over time and exhibited tighter clustering around mean tendency than the moderate-and low-risk groups.
Due to the observed large separation in the risk score trajectories at HRA across different risk groups, we were able to also separate the patients into groups based on quartiles of their risk scores at HRA. The PPV in these four groups were monotonically increasing with increasing average risk in each group, and the highest quartile yielded 100% PPV for all methods (except Lasso-GLM with IPSCC criteria) and both sets of diagnostic criteria. Therefore, in addition to HRA predictions, assigning confidence based on the risk group assignment would reduce the burden of false alarms and allow patient severity monitoring.

Organ Dysfunction Labels
The observed rate of MOD during the PICU stay obtained using IPSCC and Proulx criteria (36.2 vs. 28.1%) was similar to the results presented by Villeneuve et al. (1) (37.3 vs. 21.4%, respectively). The discordance in MOD rates is primarily attributed to different rates of cardiovascular and neurologic dysfunction. Notably, the continuous binary MOD labels generated using IPSCC criteria had higher temporal stability than the Proulx criteria.

Feature Selection and Importance
Conventional regression-based methods suffer from multiple correlated variables and require manual feature pruning or regularized regression. We took advantage of the built-in feature selection process of the methods used for training. Decision treebased methods like random forest and XGBoost are ensembles of multiple decision trees that use a random subset of relevant features and a specified maximum depth to prune features. GLMBoost uses gradient boosting with component-wise linear models, which prunes features by restricting the number of boosting iterations and selecting only one feature in each iteration. LassoGLM is an L1-regularized version of a generalized linear model, which suppresses the smaller weight features in the model and retains only the most significant ones. Some procedures (e.g., arterial catheter placement) or laboratory tests are performed by clinicians only for sicker patients, which makes the presence vs. absence of a variable an important predictor as well. We added a missingness indicator for all laboratory test results and for placement of an arterial catheter. This concept was previously endorsed in work by Sharafoddini et al. (37). Patient vital signs (e.g., core vs. peripheral temperature difference) as well as specific laboratory test results [e.g., prothrombin time (PT), activated partial thromboplastin time (aPTT), arterial pH, pCO 2 , and red blood cell distribution width] and interventions (e.g., red blood cell transfusion volume, inotrope use) featured prominently on the list of the 20 most important features for each machine learning method and for both sets of MOD criteria. One caveat is that these methods are prone to inconsistency in feature importance among correlated features, and unlike in linear models, the feature importance of non-linear decision tree ensemble methods are not additive. We observed higher concordance in feature rankings among non-linear models (random forest and XGBoost) and among linear models (GLMBoost and Lasso-GLM), but lower agreement between linear vs. non-linear models. This difference in feature ranking can be attributed to the difference in learning paradigms of these models; random forest and XGBoost combines predictions from an ensemble of decision trees, whereas GLMBoost and Lasso-GLM are regression-based linear methods. The small differences in the relative ranking of features among the decision tree-based methods can be explained by the random sampling of features in each iteration. Many of the 20 most important features overlap with those used in the MOD diagnostic criteria, but interestingly, we also observed several outside of MOD diagnostic criteria, including red blood cell distribution width, core vs. peripheral temperature difference, patient weight and age, arterial bicarb, alkaline phosphatase, C-reactive protein, monocyte number and percent, eosinophil number and percent, mean corpuscular volume, serum calcium and ionized calcium, aspartate aminotransferase, urine pH, urine color, glucose, body temperature, and AHG antibody screening. Future larger studies should be undertaken to understand the contribution of each of these features to the pathophysiology and evolution of MOD.

LIMITATIONS
This study was limited to data from a single mixed PICU, and multicenter studies would be needed to evaluate generalizability. A larger dataset would also allow for comparison of prediction performance across different organ system failure transitions such as respiratory to respiratory plus neurologic, etc. Another limitation of this study includes latency in reporting of nonpoint-of-care laboratory test results, which may alter the exact time of observed organ dysfunction. Additionally, our models were unable to use PICU day 1 MOD transitions due to the absence of pre-PICU admission laboratory results and 24-h retrospective nature of the organ dysfunction definitions. We also recognize our limitation of not being able to compare the model's early prediction ability with a clinician's ability to identify the likelihood of transition to MOD state due to the retrospective nature of our observational study. Therefore, comparing the predictive ability of prediction models with live clinical decision making should be explored in future validation studies.

Considerations for the Use of Early Prediction Models in Clinical Settings
Artificial intelligence (AI)-based healthcare applications constitute an evolving and rapidly expanding field aiming to address a large number of important clinical questions by identifying patterns in data that are either too subtle for clinicians to see or are "hiding in plain sight" and often missed (38). AI applications have focused on early diagnosis, prediction of clinical outcomes, and treatment recommendations (39)(40)(41)(42)(43)(44)(45)(46). AI-based models can condense physiologic and laboratory monitoring data to informative and explainable composite quantitative scores to help clinicians monitor the patient's severity of illness as well as provide timely insights into future clinical outcomes. Obviously, clinicians cannot treat what they do not recognize; we posit early recognition will allow treatment when organ dysfunction is less severe and potentially more easily reversible. The spectral clustering-based risk stratification method could further allow clinicians to identify a high-risk group with high accuracy, fostering efficient, and timely patient management. Future studies are needed to study the impact of therapeutic interventions on the patient's risk for developing individual as well as MOD. The roadmap to deploying these models in real-time bedside monitoring algorithms should include rigorous validation (47) keeping in consideration the legal, social, and economic implications of AI in healthcare (45,48).

CONCLUSIONS
Our prognostic model-building approach has led to the development of a set of models with demonstrated ability to predict high-risk alerts in patients who transitioned into a MOD state. The risk stratification methodologies have shown ≥93% positive predictive value in the highest risk groups. Therefore, this combined approach of continuous time risk monitoring, early warning of development of MOD, and risk group stratification could significantly aid in the monitoring and ultimately the management of critically ill children.

DATA AVAILABILITY STATEMENT
The datasets presented in this article are not readily available because the dataset contains identifiable patient health information. Requests to access the datasets should be directed to mbembea1@jhmi.edu.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by The Johns Hopkins Institutional Review Board. Written informed consent from the participants' legal guardian/next of kin was not required to participate in this study in accordance with the national legislation and the institutional requirements.

AUTHOR CONTRIBUTIONS
SB performed data cleaning, model building, statistical analysis, interpretation of prediction results and prepared first draft of the manuscript. JG and SS contributed to result interpretation, providing feedback on model building, result interpretation and manuscript revision. JF provided insights on clinical relevance and interpretation of prediction outcomes. MB and RW served as principal investigators of the study and contributed to the study design and direction, interpretation of results, and final drafting of the manuscript. All authors reviewed the final draft of the manuscript and approved for submission.