Machine Learning Identifies Complicated Sepsis Course and Subsequent Mortality Based on 20 Genes in Peripheral Blood Immune Cells at 24 H Post-ICU Admission

A complicated clinical course for critically ill patients admitted to the intensive care unit (ICU) usually includes multiorgan dysfunction and subsequent death. Owing to the heterogeneity, complexity, and unpredictability of the disease progression, ICU patient care is challenging. Identifying the predictors of complicated courses and subsequent mortality at the early stages of the disease and recognizing the trajectory of the disease from the vast array of longitudinal quantitative clinical data is difficult. Therefore, we attempted to perform a meta-analysis of previously published gene expression datasets to identify novel early biomarkers and train the artificial intelligence systems to recognize the disease trajectories and subsequent clinical outcomes. Using the gene expression profile of peripheral blood cells obtained within 24 h of pediatric ICU (PICU) admission and numerous clinical data from 228 septic patients from pediatric ICU, we identified 20 differentially expressed genes predictive of complicated course outcomes and developed a new machine learning model. After 5-fold cross-validation with 10 iterations, the overall mean area under the curve reached 0.82. Using a subset of the same set of genes, we further achieved an overall area under the curve of 0.72, 0.96, 0.83, and 0.82, respectively, on four independent external validation sets. This model was highly effective in identifying the clinical trajectories of the patients and mortality. Artificial intelligence systems identified eight out of twenty novel genetic markers (SDC4, CLEC5A, TCN1, MS4A3, HCAR3, OLAH, PLCB1, and NLRP1) that help predict sepsis severity or mortality. While these genes have been previously associated with sepsis mortality, in this work, we show that these genes are also implicated in complex disease courses, even among survivors. The discovery of eight novel genetic biomarkers related to the overactive innate immune system, including neutrophil function, and a new predictive machine learning method provides options to effectively recognize sepsis trajectories, modify real-time treatment options, improve prognosis, and patient survival.

A complicated clinical course for critically ill patients admitted to the intensive care unit (ICU) usually includes multiorgan dysfunction and subsequent death. Owing to the heterogeneity, complexity, and unpredictability of the disease progression, ICU patient care is challenging. Identifying the predictors of complicated courses and subsequent mortality at the early stages of the disease and recognizing the trajectory of the disease from the vast array of longitudinal quantitative clinical data is difficult. Therefore, we attempted to perform a meta-analysis of previously published gene expression datasets to identify novel early biomarkers and train the artificial intelligence systems to recognize the disease trajectories and subsequent clinical outcomes. Using the gene expression profile of peripheral blood cells obtained within 24 h of pediatric ICU (PICU) admission and numerous clinical data from 228 septic patients from pediatric ICU, we identified 20 differentially expressed genes predictive of complicated course outcomes and developed a new machine learning model. After 5-fold cross-validation with 10 iterations, the overall mean area under the curve reached 0.82. Using a subset of the same set of genes, we further achieved an overall area under the curve of 0.72, 0.96, 0.83, and 0.82, respectively, on four independent external validation sets. This model was highly effective in identifying the clinical trajectories of the patients and mortality. Artificial intelligence systems identified eight out of twenty novel genetic markers (SDC4, CLEC5A, TCN1, MS4A3, HCAR3, OLAH, PLCB1, and NLRP1) that help predict sepsis severity or mortality. While these genes have been previously associated with sepsis mortality, in this work, we show that these genes are also implicated in complex disease courses, even among survivors. The discovery of eight novel genetic biomarkers related to the overactive innate immune system, including neutrophil function, and a new predictive machine learning method provides options to effectively recognize sepsis trajectories, modify real-time treatment options, improve prognosis, and patient survival.
Keywords: sepsis, complicated course, critical care, machine learning, transcriptomics, biomarkers INTRODUCTION Critically ill patients are admitted to the Intensive Care Unit (ICU) for complex and dynamic care, preserving organ function and improving outcomes in otherwise dire situations. Among patients with complicated disease courses, septic patients represent a significant component (1). Sepsis is a life-threatening organ dysfunction caused by the overactive immune response to bacterial infection, often of pulmonary origin (2). Sepsis may have contributed to 20% of all global deaths in 2019 (3). Sepsis consists of a heterogeneous mix of phenotypes (4,5), various degrees of disease complexities, and trajectories leading to recovery or death (6,7). Different strategies have been pursued predicting deterioration (8)(9)(10) and managing patients with sepsis in critical care units (11) using physiological, clinical, and biomarker parameters. However, due to the heterogeneous nature of patients presenting to the ICU and the diverse disease course that follows, it has been difficult to identify generalized models of disease (12).
Statistical and machine learning methods have been developed to successfully utilize the multi-omics data for biomarker discovery for predicting survival from sepsis (13). Wong et al. (14) identified 12 biomarkers collectively called the Pediatric Sepsis Biomarker Risk Model (PERSEVERE) class genes. Further analyses resulted in the identification of 18 additional genes consisting of the PERSEVERE XP set (15). Mohammed et al. (16) identified 53 differentially expressed genes, involved mostly in immune response and chemokine activity, from expression data collected from patients admitted to the pediatric ICU (PICU) within 24 h of admission. Sweeney et al. (17) analyzed the results obtained from three independent scientific groups that developed mortality prediction models and identified additional subgroups of genes. While much has been studied about the risk for mortality, there is a dearth of machine learning approaches to predict disease trajectory, including complicated disease courses and poor clinical outcomes (18). Early identification of disease trajectory, including complicated disease courses, defined as persistence of 2 or more organ failures by day 7 or death by day 28, can aid in clinical management and targeted therapies to manage severe outcomes. Hence, there is a need to identify these biomarkers and build novel machine learning models to identify complex disease courses from plasma samples collected close to the time of ICU admission.
In this work, we performed a meta-analysis of previously published peripheral blood cell gene expression data (sampled within 24 h of sepsis onset; 228 pediatric ICU patients) and analyzed them using multiple statistical and machine learning methods to identify novel markers of sepsis disease trajectory. We found 20 highly stable genes that predict disease complexity with an average derivation AUROC of 0.82 and validation AUROCs of 0.72, 0.96, 0.83, and 0.82 within critically ill children, using peripheral blood collected within 24 h of ICU admission. We validated these variables by calculating their overlap with the well-established sepsis mortality predicting genes, conducting the functional gene-set enrichment and pathway analyses, and testing them on four external validation datasets. Earlier identification of disease complexity can inform care management and targeted therapy. Therefore, the 20 gene candidates identified by our rigorous approach can be used to identify, early in their ICU stay, patients who may ultimately develop significant organ dysfunction and complex care management.

Data Collection
The pediatric sepsis dataset GSE66099 (19) downloaded from the NCBI Gene Expression Omnibus (GEO) repository, contains the gene expression profiles extracted from the peripheral blood samples of patients who were admitted to the PICU during the first 24 h of admission. The microarray dataset was obtained from the Affymetrix GPL570 platform, which was submitted by Wong et al. on February 19, 2015, and last updated on March 25, 2019. We considered a complicated course as the primary outcome variable. This was defined as death by 28 days or persistence ≥2 organ failures at day 7 of septic shock (20). This dataset was used to train our model and derive the top gene variables.
Due to the lack of available external data that encodes for complicated courses as defined in our derivation cohort, we used a collection of four microarray gene expression datasets as the closest surrogate to validate our list of biomarkers. The first dataset, GSE54514 (21), based on an adult cohort, provided whole blood samples collected up to 5 days after ICU admission. In this cohort, we defined the complicated presentation as patients with APACHE II scores ≥25 observed within 24 h of admission to the ICU. The second dataset, E-MEXP-3850 (22) was used to study the temporal evolution of sepsis by collecting whole blood samples at six different time points from five critically ill children admitted to the PICU with meningococcal sepsis and sepsis-induced multiple organ failure. Barring one sample from patient four that was degraded and not used for microarray analysis, there were a total of 29 distinct time-course based gene expression measurements included in this study. We considered 28-day mortality as the primary outcome variable for this cohort. The third dataset, E-MEXP-3567 (23) was used to discover biomarkers of severe bacterial infection using transcriptomic data collected from whole blood samples of children suffering from either bacterial meningitis or bacterial pneumonia. The outcome variable that was chosen for this dataset was in-hospital mortality. The fourth dataset, GSE40586 (24), was used to study the pathways activated at the transcriptional level by extracting RNA from the whole blood samples of patients (infants, children, and adults) suffering from bacterial meningitis and following a complicated clinical course. The outcome variables used in our analysis included the complications observed by a neurologist during a patient's discharge from the hospital. The processed expression data, probe to gene identifiers, and the outcome labels for the datasets GSE54514, E-MEXP-3850, E-MEXP-3567, and GSE40586 were downloaded from (17). The gene expression data were extracted from whole blood samples and thus contain a mixture of red blood cells, white blood cells and platelets. However, RNA from these sources do not contribute to the expression of the immune-related genes expressed in the white blood cells.
All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional review boards of all participating institutions, and with the Declaration of Helsinki and its later amendments or comparable ethical standards.

Normalization and Background Correction
Technical variations in the gene expression data were eliminated using the Affy package (25) from R. This also helped remove the background noise. The Robust Multi-Average parameter method using "gcrma" package (26) from R was used to normalize data and perform background correction. Surrogate Variable Analysis was performed to infer batch effects and other unwanted sources of variation in the data. The "sva" package (27) in R was used for this purpose. The results of the surrogate variable analysis or the inferred batch effects were regressed with the actual batch variable (or year of measurements in this case) using the lm () function (28). The resulting variation was removed using the Combat () from the same package, and the cleaned expression set object was used for further analysis. Boxplots displaying the expression values before and after normalization and the relationship between inferred and observed batch effects were plotted using the boxplot () from R (28).

Probe to Gene Mapping, Identification of Differentially Expressed Genes
For the microarray datasets GSE66099, GSE54514, E-MEXP-3850, E-MEXP-3567, GSE40586, the Affymetrix probes were matched to gene symbols using the Affymetrix Human Genome U133 Plus 2.0 (hgu133plus2.db), Illumina Human HT12v4 annotation database (illuminaHumanv4.db), Affymetrix Human Gene 1.0 ST Array (HuGene-1_0-st), Affymetrix Human Genome HG-U133A (hgu133a.db), and Affymetrix Human Gene 1.0 ST Array (HuGene-1_0-st), respectively. In order to detect the expression values of genes with multiple probes, we averaged the expression for multiple probes that matched to the same gene. The limma package (29) from R was used to perform the differential gene expression analysis with a Benjamini-Hochberg correction (FDR cutoff = 0.1).

Functional Enrichment Analysis
We used the clusterProfiler package (30) in R to find enriched biological processes (BP), cellular components (CC), and molecular function (MF) terms. A plot displaying the enriched terms was drawn using the enrichplot package (31) in R. The STRING analysis online tool (32) was used to find significantly enriched Reactome and KEGG pathways.

Statistical Analysis
Affymetrix data download and gene mapping were done using the "affy" and the annotation package "hgu133plus2.db" package in R, respectively. A Fisher exact test was performed to determine the statistical significance among the functional enrichment terms. Benjamini Hochberg's multiple test correction methods were used to calculate the differentially expressed genes (DEGs). Heatmap for the DEGs was generated using the Heatmap () from the "complexHeatmap" package (33) in R. Since there was no explicit argument in the Heatmap () to scale the rows/columns, we scaled our expression matrix using the scale () (28) and then constructed the heatmap of the resulting scaled matrix. The Volcano plot and the MA plot were generated using the volcanoplot () and plotMA () of the limma package (29), respectively.

Variable Selection Methods
The processed microarray derivation dataset had 20,174 genes for 228 samples. Our next objective was to reduce the dimensionality of the dataset by removing redundant variables. To do that, we used three commonly used variable selection techniques, including random forest-based variable importance, LASSO, and Minimum Redundancy and Maximum Relevance. The variables generated by each of the above three methods were pooled together into one aggregated variable set, and the list of 17 differentially expressed genes were also added to that list. The Pediatric Risk of Mortality Score (PRISM) for each patient was computed and included in the model (34). For external validation, we used both LASSO based and tree-based variable selection techniques (using the extra trees classifier with 250 estimators) to select the top variables for further analysis.

Classification Models
Imbalanced data can negatively affect learning algorithms. Hence, in this study, we experimented with both oversampling and undersampling techniques to balance the training data. Among the undersampling techniques, we implemented Cluster centroids (CCN), Repeated Edited Nearest Neighbors (REDN), Edited Nearest Neighbors (EDN), Instance Hardness Threshold (IHT), and Random Undersampling (RUS). Cluster centroids (35) undersamples the majority class by first calculating the centroid of the majority class and then finding all the instances nearest to this centroid in the input variable space. Consequently, instances far away from the centroid are discarded. The scikit-learn implementation of this method uses the KMeans algorithm to find the cluster centroids. In case of the Edited Nearest Neighbor resampling technique (36), all instances whose class label differs from that of half of its k-nearest neighbors are discarded. In Repeated Edited Nearest Neighbors, the EDN technique is successively applied until no further instances can be removed from the majority class. The Instance Hardness Threshold undersampling technique (37) works by successively applying a set of n learning algorithms on the training set followed by removing those instances that are frequently misclassified. Random Undersampling involves randomly selecting instances from the majority class to remove from the training set. SMOTE (Synthetic Minority Oversampling Technique) (38) is a frequently used oversampling technique that works by synthesizing new instances of the minority class from existing data. Specifically, a data point (say a) from the minority class is first chosen at random and its k-nearest neighbors are identified. A randomly selected neighbor (say b) is then chosen and a synthetic example is created at a randomly selected point between a and b.
We then developed machine learning algorithms using several tree-based classifiers and logistic regression. The two tree-based classifiers were the Balanced Random Forest (BRF) and Extra Trees (ET) classifier. A random forest classifier combines the predictions from hundreds of decision trees built on random bootstrapped samples of the dataset using a random subset of variables when splitting the nodes. In case of imbalanced classification, a balanced random forest classifier balances data by randomly undersampling each bootstrapped sample. The extra trees classifier is a meta estimator that fits a user-defined number of randomized trees (base models) on different sub-samples of the data and then combines the predictive power of these models into one optimal model. In case of binary classification, a Logistic Regression (LOGIT) classifier works by calculating a linear combination of the log-transformed gene expression values across samples and generating a linear decision boundary to separate the two classes from one another. For the external validation analysis, in addition to the above classifiers, we also used three ensemble techniques, namely the Gradient Boosting classifier (GB), Easy Ensemble (EE), and XGBoost classifier. All three classifiers work by combining several base learning models into a strong predictive model. Gradient Boosting (39) aims to reduce the loss of the model by iteratively adding weak learners in a stagewise fashion using a gradient descent approach and XGBoost (Extreme Gradient Boosting) (40) is a fast and highly efficient implementation of the Gradient Boosting method that uses L1 and L2 regularization to generate more generalizable models. The Easy Ensemble classifier (41) is essentially a collection of AdaBoost learners trained on different random bootstrap samples of the training set. A general framework explaining how tree-based classifiers and logistic regression models work is available in the Supplementary Document.

Model Selection and Tuning
The learning process that was adopted in the manuscript is illustrated in Figures 1A-C. First, the dataset was randomly partitioned in a stratified fashion into five equal subsets ( Figure 1A). Four of the five subsets were combined into one training set. In each training phase, we performed (1) variable selection ( Figure 1B): The pooled list of variables using the three variable reduction approaches (described in the previous section), along with statistical filtering by DEG, were obtained. Finally, Recursive Feature Elimination (RFE) (43) was performed on the pooled variables to remove redundancies and arrive at a subset of the most important genes. (2) Hyperparameter tuning was performed using a cross-validated grid search technique over a parameter grid using the AUROC metric as the scoring function. (3) A classifier was trained using the hyperparameters, and the corresponding prediction scores were obtained on the hold-out test set ( Figure 1C). We repeated the entire process 10 times, resulting in 50 unique train and test splits. The classification performances obtained during each run were averaged, and the mean scores were reported. We assessed the performance of our classifiers using four widely used performance metrics: Sensitivity, Specificity, Mathews correlation coefficient, and Area under the ROC curve. They were defined as follows: False Positive Rate = FP FP + TN Mathews correlation coefficient: A quantity that measures the quality of binary classification models. Mathematically, this can be expressed as: The value of MCC ranges from −1 to 1 where −1 signifies a perfect misclassification and vice versa. Accuracy and F1 score are widely used performance metrics that can show overoptimistic inflated results for imbalanced data and hence were not included in our study. Area under the ROC curve: A probabilistic binary classifier outputs the probability of a particular data point belonging to a class. To map these probabilities to a binary category, The initial data is aggregated, normalized, corrected for batch normalization, and separated into even chunks using k-fold cross-validation (CV). In our pipeline, we used k = 5. (B) The training chunks of the CV are used for model development; the data analysis pipeline follows the Complete Cross-Validation (CCC) approach defined by Alder et al. (42). In addition to DEG, we apply three other variable selection methods to generate a pool of candidate genes. We then apply a wrapper method, namely the RFE to arrive on the most predictive genes. (C) The genes selected by the RFE method are then used to develop a predictive model. The model is then evaluated on the test fold of the CV. This process is repeated for the remaining training and test folds. Finally, the entire 5-fold CV is repeated 10 times to generate a total of 50 iterations, and the top predictors from (B) are saved and analyzed to generate a normalization score, which is a measure of how often a gene appears as a top predictor across each of the 50 iterations.
a classification threshold is usually defined. A value above that threshold will be classified as "positive" and vice versa. An ROC curve plots True Positive Rate vs. False Positive Rate and displays the performance of a binary classifier at different classification thresholds. Area under the ROC curve (or AUROC) measures the entire two-dimensional area under the ROC curve. Basically, it gives an aggregate measure of the performance of a binary classifier at various classification thresholds. Another way of interpreting AUROC is as the probability that a classifier ranks a random positive instance more highly than a random negative instance. AUROC ranges from 0 to 1 where 0 signifies a perfectly inaccurate classifier and vice versa. The ROC plots displaying the performance of the binary classifiers were generated using the matplotlib (44) plotting library in Python. A detailed discussion on each step of the workflow outlined in Figure 1 is available in the Supplementary Document.
To study the predictive power of our top 20 gene variables from the stability analysis, we performed validation on independent test sets. In the first step, we applied different undersampling and oversampling techniques on the derivation dataset. Then we performed a LASSO based variable selection to remove redundant variables. The hyperparameters of the classifier were tuned using the 5-fold cross-validation grid search technique over a parameter grid using the AUROC metric as the scoring function. The tuned classifier was then trained on the derivation set, and the corresponding prediction scores were obtained on the hold-out validation sets (GSE54514, E-MEXP-3850, E-MEXP-3567, and GSE40586). The predicted probabilities from a classification model are converted to discrete class labels using a parameter called "classification threshold." Consequently, if "x" was the reported classification threshold for a given classifier, then all instances with predicted probabilities greater than "x" were classified as having a complicated course outcome and vice versa. To further fine-tune the classifiers, instead of using the default classification threshold of 0.5, we experimented with different thresholds between 0 and 1 with step sizes of 0.001. This is particularly important for an imbalanced classification problem like ours and must be taken into account to make sure that the minority class examples are predicted correctly. The same set of performance metrics were used to assess the validation classifier's quality, as stated in the previous section.

Distribution of Variables
To study the class-wise distributional differences between the variables (or genes), we conducted the two-sample Kolmogorov-Smirnov (KS) test using scipy stats module. The distplot function from the seaborn package (45) was used to visualize the class-wise Gaussian kernel density estimate of each of the variables. In case of the E-MEXP-3567 dataset, owing to the small sample size (only 12 samples), we adopted a repeated resampling with replacement (or bootstrapping) technique to estimate the KS statistic values. The formulation of the KS test and an extended discussion on KDE plots is available in the Supplementary Document.

Clinical Characteristics of the Patients Included in the Derivation Dataset
Out of the 229 pediatric septic patients present in the derivation dataset (GSE66099), 52 had a complicated course outcome. One patient was excluded due to missing the outcome variable. The age of the cohort was 3.81   Figures 2A,B shows the boxplots of the average gene expression of samples derived from the derivation dataset GSE66099 before and after normalization. Before normalization, the expression values had inconsistent distributions. After normalization all the samples were aligned at the overall mean and variance. The output of the SVA consisted of two surrogate variables and only one of them had a significant association with the actual batch variable. The boxplot drawn in Figure 2C shows that there is a relationship between the inferred batch effect and the observed batch effect. This is similar to performing an ANOVA test and checking if the values of the SVs calculated for each sample is different between the batches or whether there is a difference in the means between the boxplots. An expanded discussion on the normalization process and identification of batch effects is available in the Supplementary Document. We removed the batch effects from the data due to the microarray experiments being conducted over multiple years using the Combat() in the "sva" package. In the given figure, the first SVA component ordered by date before batch effect correction shows that one of the inferred batch effects (or the surrogate variable) is associated with the actual batch variable. A volcano plot helps us to assess the adjusted p-values (significance), and the log fold changes (biological difference) of differential expression for the given list of genes at the same time. (C) An MA plot is a 2D scatter plot (each dot representing a gene) that represents log fold change vs. mean expression across two different conditions. All the significantly differentially expressed genes (FDR cutoff = 0.1) are colored in red and the genes without significant gene expression differences are colored in black.

Expression Levels of the DEGs Generated Prominent Clusters Separating the Complicated and Uncomplicated Course Patients
Based on an adjusted p-value cutoff <0.1, a total of 1,269 differentially expressed genes (DEGs) from 20,174 were found between the complicated and uncomplicated group of patients, including 808 upregulated genes and 461 downregulated genes. The DEGs with an absolute log2 fold change of at least one (n = 17) are shown in Supplementary  Figure 3A illustrates the heatmap of 17 DEGs (absolute log2 fold change >1 and adjusted p < 0.1) across the complicated vs. uncomplicated groups derived using the derivation dataset GSE66099. The samples are represented on the vertical axis and the genes are on the horizontal axis. The normalized gene expression values of the DEGs were used to construct the heatmap. Red and green represent the upregulated and downregulated genes, respectively. Both the samples and genes were clustered to give a better idea of the groups formed using the expression values. Two horizontal bars show annotations for the complicated course outcomes and mortality.
Three prominent sample clusters were observed from the heatmap. The first cluster contained none of the 52 complicated course patients, while the second cluster contained nine (17%) of the 52 complicated course patients and the third contained 43 (83%) of the complicated course patients. A similar cluster trend is noticed among survivors/non-survivors. None of the non-survivors were in the first cluster, four (14%) out of 28 nonsurvivors were in the second cluster and the remaining 24 (86%) were in the third cluster.
Genes were also clustered into two distinct groups of upregulated and downregulated genes. MME, TGFBI, and HCAR3 together formed one cluster of downregulated genes while the rest were in the upregulated category. LTF and IL1R2 were the most highly expressed genes, whereas CEP55, and MS4A3 were the least expressed genes. MMP8 was the most upregulated gene while HCAR3 was the most downregulated gene. Figure 3B shows the volcano plot of significantly different genes. Only genes having absolute log2 fold change >1 and adjusted p < 0.1 were considered differentially expressed. Excluding the above-mentioned downregulated genes, green dots represent log2 fold change >1 and adjusted p < 0.1; red dot represents genes with adjusted p < 0.1 but log2 fold change between −1 and 1; whereas black represent genes with log2 fold change between −1 and 1 and adjusted p > 0.1. Also, in Figure 3C, an MA-plot displays the log2 fold change between complicated course and non-complicated course samples as a function of the average expression level across all samples. Red dots are relatively larger than the black ones. The dataset included a total of 30 uncomplicated disease course patients who met the SIRS criteria when excluding these patients from our analysis we observed five genes that achieved robust normalization scores but did not meet the DEG cutoffs. Those genes were: TGFBI, DEFA4, CEP55, MME, and OLAH.

Functional Enrichment Analysis of the DEGs Displayed an Association With an Overactive Innate Immune System and Neutrophil Activity
The top 17 most significantly differentially expressed genes (absolute log fold change >1 and adjusted p < 0.1) obtained from the derivation dataset GSE66099 were analyzed. "Hematopoietic cell lineage" (hsa04640) was the most enriched KEGG pathway; "neutrophil degranulation" (HSA-6798695), "immune system" A biological process is a specific objective that an organism is genetically designed to accomplish. It is often described by the ending state or the outcome. For instance, a biological process defined as "cell division" results in the creation of a divided cell (two daughter cells) from a single parent cell. A Cellular Component (CC) defines a location occupied by a macromolecular machine during the execution of a specific molecular function. For instance, "cytoplasmic side of plasma membrane" is a cellular component defining the location of a gene product relative to cellular structures. A Molecular Function (MF) represents the primary activity of a gene product at the molecular level. Biochemical activities such as "binding" or "catalysis" are examples of GO terms representing molecular functions. In this figure, the y-axis represents the gene ontology terms and the x-axis represents the "Gene Ratio" or the percentage of total DEGs in the given GO term. The size of the dot or the "Gene Count" represents the number of genes associated with the enriched term and the color of the dot represents the significance of the terms (more significant terms being redder). "P.adjust" is the p-value adjusted using the Benjamini-Hochberg procedure.
(HSA-168256) and "antimicrobial peptides" (HSA-6803157) were some of the most enriched Reactome pathways. Gene Ontology (GO) analysis of the DEGs revealed known GO terms such as "neutrophil degranulation" (GO:0043312), "neutrophil activation involved in immune response" (GO:0002283), "neutrophil activation" (GO:0042119), and "neutrophilmediated immunity" (GO: 0002446) as some of the top enriched biological processes; "Serine-type endopeptidase activity" (GO:0004252), "endopeptidase activity" (GO: 0004175), and "serine-type peptidase activity" (GO:0008236) were some of the most enriched molecular functions; "Specific granule" (GO:0042581) and "specific granule lumen" (GO:0035580) were some of the most enriched cellular components. Figure 4 displays the results from the functional analysis. The y-axis represents the gene ontology terms and the x-axis represents "Gene Ratio" or the percentage of total DEGs in the given GO term.
We also performed functional analysis on the top predictors from our machine learning analysis which included three additional genes (PLCB1, NLRP1, and SDC4) apart from the DEGs ( Table 1). The top enriched biological process term was neutrophil degranulation (GO:0043312) similar to the results from our previous functional analysis with the DEGs. Some of the newer enriched terms included cell activation (GO:0001775), immune response (GO: 0006955), immune system process (GO:0002376), and response to stimulus (GO: 0050896). It is well-known that immunosuppression is a hallmark of sepsis and the top predictors capable of distinguishing between complicated and uncomplicated course patients show enrichment in the immune response process. Neutrophil degranulation (HSA-6798695), innate immune system (HSA-168249) were among the top enriched REACTOME pathways. The innate immune system is activated as the first response to an infection before the adaptive immune system. Most of the clinical phenotypes of sepsis can be attributed to the innate immune response. Interestingly, the adaptive immune response was not one of the enriched pathways for our list of genes which might warrant further investigation.

Machine Learning Models Built Using the Derivation Dataset Generate a Strong Predictive Model of Complicated Sepsis Based on the Top Gene and Clinical Variables
An ROC curve plots True Positive Rate vs. False Positive Rate and displays the performance of a binary classifier at different classification thresholds. Area under the ROC curve (or AUROC) measures the entire two-dimensional area under the ROC curve and estimates the ability of a binary classifier at discriminating between the two classes (complicated and uncomplicated course outcome). AUROC ranges from 0 to 1 and higher values of AUROC signifies better discriminative power of the generated classifier. The best performance in terms of overall mean AUROC (=0.823), mean specificity (=0.885) and mean MCC (=0.445) was obtained using an oversampling technique, SMOTE, and BRF classifier pair. The ROC curve of the best performing model is shown in Figure 5 and the results for the best performing classifiers is shown in Table 3. The orange-colored curve is the mean ROC curve obtained after averaging the results from different runs of the repeated 5-fold cross-validation. The list of the top clinical and genomic variables that were consistently chosen across different folds of the cross-validation of the topperforming model and their overlap with previously published predictor genes in sepsis is shown in Table 1. PRISM score, which was the only clinical variable included in our model was also chosen as one of the top predictors in our analysis. The "normalized score" represents the fraction of times the given predictor was consistently chosen over all experiments, representing the stability (46) of these genes as possible predictors of a complicated course. On the derivation cohort GSE66099 containing 228 patients, the best model in terms of mean AUROC (=0.823) was obtained using the top consistently chosen 20 gene biomarkers and PRISM score. Seventeen out of these 20 gene biomarkers were among the differentially expressed genes as shown in the heatmap and the volcano plot (green and orange dots) in Figure 3 and Table 1. Among the other classification metrics, the best mean sensitivity (=0.792) was obtained using an undersampling technique, REDN, and the BRF classifier ( Table 3).

Machine Learning Models Generated Using the Top 20 Gene Biomarkers Displayed High Classification Performances on Four Independent Test Datasets
For the Array Express dataset E-MEXP-3567 based on the pediatric cohort, the best performance in terms of overall AUROC (=0.83) was obtained using an undersampling technique, REDN, and LOGIT classifier pair. This model also gave the best specificity (=0.83) and MCC (=0.67). The best sensitivity (=1.0), however, was obtained using an undersampling technique REDN, and BRF classifier pair. The list of 10 genes that were chosen as top variables using the LASSObased variable selection technique were MMP8, CEACAM8,  LCN2, RETN, CLEC5A, TGFBI, CEP55, MME, OLAH, and SDC4. The ROC plots of the top performing models obtained using this dataset is shown in Figure 6A. The best classifier (AUROC = 0.83) was obtained using a classification threshold of 0.847. For the GSE54514 expression data based on the adult cohort, the best performance in terms of overall AUROC (=0.7233) was obtained using an undersampling technique, REDN, and BRF classifier pair. This model also gave the best MCC (0.395). The best specificity (0.80) was obtained using another undersampling technique, IHT and a stacked classifier containing BRF, GB, and ET. The best sensitivity (0.70) was obtained using the sampling technique, IHT and ET classifier pair. Both these methods used a LASSO based variable selection strategy. For the undersampling technique REDN, the list of genes that were selected using the LASSO based variable selection technique were MMP8, CEACAM8, LCN2, RETN, CLEC5A, TGFBI, CEP55, MME, OLAH, and SDC4 and for IHT the list of genes that were selected was OLFM4, CEACAM8, LCN2, RETN, ELANE, HCAR3, IL1R2, CLEC5A, MS4A3, TGFBI, DEFA4, CEP55, MME, SDC4, PLCB1, and NLRP1. The ROC plots of the top performing models obtained using this dataset is shown in Figure 6B. Predominantly two undersampling techniques, namely, Repeated Edited Nearest Neighbors and Instance Hardness Threshold gave consistently good performances in terms of AUROC. The best classifier (AUROC = 0.723) was obtained using a classification threshold of 0.701.
For the GSE40586 expression data based on a diverse group of patients (infants, children, and adults), the best performance in terms of overall AUROC (=0.822) was obtained using an undersampling technique RUS, and BRF classifier pair. This model also gave the best sensitivity (=0.875) and MCC (=0.626). The best specificity (=0.846), however, was obtained using the undersampling technique RUS and a stacked ensemble of BRF and ET classifiers. The list of top 15 genes selected by the LASSO technique were MMP8, OLFM4, CEACAM8, RETN, LTF, HCAR3, IL1R2, MS4A3, TGFBI, DEFA4, MME, OLAH, SDC4, PLCB1, and NLRP1. The ROC plots of the top performing models obtained using this dataset is shown in Figure 6C. The best classifier (AUROC = 0.822) was obtained using a classification threshold of 0.361.
For the Array Express dataset E-MEXP-3850 based on the pediatric cohort, the best performance in terms of overall AUROC (=0.9566) was obtained using an undersampling technique, RUS, and ET classifier pair. A tree-based variable ranking approach gave the best results in this case. The list of 11 genes selected by the tree-based technique were MMP8, TCN1, OLAH, CEP55, PLCB1, OLFM4, HCAR3, TGFBI, MS4A3, CEACAM8, and SDC4.
This model also gave the best specificity (=0.913), sensitivity (=1.0), and MCC (=0.828). The ROC plots of the top performing models obtained using this dataset is shown in Figure 6D.   The best classifier (AUROC = 0.956) was obtained using a classification threshold of 0.684. The classification metrics and the list of the top genomic predictors for the best performing models generated during the validation analysis are shown in Tables 1, 4, respectively. The list of tuned hyperparameters for the different samplingclassifier pairs that gave the best results is attached in Supplementary Tables 5A-D.

Both the Derivation and Validation Datasets Displayed Significant Distributional Differences in Terms of the Top Gene and Clinical Variables
The Kolmogorov-Smirnov test (or KS test) (47) is a nonparametric test of equality of two continuous distributions and is used to quantify the distance between two empirical distribution functions. The results of the two-sample KS-tests using the derivation set (GSE66099) are shown in Table 1. Both the value of the statistic and the p-value is given for each variable. A higher value of the statistic corresponds to greater distributional differences between the two experimental groups. For the derivation dataset (GSE66099) containing 228 patients, the Gaussian kernel density plots for the 20 genes and one clinical variable (PRISM score) demonstrating significant (p < 0.1) distributional differences is shown in Figure 7 and Table 1 Table 6B). Since the GSE54514 dataset had APACHE II scores, we evaluated the sensitivity of the model toward different score cutoffs of 15, 20, 25, and 30 (Supplementary Table 6A). The list of genes that showed a significant distributional difference between the two classes and their corresponding KS scores are shown in Supplementary Table 6A. It can be seen from the analysis that an APACHE II cutoff of 25 gave the highest number (10) of significant genes showing that only a fraction of the top 20 biomarkers discovered using pediatric data is actually indicative of a complex course outcome in an adult cohort. We then performed repeated 5-fold cross-validation experiments on the validation data using these 10 genes and achieved the best results using the sampling technique SMOTE and XGBoost classifier pair (mean AUROC = 0.72; mean Sensitivity = 0.78; mean Specificity = 0.65; mean MCC = 0.4). These results were similar to the external validation results for the GSE54154 dataset mentioned in the previous section. For the E-MEXP-3850 dataset, we derived a ranked list of the 20 gene variables using a tree-based variable ranking method and achieved a high  (MMP8, TCN1, OLAH, CEP55, PLCB1, OLFM4,  HCAR3, TGFBI, MS4A3, CEACAM8, and SDC4). Due to the inherent imbalance in the training data (GSE66099) we tried different classification thresholds for the tuned classifiers and reported the ones that gave the best classification performances. The list of top performing classifiers included both individual and stacked classifiers. The legend displays the names of the sampling-classifier combinations, the AUROC and the classification thresholds that gave the top results in brackets.
AUROC of 0.956 (Table 4). After performing the two-sample KS test using this ranked list, we further achieved a list of 11 candidate gene variables ( Table 1) that displayed a statistically significant distributional difference between the two classes.
Repeating the external validation analysis using this set of 11 genes gave us comparable classification performances (AUROC = 0.956, Specificity = 0.913, Sensitivity = 1.0, and MCC = 0.828). For the remaining datasets, E-MEXP-3567 and GSE40586,   A Kolmogorov-Smirnov test is a non-parametric test used to compare the equality of probability distributions. There are two scores associated with a KS test: a KS statistic that is used to quantify the distance between two distributions and the p-value which tells us the significance of the result. The differences in the distribution between the complicated and uncomplicated course groups in terms of the top 20 gene predictors and a severity score (PRISM) is shown in this plot.
we repeated the KS test analysis on the list of reported genes obtained using the LASSO-based variable selection technique.
In both cases, we found significant distributional differences between the respective classes. The complete list of genes and the KS test results for all four validation sets are shown in Supplementary Table 6B.

Functional Role of the Identified Biomarkers in Sepsis
This study revealed differentially expressed genes in peripheral blood samples from a previously published gene expression dataset collected within 24 h of PICU admission. Among the most significant genes from our machine learning analysis of complex sepsis trajectories, we found genes that are responsible for innate immune response. Sepsis is caused by the dysregulation of the host response to an infection and in its severe form, causes life-threatening organ dysfunction. Cytokines play an important role during sepsis by regulating the immune response to the infection. A variety of pro-inflammatory and anti-inflammatory mediators contribute to the inflammatory response and an imbalance between the two directly results in dysregulation. Matrix metalloproteinase-8 (MMP8) and Resistin (RETN) have been associated with the activation of proinflammatory cytokines TNF-α (51,52) which in turn stimulates systemic inflammation. MMP8 is involved in the homeostasis of the extracellular space, largely expressed by mononuclear cells and macrophages; it has been shown to be involved in roles supporting innate immunity (53). MMP8 knockout mice have been observed to have reduced phagocytosis and NET activity (54). A number of the identified genes have been long implicated in neutrophil extracellular traps (NETs), the activation of the NET results in NETosis, which can significantly complicate disease course. Among our genes, Olfactomedin 4 (OLFM4), Elastase Neutrophil Expressed (ELANE), and lactotransferrin (LTF) were identified (42,55,56). Recent findings have suggested CEA Cell Adhesion Molecule 8 (CEACAM8), C-Type Lectin Domain Containing 5A (CLEC5A) (57) may be further implicated in NETosis. Polymorphonuclear neutrophils, typically pro-inflammatory, also perform immunoregulatory roles, by expressing CEACAM8 and thus releasing soluble CEACAM8 after activation. Extracellular chromatids have been shown to activate the secretion of CEACAM8 through degranulation (58). A recent study finds CLEC5A, which has long been associated with dengue virus-induced lethal disease (59), to be an important factor associated with modulating innate immune response against bacterial infection in mice. This study suggests in mice with a knockout gene (CLEC5A), that prognosis was severe and by day 5, the mice had significant liver necrosis and increased risk for death. While these genes have been previously associated with mortality, in this work we show that these genes are also implicated in complex disease courses, even among survivors. A combination of the genes identified in this analysis also has been involved in microbiome homeostasis. Specifically, Lipocalin-2 (LCN2), an innate immune protein has been associated with maintaining an intestinal barrier against oxidative stress, having immunosuppressive character, and protects against multi-organ dysfunction (60)(61)(62). This protein has been increasingly suggested to be a therapeutic candidate to protect against gut-origin sepsis (62).
The complexity of the disease may also contribute to the ambiguity in identifying the correct class of pathogens, specifically in gram-negative/gram-positive bacterial differentiation. Therefore, interest has emerged in differentiating these characteristics through gene expressions, and Interleukin 1 Receptor Type 2 (IL1R2) has been specifically implicated (63) in this. Identifying such differentiation may also aid in determining the complexity and severity of sepsis by investigating the specific types of toxins released by either class of pathogens. Superantigens, for instance, produced by S. aureus and S. pyogenes have been suggested to cause a massive cellular immune response, leading to fatal toxic shock (64), while other microbial toxins have been involved in significant sepsis-led immunosuppression (65). Hence, earlier identification of such differentiation can improve case management and therapeutic selection. We identified Hydroxycarboxylic Acid Receptor 3 (HCAR3) and Membrane Metalloendopeptidase (MME) as two of the three downregulated genes similar to the findings of Kangelaris et al. (49) that study changes in gene expression among Acute Respiratory Distress Syndrome (ARDS) patients affected with sepsis. The genes overexpressed in ARDS were often found to be associated with a more severe sepsis outcome in other studies (including MMP8, RETN, and OLFM4) (66,67). Similar results were found among patients with Acute Kidney Injury (AKI) where the overexpression of genes such as MMP8, IL1R2, and OLFM4 was associated with increased severity and organ failure (50).
Some of the genes identified in this work overlap with approaches to predict sepsis mortality (17). DEGs identified in our study namely, CEACAM8, IL1R2, CEP55, TGFBI, and DEFA4 belonged to that list. Among the PERSEVERE genes (14) that were identified as DEGs, our analysis included MMP8, ELA2, LCN2, LTF, and RETN, while those overlapping with the PERSEVERE XP genes (15) included CEP55, TGFBI, and MME.
Among the top consistently chosen variables (that were not DEGs) in our machine learning analysis, Syndecan-4 (SDC4) has been found to be a potential biomarker with an anti-inflammatory function in patients with acute pneumonia (48). Microbiologic results of the patients from our dataset, identified Pneumococcus as the second most frequently occurring pathogen hence this result is quite significant. We also identified Phospholipase C Beta 1 (PLCB1) as one of the top genes that were not a DEG. Phospholipase (PLC) proteins are a class of membrane-associated enzymes that hydrolyze phospholipids and are responsible for signal transduction and gene transcription. The hydrolysis of phospholipids results in the activation of protein kinase C (PKC) signaling which in turn regulates macrophage mediated inflammatory response (68). Finally, the NLR Family Pyrin Domain Containing 1 (NLRP1) gene, which is one of the many genes responsible for the activation of the NLRinflammasome cascades was found to be significantly altered among septic patients in a recent study (69).
The exclusion of patients meeting SIRS criteria resulted in a lower number of DEGs, and the machine learning prediction scores were also reduced, with an AUROC of 0.61. This implies that the inclusion of such patients within the non-complicated disease course resulted in a more robust characterization of complex patients, thereby improving the predictive performance. Moreover, five genes, namely TGFBI, DEFA4, CEP55, MME, and OLAH, all except CEP55 are implicated as early markers of neutrophil activity. CEP55 overexpression has been implicated in T-cell lymphoma and genome instability (70,71).

Refining Gene Profiles Through Variable Selection and Stability Analysis
The gene expression profiles of the top 20 gene markers and one clinical variable (PRISM score) from the stability analysis had significant distributional differences between the complicated and uncomplicated course patients (Figure 7 and Table 1). When tested on a closely related independent adult validation set, two pediatric datasets and a mixed-age dataset, we achieved an outof-sample AUC of 0.72, 0.95, 0.83, and 0.82 respectively. The relatively lower AUC in the case of the adult cohort can be attributed to the fact that the derivation cohort is based on pediatric patients (mean age 3.81 years) while this validation cohort (GSE54514) is based on adult patients (mean age 59.13 years). Some of the gene predictors that are indicative of a complex course outcome for pediatric patients, might not play the same role for adult patients. This was further justified when we found that out of the top 20 gene biomarkers, only 10 had a significant class-wise distributional difference using the gene expression profiles from the validation cohort (GSE54514) using an APACHE II cutoff of 25 (Supplementary Table 6A). In the case of the two pediatric cohorts (E-MEXP-3850 and E-MEXP-3567), where we used 28-day mortality and in-hospital mortality as class labels, respectively, the number of top genes that had a significant distributional difference between the two classes were 11 and 10, respectively. This can be attributed to the fact that unlike these two validation cohorts, the variables used for the derivation cohort were based on a complicated course outcome (See Data collection under the Materials and Methods section) and not mortality. In the derivation cohort, 28 out of the 52 complicated course patients died. Finally, for the fourth validation set, a set of 15 out of the 20 genes gave the best AUROC of 0.822. One of the consistently chosen top variables (Figure 7 and Table 1) from our analysis using the derivation data (GSE66099) was the PRISM score which could not be used for the validation analysis because this information was not available in any of the validation cohorts. Also, even within the validation datasets, the complex disease course was interpreted based on severity of illness scores that were available at the time of ICU admission or mortality. Hence, a more similar dataset which derives complex disease courses longitudinally during ICU stay may have improved the performance of the proposed biomarkers.

Limitations
There are some limitations to our study. First, we present in this study a novel hybrid method of biomarker discovery that identifies stable and consistently chosen variables among multiple iterations of the pipeline as candidate gene markers of complex disease course. While we try to balance this novel approach by demonstrating traditional methods of identifying DEGs, further investigations in other datasets may be required to establish its validity. Second, due to the lack of any similar public dataset that identifies complicated disease courses within sepsis patients, we identified four closely related datasets which present severity of illness scores and mortality, using which we derived our interpretations of complicated and uncomplicated courses. Therefore, to ensure generalization and minimize selection bias, further validation must be performed on closely related prospective datasets. Third, in this study, we focus only on biomarkers derived from circulating blood leukocytes. Circulating biomarkers associated with cells from other dysfunctional organs, including tissue macrophages and vascular cells, might also be involved in a complicated clinical course for sepsis and were not included in this study. Finally, complicated and uncomplicated course outcomes were defined clinically, and so there may be some limitations in that interpretation, which contributes to bias.

Rapid Phenotyping of Complex Critically Ill Patients for Improved Situational Awareness
Sepsis is a highly heterogeneous condition requiring significant clinical resources to identify, manage, and forecast disease trajectories (4). Current work in sepsis biomarker discovery is primarily centered around the use of mortality as a key indicator of outcomes, with the assumption that the disease trajectory is uniform among survivors and non-survivors. Several traditional biomarkers such as Procalcitonin (PCT) and C-reactive protein (CRP) have been extensively used to predict mortality among septic shock patients due to their high prognostic value (72,73). While elevated levels of CRP have been associated with acute inflammation, organ failure, and mortality (74), PCT levels have been used as an indicator for decreased length of antibiotic treatments (75). However, CRP suffers from low specificity in diagnosing patients with systemic inflammation (76), and variation in early PCT levels is dependent on the type and severity of the initial infection and not necessarily on the severity of the disease itself (77). The primary outcome in most of these studies was either diagnosis of sepsis or mortality. The novelty of the proposed biomarker signature discussed here lies in identifying a holistic panel of non-invasive biomarkers to predict complicated sepsis among patients admitted to the PICU. This will minimize heterogeneity and cost in high-risk patient management.
While mortality is an important predictor variable, it does not capture the dynamic nature of several sepsisrelated complications, including, but not limited to disease severity, intervention effectiveness and progressive multi-organ dysfunction. Conversely, the use of complicated course outcomes as a predictor variable is associated with poor outcomes and has been proposed as clinically relevant endpoints in several studies (78,79). The biomarkers described in our study, which are predictive of a complicated course outcome among pediatric patients, will ultimately aid in situational awareness and clinical decision making as it pertains to the degree of care management required for patients suspected of sepsis. Unlike a set of biomarkers that may predict mortality, these collective genes enable clinicians to identify those patients who may likely survive the ICU stay but develop significant complications that result in significant expenditure of resources and clinical interventions.

CONCLUSION
This paper presents a novel list of genes that predict a complicated disease course for critically ill patients in the pediatric intensive care unit. The list of 20 genes was derived from a rigorous variable selection and validation pipeline, wherein we measure variable stability over 10 simulated iterations. While these genes have been previously associated with mortality, in this work we show that these genes are also implicated in complicated course outcomes, even among survivors. Many of the derived genes were attributed to an innate immunity function and contributed to NETosis. The resulting derivation AUROC of 0.82 and validation AUROCs of 0.723, 0.956, 0.83, and 0.82 suggests that these markers can reliably predict the outcome given only a single test of peripheral blood. Finally, this is an explorative study of biomarker discovery using computational predictions derived from publicly available gene expression data. Further evaluation of the performance of these genes in discriminating complicated course outcomes will require experimental validation using closely related prospective datasets and eventually within a prospective trial.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author/s.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by UTHSC. Written informed consent to participate in this study was provided by the participants' legal guardian/next of kin.

AUTHOR CONTRIBUTIONS
RK conceived and designed the study. SB, RK, and AM developed the method, performed the data analysis, and interpretation. HW and NP interpreted the data and critically revised the article for intellectual content. All authors wrote and proofread the manuscript. This manuscript has been released as a pre-print at bioRxiv (80). This work was supported by the National Center for Advancing Translational Sciences of the National Institutes of Health under Award Number UL1TR002378. This content was solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.