Identification of Endotypes of Hospitalized COVID-19 Patients

Background: Characterization of coronavirus disease 2019 (COVID-19) endotypes may help explain variable clinical presentations and response to treatments. While risk factors for COVID-19 have been described, COVID-19 endotypes have not been elucidated. Objectives: We sought to identify and describe COVID-19 endotypes of hospitalized patients. Methods: Consensus clustering (using the ensemble method) of patient age and laboratory values during admission identified endotypes. We analyzed data from 528 patients with COVID-19 who were admitted to telemetry capable beds at Columbia University Irving Medical Center and discharged between March 12 to July 15, 2020. Results: Four unique endotypes were identified and described by laboratory values, demographics, outcomes, and treatments. Endotypes 1 and 2 were comprised of low numbers of intubated patients (1 and 6%) and exhibited low mortality (1 and 6%), whereas endotypes 3 and 4 included high numbers of intubated patients (72 and 85%) with elevated mortality (21 and 43%). Endotypes 2 and 4 had the most comorbidities. Endotype 1 patients had low levels of inflammatory markers (ferritin, IL-6, CRP, LDH), low infectious markers (WBC, procalcitonin), and low degree of coagulopathy (PTT, PT), while endotype 4 had higher levels of those markers. Conclusions: Four unique endotypes of hospitalized patients with COVID-19 were identified, which segregated patients based on inflammatory markers, infectious markers, evidence of end-organ dysfunction, comorbidities, and outcomes. High comorbidities did not associate with poor outcome endotypes. Further work is needed to validate these endotypes in other cohorts and to study endotype differences to treatment responses.


INTRODUCTION
Coronavirus disease 2019 , caused by severe acute respiratory syndrome coronavirus 2 (SARS-COV-2), has demonstrated a wide variety of clinical courses, including asymptomatic carriers (1), mild disease (2), brief hospitalizations (2), prolonged ICU courses (3,4), and COVID-19 "longhaulers" with prolonged symptoms (5). The spectrum of disease seems broader than the spectrum caused by other respiratory viruses, such as non-SARS-COV-2 coronaviruses. The international scientific community is currently endeavoring to understand the biological constructs that influence the course of disease after COVID-19 infection. Improved understanding of the biological underpinnings of different COVID-19 courses could improve diagnosis, triage, management, and prognosis for patients.
Understanding endotypes of disease can shed light on biological underpinnings of disease and identify those who are most susceptible. Endotypes are subtypes of a clinical condition which possess distinct functional or pathobiological mechanisms (with an implicit variable likelihood of response to therapies across endotypes). It is envisaged that patients with a specific endotype present themselves within phenotypic clusters of disease, and because of the mechanistic differentiation, show response to specific therapies. Endotypes consist of subsets of the disease itself, rather than biological constructs which may or may not progress to disease (14). This approach has been used to describe subgroups in asthma (15), sepsis (16)(17)(18)(19), trauma (20), and acute respiratory distress syndrome (ARDS) (21).
In clinical practice, baseline comorbidities and/or initial lab values do not explain the full range of COVID-19 presentations that are seen. We hypothesize that COVID-19 endotypes identified based on observable characteristics of the entire hospitalization (age and a representation of laboratory values) will reveal unexpected clinical courses and outcomes that defy prediction using classic risk factors. This approach is in contrast to some initial reports of clustering COVID-19 patients including using initial laboratory values and clinical variables collected in the first 24 (22) and 72 h (23); clustering patients by demographics, comorbidity, and maximum laboratory value (24) and using principal component analysis (PCA) and k-means of 18 initial laboratory values resulting in six values used in final analysis (25). Additionally, clusters have been created from initial ICU clinical data for patients with COVID-19 ARDS (26) and from ICU patients using demographics, initial ICU labs, and other clinical variables (27). Finally, there have been descriptions of a hyperinflammatory phenotype identified by initial admission labs (28) or serial labs using cluster analysis of three laboratory values (29).
In this study, we sought to uncover endotypes of the hospitalized COVID-19 patient population using a robust clustering method (consensus clustering of ensemble classification) on patient age and laboratory values over the course of hospital admission. These endotypes were examined for insights into comorbidities, expected clinical courses, and outcomes including intubation, length of stay (LOS), and mortality.

Participants
Adults (18 years-old or older) admitted consecutively to a telemetry capable bed at NewYork-Presbyterian Hospital/Columbia University Irving Medical Center were included in the study if they had a positive SARS-COV-2 nasopharyngeal PCR test during their inpatient admission and were discharged between March 12, 2020 to July 15, 2020. Patients with multiple admissions with a positive SARS-COV-2 nasopharyngeal PCR test only had data included from the first admission. If a patient had a positive SARS-COV-2 test (any type) more than 21 days before the admission, the patient was excluded. Patients were identified prospectively for inclusion in the study cohort but had their laboratory information, outcomes, and past medical history retrospectively collected. The collection of clinical data was done before clustering, so the investigators were blinded to endotype at the time of data collection. This study was approved by the Columbia University Institutional Review Board.

Features Used for Clustering
The features that have been shown to be correlated to clinical course or outcomes of COVID-19 were considered. Laboratory values and age were used to identify endotypes (complete list available in Supplementary Material 1). Both the median and the IQR of all lab values for a patient during admission were used as features. Features missing more than 40% of patients were excluded from analysis.

Variables Used to Examine the Resulting Endotypes
Patient disposition was the primary outcome. Intubation status, length of intubation, length of stay, patient age, race, sex, comorbidities, and treatment with medications commonly used with COVID-19 patients were collected (complete list available in Supplementary Material 2).

Statistical Analyses
A schematic presentation of data collection and analysis can be seen in Figure 1. To discover endotypes, we relied on cluster analysis, which generally divides datasets into groups by minimizing the intra-group distance while maximizing the inter-group distance. Instead of using a single clustering algorithm, here we employed ensemble classification (30) by running multiple clustering algorithms (K-mean, Birch, Gaussian FIGURE 1 | Data collection and analysis schematic. Patients with positive SARS-COV-2 tests that were discharged between March 12, 2020 to July 15, 2020 were included in the study. Labs during hospitalization (median and IQR) and age were the features used for clustering and endotype discovery. Once endotypes were identified, they were analyzed for differences in demographics, outcomes, comorbidities, and treatments.
Mixture Model, and Agglomerative clustering) and integrating their results. Then, we applied consensus clustering (31) to the results of ensemble classification. Consensus clustering is a robust approach that relies on multiple iterations of the sampled dataset to derive more stable and meaningful clusters and has been widely used to identify biologically meaningful clusters. In our work, the consensus of the ensemble clustering was implemented with 50 bootstraps and 80% of the data.
The stability of consensus matrices (when cluster number K changed from 2 to 10) were measured by obtaining their cumulative distribution function (CDF) as described by Monti et al. (31). Then for each K value, proportion of increase in area under the CDF ( K), Calinski Harabasz score (CH) (32) and Davies Bouldin score (DBS) (33) were calculated and compared to determine the optimal number of clusters. Finally, to visualize the underlying structure of the data, we generated the data dendrograms by applying hierarchical clustering on the consensus matrices. Pseudocode of our clustering approach is provided in Supplementary Material 3.
To compare the differences between endotypes, the Kruskal-Wallis test (34) and Dunn's multiple comparison test (35) were used for continuous variables, and chi-square tests were used for categorical variables. A significant p-value was defined as <0.05. The analysis was performed in MATLAB TM (The Math Works, Inc., Natick, MA) and Python (www.python.org) where we used Opensemble library (36) to perform the consensus clustering.

RESULTS
Five hundred forty-four patients were identified prospectively for inclusion in the study. Sixteen patients were missing all laboratory data and therefore were excluded from analysis, leaving 528 patients in the final cohort. Baseline characteristics of the final cohort, their comorbidities and hospital characterizations are outlined in Table 1. In the study cohort, the median age was 66 (IQR 55-74), 209 (40%) were female, 103 (19.5%) were African American or Black, 1 (0.2%) was American Indian or Alaska Nation, 7 (1.3%) were Asian,

Endotype Descriptions
Features missing in more than 40% of patients were excluded from further analysis: blood pH, blood pCO 2 , blood pO 2 , βd-Glucan, ionized calcium, and fibrinogen. After considering cluster quality and stability by examining CDF plot, measured K, CH, DBS, and the underlying structure of the data using dendrograms (Supplementary Material 4), we opted for K = 4 which identified four endotypes.
Median values of the clustering features for each of the four endotypes are outlined in Table 2, Supplementary Material 5. All of the features were significantly different over the endotypes except for median bilirubin and age (p > 0.05). Characteristics of the endotypes are outlined in Table 3. Some comorbidities varied significantly across endotypes (i.e., CKD, ESRD, HTN, DM, COPD, heart failure with reduced ejection fraction [HFrEF], and obesity), while others (asthma, hyperlipidemia, HIV infection, history of stroke, heart failure with preserved ejection fraction, and heart failure with unknown EF) did not differ significantly. Treatments differed by endotype (p < 0.05) except for remdesivir and prednisone. Mortality and discharge from hospital rates also varied by endotype (Figure 2). Paired comparisons of characteristics are provided in Supplementary Material 6. A summary of the four endotypes is shown in Figure 3.
Endotype 1 patients had a median age of 68 years, had the most women (46%), the lowest prevalence of mortality (1%), shortest hospital length-of-stay (median: 5 days), and fewest intubated patients (1%). This endotype had the lowest prevalence of HTN and DM and greatest prevalence of COPD. Endotype 1 patients had the lowest inflammatory markers (ferritin, IL-6, CRP, ESR, LDH), lowest infectious markers (WBC, procalcitonin), and lowest degree of coagulopathy (PT and PTT, but not significantly < endotype 2). Endotype 1 patients received the least of any endotype of the reviewed medications (except for enoxaparin) but overall had similar medication use as endotype 2 (except for hydroxychloroquine and methylprednisolone).
Endotype 3 patients had a median age of 66 years, included approximately the cohort average of women (42%), exhibited a mortality of 21%, had the longest hospital length-of-stay (median: 41 days), and had the second-highest prevalence of intubation (72%). Patients in this endotype had a relatively low number of comorbidities. Endotype 3 patients had similar inflammatory markers as endotype 2 (ferritin, CRP, ESR, and LDH, but not IL-6 which was significantly higher), secondhighest infectious markers (WBC and procalcitonin, although procalcitonin was not significantly > endotype 2), and secondhighest coagulopathy markers (PT and PTT, but PT was not significantly < endotype 4). Endotype 3 patients received reviewed medications at similar rates as patients in endotype 4 (except for enoxaparin, heparin, and hydrocortisone). Endotype 4 patients had a median age of 64 years, included the fewest women (27%), greatest degree of mortality (43%), a fairly long hospital length-of-stay (median 37 days), and were the most intubated (85%). This endotype had moderate amounts of CKD and ESRD, higher amounts of HTN, and the most obesity. Endotype 4 patients had the highest inflammatory markers (ferritin, LDH were significantly higher than endotype 3 while IL-6 and CRP were similarly high as endotype 3), highest infectious markers (WBC, procalcitonin), and greatest degree of coagulopathy (PT and PTT, but PT was not significantly > endotype 3). The exception was ESR which was lower than endotypes 2 and 3. Endotype 4 patients received the most of the reviewed medications (except for enoxaparin and hydroxychloroquine). Of the medications, only hydrocortisone and heparin use were significantly more than in endotype 3.

DISCUSSION
Our study has three main findings: first, four distinct groups of patients were identified though consensus clustering of ensemble classification using age and laboratory values over the entire hospitalization as features. The groups as a whole did not vary significantly by age or race but had differences in sex as well as comorbidities. We consider these patient subgroups to comprise endotypes (14) since the data used to segregate them include variables that are indicative of physiologic and inflammatory dysfunction. The endotypes were also treated with differing medications in the hospital. Endotype 1 and 2 exhibited low mortality and short length of stay. However, Endotype 2 had slightly worse outcomes and slightly higher inflammatory and organ damage markers. Endotypes 3 and 4 had more mortality and length of stay, with endotype 4 having a markedly high mortality at 43% and the highest levels markers of inflammation and end-organ dysfunction.
Second, we identified endotypes of COVID-19 patients with widely disparate outcomes that were not expected based on classic risk factors such as age, sex, and preexisting comorbidities (3,6). We documented patients with lower-risk features who had worse courses than traditionally expected. Endotype 2 had the greatest number of comorbidities overall but a relatively low mortality. Focusing on comorbidities alone would have resulted in misclassification of endotype 2 patients. Along the same lines, endotype 3 had many fewer comorbidities than endotype 2, and yet endotype 3 had significantly worse outcomes. IL-6, d-dimer, and WBC are significantly higher in endotype 3 compared to endotype 2. Further examination of the different endotypes has potential to yield clinical and pathobiological insight into what is driving the vastly different clinical courses experienced by patients with COVID-19.
Third, consensus clustering of ensemble classification (37) supported the previously hypothesized existence of subgroups of COVID-19 manifestations. In part because elevated inflammatory markers such as C-reactive protein, ferritin, and IL-6 were associated with poor outcomes (38,39), steroids were studied and proven effective at treating severe COVID-19 (5). Patients meeting a proposed criteria for COVID-19-associated hyperinflammatory syndrome (including fever; ferritin and d-dimer elevation; NLR elevation or anemia/thrombocytopenia; LDH or AST elevation; and IL-6, triglyceride, or CRP elevation) were shown recently to have higher risk of requiring mechanical ventilation and higher risk of mortality (13). The endotypes we identified that have higher levels of circulating inflammatory markers have worse outcomes than patient clusters with lower inflammatory markers. This appears to hold true even when patients are intubated, such as in endotypes 3 and 4 in which a higher number of patients were intubated, but where there were notably higher mortality and inflammatory markers in endotype 4. Endotype 4 patients also had notably higher procalcitonin levels, a potential indication that these patients with higher inflammatory markers may have experienced more (or more severe) bacterial infections.
Identification of endotypes has several potential useful functions. Endotypes may point to unique pathobiologic mechanisms of disease that warrant further investigation in each specific subset of patients. Different endotypes may respond differently to treatments and may explain the heterogeneity of disease course. Examining endotypes for differential response to treatments could identify subsets of patients where treatments are beneficial. If endotypes can be identified early in disease course, endotypes can offer prognostic and clinical management information. Future studies will need to validate these endotypes.
There are several limitations to our study. First, this is a singlecenter study that prospectively collected data from patients admitted to telemetry capable beds. We have not validated the endotypes in the setting of more recent SARS-COV-2 variants. However, in the setting of this fast-moving disease, validation of endotypes in the setting of the most recent variant will continue to be a challenge for any large COVID-19 cohort study. Second, there were some lab variables with a high amount of missing data. These variables were dropped which may have introduced some bias. Third, standard of care treatments for patients with COVID-19 changed over time. The treatments each endotype received may have been changing over time. Dosing data for medications was not available, therefore anticoagulation medications were not classified as prophylactic or therapeutic. Fourth, the admission criteria for patients with COVID-19 may have changed over time.
In conclusion, disease endotypes have the potential to describe a subset of patients that are undergoing shared biologic processes resulting in a similar phenotype of disease and may identify groups of patients with different clinical courses and responses to therapy. However, having certain high or low risk features does not guarantee association with a certain outcome; rather, patients with certain features appear to have one of multiple different clinical courses. In this cohort of patients hospitalized with COVID-19, we identified four unique endotypes of patients by using clustering of laboratory values throughout the hospitalization as well as patient age. The endotypes had differences in inflammatory markers, infectious markers, evidence of end-organ dysfunction, comorbidities, and outcomes. Further work is needed to validate these endotypes in other cohorts and study endotype differences to treatment response.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Materials, further inquiries can be directed to the corresponding author/s.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by Columbia University. Written informed consent for participation was not required for this study in accordance with the national legislation and the institutional requirements.