Risk Factors of Severe Clostridioides difficile Infection; Sequential Organ Failure Assessment Score, Antibiotics, and Ribotypes

We aimed to determine whether the Sequential Organ Failure Assessment (SOFA) score predicts the prognosis of patients with Clostridioides difficile infection (CDI). In addition, the association between the type of antibiotic used and PCR ribotypes was analyzed. We conducted a propensity score (PS)-matched study and machine learning analysis using clinical data from all adult patients with confirmed CDI in three South Korean hospitals. A total of 5,337 adult patients with CDI were included in this study, and 828 (15.5%) were classified as having severe CDI. The top variables selected by the machine learning models were maximum body temperature, platelet count, eosinophil count, oxygen saturation, Glasgow Coma Scale, serum albumin, and respiratory rate. After propensity score-matching, the SOFA score, white blood cell (WBC) count, serum albumin level, and ventilator use were significantly associated with severe CDI (P < 0.001 for all). The log-rank test of SOFA score ≥ 4 significantly differentiated severe CDI patients from the non-severe group. The use of fluoroquinolone was more related to CDI patients with ribotype 018 strains than to ribotype 014/020 (P < 0.001). Even after controlling for other variables using propensity score matching analysis, we found that the SOFA score was a clinical predictor of severe CDI. We also demonstrated that the use of fluoroquinolones in hospital settings could be associated with the PCR ribotype in patients with CDI.


INTRODUCTION
Clostridioides difficile is a toxin-producing and spore-forming Gram-positive anaerobic bacterium that may colonize and cause infections in the human intestinal tract due to dysbiosis resulting from antibiotic treatment (McDonald et al., 2018). C. difficile infection (CDI) is a leading cause of healthcare-associated diarrhea and is a global concern that can exacerbate patient conditions and increase morbidity and mortality (Davies et al., 2014;Lessa et al., 2015;McDonald et al., 2018). Since CDI is associated with worse clinical outcomes in patients, several indicators, including laboratory results like WBC count and serum albumin level, patient symptoms, and hemodynamic changes have been proposed to discriminate disease severity (Surawicz et al., 2013;McDonald et al., 2018;Johnson et al., 2021). Clinical guidelines recommend treatment regimens for patients with CDI based on disease severity classified by these indicators (Johnson et al., 2021;van Prehn et al., 2021). For clinicians, early prediction of patients with CDI at risk of severe disease is important for decision-making and disease management. However, there are no standardized and validated predictive indicators for identifying high-risk groups prior to disease progression.
The Sequential Organ Failure Assessment (SOFA) score is an objective, early obtainable value that is widely used to assess and/or predict a patient's prognosis in infectious disease research. SOFA is used as a measure of sepsis-related organ dysfunction, which can be identified as an acute change of two or more points in the total score (Singer et al., 2016) and is also useful in predicting the prognosis of critically ill patients (Vincent et al., 1998). However, validation studies applying the SOFA score to grade the severity of CDI in patients are still lacking.
The aim of this study was to determine whether the SOFA score predicts the prognosis of patients with CDI at the time of diagnosis. We then identified the top variables, including components of the SOFA score with the largest impact on the prediction of severe CDI via machine learning (ML) analysis. ML techniques were used to evaluate the importance of clinical indicators along with conventional statistical approaches (Taylor et al., 2016;Chiew et al., 2020;Roimi et al., 2020). Furthermore, we also analyzed polymerase chain reaction (PCR) ribotyping of clinical C. difficile isolates to evaluate their association with the type of antibiotic used in the patient.

Study Population and Data Collection
We retrospectively extracted data from all adult patients (≥18 years of age) with confirmed CDI from January 2011 to June 2021 using the Severance Clinical Research Analysis Portal. This electronic health records data collection program with information from two university tertiary hospitals and one secondary hospital in South Korea has been in existence since 2006. CDI was defined as having three or more loose stools over a 24-h period and positive for C. difficile toxin on a nucleic acid amplification test (for the toxin B gene), or rapid antigen/toxin enzyme immunoassay. During the study period, 12,290 patients were confirmed as having CDI. Multiple positive results of CDI in the same patients, except for the first test record, were excluded (n = 5,782). After selecting the first confirmed cases from each of the 6,508 patients, we excluded 1,171 patients who had no demographic information.
We collected patient-level data, including demographics, underlying comorbidities, date of CDI diagnosis, and date of death. We obtained the most abnormal values within 72 h of a hospital visit and CDI diagnosis, by extracting both the maximum and minimum values of laboratory test results and vital signs. In addition, we investigated the use of mechanical ventilation, vasopressors, and the lowest Glasgow Coma Scale (GCS) within the 72-h period. The microbiological test results and history of antibiotic use within 60 days before CDI diagnosis were also recorded. We performed PCR ribotyping of 1,464 isolates collected from patients who could provide stool samples for C. difficile culture, as described in our previous studies (Kim et al., 2011(Kim et al., , 2021.

Definition of Clostridioides difficile Infection
According to McDonald et al. (2007), severe CDI (outcome of interest) was defined as the presence of one or more of the following: intensive care unit admission, need for interventional surgery, and death within 30 days of diagnosis (McDonald et al., 2007). Progression-free survival (PFS) refers to the duration of time that CDI patients remain non-severe on treatment.

Propensity Score-Matched Analysis
To reduce selection bias that affects clinical outcomes depending on the difference in the patient's baseline condition at the time of hospital visit, we conducted a PS-matched study and conditional logistic regression using MatchIt package (Heinze and Jüni, 2011). We selected six variables including age, sex, the Charlson comorbidity index (Charlson et al., 1987), WBC count, serum albumin, and SOFA score (P < 0.001 for all) at the time of hospital visit for adjustment by univariate analysis (Brookhart et al., 2006;Austin, 2008). We then performed a PS-matched analysis by attempting to match each patient with severe CDI to a non-severe CDI (1:2 match) using the nearest-neighbormatching method. A match occurred when the difference in logits of PS was less than 0.2 times the standard deviation of scores.

Statistical Analysis
We described the patient's characteristics using numbers and percentages for categorical variables, medians, and interquartile ranges (IQRs) for continuous variables. The statistical significance between groups was tested with Fisher's exact test for qualitative data and the Mann-Whitney U test for quantitative data. We used conditional logistic regression for univariate and multivariate analysis between groups of patients with severe and non-severe CDI. Dependent variables included in the multivariate analysis were selected based on the statistical significance provided by univariate analysis. We employed the Kaplan-Meier estimator to analyze PFS, and differences between groups were assessed using the log-rank test.
All reported P-values were two-tailed, and statistical significance was assumed if P < 0.05. Statistical analyses were performed using R statistical software version 4.1 (R Studio, Inc., Boston, MA, United States).

Machine Learning Analysis
Before modeling, all continuous variables were standardized, and missing values were imputed using the median value (Chiew et al., 2020). The dataset was randomly split at a ratio of 4:1 for the training and test sets. For each ML model, hyperparameter tuning was performed through a grid search and fivefold cross-validation. Candidate models were trained using the K-nearest neighbor (KNN), decision tree, random forest, light gradient boosting machine (LightGBM), eXtreme gradient boosting (XGBoost), support vector machine (SVM), and artificial neural network algorithms (ANN). Each model with the highest area under the receiver operator characteristic curve (AUROC) with 95% confidence intervals (CIs), accuracy, and F1 score (the harmonic mean of precision and recall) was generated (LeDell et al., 2015;Taylor et al., 2016). ML analysis was performed using Python programming software version 3.7.12 (Python Software Foundation, Wilmington, DE, United States).

Ethics Statement
The Institutional Review Board at Severance Hospital, affiliated with the Yonsei University Health System (3-2021-0508), approved this study.

Before Propensity Score Matching
A total of 5,337 adult patients with CDI between January 2010 and June 2021 were included in this study. The demographic and clinical characteristics of patients with severe and nonsevere CDI are summarized in Table 1. The median age of the patients was 65 years (IQR, 51-75 years), and 828 (15.5%) had severe CDI. The 1,464 (27.4%) isolates for PCR ribotyping produced 88 distinct C. difficile ribotypes. Among them, ribotype 014/020 (R014/020) accounted for the largest proportion (16.3%), followed by R018 (16.0%). Other ribotypes were observed at less than 10.0% each, and hypervirulent strains accounted for only 3.7% of R078 and 0.9% of R027. At baseline, severe and nonsevere CDI groups showed statistically significant differences in most variables of severity and epidemiologic characteristics, except for PCR ribotype and body temperature. Patients in the severe CDI group were older, mostly male, were more often included in hospital-onset disease, and had a higher Charlson comorbidity index than those in the non-severe CDI group (P < 0.001 for all). Similarly, the severe CDI group had a higher baseline SOFA score, WBC count, serum creatinine level, and lower systolic and diastolic blood pressures and serum albumin than the non-severe CDI group (P < 0.001 for all).

After Propensity Score Matching
After PS-matching, baseline characteristics including age, sex, SOFA score, minimum serum albumin, and maximum WBC count of both groups were well-balanced in 767 pairs at a 1:2 ratio and were not statistically different ( Table 2). However, both the SOFA score at the time of CDI diagnosis and the increased rate of the SOFA score by 2 or more points were significantly higher in the severe CDI group than in the non-severe CDI group (P < 0.001, both). In addition, minimum systolic blood pressure, minimum serum albumin, maximum WBC count, minimum eosinophil count, maximum C-reactive protein (CRP), and maximum total bilirubin still differed significantly.
We used univariate and multivariate analysis with conditional logistic regression to identify risk factors for severe CDI (Table 3). After PS matching, seven independent variables were significant indicators of severe CDI in the univariate analysis. Since the SOFA score at the time of CDI diagnosis and the increase in SOFA score by more than 2 points had multicollinearity, it was analyzed separately by the different models in multivariate analysis. In multivariate analysis model 1, the SOFA score (adjusted odds ratio [aOR], 1.16; 95% CI, 1.11-1.20; P < 0.001), maximum WBC count (aOR, 1.01; 95% CI, 1.00-1.02; P < 0.001), minimum serum albumin (aOR, 0.65; 95% CI, 0.52-0.51; P < 0.001), and ventilator use (aOR, 5.49; 95% CI, 2.23-13.55; P < 0.001) were associated with severe CDI. In multivariate analysis model 2, increases of more than 2 points in SOFA scores were also found to be significantly associated with severe CDI, even after adjusting for other variables (aOR, 2.29; 95% CI, 1.68-3.11; P < 0.001). The ribotype of the strains was not associated with severe CDI.

Sequential Organ Failure Assessment Scores in Clostridioides difficile Infection Patients and Comparison of the Predictive Models
The optimal cut-off value of the SOFA score for discriminating severe CDI was 4 points, as shown in the AUROC curve (Supplementary Figure 1). Among all patients, the log-rank test of SOFA score ≥ 4 was significantly different in patients with severe CDI from the non-severe group (P < 0.001). PFS curves for dichotomized SOFA scores of the two groups are shown in Figure 1. The SOFA, quick SOFA (qSOFA), and change in SOFA score consequent to CDI were significantly different in both groups (P < 0.001 for all three indicators).
The predictive performance of the SOFA, qSOFA score, and ML models is summarized in Supplementary Table 1. In the analysis for early discrimination of severe CDI, the SOFA score and the change in SOFA score consequent to CDI showed similar performance (AUROC, 0.732; 95% CI, 0.712-0.751 for both; F1 score, 0.400 for SOFA score and 0.403 for changes in SOFA score ≥ 2), and qSOFA showed relatively inferior performance (AUROC 0.685; 95% CI, 0.665-0.705; F1 score, 0.388). Among the performance of the ML algorithm in the internal test set, the XGBoost classifier showed the highest AUROC value of 0.806 (95% CI, 0.776-0.834), and the LightGBM classifier showed the highest accuracy of 0.859. In addition, the top predictors of ML models for severe CDI are presented. The importance plots of the XGBoost (Supplementary Figure 2) and Shapley additive explanation (SHAP) analysis of the LightGBM classifier (Figure 2) showed the most important indicators used in the ML analysis. Oxygen saturation, respiratory rate, blood urea nitrogen, GCS, and serum albumin were the top predictors in the XGBoost model, and body temperature, platelet count, eosinophil count, chemotherapy within 2 weeks, and serum lactate were selected in LightGBM.
The Relationship Between the Type of Antibiotic Used and the Main Ribotype of Clostridioides difficile Table 4 shows the comparison between the two most common ribotypes (R014/020 and R018) in this study and the type of antibiotic used within 60 days before CDI diagnosis. In the period from 2011 to 2014, C. difficile R018 was the most common strain, with 23.6% of all tested isolates; however, the relative incidence of R014/020 increased and became the most common strain. After adjusting for confounding factors, the history of use of fluoroquinolone was more associated with CDI patients with R018 strains than with R014/020 (aOR, 1.96; 95% CI, 1.31-2.93; P < 0.001). The annual incidence of fluoroquinolone prescription per 1,000 inpatient days in our hospitals is illustrated in Supplementary Figure 3 and has continued to decline since 2019.

DISCUSSION
We found that the SOFA score calculated with variables within 72 h of CDI diagnosis was statistically associated with patient outcome, even after PS matching and adjustment for other variables.
The observational approach of our study may have led to selection bias. There were systemically significant differences in the following initial parameters of patients between the severe and non-severe CDI groups: age, sex, underlying comorbidities, rate of proton pump inhibitor use and enteral feeding, and the most abnormal values of vital signs and laboratory test results. Differences in the baseline characteristics of patients known to be associated with severe CDI (Bliss et al., 1998;Loo et al., 2011;Surawicz et al., 2013;Abou Chakra et al., 2015;Trifan et al., 2017;McDonald et al., 2018) can act as confounding factors for clinical outcomes. The PS-matched study is a method of designing observational studies that mimic the characteristics of randomized controlled trials, allowing for a similar distribution of the observed baseline covariates between severe and nonsevere CDI groups. Therefore, we conducted a multivariate analysis using PS-matched data to minimize selection bias (Austin, 2008;Heinze and Jüni, 2011;Wombwell et al., 2021). Several scoring systems have been developed to predict the severity of CDI, but none of them have been validated (Barbut and Rupnik, 2012;Kassam et al., 2016;Ahmed et al., 2021), and there is still no consensus indicator that can be used to differentiate disease severity (Bauer et al., 2012;Surawicz et al., 2013;Debast et al., 2014;McDonald et al., 2018). The SOFA score is a widely accepted predictive model for patients with infectious diseases. It is a validated score that can be used to predict the prognosis of individual patients and helps to compare the quality of care between hospitals and standardized studies. We included CDI, Clostridioides difficile infection; OR, odds ratio; CI, confidence interval; SOFA, sequential organ failure assessment. Significant (P < 0.05) variables in the multivariable analysis are indicated in bold.  a large number of cases and attempted to control for confounders, thus ensuring that the SOFA score is related to patients with severe CDI. Furthermore, we also presented a dichotomous cutoff of SOFA scores to predict severe CDI using the AUROC and PFS curves in our study.
In addition to the SOFA score, other variables such as WBC count, serum albumin, and ventilator use were also significantly different between patients with severe and non-severe CDI, which is consistent with prior studies (Surawicz et al., 2013;Abou Chakra et al., 2015;McDonald et al., 2018). Certain C. difficile ribotypes, such as R027 and R078, have been shown to be more virulent than others in epidemic settings (He et al., 2013;Hensgens et al., 2013), and fluoroquinolone use was closely correlated with the emergence of CDI due to the resistance of the R027 strain to this antibiotic. However, other studies in nonoutbreak settings found that this ribotype did not significantly predict severe CDI (Welfare et al., 2011;Walk et al., 2012;Abou Chakra et al., 2015). In our data, there was no statistical association between C. difficile ribotype and severe CDI, where the prevalence of R027 and R078 was less than 5% of the available strains. Dingle et al. (2017) reported that restriction of fluoroquinolone use reduced the incidence of CDI in an England population-based study, mainly driven by the elimination of fluoroquinolone-resistant isolates. Similarly, in our hospitalbased data, fluoroquinolone use was associated with the relative incidence of CDI by major PCR ribotypes and was observed more frequently in patients with CDI due to the R018 strain than the R014/020 strain. All R018 strains had gyrA mutations and showed resistance to quinolone, whereas R014/020 strains had a gyrA mutation in 8.1% of the isolates. This suggests that the use of fluoroquinolone could act as a selective pressure to induce CDI due to antibiotic-resistant ribotype (Loo et al., 2005;Muto et al., 2007;Kallen et al., 2009;He et al., 2013;Abou Chakra et al., 2015) and a decrease in the annual prescription of these antibiotics in our centers may have influenced the change in the relative incidence of C. difficile ribotypes.
Traditional multivariate analysis has fundamental limitations in selecting independent variables to be included in the model owing to the effects of multicollinearity and overfitting issues (Baxt, 1994). Comprehensive data analysis though ML can be utilized in conjunction with conventional statistical analysis to evaluate the adequacy of clinical indicators. Therefore, we investigated 135 covariates in the clinical data, but only six variables were included in the final statistical models through univariate analysis. ML-based models have the advantage of correcting non-linear relationships and multicollinearity of variables, which can provide new insights into various fields of clinical medicine (Baxt, 1994;Wiens and Shenoy, 2018). For example, in the SHAP analysis of this study, both the maximum and minimum values of body temperature were selected as the top predictors. The maximum value of body temperature was directly proportional to the risk of severe CDI, while the minimum value showed a negative correlation, which is difficult to derive from conventional multiple logistic regression without additional definition and analysis. We conducted ML analysis to predict patients with severe CDI by utilizing all the variables investigated and demonstrated the top variables selected by the algorithms. Of these, serum albumin, maximum body temperature, and eosinophils were consistent with the predictors identified in previous studies (Kulaylat et al., 2018;McDonald et al., 2018), oxygen saturation, GCS, platelet count, and respiratory rate were the same as those included in the SOFA or qSOFA score. Thus, these components of the SOFA score contributed to the early prediction of severe CDI. In addition, the SOFA score showed a relatively high value in the F1 score, a more informative metric for evaluating predictive models on an imbalanced dataset of the outcome of interest (Saito and Rehmsmeier, 2015), and a fair AUROC value for predicting severe CDI (Hosmer et al., 2013). Therefore, in our data, the SOFA score was as good as the ML models in predicting patient prognosis.
Although we included a large number of CDI cases using an electronic data extraction program, our results are limited by the retrospective and single-country nature of the study. Thus, hidden bias and residual confounders might have influenced the generalization of the results, and PCR ribotyping of nonstored C. difficile strains could not be performed. Incomplete sampling may have underestimated the impact of ribotypes on the outcomes of patients with CDI. Furthermore, the hospitals participating in the data ranged from secondary to tertiary care centers, and patient populations could be inherently different. However, we tried to analyze risk factors for severe CDI by minimizing selection bias and multicollinearity using a PSmatched study and ML techniques.
Since the clinical course and outcomes of CDI are highly variable, from uncomplicated diarrhea to surgical intervention or death, predictive indicators of severe CDI are required at diagnosis. The SOFA score is a well-validated model in many clinical settings, based on standardized and early obtainable parameters. Even after controlling for other variables using PS-matching analysis, we found that the SOFA score was a clinical predictor of severe CDI. We also demonstrated that the use of quinolones in the hospital setting could be associated with the bacterial ribotype in patients with CDI because of antibiotic resistance.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by the Institutional Review Board at Severance Hospital, affiliated with the Yonsei University Health System (3-2021-0508), approved this study. Written informed consent for participation was not required for this study in accordance with the national legislation and the institutional requirements.

AUTHOR CONTRIBUTIONS
MHC analyzed the data. MHC and HK wrote the manuscript. DK, SHJ, HML, and HK collected the samples and clinical data. DK, SHJ, and HML critically read the manuscript. All authors contributed to the article and approved the submitted version.