Construction of a risk prediction model for pulmonary infection in patients with spontaneous intracerebral hemorrhage during the recovery phase based on machine learning

Xu, Jixiang; Li, Yuan; Zhu, Fumin; Han, Xiaoxiao; Chen, Liang; Qi, Yinliang; Zhou, Xiaomei

doi:10.3389/fneur.2025.1571755

ORIGINAL RESEARCH article

Front. Neurol., 18 June 2025

Sec. Artificial Intelligence in Neurology

Volume 16 - 2025 | https://doi.org/10.3389/fneur.2025.1571755

Construction of a risk prediction model for pulmonary infection in patients with spontaneous intracerebral hemorrhage during the recovery phase based on machine learning

Jixiang Xu ¹^†

Yuan Li ²^†

Fumin Zhu ^1,3^†

Xiaoxiao Han ¹^†

Liang Chen ¹^†

Yinliang Qi ^1,3^†

Xiaomei Zhou ^1,4^{† *}

1. Department of Hyperbaric Oxygen, The Second People's Hospital of Hefei, Hefei Hospital Affiliated to Anhui Medical University, Hefei, Anhui Province, China
2. Department of Neurology, Dazhou Central Hospital, Dazhou, Sichuan, China
3. Wannan Medical College, Wuhu, Anhui, China
4. Anhui Medical University, Hefei, Anhui, China

Article metrics

View details

1,5k

Views

379

Downloads

Abstract

Objective:

Pulmonary infection (PI) remains a prevalent and severe complication in patients recovering from spontaneous deep subcortical intracerebral hemorrhage (deep SICH). Accurate prediction of PI risk is crucial for early intervention and optimized clinical management. The aim of this study was to develop a machine learning (ML) model for predicting PI risk in patients during the recovery phase of deep SICH and to investigate the contributions of individual risk factors through explainable artificial intelligence techniques.

Methods:

We conducted a retrospective study involving 649 patients diagnosed with PI during the recovery phase of deep SICH between 2021 and 2023. The cohort was divided into a training set (70%, n = 454) and a testing set (30%, n = 195). Eight key clinical features were identified using the Boruta algorithm: mechanical ventilation, nasogastric feeding, tracheotomy, antibacterial drug use, hyperbaric oxygen therapy, procalcitonin levels, sedative drug use, and consciousness scores. Seven ML algorithms were employed to build predictive models, with performance evaluated based on the area under the receiver operating characteristic (AUC) curve, sensitivity, specificity, and accuracy. The best-performing model was selected, and SHAP (Shapley Additive Explanations) analysis was performed to interpret feature importance.

Results:

Among 649 patients with deep SICH, no significant baseline differences were found between the training (n = 454) and testing (n = 195) sets. The Boruta algorithm identified eight key predictors of pulmonary infection (PI). The random forest (RF) model achieved the highest AUCs: 0.994 (95% CI: 0.989–0.998) in training and 0.931 (95% CI: 0.899–0.963) in testing. DeLong tests showed RF significantly outperformed several models (DT, SVM, LightGBM), while performance differences with XGBoost (p = 0.95), KNN (p = 0.80), and LR (p = 0.22) were not significant. SHAP analysis revealed mechanical ventilation, nasogastric feeding, and tracheotomy as key risk factors, with hyperbaric oxygen therapy and higher consciousness scores showing protective effects.

Conclusions:

This study provides a high-performing and interpretable ML-based risk stratification tool for pulmonary infection in patients during the recovery phase of deep SICH. The integration of SHAP enhances clinical applicability by demystifying complex model outputs, thereby supporting individualized preventive strategies. These findings underscore the promise of explainable AI in advancing neurocritical care and call for prospective multicenter validation and real-time dynamic model adaptation in future research.

Background

Spontaneous intracerebral hemorrhage (SICH) persists as one of the deadliest and most debilitating subtypes of cerebrovascular disease worldwide (1, 2). Despite considerable advancements in surgical techniques and critical care management, up to 50% of patients succumb within 30 days following the onset of SICH (3). For those who survive, the stabilization of their condition marks the commencement of a crucial recovery phase. Although there is no universally standardized definition of the recovery phase in SICH, mounting evidence and clinical experience suggest that the subacute to early chronic stage—ranging from 2 weeks to 6 months post-onset—represents a critical period for functional recovery and rehabilitation. The ESO guidelines recommend initiating rehabilitation within 24 to 48 h after SICH onset, though generally not before 24 h, and usually after clinical stabilization, which occurs by the 2 month (4). Saulle et al. (5) highlighted that clinical research specifically targeting the recovery phase remains scarce and lacks a uniform time definition. Liu et al. (6) demonstrated that initiating rehabilitation ~1 week after onset significantly reduces 6-month mortality and hospitalization duration. Notably, Cao et al. (7) defined the recovery phase in SICH patients as the period between 2 and 6 months post-onset. Kearns et al. (8) further reported that the interval from ~72 h to 14 days post-onset represents a crucial stage of hematoma resolution, inflammation attenuation, and early neurofunctional recovery, thereby supporting the rationale for selecting 2 weeks as a pragmatic threshold for defining the onset of the recovery phase. This definition aligns well with our clinical observations, wherein neurological deficits tend to stabilize and the demand for structured rehabilitation intensifies during this time frame. Drawing upon this converging body of evidence, we pragmatically define the recovery phase in the present study as the period extending from 2 weeks to 6 months following SICH onset.

Anatomically, SICH can be classified into lobar hemorrhage and deep subcortical hemorrhage. Previous research has demonstrated that lobar cerebral hemorrhage is typically associated with a more severe early prognosis and is mainly caused by non-hypertensive mechanisms, such as cerebral amyloid angiopathy, which poses more complex clinical challenges. In contrast, deep subcortical hemorrhage is often attributed to hypertensive causes and is associated with lower early mortality; however, patients remain susceptible to multiple complications during the recovery phase, including pulmonary infection (PI) (9). A meta-analysis of 130,000 post-stroke infection cases found that ~10% of SICH patients in the recovery phase develop PI (10), which increases mortality by ~30% (11, 12). Most existing studies on risk factors (13) for PI and clinical prediction models (14, 15) have predominantly focused on the acute phase of SICH, with limited attention given to the recovery phase. The physiological state of patients during recovery differs markedly from that of the acute phase and represents a critical window for functional restoration. Patients with deep subcortical hemorrhage are often bedridden for extended periods and may experience immunosuppression, making them particularly vulnerable to infections (16). These factors underscore an urgent clinical need for a dedicated risk stratification tool to predict PI specifically in deep SICH patients during the recovery phase.

Machine learning, a technology capable of identifying and learning patterns from large datasets, has shown significant potential in predicting diseases and treatment outcomes within the medical field (17, 18). Compared to traditional statistical models, machine learning methods excel in capturing complex non-linear relationships (19). This study focuses on patients with deep subcortical hemorrhage during the recovery phase and aims to develop a predictive model for PI using several ML algorithms, including logistic regression, random forest, decision tree, k-nearest neighbors, light gradient boosting machine, support vector machine, and extreme gradient boosting. The performance of each model will be evaluated, and the optimal model will be interpreted using SHapley Additive exPlanations (SHAP) (20). Importantly, the goal of this study is not to predict mortality or long-term functional outcomes, but rather to enable early identification of patients at high risk of PI. This facilitates proactive intervention and personalized care strategies during a critical window of neurological recovery. Given the distinct clinical characteristics and complication mechanisms of deep subcortical hemorrhage compared to lobar hemorrhage during recovery, this study offers important value in constructing a targeted prediction model for this specific patient population.

Materials and methods

Study design and patient selection

The study population consisted of 1,021 patients diagnosed with deep SICH and admitted to the Second People's Hospital of Hefei, Anhui Province, China, between January 2021 and December 2023. The inclusion criteria were: (1) diagnosis of deep SICH with confirmation of entering the recovery phase (21); (2) age ≥ 18 years; (3) complete clinical and follow-up data available. The exclusion criteria were: (1) presence of other severe neurological disorders or comorbidities, including but not limited to neurodegenerative diseases (e.g., Parkinson's disease, Alzheimer's disease), intracranial space-occupying lesions (e.g., brain tumors), epilepsy with recurrent seizures, or severe systemic conditions such as end-stage renal disease, advanced chronic obstructive pulmonary disease (COPD), or malignancies with systemic metastasis; (2) incomplete data or loss to follow-up (The initial dataset included 1,021 patients extracted from the hospital information system (HIS). During preprocessing, patients with incomplete clinical records were excluded based on predefined criteria. All included variables were assessed for missing data using SPSS frequency analysis, and no missing values were detected in the final dataset). Ultimately, 649 patients were included in the analysis, and the flow chart of the selection process is presented in Figure 1.

Figure 1

Flowchart of the process of patient enrollment. Patients with other severe neurological disorders (e.g., Parkinson's disease, Alzheimer's disease, refractory epilepsy) or systemic comorbidities (e.g., end-stage renal disease, advanced COPD, metastatic cancer) were also excluded to ensure population homogeneity. No patients were excluded due to in-hospital or follow-up death. All 649 patients completed the study period without mortality.

The 649 patients included in the study were sequentially numbered based on their admission dates and defined as the overall cohort dataset. Using the “sample()” function in R, the overall cohort dataset was randomly divided into training and testing set in a 7:3 ratio, comprising 454 patients in the training set and 195 patients in testing set. T This retrospective study was conducted using previously collected electronic medical records at Hefei Second People's Hospital. The study protocol was reviewed and approved by the Ethics Committee of Hefei Second People's Hospital (Ethics Number: 2022-Scientific Research-091). The requirement for informed consent was formally waived by the Ethics Committee, as the study involved no more than minimal risk to the participants, used fully de-identified data, and did not affect patient rights or welfare.

Data extraction

In this study, the selection of variables was systematically informed by clinical relevance, evidence-based literature, expert consensus, and the cumulative experience of our multidisciplinary research team. The selection process prioritized variables with plausible associations to the study's primary outcome—namely, the onset and progression of pulmonary infection (PI) in the post-ICH recovery context. The inclusion criteria for candidate variables were delineated as follows:

Clinical relevance

Variables with established significance in clinical practice were given priority. For instance, types of intracerebral hemorrhage (e.g., basal ganglia hemorrhage, brainstem hemorrhage, intraventricular hemorrhage, cerebellar hemorrhage, and thalamic hemorrhage) were included because of their established impact on patient prognosis and their potential to cause secondary complications, including pulmonary infections.

Previous studies

Variables identified as risk factors for PI or outcomes associated with intracerebral hemorrhage in prior research were incorporated. These variables included age, gender, smoking history, and drinking history (15, 22–24).

Biochemical indicators

Biochemical indicators obtained from the most recent laboratory tests conducted closest in time to the diagnosis of PI were selected for their diagnostic value in identifying and monitoring infection progression. The indicators included white blood cell count (WBC), absolute lymphocyte count (ALC), neutrophil percentage (NE%), hemoglobin (Hb), platelet count (PLT), total bilirubin (TBIL), direct bilirubin (DBIL), indirect bilirubin (IBIL), alanine aminotransferase (ALT), aspartate aminotransferase (AST), prealbumin (PAB), albumin (ALB), blood urea nitrogen (BUN), creatinine (Cr), serum potassium (K⁺), C-reactive protein (CRP), serum amyloid A (SAA), procalcitonin (PCT), lactate dehydrogenase (LDH), triglycerides (TG), total cholesterol (TC), prothrombin time (PT), activated partial thromboplastin time (APTT), and D-dimer (D-D).

Intervention-related variables

Variables associated with the provided interventions were included to evaluate their influence on PI outcomes, including the use of broad-spectrum antibiotics, hyperbaric oxygen therapy (HBOT), mechanical ventilation, vasoactive drugs, sedatives, analgesics, and anticoagulants.

Surgical and procedural factors

Factors associated with an increased risk of PI, such as invasive procedures and tracheotomy, were included.

Functional status assessment

The Barthel Index (BI), which measures the activities of daily living, along with variables such as consciousness status score (Glasgow Coma Scale, GCS) and dysphagia score (Standardized Swallowing Assessment, SSA), were selected to evaluate their relationship with the overall functional status of patients and the likelihood of developing PI.

Hospitalization data

Length of hospital stay and number of hospitalizations were included to understand the impact of prolonged or repeated hospitalizations on the risk of PI.

Feature selection and machine learning

This study utilizes the Boruta algorithm to identify key features associated with the risk of PI in patients with deep SICH. The Boruta algorithm (25) is a feature selection method built on the random forest (RF) algorithm. During its application, each original feature is paired with a corresponding shadow feature, which is generated by randomly shuffling the values of the original feature. Both the original and shadow features are utilized as inputs to train the RF model, and importance scores are calculated for each feature. The Boruta algorithm compares the importance scores of the original features with those of the shadow features to identify features that demonstrate significantly higher importance than their shadow counterparts. Only features with substantially higher importance scores than their shadow counterparts are deemed significant and retained in the final feature set. After feature selection, multiple machine learning algorithms were employed to construct a risk prediction model for aspiration, including Logistic Regression (LR), Random Forest (RF), Decision Tree (DT), k-Nearest Neighbors (k-NN), Light Gradient Boosting Machine (LightGBM), Support Vector Machine (SVM), and Extreme Gradient Boosting (XGBoost). Each of these algorithms has unique advantages: LR is mainly used for predicting categorical outcomes based on specific features; RF enhances prediction accuracy by aggregating multiple decision trees through majority voting; DT creates an interpretable tree structure by splitting attributes; k-NN is an instance-based learning approach suitable for scenarios without explicit model training; LightGBM is optimized for efficiently processing large-scale datasets; SVM classifies by maximizing the margin between classes, making it suitable for high-dimensional data; and XGBoost improves predictive performance by iteratively building decision trees and minimizing the loss function.

Model construction and evaluation

In this study, the LR model was developed using the “glm” function, the RF model was constructed using the “randomForest” package, and the Decision Tree model was implemented using the “rpart” package. For k-NN, the “knn” function from the “class” package was implemented using; the LightGBM model was developed with the “lightgbm” package, the SVM model was built using the “svm” function from the “e1071” package, and the XGBoost model was constructed with the “xgboost” package. Each model was trained exclusively within the training dataset using 10-fold cross-validation repeated 5 times. This approach ensured robust internal validation and minimized overfitting. No hyperparameter tuning was performed; models used default or empirically defined settings. Final evaluation was conducted on the independent test set. Pairwise AUC comparisons between models were performed using the Delong test, implemented via the “pROC” package in R.

The evaluation metrics for the models include accuracy (ACC), sensitivity (SEN), specificity (SPE), positive predictive value (PPV), negative predictive value (NPV), and area under the receiver operating characteristic curve (AUC). Additionally, the Shapley Additive Explanations (SHAP) method was applied to further elucidate the contribution of each feature variable to the models (26). SHAP plots visualize the positive and negative contributions of each feature to the model's predictions, enabling the identification of features with significant influence on the prediction of PI risk.

Statistical analysis

Statistical analyses were performed using SPSS 26.0 and R 4.3.3 software. The continuous variables were statistically analyzed by t-test (normal distribution data) or M-U test (non-normal distribution data). The normal distribution data were represented by mean ± standard deviation, and the non-normal distribution data were represented by quartiles. The categorical data were analyzed by a Chi-square test or Fisher precision test, and were displayed as percentage. When P < 0.05 (bilateral), the difference was considered to be significant.