Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Neurol., 12 January 2026

Sec. Artificial Intelligence in Neurology

Volume 16 - 2025 | https://doi.org/10.3389/fneur.2025.1722965

AI-driven integration of imaging and radiology language improves stroke mortality prediction

  • 1Department of Radiology, Jordan University of Science and Technology, Irbid, Jordan
  • 2Department of Internal Medicine, Northwest Healthcare, Tucson, AZ, United States

Background: Quantifying radiologic phenotypes from unstructured text and imaging data may enhance clinical prediction in acute stroke but remains underexplored. This study evaluates the feasibility of automated stroke phenotyping across complementary data sources and assesses whether NLP-derived features from radiology reports improve mortality prediction beyond structured electronic health record (EHR) data.

Methods: Two complementary datasets were analyzed. First, MRI lesion masks from a public dataset (n = 60) were processed using NiBabel to calculate lesion volumes and characterize imaging features as a quantitative reference cohort. Second, 15,492 head CT/MRI reports from the MIMIC-III database were processed through a rule-based NLP pipeline to identify eight key stroke phenotypes: hemorrhage, infarct, midline shift, edema, chronic change, and vascular territories (ACA, MCA, PCA). Each classifier was trained on TF-IDF features using logistic regression and evaluated by AUC and F1-score. Probabilistic outputs from the NLP models were merged with structured admission data (age, sex, ICD-9 codes, length of stay) to predict in-hospital mortality in 3,999 stroke admissions using logistic regression.

Results: In the MRI reference cohort, mean lesion volume was 3.85 × 106 mm3 ± 4.83 × 105, demonstrating the feasibility of automated lesion quantification from open datasets. Across the MIMIC-III cohort, the best NLP models achieved AUC = 0.974 (hemorrhage) and 0.957 (edema) with balanced F1-scores (0.945 and 0.891, respectively). Incorporating text-derived phenotypes into mortality models improved discrimination modestly (AUC 0.616 vs. 0.558; ΔAUC = +0.058). Permutation analysis revealed ICD-9 codes (ΔAUC = 0.091), edema (0.051), and infarct (0.046) as top contributors to mortality prediction.

Conclusion: Automated extraction of stroke phenotypes from both quantitative imaging and radiology text is feasible and reproducible across open datasets. Although MRI lesion volume was not incorporated into mortality models due to dataset limitations, NLP-derived radiologic phenotypes from clinical text provided complementary, interpretable information beyond structured EHR data and modestly improved mortality risk stratification. These findings support the potential of text-derived imaging phenotypes as a scalable and clinically practical enhancement to stroke outcome prediction. Broader validation, incorporation of additional modalities including direct imaging metrics, and workflow-aware implementation strategies will be important next steps toward translating these models into actionable clinical decision support.

Introduction

Stroke remains the second leading cause of death globally, responsible for approximately 5.5 million deaths annually and representing 11.6% of all deaths worldwide (1, 2). The burden extends beyond mortality, with up to 50% of stroke survivors experiencing chronic disability, resulting in substantial economic and social consequences (2). While age-standardized stroke mortality rates have declined globally by 36% between 1990 and 2019, the absolute number of stroke-related deaths and incidents continues to rise due to population aging and growth (1, 3). This epidemiological transition underscores the urgent need for improved stroke prediction and management strategies, particularly in resource-limited settings where the majority of stroke burden resides (4).

The integration of neuroimaging data with clinical information has emerged as a promising approach for enhancing stroke outcome prediction, yet traditional methods often underutilize the wealth of information contained in unstructured radiology reports (5, 6). Recent advances in natural language processing have demonstrated remarkable success in extracting structured phenotypic information from free-text radiology reports, with studies achieving accuracies exceeding 95% for detecting stroke-related features such as hemorrhage, infarction, and vascular territorial involvement (5, 7). Simultaneously, quantitative MRI lesion analysis has shown strong associations with functional outcomes, with lesion volume and location serving as key predictors of stroke severity and recovery (8, 9). However, few studies have systematically combined automated imaging quantification with NLP-derived radiological phenotypes to create comprehensive prediction models for stroke outcomes.

The development of robust, automated approaches for stroke phenotyping and prediction of stroke-related mortality has significant implications for clinical decision-making, resource allocation, and research applications (10, 11). Large clinical databases such as MIMIC-III provide unprecedented opportunities to develop and validate such integrated approaches across diverse patient populations (12). By leveraging both structured clinical data and unstructured radiological text, machine learning models can potentially capture complex disease patterns that individual data modalities might miss, ultimately improving the precision of stroke prognostication and facilitating more personalized treatment approaches (13, 14). Accordingly, the objective of this study was to integrate MRI-based lesion quantification with automated extraction of stroke-related features from MIMIC-III radiology notes and to evaluate their combined utility for predicting stroke-related in-hospital mortality.

Methods

Overview of study design

The objective of this study was to determine whether stroke-relevant information embedded in radiology reports could improve prediction of in-hospital mortality beyond structured electronic health record (EHR) data alone. The analysis proceeded in five stages:

1. Dataset assembly: We combined a large clinical cohort with linked head-imaging reports from MIMIC-III and a smaller MRI cohort providing voxel-wise lesion masks to support cross-cohort comparison.

2. MRI lesion quantification: Lesion masks from the MRI dataset were processed to obtain quantitative lesion volumes. These volumes served as a biological reference for evaluating the plausibility of text-derived phenotypes.

3. Radiology NLP phenotyping: All head-imaging radiology reports in MIMIC-III were preprocessed and classified into eight clinically meaningful stroke phenotypes using supervised natural language processing models.

4. Outcome modeling: Predicted phenotype probabilities were merged with structured EHR variables (age, sex, ICD-9 diagnoses, length of stay). Baseline and enhanced logistic regression models were trained to predict in-hospital mortality, enabling quantification of the incremental value of text-derived phenotypes.

5. Feature importance analysis: Permutation-based ΔAUC analysis was used to identify which structured and text-derived variables contributed most to mortality prediction.

Together, these steps allowed us to evaluate (1) whether radiology-text phenotypes correspond to quantitative neuroimaging patterns, and (2) whether these phenotypes improve mortality risk stratification beyond standard EHR data.

Data sources

This study integrated two complementary datasets:

(1) the publicly available MIMIC-III Clinical Database (v1.4) (1517), a relational database that includes de-identified ICU admissions with linked demographic, diagnostic, and outcome information; and.

(2) the Full Head MRI and Segmentation of Stroke Patients dataset from Kaggle (18), comprising 64 anonymized brain MRI studies with voxel-wise lesion masks and metadata (age, sex, race).

Only adult patients were included. Stroke-related admissions in MIMIC-III were identified from the MIMIC-III table DIAGNOSES_ICD using ICD-9 codes 430–438, which correspond to various forms of cerebrovascular disease. In-hospital outcomes (death, home discharge, hospice) were obtained from the MIMIC-III table ADMISSIONS. A total of 3,999 stroke admissions with corresponding head imaging reports were identified from MIMIC-III.

The MRI dataset originally contained 64 MRI scans; however, four scans corresponded to healthy control subjects and therefore lacked lesion masks. Consistent with the dataset documentation, only the 60 scans with corresponding stroke-related lesion masks were included in the MRI cohort. Demographic variables (age, sex, race) supplied with the dataset were used in subsequent analyses (see Table 1).

Table 1
www.frontiersin.org

Table 1. Summary of datasets and variables.

MRI lesion quantification and processing

MRI lesion masks (*_label_deface.nii) were loaded using NiBabel, a freely available Python library for reading and manipulating NIfTI neuroimaging files. All lesion-volume computations were performed using custom Python scripts to ensure transparent and reproducible processing.

For each scan, voxel dimensions were extracted from the NIfTI affine matrix, which specifies the physical size of voxels in millimeters along each spatial axis. Lesion masks were binarized, and lesion volume was obtained by multiplying the total number of lesion voxels by the voxel volume.

These lesion-volume estimates were then merged with demographic metadata (age, sex, race) to create a unified subject-level MRI dataset.

We quantified lesion volume because it represents a biologically meaningful and widely used imaging measure of stroke severity, allowing us to:

1. Characterize the MRI cohort,

2. Assess demographic or anatomical variability, and

3. Provide a quantitative reference for comparing imaging-based characteristics with text-derived phenotypes in the MIMIC-III cohort.

The mean lesion volume across subjects was 3.85 × 106 mm3 ± 4.83 × 105, with mild variability across the dataset.

Radiology report processing and NLP phenotyping

All radiology reports labeled as “Radiology” were extracted from the MIMIC-III NOTEEVENTS relational table and filtered to include only head or brain imaging studies based on report metadata and free-text descriptors. Reports were preprocessed by lowercasing, removing protected-health-information placeholders, standardizing whitespace, and normalizing common radiology phrases.

Eight stroke-relevant phenotypes were defined: hemorrhage, infarct, midline shift, edema, chronic change, and vascular territories MCA, ACA, and PCA. Binary labels for each phenotype were generated using a rule-based keyword pipeline that incorporated context filters and negation detection (e.g., excluding statements such as “no evidence of hemorrhage”). These labels served as the reference standard for supervised model training.

For each phenotype, we trained a logistic regression classifier using Term Frequency–Inverse Document Frequency (TF–IDF) features, a well-established bag-of-words representation that scales word frequency by its rarity across the corpus to improve signal detection in clinical text. Models were trained on 80% of the labeled reports and evaluated on a 20% held-out set. Model objects were serialized using joblib, a Python utility for efficiently saving and loading machine-learning models to enable full reproducibility.

Model performance was quantified using area under the ROC curve (AUC), precision, recall, and F1-score. F1 was selected because stroke-related findings in radiology reports are often imbalanced (e.g., hemorrhage is less common than chronic changes), and F1 provides a balanced measure when false positives and false negatives are both clinically relevant. In particular, we sought to minimize false negatives (missed findings) while maintaining acceptable precision.

Phenotype–outcome integration and mortality modeling

For each MIMIC-III stroke admission, phenotype probabilities generated by the trained NLP models were assigned to the corresponding hospital encounter. These probabilities were merged with structured electronic health record (EHR) variables, including age, sex, ICD-9 diagnosis codes, and length of stay. ICD-9 codes were converted into binary indicator variables and aggregated at the admission level.

Two logistic regression models were trained to predict in-hospital mortality:

1. Baseline model: included only structured EHR features (age, sex, ICD-9 indicators, and length of stay).

2. Enhanced model: included all baseline features plus the eight text-derived phenotype probabilities (hemorrhage, infarct, midline shift, edema, chronic change, and MCA/ACA/PCA vascular territories).

Model performance was evaluated using 5-fold cross-validation. Discrimination was quantified using area under the receiver operating characteristic curve (AUC) and F1-score. F1 was included because mortality is a class-imbalanced outcome, and F1 provides a balanced metric when both false positives and false negatives carry clinical relevance.

The difference in discrimination between models (ΔAUC) was calculated to quantify the incremental predictive value added by text-derived phenotypes.

Feature importance analysis

To quantify the contribution of each variable to mortality prediction, we applied permutation feature importance using AUC as the scoring metric. For each feature, values in the test set were randomly shuffled (30 repetitions, random_state = 42), and the resulting decrease in model performance (ΔAUC) was calculated. This procedure isolates the unique predictive signal carried by each feature while maintaining the structure of all other variables.

Permutation importance was chosen because it is model-agnostic, directly measures the effect of feature disruption on discrimination, and is robust to multicollinearity—an important consideration when combining structured EHR data with text-derived phenotype probabilities.

ΔAUC values were summarized across repetitions and visualized as a horizontal bar chart comparing contributions from baseline ICD-9 code aggregates and the NLP-derived phenotype probabilities. Larger ΔAUC indicated a greater reduction in predictive accuracy and therefore greater importance to the enhanced model.

Results

Overview

A total of 3,999 stroke-related admissions with linked radiology reports and outcomes were included from MIMIC-III, along with 60 subjects in the MRI lesion-volume cohort. The results are organized to first describe lesion characteristics in the MRI reference dataset, then report the performance of the radiology NLP phenotyping pipeline, followed by mortality prediction results and feature importance analyses evaluating the relative contribution of each predictor.

MRI cohort lesion characteristics

The MRI cohort included 60 subjects with voxel-wise lesion masks. The mean lesion volume was 3.85 × 106 mm3 (SD 4.83 × 105 mm3), reflecting the expected range of lesion sizes observed across mixed ischemic and hemorrhagic stroke presentations. Lesion volume showed a weak inverse correlation with age (Pearson r = −0.17), and no meaningful differences by sex were observed.

The distribution of lesion volumes is shown in Figure 1. The histogram demonstrates a generally unimodal distribution with limited outliers, consistent with natural clinical variability but without pronounced left or right skew.

Figure 1
Bar chart titled

Figure 1. MRI lesion volume distribution across the study cohort. Lesion volumes were computed by multiplying lesion-mask voxel counts by voxel size from each NIfTI header. The dashed line marks the cohort mean (3.85 × 106 mm3), and the shaded band represents ±1 standard deviation (4.83 × 105 mm3). The distribution is approximately unimodal with limited outliers.

Radiology NLP phenotyping

The NLP pipeline successfully processed all 15,492 radiology reports labeled as “Radiology” in MIMIC-III, identifying 3,999 unique stroke admissions with linked clinical outcomes. After text cleaning (lowercasing, PHI removal, and token standardization), each report was classified into eight clinically relevant stroke phenotypes: hemorrhage, infarct, midline shift, edema, chronic changes, and vascular territories (ACA, MCA, PCA).

Hemorrhage and edema were the most common positive findings, whereas vascular territory mentions were less frequent. The highest-performing classifiers were those for hemorrhage (AUC = 0.974, F1 = 0.945) and edema (AUC = 0.957, F1 = 0.891). All models demonstrated balanced precision–recall performance across folds, indicating stable and reproducible extraction of stroke-related imaging phenotypes from radiology text (Figure 2).

Figure 2
Bar chart titled

Figure 2. Phenotype classifier performance (AUC) with F1-scores. Horizontal bars show test AUC for eight radiology-derived stroke phenotypes; F1-scores are annotated in gray. Models used TF-IDF features and logistic regression on a 20% hold-out set.

Mortality prediction models

Two logistic regression models were trained to predict in-hospital mortality.

The baseline model, using structured clinical variables alone (age, sex, ICD-9 codes, length of stay), achieved an AUC of 0.558 (F1 = 0.334). Adding the eight text-derived radiology phenotypes yielded an enhanced model with an AUC of 0.616 (F1 = 0.365), corresponding to an improvement of ΔAUC = +0.058.

To assess statistical robustness, bootstrap resampling was applied after accounting for demographic covariates. The mean ΔAUC was −0.002 [95% CI (−0.035, 0.032)], reflecting that although point estimates favored the enhanced model, the improvement did not reach statistical significance in this sample. Full model comparison results are summarized in Table 2.

Table 2
www.frontiersin.org

Table 2. Baseline vs. enhanced mortality prediction models.

Feature importance analysis using permutation-based ΔAUC identified the relative contribution of each predictor (Figure 3).

Figure 3
Bar chart showing the mean AUC decrease (permutation importance) for various factors. The ICD9_CODE with a score of 0.091 is most significant. Edema_prob and infarct_prob follow with scores of 0.051 and 0.046, respectively. Hemorrhage_prob, chronic_prob, and midline_shift_prob are minor factors. Negative values are shown for territory_PCA_prob, territory_ACA_prob, and territory_MCA_prob. Blue represents ICD-9 aggregate, and green indicates phenotype probabilities.

Figure 3. Feature importance for in-hospital mortality prediction. Bar chart shows mean decrease in AUC (ΔAUC) following random permutation of each feature in the enhanced model. Blue corresponds to the aggregate ICD-9 diagnostic code signal, while orange bars represent probabilities derived from radiology-text phenotypes. Higher ΔAUC values indicate greater contribution to model discrimination. The ICD-9 aggregate provided the strongest single contribution (ΔAUC ≈ 0.09), followed by edema and infarct probabilities derived from radiology text.

Limitations

This study has several limitations. The MRI cohort used for lesion quantification was relatively small (n = 60) and served primarily as a descriptive imaging reference rather than a predictive input. Because these scans originate from an openly available dataset with heterogeneous acquisition protocols and limited clinical annotation, lesion volume was not incorporated into the mortality models, preventing direct assessment of its independent prognostic value. Similarly, all radiology phenotypes were derived from rule-guided keyword extraction rather than manual chart review, which may miss nuanced or context-dependent findings. An additional limitation concerns the diagnostic coding framework: MIMIC-III contains only ICD-9 codes, which are no longer used in most healthcare systems and have lower diagnostic specificity than ICD-10 or later systems. This constraint may reduce granularity in the baseline clinical variables and modestly underestimate the predictive value of structured diagnostic information.

The mortality prediction models were trained exclusively on MIMIC-III, a single-center dataset whose documentation and case mix may not reflect broader clinical practice, and external validation was not possible. Despite these constraints, the study provides a reproducible and extensible framework for integrating radiology text phenotypes with structured EHR data. The NLP labeling pipeline demonstrated strong internal consistency, and the mortality models were rigorously evaluated with cross-validation and bootstrap resampling. Findings consistently showed that text-derived imaging features contribute measurable discriminatory value beyond traditional clinical variables. Although the observed gains were modest, they highlight a scalable strategy for leveraging radiology reports in outcome modeling and motivate future work using larger, multi-institutional datasets with harmonized diagnostic and imaging annotations.

Discussion

This study demonstrates that automated extraction of stroke-relevant phenotypes from radiology reports can be achieved with high accuracy (AUC 0.957–0.974 for hemorrhage and edema) and provides modest but measurable improvement in mortality prediction beyond structured EHR data alone (ΔAUC = +0.058). Permutation-based feature importance revealed that while diagnostic codes remain the strongest predictors (ΔAUC = 0.091), NLP-derived imaging phenotypes, particularly edema (ΔAUC = 0.051) and infarct (ΔAUC = 0.046), contribute independent predictive signal. These findings establish proof-of-concept that routinely generated radiology text can augment clinical risk models without requiring additional imaging processing or manual annotation, though the magnitude of improvement suggests that text-derived features complement rather than replace structured diagnostic information.

The observed gains align with a growing body of evidence examining machine learning approaches to stroke prognostication. Recent work using random forest and logistic regression models with stroke severity scores, patient demographics, and laboratory results has achieved strong performance in predicting 90-day prognosis and in-hospital mortality for hemorrhagic stroke patients (19). Studies incorporating comprehensive feature sets that combines clinical, imaging, and laboratory data have achieved AUCs greater than 0.8 in external validation cohorts (20). The more modest improvement observed here (ΔAUC = +0.058) likely reflects the incremental nature of adding text-derived features to a model that already includes diagnostic codes, which themselves encode substantial prognostic information. This finding suggests that maximum predictive performance may require integration of additional data modalities beyond text-derived phenotypes alone.

The high classification accuracy achieved by the NLP models for hemorrhage (AUC 0.974) and edema (AUC 0.957) is consistent with published benchmarks. Studies using various NLP tools have reported accuracies exceeding 95% for detecting stroke-related features including hemorrhage and infarction from radiology reports (5, 7). This consistency across different approaches and datasets supports the reliability of automated text-based phenotyping for these core imaging findings. The success of rule-based keyword extraction with contextual filters demonstrates that sophisticated architectures are not always necessary when applied to well-structured clinical documentation, enhancing the practical feasibility and scalability of this approach.

As shown in Figure 3, the finding that edema probability (ΔAUC = 0.051) emerged as the strongest contributor among imaging-derived features aligns with clinical knowledge that cerebral edema represents a critical complication associated with increased mortality risk in acute stroke. Similarly, the contribution of infarct probability (ΔAUC = 0.046) reflects the fundamental importance of tissue injury extent in determining outcomes. The minimal contribution from vascular territory indicators and chronic changes suggests that not all extractable radiologic features carry equal prognostic weight, or alternatively, that their predictive value may already be captured by structured diagnostic codes. This pattern underscores the importance of feature selection and model parsimony—incorporating text-derived phenotypes indiscriminately may add complexity without proportional predictive gain.

Explainable machine learning frameworks have enhanced the interpretability and clinical acceptance of predictive algorithms in stroke care (20, 21). Methods such as SHapley Additive exPlanations (SHAP) enable identification and quantification of both conventional and novel predictors within complex models. The permutation-based approach applied here provides similar interpretability benefits while being directly tied to model discrimination (AUC), making the relative importance of each feature immediately transparent to clinicians.

Recent studies have demonstrated the value of incorporating diverse data sources into stroke outcome prediction. Work examining Social Drivers of Health alongside established clinical predictors has shown improved accuracy in predicting stroke outcomes (21). Similarly, dynamic and interpretable machine learning frameworks are being developed to address fluctuating patient trajectories during post-acute care (22). The text-derived phenotypes identified in this study represent another dimension of interpretable features that can be integrated into comprehensive risk models. Unlike raw imaging features that may lack clinical interpretability, phenotypes such as presence of hemorrhage or edema directly correspond to established clinical concepts and can be validated against imaging findings.

The practical implications of integrating NLP-derived radiology phenotypes into clinical prediction models merit consideration. Because radiology reports are routinely generated as part of standard clinical care, this approach offers a scalable mechanism for enriching risk models without requiring additional imaging processing or manual annotation (23). Text-based pipelines have supported automated triage, clinical decision assistance, and EHR-embedded risk stratification tools in other settings (24). However, the modest performance gain observed (ΔAUC = +0.058) raises important questions about cost–benefit tradeoffs. While text-derived features are technically scalable, the incremental predictive value must be weighed against implementation complexity, computational resources, and ongoing model maintenance requirements.

Several challenges must be addressed before widespread implementation, including variability in reporting structure across institutions, the need for external validation across diverse clinical settings, and ensuring interoperability with modern coding systems such as ICD-10 or SNOMED CT, which offer higher specificity than ICD-9 (25). As healthcare systems increasingly adopt multimodal analytics, integrating NLP-derived imaging phenotypes with laboratory values, physiologic time-series data, and raw imaging features may yield more robust and generalizable models (26, 27). The findings presented here suggest that text-derived phenotypes provide a foundational layer that could be augmented with these additional modalities. Overcoming deployment barriers—such as workflow integration, data governance, model monitoring, and clinician interpretability requirements—remains essential to translating these methods into clinically actionable tools (28).

Data availability statement

The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found at Johnson et al. (15) and Birnbaum et al. (18).

Ethics statement

Ethical approval was not required for the study involving humans in accordance with the local legislation and institutional requirements. Written informed consent to participate in this study was not required from the participants or the participants’ legal guardians/next of kin in accordance with the national legislation and the institutional requirements.

Author contributions

AA: Investigation, Resources, Supervision, Funding acquisition, Conceptualization, Writing – review & editing, Project administration, Data curation, Visualization, Formal analysis, Writing – original draft, Software, Validation, Methodology. AM: Writing – review & editing, Investigation, Conceptualization, Resources, Writing – original draft, Supervision, Visualization. LA: Visualization, Writing – review & editing, Writing – original draft.

Funding

The author(s) declared that financial support was not received for this work and/or its publication.

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that Generative AI was used in the creation of this manuscript. Generative artificial intelligence was used solely for language refinement, grammar correction, and formatting assistance during manuscript preparation. The authors reviewed and edited all content to ensure full accuracy and take complete responsibility for the final version of the manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

1. GBD 2019 Stroke Collaborators. Global, regional, and national burden of stroke and its risk factors, 1990-2019: a systematic analysis for the global burden of disease study 2019. Lancet Neurol. (2021) 20:795–820. doi: 10.1016/S1474-4422(21)00252-0

Crossref Full Text | Google Scholar

2. Donkor, ES. Stroke in the 21st century: a snapshot of the burden, epidemiology, and quality of life. Stroke Res Treat. (2018) 2018:3238165. doi: 10.1155/2018/3238165,

PubMed Abstract | Crossref Full Text | Google Scholar

3. GBD 2016 Stroke Collaborators. Global, regional, and national burden of stroke, 1990–2016: a systematic analysis for the global burden of disease study 2016. Lancet Neurol. (2019) 18:439–58. doi: 10.1016/S1474-4422(19)30034-1,

PubMed Abstract | Crossref Full Text | Google Scholar

4. Feigin, VL, Brainin, M, Norrving, B, Martins, SO, Pandian, J, Lindsay, P, et al. World stroke organization: global stroke fact sheet 2025. Int J Stroke. (2025) 20:132–44. doi: 10.1177/17474930241308142,

PubMed Abstract | Crossref Full Text | Google Scholar

5. Casey, A, Davidson, E, Grover, C, Tobin, R, Grivas, A, Zhang, H, et al. Understanding the performance and reliability of NLP tools: a comparison of four NLP tools predicting stroke phenotypes in radiology reports. Front Digit Health. (2023) 5:1184919. doi: 10.3389/fdgth.2023.1184919,

PubMed Abstract | Crossref Full Text | Google Scholar

6. Yu, AYX, Liu, ZA, Pou-Prom, C, Lopes, K, Kapral, MK, Aviv, RI, et al. Automating stroke data extraction from free-text radiology reports using natural language processing: instrument validation study. JMIR Med Inform. (2021) 9:e24381. doi: 10.2196/24381,

PubMed Abstract | Crossref Full Text | Google Scholar

7. Hsu, E, Bako, AT, Potter, T, Pan, AP, Britz, GW, Tannous, J, et al. Extraction of radiological characteristics from free-text imaging reports using natural language processing among patients with ischemic and hemorrhagic stroke: algorithm development and validation. JMIR AI. (2023) 2:e42884. doi: 10.2196/42884,

PubMed Abstract | Crossref Full Text | Google Scholar

8. Gao, W, Wang, M, Lin, J, Huang, J, Cai, L, Chen, X, et al. White matter hyperintensity burden and infarct volume predict functional outcomes in anterior choroidal artery stroke: a multimodal MRI study. Front Neurosci. (2025) 19:19. doi: 10.3389/fnins.2025.1625882,

PubMed Abstract | Crossref Full Text | Google Scholar

9. Zhang, Y, Zhuang, Y, Ge, Y, Wu, P-Y, Zhao, J, Wang, H, et al. MRI whole-lesion texture analysis on ADC maps for the prognostic assessment of ischemic stroke. BMC Med Imaging. (2022) 22:115. doi: 10.1186/s12880-022-00845-y,

PubMed Abstract | Crossref Full Text | Google Scholar

10. Ong, CJ, Orfanoudaki, A, Zhang, R, Caprasse, FPM, Hutch, M, Ma, L, et al. Machine learning and natural language processing methods to identify ischemic stroke, acuity and location from radiology reports. PLoS One. (2020) 15:e0234908. doi: 10.1371/journal.pone.0234908,

PubMed Abstract | Crossref Full Text | Google Scholar

11. Miller, MI, Orfanoudaki, A, Cronin, M, Saglam, H, Kim, ISY, Balogun, O, et al. Natural language processing of radiology reports to detect complications of ischemic stroke. Neurocrit Care. (2022) 37:291–302. doi: 10.1007/s12028-022-01513-3,

PubMed Abstract | Crossref Full Text | Google Scholar

12. Khaled, A, Sabir, M, Qureshi, R, Caruso, CM, Guarrasi, V, Xiang, S, et al. Leveraging MIMIC datasets for better digital health: a review on open problems, progress highlights, and future promises. arXiv. (2025). doi: 10.48550/ARXIV.2506.12808

Crossref Full Text | Google Scholar

13. Pinto, A, Mckinley, R, Alves, V, Wiest, R, Silva, CA, and Reyes, M. Stroke lesion outcome prediction based on MRI imaging combined with clinical information. Front Neurol. (2018) 9:1060. doi: 10.3389/fneur.2018.01060,

PubMed Abstract | Crossref Full Text | Google Scholar

14. Wang, H, Sun, Y, Ge, Y, Wu, P-Y, Lin, J, Zhao, J, et al. A clinical-radiomics nomogram for functional outcome predictions in ischemic stroke. Neurol Ther. (2021) 10:819–32. doi: 10.1007/s40120-021-00263-2,

PubMed Abstract | Crossref Full Text | Google Scholar

15. Johnson, A, Pollard, T, and Mark, R. MIMIC-III Clinical Database (version 1.4) PhysioNet. Cambridge, MA, United States (2016) RRID:SCR_007345.

Google Scholar

16. Johnson, AEW, Pollard, TJ, Shen, L, Lehman, LH, Feng, M, Ghassemi, M, et al. MIMIC-III, a freely accessible critical care database. Sci Data. (2016) 3:160035. doi: 10.1038/sdata.2016.35,

PubMed Abstract | Crossref Full Text | Google Scholar

17. Goldberger, A, Amaral, L, Glass, L, Hausdorff, J, Ivanov, PC, Mark, R, et al. PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation. (2000) 101:e215–20. doi: 10.1161/01.cir.101.23.e215, RRID:SCR_007345

Crossref Full Text | Google Scholar

18. Birnbaum, AM, Buchwald, A, Turkeltaub, P, Jacks, A, Carra, G, Kannana, S, et al. Full-head segmentation of MRI with abnormal brain anatomy: model and data release. ArXiv. (2025). doi: 10.48550/arXiv.2501.18716

Crossref Full Text | Google Scholar

19. Abujaber, AA, Albalkhi, I, Imam, Y, Yaseen, S, Nashwan, AJ, Akhtar, N, et al. Machine learning-based prediction of 90-day prognosis and in-hospital mortality in hemorrhagic stroke patients. Sci Rep. (2025) 15:16242. doi: 10.1038/s41598-025-90944-x,

PubMed Abstract | Crossref Full Text | Google Scholar

20. Yao, Z, Mao, C, Ke, Z, and Xu, Y. An explainable machine learning model for predicting the outcome of ischemic stroke after mechanical thrombectomy. J Neurointerv Surg. (2023) 15:1136–41. doi: 10.1136/jnis-2022-019598,

PubMed Abstract | Crossref Full Text | Google Scholar

21. Veledar, E, Zhou, L, Veledar, O, Gardener, H, Gutierrez, CM, Brown, SC, et al. Identifying determinants of readmission and death post-stroke using explainable machine learning. PLoS One. (2025) 20:e0332371. doi: 10.1371/journal.pone.0332371,

PubMed Abstract | Crossref Full Text | Google Scholar

22. Petrović, I, Njegovan, S, Tomašević, O, Vlahović, D, Rajić, S, Živanović, Ž, et al. Dynamic, interpretable, machine learning–based outcome prediction as a new emerging opportunity in acute ischemic stroke patient care: a proof-of-concept study. Stroke Res Treat. (2025) 2025:3561616. doi: 10.1155/srat/3561616,

PubMed Abstract | Crossref Full Text | Google Scholar

23. Pons, E, Braun, LMM, Hunink, MGM, and Kors, JA. Natural language processing in radiology: a systematic review. Radiology. (2016) 279:329–43. doi: 10.1148/radiol.16142770,

PubMed Abstract | Crossref Full Text | Google Scholar

24. Chng, SY, Tern, PJW, Kan, MRX, and Cheng, LTE. Automated labelling of radiology reports using natural language processing: comparison of traditional and newer methods. Health Care Sci. (2023) 2:120–8. doi: 10.1002/hcs2.40,

PubMed Abstract | Crossref Full Text | Google Scholar

25. O’Malley, KJ, Cook, KF, Price, MD, Wildes, KR, Hurdle, JF, and Ashton, CM. Measuring diagnoses: ICD code accuracy. Health Serv Res. (2005) 40:1620–39. doi: 10.1111/j.1475-6773.2005.00444.x,

PubMed Abstract | Crossref Full Text | Google Scholar

26. Amador, K, Winder, AJ, Fiehler, J, Barber, PA, Wilms, M, and Forkert, ND. A multimodal multitask deep learning model for predicting stroke lesion and functional outcomes using 4D CTP imaging and clinical metadata. Sci Rep. (2025) 15:38136. doi: 10.1038/s41598-025-21945-z,

PubMed Abstract | Crossref Full Text | Google Scholar

27. Chen, Q, Xia, T, Zhang, M, Xia, N, Liu, J, and Yang, Y. Radiomics in stroke neuroimaging: techniques, applications, and challenges. Aging Dis. (2021) 12:143–54. doi: 10.14336/AD.2020.0421,

PubMed Abstract | Crossref Full Text | Google Scholar

28. Topol, EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. (2019) 25:44–56. doi: 10.1038/s41591-018-0300-7,

PubMed Abstract | Crossref Full Text | Google Scholar

Glossary

ACA - Anterior Cerebral Artery

AI - Artificial Intelligence

AUC - Area Under the Receiver Operating Characteristic Curve

CI - Confidence Interval

CT - Computed Tomography

ΔAUC - Change in AUC

EHR - Electronic Health Record

F1 - Harmonic mean of precision and recall

ICD - International Classification of Diseases (specifically ICD-9 codes were used to identify stroke admissions)

ICU - Intensive Care Unit

LOS - Length of Stay (a structured EHR variable)

MCA - Middle Cerebral Artery

MIMIC-III - MIMIC-III Clinical Database (v1.4)

MRI - Magnetic Resonance Imaging

NIfTI - Neuroimaging Informatics Technology Initiative

NLP - Natural Language Processing

PCA - Posterior Cerebral Artery

PFCML-MT - Personalized Prediction of outcome using Machine learning in patients undergoing Mechanical Thrombectomy.

SD - Standard Deviation (used to describe the mean lesion volume variability)

SHAP - SHapley Additive exPlanations (a framework for explainable machine learning)

TF-IDF - Term Frequency-Inverse Document Frequency (features used in the NLP models)

Keywords: machine learning, MRI lesion quantification, natural language processing, outcome prediction, radiology report phenotyping, stroke

Citation: Alotaibi A, Moustafa A and Abualloush L (2026) AI-driven integration of imaging and radiology language improves stroke mortality prediction. Front. Neurol. 16:1722965. doi: 10.3389/fneur.2025.1722965

Received: 13 October 2025; Revised: 11 December 2025; Accepted: 22 December 2025;
Published: 12 January 2026.

Edited by:

Ian James Martins, University of Western Australia, Australia

Reviewed by:

Yankai Meng, The Affiliated Hospital of Xuzhou Medical University, China
Zeke J. McKinney, HealthPartners Institute, United States

Copyright © 2026 Alotaibi, Moustafa and Abualloush. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Albara Alotaibi, ZHJiYmFyYWFAZ21haWwuY29t

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.