ORIGINAL RESEARCH article
Front. Neurol.
Sec. Stroke
Volume 16 - 2025 | doi: 10.3389/fneur.2025.1668420
This article is part of the Research TopicPost-stroke Epilepsy: Risks, Prognosis, and PreventionView all articles
Evaluating Machine Learning Models for Stroke Prediction Based on Clinical Variables
Provisionally accepted- 1Clemson University, Clemson, United States
- 2College of Engineering, Anderson University, Anderson, SC, United States
- 3School of Medicine Greenville, University of South Carolina, Columbia, SC, United States
- 4College of Arts and Sciences, Anderson University, Anderson, SC, United States
- 5Department of Psychology University of New Hampshire, Durham, NH, United States
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
Stroke remains one of the leading causes of global mortality and long-term disability, driving the urgent need for accurate and early risk prediction tools. Traditional models such as the Framingham Stroke Risk Score have provided foundational insights into stroke prevention but are constrained by linear assumptions and limited adaptability to complex real-world data. In contrast, machine learning (ML) techniques offer the ability to model non-linear relationships and interactions among diverse clinical and demographic variables, supporting more personalized and flexible risk prediction. This study evaluates five supervised ML algorithms, Logistic Regression, Random Forest, Gradient Boosting, Support Vector Machine (SVM), and K-Nearest Neighbours (KNN), using a publicly available dataset from Kaggle. Following class imbalance correction, models were assessed using multiple metrics including accuracy, ROC-AUC, and confusion matrices. Logistic Regression and Gradient Boosting achieved the highest accuracy (95.11%) and ROC-AUC (0.836), although all models demonstrated poor recall, reflecting challenges in identifying rare stroke cases. Feature importance analysis using the Random Forest model identified age, average glucose level, and BMI as the most influential predictors of stroke, aligning with the Metabolic Syndrome Hypothesis and previous epidemiological findings. These findings underscore both the promise and current limitations of ML in stroke risk prediction and highlight the need for future research leveraging multimodal datasets and advanced algorithmic strategies to enhance sensitivity and clinical utility.
Keywords: stroke risk prediction, Machine learning in healthcare, Clinical Decision Support Systems, predictive modelling, Feature importance analysis, Imbalanced data handling
Received: 17 Jul 2025; Accepted: 20 Aug 2025.
Copyright: © 2025 Akinwumi, Ojo, Nathaniel, Wanliss, Karunwi and Sulaiman. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence:
Patrick O. Akinwumi, Clemson University, Clemson, United States
Stephen Ojo, College of Engineering, Anderson University, Anderson, SC, United States
Thomas I Nathaniel, School of Medicine Greenville, University of South Carolina, Columbia, SC, United States
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.