A machine learning model for visualization and dynamic clinical prediction of stroke recurrence in acute ischemic stroke patients: A real-world retrospective study

Background and purpose Recurrent stroke accounts for 25–30% of all preventable strokes, and this study was conducted to establish a machine learning-based clinical predictive rice idol for predicting stroke recurrence within 1 year in patients with acute ischemic stroke (AIS). Methods A total of 645 AIS patients at The Second Affiliated Hospital of Xuzhou Medical University were screened, included and followed up for 1 year for comprehensive clinical data. Univariate and multivariate logistic regression (LR) were used to screen the risk factors of stroke recurrence. The data set was randomly divided into training set and test set according to the ratio of 7:3, and the following six prediction models were established by machine algorithm: random forest (RF), Naive Bayes model (NBC), decision tree (DT), extreme gradient boosting (XGB), gradient boosting machine (GBM) and LR. The model with the strongest prediction performance was selected by 10-fold cross-validation and receiver operating characteristic (ROC) curves, and the models were investigated for interpretability by SHAP. Finally, the models were constructed to be visualized using a web calculator. Results Logistic regression analysis showed that right hemisphere, homocysteine (HCY), C-reactive protein (CRP), and stroke severity (SS) were independent risk factors for the development of stroke recurrence in AIS patients. In 10-fold cross-validation, area under curve (AUC) ranked from 0.777 to 0.959. In ROC curve analysis, AUC ranged from 0.887 to 0.946. RF model has the best ability to predict stroke recurrence, and HCY has the largest contribution to the model. A web-based calculator https://mlmedicine-re-stroke2-re-stroke2-baylee.streamlitapp.com/ has been developed accordingly. Conclusion This study identified four independent risk factors affecting recurrence within 1 year in stroke patients, and the constructed RF-based prediction model had good performance.


Introduction
Stroke is characterized by acute focal injury of the central nervous system (i.e., brain, retina, or spinal cord) resulting in neurological dysfunction due to sudden rupture of blood vessels or obstruction of blood flow. It is categorized into ischemic and hemorrhagic stroke, while the incidence of the former is higher than that of the latter, accounting for 60-70% of all strokes (Sacco et al., 2013). The major clinical manifestation of stroke is the sudden onset of focal neurological deficits, whose clinical diagnosis is further complemented with imaging of the brain and its vascular trees (Campbell and Khatri, 2020). Epidemiological data suggest that stroke is the second leading cause of death and disability worldwide, causing tremendous burden shared by low-and middleincome countries (Saini et al., 2021). As suggested by 2016 global burden of disease data that one in four people will have a stroke in their lifetime (GBD 2016Neurology Collaborators, 2019, its prevalence is approximately equal in men and women. However, incidence of stoke is higher in older women (more than 50% higher comparing to men aged 75 years or older), and among some races (e.g., 1.91 per 1,000 in Black or African American and 0.88 per 1,000 in Caucasians) (Virani et al., 2020). Data from 2010 to 2017 showed a 5.3% increase in stroke morbidity and mortality and a 19.3% increase in prevalence, respectively (Goldstein, 2020).
Despite effective treatment approaches, stroke patients are still at measurable risk of recurrent episodes after initial recovery. Recurrent strokes account for 25-30% of all preventable strokes, a majority of which are ischemic strokes, and their onset lead to a higher mortality and disability rate than the initial episode (Luengo-Fernandez et al., 2012). The risk rates for early recurrence of ischemic stroke are approximately 5% at 7 days, and 10% at 14 days, respectively; the long-term recurrence risk rates are approximately 11.1% [95% confidence interval (CI) 9-0 to 13.3] at 1 year (Hankey, 2014). Therefore, the identification of risk factors for stroke recurrence is beneficial to identify populations of high-risk recurrence, ensuring early intervention to reduce the morbidity and mortality. Previous studies have shown that pathophysiological factors and lifestyle factors are all influential factors leading to stroke recurrence. In addition, history of previous cerebrovascular events and stroke subtypes are also important risk factors for recurrence (Chin et al., 2018). Therefore, it is crucial to develop predictive models for effective secondary prevention and management.
With rapid development of precision medicine in the recent years, data science and predictive analytics take on significant roles for physicians to deliver individualized care. However, clinical application of models to predict recurrent stroke using regression or other statistical methods is often limited by the narrow range of variables (Chaudhary et al., 2019) as studies have shown that the area under the receiver operating characteristic (ROC) curve for multivariate logistic models developed using clinical and retinal characteristics for recurrent stroke within 1 year is 0.71-0.74 [higher area under curve (AUC) values indicate better model predictive power] (Yuanyuan et al., 2020). Similarly, when machine learning (ML) is used with single-or multi-omics medical data, more details can be mined from the data and better diagnostic and prognostic tools can be developed compared to traditional statistical regression models (Bersanelli et al., 2016;Erickson et al., 2017;Dias-Audibert et al., 2020;Fleuren et al., 2020). Studies have demonstrated that ML can successfully predict favorable outcomes for up to 3 months after acute stroke event and that the area under the curve of deep neural network models is significantly higher than the Astral score (0.888 vs. 0.839; P < 0.001) (Heo et al., 2019). In addition, ML can be used to efficiently label stroke patients in the emergency setting to facilitate triage (Abedi et al., 2020), as well as to monitor predictive models for long-term recurrent stroke (5 years) by using six of its algorithmic models (Abedi et al., 2021).
In the current study, we constructed six different prediction models by adding observational indicators and explored factors influencing recurrence in all stroke patients based on 1-year follow-up data, evaluated their performances based on sensitivity, specificity, accuracy, and subject operating characteristic curve (ROC), and analyzed the relative importance and interpretability of different factors on the models. We aimed to provide a reference for identifying stroke patients at high risk of recurrence, which is conducive to early diagnosis, and treatment of stroke recurrence, leading to improved survival and recovery of patients.

Statistical methods
The collected clinicopathological and biochemical data were subjected to statistical analysis and model construction using R (version4.0.5) 1 and Python (version3.8). Firstly, based on various types of data, continuous variables were expressed as mean ± standard deviation and compared using an independent samples t-test; categorical variables were expressed as frequency (percentage, %) and analyzed with χ2 test. Logistic regression (LR) analysis was used to identify risk factors independently associated with stroke recurrence. Variables with P-values less than 0.05 in the results of univariate LR analysis were included in multivariate LR analysis. Finally, factors with a P-value < 0.05 in the results of multivariate LR analysis were identified as independent risk factors for stroke recurrence, and the odds ratio (OR) and 95% CI were calculated for each variable.

Model building and validation
In our study, prediction models based on six different ML algorithms were used to analyze our data: linear regression algorithm (LR), plain Bayesian classification algorithm Naive Bayes model (NBC), decision tree algorithm (DT), random forest 1 https://www.r-project.org/ algorithm (RF), gradient augmentation algorithm (GBM), and xgboost (XGB) algorithm. Based on the training set data, average AUC values were calculated and the accuracy of the ML-based model algorithms was verified using the 10-fold crossover method. In addition, the ROC curves of various algorithmic models under the test set were plotted for external validation, while radar plots characterizing sensitivity, accuracy, and specificity of the models were provided to comprehensively evaluate performance of the models. The algorithm presenting the highest average AUC value was selected as the optimal algorithm. Then, contribution of each variable in the optimal model were calculated by the interpretable model SHAP to determine importance of the variables and the positive or negative contribution to the model. Finally, a web calculator was built on this basis to enable input of patient data and disease prediction to help physicians assess the risk of stroke recurrence within 1 year.

Baseline population characteristics
A total of 832 patients with AIS were included in this study, and 48 patients with incomplete clinical data and inadequate ancillary tests, 64 patients with disturbance of consciousness and severe aphasia, 18 patients with severe abnormal organ function, and 57 patients with less than 1 year of follow-up for various reasons were excluded. The final 645 patients with AIS with or without stroke recurrence within 1 year were included (Figure 1). A total of 84 patients experienced recurrent stroke. The rate of stroke recurrence was 13%. Table 1 shows that differences in the side of hemisphere (SOH), HCY, CRP, NSE, S100β, anticoagulation, PPI, dysphagia, and SS were statistically significant in the presence or absence of stroke recurrence. It is suggested that these variables may be independent risk factors for stroke recurrence.

Model building and performance validation
Four significantly different factors were defined as variables of the model by single-factor and multi-factor screening, and the Flowchart of the whole experiment.
Frontiers in Neuroscience 04 frontiersin.org To improve the model prediction effectiveness, a 10-fold crossvalidation method was used for internal validation (Figure 2). The results showed that the RF model was the best predictor of stroke recurrence (mean AUC 0.959, standard deviation 0.017) and the LR model had the lowest AUC value of 0.777. In addition, the ROC curves of various algorithmic models under the test set (Figure 3) were plotted for external validation in this study, where the RF algorithm prediction model had the largest area under the AUC curve (AUC = 0.946), the LR algorithm prediction model had the lowest (AUC = 0.887), and the other models were in between, indicating that the RF model had a better performance in terms of data fitting effect. The radar plot of prediction model sensitivity and specificity showed (Figure 4) that the LR and RF models had better accuracy and sensitivity in their prediction ability, while the NBC model had higher specificity. However, the RF model approach was more effective (F1 = 0.585) when considering the precision and recall rates (F1) together. Finally, the performance of the six algorithmic models was compiled into a table (Table 3). Therefore, we chose the RF model as the final prediction model.

Interpretability study of variables in the RF model
Considering clinical utility, we focused on the contribution of each variable to the final prediction outcome in the ML-based RF prediction model for AIS patients using the interpretable model SHAP. With each variable as a participant and the model output as a collaborative outcome, the contribution, or SHAP value, was Ten-fold cross-validation within the training set of the machine algorithm.  Radar plots of sensitivity and specificity of 6 machine algorithm models.
calculated. As can be seen in Figure 5 (left), HCY, CRP, SS, and SOH are in descending order. Meanwhile, Figure 5 (right) shows the magnitude of the four variables taking values in different colors, with higher values corresponding to red and vice versa. The figure takes the SHAP value of zero as the origin, and the negative left and positive right indicate that the variable contributes negatively or positively to the output prediction results. Therefore, we conclude that HCY has the greatest impact on the model, HCY, SS, and CRP all contribute positively to the RF model output prediction results, and SOH contributes negatively to the RF model output prediction results.

Web-based calculator RF model
The RF-based model performed best among the six models. Therefore, we built a web-based calculator 1 to facilitate the clinical application of this predictive model (Figure 6).

Discussion
Stroke prevalence is universal, but it is more likely to occur in the middle-aged and elderly population with higher rates of disability and mortality. Most strokes are predominantly ischemic strokes caused by arterial occlusion, and stroke recurrence leads to a higher risk of death and disability than their first occurrence (Tu et al., 2021(Tu et al., , 2022. Therefore, there is an urgent need to identify risk factors for stroke recurrence in such patients to improve prevention, reduce recurrence and disability rates, and prolong patient survival. Traditional data mining and statistical methods usually require feature engineering to obtain effective and more robust features, and then construct prediction or clustering models. With complex data and a lack of sufficient domain knowledge, both steps present many challenges (Miotto et al., 2018). Machine learning utilizes large-scale, diverse datasets to build useful patterns by running complex algorithms, and has an important role in the biomedical field for disease detection, diagnosis, prevention, and treatment. Its development leads to more accurate early diagnosis, individualized treatment, and continuous monitoring, as well as effective screening for disease-related risk factors and prediction of disease recurrence (Goecks et al., 2020). Machine learning is particularly useful when datasets are too large or complex for human analysis, and/or when you want to automate the data analysis process to build reproducible and time-saving pipelines. The RF model used in the study, for example, has the advantage of knowing how important each element is to the prediction, and the individual DTs are human-readable, making them easier to train and adjust. But there are also shortcomings that are not suitable for regression and many DTs are difficult to explain (Greener et al., 2022).
Throughout the study, we followed 645 patients with AIS, 84 of whom experienced recurrence. The rate of stroke recurrence was consistent with previous result (13% vs. 11.1%) (Hankey, 2014). After evaluating baseline characteristics of multiple clinical variables collected with stroke recurrence and identifying four independent risk factors for stroke recurrence by univariate and multivariate LR analysis (i.e., right hemisphere, HCY, CRP, and SS), six ML methods were used: LR, NBC, DT, XGB, GBM, and RF. They were used to develop prediction models for individualized prediction of stroke recurrence in AIS patients (Figure 1). The results showed that the RF model had the best predictive ability, and the mean AUC of all 10-fold cross-validation results were > 0.8. After a comprehensive evaluation of the performance of the six models (Table 3), it was concluded that the RF model performed better, and the relative importance of the four risk variables in the RF model was compared from highest to lowest, HCY, CRP, SS, and SOH. Finally, an online web calculator was created to facilitate clinical application. Table 1 shows the baseline characteristics of patients with AIS. There were no significant differences between non-stroke recurrent patients and stroke recurrent patients in terms of gender and age; however, there were significant differences in the side of the hemisphere (SOH; P = 0.014 < 0.05), blood homocysteine (HCY; P < 0.0 01), C-reactive protein (CRP; P < 0.001), neuron-specific enolase (NSE; P = 0.002 < 0.05), central neurospecific protein (S100β; P = 0.012 < 0.05), anticoagulation therapy (P < 0.001), proton pump inhibitor therapy (PPI; P = 0.007 < 0.05), dysphagia (P = 0.039 < 0.05), and stroke severity (SS; P < 0.001).
HCY, CRP, NSE, and S100β are all serum proteins. HCY is a sulfur-containing amino acid in the body and an important intermediate in the metabolism of methionine and cysteine, which itself is not involved in protein synthesis. Elevated HCY has become an independent risk factor for the development of atherosclerosis.
As a pro-inflammatory marker, the inflammatory process has an important role in pathophysiology of ischemic stroke, and elevation of HCY is considered a risk factor for ischemic stroke (Chen et al., 2017). The data in Table 1 show that HCY was lower in patients with recurrent stroke than in patients with nonrecurrent stroke (13.2 vs. 16.2, P = 0.002). Previously, it was shown that hypertension with high homocysteine (HHcy) (H-type hypertension) and CRP can increase the incidence of ischemic stroke. Later data demonstrated that recurrent ischemic stroke (RIS) is associated with advanced age, male sex, diabetes, H-type hypertension, and C-reactive protein. In contrast, controlling H-type hypertension and C-reactive protein levels reduce the risk of RIS . Under normal conditions, the levels of NSE and S100β in body fluids are extremely low. When neuronal injury or necrosis occurs, NSE and S100β rapidly spill from the cells into the cerebrospinal fluid and enter the blood through the damaged blood-brain barrier, resulting in elevated serum NSE and S100β concentrations. The levels of which reflect the extent of neuronal damage, so elevated NSE and S100β suggest possible relapse. In addition, other serum biomarkers, such as serum Copeptin levels, are associated with recurrent stroke events and are predictors of severity at admission and 1-year stroke recurrence in stroke patients (Tang et al., 2017). In addition, other biomarkers such as serum fatty acid binding protein 4 (FABP4) , serum CXCL12 levels (Gu et al., 2016), interleukin-37 (Zhang et al., 2021), and cystatin C (Liu et al., 2021) have been reported to be associated with stroke recurrence.
Serum C-reactive protein, an acute chronotropic reactive protein elevated in the presence of infection, is second only to HCY in RF models in terms of relative importance. It is also a non-specific marker of systemic inflammation reflecting various infectious and non-infectious inflammatory conditions in the organism. A retrospective review showed that 26 studies reported an association of CRP with recurrent stroke, of which 12 (46%) described a positive association (McCabe et al., 2021), a result that is consistent with what we obtained. In addition to the abovementioned control of CRP with H hypertension that reduces the risk of RIS, elevated serum Hs-CRP and HCY levels are associated Relative importance of variables based on SHAP for RF prediction model.

Frontiers in Neuroscience 09
frontiersin.org Online calculator for predicting stroke recurrence.
with the risk of post-stroke depression (PSD) 1 year after stroke onset, and the combination of these two factors adds prognostic information to early assessment of PSD (Cheng et al., 2018). The study in Table 2 found that the differences in NSE, S100β, Anticoagulation, PPI, and dysphagia were statistically significant in the univariate LR analysis, but the differences were not found to be statistically significant when the above variables were included in the multivariate LR analysis, which may be due to the sample size of the study population, selection bias. SS was a good independent predictor of stroke recurrence (Table 2), with a risk ratio of 3.98 for recurrence in stroke patients, as well as being a relatively important factor in the RF model. In a 2016 study analyzing the regression after ischemic stroke and its associated factors in elderly patients, it was shown that at 12 months of stroke, moderate stroke was associated with dependency and severe stroke was associated with dependency and recurrence (Wu et al., 2016). In addition, a meta-analysis of stroke recurrence rates was recently performed in a retrospective study of patients with first ischemic stroke. The results of this study showed that hypertension, diabetes mellitus, atrial fibrillation, transient ischemic attack, and high SS were independent risk factors for recurrence (Kolmos et al., 2021).
The right hemisphere is the last relative importance in the RF model. Stroke patients experience impairments such as contralateral motor deficits and interhemispheric imbalances including hyperexcitability of the contralateral hemisphere after stroke. Since the recovery of cerebral hemispheres through motor dysfunction can be achieved by increasing excitability of the affected hemisphere or decreasing the excitability of the unaffected hemisphere, current brain treatments for stroke patients include a brain-computer interface (BCI) and transcranial magnetic stimulation (TMS) therapies to reduce mortality and alleviate the degree of disability in patients. Studies have shown that bilateral hemisphere treatment by TMS facilitates motor recovery of paralyzed hands in stroke patients (Takeuchi et al., 2009). In contrast, when patients present with bilateral focal hemispheres, there may be an interruption of the axis between the central nervous system and the gastrointestinal system, leading to secondary symptoms such as dysphagia and gastrointestinal bleeding (Schaller et al., 2006). Similarly, our study data show that dysphagia, although not an independent risk factor for stroke recurrence, has a statistically significant difference between the baseline characteristics of patients with and without stroke recurrence.Importance of primary prevention in patients with the first stroke and secondary prevention in recurrent stroke is stressed in the current study. Primary prevention treatment is anticoagulation for atrial fibrillation, antihypertensive treatment for hypertension and controlling glucose for diabete, etc (Diener and Hankey, 2020). The primary aim of secondary prevention is to prevent or reduce the risk of recurrent stroke and to reduce the degree of disability. Effective treatments include antithrombotic and anticoagulant therapy, revascularization, and implementation of structured evaluation and intervention (Hankey, 2014). Although effective for secondary prevention of ischemic stroke with aspirin, increases the risk of hemorrhagic stroke, upper gastrointestinal bleeding (UGIB), and dyspepsia. Prophylactic administration of proton pump inhibitors (PPIs) may reduce the risk of these digestive symptoms (Takabayashi et al., 2015). There is evidence that some proton pump inhibitors can attenuate the antiplatelet effects of clopidogrel, but after multivariate adjustment, the data show that the use of proton pump inhibitors is not associated with a significantly increased risk of recurrent stroke or death (Juurlink et al., 2011). Again, this is consistent with the conclusions reached in this study.
However, there are some limitations that need to be addressed in the future. First, the ML algorithm model we developed is limited to one hospital, which may limit its widespread use in other regions. Second, the sample size of this study has some limitations and there is room for extending the follow-up period. Finally, this study is retrospective and suffers from the inherent data bias of retrospective studies, which may lead to bias in the data. We will conduct further multicenter and prospective studies in the future.

Conclusion
In conclusion, we constructed six risk prediction models for stroke recurrence in patients with AIS by machine learning algorithm (ML), introducing four independent risk factors associated with stroke recurrence (i.e., right hemisphere, HCY, CRP, and SS). Among them, we found that RF model made promising prediction, as it performed the best in both internal validation and external validation combined, with comparable accuracy, sensitivity, and specificity. It is hoped that this web-based calculator can serve as an effective predictive tool to help stroke patients prevent recurrence and assist physicians in adjudication.

Data availability statement
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Ethics statement
The studies involving human participants were reviewed and approved by the Ethics Committee of The Second Affiliated