- 1Department of Cardiology, The First Hospital of China Medical University, Shenyang, China
- 2Department of Cardiology, The First Affiliated Hospital of Dalian Medical University, Dalian, China
- 3Department of Gastric Surgery, Liaoning Cancer Hospital and Institute (Affiliated Cancer Hospital of Dalian University of Technology, Cancer Hospital of China Medical University), Shenyang, China
Background: We aim to construct a machine learning (ML) model to predict stroke risk in patients with hypertension.
Methods: In all, 68 variables, including demographic information, medical history and medication use, lifestyle, anthropometry laboratory tests, electrocardiography, and echocardiography, were selected for baseline analysis. Of these, 10 optimal variables were selected by Recursive feature elimination (RFE) and then the model was trained and tested using eXtreme Gradient Boosting (XGBoost). A 10- fold cycle of cross-validation was used during the process. Next, XGBoost was used to develop a prediction model. Four traditional Cox regression models including the China-PAR Score and the Framingham Stroke Risk Score model were established and compared with the ML model. Finally, the results of the performance assessment of the models were compared using C-statistics for discrimination and Brier score for calibration.
Results: In all, we included 5,197 hypertensive participants (mean age = 57.16 ± 10.20 years) from the Northeast China Rural Cardiovascular Health Study (NCRCHS). Of these, end point events occurred in 294 patients (5.7%, 185 males and 109 females) during a mean follow-up period of 4.26 ± 1.03 years. Using RFE, 10 variables were selected to construct the XGBoost model. The ML model demonstrated better discrimination than the best performing Cox regression model [C-statistic 0.967 (95% CI, 0.956, 0.978) vs. 0.781 (95% CI, 0.772, 0.785), respectively] with an acceptable calibration (Brier score = 0.053).
Conclusion: Using the ML method, we constructed a high-precision prognostic model to predict stroke risk in patients with hypertension. This model exhibited a better classification effect and better performance compared with the traditional risk scales. The model could be used in clinical practice to achieve early prevention and intervention of stroke.
1 Introduction
Globally, stroke is among the most serious public health problems since it is the main cause of disability and the second leading cause of death (Mathers and Loncar, 2006). In 2017, a survey reported >13 million stroke cases in China (Wang et al., 2017). According to the National Disease Surveillance Points System, 1.6 million people die due to stroke in China every year, which is almost one-third of the total deaths from stroke worldwide (Liu et al., 2011; Feigin et al., 2015; Zhou et al., 2016; Wang et al., 2017). The incidence and total mortality of stroke in China are rising every year (Wang et al., 2020).
China has the largest number of hypertensive patients. Among Chinese adults ≥18 years old, approximately 244.5 million individuals have hypertension (HTN) and 435.3 million have pre-HTN. Strikingly, HTN is an important risk factor for stroke, especially in China. Over 60% of the patients with acute stroke present with high blood pressure (Miller et al., 2014). In Chinese hypertensive patients, every 10 mmHg increase in systolic blood pressure is associated with 1.44-fold and 1.5-fold risk for ischemic and hemorrhagic stroke, respectively. Moreover, the stroke risk is remarkably higher among Chinese hypertensive patients than their Caucasian counterparts (risk ratio is 2.58) (Zhang et al., 2006). Therefore, the burden of stroke is huge in China. Thus, accurate prediction of stroke in the hypertensive population is critical for stroke prevention in China.
To date, some prediction models of stroke, such as the Framingham Stroke Risk Score (FHS), (Kannel et al., 1976; D'Agostino et al., 1994, 2008; Zhou et al., 2017) the European Pooled Cohort Equations (PCE) (Goff et al., 2014). and the Q Stroke score in the United Kingdom, have been developed (Hippisley-Cox et al., 2013). However, the applicability of these models in the Chinese population has always been questioned with overestimated risks (Liu et al., 2004; Chia et al., 2014; Jiang et al., 2020; Li J. et al., 2021). Although the Chinese Multi-provincial Cohort Study (CMCS) (Liu et al., 2004; Wang et al., 2016) and the Prediction for ASCVD Risk in China (China-PAR model) were based on the Chinese population, (Yang et al., 2016; Xing et al., 2019) their accuracy and application were compromised by the limited risk factors considered by them and the traditional analysis method used.
Machine learning (ML) algorithms have shown good performance in the diagnosis, (Chen, 2022) treatment (Chi et al., 2022), and prognosis (Motwani et al., 2017; Segar et al., 2019; Angraal et al., 2020; Wu et al., 2020; Shu et al., 2021) of cardiovascular diseases (CVDs). Nonetheless, no stroke prediction model to date has been developed from the general Chinese hypertensive cohort using ML algorithms (Wu et al., 2020; Yang et al., 2021). Hence, this study aims to develop a ML method-based model for more accurate prediction of stroke risk in Chinese hypertensive patients compared with traditional models.
2 Methods
2.1 Patient cohort
The Northeast China Rural Cardiovascular Health Study (NCRCHS) is a prospective cohort study conducted in the rural areas of Northeast China, whose inclusion criteria and design are described in the previous article (Guo et al., 2021a). In all, 11,956 participants aged ≥35 years were recruited from Dawa, Zhangwu, and Liaoyang counties of Liaoning province between 2012–2013 using a multi-stage, randomly stratified cluster-sampling scheme. The participants were followed up in 2015 and 2017. The NCRCHS was approved by the Ethics Committee of China Medical University (Shenyang, China). All participants provided written informed consent.
In this study, we included data from 5,260 participants of the NCRCHS study. Of these, 63 patients with missing or abnormal values were excluded. As a result, 5,197 hypertensive patients were included in the final analyses (Supplementary Figure S1).
2.2 Study variables and data collection
The data collection procedure is described in detail in our previous paper (Li Z. et al., 2021). The patients' demographic characteristics, medical history and medication use, lifestyle factors, and other information were obtained at baseline through interviews by trained research staff using a standardized validated questionnaire. Indices such as weight, height, and waist circumference (WC). Body mass index (BMI) was computed as body weight in kilograms divided by the square of the height in meters. Blood pressure (BP) was assessed thrice after a 5-min rest using an automatic electronic sphygmomanometer (HEM-907; Omron, Kyoto, Japan) and averaged. HTN was defined according to the JNC 7 report as systolic blood pressure (SBP) ≥140 mmHg, diastolic blood pressure (DBP) ≥90 mmHg, and/or the use of antihypertensive medications (Chobanian et al., 2003). We collected blood samples after at least 12 H of fasting to determine the plasma levels of fasting glucose (FPG), triglycerides (TG), high-density lipoprotein cholesterol (HDL-C), uric acid, estimated glomerular filtration rate (eGFR), and blood routine biochemical indicators. Standard 12-lead electrocardiograms (ECGs) were used with a MAC 5500 (GE Healthcare, Little Chalfont, UK) as previously described (Li Z. et al., 2021). The ECGs were analyzed automatically, including QRS duration, PR duration, P axis, R axis, T axis, left ventricular hypertrophy (LVH) ECG (define per Sokolow–Lyon criteria), and QT interval (Framingham). Atrial fibrillation (AF) was defined as having a previous history of AF or an ECG suggestive of AF. Echocardiography was performed for all participants based on the American Society of Echocardiography guidelines, which were consistent with our previous study (Li T. et al., 2021). A Doppler echocardiography (Vivid; GE Healthcare, Connecticut, USA) with a 3.0-MHz transducer (Vivid, GE Healthcare, USA), including M-mode, two-dimensional, spectral, and color Doppler was used. Aortic dimension (AD), left atrial diameter (LAD), left ventricular end-diastolic internal dimension (LVIDd), left ventricular end—systolic internal dimension (LVIDs), interventricular septal thickness (IVSd), posterior wall thickness (PWTd), left ventricular ejection fraction (LVEF), E wave, and A wave were measured. In all, we selected 68 variables for model construction (Table 1), including subsets of characteristics related to the course, prognosis, hypertension-related target organ damage, and complications.
2.3 Follow-up
During follow-up, we collected the end point events of new fatal or non-fatal strokes. According to the World Health Organization (WHO) Multinational Monitoring of Trends and Determinants in CVD criteria, stroke was defined as rapidly developing signs of focal or global disturbance of cerebral functions lasting for >24 H (unless interrupted by surgery or death) with no apparent non-vascular cause (Asplund et al., 1988). Chronic cerebral vascular disease and transient ischemic attack were excluded. We collected medical records and death certificates for all participants who were possibly diagnosed or died. All information was independently reviewed and judged by the end point assessment committee.
2.4 Cox regression model construction
We established four Cox regression models and compared them with the ML model. For one Cox proportional hazards model, Cox proportional hazards analysis was performed on all variables, and redundant variables were eliminated via the forward conditional stepwise selection method in Cox regression. Two regression models selected variables established in the China-PAR Score (Yang et al., 2016) and the Framingham Stroke Risk Score (D'Agostino et al., 1994). The last Cox model used LASSO regression to filter the variables.
2.5 Ml model construction and calculation
To avoid data leakage, all data preprocessing and model construction procedures were conducted strictly within each training subset of a 10-fold cross-validation framework, while the corresponding validation subset was used only for model evaluation. Missing data were handled using multivariate imputation by chained equations (MICE), and outlier detection procedures were performed on the training data in each fold, with the derived parameters applied to the validation data.
The machine learning workflow, as shown in Supplementary Figure S1, involved feature selection using RFE, model training, and testing with XGBoost within a 10-fold cross-validation cycle. RFE was performed independently in each training subset, and validation errors across all folds were calculated to identify the optimal feature combination with the lowest average error. After feature selection, Synthetic Minority Over-sampling Technique (SMOTE) was applied only to the training data to balance class distribution, whereas validation data retained their original distribution, and SMOTE was implemented using the k-nearest neighbors approach (k = 5). Subsequently, XGBoost was used to develop the prediction model, a gradient tree boosting-based classifier that aggregates multiple weak learners into a strong learner, and was trained using the gbtree booster (max_depth = 4, learning rate = 0.05, n_estimators = 300, subsample = 0.8, and colsample_bytree = 0.8). All machine learning analyses were implemented in the open-source R software (version 4.1.1).
2.6 Statistical analysis and performance measures
Missing laboratory values were imputed using the mice package in R (m = 5 imputations, max it = 10 iterations), with the imputation model fitted on the training data in each fold and then applied to the corresponding validation data. For outliers, the local outlier factor (LOF) was used for numerical variables and the attribute value frequency (AVF) algorithm for categorical variables. Continuous variables were represented as mean ± standard deviation (SD) and compared using the t-test or Mann–Whitney U test. Categorical variables were represented as frequency (n) and proportion (%) and compared using the chi-square test. C-statistics was used to evaluate the performance of the models (DeLong et al., 1988). Calibration of the models was evaluated by the Brier score method (range, 0– 1) (Brier, 1950) and the numbers of observed and predicted events proportion were grouped according to the decile of predicted risk (Liu et al., 2004). A P-value < 0.05 was considered statistically significant. Decision curve analysis (DCA) was performed to assess the clinical net benefit of each model at different threshold probabilities, providing a visual comparison of their potential clinical utility.
SHapley Additive exPlanations (SHAP) is a framework based on the additive feature attribution method that explains the output of the XGBoost model. A positive SHAP value indicates that the feature has a positive effect, while a negative SHAP value indicates that the feature reduces the outcome value and has a negative effect. This method can output the importance ranking of the features as well as the relationship between the features and the outcomes. SHAP-force plot was used to visualize the impact of individual feature values on the model's prediction for each observation. Descriptive analyses and comparisons between clinically defined groups were performed using R 4.1.1.
3 Results
3.1 Participants' characteristics data description
In all, 5,197 individuals with hypertension were included in the study. Of these, 49.4% were males, 50.6% were females, and the mean age was 57.16 years. End point events occurred in 294 (5.7%) patients during a mean follow-up period of 4.26 ± 1.03 years, among which 185 were males and 109 were females. Table 2 shows the distribution of the risk factors. Individuals with end points were older and had higher WC, SBP, DBP, FBG, QTc Framingham, AD, LAD, IVSd, LVIDs, PWTd, and A wave compared with those without end points. Furthermore, they take antihypertensive medications more frequently and were more likely to have a stroke history and left ventricular hypertrophy (LVH) ECG. In contrast, eGFR, R axis, LVEF, and E wave were lower in individuals with end points than those without end points.
3.2 Model evaluation and comparison
Finally, we selected and analyzed 10 variables for the ML model by the RFE and XGBoost combination. The 10-variable combination included PWTd, IVSd, age, SBP, eGFR, QTc Framingham, platelet, calcium, uric acid, and FBG. We constructed a SHAP summary plot of the XGBoost model (Figure 1) to identify the importance of each feature in the prediction model. We identified that PWTd, IVSd, age, and SBP were the most important risk factors for stroke (Figure 1). In contrast, eGFR was associated with a decreased risk of stroke. We further generated a SHAP force-style plot for a representative patient to illustrate the individualized prediction of the XGBoost model, showing how each feature contributed to the predicted stroke recurrence risk (Supplementary Figure S2).
Figure 2. Prediction of outcome events and observed end points in each decile using the ML and 4 Cox models.
The results of the performance assessment were compared using C-statistics for discrimination and Brier score for calibration (Table 3) with 10-fold cycle of cross-validation. For predicting the end point events, the C-statistics was highest for the ML model [0.967 (95% CI, 0.956, 0.978)] among the five models. The four Cox regression models were similar to each other and their C-statistics were: Cox Regression (Framingham Stroke Risk Score) [0.747 (95% CI, 0.734, 0.762)], Cox Regression (China-PAR score) [0.725 (95% CI, 0.707, 0.731)], Cox Regression (Stepwise) [0.781 (95% CI, 0.772, 0.785)], and Cox Regression (LASSO) [0.764 (95% CI, 0.757, 0.772)]. Therefore, our ML model based on XGBoost had a better classification effect and better performance compared with the traditional risk scales.
The Brier score for the ML model was 0.053, indicating a good calibration between the estimated predicted risk and the observed 4.26 years risk. Calibration was also assessed by comparing the predicted and observed risks in each decile (Table 3). The largest difference for ML was small (2.9% in the 6th decile) compared with that of Cox (Framingham Stroke Risk Score) (42.7% in the 6th decile), Cox (China-PAR score) (33.8% in the 6th decile), Cox (stepwise) (27.9% in the 8th decile), and Cox (LASSO) (32.9% in the 8th decile).
Decision curve analysis (DCA) showed that all five models provided net clinical benefit mainly at low threshold probabilities (approximately 0–0.15), whereas net benefit rapidly approached zero as the threshold increased (0.15–0.20), indicating limited utility at moderate-to-high thresholds (Figure 3). The DCA curves largely overlapped across the evaluable range, suggesting that the differences in net benefit among models were small and not clinically apparent; importantly, the ML model demonstrated net benefit comparable to the Cox regression models with only marginal variations across thresholds.
4 Discussion
This study presents a novel ML technique that integrates demographic characteristics, basic information, blood biochemical indicators, electrocardiographic variables, and echocardiographic indicators to efficiently predict the risk of stroke among Chinese hypertensive patients. We found that the performance of the ML model was better than that of the four Cox regression models with a significantly high C-statistic.
ML methods are powerful tools and are increasingly applied for diverse medical applications to predict disease outcomes. XGBoost, an advanced method, has been consistently shown to be one of the best ML methods in supervised learning tasks. This algorithm can capture complex and non-linear interactions between variables. Additionally, it can learn its splitting direction for samples with missing values automatically and reduce overfitting and calculation.
Nonetheless, only a few ML prognostic models have been reported for hypertensive patients. Additionally, these models have certain limitations. For instance, Wu et al. used the ML method to construct a prognostic model for predicting the risk of hypertension in young patients. This ML approach was comparable with Cox regression and was outperformed the recalibrated FRS model (Wu et al., 2020). Nonetheless, the study focused on the young population and only 508 samples were enrolled among which 42 had end point events; hence, generalizability to all age individuals remains to be studied. Additionally, Yujie Y et al. constructed a stroke risk prediction model for patients with hypertension based on large-scale electronic medical record systems (EMRs) and proved that the ML models perform better than the traditional methods (Yang et al., 2021). Nevertheless, since it was a retrospective study based on EMRs, numerous values were missing; hence, some important traditional scales were not included in this model. Fortunately, this was a prospective study based on a large-scale cohort and the collection of basic information was reliable and the measurement results were unified. Thus, the results apply to a wide range of populations, especially Northeast China that has a high stroke incidence.
In this research, we selected 10 variables for constructing the ML model. Of these, eight variables (PWTd, IVSd, eGFR, FBG, calcium, QTc Framingham, uric acid, and platelet) were different from the traditional Cox models and were seldom discussed for the risk prediction of stroke. According to the SHAP summary plot, PWTd and IVSd had the highest predictive value for stroke, while the predictive value of FBG was much lower. PWTd and IVSd are indices of LVH. LVH in echocardiography is an independent predictor of incident CVDs (Gupta et al., 2010; Leigh et al., 2016). In addition, it was proved that hypertension with LVH was an extremely high-risk factor for CVDs. Li Z. et al. (2021) found that ventricular septal thickness in echocardiography should be considered when constructing risk prediction models for CVDs. In addition, studies revealed that eGFR is independently associated with cardiovascular events, although it may not be recognized as a major risk factor as SBP (Go et al., 2004; Chung et al., 2007; Sosner et al., 2015). This could be attributed to atherosclerosis, which can influence the renal blood vessels leading to renal insufficiency. Guo et al. (2021b) found that each 10 ms increase in the QTc interval was associated with an HR of 1.12 for stroke. Uric acid ranked ninth on the list of influencing factors. Several studies have shown that uric acid was an independent risk factor for ischemic stroke, especially for predicting ischemic stroke in Chinese hypertensive patients (Zhang et al., 2020; Dong et al., 2021). However, some studies failed to identify significant evidence between uric acid levels and the risk of the first stroke in Chinese adults with hypertension (Shi et al., 2017; Hu et al., 2021); hence, further studies are needed to validate the relationship. An abnormal T axis was identified to be an independent risk factor for CVD; hence, ECG monitoring to identify T-wave axis deviation can be an early indicator of CVD and help avoid cardiac events.
Inevitably, the study has several limitations. (1) We constructed a ML model based on XGBoost to compare with the traditional Cox regression models since it was previously proven to be better than other ML models (Yang et al., 2021). However, other non-linear ensemble methods, such as SVM, decision trees, and KNN classifiers, which also outperformed the traditional models were not included. (2) Potential predictors (Naganuma et al., 2013; Li et al., 2015; Su et al., 2020) of stroke, such as cranial imaging, were not collected either at baseline or follow-up. Meanwhile, competing risks may lead to overestimation of stroke risk, and we will subsequently apply the Fine-Gray model for sensitivity analysis. (3) Similar to previous articles, this research was performed on a rural population of northeast China without validation in independent cohorts. However, we used 10-fold cycle of cross-validation to compensate for the lack of external verification. It has been confirmed and used in previous research (Motwani et al., 2017; Juhola et al., 2021). 10-fold cycle of cross-validation can reduce the variance in prediction error and minimize overfitting and optimism bias. In addition, the follow-up time of 4.26 ± 1.03 years was short and a long-term follow-up is required.
5 Conclusion
In summary, we used an ML method to construct a prognostic model with 10 selected variables for predicting the risk of stroke in patients with hypertension. The XGBoost model had better performance compared with the traditional models. The ML predictive model may be useful to identify hypertensive patients developing stroke so that targeted prevention strategies can be carried out and it is highly expected to be applied in clinical practice.
Data availability statement
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.
Ethics statement
The studies involving humans were approved by the Ethics Committee of the First Hospital of China Medical University. The studies were conducted in accordance with the local legislation and institutional requirements. The participants provided their written informed consent to participate in this study.
Author contributions
YZ: Funding acquisition, Writing – original draft, Data curation, Formal analysis. WD: Data curation, Formal analysis, Writing – original draft, Investigation, Methodology, Software. WW: Conceptualization, Funding acquisition, Project administration, Resources, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing.
Funding
The author(s) declared that financial support was received for this work and/or its publication. This study was supported by the Liaoning Province Science and technology plan joint program natural science foundation (2024-MSLH-268), Project of the Doctoral Research Start-up Fund Program of Liaoning Province (2025-BS-0634) and Shenyang Municipal Public Health Research and Development Special Project (24-214-3-163).
Conflict of interest
The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Generative AI statement
The author(s) declared that generative AI was not used in the creation of this manuscript.
Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.
Publisher's note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
Supplementary material
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fmicb.2026.1737655/full#supplementary-material
Supplementary Figure S1 | The process for building and evaluating the performance of a ML method.
Supplementary Figure S2 | Individual-level explanation of the XGBoost model prediction using a SHAP force-style plot. the model output is shown on the log-odds scale, starting from the base value (dashed line) and progressing to the final predicted risk of 18.1%. Each bar represents the contribution of an individual feature to the patient-specific prediction, with colors indicating whether the feature increases or decreases the predicted risk.
References
Angraal, S., Mortazavi, B. J., Gupta, A., Khera, R., Ahmad, T., Desai, N. R., et al. (2020). Machine learning prediction of mortality and hospitalization in heart failure with preserved ejection fraction. JACC Heart Fail. 8, 12–21. doi: 10.1016/j.jchf.2019.06.013
Asplund, K., Tuomilehto, J., Stegmayr, B., Wester, P. O., and Tunstall-Pedoe, H. (1988). Diagnostic criteria and quality control of the registration of stroke events in the Monica project. Acta Med. Scand. Suppl. 728, 26–39. doi: 10.1111/j.0954-6820.1988.tb05550.x
Brier, G. W. (1950). Verification of forecasts expressed in terms of probabilit. Mon Weather Rev 78, 1–3. doi: 10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2
Chen, S. (2022). Models of artificial intelligence-assisted diagnosis of lung cancer pathology based on deep learning algorithms. J. Healthc. Eng. 2022:3972298. doi: 10.1155/2022/3972298
Chi, C. L., Wang, J., Ying Yew, P., Lenskaia, T., Loth, M., Mani Pradhan, P., et al. (2022). Producing personalized statin treatment plans to optimize clinical outcomes using big data and machine learning. J. Biomed. Inform. 128:104029. doi: 10.1016/j.jbi.2022.104029
Chia, Y. C., Lim, H. M., and Ching, S. M. (2014). Validation of the pooled cohort risk score in an Asian population - a retrospective cohort study. BMC Cardiovasc. Disord. 14:163. doi: 10.1186/1471-2261-14-163
Chobanian, A. V., Bakris, G. L., Black, H. R., Cushman, W. C., Green, L. A., Izzo, J. L. Jr., et al. (2003). The seventh report of the joint national committee on prevention, detection, evaluation, and treatment of high blood pressure: the JNC 7 report. JAMA 289, 2560–2572. doi: 10.1001/jama.289.19.2560
Chung, A., Iheonunekwu, N., Gilbert, D.T., and Barton, E. N. (2007). Cardiac disease in dialysis patients in a Jamaican hospital: echocardiographic findings that predict mortality. West Indian Med. J. 56, 305–308. doi: 10.1590/S0043-31442007000300024
D'Agostino, R. B., Wolf, P. A., Belanger, A. J., and Kannel, W. B. (1994). Stroke risk profile: adjustment for antihypertensive medication. the Framingham study. Stroke 25, 40–43. doi: 10.1161/01.STR.25.1.40
D'Agostino, R. B. Sr., Vasan, R. S., Pencina, M. J., Wolf, P. A., Cobain, M., Massaro, J. M., et al. (2008). General cardiovascular risk profile for use in primary care: the Framingham heart study. Circulation 117, 743–753. doi: 10.1161/CIRCULATIONAHA.107.699579
DeLong, E. R., DeLong, D. M., and Clarke-Pearson, D. L. (1988). Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44, 837–845. doi: 10.2307/2531595
Dong, Y., Shi, H., Chen, X., Fu, K., Li, J., Chen, H., et al. (2021). Serum uric acid and risk of stroke: a dose-response meta-analysis. J. Clin. Biochem. Nutr. 68, 221–227. doi: 10.3164/jcbn.20-94
Feigin, V. L., Krishnamurthi, R. V., Parmar, P., Norrving, B., Mensah, G. A., Bennett, D. A., et al. (2015). Update on the global burden of ischemic and hemorrhagic stroke in 1990-2013: the GBD 2013 Study. Neuroepidemiology 45, 161–176. doi: 10.1159/000441085
Go, A. S., Chertow, G. M., Fan, D., McCulloch, C. E., and Hsu, C. Y. (2004). Chronic kidney disease and the risks of death, cardiovascular events, and hospitalization. N. Engl. J. Med. 351, 1296–1305. doi: 10.1056/NEJMoa041031
Goff, D. C. Jr., Lloyd-Jones, D. M., Bennett, G., Coady, S., D'Agostino, R. B., Gibbons, R., et al. (2014). 2013 ACC/AHA guideline on the assessment of cardiovascular risk: a report of the American college of cardiology/American heart association task force on practice guidelines. Circulation 129, 49–73. doi: 10.1161/01.cir.0000437741.48606.98
Guo, X., Li, Z., Zhou, Y., Yu, S., Yang, H., Sun, G., et al. (2021a). The effects of transitions in metabolic health and obesity status on incident cardiovascular disease: insights from a general Chinese population. Eur. J. Prev. Cardiol. 28, 1250–1258. doi: 10.1177/2047487320935550
Guo, X., Li, Z., Zhou, Y., Yu, S., Yang, H., Sun, G., et al. (2021b). Corrected QT interval is associated with stroke but not coronary heart disease: insights from a general chinese population. Front. Cardiovasc. Med. 8:605774. doi: 10.3389/fcvm.2021.605774
Gupta, S., Berry, J. D., Ayers, C. R., Peshock, R. M., Khera, A., de Lemos, J. A., et al. (2010). Left ventricular hypertrophy, aortic wall thickness, and lifetime predicted risk of cardiovascular disease:the dallas heart study. JACC Cardiovasc. Imaging 3, 605–613. doi: 10.1016/j.jcmg.2010.03.005
Hippisley-Cox, J., Coupland, C., and Brindle, P. (2013). Derivation and validation of QStroke score for predicting risk of ischaemic stroke in primary care and comparison with other risk scores: a prospective open cohort study. BMJ 346:f2573. doi: 10.1136/bmj.f2573
Hu, F., Hu, L., Yu, R., Han, F., Zhou, W., Wang, T., et al. (2021). Prospective study of serum uric acid levels and first stroke events in chinese adults with hypertension. Front. Physiol. 12:807420. doi: 10.3389/fphys.2021.807420
Jiang, Y., Ma, R., Guo, H., Zhang, X., Wang, X., Wang, K., et al. (2020). External validation of three atherosclerotic cardiovascular disease risk equations in rural areas of Xinjiang, China. BMC Public Health 20:1471. doi: 10.1186/s12889-020-09579-4
Juhola, M., Joutsijoki, H., Penttinen, K., Shah, D., and Aalto-Setala, K. (2021). On computational classification of genetic cardiac diseases applying iPSC cardiomyocytes. Comput. Methods Programs Biomed. 210:106367. doi: 10.1016/j.cmpb.2021.106367
Kannel, W. B., McGee, D., and Gordon, T. (1976). A general cardiovascular risk profile: the Framingham study. Am. J. Cardiol. 38, 46–51. doi: 10.1016/0002-9149(76)90061-8
Leigh, J. A., O'Neal, W. T., and Soliman, E. Z. (2016). Electrocardiographic left ventricular hypertrophy as a predictor of cardiovascular disease independent of left ventricular anatomy in subjects aged >/=65 years. Am. J. Cardiol. 117, 1831–1835. doi: 10.1016/j.amjcard.2016.03.020
Li, J., Liu, F., Yang, X., Cao, J., Chen, S., Chen, J., et al. (2021). Validating world health organization cardiovascular disease risk charts and optimizing risk assessment in China. Lancet Reg. Health West Pac. 8:100096. doi: 10.1016/j.lanwpc.2021.100096
Li, J. L., Li, C. S., Fu, J. H., Zhang, K., Xu, R., and Xu, W. J. (2015). Evaluation of cranial and cervical arteries and brain tissue in transient ischemic attack patients with magnetic resonance angiography and diffusion-weighted imaging. Med. Sci. Monit. 21, 1726–1731. doi: 10.12659/MSM.894388
Li, T., Li, G., Guo, X., Li, Z., Yang, J., and Sun, Y. (2021). Predictive value of echocardiographic left atrial size for incident stoke and stroke cause mortality: a population-based study. BMJ Open 11:e043595. doi: 10.1136/bmjopen-2020-043595
Li, Z., Yang, Y., Zheng, L., Sun, G., Guo, X., and Sun, Y. (2021). It's time to add electrocardiography and echocardiography to CVD risk prediction models: results from a prospective cohort study. Risk Manag. Healthc. Policy 14, 4657–4671. doi: 10.2147/RMHP.S337466
Liu, J., Hong, Y., D'Agostino, R. B. Sr., Wu, Z., Wang, W., Sun, J., et al. (2004). Predictive value for the Chinese population of the Framingham CHD risk assessment tool compared with the Chinese multi-provincial cohort study. JAMA 291, 2591–2599. doi: 10.1001/jama.291.21.2591
Liu, L., Wang, D., Wong, K. S., and Wang, Y. (2011). Stroke and stroke care in China: huge burden, significant workload, and a national priority. Stroke 42, 3651–3654. doi: 10.1161/STROKEAHA.111.635755
Mathers, C. D., and Loncar, D. (2006). Projections of global mortality and burden of disease from 2002 to 2030. PLoS Med. 3:e442. doi: 10.1371/journal.pmed.0030442
Miller, J., Kinni, H., Lewandowski, C., Nowak, R., and Levy, P. (2014). Management of hypertension in stroke. Ann. Emerg. Med. 64, 248–255. doi: 10.1016/j.annemergmed.2014.03.004
Motwani, M., Dey, D., Berman, D. S., Germano, G., Achenbach, S., Al-Mallah, M. H., et al. (2017). Machine learning for prediction of all-cause mortality in patients with suspected coronary artery disease: a 5-year multicentre prospective registry analysis. Eur. Heart J. 38, 500–507. doi: 10.1093/eurheartj/ehw188
Naganuma, T., Takemoto, Y., Shoji, T., Shima, H., Ishimura, E., Okamura, M., et al. (2013). Cerebral white matter hyperintensity predicts cardiovascular events in haemodialysis patients. Nephrology 18, 676–681. doi: 10.1111/nep.12115
Segar, M. W., Vaduganathan, M., Patel, K. V., McGuire, D. K., Butler, J., Fonarow, G. C., et al. (2019). Machine learning to predict the risk of incident heart failure hospitalization among patients with diabetes: the watch-DM risk score. Diabetes Care 42, 2298–2306. doi: 10.2337/dc19-0587
Shi, X., Yang, J., Wang, L., Zhao, M., Zhang, C., He, M., et al. (2017). Prospective study of serum uric acid levels and stroke in a Chinese hypertensive cohort. Clin. Exp. Hypertens. 39, 527–531. doi: 10.1080/10641963.2017.1281938
Shu, S., Ren, J., and Song, J. (2021). Clinical application of machine learning-based artificial intelligence in the diagnosis, prediction, and classification of cardiovascular diseases. Circ. J. 85, 1416–1425. doi: 10.1253/circj.CJ-20-1121
Sosner, P., Hulin-Delmotte, C., Saulnier, P. J., Cabasson, S., Gand, E., Torremocha, F., et al. (2015). Cardiovascular prognosis in patients with type 2 diabetes: contribution of heart and kidney subclinical damage. Am. Heart J. 169 108–114. doi: 10.1016/j.ahj.2014.09.012
Su, J. H., Meng, L. W., Dong, D., Zhuo, W. Y., Wang, J. M., Liu, L. B., et al. (2020). Noninvasive model for predicting future ischemic strokes in patients with silent lacunar infarction using radiomics. BMC Med. Imaging 20:77. doi: 10.1186/s12880-020-00470-7
Wang, W. Z., Jiang, B., Sun, H. X., Ru, X. J., Sun, D. L., Wang, L. H., et al. (2017). Prevalence, incidence, and mortality of stroke in China results from a nationwide population-based survey of 480 687 adults. Circulation 135, 759–771. doi: 10.1161/CIRCULATIONAHA.116.025250
Wang, Y., Liu, J., Wang, W., Wang, M., Qi, Y., Xie, W., et al. (2016). Lifetime risk of stroke in young-aged and middle-aged Chinese population: the Chinese multi-provincial cohort study. J. Hypertens. 34, 2434–2440. doi: 10.1097/HJH.0000000000001084
Wang, Y. J., Li, Z. X., Gu, H. Q., Zhai, Y., Jiang, Y., Zhao, X. Q., et al. (2020). China stroke statistics 2019: a report from the national center for healthcare quality management in neurological diseases, China national clinical research center for neurological diseases, the Chinese stroke association, national center for chronic and non-communicable disease control and prevention, Chinese center for disease control and prevention and institute for global neuroscience and stroke collaborations. Stroke Vasc. Neurol. 5, 211–239. doi: 10.1136/svn-2020-000457
Wu, X., Yuan, X., Wang, W., Liu, K., Qin, Y., Sun, X., et al. (2020). Value of a machine learning approach for predicting clinical outcomes in young patients with hypertension. Hypertension 75, 1271–1278. doi: 10.1161/HYPERTENSIONAHA.119.13404
Xing, X., Yang, X., Liu, F., Li, J., Chen, J., Liu, X., et al. (2019). predicting 10-year and lifetime stroke risk in Chinese population. Stroke 50, 2371–2378. doi: 10.1161/STROKEAHA.119.025553
Yang, X., Li, J., Hu, D., Chen, J., Li, Y., Huang, J., et al. (2016). Predicting the 10-year risks of atherosclerotic cardiovascular disease in Chinese population: the China-PAR project (prediction for ASCVD risk in China). Circulation 134, 1430–1440. doi: 10.1161/CIRCULATIONAHA.116.022367
Yang, Y., Zheng, J., Du, Z., Li, Y., and Cai, Y. (2021). Accurate prediction of stroke for hypertensive patients based on medical big data and machine learning algorithms: retrospective study. JMIR Med. Inform. 9:e30277. doi: 10.2196/30277
Zhang, S., Liu, L., Huang, Y. Q., Lo, K., Tang, S., and Feng, Y. Q. (2020). The association between serum uric acid levels and ischemic stroke in essential hypertension patients. Postgrad. Med. 132, 551–558. doi: 10.1080/00325481.2020.1757924
Zhang, X. F., Attia, J., D'Este, C., and Ma, X. Y. (2006). The relationship between higher blood pressure and ischaemic, haemorrhagic stroke among Chinese and Caucasians: meta-analysis. Eur. J. Cardiovasc. Prev. Rehabil. 13, 429–437. doi: 10.1097/00149831-200606000-00020
Zhou, M., Wang, H., Zhu, J., Chen, W., Wang, L., Liu, S., et al. (2016). Cause-specific mortality for 240 causes in China during 1990–2013: a systematic subnational analysis for the global burden of disease study 2013. Lancet 387, 251–272. doi: 10.1016/S0140-6736(15)00551-6
Keywords: hypertension, ML method, prediction, stroke, XGboost
Citation: Zhou Y, Deng W and Wang W (2026) Machine learning increases the prediction of stroke for Chinese hypertensive patients. Front. Microbiol. 17:1737655. doi: 10.3389/fmicb.2026.1737655
Received: 02 November 2025; Revised: 28 December 2025; Accepted: 05 January 2026;
Published: 23 January 2026.
Edited by:
Jinghua Zhang, Hohai University, ChinaReviewed by:
Mengkai Yan, Hohai University, ChinaHao Xu, Garvan Institute of Medical Cancer Biology Laboratory, Australia
Copyright © 2026 Zhou, Deng and Wang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Wentao Wang, d2FuZ3dlbnRhb0BjYW5jZXJob3NwLWxuLWNtdS5jb20=
†These authors have contributed equally to this work and share first authorship
Ying Zhou1†