Machine learning model for cardiovascular disease prediction in patients with chronic kidney disease

Introduction Cardiovascular disease (CVD) is the leading cause of death in patients with chronic kidney disease (CKD). This study aimed to develop CVD risk prediction models using machine learning to support clinical decision making and improve patient prognosis. Methods Electronic medical records from patients with CKD at a single center from 2015 to 2020 were used to develop machine learning models for the prediction of CVD. Least absolute shrinkage and selection operator (LASSO) regression was used to select important features predicting the risk of developing CVD. Seven machine learning classification algorithms were used to build models, which were evaluated by receiver operating characteristic curves, accuracy, sensitivity, specificity, and F1-score, and Shapley Additive explanations was used to interpret the model results. CVD was defined as composite cardiovascular events including coronary heart disease (coronary artery disease, myocardial infarction, angina pectoris, and coronary artery revascularization), cerebrovascular disease (hemorrhagic stroke and ischemic stroke), deaths from all causes (cardiovascular deaths, non-cardiovascular deaths, unknown cause of death), congestive heart failure, and peripheral artery disease (aortic aneurysm, aortic or other peripheral arterial revascularization). A cardiovascular event was a composite outcome of multiple cardiovascular events, as determined by reviewing medical records. Results This study included 8,894 patients with CKD, with a composite CVD event incidence of 25.9%; a total of 2,304 patients reached this outcome. LASSO regression identified eight important features for predicting the risk of CKD developing into CVD: age, history of hypertension, sex, antiplatelet drugs, high-density lipoprotein, sodium ions, 24-h urinary protein, and estimated glomerular filtration rate. The model developed using Extreme Gradient Boosting in the test set had an area under the curve of 0.89, outperforming the other models, indicating that it had the best CVD predictive performance. Conclusion This study established a CVD risk prediction model for patients with CKD, based on routine clinical diagnostic and treatment data, with good predictive accuracy. This model is expected to provide a scientific basis for the management and treatment of patients with CKD.


Introduction
Chronic kidney disease (CKD) affects more than 10% of the global population and is closely associated with increases in the incidence and mortality rates of cardiovascular disease (CVD) (1,2).Therefore, effective prediction and management of the cardiovascular risk in this group are of paramount importance (3).
However, there are few established prediction tools developed specifically for this population, and classical models, such as the Framingham prediction model and the Systematic Coronary Risk Evaluation (SCORE) tool, perform poorly in patients with CKD (4-6).Hence, predicting the risk of cardiovascular events in patients with CKD is an important research area that still requires refinement and the development of more accurate and reliable prediction tools tailored for patients with CKD.
After conducting a literature review, we identified several modeling techniques commonly used in predictive modeling tasks.These include Logistic Regression (LR), Cox Model, Support Vector Machine (SVM), Random Forest (RF), K-Nearest Neighbor (KNN), Extreme Gradient Boosting (XGBoost), and Back propagation Neural Network (BPNN).Of note, XGBoost has emerged as a prevalent machine learning method that has demonstrated effective outcomes in various risk prediction models.Zelnick et al. ( 7) used a predictive model developed using gradient boosting machines, which demonstrated superior performance in forecasting atrial fibrillation events among patients with CKD compared to previously published predictive models.
This study utilized data from patients' medical records to identify risk factors associated with cardiovascular events in patients with CKD.Subsequently, machine learning was employed to construct artificial intelligence models to predict disease occurrence and assist clinicians in the timely detection of cardiovascular events.

Study population
This was a single-center retrospective study of data sourced from the electronic medical record system of a large tertiary hospital for inpatients with CKD at the Chinese People's Liberation Army general Hospital(PLAGH).Data were collected from patients who received treatment at the Nephrology Department of the PLAGH between January 1, 2015, and December 31, 2020, totaling 8,894 cases.The inclusion criteria were as follows ( 1 This study was approved by the Ethics Committee of the General Hospital of the Chinese People's Liberation Army (S2021-696-01).The study was conducted in accordance with the Declaration of Helsinki and was approved by the ethics committee, the informed consent may be exempted from signing.

Study outcome
CVD was defined as a composite of cardiovascular events, including coronary heart disease (coronary artery disease, myocardial infarction, angina pectoris, and coronary artery revascularization), cerebrovascular disease (hemorrhagic stroke and ischemic stroke), death from all causes (cardiovascular death, non-cardiovascular death, and unknown cause of death), congestive heart failure, and peripheral artery disease (aortic aneurysm, aortic, or other peripheral arterial revascularization) (8)(9)(10).A cardiovascular event was defined as a composite outcome of multiple cardiovascular events, as determined by reviewing medical records.

Other definitions
The specific definition of CKD depended primarily on the pathological diagnosis of biopsy findings in the medical records, on a clinical diagnosis by a nephrologist, or according to the Kidney Disease Outcomes Quality Initiative (KDOQI) guidelines, which define renal injury as an estimated glomerular filtration rate (eGFR) <60 mL/min/1.73m² for 3 months or more.Renal injury was defined as the presence of pathological abnormalities or injury markers, including abnormal blood or urine test results, or imaging findings.The eGFR was calculated using the Chronic Kidney Disease Epidemiology Collaboration equation (11).

Clinical data extraction
We collected patients' demographic information, vital signs, and clinical data, including the following: age, sex, history of hypertension, C-reactive protein, serum creatinine (SCr), eGFR, blood urea nitrogen (BUN), total cholesterol, triglycerides, highdensity lipoprotein (HDL), low-density lipoprotein, serum cystatin C, serum albumin, hemoglobin, blood potassium, blood sodium, blood calcium, blood chlorine, blood phosphorus, blood magnesium, plasma D-dimer, interleukin 6 activated partial thromboplastin time, prothrombin time, and medication treatment measures, including whether antiplatelet drugs were taken.The 24-h urine protein, body mass index (BMI), neutrophil and lymphocyte ratio, monocyte count/lymphocyte ratio, platelet count/lymphocyte ratio, and platelet count × (neutrophil/lymphocyte count) were also calculated using information from the medical records.

Statistical analysis
Data were computed and statistically analyzed using software SPSS version 26, R software version 4.3.2, and Python version 3.4.Variables with >25% missing values were excluded, and missing data were imputed using the KNN.For continuous variables, comparisons were made using Student's t-test (for normally distributed variables) or the Wilcoxon rank-sum test (for nonnormally distributed variables).Continuous variables conforming to normal or approximately normal distributions were expressed as mean ± standard deviation and were compared using the t-test.Continuous variables not conforming to normal distributions were expressed as median (M) (quartile 1 [Q1], quartile [Q3]) and were compared using the Mann-Whitney U test.Categorical variables were described using counts (%), and comparisons were made using the chi-square test.A P-value<0.05 was considered significant.

Data augmentation
Data augmentation was used to balance the number of patients with and without CVD, with 6,640 cases each.To address data imbalance, we employed the synthetic minority over-sampling technique (12) for data augmentation.

Model construction
The dataset was divided into training and testing sets in an 8:2 ratio.Least absolute shrinkage and selection operator (LASSO) regression analysis was used to select variables that could predict CVD risk.After a thorough literature search, we carefully selected the currently used modeling methods.Seven models were built using machine learning including LR, Naive Bayes (NB), KNN, XGBoost, RF, and Back propagation neural network (BPNN) to evaluate the model performance.

Evaluation metrics for machine learning
Receiver operating characteristic (ROC) curves were drawn to assess the model performance using the accuracy, sensitivity, specificity, F1-score (13) and area under the curve (AUC) as indicators to evaluate the model's ability to predict cardiovascular events.The formulas for the model evaluation index are as follows: where TP represents the number of true positives; FP, number of false positives; FN, number of false negatives; and TN, number of true negatives.
These indicators were used to validate the results, evaluate their ability to predict cardiovascular events, and select the best model.

Model interpretation
Shapley Additive explanations (SHAP) was used to interpret the model results.SHAP (14) values are used to explain the output of any machine learning model by quantifying the impact of each feature on the prediction.

Study population
Data from 8,894 patients with CKD were collected; patients were divided based on the presence (n=2,304) or absence (n=6,640) of CVD.As there was an imbalance in the data, data augmentation was used to balance the number of patients with or without CVD to 6,640 cases each.The dataset was divided into a training set (n=10,624) and a testing set (n=2,656) at an 8:2 ratio.Data collected from the training dataset were used to evaluate important variables related to CVD and to establish prediction models.The data from the test set was utilized to assess the performance of the prediction models trained on the training set.The data collection process is illustrated in Figure 1.

Clinical characteristics of the included patients
As shown in Table 1, a total of 13,280 patients were enrolled in the study following data augmentation.The median age of the participants was 52 years.Among them, 69.1% were male, and 30.9% were female.The incidence of composite CVD events was 50% (6640/13280), with 6,640 patients reaching the outcome.The average age of patients with CVD was significantly higher than that of patients without cardiovascular disease.BMI and prevalence of hypertension were higher in the CVD group than in the non-CVD group.The proportion of males was significantly higher in the CVD group.Nearly all laboratory indicators including hemoglobin, SCr, eGFR, 24-h urinary protein, HDL, BUN, and inflammatory markers such as the neutrophil-to-lymphocyte ratio were significantly different between the CVD and non-CVD groups (P<0.05).

Feature selection
LASSO regression was used to select the important variables associated with CVD.The optimal parameter (lambda) in the LASSO regression model was determined using five-fold cross-validation.A dotted vertical line was drawn at the value of lambda that represents the best trade-off according to the minimum criterion, and another at the most regularized model within one standard error of the minimum (Figure 2A), while a vertical line was drawn at the value selected by five-fold cross-validation, where the optimal lambda produced eight features with non-zero coefficients (Figure 2B).Eight variables were found to be predictors of CVD occurrence (Figure 2), with the corresponding model risk factors being age, history of hypertension, sex, antiplatelet medication, HDL, sodium, 24-h urinary protein, and eGFR.

Model construction and evaluation
We evaluated seven machine-learning models for predicting CVD using training and testing datasets, including the SVM, LR, NB, KNN, XGBoost, RF, and BPNN to evaluate model performance.The training set was used to train the models, and the testing set was used to test their accuracy and generalizability.The performances of the different models is shown in Figure 3 and Table 2.The test set had the following AUC values: SVM algorithm, 0.817; LR model, 0.817; KNN, 0.784 (lowest AUC); RF, 0.829; BPNN, 0.808; and NB algorithm, 0.796; and XGBoost algorithm, 0.893.The XGBoost model had the highest AUC, which was significantly higher than that of the other models.This indicates a good ability to distinguish between the presence and absence of CVD.Besides, the XGBoost model exhibited the highest accuracy (0.806), specificity (0.8), and F1 score (0.806), which suggests that it is the best performing model among those listed.The ranking of this CVD prediction model as one of the best models indicates that it has strong predictive ability and can be used in clinical settings.

Model interpretation
SHAP was used to interpret the predictions of the XGBoost machine learning model by calculating the contribution of each feature to the CVD prediction (14).Figure 4B shows the rankings of the top eight risk factors; the importance decreases with age, history of hypertension, sex, 24-h urinary protein, antiplatelet medication, eGFR, sodium, and HDL.Age was found to be the most influential feature, followed by history of hypertension and sex, which had the strongest predictive impact on the model.SHAP summary plots (Figures 4A, 5) were used to visually represent the impact of each variable on the model's output.The position of the SHAP value (xaxis) indicates the impact of the feature on prediction, with each point representing a sample, and the redder (bluer) color indicating higher (lower) feature values.If the SHAP value increases with an increase in the feature value, it indicates a positive correlation between the feature and the predicted outcome; otherwise, it indicates a negative relationship (Figures 4A, 5).The results show that for age, history of hypertension, being male, and a lower GFR, many patients' SHAP distributions are positive, indicating that an increase in age, having a history of hypertension, being male, and lower eGFR increase the risk of cardiovascular events.Higher levels of HDL and the use of antiplatelet medications reduce the risk of cardiovascular events.The prediction results of the XGBoost model are displayed using a confusion matrix, where the positive predictive value is 80.7% and the negative predictive value is 80.5% (Figure 6).This study retrospectively analyzed clinical data from the electronic medical records of 8,894 Chinese patients with CKD and successfully constructed a risk prediction model for the occurrence of CVD in these patients.To our knowledge, this is the first large-sample risk prediction model for cardiovascular events in CKD based on a Chinese population using clinical indicators, including adult patients with CKD.The demographic characteristics of these patients, including age and sex, were representative of patients with CKD, showing good model efficacy and a strong clinical application value.
Previous studies have explored the construction of prediction models for CVD in CKD (15)(16)(17)(18).R. Avram et al. conducted a cohort study in which elastic net regression was employed to develop a proteomic risk model for predicting cardiovascular risk among 2,182 participants from a chronic kidney dysfunction cohort (19), with AUC values ranging from 0.84 to 0.89 over 1 to 10 years, yet the clinical model performed poorly, with AUCs between 0.70 and 0.73.In addition, the Chronic Renal Insufficiency Cohort study constructed a 10-year atherosclerotic cardiovascular disease risk prediction model for patients with CKD( ( 20)); the AUC of the Chronic Renal Insufficiency Cohort model developed using clinically available variables was 0.760, which targeted atherosclerotic cardiovascular disease and not composite cardiovascular events.A cohort study aimed to predict atrial fibrillation events in CKD with models developed using machine learning methods in the CKD population, which were compared to previously published prediction models; however, the C-index of the model using clinical variables was only 0.67 (7).
This study utilized clinical variables to construct a risk prediction model for cardiovascular events in CKD with superior performance.Among the seven machine learning models, most artificial intelligence models have shown a higher predictive performance than traditional LR and Cox regression models.Artificial Neural Networks (21) are highly suitable for extensive Performance of 7 types of predicting models of training dataset (A) and testing dataset (B); SVM, support vector machine; Log Reg, logistic regression; XGBoost, extreme gradient boosting; KNN, k-nearest neighbor neighbor; NB, naïve Bayesian; RF, Random Forest; BPNN, Backpropagation Neural Network.datasets rich in sequential and unstructured features, requiring the estimation of a large number of parameters, and thus necessitating substantial data to avoid overfitting.Machine learning, using routine clinical data, can accurately predict CVD in CKD.For this study, which primarily involves straightforward numerical variables, simpler machine learning models would be more appropriate.XGBoost, developed from RF, is not affected by multicollinearity and is characterized by its flexibility and efficiency.

B A
In our study, we compared the performance of prediction models generated by seven machine learning algorithms and used Mg (mmol/L) 0.9 (0.8,0.9) 0.9 (0.8,0.9) 0.9 (0. the XGBoost ensemble machine learning method to construct the model.The results showed that the model had the highest AUC (0.89), sensitivity, and F1 score, indicating a good predictive effect on the risk of CVD in patients with CKD.This may be due to the effectiveness of XGBoost in handling complex patterns for disease prediction, outperforming other models.Machine learning is often referred to as a "black box."To explain the decision-making process of the XGBoost model algorithm, we employed the SHAP visualization method to interpret our predictions (14).The combination of machine learning and SHAP can provide clear explanations for individualized risk predictions and allow doctors to intuitively understand the impact of key features in the model (22).This study explained the contribution of the model and risk factors using SHAP values, which showed that age, hypertension, sex, and dyslipidemia can determine the risk of CVD in patients with CKD.Specifically, patients with CKD who are older, male, and have hypertension, lower eGFR, and lower high-density lipoprotein levels have a higher risk of CVD.The risk factors mentioned in the prediction model can be used to predict the CVD risk.This finding supports previous research and emphasizes the importance of risk factors for the occurrence of CVD events later in life.Starting from the mid-20th century with the Framingham Study (23), previous research has proven that older age, hypertension, male sex, and dyslipidemia are traditional independent risk factors for CVD in patients with CKD (24)(25)(26)(27), which were also factors included in the Framingham prediction model (28).Sodium accumulates in tissues, potentially causing systemic inflammation and directly affecting myocardial and vascular structures.High sodium levels lead to blood pressure changes and sodium retention in patients with CKD, thereby increasing the risk of CVD (29).Compared with individuals without CKD, higher sodium content in the muscles and skin was observed in patients undergoing dialysis (30), which was positively correlated with systemic inflammation.
Previous research has identified multiple risk factors for CVD, while recent studies have focused on using artificial intelligence or regression-based models to identify new risk factors and provide insights into disease mechanisms, thereby improving the accuracy of CVD predictions in patients with CKD.These computational models attempt to overcome the limitations of the traditional models by incorporating a broader range of variables and using advanced techniques.The integration of novel CKD-specific markers and the use of complex computational techniques are expected to resolve this issue.The new models combine traditional and CKD-specific risk factors, recognizing the complex interactions between CKD progression and cardiovascular health.Many studies have confirmed that a decline in eGFR and albuminuria are independent risk factors for an increased risk of CVD death (31,32).Kunihiro Matsushita et al (15) added unique kidney indicators, such as eGFR and the urine albumin-creatinine ratio (UACR) to the CKD supplement model, verifying that their inclusion significantly improves the risk prediction of CVD in patients with CKD.This is also relevant as the primary clinical guidelines for CVD prevention are yet to adopt CKD in CVD risk prediction; SCr is recommended as a primary marker of the eGFR in the current KDOQI guidelines, and eGFR-defined CKD is related to adverse CVD outcomes (33).This study incorporated 24-h proteinuria into the model, which differs from the previously used albuminuria and the UACR.Considering the higher cost of albumin measurement compared to total protein measurement, using UACR or urine protein-creatinine ratio for population screening is reasonable (33).However, despite the convenience of urine protein-creatinine ratio (UPCR) and UACR in quantifying proteinuria, their use has limitations.The random measured UPCR or UACR may not always reflect the 24-h excretion rate because protein or albumin excretion varies with the time of the day, stress levels, fatigue, and other factors.Therefore, incorporating the 24-h quantitative measurement of proteinuria into models, as illustrated by SHAP graphs, shows that an increase in 24-h proteinuria can lead to an increased risk of CKD.As traditional CVD risk factors may have different weights in the prediction factors among the CKD population (34), other indicators can serve as valuable supplements to enhance the predictive capability of such models.SHAP visualizations provide information for clinical decision making, highlighting the complexity of predictive models.Factors included in the predictive model are rooted in established and emerging evidence.New indicators of kidney disease risk are integrated into relevant predictive models as complements, offering better possibilities for clinicians to accurately assess patients' cardiovascular risk and take appropriate intervention measures to reduce the incidence of cardiovascular events.This study has some limitations.First, the model was not validated externally and was based on retrospective data, necessitating prospective cohort studies to verify the accuracy and stability of the model.Second, only routine clinical indicators were included in this study, and the addition of novel biomarkers from multi-omics studies may further enhance the model.However, this study was based on a large sample of the Chinese population and had a good model effect.Additionally, the use of clinically accessible indicators to build a CVD prediction model has strong clinical application value.
Moving forward, we may consider integrating our predictive model into clinical practice via mini-programs and mobile applications.This approach may facilitate precise diagnosis and treatment in clinical settings.
In conclusion, this study successfully established a riskprediction model for CVD in patients with CKD.The risk prediction model is expected to serve as a practical tool for clinicians to identify high-risk individuals at an early stage and initiate targeted interventions in a timely manner, thus improving the scientific accuracy of clinical decision-making.The confusion matrix of the XGBoost model predictions.
): diagnosis of CKD according to the 2012 Kidney Disease: Improving Global Outcomes guidelines or a clinical diagnosis of CKD; (2) age ≥18 years; and (3) complete data on key clinical indicators including creatinine and routine urine indicators.Patients with acute kidney failure were excluded.

Features
selection using the LASSO binomial regression model.LASSO, least absolute shrinkage and selection operator.(A) The partial likelihood deviance (binomial deviance) curve was plotted versus log (lambda).LASSO coefficient profiles of the 31 baseline features.(B) Tuning parameter (A) selection in the LASSO model used 5-fold cross validation via minimum criteria variable selection.LASSO coefficient profiles of the 8 features.

FIGURE 5 SHAP
FIGURE 5 SHAP dependence plot of the XGBoost model.

TABLE 1
Clinical characteristics of patients following data augmentation.

TABLE 2
Performance metrics for seven models in testing dataset.
(A) SHAP summary plot in XGBoost model with 8 variables.(B) A importance matrix plot of the XGBoost.