Development and Validation of a Predictive Model for Coronary Artery Disease Using Machine Learning

Early identification of coronary artery disease (CAD) can prevent the progress of CAD and effectually lower the mortality rate, so we intended to construct and validate a machine learning model to predict the risk of CAD based on conventional risk factors and lab test data. There were 3,112 CAD patients and 3,182 controls enrolled from three centers in China. We compared the baseline and clinical characteristics between two groups. Then, Random Forest algorithm was used to construct a model to predict CAD and the model was assessed by receiver operating characteristic (ROC) curve. In the development cohort, the Random Forest model showed a good AUC 0.948 (95%CI: 0.941–0.954) to identify CAD patients from controls, with a sensitivity of 90%, a specificity of 85.4%, a positive predictive value of 0.863 and a negative predictive value of 0.894. Validation of the model also yielded a favorable discriminatory ability with the AUC, sensitivity, specificity, positive predictive value, and negative predictive value of 0.944 (95%CI: 0.934–0.955), 89.5%, 85.8%, 0.868, and 0.886 in the validation cohort 1, respectively, and 0.940 (95%CI: 0.922–0.960), 79.5%, 94.3%, 0.932, and 0.823 in the validation cohort 2, respectively. An easy-to-use tool that combined 15 indexes to assess the CAD risk was constructed and validated using Random Forest algorithm, which showed favorable predictive capability (http://45.32.120.149:3000/randomforest). Our model is extremely valuable for clinical practice, which will be helpful for the management and primary prevention of CAD patients.


INTRODUCTION
Currently, coronary artery disease (CAD) continues to be the principal cause of worldwide incidence and mortality (1,2). The main pathogenic mechanism of CAD is atherosclerosis, a complicated and constantly progressing process of chronic inflammation characterized by dysfunction of endothelial cells, cumulative deposition of lipoprotein particles, migration of monocyte and macrophage, proliferation of vascular smooth muscle cells (VSMCs), and ultimately contributes to a narrowing of the vessel that impedes blood supply to the heart (3,4). The reference standard of CAD diagnosis is invasive coronary angiography, which allows for real-time evaluation of the location and the degree of coronary stenosis, and to decide the most suitable therapy (5). However, its use for population screening has been limited by the demand for specialized catheterization laboratory and the possible radiation exposure (6,7). Consequently, sensitive, specific, and non-invasive indicators for CAD risk assessment are urgently desirable.
Development and progression of coronary atherosclerosis is modulated by multiple interplays between genetic and environmental factors (8). Consistent and convincing evidence has authenticated a casual correlation between lipoproteinrelated lipid contents and cardiovascular disease prevalence (9)(10)(11). High level of circulating low-density lipoprotein cholesterol (LDL-C) and triglyceride (TG)-rich lipoproteins were related with high risk of CAD, whereas high level of highdensity lipoprotein cholesterol (HDL-C) was correlated with low CAD risk. Furthermore, pedigree studies have demonstrated that triglyceride, LDL-C, and HDL-C concentrations are strongly determined by the individual genetic architecture. For instance, rare variants in the apolipoprotein B (APOB) and LDL receptor (LDLR) genes and common variants in the apolipoprotein E (APOE) gene could increase LDL-C contents and were also correlated with increased susceptibility to CAD (12). In addition to cholesterol, the epidemiological studies have also substantiated other canonical risk factors, such as age, male gender, smoking, alcohol drinking, hypertension, diabetes, and obesity. To devise and improve preventive tactics for CAD, it is indispensable to comprehend and properly calculate the etiological contribution of these risk factors. In this study, we sought to evaluate the predictive value of these traditional risk factors in CAD by machine learning algorithms.

Study Design and Data Collection
This three-stage case-control study, involving 3,112 CAD patients and 3,182 controls, was retrospectively collected from three clinical centers: the development cohort with 2014 CAD cases and 2018 controls from Wuhan Asia Heart Hospital between March 2014 and October 2016, the validation cohort 1 with 837 CAD cases and 876 controls from Zhongnan Hospital of Wuhan University between January 2016 and December 2017 and the validation cohort 2 with 261 CAD cases and 258 controls from Shandong Provincial Hospital between January 2017 and February 2018. The diagnosis of CAD was determined by coronary angiography that stenosis ≥50% in at least one main coronary artery or their major branches. Patients with other cardiac diseases, autoimmune diseases, systemic diseases, and cancers were excluded. The control groups were non-CAD individuals based on physical examination and medical history evaluations. Traditional CAD risk factors such as age, gender, alcohol drinking, cigarette smoking and histories of hyperlipidemia, hypertension, and type 2 diabetes mellitus (T2DM) and clinical information, including blood pressure, body mass index (BMI), fasting plasma glucose (FPG), total cholesterol (TC), total triglyceride (TG), LDL-C, and HDL-C were retrospectively collected from the database of electronic medical records and laboratory test reports. The study was approved by the Ethics Committees of Wuhan Asia Heart Hospital, Zhongnan Hospital of Wuhan University, and Shandong Provincial Hospital and adhered to the tenets of the Declaration of Helsinki. Informed consent was obtained from all participants.

Machine Learning Algorithms
Logistic regression is a kind of probabilistic statistical classification model, which can be applied to predict the classification of nominal variable based on certain features. The classification is completed by utilizing the logit function to evaluate the outcome probability. As a supervised machine learning algorithm, support vector machines (SVM) can be fitted for both classification and regression. It first maps the data into a multidimensional feature space constructed by the kernel function, and then determines the optimal hyperplane that partitions the training set by the maximum boundary. Decision trees, one of the easiest kinds of decision model, use a tree structure built by recursive partitioning to simulate the correlations between the features and the potential outcomes. Once the model is created, the resulting structure is shown in a human-readable format. Random forests (RF) are modified bagged trees that randomly select the predictor features to split at each node and incorporate the voting results of many decision trees for classification. It retains the many strengths of the decision tree and exhibits high accurateness in disease diagnosis and risk prediction.

Statistical Analysis
Qualitative variables were expressed as frequencies with proportions, and the differences between cases and controls were examined by Chi-square test. Quantitative variables were shown as mean with standard deviation (SD) and were assessed for normality distribution by the Kolmogorov-Smirnov test. Independent t-test and Mann-Whitney U-test were performed to compare two groups of continuous variables with or without normal distribution, respectively. A two-sided P < 0.05 was considered to be statistically significant.
Least absolute shrinkage and selection operator (LASSO) regression analysis was applied to identify relatively important features. The logistic regression can be fitted to the data using the glm function with the family argument set to binomial and the summary function was used to check the coefficients and their p-values. We use the e1071 package to build linear SVM model since it contains the tune.svm function which can optimize the tuning parameters and kernel functions through cross-validation. To build the classification tree model, we use rpart function from party package and inspect the error per split in order to determine the optimal number of splits in the tree, then prune function was used to prune the tree. Randomforest function from randomForest package was used to build random forest model. The specific and optimal tree was determined by the minimum mean of squared residuals and the number of trees constructed in this model was 178, and three variables were randomly selected to split at each node. The prediction model was established by the four aforementioned machine learning algorithms and the predictive capability was evaluated by the area under the receiver operating characteristic curve (ROC) and the precisionrecall curve, using precrec package. For final model, ROCR package was conducted to assess the classification accuracy in the development and validation cohorts. The methods and the interpretation of the results are guaranteed by a machine learning algorithm expert. All data analysis was performed in R software (version 3.6.0).

Baseline and Clinical Characteristics of Study Population
In the development cohort, CAD patients had significantly higher age, higher body mass index (BMI), higher proportions of smoking, alcohol drinking, histories of hyperlipidemia, hypertension, and T2DM, higher concentrations of TC, TG, LDL-C, and FPG, and lower level of HDL-C, systolic blood pressure (SBP), and diastolic blood pressure (DBP) comparing with controls ( Table 1). In two validation cohorts, the baseline and clinical characteristics of study population were basically similar to that of the development cohort ( Table 1).

Machine Learning Model Evaluation
By LASSO regression filtration, all these variables remain significant (Figure 1). Therefore, we utilized four machine learning algorithms (logistic regression, RF, decision tree classification, and SVM) to construct the full model with all these variables, and the ROC curve (Figure 2) and the precision-recall curve (Figure 3) were implemented to assess their performance. Ultimately, random forests model was chosen for further analysis owing to its highest predictive accuracy.

Model Construction and Validation
We constructed a risk prediction model with all these variables by the random forests. The out-of-bag (OOB) estimate of error rate was 12.26%, indicating that the generalization error of this model is relatively small. The degree of Gini coefficient average decrease implied that HDL-C, followed by LDL-C, TG, BMI, and TC are important features for the risk evaluation of CAD (Figure 4). In the development cohort, this model yielded a high AUC 0.948 (95%CI: 0.941-0.954) to identify CAD patients from controls, with a sensitivity of 90%, a specificity of 85.4%, a positive predictive value of 0.863 and a negative predictive value of 0.894 ( Figure 5A). In consistent with the development cohort, favorable discriminatory ability was also demonstrated by two validation cohorts, with an AUC, sensitivity, specificity, positive predictive value, and negative predictive value of 0.944 (95%CI: 0.934-0.955), 89.5%, 85.8%, 0.868, and 0.886 in the validation cohort 1 (Figure 5B), respectively, and 0.940 (95%CI: 0.922-0.960), 79.5%, 94.3%, 0.932, and 0.823 in the validation cohort 2 (Figure 5C), respectively. The model is further shown as a web calculator to facilitate its application (http://45.32.120. 149:3000/randomforest).

DISCUSSION
In the current study, we elucidated the significant contributions of age, gender, alcohol drinking, cigarette smoking, hyperlipidemia, hypertension and T2DM, TC, TG, HDL-C, LDL-C, SBP, DBP, and FBG to the risk of CAD. Subsequently, we constructed and validated a Random Forest model integrated these indexes with a favorable discriminability that can be helpful for the non-invasive identification of CAD patients.
The most important features identified by Random Forest was HDL-C. HDL-C has been regarded as "good cholesterol" largely owing to an inverse association between high HDL-C levels and low CAD risk (13,14). The main functions of HDL are to facilitate reverse cholesterol transport and regulate inflammation (15). The potential atheroprotective effects of HDL from healthy individuals were remarkably impaired in CAD patients (16,17). The Framingham study suggested that about 44% CAD clinical     events in men with HDL-C >40 mg/dL and near 43% in women with HDL-C >50 mg/dL. In addition, Mendelian randomization genetic studies have suggested that genetically raised HDL-C levels were not correlated with a decreased risk of myocardial infarction comparing with genetic variants associated with lowering LDL-C levels (18). Furthermore, recent large-scale clinical studies had also failed to validate the preventive effect of HDL-C raising treatments on coronary disease (19,20). These results highlighted that HDL-C levels were not necessarily causally associated with coronary disease and normal serum HDL-C levels did not guarantee free of CAD events (15). Nevertheless, our model illustrated that HDL-C levels made a greatest contribution in CAD risk prediction. A high serum LDL-C level is a well-established risk factor for cardiovascular disease, especially CAD. Genetic studies have revealed that variants in PSCK9 (proprotein convertase subtilisin/kexin type 9), HMGCR (HMG-coenzyme A reductase), and NPC1L1 (Niemann-Pick C1-like intracellular cholesterol transporter 1) are correlated with decreasing LDL-C levels and low CAD risk (21)(22)(23). Moreover, large-scale clinical studies have suggested that decreasing LDL-C by targeting these proteins has been proven to be a safe and effectual approach to reduce risk of coronary disease. Additionally, lowering serum LDL-C levels can reduce mortality and morbidity of cardiovascular diseases in both primary and secondary prevention (24,25). Consistent with this result, our model also showed that LDL-C levels were the important feature in CAD risk assessment.
Large triglyceride-rich lipoprotein particles including chylomicrons (CM) and very low-density lipoprotein (VLDL) particles pass through the arterial wall via transcytosis in specialized vesicles instead of direct penetration. These particles can be swallowed by arterial macrophages, with enormous cholesterol depositing and foam cell formation in coronary arteries (26). Therefore, elevated blood triglyceride levels were associated with the development of CAD by directly participating in atherosclerotic plaque formation and progression (27). However, the relationship between increased plasma triglyceride levels and cardiovascular disease were controversial in epidemiological studies. A large meta-analysis comprising 10,158 CAD patients from 262,525 individuals in 29 prospective studies suggested a modestly significant correlation between triglyceride levels and CAD risk (28). In contrast, other studies didn't find significance after multivariable adjustments for smoking, hypertension, diabetes, BMI, and glucose levels. Meanwhile, some of these studies also implied that even a little elevated triglyceride levels were correlated with high risk of recurrent CVD events in patients receiving stain treatment and should be regarded as a useful risk indicator (29). Moreover, genetic studies have demonstrated that increased blood triglyceride levels were causally associated with high CAD risk (30). In addition, a Mendelian randomization study revealed that genetically decreased non-fasting plasma triglyceride concentrations could reduce all-cause mortality (31). In accord with these reports, our model hinted that blood triglyceride levels were helpful for CAD risk prediction.
Aside from HDL-C, LDL-C, and triglyceride levels, other variables such as BMI, TC, DBP, SBP, FBG, age, hypertension, smoking, sex, T2DM, hyperlipidemia, and drinking are of relative importance in stratifying a patient's risk. Age is the most important factor associated with the progression of CAD, as well as death when coronary atherosclerosis occurs (32). Previous studies have demonstrated that there is a conspicuous sex difference in CAD incidence and mortality (33,34). In general, men develop CAD earlier than women (35). Dietary cholesterol could raise the concentration of serum total cholesterol, which was associated with a high risk of cardiovascular disease (36). Obesity has been shown to be an usual cause of cardiovascular mortality in the developed countries (37). Abdominal visceral with an excess fat overload can result in atherosclerotic disease (37). Dysregulation of endocrine factors originating from adipocyte in overnutrition has been presumed to be implicated in the progression of atherosclerosis (38). Hypertension was pathologically related with CAD and arterial hypertension could aggravate CAD (39). Furthermore, hypertension was also often correlated with other risk factors of CAD, such as dyslipidemia and insulin resistance (40). Diabetes was reported to frequently correlated with high levels of triglyceride and low levels of HDL-C (41). Smoking could induce endothelial exposure and platelet adhering to subintimal layer, thus increasing lipoprotein particle penetration and proliferation of smooth muscle cells (SMCs) (42). Meanwhile, the cardiovascular system is sensitive to the toxic effects of alcohol. High-dose alcohol drinking could induce extensive coronary arterial damage and increase the risk of developing CAD (43).
The Framingham risk score is a well-known prediction algorithm that has been widely applied to evaluate CAD risk in different populations including Chinese (44,45). However, since the risk equation was developed in 1976 and more than 99% participants are of European descent, it is necessary to reconfirm the predictive values of traditional risk factors for Chinese due to the intrinsic discrepancy of diet and life style, social environment and genetic predisposition. Furthermore, rapidly increasing percapita income, westernization of lifestyle, an aging population and longer life spans contributed to conspicuous changes in the CVD epidemics and risk factors pattern in China during the past decade (46). Therefore, an evolutionary CAD risk appraisal tool developed from recent information of Chinese population would be better generalized. Some recent studies utilized nomogram to assess CAD risks based on the results of multivariate logistic regression or Cox proportional hazard regression (47)(48)(49)(50). Albeit these studies provided powerful clinical benefits, these models had the inherent drawback that the algorithm is sensitive to multicollinearity and missing values. Random forest is an ensemble classifier which applies lots of decision trees to the dataset and integrates results from all the trees by taking a majority vote. It can ameliorate prediction accuracy without considerably increasing the calculation amount. Our study established and validated a Random Forest model, which shows favorable predictive capability and clinical application value.
Some possible limitations in our study should be emphasized. First, this is a retrospective study, some potential inherent biases cannot be ignored and causal inference is limited. Second, we only took 15 CAD traditional risk factors into account, future studies with more variables including individual genetic information are necessitated to further confirm our results. Finally, this was a three-center study of only Chinese population from two provinces, which may restrict its generalizability. Therefore, future prospective multicenter studies from other areas of China are required to validate the findings of our study.
Collectively, an easy-to-use tool that combined 15 indexes to assess the CAD risk was constructed and validated using Random Forest algorithm, which showed favorable predictive capability (http://45.32.120.149:3000/randomforest). Our model is extremely valuable for clinical practice, which will be helpful for the primary prevention and management of CAD patients.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by the Ethics Committees of Wuhan Asia Heart Hospital, Zhongnan Hospital of Wuhan University, and Shandong Provincial Hospital. The patients/participants provided their written informed consent to participate in this study.

AUTHOR CONTRIBUTIONS
XG, BL, YX, XZ, and ZL collected clinical information and laboratory data. CW and BJ analyzed the data. CW and YZ generated the figures and wrote the manuscript. FZ designed and supervised this study and revised the manuscript. All authors read and approved the final manuscript.