Predicting the risk of subclinical atherosclerosis based on interpretable machine models in a Chinese T2DM population

Background Cardiovascular disease (CVD) has emerged as a global public health concern. Identifying and preventing subclinical atherosclerosis (SCAS), an early indicator of CVD, is critical for improving cardiovascular outcomes. This study aimed to construct interpretable machine learning models for predicting SCAS risk in type 2 diabetes mellitus (T2DM) patients. Methods This study included 3084 T2DM individuals who received health care at Zhenhai Lianhua Hospital, Ningbo, China, from January 2018 to December 2022. The least absolute shrinkage and selection operator combined with random forest-recursive feature elimination were used to screen for characteristic variables. Linear discriminant analysis, logistic regression, Naive Bayes, random forest, support vector machine, and extreme gradient boosting were employed in constructing risk prediction models for SCAS in T2DM patients. The area under the receiver operating characteristic curve (AUC) was employed to assess the predictive capacity of the model through 10-fold cross-validation. Additionally, the SHapley Additive exPlanations were utilized to interpret the best-performing model. Results The percentage of SCAS was 38.46% (n=1186) in the study population. Fourteen variables, including age, white blood cell count, and basophil count, were identified as independent risk factors for SCAS. Nine predictors, including age, albumin, and total protein, were screened for the construction of risk prediction models. After validation, the random forest model exhibited the best clinical predictive value in the training set with an AUC of 0.729 (95% CI: 0.709-0.749), and it also demonstrated good predictive value in the internal validation set [AUC: 0.715 (95% CI: 0.688-0.742)]. The model interpretation revealed that age, albumin, total protein, total cholesterol, and serum creatinine were the top five variables contributing to the prediction model. Conclusion The construction of SCAS risk models based on the Chinese T2DM population contributes to its early prevention and intervention, which would reduce the incidence of adverse cardiovascular prognostic events.


Introduction
Type 2 diabetes mellitus (T2DM) is a metabolic disorder characterized by insulin resistance and relative insulin deficiency.In recent years, the prevalence of T2DM has increased steadily, which has become a serious public health issue.Updated estimates for 2021 showed that about 10.5% of the global population had T2DM, a prediction that this figure would increase to 12.2% by 2045 (1).Cardiovascular disease (CVD) is the leading cause of death and disability in T2DM (2,3).Studies have shown that the risk of CVD in patients with T2DM is two to four times higher than in individuals without diabetes (4,5).Atherosclerosis (AS), the predominant pathophysiologic process in CVD, may begin early in life and remain latent and asymptomatic for extended periods before progressing to advanced stages.Subclinical atherosclerosis (SCAS) serves as an early indicator of atherosclerotic burden, and its timely recognition can help slow down or prevent the progression to CVD (6).Therefore, the early identification and effective management of SCAS in individuals with T2DM are crucial strategies to mitigate progression to overt CVD, thereby improving life expectancy and quality.
Diagnostic methods for SCAS include angiography, intravascular ultrasound, carotid ultrasound (CUS), computed tomography (CT), and magnetic resonance imaging.Measuring carotid intima-media thickness (CIMT) and coronary artery calcification (CAC) using CUS and CT has become the mainstay for assessing SCAS, owing to their noninvasive and easily accessible nature (7,8).However, large-scale use of CUS and CT could inevitably lead to the waste of medical resources and increased costs.Thus, establishing an assessment tool capable of screening individuals at high risk for SCAS without the need for imaging examinations is of great significance.
In recent years, artificial intelligence (AI) and machine learning (ML) have increasingly been utilized in the healthcare field (9).Several studies currently employ ML methods to research SCAS.For example, Sańchez-Cabo et al. (10) developed a SCAS risk prediction model for young asymptomatic individuals using four ML algorithms, demonstrating good clinical predictive value with an area under the receiver operating characteristic curve (AUC) of 0.890.Additionally, Nuñez et al. (11) used ML methods to identify circulating proteins that can predict SCAS, also showing good clinical predictive value with an AUC of 0.730.However, there are few reports on the risk prediction models for SCAS in T2DM patients.The purpose of this study was to establish SCAS risk prediction models based on interpretable machine learning algorithms, contributing to the early identification of SCAS and guiding appropriate prevention and interventions.

Participants
This study enrolled 3140 T2DM individuals who had sought medical care through outpatient visits, inpatient admissions, and routine physical examinations at Zhenhai Lianhua Hospital in Ningbo, China, from January 2018 to December 2022.The sample size for this study adhered to the rule of 10 events per variable (12).The demographic data, comorbidities, complications, and biochemical parameters were obtained by questionnaires and laboratory tests.Inclusion criteria: participants aged ≥ 18 years who either self-report T2DM, are undergoing pharmacological treatment for T2DM, or meet the diagnostic criteria of T2DM.These criteria include fasting blood glucose (FBG) levels of ≥ 7.0 mmol/L, 2-hour blood glucose levels of ≥ 11.1 mmol/L, or a glycated hemoglobin level of ≥ 6.5% (13).Exclusion criteria: individuals with other forms of diabetes mellitus, concurrent coronary heart disease or cerebral infarction, acute complications related to diabetes mellitus, malignant tumors, severe liver and kidney function abnormalities, or pregnancy.SCAS was defined as CIMT > 1.0 mm and/or the presence of plaque without clinical manifestations (14).Data with more than 20% missing were excluded (n=56), and those with less than 20% were filled by multiple interpolations (Supplementary Figure 1).Ultimately, 3084 T2DM patients were included in this study.The study's flow diagram is depicted in Figure 1.

Statistical analysis
Kolmogorov-Smirnov assessed sample distribution normality.Normal continuous variables were expressed as means (standard deviation, SD), non-normal continuous variables as median (interquartile range, IQR), and categorical variables as frequency (percentage, %).Between-group analyses involved independent samples t-tests for normal continuous variables, Mann-Whitney U tests for non-normal continuous variables, and chi-square tests for categorical variables.Box plots were used to elucidate the relationship between various metabolic parameters [including atherogenic index of plasma (AIP), Castelli risk index (CRI), metabolic score for insulin resistance (METS-IR), and triglyceride-glucose (TyG) index] and SCAS.The formulas for these parameters were calculated as follows: AIP = Log(TG/HDL); CRI = TC/HDL; METS-IR = Ln((2 * FBG + TG) * BMI)/(Ln (HDL)); TyG = Ln[(TG * FBG)/2].Multivariate logistic regression identified independent risk factors for SCAS.Restricted cubic spline was employed to analyze the dose-response relationship betweent AIP and SCAS.
Least absolute shrinkage and selection operator (LASSO) combined with random forest-recursive feature elimination (RF-RFE) were used to screen for characteristic variables.Six ML methods, including linear discriminant analysis (LDA), logistic regression (LR), Naive Bayes (NB), random forest (RF), support vector machine (SVM), and extreme gradient boosting (XGboost), were used to model construction.The primary parameters used to evaluate the effectiveness of risk prediction models included accuracy, sensitivity, specificity, precision, recall, and the F1 score.AUC was utilized to assess the models' predictive ability.Calibration curves and the Brier score were used to assess calibration capability, while decision curve analysis (DCA) was employed to evaluate clinical applicability.Additionally, the Shapley Additive exPlanations (SHAP) was used to interpret the best predictive model.

Clinical baseline information of the study population
A total of 3084 participants were enrolled in this study, comprising 1898 individuals with T2DM without SCAS, and 1186 individuals with T2DM with SCAS.The percentage of SCAS in the T2DM population was found to be as high as 38.46%.The median age of participants was 56 years (IQR: 49-61).Participants in the SCAS group were older, with a median age of 58 years (IQR: 53-62), compared to 54 years (IQR: 46-60) in the control group.The male proportion was similar in both groups (74.6% in the SCAS group vs. 73.8% in the control group, P > 0.05).Additionally, statistically significant differences were observed between the groups in terms of routine blood tests, lipid and glucose levels, and liver and kidney function (P < 0.05).The baseline clinical characteristics of the study population are presented in Table 1.
The AIP, CRI, METS-IR, and TyG index are metabolismrelated parameters commonly used in the diagnosis and risk assessment of metabolism-related diseases (15-18).The current study showed that three metabolism-related parameters, including AIP, CRI, and TyG, were significantly higher in the SCAS group than in the control group (P < 0.05) (Figure 2).

Independent risk factors
Nineteen potential risk factors associated with SCAS were initially screened by univariate analysis (P < 0.05) (Table 1).To ensure the accuracy and credibility of the findings, we calculated the variance inflation factor (VIF) for each variable and considered to exhibit lower multicollinearity when their VIF was below 10 (Supplementary Figure 2).Afterward, we performed stepwise backward logistic regression analysis with the Akaike information criterion to filter and remove multicollinear variables.Ultimately, fifteen variables were included in the multivariate logistic regression analysis, and the final fourteen variables such as Age, WBC, BASO, and LYC (P < 0.05) were identified as independent risk factors for SCAS (Figure 3).Based on the independent risk factors, we proceeded to explore the correlation between the variables (Figure 4).From the correlation analysis, we observed a negative correlation between AIP and Age (r = -0.24,P < 0.01), MCV (r = -0.13,P < 0.01), and HDL (r = -0.69,P < 0.01).Additionally, positive correlations were observed between AIP and WBC (r = 0.14, P < 0.01), GGT (r = 0.28, P < 0.01), and SUA (r = 0.27, P < 0.01).
To further assess the clinical applicability of AIP, we conducted a diagnostic experiment and a dose-response relationship study.The result of the diagnostic experiment (Figure 5A) revealed that although AIP holds promise as a potential biomarker for SCAS, its diagnostic value was moderate (AUC: 0.535).The dose-response relationship (Figure 5B) demonstrated a linear correlation between AIP and the risk of SCAS prevalence (P-overall < 0.001, P-nonlinear = 0.319), with a significant increase in risk observed when AIP was greater than 0.625.

Construction of risk prediction models
The study population was divided into training and internal validation sets at a 6:4 ratio.The basic characteristics of the participants in the two sets did not differ (Table 2).LASSO enables a data dimensionality reduction algorithm that screens feature predictors by constructing a penalty function that compresses regression coefficients to zero (19).RF-RFE is a Association of four metabolism-related parameters with risk of SCAS.AIP, atherogenic index of plasma; CRI, Castelli risk index; TyG, triglycerideglucose; METS-IR, metabolic score for insulin resistance; SCAS, subclinical atherosclerosis.
recursive backward feature elimination method that evaluates the importance of variables and progressively removes the least important ones, ultimately screening the optimal number of features (20).In the training set, LASSO combined with RF-RFE was applied to screen the most characteristic variables for SCAS (Figures 6A, B).Subsequently, the common variables screened by both algorithms were selected as predictors for constructing the SCAS risk prediction models, which included Age, FBG, TC, HDL, LDL, TP, ALB, SUA, and SCR (Figure 6C).To determine the optimal risk prediction model, six machine learning algorithms, namely LDA, LR, NB, RF, SVM, and XGboost, were employed to construct risk prediction models.

Validation of risk prediction models
Within the training set, 10-fold cross-validation was employed to evaluate the predictive value of the models and showed that the Correlation analysis between the variables.MCV, mean red blood cell volume; HDL, high-density lipoprotein; PDW, platelet distribution width; MPV, mean platelet volume; FBG, fasting blood glucose; BASO, basophil count; AIP, atherogenic index of plasma; WBC, white blood cell count; LYC, lymphocyte count; GGT, gamma-glutamyl transpeptidase; SCR, serum creatinine; SUA, serum uric acid; TP, total protein.*P < 0.05; **P < 0.01.

A B
Receiver operating characteristic (ROC) curve and dose-response relationship between AIP and subclinical atherosclerosis.RF model had the best clinical predictive value [AUC: 0.729 (95% CI: 0.709-0.749)],followed by the SVM model [AUC: 0.720 (0.705-0.735)](Figure 7A).In the internal validation set, the RF model also demonstrated a good clinical predictive value [AUC: 0.715 (95% CI: 0.688-0.742)](Figure7B).Furthermore, a comprehensive comparison of other clinical performance parameters, such as sensitivity and specificity, was conducted among the prediction models (Table 3).From the table, we observed that the RF model exhibits excellent performance in various parameters in the training set.The confusion matrix of the six machine learning models in the training set is shown in Figure 8.The calibration curve visually displays the fit of the risk prediction models.As shown in Figure 9, except for the XGboost and NB models, the predicted values of the other models closely match the theoretical values, demonstrating good clinical calibration.
DCA was used to assess the clinical applicability of predictive models by showing the relationship between risks and benefits corresponding to different decision-making.In the training set, all six ML models showed good clinical applicability (Figure 10A).Further, we calculated the risk threshold probability for the RF prediction model in the internal validation set, which showed that the RF model was clinically beneficial in the range of 2%-70% (Figure 10B).

Interpretation of risk prediction model
Based on the aforementioned analysis, we found that the RF prediction model demonstrated outstanding performance in both the training and internal validation sets, with the highest clinical predictive value observed in the training set [AUC: 0.729 (95% CI: 0.709-0.749)]and outperformed others in terms of accuracy, sensitivity, recall, and F1 score.Therefore, we have selected the RF model as the optimal prediction model for further model interpretation.SHAP interpretation is currently an emerging and the most commonly used method for interpreting predictive models in the field of ML, which interprets the model by computing the "contribution value" (Shapley values) of each characteristic predictor (21).Figure 11A depicts the contribution degree of the characteristic predictors to the prediction model, with the top five variables being Age, ALB, TP, TC, and SCR.Moreover, we observed that higher values of Age, TC, and SCR correspond to higher SHAP values and increased disease risk, whereas higher values of ALB and TP result in smaller SHAP values and reduced disease risk (Figure 11B).

Discussion
This study included a total of 3084 T2DM individuals, of whom 1186 had SCAS.Multivariate logistic regression analysis identified 14 variables, such as Age, WBC, BASO, and LYC (P < 0.05) as independent risk factors for SCAS in T2DM patients.LASSO combined with RF-RFE algorithms revealed nine characteristic variables, including Age, FBG, TC, HDL, LDL, TP, ALB, SUA, and SCR, as predictors for the SCAS risk model.Six ML models were developed and validated for clinical performance.Ultimately, the RF model exhibited the highest clinical predictive value in the training set [AUC: 0.729 (0.709-0.749)] and outperformed in accuracy, sensitivity, recall, and F1 score.The SHAP interpretation of the RF model revealed that Age, ALB, TP, TC, and SCR were the top five variables that made the most significant contributions to the predictive model.
In this study, the percentage of SCAS in the T2DM population was 38.46%, lower than the 43.68% reported by Hashimoto et al. in a Japanese T2DM population (22), which might be related to the region and sample size.Multiple studies have demonstrated an association between the TyG index and the incidence of CVD, coronary artery stenosis, stroke, and AS (23,24).A meta-analysis has revealed that an elevated TyG index is associated with SCAS and arterial stiffness in the adult population (25).Notably, the I-Lan Longitudinal Aging Study identified an association between the TyG index and SCAS in non-diabetic individuals, but not in those with diabetes (26).Consistent with this finding, our study also found no significant statistical association between the TyG index and SCAS in the T2DM population.AIP has emerged as a novel predictive biomarker for CVD.Associations have been identified between elevated AIP levels and increased incidences of CAC and AS (27,28).In this study, we observed that for every 0.1 unit increase in AIP, the risk of SCAS increased by 0.31-fold [OR: 1.310 (1.201-1.401)].However, the receiver operating characteristic curve indicated an average diagnostic value for AIP (AUC: 0.535).
Age, PDW, MPV, SUA, and GGT were observed as independent risk factors for SCAS, consistent with previous studies (29-33).Inflammation-related markers such as WBC, BASO, and LYC, were also found to be independent risk factors for SCAS.Long-term studies have shown that AS has a complex pathogenesis, primarily attributed to lipoprotein retention in the arterial wall and chronic inflammation (34,35).Hyperglycemia leads to increased inflammasome activity, upregulated nucleotide-binding oligomerization domain-like receptor 3, and ultimately elevated pro-inflammatory interleukin1b and interleukin 18 levels (36).Our study further confirms that SCAS in T2DM is a chronic inflammatory condition.Dyslipidemia is a wellestablished independent risk factor for CVD.In our study, we observed that HDL is an independent risk factor for SCAS.While early research consistently demonstrated an inverse correlation between HDL levels and CVD risk (37,38), more recent studies have unveiled a non-linear, U-shaped relationship, with very high HDL levels associated with cardiovascular mortality (39,40).
Optimizing approaches for early diagnosis of SCAS and providing earlier and more precise interventions are crucial to reducing adverse cardiovascular events.Currently, CUS and CT examinations are the primary methods for screening SCAS, but massive generalization inevitably leads to the wastage of medical resources and increased costs, particularly in low-income countries with limited resources.In recent years, with the growing demand for high-quality healthcare, AI has become a powerful tool in clinical medicine.ML, as a branch of AI, was able to analyze large datasets, find complex patterns, and generate insights that contribute to early disease diagnosis, drug discovery, and risk prediction (41,42).For instance, a study based on electronic health records used ML to generate an in-silico marker for coronary artery disease (CAD) that can non-invasively quantify AS and risk of death on a continuous spectrum, and identify underdiagnosed individuals (43).In addition, Ninomiya et al. (44) developed ML models to predict 5year all-cause mortality in patients with CAD and assessed ML's benefit in guiding decision-making between percutaneous coronary intervention (PCI) and coronary artery bypass grafting (CABG).The results showed that the hybrid gradient boosting model was the most effective for predicting 5-year all-cause mortality (C-indexes of 0.78) and that ML is feasible and effective for identifying individuals who benefit from CABG or PCI.In this study, we have developed risk prediction models for SCAS in T2DM patients based on interpretable machine learning methods that could contribute to the early identification of high-risk individuals.
Our study carries significant clinical importance.This might be one of the initial studies to perform SCAS risk prediction in the T2DM population using interpretable ML methods.As a chronic condition, SCAS is challenging to reverse once it develops, emphasizing the effectiveness of early prevention over active treatment.This prediction model enables the identification of high-risk individuals with SCAS within the T2DM population, providing a valuable advantage for early disease prevention.Moreover, the prediction model could bring benefits not only to medically underdeveloped regions but also to inform the clinical decisions of physicians, thus contributing to the optimization of healthcare resources.
This study has certain unavoidable limitations.Firstly, the study population was limited to a specific region, which might impact the generalizability of the prediction model.Secondly, the collection of clinical data lacked comprehensiveness, which may have led to the omission of potential predictors.Thirdly, the risk prediction model has only undergone validation using internal datasets, necessitating further validation with external datasets.In future studies, we will conduct a long-term follow-up study and collaborate with multiple centers to further revise and improve the model.

Conclusions
In summary, the development, validation, and interpretation of the SCAS risk prediction model in a Chinese T2DM population has significant implications for the reduction and prevention of adverse cardiovascular events.renamed as The First Affiliated Hospital of Ningbo University), Ningbo, China (KY20220607).Informed consent was obtained from all participants, and the study data were anonymized.
FIGURE 8The confusion matrix of the six machine learning models in the training set.(A) Linear discriminant analysis; (B) Logistic regression; (C) Naive Bayes; (D) Random forest; (E) Support vector machine; (F) Extreme gradient boosting.

TABLE 1
Univariate analysis of subclinical atherosclerosis.

TABLE 2
Characteristics of participants in different sets.

TABLE 3
Performance parameters of six machine learning prediction models in the training set.