AUTHOR=Chen Yingxi , Wang Chunyu , Liu Xiaozhu , Duan Minjie , Xiang Tianyu , Huang Haodong TITLE=Machine learning-based coronary heart disease diagnosis model for type 2 diabetes patients JOURNAL=Frontiers in Endocrinology VOLUME=Volume 16 - 2025 YEAR=2025 URL=https://www.frontiersin.org/journals/endocrinology/articles/10.3389/fendo.2025.1550793 DOI=10.3389/fendo.2025.1550793 ISSN=1664-2392 ABSTRACT=BackgroundTo establish a classification model for assisting the diagnosis of type 2 diabetes mellitus (T2DM) complicated with coronary heart disease (CHD).MethodsPatients with T2DM who underwent coronary angiography (CA) were enrolled from seven affiliated hospitals of Chongqing Medical University. Statistical differences in clinical variables between T2DM with or without CHD patients were verified using univariate analysis. The original data was divided into a training set and a validation set in a 7:3 ratio. The training set data were used to screen features using Logistic regression, Lasso regression, or recursive feature elimination (RFE). Five machine learning algorithms, including Logistic regression, Support Vector Machine (SVM), Random Forest (RF), eXtreme gradient boosting (XgBoost), and Light Gradient Boosting Machine (LightGBM), were selected for modeling. The performance of the models was verified through 5-fold cross-validation and the training set.ResultsClinical data were collected from 1943 patients with T2DM complicated with CHD and 574 T2DM patients without CHD. Univariate analysis identified 20 optimal risk factors, four of the risk factors had over 30% missing values, we ultimately included 16 risk factors. Logistic regression screened eight features, Lasso regression screened ten features, the RFE method screened eight, fourteen, sixteen, and thirteen features for SVM, RF, XgBoost, and LightGBM, respectively. Among all models, the XgBoost model based on features selected by RFE+LightGBM demonstrated the best performance, achieving an AUC of 0.814 (95% CI, 0.779-0.847), accuracy of 0.799 (95% CI, 0.771-0.827), precision of 0.841 (95% CI, 0.812-0.868), recall of 0.920 (95% CI, 0.898-0.941), and F1-score of 0.879 (95% CI, 0.859-0.897) in the testing set.ConclusionsBased on T2DM data and machine learning theory, a Bayesian-optimized XgBoost model was established using the RFE+LightGBM method. This model effectively determines whether T2DM patients have CHD.