Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Cardiovasc. Med.

Sec. General Cardiovascular Medicine

Volume 12 - 2025 | doi: 10.3389/fcvm.2025.1559831

An Explainable Machine Learning Model for Predicting Chronic Coronary Disease and Identifying Valuable Text Features

Provisionally accepted
Weipeng  GanWeipeng Gan1Peipei  WangPeipei Wang2Xiangrong  XieXiangrong Xie3Lingfei  YangLingfei Yang1Dasheng  LuDasheng Lu1Sheng  YeSheng Ye1Mingquan  YeMingquan Ye2*
  • 1Second Affiliated Hospital of Wannan Medical College, Wuhu, China
  • 2Wannan Medical College, Wuhu, Anhui Province, China
  • 3First Affiliated Hospital of Wannan Medical College, Wuhu, Anhui Province, China

The final, formatted version of the article will be published soon.

Background: Chronic Coronary Disease (CCD) is a leading global cause of morbidity and mortality. Existing Pre-test Probability (PTP) models mainly rely on in-hospital data and clinician judgment. This study aims to construct machine learning (ML) models for predicting CCD by using easily accessible text data and baseline characteristics, and to evaluate the contribution of text data to the diagnostic model.The chief complaints, present illness, past medical history and vital signs of the patients from the internal medicine departments of the First Affiliated Hospital and the Second Affiliated Hospital of Wannan Medical College were gathered. The text data of the research subjects were structured by using text mining technology. A customized "stop words" list and "custom dictionary" for cardiovascular medicine were created to optimize the processing of text data. Then, ML algorithms were employed to establish CCD prediction models. Finally, the Shapley additive explanation (SHAP) algorithm was used to interpret the models.We enrolled a total of 21,876 patients in this study, with 7,449 in the CCD group and 14,406 in the non-CCD group. Patients in the CCD group were generally older and had a higher male proportion. After conducting feature engineering, we successfully constructed a Random Forest model. The model achieved an area under the curve (AUC) of 0.93 (95% CI, 0.93-0.94), demonstrating excellent performance in horizontal comparisons. Using the SHAP algorithm, valuable text features like "chest pain", "chest tightness" and structured features such as age, which are crucial for CCD judgment, were identified. Additionally, an illustration of how these features influenced the model's decision -making process was provided.

Keywords: Chronic coronary disease, Pre-test probability, text mining, machine learning, early diagnosis

Received: 11 Feb 2025; Accepted: 29 Aug 2025.

Copyright: © 2025 Gan, Wang, Xie, Yang, Lu, Ye and Ye. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Mingquan Ye, Wannan Medical College, Wuhu, 241002, Anhui Province, China

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.