Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Oncol.

Sec. Gastrointestinal Cancers: Colorectal Cancer

Volume 15 - 2025 | doi: 10.3389/fonc.2025.1575844

This article is part of the Research TopicAdvanced Machine Learning Techniques in Cancer Prognosis and ScreeningView all 3 articles

Population-Based Colorectal Cancer Risk Prediction Using a SHAP-Enhanced LightGBM Model

Provisionally accepted
Guinian  DuGuinian Du1Hui  LvHui Lv1Yishan  LiangYishan Liang1Jingyue  ZhangJingyue Zhang1Qiaoling  HuangQiaoling Huang1Guiming  XieGuiming Xie1Xian  WuXian Wu1Hao  ZengHao Zeng1Lijuan  WuLijuan Wu1Jianbo  YeJianbo Ye1Wentan  XieWentan Xie1*Xia  LiXia Li2*Yifan  SunYifan Sun1*
  • 1eighth affiliated hospital of guangxi medical university Guigang, China, Guigang, China
  • 2Liuzhou Liutie Central Hospital, Liuzhou, Guangxi Zhuang Region, China

The final, formatted version of the article will be published soon.

Background Colorectal cancer (CRC) is a highly frequent cancer worldwide, and early detection and risk stratification playing a critical role in reducing both incidence and mortality. we aimed to develop and validate a machine learning (ML) model using clinical data to improve CRC identification and prognostic evaluation. Methods We analyzed multicenter datasets comprising 676 CRC patients and 410 controls from Guigang City People's Hospital (2020-2024) for model training/internal validation, with 463 patients from Laibin City People's Hospital for external validation. Seven ML algorithms were systematically compared, with Light Gradient Boosting Machine (LightGBM) ultimately selected as the optimal framework. Model performance was rigorously assessed through area under the receiver operating characteristic (AUROC) analysis, calibration curves, Brier scores, and decision curve analysis. SHAP (SHapley Additive exPlanations) methodology was employed for feature interpretation.The LightGBM model demonstrated exceptional discrimination with AUROCs of 0.9931 (95% CI: 0.9883-0.998) in the training cohort and 0.9429 (95% CI: 0.9176-0.9682) in external validation. Calibration curves revealed strong prediction-actual outcome concordance (Brier score=0.139). SHAP analysis identified 13 key predictors, with age (mean SHAP value=0.216) and ) as dominant contributors. Other significant variables included hematological parameters (WBC, RBC, HGB, PLT), biochemical markers (ALT, TP, ALB, UREA, uric acid), and gender. A clinically implementable web-based risk calculator was successfully developed for real-time probability estimation. Conclusions Our LightGBM-based model achieves high predictive accuracy while maintaining clinical interpretability, effectively bridging the gap between complex ML systems and practical clinical decision-making. The identified biomarker panel provides biological insights into CRC pathogenesis. This tool shows significant potential for optimizing early diagnosis and personalized risk assessment in CRC management.

Keywords: colorectal cancer, risk prediction, machine learning, LightGBM model, early diagnosis

Received: 13 Feb 2025; Accepted: 30 Jun 2025.

Copyright: © 2025 Du, Lv, Liang, Zhang, Huang, Xie, Wu, Zeng, Wu, Ye, Xie, Li and Sun. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence:
Wentan Xie, eighth affiliated hospital of guangxi medical university Guigang, China, Guigang, China
Xia Li, Liuzhou Liutie Central Hospital, Liuzhou, Guangxi Zhuang Region, China
Yifan Sun, eighth affiliated hospital of guangxi medical university Guigang, China, Guigang, China

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.