ORIGINAL RESEARCH article
Front. Public Health
Sec. Infectious Diseases: Epidemiology and Prevention
Volume 13 - 2025 | doi: 10.3389/fpubh.2025.1688708
Nine-Year Risk Stratification and Prediction of Helicobacter pylori Infection Using Group-Based Trajectory Modeling and Machine Learning in 35,206 Adults
Provisionally accepted- 1First Hospital of Shanxi Medical University, Taiyuan, China
- 2Shanxi Medical University, Taiyuan, China
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
Background. Helicobacter pylori (H. pylori) infection remains prevalent in regions such as Shanxi, China, contributing to gastrointestinal morbidity. Accurately identifying high-risk individuals is essential for effective screening and early intervention. Methods. We conducted a retrospective longitudinal cohort study of 35,206 adults who underwent repeated annual health checkups with H. pylori testing at a single center from 2016 to 2024. Group-based trajectory modeling (GBTM) identified risk subgroups. Multivariable logistic regression identified predictors of high-risk trajectories; alcohol consumption was assessed as an effect modifier. Five machine learning models—including Light Gradient Boosting Machine (LightGBM), Extreme Gradient Boosting, Logistic regression, etc.—were trained using a 7:3 split. Temporal validation (2016 – 2020 training/2021 – 2024 validation) assessed generalizability. SHapley Additive exPlanations (SHAP) improved interpretability. A prediction tool was deployed via R Shiny. Results. GBTM identified high-risk (14.63%) and low-risk (85.37%) groups. Protective factors included women (OR=0.042, 95% CI: 0.039–0.046) and unmarried status (OR=0.092, 95% CI: 0.085–0.099); risk factors included obesity (OR=1.138, 95% CI: 1.070–1.210), blue-collar workers (OR=1.557, 95% CI: 1.454–1.666), and alcohol consumption (OR=1.277, 95% CI: 1.165–1.401). Alcohol consumption interacted with all significant factors in subgroup analysis (all P <0.001), with the strongest interaction observed for being married (OR = 8.622, 95% CI: 7.872–9.437). Internal (2016–2020) and external (2021–2024) validation assessed generalizability. LightGBM achieved AUCs of 0.851 (training), 0.843 (validation), 0.863 (temporal training), and 0.831 (temporal validation). SHAP ranked marital status and sex as top predictors. The tool is available at: https://prediction-model-for-hp.shinyapps.io/hp_shinyapp-/. Conclusions. We developed an online, interpretable risk prediction tool with validated accuracy to support precision screening of H. pylori infection.
Keywords: Helicobacter pylori, machine learning, risk prediction, group-basedtrajectory modeling, Shapley additive explanations
Received: 30 Aug 2025; Accepted: 22 Oct 2025.
Copyright: © 2025 Zhao, Liu, Wei, Wang, 肖 and Yao. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence: Heping Zhao, zhaoheping360@126.com
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.