Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Genet.

Sec. Computational Genomics

Integrating GWAS and Machine Learning for Disease Risk Prediction in the Taiwanese Hakka Population

Provisionally accepted
Jing-Hong  XiaoJing-Hong Xiao1,2Hsiao-Yen  KangHsiao-Yen Kang3Li-Ching  WuLi-Ching Wu1,2,4Tien  HsuTien Hsu5Chin-Pyng  WuChin-Pyng Wu6LI JEN  SULI JEN SU1,2,7,8*
  • 1National Central University, Taoyuan, Taiwan
  • 2Core Facilities for High Throughput Experimental Analysis, Taoyuan City, Taiwan
  • 3Landseed International Hospital, Department of Family Medicine, Department of Community Medicine, Taoyuan City, Taiwan
  • 4Education and Research Center for Technology Assisted Substance Abuse Prevention and Management, National Central University., Taoyuan City, Taiwan
  • 5Graduate Institute of Biomedical Sciences, China Medical University, Taichung City, Taiwan
  • 6Critical Care Center, Department of Internal Medicine, Landseed International Hospital, Taoyuan City, Taiwan
  • 7Department of Family Medicine, Department of Community Medicine, Taoyuan City, Taiwan
  • 8IHMed IVF Center, Taipei City, Taiwan

The final, formatted version of the article will be published soon.

Genome-wide association studies (GWAS) have identified numerous loci associated with complex diseases, yet their predictive power in small or genetically homogeneous populations remains limited. Integrating machine learning with GWAS offers a path to improve risk prediction and uncover functional variants relevant to precision medicine. DNA samples from Taiwanese Hakka individuals with type 2 diabetes, hypertension, and eye diseases were analyzed. After standard quality control, 295,589 SNPs were retained. Fourteen machine-learning algorithms were evaluated using SNPs selected through traditional GWAS filtering and refined via wrapper-based feature selection with a best-first search algorithm. Model performance was assessed by internal cross-validation and external validation using Taiwan Biobank data, and functional annotation was conducted through GTEx v10 cis-eQTL analysis. Predictive models relying solely on significant GWAS SNPs achieved moderate internal accuracy but limited generalizability. Incorporating feature-selected SNPs markedly improved performance: the Random Forest model achieved accuracies above 88% in cross-validation and above 85% in external validation, confirmed by 1000× bootstrap resampling. eQTL analysis identified functional associations such as rs12121653-KDM5B and rs12121653-MGAT4EP, implicating pathways involved in metabolic and mitochondrial regulation. These findings demonstrate that integrating GWAS with machine-learning-based feature selection enables the construction of robust, population-specific disease risk models. Given the small sample size of the discovery cohort (n=96), all predictive results should be interpreted as exploratory. We employed stringent cross-validation and 1,000× bootstrap resampling to reduce overfitting, and genomic control metrics (QQ plots and λGC values) were evaluated to ensure no major test statistic inflation. Independent large-scale validation will still be required. The approach effectively captures additive and interaction-driven genetic components and provides a scalable framework for applying precision medicine to underrepresented or isolated populations.

Keywords: type 2 diabetes, genome-wide association studies, machine learning, Algorithmic rules, Disease risk prediction

Received: 27 Aug 2025; Accepted: 19 Nov 2025.

Copyright: © 2025 Xiao, Kang, Wu, Hsu, Wu and SU. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: LI JEN SU, sulijen@gmail.com

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.