Voice-Based Prediction of Prediabetes Using Classical Machine Learning Models

Oreskovic, Jessica; Fazli, Ghazal; Varma, Vanita; Malik, Kinza; Kaufman, Jaycee; Fossat, Yan

doi:10.3389/fcdhc.2025.1697769

ORIGINAL RESEARCH article

Front. Clin. Diabetes Healthc.

Sec. Diabetes Innovative Devices

Voice-Based Prediction of Prediabetes Using Classical Machine Learning Models

Provisionally accepted

Jessica Oreskovic^1*

Ghazal Fazli²

Vanita Varma³

Kinza Malik³

Jaycee Kaufman¹

Yan Fossat¹

¹Klick Inc, Toronto, Canada
²University of Toronto, Toronto, Canada
³Humber Polytechnic, Toronto, Canada

The final, formatted version of the article will be published soon.

Introduction: Prediabetes is a highly prevalent metabolic condition that significantly increases the risk of developing type 2 diabetes and cardiovascular disease. Despite its clinical importance, over 80% of individuals with prediabetes remain undiagnosed. Voice analysis has emerged as a non-invasive, accessible method for disease screening, with prior work showing promising results in detecting hypertension and type 2 diabetes from acoustic features. This study investigates whether voice-based machine learning models can identify individuals with prediabetes and evaluates the generalizability of these models across populations. Methods: Participants were recruited from clinical sites in India and a community college in Canada. All participants recorded the same spoken phrase multiple times daily via a mobile app, and glycemic status was assessed using HbA1c levels. Voice recordings were preprocessed to remove silence and trimmed to exclude potentially uninformative sections. A total of 167 acoustic features were extracted from each sample using Librosa, scipy, and parselmouth. Features were averaged per participant. Sex-specific models were developed under six experimental configurations varying by dataset balance (age/BMI-matched vs. unbalanced) and BMI inclusion. Feature selection was conducted using L1-regularized logistic regression (LASSO), and SMOTE was applied during training to address class imbalance. Twelve machine learning classifiers were evaluated using leave-one-subject-out cross-validation (LOSO-CV) on the India dataset. Final models were tested on a holdout India subset and the independent Canada dataset. Results: In cross-validation, the best female model (XGBoost, balanced, no BMI) achieved a balanced accuracy of 0.78, and the best male model (Random Forest, balanced, no BMI) achieved 0.68. However, holdout set testing identified different optimal configurations for generalization: the male XGBoost model trained on an unbalanced dataset outperformed the cross-validated model. In the Canada dataset, models failed to generalize effectively, with several configurations unable to correctly identify prediabetic participants. Discussion: Voice-based prediction models show potential for prediabetes screening in controlled populations, but their performance declines when applied across geographic or demographic boundaries. These findings highlight the need for more diverse training data and population-specific model tuning to support real-world applicability.

Keywords: prediabetes, Voice, vocal biomarker, type 2 diabetes, Voice signal analysis

Received: 02 Sep 2025; Accepted: 14 Nov 2025.

Copyright: © 2025 Oreskovic, Fazli, Varma, Malik, Kaufman and Fossat. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Jessica Oreskovic, joreskovic@klick.com

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.