ORIGINAL RESEARCH article
Front. Artif. Intell.
Sec. Medicine and Public Health
Volume 8 - 2025 | doi: 10.3389/frai.2025.1601716
Machine learning based assessment of hoarseness severity: A multisensor approach centered on high-speed videoendoscopy
Provisionally accepted- 1University Hospital Erlangen, Erlangen, Germany
- 2University of Erlangen Nuremberg, Erlangen, Bavaria, Germany
- 3Louisiana State University, Baton Rouge, Louisiana, United States
- 4LMU Munich University Hospital, Munich, Bavaria, Germany
- 5University Medical Center Regensburg, Regensburg, Bavaria, Germany
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
Introduction: Functional voice disorders are characterized by impaired voice production without primary organic changes, posing challenges for standardized assessment. Current diagnostic methods rely heavily on subjective evaluation, suffering from inter-rater variability. High-speed videoendoscopy (HSV) offers an objective alternative by capturing true intra-cycle vocal fold behavior. Integrating time-synchronized acoustic and HSV recordings could allow for an objective visual and acoustic assessment of vocal function based on a single HSV examination. This study investigates a machine learning-based approach for hoarseness severity assessment using synchronous HSV and acoustic recordings, alongside conventional voice examinations. Methods: Three databases comprising 457 HSV recordings of the sustained vowel /i/, 634 HSVsynchronized acoustic recordings, and clinical parameters from 923 visits were analyzed. Subjects were classified into two hoarseness groups based on auditory-perceptual ratings, with predicted scores serving as continuous hoarseness severity ratings. A videoendoscopic model was developed by selecting a suitable classification algorithm and a minimal-optimal subset of glottal parameters. This model was compared against an acoustic model based on HSV-synchronized recordings and a clinical model based on parameters from other examinations. Two ensemble models were constructed by combining the HSV-based models and all models, respectively. Model performance was evaluated on a shared test set based on classification accuracy, correlation with subjective ratings, and correlation between predicted and observed changes in hoarseness severity. Results: The videoendoscopic, acoustic, and clinical model achieved correlations of 0.464, 0.512, and 0.638 with subjective hoarseness ratings. Integrating glottal and acoustic parameters into the HSV-based ensemble model improved correlation to 0.603, confirming the complementary nature of time-synchronized HSV and acoustic recordings. The ensemble model incorporating all modalities achieved the highest correlation of 0.752, underscoring the diagnostic value of multimodal objective assessments. Discussion: This study highlights the potential of synchronous HSV and acoustic recordings for objective hoarseness severity assessment, offering a more comprehensive evaluation of vocal function. While practical challenges remain, the integration of these modalities led to notable improvements, supporting their complementary value in enhancing diagnostic accuracy. Future advancements could include flexible nasal endoscopy to enable more natural phonation and refinement of glottal parameter extraction to improve model robustness under variable recording conditions.
Keywords: machine learning, deep learning, High-speed videoendoscopy, Voice Disorders, Hoarseness, image processing, Signal processing, Feature Selection
Received: 28 Mar 2025; Accepted: 12 May 2025.
Copyright: © 2025 Schraut, Schützenberger, Arias-Vergara, Kunduk, Echternach, Dürr, Werz and Döllinger. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence: Tobias Schraut, University Hospital Erlangen, Erlangen, Germany
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.