AUTHOR=Schraut Tobias , Schützenberger Anne , Arias-Vergara Tomás , Kunduk Melda , Echternach Matthias , Dürr Stephan , Werz Julia , Döllinger Michael 

TITLE=Machine learning based assessment of hoarseness severity: a multi-sensor approach centered on high-speed videoendoscopy

JOURNAL=Frontiers in Artificial Intelligence

VOLUME=Volume 8 - 2025

YEAR=2025

URL=https://www.frontiersin.org/journals/artificial-intelligence/articles/10.3389/frai.2025.1601716

DOI=10.3389/frai.2025.1601716

ISSN=2624-8212

ABSTRACT=IntroductionFunctional voice disorders are characterized by impaired voice production without primary organic changes, posing challenges for standardized assessment. Current diagnostic methods rely heavily on subjective evaluation, suffering from inter-rater variability. High-speed videoendoscopy (HSV) offers an objective alternative by capturing true intra-cycle vocal fold behavior. Integrating time-synchronized acoustic and HSV recordings could allow for an objective visual and acoustic assessment of vocal function based on a single HSV examination. This study investigates a machine learning-based approach for hoarseness severity assessment using synchronous HSV and acoustic recordings, alongside conventional voice examinations.MethodsThree databases comprising 457 HSV recordings of the sustained vowel /i/, 634 HSV-synchronized acoustic recordings, and clinical parameters from 923 visits were analyzed. Subjects were classified into two hoarseness groups based on auditory-perceptual ratings, with predicted scores serving as continuous hoarseness severity ratings. A videoendoscopic model was developed by selecting a suitable classification algorithm and a minimal-optimal subset of glottal parameters. This model was compared against an acoustic model based on HSV-synchronized recordings and a clinical model based on parameters from other examinations. Two ensemble models were constructed by combining the HSV-based models and all models, respectively. Model performance was evaluated on a shared test set based on classification accuracy, correlation with subjective ratings, and correlation between predicted and observed changes in hoarseness severity.ResultsThe videoendoscopic, acoustic, and clinical model achieved correlations of 0.464, 0.512, and 0.638 with subjective hoarseness ratings. Integrating glottal and acoustic parameters into the HSV-based ensemble model improved correlation to 0.603, confirming the complementary nature of time-synchronized HSV and acoustic recordings. The ensemble model incorporating all modalities achieved the highest correlation of 0.752, underscoring the diagnostic value of multimodal objective assessments.DiscussionThis study highlights the potential of synchronous HSV and acoustic recordings for objective hoarseness severity assessment, offering a more comprehensive evaluation of vocal function. While practical challenges remain, the integration of these modalities led to notable improvements, supporting their complementary value in enhancing diagnostic accuracy. Future advancements could include flexible nasal endoscopy to enable more natural phonation and refinement of glottal parameter extraction to improve model robustness under variable recording conditions.