AUTHOR=Anibal James , Huth Hannah , Li Ming , Hazen Lindsey , Daoud Veronica , Ebedes Dominique , Lam Yen Minh , Nguyen Hang , Hong Phuc Vo , Kleinman Michael , Ost Shelley , Jackson Christopher , Sprabery Laura , Elangovan Cheran , Krishnaiah Balaji , Akst Lee , Lina Ioan , Elyazar Iqbal , Ekawati Lenny , Jansen Stefan , Nduwayezu Richard , Garcia Charisse , Plum Jeffrey , Brenner Jacqueline , Song Miranda , Ricotta Emily , Clifton David , Thwaites C. Louise , Bensoussan Yael , Wood Bradford 

TITLE=Voice EHR: introducing multimodal audio data for health

JOURNAL=Frontiers in Digital Health

VOLUME=Volume 6 - 2024

YEAR=2025

URL=https://www.frontiersin.org/journals/digital-health/articles/10.3389/fdgth.2024.1448351

DOI=10.3389/fdgth.2024.1448351

ISSN=2673-253X

ABSTRACT=IntroductionArtificial intelligence (AI) models trained on audio data may have the potential to rapidly perform clinical tasks, enhancing medical decision-making and potentially improving outcomes through early detection. Existing technologies depend on limited datasets collected with expensive recording equipment in high-income countries, which challenges deployment in resource-constrained, high-volume settings where audio data may have a profound impact on health equity.MethodsThis report introduces a novel protocol for audio data collection and a corresponding application that captures health information through guided questions.ResultsTo demonstrate the potential of Voice EHR as a biomarker of health, initial experiments on data quality and multiple case studies are presented in this report. Large language models (LLMs) were used to compare transcribed Voice EHR data with data (from the same patients) collected through conventional techniques like multiple choice questions. Information contained in the Voice EHR samples was consistently rated as equally or more relevant to a health evaluation.DiscussionThe HEAR application facilitates the collection of an audio electronic health record (“Voice EHR”) that may contain complex biomarkers of health from conventional voice/respiratory features, speech patterns, and spoken language with semantic meaning and longitudinal context–potentially compensating for the typical limitations of unimodal clinical datasets.