AUTHOR=Anibal James , Huth Hannah , Li Ming , Hazen Lindsey , Daoud Veronica , Ebedes Dominique , Lam Yen Minh , Nguyen Hang , Hong Phuc Vo , Kleinman Michael , Ost Shelley , Jackson Christopher , Sprabery Laura , Elangovan Cheran , Krishnaiah Balaji , Akst Lee , Lina Ioan , Elyazar Iqbal , Ekawati Lenny , Jansen Stefan , Nduwayezu Richard , Garcia Charisse , Plum Jeffrey , Brenner Jacqueline , Song Miranda , Ricotta Emily , Clifton David , Thwaites C. Louise , Bensoussan Yael , Wood Bradford TITLE=Voice EHR: introducing multimodal audio data for health JOURNAL=Frontiers in Digital Health VOLUME=Volume 6 - 2024 YEAR=2025 URL=https://www.frontiersin.org/journals/digital-health/articles/10.3389/fdgth.2024.1448351 DOI=10.3389/fdgth.2024.1448351 ISSN=2673-253X ABSTRACT=IntroductionArtificial intelligence (AI) models trained on audio data may have the potential to rapidly perform clinical tasks, enhancing medical decision-making and potentially improving outcomes through early detection. Existing technologies depend on limited datasets collected with expensive recording equipment in high-income countries, which challenges deployment in resource-constrained, high-volume settings where audio data may have a profound impact on health equity.MethodsThis report introduces a novel protocol for audio data collection and a corresponding application that captures health information through guided questions.ResultsTo demonstrate the potential of Voice EHR as a biomarker of health, initial experiments on data quality and multiple case studies are presented in this report. Large language models (LLMs) were used to compare transcribed Voice EHR data with data (from the same patients) collected through conventional techniques like multiple choice questions. Information contained in the Voice EHR samples was consistently rated as equally or more relevant to a health evaluation.DiscussionThe HEAR application facilitates the collection of an audio electronic health record (“Voice EHR”) that may contain complex biomarkers of health from conventional voice/respiratory features, speech patterns, and spoken language with semantic meaning and longitudinal context–potentially compensating for the typical limitations of unimodal clinical datasets.