Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Artif. Intell.

Sec. Medicine and Public Health

Volume 8 - 2025 | doi: 10.3389/frai.2025.1691499

This article is part of the Research TopicAdvancing Healthcare AI: Evaluating Accuracy and Future DirectionsView all 14 articles

Assessing the Quality of AI-Generated Clinical Notes: Validated Evaluation of a Large Language Model Ambient Scribe

Provisionally accepted
Erin  PalmErin Palm1Astrit  ManikantanAstrit Manikantan1Srikanth  Subramanya BelwadiSrikanth Subramanya Belwadi1Herprit  MahalHerprit Mahal1Mark  E. Pepin, MD, PhDMark E. Pepin, MD, PhD1,2*
  • 1Suki AI Inc, Redwood City, United States
  • 2Stanford Cardiovascular Institute, School of Medicine, Stanford University, Stanfordd, United States

The final, formatted version of the article will be published soon.

Background: Generative artificial intelligence (AI) tools have been rapidly applied as "ambient scribes" to generate draft clinical notes from patient encounters. Despite rapid adoption, few studies have systematically gauged the quality of AI-authored documentation against physician standards using validated frameworks. Objective: This study aimed to compare the quality of large language model (LLM)-generated clinical ("Ambient") with physician-authored reference ("Gold") notes across five clinical specialties using the Physician Documentation Quality Instrument (PDQI-9) as a validated framework to quantify document quality. Methods: We pooled 97 de-identified audio recordings of outpatient clinical encounters spanning general medicine, pediatrics, obstetrics/gynecology, orthopedics, and adult cardiology specialties. For each encounter, clinical notes were generated using both LLM-optimized "Ambient" and blinded physician-drafted "Gold" notes using only audio recording and the corresponding transcript. Two blinded specialty reviewers independently evaluated each note using the modified PDQI-9 (11 criteria, Likert scale ratings, plus binary hallucination detection). Inter-rater reliability was assessed using within-group interrater agreement coefficient (RWG) statistics. Paired comparisons were conducted using t tests or Mann-Whitney tests. Results: Paired analysis of 97 clinical encounters yielded a total of 388 paired reviews for 194 notes (2 notes/encounter), which revealed high (RWG > 0.7) inter-rater agreement except in pediatrics and cardiology, which showed moderate concordance. Gold notes achieved a higher overall quality score (4.25/5 vs 4.20/5, P = 0.04), accuracy (P = 0.05), succinctness (P < 0.001), and internal consistency (P = 0.004) relative to Ambient notes. By contrast, Ambient notes scored higher on thoroughness (P < 0.001) and organization (P = 0.03). Hallucinations were identified in 20% of Gold notes and 31% of Ambient notes (P = 0.01). Despite these limitations, overall reviewer preference favored Ambient notes (47% vs 39% for Gold). Conclusions: LLM-generated Ambient notes demonstrated comparable quality to physician-authored notes across multiple specialties. Ambient notes were more thorough and organized, though less succinct and more prone to hallucination. Overall, the PDQI-9 offers a validated and practical framework to evaluate AI-generated clinical documents. This quality assessment methodology should inform iterative quality optimization and standardization of ambient AI scribes in clinical practice.

Keywords: Large language models, artificial intelligence, Medical Scribe, Clinical quality improvement, Dictation accuracy

Received: 23 Aug 2025; Accepted: 30 Sep 2025.

Copyright: © 2025 Palm, Manikantan, Belwadi, Mahal and Pepin, MD, PhD. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Mark E. Pepin, MD, PhD, pepinme@gmail.com

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.