ORIGINAL RESEARCH article
Front. Digit. Health
Sec. Connected Health
Volume 7 - 2025 | doi: 10.3389/fdgth.2025.1624786
Diagnostic Efficacy of Large Language Models in The Pediatric Emergency Department: A Pilot Study
Provisionally accepted- 1Department of Pediatric Emergency, Ospedale Infantile Regina Margherita, Turin, Italy
- 2Department of Public Health and Pediatrics, Università degli Studi di Torino, Turin, Italy
- 3Department of Control and Computer Engineering, Politecnico di Torino, Turin, Italy
- 4Division of Emergency Medicine and High Dependency Unit, Department of Medical Sciences, Università degli Studi di Torino, Turin, Italy
- 5Links Foundation, Turin, Italy
- 6Department of Clinical and Biological Sciences, Università degli Studi di Torino, Turin, Italy
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
The Pediatric Emergency Department (PED) faces significant challenges, such as high patient volumes, time-sensitive decisions, and complex diagnoses. Large Language Models (LLMs) have the potential to enhance patient care; however, their effectiveness in supporting the diagnostic process remains uncertain, with studies showing mixed results regarding their impact on clinical reasoning. We aimed to assess LLM-based chatbots performance in realistic PED scenarios, and to explore their use as diagnosis-making assistants in pediatric emergency. Methods: We evaluated the diagnostic effectiveness of 5 LLMs (ChatGPT-4o, Gemini 1.5 Pro, Gemini 1.5 Flash, Llama-3-8B, and ChatGPT-4o mini) compared to 23 physicians (including 10 PED physicians, 6 PED residents, and 7 Emergency Medicine residents). Both LLMs and physicians had to provide one primary diagnosis and two differential diagnoses for 80 realpractice pediatric clinical cases from the PED of a tertiary care Children's Hospital, with three different levels of diagnostic complexity. The responses from both LLMs and physicians were compared to the final diagnoses assigned upon patient discharge; two independent experts evaluated the answers using a five-level accuracy scale. Each physician or LLM received a total score out of 80, based on the sum of all answer points. Results: The best performing chatbots were ChatGPT-4o (score: 72.5) and Gemini 1.5 Pro (score: 62.75), the first performing better (p<0.05) than PED physicians (score: 61.88). Emergency Medicine residents performed worse (score: 43.75) than both the other physicians and chatbots (p<0.01). Chatbots' performance was inversely proportional to case difficulty, but ChatGPT-4o managed to match the majority of the correct answers even for highly difficult cases. Discussion: ChatGPT-4o and Gemini 1.5 Pro could be a valid tool for ED physicians, supporting clinical decision-making without replacing the physician's judgment. Shared protocols for effective collaboration between AI chatbots and healthcare professionals are needed.
Keywords: artificial intelligence, Chatbot, Diagnostic accuracy, Large Language Model, Pediatric emergency department
Received: 08 May 2025; Accepted: 16 Jun 2025.
Copyright: © 2025 Monte, Barolo, Circhetta, Delmonaco, Castagno, Pivetta, Bergamasco, Franco, Olmo and Bondone. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence: Emanuele Castagno, Department of Pediatric Emergency, Ospedale Infantile Regina Margherita, Turin, Italy
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.