AUTHOR=Yang Guijun , Jiang Hejun , Yuan Shuhua , Tang Mingyu , Zhang Jing , Lin Jilei , Chen Jiande , Yuan Jiajun , Zhao Liebin , Yin Yong 

TITLE=Evaluating large language models in pediatric fever management: a two-layer study

JOURNAL=Frontiers in Digital Health

VOLUME=Volume 7 - 2025

YEAR=2025

URL=https://www.frontiersin.org/journals/digital-health/articles/10.3389/fdgth.2025.1610671

DOI=10.3389/fdgth.2025.1610671

ISSN=2673-253X

ABSTRACT=BackgroundPediatric fever is a prevalent concern, often causing parental anxiety and frequent medical consultations. While large language models (LLMs) such as ChatGPT, Perplexity, and YouChat show promise in enhancing medical communication and education, their efficacy in addressing complex pediatric fever-related questions remains underexplored, particularly from the perspectives of medical professionals and patients’ relatives.ObjectiveThis study aimed to explore the differences and similarities among four common large language models (ChatGPT3.5, ChatGPT4.0, YouChat, and Perplexity) in answering thirty pediatric fever-related questions and to examine how doctors and pediatric patients’ relatives evaluate the LLM-generated answers based on predefined criteria.MethodsThe study selected thirty fever-related pediatric questions answered by the four models. Twenty doctors rated these responses across four dimensions. To conduct the survey among pediatric patients’ relatives, we eliminated certain responses that we deemed to pose safety risks or be misleading. Based on the doctors’ questionnaire, the thirty questions were divided into six groups, each evaluated by twenty pediatric relatives. The Tukey post-hoc test was used to check for significant differences. Some of pediatric relatives was revisited for deeper insights into the results.ResultsIn the doctors’ questionnaire, ChatGPT3.5 and ChatGPT4.0 outperformed YouChat and Perplexity in all dimensions, with no significant difference between ChatGPT3.5 and ChatGPT4.0 or between YouChat and Perplexity. All models scored significantly better in accuracy than other dimensions. In the pediatric relatives’ questionnaire, no significant differences were found among the models, with revisits revealing some reasons for these results.ConclusionsInternet searches (YouChat and Perplexity) did not improve the ability of large language models to answer medical questions as expected. Patients lacked the ability to understand and analyze model responses due to a lack of professional knowledge and a lack of central points in model answers. When developing large language models for patient use, it's important to highlight the central points of the answers and ensure they are easily understandable.