ORIGINAL RESEARCH article
Front. Artif. Intell.
Sec. Medicine and Public Health
Volume 8 - 2025 | doi: 10.3389/frai.2025.1618378
Evaluation of Large Language Model-Generated Medical Information on Idiopathic Pulmonary Fibrosis
Provisionally accepted- 1Institute of Allergology, Charité University Medicine Berlin, Berlin, Germany
- 2Universidad de Especialidades Espiritu Santo, Samborondon, Ecuador
- 3Respiralab Research Center, Guayaquil, Guayas, Ecuador
- 4Clinic of Pneumology, Medical Center - University of Freiburg, Germany; Faculty of Medicine, University of Freiburg, Freiburg, Germany
- 5Respiratory Diseases Clinic, Regional Hospital of High Specialty of the Yucatan Peninsula, Instituto Mexicano del Seguro Social-Bienestar, Merida, Mexico
- 6Instituto Nacional de Enfermedades Respiratorias (INER), Distrito Federal, Mexico
- 7Programa de Pós-Graduação em Saúde Coletiva, Universidade Estadual de Feira de Santana, Feira de Santana, Brazil
- 8Centre for Heart Lung Innovation and Department of Medicine, University of British Columbia and St Paul's Hospital, Vancouver, Canada
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
Background: Idiopathic Pulmonary Fibrosis (IPF) information from AI-powered large language models (LLMs) like ChatGPT-4 and Gemini 1.5 Pro is unexplored for quality, reliability, readability, and concordance with clinical guidelines. Research question: What is the quality, reliability, readability, and concordance to clinical guidelines of LLMs in medical and clinically IPF-related content? Study design and methods: ChatGPT-4 and Gemini 1.5 Pro responses to 23 ATS/ERS/JRS/ALAT IPF guidelines questions were compared. Six independent raters evaluated responses for quality (DISCERN), reliability (JAMA Benchmark Criteria), readability (Flesch-Kincaid), and guideline concordance (0–4). Descriptive analysis, Intraclass Correlation Coefficient, Wilcoxon signed-rank test, and effect sizes (r) were calculated. Statistical significance was set at p<0.05. Results: According to JAMA Benchmark, ChatGPT-4 and Gemini 1.5 Pro provided partially reliable responses; however, readability evaluations showed that both models were difficult to understand. The Gemini 1.5 Pro provided significantly better treatment information (DISCERN score: 56 versus 43, p<0.001). Gemini had considerably higher international IPF guidelines concordance than ChatGPT-4 (median 3.0 [3.0–3.5] vs 3.0 [2.5–3.0], p=0.0029). Interpretation: Both models gave useful medical insights, but their reliability is limited. Gemini 1.5 Pro gave greater quality information than ChatGPT-4 and was more compliant with worldwide IPF guidelines. Readability analyses found that AI-generated medical information was difficult to understand, stressing the need to refine it. What is already known on this topic: Recent advancements in AI, especially large language models (LLMs) powered by natural language processing (NLP), have revolutionized the way medical information is retrieved and utilized. What this study adds: This study highlights the potential and limitations of ChatGPT-4 and Gemini 1.5 Pro in generating medical information on IPF. They provided partially reliable information in their responses; however, Gemini 1.5 Pro demonstrated superior quality in treatment-related content and greater concordance with clinical guidelines. Nevertheless, neither model provided answers in full concordance with established clinical guidelines, and their readability remained a major challenge. How this study might affect research, practice or policy: These findings highlight the need for AI model refinement as LLMs evolve as healthcare reference tools to help doctors and patients make evidence-based decisions.
Keywords: Idiopathic Pulmonary Fibrosis, artificial intelligence, Natural languageprocessing, machine learning, Large language models, Health Information Systems, Qualityof Health Care, clinical decision-making
Received: 28 Apr 2025; Accepted: 01 Sep 2025.
Copyright: © 2025 Cherrez-Ojeda, Frye, Hoheisel, Cortes-Telles, Robles-Velasco, Toledo, Figueiredo, Ryerson, Rodas-Valero and Calderón. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence: Iván Cherrez-Ojeda, Institute of Allergology, Charité University Medicine Berlin, Berlin, Germany
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.