Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Artif. Intell.

Sec. Medicine and Public Health

Volume 8 - 2025 | doi: 10.3389/frai.2025.1618378

Evaluation of Large Language Model-Generated Medical Information on Idiopathic Pulmonary Fibrosis

Provisionally accepted
Iván  Cherrez-OjedaIván Cherrez-Ojeda1,2,3*Björn  Christian FryeBjörn Christian Frye4Andreas  HoheiselAndreas Hoheisel4Arturo  Cortes-TellesArturo Cortes-Telles5Karla  Robles-VelascoKarla Robles-Velasco2,3Heidegger  Mateos ToledoHeidegger Mateos Toledo6Ricardo  FigueiredoRicardo Figueiredo7Christopher  J RyersonChristopher J Ryerson8Gabriela  Rodas-ValeroGabriela Rodas-Valero2,3Juan  Carlos CalderónJuan Carlos Calderón2,3
  • 1Institute of Allergology, Charité University Medicine Berlin, Berlin, Germany
  • 2Universidad de Especialidades Espiritu Santo, Samborondon, Ecuador
  • 3Respiralab Research Center, Guayaquil, Guayas, Ecuador
  • 4Clinic of Pneumology, Medical Center - University of Freiburg, Germany; Faculty of Medicine, University of Freiburg, Freiburg, Germany
  • 5Respiratory Diseases Clinic, Regional Hospital of High Specialty of the Yucatan Peninsula, Instituto Mexicano del Seguro Social-Bienestar, Merida, Mexico
  • 6Instituto Nacional de Enfermedades Respiratorias (INER), Distrito Federal, Mexico
  • 7Programa de Pós-Graduação em Saúde Coletiva, Universidade Estadual de Feira de Santana, Feira de Santana, Brazil
  • 8Centre for Heart Lung Innovation and Department of Medicine, University of British Columbia and St Paul's Hospital, Vancouver, Canada

The final, formatted version of the article will be published soon.

Background: Idiopathic Pulmonary Fibrosis (IPF) information from AI-powered large language models (LLMs) like ChatGPT-4 and Gemini 1.5 Pro is unexplored for quality, reliability, readability, and concordance with clinical guidelines. Research question: What is the quality, reliability, readability, and concordance to clinical guidelines of LLMs in medical and clinically IPF-related content? Study design and methods: ChatGPT-4 and Gemini 1.5 Pro responses to 23 ATS/ERS/JRS/ALAT IPF guidelines questions were compared. Six independent raters evaluated responses for quality (DISCERN), reliability (JAMA Benchmark Criteria), readability (Flesch-Kincaid), and guideline concordance (0–4). Descriptive analysis, Intraclass Correlation Coefficient, Wilcoxon signed-rank test, and effect sizes (r) were calculated. Statistical significance was set at p<0.05. Results: According to JAMA Benchmark, ChatGPT-4 and Gemini 1.5 Pro provided partially reliable responses; however, readability evaluations showed that both models were difficult to understand. The Gemini 1.5 Pro provided significantly better treatment information (DISCERN score: 56 versus 43, p<0.001). Gemini had considerably higher international IPF guidelines concordance than ChatGPT-4 (median 3.0 [3.0–3.5] vs 3.0 [2.5–3.0], p=0.0029). Interpretation: Both models gave useful medical insights, but their reliability is limited. Gemini 1.5 Pro gave greater quality information than ChatGPT-4 and was more compliant with worldwide IPF guidelines. Readability analyses found that AI-generated medical information was difficult to understand, stressing the need to refine it. What is already known on this topic: Recent advancements in AI, especially large language models (LLMs) powered by natural language processing (NLP), have revolutionized the way medical information is retrieved and utilized. What this study adds: This study highlights the potential and limitations of ChatGPT-4 and Gemini 1.5 Pro in generating medical information on IPF. They provided partially reliable information in their responses; however, Gemini 1.5 Pro demonstrated superior quality in treatment-related content and greater concordance with clinical guidelines. Nevertheless, neither model provided answers in full concordance with established clinical guidelines, and their readability remained a major challenge. How this study might affect research, practice or policy: These findings highlight the need for AI model refinement as LLMs evolve as healthcare reference tools to help doctors and patients make evidence-based decisions.

Keywords: Idiopathic Pulmonary Fibrosis, artificial intelligence, Natural languageprocessing, machine learning, Large language models, Health Information Systems, Qualityof Health Care, clinical decision-making

Received: 28 Apr 2025; Accepted: 01 Sep 2025.

Copyright: © 2025 Cherrez-Ojeda, Frye, Hoheisel, Cortes-Telles, Robles-Velasco, Toledo, Figueiredo, Ryerson, Rodas-Valero and Calderón. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Iván Cherrez-Ojeda, Institute of Allergology, Charité University Medicine Berlin, Berlin, Germany

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.