ORIGINAL RESEARCH article

Front. Digit. Health

Sec. Health Technology Implementation

Volume 7 - 2025 | doi: 10.3389/fdgth.2025.1574287

This article is part of the Research TopicDigital Medicine and Artificial IntelligenceView all articles

Accuracy of ChatGPT-3.5, ChatGPT-4o, Copilot, Gemini, Claude, and Perplexity in advising on lumbosacral radicular pain against clinical practice guidelines: crosssectional study

Provisionally accepted
  • 1University of Verona, Verona, Veneto, Italy
  • 2IRCCS Istituto Ortopedico Galeazzi, Unit of Clinical Epidemiology, Milan, Italy., Milano, Italy
  • 3Department of Orthopaedics, Duke University, Durham, NC, USA., Durham, California, United States
  • 4Department of Medical Sciences, University of Udine, Udine, Italy., Udine, Italy
  • 5Azienda Sanitaria dell'Alto Adige (ASDAA), Merano, Italy
  • 6Department of Biomedical and Neuromotor Sciences (DIBINEM), Alma Mater University of Bologna, Bologna, Italy, bologna, Italy

The final, formatted version of the article will be published soon.

IntroductionArtificial Intelligence (AI) chatbots, which generate human-like responses based on extensive data, are becoming important tools in healthcare by providing information on health conditions, treatments, and preventive measures, acting as virtual assistants. However, their performance in aligning with clinical practice guidelines (CPGs) for providing answers to complex clinical questions on lumbosacral radicular pain is still unclear. We aim to evaluate AI chatbots’ performance against CPG recommendations for diagnosing and treating lumbosacral radicular pain.MethodsWe performed a cross-sectional study to assess AI chatbots' responses against CPGs recommendations for diagnosing and treating lumbosacral radicular pain. Clinical questions based on these CPGs were posed to the latest versions (updated in 2024) of six AI chatbots: ChatGPT-3.5, ChatGPT-4o, Microsoft Copilot, Google Gemini, Claude, and Perplexity. The chatbots' responses were evaluated for (a) consistency of text responses using Plagiarism Checker X, (b) intra- and inter-rater reliability using Fleiss' Kappa, and (c) match rate with CPGs. Statistical analyses were performed with STATA/MP 16.1.ResultsWe found high variability in the text consistency of AI chatbot responses (median range 26% to 68%). Intra-rater reliability ranged from "almost perfect" to "substantial," while inter-rater reliability varied from "almost perfect" to "moderate." Perplexity had the highest match rate at 67%, followed by Google Gemini at 63%, and Microsoft Copilot at 44%. ChatGPT-3.5, ChatGPT-4o, and Claude showed the lowest performance, each with a 33% match rate.ConclusionsDespite the variability in internal consistency and good intra- and inter-rater reliability, the AI Chatbots' recommendations often did not align with CPGs recommendations for diagnosing and treating lumbosacral radicular pain. Clinicians and patients should exercise caution when relying on these AI models, since one to two-thirds of the recommendations provided may be inappropriate or misleading according to specific chatbots.

Keywords: artificial intelligence, Physiotherapy, machine learning, Musculoskeletal, natural language procecessing, Orthopaedics, ChatGPT, Chatbots

Received: 10 Feb 2025; Accepted: 09 Jun 2025.

Copyright: © 2025 Rossettini, Bargeri, Cook, Guida, Palese, Rodeghiero, Pillastrini, Turolla, Castellini and Gianola. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Lia Rodeghiero, Azienda Sanitaria dell'Alto Adige (ASDAA), Merano, Italy

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.