Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Public Health

Sec. Digital Public Health

This article is part of the Research TopicGenerative AI and Large Language Models in Medicine: Applications, Challenges, and OpportunitiesView all 7 articles

Evaluation of the Accuracy of Large Language Models in Answering Bone Cancer-Related Questions

Provisionally accepted
  • Ganzhou People's Hospital, Ganzhou, China

The final, formatted version of the article will be published soon.

Abstract Introduction: Large Language Models (LLMs) excel at understanding medical terminology, parsing unstructured clinical data, and generating contextually relevant insights, emerging as transformative healthcare tools. Three leading LLMs—Deepseek, ChatGPT, and Grok—show great potential for medical education, clinical decision-making, and patient care. Bone cancer includes diverse primary and metastatic tumors, each with distinct diagnostic criteria, treatment pathways, and prognoses. Based on this guideline, this study assesses the accuracy of Deepseek V3.1, ChatGPT 5, and Grok 4 in addressing bone cancer-related questions. Methods: Based on the clinical guidelines for bone cancer released by the NCCN in April 2025, 52 questions related to bone cancer were developed. Researchers posed questions to Deepseek V3.1, ChatGPT 5, and Grok 4, and collected the data generated; each LLM was queried twice within a one-month period. The collected data were independently evaluated and scored by two bone cancer-treating specialists in accordance with the scoring criteria. Results: Among the answers to the 52 bone cancer-related questions, the probability of Deepseek V3.1, ChatGPT 5, and Grok 4 providing correct responses in both rounds was greater than 90%. Additionally, no correlation was observed between the LLMs' scores, word count, and response times. The total scores of Deepseek V3.1, ChatGPT 5, and Grok 4 were 3.75±0.71, 3.81±0.6, and .87±0.51, respectively. The word count of responses from Deepseek V3.1, ChatGPT 5, and Grok 4 was 546.56±194.49, 367.02±273.18, and 194.16±197.07 words, respectively. The response times of Deepseek V3.1, ChatGPT 5, and Grok 4 were 11.83±3.41 seconds, 1.52±0.52 seconds and 42.48±26.89 seconds, respectively. No statistically significant differences in scores were found for any of the LLMs between the two rounds. However, ChatGPT 5 showed a statistically significant difference in word count between the two rounds (360.12 ± 279.89 vs. 373.94 ± 268.86 words). Conclusion: When answering bone cancer-related questions, Deepseek V3.1, ChatGPT 5, and Grok 4 generally performed well. Specifically, when responding to questions about Ewing sarcoma, ChatGPT 5 and Grok 4 demonstrated higher accuracy than Deepseek V3.1. While each model has its own strengths and limitations, their collective potential to enhance medical knowledge and improve healthcare outcomes is undeniable.

Keywords: Large language models, Deepseek V3.1, ChatGPT 5, Grok 4, Bone cancer, health information

Received: 22 Nov 2025; Accepted: 25 Nov 2025.

Copyright: © 2025 Pan, Huang, Liu, Lin and Ye. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Shuxi Ye

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.