ORIGINAL RESEARCH article
Front. Public Health
Sec. Digital Public Health
This article is part of the Research TopicGenerative AI and Large Language Models in Medicine: Applications, Challenges, and OpportunitiesView all 7 articles
Evaluation of the Accuracy of Large Language Models in Answering Bone Cancer-Related Questions
Provisionally accepted- Ganzhou People's Hospital, Ganzhou, China
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
Abstract Introduction: Large Language Models (LLMs) excel at understanding medical terminology, parsing unstructured clinical data, and generating contextually relevant insights, emerging as transformative healthcare tools. Three leading LLMs—Deepseek, ChatGPT, and Grok—show great potential for medical education, clinical decision-making, and patient care. Bone cancer includes diverse primary and metastatic tumors, each with distinct diagnostic criteria, treatment pathways, and prognoses. Based on this guideline, this study assesses the accuracy of Deepseek V3.1, ChatGPT 5, and Grok 4 in addressing bone cancer-related questions. Methods: Based on the clinical guidelines for bone cancer released by the NCCN in April 2025, 52 questions related to bone cancer were developed. Researchers posed questions to Deepseek V3.1, ChatGPT 5, and Grok 4, and collected the data generated; each LLM was queried twice within a one-month period. The collected data were independently evaluated and scored by two bone cancer-treating specialists in accordance with the scoring criteria. Results: Among the answers to the 52 bone cancer-related questions, the probability of Deepseek V3.1, ChatGPT 5, and Grok 4 providing correct responses in both rounds was greater than 90%. Additionally, no correlation was observed between the LLMs' scores, word count, and response times. The total scores of Deepseek V3.1, ChatGPT 5, and Grok 4 were 3.75±0.71, 3.81±0.6, and .87±0.51, respectively. The word count of responses from Deepseek V3.1, ChatGPT 5, and Grok 4 was 546.56±194.49, 367.02±273.18, and 194.16±197.07 words, respectively. The response times of Deepseek V3.1, ChatGPT 5, and Grok 4 were 11.83±3.41 seconds, 1.52±0.52 seconds and 42.48±26.89 seconds, respectively. No statistically significant differences in scores were found for any of the LLMs between the two rounds. However, ChatGPT 5 showed a statistically significant difference in word count between the two rounds (360.12 ± 279.89 vs. 373.94 ± 268.86 words). Conclusion: When answering bone cancer-related questions, Deepseek V3.1, ChatGPT 5, and Grok 4 generally performed well. Specifically, when responding to questions about Ewing sarcoma, ChatGPT 5 and Grok 4 demonstrated higher accuracy than Deepseek V3.1. While each model has its own strengths and limitations, their collective potential to enhance medical knowledge and improve healthcare outcomes is undeniable.
Keywords: Large language models, Deepseek V3.1, ChatGPT 5, Grok 4, Bone cancer, health information
Received: 22 Nov 2025; Accepted: 25 Nov 2025.
Copyright: © 2025 Pan, Huang, Liu, Lin and Ye. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence: Shuxi Ye
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.
