ORIGINAL RESEARCH article
Front. Aging
Sec. Musculoskeletal Aging
Evaluating the Performance of Large Language Models in Sarcopenia-Related Patient Queries: A Foundational Assessment for Patient-Centered Validation
Tao HUANG 1
Ben Kirk 2
Jacqueline CLOSE 3
Jae-Young Lim 4
Gustavo Duque 5
Peter Ebeling 6
Minghui Yang 7
Maoyi Tian 8
Chun Sing CHUI 1
Chaoran LIU 1
Ning Zhang 1
Winghoi Cheung 1
Ronald Man Yeung Wong 1
1. The Chinese University of Hong Kong, Shatin, China
2. The University of Melbourne, Melbourne, Australia
3. Neuroscience Research Australia, Randwick, Australia
4. Seoul National University Bundang Hospital, Seongnam-si, Republic of Korea
5. McGill University, Montreal, Canada
6. Monash University School of Clinical Sciences at Monash Health, Clayton, Australia
7. Beijing Jishuitan Hospital Department of Orthopedics, Beijing, China
8. Harbin University, Harbin, China
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
Abstract
Background: Large Language Models (LLMs) have shown promise in clinical applications but their performance in specialized areas such as sarcopenia remains understudied. Methods: A panel of sarcopenia clinician researchers developed 20 standardized patient-centered questions across six clinical domains. Each question was input into all three LLMs, and responses were anonymized, randomized, and independently assessed by three clinician researchers. Accuracy was graded on a four-point scale ("Poor" to "Excellent"), and comprehensiveness was evaluated for responses rated "Good" or higher using a five-point scale. Results: All LLMs achieved good performance, with no responses rated "Poor" across any domain. Deepseek had the longest and most detailed responses (mean word count: 583.75 ± 71.89) and showed superior performance in "risk factors" and "prognosis." ChatGPT provided the most concise replies (359.5 ± 87.89 words, p = 0.0011) but achieved the highest proportion of "Good" ratings (90%). Gemini excelled in "pathogenesis" and "diagnosis" but received the most critical feedback in "prevention and treatment." Although trends in performance differences were noted, they did not reach statistical significance. Mean comprehensiveness scores were also similar across models (Deepseek: 4.017 ± 0.77, Gemini: 3.97 ± 0.88, ChatGPT: 3.953 ± 0.83; p > 0.05). Conclusions: Despite minor differences in performance across domains, all three LLMs demonstrated acceptable accuracy and comprehensiveness when responding to sarcopenia-related queries. Their comparable results may reflect similarly recent training data and language capabilities. These findings suggest that LLMs could potentially serve as a valuable tool in patient education and care on sarcopenia. This study provides an initial, expert-based assessment of LLM information quality regarding sarcopenia. While the responses demonstrated good accuracy, this evaluation focuses on content correctness from a clinical perspective. Future research must complement these findings by directly engaging older adult cohorts before clinical implementation can be considered. However, human oversight remains essential to ensure safe and appropriate assessment and individually tailored advice and management.
Summary
Keywords
ChatGPT, deepseek, Gemini, Large language models, Sarcopenia
Received
25 September 2025
Accepted
17 February 2026
Copyright
© 2026 HUANG, Kirk, CLOSE, Lim, Duque, Ebeling, Yang, Tian, CHUI, LIU, Zhang, Cheung and Wong. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Ronald Man Yeung Wong
Disclaimer
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.