You're viewing our updated article page. If you need more time to adjust, you can return to the old layout.

ORIGINAL RESEARCH article

Front. Aging

Sec. Musculoskeletal Aging

Evaluating the Performance of Large Language Models in Sarcopenia-Related Patient Queries: A Foundational Assessment for Patient-Centered Validation

  • 1. The Chinese University of Hong Kong, Shatin, China

  • 2. The University of Melbourne, Melbourne, Australia

  • 3. Neuroscience Research Australia, Randwick, Australia

  • 4. Seoul National University Bundang Hospital, Seongnam-si, Republic of Korea

  • 5. McGill University, Montreal, Canada

  • 6. Monash University School of Clinical Sciences at Monash Health, Clayton, Australia

  • 7. Beijing Jishuitan Hospital Department of Orthopedics, Beijing, China

  • 8. Harbin University, Harbin, China

The final, formatted version of the article will be published soon.

Abstract

Background: Large Language Models (LLMs) have shown promise in clinical applications but their performance in specialized areas such as sarcopenia remains understudied. Methods: A panel of sarcopenia clinician researchers developed 20 standardized patient-centered questions across six clinical domains. Each question was input into all three LLMs, and responses were anonymized, randomized, and independently assessed by three clinician researchers. Accuracy was graded on a four-point scale ("Poor" to "Excellent"), and comprehensiveness was evaluated for responses rated "Good" or higher using a five-point scale. Results: All LLMs achieved good performance, with no responses rated "Poor" across any domain. Deepseek had the longest and most detailed responses (mean word count: 583.75 ± 71.89) and showed superior performance in "risk factors" and "prognosis." ChatGPT provided the most concise replies (359.5 ± 87.89 words, p = 0.0011) but achieved the highest proportion of "Good" ratings (90%). Gemini excelled in "pathogenesis" and "diagnosis" but received the most critical feedback in "prevention and treatment." Although trends in performance differences were noted, they did not reach statistical significance. Mean comprehensiveness scores were also similar across models (Deepseek: 4.017 ± 0.77, Gemini: 3.97 ± 0.88, ChatGPT: 3.953 ± 0.83; p > 0.05). Conclusions: Despite minor differences in performance across domains, all three LLMs demonstrated acceptable accuracy and comprehensiveness when responding to sarcopenia-related queries. Their comparable results may reflect similarly recent training data and language capabilities. These findings suggest that LLMs could potentially serve as a valuable tool in patient education and care on sarcopenia. This study provides an initial, expert-based assessment of LLM information quality regarding sarcopenia. While the responses demonstrated good accuracy, this evaluation focuses on content correctness from a clinical perspective. Future research must complement these findings by directly engaging older adult cohorts before clinical implementation can be considered. However, human oversight remains essential to ensure safe and appropriate assessment and individually tailored advice and management.

Summary

Keywords

ChatGPT, deepseek, Gemini, Large language models, Sarcopenia

Received

25 September 2025

Accepted

17 February 2026

Copyright

© 2026 HUANG, Kirk, CLOSE, Lim, Duque, Ebeling, Yang, Tian, CHUI, LIU, Zhang, Cheung and Wong. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Ronald Man Yeung Wong

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Share article

Article metrics