BRIEF RESEARCH REPORT article
Front. Artif. Intell.
Sec. Medicine and Public Health
Volume 8 - 2025 | doi: 10.3389/frai.2025.1644093
This article is part of the Research TopicThe Applications of AI Techniques in Medical Data ProcessingView all 12 articles
Limitations of Broadly Trained LLMs in Interpreting Orthopedic Walch Glenoid Classifications
Provisionally accepted- Penn State Health Milton S Hershey Medical Center, Hershey, United States
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
Artificial intelligence (AI) integration in medical practice has grown substantially, with physician use nearly doubling from 38% in 2023 to 68% in 2024. Recent advances in large language models (LLMs) include multimodal inputs, showing potential for medical image interpretation and clinical software integrations.This study evaluated the accuracy of two popular LLMs, Claude 3.5 Sonnet and DeepSeek R1, in interpreting glenoid diagrams using Walch glenoid classification in preoperative shoulder reconstruction applications. Test images included seven black-white Walch glenoid diagrams from Radiopaedia. LLMs were accessed via Perplexity.ai without specialized medical training.LLMs were tested across multiple conversation threads with prompt instructions of varying length, ranging from 22-864 words for DeepSeek and 127-840 words for Claude.Performance differed significantly between models. DeepSeek achieved 44% accuracy (7/16), while Claude had 0% accuracy (0/16). DeepSeek showed mild positive correlation between instruction length and response accuracy. Common errors across both LLMs included misclassifying A2 as either A1 (32%) or B2 (20%).LLMs' ability to interpret even simplified medical diagrams. DeepSeek's continuous learning feature and open-source dataset integration exhibited superior accuracy, although still insufficient for clinical applications. These limitations stem from LLM training data containing primarily text instead of medical images, creating pattern recognition deficiencies when interpreting visual medical information. Despite AI's growing adoption in healthcare, this study concludes that as of February 2025, publicly available broadly trained LLMs lack consistency and accuracy necessary for reliable medical image interpretation, emphasizing the need for specialized training before clinical implementation.
Keywords: Claude 3.5-Sonnet, Orthopaedic surgery, DeepSeek R1, Walch glenoid morphology, large language model (LLM), Shoulder osteoarthritis, Walch glenoid type
Received: 09 Jun 2025; Accepted: 24 Jul 2025.
Copyright: © 2025 ElSayed and Updegrove. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence: Gary F. Updegrove, Penn State Health Milton S Hershey Medical Center, Hershey, United States
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.