ChatGPT4o's theranostic performance in the management of thoracolumbar spine fractures

Jia, Xuehai; Ma, Litai; Yang, Yi; Deng, Yi; Shen, Changyong; Zhang, Kerui; Li, Ya

doi:10.3389/fsurg.2025.1524396

OPINION article

Front. Surg., 25 February 2025

Sec. Orthopedic Surgery

Volume 12 - 2025 | https://doi.org/10.3389/fsurg.2025.1524396

ChatGPT4o's theranostic performance in the management of thoracolumbar spine fractures

Xuehai Jia¹

Litai Ma^1*

Yi Yang¹

Yi Deng¹

Changyong Shen¹

Kerui Zhang¹

Ya Li²

¹Department of Orthopedics, Orthopedic Research Institute, West China Hospital, Sichuan University, Chengdu, Sichuan, China
²The First Affiliated Hospital of Shihezi University, Shihezi, Xinjiang Uyghur, China

Introduction

ChatGPT, developed by OpenAI (https://chat.openai.com), is a publicly accessible tool that utilizes advanced machine learning algorithms to process and analyze extensive data, generating responses to user inquiries. On May 13, 2024, OpenAI launched the ChatGPT4o model, which, according to information on the OpenAI website, represents the latest, fastest, and most advanced version. This model supports a context length of up to 128k tokens (equivalent to the length of a long novel) and offers multimodal capabilities, including text and image inputs, as well as text, image, and audio outputs (https://help.openai.com). While numerous studies have explored ChatGPT's potential applications and challenges in the biomedical field (1, 2), limited research has been conducted on the specific capabilities of ChatGPT4o in the medical domain. A REVIEW article (3) published in Frontiers in Surgery mentions that ChatGPT lacks sufficient expertise and background understanding in specialized fields. However, the application of ChatGPT4o may have the potential to change this situation. To validate this model, we investigate the theranostic performance of ChatGPT4o in managing thoracolumbar spine fractures to assess its potential effectiveness and applications in clinical practice.

Method

For our evaluation, we formulated 38 clinical questions based on the diagnostic, treatment, and management guidelines for thoracolumbar fractures established by the Congress of Neurological Surgeons (CNS) (4–14) and the Chinese Medical Association (CMA) (15). We input all 38 questions into ChatGPT-4o (OpenAI, accessed November 3, 2024) without providing additional context or guidelines. Each question was posed once, and the initial generated response was recorded. To minimize variability, no iterative refinement of prompts was performed. The responses were anonymized and compiled in Supplementary Material S1. Each response was subsequently reviewed by three independent spine surgery experts, who evaluated the responses according to both the established guidelines and their own clinical experience. Each expert used a five-point Likert scale to rate the responses: (1) indicating completely incorrect; (2) more incorrect than correct; (3) an equal mix of correct and incorrect; (4) more correct than incorrect; and (5) completely correct. The median score from the three experts was used as the final rating to minimize bias.

Result

When ChatGPT4o was presented with “yes or no” questions, it typically responded with comprehensive diagnostic criteria and therapeutic principles rather than a simple “yes” or “no.” According to our results (Table 1), 0 responses (0%) received a score of 1, 1 response (2.63%) received a score of 2, 1 response (2.63%) scored a 3, 8 responses (21.05%) scored a 4, and 28 responses (73.68%) scored a 5. Approximately 94.7% of the responses were largely or entirely accurate.

Table 1

Table 1. Five-point Likert scores for responses from inquires posed to chat-GPT4o.

Discussion

When asked, “Does the choice of surgical approach (anterior, posterior, or combined anterior-posterior) improve clinical outcomes in patients with thoracic and lumbar fractures?”, ChatGPT4o provided an affirmative answer along with detailed explanations. However, according to CNS guidelines, for patients with burst fractures of the thoracolumbar spine, surgeons may use an anterior, posterior, or combined approach, as the choice of approach does not significantly affect clinical or neurological outcomes, a Grade B recommendation. Although ChatGPT4o provided a detailed explanation of the indications for each approach, the experts noted that while the response was generally accurate, the final conclusion was not entirely consistent with guideline recommendations. Furthermore, while ChatGPT4o appears capable of conducting targeted searches on open websites, its “independent reasoning” abilities require further refinement.

In summary, ChatGPT4o demonstrates promising performance in diagnosing and treating thoracolumbar trauma. Its ability to search open websites and provide detailed responses could be a useful reference for clinical practitioners. However, ChatGPT4o does not consistently provide fully accurate answers, particularly with “yes or no” questions. Its dependence on specific sources for data retrieval may introduce biases that limit its broader application in the field of spine surgery. ChatGPT requires substantial medical data for further training to enhance model performance. Moreover, given the specific ethical considerations in medicine, ChatGPT4o's use in clinical settings must ensure patient safety, data privacy, ethical standards, and adherence to relevant “AI regulations”. Although ChatGPT4o's responses may improve clinical efficiency, it should only serve as a clinical assistant, with spine surgeons validating the accuracy of its information.

This study has several methodological limitations: firstly, the lack of comparative analyses with established AI systems (e.g., Google Med-PaLM, IBM Watson) or traditional decision-support tools hinders definitive performance benchmarking; secondly, simulated testing environments may overestimate system efficacy, as diagnostic performance degradation in real-world clinical settings requires urgent empirical validation; finally, the rapid evolution of AI technology necessitates dynamically updated training databases and ethical evaluation frameworks. To address these gaps, subsequent research will incorporate the Partial Credit Model (PCM) and Item Response Theory (IRT) through latent trait modeling, systematically quantifying AI response difficulty levels, refining multidimensional scoring criteria, and strengthening clinical applicability assessments to establish a psychometrically-based evaluation framework. This methodological advancement will enhance the granular understanding of AI's role in complex medical decision-making (e.g., surgical approach selection, prognostic stratification). Future research priorities include: (1) comparative effectiveness studies across AI systems, (2) real-world clinical validation of performance, and (3) development of specialty-specific human-AI collaboration guidelines to systematically improve the clinical utility of intelligent assistive tools in spinal surgery.

Author contributions

XJ: Investigation, Methodology, Writing – original draft. LM: Supervision, Writing – review & editing. YY: Data curation, Validation, Writing – original draft. YD: Data curation, Investigation, Writing – original draft. CS: Methodology, Validation, Writing – original draft. KZ: Data curation, Investigation, Methodology, Writing – original draft. YL: Data curation, Investigation, Writing – review & editing.

Funding

The author(s) declare that no financial support was received for the research, authorship, and/or publication of this article.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declare that Generative AI was used in the creation of this manuscript. Provided answers to guideline-related questions regarding thoracolumbar spine fractures.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fsurg.2025.1524396/full#supplementary-material

References

1. Tian S, Jin Q, Yeganova L, Lai P-T, Zhu Q, Chen X, et al. Opportunities and challenges for ChatGPT and large language models in biomedicine and health. Brief Bioinform. (2023) 25:bbad493. doi: 10.1093/bib/bbad493

PubMed Abstract | Crossref Full Text | Google Scholar

2. Zhang J, Sun K, Jagadeesh A, Falakaflaki P, Kayayan E, Tao G, et al. The potential and pitfalls of using a large language model such as ChatGPT, GPT-4, or LLaMA as a clinical assistant. J Am Med Inform Assoc. (2024) 31:1884–91. doi: 10.1093/jamia/ocae184

PubMed Abstract | Crossref Full Text | Google Scholar

3. Giorgino R, Alessandri-Bonetti M, Luca A, Migliorini F, Rossi N, Peretti GM, et al. ChatGPT in orthopedics: a narrative review exploring the potential of artificial intelligence in orthopedic practice. Front Surg. (2023) 10:1284015. doi: 10.3389/fsurg.2023.1284015

PubMed Abstract | Crossref Full Text | Google Scholar

4. Dailey AT, Arnold PM, Anderson PA, Chi JH, Dhall SS, Eichholz KM, et al. Congress of neurological surgeons systematic review and evidence-based guidelines on the evaluation and treatment of patients with thoracolumbar spine trauma: classification of injury. Neurosurgery. (2019) 84:E24–7. doi: 10.1093/neuros/nyy372

PubMed Abstract | Crossref Full Text | Google Scholar

5. Dhall SS, Dailey AT, Anderson PA, Arnold PM, Chi JH, Eichholz KM, et al. Congress of neurological surgeons systematic review and evidence-based guidelines on the evaluation and treatment of patients with thoracolumbar spine trauma: hemodynamic management. Neurosurgery. (2019) 84:E43–5. doi: 10.1093/neuros/nyy368

PubMed Abstract | Crossref Full Text | Google Scholar

6. Harrop JS, Chi JH, Anderson PA, Arnold PM, Dailey AT, Dhall SS, et al. Congress of neurological surgeons systematic review and evidence-based guidelines on the evaluation and treatment of patients with thoracolumbar spine trauma: neurological assessment. Neurosurgery. (2019) 84:E32–5. doi: 10.1093/neuros/nyy370

PubMed Abstract | Crossref Full Text | Google Scholar

7. Hoh DJ, Qureshi S, Anderson PA, Arnold PM, John HC, Dailey AT, et al. Congress of neurological surgeons systematic review and evidence-based guidelines on the evaluation and treatment of patients with thoracolumbar spine trauma: nonoperative care. Neurosurgery. (2019) 84:E46–9. doi: 10.1093/neuros/nyy369

PubMed Abstract | Crossref Full Text | Google Scholar

8. Chi JH, Eichholz KM, Anderson PA, Arnold PM, Dailey AT, Dhall SS, et al. Congress of neurological surgeons systematic review and evidence-based guidelines on the evaluation and treatment of patients with thoracolumbar spine trauma: novel surgical strategies. Neurosurgery. (2019) 84:E59–62. doi: 10.1093/neuros/nyy364

PubMed Abstract | Crossref Full Text | Google Scholar

9. Rabb CH, Hoh DJ, Anderson PA, Arnold PM, Chi JH, Dailey AT, et al. Congress of neurological surgeons systematic review and evidence-based guidelines on the evaluation and treatment of patients with thoracolumbar spine trauma: operative versus nonoperative treatment. Neurosurgery. (2019) 84:E50–2. doi: 10.1093/neuros/nyy361

PubMed Abstract | Crossref Full Text | Google Scholar

10. Arnold PM, Anderson PA, Chi JH, Dailey AT, Dhall SS, Eichholz KM, et al. Congress of neurological surgeons systematic review and evidence-based guidelines on the evaluation and treatment of patients with thoracolumbar spine trauma: pharmacological treatment. Neurosurgery. (2019) 84:E36–8. doi: 10.1093/neuros/nyy371

PubMed Abstract | Crossref Full Text | Google Scholar

11. Raksin PB, Harrop JS, Anderson PA, Arnold PM, Chi JH, Dailey AT, et al. Congress of neurological surgeons systematic review and evidence-based guidelines on the evaluation and treatment of patients with thoracolumbar spine trauma: prophylaxis and treatment of thromboembolic events. Neurosurgery. (2019) 84:E39–42. doi: 10.1093/neuros/nyy367

PubMed Abstract | Crossref Full Text | Google Scholar

12. Qureshi S, Dhall SS, Anderson PA, Arnold PM, Chi JH, Dailey AT, et al. Congress of neurological surgeons systematic review and evidence-based guidelines on the evaluation and treatment of patients with thoracolumbar spine trauma: radiological evaluation. Neurosurgery. (2019) 84:E28–31. doi: 10.1093/neuros/nyy373

PubMed Abstract | Crossref Full Text | Google Scholar

13. Anderson PA, Raksin PB, Arnold PM, Chi JH, Dailey AT, Dhall SS, et al. Congress of neurological surgeons systematic review and evidence-based guidelines on the evaluation and treatment of patients with thoracolumbar spine trauma: surgical approaches. Neurosurgery. (2019) 84:E56–8. doi: 10.1093/neuros/nyy363

PubMed Abstract | Crossref Full Text | Google Scholar

14. Eichholz KM, Rabb CH, Anderson PA, Arnold PM, Chi JH, Dailey AT, et al. Congress of neurological surgeons systematic review and evidence-based guidelines on the evaluation and treatment of patients with thoracolumbar spine trauma: timing of surgical intervention. Neurosurgery. (2019) 84:E53–5. doi: 10.1093/neuros/nyy362

PubMed Abstract | Crossref Full Text | Google Scholar

15. Chinese Medical Doctor Association Orthopedics Branch, Editorial Board of the Evidence-based Clinical Practice Guidelines for Acute Thoracolumbar Spinal Cord Injury in Adults by the Chinese Medical Doctor Association Orthopedics Branch. Evidence-based clinical practice guidelines for orthopedics by the Chinese medical doctor association orthopedics branch: evidence-based clinical practice guidelines for acute thoracolumbar spinal cord injury in adults. Chin J Surg. (2019) 57(3):161–5. doi: 10.3760/cma.j.issn.0529-5815.2019.03.001

PubMed Abstract | Crossref Full Text | Google Scholar

Keywords: ChatGPT4o, thoracolumbar spine fractures, theranostic performance, clinical practice, AI in medicine

Citation: Jia X, Ma L, Yang Y, Deng Y, Shen C, Zhang K and Li Y (2025) ChatGPT4o's theranostic performance in the management of thoracolumbar spine fractures. Front. Surg. 12:1524396. doi: 10.3389/fsurg.2025.1524396

Received: 7 November 2024; Accepted: 12 February 2025;
Published: 25 February 2025.

Edited by:

Wencai Liu, Shanghai Jiao Tong University, China

Reviewed by:

Harish Kempegowda, Boston University, United States
Hartanto Hartanto, Universitas Widya Dharma, Indonesia
Nicola Manocchio, University of Rome Tor Vergata, Italy

Copyright: © 2025 Jia, Ma, Yang, Deng, Shen, Zhang and Li. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Litai Ma, bWEubGl0YWlAMTYzLmNvbQ==

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.