Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Endocrinol.

Sec. Clinical Diabetes

This article is part of the Research TopicAI in Healthcare: Transforming Clinical Risk Prediction, Medical Large Language Models, and BeyondView all 19 articles

Comparative Assessment of Large Language Models in Diabetic Foot Infection Management: Alignment with IWGDF/IDSA Guidelines

Provisionally accepted
Hongxia  WuHongxia Wu1Jiayi  DengJiayi Deng2Xu  QiuXu Qiu2Li  XuLi Xu2Lumeng  LuLumeng Lu1Mingna  FanMingna Fan1Danni  YuDanni Yu1Chuanbo  LiuChuanbo Liu3Zhaohuan  ChenZhaohuan Chen3Kai  WangKai Wang4Yuyan  WangYuyan Wang3*Haifang  ZhouHaifang Zhou1*Liyang  ChangLiyang Chang1*Hanbin  WangHanbin Wang3*
  • 1Hangzhou Traditional Chinese Medicine Hospital Affiliated to Zhejiang Chinese Medical University, Hangzhou, China
  • 2Zhejiang Chinese Medical University, Hangzhou, China
  • 3Hangzhou First People's Hospital, Hangzhou, China
  • 4The First People's Hospital of Hangzhou Lining District, Hangzhou, China

The final, formatted version of the article will be published soon.

ABSTRACT 1 Objective: To assess the clinical utility of artificial intelligence (AI) models 2 (ChatGPT-4o, DeepSeek-R1, Grok-3 and Claude-3.7) in aligning with international 3 guidelines for diabetic foot infection (DFI) management. 4 Background: AI systems have demonstrated their potential application value in 5 numerous fields. However, the specific effects of these technologies in the medical 6 and health sector still require in-depth exploration. DFI is a relatively common and 7 serious complication among diabetic patients, and the accurate transmission of 8 relevant information is of great significance. Therefore, it is particularly important to 9 evaluate whether artificial intelligence can serve as an effective clinical auxiliary tool. 10 Methods: Responses from ChatGPT-4o, DeepSeek-R1, Grok-3 and Claude-3.7 were 11 evaluated against DFI guidelines using four clinical dimensions (Accuracy, 12 Overconclusiveness, Supplementary Value, and Completeness) using a 5-point Likert 13 scale, and assessed for readability using Flesch Reading Ease (FRE) and 14 Flesch–Kincaid Grade Level (FKGL). Statistical analyses included ANOVA and post 15 hoc comparisons. 16 Results: No significant differences were found across models for Accuracy and 17 Overconclusiveness (p > 0.05). However, Supplementary Value differed significantly 18 (p < 0.001), the performance of Grok-3 is superior to that of ChatGPT-4o (p < 0.0001), 19 DeepSeek-R1 (p =0.003), and Claude-3.7 (p < 0.0001). Meanwhile, there are 20 significant differences in terms of Completeness (p =0.005), Grok-3 outperforms 21 ChatGPT-4o (p =0.016)and Claude-3.7 (p =0.010) significantly.Readability also 22 varied: DeepSeek-R1 responses were more complex than ChatGPT-4o (p =0.046). 23 Conclusion: All models perform comparably in terms of accuracy and in avoiding 24 over-conclusions. Grok-3 outperformed the other models in the dimensions of 25 complementarity and completeness. DeepSeek-R1 generated the most complex text. 26 These findings validate the feasibility of AI in the standardized management of DFI, 27 but the models still need to be further verified through clinical trials to determine their 28 value in the real-world decision-making process.

Keywords: adherence, artificial intelligence, Diabetic foot infection, guideline, Large language models

Received: 17 Jul 2025; Accepted: 09 Feb 2026.

Copyright: © 2026 Wu, Deng, Qiu, Xu, Lu, Fan, Yu, Liu, Chen, Wang, Wang, Zhou, Chang and Wang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence:
Yuyan Wang
Haifang Zhou
Liyang Chang
Hanbin Wang

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.