Assessing the Adherence of Large Language Models to Clinical Practice Guidelines in Chinese Medicine: A Content Analysis

Zhao, Weilong; Lai, Honghao; Pan, Bei; Huang, Jiajie; Xia, Danni; Bai, Chunyang; Liu, Jiayi; Liu, Jianing; Jin, Yinghui; Shang, Hongcai; Liu, Jianping; Shi, Nannan; Liu, Jie; Chen, Yaolong; Estill, Janne; Ge, Long

doi:10.3389/fphar.2025.1649041

ORIGINAL RESEARCH article

Front. Pharmacol.

Sec. Ethnopharmacology

Volume 16 - 2025 | doi: 10.3389/fphar.2025.1649041

Assessing the Adherence of Large Language Models to Clinical Practice Guidelines in Chinese Medicine: A Content Analysis

Provisionally accepted

Weilong Zhao¹

Honghao Lai¹

Bei Pan²

Jiajie Huang³

Danni Xia¹

Chunyang Bai⁴

Jiayi Liu^1,5

Jianing Liu³

Jie Liu¹⁰

¹Lanzhou University School of Public Health, Lanzhou, China
²Lanzhou University School of Basic Medical Sciences, Lanzhou, China
³Gansu University of Chinese Medicine, Lanzhou, China
⁴Southern Medical University, Guangzhou, China
⁵Universite de Geneve, Geneva, Switzerland
⁶Zhongnan Hospital of Wuhan University Center for Evidence-Based and Translational Medicine, Wuhan, China
⁷Beijing University of Chinese Medicine Dongfang Hospital, Beijing, China
⁸Beijing University of Chinese Medicine, Beijing, China
⁹China Academy of Chinese Medical Sciences, Beijing, China
¹⁰China Academy of Chinese Medical Sciences Guang'anmen Hospital, Beijing, China
¹¹Lanzhou University Institute of Health Data Science, Lanzhou, China
¹²University of Madras School of Basic Medical Sciences, Chennai, India

The final, formatted version of the article will be published soon.

ABSTRACT Objective: Whether large language models (LLMs) can effectively facilitate CM knowledge acquisition remains uncertain. This study aims to assess the adherence of LLMs to Clinical Practice Guidelines (CPGs) in CM. Methods:This cross-sectional study randomly selected ten CPGs in CM and constructed 150 questions across three categories: medication based on differential diagnosis (MDD), specific prescription consultation (SPC), and CM theory analysis (CTA). Eight LLMs (GPT-4o, Claude-3.5 Sonnet, Moonshot-v1, ChatGLM-4, DeepSeek-v3, DeepSeek-r1, Claude-4 sonnet, and Claude-4 sonnet thinking) were evaluated using both English and Chinese queries. The main evaluation metrics included accuracy, readability, and use of safety disclaimers. Results: Overall, DeepSeek-v3 and DeepSeek-r1 demonstrated superior performance in both English (median 5.00, interquartile range (IQR) 4.00-5.00 vs. median 5.00, IQR 3.70-5.00) and Chinese (both median 5.00, IQR 4.30-5.00), significantly outperforming all other models. All models achieved significantly higher accuracy in Chinese versus English responses (all p < 0.05). Significant variations in accuracy were observed across the categories of questions, with MDD and SPC questions presenting more challenges than CTA questions. English responses had lower readability (mean flesch reading ease score 32.7) compared to Chinese responses. Moonshot-v1 provided the highest rate of safety disclaimers (98.7% English, 100% Chinese). Conclusion: LLMs showed varying degrees of potential for acquiring CM knowledge. The performance of DeepSeek-v3 and DeepSeek-r1 is satisfactory. Optimizing LLMs to become effective tools for disseminating CM information is an important direction for future development.

Keywords: Chinese medicine, Large Language Model, comparison, Clinical practice guideline, Knowledge acquisition

Received: 18 Jun 2025; Accepted: 17 Jul 2025.

Copyright: © 2025 Zhao, Lai, Pan, Huang, Xia, Bai, Liu, Liu, Jin, Shang, Liu, Shi, Liu, Chen, Estill and Ge. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Long Ge, Lanzhou University School of Public Health, Lanzhou, China

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.