ORIGINAL RESEARCH article
Front. Educ.
Sec. Assessment, Testing and Applied Measurement
Performance of ChatGPT-o1 and DeepSeek-R1 on the health law related questions in the Chinese National Licensing Examination: A Comparative Study
Jihui Liu 1
Qingtian Zhu 2
Yadong Fan 3
Chenchen Yuan 2
Xiaowu Dong 2
Chengpeng Li 4
Weiwei Chen 3
Guotao Lu 3
1. Qidong County People's Court, Hengyang, China
2. Yangzhou Key Laboratory of Pancreatic Disease, Affiliated Hospital of Yangzhou University, Yangzhou, China
3. Faculty of Medicine, Yangzhou University, Yangzhou, China
4. Key Laboratory of Carcinogenesis and Translational Research (Ministry of Education), Peking University Cancer Hospital & Institute, Beijing, China
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
Abstract
Background: This study aimed to compare the performance of two advanced large language models (LLMs), DeepSeek-R1 and ChatGPT-o1, in addressing health law–related questions from the Chinese National Medical Licensing Examination (CNMLE), thereby evaluating their applicability in medical education in non-English contexts. Methods: A total of 400 health law questions were randomly selected from the official CNMLE guidebook. Each question was independently administered to DeepSeek-R1 and ChatGPT-o1 via standardized Application Programming Interfaces (API) prompts to minimize hallucination and memory effects. Model responses were compared against the official answers, and statistical analyses were conducted using McNemar’s test, with p < 0.05 indicating significance. Results: DeepSeek-R1 achieved an overall accuracy of 93.5% (374/400), significantly higher than ChatGPT-o1’s 79.5% (318/400; p < 0.001). Subgroup analysis revealed that DeepSeek-R1 consistently outperformed ChatGPT-o1 across most legal categories, including medical institutions, infectious disease prevention, malpractice liability, and pharmaceutical regulation. Both models performed comparably in categories such as blood donation and maternal–child health law. DeepSeek-R1 also achieved perfect accuracy in smaller domains such as public health emergencies and occupational disease control. Conclusions: DeepSeek-R1 demonstrated superior performance compared with ChatGPT-o1 in answering health law questions within the CNMLE, highlighting its potential as a reliable tool for medical education in China. The findings underscore the influence of linguistic and cultural context on LLM performance. Future work should expand evaluation to open-ended and case-based questions and explore fine-tuning strategies to enhance the accuracy in healthcare settings.
Summary
Keywords
Chinese National Medical Licensing Examination, Comparative study [publication type], Health Law and Ethics, Large Language Models (LLMs), Medical Education
Received
19 November 2025
Accepted
20 February 2026
Copyright
© 2026 Liu, Zhu, Fan, Yuan, Dong, Li, Chen and Lu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Weiwei Chen; Guotao Lu
Disclaimer
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.