ORIGINAL RESEARCH article

Front. Educ.

Sec. Assessment, Testing and Applied Measurement

Performance of ChatGPT-o1 and DeepSeek-R1 on the health law related questions in the Chinese National Licensing Examination: A Comparative Study

  • 1. Qidong County People's Court, Hengyang, China

  • 2. Yangzhou Key Laboratory of Pancreatic Disease, Affiliated Hospital of Yangzhou University, Yangzhou, China

  • 3. Faculty of Medicine, Yangzhou University, Yangzhou, China

  • 4. Key Laboratory of Carcinogenesis and Translational Research (Ministry of Education), Peking University Cancer Hospital & Institute, Beijing, China

The final, formatted version of the article will be published soon.

Abstract

Background: This study aimed to compare the performance of two advanced large language models (LLMs), DeepSeek-R1 and ChatGPT-o1, in addressing health law–related questions from the Chinese National Medical Licensing Examination (CNMLE), thereby evaluating their applicability in medical education in non-English contexts. Methods: A total of 400 health law questions were randomly selected from the official CNMLE guidebook. Each question was independently administered to DeepSeek-R1 and ChatGPT-o1 via standardized Application Programming Interfaces (API) prompts to minimize hallucination and memory effects. Model responses were compared against the official answers, and statistical analyses were conducted using McNemar’s test, with p < 0.05 indicating significance. Results: DeepSeek-R1 achieved an overall accuracy of 93.5% (374/400), significantly higher than ChatGPT-o1’s 79.5% (318/400; p < 0.001). Subgroup analysis revealed that DeepSeek-R1 consistently outperformed ChatGPT-o1 across most legal categories, including medical institutions, infectious disease prevention, malpractice liability, and pharmaceutical regulation. Both models performed comparably in categories such as blood donation and maternal–child health law. DeepSeek-R1 also achieved perfect accuracy in smaller domains such as public health emergencies and occupational disease control. Conclusions: DeepSeek-R1 demonstrated superior performance compared with ChatGPT-o1 in answering health law questions within the CNMLE, highlighting its potential as a reliable tool for medical education in China. The findings underscore the influence of linguistic and cultural context on LLM performance. Future work should expand evaluation to open-ended and case-based questions and explore fine-tuning strategies to enhance the accuracy in healthcare settings.

Summary

Keywords

Chinese National Medical Licensing Examination, Comparative study [publication type], Health Law and Ethics, Large Language Models (LLMs), Medical Education

Received

19 November 2025

Accepted

20 February 2026

Copyright

© 2026 Liu, Zhu, Fan, Yuan, Dong, Li, Chen and Lu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Weiwei Chen; Guotao Lu

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Share article

Article metrics