Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Med., 09 January 2026

Sec. Healthcare Professions Education

Volume 12 - 2025 | https://doi.org/10.3389/fmed.2025.1667104

This article is part of the Research TopicArtificial Intelligence for Technology Enhanced LearningView all 19 articles

Supporting postgraduate exam preparation with large language models: implications for traditional Chinese medicine education


Baifeng WangBaifeng Wang1Meiwei ZhangMeiwei Zhang2Zhe WangZhe Wang3Keyu YaoKeyu Yao4Meng HaoMeng Hao4Junhui WangJunhui Wang5Suyuan Peng
Suyuan Peng4*Yan Zhu
Yan Zhu4*
  • 1Dongzhimen Hospital, Beijing University of Chinese Medicine, Beijing, China
  • 2School of Medical Informatics, Changchun University of Traditional Chinese Medicine, Changchun, China
  • 3Institute of Medical Informatics, Statistics, and Epidemiology,Leipzig University, Leipzig, Germany
  • 4Institute of Information on Traditional Chinese Medicine, China Academy of Chinese Medical Sciences, Beijing, China
  • 5Guang'anmen Hospital,China Academy of Chinese Medical Sciences, Beijing, China

Introduction: In China, the medical education system features multiple co-existing levels, with higher education often leading to better job prospects. In career advancement—especially for entry into competitive urban hospitals—the postgraduate examination often plays a more decisive role than the licensing examination. The application of Large Language Models (LLMs) in Traditional Chinese Medicine (TCM) has rapidly expanded. TCM theories possess distinct scientific features, requiring LLMs to demonstrate advanced information processing and comprehension abilities in a Chinese context. While LLMs have shown strong performance in many countries' licensing examinations, their performance in selective TCM examinations remains underexplored. This study aimed to evaluate and compare the performance of Ernie Bot, ChatGLM, SparkDesk, and GPT-4 on the 2023 Chinese Postgraduate Examination for TCM (CPE-TCM), and explore their potential in supporting TCM education and academic development.

Methods: We assessed the performance of four LLMs using the 2023 CPE-TCM as a test set. Exam scores were calculated to evaluate subject-specific performance. Additionally, responses were qualitatively analyzed based on logical reasoning and the use of internal and external information.

Results: Ernie Bot and ChatGLM achieved accuracy rates of 50.30 and 46.67%, respectively, both above the passing score. Statistically significant differences in subject-specific performance were observed, with the highest scores in the medical humanistic spirit module. ChatGLM and GPT-4 provided logical explanations for all responses, while Ernie Bot and SparkDesk showed logical reasoning in 98.2 and 43.6% of responses, respectively. ChatGLM and GPT-4 incorporated internal information in all explanations, whereas SparkDesk rarely did. Over 60% of responses from Ernie Bot, ChatGLM, and GPT-4 included external information, which did not significantly differ between correct and incorrect answers. In SparkDesk, the presence of internal or external information was significantly associated with answer correctness (P < 0.001).

Discussion: Ernie Bot and ChatGLM surpassed the passing threshold for postgraduate selection, reflecting solid TCM expertise. LLMs demonstrated strong capabilities in logical reasoning and integration of background knowledge, highlighting their promising role in enhancing TCM education.

1 Introduction

Physicians in most high-income countries undergo similar levels of educational training and share a homogeneous academic background (1). In response to the demand for healthcare, China has developed a complex medical education system aimed at increasing the number of physicians. Furthermore, the criteria for graduation and employment of medical doctors vary. Doctors with lower levels of education mainly practice in townships and rural areas in China, while those with higher levels of education are more likely to work in urban areas (1). Within the current system, there is a 3-year junior college medical program, a 5-year Bachelor of Medicine degree program, a “5+3” Master of Medicine degree program, and an 8-year Doctor of Medicine program (2). The medical profession has established degree levels to meet the demand for professional healthcare and ensure the provision of basic and primary medical services to the public (3). Despite undergoing multiple reforms and advancements, there remains a significant gap between the medical education system and elite medical education.

A growing number of undergraduate students in China are now pursuing advanced degrees like master's or doctoral programs (2). In the past 3 years, over four million candidates have taken Chinese Postgraduate Examination (CPE) annually. Recent data on postgraduate enrollment showed medicine is among the top five disciplines in terms of enrollment size, and it had the highest growth rate in 2021 (4). As of the end of 2020, 59.5% of the physicians in the workforce had a bachelor's degree or higher.

Within China's multi-tiered medical education and practitioner system, the National Medical Licensing Examination (NMLE) serves as the mandatory gateway for clinical practice. However, for career advancement and entry into competitive urban hospitals, CPE is often the more critical and decisive step (5). Existing studies have primarily focused on medical licensing examinations, indicating that large language models (LLMs) may play a supportive role in medical education. However, it remains unclear whether LLMs can assist medical students in succeeding in competitive graduate entrance examinations and advancing their academic qualifications.

Significant advancements have been observed in Artificial Intelligence (AI) modeling in recent years, leading to the rapid development of LLM technology. Particularly noteworthy is the emergence of products like ChatGPT (OpenAI, 2022), signaling a new era for the deployment of general-domain AI (6). Currently, LLMs have demonstrated promising applications in various areas of medical research, including diagnostics (7), medical image analysis (8), medical writing (9), and personalized medicine (10).

The medical licensing examination is commonly used to assess the performance of LLMs in the medical field due to its high standardization, regulation, and comprehensive coverage of various subjects. Multiple studies have systematically evaluated ChatGPT's performance on standardized tests across various languages. Notably, it has demonstrated excellent performance on assessments such as the United States Medical Licensing Examination (USMLE) (1113), the Japanese Medical Licensing Examination (JMLE) (14), the Saudi Medical Licensing Examination (SMLE) (15), the Polish medical specialization licensing exam (PES) (16) and Taiwan's medical licensing exams (17). However, over the past 5 years of the Chinese NMLE, ChatGPT scores have consistently fallen below the passing threshold. The main reason is that ChatGPT's training data is mainly in English, with only a small amount in Chinese. Additionally, ChatGPT may face challenges in accurately comprehending healthcare policies within non-English speaking nations (18, 19). Cai et al. introduced a comprehensive benchmark for the Chinese medical domain and evaluated both general-domain and medical LLMs in Chinese. The findings suggest that general-domain LLMs possess substantial medical knowledge and may perform better than medical LLMs (20).

However, most existing evaluations remain focused on modern biomedicine. Research on LLM performance in traditional medical systems—such as Traditional Chinese Medicine (TCM) or Traditional Korean Medicine—is still scarce, despite their unique theoretical and linguistic characteristics.

The application of LLMs in the education and assessment of East Asian traditional medicine has been gradually expanding. Previous studies have evaluated the performance of GPT-4 on Korea's national licensing examination for Traditional Korean Medicine (21) and examined ChatGPT-4 on Taiwan's TCM physician licensing examination (22). Both studies showed that LLMs could handle standardized questions to some extent but still lacked explanatory depth and higher-order reasoning. These studies, however, focused on licensure-oriented exams that emphasize minimum clinical competency, regulatory compliance, and knowledge breadth.

In contrast, the present study adopts the Chinese Postgraduate Examination for TCM (CPE-TCM), which is an academically selective examination. Its evaluation focus shifts from assessing “minimum competency” to higher-level cognitive abilities. The design of the test places greater emphasis on knowledge depth, integration, and differentiation, aiming to comprehensively assess candidates' understanding of foundational TCM theories, clinical reasoning, and the ability to transfer knowledge across contexts. It represents a cognitively challenging task with a high degree of discrimination. Notably, in addition to standard single-choice questions (A-type and B-type), the exam also includes highly discriminative multiple-choice questions (X-type). These require examinees to accurately grasp and horizontally integrate multiple knowledge points, imposing significantly greater cognitive demands than those of typical licensing examinations. The exam content also focuses more specifically on core disciplines of TCM, excluding non-essential topics such as basic Western medicine and medical laws and regulations, which are commonly included in licensure exams. Therefore, CPE-TCM provides a more sensitive evaluation scenario for assessing the depth of understanding and the ability of LLMs to integrate domain-specific knowledge. In addition to these distinctions, data accessibility also influenced our choice of examination. Although the NMLE is equally authoritative, its complete and verified question sets are not publicly available, and online items lack authenticity. By contrast, CPE-TCM offers a legally published, standardized, and fully accessible corpus, ensuring reliable and reproducible evaluation.

Compared with previous studies that used physician or pharmacist licensing examinations, this study offers a new academic perspective by systematically evaluating the comprehension and reasoning abilities of LLMs in the context of advanced TCM education. This design not only expands the scope of education-oriented assessment tasks but also lays a theoretical foundation for future applications of LLMs in postgraduate education, instructional support, and personalized learning in TCM.

TCM is a comprehensive medical system rooted in centuries of clinical experience and theoretical development in China. It plays an important role in disease prevention and treatment, and its global impact is growing, with acupuncture formally recognized in over 100 WHO member states. Given its distinctive theoretical framework and contextual reasoning features, systematic evaluation of LLMs in answering TCM-related questions is necessary to ensure their reliability and applicability in this specialized domain.

Therefore, this study aims to systematically evaluate the performance of mainstream LLMs on CPE-TCM under simulated test conditions and to explore their potential in supporting postgraduate TCM education. This benchmark offers several advantages: it (i) comprehensively covers core TCM subjects; (ii) is authored by domain experts, ensuring content validity; (iii) provides highly standardized items with verifiable answers; and (iv) supplies official scoring anchors (i.e., passing thresholds).

2 Methodology

2.1 Models

In addition to the globally recognized GPT-4 (6), we selected three representative Chinese LLMs—Ernie Bot, ChatGLM, and SparkDesk—to enhance language adaptation and methodological coverage. Ernie Bot is built upon the ERNIE architecture, which adopts a knowledge-enhanced pre-training paradigm by integrating structured knowledge with large-scale corpora (23). It has demonstrated strong performance across a range of Chinese NLP tasks and is particularly suitable for addressing structured, knowledge-dependent professional questions as featured in this study. ChatGLM, derived from the GLM framework, employs an autoregressive blank infilling objective to support a unified pre-training architecture (24). It emphasizes architectural generalization and task compatibility, providing a solid foundation for logical reasoning and option-level analysis. SparkDesk prioritizes education-oriented deployment. It has been applied across multiple educational scenarios—including personalized learning, lesson planning, and classroom-integrated devices—demonstrating a high degree of engineering maturity and practical usability (25).

Collectively, these models represent three major technical pathways in Chinese LLM development—knowledge enhancement, unified architecture, and application-driven integration. Their inclusion enables a multi-faceted evaluation of LLM performance in the context of this study and provides a balanced basis for comparison with GPT-4.

All models were accessed via official web interfaces with browsing/plugins disabled and default inference settings (no manual parameter adjustments).

2.2 Medical examination datasets

CPE-TCM is a nationally administered selective examination organized and developed by the National Education Examinations Authority under the Ministry of Education of China. It is designed for all universities nationwide that enroll master's students in TCM programs, featuring a high degree of standardization and authority. Therefore, the examination questions reflect the core national requirements for the knowledge structure and competency level expected of candidates entering TCM postgraduate programs. The test content focuses closely on classical TCM theories and core clinical disciplines, and is widely regarded within the TCM education system as an important indicator of undergraduate learning outcomes and postgraduate training potential.

We collected questions from the CPE-TCM in 2023, for a total of 165 questions. The examination comprised single-choice and multiple-choice questions spanning seven distinct disciplines: Basic Theory of TCM, Diagnostics of TCM, Pharmacology of TCM, Formulas of TCM, Internal Medicine of TCM, Acupuncture and Moxibustion, and Medical Humanistic Spirit (Table 1).

Table 1
www.frontiersin.org

Table 1. Subjects examined in the test questions.

2.3 Prompt engineering

The prompts have a significant impact on the output of LLMs (26). To obtain the best form and content of answers, we supplied prompts and instructed LLMs to furnish explanations and justifications for their selected answers.

Each question was input in the specified format: “The following is a single-choice question from the Chinese Comprehensive Ability of Clinical Medicine (TCM) Examination. Please choose the correct answer and provide a correct and reasonable answer analysis” (single-choice question) or “The following is a multiple-choice question from the Chinese Comprehensive Ability of Clinical Medicine (TCM) Examination. For each sub-question, there are four options labeled A, B, C, and D, among which at least two meet the requirements of the question. Please choose the correct answers and generate a correct and reasonable answer analysis” (multiple-choice question). Next, input the question and its corresponding options. Table 2 displays an example question along with the answers provided by the LLMs.

Table 2
www.frontiersin.org

Table 2. Response scenarios for the identical question from Ernie Bot and GPT-4.

Both the prompts and the test items were presented to all models in their original Chinese form, and all responses were generated entirely based on Chinese text. The English examples shown in the manuscript are human translations provided after the evaluation for illustrative purposes only; they were not used in model inference and had no influence on model performance or evaluation outcomes.

2.4 Data analysis

All model testing was conducted by inputting questions and obtaining answers provided by the LLMs. All responses were recorded in an electronic spreadsheet, and the chosen answers from each response were extracted and compared to the standard answers. Each question was considered “correct” only if the answer choices matched the standard answer, whereas incorrect choices, missing options, or ambiguous responses were labeled as “incorrect.” The accuracy of LLMs' responses was calculated after the process.

The scoring system for the question set was structured as follows (Table 3):

Table 3
www.frontiersin.org

Table 3. Scoring structure.

Single-choice questions progress across three phases: foundational knowledge (Phase 1), clinical knowledge with humanities (Phase 2), and interdisciplinary integration (Phase 3), each calibrated with weighted points reflecting cognitive demands. Multiple-choice questions extend this framework to assess advanced problem-solving in complex scenarios. Differential scoring (1.5–2.0 points/question) emphasizes clinical reasoning and synthetic thinking, ensuring systematic evaluation of both specialized expertise and holistic TCM proficiency.

We calculated the accuracy rates of LLMs' responses across various subjects in the test questions and conducted statistical analysis on them. SPSS 27.0 software was used for data processing, and Pearson's χ2 test was used for comparison of rates, with P < 0.05 indicating a statistically significant difference. To compare within-model accuracy differences across disciplines (one omnibus χ2 test per model), we controlled the family-wise error rate (FWER) using the Bonferroni correction across the four primary tests (adjusted α = 0.05/4 = 0.0125). We report Bonferroni-adjusted P values. All tests were two-sided.

2.5 Qualitative analysis

Three binary variables were employed to assess the responsiveness of LLMs toward each question:

1. Logical reasoning: The response demonstrated the logical connection between the provided information and the answer choices.

2. Ability to use internal information: The response incorporated information relevant to the question, including analyses derived from case content or essential explanations focused on the question.

3. Ability to use external information: The response incorporated external information beyond the scope of the question, including but not limited to additional analyses and expansions of the topic, provided answers, and distractors.

Each incorrect answer was categorized as one of three types based on the reason for the mistake:

1. Logical error: The parsing contained relevant information but did not correctly convert it into answers. The options in the parsing differed from those in the output. The parsing contained contradictory statements.

Example Question: “What can cause low-grade fever?” Model output (GPT-4): “D. Qi stagnation: Qi stagnation is not directly related to low-grade fever, but prolonged Qi stagnation can transform into Fire, which can cause a low-grade fever.” The analysis showed that Qi stagnation could lead to a low-grade fever, but option D was not selected in the response. This is a logical error.

2. Information error: The extraction, interpretation, and application of internal information or the presentation of external information is incorrect.

Example Question: According to “Huang Di Nei Jing Su Wen,” what are the manifestations of men at “five times eight”? Model output (Ernie Bot): according to “Huang Di Nei Jing Su Wen,” men at “five times eight” exhibit strong tendons and bones (A is correct).” Due to this misinformation, the answer given is incorrect.

3. The presence of both types of errors concurrently.

This study involved three evaluators: two senior postgraduate students in TCM who successfully completed the CPE-TCM, and one senior clinical professional in TCM. To familiarize the evaluators with the scoring system, 30 questions from the 2022 CPE-TCM were used for training purposes. The process was independently assessed by the two masters based on the mentioned criteria and then cross-checked. In the event of disputes, another Chinese medicine professional would make the final judgment.

3 Results

3.1 Accuracy and scoring performance

The official passing threshold was 117 points. Ernie Bot, the highest achiever, attained a score of 146 by correctly answering 83 questions (50.30%). ChatGLM also successfully passed the exam with a score of 136.5, demonstrating an accuracy of 46.67%. However, SparkDesk and GPT-4 did not pass the examination, scoring 86 and 112 points, respectively.

3.2 Performance on different subjects

Table 4 summarizes the test results for each subject. After applying the Bonferroni correction across the four omnibus χ2 tests, only GPT-4 retained a statistically significant heterogeneity across disciplines (P_adj < 0.004). The heterogeneity for Ernie Bot (P_adj = 0.080), ChatGLM (P_adj = 0.104), and SparkDesk (P_adj = 1.000) was not statistically significant. All models involved in the test exhibited the highest accuracy rates in questions related to the medical humanistic spirit.

Table 4
www.frontiersin.org

Table 4. The performance of LLMs on different subjects.

3.3 Qualitative analysis of LLM's response quality

The performance of LLMs was evaluated using three qualitative indicators: logical reasoning, ability to use internal information, and ability to use external information (Table 5).

Table 5
www.frontiersin.org

Table 5. Performance of LLMs' responses in evaluation metrics.

Logical reasoning: ChatGLM and GPT-4 could provide logical explanations for every answer selection. However, some of the responses from Ernie Bot and SparkDesk lacked logical reasoning. In the case of Ernie Bot, the presence or absence of logical reasoning did not have a significant effect on the accuracy rate of response (P = 0.12). However, a significant effect was observed on the presence of logical reasoning in SparkDesk (P < 0.001), suggesting that logical reasoning may impact correctness rates.

Ability to use internal information: Both ChatGLM and GPT-4 utilized internal information in all question explanations. In contrast, SparkDesk provided significantly less internal information in its parses compared to the other three models. The presence or absence of internal information in the response significantly impacted the accuracy rate for SparkDesk (P < 0.001).

Ability to use external information: For Ernie Bot, 75.9% (63/83) of correct responses and 80.5% (66/82) of incorrect responses presented external information (difference of 4.6%; P = 0.30). Similarly, for ChatGLM, 64.9% (50/77) of correct responses and 71.6% (63/88) of incorrect responses contained external information (difference of 6.7%; P = 0.23). However, in SparkDesk, only 8.6% (5/58) of correct responses and 32.7% (35/107) of incorrect responses presented external information (difference of 24.1%; P < 0.001). There was a significant effect on the accuracy rate for SparkDesk depending on the presence of external information. For GPT-4, the difference was 7.1% (P = 0.21).

Among the reasons for incorrect responses in the four LLMs, informational errors were the most common, followed by instances where both logical and informational errors occurred at the same time (Table 6).

Table 6
www.frontiersin.org

Table 6. Statistical analysis of the reasons behind incorrect responses.

4 Discussion

4.1 Principal findings

Our results indicated that Ernie Bot performed the best in terms of scores and correctness, followed by ChatGLM. Both of them exceeded the passing score, demonstrating their expertise in TCM surpassed the passing threshold for CPE-TCM. SparkDesk and GPT-4 did not reach the passing threshold. This suggests that Ernie Bot and ChatGLM have the potential to be used in TCM and could serve various roles in the future.

Differences in AI accuracy across input languages may partially stem from the linguistic composition of its dataset, as LLMs often exhibit a preference for languages more aligned with their training data (27). ChatGPT performs best with English input, emphasizing the influence of language on its accuracy (28). Yu et al. (29) discovered that ChatGPT demonstrated medium accuracy in answering open-ended medical questions in Chinese, with an accuracy rate of 31.5%. Considering the disparities in English and Chinese inputs, it's clear that ChatGPT needs further improvements to handle medical questions in Chinese. Furthermore, the performance of GPT-4 did not meet expectations in this study, likely because it is a general-domain model with relatively limited training data in TCM. To enhance the interoperability of Chinese medical terminology within international contexts, previous research has attempted to map Chinese medical entities to the Unified Medical Language System (UMLS), which helps cross-lingual models better understand and align TCM-related concepts (30).

After applying the Bonferroni correction across the four omnibus χ2 tests, only GPT-4 retained a statistically significant difference in performance across disciplines (P_adj < 0.004), while the apparent subject-level variations in Ernie Bot and ChatGLM (P_adj = 0.080 and 0.104, respectively) were no longer significant. This finding indicates that the previously observed heterogeneity for these two models may have been partly attributable to multiple testing and should therefore be interpreted with caution. The remaining subject-specific difference in GPT-4′s performance suggests that even the most advanced general-purpose LLM still faces challenges in mastering certain specialized areas of TCM knowledge, highlighting the importance of domain-specific fine-tuning.

These four LLMs did well in responding to the questions on the medical humanistic spirit, possibly because answering such questions does not necessitate specialized medical knowledge or clinical experience. In TCM education, students first learn foundational subjects before moving on to clinical subjects. The ability to effectively address clinical questions relies, in part, on mastering these fundamentals and applying them correctly. Ernie Bot showed a higher accuracy rate in answering questions about Pharmacology (foundational subjects) and Internal Medicine of TCM (clinical subjects), while ChatGLM and GPT-4 performed well in Internal Medicine of TCM. Consequently, in the present study, the remarkable performance of LLMs in Internal Medicine of TCM (clinical subjects) gave a huge impression. This may be because the Internal Medicine of TCM questions were mostly presented in case formats, where LLMs showed greater proficiency in extracting and processing internal information. Subject-level accuracy illustrates the suitability of various models for distinct domains, providing invaluable guidance for users seeking to choose a model tailored to their specific requirements (31).

In the qualitative analysis of responses, (1) In terms of logical reasoning ability, when the prompt specifically asked the LLMs to generate an answer parsing, Ernie Bot, ChatGLM, and GPT-4 produced more context-focused answers, which were better representations of the deductive reasoning process. SparkDesk had 56.4% of answers lacking logical reasoning. Meanwhile, the accuracy rate of responses showing logical reasoning exceeded those without it. This suggests an urgent need for SparkDesk to improve its deductive reasoning abilities to increase accuracy rates. (2) In terms of the ability to use internal and external information, Ernie Bot, ChatGLM, and GPT-4 incorporated internal information in over 95% of their answer parses, with over 60% integrating external information. The frequency of correct answers with external information did not significantly surpass that of incorrect answers. This suggested that although LLMs could connect questions to extra data, such information did not substantially aid in making the correct choice.

The causes of incorrect responses were categorized into three groups: logical errors, information errors, and both types of errors at the same time. General-domain LLMs were mainly trained on widely accessible public datasets, which might have limitations due to the lack of training data specific to professional knowledge within specialized domains (32). When confronted with issues necessitating domain-specific knowledge, general-domain LLMs can be significantly illusory, often resulting in factual fabrications (33). It closely aligns with the findings of the present study, where informational errors were identified as the most prevalent factor. Hence, prioritizing data selection and filtering from pre-trained corpora to gather high-quality TCM knowledge data is crucial. Moreover, considering the intricate terminology in TCM, improving the proficiency of LLMs in conducting expert analyses on TCM matters could be achieved through the selection of instruction-tuning data.

In this study, the evaluation of four LLMs' performance in the CPE-TCM revealed that Ernie Bot and ChatGLM were able to pass the exam. Furthermore, over 95% of responses demonstrated logical reasoning and internal information. The majority of responses also presented external information, indicating that they could justify the response in most cases by demonstrating logical reasoning and contextual information. Ernie Bot and ChatGLM could process and comprehend natural language inputs by leveraging their broad pre-trained knowledge to deliver coherent and well-parsed responses. As a result, the superior performance and accuracy of ChatGLM and Ernie Bot imply a greater proficiency in addressing TCM exam questions. This offers preliminary data that they can lead to integrated TCM applications, showcasing their enormous potential as TCM support tools. The knowledge accuracy and interpretive ability of SparkDesk and GPT-4 need further improvement.

4.2 Potential application of LLMs in TCM education

The logical reasoning and contextual comprehension demonstrated by LLMs highlight their potential as valuable tools in TCM education (18). LLMs can serve as “virtual assistants” to support various educational needs—such as generating multiple-choice questions, offering personalized feedback, simplifying complex concepts, and assisting in diagnostic training and prescription writing (34). Their ability to process nuanced language input and integrate background knowledge allows them to contribute meaningfully to knowledge consolidation and clinical thinking.

However, the deployment of LLMs in education requires caution. All outputs must be validated to ensure the safe and accurate dissemination of TCM knowledge, particularly given its specialized terminology and rule-based logic. While previous research has broadly envisioned LLMs as supportive tools, further specificity is needed to guide practical implementation in real-world teaching.

To address this gap, we propose stage-specific application scenarios that align LLMs functionalities with the core competencies of each phase in TCM education. These scenarios are designed to be actionable, verifiable, and conducive to educational outcomes.

4.3 Application scenarios of LLMs aligned with the stages of TCM education

Building upon prior research on the use of LLMs in medical education, we further integrate the staged characteristics of TCM training and its unique pedagogical components—such as syndrome differentiation, formula composition, and rule-based prescription logic—to propose structured and practical application pathways.

1. Chinese postgraduate examination stage

During preparation for CPE-TCM, LLMs can be deeply embedded into intelligent assessment and adaptive learning systems to facilitate high-level competence evaluation and personalized learning guidance.

• Generation and analysis of X-type questions

X-type questions (multiple-choice questions with multiple correct answers) are the most discriminative item type in CPE-TCM. LLMs can automatically generate such questions that examine the logical interconnections among multiple knowledge domains—including basic theories, Chinese materia medica, formulas, and internal medicine—and provide detailed reasoning for each option. This allows students to engage in horizontal comparison across subfields, deepening comprehension and memory.

• Error tracing and knowledge mapping

Unlike traditional “question–answer–explanation” feedback, LLMs can map incorrect responses to specific knowledge deficiencies (e.g., incompatibility rules, misinterpretation of pathogenesis) and trigger personalized review plans. This enables an adaptive learning loop of “error diagnosis—spaced repetition—reinforced training.”

A typical implementation scenario may include the following stages:

• Day 0: Initial attempt and immediate feedback

Students complete an LLM-generated X-type question. The system provides reasoning for each option and maps any incorrect choices to relevant weak points (e.g., “insufficient ability to differentiate between Qi deficiency and Yang deficiency” or “unclear understanding of the therapeutic functions of formulas”).

• Day 1: Rule reinforcement

The system pushes targeted microlearning content (e.g., comparison charts, mnemonic summaries) and focused discriminative exercises to strengthen key principles and correct fundamental misconceptions.

• Day 3: Knowledge integration and transfer

LLMs generate new questions linking the previously weak concepts with related domains (e.g., combining herbal properties, formula compatibility, and internal medicine pathogenesis), testing the learner's ability to apply and transfer knowledge to new contexts.

• Day 7: Contextualized comprehensive testing

The system presents a simulated clinical case requiring the learner to complete the entire reasoning chain—from syndrome differentiation to formula composition—thereby verifying whether fragmented knowledge has been internalized and integrated.

2. Standardized residency training stage

At the stage of standardized residency training, LLMs can serve as virtual case simulators and prescription audit assistants, enhancing clinical reasoning ability and medication safety awareness.

• Case variant generation integrated with SP training

Prior studies have shown that LLM-based virtual patients can effectively improve medical trainees' diagnostic and communication skills (35). Based on standardized patient (SP) scripts, LLMs can automatically generate diverse case variants by modifying syndrome patterns, tongue and pulse manifestations, or comorbid conditions. This enables multi-scenario training for syndrome differentiation. For example, a patient with a “spleen deficiency” pattern may be supplemented with a “current use of ibuprofen” context to test whether the trainee can recognize the adverse gastrointestinal risks of NSAIDs, thereby reinforcing awareness of medication safety.

• Virtual patient interaction

LLMs can simulate natural patient communication, including realistic language style and emotional responses (35). The model dynamically adjusts dialogue pace and information disclosure depth based on the completeness and logic of the trainee's questions, improving the ability to collect the “Four Diagnostic Methods” (inspection, listening/smelling, inquiry, and palpation). Meanwhile, it can analyze the appropriateness of the trainee's diagnosis and prescription, identify logical flaws or safety warnings (e.g., formula–pattern mismatch, pregnancy contraindication), and provide instant, context-aware feedback.

• Prescription audit assistance

After the virtual consultation, trainees submit herbal prescriptions, which are automatically reviewed by the LLM to identify potential risks and provide revision suggestions to improve prescription safety and compliance.

Overall, LLMs can support the development of an integrated TCM Clinical Reasoning Training Platform that combines virtual and physical learning components:

• Syndrome evolution simulation

Starting from a prototype syndrome pattern (e.g., liver qi stagnation with spleen deficiency), the model can generate evolved patterns such as liver fire transformation or qi–yin deficiency, automatically adjusting inquiry content and physical findings to train students' diagnostic and differentiation skills.

• Symptom interference design

LLMs can deliberately obscure or disguise primary symptoms to test diagnostic accuracy—for instance, in a case of cardiac blood stasis pattern, an SP may primarily report “epigastric discomfort,” while the key symptom of “chest oppression” is downplayed, evaluating the trainee's ability to identify the core pathogenesis through detailed inquiry.

• Contextualized prescription feedback

Upon prescription submission, the LLMs cross-references patient history and syndrome pattern, flags potential risks, and cites classical sources or modern clinical guidelines to provide evidence-based modification suggestions.

By aligning these LLM-driven functions with the progressive stages of “cognitive learning—clinical reasoning—safe practice” in TCM education, a systematic and intelligent support framework can be established. Future pilot programs could be conducted in small teaching cohorts to verify feasibility, accuracy, and educational outcomes, thereby providing an evidence-based pathway toward the intelligent transformation of TCM education.

4.4 Limitations

This study had the following limitations. First, CPE-TCM evaluated the basic competence of TCM students comprehensively but did not include a more detailed specialist assessment. Furthermore, LLMs are rapidly evolving, with their architectures and parameters continuously optimized over time, which poses objective constraints on reproducing identical experimental conditions. Given the inherent stochasticity of model generation, this study reports single-run results as representative performance estimates within each model version. Future research could conduct multiple independent trials on fixed model versions to obtain more robust statistical conclusions.

Data availability statement

The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: https://doi.org/10.6084/m9.figshare.28682180.v3 the dataset is publicly available on Figshare.

Author contributions

BW: Data curation, Writing – original draft, Conceptualization, Methodology, Writing – review & editing, Formal analysis. MZ: Data curation, Writing – original draft, Formal analysis. ZW: Methodology, Data curation, Conceptualization, Writing – review & editing, Funding acquisition. KY: Formal analysis, Validation, Writing – review & editing. MH: Validation, Writing – review & editing, Resources. JW: Validation, Writing – review & editing. SP: Conceptualization, Writing – review & editing, Project administration, Funding acquisition, Supervision, Methodology, Writing – original draft. YZ: Funding acquisition, Writing – review & editing, Supervision, Project administration, Methodology, Writing – original draft, Conceptualization.

Funding

The author(s) declared that financial support was received for this work and/or its publication. This work was supported by the China Academy of Chinese Medical Sciences Basic Research Operating Expenses (ZZ170320, ZZ18XRZ069); the Beijing Natural Science Foundation (7252253, 7254504); and the Scientific and Technological Innovation Project of the China Academy of Chinese Medical Sciences (Cl2025C009LH).

Acknowledgments

The authors gratefully acknowledge the support of colleagues who contributed to this work. The authors acknowledge that an earlier version of this manuscript was made available as a preprint at Research Square (https://doi.org/10.21203/rs.3.rs-4392855/v1) (36).

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

1. Hsieh CR, Tang C. The multi-tiered medical education system and its influence on the health care market-China's Flexner report. Hum Resour Health. (2019) 17:50. doi: 10.1186/s12960-019-0382-4

PubMed Abstract | Crossref Full Text | Google Scholar

2. Liu X, Feng J, Liu C, Chu R, Lv M, Zhong N, et al. Medical education systems in China: development, status, and evaluation. Acad Med. (2023) 98:43–9. doi: 10.1097/ACM.0000000000004919

PubMed Abstract | Crossref Full Text | Google Scholar

3. Anand S, Fan VY, Zhang J, Zhang L, Ke Y, Dong Z, et al. China's human resources for health: quantity, quality, and distribution. Lancet. (2008) 372:1774–81. doi: 10.1016/S0140-6736(08)61363-X

PubMed Abstract | Crossref Full Text | Google Scholar

4. National Graduate Enrolment Survey Report. (2023). Available online at: https://www.eol.cn/e_ky/zt/report/2024/abstract.html (Accessed June 15, 2025).

Google Scholar

5. Wang W. Medical education in China: progress in the past 70 years and a vision for the future. BMC Med Educ. (2021) 21:453. doi: 10.1186/s12909-021-02875-6

PubMed Abstract | Crossref Full Text | Google Scholar

6. OpenAI, Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, et al. Gpt-4 Technical Report. arXiv:2303.08774. (2023). Available online at: https://ui.adsabs.harvard.edu/abs/2023arXiv230308774O (Accessed June 15, 2025).

Google Scholar

7. Berg HT, van Bakel B, van de Wouw L, Jie KE, Schipper A, Jansen H, et al. Chatgpt and generating a differential diagnosis early in an emergency department presentation. Ann Emerg Med. (2023). doi: 10.1016/j.annemergmed.2023.08.003

PubMed Abstract | Crossref Full Text | Google Scholar

8. Srivastav S, Chandrakar R, Gupta S, Babhulkar V, Agrawal S, Jaiswal A, et al. Chatgpt in radiology: the advantages and limitations of artificial intelligence for medical imaging diagnosis. Cureus. (2023) 15:e41435. doi: 10.7759/cureus.41435

PubMed Abstract | Crossref Full Text | Google Scholar

9. Liu H, Azam M, Bin Naeem S, Faiola A. An overview of the capabilities of Chatgpt for medical writing and its implications for academic integrity. Health Info Libr J. (2023). doi: 10.1111/hir.12509

PubMed Abstract | Crossref Full Text | Google Scholar

10. Baxi V, Edwards R, Montalto M, Saha S. Digital pathology and artificial intelligence in translational medicine and clinical practice. Mod Pathol. (2022) 35:23–32. doi: 10.1038/s41379-021-00919-2

PubMed Abstract | Crossref Full Text | Google Scholar

11. Sharma P, Thapa K, Thapa D, Dhakal P, Deep Upadhaya M, Adhikari S, et al. Performance of Chatgpt on Usmle: Unlocking the Potential of Large Language Models for Ai-Assisted Medical Education. arXiv:2307.00112. (2023). Available online at: https://ui.adsabs.harvard.edu/abs/2023arXiv230700112S (Accessed June 15, 2025).

Google Scholar

12. Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, et al. How does Chatgpt perform on the United States medical licensing examination? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ. (2023) 9:e45312. doi: 10.2196/45312

Crossref Full Text | Google Scholar

13. Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of Chatgpt on Usmle: potential for Ai-assisted medical education using large language models. PLOS Digit Health. (2023) 2:e0000198. doi: 10.1371/journal.pdig.0000198

PubMed Abstract | Crossref Full Text | Google Scholar

14. Takagi S, Watari T, Erabi A, Sakaguchi K. Performance of Gpt-35 and Gpt-4 on the Japanese medical licensing examination: comparison study. JMIR Med Educ. (2023) 9:e48002. doi: 10.2196/48002

PubMed Abstract | Crossref Full Text | Google Scholar

15. Aljindan FK, Al Qurashi AA, Albalawi IAS, Alanazi AMM, Aljuhani HAM, Falah Almutairi F, et al. Chatgpt conquers the Saudi medical licensing exam: exploring the accuracy of artificial intelligence in medical knowledge assessment and implications for modern medical education. Cureus. (2023) 15:e45043. doi: 10.7759/cureus.45043

PubMed Abstract | Crossref Full Text | Google Scholar

16. Wójcik S, Rulkiewicz A, Pruszczyk P, Lisik W, Poboży M, Domienik-Karłowicz J. Reshaping medical education: performance of Chatgpt on a Pes medical examination. Cardiol J. (2023). doi: 10.5603/cj.97517

PubMed Abstract | Crossref Full Text | Google Scholar

17. Lin SY, Chan PK, Hsu WH, Kao CH. Exploring the proficiency of Chatgpt-4: an evaluation of its performance in the Taiwan advanced medical licensing examination. Digit Health. (2024) 10:20552076241237678. doi: 10.1177/20552076241237678

PubMed Abstract | Crossref Full Text | Google Scholar

18. Wang X, Gong Z, Wang G, Jia J, Xu Y, Zhao J, et al. Chatgpt performs on the Chinese national medical licensing examination. J Med Syst. (2023) 47:86. doi: 10.1007/s10916-023-01961-0

PubMed Abstract | Crossref Full Text | Google Scholar

19. Zong H, Li J, Wu E, Wu R, Lu J, Shen B. Performance of Chatgpt on Chinese national medical licensing examinations: a five-year examination evaluation study for physicians, pharmacists and nurses. BMC Med Educ. (2024) 24:143. doi: 10.1186/s12909-024-05125-7

PubMed Abstract | Crossref Full Text | Google Scholar

20. Cai Y, Wang L, Wang Y, de Melo G, Zhang Y, Wang Y, et al. Medbench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models. arXiv:2312.12806. (2023). Available online at: https://ui.adsabs.harvard.edu/abs/2023arXiv231212806C (Accessed June 15, 2025).

Google Scholar

21. Jang D, Yun TR, Lee CY, Kwon YK, Kim CE. Gpt-4 can pass the Korean national licensing examination for Korean medicine doctors. PLoS Digit Health (2023) 2:e0000416. doi: 10.1371/journal.pdig.0000416

PubMed Abstract | Crossref Full Text | Google Scholar

22. Tseng L-W, Lu Y-C, Tseng L-C, Chen Y-C, Chen H-Y. Performance of Chatgpt-4 on Taiwanese traditional Chinese medicine licensing examinations: cross-sectional study. JMIR Med Educ. (2025) 11:e58897. doi: 10.2196/58897

PubMed Abstract | Crossref Full Text | Google Scholar

23. Sun Y, Wang S, Feng S, Ding S, Pang C, Shang J, et al. Ernie 3.0: Large-Scale Knowledge Enhanced Pre-training for Language Understanding and Generation. arXiv:2107.02137. (2021). Available online at: https://ui.adsabs.harvard.edu/abs/2021arXiv210702137S (Accessed June 15, 2025).

Google Scholar

24. Du Z, Qian Y, Liu X, Ding M, Qiu J, Yang Z, et al. Glm: General Language Model Pretraining with Autoregressive Blank Infilling. arXiv:2103.10360. (2021). Available online at: https://ui.adsabs.harvard.edu/abs/2021arXiv210310360D (Accessed June 15, 2025).

Google Scholar

25. Xu H, Gan W, Qi Z, Wu J, Yu PS. Large Language Models for Education: A Survey. arXiv:2405.13001. (2024). Available online at: https://ui.adsabs.harvard.edu/abs/2024arXiv240513001X (Accessed June 15, 2025).

Google Scholar

26. White J, Fu Q, Hays S, Sandborn M, Olea C, Gilbert H, et al. A Prompt Pattern Catalog to Enhance Prompt Engineering With Chatgpt. arXiv:2302.11382. (2023). Available online at: https://ui.adsabs.harvard.edu/abs/2023arXiv230211382W (Accessed June 15, 2025).

Google Scholar

27. Nicholas G, Bhatia A. Lost in Translation: Large Language Models in Non-English Content Analysis. arXiv:2306.07377. (2023). Available online at: https://ui.adsabs.harvard.edu/abs/2023arXiv230607377N (Accessed June 15, 2025).

Google Scholar

28. Meyer A, Riese J, Streichert T. Comparison of the performance of Gpt-35 and Gpt-4 with that of medical students on the written German medical licensing examination: observational study. JMIR Med Educ. (2024) 10:e50965. doi: 10.2196/50965

PubMed Abstract | Crossref Full Text | Google Scholar

29. Yu P, Fang C, Liu X, Fu W, Ling J, Yan Z, et al. Performance of Chatgpt on the Chinese postgraduate examination for clinical medicine: survey study. JMIR Med Educ. (2024) 10:e48514. doi: 10.2196/48514

PubMed Abstract | Crossref Full Text | Google Scholar

30. Chen L, Qi Y, Wu A, Deng L, Jiang T. Mapping Chinese medical entities to the unified medical language system. Health Data Sci. (2023) 3:0011. doi: 10.34133/hds.0011

PubMed Abstract | Crossref Full Text | Google Scholar

31. Farhat F, Chaudhry BM, Nadeem M, Sohail SS, Madsen DØ. Evaluating large language models for the national premedical exam in India: comparative analysis of Gpt-35, Gpt-4, and Bard. JMIR Med Educ. (2024) 10:e51523. doi: 10.2196/51523

PubMed Abstract | Crossref Full Text | Google Scholar

32. Zhao WX, Zhou K, Li J, Tang T, Wang X, Hou Y, et al. A Survey of Large Language Models. arXiv:2303.18223. (2023). Available online at: https://ui.adsabs.harvard.edu/abs/2023arXiv230318223Z (Accessed June 15, 2025).

Google Scholar

33. Huang L, Yu W, Ma W, Zhong W, Feng Z, Wang H, et al. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. arXiv:2311.05232. (2023). Available online at: https://ui.adsabs.harvard.edu/abs/2023arXiv231105232H (Accessed June 15, 2025).

Google Scholar

34. Lee H. The rise of Chatgpt: exploring its potential in medical education. Anat Sci Educ. (2023). doi: 10.1002/ase.2270

PubMed Abstract | Crossref Full Text | Google Scholar

35. Öncü S, Torun F, Ülkü HH. Ai-powered standardised patients: evaluating Chatgpt-4o's impact on clinical case management in intern physicians. BMC Med Educ. (2025) 25:278. doi: 10.1186/s12909-025-06877-6

PubMed Abstract | Crossref Full Text | Google Scholar

36. Peng S, Zhu Y, Wang B, Zhang M, Wang Z, Yao K, et al. Performance of Gpt-4 and Mainstream Chinese Large Language Models on the Chinese Postgraduate Examination Dataset: Potential for Ai-Assisted Traditional Chinese Medicine. Durham, NC: Research Square (2024). doi: 10.21203/rs.3.rs-4392855/v1

Crossref Full Text | Google Scholar

Keywords: large language models (LLMs), traditional Chinese medicine, medical education, Ernie Bot, ChatGLM, SparkDesk, GPT-4

Citation: Wang B, Zhang M, Wang Z, Yao K, Hao M, Wang J, Peng S and Zhu Y (2026) Supporting postgraduate exam preparation with large language models: implications for traditional Chinese medicine education. Front. Med. 12:1667104. doi: 10.3389/fmed.2025.1667104

Received: 16 July 2025; Accepted: 02 December 2025;
Published: 09 January 2026.

Edited by:

Antonio Sarasa-Cabezuelo, Complutense University of Madrid, Spain

Reviewed by:

Ziming Yin, University of Shanghai for Science and Technology, China
Dongyeop Jang, Dong-Eui University, Republic of Korea

Copyright © 2026 Wang, Zhang, Wang, Yao, Hao, Wang, Peng and Zhu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Suyuan Peng, cGVuZy5zdXl1YW5AYmptdS5lZHUuY24=; Yan Zhu, emh1eWFuMTY2QDEyNi5jb20=

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.