- 1Department of Transfusion Medicine, The Affiliated Hospital of Shaoxing University, Shaoxing, China
- 2Department of Computer Science and Engineering, Shaoxing University, Shaoxing, China
- 3Institute of Artificial Intelligence, Shaoxing University, Shaoxing, China
- 4Zhejiang Pharmaceutical University, Ningbo, Zhejiang, China
- 5School of Computing, College of Science, Engineering and Technology, The University of South Africa, Roodepoort, South Africa
- 6Zhejiang-ltaly Joint Laboratory on AI & Materials Medical Technology, Shaoxing, China
In recent years, large language models (LLMs) have achieved remarkable progress in natural language processing and demonstrated potential applications in medicine. However, their professional capabilities in specific medical subfields, such as immunology, still require systematic evaluation. This study systematically evaluated 11 representative LLMs, including DeepSeek, GPT, Llama, Gemma, and Qwen series, based on the Chinese National Health Professional Qualification Examination in Rheumatology and Clinical Immunology. The evaluation covered four dimensions: basic medical knowledge, related medical knowledge, immunology knowledge, and professional practice ability. Results show significant differences among LLMs. DeepSeek-R1 and Qwen3 achieve the best performance, with accuracy exceeding 90%. However, performance on professional practice ability tasks remained relatively low, highlighting limitations in complex clinical applications.
1 Introduction
Since OpenAI released the generative pre-trained Transformer model in 2018, large language models (LLMs) have experienced rapid development (1). Models such as GPT-3 and GPT-4 have demonstrated outstanding performance in numerous natural language processing tasks due to their powerful language understanding and generation capabilities (2). Concurrently, multiple research institutions and companies worldwide have launched distinctive models, such as the DeepSeek series (3), Google's Gemma series (4), Meta's Llama series (5), and Alibaba's Qwen series (6). These models vary in parameter scale, architecture design, and training data, fostering prosperity and competition in the field of artificial intelligence.
Explorations into the application of LLMs in the medical field are increasing, covering areas such as medical question answering, auxiliary diagnosis, medical record summarization, and patient education (7–10). For example, Google's Med-PaLM 2 was the first to achieve a score above the passing threshold of 86.5% on the United States Medical Licensing Examination (USMLE), indicating that LLMs possess clinically actionable value in medical diagnostic scenarios (11). In specific fields such as mental health, studies have begun systematically evaluating LLMs' knowledge mastery and diagnostic capabilities (12, 13). In (13), the authors tested the professional knowledge of LLMs like DeepSeek, Gemma, and GPT in mental health. The results showed that the highest accuracy on single-choice questions reached 86.83%, but the accuracy was lower on multiple-choice questions. In radiology error detection and case-based question answering, GPT-4V's performance approached that of radiologists; preliminary studies in pathological image-assisted diagnosis also show potential (14).
As an important subfield of medicine, LLMs also have broad application prospects in immunology (15–18). Rider et al. evaluated the diagnostic and clinical management guidance capabilities of LLMs in real cases of primary immunodeficiency disorders (15). In the study, real anonymized cases were submitted to the models using multi-turn prompting, and immunology experts assessed the models' diagnostic accuracy and reasoning quality. They found that some models (e.g., GPT-4o) achieved diagnostic accuracy of approximately 96.2%. In (16), researchers used LLMs to automatically identify immune-related adverse events in electronic health records (EHRs) and clinical trial data during treatments such as immune checkpoint inhibitors. This system can help detect these side effects in vast amounts of unstructured text, improving safety monitoring. In (17), researchers attempted to use GPT-4 as a tool for estimating the proportions of various immune cell types in mixed samples from whole blood RNA-Seq data. The results showed approximately 70% consistency between GPT-4 and traditional bioinformatics tools (e.g., CIBERSORTx, xCell).
Although LLMs demonstrate strong application potential in the medical field, their professionalism, accuracy, and reliability still require rigorous evaluation and validation. Currently, evaluations of LLMs' medical application capabilities mostly focus on general medical knowledge or a few popular specialties (13, 19). Evaluation methods include using standardized medical examination question banks (e.g., USMLE) (20), constructing specific task datasets (21), or simulating clinical scenarios (22). Only a small number of works have attempted to evaluate LLMs' capabilities in the field of immunology, such as (15). However, these works only evaluated specific tasks, such as diagnosing primary immunodeficiency disorders, and tested only a few LLMs, such as GPT-4o and LLama, lacking evaluation of recently released LLMs like DeepSeek and Qwen.
Rheumatology and clinical immunology encompass numerous disease types with complex pathogenesis and highly specialized diagnosis and treatment plans. Evaluating LLMs' professional capabilities in this field can provide a basis for medical workers, educators, and researchers to select appropriate AI-assisted tools. Also, it helps identify current models' weaknesses in specialized knowledge understanding and clinical application, providing direction for subsequent domain adaptation fine-tuning. This paper is the first attempt to comprehensively evaluate LLMs' professional knowledge capabilities in immunology, an important medical field. We select the latest representative LLMs, such as GPT-4.1, DeepSeek-R1, Gemma3, and Qwen3, as test subjects, conducting large-scale testing experiments. We purchasesimulated questions from reputable online vendors that mimic the format and content of the Chinese Health Professional Technical Qualification Examination in Rheumatology and Immunology for testing. These questions cover basic medical knowledge, immunology-related medical professional knowledge, immunology professional knowledge, and professional practice ability. The test results show that DeepSeek-R1 and Qwen3 demonstrate strong capabilities in the evaluation of rheumatology and clinical immunology professional knowledge, especially in basic knowledge and professional knowledge, with accuracy stably above 90%, surpassing GPT-4.1 and Gemma3. However, these LLMs generally exhibit lower accuracy in handling professional practice ability questions, revealing limitations in complex clinical scenario applications. This study provides methodological and empirical references for evaluating LLMs' knowledge mastery and application abilities in specialized medical fields and points out directions for future model optimization and evaluation system improvement.
2 Methods
2.1 Dataset
The dataset used for evaluation in this study consists of simulated questions purchased from online vendors that are designed to reflect the format and content of the Rheumatology and Clinical Immunology subject of the Chinese Health Professional Qualification Examination. All these questions are original Chinese questions. This exam is a national-level access examination for health technicians engaged in rheumatology and clinical immunology professional work in China, serving as a standard examination to assess whether applicants possess the corresponding professional technical qualifications and abilities (23). This technical qualification examination aims to scientifically and fairly measure and evaluate medical personnel's professional knowledge, professional skills, and clinical practice abilities in the field of immunology, ensuring that medical personnel engaged in this specialty meet uniform, standardized basic levels, guaranteeing medical quality and patient safety.
The question bank content used in this paper is comprehensive, covering the core knowledge and skills required for rheumatology and clinical immunology physicians. All questions are single-choice and divided into four categories: Basic Medical Knowledge, Immunology-Related Medical Professional Knowledge, Immunology Professional Knowledge, and Professional Practice Ability, totaling 2829 questions. The Basic Medical Knowledge category comprises 517 questions, covering basic medical disciplines such as anatomy, physiology, pathology, and pharmacology related to rheumatology and immunology. The Immunology-Related Medical Professional Knowledge category includes 1698 questions, involving knowledge closely related to the diagnosis and treatment of rheumatic and immunological diseases in related disciplines such as internal medicine, surgery, pediatrics, and dermatology. The Immunology Professional Knowledge category consists of 570 questions, focusing on core professional knowledge such as immune system fundamentals, autoimmune disease mechanisms, and immunological testing. The Professional Practice Ability category includes 44 questions, simulating clinical scenarios to assess comprehensive application abilities such as medical history analysis, auxiliary examination interpretation, diagnosis, and treatment plan selection.
Below is an example question from the Immunology Professional Knowledge category:
Question: Which of the following corresponds to the pathological features of dry syndrome affecting the lip mucosa? ( )
A. Angioma formation
B. Destruction of glandular tissue, focal lymphocytic infiltration
C. Leukocytosis
D. Complement deposition
E. NK cell infiltration
Answer: B
2.2 Tested models
To ensure a comprehensive evaluation of LLMs with different architectures and scales, this study includes a wide variety of models in the experiments. The selected LLMs include both large cutting-edge models that perform excellently in natural language processing tasks and small lightweight models deployable in resource-constrained environments. Considering cost and performance, we select the following representative model families: the DeepSeek series, OpenAI's GPT series, Google's Gemma series, and Alibaba's Qwen series, totaling 11 LLMs, as detailed in Table 1.
2.3 Testing methods
This study employs a unified automated testing process to evaluate all participating LLMs. During testing, programmatic calls are made through each model's provided API interface. Automated test scripts written in Python read question and option information line by line from the Excel-format question bank file. To ensure methodological consistency and fair comparison across heterogeneous LLMs, all models are evaluated using a zero-shot prompting strategy. This avoids introducing model-specific biases related to system prompts, hidden instructions, or chain-of-thought behavior, which differ substantially among providers. Zero-shot evaluation reflects the default usage patterns in many real-world clinical and educational settings, where users typically input single-turn questions without specialized prompting. We therefore designed the evaluation to measure baseline reliability and factual correctness under uniform and controlled conditions.
To ensure comparability of evaluation results and reduce the impact of random factors, all LLMs use default parameter settings without any system prompts or chain-of-thought guidance to examine the models' native capabilities under default configurations. Note that all questions used in this study are original Chinese questions. No translation is performed at any stage of data preparation or model evaluation. All prompts given to the models were also in Chinese. This ensures consistency with the linguistic structure of the original exam and avoids any biases introduced by translation.
3 Results
Based on the test results from question bank, the performance of the 11 LLMs across the four knowledge dimensions is shown in Table 2, with total accuracy shown in Figure 1. Overall, the LLMs show significant differences in performance, with total accuracy rates ranging from 63.63% to 91.91%.
Table 2. Test results (I: basic medical knowledge; II: related medical professional knowledge; III: immunology professional knowledge; IV: professional practice ability).
In the Basic Medical Knowledge section (517 questions), DeepSeek_R1 rankes first with 94.2% accuracy, followed by Qwen3-32B (93.2%) and DeepSeek_V3Pro (93.23%). Gemma2-27B performed relatively weakly in this dimension, with an accuracy of 63.25%. In the Related Medical Professional Knowledge section (1698 questions), Qwen3-235B performs excellently, achieving 93.46% accuracy, with DeepSeek_R1 (92.58%) and Qwen3-32B (92.79%) ranking second and third, respectively. In the Immunology Professional Knowledge section (570 questions), DeepSeek_R1 leds with 88.91% accuracy, while Qwen3-32B (87.28%) and DeepSeek_V3Pro (87.52%) also perform excellently. In the Professional Practice Ability section (44 questions), GPT-4.1 and Qwen3-32B tie for the highest accuracy, both at 84.09%, followed by DeepSeek_R1 and DeepSeek_V3Pro with 81.82% accuracy.
In terms of total accuracy, as shown in Figure 1, DeepSeek_R1 rankes first with a comprehensive performance of 91.91%, followed by DeepSeek_V3Pro (91.34%) and Qwen3-235B (91.62%) in second and third place, respectively. Qwen3-32B rankes fourth with a total accuracy of 91.23%. In contrast, the Gemma series models perform relatively poorly overall, with Gemma2-27B achieving 63.63% total accuracy and Gemma3-27B achieving 71.47%. To further verify the reproducibility of our findings, we conduct three independent runs for four representative models (DeepSeek-R1, GPT-4.1, Gemma3-27B and Qwen3-235B) using the same zero-shot setting. Across 570 items in the Immunology Professional Knowledge section, the per-run accuracy varied by less than 1.2 percentage points for all four models indicating deterministic behavio r under deterministic decoding and confirming that the reported point estimates are highly stable across replicate calls.
To systematically evaluate performance differences among 11 LLMs, we conduct pairwise McNemar tests on 570 specialized knowledge questions. Results show that any pairwise comparison within DeepSeek-R1, Qwen3-235B, and Qwen3-32B yielded p≥0.62, indicating comparable performance with differences attributable to random variation. However, these three top performers exhibited p < 0.001 against GPT-4o, GPT-4.1, Llama4-scout, and the Gemma series, statistically forming a significant high-to-medium stratification. Concurrently, p-values between any models within the DeepSeek family or Gemma family remained no less than 0.62, suggesting that different checkpoint sizes within the same series converge in performance on this dataset.
4 Discussion
4.1 Domain-specific interpretation of LLM performance
The performance of the 11 evaluated LLMs demonstrates clear differences across the four major competency domains of the examination. These patterns reveal not only the relative strengths of current models in factual medical knowledge but also their limitations in deeper clinical reasoning. Below, we provide a domain-by-domain interpretation to highlight the cognitive and practical implications of the results.
Firstly, within the domain of fundamental medical knowledge, most high-performance large language models (LLMs) achieve accuracy rates exceeding 90%, with models such as DeepSeek-R1 and Qwen3-32B demonstrating mastery approaching expert-level proficiency. This dimension primarily assesses structured medical knowledge, content which is comprehensively represented within mainstream pre-training corpora. Test results indicate that large language models excel in pattern-based semantic retrieval and factual recall, cognitive processes corresponding to the early stages of human medical learning. These findings confirm that contemporary large language models have established a robust and reliable foundation of medical knowledge.
Second, in the Related Medical Professional Knowledge domain, which spans internal medicine, dermatology, pediatrics, and other adjacent clinical specialties, performance remained strong for most large-scale models. This domain requires broader semantic integration across multiple disciplines, yet the knowledge involved is still largely factual and well documented. The moderate performance differences among models suggest that large, diverse pretraining corpora enable effective generalization across medical subfields. However, the more pronounced decline observed in mid-sized models such as Gemma2-27B highlights the dependence of cross-disciplinary reasoning on model scale and training data diversity.
Third, performance in the Immunology Professional Knowledge domain displayed greater variability across models. Although leading models scored around 86-89%, overall accuracy was lower than in the previous two domains. Questions in this category demand understanding of mechanisms such as immune pathways, autoimmunity, and immunological testing–areas where high-quality textual resources are more specialized and less abundant. This type of knowledge requires deeper conceptual abstraction and mechanistic understanding, capabilities that current LLMs approximate but do not fully master. The results underscore the need for immunology-specific domain adaptation or fine-tuning to improve performance in specialized medical subfields.
Fourth, the domain of professional practice capability reveals the most significant limitations of current large language models. Even the best-performing models, GPT-4.1 and Qwen3-32B, achieve accuracy rates of only around 84%, with other models performing markedly worse. These test items simulate real clinical scenarios, requiring diagnostic reasoning based on symptoms and examination results. Unlike factual questions, such tasks demand multi-step reasoning, situational judgement, and probabilistic inference. The performance gap reflects the challenges LLMs face when executing evidence-based clinical reasoning. Despite possessing robust factual knowledge reserves, current models remain constrained when handling complex, context-dependent medical tasks.
4.2 Implications for medical education and assessment
The domain-specific performance patterns observed in this study carry several important implications for medical education and assessment. The strong results in factual and foundational knowledge indicate that advanced LLMs can serve as supportive tools for medical learning, helping generate practice questions, clarify complex immunology concepts, and provide structured summaries that ease the acquisition of basic content. However, the marked decline in the Professional Practice Ability domain highlights that core clinical reasoning, such as differential diagnosis, contextual judgment, and multi-step inference, remains a distinctly human strength and should continue to be emphasized in student evaluation and curriculum design. These results suggest that while LLMs can enhance early-stage knowledge acquisition, high-stakes assessment must still rely on tasks that probe deeper cognitive processes. At the same time, the domain-specific strengths and weaknesses revealed in this study provide a basis for developing targeted AI-assisted tutoring systems that reinforce foundational knowledge while offering guided reasoning scaffolds for complex clinical scenarios, ultimately supporting medical students without substituting expert oversight.
4.3 Technical factors, clinical implications, and future directions
Differences in model performance can be partly explained by variations in model architecture, pretraining corpus diversity, and domain exposure. Larger models such as DeepSeek-R1 and Qwen3-235B likely benefit from extensive medical coverage and richer Chinese pre-training data, contributing to their strong performance across most domains. In contrast, mid-sized models like Gemma2-27B and Gemma3-27B may lack sufficient medical domain representation during pretraining, resulting in weaker performance, especially in tasks requiring immunology-specific conceptual knowledge or integrated clinical reasoning. The absence of explicit domain fine-tuning in all evaluated models also helps explain the consistent decline observed in tasks involving higher-order pathophysiological interpretation.
The findings additionally highlight important clinical and methodological implications. Although the leading LLMs show impressive mastery of factual medical knowledge, their reduced accuracy in practice-oriented items indicates that autonomous diagnostic reasoning remains beyond their current capabilities. As such, LLMs may serve as educational support tools but should not be deployed for unsupervised clinical decision-making. The use of single-choice questions and zero-shot settings in this study, while reflecting common real-world usage, limits the evaluation of deeper reasoning processes and fails to capture the full complexity of clinical workflows.
Future research should therefore broaden the evaluation framework to include open-ended clinical reasoning tasks, longitudinal case simulations, and multimodal clinical data such as imaging or laboratory results. Incorporating chain-of-thought prompting, domain-specific fine-tuning, and reinforcement learning with expert feedback may further enhance reasoning stability and domain robustness. Comparative analyses using real patient cases and collaborative expert annotation will also be essential for understanding how LLMs behave in authentic clinical environments and for identifying safe, practical pathways for their integration into medical education and clinical decision support systems.
5 Conclusion
This study systematically evaluated the professional capabilities of 11 representative LLMs in the field of rheumatology and clinical immunology. The results show that the DeepSeek series and Qwen series models performed excellently in this professional field, with total accuracy rates exceeding 91%, demonstrating their significant advantages in mastering medical professional knowledge. However, all models showed decreased performance on questions examining comprehensive clinical practice ability, highlighting the limitations of LLMs in complex clinical reasoning.
The findings suggest that LLM performance is shaped by multiple factors, including model scale, training data composition, architectural design, and the degree of domain adaptation. While leading models now approach the knowledge reserve levels of human experts in basic and related medical fields, their constrained clinical reasoning indicates that current LLMs are better suited for roles in medical education and knowledge support rather than independent clinical decision-making. Future research should prioritize the development of more advanced and multifaceted evaluation frameworks, such as incorporating multimodal inputs, integrating simulations of authentic clinical decision pathways, and conducting deeper error analyses in collaboration with clinical immunology specialists. Additionally, domain-adapted fine-tuning and targeted enhancement of reasoning capabilities represent essential directions for improving the reliability and practical applicability of LLMs in clinical environments.
Data availability statement
The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.
Author contributions
YW: Conceptualization, Data curation, Writing – original draft. YJ: Data curation, Software, Writing – original draft. WJ: Data curation, Software, Writing – original draft. YX: Data curation, Software, Writing – original draft. WL: Data curation, Software, Writing – original draft. JW: Conceptualization, Investigation, Software, Writing – review & editing. QS: Conceptualization, Investigation, Supervision, Writing – review & editing. ZF: Conceptualization, Investigation, Supervision, Writing – review & editing, Software.
Funding
The author(s) declared that financial support was received for this work and/or its publication. This work was supported by the Social Science Research Project of the Ministry of Education of China under Grant No. 22YJAZH016.
Conflict of interest
The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Generative AI statement
The author(s) declared that generative AI was not used in the creation of this manuscript.
Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.
Publisher's note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
References
1. Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, et al. GPT-4 technical report. arXiv [preprint] arXiv:230308774. (2023). doi: 10.48550/arXiv.2303.08774
2. Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, et al. Language models are few-shot learners. Adv Neural Inf Process Syst. (2020) 33:1877–901. doi: 10.48550/arXiv.2005.14165
3. Liu A, Feng B, Xue B, Xue B, Wang B, Wu B, et al. Deepseek-v3 technical report. arXiv [preprint] arXiv:241219437. (2024). doi: 10.48550/arXiv.2412.19437
4. Team G, Mesnard T, Hardin C, Dadashi R, Bhupatiraju S, Pathak S, et al. Gemma: open models based on gemini research and technology. arXiv [preprint] arXiv:240308295. (2024). doi: 10.48550/arXiv.2403.08295
5. Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, et al. Llama 2: open foundation and fine-tuned chat models. arXiv [preprint] arXiv:230709288. (2023). doi: 10.48550/arXiv.2307.09288
6. Yang A, Yang B, Zhang B, Hui B, Zheng B, Yu B, et al. Qwen2. 5 technical report. arXiv [preprint] arXiv:250112948. (2025). doi: 10.48550/arXiv.2412.15115
7. Thirunavukarasu AJ, Ting DSW, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. (2023) 29:1930–40. doi: 10.1038/s41591-023-02448-8
8. Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large language models encode clinical knowledge. Nature. (2023) 620:172–80. doi: 10.1038/s41586-023-06291-2
9. Liévin V, Hother CE, Motzfeldt A, Winther O. Can large language models reason about medical questions? Patterns. (2023) 4:100943. doi: 10.1016/j.patter.2024.100943
10. Li H, Hou Y, Yang J, Song Y, Liu Y, Husile. Reunderstanding of the classification of syndromes in the traditional chinese medicine classic “shang han lun” based on large language models. In: 2024 IEEE International Conference on Medical Artificial Intelligence (MedAI). Chongqing: IEEE (2024). p. 348–353.
11. Nori H, Lee YT, Zhang S, Carignan D, Edgar R, Fusi N, et al. Can generalist foundation models outcompete special-purpose tuning? Case study in medicine. arXiv [preprint] arXiv:231116452. (2023). doi: 10.48550/arXiv.2311.16452
12. Hanafi A, Saad M, Zahran N, Hanafy R, Fouda M. A comprehensive evaluation of large language models on mental illnesses. arXiv [preprint] arXiv:240915687. (2024). doi: 10.48550/arXiv.2409.15687
13. Xu Y, Fang Z, Lin W, et al. Evaluation of large language models on mental health: from knowledge test to illness diagnosis. Front Psychiatry. (2025) 16:1646974. doi: 10.3389/fpsyt.2025.1646974
14. Brin D, Sorin V, Barash Y, Konen E, Glicksberg B, Nadkarni G, et al. Assessing GPT-4 multimodal performance in radiological image analysis. Eur Radiol. (2025) 35:1959–65. doi: 10.1007/s00330-024-11035-5
15. Rider NL Li Y, Chin A, DiGiacomo D, Dutmer C, Farmer J, et al. Evaluating large language model performance to support the diagnosis and management of patients with primary immune disorders. J Allergy Clini Immunol. (2025) 156:81–7. doi: 10.1016/j.jaci.2025.02.004
16. Bejan CA, Wang M, Venkateswaran S, Bergmann EA, Hiles L, Xu Y, et al. irAE-GPT: Leveraging large language models to identify immune-related adverse events in electronic health records and clinical trial datasets. medRxiv. (2025). doi: 10.1101/2025.03.05.25323445
17. Sarwal R, Bhattacharya S, Butte A. Leveraging GPT-4 for enhanced cell-type deconvolution in immunological research. J Immunol. (2024). 212:0274_6013-0274_6013. doi: 10.4049/jimmunol.212.supp.0274.6013
18. Li H, Xia C, Hou Y, Hu S, Quan J, Liu Y. TCMRD-KG: design and development of a rheumatism knowledge graph based on ancient chinese literature. In: 2024 IEEE International Conference on Medical Artificial Intelligence (MedAI). Chongqing: IEEE (2024). p. 588–593.
19. Levkovich I. Evaluating diagnostic accuracy and treatment efficacy in mental health: A comparative analysis of large language model tools and mental health professionals. Eur J Investigat Health Psychol Educ. (2025). 15:9. doi: 10.3390/ejihpe15010009
20. Nori H, King N, McKinney SM, Carignan D, Horvitz E. Capabilities of GPT-4 on medical challenge problems. arXiv [preprint] arXiv:230313375. (2023). doi: 10.48550/arXiv.2303.13375
21. Wang B, Xie Q, Pei J, Wang B. A survey of evaluation metrics used for large language models in medical applications. arXiv [preprint] arXiv:240605868. (2024). doi: 10.48550/arXiv.2404.15777
22. Meskó B, Topol EJ. The imperative for regulatory oversight of large language models (or generative AI) in healthcare. NPJ Digit Med. (2023) 6:120. doi: 10.1038/s41746-023-00873-0
23. National Health Commission P R China. National Health Professional Qualification Examination. (2025). Available online at: https://www.21wecan.com/wsrcw/c100190/chuzhongji.shtml (Accessed September 15, 2025).
Keywords: DeepSeek, immunology, knowledge test, large language models, model evaluation
Citation: Wang Y, Jiang Y, Jin W, Xu Y, Lin W, Wang J, Song Q and Fang Z (2026) Evaluation of large language models in rheumatology and clinical immunology: a systematic assessment based on Chinese national health professional qualification examination. Front. Med. 12:1716122. doi: 10.3389/fmed.2025.1716122
Received: 03 October 2025; Revised: 10 December 2025;
Accepted: 22 December 2025; Published: 15 January 2026.
Edited by:
Birger Moell, KTH Royal Institute of Technology, SwedenReviewed by:
Haotian Li, Beijing Key Laboratory of Functional Gastrointestinal Disorders Diagnosis and Treatment of Traditional Chinese Medicine, ChinaAhmet Üşen, Istanbul Medipol University, Medical Faculty, Türkiye
Copyright © 2026 Wang, Jiang, Jin, Xu, Lin, Wang, Song and Fang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Qin Song, c29uZ3FAempwdS5lZHUuY24=; Zhaoxi Fang, ZmFuZ3poYW94aUB1c3guZWR1LmNu
Yaqing Wang1