Evaluation of ChatGPT-4o's and DeepSeek R1's responses to urological problems: A comparative study

LU, HANBO; Zhang, Yusa; Wang, Zhan; Zhao, Yang; Liu, Jiang; Qiu, Dongxu; Zhang, Yu-Shi

doi:10.3389/fdgth.2025.1686619

ORIGINAL RESEARCH article

Front. Digit. Health

Sec. Health Informatics

Evaluation of ChatGPT-4o's and DeepSeek R1's responses to urological problems: A comparative study

Provisionally accepted

HANBO LU¹

Yusa Zhang²

Zhan Wang¹

¹Department of Urology, Peking Union Medical College Hospital, Peking Union Medical College, Chinese Academy of Medical Sciences, Beijing, China
²Eight-year Program of Clinical Medicine, Chinese Academy of Medical Sciences, Peking Union Medical College, Beijing, P.R. China, Beijing, China

The final, formatted version of the article will be published soon.

Medical professionals are increasingly utilizing artificial intelligence (AI) tools to improve learning and diagnostic processes. However, limited research has compared the performance of AI models in specific clinical fields such as urology. This study aimed to compare the performance of ChatGPT-4o and DeepSeek R1 in solving 809 single-choice urological questions from the Chinese National Qualification Examination for Attending Physicians in Urology. Both models were tested across three configurations, standard, advanced reasoning, and retrieval-augmented generation (RAG). Results showed that ChatGPT-4o achieved accuracy rates of 78.12%, 73.79%, and 78.99%, respectively, while DeepSeek R1 achieved higher rates of 83.19%, 81.46%, and 84.55%, with all differences being statistically significant (p < 0.001, Cohen's h = 0.129, 0.185 and 0.144). DeepSeek R1 also exhibited greater stability across reasoning modes, whereas ChatGPT-4o showed significant variability. Notably, DeepSeek R1 demonstrated superior performance in complex, case-based questions. These findings suggest that DeepSeek R1 outperforms ChatGPT-4o in both accuracy and stability, particularly in handling complex urological cases when evaluated in Chinese. These findings suggest that optimized AI models may support medical education and clinical decision-making, particularly in Chinese-language contexts, while cautioning that their performance in other languages or settings remains to be evaluated.

Keywords: Urology, DeepSeek R1, ChatGPT-4o, large language models (LLM), Clinical decision support, artificial intelligence

Received: 15 Aug 2025; Accepted: 11 Nov 2025.

Copyright: © 2025 LU, Zhang, Wang, Zhao, Liu, Qiu and Zhang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence:
Dongxu Qiu, qiudongxu1996@163.com
Yu-Shi Zhang, beijingzhangyushi@126.com

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.