ORIGINAL RESEARCH article
Front. Oncol.
Sec. Gastrointestinal Cancers: Hepato Pancreatic Biliary Cancers
Volume 15 - 2025 | doi: 10.3389/fonc.2025.1613818
Performance of Large Language Models in the Differential Diagnosis of Benign and Malignant Biliary Stricture
Provisionally accepted- 1Xijing Hospital of Digestive Diseases, Air Force Medical University, Xi'an, China
- 2People's Liberation Army Joint Logistics Support Force 940th Hospital, Lanzhou, Gansu Province, China
- 3Third People's Hospital of Gansu Province, Lanzhou, Gansu Province, China
- 4Ankang Traditional Chinese Medicine Hospital, Ankang, China
- 5Tongji Medical College, Huazhong University of Science and Technology, Wuhan, Hubei Province, China
- 6First Affiliated Hospital of Anhui Medical University, Hefei, Anhui Province, China
- 7Yantai Ludong Hospital, Shandong Provincial Hospital Group, Yantai, Shandong Province, China
- 8Qinzhou Second People's Hospital, Qinzhou, China
- 9Xiang'an Hospital, Xiamen University, Xiamen, Fujian Province, China
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
Distinguishing benign from malignant biliary strictures remains challenging. Large Language Models (LLMs) show promise in enhancing diagnostic accuracy. This study aimed to evaluate the performances of ten LLMs in the differential diagnosis of benign and malignant biliary strictures.Consecutive patients with biliary strictures undergoing endoscopic retrograde cholangiopancreatography (ERCP) at Xijing Hospital between January and December 2024 were retrospectively analyzed. Ten LLMs were systematically prompted with standardized clinical, laboratory, and imaging data. Performance was compared against tumor markers (CA19-9, CEA), a new multivariable clinical model, and ten independent pancreaticobiliary exoerienced physicians. Subgroup analyses assessed hilar (n=29) versus non-hilar strictures. Gold-standard diagnosis relied on histopathology and ≥3-month follow-up.Among the 159 included patients (83 benign, 76 malignant), four LLMs (Kimi, Deepseek-R1, Claude-3.5S, Llama-3.1), the clinical model (AUC:0.83), and six physicians achieved >80% accuracy. Kimi demonstrated superior accuracy (87%), significantly outperforming 70% of physicians (7/10, p<0.01). Three other LLMs (Deepseek-R1:83%, Claude-3.5S:82%, Llama-3.1:81%) and the clinical model performed comparably to physicians (78-84%, p>0.05), collectively surpassing tumor markers (CA19-9 accuracy:66%, CEA:71%). Physicians demonstrated higher accuracy for hilar strictures (87% vs. 79% for non-hilar, p<0.001). LLMs showed similar performance across stricture locations (hilar:64-95%; non-hilar:62-88%, p>0.05). For hilar strictures, 7/10 physicians achieved significantly higher accuracy (87-90%) than 8/10 LLMs (64-84%, p<0.05).Using clinical, lab, and imaging data, some LLMs achieved diagnostic accuracy comparable to or exceeding clinical models and experienced physicians for differentiating benign versus malignant strictures. However, for hilar strictures, LLM performance was inferior to over half of the physicians.
Keywords: Large Language Model, Biliary stricture, Cholangiocarcinoma, Prediction model, diagnosis
Received: 17 Apr 2025; Accepted: 18 Jun 2025.
Copyright: © 2025 Kang, Li, Yang, Ren, Zhang, Wang, Liu, Wang, Shang, Hong, Wan, Du, Zeng, Liu, Li, Lou, Luo, Liang, Lv and Pan. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence:
Yong Lv, Xijing Hospital of Digestive Diseases, Air Force Medical University, Xi'an, China
Yanglin Pan, Xijing Hospital of Digestive Diseases, Air Force Medical University, Xi'an, China
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.