ORIGINAL RESEARCH article

Front. Oncol.

Sec. Gastrointestinal Cancers: Hepato Pancreatic Biliary Cancers

Volume 15 - 2025 | doi: 10.3389/fonc.2025.1613818

Performance of Large Language Models in the Differential Diagnosis of Benign and Malignant Biliary Stricture

Provisionally accepted
Chenxi  KangChenxi Kang1Jing  LiJing Li1Xintian  YangXintian Yang1Gui  RenGui Ren1Linhui  ZhangLinhui Zhang1Wei  WangWei Wang2Xin  LiuXin Liu3Lei  WangLei Wang4Guochen  ShangGuochen Shang5Jianglong  HongJianglong Hong6Bingnian  WanBingnian Wan7Yu  DuYu Du8Wei  ZengWei Zeng9Yaling  LiuYaling Liu1Tongxin  LiTongxin Li1Lijun  LouLijun Lou1Hui  LuoHui Luo1Shuhui  LiangShuhui Liang1Yong  LvYong Lv1*Yanglin  PanYanglin Pan1*
  • 1Xijing Hospital of Digestive Diseases, Air Force Medical University, Xi'an, China
  • 2People's Liberation Army Joint Logistics Support Force 940th Hospital, Lanzhou, Gansu Province, China
  • 3Third People's Hospital of Gansu Province, Lanzhou, Gansu Province, China
  • 4Ankang Traditional Chinese Medicine Hospital, Ankang, China
  • 5Tongji Medical College, Huazhong University of Science and Technology, Wuhan, Hubei Province, China
  • 6First Affiliated Hospital of Anhui Medical University, Hefei, Anhui Province, China
  • 7Yantai Ludong Hospital, Shandong Provincial Hospital Group, Yantai, Shandong Province, China
  • 8Qinzhou Second People's Hospital, Qinzhou, China
  • 9Xiang'an Hospital, Xiamen University, Xiamen, Fujian Province, China

The final, formatted version of the article will be published soon.

Distinguishing benign from malignant biliary strictures remains challenging. Large Language Models (LLMs) show promise in enhancing diagnostic accuracy. This study aimed to evaluate the performances of ten LLMs in the differential diagnosis of benign and malignant biliary strictures.Consecutive patients with biliary strictures undergoing endoscopic retrograde cholangiopancreatography (ERCP) at Xijing Hospital between January and December 2024 were retrospectively analyzed. Ten LLMs were systematically prompted with standardized clinical, laboratory, and imaging data. Performance was compared against tumor markers (CA19-9, CEA), a new multivariable clinical model, and ten independent pancreaticobiliary exoerienced physicians. Subgroup analyses assessed hilar (n=29) versus non-hilar strictures. Gold-standard diagnosis relied on histopathology and ≥3-month follow-up.Among the 159 included patients (83 benign, 76 malignant), four LLMs (Kimi, Deepseek-R1, Claude-3.5S, Llama-3.1), the clinical model (AUC:0.83), and six physicians achieved >80% accuracy. Kimi demonstrated superior accuracy (87%), significantly outperforming 70% of physicians (7/10, p<0.01). Three other LLMs (Deepseek-R1:83%, Claude-3.5S:82%, Llama-3.1:81%) and the clinical model performed comparably to physicians (78-84%, p>0.05), collectively surpassing tumor markers (CA19-9 accuracy:66%, CEA:71%). Physicians demonstrated higher accuracy for hilar strictures (87% vs. 79% for non-hilar, p<0.001). LLMs showed similar performance across stricture locations (hilar:64-95%; non-hilar:62-88%, p>0.05). For hilar strictures, 7/10 physicians achieved significantly higher accuracy (87-90%) than 8/10 LLMs (64-84%, p<0.05).Using clinical, lab, and imaging data, some LLMs achieved diagnostic accuracy comparable to or exceeding clinical models and experienced physicians for differentiating benign versus malignant strictures. However, for hilar strictures, LLM performance was inferior to over half of the physicians.

Keywords: Large Language Model, Biliary stricture, Cholangiocarcinoma, Prediction model, diagnosis

Received: 17 Apr 2025; Accepted: 18 Jun 2025.

Copyright: © 2025 Kang, Li, Yang, Ren, Zhang, Wang, Liu, Wang, Shang, Hong, Wan, Du, Zeng, Liu, Li, Lou, Luo, Liang, Lv and Pan. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence:
Yong Lv, Xijing Hospital of Digestive Diseases, Air Force Medical University, Xi'an, China
Yanglin Pan, Xijing Hospital of Digestive Diseases, Air Force Medical University, Xi'an, China

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.