AUTHOR=Kang Chenxi , Li Jing , Yang Xintian , Ren Gui , Zhang Linhui , Wang Wei , Liu Xin , Wang Lei , Shang Guochen , Hong Jianglong , Wan Bingnian , Du Yu , Zeng Wei , Liu Yaling , Li Tongxin , Lou Lijun , Luo Hui , Liang Shuhui , Lv Yong , Pan Yanglin 

TITLE=Performance of large language models in the differential diagnosis of benign and malignant biliary stricture

JOURNAL=Frontiers in Oncology

VOLUME=Volume 15 - 2025

YEAR=2025

URL=https://www.frontiersin.org/journals/oncology/articles/10.3389/fonc.2025.1613818

DOI=10.3389/fonc.2025.1613818

ISSN=2234-943X

ABSTRACT=BackgroundDistinguishing benign from malignant biliary strictures remains challenging. Large Language Models (LLMs) show promise in enhancing diagnostic accuracy. This study aimed to evaluate the performances of ten LLMs in the differential diagnosis of benign and malignant biliary strictures.MethodsConsecutive patients with biliary strictures undergoing endoscopic retrograde cholangiopancreatography (ERCP) at Xijing Hospital between January and December 2024 were retrospectively analyzed. Ten LLMs were systematically prompted with standardized clinical, laboratory, and imaging data. Performance was compared against tumor markers (CA19-9, CEA), a new multivariable clinical model, and ten independent pancreaticobiliary exoerienced physicians. Subgroup analyses assessed hilar (n=29) versus non-hilar strictures. Gold-standard diagnosis relied on histopathology and ≥3-month follow-up.ResultsAmong the 159 included patients (83 benign, 76 malignant), four LLMs (Kimi, Deepseek-R1, Claude-3.5S, Llama-3.1), the clinical model (AUC:0.83), and six physicians achieved >80% accuracy. Kimi demonstrated superior accuracy (87%), significantly outperforming 70% of physicians (7/10, p<0.01). Three other LLMs (Deepseek-R1:83%, Claude-3.5S:82%, Llama-3.1:81%) and the clinical model performed comparably to physicians (78-84%, p>0.05), collectively surpassing tumor markers (CA19–9 accuracy:66%, CEA:71%). Physicians demonstrated higher accuracy for hilar strictures (87% vs. 79% for non-hilar, p<0.001). LLMs showed similar performance across stricture locations (hilar:64-95%; non-hilar:62-88%, p>0.05). For hilar strictures, 7/10 physicians achieved significantly higher accuracy (87-90%) than 8/10 LLMs (64-84%, p<0.05).ConclusionsUsing clinical, lab, and imaging data, some LLMs achieved diagnostic accuracy comparable to or exceeding clinical models and experienced physicians for differentiating benign versus malignant strictures. However, for hilar strictures, LLM performance was inferior to over half of the physicians.