Performance of large language models in the differential diagnosis of benign and malignant biliary stricture

Kang, Chenxi; Li, Jing; Yang, Xintian; Ren, Gui; Zhang, Linhui; Wang, Wei; Liu, Xin; Wang, Lei; Shang, Guochen; Hong, Jianglong; Wan, Bingnian; Du, Yu; Zeng, Wei; Liu, Yaling; Li, Tongxin; Lou, Lijun; Luo, Hui; Liang, Shuhui; Lv, Yong; Pan, Yanglin

doi:10.3389/fonc.2025.1613818

ORIGINAL RESEARCH article

Front. Oncol., 03 July 2025

Sec. Gastrointestinal Cancers: Hepato Pancreatic Biliary Cancers

Volume 15 - 2025 | https://doi.org/10.3389/fonc.2025.1613818

Performance of large language models in the differential diagnosis of benign and malignant biliary stricture

Chenxi Kang¹

Jing Li¹

Xintian Yang¹

Gui Ren¹

Linhui Zhang¹

Wei Wang²

Xin Liu³

Lei Wang⁴

Guochen Shang⁵

Jianglong Hong⁶

Bingnian Wan⁷

Yu Du⁸

Wei Zeng⁹

Yaling Liu¹

Tongxin Li¹

Lijun Lou¹

Hui Luo¹

Shuhui Liang¹

Yong Lv^1*

Yanglin Pan^1*

¹Xijing Hospital of Digestive Diseases, Air Force Medical University, Xi’an, China
²Department of Gastroenterology, People’s Liberation Army Joint Logistics Support Force 940th Hospital, Lanzhou, Gansu, China
³Department of Gastroenterology, Third People’s Hospital of Gansu Province, Lanzhou, Gansu, China
⁴Department of Gastroenterology, Ankang Traditional Chinese Medicine Hospital, Ankang, China
⁵Tongji Medical College, Huazhong University of Science and Technology, Wuhan, Hubei, China
⁶First Affiliated Hospital of Anhui Medical University, Hefei, Anhui, China
⁷Yantai Ludong Hospital, Shandong Provincial Hospital Group, Yantai, Shandong, China
⁸Department of Gastroenterology, Qinzhou Second People’s Hospital, Qinzhou, China
⁹Xiang’an Hospital, Xiamen University, Xiamen, Fujian, China

Background: Distinguishing benign from malignant biliary strictures remains challenging. Large Language Models (LLMs) show promise in enhancing diagnostic accuracy. This study aimed to evaluate the performances of ten LLMs in the differential diagnosis of benign and malignant biliary strictures.

Methods: Consecutive patients with biliary strictures undergoing endoscopic retrograde cholangiopancreatography (ERCP) at Xijing Hospital between January and December 2024 were retrospectively analyzed. Ten LLMs were systematically prompted with standardized clinical, laboratory, and imaging data. Performance was compared against tumor markers (CA19-9, CEA), a new multivariable clinical model, and ten independent pancreaticobiliary exoerienced physicians. Subgroup analyses assessed hilar (n=29) versus non-hilar strictures. Gold-standard diagnosis relied on histopathology and ≥3-month follow-up.

Results: Among the 159 included patients (83 benign, 76 malignant), four LLMs (Kimi, Deepseek-R1, Claude-3.5S, Llama-3.1), the clinical model (AUC:0.83), and six physicians achieved >80% accuracy. Kimi demonstrated superior accuracy (87%), significantly outperforming 70% of physicians (7/10, p<0.01). Three other LLMs (Deepseek-R1:83%, Claude-3.5S:82%, Llama-3.1:81%) and the clinical model performed comparably to physicians (78-84%, p>0.05), collectively surpassing tumor markers (CA19–9 accuracy:66%, CEA:71%). Physicians demonstrated higher accuracy for hilar strictures (87% vs. 79% for non-hilar, p<0.001). LLMs showed similar performance across stricture locations (hilar:64-95%; non-hilar:62-88%, p>0.05). For hilar strictures, 7/10 physicians achieved significantly higher accuracy (87-90%) than 8/10 LLMs (64-84%, p<0.05).

Conclusions: Using clinical, lab, and imaging data, some LLMs achieved diagnostic accuracy comparable to or exceeding clinical models and experienced physicians for differentiating benign versus malignant strictures. However, for hilar strictures, LLM performance was inferior to over half of the physicians.

Introduction

Biliary strictures, characterized by abnormal bile duct narrowing, can significantly obstruct bile flow. An estimated 54-87% of biliary strictures are malignant (1–5), arising from local or metastatic cancers. Benign biliary strictures have heterogeneous etiologies including surgical bile duct injury, chronic pancreatitis, or chronic cholangiopathies (e.g., primary sclerosing cholangitis) (6). Benign and malignant strictures differ significantly in management and prognosis. Benign strictures are typically managed by endoscopic dilation, stenting, or surgery (7–9), while malignant strictures require aggressive approaches including surgical resection, palliative drainage, and systemic therapy (10–12). Thus, accurately distinguishing between them is crucial for guiding treatment and prognostic assessment.

Characteristics of biliary strictures are typically assessed via brushing cytology, forceps biopsy, or cholangioscopic biopsy during endoscopic retrograde cholangiopancreatography (ERCP) (13). Brush cytology and forceps biopsy demonstrate diagnostic accuracies of 15–80% (14, 15), while cholangioscopic biopsy achieves 70–87% (13, 16). Endoscopic ultrasound-guided (EUS) fine needle aspiration (FNA) or fine needle biopsy (FNB) shows favorable accuracy, particularly for extrinsic mass-related strictures (1). Advanced techniques like intraductal ultrasonography (IDUS) (17), probe-based confocal laser endomicroscopy (pCLE) (18), and optical coherence tomography (OCT) (19), offer enhanced precision but are limited by invasiveness, cost, and availability.

Significant differences also exist in common clinical parameters between benign and malignant strictures, including age of onset, duration of liver function abnormalities, previous surgeries, and tumor markers. These noninvasive, accessible parameters facilitate convenient prediction. Wang et al. used CA50, CA19-9, and AFP to achieve an AUC of 0.879 (95% CI: 0.841–0.917) (20), while Zhang et al. combined MRI with inflammatory markers for an AUC of 0.802 (95% CI: 0.719–0.870) (18). Though promising, these models require further validation.

LLMs including Deepseek-R1, GPT-4T, Claude-3.5S, and Llama-3.1 show potential in improving diagnostic accuracy across medicine (21–24). Trained on vast medical data, LLMs may assist preliminary diagnosis by reducing interpretational variability. However, their utility for differentiating biliary strictures—a specialized, less common condition—remains unknown. We hypothesize that LLMs leveraging common clinical data (manifestations, blood tests, imaging) could aid this differentiation.

In this study, we aimed to evaluate the diagnostic performance of ten distinct LLMs for diagnosing benign versus malignant biliary strictures, comparing performance against tumor markers, a novel clinical model, and ten experienced pancreaticobiliary specialists.

Methods

Study design

This retrospective study evaluated the diagnostic performance of ten distinct large language models (LLMs) in differentiating between benign and malignant biliary strictures. The study protocol was approved by the Ethics Committee of Xijing Hospital. The written informed consent was obtained from all the patients or their next of kin.

Patients

Consecutive patients aged ≥18 years admitted to Xijing Hospital for biliary stricture evaluation between January and December 2024 were eligible. Inclusion criteria required a definitive etiological diagnosis confirmed by pathological examination and regular follow-up exceeding 3 months. Patients with incomplete medical records were excluded. Malignancy diagnosis relied on pathological results obtained via endoscopic retrograde cholangiopancreatography (ERCP), percutaneous transhepatic cholangiographic drainage (PTCD), endoscopic ultrasound (EUS), biopsy, or surgery. Benign strictures required confirmation by benign pathology and absence of progression over ≥3 months.

Data collection

Demographic, clinical, imaging, and pathological data were extracted from electronic medical records. An independent physician standardized data inputs for ten LLMs, ensuring unbiased differentiation. Case data was structured uniformly, excluding diagnostic conclusions, and included clinical presentation, history, imaging reports (CT, MRI, MRCP), and lab results (complete blood count, liver and renal function tests, lipids, coagulation, tumor/inflammatory markers). Any indications of benignancy or malignancy from the reports were removed. A flowchart of the overall study design is shown in Figure 1, illustrating the process from patient selection to diagnostic performance evaluation.

Figure 1

Flowchart illustrating the process of evaluating biliary stricture cases using Large Language Models (LLMs), Machine Learning Models (MLMs), and experienced physicians. It includes inclusion and exclusion criteria, preprocessing of cases, and structured steps for LLMs, MLMs, and physicians. The analysis involves diagnostic performance metrics and agreement comparisons between models, measured by Kappa Value. The chart emphasizes standardized input and output collection for consistent analysis.

Figure 1. Flowchart of overall study design. LLM, Large Language Model; MLM, Machine Learning Model.

Large language models for differential diagnosis of biliary strictures

Ten LLMs were selected for evaluation for reproducibility, including 1) mainstream commercial models with proven medical reasoning capabilities in prior studies (GPT-4T, GPT-4o, Gemini-1.5 pro, Claude-3.5S), 2) models developed by Chinese companies to align with the Chinese-language clinical data (Kimi, ERNIE-4, Qwen-2, GLM-4); 3) open-source models (Deepseek-R1, Llama-3.1) (25, 26). To ensure consistency and reproducibility, a structured query approach was used. A case-specific prompt simulated consultation with an experienced pancreaticobiliary specialist: “Supposing you are an experienced physician specializing in pancreaticobiliary diseases, when encountering a case with biliary stricture, please provide a tentative judgment indicating whether the cause of the stricture is more likely to be malignant or benign.” (Detailed prompt provided in Supplementary Method S1). For each LLM, a new conversation session was initiated to eliminate contextual carryover. Probabilistic LLM outputs (e.g., “likely benign,” “possibly malignant”) were converted into binary outcomes (benign/malignant) using predefined rules. Responses indicating malignancy (e.g., “likely malignant,” “probably malignant,” “suggestive of malignancy”) were classified as “malignant.” Responses indicating benignancy (e.g., “likely benign,” “probably benign,” “suggestive of a benign process”) were classified as “benign.” Explicit statements (“benign”/”malignant”) were directly categorized.

All queries employed identical deterministic parameters: temperature=0.0 (output consistency), max tokens=10240, and consistent system prompts. Queries were executed in January 2025 using contemporaneous model versions. The evaluated LLMs included: Deepseek-R1, GPT-4T, GPT-4o, Claude-3.5S, Gemini-1.5 Pro, Kimi, Llama-3.1 405B, ERNIE-4.0-Turbo-8K, Qwen-2-72B, and ChatGLM-4-9B (details in Supplementary Table S1). Chain-of-thought (CoT) outputs were extracted for reasoning pattern analysis. Cases were categorized by diagnostic accuracy (correct/incorrect), and reasoning traces were independently assessed by two gastroenterologists using predefined clinical logic criteria (Supplementary Table S3).

Development of a clinical prediction model

A multivariable logistic regression model incorporating key demographic, clinical, laboratory, and imaging parameters was developed. The modeling process initiated with univariable screening to identify candidate variables (p < 0.10), followed by multivariable regression analysis. Variables were retained multivariable models based on statistical significance (p < 0.05) or established clinical relevance for variables approaching significance. Collinearity was assessed using variance inflation factors (VIF) and Spearman correlation coefficients (|ρ| > 0.6), with clinical importance determining variable selection when collinearity was detected. Bidirectional stepwise selection (forward/backward) using Akaike (AIC) and Bayesian (BIC) information criteria optimized the model. Final model selection prioritized minimized AIC/BIC and maximized predictive performance, with internal validation implemented through bootstrapping (1,000 resamples). CA19–9 and CEA were separately assessed as independent diagnostic markers.

Experienced physicians’ evaluation

Ten experienced pancreaticobiliary specialists (each with ≥10 years of clinical practice) independently evaluated comprehensive clinical summaries identical to those processed by the LLMs. All diagnostic predictions regarding benign or malignant status were made without intercommunication among physicians to ensure independent assessment.

Outcome

The primary outcome was diagnostic accuracy that was calculated as the average of sensitivity and specificity to account for class imbalance (range: 0-1, higher values superior). For LLMs, the highest accuracy from duplicate assessments was selected. Secondary performance metrics included sensitivity (proportion of true positives correctly identified), specificity (proportion of true negatives correctly identified), positive predictive value (PPV, proportion of positive predictions that were correct), negative predictive value (NPV, proportion of negative predictions that were correct), and F1-score (harmonic mean of precision and sensitivity providing balanced assessment).

Statistical analysis

Continuous variables are presented as mean ± standard deviation (normally distributed) or median [interquartile range] (non-normal distributions), while categorical variables are expressed as proportions (%) with odds ratios (OR) and 95% confidence intervals (CI) for association analyses. McNemar and DeLong tests were used to compare performance metrics between models. Confidence intervals for accuracy, sensitivity, and specificity were calculated using the Clopper-Pearson method, with proportion differences reported in pairwise comparisons. Subgroup analysis evaluated LLM diagnostic performance by stricture location (hilar vs. non-hilar). Internal concordance of LLMs was assessed through weighted Cohen’s kappa (κ) with 95% CI based on duplicate assessments, interpreted as: κ ≤ 0.20 (slight concordance), 0.21-0.40 (fair), 0.41-0.60 (moderate), 0.61-0.80 (substantial), and 0.81-1.00 (almost perfect). Pairwise comparisons of classification accuracy employed McNemar’s test with Holm-Bonferroni correction for multiple comparisons (family-wise α = 0.05; significance threshold: adjusted p < 0.05). All statistical tests were two-sided. Statistical analyses were conducted using R version 4.3.1.

Results

Patient demographics and clinical characteristics

During the study period, 270 patients were diagnosed with biliary stricture, of whom 111 were excluded according to inclusion/exclusion criteria, resulting in a final cohort of 159 patients (83 benign, 76 malignant). Baseline demographic and clinical characteristics are summarized in Table 1. Compared with benign stricture patients, those with malignant lesions were older (65.13 vs. 57.19 years), exhibited higher bilirubin (149.14 vs. 78.15 μmol/L, p < 0.001) and CA19–9 levels (2888.33 vs. 246.32 U/mL, p = 0.003), a higher proportion of lymph node enlargement (47.4% vs. 19.3%, p < 0.001), and more frequent hilar strictures (26.3% vs. 10.8%, p = 0.020).

Table 1

Table 1. Patient demographics, clinical characteristics, and diagnostic markers.

Clinical data-driven diagnostic model performance

A stepwise multivariable logistic regression was employed to develop the diagnostic model, with the final model retaining age, CA19-9, CEA, disease duration <1 month, C-reactive protein (CRP), surgical history, and lymph node enlargement (Table 2). Supplementary Figure S1 illustrates the impact of hyperparameter λ on model accuracy under L2 regularization, with the optimal λ value of 149.3 determined via 10-fold cross-validation to maximize test-set accuracy. The model achieved an AUC of 0.83 (95% CI: 0.70–0.96), accuracy of 0.83, sensitivity of 0.83, and specificity of 0.82 (Table 3, Figure 2), outperforming tumor markers alone. CA19–9 showed an AUC of 0.77 (95% CI: 0.69–0.84), accuracy of 0.66, specificity of 0.93, and sensitivity of 0.39; CEA yielded an AUC of 0.66 (95% CI: 0.58–0.75), accuracy of 0.71, specificity of 0.59, and sensitivity of 0.83. Internal validation via bootstrapping (1000 iterations) confirmed consistent AUC of 0.83 (Supplementary Figure S2).

Table 2

Table 2. Selection of variables based on univariate and multivariate logistic regression analysis.

Table 3

Table 3. Performance metrics of LLMs, EPs, clinical model and tumor markers.

Figure 2

Panel A shows a Receiver Operating Characteristic (ROC) curve comparing a clinical model, CA19-9, and CEA, with AUC values of 0.83, 0.77, and 0.66 respectively. Panel B displays a radar chart for the clinical model, CA19-9, and CEA, illustrating AUC, accuracy, F1 score, sensitivity, and specificity with various colored lines.

Figure 2. ROC curve analysis (A) and radar chart (B) of diagnostic model and tumor markers. CA19-9, Carbohydrate Antigen 199; CEA, Carcinoembryonic Antigen; AUC, Area Under the Curve; CI, Confidence Interval; ROC, Receiver Operating Characteristic.

Comparative diagnostic performance of LLMs, tumor markers, clinical model, and physicians

Four LLMs (40%), one clinical model, and six physicians (60%) achieved accuracies ≥80%. Kimi demonstrated the highest accuracy (87%), significantly outperforming 70% of physicians (7/10, p < 0.01) (Table 3, Figure 3, Supplementary Table S2). Top-performing LLMs including Deepseek-R1 (0.83), Claude-3.5S (0.82), Llama-3.1 (0.81), and GPT-4T (0.79) showed comparable accuracy to physicians (0.78–0.84, p > 0.05), while physicians outperformed lower-performing LLMs (GPT-4o, Gemini-1.5-pro). Eighty percent of LLMs exceeded CEA (0.66) and CA19-9 (0.71) in accuracy, with the clinical model (0.83) competing with top LLMs.

Figure 3

Chart A shows the accuracy of different models, categorized by color: EP (purple), Tumor marker (green), Clinical model (blue), and LLM (orange). Chart B is a matrix illustrating p-values for model comparisons, with varying hues representing different p-value ranges. Both charts contain numerous models like EP, Qwen, and GPT, alongside Tumor marker measures, highlighting statistical significance variations across pairs.

Figure 3. Comparative performance evaluation. (A) Accuracy among Different LLMs, Experienced Physicians, Clinical Model and Tumor Markers. (B) A Comparative Analysis of Diagnostic Accuracy and Significance Testing among models. The p-values (Holm-adjusted) were from the comparison of accuracy between the predictive groups listed along the horizontal axis and those on the vertical axis. A positive p-value indicates that the accuracy of the group on the horizontal axis is statistically greater than that of the group on the vertical axis, whereas a negative p-value signifies the opposite. CA19-9, Carbohydrate Antigen 199; CEA, Carcinoembryonic Antigen; EP, Experienced physician.

CA19–9 exhibited the highest sensitivity (0.93) across 23 predictive groups, significant in 18 groups (p: 0–0.044). GPT-4T, GPT-4o, and Gemini-1.5-pro showed the highest specificity (0.99), significant in 14 groups (p: 0–0.048). Kimi led in F1-score (0.87, significant in 12 groups), GPT-4T in PPV (0.98, 13 groups), and EP1 in NPV (0.85, 12 groups). Top LLMs dominated four metrics (accuracy, specificity, F1-score, PPV), while tumor markers and physicians excelled in sensitivity and NPV, respectively.

Analysis of 360 incorrect diagnoses revealed that over 80% resulted from LLMs over-relying on single data sources (Supplementary Table S6). Representative examples: Claude 3.5 misdiagnosed a benign stricture as malignant solely based on elevated CA19-9, while GPT-4T correctly integrated bilirubin trends, imaging, and histology (Supplementary Material).

Analysis of the misdiagnoses revealed >80% originated from LLMs over-relying on single data sources (Supplementary Table S3). For instance, Claude 3.5 misdiagnosed a benign stricture as malignant based solely on elevated CA19-9, whereas GPT-4T integrated bilirubin trends, imaging, and histology for correct diagnosis (Supplementary Document).

Subgroup analysis by stricture location

Performance metrics for hilar (n=29, 9 benign/20 malignant) and non-hilar subgroups are detailed in Figure 4 and Supplementary Table S4. Given the limited sample size of hilar strictures (n=29, including 9 benign and 20 malignant cases), the subgroup analysis should be interpreted with caution due to potential overfitting risks. Despite this limitation, we observed that in Physicians showed higher accuracy in hilar versus non-hilar strictures (0.87 vs. 0.79, p < 0.001), while LLMs had comparable accuracy in both subgroups (hilar: 0.64-0.95; non-hilar: 0.62-0.88, p > 0.05). In the hilar subgroup, Deepseek-R1 showed highest hilar accuracy (0.95, 95% CI:0.77-0.99), followed by Kimi (0.87) and Claude-3.5S (0.84). Notably in this exploratory analysis, 7/10 physicians achieved superior accuracy over 8/10 LLMs (87-90% vs. 64-84%, p<0.05). In the non-hilar subgroup, LLMs showed competitive/exceeding accuracy. Given the small sample size, these findings should be interpreted as preliminary and require validation in larger cohorts.

Figure 4

Bar chart showing accuracy comparisons between Hilar and Non-hilar subgroups across various models. Each bar represents a model, with yellow for Hilar and turquoise for Non-hilar. Error bars indicate variability. Models include GPT-4T, Claude-3.5S, Qwen-2, and others, with accuracy ranging from 0.25 to 1.00.

Figure 4. Accuracy and 95% CIs across subgroups for different prediction models. Error bars represent 95% confidence intervals. Hilar strictures (n=29: 9 benign, 20 malignant), non-hilar strictures (n=130: 74 benign, 56 malignant). CA19-9, Carbohydrate Antigen 199; CEA, Carcinoembryonic Antigen; EP, Experienced physician.

LLM diagnostic concordance assessment

Deepseek-R1, GPT-4T, GPT-4o, Llama-3.1, Gemini-1.5-pro, and Claude-3.5S showed near-perfect internal concordance (κ=0.81-0.97). Kimi, ERNIE-4, and GLM-4 demonstrated substantial concordance, while Qwen-2 showed fair concordance. When compared to true values, no model reached almost perfect concordance: Deepseek-R1, Llama-3.1, Claude-3.5S, and Kimi showed substantial concordance; ERNIE-4, GLM-4, Qwen-2 moderate; Gemini-1.5 pro and GPT-4o fair (Supplementary Table S6, Supplementary Figure S3).

Discussion

The differentiation between benign and malignant biliary strictures remains clinically challenging. Current endoscopic techniques show limited sensitivity: ERCP-based brushing/biopsy (0.45-0.67) (27–29) and cholangioscopy biopsy (0.43-0.74) (2, 30) often prove inadequate. Our study demonstrates that select large language models (LLMs) achieve diagnostic accuracy rivaling or exceeding human expertise. Kimi attained the highest accuracy (87%), significantly outperforming 70% of experienced physicians (7/10, p<0.01). Three additional LLMs (Deepseek-R1:83%, Claude-3.5S:82%, Llama-3.1:81%) and our clinical prediction model (83%) performed comparably to physicians (78-84%, p>0.05). Collectively, 80% of LLMs surpassed conventional tumor markers (CA19–9 accuracy:66%; CEA:71%).

Notably, this is the first study to demonstrate that select large language models (LLMs) match or exceed the accuracy of a clinical model and experienced physicians. These findings suggest LLMs could serve as accessible, real-time diagnostic aids, particularly in resource-constrained settings where specialist expertise is limited. The variation among LLM performance reveals clinically meaningful insights. While GPT-4o and Gemini-1.5-Pro lead general benchmarking tasks (31), these models underperformed in our specific diagnostic application (accuracies 0.66 and 0.62 respectively). These underperforming models exhibited extreme specificity (0.99) and PPV (0.95-0.97) but critically low sensitivity (0.25-0.34), likely reflecting excessive safety prioritization during training protocols. This pattern necessitates caution when using such models for biliary stricture assessment. Conversely, Kimi’s superior performance (87%) highlights how task-specific optimization can yield exceptional diagnostic capability irrespective of general benchmarking performance.

Our subgroup analysis revealed physicians outperformed LLMs in the evaluation of hilar strictures (n=29), with 7/10 physicians achieving significantly higher accuracy than 8/10 LLMs (87-90% vs. 64-84%, p<0.05). This finding aligns with established clinical knowledge that >90% of hilar strictures are malignant (1), suggesting experienced clinicians better integrate this epidemiological context. The performance gap may indicate incomplete learning of clinical nuances by current LLMs. However, targeted fine-tuning with medical knowledge or Retrieval-Augmented Generation (RAG) (32–34) could potentially bridge this gap in future iterations. Clinically, this underscores the continued value of expert judgment in anatomically complex presentations, while suggesting LLMs may currently serve best as diagnostic aids for non-hilar strictures where they demonstrated parity with physicians. However, due to due to small sample size in the subgroup of hilar strictures, this analysis is exploratory and requires validation.

Our clinical prediction model, incorporating established risk factors (age, CA19-9, CEA, disease duration <1 month, CRP, surgical history, lymphadenopathy), achieved an AUC of 0.83 (95% CI:0.70-0.96), aligning with prior reports (AUC 0.75-0.83) (35–38) while maintaining practical clinical utility. CA19–9 demonstrated an expected AUC (0.77) matching literature reports (0.759-0.783) (35, 38), but presented a diagnostic paradox with high sensitivity (0.93) yet poor specificity (0.39). This suggests potential utility as a rule-out screening tool requiring subsequent confirmation, while CEA demonstrated weaker discriminative capacity (AUC 0.66) than some prior studies (39–42), emphasizing context-dependent variability.

Despite promising diagnostic capabilities, clinical implementation of LLMs faces significant barriers requiring strategic resolution. Error analysis demonstrated that >80% of misdiagnoses originated from LLMs over-relying on isolated data elements rather than multimodal integration. Representative examples included Claude 3.5 misclassifying a benign stricture as malignant based solely on elevated CA19-9, contrasting with GPT-4T’s accurate diagnosis achieved through synthesizing bilirubin trends, imaging findings, and histology. This significant limitation persists despite recent demonstrations of LLMs outperforming physicians in controlled diagnostic settings (43), underscoring a critical challenge in translating artificial intelligence capabilities to clinical practice where multimodal reasoning is essential. Text-based implementation currently constrains LLMs, but emerging multimodal capabilities in radiology (44–46) and dermatology (47) suggest promising diagnostic extensions. Future integration of CT/MRCP imaging could substantially enhance biliary stricture evaluation, though diagnosing rare conditions requires specialized training approaches. To address these constraints, we propose a structured workflow comprising: 1) electronic Medical Record-integrated real-time malignancy probability scoring, and 2) automatic referral to targeted multidisciplinary review for cases with LLM confidence scores below 80%. This structure preserves physician oversight while optimizing diagnostic efficiency, particularly valuable in resource-limited settings. However, clinical deployment of LLMs demands addressing several critical ethical considerations: 1) Accountability through legal frameworks addressing liability for diagnostic errors; 2) Hallucination mitigation requiring detection protocols (evidenced in 12% of erroneous outputs); 3) Patient acceptance considerations, with survey data showing 67% rejection of AI-exclusive diagnoses for cancer-related decisions; 4) Equity concerns including documented performance disparities in elderly populations; 5) Transparency requirements for interpretable decision pathways; and 6) Privacy mandates demanding robust data anonymization. Essential mitigation strategies include human-AI collaborative diagnostic models, algorithmic bias correction techniques, and targeted patient education initiatives clarifying LLMs’ assistive role.

Our study exhibits several limitations that warrant consideration. First, the retrospective single-center design introduces potential selection bias. Second, modest sample size limiting statistical power to address heterogeneity in biliary stricture presentations, potentially restricting generalizability across diverse healthcare settings; Third, the small hilar stricture subgroup (n = 29) limits statistical power for physician-LLM comparisons in this anatomically complex subset. Fourth, version-specific LLM evaluation restricts generalizability to updated iterations. Fifth, variability in physician experience levels may impact human performance benchmarks. Sixth, absence of external validation constrains generalizability.

Conclusion

In conclusion, this study demonstrates select LLMs (Kimi, Deepseek-R1, Claude-3.5S, Llama-3.1) achieve diagnostic accuracy comparable to or exceeding clinical models and physicians for biliary strictures, though hilar cases remain challenging. Their optimal implementation involves augmenting clinical judgment rather than replacing it, especially valuable for non-hilar strictures where performance matched physicians.

Data availability statement

The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding authors.

Ethics statement

The studies involving humans were approved by the Ethics Committee of Xijing Hospital. The studies were conducted in accordance with the local legislation and institutional requirements. Written informed consent for participation was not required from the participants or the participants’ legal guardians/next of kin in accordance with the national legislation and institutional requirements. Written informed consent was obtained from the individual(s) for the publication of any potentially identifiable images or data included in this article.

Author contributions

CK: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft. JL: Data curation, Formal analysis, Visualization, Writing – review & editing, Resources. XY: Data curation, Formal analysis, Visualization, Writing – review & editing, Software. GR: Investigation, Validation, Writing – review & editing. LZ: Investigation, Validation, Writing – review & editing. WW: Investigation, Writing – review & editing. XL: Investigation, Writing – review & editing. LW: Investigation, Writing – review & editing. GS: Investigation, Writing – review & editing. JH: Investigation, Writing – review & editing. BW: Investigation, Writing – review & editing. YD: Investigation, Writing – review & editing. WZ: Investigation, Writing – review & editing. YLL: Methodology, Writing – review & editing. TL: Validation, Writing – review & editing. LL: Validation, Writing – review & editing. HL: Investigation, Writing – review & editing. SL: Investigation, Writing – review & editing. YL: Conceptualization, Methodology, Supervision, Writing – review & editing. YP: Conceptualization, Funding acquisition, Project administration, Resources, Supervision, Writing – review & editing.

Funding

The author(s) declare that financial support was received for the research and/or publication of this article. This study was supported by National Key R&D Program of China (2022YFC2505100) and grants from the National Natural Science Foundation of China (82373117&82370619).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declare that no Generative AI was used in the creation of this manuscript.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fonc.2025.1613818/full#supplementary-material

References

1. Elmunzer BJ, Maranki JL, Gómez V, Tavakkoli A, Sauer BG, Limketkai BN, et al. ACG clinical guideline: diagnosis and management of biliary strictures. Am J Gastroenterol. (2023) 118:405–26. doi: 10.14309/ajg.0000000000002190

PubMed Abstract | Crossref Full Text | Google Scholar

2. Wen L-J, Chen J-H, Xu H-J, Yu Q, and Liu K. Efficacy and safety of digital single-operator cholangioscopy in the diagnosis of indeterminate biliary strictures by targeted biopsies: A systematic review and meta-analysis. Diagn Basel Switz. (2020) 10:666. doi: 10.3390/diagnostics10090666

PubMed Abstract | Crossref Full Text | Google Scholar

3. de Moura DTH, Ryou M, de Moura EGH, Ribeiro IB, Bernardo WM, and Thompson CC. Endoscopic ultrasound-guided fine needle aspiration and endoscopic retrograde cholangiopancreatography-based tissue sampling in suspected Malignant biliary strictures: A meta-analysis of same-session procedures. Clin Endosc. (2020) 53:417–28. doi: 10.5946/ce.2019.053

PubMed Abstract | Crossref Full Text | Google Scholar

4. De Moura DTH, Moura EGHD, Bernardo WM, De Moura ETH, Baraca FI, Kondo A, et al. Endoscopic retrograde cholangiopancreatography versus endoscopic ultrasound for tissue diagnosis of Malignant biliary stricture: Systematic review and meta-analysis. Endosc Ultrasound. (2018) 7:10–9. doi: 10.4103/2303-9027.193597

PubMed Abstract | Crossref Full Text | Google Scholar

5. Tummala P, Munigala S, Eloubeidi MA, and Agarwal B. Patients with obstructive jaundice and biliary stricture ± mass lesion on imaging: prevalence of Malignancy and potential role of EUS-FNA. J Clin Gastroenterol. (2013) 47:532–7. doi: 10.1097/MCG.0b013e3182745d9f

PubMed Abstract | Crossref Full Text | Google Scholar

6. Hu B, Sun B, Cai Q, Wong Lau JY, Ma S, Itoi T, et al. Asia-Pacific consensus guidelines for endoscopic management of benign biliary strictures. Gastrointest Endosc. (2017) 86:44–58. doi: 10.1016/j.gie.2017.02.031

PubMed Abstract | Crossref Full Text | Google Scholar

7. Ramchandani M, Lakhtakia S, Costamagna G, Tringali A, Püspöek A, Tribl B, et al. Fully covered self-expanding metal stent vs multiple plastic stents to treat benign biliary strictures secondary to chronic pancreatitis: A multicenter randomized trial. Gastroenterology. (2021) 161:185–95. doi: 10.1053/j.gastro.2021.03.015

PubMed Abstract | Crossref Full Text | Google Scholar

8. Sato T, Kogure H, Nakai Y, Ishigaki K, Hakuta R, Saito K, et al. A prospective study of fully covered metal stents for different types of refractory benign biliary strictures. Endoscopy. (2020) 52:368–76. doi: 10.1055/a-1111-8666

PubMed Abstract | Crossref Full Text | Google Scholar

9. Ponsioen CY, Arnelo U, Bergquist A, Rauws EA, Paulsen V, Cantú P, et al. No superiority of stents vs balloon dilatation for dominant strictures in patients with primary sclerosing cholangitis. Gastroenterology. (2018) 155:752–759.e5. doi: 10.1053/j.gastro.2018.05.034

PubMed Abstract | Crossref Full Text | Google Scholar

10. Franssen S, Van Driel LMJW, Moelker A, and Groot Koerkamp B. Primary percutaneous stenting above the ampulla for palliative biliary drainage of Malignant hilar biliary obstruction. J Clin Oncol. (2023) 41:527–7. doi: 10.1200/JCO.2023.41.4_suppl.527

Crossref Full Text | Google Scholar

11. Jang S, Stevens T, Parsi MA, Bhatt A, Kichler A, and Vargo JJ. Superiority of self-expandable metallic stents over plastic stents in treatment of Malignant distal biliary strictures. Clin Gastroenterol Hepatol. (2022) 20:e182–95. doi: 10.1016/j.cgh.2020.12.020

PubMed Abstract | Crossref Full Text | Google Scholar

12. Wang AY and Yachimski PS. Endoscopic management of pancreatobiliary neoplasms. Gastroenterology. (2018) 154:1947–63. doi: 10.1053/j.gastro.2017.11.295

PubMed Abstract | Crossref Full Text | Google Scholar

13. Wallace MB, Wang KK, Adler DG, and Rastogi A. Recent advances in endoscopy. Gastroenterology. (2017) 153:364–81. doi: 10.1053/j.gastro.2017.06.014

PubMed Abstract | Crossref Full Text | Google Scholar

14. Khamaysi I, Firman R, Martin P, Vasilyev G, Boyko E, and Zussman E. Mechanical perspective on increasing brush cytology yield. ACS Biomater Sci Eng. (2024) 10:1743–52. doi: 10.1021/acsbiomaterials.3c00935

PubMed Abstract | Crossref Full Text | Google Scholar

15. Wang J, Xia M, Jin Y, Zheng H, Shen Z, Dai W, et al. More endoscopy-based brushing passes improve the detection of Malignant biliary strictures: A multicenter randomized controlled trial. Am J Gastroenterol. (2022) 117:733–9. doi: 10.14309/ajg.0000000000001666

PubMed Abstract | Crossref Full Text | Google Scholar

16. Bang JY, Navaneethan U, Hasan M, Sutton B, Hawes R, and Varadarajulu S. Optimizing outcomes of single-operator cholangioscopy-guided biopsies based on a randomized trial. Clin Gastroenterol Hepatol Off Clin Pract J Am Gastroenterol Assoc. (2020) 18:441–448.e1. doi: 10.1016/j.cgh.2019.07.035

PubMed Abstract | Crossref Full Text | Google Scholar

17. Chen L, Lu Y, and Kulkarni P. Tu1008 a meta-analysis of the value of Intraductal Ultrasound (IDUS) in differentiating Malignant from benign biliary strictures. Gastroenterology. (2020) 158:S–1003-S-1004. doi: 10.1016/S0016-5085(20)33183-8

Crossref Full Text | Google Scholar

18. Han S, Kahaleh M, Sharaiha RZ, Tarnasky PR, Kedia P, Slivka A, et al. Probe-based confocal laser endomicroscopy in the evaluation of dominant strictures in patients with primary sclerosing cholangitis: results of a U.S. multicenter prospective trial. Gastrointest Endosc. (2021) 94:569–576.e1. doi: 10.1016/j.gie.2021.03.027

PubMed Abstract | Crossref Full Text | Google Scholar

19. Joshi V, Patel SN, Vanderveldt H, Oliva I, Raijman I, Molina C, et al. Mo1963 A pilot study of safety and efficacy of directed cannulation with a low profile catheter (LP) and imaging characteristics of bile duct wall using optical coherance tomography (OCT) for indeterminate biliary strictures initial report on in-vivo evaluation during ERCP. Gastrointest Endosc. (2017) 85:AB496–7. doi: 10.1016/j.gie.2017.03.1150

Crossref Full Text | Google Scholar

20. Wang Y, Wang W, Zhang S, Cai W, Song R, Mei T, et al. Diagnostic value of carbohydrate antigen 50 in biliary tract cancer: a large-scale multicenter study. Cancer Med. (2024) 13:e7388. doi: 10.1002/cam4.7388

PubMed Abstract | Crossref Full Text | Google Scholar

21. Levine DM, Tuwani R, Kompa B, Varma A, Finlayson SG, Mehrotra A, et al. The diagnostic and triage accuracy of the GPT-3 artificial intelligence model: an observational study. Lancet Digit Health. (2024) 6:e555–61. doi: 10.1016/S2589-7500(24)00097-9

PubMed Abstract | Crossref Full Text | Google Scholar

22. Peng L, Cai S, Wu Z, Shang H, Zhu X, and Li X. MMGPL: multimodal medical data analysis with graph prompt learning. Med Image Anal. (2024) 97:103225. doi: 10.1016/j.media.2024.103225

PubMed Abstract | Crossref Full Text | Google Scholar

23. Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, and Ting DSW. Large language models in medicine. Nat Med. (2023) 29:1930–40. doi: 10.1038/s41591-023-02448-8

PubMed Abstract | Crossref Full Text | Google Scholar

24. Santoki A, Jones R, Peter M, and Mathew A. Implementing large language model-based artificial intelligence (AI) technology in proposing effective treatment plans in patients with cancer. J Clin Oncol. (2024) 42:e13660–0. doi: 10.1200/JCO.2024.42.16_suppl.e13660

Crossref Full Text | Google Scholar

25. Qiu P, Wu C, Zhang X, Lin W, Wang H, Zhang Y, et al. Towards building multilingual language model for medicine. Nat Commun. (2024) 15:8384. doi: 10.1038/s41467-024-52417-z

PubMed Abstract | Crossref Full Text | Google Scholar

26. Zhou S, Luo X, Chen C, Jiang H, Yang C, Ran G, et al. The performance of large language model powered chatbots compared to oncology physicians on colorectal cancer queries. Int J Surg. (2024) 110(10):6509–17. doi: 10.1097/JS9.0000000000001850

PubMed Abstract | Crossref Full Text | Google Scholar

27. Yoon SB, Moon S-H, Ko SW, Lim H, Kang HS, and Kim JH. Brush cytology, forceps biopsy, or endoscopic ultrasound-guided sampling for diagnosis of bile duct cancer: A meta-analysis. Dig Dis Sci. (2022) 67:3284–97. doi: 10.1007/s10620-021-07138-4

PubMed Abstract | Crossref Full Text | Google Scholar

28. Jeon TY, Choi MH, Yoon SB, Soh JS, and Moon S-H. Systematic review and meta-analysis of percutaneous transluminal forceps biopsy for diagnosing Malignant biliary strictures. Eur Radiol. (2022) 32:1747–56. doi: 10.1007/s00330-021-08301-1

PubMed Abstract | Crossref Full Text | Google Scholar

29. Navaneethan U, Njei B, Lourdusamy V, Konjeti R, Vargo JJ, and Parsi MA. Comparative effectiveness of biliary brush cytology and intraductal biopsy for detection of Malignant biliary strictures: a systematic review and meta-analysis. Gastrointest Endosc. (2015) 81:168–76. doi: 10.1016/j.gie.2014.09.017

PubMed Abstract | Crossref Full Text | Google Scholar

30. Sun X, Zhou Z, Tian J, Wang Z, Huang Q, Fan K, et al. Is single-operator peroral cholangioscopy a useful tool for the diagnosis of indeterminate biliary lesion? A systematic review and meta-analysis. Gastrointest Endosc. (2015) 82:79–87. doi: 10.1016/j.gie.2014.12.021

PubMed Abstract | Crossref Full Text | Google Scholar

31. Chiang W-L, Zheng L, Sheng Y, Angelopoulos AN, Li T, Li D, et al. Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference (2024). Available online at: http://arxiv.org/abs/2403.04132 (Accessed August 26, 2024).

Google Scholar

32. Tran H, Yang Z, Yao Z, and Yu H. BioInstruct: instruction tuning of large language models for biomedical natural language processing. J Am Med Inform Assoc. (2024) 31:1821–32. doi: 10.1093/jamia/ocae122

PubMed Abstract | Crossref Full Text | Google Scholar

33. Ge J, Sun S, Owens J, Galvez V, Gologorskaya O, Lai JC, et al. Development of a liver disease-specific large language model chat interface using retrieval-augmented generation. Hepatol Baltim Md. (2024) 80(5):1158–68. doi: 10.1097/HEP.0000000000000834

PubMed Abstract | Crossref Full Text | Google Scholar

34. Du D, Zhong F, and Liu L. Enhancing recognition and interpretation of functional phenotypic sequences through fine-tuning pre-trained genomic models. J Transl Med. (2024) 22:756. doi: 10.1186/s12967-024-05567-z

PubMed Abstract | Crossref Full Text | Google Scholar

35. Wen N, Peng D, Xiong X, Liu G, Nie G, Wang Y, et al. Cholangiocarcinoma combined with biliary obstruction: an exosomal circRNA signature for diagnosis and early recurrence monitoring. Signal Transduct Target Ther. (2024) 9:107. doi: 10.1038/s41392-024-01814-3

PubMed Abstract | Crossref Full Text | Google Scholar

36. Liang B, Zhong L, He Q, Wang S, Pan Z, Wang T, et al. Diagnostic accuracy of serum CA19–9 in patients with cholangiocarcinoma: A systematic review and meta-analysis. Med Sci Monit. (2015) 21:3555–63. doi: 10.12659/MSM.895040

PubMed Abstract | Crossref Full Text | Google Scholar

37. Liu DSK, Prado MM, Giovannetti E, Jiao LR, Krell J, and Frampton AE. MOY 2 microRNAs as bile based biomarkers for pancreaticobiliary cancers (MIRABILE). Br J Surg. (2023) 110:znad241.005. doi: 10.1093/bjs/znad241.005

Crossref Full Text | Google Scholar

38. Gao L, Lin Y, Yue P, Li S, Zhang Y, Mi N, et al. Identification of a novel bile marker clusterin and a public online prediction platform based on deep learning for cholangiocarcinoma. BMC Med. (2023) 21:294. doi: 10.1186/s12916-023-02990-9

PubMed Abstract | Crossref Full Text | Google Scholar

39. Loosen SH, Roderburg C, Kauertz KL, Koch A, Vucur M, Schneider AT, et al. CEA but not CA19–9 is an independent prognostic factor in patients undergoing resection of cholangiocarcinoma. Sci Rep. (2017) 7:16975. doi: 10.1038/s41598-017-17175-7

PubMed Abstract | Crossref Full Text | Google Scholar

40. Li Y, Li D-J, Chen J, Liu W, Li J-W, Jiang P, et al. Application of joint detection of AFP, CA19-9, CA125 and CEA in identification and diagnosis of cholangiocarcinoma. Asian Pac J Cancer Prev. (2015) 16:3451–5. doi: 10.7314/APJCP.2015.16.8.3451

PubMed Abstract | Crossref Full Text | Google Scholar

41. Tuzun Ince A, Yildiz K, Baysal B, Danalioglu A, Kocaman O, Tozlu M, et al. Roles of serum and biliary CEA, CA19-9, VEGFR3, and TAC in differentiating between Malignant and benign biliary obstructions. Turk J Gastroenterol. (2014) 25:162–9. doi: 10.5152/tjg.2014.6056

PubMed Abstract | Crossref Full Text | Google Scholar

42. Macias RIR, Kornek M, Rodrigues PM, Paiva NA, Castro RE, Urban S, et al. Diagnostic and prognostic biomarkers in cholangiocarcinoma. Liver Int. (2019) 39:108–22. doi: 10.1111/liv.14090

PubMed Abstract | Crossref Full Text | Google Scholar

43. Brodeur PG, Buckley TA, Kanjee Z, Goh E, Ling EB, Jain P, et al. Superhuman performance of a large language model on the reasoning tasks of a physician. arXiv (2024). doi: 10.48550/ARXIV.2412.10849

Crossref Full Text | Google Scholar

44. Brin D, Sorin V, Barash Y, Konen E, Glicksberg BS, Nadkarni GN, et al. Assessing GPT-4 multimodal performance in radiological image analysis. Eur Radiol. (2025) 35(4):1959–65. doi: 10.1007/s00330-024-11035-5

PubMed Abstract | Crossref Full Text | Google Scholar

45. Kaczmarczyk R, Wilhelm TI, Martin R, and Roos J. Evaluating multimodal AI in medical diagnostics. NPJ Digit Med. (2024) 7:205. doi: 10.1038/s41746-024-01208-3

PubMed Abstract | Crossref Full Text | Google Scholar

46. Beşler MS. The accuracy of the multimodal large language model GPT-4 on sample questions from the interventional radiology board examination. Acad Radiol. (2024) 31:3476. doi: 10.1016/j.acra.2024.03.023

PubMed Abstract | Crossref Full Text | Google Scholar

47. Zhou J, He X, Sun L, Xu J, Chen X, Chu Y, et al. Pre-trained multimodal large language model enhances dermatological diagnosis using SkinGPT-4. Nat Commun. (2024) 15:5649. doi: 10.1038/s41467-024-50043-3

PubMed Abstract | Crossref Full Text | Google Scholar

Keywords: large language model, biliary stricture, cholangiocarcinoma, prediction model, diagnosis

Citation: Kang C, Li J, Yang X, Ren G, Zhang L, Wang W, Liu X, Wang L, Shang G, Hong J, Wan B, Du Y, Zeng W, Liu Y, Li T, Lou L, Luo H, Liang S, Lv Y and Pan Y (2025) Performance of large language models in the differential diagnosis of benign and malignant biliary stricture. Front. Oncol. 15:1613818. doi: 10.3389/fonc.2025.1613818

Received: 17 April 2025; Accepted: 18 June 2025;
Published: 03 July 2025.

Edited by:

Xin-Rong Yang, Fudan University, China

Reviewed by:

Lei Xu, The First Affiliated Hospital of Xi’an Jiaotong University, China
Balu Bhasuran, Florida State University, United States

Copyright © 2025 Kang, Li, Yang, Ren, Zhang, Wang, Liu, Wang, Shang, Hong, Wan, Du, Zeng, Liu, Li, Lou, Luo, Liang, Lv and Pan. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Yong Lv, bHZ5b25neGpAeWVhaC5uZXQ=; Yanglin Pan, eWFuZ2xpbnBhbkBob3RtYWlsLmNvbQ==

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.