- 1Biomedical Information Center, Beijing You’An Hospital, Capital Medical University, Beijing, China
- 2Interventional Therapy Center for Oncology, Beijing You’An Hospital, Capital Medical University, Beijing, China
- 3Institute of Clinical Medicine, Beijing Friendship Hospital, Capital Medical University, Beijing, China
- 4Graduate School of Dalian Medical University, Dalian Medical University, Dalian, China
- 5Department of Infectious Diseases, Beijing You’An Hospital, Capital Medical University, Beijing, China
Objective: This study aimed to evaluate and compare the performance of large language models (LLMs) and traditional Cox regression models in predicting the long-term progression risk in patients with intermediate to advanced hepatocellular carcinoma (HCC).
Methods: A total of 576 patients with intermediate to advanced HCC were included, comprising a training cohort (n = 403) and a validation cohort (n = 173) for model development and validation. We evaluated the predictive performance of LLMs (DeepSeek R1, DeepSeek V3, and Qwen/QWQ-32B) and the traditional Cox regression model for estimating the progression risk of HCC at 12, 24, and 36 months. Time-dependent area under the curve (AUC), decision curve analysis, calibration curve, net reclassification improvement, and integrated discrimination improvement were used to comprehensively assess model performance.
Results: Based on transarterial chemoembolization combined with targeted therapy, the addition of immune checkpoint inhibitors (ICIs) and/or ablation prolonged the progression-free survival (PFS): all four treatments combined showed optimal outcome (median PFS = 12.3 months, 95%CI = 9.9–14.1). Univariate and multivariate Cox analyses identified independent prognostic factors, which were utilized to develop a progression risk nomogram. The model had good discrimination, with training cohort AUCs (at 12, 24, and 36 months) of 0.72 (95%CI = 0.67–0.78), 0.77 (95%CI = 0.69–0.86), and 0.96 (95%CI = 0.93–0.99), respectively, and validation cohort AUCs of 0.75 (95%CI = 0.67–0.83), 0.81 (95%CI = 0.71–0.91), and 0.97 (95%CI = 0.94–1.0), respectively. Three LLMs were evaluated on the same dataset. Except for DeepSeek R1 at 12 and 24 months (training cohort), all LLMs underperformed the Cox model across time points, indicating current limitations in predicting long-term progression risk.
Conclusion: The combination of ablation and/or ICIs with standard treatment could prolong PFS. In predicting the long-term HCC progression risk, the traditional Cox model exceeded the LLMs. Their combination may merge structured modeling stability with the multi-source data processing capacity of LLMs, potentially improving prediction accuracy.
Introduction
Hepatocellular carcinoma (HCC) is one of the leading causes of morbidity and mortality worldwide (1). For intermediate to advanced HCC, transarterial chemoembolization (TACE) combined with targeted therapy is an important treatment strategy (2, 3). In recent years, immune checkpoint inhibitors (ICIs) have become a standard type of systemic therapy and have been widely used in clinical practice (4, 5). Ablation therapy has been increasingly explored in recent years for use in intermediate to advanced patients, particularly in combination with systemic treatments (5, 6). These two therapies have contributed to the development of combination therapies (7, 8). However, the variety of treatment options has made it difficult to assess prognosis on an individual basis. The Cox proportional hazards model has been extensively utilized in studies concerning HCC prognosis due to its compatibility with structured clinical variables, stable statistical properties, and strong interpretability. This model has been incorporated into clinical guidelines for risk stratification and treatment planning.
At the same time, large language models (LLMs) have shown significant potential in extracting features from unstructured data such as clinical records, imaging reports, and histopathological data (9–11). These models have demonstrated competitive performance in various diagnostic and short-term risk prediction studies (12, 13). However, the applicability of these LLMs in long-term prediction modeling remains understudied, especially in real-world settings with diverse treatment regimens and complex temporal dynamics.
In this study, we first developed a Cox model to predict the progression-free survival (PFS) in patients treated with TACE combined with targeted therapy and further stratified based on the use of ICIs and/or local ablation therapy. We then evaluated the performance of three LLMs (DeepSeek R1, DeepSeek V3, and Qwen/QWQ-32B) on the same task, comparing them in terms of discriminative ability, clinical net benefit, and risk reclassification. Our research objective was to evaluate and compare the performance of LLMs and traditional Cox regression models in predicting the long-term progression risk in patients with intermediate to advanced HCC.
Materials and methods
Patients
The study included 576 patients aged 18–75 years who were admitted to Beijing You’an Hospital, Capital Medical University, from January 2020 to January 2023. HCC was diagnosed according to the pathological findings or clinical–radiological features in line with the criteria of the China Liver Cancer Staging (14). Unresectability of the tumor was determined by multidisciplinary team discussion. The inclusion and exclusion criteria are described in Supplementary Method 1. The demographic data and clinicopathological data within 7 days before ablation were collected and are described in Supplementary Method 2.
Data processing
The identical structured dataset, which was stored in CSV format and contained the same patient characteristics and outcomes, served as the basis for both the Cox model and the LLMs. The Cox model processed the data directly as a numerical matrix; for the LLM-based inference, the same CSV tables were converted into JSON Lines format. When constructing the Cox model, continuous variables [i.e., age, tumor size, and prothrombin time activity (PTA)] were discretized according to the optimal cutoff values derived from the surv_cutpoint function. Variables such as age and tumor size were not standardized as their measurement units are clinically meaningful and the scale differences among them remained within an acceptable range. For variables exhibiting wide numerical ranges [e.g., alpha-fetoprotein (AFP)], a logarithmic transformation was applied to approximate a normal distribution.
Treatment and supplemental therapies
All enrolled patients initially received TACE combined with tyrosine kinase inhibitor (TKI) therapy. In this article, this combination therapy is referred to as “dual therapy.” The TACE procedure is described in Supplementary Method 3. TKI therapy was suspended for a period of 3 days prior to each TACE session and resumed 3 days later if no severe TACE-related adverse events occurred. Some patients received additional treatments (e.g., ablation therapy and/or ICIs) based on the disease characteristics, which is described in Supplementary Method 4.
Follow-up
All patients were followed up every 4–8 weeks. The primary endpoint was progression, defined as new tumor lesions confirmed by imaging and/or histopathological examination after treatment, regardless of the primary or distant lesion site. PFS was defined as the time from treatment initiation to progression or death from any cause, and patients without progression and alive at the last follow-up were censored.
Efficacy and safety assessment
Tumor response was evaluated as complete response (CR), partial response (PR), stable disease (SD), and progressive disease (PD) based on the contrast-enhanced computed tomography or magnetic resonance imaging findings according to the modified Response Evaluation Criteria in Solid Tumors (mRECIST) (15). The objective response rate (ORR) was defined as the proportion of patients who achieved the best response of CR or PR and sustained it for at least 4 weeks. The disease control rate (DCR) is the proportion of patients who achieved CR, PR, and SD. The interval between the beginning of therapy and the initial observation of CR or PR was defined as the time to response. Safety was assessed via treatment-related adverse events, which were monitored and recorded in accordance with the Common Terminology Criteria for Adverse Events version 5.0.
Local deployment and inference environment
LLM tasks were run on a MacBook Air equipped with an Apple Silicon M4 processor and macOS 15. To enable local deployment and model inference, we utilized an open-source toolchain comprising Ollama (version 0.5.7) and LLM-Anything (version 1.7.3), described in Supplementary Method 5.
Input format and prompt design
All inputs were from structured.jsonl files, with each entry representing a progression risk prediction task. Each sample included standardized clinical variables [e.g., age, tumor size, and hepatitis B virus (HBV) status, among others], formatted into natural language prompts to guide the models. Details of the task descriptions are in Supplementary Method 6. The task was to estimate the 12-, 24-, and 36-month HCC progression probabilities. LLMs output the progression probabilities as 0–1 decimal values. All prompts used a unified template for consistency, reproducibility, and standardization (format below).
“You are a highly experienced oncologist assessing the risk of disease progression for a cancer patient over time. Based on the provided clinical information, estimate the patient’s probability of hepatocellular carcinoma progression at 12 months, 24 months, and 36 months. You only output JSONL. Return risk of progression, a decimal score between 0 and 1. Please also provide a clear and concise explanation of your reasoning in English.”
The models were instructed to return predictions in the following structured.jsonl format: risk_12m, risk_24m, and risk_36m: Predicted progression probability at 12, 24, and 36 months; reasoning: An English explanation detailing the reasoning behind the predicted values. All inference outputs were saved in a.jsonl format to facilitate downstream structured analysis, model performance evaluation, and cross-model comparisons. During inference, all models were executed using deterministic decoding settings with temperature = 0 and top_p = 0.9, allowing for diverse yet controlled output generation. These parameters were kept constant across all runs to ensure uniform generation behavior and fair comparisons between models.
Statistical analyses
Continuous variables are represented as the mean ± standard deviation or as median (interquartile range, IQR). The cutoff values of the quantitative variables were selected by applying the surv_cutpoint function as implemented in the “survminer” package. Risk factors were selected using univariate and multivariate Cox regression analyses and were employed to construct the final nomogram. The discrimination and the predictive accuracy were assessed using the area under the time-dependent receiver operating characteristic (ROC) curve (AUC). Consistency was evaluated using a calibration curve with Brier scores. Based on the established model, patients were stratified into the high-risk and low-risk groups according to their PFS probability. Decision curve analysis (DCA) was used to assess the clinical utility of a Cox model and the LLMs by quantifying the net benefit of using across different threshold probabilities. The net reclassification improvement (NRI) (16) was applied to calculate the proportion of correct reclassifications minus the proportion of incorrect reclassifications by the Cox model compared with the LLMs. The integrated discrimination improvement (IDI) (17) was applied to calculate the difference between the average predicted probabilities of the Cox model and the LLMs in the event and non-event groups. All statistical analyses were performed using R software (version 4.2.2) with the following packages: “rms,” “survival,” “riskRegression,” “pec,” “plotROC,” and “timeROC.” A two-sided p-value <0.05 was considered statistically significant.
Results
Characteristics of the patients
The baseline characteristics of the included patients are described in Table 1. A total of 576 patients with intermediate to advanced HCC at Beijing You’An Hospital were included, comprising 485 (83.6%) men and 91 (16.4%) women. The mean age of the patients was 58.39 ± 9.71 years. According to the Barcelona Clinic Liver Cancer (BCLC) staging system, 204 patients (35.4%) were classified as stage B and 372 (64.6%) as stage C. There were 126 patients (21.9%) with a single tumor, whereas 450 (78.1%) had multiple tumors. Among all patients, 98 (17.0%) received ablation only, 207 (35.9%) received ICIs only, 115 (20.0%) received both ablation and ICIs, and 155 (26.9%) received “dual therapy.” Following mRECIST, 116 (27.0%) patients achieved CR, 198 (47.0%) achieved PR, 97 (23.0%) had SD, and 11 (2.6%) had PD. The highest ORR (CR + PR) and DCR (CR + PR + SD) were 74.4% and 97.4%, respectively.
Progression-free survival analysis
At the end of follow-up, the median PFS of the overall patient cohort was 8.6 months (95%CI = 7.9–9.5) (Figure 1A). Subgroup analysis showed that the median PFS of the patients who received “dual therapy” + ablation therapy was 10.3 months (95%CI = 9.1–12.5) (Figure 1B), while the median PFS of the patients who received “dual therapy” + ICI therapy was 9.8 months (95%CI = 8.6–11.5) (Figure 1C). Further stratification by specific treatment combinations revealed that patients who received “dual therapy” + ablation therapy combined with ICIs had the longest median PFS of 12.3 months (95%CI = 9.9–14.1), followed by the group that received “dual therapy” + ablation therapy alone at 9.1 months (95%CI = 8.2–11.6) and the group that received “dual therapy” + ICI therapy alone at 8.5 months (95%CI = 7.1–10.3). In contrast, patients who received only “dual therapy” without other interventions had the shortest median PFS of 5.8 months (95%CI = 5.1–7.7) (Figure 1D). These results indicate that patients with intermediate to advanced HCC have a high risk of progression. Both ablation therapy and ICI therapy helped delay the progression of HCC, and combination therapy yielded the best prognostic outcomes.
Figure 1. Kaplan–Meier curves for the progression-free survival (PFS) of patients with various types of therapy. (A) PFS of all patients, with 95% confidence intervals. (B) Patients who received ablation therapy. (C) Patients who received immune checkpoint inhibitors (ICIs) therapy. (D) Subgroup comparison of patients who received dual therapy [transarterial chemoembolization (TACE) combined with tyrosine kinase inhibitor (TKI) therapy]. This combination therapy was referred to as “dual therapy,” dual therapy with ablation only, dual therapy with ICIs only, or the combination of ablation, ICIs, and dual therapy. The vertical dashed line in each panel represents the median PFS, indicating the time at which 50% of the patients remained progression-free. p < 0.05 was considered statistically significant.
Selection of independent prognostic factors and nomogram established for PFS
The independent factors predictive of PFS based on univariate and multivariate Cox proportional hazards models are displayed in Table 2. In the univariate and multivariate analyses, age (>46 vs. ≤46 years), type of therapy (ablation and ICIs, ablation only, ICIs only, vs. no additional treatment), PTA (>67% vs. ≤67%), and mRECIST response (SD + PD vs. CR + PR) were identified as significant prognostic factors for PFS. These four independent risk factors were utilized to build a Cox model for predicting progression in patients with intermediate to advanced HCC (Figure 2A).
Figure 2. Construction and internal validation of the Cox model nomogram. (A) Nomogram for prediction of the 12-, 24-, and 36-month progression-free survival based on four prognostic factors. (B, C) Time-dependent receiver operating characteristic (ROC) curves and area under the curve (AUC) values at 12, 24, and 36 months in the training set (B) and the validation set (C). The numbers in the figure represent the AUC (95% confidence interval). (D, E) Brier score calibration curves at 12, 24, and 36 months in the training set (D) and the validation set (E). The numbers in the figure represent the Brier score (95% confidence interval). (F, G) Progression-free survival in the training (F) and validation (G) sets. Patients were stratified into the high- and low-risk groups based on the nomogram-derived scores. Numbers at risk are shown below each plot. The p-values were calculated using the log-rank test. The X-axis represents the predicted probabilities, while the Y-axis represents the observed probabilities.
The time-dependent ROC analysis showed that the AUC values for predicting progression at 12, 24, and 36 months were 0.72 (95%CI = 0.67–0.78), 0.77 (95%CI = 0.69–0.86), and 0.96 (95%CI = 0.93–0.99) in the training set, respectively, indicating that the model can effectively distinguish between progressive and non-progressive patients (Figure 2B). We assessed the calibration performance of the model using the Brier scores, which were 0.20 (95%CI = 0.18–0.22), 0.12 (95%CI = 0.09–0.14), and 0.03 (95%CI = 0.01–0.05) in the training set, respectively, indicating that the predicted results were highly consistent with the observed outcomes (Figure 2D). In the validation cohort, the nomogram also demonstrated strong discriminatory ability, with AUC values for predicting the 12-, 24-, and 36-month PFS of 0.75 (95%CI = 0.67–0.83), 0.81 (95%CI = 0.71–0.91), and 0.97 (95%CI = 0.94–1.0), respectively (Figure 2C). The Brier scores for progression at 12, 24, and 36 months were 0.20 (95%CI = 0.17–0.22), 0.13 (95%CI = 0.095–0.16), and 0.04 (95%CI = 0.007–0.07), respectively, further indicating that the predictive results were highly consistent with the observed outcomes (Figure 2E).
Patients were classified into high- and low-risk groups based on the nomogram-derived risk scores. The Kaplan–Meier curves showed that, in the training set, patients in the high-risk group had a shorter median PFS of 5.8 months (95%CI = 4.9–7.4) compared with 11.5 months (95%CI = 9.6–13.8) in the low-risk group (Figure 2F). In the validation set, the median PFS was 6.6 months (95%CI = 4.2–9.2) for the high-risk group and was 11.6 months (95%CI = 9.1–14.0) for the low-risk group (both p < 0.0001) (Figure 2G).
Comparison of the predictive performance of the Cox model with that of the LLMs for tumor progression
We utilized the LLMs of the LLM-Anything platform to predict the progression risk of HCC at 12, 24, and 36 months in the training and validation cohorts. In the training set, the time-dependent ROC analysis showed that DeepSeek-R1 (Figure 3A) achieved AUC values of 0.811 (95%CI = 0.765–0.857), 0.789 (95%CI = 0.708–0.869), and 0.826 (95%CI = 0.777–0.874), respectively. In contrast, DeepSeek-V3 (Figure 3B) had lower AUC values of 0.636 (95%CI = 0.578–0.694), 0.590 (95%CI = 0.507–0.674), and 0.586 (95%CI = 0.353–0.820), while Qwen/QWQ-32B (Figure 3C) had AUC values of 0.550 (95%CI = 0.489–0.612), 0.569 (95%CI = 0.473–0.665), and 0.630 (95%CI = 0.387–0.872) at the same time points. In the validation set, DeepSeek-R1 (Figure 4D) had AUC values of 0.583 (95%CI = 0.494–0.672), 0.581 (95%CI = 0.437–0.726), and 0.858 (95%CI = 0.715–1.000), respectively. DeepSeek-V3 (Figure 4E) had AUC values of 0.415 (95%CI = 0.324–0.505), 0.464 (95%CI = 0.328–0.599), and 0.469 (95%CI = 0.294–0.644), while Qwen/QWQ-32B (Figure 4F) had AUC values of 0.457 (95%CI = 0.363–0.550), 0.508 (95%CI = 0.390–0.625), and 0.572 (95%CI = 0.448–0.696) at the same time points. Except for the performance of DeepSeek R1 at 12 and 24 months in the training set, these AUC values did not exceed the performance of the Cox model. Subsequently, we compared the DCA of DeepSeek R1, DeepSeek V3, Qwen/QWQ-32B, and the Cox model at 12, 24, and 36 months (Figures 4A–C, respectively) in the training cohort. DeepSeek R1 demonstrated the highest standardized net benefit at all time points, indicating superior clinical application value in predicting the progression of HCC. In contrast, DeepSeek V3 and Qwen/QWQ-32B had relatively lower net benefits, while the Cox model performed between these LLMs. In the validation cohort, the Cox model slightly outperformed the other models at 12 and 24 months (Figures 4D, E, respectively), while the four models performed similarly at 36 months (Figure 4F).
Figure 3. Time-dependent area under the curve (AUC) performance of DeepSeek R1, DeepSeek V3, and Qwen/QWQ-32B in the training and validation sets. (A–C) Time-dependent receiver operating characteristic (ROC) curves and AUCs of the DeepSeek R1, DeepSeek V3, and Qwen/QWQ-32B models at 12 months (A), 24 months (B), and 36 months (C) in the training set. (D–F) Time-dependent ROC curves and AUCs of the DeepSeek R1, DeepSeek V3, and Qwen/QWQ-32B models at 12 months (D), 24 months (E), and 36 months (F) in the validation set. The numbers in the figure represent the AUC (95% confidence interval).
Figure 4. Decision curve analysis (DCA) of the predictive models at different time points in the training and validation sets. (A–C) Net benefit curves comparing the DeepSeek R1, DeepSeek V3, and Qwen/QWQ-32B models with the Cox model at 12 months (A), 24 months (B), and 36 months (C) in the training set. (D–F) Net benefit curves comparing the DeepSeek R1, DeepSeek V3, and Qwen/QWQ-32B models with the Cox model at 12 months (D), 24 months (E), and 36 months (F) in the validation set. The X-axis represents the high-risk threshold (predicted probability), while the Y-axis represents the standardized net benefit. The “All” and “None” lines represent the net benefit under the strategies of treating all patients or no patients, respectively.
NRI- and IDI-based performance comparison between the Cox model and the LLMs
Lastly, the Cox model was used as the reference for the calculation of the NRI and IDI to assess the performance improvement of the DeepSeek R1, DeepSeek V3, and Qwen/QWQ-32B models. As shown in Figures 5A, B and in Table 3, across other time points, the Cox model consistently outperformed DeepSeek R1, DeepSeek V3, and Qwen/QWQ-32B in both the NRI and IDI metrics. DeepSeek R1 demonstrated higher NRI values than the Cox model at 12 and 24 months, with NRI values of 0.26 (0.07–0.46, p < 0.01) and 0.46 (0.26–0.69, p < 0.01), respectively. In addition, DeepSeek R1 achieved a higher IDI at 12 months compared with the Cox model, with an IDI value of 0.18 (0.09–0.27, p < 0.01). In the validation cohort, the Cox model consistently demonstrated superior performance compared with all LLMs (Supplementary Figure S1, Table 3), suggesting a modest improvement in overall discrimination.
Figure 5. Net reclassification improvement (NRI) plots comparing the DeepSeek R1, DeepSeek V3, and Qwen/QWQ-32B models with the Cox model at 12, 24, and 36 months in the training set. (A–C) Category-based NRI plots for DeepSeek R1 versus the Cox model at 12 months (A), 24 months (B), and 36 months (C).(D–F) Category-based NRI plots for DeepSeek V3 versus the Cox model at 12 months (D), 24 months (E), and 36 months (F). (G–I) Category-based NRI plots for Qwen/QWQ-32B versus the Cox model at 12 months (G), 24 months (H), and 36 months (I).
Table 3. Comparison of the DeepSeekv3, DeepSeekr1, and Qwen models with the Cox model at 12, 24, and 36 months using net reclassification improvement (NRI) and integrated discrimination improvement (IDI).
Discussion
In patients with intermediate to advanced HCC receiving dual therapy, the addition of ablation or ICI therapy could further prolong PFS, with the most significant effects observed when all four treatment modalities were used in combination. Existing studies have also shown that ablation therapy could improve the overall survival and PFS in patients with intermediate to advanced HCC (6, 18, 19), supporting its use as a viable combination therapy in advanced treatment strategies.
Compared with the Cox models, the LLMs required larger-scale, structurally complex multimodal data (e.g., images, text, and event logs) and substantially higher computational resources for training. A number of studies have demonstrated LLMs to show significant advantages in disease diagnosis, treatment planning, and short-term risk assessment, particularly when processing unstructured data such as clinical notes and imaging reports. One study using pathology reports from The Cancer Genome Atlas Thyroid Cancer (TCGA-THCA) cohort demonstrated that LLMs could accurately extract key information and achieve precise American Joint Committee on Cancer (AJCC) staging and American Thyroid Association (ATA) risk stratification, with F1 scores ranging from 88.5% to 96.5% for ATA risk and from 94.2% to 99.7% for AJCC staging (10). Another study on cardiac arrest patients showed ChatGPT-4 to have AUCs of 0.85 for mortality prediction and 0.83 for neurological outcome, which are comparable to those of established prognostic scoring systems (20). However, the LLMs failed to outperform the traditional Cox model, highlighting their limitations in long-term prognosis modeling in this study. A recent study supported this view: when evaluated on real-world electronic health record datasets, the predictive performance of GPT-3.5 and GPT-4 was significantly lower than that of traditional machine learning models (AUC values of 0.537 and 0.629 vs. 0.847 for the VUMC dataset; 0.517 and 0.602 vs. 0.894 for the MIMIC dataset) (21). This difference was mainly due to the LLMs not being able to resolve survival analysis issues (e.g., censored data and time-varying risks), while the Cox models were specifically designed for these situations and were more statistically robust (22–25). Majority of the LLMs only processed fixed single inputs/outputs, hardly tracking time-varying factors or multistage disease progression. Moreover, the progression of HCC is influenced by a variety of complex factors that change over time, such as the tumor biology (26, 27), pathological characteristics (28, 29), treatment efficacy (30), and liver function fluctuations (31). It is worth noting that, although the probability scores provided by the LLMs performed poorly in the quantitative evaluation, their generated reasoning text occasionally identified risk factors consistent with clinical guidelines, suggesting a potential for integrating medical knowledge. However, how to reliably extract, validate, and quantify such qualitative reasoning remains a major challenge before it can be used for serious clinical decision support. In general, LLMs are more suitable for short-term prediction tasks with clear structures and outcomes. For their application to long-term survival modeling, further optimization is needed in the model architecture and in the ability to learn from time-varying information so as to enhance their application value in clinical prognosis.
We propose Cox models and LLMs as complementary tools in survival analysis. Cox models rely on structured data with high interpretability, making them ideal for standardized clinical risk assessment (32). They remain a cornerstone of current clinical risk assessment systems. On the other hand, LLMs have excellent language comprehension and reasoning abilities, extracting insights from unstructured data (e.g., imaging reports, pathology descriptions, and clinical records) (33). They are also evolving into multimodal models capable of processing images and texts (34, 35), which help in identifying complex patterns in medical images that traditional methods might miss (33, 36). These unstructured features could be converted into structured inputs for Cox models through embedding or vectorization. If LLMs could accurately identify such features and convert them into quantifiable inputs, they could have an enhanced predictive performance without the need for specialized segmentation models. In conclusion, the Cox model could handle structured risk modeling, while the LLMs could deal with complex, unstructured, multimodal clinical data. The combination of the two could create a more versatile and interpretable hybrid clinical prediction framework fit for real-world environments with diverse and evolving clinical features. The implementation involved five key steps. Firstly, structured data (e.g., age and laboratory parameters) and unstructured text (e.g., clinical notes and imaging reports) were collected and preprocessed. Secondly, the preprocessed text was input into a clinical LLM to obtain semantic embeddings (37, 38), which were aggregated (using the [CLS] token or mean pooling) and then dimensionally reduced via principal component analysis. Thirdly, the resulting text-derived features were concatenated with the structured variables to form a unified feature vector. Fourthly, this combined feature set, along with the survival time and the event status, was used to train the Cox model. Finally, model performance was evaluated using the C-index, AUC, calibration curves, and DCA.
This study also has certain limitations. Firstly, it was based on retrospective data from a single center (576 patients) and did not undergo external validation in a multicenter independent cohort; therefore, the robustness of its conclusions needs further confirmation. Secondly, the exceptionally high AUC values observed at 36 months for the Cox model (0.96 in training and 0.97 in validation) should be interpreted with caution as they are likely attributable to the extreme class imbalance at this late time point, with only three patients remaining event-free in each cohort. Such a small number of non-events made the discrimination metric estimates unstable. Consequently, the model performance at time points with a more sufficient sample size, such as at 24 months [AUC: training = 0.77 (95%CI = 0.69–0.86), validation = 0.81 (95%CI = 0.71–0.91)], provided a more robust and representative assessment of its generalizability. Finally, the evaluation of the LLMs relied solely on structured clinical data, without incorporating unstructured information (e.g., imaging, pathology, or genomics), which might have underestimated their potential predictive value.
Conclusions
In summary, combining ablation or ICIs with standard treatment significantly prolonged the PFS in patients with intermediate to advanced HCC, with dual therapy showing superior efficacy. In addition, the constructed Cox regression model effectively distinguished between the high-risk and low-risk patients with advanced progression, and its predictive performance was superior to that of three LLMs. The combination of Cox and LLMs is expected to integrate the robustness of structured models with the ability of LLMs to process multi-source heterogeneous data, thereby achieving higher prediction accuracy.
Data availability statement
The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding author.
Ethics statement
The studies involving humans were approved by ethics committee of the Beijing You’An Hospital, Capital Medical University. The studies were conducted in accordance with the local legislation and institutional requirements. This retrospective study using existing anonymized medical records has been approved by the ethics committee of the Beijing You’An Hospital, which waived the requirement for written informed consent as the study poses minimal risk, obtaining retrospective consent is impracticable, and patient privacy is strictly protected.
Author contributions
KL: Conceptualization, Writing – original draft, Writing – review & editing, Methodology, Software. CW: Writing – original draft, Writing – review & editing, Data curation, Methodology. YX: Data curation, Methodology, Writing – review & editing. YS: Methodology, Software, Writing – review & editing. YBZ: Methodology, Software, Writing – review & editing. DM: Funding acquisition, Writing – review & editing. CH: Writing – review & editing, Supervision. DG: Funding acquisition, Writing – review & editing. TM: Writing – review & editing, Methodology. AL: Writing – review & editing, Data curation. YHZ: Investigation, Supervision, Writing – review & editing.
Funding
The author(s) declared that financial support was received for this work and/or its publication. This work was financially supported by the Beijing Municipal Science and Technology Program (Z251100007125025), National Key Research and Development Program of China (2022YFC2304404), Beijing YouAn Hospital (BJYAYY-YN2023-10).
Conflict of interest
The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Generative AI statement
The author(s) declared that generative AI was not used in the creation of this manuscript.
Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
Supplementary material
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fonc.2026.1710529/full#supplementary-material
References
1. Tan EY, Danpanichkul P, Yong JN, Yu Z, Tan DJH, Lim WH, et al. Liver cancer in 2021: Global burden of disease study. J Hepatol. (2024) 82:851–60. doi: 10.1016/j.jhep.2024.10.031
2. Lencioni R. Loco-regional treatment of hepatocellular carcinoma. Hepatology. (2010) 52:762–73. doi: 10.1002/hep.23725
3. Liu KC, Hao YH, Lv WF, Jia WD, Ji CS, Zhou CZ, et al. Transarterial chemoembolization combined with sorafenib in patients with BCLC stage C hepatocellular carcinoma. Drug Des Devel Ther. (2020) 14:3461–8. doi: 10.2147/DDDT.S248850
4. Cheng AL, Qin S, Ikeda M, Galle PR, Ducreux M, Kim TY, et al. Updated efficacy and safety data from IMbrave150: Atezolizumab plus bevacizumab vs. sorafenib for unresecta ble hepatocellular carcinoma. . J Hepatol. (2022) 76:862–73. doi: 10.1016/j.jhep.2021.11.030
5. Qin S, Chen M, Cheng AL, Kaseb AO, Kudo M, Lee HC, et al. Atezolizumab plus bevacizumab versus active surveillance in patients with resected or ablated high-risk hepatocellular carcinoma (IMbrave050): a randomised, open-label, multicentre, phase 3 trial. Lancet. (2023) 402:1835–47. doi: 10.1016/S0140-6736(23)01796-8
6. Li X, Zhang Q, Lu Q, Cheng Z, Liu F, Han Z, et al. Microwave ablation combined with apatinib and camrelizumab in patients with advanced hepatocellular carcinoma: A single-arm, preliminary study. Front Immunol. (2022) 13:1023983. doi: 10.3389/fimmu.2022.1023983
7. Seidensticker M, Öcal O, Schütte K, Malfertheiner P, Berg T, Loewe C, et al. Impact of adjuvant sorafenib treatment after local ablation for HCC in the phase II SORAMIC trial. JHEP Rep. (2023) 5:100699. doi: 10.1016/j.jhepr.2023.100699
8. Shiina S, Tateishi R, Arano T, Uchino K, Enooku K, Nakagawa H, et al. Radiofrequency ablation for hepatocellular carcinoma: 10-year outcome and prognostic factors. Am J Gastroenterol. (2012) 107:569–77. doi: 10.1038/ajg.2011.425
9. Zhang K, Meng X, Yan X, Ji J, Liu J, Xu H, et al. Revolutionizing health care: the transformative impact of large language models in medicine. J Med Internet Res. (2025) 27:e59069. doi: 10.2196/59069
10. Fung MMH, Tang EHM, Wu T, Luk Y, Au ICH, Liu X, et al. Developing a named entity framework for thyroid cancer staging and risk level classification using large language models. NPJ Digit Med. (2025) 8:134. doi: 10.1038/s41746-025-01528-y
11. McCoy TH and Perlis RH. Applying large language models to stratify suicide risk using narrative clinical notes. J Mood Anxiety Disord. (2025) 10:100109. doi: 10.1016/j.xjmad.2025.100109
12. Amacher SA, Baumann SM, Berger S, Arpagaus A, Egli SB, Grzonka P, et al. Can the large language model ChatGPT-4omni predict outcomes in adult patients with status epilepticus? Epilepsia. (2025) 66:674–85. doi: 10.1111/epi.18215
13. Alba C, Xue B, Abraham J, Kannampallil T, and Lu C. The foundational capabilities of large language models in predicting postoperative risks using clinical notes. NPJ Digit Med. (2025) 8:95. doi: 10.1038/s41746-025-01489-2
14. Ryder SD. Guidelines for the diagnosis and treatment of hepatocellular carcinoma (HCC) in adults. Gut. (2003) 52 Suppl 3::iii1–8. doi: 10.1136/gut.52.suppl_3.iii1
15. Lee JS, Choi HJ, Kim BK, Park JY, Kim DY, Ahn SH, et al. The modified response evaluation criteria in solid tumors (RECIST) yield a more accurate prognoses than the RECIST 1.1 in hepatocellular carcinoma treated with transarterial radioembolization. Gut liver. (2020) 14:765–74. doi: 10.5009/gnl19197
16. Leening MJ, Vedder MM, Witteman JC, Pencina MJ, and Steyerberg EW. Net reclassification improvement: computation, interpretation, and controversies: a literature review and clinician’s guide. Ann Intern Med. (2014) 160:122–31. doi: 10.7326/M13-1522
17. Pencina MJ, D’Agostino RB, and Demler OV. Novel metrics for evaluating improvement in discrimination: net reclassification and integrated discrimination improvement for normal variables and nested models. Stat Med. (2012) 31:101–13. doi: 10.1002/sim.4348
18. Yu J, Cheng ZG, Han ZY, Liu FY, Zheng RQ, Cheng W, et al. Period-dependent survival benefit of percutaneous microwave ablation for hepatocellular carcinoma: A 12-year real-world, multicentric experience. Liver Cancer. (2022) 11:341–53. doi: 10.1159/000522134
19. Xiao T, Hu S, Dong S, Cai Q, Gong W, Zhang Y, et al. A study on combination of non-ablative local RFA with PD-1 and angiogenesis blocking to prolong survival through improvement of immune microenvironment in advanced Hepatocellular Carcinoma. Int Immunopharmacol. (2024) 134:112144. doi: 10.1016/j.intimp.2024.112144
20. Amacher SA, Arpagaus A, Sahmer C, Becker C, Gross S, Urben T, et al. Prediction of outcomes after cardiac arrest by a generative artificial intelligence model. Resusc Plus. (2024) 18:100587. doi: 10.1016/j.resplu.2024.100587
21. Brown KE, Yan C, Li Z, Zhang X, Collins BX, Chen Y, et al. Large language models are less effective at clinical prediction tasks than locally trained machine learning models. J Am Med Inform Assoc. (2025) 32:811–22. doi: 10.1093/jamia/ocaf038
22. Jin H, Tu D, Zhao N, Shepherd LE, and Goss PE. Longer-term outcomes of letrozole versus placebo after 5 years of tamoxifen in the NCIC CTG MA.17 trial: analyses adjusting for treatment crossover. . J Clin Oncol. (2012) 30:718–21. doi: 10.1200/JCO.2010.34.4010
23. Yang J, Sakhvidi MJZ, de Hoogh K, Vienneau D, Siemiatyck J, Zins M, et al. Long-term exposure to black carbon and mortality: A 28-year follow-up of the GAZEL cohort. Environ Int. (2021) 157:106805. doi: 10.1016/j.envint.2021.106805
24. Asar Ö, Ritchie J, Kalra PA, and Diggle PJ. Joint modelling of repeated measurement and time-to-event data: an introductory tutorial. Int J Epidemiol. (2015) 44:334–44. doi: 10.1093/ije/dyu262
25. Bellavia A and Murphy SA. Cox regression model in clinical research: overview of key properties and interpretation. Circulation. (2025) 151:337–9. doi: 10.1161/CIRCULATIONAHA.124.072956
26. Li H, Lan T, Xu L, Liu H, Wang J, Li J, et al. NCSTN promotes hepatocellular carcinoma cell growth and metastasis via βiaastasi activation in a Notch1/AKT dependent manner. J Exp Clin Cancer Res. (2020) 39:128. doi: 10.1186/s13046-020-01638-3
27. Zhong FJ, Sun B, Cao MM, Xu C, Li YM, and Yang LY. STMN2 mediates nuclear translocation of Smad2/3 and enhances TGFa signaling by destabilizing microtubules to promote epithelial-mesenchymal transition in hepatocellular carcinoma. Cancer Lett. (2021) 506:128–41. doi: 10.1016/j.canlet.2021.03.001
28. Brunner SF, Roberts ND, Wylie LA, Moore L, Aitken SJ, Davies SE, et al. Somatic mutations and clonal dynamics in healthy and cirrhotic human liver. Nature. (2019) 574:538–42. doi: 10.1038/s41586-019-1670-9
29. Zhu M, Lu T, Jia Y, Luo X, Gopal P, Li L, et al. Somatic mutations increase hepatic clonal fitness and regeneration in chronic liver disease. Cell. (2019) 177:608–21.e12. doi: 10.1016/j.cell.2019.03.026
30. Fako V, Martin SP, Pomyen Y, Budhu A, Chaisaingmongkol J, Franck S, et al. Gene signature predictive of hepatocellular carcinoma patient response to transarterial chemoembolization. Int J Biol Sci. (2019) 15:2654–63. doi: 10.7150/ijbs.39534
31. Blüthner E, Bednarsch J, Malinowski M, Binder P, Pratschke J, Stockmann M, et al. Dynamic liver function is an independent predictor of recurrence-free survival after curative liver resection for HCC - A retrospective cohort study. Int J Surg. (2019) 71:56–65. doi: 10.1016/j.ijsu.2019.08.033
32. McLernon DJ, Giardiello D, Van Calster B, Wynants L, van Geloven N, van Smeden M, et al. Assessing performance and clinical usefulness in prediction models with survival outcomes: practical guidance for cox proportional hazards models. Ann Intern Med. (2023) 176:105–14. doi: 10.7326/M22-0844
33. Li CY, Chang KJ, Yang CF, Wu HY, Chen W, Bansal H, et al. Towards a holistic framework for multimodal LLM in 3D brain CT radiology report generation. Nat Commun. (2025) 16:2258. doi: 10.1038/s41467-025-57426-0
34. Truhn D, Eckardt JN, Ferber D, and Kather JN. Large language models and multimodal foundation models for precision oncology. NPJ Precis Oncol. (2024) 8:72. doi: 10.1038/s41698-024-00573-2
35. Oh Y, Park S, Byun HK, Cho Y, Lee IJ, Kim JS, et al. LLM-driven multimodal target volume contouring in radiation oncology. Nat Commun. (2024) 15:9186. doi: 10.1038/s41467-024-53387-y
36. Ferber D, Wölflein G, Wiest IC, Ligero M, Sainath S, Ghaffari Laleh N, et al. In-context learning enables multimodal large language models to classify cancer pathology images. Nat Commun. (2024) 15:10104. doi: 10.1038/s41467-024-51465-9
37. Silva JF and Matos S. Modelling patient trajectories using multimodal information. J Biomed informatics. (2022) 134:104195. doi: 10.1016/j.jbi.2022.104195
Keywords: ablation therapy, Cox proportional hazards model, intermediate to advanced HCC, large language models, progression prediction
Citation: Li K, Wang C, Xiong Y, Song Y, Zhang Y, Mou D, Hu C, Guo D, Mei T, Li A and Zhang Y (2026) Traditional Cox regression outperforms large language models in predicting long-term progression of intermediate to advanced hepatocellular carcinoma. Front. Oncol. 16:1710529. doi: 10.3389/fonc.2026.1710529
Received: 22 September 2025; Accepted: 12 January 2026; Revised: 02 January 2026;
Published: 29 January 2026.
Edited by:
Stavros P. Papadakos, Laiko General Hospital of Athens, GreeceReviewed by:
Fangshi Xu, The Second Affiliated Hospital of Xi’an Jiaotong University, ChinaIoannis Katsaros, National and Kapodistrian University of Athens, Greece
Copyright © 2026 Li, Wang, Xiong, Song, Zhang, Mou, Hu, Guo, Mei, Li and Zhang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Yonghong Zhang, emhhbmd5aEBjY211LmVkdS5jbg==
†These authors have contributed equally to this work
Danlei Mou5