TNM staging of esophageal cancer using fine-tuned pathology foundation models and multiple instance learning

Cao, Junxiang; Wang, Jiacheng

doi:10.3389/fonc.2026.1832365

ORIGINAL RESEARCH article

Front. Oncol., 29 May 2026

Sec. Cancer Imaging and Image-directed Interventions

Volume 16 - 2026 | https://doi.org/10.3389/fonc.2026.1832365

TNM staging of esophageal cancer using fine-tuned pathology foundation models and multiple instance learning

1. School of Computer, Jiangsu University of Science and Technology, Zhenjiang, China
2. Manteia Technology, Co., Ltd., Xiamen, China

Abstract

Background:

Esophageal cancer is a common and aggressive malignancy worldwide. Accurate TNM staging is essential for clinical decision-making and prognosis. While whole slide images (WSIs) offer rich histopathological details for automated staging, their sheer size and complexity make manual analysis subjective and time-consuming. Consequently, advanced deep learning models are critically needed to improve the accuracy and efficiency of WSI-based TNM staging.

Methods:

We propose a novel deep learning approach that integrates a fine-tuned pathology foundation model within a Multiple Instance Learning (MIL) framework for automated TNM staging from WSIs. To boost predictive accuracy, our model is enhanced with a feature attention mechanism to pinpoint critical tumor areas and an adaptive layer for effective multi-scale feature fusion. Furthermore, we developed a class-weighted cross-entropy loss function to address pronounced TNM stage imbalance, targeting composite pathological stage classification as a single multi-class prediction task rather than separate T, N, and M components. Performance was rigorously evaluated using three-fold cross-validation and benchmarked against state-of-the-art methods.

Results:

Our model achieved outstanding performance on the TCGA-ESCA dataset, outperforming existing methods in accuracy, macro F1-score, and AUC. These results confirm the model’s effectiveness. Visualization analysis further showed that the model accurately localizes tumor regions and their boundaries, increasing the interpretability and reliability of its predictions.

Conclusion:

Our study validates the potential of a WSI-based deep learning approach for accurate prediction of esophageal cancer’s TNM stage. This tool has the potential to augment pathological diagnosis, refine patient risk assessment, and ultimately guide better treatment strategies.

1 Introduction

Esophageal cancer (ESCA) is a prevalent and highly lethal malignancy of the digestive tract worldwide, and early diagnosis together with accurate staging is critical for improving patient outcomes (, ). The two major histological subtypes of ESCA, squamous cell carcinoma (ESCC) and adenocarcinoma (AC), differ markedly in pathogenesis, epidemiological characteristics, and prognosis, and they also require distinct therapeutic strategies (). Currently, the TNM staging system is a widely accepted standard for treatment decision-making and prognostic evaluation in ESCA, providing essential guidance for clinical management. This system classifies patients according to pathological parameters including the depth of primary tumor invasion (T), the extent of regional lymph node involvement (N), and the presence of distant metastasis (M), thereby enabling stratified prognostic assessment and supporting personalized treatment planning (, ).

In recent years, with the continuous advancement of digital pathology, whole slide imaging (WSI) has become increasingly prevalent in clinical pathological diagnosis. By enabling high-resolution scanning of entire tissue sections, WSI allows for comprehensive digital storage and visualization of slides, enabling pathologists to freely zoom, navigate, and examine histological structures from multiple perspectives at a computer workstation. This capability has markedly improved both the efficiency and accuracy of diagnosis (–). Furthermore, WSI not only overcomes the limitations of conventional optical microscopy in terms of field of view and information accessibility but also expands the scope of pathology applications, providing robust support for telepathology, pathology education, case archiving, and the integration of artificial intelligence (AI) into diagnostic practice. Multiple studies have demonstrated that WSI achieves a high level of diagnostic concordance with traditional glass slides across a wide range of tumors. Moreover, in cancers with complex histological architecture and stringent staging requirements, such as ESCA, WSI offers a more comprehensive and objective representation of spatial distribution and infiltration patterns ().

In TNM staging of ESCA, WSI provides a more accurate and visually intuitive evaluation of the depth of primary tumor invasion, extent of lymph node involvement, and presence of distant metastasis. For instance, WSI enables continuous examination of entire tissue sections, facilitating the detection of small lymph node metastases and early tumor infiltration, thereby reducing the risk of missed diagnoses and misclassification (, ). Moreover, the digital nature of WSI offers a rich data foundation for the development of AI algorithms and automated staging assessment, holding significant potential to advance the standardization and intelligence of ESCA pathological staging.

However, the extremely large data scale of WSI and the limited granularity of available annotations pose significant challenges for analysis. Individual slides often contain hundreds of millions of pixels, while annotations are typically provided only at the slide level, lacking detailed labels for specific tissue regions. This limitation hinders the direct application of conventional supervised learning methods. In this context, the integration of digital pathology and AI has driven rapid advances in computational pathology (, ). Deep learning (DL) models, in particular, have demonstrated remarkable feature learning capabilities, enabling the automatic extraction of complex histological patterns, many of which surpass the visual perception of human observers. To address these challenges, the multiple instance learning (MIL) framework has been widely adopted (, ). MIL aggregates information from local image patches to generate slide-level predictions, allowing efficient processing of entire WSI without dense annotations. Furthermore, the incorporation of attention mechanisms enables the model to focus on diagnostically relevant regions within the tumor microenvironment, thereby significantly enhancing the accuracy and reliability of feature extraction and prediction (–).

Recent advances in imaging technology and computer-aided diagnosis (CAD) are further transforming the landscape of esophageal cancer detection. Notably, hyperspectral imaging (HSI) combined with deep learning has demonstrated promising results for early esophageal cancer detection, with YOLO-based hyperspectral models achieving high sensitivity for mucosal lesions that may be missed under conventional white-light endoscopy (, ). These complementary modalities highlight a broader paradigm shift toward AI-assisted, multimodal esophageal diagnostics. Our work contributes to this landscape by addressing a distinct and underexplored challenge: automated pathological TNM staging directly from WSIs of resected specimens, which provides a critical layer of objective histopathological assessment independent of endoscopic or radiological workup.

Building on the aforementioned background, this study aims to develop a novel MIL-based DL model that leverages only WSIs of the primary tumor to composite overall pathological TNM stage (Stage I–IV as a single multi-class classification task). To enhance the model’s predictive capability, we incorporate a feature attention mechanism to strengthen the representation of diagnostically critical regions, enabling the network to focus on histological areas most relevant to tumor staging. Additionally, an adaptive layer is employed for multi-scale feature fusion, allowing the integration of information across different spatial resolutions and improving prediction performance across various staging categories. By combining these strategies, the model achieves improved accuracy, robustness, and interpretability.

2 Related work

AI has rapidly advanced in digital pathology, particularly in DL applications that span image segmentation, classification, prognosis prediction, and treatment response evaluation (–). Relevant studies across various cancers have demonstrated clinical value, such as subtype differentiation in lung cancer using pathology slides (), prognosis prediction in colorectal cancer (), precision treatment exploration in hepatocellular carcinoma (, ), and Gleason grading prediction in prostate cancer biopsies (). These achievements highlight the broad applicability and clinical potential of DL in pathological imaging.

In the field of ESCA, DL research has primarily focused on endoscopic image diagnostics, especially early detection and grading of Barrett’s esophagus (BE) and esophageal squamous cell carcinoma. For instance, Gong et al. (30) developed a model based on HDWLE images, achieving an accuracy of 93.9% in multi-center external testing; Tang et al. (31) introduced a real-time DCNN system, which significantly outperforms expert endoscopists in sensitivity and NPV; Cai et al. (32) developed a CAD system that exceeds the sensitivity and NPV of endoscopists at various levels, effectively improving overall diagnostic performance in clinical practice; Li et al. (33) demonstrated that CAD-NBI outperforms CAD-WLI in accuracy and specificity, especially among mid-level and junior endoscopists.

In addition to conventional endoscopic detection, some studies have extended DL to BE detection and quantitative analysis. Pan et al. (34) employed a fully convolutional network (FCN) to segment the gastroesophageal junction (GEJ) and squamocolumnar junction (SCJ), assisting BE detection; Tsai et al. (35) used EfficientNetV2B2 to build a CAD system that achieves over 94% accuracy and sensitivity on NBI images; Wu et al. (36) developed the ELNet system, which integrates classification and segmentation modules, outperforming existing methods in sensitivity and accuracy; Ali et al. (37) developed a DL-based 3D reconstruction system capable of automatically quantifying BE length and area, accurately extracting C&M scores. These studies not only demonstrate the potential of DL for automatic BE detection but also offer new approaches for quantification and standardization.

Moreover, research has extended to other modalities such as CT and pathology slides. Markowetz et al. (38) demonstrated the efficient diagnosis of BE using Cytosponge-TFF3 detection on pathology slides, reducing the pathologist’s workload by 57%; Bouzid et al. (39) proposed the BE-TransMIL model, which showed excellent generalization capability on H&E and TFF3-stained slides, with external validation AUROC exceeding 87%; in CT imaging, efforts have been made to predict chemoradiation therapy (CRT) responses in ESCA patients using DL models (30, 40). However, there is still a lack of DL research on WSI for differentiating ESCC and AC morphological features, or predicting CRT treatment responses, which remains an area for future breakthroughs.

Despite the significant potential of DL for early diagnosis and staging of ESCA, most research has focused on endoscopic images and other imaging modalities. In contrast, automated TNM staging based on WSI is still in its infancy and faces many challenges. Therefore, automated ESCA TNM staging prediction based on pathology slides holds significant clinical promise and represents a new technological breakthrough for future intelligent pathology image analysis and personalized treatment. The proposed research aims to lay a crucial foundation for precise diagnosis, clinical decision-making, and individualized treatment of ESCA, and provides a model that can be adapted for the pathological analysis of other cancer types.

3 Materials and methods

3.1 Patients and data collection

This study analyzes the ESCA dataset from The Cancer Genome Atlas (TCGA). The TCGA-ESCA cohort includes 151 pathologically confirmed ESCA patients, with a total of 151 H&E stained WSIs from these patients. All data were publicly obtained from the TCGA official data portal (https://portal.gdc.cancer.gov/).

This study followed the criteria outlined below for patient selection:

Inclusion criteria:

Pathologically diagnosed with ESCA.
Availability of H&E WSIs.
Complete American Joint Committee on Cancer (AJCC) TNM staging data and survival follow-up information.

Exclusion criteria:

Poor tissue slide quality (e.g., excessive fading, missing critical areas, severe folding, or artificial artifacts such as knife marks), which would affect subsequent morphological analysis.
Incomplete clinical pathological information (e.g., missing critical staging information, unknown survival status, or insufficient follow-up time of less than 30 days).
Specimens obtained after neoadjuvant therapy and surgical resection (to avoid interference from treatment-related morphological changes in the analysis).

After applying these criteria, a total of 128 patients were included in the subsequent analysis. All clinical data acquisition and data usage for the selected patients were in compliance with the TCGA data use guidelines and relevant ethical regulations. This study exclusively utilized retrospective, de-identified human pathological and clinical data from the publicly accessible TCGA repository; no direct patient contact or prospective data collection was performed. Use of TCGA data is governed by the TCGA Data Use Certification Agreement, and no additional institutional review board (IRB) approval was required for this publicly available, de-identified dataset. All WSIs used in this study are H&E-stained slides derived from primary esophageal resection specimens; no lymph node dissection slides or metastatic site specimens were included. The final cohort comprised 80 esophageal squamous cell carcinoma (ESCC) and 45 adenocarcinoma (AC) cases (3 not reported). TNM stage distribution was as follows: Stage I, n=14 (10.9%); Stage II, n=64 (50.0%); Stage III, n=45 (35.2%); Stage IV, n=5 (3.9%). The cohort included 109 male and 19 female patients, with a mean age at diagnosis of 60.1 ± 10.6 years (range 36–86). During 3-fold cross-validation, stratified splitting was employed to maintain consistent stage and histological subtype proportions across folds. Patient characteristics are summarized in Supplementary Table 1.

3.2 Data preprocessing and enhancement

In this study, we adopted the CLAM (Classification of Large-scale Pathology Whole Slide Images) standard pipeline to process and analyze WSI of ESCA. This pipeline provides an efficient solution for handling large-scale pathology image data, ensuring high-quality feature extraction and model training.

First, we used the OpenSlide library (v. 1.3.1) to load WSI files and parse their multi-resolution pyramid structure. All subsequent processing was conducted at a 40x magnification level to ensure sufficient tissue detail. Prior to cutting image patches, we performed background filtering to remove non-tissue areas, blank regions, and artifacts, ensuring that the extracted image patches contained valid tissue information. Next, fixed-size image patches (224×224 pixels) were extracted from the cleaned pathology slides. These image patches represent localized tissue information, and together, they form the representation of the entire slide. To mitigate color variations between samples, we performed color normalization on the extracted image patches. The Reinhard color normalization method was applied to adjust all slides to a unified color standard. This process allowed the model to focus more on tissue structural features rather than color variations, thereby improving its generalization capability. To address the long-tail problem, we employed a 3-fold cross-validation strategy for data partitioning.

3.3 MIL model construction

To address the high dimensionality and strong noise inherent in directly modeling WSIs, this study developed an enhanced slide-level prediction model based on the classical MIL framework. The overall architecture consists of three core stages: feature extraction, instance feature aggregation, and slide-level classification. The model structure is shown in Figure 1. In the feature extraction stage, preprocessed image patches are encoded using a pretrained universal pathology feature encoder to obtain high-level semantic features with dimensionality d. Compared with training CNNs from scratch, leveraging a large pretrained model effectively enhances feature generalization and mitigates overfitting in small-sample scenarios. During the feature aggregation stage, all instance features from a single WSI are input into an attention-based MIL module. This module learns discriminative attention weights, assigning higher importance to key tumor regions while preserving global information, thereby emphasizing regions relevant to cancer staging. Finally, the slide-level embedding generated by the MIL module is passed through a fully connected classifier to predict multi-class labels. Unlike conventional mean or max pooling, the attention-driven aggregation employed in this study allows finer modeling of inter-instance variability and improves the recognition of long-tail categories.

Figure 1

Accurately identifying key information relevant to cancer staging from noisy instances in the MIL aggregation module is critical for model performance. In this study, we introduce a combined attention and adapter structure within the aggregation module to achieve adaptive feature refinement and efficient modeling of important regions. Figure 2 shows the details of the adapter module used for residual-style enhancement. First, the input instance features x are mapped to a latent space of dimension L through a linear transformation followed by a nonlinear activation. Subsequently, a residual-style enhancement is applied via the adapter module:

Figure 2

Here, W_down and W_up denote the linear layers for down-projection and up-projection, respectively, and σ represents the nonlinear activation function. This adapter module enables lightweight nonlinear adjustments while maintaining the stability of the original features, thereby enhancing the model’s ability to capture fine-grained differences.

Regarding the attention mechanism, the network uses Gated Attention. Figure 3 shows the details of the Gated Attention. The attention weights are computed as follows:

Figure 3

This architecture not only enhances the model’s sensitivity to critical instances through the attention mechanism but also increases the flexibility of feature representations via the Adapter module, enabling the model to better handle the heterogeneity and long-tail distribution inherent in pathology WSIs. Compared with conventional attention-based MIL, the proposed Adapter-enhanced structure demonstrates improved robustness in feature representation and model generalization.

During the training of the multi-instance learning model, the TNM stage distribution of ESCA exhibits a pronounced long-tail pattern, with substantial differences in sample sizes across stages. To mitigate the adverse effects of class imbalance on model training, this study employs a weighted cross-entropy loss function for optimization. Specifically, for each class c with n_c samples, the class weight is calculated as:

Here, the form of log(1 + n_c) allows the model to assign higher training attention to underrepresented classes while avoiding excessively large or small weights. The weighted cross-entropy loss function is then defined as:

Here, y_c denotes the one-hot encoding of the true label, represents the predicted probability from the model, and C is the total number of classes. This weighting strategy enables the model to learn more effectively from underrepresented classes during training, thereby improving the predictive performance for rare TNM stages, while maintaining stable performance for the majority classes.

3.4 Model development and validation

In the feature extraction stage, image patches were encoded using a pretrained UNI (41) to obtain high-level semantic features capturing cellular morphology, tissue architecture, and microenvironment information. These features enhance generalization in small-sample scenarios and are subsequently aggregated via the MIL module for slide-level prediction. In addition, we also employed CONCH (42) for feature extraction as a comparative baseline. For data splitting, this study employed 3-fold cross-validation with stratified sampling to preserve TNM stage and histological subtype (ESCC/AC) distributions across folds to ensure that each patient participates in both training and testing, thereby enhancing the robustness of model performance. In each fold, the dataset was divided into training and validation sets, and the final results were obtained by averaging the performance metrics across the three folds. Implementation details include using the PyTorch framework on an NVIDIA A100 GPU (48 GB). The Adam optimizer was employed with an initial learning rate of 0.0001, dynamically adjusted using a cosine annealing scheduler. Since the MIL framework trains at the slide level, the batch size was set to 1, with 50 training epochs per fold. For each fold, the model achieving the best validation performance was saved. This training strategy ensures both the generalizability and stability of the model in complex pathological scenarios.

3.5 Statistical analyses

All statistical analyses in this study were conducted using Python (v3.10). Model performance was assessed using metrics including Accuracy (ACC), F1-score, and the Area Under the Curve (AUC) of the multi-class ROC. For cross-validation results, the mean and standard deviation across folds were reported to evaluate model stability. Multi-class confusion matrices were employed to visualize predictive performance across different TNM stages.

4 Results

4.1 Comparison of MIL models

The Table 1 presents the classification performance of various MIL models using two types of feature extractors, CONCH and UNI, with metrics including ACC, AUC, and F1 score. Overall, both the choice of feature extractor and the MIL model significantly affect performance. When using Conch features, DS_MIL outperforms other traditional MIL models in ACC, yet its F1 score remains lower than that of our proposed method, indicating that conventional approaches may struggle with imbalanced samples or identifying key instances. Our method achieves an ACC of 71.43%, AUC of 91.70%, and F1 of 74.28% on Conch features, demonstrating superior performance in both accuracy and comprehensive metrics, reflecting its ability to capture important instances and integrate global information effectively.

Table 1

Feature 3xtractor	Models	ACC(%)	AUC(%)	F1(%)
CONCH	AB_MIL	64.28	89.33	44.59
	DS_MIL	71.43	93.69	68.94
	CLAM_MIL	59.52	61.31	59.52
	Trans_MIL	61.90	89.18	44.32
	Ours	71.43	91.70	74.28
UNI	AB_MIL	73.81	87.10	72.64
	DS_MIL	76.19	92.52	79.76
	CLAM_MIL	76.19	90.34	76.19
	Trans_MIL	66.67	88.00	66.67
	Ours	78.57	94.05	81.36

Performance comparison of different MIL-based models on CONCH and UNI feature extractors.

Metrics include ACC, AUC, and F1 score. “Ours” denotes the proposed method, which achieves the highest overall performance across both feature extractors.

With UNI features, all models perform better compared to Conch, suggesting that UNI features provide richer discriminative information. While DS_MIL and CLAM_MIL achieve an ACC of 76.19%, our proposed method attains the highest values across ACC (78.57%), AUC (94.05%), and F1 score (81.36%), highlighting its robustness and generalization capability across different feature spaces. Overall, regardless of the feature extractor, our method consistently captures critical pathological information and improves classification performance, showing clear advantages over existing MIL approaches.

4.2 Ablation studies

Table 2 summarizes the performance of various models using two different feature extractors, CONCH and UNI, evaluated in terms of ACC, AUC, and F1 score. For the CONCH feature extractor, the baseline model achieves an ACC of 64.28%, an AUC of 89.33%, and an F1 score of 44.59%. Introducing the adapter module substantially improves all metrics, with ACC rising to 69.05%, AUC to 94.29%, and F1 to 74.23%. The Attention module enhances ACC and F1 moderately but slightly decreases AUC. Our proposed method outperforms the baseline and attention variants in ACC and F1, achieving 71.43%, 91.70%, and 74.28%, respectively.

Table 2

Feature 3xtractor	Models	ACC(%)	AUC(%)	F1(%)
CONCH	Baseline	64.28	89.33	44.59
	+Adapter	69.05	94.29	74.23
	+Attention	69.05	88.78	67.86
	Ours	71.43	91.70	74.28
UNI	Baseline	69.05	88.78	67.86
	+Adapter	76.19	91.84	81.40
	+Attention	73.81	94.24	74.60
	Ours	78.57	94.05	81.36

Ablation experiments evaluating the contribution of different modules in the proposed framework.

For the UNI feature extractor, the baseline model already achieves higher performance (ACC 69.05%, AUC 88.78%, F1 67.86%) compared with CONCH. Incorporating the Adapter module further boosts all metrics, reaching ACC 76.19%, AUC 91.84%, and F1 81.40%. The Attention module achieves a high AUC of 94.24% but a relatively lower F1 score. Our approach consistently maintains strong performance across all metrics, achieving ACC 78.57%, AUC 94.05%, and F1 81.36%. Overall, these results demonstrate that the proposed method achieves superior or competitive performance across different feature extractors, particularly in terms of ACC and F1, indicating its effectiveness in capturing discriminative features for the classification task.

4.3 Explainability

To further interpret the model’s predictions and evaluate its focus on diagnostically relevant regions, we visualized the attention distribution across WSIs. As shown in Figure 4, the attention heatmaps indicate that the model assigns higher weights to regions with dense tumor cells, nuclear atypia, and structural abnormalities. The top-15 attention patches extracted from each WSI correspond to areas critical for TNM staging, such as regions with invasive tumor fronts or high-grade dysplasia. These visualizations confirm that the MIL-based model not only integrates information from the entire slide but also effectively identifies the most informative tissue regions, enhancing interpretability and clinical relevance of the predictions.

Figure 4

5 Discussion

In this study, we successfully developed a deep learning model based on WSIs that leverages a MIL framework to predict TNM staging of ESCA. The proposed model demonstrates that pathology-driven AI can provide accurate, slice-level predictions of overall TNM stage, highlighting the feasibility of automated histopathological staging in clinical practice.

A key advantage of our model lies in its ability to focus on critical morphological regions that are closely associated with tumor invasion and metastasis. Attention-driven aggregation enables the model to highlight regions that may reflect biological behaviors underlying disease progression, including lymphovascular invasion and submucosal spread. Compared with conventional TNM staging, which relies on manual pathological assessment, our model captures subtle morphological cues across the entire slide that are often overlooked. Similar approaches in other cancers, such as colorectal, lung, and breast cancer, have shown that MIL-based models can effectively link histopathological features with clinically relevant outcomes, demonstrating the translational potential of this methodology. Although N and M status cannot be directly observed on primary tumor WSIs, established histological surrogates of metastatic potential — including lymphovascular invasion, tumor budding, invasion depth, and high-grade nuclear atypia — are spatially encoded within the primary resection specimen and are precisely the type of patch-level signal that attention-based MIL is designed to aggregate. The composite pathological TNM stage thus serves as a valid, biologically grounded supervisory signal, and the model learns to approximate it from primary tumor morphology rather than from direct N or M observation. This paradigm is consistent with WSI-based prediction of nodal status and molecular subtypes from primary tumor histology in colorectal and lung cancer (, 43). We acknowledge, however, that Stage III vs. IV discrimination — where M1 is the sole differentiator — represents the biological ceiling of this approach, and has informed our decision to frame the present results as proof-of-concept performance.

Clinically, WSI significantly improves the detection of small lesions. The prognosis of ESCA is closely associated with lymph node metastasis, yet the identification of micrometastases under conventional microscopy heavily relies on the pathologist’s experience and attention, posing a risk of missed diagnoses. The high-resolution digital images provided by WSI offer a solid foundation for the development of artificial intelligence algorithms, facilitating automated and standardized pathological staging. For instance, deep learning models have demonstrated excellent performance in identifying tumor regions and quantifying invasion depth, with efficiency and reproducibility far surpassing manual evaluation. This not only alleviates pathologists from repetitive tasks, allowing them to focus on challenging cases, but also optimizes the overall diagnostic workflow, enhancing both clinical efficiency and accuracy.

Despite these promising results, this study has several limitations. First, the dataset was derived solely from TCGA, which may limit the generalizability of the model due to sample selection biases, limited histological diversity, and the absence of inter-institutional staining and scanner variability. The present findings should therefore be interpreted as proof-of-concept performance rather than clinical readiness; external multi-institutional validation is essential before translational deployment and is planned as a primary objective of future work. Second, the relatively small cohort (n=128) and pronounced class imbalance across TNM stages—particularly the underrepresentation of Stage I and Stage IV cases—may constrain the model’s learning of rare-stage-specific features, despite the use of class-weighted loss and stratified cross-validation. Expanding the dataset through additional public repositories (e.g., CPTAC-ESCA) and prospective multi-center data collection will be prioritized. Third, while attention heatmaps provide qualitative interpretability, quantitative validation against pathologist-annotated regions of interest (e.g., invasion fronts, lymphovascular invasion foci) is lacking; future work will incorporate such expert-guided evaluation to confirm the biological relevance of model-highlighted regions. Future work will focus on expanding the dataset to multiple centers to improve generalization and robustness. It is important to emphasize that the proposed WSI-based model is intended to augment, not replace, conventional radiological staging workup including computed tomography (CT) and positron emission tomography (PET-CT). While our approach provides complementary pathological staging information from resected primary tumor slides, it does not assess distant metastatic burden or pre-operative nodal status, which require dedicated imaging modalities. The clinical value of this tool lies in its ability to provide an objective, reproducible, and automated corroboration of pathological stage from histology alone, supporting multidisciplinary tumor board review and informing post-operative prognostic assessment. Moreover, integrating multimodal information—including radiological imaging, pathological slides, and genomic data—could further enhance predictive performance and provide a more comprehensive understanding of tumor biology, paving the way for more precise and individualized clinical decision-making.

6 Conclusion

In this study, we developed a deep learning model based on WSIs using a MIL framework to automatically predict the overall TNM stage of ESCA. By integrating attention mechanisms with adapter modules, the model effectively focuses on critical pathological regions, enhances recognition of underrepresented stage categories, and improves overall prediction accuracy and robustness. This work provides a practical approach for automated pathological staging of ESCA and lays the foundation for future multimodal integration, personalized treatment decision-making, and intelligent pathology analysis.

Statements

Data availability statement

The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding author.

Author contributions

JC: Conceptualization, Data curation, Formal analysis, Methodology, Visualization, Writing – original draft, Writing – review & editing. JW: Resources, Supervision, Writing – review & editing.

Funding

The author(s) declared that financial support was not received for this work and/or its publication.

Conflict of interest

Author JW was employed by Manteia Technology, Co., Ltd.

The remaining author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fonc.2026.1832365/full#supplementary-material

Supplementary Table 1

Patient characteristics and TNM stage distribution (n = 128).

References

1
HoeppnerJBrunnerTSchmoorCBronsertPKulemannBClausRet al. Perioperative chemotherapy or preoperative chemoradiotherapy in esophageal cancer. N Engl J Med. (2025) 392:323–35. doi: 10.1056/nejmoa2409408. PMID:
2
JeonWJParkDAl-ManaseerFChenYJKimJYLiuBet al. Survival and treatment patterns in stage II to III esophageal cancer. JAMA Netw Open. (2024) 7:e2440568-e2440568. doi: 10.1001/jamanetworkopen.2024.40568. PMID:
3
DunbarKJWongKKRustgiAK. Cancer-associated fibroblasts in esophageal cancer. Cell Mol Gastroenterol Hepatol. (2024) 17:687–95. doi: 10.1016/j.jcmgh.2024.01.008. PMID:
4
D’AlessandroCPittacoloMDe GrandisAGarzottoPGaluppiniFMolettaLet al. Mastering esophageal cancer imaging: what radiologists need to know. Abdominal Radiol. (2025) 50:1–18. doi: 10.1007/s00261-025-04988-8
- CrossRef
- Google Scholar
5
López SalaPAlberdi AldasoroNFuertes FernándezISáenz BañuelosJ. An updated review of the TNM classification system for cancer of the oesophagus and its complications. Radiologia (Engl Ed). (2021) 63:445–55. doi: 10.1016/j.rxeng.2020.09.004. PMID:
6
IyengarJN. Whole slide imaging: The futurescape of histopathology. Indian J Pathol Microbiol. (2021) 64:8–13. doi: 10.4103/IJPM.IJPM_356_20
- CrossRef
- Google Scholar
7
PatelABalisUGJChengJLiZLujanGMcClintockDSet al. Contemporary whole slide imaging devices and their applications within the modern pathology department: A selected hardware review. J Pathol Inform. (2021) 12:50. doi: 10.4103/jpi.jpi_66_21. PMID:
8
EccherAGirolamiI. Current state of whole slide imaging use in cytopathology: Pros and pitfalls. Cytopathology. (2020) 31:372–8. doi: 10.1111/cyt.12806. PMID:
9
RaoVSubramanianPSaliAPMenonSDesaiSB. Validation of Whole Slide Imaging for primary surgical pathology diagnosis of prostate biopsies. Indian J Pathol Microbiol. (2021) 64:78–83. doi: 10.4103/ijpm.ijpm_855_19. PMID:
10
AhsanMJahangirFRathoreSMumtazMZamanS. Evaluation of whole-slide imaging for diagnosing frozen sections. Ann Diagn Pathol. (2025) 75:152431. doi: 10.1016/j.anndiagpath.2024.152431. PMID:
11
KantasiripitakCLaohawetwanitTApornviratSNiemnapaK. Validation of whole slide imaging for frozen section diagnosis of lymph node metastasis: A retrospective study from a tertiary care hospital in Thailand. Ann Diagn Pathol. (2022) 60:151987. doi: 10.1016/j.anndiagpath.2022.151987. PMID:
12
JungCMParkBSonHChungLIYChaeYK. H&E whole-slide image (WSI) based artificial intelligence (AI) model to detect EGFR and ALK mutation in non-small cell lung cancer (NSCLC). Cancer Res. (2025) 85:2451. doi: 10.1158/1538-7445.am2025-2451. PMID:
13
ZhangJChoiHKimYParkJChoSKimEet al. Artificial intelligence-based digital pathology using H&E-stained whole slide images in immuno-oncology: from immune biomarker detection to immunotherapy response prediction. J Immunother Cancer. (2025) 13:e011346. doi: 10.1136/jitc-2024-011346. PMID:
14
ZhangJHaoFLiuXYaoSWuYLiMet al. Multi-scale multi-instance contrastive learning for whole slide image classification. Eng Appl Artif Intell. (2024) 138:109300. doi: 10.1016/j.engappai.2024.109300. PMID:
15
QuLMaYLuoXGuoQWangMSongZet al. Rethinking multiple instance learning for whole slide image classification: A good instance classifier is all you need. IEEE Trans Circuits Syst Video Technol. (2024) 34:9732–44. doi: 10.1109/tcsvt.2024.3400876. PMID:
16
SchmidtAMorales-AlvarezPMolinaR. Probabilistic attention based on gaussian processes for deep multiple instance learning. IEEE Trans Neural Networks Learn Syst. (2023) 35:10909–22. doi: 10.1109/tnnls.2023.3245329. PMID:
17
MaoJXuJTangXLiuYZhaoHTianGet al. CAMIL: channel attention-based multiple instance learning for whole slide image classification. Bioinformatics. (2025) 41:btaf024. doi: 10.1093/bioinformatics/btaf024. PMID:
18
SuZTavolaraTECarreno-GaleanoGLeeSJGurcanMNNiaziMKK. Attention2majority: Weak multiple instance learning for regenerative kidney grading on whole slide images. Med Image Anal. (2022) 79:102462. doi: 10.1016/j.media.2022.102462. PMID:
19
WengWCHuangCWSuCCMukundanAKarmakarRChenTHet al. Optimizing esophageal cancer diagnosis with computer-aided detection by YOLO models combined with hyperspectral imaging. Diagnostics. (2025) 15:1686. doi: 10.3390/diagnostics15131686. PMID:
20
ChangLJChouCKMukundanAKarmakarRChenTHSynaSet al. Evaluation of spectral imaging for early esophageal cancer detection. Cancers. (2025) 17:2049. doi: 10.3390/cancers17122049. PMID:
21
CoudrayNOcampoPSSakellaropoulosTNarulaNSnuderlMFenyöDet al. Classification and mutation prediction from non–small cell lung cancer histopathology images using deep learning. Nat Med. (2018) 24:1559–67. doi: 10.1038/s41591-018-0177-5. PMID:
22
YamashitaRLongJLongacreTPengLBerryGMartinBet al. Deep learning model for the prediction of microsatellite instability in colorectal cancer: a diagnostic study. Lancet Oncol. (2021) 22:132–41. doi: 10.1016/s1470-2045(20)30535-0. PMID:
23
AbdulJabbarKRazaSEARosenthalRJamal-HanjaniMVeeriahSAkarcaAet al. Geospatial immune variability illuminates differential evolution of lung adenocarcinoma. Nat Med. (2020) 26:1054–62. doi: 10.1038/s41591-020-0900-x. PMID:
24
SkredeOJDe RaedtSKleppeAHveemTSLiestølKMaddisonJet al. Deep learning for prediction of colorectal cancer outcome: a discovery and validation study. Lancet. (2020) 395:350–60. doi: 10.1016/s0140-6736(19)32998-8. PMID:
25
KludtCWangYAhmadWBychkovAFukuokaJGaisaNet al. Next-generation lung cancer pathology: Development and validation of diagnostic and prognostic algorithms. Cell Rep Med. (2024) 5:1–15. doi: 10.1016/j.xcrm.2024.101697. PMID:
26
SchirrisYGavvesENederlofIHorlingsHMTeuwenJ. DeepSMILE: Contrastive self-supervised pre-training benefits MSI and HRD classification directly from H&E whole-slide images in colorectal and breast cancer. Med Image Anal. (2022) 79:102464. doi: 10.1016/j.media.2022.102464. PMID:
27
ShiJYWangXDingGYDongZHanJGuanZet al. Exploring prognostic indicators in the pathological images of hepatocellular carcinoma based on deep learning. Gut. (2021) 70:951–61. doi: 10.1136/gutjnl-2020-320930. PMID:
28
SaillardCSchmauchBLaifaOMoariiMToldoSZaslavskiyMet al. Predicting survival after hepatocellular carcinoma resection using deep learning on histological slides. Hepatology. (2020) 72:2000–13. doi: 10.1002/hep.31207. PMID:
29
BultenWPinckaersHVan BovenHVinkRde BelTvan GinnekenBet al. Automated deep-learning system for Gleason grading of prostate cancer using biopsies: a diagnostic study. Lancet Oncol. (2020) 21:233–41. doi: 10.1016/s1470-2045(19)30739-9. PMID:
30
GongEJBangCSJungKKimSJKimJWSeoSIet al. Deep-learning for the diagnosis of esophageal cancers and precursor lesions in endoscopic images: A model establishment and nationwide multicenter performance verification study. J Personalized Med. (2022) 12:1052. doi: 10.3390/jpm12071052. PMID:
31
TangDWangLJiangJLiuYNiMFuYet al. A novel deep learning system for diagnosing early esophageal squamous cell carcinoma: a multicenter diagnostic study. Clin Trans Gastroenterol. (2021) 12:e00393. doi: 10.14309/ctg.0000000000000393. PMID:
32
ZhongYSCaiSLLiBZhouPH. Using a deep learning system in endoscopy for screening of early esophageal squamous cell carcinoma (with video). Gastrointestinal Endoscopy. (2019) 90:745–53:e2. doi: 10.1016/j.gie.2019.06.044. PMID:
33
LiBCaiSLTanWMLiJCYalikongAFengXSet al. Comparative study on artificial intelligence systems for detecting early esophageal squamous cell carcinoma between narrow-band and white-light imaging. World J Gastroenterol. (2021) 27:281. doi: 10.3748/wjg.v27.i3.281. PMID:
34
PanWLiXWangWZhouLWuJRenTet al. Identification of Barrett's esophagus in endoscopic images using deep learning. BMC Gastroenterol. (2021) 21:479. doi: 10.1186/s12876-021-02055-2. PMID:
35
TsaiMCYenHHTsaiHYHuangYKLuoYSKorneliusEet al. Artificial intelligence system for the detection of Barrett’s esophagus. World J Gastroenterol. (2023) 29:6198–207. doi: 10.3748/wjg.v29.i48.6198. PMID:
36
WuZGeRWenMLiuGChenYZhangPet al. ELNet : Automatic classification and segmentation for esophageal lesions using convolutional neural network. Med Image Anal. (2021) 67:101838. doi: 10.1016/j.media.2020.101838. PMID:
37
AliSBaileyAAshSHaghighatMInvestigatorsTGULeedhamSJet al. A pilot study on automatic three-dimensional quantification of barrett’s esophagus for risk stratification and therapy monitoring. Gastroenterology. (2021) 161:865–78. doi: 10.1053/j.gastro.2021.05.059. PMID:
38
EbigboAMendelRProbstAManzenederJSouzaLAPapaJPet al. Computer-aided diagnosis using deep learning in the evaluation of early oesophageal adenocarcinoma. Gut. (2019) 68:1143–5. doi: 10.1136/gutjnl-2018-317573. PMID:
39
BouzidKSharmaHKillcoyneSCoelho de CastroDSchwaighoferAIlseMet al. Enabling large-scale screening of barrett’s esophagus using weakly supervised deep learning in histopathology. Nat Commun. (2023) 15:2026. doi: 10.1038/s41467-024-46174-2. PMID:
40
FaghaniSCodipillyDCDavidVMoassefiMRouzrokhPKhosraviBet al. Development of a deep learning model for the histologic diagnosis of dysplasia in Barrett’s esophagus. Gastrointestinal Endoscopy. (2022) 96:918–25. doi: 10.1016/j.gie.2022.06.013. PMID:
41
ChenRJDingTLuMYWilliamsonDFKJaumeGSongAHet al. Towards a general-purpose foundation model for computational pathology. Nat Med. (2024) 30:850–62. doi: 10.1038/s41591-024-02857-3. PMID:
42
LuMYChenBWilliamsonDFKChenRJLiangIDingTet al. A visual-language foundation model for computational pathology. Nat Med. (2024) 30:863–74. doi: 10.1038/s41591-024-02856-4. PMID:
43
KatherJNPearsonATHalamaNJägerDKrauseJLoosenSHet al. Deep learning can predict microsatellite instability directly from histology in gastrointestinal cancer. Nat Med. (2019) 25:1054–6. doi: 10.1038/s41591-019-0462-y. PMID:

Summary

Keywords

esophageal cancer, foundation model, multiple instance learning, TNM staging, whole slide imaging

Citation

Cao J and Wang J (2026) TNM staging of esophageal cancer using fine-tuned pathology foundation models and multiple instance learning. Front. Oncol. 16:1832365. doi: 10.3389/fonc.2026.1832365

Received

17 March 2026

Revised

24 April 2026

Accepted

11 May 2026

Published

29 May 2026

Volume

16 - 2026

Edited by

Mingon Kang, University of Nevada, United States

Reviewed by

Adel Aref, Children’s Medical Research Institute, Australia

Arvind Mukundan, Sanjivani University, India

Updates

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Jiacheng Wang, jiachengw@manteiatech.com

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Cancer Imaging and Image-directed Interventions

ORIGINAL RESEARCH article

TNM staging of esophageal cancer using fine-tuned pathology foundation models and multiple instance learning

Abstract

1 Introduction

2 Related work