Deep Learning Model for Predicting the Pathological Complete Response to Neoadjuvant Chemoradiotherapy of Locally Advanced Rectal Cancer

Objective This study aimed to develop an artificial intelligence model for predicting the pathological complete response (pCR) to neoadjuvant chemoradiotherapy (nCRT) of locally advanced rectal cancer (LARC) using digital pathological images. Background nCRT followed by total mesorectal excision (TME) is a standard treatment strategy for patients with LARC. Predicting the PCR to nCRT of LARC remine difficulty. Methods 842 LARC patients treated with standard nCRT from three medical centers were retrospectively recruited and subgrouped into the training, testing and external validation sets. Treatment response was classified as pCR and non-pCR based on the pathological diagnosis after surgery as the ground truth. The hematoxylin & eosin (H&E)-stained biopsy slides were manually annotated and used to develop a deep pathological complete response (DeepPCR) prediction model by deep learning. Results The proposed DeepPCR model achieved an AUC-ROC of 0.710 (95% CI: 0.595, 0.808) in the testing cohort. Similarly, in the external validation cohort, the DeepPCR model achieved an AUC-ROC of 0.723 (95% CI: 0.591, 0.844). The sensitivity and specificity of the DeepPCR model were 72.6% and 46.9% in the testing set and 72.5% and 62.7% in the external validation cohort, respectively. Multivariate logistic regression analysis showed that the DeepPCR model was an independent predictive factor of nCRT (P=0.008 and P=0.004 for the testing set and external validation set, respectively). Conclusions The DeepPCR model showed high accuracy in predicting pCR and served as an independent predictive factor for pCR. The model can be used to assist in clinical treatment decision making before surgery.


INTRODUCTION
Colorectal cancer remains one of the leading causes of cancer death (1). For patients with locally advanced rectal cancer (LARC), neoadjuvant chemoradiotherapy (nCRT) followed by total mesorectal excision (TME) is recommended as a standard treatment strategy. nCRT can significantly reduce local recurrence and treatment-associated toxicity and more importantly, make tumors more amenable to resection. However, the treatment response to nCRT varies greatly among patients. Approximately 15-38% of patients could obtain a pathological complete response (pCR) and are recommended the watch and wait approach to avoid the side effects of surgery (2), while 20% of patients have little to no response to nCRT and might even suffer significant side effects and miss their best opportunity for surgery (3)(4)(5). More importantly, patients with pCR have better long-term outcomes, indicating a favorable prognosis (6). However, how to predict treatment response, especially to identify pCR candidates prior to nCRT, remains challenging for LARC.
Previous studies have shown that tumor stage, serum tumor markers before neoadjuvant therapy, and lymphocyte infiltration in the tumor microenvironment are associated with tumor regression to nCRT (7). Recently, with the development of artificial intelligence algorithms, radiological imaging has been used to evaluate the treatment response of LARC (8)(9)(10)(11)(12)(13)(14). The commonly adopted imaging techniques include diffusion-weighted magnetic resonance imaging (MRI) (11), diffusion kurtosis and T2weighted MRI (8), and a multiparametric MRI protocol with dynamic-contrast-enhanced MRI (13). For instance, Zhang et al. (10) developed a pCR prediction model based on diffusion kurtosis and T2-weighted MRI, and the area under the curve (AUC) was 0.70 (95% confidence interval (CI): 0.59, 0.79). Histopathological images prevail as the gold standard for patient diagnosis and contain abundant biological information. Therefore, we anticipate that more accurate predictions can be achieved by analyzing pathological images than by analyzing radiological images.
Compared with conventional machine learning, deep learning can automatically extract features from an image without the necessity of feature predefinition and is suitable for mining the most relevant feature representations. Multi-instance learning (MIL), as a weakly supervised deep learning technique, has achieved promising results on the topic of patient prognosis and outcome prediction (15)(16)(17)(18). MIL enables the network to learn more holistic information from whole-slide images (WSIs). To the best of our knowledge, there has been little investigation on the prediction of pCR based on histopathological images prior to nCRT with the MIL technique. The aim of this study was to develop a deep pathological complete response (DeepPCR) prediction model for the prediction of pCR directly from conventional hematoxylin & eosin (H&E)-stained histopathological images.

Study Cohort and Availability
Two different cohorts, i.e., the primary cohort and external validation cohort, were adopted for training and internal and external validation and included retrospectively identified LARC patients from January 1, 2010, to January 1, 2018, from three hospitals in China (the Sixth Affiliated Hospital of Sun Yat-sen University, Cancer Center of Sun Yat-sen University, and Yunnan Cancer Hospital). A total of 842 patients were recruited; among them, the primary cohort (783 patients from the Sixth Affiliated Hospital of Sun Yat-sen University and Cancer Center of Sun Yat-sen University) was randomly subgrouped into the training set (666 patients, 85%) and testing set (117 patients, 15%), and the external validation cohort (from Yunnan Cancer Hospital) contained 102 patients. The inclusion criteria were as follows: (1) patients had locally advanced disease determined by pretreatment TNM stage (T3/ T4, and/or N+); (2) biopsy was performed, and the biopsy specimen was pathologically diagnosed as adenocarcinoma; and (3) patients underwent nCRT followed by rectal resection. The exclusion criteria were as follows: (1) patients with familial adenomatous polyposis, distant metastases, or Lynch syndrome; and (2) patients with no information on tumor regression grade (TRG) and no available H&E-stained slides.
All patients accepted a standard treatment strategy based on the National Comprehensive Cancer Network (NCCN) guidelines (version 3, 2017). The nCRT regimen was 50 Gy pelvic radiation therapy with concurrent 5-fluorouracil-based chemotherapy (FOLFIRI or FOLFOX regimens). TME was performed by either anterior resection or abdominoperineal resection after nCRT of 4-8 weeks. The TRG after nCRT was used to divide patients into two groups based on H&E-stained slides after surgery: pCR (with no remaining viable cancer cells) and non-pCR (with small clusters of cancer cells or no response with extensive residual cancer). The flow diagram of patient enrollment into the two cohorts is shown in Figure 1.
Clinicopathological variables, such as age, sex, TNM stage, histological grade, TRG after surgery, and blood testing parameters, including lymphocytes, neutrophils, carcinoembryonic antigen (CEA), carbohydrate antigen 19-9 (CA19-9), and lactate dehydrogenase (LDH) prior to nCRT treatment, were collected. This study was approved by the Institutional Review Board of the Sixth Affiliated Hospital of Sun Yat-sen University.

Data Preparation
Formalin-fixed paraffin-embedded (FFPE) biopsy tissue blocks were cut into 4-µm sections for H&E staining. All slides were checked by a pathologist who ascertained that they contained tumor areas. WSIs were acquired at a magnification of 20× on an Aperio scanner.
Tumor tissue regions were hand-delineated by pathologists (Dr. XYL and Dr. HLL) using Aperio Image Scope software and subsequently cropped into patches with a size of 299×299 pixels at a magnification of 20×. The distribution of the number of patches per slide followed a long-tail distribution, with the majority of slides containing approximately 100 patches. For slides with more than 1000 patches, we randomly chose 1000 cropped patches.

pCR Candidate Classification
Four models were designed for classifying the input biopsy histological images, with patients' distinct TRG outcomes as the ground truth. The first three models were trained on 102,728 patches and tested on 18475 patches in the primary cohort, namely, the DeepPCR model, patch-based combined model, and patch-based individual model. The DeepPCR model was built upon the MIL strategy ( Figure 2). Specifically, a pretrained ResNet-18 model (19) was leveraged to extract the pathological feature representations of each cropped patch, i.e., the patch-wise phenotype representation (patchPR). Based on the patchPRs, the unsupervised K-means algorithm was used to categorize these features into six clusters (see Supplementary Figure 1). Each cluster occupied a subspace of the features and comprised a distinctive phenotype group. The patches in each cluster were further processed by a multi-instance fully convolutional model (MI-FCM) (20) to generate cluster-wise phenotype representation (clusPR). Herein, the MI-FCM was comprised of two pairs of Conv-ReLU layers, followed by a pooling layer. Afterwards, WSI-wise phenotype representation (wsiPR) was constructed by concatenating the clusPRs from the same WSI. The wsiPR sufficiently exploited the intercluster feature difference and intracluster feature dependence, constituting the most informative phenotype representation. Based on wsiPR, a two-layer fully connected network was leveraged to generate the final prediction. The DeepPCR model built a hierarchical feature structure from patch to WSI and explicitly modeled the mutual dependence between different phenotype groups for patient outcome prediction.
The patch-based combined model and patch-based individual model used patch-based approaches in which the cropped patches shared the same label with the original histopathological WSI and the prediction of patch-based methods was made for each patch rather than each WSI. Similar to DeepPCR, the pretrained ResNet-18 model was adopted to extract the phenotype representations of each cropped patch. According to the aggregation method of the patch-level prediction, we implemented these patch-based models in two ways. One was to predict each individual patch's label, and then combined them via majority voting, which was called the patch-based individual model. The other was to aggregate the patchlevel predictions of each subject by removing the clustering step in DeepPCR, called the patch-based combined model (remaining modules are the same as DeepPCR). To validate the effectiveness of pathological imaging data in pCR outcome prediction compared with nonpathological data, the fourth model (hematology model), based on clinical hematology data, including CEA, CA19-9, LDH, lymphocytes, and neutrophils, was built. A two-layer multilayer perceptron (MLP) model was adopted in the hematology model.

Phenotype Visualization
To visualize the representative phenotypes in each K-means cluster, t-distributed Stochastic Neighbor Embedding (t-SNE) (21) and the Raster Fairy method (22) were applied on the patchPRs. t-SNE is a technique for dimensionality reduction that is particularly well suited for the visualization of highdimensional data. The Raster Fairy method aims to transform the two-dimensional clustering data derived from t-SNE into a regular grid without destroying the neighborhood relations emerging from the clustering. The GradCAM method (23) was used to calculate the patch importance for target prediction.

Statistical Analysis
The predictive efficacy of the model was evaluated by the area under the receiver operating characteristic curve (AUC-ROC), area under the precision-recall curve (AUC-PR), sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). Univariate and multivariate logistic regression analyses were performed to investigate the predictive value for all biomarkers. The statistical significance of the differences in the clinicopathological characteristics of pCR and non-pCR patients were calculated using the Mann-Whitney test (two-tailed) for continuous variables and Fisher's exact test (two-tailed) for dichotomous variables. Comparisons of clinicopathological factors in the primary and external validation cohorts were performed using Student's t test for continuous variables and Fisher's exact test (two-tailed) for dichotomous variables. A twosided p value of less than 0.05 was considered statistically significant.

Patient Characteristics
The primary cohort included 783 patients: 295 patients from the Sixth Affiliated Hospital of Sun Yat-sen University and 488 patients from the Cancer Center of Sun Yat-sen University. A total of 201 and 582 patients were classified as pCR and non-pCR, respectively. The external validation cohort from Yunnan Cancer Hospital included 102 patients, of which 24 and 78 patients were classified as pCR and non-pCR, respectively. The clinicopathological characteristics of the patients in the primary and external validation cohorts are provided in Table 1. The clinicopathological characteristics, including clinical T stage and histological grade, were different between the primary and external validation cohorts (P<0.001 and P<0.001, respectively) (Supplementary Table 1).

pCR Candidate Prediction in the Primary Cohort
The DeepPCR model had a higher discriminative power, with an AUC-ROC of 0.710 (95% CI: 0.595, 0.808) and an AUC-PR of 0.875 (95% CI: 0.795, 0.935) in the primary cohort ( Figure 3A and Table 2A). The sensitivity, specificity, PPV and NPV were 72.6%, 46.9%, 70.4%, and 54.0%, respectively ( Table 2A) Table 2A). As shown in Figure 3E, the AUC-ROC of the DeepPCR model was significantly higher than that of the hematology model (P < 0.001) and patch-based individual model (P < 0 .05).  Table 2B). In external cohorts, the AUC-ROC of the DeepPCR model was significantly higher than that of the hematology model (P < 0.001) and patch-based individual model (P < 0 .05) ( Figure 3F).

Univariate and Multivariate Analyses
In the primary cohort, the univariate logistic regression analysis showed that CEA and DeepPCR model were significantly correlated with pCR (P=0.033 and 0.0001, respectively) ( Table 3A). Multivariate logistic regression analysis showed that only DeepPCR was an independent factor for predicting pCR (95% CI: 1.646, 28.743; P=0.008) ( Table 3B).

Histological Patterns Associated With TRG
To find some important clinical insights based on the DeepPCR model, we determined which types of histological patterns were most relevant to patient TRG, and the pipeline of this process is displayed in Figure 4. In Figure 4A, each grid represented an individual patch, and the patchPRs obtained from all these patches were categorized into six phenotype clusters ( Figure 4B), which were reduced into a two-dimensional feature space based on t-SNE and the Raster Fairy method. Here, the phenotypes could be color, edges, texture, curve and/or shape of cancer and normal tissues. To conduct an investigation into which types of phenotypes contribute the most to pCR prediction, the GradCAM method (23) was adopted to calculate the importance of patches. The importance heatmap is shown in Figure 4C, and darker colors indicate that the patches played a more important role in pCR prediction. We also calculated the sum of the importance values of the patches in each cluster ( Figure 4D). It can be seen that different clusters had different predictive powers for pCR prediction, and a larger value indicated that the corresponding cluster contributed more to DeepPCR. The size of bubbles represents the number of patches in the corresponding cluster. We found that patches in clusters 0 and l played more important roles in pCR candidate prediction. Specifically, the patch importance value of cluster 1 was significantly larger than that of clusters 2, 3, 4, and 5 (P<0.001, P<0.05, P<0.01, and P<0.001, respectively). There were no significant differences between cluster 0 and cluster 1 in terms of the patch importance value ( Figure 4D). Figure 4E shows the representative patches of cluster 1 and their distribution in a WSI, which also represented a special histological pattern and spatial pattern highly associated with pCR. Similarly, Figures 4F-J demonstrates the same patchPR visualization process but for the non-pCR group. The patch importance value of cluster 2 was significantly larger than that of clusters 0, 1, 3, 4, and 5 (P<0.001, P<0.001, P<0.001, P<0.001, and P<0.001, respectively) ( Figure 4I) in non-pCR candidate prediction.

DISCUSSION
In the present study, we developed a novel model to predict pCR in LARC using digital pathological images. We found that the DeepPCR model could achieve a relatively high AUC-ROC score of 0.710. Multivariate logistic analysis showed that the DeepPCR model was indeed an independent factor for predicting pCR, indicating that the model could assist in treatment decision making prior to surgery for LARC. In recent years, there has been increasing interest in digital pathology image analysis based on machine learning algorithms to assist in pathological diagnosis (24,25). In this work, we used a probability threshold of 0.7 (that is, any patient with a pCR prediction probability greater than 0.7 was reported as a pCR candidate  The CI value is inside the parentheses. Sen, sensitivity; Spe, specificity; PPV, positive predictive value; NPV, negative predictive value. In this work, we used a probability threshold of 0.7 (that is, any patient with a pCR prediction probability greater than 0.7 was reported as a pCR candidate).  Radiological imaging has its own limitation in distinguishing inflammatory lesions from neoplastic lesions. As the gold standard of disease diagnosis, conventional preoperative pathological biopsy is of great significance for the diagnosis and prognosis of tumors. The discriminative power of DeepPCR model was significantly higher than that of the hematology model (P<0.001) and the other two patch-based models (P<0.001 and P<0.001, respectively). The number of patients in our study was larger than other reported works (8)(9)(10)(11)(12)(13)(14). Moreover, the DeepPCR model was evaluated in independent cohorts. The external validation cohort came from another center with a different sample handling procedure and using a different scanner. Although the external validation cohort was different from the primary cohort in terms of the clinicopathological characteristics, such as clinical T stage and histological grade (P<0.001 and P<0.001, respectively), the proposed model achieved similar results as those in the primary cohort, indicating its generalizability and robustness.
The proposed model leveraged an MIL-based deep learning model and showed a superior performance compared to previous patch-based learning methods. Existing patch-based approaches can be categorized into two classes based on the level of the employed annotations. For the first class, patch-wise annotations are used to train deep learning models (26)(27)(28)(29)(30), and strong supervision is typically performed, benefiting from the precise labeling information. Nevertheless, these methods depend on pixel-level annotations by expert pathologists, and it would be labor intensive and hard to obtain sufficient high-quality annotation data. For the second class of methods, the groundtruth labels are provided for the whole images rather than the patches (31,32). When performing the learning process, the global image-level label of each WSI is taken as the patch-level label directly, and the final prediction is generated by combining the patch-level outputs. Although this type of method is very straightforward, there are two crucial problems. First, the cropped patches of WSIs are processed independently, and the spatial constraints of these patches are neglected. The second problem is that the patches in the same image indiscriminately share the same label and thus introduce a substantial disturbance to model training. To address these problems, several MIL-based approaches that aim to leverage the feature representations of all image patches to collaboratively predict the patient outcome have been developed (15)(16)(17)(18)33). Building upon these methods, our proposed model can effectively mine the dependence of feature representations at three different scales, i.e., patch-level, cluster-level, and WSI-level phenotype representations. In this patient outcome prediction task, the MIL-based learning method outperformed the patch-based learning methods. Specifically, the MIL-based methods were able to jointly consider intrapatch dependence; thus, the spatial relationships between tumor tissues (including cancer cells and surrounding stromal cells) were exploited. These tissues form the tumor microenvironment (34), and the characterization of the microenvironment plays an important role in tumor progression and the response to treatment. However, the patch-based learning methods only processed patches independently, and the spatial information among patches was neglected; thus, these methods showed poor performance. Our findings suggest that MIL-based learning models can handle the spatial information inherent in the tumor microenvironment.
Some deep learning-based studies visualized and interpreted the learned feature representations (31,(35)(36)(37)(38), which may provide some important clinical insights. For instance, Courtiol et al. (35) identified regions that contributed to patient outcome prediction (mesothelioma classification) by visualizing various scenarios predicted by the deep learning model. They found that these regions are mostly located in the stroma and are associated with inflammation, cellular diversity and vacuolization. Campanella et al. (36) assessed the model by visualizing the features reduced in a 2D space and found that a set of top-ranked patches with probabilities close to 0.5 contained glands suspicious of being malignant. In our study, patchPRs were categorized into six phenotype clusters based on the DeepPCR model. We determined that different clusters had different predictive powers for pCR prediction. We calculated the sum of the importance values of the patches in each cluster and found that the patches in cluster 0 and cluster 1 played more important roles in pCR candidate prediction. Although we did not analyze each cluster in more detail, we proposed that some histological patterns may be associated with the predicted TRG. The novel histological pattern may be associated with the morphological features and microenvironment of the tumor.
Previous studies showed that pretreatment serum CEA levels were significantly correlated with pCR (39). In our study, the univariate logistic regression analysis showed that CEA levels significantly correlated with pCR in the primary cohort (P=0.033) and in the external validation cohort (P=0.042). However, in multivariate logistic regression analysis, this association did not persist, and only the DeepPCR model was an independent factor for predicting pCR (95% CI: 1.646, 28.743; P=0.008). We also conducted pCR prediction experiments based on clinical data, i.e., CEA, CA19-9, LDH, lymphocytes, and neutrophils. In the experimental studies, an AUC-ROC of 0.403 was achieved based on these nonpathological data, showing that they may not be sufficient for prognostic pCR prediction.
Although promising results and relevant clinical insights were found, there are some limitations in this study. First, this study was a retrospective study. A multicenter prospective study is needed to confirm the performance of the prediction model. Second, due to the prevalence of tumor heterogeneity, the representativeness of biopsy specimens was limited. Another limitation of this study was that deep learning has the disadvantage of its black-box nature. Although we determined some histological patterns relevant to patient TRG, the morphological features and microenvironment of each histological pattern should be further investigated.
In conclusion, our study was the first to investigate the nCRT outcome prediction problem in LARC patients using presurgical biopsy pathological images. A clinically useful prediction model was developed using deep learning. The DeepPCR model was evaluated in an independent cohort and achieved stable results. This model has the potential to guide clinicians in making nCRT choices.