Prediction of Breast Cancer Recurrence Using a Deep Convolutional Neural Network Without Region-of-Interest Labeling

Purpose The present study aimed to assign a risk score for breast cancer recurrence based on pathological whole slide images (WSIs) using a deep learning model. Methods A total of 233 WSIs from 138 breast cancer patients were assigned either a low-risk or a high-risk score based on a 70-gene signature. These images were processed into patches of 512x512 pixels by the PyHIST tool and underwent color normalization using the Macenko method. Afterward, out of focus and pixelated patches were removed using the Laplacian algorithm. Finally, the remaining patches (n=294,562) were split into 3 parts for model training (50%), validation (7%) and testing (43%). We used 6 pretrained models for transfer learning and evaluated their performance using accuracy, precision, recall, F1 score, confusion matrix, and AUC. Additionally, to demonstrate the robustness of the final model and its generalization capacity, the testing set was used for model evaluation. Finally, the GRAD-CAM algorithm was used for model visualization. Results Six models, namely VGG16, ResNet50, ResNet101, Inception_ResNet, EfficientB5, and Xception, achieved high performance in the validation set with an overall accuracy of 0.84, 0.85, 0.83, 0.84, 0.87, and 0.91, respectively. We selected Xception for assessment of the testing set, and this model achieved an overall accuracy of 0.87 with a patch-wise approach and 0.90 and 1.00 with a patient-wise approach for high-risk and low-risk groups, respectively. Conclusions Our study demonstrated the feasibility and high performance of artificial intelligence models trained without region-of-interest labeling for predicting cancer recurrence based on a 70-gene signature risk score.


INTRODUCTION
Breast cancer is one of the most common cancer types found in women worldwide (1). Although the overall survival rate of breast cancer has improved in the last decade, prognostication regarding the risk of recurrence and potential biomarkers for assisting clinical treatment decision have also been the focus of ongoing research (2). Thus, there are numerous studies reporting novel biomarkers and subtyping breast cancer according to recurrence risk (3,4). The standard method for breast cancer classification uses immunohistochemistry (IHC) markers such as progesterone receptor (PR), human epidermal growth factor receptor II (HER2), and estrogen receptor (ER) together with Ki67 (5,6). Other gene expression-based approaches such as PAM50, TargetPrint, MammaPrint, and BluePrint are also available options (7)(8)(9). However, these subtyping methods require analyzing mRNA expression levels using microarray platforms and clustering certain pre-selected genes for designating subtypes. Each method has its own advantages in assisting clinical treatment decisions. However, these methods are time-consuming and costly.
The MammaPrint and BluePrint tests are established assays for predicting high and low recurrence risk and subtyping breast cancer into basal, HER2, and luminal (9). In the WGS-PRIMe study, Mammaprint and BluePrint tests had high impact in assisting physicians making their treatment recommendations for early-stage luminal breast cancer patients, with over 92% adherence to the Mammaprint risk assessment for both low-and high-risk patients (10). In addition to this study, the IMPACt and MINDACT studies shown high concordance between treatment decisions and Mammaprint risk score for determining the necessity of adjuvant chemotherapy (over 88% for low-risk patients and over 78% for high-risk patients) (11)(12)(13). This evidence proves the power and impact of this genetic test in guiding clinical treatment decisions.
In recent years, numerous studies using artificial intelligence (AI) tools for various biological problems have been documented (14)(15)(16)(17). There are two subdomains of AI, namely machine learning and deep learning, which use different approaches for feature selection during model training (18). While the machine learning approach requires domain knowledge to select significant features, deep learning is equipped with auto-feature extraction capability to learn the differences between groups for prediction and classification tasks without prior knowledge (18). The applications of machine learning and deep learning have been shown to have enormous impact in biomedical research (19,20). Over the past few years, deep learning methods have matured and are now well-recognized in many biomedical fields of study (21,22). The majority of these studies used biomedical images such as pathological (23), radiological (24,25), and digital slide images (26) to train a convolutional neural network (CNN). Deep learning methods can also use other data types, such as DNA and RNA sequencing data and proteomics data, in either raw or processed format, to train a CNN (15,(27)(28)(29). Furthermore, the dream of using pathological images with deep learning to predict patient outcomes has been fulfilled recently (30). Leveraging the plethora of biological data could facilitate both the CNN training process and independent validation, which will in turn boost model performance even higher than the domain-expert level.
To continue the trend of integrating clinical and genetic data with AI, as well as assist physicians in making treatment recommendations, we tasked a deep neural network with predicting high and low risk of recurrence from pathological images of breast cancer patients. In addition, we also developed a novel deep learning pipeline using whole slide images (WSIs) as the only data source, without any input from pathologists for tumor region labeling.

MATERIALS AND METHODS
The workflow of our study is depicted in Figure 1. The WSIs underwent image segmentation for patch selection using the Otsu algorithm (31). Next, selected patches of 512x512 pixels were generated. These small patches then underwent normalization for hematoxylin and eosin staining using the Macenko method (32) and a Python script from (https:// github.com/schaugf/HEnorm_python) with appropriate modifications. Next, blurry and pixelated images were removed using the Laplacian algorithm. The retained images were used for model training, validation, and testing. Finally, to visualize how different models learn to distinguish samples with low-risk and high-risk 70-gene signature scores, we used gradient-weighted class activation mapping (GRAD-CAM) to create the activated heatmap for each image.

Samples
A total of 233 WSIs from 138 breast cancer cases from Taipei Veterans General Hospital were used for model training, validation, and testing. Tumor sections from each patient were obtained and prepare for hematoxylin and eosin staining. Afterward, the stained slides were scanned with an Ultra-Fast Scanner (Philips, USA) to provide the digital slides in TIFF format. These WSIs were then used for generating patches. The same slide which performs 70 gene signature was scanned for our study. The patients' demographics and other information such as age, nottingham grade, estrogen receptor status, progesterone receptor status, HER2 status, and TNM stage are shown in Table 1.

Patch Generation
Patches from each WSI were prepared with PyHIST (33), which is a Python-based tool allowing us to select the patch at designed dimensions. We set 512x512 pixels as our patch size, and these patches were obtained at the highest magnification level (20x). Mask down-sampling and tile crossed image down-sampling were set as the defaults. The Otsu algorithm was used as the tile generation method. We set the tissue content threshold at 0.85 to select patches composed of at least 85% tissue. A total of 294,562 patches were generated from 233 WSIs, 57% of which (n=169,161) were used for model training and fine tuning. The patches were divided into a training set consisting of 160,000 patches (n= 39 (100,000 patches) for low-risk group, n = 30 (60,000 patches) for high risk group) and a validation set with 9161 patches. The other 43% of the patches (n=125,401), of which 54,122 and 71,279 came from 20 high-risk patients and 35 low-risk patients, respectively, were used for independent testing of the model performance.

Removal of Blurry and Pixelated Images
The WSIs had blurry regions that were locally out of focus. To overcome this problem, we used the Laplacian algorithm for detecting blurry images based on variance thresholding. The kernel size of the Laplacian operator was 13x13 pixels, which was obtained by a trial-and-error approach from 3x3 pixels to 15x15 pixels. The variance threshold was set at >1e15 and <1e14. These steps were done with a custom Python script using the OpenCV library (34). After blurry and pixelated images were removed, we double-checked the whole dataset manually to make sure this threshold could remove all of these images.

Model Training
To speed up the training process, we applied transfer learning using weights from 6 pre-trained models, namely VGG16, Res50, Res101, In_Res, X_Cept, and EB5. These models have achieved high accuracy with the ImageNet dataset (35), which is used as a common model performance benchmark. The purpose of using multiple trained models' architecture was to take advantage of the high-performance architecture of these models as well as to compare their performance in pathological image classification. The original patch size of 512x512 pixels was rescaled to 128x128 pixels for model input. The architecture from the pretrained models was kept as it was; however, we used 2 fully connected layer with 1,024 neurons, and 256 neurons to reduce computational load. The second fully connected layer was connected to another hidden layer with one neuron to output the model prediction value. The thresholds of 0.3, 0.5, and 0.7 were applied for sigmoid function. Depending on the selected threshold, the model prediction, for a class with score < 0.3, <0.5 or <0.7 was high-risk group, else low-risk group.
We used adaptive moment estimation (Adam) as the optimizer with a learning rate of 1e-5, together with a decay rate of 1e-5/50 for 50 epochs at a batch size of 64. The rectifier linear unit (relu) was used as the activation function in the hidden layers, whereas sigmoid activation was used in the dense output layer.

Model Prediction Visualization
We used gradient weighted class activation mapping (GRAD-CAM) to illustrate the model prediction visualization. The overall idea of GRAD-CAM is to use the final convolutional layer of the model to extract information on how the model made its decision for the final output class (36). After we trained our model with the WSIs data and obtained the final optimal weight file, we used this weight to obtain the GRAD-CAM visualization with the last convolutional layer in our model.

Model Training and Validation
The results from our model training with the VGG16, Res50, Res101, EB5, In_Res, and X_Cept pretrained models achieved 83%, 85%, 83%, 87%, 85%, and 91% accuracy, respectively. These results were generated with the validation set, which proved the model was not overfit to the training and validation sets. Apart from the accuracy metric, other model evaluation metrics were also calculated, such as the weighted precision (Figure 2A), weighted recall ( Figure 2B), and weighted F1 score ( Figure 2C). The precision metric is the ratio of true positives to the sum of true positives and false positives from the model prediction, whereas recall is the ratio of true positives to the sum of true positives and false negatives from the model prediction. The F1 score is the harmonic average of precision and recall. In our study, the lowest precision, recall, and F1 score at 83% came from Res101 model and the highest at 91% was from the X_Cept model ( Figure 2).
Figures 3A-F displays the normalized confusion matrix of each model. A confusion matrix shows the true labels and predicted labels of each class as well as the percentage of true positive, false positive, true negative, and false negative predictions. The darker blue color indicates a higher correct prediction of each class. The lowest-performing model was Res101 and the highest-performing model was X_Cept. For instance, Res101 had 15% false positive and 19% false negative predictions, while the X_Cept model had only 8% false positives and 12% false negatives when predicting high and low risk breast cancer patients.
To further visualize the true positive rate and false positive rate of each model on the validation set, we also plotted the receiver operating characteristic (ROC) curve of each model ( Figures 4A-F). The highest area under the curve (AUC) was 0.90, which belonged to the X_Cept model, whereas the lowest belonged to Res101 (AUC=0.83). The In_Res, and VGG16 models had the same AUC of 0.84, whereas Res50 and EB5 had an AUC of 0.85 and 0.87, respectively.

Independent Testing of Model Performance With Test Dataset
We validated the model performance with the testing dataset, consisting of 125,401 patches from 55 breast cancer patients (20 patients with high risk and 35 patients with low risk). We used two approaches to evaluate the model prediction performance, namely patch-wise and patient-wise, because in clinical practice, each patient would have 3-5 WSIs for final risk assessment. Both patch-wise and patient-wise methods have high confidence (>85% accuracy), but the patient-wise method is better for clinical use because it provides higher confidence in the model prediction based on the 70-gene signature score in clinical applications. We used only the X_Cept model for this evaluation step, owing to its highest performance in the training and validation phases. The model performance in the independent testing set is reported in Table 2. For the patch-wise approach, the precision, recall, and F1 score of the high-risk group were 0.86, 0.85, and 0.85, while these metrics in the lowrisk group were 0.89, 0.89, and 0.89, respectively. Both the macro average and the weighted average were 0.87. The patient-wise results for each individual are displayed in Figures 5A-F. The model accuracy was consistent across different selected thresholds. A minor shift in the high-risk group was found between the chosen thresholds. Sample H2 with high-risk shifted 16% of prediction probability from 0.25 to 0.41. Another sample from low risk group (L35) also reported a 16% difference of prediction probability from threshold 0.5 relative to 0.3 and 0.7 thresholds. However, the final prediction results for these 2 samples were still unchanged. Overall, the model accuracy was 90% and 100% for the high-risk and low-risk groups, respectively. The overall accuracy reached 96.3% (53/55).To decode the model learning process, we used GRAD-CAM with the last activation layer to create a heatmap superimposed on the original image ( Figures 6A, B). This illustrates how each model learned to distinguish differences between the low-risk and highrisk 70-gene signature scores. VGG16, Res50, Res101, and X_Cept were activated on the tumor part of the patches.
In_Res was activated in the tumor and peri-tumor stroma areas. EB5 was mainly activated on peri-tumor stroma areas. The activated areas of VGG16, Res50, Res101, and X_Cept were highly identical, while the size of the activated area of X_Cept was smaller than those of Res50, Res101, and VGG16. It is readily seen that the VGG16 model's heatmap wase highly activated over the entire image, whereas the Res50 and Res101 models' heatmaps were activated in the middle of the image. These three models performed well with the last two images on the right-hand side, which had the majority of cells distributed in the lower and upper corners; however, the models still managed to recognize these areas as tumor cells, which demonstrated the models' capability and logic in distinguishing tumor and nontumor areas. The EB5 and In_Res models' activation maps did not show reasonable pathological features in the images, perhaps because these models used a corner-based approach to determine

DISCUSSION
Breast cancer recurrence not only negatively affects patients' quality of life, but a majority of breast cancer patients also undergo chemo-and radiotherapy, which have a high cost and a high rate of side effects (37,38). Predicting the risk of recurrence for patients diagnosed with breast cancer in an early stage could help in making a suitable treatment plan, which could also prevent overtreatment of patients with highly toxic chemotherapy. In past decades, identification of novel prognostic biomarkers for recurrence was based on multiomics approaches, which required multi-step protocols and time-consuming analyses (39). Consequently, a rapid, robust, and highly accurate method is highly desirable. Lately, with the emergence of AI tools, machine learning and deep learning have shown promising results in almost every aspect of healthcare research, and they have demonstrated their indispensable roles in assisting and facilitating physicians and researchers in their routine duties.
In an attempt to predict breast cancer recurrence, we have combined deep learning and pathological images into a simple, yet comprehensive and highly accurate, AI pipeline. We developed a complete pipeline for predicting breast cancer recurrence using a single source of data, namely, pathological images with high and low risk scores provided by an established 70-gene signature. Six different pretrained models were used for transfer learning with pretrained weights from the ImageNet dataset. The highest model performance using X_Cept architecture (40) and two fully connected layer of 1024 and 256 neurons achieved an AUC of 0.90 using the validation set and an accuracy of 0.87 using the testing dataset. Furthermore, we bypassed the necessity for regionof-interest/tumor labeling for each WSI, which was a tedious and laborious task for pathologists.
The benefit of the 70-gene signature risk score to breast cancer patients has been proven in several studies with large sample size (11,13). The important role of this test in assisting physicians with making treatment plans has been affirmed. Nevertheless, mRNA expression profiling of all 70 genes in the signature is needed for completing this test, which is costly and laborious. In addition,   discrepancies between different ethnic groups may also lead to divergence in the expression level of certain genes (41,42), which may affect the risk assessment. Using pathological images as the training data possesses many advantages over gene expression data, such as rapidity, straightforwardness, low cost, and reusability. On top of that, deep learning uses a representation approach to learn from data (43), which has been extensively proven to be superior to human experts in many biomedical tasks such as predicting cancer metastasis from expression data (44), distinguishing diseased and normal cells (45), and detection of tumor areas (46), just to name a few. In our study, we used WSIs for prediction that also contained adjacent tumor areas which may exhibit morphological changes in breast cancers. These adjacent tumor areas contributed to model building because they were different between samples with low-and high-risk 70-gene signature scores. The morphological changes might be small in scale, but deep learning methods, especially their CNN architecture, are designed with hundreds of filters and different kernel sizes, which are particularly practical in detecting these tiny alterations. The applications of AI tools for biomedical data have been extensively evaluated recently. Hundreds of researchers use different types of data, such as images and omics data, together  with clinical information. The performance of these models varies depending on the prediction and classification problem. For instance, in colorectal cancer outcome prediction, a deep learning model achieved an AUC of 0.69, whereas human experts achieved an AUC of 0.58 (30). In another study on non-small cell lung cancer, experts attempted to predict the mutation status of target genes such as TP53 and KRAS using pathological images, and the AUC ranged from 0.733 to 0.856 in the external validation dataset (47). In addition to pathological images, recent studies also used computed tomography scan images for deep learning model training and achieved an AUC of 0.75 for predicting lung cancer treatment response (48). In breast cancer research, various studies have also used deep learning to predict patient outcomes by integrating pathological images and genomic data and have achieved AUCs ranging from 0.681 to 0.821 (49). Interestingly, pathological images were also used to predict the gene expression level in a pan-cancer study, and AUC scores ranged from 0.65 to 0.98 depending on the type and subtype of cancer. In another large-scale study using 44,732 WSIs from 15,187 patients to predict clinical grade from pathological images, a deep learning model achieved a state-ofthe-art AUC of 0.98 for all types of cancers. This study had the distinction of using the reported diagnoses for image labeling only (50). The prognostic 70-genes signature achieved 89%, 42%, and 65% for sensitivity, specificity, and overall accuracy, respectively (51,52). In contrast, our best model (X_Cept model) achieved a sensitivity of 0.89 and a specificity of 0.86 (Based on Table 2). Taken together, the feasibility and efficiency of using pathological images and AI tools to predict clinical information and patient outcomes have been demonstrated, and the next step is to conduct prospective studies for evaluating potential application in clinical practice. With rapid advances in AI algorithms coupled with new hardware generations and a plethora of ready-to-use healthcare data, models can now be trained with larger datasets in a shorter period of time. As a matter of course, expensive genomic, transcriptomic, proteomic, and metabolomic tests for different clinical purposes such as patient outcomes, survival analyses, and cancer subtyping will inevitably receive assistance from faster and better AI tools. Eventually, AI tools are expected to either completely transform traditional healthcare approaches or create a hybrid form. These advances will help physicians make better and faster decisions in treatment planning that requires a personalized medicine approach.

CONCLUSION
In the present study, we developed a high-performance, automated deep neural network pipeline to predict risk of breast cancer recurrence using pathological images, which reduces the cost and time of genetic testing and obviates the need for tumor region labeling. We also demonstrated that a deep neural network model could learn the complex pathological features only from images and was able to find tumor areas for distinguishing low-and high-risk breast cancers.

Limitation
One of the limitations of this study is the size of the dataset and the study populations. Although the models can reach upto 87% accuracy for the patch-wise approach and upto 96.3% for the patient-wise approach, more rigorous independent validations are required to establish their efficacy and and reliability for future applications into bigger datasets from different study groups of varied ethnicities. In addition, the current study has not performed survival analysis, owing to no event in our study cohort.

DATA AVAILABILITY STATEMENT
The datasets presented in this article are not readily available because the patient's data are from Taipei Veteran General Hospital and are prohibited from distribution for public use.
Requests to access the datasets should be directed to C-CH, chisheng74@gmail.com.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by Internal review board of the Taipei Veteran General Hospital. The patients/participants provided their written informed consent to participate in this study.