Breast Cancer Molecular Subtype Prediction on Pathological Images with Discriminative Patch Selection and Multi-Instance Learning

Molecular subtypes of breast cancer are important references to personalized clinical treatment. For cost and labor savings, only one of the patient’s paraffin blocks is usually selected for subsequent immunohistochemistry (IHC) to obtain molecular subtypes. Inevitable block sampling error is risky due to the tumor heterogeneity and could result in a delay in treatment. Molecular subtype prediction from conventional H&E pathological whole slide images (WSI) using the AI method is useful and critical to assist pathologists to pre-screen proper paraffin block for IHC. It is a challenging task since only WSI-level labels of molecular subtypes from IHC can be obtained without detailed local region information. Gigapixel WSIs are divided into a huge amount of patches to be computationally feasible for deep learning, while with coarse slide-level labels, patch-based methods may suffer from abundant noise patches, such as folds, overstained regions, or non-tumor tissues. A weakly supervised learning framework based on discriminative patch selection and multi-instance learning was proposed for breast cancer molecular subtype prediction from H&E WSIs. Firstly, co-teaching strategy using two networks was adopted to learn molecular subtype representations and filter out some noise patches. Then, a balanced sampling strategy was used to handle the imbalance in subtypes in the dataset. In addition, a noise patch filtering algorithm that used local outlier factor based on cluster centers was proposed to further select discriminative patches. Finally, a loss function integrating local patch with global slide constraint information was used to fine-tune MIL framework on obtained discriminative patches and further improve the prediction performance of molecular subtyping. The experimental results confirmed the effectiveness of the proposed AI method and our models outperformed even senior pathologists, which has the potential to assist pathologists to pre-screen paraffin blocks for IHC in clinic.

Molecular subtypes of breast cancer are important references to personalized clinical treatment. For cost and labor savings, only one of the patient's paraffin blocks is usually selected for subsequent immunohistochemistry (IHC) to obtain molecular subtypes. Inevitable block sampling error is risky due to the tumor heterogeneity and could result in a delay in treatment. Molecular subtype prediction from conventional H&E pathological whole slide images (WSI) using the AI method is useful and critical to assist pathologists to pre-screen proper paraffin block for IHC. It is a challenging task since only WSI-level labels of molecular subtypes from IHC can be obtained without detailed local region information. Gigapixel WSIs are divided into a huge amount of patches to be computationally feasible for deep learning, while with coarse slide-level labels, patch-based methods may suffer from abundant noise patches, such as folds, overstained regions, or non-tumor tissues. A weakly supervised learning framework based on discriminative patch selection and multiinstance learning was proposed for breast cancer molecular subtype prediction from H&E WSIs. Firstly, co-teaching strategy using two networks was adopted to learn molecular subtype representations and filter out some noise patches. Then, a balanced sampling strategy was used to handle the imbalance in subtypes in the dataset. In addition, a noise patch filtering algorithm that used local outlier factor based on cluster centers was proposed to further select discriminative patches. Finally, a loss function integrating local patch with global slide constraint information was used to fine-tune MIL framework on obtained discriminative patches and further improve the prediction performance of molecular subtyping. The experimental results confirmed the effectiveness of the proposed AI method and our models outperformed even senior pathologists, which has the potential to assist pathologists to pre-screen paraffin blocks for IHC in clinic.

INTRODUCTION
Breast cancer is intrinsically heterogeneous and has been commonly categorized into molecular subtypes since the late 1990s (1). According to various molecular expressions of certain genes, breast cancer can be classified into four molecular subtypes, namely, Luminal A, Luminal B, Her-2, and Basal-like (2). Molecular subtypes directly reveal the biological behavior of breast cancer and represent changes in gene expression, which can be used to determine tailored treatment approaches and predict prognosis (3).
In clinic, molecular subtype diagnosis usually comes from immunohistochemistry (IHC) (4). IHC uses the high specificity between antigen and antibody, as well as histochemical procedures to mark antigen and antibody positions. IHC staining is used to identify aberrant cells such as those found in cancerous tumors. Certain biological activities, such as growth or cell death, are associated with certain molecular markers (5). Four biomarkers, including estrogen receptor (ER), progesterone receptor (PR), human epidermal growth factor receptor 2 (HER2), and Ki67, are commonly utilized to immunostain the slides to determine molecular subtypes of breast cancer. Diagnosed subtypes basically determine corresponding treatment strategies, such as targeted drugs for HER2-positive and hormone therapy for Luminal-A. Due to tumor heterogeneity, gene expression of ER, PR, and HER2 often varies in different paraffin blocks and thus may lead to inaccurate subtype diagnosis. For cost and labor savings, pathologists usually examine only one of the paraffin blocks in a case to determine the molecular subtype of breast cancer. Since molecular subtypes determine treatment strategies, inevitable sampling error is risky due to the tumor heterogeneity and could result in a delay in medical treatment. Molecular subtype prediction from conventional H&E pathological whole slide images (WSI) using the AI method is useful and critical to assist pathologists to pre-screen proper paraffin block for subsequent IHC in clinic.
Changes in gene expression will cause variations in texture in pathological images. Some pathologists have attempted to investigate the statistical relationship between specific gene expression with hematoxylin and eosin (H&E)-stained pathological images (6). Directly predicting molecular subtypes of breast cancer using H&E pathological images based on AI is a prospective study, which may also help improve diagnosis reliability of molecular subtypes.
Molecular subtyping on H&E-stained pathological images is a challenging task since we can only obtain the slide-level label for each molecular subtype without detailed local region information. Even experienced pathologists have difficulty annotating corresponding molecular subtype regions in H&E pathological images (7). Due to the extremely high resolution of whole slide images (WSIs), WSIs are computationally infeasible to be directly fed into a network for training and testing; therefore, they are usually divided into small patches. The lack of patch-level labels makes it a weak label problem for machine learning.
Deep learning is becoming increasingly widely used in computer vision tasks. Most deep learning tasks require a large amount of fine-labeled data for supervised learning, which is time-consuming, especially in medical fields. Weakly supervised learning, for example, has been a hotspot for research on reducing the dependence on labeling data. Benenson et al. (8) adopted an interactive method, in which human annotations and the model collaborate to complete the segmentation task. Berthelot et al. (9) augmented labeled data with unlabeled data for classification. To reduce the influence of noisy data, Cheng et al. (10) presented a weakly supervised learning method using a side information network, which largely alleviates the negative impact of noisy image labels. Qu et al. (11) addressed noisy label problem by enforcing prominent feature extraction by matching feature distribution between clean and noisy data.
In recent years, multi-instance learning (MIL) (12) methods are generally adopted for weakly supervised learning. For WSI classification based on MIL, all patches extracted from a pathological image form a bag, and patches are instances of this bag. With only the bag-level labels in the training stage, the goal of MIL is to train a classifier to predict bag-level labels and even instance-level labels. Some previous work extended and enhanced MIL framework using multiple techniques. Wu et al. (13) proposed DE-MIMG that allows each bag to contain pairs of instances and graphs and results in optimal representation. Discriminative bag mapping (14) was adopted to build a discriminative instance pool that can properly separate bags in the mapping space. As attention mechanism gained its popularity in deep neural networks, Ilse et al. (15) and Shi et al. (16) introduced attention mechanism to MIL, where attention weights can represent how much instances contribute to the bag label. Instead of assuming instances in each bag are independent and identically distributed (i.i.d.), Zhang et al. (17) proposed MIVAE that explicitly models the dependencies among instances within each bag for both instance-level and bag-level prediction. Li et al. (18) proposed to use contrast learning to extract multiscale WSI features and a novel MIL aggregator that models the relations of the instances. Shao et al. (19) devised transformer-based correlated MIL that explored both morphological and spatial information. However, most attention-based and correlated MIL methods require largescale training datasets and significant computational resources. In addition, feature clustering methods have also drawn some attention in MIL. Wang et al. (20) modeled each WSI as k groups of tiles with similar features to ensure learning both diverse and discriminative features. Similarly, Sharma et al. (21) performed K-means clustering on patches within each WSI and randomly sampled a certain amount of patches from each cluster to accommodate for computational limit without much information loss. However, besides the variability of patches within a WSI, the variability of WSIs from the same category is also considerable, where clustering techniques can be used to refine class-level learned features for more accurate subtyping.
Nevertheless, breast cancer molecular subtyping specifically on H&E images has been insufficiently studied. Shamai et al. (22) used logistic regression to explore correlations between histomorphology and biomarker expression and a deep neural network to predict biomarker expression in examined tissue. Rawat et al. (23) introduced "tissue fingerprints" that can learn H&E features to distinguish patients, which are further used to predict ER, PR, and HER2 status. In these studies, machine learning technique is adopted to predict biomarker expression level from H&E histomorphology; direct molecular subtype prediction, however, has not been achieved. Jaber et al. (24) proposed an intrinsic molecular subtype (IMS) classifier from H&E images and analyzed heterogeneity within patches from the same WSI. Although using Inception-v3 to extract features, they adopted traditional PCA and SVM for classification, leading to limited performance.
Since the patches cut from each WSI may come from various regions including lesion, benign, or background of the WSI, some research (25,26) regard the non-lesion areas in the patches of the pathological images as noisy labels. Differing from pathological classification tasks, such as ductal carcinoma in situ and invasive ductal carcinoma for breast cancer, where pathologists can label tumor regions with different pathological classes, it is impossible to distinguish tumor regions representing different molecular subtypes even for senior pathologists. Although tumor region annotations are useful information for deep networks to learn molecular subtypes, these manual annotations are time-consuming for pathologists. This paper focuses on molecular subtyping with only slide-level labeling instead of detailed tumor region labeling information. The crucial challenge is to eliminate the influence of noise patches and learn expressive features for classifying molecular subtypes.
In this paper, we modeled the patch-based molecular subtype prediction task of pathological slides as a noisy labeling problem in weakly supervised learning. A multi-instance learning framework DPMIL for pathological image molecular subtyping prediction based on discriminative patch filtering was proposed. First, in order to distinguish noise patches, a pre-classification strategy for molecular classification of pathological slides based on co-teaching was presented. This method adopted co-teaching strategy to train two backbone networks and used co-teaching loss function to filter out noise patches to update model parameters. Then, a local outlier factor algorithm was used to reveal the outliers in the feature space for each molecular subtype, and the patches with features close to the cluster center were retained as discriminative patches. Finally, based on the filtered discriminative patches, the pathological slide-level global loss and patch-level local loss were integrated to fine-tune the prediction model for better feature representation of molecular subtypes. The experimental results confirmed the effectiveness of our proposed framework on the molecular subtyping dataset; breast cancer pathological images were provided by Xiangya Hospital. Our AI models outperformed even senior pathologists, which has the potential to assist in prescreening proper paraffin block of patients for subsequent IHC molecular subtyping in clinic.

Data, Software, and Hardware
This paper used breast cancer H&E pathology dataset BCMT (Breast Cancer with Molecular Typing) provided by Xiangya Hospital. All the pathology WSIs used a pyramid storage structure.
As Table 1 shows, the BCMT dataset contains 1,254 pathological WSIs from 1,254 patients or cases with slide-level molecular subtype annotations between 2017 and 2019. The dataset contains 313 slides for Luminal A, 382 slides for Luminal B, 316 slides for Her-2 overexpression subtype, and 243 slides for the Basal-like subtype. We randomly divided the slides into training set and validation set with a ratio of 8:2 for each type. This paper uses accuracy, precision, recall, and F1 score to measure the performance of four molecular subtypes.
We use 4 GeForce GTX 2080 Tis with 11 GB memory to train the network and Python with Pytorch to implement our algorithm. The initial learning rate is 0.1 and the poly learning rate policy with the power of 0.9 is employed. The minibatch size is set as 32.

Proposed Framework
This paper proposes a breast cancer molecular subtype prediction framework based on multi-instance learning and discriminative patch filtering. The pipeline of our framework is illustrated in Figure 1.
Firstly, patches from H&E WSIs are extracted to train a molecular subtype classifier. Co-teaching (27) between two networks is used to obtain the patch-level classification and select candidate discriminative patches. Then, local outlier factor (LOF) (28) based on cluster centers of subtypes is adopted to further filter out noise patches and obtain discriminative patches. Based on these discriminative patches, we fine-tuned a new molecular subtyping model initialized by the model performed better in co-teaching stage. Finally, the local loss function and global loss function are combined as constraint information in multi-instance learning framework to improve feature representation of molecular subtypes. The fine-tuned model is used to obtain the final patchlevel and slide-level molecular subtyping results.

Feature Construction and Patch Selection Based on Co-Teaching
In multi-instance learning framework, each patch is usually assigned the same label as WSI it belongs to (29)(30)(31), while for molecular subtyping, patches from WSI may contain benign or other tissues, which will make slide-level prediction difficult. To reduce these noise patches, this paper adopts co-teaching strategy (27), which usually trains two neural networks and enables them to learn from each other. This strategy assumes that the two models simultaneously consider the samples with the lowest loss as non-noisy samples. These selected instances are considered more representative of the category of the bag than other instances. Each network treats samples with minimal loss in each batch as knowledge and feeds these samples to the other network. Co-teaching strategy is inherently suitable for classification with noisy labels. This paper uses ResNet-50 (32) as the backbone for coteaching. The parameters of the two models are randomly initialized and the selection strategy of K follows (27). During co-teaching process, the ResNet-50 network is used to obtain representative features and confidence for each patch. Patches with higher confidence are selected as candidate discrimination patches for subsequent process.

Noise Patch Filtering Using Local Outlier Factor
Although the above co-teaching strategy used co-teaching loss to filter out some noise patches, many noise patches from benign or other tissue regions remain. For selected high confidence patches, we can obtain the feature of each patch before the classification layer. Patches belonging to the same molecular subtype tend to gather into the same cluster in feature space.
This paper further proposed a noise filtering method based on local outlier factors (LOF), which is a classic density-based algorithm (28). The main idea is to calculate a numerical score to represent the abnormality degree of a sample to the cluster center with average density. In feature space, the density of a certain point is compared with the average density of points around it. If the former score is lower, the point may be abnormal and vice versa. Figure 2 shows an example of point set (blue point) in feature space for certain molecular subtyping. We query whether these four points are outliers of the point set. The green point is not an outlier with a lower LOF score, and the red points are outliers with high ones. The size of the red point is the value of the LOF scores and represents the abnormality degree of a certain point.
We perform LOF for each subtype of molecular features and regard patches that do not belong to a specific cluster of molecular subtype as noise patches.

Multi-Instance Learning With Global and Local Constraint
The above selected discriminative patches are further used to improve feature representation of molecular subtypes based on multi-instance learning framework (MIL). MIL regards the WSI as a bag containing a number of patches. These patches are considered as instances, and their predictions are aggregated to obtain a bag-level prediction. ResNet-50 is also adopted as a backbone to train the MIL classification model. We initialize the MIL model with the model that performs better in co-teaching and use discriminative patches for fine-tuning.
We introduce the slide-level loss function to impose global information constraints to guide the MIL training. The slidelevel loss function L WSI is defined as Formula 1, where L WSI represents the slide-level loss function of the ith pathological image defined as the cross-entropy function (32). N WSI represents the total number of pathological slides in the training set, and a is the weight of slide-level loss.
L WSI i is defined as Formula 2, where M is the molecular type number, and Y o,c is the indicator function. When the output prediction result in o is the same as the true label c of the pathological slide, it is set to 1; otherwise, it is 0.  P c is defined in Formula 3, representing the confidence level of slide-level molecular subtyping. N p s the total number of patches of the pathological image, and P i,c represents the confidence value when the ith patch of WSI is classified as type c. As shown in Formula 3, the average confidence value of all patches from the same WSI are obtained and used as the slidelevel molecular subtyping confidence.
The two-stage training diagram is shown in Figure 3. We use the patches by LOF-Denoising as input. In each epoch, the training process is divided into two stages. The first stage uses all patches to calculate the patch-level loss to train the model, and we use crossentropy as the loss function, which is defined in Formula 4.
M represents the total number of molecular types and Y c is the indicator function, which is equal to 1 when prediction c equals the ground truth of the slide. p c denotes the confidence level and the current patch is classified as type c. The second stage is trained for slide-level subtyping using slide-level loss function as global constraint information.

RESULTS
This section introduces several experiments to evaluate the performance of our proposed framework DPMIL, including the patch resampling strategy, co-teaching, LOF, and MIL training successively. The performance of model is evaluated using average accuracy, recall, precision, and macro F1 for four subtypes.

Results of Patch Resampling and Co-Teaching
The total number of different types of patches at different resolutions is shown in Figure 4, which shows the imbalance of number of patches for each molecular subtype and each resolution. To deal with the imbalance of dataset, we use a patch resampling strategy to ensure category equalization. For each epoch, the number of training data for each molecular subtype is set as a constant value. The common part is randomly sampled from all patches, and the number of sampled patches is different according to their resolution: 180,000 patches at 5×, 700,000 patches at 10× and, 5,000,000 patches at 20×. The rare part of the data is generated by data augmentation such as randomly flip, horizontal, and vertical symmetry.
We use ResNet-50 as the classifier to evaluate the performance of the sampling strategy. Figure 5 shows the results of molecular subtyping with patches resampling at different resolutions. The accuracy of models with resampling strategy are all higher than those without resampling at three resolutions.
F1 values improve about 6% with patch resampling methods for all the resolutions. In addition, the highest accuracy and F1 value are all achieved at 10×, which indicates that patch size and tissue texture make a good compromise at 10×.

Molecular Subtype Classification Using Co-Teaching and LOF
This section describes experiments to verify the effectiveness of the co-teaching strategy. ResNet-50 was selected as two backbones for co-teaching. The model is trained for 20 epochs with a minibatch of 32. The initial value of the learning rate is 0.01, and the polynomial learning rate decay method (33) is used to adjust the learning rate. Figure 6 shows the results of molecular subtype classification with and without co-teaching at different resolutions. The accuracy improves 4% to 6% and F1 score improves 4% to 11% with co-teaching. The co-teaching framework trains two neural networks and enable them to learn from each other, which can reduce the influence of noise patches. The F1 value of 10×-Co-teaching reaches 0.604 and improves 4.5% compared with 10×-resampling. We selected the model from Co-teaching with the higher F1 value at each resolution. Features before classification layer were input into LOF-Denoising for patch filtering for all molecular types. We supposed S i is the number of normal patches of the ith molecular type and there were o 4 1=1 S i features in total. These features in co-teaching were used for statistical classification of output logits, the number of which is limited to 2,000. The experimental results are shown in Figure 6, which shows that LOF after co-teaching can further improve the metrics since more noise patches are filtered out. We select 10× resolution in the following experiments.

Multi-Instance Learning with Global Information
Based on the above discriminative patch selection, we further verify the multi-instance learning framework with slide-level loss. We used a four-class classification model for molecular subtyping and compared the results with different weights in Formula 1. In the second training stage of the model with global constraint, the influence of the weight K in loss function of formula 1 was examined.
When 0 ≤ a ≤ 1, the influence of the second stage on the model parameters is weakened. When a = 0, the second stage of training does not affect the model. When a > 1, the influence of the second stage is enhanced. We set the value to 0.5, 1.0, and 2.0, respectively, to evaluate the effectiveness of global loss constraint in the second stage of training. Figure 7 shows the results of MIL for molecular subtyping, proving that using slide-level loss function can improve the performance of the model. The reason may be that there are still some noise patches in the selected patches after noise filtering. We used a slide-level loss to add global constraint information, which can further reduce the influence of noise patches.

Binary Classification Model and Weighted Fusion
Apart from the four-class classification model, to further improve the performance of molecular subtype classification, we also tried binary classification models for each molecular subtype. Finally, a weighted fusion method is adopted to accomplish the final four-type classification. Binary classification models were trained similar to four-class classification model, including co-teaching, LOF, and slide-level loss of MIL. Parameter a is set to 0.5 for all the experiments. The prediction results of each molecular type of binary classification model are shown in Figure 8, where F1 reached over 0.72 for all subtypes. Notably, Basal-like molecular type obtained the highest F1 value of 0.774.
For four-class classification, we averaged the confidence level of all patches from a WSI, and then use it as the confidence of the molecular subtype of the WSI. We used grid search (34) for the best weight setting of the four-class prediction model and finally take 0.6, 0.9, 0.5, and 0.7 as weights of four subtypes. The final weighted four-class classification results are shown in Figure 8. Compared with the direct four-class molecular type prediction from model 10×-(0.5) in Figure 7, four-classifier weighted fusion in Figure 8 can increase the accuracy by 6.7% and the F1 score by 3.2%.
To compare our method with pathology doctors in molecular-type classification, nine pathologists were invited to diagnose molecular subtypes of a total of 99 randomly selected WSIs from test dataset. In clinic, pathologists usually can classify molecular subtypes on IHC images but not on H&E-stained images. Therefore, pathologists can only conduct subtyping totally based on image pattern and their clinical experience. Table 2 shows the average accuracy, precision, recall, and macro F1 scores according to the labels of pathologists (D1: 5 years' experience, D2: 10 years' experience, and D3: 15 years' experience) assigned to each H&E WSI. Specifically, we provide the means and ranges of 4 metrics from seven 5-year pathologists (D1s). As shown in Table 2, 5-year experienced doctors can hardly make better predictions than random guess, which indicates the unusual difficulty in breast cancer subtyping on H&E images. To be optimistic, more experienced doctors can provide a more accurate diagnosis on molecular subtypes. Our four-class classification model (10×-0.5) and fused binary classification model (10×-Weight fusion) show obvious superiority over doctors in all metrics, surpassing predictions of the most experienced doctor (D3) by 15.4% and 21.5% in accuracy and F1, respectively.   Since WSI-level labels lack detailed region annotation information, most of the existing methods use patch-based methods for WSI recognition. How to eliminate the influence of noise patches and learn the corresponding features for molecular subtyping through training process is the key problem. Our work aims to predict slide-level labels of H&E pathological slides using only weakly annotated information at the slide level. This paper proposes a framework by selecting these discriminant patches to reduce the impact of noise patches and combined MIL for molecular subtype classification. The experimental results show the effectiveness of our proposed framework on the partner hospital's breast cancer H&E pathological image dataset.
MIL has been applied in diverse diseases and image modalities including classification of cancer in histopathology    (39) proposed a graph attention clustering multiinstance learning algorithm based on texture features to predict the TNM staging of rectal cancer tumor metastasis and improved the accuracy of pathological slide staging. Wang et al. (40) proposed a classification framework for pathological slides for gastric cancer diagnosis, which used localization networks to extract patch features and critical filtered patches to replace the general clustering module. After local network extraction and screening of key patch feature maps, concatenation is performed to obtain an overall feature map describing pathological slides.
Recent studies rely largely on the powerful feature extraction capability of deep learning. Yang et al. (41) trained a six-type classifier for identification of lung lesions from WSIs based on EfficientNet (42). To obtain slide-level diagnosis, a threshold-based tumor-first aggregation method that fused majority voting and probability threshold was proposed. Wang et al. (43) developed a second-order multiple instances learning method with an adaptive aggregator stacked by attention mechanism and RNN for histopathological image classification, attempting to explore second-order statistics of deep features for histopathological images. MIL framework can also be applied to similar tasks like survival prediction. Yao et al. (12) proposed Deep Attention Multiple Instance Learning by introducing Siamese MI-FCN that learns features from phenotype clusters, and attention-based MIL pooling that performs trainable weighted aggregation. While our paper focuses on the selection of discriminative patches and combined local and global constraint information in a MIL framework.
The retrospective study design would have resulted in inevitable bias and all the data were collected from a single center, thereby limiting the sample size of the study. In future work, we will combine multi-center and multi-resolution information of pathological images to improve the accuracy and to evaluate on larger datasets.

CONCLUSIONS
Molecular subtype prediction from H&E pathological slides is a challenging task. Based on slide-level weak labels, this paper proposes a multi-instance learning framework for molecular subtype classification with discriminative patches selection. Firstly, we use co-teaching strategy to train the molecular subtype prediction model with noise patches. Then, the noise patches are filtered out according to features obtained from the model through local outlier factor algorithm. Finally, based on the filtered discriminative patches, a multi-instance learning based molecular subtyping model using both slide-level and patch-level loss is finetuned. The experimental results show the effectiveness of the proposed framework on the breast cancer H&E pathological image dataset from Xiangya hospital. Although its performance is not sufficient to replace pathologists' clinical diagnosis directly, it is reasonable to employ our framework to preliminary screening for more convenient and reliable molecular subtyping.

DATA AVAILABILITY STATEMENT
The datasets presented in this article are not readily available because no interviewees consented to their data being retained or shared due to the ethically sensitive nature of the research. Requests to access the datasets should be directed to HL, hliu@ict.ac.cn.