A novel dilated contextual attention module for breast cancer mitosis cell detection

Background and object: Mitotic count (MC) is a critical histological parameter for accurately assessing the degree of invasiveness in breast cancer, holding significant clinical value for cancer treatment and prognosis. However, accurately identifying mitotic cells poses a challenge due to their morphological and size diversity. Objective: We propose a novel end-to-end deep-learning method for identifying mitotic cells in breast cancer pathological images, with the aim of enhancing the performance of recognizing mitotic cells. Methods: We introduced the Dilated Cascading Network (DilCasNet) composed of detection and classification stages. To enhance the model’s ability to capture distant feature dependencies in mitotic cells, we devised a novel Dilated Contextual Attention Module (DiCoA) that utilizes sparse global attention during the detection. For reclassifying mitotic cell areas localized in the detection stage, we integrate the EfficientNet-B7 and VGG16 pre-trained models (InPreMo) in the classification step. Results: Based on the canine mammary carcinoma (CMC) mitosis dataset, DilCasNet demonstrates superior overall performance compared to the benchmark model. The specific metrics of the model’s performance are as follows: F1 score of 82.9%, Precision of 82.6%, and Recall of 83.2%. With the incorporation of the DiCoA attention module, the model exhibited an improvement of over 3.5% in the F1 during the detection stage. Conclusion: The DilCasNet achieved a favorable detection performance of mitotic cells in breast cancer and provides a solution for detecting mitotic cells in pathological images of other cancers.


A.2.2 DiCoA Attention Scores
Firstly, extract the dilated contextual attention score matrix, denoted as  ,,  , for a given input feature map  ∈  × ×  .Transforming it into a scalar result is represented by   : Where (  ,   ) represents the coordinate center of the target box, and   ,  ℎ denote the width and height of the bounding box, respectively. represents the total number of pixels extracted in a neighborhood of size , and  represents the dilation value.
During the training stage, the attention scores of DiCoA are updated by optimizing the objective loss with respect to the target using the cross-entropy method： Where   represents the ground truth of a bounding box. denotes the negative loglikelihood loss function., , and  together constitute the cross-entropy loss.

A.3. DiCoA Attention Placement Locations
To effectively detect mitotic cells of various shapes, capture long-range dependencies in mitotic cell features, and adapt the extracted feature vectors to the physical sizes of different mitotic cells, we designed the integration of the DiCoA attention module with a multi-scale network.We chose to incorporate the DiCoA attention module in the fifth stage of the Bottom-up process of the FPN, specifically after the output of the 1 × 1 convolutional layer with a downsampling rate of 2 5 = 32, denoted as C5.Subsequently, we obtained P2, P3, P4, and P5 through four 1 × 1 convolution operations, derived from C2, C3, C4, and C5, respectively (as illustrated in Figure A.1).These operations reduced the channel dimensions of the feature maps to C=256, facilitating better processing of features across different scales.
The rationale for incorporating the DiCoA attention module at the C5 stage of the FPN module is as follows: In this study, the image resolution is 0.25 µm/pixel, and a C5 stage feature vector represents a region of length 24 µm, as illustrated in

A.4. Target Center Adjustment
Despite window relocation having reassessed numerous false-positive samples at the boundaries of sliding windows, low-quality bounding boxes persist in the classification stage.We employed the target center adjustment method to mitigate input translation variance.Optimization was achieved by balancing the combination of regression loss   and classification loss   through the loss function   , as expressed in Equation (A.5).Here, the parameter   is utilized to control the weighting of the two losses.
Where   ∈ [0,1] is the loss allocation weight, we set   to 0.95.The classification loss   is computed based on the standard cross-entropy loss between the predicted and true target categories.The regression loss   is derived from the calculation of the distance between the predicted and true target centers.

A.5. Experimental Setup for Target Center Adjustment
We employed the target center adjustment method, utilizing DenseNet201 as the network backbone during training.The mitotic cells located during the detection stage were subjected to reidentification in the classification stage.Table B.8 provides comparative data on the model's classification experiments and complexity.The spatial complexity of the model is represented by the parameter count (Params), which denotes the total number of trainable parameters in the network.The temporal complexity of the model is expressed through the floating-point operation count (FLOPs) and multiply-accumulate operations (MACs), serving as metrics for the computational complexity of the model.For the integration of multiple pre-trained models in InPreMo, the time and space complexities are the sums of the respective time and space complexities of the individual pre-trained models.
In the single-model classification task, the use of the EfficientNet-B7 pre-trained model outperforms EfficientNet-B4, exhibiting a superior performance with increases of 3.2% and 1.1% in Precision and F1, respectively.On the other hand, Densenet121 performs better when compared to Densenet201 pretrained models.However, VGG16 yields relatively poorer results compared to other pre-trained models.
When combining two pre-trained models, although the VGG16 pre-trained model performs relatively poorly in single-model experiments, the combination of EfficientNet-B7 and VGG16 pre-trained models achieves optimal results.In comparison to the combination of EfficientNet-B4 and VGG16, Precision and F1 increased by 3.2% and 0.9%, respectively, compared to EfficientNet-B4 alone, and improved by 3.6% and 1.1%, respectively, compared to VGG16.The combination of EfficientNet-B4 and Densenet201 also enhanced overall performance, while the combination of EfficientNet-B7 and Densenet201 resulted in a decrease in performance.
When combining three pre-trained models, EfficientNet-B7, Resnet50, and VGG16, we achieved relatively favorable experimental results.The most significant improvement was observed in Precision, where compared to EfficientNet-B7 alone, there was an increase of 1%, and compared to the combination of EfficientNet-B7 and VGG16, there was an increase of 1.8%.However, this performance enhancement is accompanied by a significant increase in the number of parameters.
In the classification stage, the combination of EfficientNet-B7 and Densenet121 does not outperform the performance of the individual pre-trained model EfficientNet-B7, but it surpasses the performance of the individual pre-trained model Densenet121.Simultaneously, the fusion of pre-trained models does not necessarily lead to performance improvement.The combination of EfficientNet-B7 and Densenet201 lowers the overall performance, placing it below the performance of any single pretrained model.We observed that the combination of highly complex models often introduces adverse effects, especially for models that have exhibited overfitting.Densenet121 performs better compared to Densenet201, possibly because Densenet201 has shown signs of overfitting.The combination of highly complex models such as EfficientNet-B7 and Densenet201 results in a noticeable decrease in performance.The decline is more significant compared to the combination of EfficientNet-B7 and Densenet121, potentially due to the former utilizing the higher model complexity of EfficientNet-B7 and the overfitted Densenet201.Combining models with lower complexity can mitigate the negative impact of overfitting on model performance.For instance, the combination of EfficientNet-B4 and Densenet201, while employing the overfitted Densenet201, benefits from the relatively lower model complexity of the EfficientNet-B4 pre-trained model.Ultimately, the performance of the combination of EfficientNet-B4 and Densenet201, compared to the EfficientNet-B4 model, shows improvements of 0.7%, 0.7%, and 0.7% in Precision, Recall, and F1, respectively.Compared to the Densenet201 model, the combination exhibits improvements of 0.4% and 0.3% in Precision and F1, respectively.

B.9. Evaluation of Sensitivity and Specificity in the Classification Stage
Figure A.1 (a).
Figure A.1 (b) demonstrates the annotated size of a mitotic cell with a diameter of 25 µm.The downsampling rate of the C5 stage precisely encompasses the entire region of a mitotic cell.Therefore, by applying DiCoA at the C5 stage, we can better capture information related to mitotic cells.In earlier stages, such as the C4 stage with a downsampling rate of 2 4 or the C3 stage with a downsampling rate of 2 3 , utilizing DiCoA cannot adequately capture the information of mitotic cells.Supplementary Figure A.1.The schematic representation of the attention receptive field size of DiCoA.(a) Explanation of the attention span for a single pixel, (b) A green circle highlights a mitotic cell.

1 Model Architectures for the Detection Stage Supplementary Table A.3.
The input image resolution was set to 128 × 128 .The network backbone was initialized with ImageNet pre-trained weights.Training was performed with a batch size of 64 using augmentation techniques during training.For the CMC dataset, the threshold for positive class was set to 0.2.Architectures for the Detection Stage.
the Adam optimizer.The initial learning rate for the model was set to 10 −4 , and during training, it was adjusted to 1/10 of the initial value after the 22,500th and 27,000th iterations.In each experiment,   was set to 0.95.To enhance the diversity of training data, we applied random flipping and standard photometric data

2 Model Architectures for the Detection Stage Supplementary Table A.4.
Architectures for the Classification Stage.

Efficacy Ablation Experiments of InPreMo Supplementary Table B.11
. Single/Multi-Model Classification Experiments and Complexity Comparison.

Supplementary Table B.12. Evaluation
of Sensitivity and Specificity in the Classification Stage.