Efficient semi-supervised semantic segmentation of electron microscopy cancer images with sparse annotations

Electron microscopy (EM) enables imaging at a resolution of nanometers and can shed light on how cancer evolves to develop resistance to therapy. Acquiring these images has become a routine task.However, analyzing them is now a bottleneck, as manual structure identification is very time-consuming and can take up to several months for a single sample. Deep learning approaches offer a suitable solution to speed up the analysis. In this work, we present a study of several state-of-the-art deep learning models for the task of segmenting nuclei and nucleoli in volumes from tumor biopsies. We compared previous results obtained with the ResUNet architecture to the more recent UNet++, FracTALResNet, SenFormer, and CEECNet models. In addition, we explored the utilization of unlabeled images through semi-supervised learning with Cross Pseudo Supervision. We have trained and evaluated all of the models on sparse manual labels from three fully annotated in-house datasets that we have made available on demand, demonstrating improvements in terms of 3D Dice score. From the analysis of these results, we drew conclusions on the relative gains of using more complex models, and semi-supervised learning as well as the next steps for the mitigation of the manual segmentation bottleneck.


Introduction
Recent advances in cancer nanomedicine have made cancer treatment safer and more effective (Shi et al., 2017).Nanotechnology has elucidated interactions between tumor cells and their microenvironment showing key factors in cancer behavior and responses to treatment (Hirata and Sahai, 2017;Tanaka and Kano, 2018).Gaining a deeper understanding of the underlying mechanisms taking place during such interactions will help us understand how cancer grows and develops drug resistance, and ultimately help us find new, efficient, and safe therapeutic strategies aimed at disrupting cancer development (Baghban et al., 2020).
To do this, high-resolution information collected from the cellular components at a nanometer scale using focused ion beam-scanning electron microscopy (FIB-SEM) is especially useful as it provides volumes of serially-collected 2D SEM images, creating volume EM (vEM) image stacks, and allowing access to 3D information from tissues (Giannuzzi and Stevie, 2005).This fully automated protocol avoids artifacts associated with serial microtomies and enables voxels to be isotropic, thus yielding a similar image quality in all dimensions, beneficial for feature recognition and context within the volume (Bushby et al., 2011).
These advantageous features have made SEM desirable for use in clinical programs.However, the analysis-limiting step is the extraction of meaningful features, starting with the segmentation of cellular components present in these images.This is currently done by human experts through hand annotation.It is a tedious and time-consuming task, making it unsuitable for medical applications and decisions where time is a critical factor.To overcome this limitation and fully make use of FIB-SEM in a clinical setting, the development of automated and robust models is critical to speeding up this task (Perez et al., 2014).
Segmenting images acquired via FIB-SEM is a difficult problem.Indeed, these images differ considerably from natural ones (images representing what human beings would observe in the real world), and even from other microscopy techniques such as fluorescence microscopy, due to increased noise, different collection resolution, and the reduced number of image channels.EM images are single-channel (grayscale) and tend to have limited contrast between objects of interest and background (Karabǧ et al., 2020).Furthermore, the ultrastructure of tumor cells and their microenvironment vary from those of normal cells (Nunney et al., 2015), and EM analysis methods can be tissue type dependent; most current methods have been developed for neural images (Taha and Hanbury, 2015;Zhang et al., 2021).Therefore, segmentation methods designed to assist other microscopy modalities or other tissue types cannot be applied to ultrastructure segmentation of cancer cells imaged by EM.
We expanded on previous work from Machireddy et al. (2021), where authors showed that a sparsely manually annotated dataset, typically around 1% of the image stack, was sufficient to train models to segment the whole volume.While state-of-the-art in semantic segmentation has been dominated by attention-based models for natural images (Jain et al., 2021), convolutional architectures remain mainstream with EM data, and were used in Machireddy et al. (2021), and the companion paper within this journal volume.In this paper, we compared architectures as well as training frameworks to find the most suitable one for the task of semantic segmentation in the aforementioned specific context of FIB-SEM images.By optimizing the learning process, we expected to improve overall segmentation results and minimize the manual annotation bottleneck by reducing the number of manually labeled images needed for training.
In this study, we focused on the segmentation of nuclei and nucleoli in vEM image stacks acquired from human tumor samples, as both are commonly used as cancer cell identifiers (Zink et al., 2004) and have emerged as promising therapeutic targets for cancer treatment (Lindström et al., 2018).Segmenting both structures accurately has thus proven essential.We evaluated the selected models on FIB-SEM images of three longitudinal tissue biopsy datasets that are available on demand as part of the Human Tumor Atlas Network (HTAN).A quick visualization of the data and end results can be found in Figures 1, 2.

Dataset collection
In a sanctioned observational study approved by the institutional review board, we obtained volumes 2 and 3 at two distinct time points during cancer treatment from a patient diagnosed with metastatic ER + breast ductal carcinoma.Additionally, volume 1 was procured from a patient with pancreatic ductal adenocarcinoma.All biopsies underwent acquisition and analysis in accordance with the OHSU IRBapproved Molecular Mechanisms of Tumor Evolution and Resistance to Therapy protocol (IRB#16113).The HTAN sample IDs for volumes 2 and 3 are HTA9-1 bx1 and HTA9-1 bx2, respectively.Informed written consent was secured from all subjects.
The samples were preserved using Karnovsky's fixative [2.0%PFA, 2.5% Glutaraldehyde (Karnovsky, 1964)], post-fixed utilizing an OsO4-TCH-OsO4 staining protocol, and embedded in Epon resin (Riesterer et al., 2019).Post-fixation staining involved binding heavy metals to lipid-rich membranes to enhance contrast in electron microscopy imaging.To achieve high-resolution, chargefree, high-contrast, and low-noise images, a conductive coating with an 8 nm-thick layer of carbon was applied.The imaging process was carried out using a FEI Helios NanoLab 660 DualBeam ™ FIB-SEM to collect high-resolution 3D volumes of the resin-embedded blocks.
Targeted volumes were obtained through the use of a Ga + FIB source, sequentially slicing a few nanometers from the sample to expose a fresh surface for subsequent imaging.The slicing/imaging cycle was automated through the FEI AutoSlice and View ™ software extended package, with image collection during 3D data acquisition utilizing the in-column detector (ICD).The isotropic resolution for imaging metastatic breast cancer and primary pancreatic tissues was 4 and 6 nm, respectively.

Image preprocessing
Following data acquisition, images within the stack underwent translational alignment in the XY-plane using an internally developed stochastic version of TurboReg affine transformation (Thevenaz et al., 1998).During the alignment step, zero-padding was applied to the images to maintain a consistent size, after which they were cropped to generate the final 3D image volumes.The registration and edge cropping procedures resulted in a final resolution of 6065 × 3976 × 792 for volume 1, 5634 × 1912 × 757 for volume 2, and 5728 × 3511 × 2526 for volume 3.
At times, the brightness of certain images in the stack exhibited variability, introducing complexity and rendering segmentation more challenging.To ensure uniformity across the stack and mitigate image complexity, histogram equalization was implemented.

Training and evaluation
Previously, we trained a model for each dataset using a subset of manually labeled images spaced evenly along the volumes and evaluated on remaining unlabeled images.We reported results on using 7, 10, 15, and 25 training images on all datasets, which represents between 0.3% and 3.3% of all image slices depending on the dataset.Volume 1, 2, and 3 have a total of 792, 757, and 2526 slices respectively; all slides except for the training ones were used for producing evaluation Tables 3-6.As these EM images had large dimensions (typically around 6000 × 4000 pixels), they were cropped to 512 × 512 tiles.We followed the same procedure as (Machireddy et al., 2021) of extracting tiles of size 2048 × 2048 and downsampling them to 512 × 512 as a way to capture larger spatial context.We applied standard random flip and rotation data augmentations.When training with nucleoli, as they account for a small area in the total image, taking random crops effectively resulted in most crops being empty, and models collapsing to the prediction of background.To address this issue, we ensured that more than 99% of crops in a batch contain nucleoli.
Moreover, we selected the Dice score as an evaluation metric because it can be seen as a harmonic mean of precision and recall, and is fitted when dealing with imbalanced class settings, which is our case.However, we also report the 3D Dice score rather than the average of individual Dice scores across all slices as reported in Machireddy et al. (2021).Indeed, we found the latter to be biased towards giving more importance to slices with fewer foreground pixels, while the former effectively reflects the captured percentage of the target structure.For the sake of comparison, we reported the averaged version in our results section in addition to the 3D Dice scores.We recommend however to use the latter.The 3D Dice and average dice are more precisely defined as follows: 3D Dice Dice Predicted Volume, Ground Truth Volume ( )

Number of slices in volume
All models were trained and evaluated on one NVIDIA V100 GPU, as we strongly believe we should keep our clinical end goal in mind and aim to reflect image analysis capabilities available to teams with reasonable computational power.To this end, we also report training times and the number of parameters in Table 1.

Fully-supervised framework 2.3.1 ResUNet
We used previous work from Machireddy et al. (2021) as the baseline for nuclei and nucleoli segmentation.The model used was a Residual U-Net (ResUNet) (He et al., 2015), a simple yet robust fully convolutional encoder-decoder network.U-net and its variants are the most prominent architectures for image segmentation, as the residual connections solve the gradient vanishing problem faced when working with very deep models (Glorot and Bengio, 2010), while the different levels allow feature refinement at different scales.These features have made U-nets widely used in many computer vision problems, including analysis of medical data (Siddique et al., 2020;Su et al., 2021).

UNet++
UNet++ (Zhou et al., 2018) was the first model we decided to compare to the baseline.Our motivation to use this model came from the fact that it was heavily inspired by ResUNet, and was especially designed for medical-like image segmentation.The major differences from ResUNet are the presence of dense convolution blocks on the skip connections and deep supervision losses.Dense convolution blocks aim at reducing the semantic gap between the encoder and decoder, while deep supervision loss enables the model to be accurate (by averaging outputs from all segmentation branches) and fast (by selecting one of the segmentation maps as output).In our work, as we were primarily focused on accuracy, we used the average of all branches.We used the implementation available in the Segmentation Models Python library 1 with ResNet34 as the encoder backbone, and the soft Dice loss (DL) function which is commonly used in semantic segmentation for images when background and foreground classes are imbalanced and is defined as follows for a ground truth y and prediction p:

FracTALResNet
FracTALResNet (Diakogiannis et al., 2021) was also used for comparison.While the original model presented is designed for the task of semantic change detection, it can be adapted for semantic segmentation, and such architecture is in fact available in the authors' official implementation 2 .It was heavily inspired by ResUNet as well, but makes use of a multi-head attention layer (FracTAL block).It also makes use of boundaries and distance maps calculated from the segmentation masks in order to improve performances, but at the cost of both memory and computational time during training.It is trained using the Fractal Tanimoto similarity measure.

CEECNet
CEECNet was also introduced in Diakogiannis et al. ( 2021) and for the same purpose as FracTALResNet but managed to achieve state-of-the-art performances by focusing on context.Indeed, the CEECNet block stands for Compress-Expand Expand-Compress and is comprised of two branches.The first branch (CE block) processes a view of the input in lower resolution, while the second branch (EC block) treats a view in higher spatial resolution.The motivation behind using this model came from the fact that, as described in Section 2.2, feeding more context by down-sampling to a lower resolution is beneficial to segmentation accuracy.Since the core block of CEECNet is based on the compress and expand operations, we believed this network would be able to leverage contextual information in order to achieve better segmentation performances.Similar to FracTALResNet, it was trained with the Tanimoto similarity measure and needs computed boundaries and distance maps.

SenFormer
SenFormer (Bousselham et al., 2021) (Efficient Self-Ensemble Framework for Semantic Segmentation) was the last fullysupervised method tested.It is a newly developed ensemble approach for the task of semantic segmentation that makes use of transformers in the decoders and the Feature Pyramid Network (FPN) backbone.Our motivation behind using this model came from the fact that it is almost purely attention-based, which by definition adds spatial context to the segmentation.

Supervised model choice
Since model architecture is orthogonal to using a semisupervised framework, we picked the best-performing model using Dice scores, as can be seen in Table 2. UNet++ performed better on average, exhibited a low variance, often performed the best out of fully-supervised architectures, and almost never underperformed (as shown by the average rank).Detailed results were reported in Tables 3, 4. For these reasons, it was the model we chose to compare to the baseline in the semisupervised framework.

Semi-supervised learning (SSL) framework 2.4.1 Cross pseudo supervision (CPS)
As described in Section 2.2, roughly 1% of the collected images were manually annotated and used for training the fully-supervised methods.To take advantage of the potential semantic information contained in the unlabeled images, we used the CPS framework described in Chen et al. (2021).It trained two networks and in addition to using the standard pixel-wise cross-entropy loss on   labeled images, used pseudo-labels generated from the segmentation confidence map of one network to supervise the other as can be seen in Figure 3. Loss for unlabeled images in CPS is defined as follows with D u denoting unlabeled data, p i the segmentation confidence map, y i the predicted label map, ℓ ce the cross-entropy loss, 1, and 2 representing each network: The total loss is thus defined as follows with L s the standard pixelwise cross-entropy loss: We trained both ResUNet and UNet++ models with this framework and used the same number of labeled images as in the supervised setting, treating the remaining images in each volume as unlabeled data.In our study, we noticed that models trained with a loss more similar to the evaluation metric performed better.As a consequence, we replaced all of the cross-entropy losses originally used in Chen et al. ( 2021) with the soft Dice loss 1.We also noticed that learning needed to be driven by the supervised loss during the first epochs.At the beginning of training, models had no prior knowledge of the segmentation task, and thus, could not yield relevant pseudo-labels resulting in frequent collapsing to predict only the background, especially when working with nucleoli.To resolve this, we implemented a linear warm-up to λ, the parameter used to balance the CPS loss with the supervised loss, so that the latter has priority over the former during the early steps of training.We used a value of 1 for λ in all of our experiments and the following function for linear warmup.

Integration of CutMix data augmentation
CutMix (Yun et al., 2019) is a popular data augmentation method for training classifiers that shuffles information throughout the training batch, and has recently been used in semi-supervised segmentation tasks.It boosts accuracy by making the models focus on less discriminative parts of objects to segment, in our case we can imagine that some part of a nucleus is replaced with background, then models have to use something other than the nuclear membrane to detect the nucleus, forcing them to learn other features such as texture.In the authors' implementation, when using the CutMix strategy in the CPS loss, the latter is only optionally computed on labeled data.However, in our case, not having labels meant not being able to ensure the CPS batch contained any nucleoli as we did for supervised methods (see Section 2.2).This made the loss unstable as models performed poorly on empty images.To solve this issue, we trained all models with both the supervised and unsupervised CPS loss, and ensured that at least half of the CPS batch contained nucleoli.We tried using CutMix in the fully supervised setting, however, it did not yield any significant improvement.We believe this to be due to the fact that the number of images we trained on was so limited in the fullysupervised setting, CutMix could not add much new information during augmentation.

Accurate segmentation of nuclei and nucleoli
For comparison with previous results (Machireddy et al., 2021), we first used the same average of 2D dices (Tables 3, 4) while also providing the unbiased 3D Dice metric results (Tables 5, 6).The 3D dice scores are almost always higher than their average counterpart, which is positive as they are more representative of the segmentation quality.This is particularly visible for nucleoli (compare Tables 4, 6).As can be observed in Tables 3, 4, all evaluated models were able to accurately segment both nuclei and nucleoli.Despite the introduction of attention enabling models to gain a marginal edge over the baseline introduced in Machireddy et al. ( 2021), a clear advantage was obtained only when using SSL, especially when paired with CutMix data augmentation.The best results are given by UNet++ with SSL and CutMix, which indicates both model selection and using a semi-supervised framework help improve segmentation performance.

Benefits of semi-supervised learning
While fully-supervised models could sometimes outperform SSL ones on specific datasets (for example, CEECNet on Volume 1 nucleoli), SSL remained stable over all structures and volumes.It outperformed the baseline for all datasets, most noticeably on Volume 3 nucleoli, with an average gain of 0.11 in Dice, representing a 15.6% performance increase.One of the reasons behind this performance gain is the high heterogeneity in Volume 3 nucleoli, and most models struggled segmenting unseen structures, as can be observed in Figure 4.As the performances of different fully-supervised methods varied highly depending on the volumes (for example, Senformer under-performed in segmenting Volume 1 nucleoli), the SSL methods remained stable.When evaluating our models, we noticed that fully supervised methods performed really well around the images they were trained on (see Figure 5), yielding a near perfect Dice score.However, performances dropped as soon as evaluation images start being dissimilar to the training images, thus forming dips visible in the plot.This is a clear sign of overfitting that SSL prevented thanks to the regularization added by the CPS loss.This stability and consistency across image volumes allows, in addition to the performance gain, an easier postprocessing of the segmented volume by manual inspection and interpretation or algorithmic analysis.These result made us believe semi-supervised frameworks were key in attaining better generalization performance in our sparse annotation setting.Indeed, the Dice score of UNet++ with SSL and Cutmix are better most of the time with only 7 training images than what was achieved previously with 25 images in Machireddy et al. (2021) or with supervised models in this study.
Seeing that the semi-supervised framework yielded way more consistent results, we looked into whether segmentation networks trained under this framework would be able to generalize across samples.To test this, we trained a model on two of our three volumes and evaluated on the third.We noticed that the performance  dropped significantly (with dice scores between 0.4 and 0.6) depending on the volume.We hypothesize that two volumes worth of data is not enough to generalize across samples and leave to future work exploration of ways to deal with this issue such as expanding the dataset, data normalization, knowledge distillation or style transfer.

FIGURE
Comparison of scores for nuclei segmentation along all 757 slices in Volume 2 with 7 training images.Training slices are marked with vertical black lines.We can clearly observe the seven peaks in performance and drops in between for fully-supervised methods (beige to brown) as opposed to the stability of the SSL models (blue).
In this work, we investigated the segmentation of nuclei and nucleoli in vEM images of cancer cells.We studied the performances of several leading deep-learning models and assessed the relative performance gains of each method.We provided insight as to why semi-supervised methods were able to yield more robust results and managed to improve on previous work both in terms of reducing the amount of data needed and segmentation performances, with an improved Dice on all Volumes.We believe the key components that make these improvements possible in our pipeline are the combination of exploiting unlabeled data information (SSL framework), carefully picking a segmentation model, effective data augmentation methods (CutMix), efficient losses, and keeping dataset specificities in mind (e.g., adapting the batch selection and CPS framework to nucleoli sparsity (see Sections 2.2, 2.4.1, 2.4.2).We made the experiment code available at 3 and the complete manual annotations for the data have been provided through the HTAN data portal.We believe that semi-supervised methods are a key component in segmentation with sparse annotations as they proved to be superior in both quantitative and qualitative evaluations.

FIGURE 1
FIGURE 1 Example image slices and corresponding ground truth annotations from Volumes 1 and 3.The horizontal image width is equal to 25 µm.Nuclei are in red and nucleoli in blue.(A) Volume 1 example image.(B) Volume 1 example image annotations.(C) Volume 3 example image.(D) Volume 3 example image annotations.

FIGURE 3
FIGURE 3Illustration of the CPS training framework.X are inputs with the same data augmentation.θ 1 and θ 2 are two segmentation networks with the same structure and differently initialized weights.P i is the segmentation confidence map, Y i is the predicted label map.→ means forward operation, −−→ loss supervision, and // on → stopping the gradient.

TABLE 1
Number of parameters and training times for one volume.SSL-trained models require double the number of parameters because two models are trained at the same time.

TABLE 2
Average, standard deviation, and average rank for dice score over all volumes.

TABLE 3
Nuclei segmentation dice scores.Columns labeled 7, 10, 15, and 25 in the second line represent the number of training images for each volume.

TABLE 4
Nucleoli segmentation dice scores.Columns labeled 7, 10, 15, and 25 in the second line represent the number of training images for each volume.

TABLE 5
3D nuclei segmentation dice scores.Numbers 7, 10, 15, and 25 in the second line represent the number of training images for each volume.

TABLE 6
3D nucleoli segmentation dice scores.Numbers 7, 10, 15, and 25 in the second line represent the number of training images for each volume.