Hierarchical Encoder-Decoder With Soft Label-Decomposition for Mitochondria Segmentation in EM Images

Semantic segmentation of mitochondria from electron microscopy (EM) images is an essential step to obtain reliable morphological statistics about mitochondria. However, automatically delineating plenty of mitochondria of varied shapes from complex backgrounds with sufficient accuracy is challenging. To address these challenges, we develop a hierarchical encoder-decoder network (HED-Net), which has a three-level nested U-shape architecture to capture rich contextual information. Given the irregular shape of mitochondria, we introduce a novel soft label-decomposition strategy to exploit shape knowledge in manual labels. Rather than simply using the ground truth label maps as the unique supervision in the model training, we introduce additional subcategory-aware supervision by softly decomposing each manual label map into two complementary label maps according to mitochondria's ovality. The three label maps are integrated with our HED-Net to supervise the model training. While the original label map guides the network to segment all the mitochondria of varied shapes, the auxiliary label maps guide the network to segment subcategories of mitochondria of circular shape and elliptic shape, respectively, which are much more manageable tasks. Extensive experiments on two public benchmarks show that our HED-Net performs favorably against state-of-the-art methods.


INTRODUCTION
Mitochondria are the site of oxidative metabolism in eukaryotes and an essential place to synthesize adenosine triphosphate (ATP) to provide power for cells (Brand et al., 2013). The latest research (Seo et al., 2019) has found that mitochondria are closely related to the occurrence of genetic diseases and the survival of cancer cells. The changes in mitochondrial morphology have a direct impact on the normal realization of their functions. Mitochondria delineation in Electron Microscopy (EM) images plays a vital role in assisting neuroscientists to analyze mitochondrial morphology and distribution of mitochondria. However, manual delineation of mitochondria in many high-resolution EM images requires a vast amount of time and effort by annotation experts. Therefore, automated mitochondria segmentation algorithms with sufficient accuracy are highly desirable to help neurologists analyze EM images. However, mitochondria have varied shapes, ranging from punctuating structures to tubular networks (Wei et al., 2020). Therefore, accurately segmenting mitochondria from complex backgrounds is challenging. Example slices of EM images from two datasets are shown in Figure 1. where mitochondria show irregular shapes and other subcellular structures in the background show similar appearance and shape with mitochondria. Significantly, mitochondria show large variance in roundness. Figure 2A illustrates the ovality distribution of mitochondria on images from the EPFL dataset (Lucchi et al., 2013) and the Kasthuri++ dataset (Casser et al., 2020). The ovality p of each mitochondrion is defined as the length a of its major axis over the length b of its minor axis. The median of ovality p distribution is 1.56 on the EPFL dataset and 1.60 on the Kasthuri++ dataset. Further illustrations are shown in Figures 2B,C. in which we conducted ellipse fitting for each mitochondrion instance. It can be seen that some mitochondria have p over 10, while some others have p lower than 1.5. It is challenging for a model to delineate mitochondria of different shapes simultaneously, which motivates us to exploit subcategory information in our learning based segmentation model.
Recently, various methods (Lucchi et al., 2013;Cheng and Varshney, 2017;Cetina et al., 2018;Xiao et al., 2018;Casser et al., 2020;Peng and Yuan, 2020;Yuan et al., 2021) have been introduced to address mitochondria segmentation. According to the features they used, mitochondria segmentation can be categorized into two classes: traditional methods with handcrafted features (Lucchi et al., 2011(Lucchi et al., , 2013Cetina et al., 2018;Peng and Yuan, 2020) and deep learning methods with automatically learned features (Cheng and Varshney, 2017;Xiao et al., 2018;Casser et al., 2020;Yuan et al., 2020Yuan et al., , 2021. Generally speaking, deep-learning-based methods, especially methods based on fully convolutional neural networks (Ronneberger et al., 2015;Litjens et al., 2017;Shelhamer et al., 2017), show better performance than traditional machine learning and computer vision methods (Lucchi et al., 2013;Cetina et al., 2018;Peng and Yuan, 2020). Since EM images are volumetric data, both 3D models and 2D models have been adopted in each class. Typically, 3D models (Çiçek et al., 2016;Xiao et al., 2018;Yuan et al., 2021) show better performance by taking advantage of full spatial contexts but at the expense of high computational cost; in FIGURE 1 | Illustration of typical EM images, in which mitochondria show varied shapes. (A) The EPFL data (Lucchi et al., 2013) were taken from CA1 hippocampus region of a mouse brain, (B) the Kasthuri++ data (Casser et al., 2020) were taken from mouse cortex. The red contours represent the corresponding ground-truth segmentation.
contrast, 2D models (Ronneberger et al., 2015;Casser et al., 2020) are more computationally efficient but may neglect interslice consistency and show inferior performance. However, 2D methods are flexible to process EM images with large slice thickness. In this study, we follow the slice-by-slice segmentation strategy and aim to devise a powerful 2D model in the deep learning framework.
Among the deep-learning-based methods, the 2D U-Net (Ronneberger et al., 2015) and 3D U-Net (Çiçek et al., 2016), typical encoder-decoder networks with skip connections, are strong baseline models for 2D segmentation and 3D segmentation, respectively. Casser et al. (2020) used a modified 2D U-Net with an on-the-fly data augmentation and Zfiltering postprocessing, and their model showed obviously improved performance over 2D U-Net. Cheng and Varshney (2017) improved the 2D/3D U-Net with factorized convolutions and online feature-level augmentations and showed improved results over the 2D/3D U-Net. Xiao et al. (2018) proposed an effective approach using a modified 3D U-Net, a 3D residual convolutional network with deep supervision. Yuan et al. (2021) introduced a lightweight HIVE-Net with state-of-theart performance. Their method conducted 3D segmentation but essentially with only more computationally-efficient 2D convolutions. An auxiliary centreline detection task is augmented FIGURE 2 | The ovality distribution of mitochondria on images from the EPFL dataset and the Kasthuri++ dataset. The ovality p of each mitochondrion is defined as the ratio of the length of its major axis a to the length of its minor axis b, which can be obtained by performing ellipse fitting on the label maps. The median of ovality p distribution (A) is 1.56 on the EPFL dataset and 1.60 on the Kasthuri++ dataset. The red contours in (B,C) represent mitochondria with p over 1.60, and the green contours represent those with p lower than 1.60. to capture intrinsic shape prior. Given the high computational complexity of 3D networks, we follow the slice-by-slice segmentation strategy with 2D networks. However, due to the ambiguity of mitochondria segmentation, it is challenging to accurately delineate plenty of mitochondria of varied shapes with information from a single image slice.
To address these challenges, we propose a novel hierarchical encoder-decoder network, named HED-Net, with three-level nested encoder-decoder architecture to capture multi-scale contextual features, which are crucial to discriminate objects from complex backgrounds. Inspired by Qin et al. (2020), we used micro U-Nets to substitute standard convolutions, which constitutes the deeper level of encoder-decoder of our HED-Net. To improve the identification of mitochondria of varied shapes, we propose to exploit shape knowledge from manual labels. Note that manual labels are typically just used as pixel-wise supervision on the final output layer during model training. However, the manual label map for each training image contains more global and semantic information that can be explored to boost the segmentation. Although it is impractical to build a statistical shape model as the prior for multi-object segmentation, it is relatively easy to identify the roundness of each mitochondrion. Based on this observation, we take an easy-to-hard strategy for this challenging binary segmentation problem and introduce subcategory information according to the roundness of the mitochondria, i.e., mitochondria of elliptic shape and mitochondria of circular shape. To guide the model training, we construct two auxiliary label maps with a soft label-decomposition strategy, which decomposes the ground truth label map into two complementary label maps. One label map takes higher values on mitochondria of elliptic shape and lower values on mitochondria of elliptic shape; the other label map takes lower values on mitochondria of elliptic shape and higher values on mitochondria of circular shape. All of the three label-maps jointly supervise the proposed HED-Net. To this end, the outer-level of our HED-Net consists of a soft label-decomposition subnet and a label-fusion subnet, both of which are encoder-decoders. A closely-related method is the decompose-and-integrate strategy in Zhang et al. (2019) for multi-class segmentation, where they split multi-class label map into several binary ones. Our method's significant difference is that we address the binary segmentation problem with a novel soft-label decomposition strategy. The underlying observation is that it is impractical to classify mitochondria into elliptic shapes and circular shapes using a hard threshold based on the ovality of mitochondria.
The main contributions of this study can be summarized as follows, • We propose a soft label-decomposition strategy to exploit side shape information in manual labels. • A three-level nested encoder-decoder network is introduced to capture rich contextual information and facilitate the facilitate the integration of subcategory-aware supervision. • Validations on two challenging benchmarks show that the proposed 2D method can achieve competitive performance in terms of class-level and instance-level measures.
The remainder of this paper is arranged as follows. We elaborate on the proposed methods in section 2. We present experiments and results in section 3. Section 4 concludes this study.

METHOD
In this section, we describe the proposed HED-Net in details. To segment mitochondria from volumetric EM images, we follow the slice-by-slice segmentation strategy. To capture inter-slice continuity, the proposed model takes 5-adjacent slices as the input but only outputs the prediction for the centering slice of the multichannel input. Figure 3 provides an overview of the proposed HED-Net, which is composed by two stages of encoder-decoder with cross-stage skip-connections and supervised by multiple supervising labels, which will be discussed later in details. The first stage of the HED-Net is a two-head encoder-decoder, named Soft Label-Decomposition Sub-Net, which is supervised by auxiliary soft labels for subcategories of mitochondria. The second stage of the HED-Net is an encoder-decoder, named Soft Label-Fusion Sub-Net, which fuses the information from the predictions in the first stage and the original images and supervised by original ground truth label map.

Overview of the Proposed Model
In both stages of the HED-Net, we use a slightly modified U-Net with residual connections as the encoder-decoder. The standard U-Net architecture has a contracting path as the encoder to extract semantic features and a symmetric expanding path as the decoder for precise boundary delineation. The standard U-Net used skip-connections between the corresponding encoding and decoding layers. In our model, each encoder has four down-sampling layers including one strided 3×3 convolution layer (the first layer) and three 2×2 maxpooling layers, and each decoder has four bilinear up-samplings. In terms of architecture, the main differences of the used U-Net in this paper and the standard U-Net (Ronneberger et al., 2015) are the using of strided 3×3 convolution for downsampling and bilinear interpolation for unsampling. Moreover, we use residual connections to achieve residual learning (He et al., 2016).
To capture rich contextual features, we replace standard 3×3 convolutional layers with micro U-Nets in all the encoderdecoders in the two stages. Each micro U-Net has three maxpooling layers for down-sampling in the encoder and three bilinear up-sampling layers in the decoder. Note that the idea of using small U-Net as the building blocks in U-Net was firstly introduced in Qin et al. (2020) for salient object detection.

Soft Label-Decomposition Subnet
The significant challenges for mitochondria segmentation are the varied shapes and complicated background in EM images, which typically result in missed detection, false detection, and inaccurate boundary delineation, especially for mitochondria of irregular shape. Typically, the manual labels are just used as pixel-wise supervision on the final output layer during model training. However, the manual label map for each training image contains more global and semantic information that can be FIGURE 3 | An overview of the proposed HED-Net, which is a three-level nested encoder-decoder with micro U-Nets as the basic building blocks. The HED-Net consists of a soft label-decomposition stage and a soft label-fusion stage. The model is trained under the supervision of ground truth label maps and subcategory-aware label maps. explored to boost the segmentation of mitochondria. Therefore, we explore general shape knowledge extracted from label maps as side information to improve the segmentation. Although it is impractical to build a statistical shape prior for simultaneously segmenting plenty of mitochondria with varied shape from each EM image, it is relatively easy to identify the roundness of each mitochondrion. Based on this observation, we take an easy-to-hard strategy for this challenging binary segmentation problem and introduce subcategory information according to the roundness of the mitochondria, i.e., mitochondria of elliptic shape and mitochondria of circular shape. The two decoders in the Soft Label-Decomposition Sub-Net focus on the segmentation of mitochondria belongs to the two subcategories, respectively.

Soft Label-Decomposition
Since there is ambiguity to define mitochondria of elliptic shape and mitochondria of circular shape, we introduce a soft labeldecomposition strategy to construct auxiliary label maps for supervising the two decoders in the Soft Label-Decomposition Sub-Net. Specifically, given the label map Y of a training image X, we construct two auxiliary label maps Y 1 and Y 2 according to the roundness of each mitochondrion in Y. Suppose a is the length of the major axis of a mitochondrion instance, and b is the length of the minor axis. We measure the roundness of each mitochondrion instance in Y by p = a/b, where the major axis and minor axis are estimated by ellipse fitting, as shown in Figure 2. The mitochondria in Figure 2 show significantly difference in roundness. When p approaching 1, the mitochondrion instance is more like a circle. With a given threshold on p, we can categorize the mitochondria into two sub-classes, mitochondria of circular shape and mitochondria of elliptical shape, as shown in Figure 4. The two auxiliary label maps are defined as: where x is a pixel in X, p(x) is the roundness of the mitochondrion instance that x belongs to, α ∈ [0.5, 1] is a positive constant, and the ovality threshold T is a positive value. With α ∈ (0.5, 1], the label maps Y 1 and Y 2 put unequal weights on mitochondria of different shapes. In our experiments, the parameter T is set as 1.6 according to the mitochondrion instance's statistical distribution in the training dataset. We set α=0.9 to make the two label maps Y 1 and Y 2 highlight mitochondria of different subcategories. Although we can also achieve a hard label-decomposition by setting set α = 1, it is challenging for a segmentation model to identify mitochondria in each subcategory.
FIGURE 4 | Illustration of the proposed soft label-decomposition. A ground truth label map Y in the training data is decomposed into two complementary label maps Y 1 and Y 2 according to roundness of each mitochondrion instance. All the three label maps are jointly used to supervise the model training.

Subcategory-Aware Supervision
The two auxiliary label maps Y 1 and Y 2 are used as deep supervision to guide the model training. The soft labeldecomposition subnet consists of two decoders: a circular decoder that is mainly responsible for detecting circular mitochondria, and an ellipse decoder that is mainly responsible for detecting oval-shaped mitochondria. Since there is ambiguity on the class boundary between circular mitochondria and ovalshaped mitochondria, each decoder segments all mitochondria but puts higher weights on its focused subcategory.

Soft Label-Fusion Subnet
To integrate the predictions and features of the soft labeldecomposition stage, we introduce a soft label-fusion stage, which is supervised by full ground truth labels and jointly trained with the first stage. The soft label-fusion subset takes the predictions of the first stage and the original image as input. The features in the encoders of the first stage is reused in the decoder of the second stage with skip-connections and concatenation. The soft label-decomposition subnet and the soft label-fusion subnet constitute the proposed HED-Net.

The Total Loss
Let P 1 , P 2 , and P be the predictions of the circle decoder, the ellipse encoder, and the decoder of the second stage, respectively. The total loss of our HED-Net is defined as following, where λ is a positive trade-off parameter. L circle , L ellipse , and L denotes the losses for the circle decoder, the ellipse encoder, and the decoder of the second stage. They are defined based on the Dice loss function, where x is a pixel location. It is noteworthy that, since the auxiliary label maps Y 1 and Y 2 take different values on different mitochondrion instances, Dice(P 1 , Y 1 ) and Dice(P 2 , Y 2 ) are essentially weighted Dice losses.

RESULTS AND ANALYSIS
In this section, we first evaluate the segmentation and detection performance of our method on two public benckmarks, and then conduct an ablation analysis of our model.

Datasets
We evaluate model performance on two mitochondria datasets, which have different voxel spacings and different volume sizes. The public EPFL dataset 1 provides two stacks for model training and testing, respectively; each stack has 165 consecutive slices of size 768 × 1,024, which were scanned with focused ion beam scanning EM (FIBSEM) from CA1 hippocampus region of a mouse brain. Kasthuri++ dataset 2 contains 85 consecutive image slices of size 1,643 × 1,613 for model training and 75 slices of size 1,334 × 1,553 for model testing. The images in Kasthuri++ dataset were taken from 3-cylinder mouse cortex with serial section EM (ssEM). The Kasthuri++ dataset was relabeled by Casser et al. (2020). The two datasets have significantly different voxel spacing. While the voxel spacing of EPFL dataset is 5 nm 3 per voxel and the voxel spacing of Kasthuri++ dataset is 3×3×30 nm per voxel.

Criteria for Evaluating Binary Segmentation
Dice similarity coefficient (DSC) and Jaccard-index coefficient (JAC) are used to measure the agreement between the binary ground truth Y and predicted segmentation P.
where N is the total number of instance in Y, P j * is the segment (i.e., connected region) in the predicted segmentation that has the largest overlapping (in terms of JAC) with the segment Y j ; FP is the set of false positive regions in P without the matched mitochondria in Y.
where true positives (TP), false positives (FP), and false negatives (FN) representing the matched pairs of segments with at least 50% overlapping in JAC, unmatched predicted segments, and unmatched ground truth segments, respectively.

Criteria for Evaluating Detection
By default, we use F1-75, which requires at least 75% overlap in JAC. Moreover, given the TP and FN, we also report the sensitivity (SEN) and specificity (SPE).

Implementation Details
We use Pytorch (Paszke et al., 2019) on a workstation with 64 GB RAM and one GTX 2080Ti GPU to implement our experiments. The trade-off parameter λ is fixed and set as 0.5 to make the training losses of the first stage and the second stage have the similar magnitudes. Thus, the two stages have the same importance. The model is optimized by Adam (Kingma and Ba, 2014), and the weight decay is set to 10 −5 . The initial learning rate is set as 5 × 10 −4 and a step-wise learning rate decay scheme is employed. For the EPFL dataset, the step and decay rate is set to 30 and 0.9, respectively; For the Kasthuri++ dataset, the step and decay rate is set as 60 and 0.9, respectively. Our network is trained using randomly cropped images of size 512×512 and batch size 3 for all the two datasets. Synthesized images and the corresponding label maps through flipping, gaussian blur, median blur, and random rotations of ±90 • are used as data augmentation to the training data. At the inference time, we apply the test-time argumentation, including flipping and rotation of ±90 • to improve the performance further.

Segmentation Performance
We compare our method with both 2D methods and 3D methods, including both traditional methods based on handcrafted features (Lucchi et al., 2013;Cetina et al., 2018;Peng and Yuan, 2020)    2018; Casser et al., 2020), on the EPFL dataset and Kasthuri++ dataset. Since our HED-Net takes 5-slice input, which is usually called 2.5D method, we also compare our method with 2D U-Net (Ronneberger et al., 2015) that takes five slices as input.

Visual Comparison
Figure 5 provides visual comparisons of the proposed method with two strong baselines, i.e., 2D U-Net (Ronneberger et al., 2015) and 3D U-Net (Çiçek et al., 2016) and a state-of-the-art 3D model, i.e., HIVE-Net (Yuan et al., 2021), on examples in EPFL dataset and Kasthuri++ dataset. In comparison of the results in Figures 5B,C,E, we can see that the proposed method obviously shows fewer false detections and fewer missed detections than 2D U-Net and 3D U-Net. As shown in Figures 5D,E, the proposed 2D model shows comparable visual performance with the 3D model HIVE-Net but with slightly better shape integrity. Table 1 demonstrates the quantitative comparison of our method with both 3D methods, 2.5D methods, and 2D methods for mitochondria segmentation on the EPFL dataset. While the methods in Lucchi et al. (2013), Cetina et al. (2018), and Peng and Yuan (2020) are traditional methods with handcrafted features, other methods are deep learning based methods, which show better results than traditional methods. Compared to 2D methods that take single slice as input, 2.5D methods takes multiple slices as input. With more slices as input, the U-Net (5-slice input) outperforms the U-Net (1-slice input) by 1.4% in DSC. The 3D U-Net and methods in Cheng and Varshney (2017), Xiao et al. (2018), and Yuan et al. (2021) directly segment 3D volumes and generally show better results than 2D methods and 2.5D methods. The HIVE-Net shows the best results among the 3D models. However, 3D models usually suffer from high computational complexity.

Segmentation Performance on EPFL Dataset
From the Table 1, we can see that the proposed HED-Net not only shows the best segmentation performance among all the 2D models and 2.5D models, but also shows competitive performance in comparison with 3D models. Especially, for instance segmentation, our model outperforms the HIVE-Net by 0.7 and 1.1% in terms of AJI and PQ, respectively. For binary segmentation, the performance of our method is only slightly lower (≤0.2%) than the HIVE-Net. When taking single slice as the input, our HED-Net (1-slice) outperforms most of the compared methods except for Xiao et al. (2018) and HIVE-Net (Yuan et al., 2021), and show similar performance as Casser et al. (2020), who used a median filter along the z-dimension (Z-Filtering) as post-processing to capture 3D information. Table 2 demonstrates the quantitative comparison of the performances of different methods for mitochondria segmentation on the Kasthuri++ dataset. The proposed method shows significant improvements over the strong baseline models, i.e., 2D U-Net (5-slice) and 3D U-Net and obtains 96.1% in The results are the performance on the EPFL dataset. Best results are highlighted in bold.

Segmentation Performance on Kasthuri++ Dataset
FIGURE 6 | Detection performance in F1 with different overlapping thresholds for matched instances. As the overlapping thresholds increase, the F1 of all methods has decreased, but our method maintains the highest stability.
DSC for binary segmentation, 91.6% in AJI, and 86.6 in PQ for instance segmentation, outperforming the 3D U-Net by 1.8% in DSC, 3.4% in JAC, 3.7% in AJI, and 5.1% in PQ. The proposed model outperforms the method in Xiao et al. (2018) by 0.2% in DSC, 0.4% in JAC, 0.6% in AJI, and 1.5% in PQ. The proposed model shows competitive performance in comparison with the HIVE-Net but is flexible to process both 2D and 3D data.

Detection Performance
The detection performance is also crucial for evaluating the proposed method. To this end, we compare our method with other methods in terms of F1-75, SPE, and SEN. Table 3 summarizes the quantitative comparison results on both the EPFL dataset and the Kasthuri++ dataset. Overall, our method shows the best performance on both of the two datasets and outperforms the 2D U-Net by 11.0 and 6.5% in F1-75 on the EPFL dataset and the Kasthuri++ dataset, respectively. Moreover, the proposed 2D model outperforms the state-of-the-art HIVE-Net by 1.9 and 1.2% in F1-75 on the EPFL dataset and the Kasthuri++ dataset, respectively. Significantly, our method shows higher specificity, which indicates that our model has strong ability to control false detection. These results demonstrate the effectiveness of our method. In addition to F1-75, we compare our method with other methods in terms of F1 values that use other overlapping requirements. The comparison results are illustrated in Figure 6. Our model shows superior performance over other methods, especially in terms of F1-80 and F1-85, which are very strict detection measures. Therefore, these results also indicate that the segmentation by our proposed model can match the ground truth segmentation better.

Impact of the Number of Input Slices
We first study the impact of using varying input slices on segmentation performance, which is demonstrated in Table 4. By comparing the results in four measures, we have three conclusions. First, making use of multi-slice input can improve the performance of the U-Net and our HED-Net. However, the performance gain can be marginal when increasing the number of input slices. Second, compared to the U-Net, the proposed HED-Net obtains a smaller performance gain when using multislice input. Third, 5-slice input is the best choice for our model. Therefore, by default, we use five neighboring slices as input for our model in the following experiments.

Impact of the Tradeoff Parameter λ
We further investigate the impact of the hyper-parameter λ on the segmentation performance of the proposed HED-Net. The hyper-parameter λ trades off the importance of the two stages of the HED-Net. Since the first stage of the HED-Net contains two decoders and the second stage contains only one decoder, we set λ as 0.5 by default to have equal importance on the two stages. In this section, we further test the segmentation performance with other choices of λ, i.e., 0.1, 0.25, and 1.0. The segmentation results are summarized in Table 5. We can see that balanced importance of the two stages with λ=0.5 results in a better performance.

Ablation Study
We conduct ablation studies to identify the effectiveness of the introduced two-stage network architecture, the impact of using micro U-Net as building blocks, and the superiority of the proposed soft label-decomposition strategy. Table 6 illustrates the performance of the proposed HED-Net under different   network settings. The last row is our complete model, which uses two-stage network architecture with micro U-Net as the basic building block and utilizes the soft label-decomposition strategy to achieve subcategory-aware learning. As can be seen, the models using one-stage architecture are overall poor than the one using the two-stage architecture. Significantly, when using standard 3×3 convolution as the basic building block, the twostage network outperforms the one-stage network by a large margin for both binary segmentation and instance segmentation. Moreover, the using of micro U-Net in the two-stage HED-Net results in a performance gain of 0.6% in DSC, 1.1% in JAC, 1.1% in AJI, and 4.6% in PQ, which indicate its effectiveness. Furthermore, with the soft label-decomposition, we obtain a performance gain of 0.6% in DSC, 1.0% in JAC, 1.0% in AJI, and 1.3% in PQ. Compared to the HED-Net using hard labeldecomposition, the HED-Net using soft label-decomposition shows superior performance. A visual comparison of the complete HED-Net with the HED-Net without using labeldecomposition and the U-Net are shown in Figure 7. The results of our HED-Net shows much fewer false detections and more accurate boundary delineations. Figure 8 provides further visual comparison of predictions of the circle decoder and ellipse decoder in the first stage and the fusion decoder in the second stage. The segmentation results by the HED-Net with no label-decomposition, hard label-decomposition, and soft label-decomposition are illustrated. The results of the circle decoder and ellipse decoder of the HED-Net are complementary. Compared to the HED-Net with no label-decomposition and with hard label-decomposition, the HED-Net with soft labeldecomposition shows reduced false positives and a stronger ability to capture mitochondria of large ovality.

The Impact of the Ovality Threshold T
We have also investigated the impact of choosing different ovality thresholds T for the HED-Net with hard and soft labeldecomposition. Given the ovality p distributions in Figure 2, we choose the median of p, i.e., T = 1.6, as the default setting. In this section, we test other choices of T, i.e., first quartile (T = 1.3) and third quartile (T = 2.1). The results on the EPFL dataset are reported in Table 7. We can see that, the median of the ovality distribution is a better choice as the ovality threshold than the first quartile and third quartile. The HED-Net with the soft labeldecomposition consistently outperforms the HED-Net with the hard label-decomposition.

The Inference Time
One of the crucial aspects of deploying a segmentation network is the inference time. In this section, we compare the inference time of our HED-Net with other 2D/2.5D/3D methods, i.e., 2D U-Net (1-slice), 2D U-Net (5-slice), 3D U-Net, and HIVE-Net. More specifically, we calculate the total inference time on the testing stack of the EPFL data, which contains 165 consecutive images of size 768 × 1,024. While the 2D and 2.5D methods, i.e., 2D U-Net (1-slice), 2D U-Net (5-slice), and our HED-Net, conduct sliceby-slice segmentation, the 3D U-Net and HIVE-Net segment all the images in one pass. The comparative results are illustrated in Figure 9. It can be seen that our method takes a much shorter inference time than the compared methods. Significantly, our proposed HED-Net takes 25.7 s for the inference of all the testing images, while the top-performing method HIVE-Net takes 133.5 s for inference.

CONCLUSIONS
In this paper, we have proposed a hierarchical encoderdecoder network for mitochondria segmentation from EM images. To address the challenge of the varied shape of mitochondria and complex backgrounds, we followed an easyto-hard strategy. Specifically, we introduced a novel soft label-decomposition strategy, which resulted in additional subcategory-aware supervision for our model. The proposed network utilized a three-level nested U-shape architecture to capture rich contextual information and employed general shape information in manual labels to reduce missed detection and false detection. The proposed method has been evaluated on two challenging benchmarks. Comparisons with strong baseline models and top-performing 2D/3D methods showed that our method showed state-of-the-art results. Significantly, the proposed model showed superior results for instance segmentation and detection tasks. Ablation studies further demonstrated the effectiveness of the proposed model. In future work, we will utilize the size attribute and symmetry attribute of the targets as the side information further to explore the global information in the manual label maps. Moreover, we will employ more advanced image synthesis methods (Peng and Wang, 2021) to reduce the amount of labeled data requested for model training.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author/s.

AUTHOR CONTRIBUTIONS
ZL and YW mainly implemented the method, conducted the experiments, and contributed to the writing of the paper. SL helped perform the analysis with constructive discussions. JP supervised the whole process, including the development of the concept, writing, revision, and other general advice. All authors approved it for publication.