Edited by: George Azzopardi, University of Groningen, Netherlands
Reviewed by: Antonio Greco, University of Salerno, Italy; Laura Fernáández-Robles, Universidad de León, Spain
This article was submitted to Sensor Fusion and Machine Perception, a section of the journal Frontiers in Robotics and AI
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Computer Tomography (CT) is an imaging procedure that combines many X-ray measurements taken from different angles. The segmentation of areas in the CT images provides a valuable aid to physicians and radiologists in order to better provide a patient diagnose. The CT scans of a body torso usually include different neighboring internal body organs. Deep learning has become the state-of-the-art in medical image segmentation. For such techniques, in order to perform a successful segmentation, it is of great importance that the network learns to focus on the organ of interest and surrounding structures and also that the network can detect target regions of different sizes. In this paper, we propose the extension of a popular deep learning methodology, Convolutional Neural Networks (CNN), by including deep supervision and attention gates. Our experimental evaluation shows that the inclusion of attention and deep supervision results in consistent improvement of the tumor prediction accuracy across the different datasets and training sizes while adding minimal computational overhead.
The daily work of a radiologist consists of visually analyzing multiple anatomical structures in medical images. Subtle variations in size, shape, or structure may be a sign of disease and can help to confirm or discard a particular diagnosis. However, manual measurements are time-consuming and could result in inter-operator and intra-operator variability (Sharma and Aggarwal,
Deep learning techniques, especially convolutional neural networks (CNN), have become the state-of-the-art for medical image segmentation. Fully convolutional networks (FCNs) (Long et al.,
Despite the success of deep CNN techniques, there are difficulties inherent to their applicability. First, large datasets are needed for the successful training of deep CNN models. In medical imaging, this may be problematic due to the cost of acquisition, data anonymization techniques, etc. Second, volumetric medical image data require vast computational resources, even when using graphical computation units (GPU) the training process is very time-consuming. Therefore, every new proposal should take into account not only the performance but also the computational load.
Current CT-based clinical abdominal diagnosis relies on the comprehensive analysis of groups of organs, and the quantitative measures of volumes, shapes, and others, which are usually indicators of disorders. Computer-aided diagnosis and medical image analysis traditionally focus on organ or disease based applications, i.e., multi-organ segmentation from abdominal CT (Jimenez-del-Toro et al.,
There are two significant challenges in automatic abdominal organ segmentation from CT images (Hu et al.,
The task of detecting cancerous tissue in an abdominal organ is even more difficult because of the large variability of tumors in size, position, and morphology structure. Results are quite impressive when the focus is on detecting organs; an example of this is (Isensee et al.,
On the other hand, all the organs have a typical shape, structure, and relative position in the abdomen. The model could then benefit from an attentional mechanism consolidated in the network architecture, which could help to focus specifically on the organ of interest. For this purpose, we incorporated the idea of attention gates (AG) (Oktay et al.,
Many research papers have incorporated attention into artificial CNN visual models for image captioning (Xu et al.,
Deep supervision was firstly introduced by Lee et al. (
In the present work, we propose a methodology for a more reliable organ and tumor segmentation from computed tomography scans. The contribution of this work is three-fold:
A methodology that achieves the state-of-the-art performance on several segmentation tasks dealing with organ and tumor segmentation, of special interest is the increase obtained in the precision of tumor segmentation.
A visualization of the feature maps from our CNN architecture to provide some insight into what is the focus of attention in the different parts of the model for better tumor detection.
Third and not last, we provide a novel and extended comparison of CNN architectures for different organ-tumor segmentation from abdomen CT scans.
We will provide the details of the proposed methodology in this section. Firstly, we will explain the preprocessing and normalization of the medical image data. Secondly, we will provide a detailed description of the model architecture, the attention gates, and the deep supervision layers. The loss function, the optimizer, and other specifics of interest are detailed in the following subsection, which also describes patch sampling and data augmentation techniques utilized in order to prevent overfitting. The last part shortly outlines inference and how the image patches are stitched back together. We provide a publicly available implementation of our methodology using PyTorch at:
CT scans might be captured by different scanners in different medical clinics with nonidentical acquisition protocols; therefore the data preprocessing step is crucial to normalize the data in a way that enables the convolutional network to learn suitable and meaningful features properly. We preprocess the CT scan images as follows (Isensee et al.,
All patients are resampled to the median voxel spacing of the dataset using the third-order spline interpolation for image data and the nearest neighbor interpolation for the segmentation mask.
The dataset is normalized by clipping to the [0.5, 99.5] percentiles of the intensity values occurring within the segmentation masks.
Z-score normalization is applied based on the mean and standard deviation of all intensity values occurring within the segmentation masks.
Because of memory restrictions, the model was trained on 3D image patches. All the models were trained on an 11GB GPU. A base configuration of the input patch size of 128 × 128 × 128 and a batch size of 2 was chosen to fit our hardware set up. Then the model automatically adapts these parameters, so they reflect the median image size of each dataset. We consider two different approaches:
Deep learning techniques, especially convolutional neural networks, occupy the main interest of research in the area of medical image segmentation nowadays and outperform most techniques. A very popular convolution neural network architecture used in medical imaging is the encoder-decoder structure with skip connections at each image resolution level. The basic principle was firstly presented by Ronneberger et al. (
We follow encoder-decoder architecture choices applied to each dataset by Isensee et al. (
In addition to original encoder-decoder network architecture, we add attention gates (Oktay et al.,
A block diagram of the segmentation model with attention gates and deep supervision.
Attention coefficients, α
where
where
A block diagram of additive attention gate (AG) (Oktay et al.,
Deep supervision (Kayalibay et al.,
These additional segmentation maps do not primarily serve for any further refinement of the final segmentation map created at the last layer of the model because the context information is already provided by long skip connections. The secondary segmentation maps help in the speed of convergence by “encouraging” earlier layers of the network to produce better segmentation results. A similar principle has been used by Kayalibay et al. (
Unless stated otherwise, all models are trained with a five-fold cross-validation. The network is trained with a combination of dice (5) and cross-entropy (6) loss function (4):
where
The dice loss is computed for each class and each sample in the batch and averaged over the batch and over all classes. We use the Adam optimizer with an initial learning rate 3 × 10−5 and
Gradient updates are computed by standard backpropagation using a small batch size of 2. Initial weights values are extracted from a normal distribution (He et al.,
Training of the deep convolutional neural networks from limited training data suffers from overfitting. To minimize this problem, we apply a large variety of data augmentation techniques: random rotations, random scaling, random elastic deformations, gamma correction augmentation, and mirroring. All the augmentation techniques are applied on the fly during training. Data augmentation is realized with a framework which is publicly available at:
The patches are generated randomly on the fly during the training, but we force that minimally one of the samples in a batch contains at least one foreground class to enhance the stability of the network training.
According to the training, inference of the final segmentation mask is also made patch-wise. The output accuracy is known to decrease toward the borders of the predicted image. Therefore, we overlap the patches by half the size of the patch and also weigh voxels close to the center higher than those close to the border, when aggregating predictions across patches. The weights are generated, so the center position in a patch is equal to one, and the boundary pixels are set to zero, in between the values are extracted from a Gaussian distribution with sigma equal to one-eight of patch size. To further increase the stability, we use test time data augmentation by mirroring all patches along all axes.
In order to show the validity of the proposed segmentation method, we evaluate the methodology on challenging abdominal CT segmentation problem. We appraise the detection of cancerous tissue inside three different organs: pancreas, liver, and kidney.
The experiments are evaluated on three different CT abdominal datasets featuring organ and tumor segmentation classes: kidney, liver, and pancreas. Each dataset brings slightly different challenges for the model. More information about each task dataset, training setups, and concrete network topologies are as follows (see also
An overview of image shapes, training setups, and network topologies for each task.
Kidney | Num. images training | 168 | 168 |
Num. images validation | 42 | 42 | |
Median patient shape | 511 × 511 × 136 | 247 × 247 × 127 | |
Input patch size | 160 × 160 × 48 | 128 × 128 × 80 | |
Num. downsampling per axis | 5, 5, 3 | 5, 5, 4 | |
Batch size | 2 | 2 | |
Liver | Num. images training | 105 | 105 |
Num. images validation | 26 | 26 | |
Median patient shape | 482 × 512 × 512 | 189 × 201 × 201 | |
Input patch size | 96 × 128 × 128 | 96 × 128 × 128 | |
Num. downsampling per axis | 5, 5, 5 | 5, 5, 5 | |
Batch size | 2 | 2 | |
Pancreas | Num. images training | 224 | 224 |
Num. images validation | 57 | 57 | |
Median patient shape | 96 × 512 × 512 | 88 × 299 × 299 | |
Input patch size | 40 × 192 × 160 | 64 × 128 × 128 | |
Num. downsampling per axis | 3, 5, 5 | 3, 5, 5 | |
Batch size | 2 | 2 |
The dataset features a collection of multi-phase CT imaging, segmentation masks, and comprehensive clinical outcomes for 300 patients who underwent nephrectomy for kidney tumors at the University of Minnesota Medical Center between 2010 and 2018 (Heller et al.,
We perform five-fold cross-validation during training: 42 images are used for validation and 168 images for training. The mean patient shape after the resampling is 511 × 511 × 136 pixels in case of high-resolution and 247 × 247 × 127 pixels in low-resolution. According to the median shapes, we use 5, 5, and 3 downsampling for each respective image axis in high-resolution and 5, 5, 4 downsamplings in low-resolution. The patch size in case of high-resolution is 160×160×48 pixels and 128×128× 80 pixels for low-resolution.
The dataset features a collection of 201 portal-venous-phase CT scans and segmentation masks for liver and tumor captured at IRCAD Hôpitaux Universitaires. Sixty-five percent (131) of these images have been released publicly as the training set for the 2018 MICCAI Medical Decathlon Challenge
We perform five-fold cross-validation during training: 26 images are used for validation and 105 images for training. The mean patient shape after the resampling is 482 × 512 × 512 pixels in case of high-resolution and 189 × 201 × 201 pixels in low-resolution. According to the median shapes, we downsample five times each respective image axis in both high-resolution and low-resolution. The patch size in case of high-resolution was 96×128×128 pixels and 96×128×128 pixels for low-resolution.
The dataset features a collection of 421 portal-venous-phase CT imaging and segmentation masks for pancreas and tumor captured at Memorial Sloan Kettering Cancer Center. Seventy percent (282) of these images have been released publicly as the training set for the 2018 MICCAI Medical Decathlon Challenge
We perform five-fold cross-validation during training: 26 images are used for validation and 105 images for training. The mean patient shape after the resampling is 96 × 512 × 512 pixels in the case of high-resolution and 88 × 299 × 299 pixels in low-resolution. According to the median shapes, we do 3, 5, and 5 downsampling for each respective image axis in high-resolution and 3, 5, 5 downsamplings in low-resolution. The patch size in case of high-resolution is 40×192×160 pixels and 64×128×128 pixels for low-resolution.
The network design allows us to visualize meaningful activations maps from the attention gates as well as from the deep supervision layers. The visualizations enable an exciting insight into the functionality of the convolutional network. The understanding of how the model represents the input image at the intermediate layers can help to gain more insight into improving the model and uncover at least part of the black-box behavior for which the neural networks are also known.
The low-resolution VNet was chosen to study the attention coefficients generated at different levels of a network trained on the Medical Decathlon Pancreas dataset.
Examples of attention maps (AM) obtained from attention gates in the three topmost levels of the low-resolution VNet (from left to right: full spatial resolution, downsampling of two and four).
The attention coefficients obtained from two randomly chosen validation images from each studied dataset are visualized in
Visualization of attention maps (AM) in low-resolution for VNet and two randomly chosen patient images from the validation set of each studied dataset. For each patient, the left picture shows the attention from the topmost layer (with the highest spatial resolution), and the right picture shows the attention from the second topmost layer.
The low-resolution VNet was also chosen to study the secondary segmentation maps created at lower levels of the network trained on the Medical Decathlon Pancreas dataset. The segmentation maps are shown in
The secondary segmentation maps (SSM) obtained from deep supervision layers of low-resolution VNet for one randomly chosen patient image from the validation set of the Medical Decathlon Pancreas dataset.
The more in-depth segmentation maps in the organ label channel are more challenging to interpret. The second level map seems to be inverted, including the pancreas into a darker part of the input image. On the other hand, the third level map highlights all the organs present in the image. After a summation of these two maps, we achieve the desired highlight of the pancreas. Overall, we could say that all the secondary segmentation maps have a relevant impact on the final result.
We use the following metrics score to evaluate the final segmentation in the subsequent sections: precision, recall, and dice. Each of the metrics is briefly explained below.
In the context of segmentation, precision, and recall compare the results of the classifier under test with the ground-true segmentation by a combination the true positives (
This way both the precision and recall are normalized in the range 〈0, 100〉, higher values indicating better performance.
When applied to a binary segmentation task, the dice score evaluates the degree of overlap between the predicted segmentation mask and the reference segmentation mask. Given binary masks, U and V, the Dice score
In this variant, the dice score lays in the range 〈0, 100〉, higher values indicating better performance.
Next, we present a comprehensive study of the organ and tumor segmentation tasks on the three different abdominal CT datasets. For each dataset, four model variants were trained to show the impact of the different model architecture choices. The UNet utilizes max-pooling and the upsampling layers, while VNet is fully convolutional. Each architecture variant was trained on two different image resolutions: full-resolution and low-resolution. For more details about the model variants, please refer to section 2.2. Moreover, we provide assembly results from the respective full and low-resolution models. The soft-max output maps from the full and the low-resolution model variant were averaged and only then the final segmentation map was created.
Kidney Tumor Challenge 2019.
UNet | Low Res. | 94.96 ± 0.02 | 96.22 ± 0.08 | 95.50 ± 0.01 | 81.51 ± 2.30 | 82.62 ± 3.85 | 79.27 ± 0.30 |
Full res. | 95.55 ± 0.75 | 97.08 ± 1.21 | 96.21 ± 0.62 | 78.83 ± 5.21 | 81.44 ± 4.63 | 76.70 ± 2.46 | |
Assembly | 96.22 ± 1.32 | 97.11 ± 1.87 | 96.25 ± 1.12 | 83.88 ± 3.01 | 81.50 ± 6.23 | 78.68 ± 5.93 | |
VNet | Low res. | 94.79 ± 0.78 | 95.07 ± 1.42 | 94.63 ± 0.88 | 77.85 ± 3.43 | 78.51 ± 2.79 | 74.12 ± 2.66 |
Full res. | 96.01 ± 0.71 | 96.15 ± 1.19 | 95.93 ± 0.54 | 78.77 ± 3.60 | 79.72 ± 2.57 | 75.43 ± 1.59 | |
Assembly | 96.54 ± 1.06 | 96.63 ± 1.35 | 96.43 ± 1.06 | 82.71 ± 2.80 | 83.39 ± 8.21 | 79.94 ± 5.33 |
Medical Decathlon Challenge 2018—Task03-Liver.
UNet | Low res. | 95.01 ± 0.92 | 95.52 ± 1.38 | 94.91 ± 1.57 | 63.65 ± 4.92 | 58.13 ± 7.66 | 53.27 ± 4.57 |
Full res. | 95.39 ± 1.03 | 96.28 ± 1.09 | 95.80 ± 1.16 | 58.24 ± 7.23 | 76.39 ± 9.51 | 58.87 ± 3.01 | |
Assembly | 95.95 ± 0.70 | 96.66 ± 1.68 | 96.28 ± 1.01 | 63.74 ± 9.51 | 72.86 ± 10.1 | 60.29 ± 3.85 | |
VNet | Low res. | 94.96 ± 0.87 | 95.19 ± 1.75 | 94.54 ± 1.97 | 65.17 ± 5.69 | 59.13 ± 11.5 | 54.72 ± 6.11 |
Full res. | 94.39 ± 1.23 | 95.59 ± 1.03 | 94.86 ± 1.25 | 61.12 ± 8.33 | 70.34 ± 9.36 | 57.74 ± 2.20 | |
Assembly | 95.57 ± 0.65 | 95.80 ± 1.36 | 95.74 ± 0.89 | 73.42 ± 5.76 | 67.41 ± 13.0 | 64.70 ± 3.08 |
Medical Decathlon Challenge 2018—Task07-Pancreas.
UNet | Low res. | 80.39 ± 1.83 | 83.70 ± 2.02 | 80.96 ± 2.33 | 62.18 ± 3.35 | 58.12 ± 6.12 | 54.66 ± 4.54 |
Full res. | 80.88 ± 1.66 | 83.77 ± 0.59 | 81.15 ± 0.43 | 60.86 ± 1.41 | 54.36 ± 3.76 | 51.66 ± 4.70 | |
Assembly | 81.21 ± 0.62 | 84.51 ± 1.87 | 81.81 ± 0.98 | 62.98 ± 3.74 | 55.84 ± 1.42 | 52.68 ± 1.89 | |
VNet | Low res. | 79.36 ± 2.14 | 82.24 ± 1.71 | 79.62 ± 1.22 | 60.53 ± 2.72 | 55.19 ± 2.85 | 52.56 ± 2.89 |
Full res. | 79.92 ± 1.05 | 82.73 ± 1.37 | 80.09 ± 0.95 | 64.46 ± 5.23 | 51.30 ± 3.56 | 50.14 ± 4.14 | |
Assembly | 80.61 ± 0.37 | 84.10 ± 1.45 | 81.22 ± 0.64 | 64.62 ± 3.29 | 54.39 ± 1.26 | 52.99 ± 2.05 |
Due to the prominent inter-variability of position, size, and morphology structure, the tumor labels segmentation was less successful than the organ segmentation. We can see lower score values and also more significant inter-variability between the folds. The variability is especially high in the Liver-tumor label, where the lesions are usually divided into many small occurrences, and missing some of them means a significant change in the segmentation score results. The model could benefit from some postprocessing, which may help to sort out some of the lesions outside the liver organ, as suggested in Bilic et al. (
Generally, the performance of the UNet and the fully convolutional VNet is comparable, but we could observe slightly better scores achieved by VNet in the MDC Liver dataset and KiTS dataset while the trend is opposed in the MDC Pancreas dataset, where the UNet provided better results than the VNet. Still, when it comes to the assembly results, the VNet benefits from its trainable parameters and achieves better results than UNet variant in all three datasets.
The proposed network architecture was benchmarked against the winning submission of the Medical Decathlon Challenge (MDC), namely nnUNet (Isensee et al.,
Comparison of the proposed VNet-AG-DSV to the state-of-the-art network with similar parameters presented by Isensee et al. (
Isensee et al. ( |
47.01 | 79.45 | 49.65 | |
Isensee et al. ( |
94.11 | 77.69 | 42.69 | |
Isensee et al. ( |
95.43 | 61.82 | 79.30 | 52.12 |
VNet-AG-DSV—Low res. | 94.54 | |||
VNet-AG-DSV—Full res. | 57.65 | |||
VNet-AG-DSV—Best model |
The full- and the low-resolution models with attention gates (VNet-AG-DSV) achieved higher dice scores for both labels on the pancreas dataset, of particular interest is that the tumor dice scores were substantially increased, by three and seven points in low and full-resolution, respectively. In the case of the liver dataset, we could see a significant improvement in the low-resolution case. Attention gates improved the tumor dice score by seven points while the liver segmentation precision was comparable. The decrease in dice score happened only on the tumor class in the full-resolution case. Finally, if we compare the best models presented in both papers, our model with attention gates and deep supervision (VNet-AG-DSV) wins on both datasets, adding nearly three score points on the liver-tumor class and two points in pancreas label.
The performance of the model with and without the attention gates is quantitatively compared in
Performance comparison.
Num. parameters [M] | 26.2453 | 26.2917 | 29.6873 | 29.7383 |
Train iteration |
224.8231 | 260.6527 | 297.2699 | 338.3336 |
Eval iteration |
189.7215 | 217.5776 | 268.6558 | 299.3836 |
The proposed architecture was evaluated on three publicly available datasets: Task03-Liver, Task07-Pancreas from Medical Decathlon Challenge and the Kidney Tumor Segmentation 2019 Challenge dataset to compare its performance with state-of-the-art methods. Next three subsections summarize the results for each dataset.
Our VNet with attention gates and deep supervision (VNet-AG-DSV) for the kidney-tumor task (
Test set results from the Kidney Tumor Challenge 2019 leaderboard.
Isensee and Maier-Hein ( |
91.23 | 97.37 | 85.09 |
Hou et al. ( |
90.64 | 96.74 | 84.54 |
Mu et al. ( |
90.25 | 97.29 | 83.21 |
VNet-AG-DSV | 87.96 | 96.63 | 79.29 |
The liver-tumor dataset was obtained from the Medical Decathlon Challenge (MDC) happening at the MICCAI conference in 2018. We analyze the results from various research papers dealing with liver and liver-tumor segmentation. The Bilic et al. (
Comprarison of the state-of-the-art methods for liver and liver-tumor segmentation from CT scans.
Bilic et al. ( |
83.15 | 96.10 | 70.20 |
Bilic et al. ( |
81.00 | 96.30 | 65.70 |
Isensee et al. ( |
78.63 | 95.43 | 61.82 |
VNet-AG-DSV | 80.56 | 96.37 | 64.70 |
In comparison to other abdominal organs, the pancreas segmentation is a challenging task, as shown by the lower dice scores achieved in the literature. Roth et al. (
Comprarison of the state-of-the-art methods for pancreas and pancreas-tumor segmentation from CT scans.
Roth et al. ( |
- | 81.27 | - |
Oktay et al. ( |
- | 84.00 | - |
Isensee et al. ( |
65.71 | 79.30 | 52.12 |
VNet-AG-DSV | 67.11 | 81.22 | 52.99 |
Conventional artificial neural networks with fully connected hidden layers take a very long time to be trained. Due to this, the convolutional neural network (CNN) was introduced. It is specifically designed to work with the images by the use of convolutional layers and pooling layers before ending with fully connected layers. Nowadays, convolutional neural network architectures are the primary choice for most of the computer vision tasks. CNN takes inspiration in biological processes in that the connectivity pattern between neurons corresponds to the organization of the animal visual cortex (Hubel and Wiesel,
Image segmentation is one of the most laborious tasks in computer vision since it requires the pixel-wise classification of the input image. Long et al. (
The deep supervision presented by Kayalibay et al. (
Apart from the skip connections, many researches tried to incorporate the concept of attention into artificial CNN visual models (Mnih et al.,
This work presents a comprehensive study of medical image segmentation via a deep convolutional neural network. We propose a novel network architecture extended by attention gates and deep supervision (VNet-AG-DSV) which achieves results comparable to the state-of-the-art performance on several and very different medical image datasets. We performed extensive study which analyze the two most popular convolutional neural networks in medical images (UNet and VNet) across three different organ-tumor datasets and two training image resolutions. Further, to understand how the model represents the input image at the intermediate layers, the activation maps from attention gates and secondary segmentation maps from deep supervision layers are visualized. The visualizations show an excellent correlation between the activation present and the label of interest. The performance comparison shows that the proposed network extension introduces a slight computation burden, which is outweighed by considerable improvement in performance. Finally, our architecture is fully automatic and has shown its validity at detecting three different organs and tumors, i.e., more general than the state of the art, while providing similar performance to more dedicated methods.
Publicly available datasets were analyzed in this study. This data can be found here:
AT and TT coded the proposed methodology and performed the experiments. ZK helped to ensure the needed computation power. AT wrote the first draft of the manuscript. AR-S did the first approval reading. All authors contributed conception and design of the study, contributed to manuscript revision, read, and approved the submitted version.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The handling Editor declared a past co-authorship with one of the authors AR-S.
This paper was created with support of A.I. Lab (
1For example website Grand Challenges in Biomedical Image Analysis gathers multiple competitions;
2A one-hot encoding was created from the original ground true segmentation map in a way, that each image channel contains only one class present in segmentation map, this way all the classes are represented by value one just in different image channels. For example, if we have ground true segmentation map of size (1 ×
3
4