Robust fusion for skin lesion segmentation of dermoscopic images

Robust skin lesion segmentation of dermoscopic images is still very difficult. Recent methods often take the combinations of CNN and Transformer for feature abstraction and multi-scale features for further classification. Both types of combination in general rely on some forms of feature fusion. This paper considers these fusions from two novel points of view. For abstraction, Transformer is viewed as the affinity exploration of different patch tokens and can be applied to attend CNN features in multiple scales. Consequently, a new fusion module, the Attention-based Transformer-And-CNN fusion module (ATAC), is proposed. ATAC augments the CNN features with more global contexts. For further classification, adaptively combining the information from multiple scales according to their contributions to object recognition is expected. Accordingly, a new fusion module, the GAting-based Multi-Scale fusion module (GAMS), is also introduced, which adaptively weights the information from multiple scales by the light-weighted gating mechanism. Combining ATAC and GAMS leads to a new encoder-decoder-based framework. In this method, ATAC acts as an encoder block to progressively abstract strong CNN features with rich global contexts attended by long-range relations, while GAMS works as an enhancement of the decoder to generate the discriminative features through adaptive fusion of multi-scale ones. This framework is especially good at lesions of varying sizes and shapes and of low contrasts and its performances are demonstrated with extensive experiments on public skin lesion segmentation datasets.


Introduction
Skin cancer is listed as one of the fastest-growing cancers in the world (Jemal, 2017) and dermatologists usually identify lesions visually from dermoscopy images captured by dermoscopy. However, manual identification is usually tedious and time-consuming. Therefore, automatic skin lesion segmentation is badly needed in clinical practice, which can assist dermatologists in further analysis.
Skin lesions often have a vast variety of lesion shapes and sizes and are often with low contrasts (Figure 1). It means both global and local contexts are important for an effective feature abstraction, which is also why some methods (Wu et al., 2022;Zhang et al., 2021;Xu et al., 2021;Chen et al., 2021) combine both convolution neural network (CNN) and Transformer (Vaswani et al., 2017): CNN gets features with rich local information while Transformer captures the long-range relationships. They often fuse the two types of feature serially (Chen et al., 2021), or after the last stage of the Transformer branch Xu et al., 2021;Wu et al., 2022).
However, such fusions may not utilize the Transformer effectively. Transformer in principle computes the affinities as attention for long-range relationships. The size and shape variations are significant symbols of lesions (Figure 1), which means a more effective fusion of them can be obtained if applying Transformer as an augmentation to different scales at different encoding stages during the progress of CNN. This progressive boost is very important, especially when facing the low-contrast appearances of lesions.
Therefore, we argue that the better way is to take the Transformer as a progressive attention tool to enhance the longrange information gradually and consequently the feature responses of lesions will be significantly enhanced. Accordingly, a new feature fusion module, the Attention-based Transformer-And-CNN fusion module (ATAC), is proposed. It can fulfill the attention-based fusion progressively in multiple scales which is different from the traditional fusion applied in tandem or after the last stage.
Effectively decoding from the strong features is also important for a successful segmentation, where the fusion of features from different scales is often considered an effective idea. Recent studies show that different scales may have different weights in fusion and features at suboptimal scales may reduce segmentation accuracy (Chen et al., 2016;Shi et al., 2018), e.g., large scales are more important for bigger lesions. Recent methods (Xu et al., 2021;Dai et al., 2022) fuse the multi-scale features with weights computed from several additional convolutions and thus increase the computation complexity.
We prefer a light-weighted scheme to fuse the multi-scale features. Considering that a gating mechanism is effective in filtering the features with fewer parameters, this paper proposes a new multi-scale fusion module, the GAting-based Multi-Scale fusion module (GAMS), to aggregate the multi-scale features adaptively by the weights from gating.
The two fusion modules ATAC and GAMS lead to a new skin lesion segmentation method. Built on the popular U-Net (Ronneberger et al., 2015) structure, it takes ATAC as an encoder block for the effective abstraction of features from both global and local contexts while adopting GAMS as an enhancement to the decoder for robust exploration of the multi-scale features. Experiments show that this method can accurately locate the lesions of different lesion shapes and sizes and low contrasts.
The main contributions can be summarized as follows.
• A novel CNN and Transformer fusion module, ATAC, which takes Transformer features as affinity estimation to attend CNN features for progressively boosting the global contexts. • A novel multi-scale fusion module, GAMS, which takes weighted contributions from multi-scale features by gating to fuse information from different contexts. • A new encoder-decoder-based skin lesion segmentation network for single images, which integrates both ATAC and GAMS as the encoder block and decoder enhancement separately and thus can reach robust segmentation of skin lesions without the affection of size and shape variations and low contrasts.
2 Related work
Parallel adaption of both CNN and Transformer is also proposed (Wu et al., 2022;Zhang et al., 2021;Xu et al., 2021). Possible fusion methods include concatenation (Wu et al., 2022) and some attention-inspired mechanisms, such as convolution-based attention  or direct attention-based supervision (Xu et al., 2021). However, all these fusions happen after the last stages of the Transformer branch and thus may not fully explore the rich contexts from the multi-scale features robustly.

FIGURE 2
The pipeline of the proposed method. Building on the U-Net and incorporating both CNN and Transformer, it includes two new fusion modules, ATAC and GAMS, into the encoder and decoder respectively for progressively boosting the feature during abstraction and adaptively combining the features for classification respectively. Note: PVT stands for PVT v2, which supplies the Transformer features to ATAC.

Multi-scale feature aggregation
Some nature image oriented methods (Zhao et al., 2017;Chen et al., 2018;Lin et al., 2017) first extract multi-scale features by pyramid pooling module (PPM), pyramid atrous convolutions, or feature pyramid network (FPN) and then combine these features to predict segmentation results. For skin lesion segmentation, researchers usually first extract multi-scale features by atrous convolution or standard convolution and then fuse them using concatenation or element-wise addition (Zhang et al., 2019;Liu et al., 2019;Cui et al., 2019). Recently, Xu et al. (2021) and Dai et al. (2022) fused multi-scale features by learned weights which are computed by additional convolutions. However, their methods increase the training parameters and consequently the computational complexity.

Methods
Overall, our proposed framework ( Figure 2) takes the U-shaped encoder-decoder structure. The encoder adopts ATAC as a building block, which gradually fuses the Transformer and CNN features for feature abstraction. The decoder consists of the normal decoder and its enhancement GAMS. The normal decoder is skipped and connected from the encoder as the typical U-Net, while GAMS takes the features from the decoder for adaptive fusion.
The CNN features of images input to the encoder are attended by the Transformer features stage by stage. Gradually, globally augmented CNN features can be obtained. Then the normal decoder is applied to fulfill the final classification (Prediction 1), while the multi-scale decoder features are also input to GAMS so that effective features aggregated by adaptive fusion are generalized (Prediction 2). The final prediction is based on the results from both predictions.
The four-stage PVT v2  supplies the Transformer features to ATAC. The normal decoder is made up of up-sampling and two 3 × 3 convolution, as the decoder of UNet (Ronneberger et al., 2015). Now let's discuss the details of ATAC and GAMS.

The attention-based transformer-and-CNN fusion module
3.1.1 Why progressive attention?
Generally, CNN is good at capturing features with rich local details, while Transformer can capture long-range dependence vital to distinguish the target from the background. Therefore, to aggregate information on features from both CNN and Transformer, there are two typical ways of fusion ( Figures 3A,B). One can be called serial fusion which treats either CNN or Transformer as neighboring branches and then serially fuses their features. The other can be called last-stage fusion which treats the CNN and Transformer as two parallel branches and fuses their features finally after the last stage of the Transformer branch.
However, serial fusion may not obtain robust features after the second branch because of the possible information loss brought by the filtering effect of the first branch. The last-stage fusion tries to keep all information from the two branches. But the combination after the last stages of the Transformer branch may mess up the information from different scales of the Transformer and cannot effectively utilize it. A more efficient utilization of Transformer features is expected.
Transformer is different from CNN in computing the features. Transformer is configured with the multi-head selfattention for learning the long-range dependencies of image patches. This attention mechanism means that Transformer actually captures the patch affinities globally. The multi-scale

FIGURE 4
The structure of ATAC. Pooling represents the regular max-pooling operation with the pooling size 2. Maxpool and AvgPool represent max-pooling operation and average-pooling operation along the channel axis respectively.

Frontiers in Bioengineering and Biotechnology frontiersin.org
Transformer branch supplies rich affinity information from different scales and thus can be used to boost the CNN features progressively as the general attention mechanism for more robust exploration and fusion. Therefore, a novel fusion method called attentive fusion can be obtained ( Figure 3C). Let's revisit the principle of the multi-head self-attention mechanism in Transformer.
Given a set of N tokens T t 1 , t 2 , . . . , t N { } where t n ∈ R d is the ddimension feature vector of the nth token (n = 1, . . . , N). The multihead self-attention of Transformer first computes the query Q i , key K i and value V i of the i-th head of all p heads by a linear layer L(·), ( 1 ) Then it computes the attention matrix A i ∈ R N×N , representing affinities between tokens, where S represents Softmax and d s is the column dimension of Q i , K i and V i . After, the feature map of the ith head, H i ∈ R N×ds , can be computed as The final feature map L is obtained by a linear layer after concatenating K all head feature maps, This principle shows that Transformer essentially models the patch affinity. Therefore, its features can be treated as affinities whose global information can be utilized to boost CNN features.
Consequently, to better benefit from Transformer, the proposed attentive fusion is installed as the Transformer attended module ATAC to boost the long-range relations inside the CNN features from different stages and thus progressively dig up significant large-scale contexts. This design fits well with lesions: Their varying sizes and shapes need global contexts to capture, especially considering their possibly low contrasts.

The structure of ATAC
ATAC is designed as follows ( Figure 4). First, the feature maps F is 3 × 3 convoluted twice for extracting CNN feature F c , and then regular max-pooling (pooling size is 2) is applied to integrate features as F p ∈ R c×h×w (c, h, w represent channel dimension, height and width of feature map respectively),

FIGURE 5
The structure of GAMS.

FIGURE 6
Comparison of the DSC curves under different thresholds on ISIC 2018 and PH2.
Frontiers in Bioengineering and Biotechnology frontiersin.org where C 3×3 and P indicate 3 × 3 convolution and regular maxpooling operation respectively. At the same time, the corresponding scale Transformer features F t ∈ R c×h×w from the PVT v2 are first mapped to the features W m ∈ R 1×h×w and W a ∈ R 1×h×w by max-pooling M m and average-pooling M a along the channel axis for integrating information across all channel dimensions, which can be effective in highlighting informative regions. Then they are added to get the Then, the attention is applied. Here, fused Transformer features W are embedded into CNN features F p by element-wise multiplication to get enhanced CNN features F f . Here, F p and W have the same width and height, so the element-wise multiplication is broadcasted along each channel.
where ⊙ indicates Hadamard product. The output features F o is finally obtained by 1 × 1 convolution of F f .

FIGURE 7
Features from the general ablation study.

FIGURE 8
Features from the ablation study on the fusion methods in ATAC.
Frontiers in Bioengineering and Biotechnology frontiersin.org

The GAting-based multi-scale fusion module
Contexts from different scales may have different influences on object perception. For example, their large scales are more important for bigger lesions and vice versa. It is better to have a weighting scheme to automatically utilize such differences. Considering gating is a very multi-scale filter for such a purpose, this paper introduces GAMS ( Figure 5) to improve the feature discrimination.
In GAMS, the input feature maps are first rescaled to the same scale as S i (i ∈ {1, . . . , n}) by bilinear upsampling (In our experiment, n is set to four according to the four stages of the normal decoder). Then 1 × 1 convolution C 1×1 is applied to reduce the depth of the features to 1. Afterwards, the mapped features are fused by concatenating asS, Then, the gating map W can be obtained by activation function Softmax S, W in Eq. 9 is further divided into W 1 , W 2 , . . . , W n as the corresponding weights for the n scales. These weights are used to weighted all input features, which are further convoluted by 1 × 1 as the aggregated output features O,

Loss function
The overall loss is set to be the weighted average of the losses from both predictions as shown in Figure 2, where: 1) λ denotes the weight (λ = 0.2 in the experiment); and 2) L GAMS and L Normal are the losses from GAMS and the normal decoder respectively. Each loss L i (i ∈ {GAMS, Normal}) is estimated by the combination of both weighted binary cross-entropy (WBCE) and weighted Intersection over Union (WIOU), where: 1) p andp indicate the ground truth and prediction respectively; and 2) l w IOU (·) and l w BCE (·) denote the WBCE and WIOU losses respectively.

Setup
The system is built by PyTorch with a single NVIDIA GeForce GTX 2080Ti GPU. The epoch is 100 and Adam is the optimizer with an initial learning rate of 10 -4 . For PH2, the batch size is set to 8. And for the other three datasets, the batch size is set to 16. All images are re-sized to 256 × 256 as input with various data   (Mendonça et al., 2013), where the dataset division for ISIC 2017 is the same as the previous study (Reza et al., 2022) with those of the other three following the setting in FAT-Net (Wu et al., 2022). Details of four datasets used in our experiments are described below.
• ISIC 2016 is provided by the international skin imaging collaboration (ISIC). There are a total of 1279 RGB skin lesions images, of which 900 are used for training and 379 are used for testing. • ISIC 2017 is also provided by ISIC, which includes 2000 RGB skin lesion images as the training set with masks for segmentation. We randomly divide the original dataset into a training set, validation set, and testing set in a ratio of 7:1:2. • ISIC 2018 is also collected by ISIC, which contains 2594 RGB skin lesions images. Like the ISIC 2017 data set division, we use 1815 samples for the training set, 259 samples for the validation set, and 520 samples for the testing set.  (Öztürk and Özkaya, 2020) and CKDNet (Jin et al., 2021)) and three Transformer-based models (TransUNet (Chen et al., 2021), FAT-Net (Wu et al., 2022) and TMUNet Reza et al. (2022)). Among CNN-based models, U-Net and AttU-Net are basic medical image segmentation frameworks. DAGAN, iFCN, and CKDNet are specially designed for skin lesion segmentation. CPFNet, MCGU-Net, and SBPS are excellent segmentation networks in recent years, solving the problems of large size and structure variation and boundary ambiguities, which can be applied to various types of medical images. Among Transformer-based models, TMUNet and FAT-Net fuse CNN and Transformer features at the last stage, while TransUNet fuses CNN and Transformer features serially.

Evaluation metrics
Five widely used metrics are employed to quantitatively evaluate the segmentation performances, including the Sensitivity (SE) (Yerushalmy, 1947), (Yerushalmy, 1947), Specificity (SP) (Yerushalmy, 1947), Intersection over Union (IoU) (Everingham et al., 2015), Dice Similarity Coefficient (DSC) (Dice, 1945), (Dice, 1945)and Accuracy (ACC) (per a la Normalització, 1994). They are defined as: where: 1) TP (True-Positive) represents the number of pixels that are correctly classified as lesions; 2) TN (True Negative) represents the number of pixels that are correctly classified as backgrounds; 3) FP (False Positive) represents the number of pixels which are falsely classified as lesions; and 4)FN (False Negative) represents the number of pixels which are falsely classified as backgrounds.

Ablation studies 4.3.1 General ablation study
First, the general ablation study for the proposed modules and method for skin lesion segmentation is conducted. U-Net is taken as the baseline. ATAC and GAMS are added to the baseline as different configurations which run on the same environment with the same data augmentations for a fair comparison.
• Baseline The backbone network using U-Net; • Baseline + ATAC Baseline but replacing its encoder block with ATAC;  (Figure 6), which demonstrates the performance gains by ATAC, GAMS, and the full model over Baseline with the full model being the best among all methods.
The feature maps output by the third stage of the normal decoder in different configurations are also visualized (Figure 7). We randomly selected one-channel feature maps for different configurations, which are uniformly resized to 128 × 128 for better display. The lesions are of different sizes and shapes with the smaller ones in low contrast. ATAC can significantly remove the background distractions because of the global enhancement from Transformer, while GAMS further improves the object responses, especially for the smaller lesion, thanks to its varying weight scheme. Their combination, i.e., the full model, obtains the best result with the strongest maps.

Ablation study on the fusion method in ATAC
The ablation study on the fusion method in ATAC is also undertaken ( Table 2). Two widely used fusion methods, concatenation and addition, are compared with our proposed attentive fusion, where multiplication operations of ATAC are substituted with concatenation or addition separately. The attentive method achieves the best performances on both ISIC 2018 and PH2 among all methods.
The features abstracted with different fusion operations are also extracted (Figure 8). The method of feature visualization is the same as in Figure 7. The responses from attentive fusion are stronger and more focused than the other two operations, which also demonstrates the importance of attentive fusion for robust lesion segmentation.

Ablation study on the encoder
To further verify the effectiveness of fusion between CNN and Transformer features, an ablation study to compare with only CNN features or only Transformer features in the encoder is also conducted. We replace ATAC with the encoder block of U-Net for only using CNN features. And we replace ATAC with the block of PVT v2 for only using Transformer features. As can be seen in Table 3, our fused encoder achieves the best performance compared with CNN or Transformer encoder alone.

FIGURE 9
Qualitative comparison for the ablation study on the encoder.

Frontiers in Bioengineering and Biotechnology frontiersin.org
In addition, the segmentation results of some representative images are visualized in Figure 9, including the lesions with various sizes, irregular shapes, and low contrast. The first and second rows show that our fusion encoder yields the best prediction for the smallest or largest lesions. The third row shows the segmentation results for lesions with low contrast.
It can be seen that both the CNN encoder and Transformer encoder exhibit over-segmentation, while our fusion encoder achieves the best performance. The last row proves that our fusion encoder segments lesions more accurately for irregularly shaped lesions. Now, we will discuss the comparisons with the four datasets.  The quantitative results of existing methods are reported by Lin et al. (2022); Wu et al. (2022); Reza et al. (2022) (Table 4). Our method achieves the highest scores in all metrics except SE with a slight decrease. Figure 10 shows some visualization results of different methods. As can be seen, the lesion has low contrast and ambiguous boundary in the last row of ISIC 2018. The compared methods exhibit undersegmentation. In addition, FAT-Net and TransUNet can struggle to localize a complete lesion because of possible information loss brought by serial fusion and last-stage fusion. Our method benefits from information fusion at different encoder and decoder stages, which can boost feature representation, and thus our method achieves more accurate segmentation results than the compared methods.

Evaluation on ISIC 2017 4.5.1 Quantitative results
The experimental results of DAGAN and MCGU-Net are reported by TMUNet (Reza et al., 2022) with the rest results computed by us according to their released codes (Table 5). Our method also achieves the highest scores in most metrics. In addition, compared with the latest method TMUNet, ours is 0.83%, 1.45%, and 0.44% higher in DSC, IoU, and ACC, respectively.

FIGURE 11
Failure examples of the proposed method.
Frontiers in Bioengineering and Biotechnology frontiersin.org 4.5.2 Qualitative results Figure 10 shows that our method obtains more accurate results than other methods on ISIC 2017. In the last row of ISIC 2017, the lesion has hair interference. But, apparently, our method is better than other compared methods. It is due to ATAC can effectively utilize the long-range contexts from Transformer, which helps to distinguish different classes.

Evaluation on ISIC 2016
4.6.1 Quantitative results The quantitative results of existing methods are reported by FAT-Net (Wu et al., 2022) except that those of TransUNet and TMUNet are computed by us according to their released codes (Table 6). Ours again achieves the highest scores in most metrics. Figure 10 gives some visual results. As shown in the first and second rows of ISIC 2016, the lesions exhibit a large variation in sizes. But while credit should be given to the fusion of GAMS in different scales of the decoding stage, our method can detect lesions more accurately than other methods, even if they are very small or large.

Quantitative results
The quantitative results of existing methods are from FAT-Net (Wu et al., 2022), except for CPFNet, TransUNet, CPFNet, and TMUNet, which are computed by us according to their codes (Table 7). Our method again achieves the highest scores for all metrics.

Qualitative results
As can been seen in PH2 of Figure 10 there are many details around boundary of these lesions. But the boundary obtained by our method is more accurate and closer to the ground truth than other methods. This advantage depends on the strong feature representation capabilities of ATAC.

Failure cases of the proposed method
Although our method is better than the current mainstream segmentation methods, some challenges are still not solved. Figure 11 shows some failure examples. It can be observed that these lesions have very complex boundary regions (see the first, third, and sixth columns) and serious noise interference (see the second, fourth, and fifth column). Our method can basically detect the lesion locations. But in these complex scenes, our method gets poor segmentation results because it is difficult to obtain robust feature representation to distinguish different classes.

Conclusion
This paper aims at effective fusion policies for robust skin lesion segmentation from dermoscopic images and proposes a new method. Two new fusion modules, ATAC and GAMS, are incorporated in its encoder and decoder for robust feature abstraction and further classification separately. ATAC acts as the encoder block, which takes the Transformer to attend CNN for augmentation of global contexts in different stages. This design makes the abstracted features better fitted for the size and shape of varying lesions, especially when they are in low contrast. GAMS works as an enhancement to the normal decoder, which adaptively weights the features of multiple scales by gating. This module can help obtain features characterized for different objects in low complexity and highly discriminative for robust final inference. Quantitative and qualitative experiments demonstrate the efficacy of the proposed method.
However, ambiguous boundaries of lesions are still challenging. In addition, hair covering the lesions may also distract the model and thus affect the segmentation performances. In the future, we will study those problems and propose more robust methods accordingly.

Author contributions
QG: Writing-Original Draft, Methodology, Formal analysis. XF: Writing-review and editing, Conceptualization, Supervision. LW: Writing-review and editing. EZ: Writing-review and editing. ZL: Writing-review and editing. All authors contributed to the article and approved the submitted version.

Funding
This work is supported by the Natural Science Foundation of Anhui Province (2108085MF210).

Conflict of interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
Frontiers in Bioengineering and Biotechnology frontiersin.org