DA-TransUNet: integrating spatial and channel dual attention with transformer U-net for medical image segmentation

Accurate medical image segmentation is critical for disease quantification and treatment evaluation. While traditional U-Net architectures and their transformer-integrated variants excel in automated segmentation tasks. Existing models also struggle with parameter efficiency and computational complexity, often due to the extensive use of Transformers. However, they lack the ability to harness the image’s intrinsic position and channel features. Research employing Dual Attention mechanisms of position and channel have not been specifically optimized for the high-detail demands of medical images. To address these issues, this study proposes a novel deep medical image segmentation framework, called DA-TransUNet, aiming to integrate the Transformer and dual attention block (DA-Block) into the traditional U-shaped architecture. Also, DA-TransUNet tailored for the high-detail requirements of medical images, optimizes the intermittent channels of Dual Attention (DA) and employs DA in each skip-connection to effectively filter out irrelevant information. This integration significantly enhances the model’s capability to extract features, thereby improving the performance of medical image segmentation. DA-TransUNet is validated in medical image segmentation tasks, consistently outperforming state-of-the-art techniques across 5 datasets. In summary, DA-TransUNet has made significant strides in medical image segmentation, offering new insights into existing techniques. It strengthens model performance from the perspective of image features, thereby advancing the development of high-precision automated medical image diagnosis. The codes and parameters of our model will be publicly available at https://github.com/SUN-1024/DA-TransUnet.


Introduction
Medical image segmentation is the process of delineating regions of interest within medical images for diagnosis and treatment planning.It serves as a cornerstone in medical image analysis.Precise delineation of lesions plays a crucial role in quantifying diseases, aiding the assessment of disease prognosis, and evaluating treatment efficacy.Manual segmentation is both accurate and affordable for pathology diagnosis but vital in standardized clinical settings.Conversely, automated segmentation ensures a reliable and consistent process, boosting efficiency, cutting down on labor and costs, and preserving accuracy.Consequently, there is a substantial demand for exceptionally accurate automated medical image segmentation technology within the realm of clinical diagnostics.
In the last decade, the traditional U-net structure have been widely employed in numerous segmentation tasks, yielding commendable outcomes.Notably, the U-Net model [1], along with its various enhanced iterations, have achieved substantial success.ResUnet [2] emerged during this period, influenced by the residual concept.Similarly, UNet++ [3] emphasizes enhancements in skip connections, and DAResUnet [4] incorporates a dual attention block with a residual block (Res-Block) in U-net.Both architectures have benefited from the influence of the encoder-decoder idea, while skip connections provide initial features for the decoder, bridging the semantic gap between encoders and decoders.However, limitations in the sensing field and biases in convolutional operations can compromise segmentation accuracy.Additionally, the inability to establish long-range dependencies and global context further constrains performance improvements.
The transformer [5], originally developed for sequence-to-sequence modeling in Natural Language Processing (NLP), has also found utility in the field of Computer Vision (CV).ViTs segment images into patches and input their embeddings into a transformer network for strong performance.[6].The segmentation efficacy is amplified by ViTs' application in CV, especially in medical image segmentation.Inspired by ViTs, TransUNet [7] further combines the functionality of ViTs with the advantages of U-net in the field of medical image segmentation.Specifically, it employs a transformer's encoder to process the image and employs CNN and hopping connections for accurate up-sampling feature recovery, it neglects image-specific features like position and channel.Leveraging the capabilities of ViTs, TransUNet [7] fuses the strengths of ViTs and U-net architectures to advance the performance in medical image segmentation.TransUNet utilizes a transformer-based encoder for robust image feature extraction while incorporating conventional convolutional neural networks and skip connections to achieve precise feature map up-sampling.It omits considerations for imagespecific attributes such as spatial positioning and channel information.Swin-Unet [8] combines the Swin-transform block with the U-net structure and achieves good results.Yet, adding extensive Transformer blocks inflates the parameter count without significantly improving results.However, the aforementioned medical image segmentation studies show progress in leveraging U-net and Transformer features, they have some limitations: 1) Although combining the Transformer with traditional U-Net architectures has shown promise in medical image segmentation, the Transformer lacks built-in mechanisms for considering image-specific features of position and channel.This gap in functionality calls for additional investigation.
2) In the U-Net model, skip connections serve as a vital element, bridging the semantic divide between the encoder and the decoder.Despite their potential to improve segmentation performance, skip connections have seen limited optimization efforts to date.
3) Many studies merely stack multiple Transformers to enhance models, resulting in inflated parameters and computational complexity with marginal gains in performance.The intricate design of integrating Transformers and U-Net architectures warrants further investigation.
To address the aforementioned challenges, we propose DA-TransUNet, which incorporates DA-Blocks specifically designed to extract image-specific positional and channel features, thereby enhancing both parameter efficiency and performance.We believe that the extensive use of Transformers is not as impactful as utilizing a suite of precisely calibrated DA-Blocks specifically optimized for image-specific features.The DA-Block within the transformer layer possesses robust, specialized capability for extracting image-specific positional and channel features.This block integrates the Position Attention Block (PAM) and Channel Attention Block (CAM) from the Dual Attention Network for Scene Segmentation [9].Positioned in the embedding layer of DA-TransUNet, the Dual Attention Block offers robust feature extraction capabilities.We also integrate DA-Block into the three-layer skip connection to optimize features passed by the encoder.This narrows the semantic gap and aids in creating a unified feature representation.This fusion method maximizes the use of positional and channel features in the attention mechanism, optimizing the model.Furthermore, skip connections in the U-shape structure are enhanced with DA-Blocks to filter irrelevant information, improving image reconstruction quality.Owing these enhancements, both the decoding and medical image segmentation capabilities are significantly bolstered.
We mainly evaluate the effectiveness of proposed DA-TransUNet on several medical image datasets of Synapse [10], CVC-ClinicDB [11], ISIC2018 [12,13], kvasir-seg [14], Kvasir-Instrument dataset [15] and Chest X-ray mask and label dataset [16,17].DA-TransUNet demonstrates notable efficacy, as evidenced by quantifiable metrics.Our main contributions are summarized below: 1) We propose DA-TransUNet, a novel architecture that integrates dual attention mechanisms to process positional and channel information into a Transformer U-net framework.This design improves the flexibility and functionality of the encoder-decoder structure, thereby improving performance in medical image segmentation tasks.
2) A well-designed dual-attention encoding mechanism is proposed to be positioned ahead of the Transformer layer in encoder.This can enhance its feature extraction capabilities and enrich the functionality of the encoder in the U-net structure.(Section3) 3) We enhance the effectiveness of skip connections by incorporating Dual Attention Block into each layer, a modification substantiated by ablation studies, which results in more accurate feature delivery to the decoder and improved image segmentation performance.(Section4.4) 4) Our proposed DA-TransUNet method achieves state-of-the-art performance on multiple medical imaging datasets, which proves the effectiveness of our method and its contribution to advcancing medical image segmentation.
The rest of this article is organized as follows.Section II reviews the related works of automatic medical image segmentation, and the description of our proposed DA-TransUNet is given in Section III.Next, the comprehensive experiments and visualization analyses are conducted in Section IV.Finally, Section V makes a conclusion of the whole work.
2 Related Work

U-net Model
Recently, attention mechanisms have gained popularity in U-net architectures [1].For example, Attention U-net incorporates attention mechanisms to enhance pancreas localization and segmentation performance [18]; DAResUnet integrates both double attention and residual mechanisms into U-net [4]; Attention Res-UNet explores the substitution of hard-attention with soft-attention [19]; Sa-unet incorporates a spatial attention mechanism in U-net [20].Following this, TransUNet innovatively combines Transformer and U-net structure [7].Building on TransUNet, TransU-Net++ incorporates attention mechanisms into both skip connections and feature extraction [21].Swin-Unet [8] improves by replacing every convolution block in U-net with Swin-Transformer [22].DS-TransUNet proposes to incorparte the tif module to the skip connection to improve the model [23].AA-transunet leverages Block Attention Model (CBAM) and Deep Separable Convolution (DSC) to further optimize TransUNet [24].TransFuse uses dual attention Bifusion blocks and AG to fuse features of two different parts of CNN and Transformer [25].Numerous attention mechanisms have been added to U-net and TransUNet models, yet further exploration is warranted.Diverging from prior approaches, our experiment introduces a dual attention mechanism and Transformer module into the traditional U-shaped encoder-decoder and skip connections, yielding promising results.

Application of skip connections in medical image segmentation modeling
Skip connections in U-net aim to bridge the semantic gap between the encoder and decoder, effectively recovering finegrained object details [26][27] [28].There are three primary modifications to skip connections: firstly, increasing their complexity [29].U-Net++ redesigned the skip connection to include a Dense-like structure in the skip connection [3], and U-Net3++ [30] changed the skip connection to a full-scale skip connection.Secondly, RA-UNet introduces a 3D hybrid residual attention-aware method for precise feature extraction in skipped connections [31].The third is a combination of encoder and decoder feature maps: An alternative extension to the classical skip connection was For the input medical images, we feed them into an encoder with transformer and Dual Attention Block (DA-Block).Then, the features of each of the three different scales are purified by DA-Block.Finally, the purified skip connections are fused with the decoder, which subsequently undergoes CNN-based up-sampling to restore the channel to the same resolution as the input image.In this way, the final image prediction result is obtained.
introduced in BCDU-Net with a bidirectional convolutional long-term-short-term memory (LSTM) module was added to the skip connection [32].Aligning with the second approach, we integrate Dual Attention Blocks into each skip connection layer, enhancing decoder feature extraction and thereby improving image segmentation accuracy.

The use of attentional mechanisms in medical images
Attention mechanisms are essential for directing model focus towards relevant features, thereby enhancing performance.
In recent years, dual attention mechanisms have seen diverse applications across multiple fields.In scene segmentation, the Dual Attention Network (DANet) employs position and channel attention mechanisms to improve performance [9].A modularized DANs framework is presented that adeptly merges visual and textual attention mechanisms [33].This cohesive approach enables selective focus on pivotal features in both types of data, thereby improving taskspecific performance.Additionally, the introduction of the Dual Attention Module (DuATM) has been groundbreaking in the field of audio-visual event localization.This model excels at learning context-aware feature sequences and performing attention sequence comparisons in tandem, effectively incorporating auditory-oriented visual attention mechanisms [34].Moreover, dual attention mechanisms have been applied to medical segmentation, yielding promising results [35].The Multilevel Dual Attention U-net for Polyp Segment combines dual attention and U-net in medical image segmentation [36].While significant progress has been made in medical image segmentation, there is still ample room for further research to explore the potential of position and channel attention mechanism in the field of medical image segmentation.

Method
In the subsequent section, we propose the DA-TransUNet architecture, illustrated in Figure .1.We start with a comprehensive overview of the architecture.Next, we detailed the architecture's key components in the following order: the dual attention blocks(DA-Block), the encoder, the skip connections, and the decoder.

Overview of DA-TransUNet
In Figure 1, the architecture of DA-TransUNet is presented.The model comprises three core components: the encoder, the decoder, and the skip connections.In particular, the encoder fuses a conventional convolutional neural network (CNN) with a Transformer layer and is further enriched by the DA-Block, which are exclusively introduced in this To address these constraints, we integrate DA-Blocks both preceding the Transformer layers and within the encoderdecoder skip connections.This achieves two goals: firstly, it refines the feature map input to the Transformer, enabling more nuanced and precise global feature extraction; secondly, the DA-Block in the skip connections optimize the transmitted features from the encoder, facilitating the decoder in reconstructing a more accurate feature map.Thus, our proposed architecture amalgamates the strengths and mitigates the weaknesses of both foundational technologies, resulting in a robust system capable of image-specific feature extraction.

Dual Attention Block(DA-Block)
As shown in the attached Figure 2, the Dual Attention Block (DA-Block) serves as a feature extraction module that integrates image-specific features of position and channel.This enables feature extraction tailored to the unique attributes of the image.Particularly in the context U-Net shaped architectures, the specialized feature extraction capabilities of the DA-Block are crucial.While Transformers are adept at using attention mechanisms to extract global features, they are not specifically tailored for image-specific attributes.In contrast, the DA-Block excels in both position-based and channel-based feature extraction, enabling a more detailed and accurate set of features to be obtained.Therefore, we incorporate it into the encoder and skip connections to enhance the model's segmentation performance.The DA-Block consists of two primary components: one featuring a Position Attention Module (PAM), and the other incorporating a Channel Attention Module (CAM), both borrowed from the Dual Attention Network for scene segmentation [9].

PAM (Position Attention Module):
As shown in Figure 3, PAM captures spatial dependencies between any two positions of feature maps, updating specific features through a weighted sum of all position features.The weights are determined by the feature similarity between two positions.Therefore, PAM is effective at extracting meaningful spatial features.
PAM initially takes a local feature, denoted as A ∈ R C×H×W (C represents Channel, H represents height, and W represents Width).We then feed A into a convolutional layer, resulting in three new feature maps, namely B, C, and D, each of size R C×H×W .Next, we reshape B and C to R C×N , where N = H×W denotes the number of pixels.We perform a matrix multiplication between the transpose of C and B and subsequently use a softmax layer to compute the spatial attention map S ∈ R N ×N : Here, S ji measures the impact of the i-th position on the j-th position.We then reshape matrix D to R C×N .A matrix multiplication is performed between D and the transpose of S, followed by reshaping the result to R C×H×W .Finally, we multiply it by a parameter α and perform an element-wise sum operation with the features A to obtain the final output E ∈ R C×H×W : The weight α is initialized as 0 and is learned progressively.PAM has a strong capability to extract spatial features.As E is generated as a weighted sum of all position features and original features, it possesses global contextual features and aggregates context based on the spatial attention map.This ensures effective extraction of position features while maintaining global contextual information.

CAM (Channel Attention Module):
As shown in Figure 4, this is CAM, which excels in extracting channel features.
Unlike PAM, we directly reshape the original feature A ∈ R C×H×W to R C×N , and then perform a matrix multiplication between A and its transpose.Subsequently, we apply a softmax layer to obtain the channel attention map X ∈ R C×C : Here, x ji measures the impact of the i-th channel on the j-th channel.Next, we perform a matrix multiplication between the transpose of X and A, reshaping the result to R C×H×W .We then multiply the result by a scale parameter β and perform an element-wise sum operation with A to obtain the final output E ∈ R C×H×W : The other component is the same, with the only difference being that the PAM block is replaced with a CAM with the following formula: After extracting α1 and α2 from the two layers of attention, the output is obtained by aggregating and summing the two layers of attention and recovering the number of channels in one convolution.
This sophisticated DA-Block architecture seamlessly integrates the strengths of the PAM and CAM to improve feature extraction, making it a critical component in enhancing the model's overall performance.By combining convolutional neural networks, transformer architectures, and dual-attention mechanisms, the encoder configuration culminates in a robust capability for feature extraction, resulting in a symbiotic powerhouse of capabilities.

Skip-connections with Dual Attention
Similar to other U-structured models, we have also incorporated skip connections between the encoder and decoder to bridge the semantic gap that exists between them.To further minimize this semantic gap, we introduced dual-attention blocks (DA-Blocks), as depicted in Figure 1, in each of the three skip connection layers.This decision was based on our observation that traditional skip connections often transmit redundant features, which DA-Blocks effectively filter.Integrating DA-Blocks into the skip connections allows them to refine the sparsely encoded features from both positional and channel perspectives, extracting more valuable information while reducing redundancy.By doing so, DA-Blocks assist the decoder in more accurate feature map reconstruction.Moreover, the inclusion of DA-Blocks not only enhances the model's robustness but also effectively mitigates sensitivity to overfitting, contributing to the overall performance and generalization capability of the model.

Decoder
As depicted in Figure 1, the right half of the diagram corresponds to the decoder.The primary role of the decoder is to reconstruct the original feature map by utilizing features acquired from the encoder and those received through skip connections, employing operations like upsampling.
The decoder's components include feature fusion, a segmentation head, and three upsampling convolution blocks.The first component: feature fusion entails the integration of feature maps transmitted through skip connections with the existing feature maps, thereby assisting the decoder in faithfully reconstructing the original feature map.The second component: the segmentation head is responsible for restoring the final output feature map to its original dimensions.The third component: the three upsampling convolution blocks incrementally double the size of the input feature map in each step, effectively restoring the image's resolution.
Putting the above parts together, the workflow begins by passing the input image through convolution blocks and subsequently performing upsampling to augment the size of the feature maps.These feature maps undergo a twofold size increase while their dimensions are reduced by half.The features received through the skip connections are then fused, followed by continued upsampling and convolution.After three iterations of this process, the generated feature map undergoes one final round of upsampling and is accurately restored to its original size by the segmentation head.
Thanks to this architecture, the decoder demonstrates robust decoding capabilities, effectively revitalizing the original feature map using features from both the encoder and skip connections.
The experimental results demonstrate that DA-TransUNet outperforms existing methods across all six datasets.In the following subsections, we first introduce the dataset and implementation details.Then show the results on each of the six datasets.

Synapse
The Synapse dataset consists of 30 scans of eight abdominal organs.These eight organs include the left kidney, right kidney, aorta, spleen, gallbladder, liver, stomach and pancreas.There are a total of 3779 axially enhanced abdominal clinical CT images.

CVC-ClinicDB
CVC-ClinicDB is a database of frames extracted from colonoscopy videos, which is part of the Endoscopic Vision Challenge.This is a dataset of endoscopic colonoscopy frames for the detection of polyps.CVC-ClinicDB contains 612 still images from 29 different sequences.Each image has its associated manually annotated ground truth covering the polyp.

Chest Xray
Chest Xray Masks and Labels X-ray images and corresponding masks are provided.The X-rays were obtained from the Montgomery County Department of Health and Human Services Tuberculosis Control Program, Montgomery County, Maryland, USA.The set of images contains 80 anterior and posterior X-rays, of which 58 X-rays are normal and 1702 X-rays are abnormal with evidence of tuberculosis.All images have been de-identified and presented in DICOM format.The set contains a variety of abnormalities, including exudates and corneal morphology.It contains 138 posterior anterior radiographs, of which 80 radiographs were normal and 58 radiographs showed abnormal manifestations of tuberculosis.

Kvasir SEG
Kvasir SEG is an open-access dataset of gastrointestinal polyp images and corresponding segmentation masks, manually annotated and verified by an experienced gastroenterologist.It contains 1000 polyp images and their corresponding groudtruth, the resolution of the images contained in Kvasir-SEG varies form 332x487 to 1920x1072 pixels, the file format is jpg.

Kvasir-Instrument
Kvasir-Instrument a gastrointestinal instrument Dataset.It contains 590 endoscopic tool images and their groud truth mask, the resolution of the image in the dataset varies from 720x576 to 1280x1024, which consists of 590 annotated frames comprising of GI procedure tools such as snares, balloons, biopsy forceps, etc. the file format is jpg.

2018ISIC-Task
The dataset used in the 2018 ISIC Challenge addresses the challenges of skin diseases.It comprises a total of 2512 images, with a file format of JPG.The images of lesions were obtained using various dermatoscopic techniques from different anatomical sites (excluding mucous membranes and nails).These images are sourced from historical samples of patients undergoing skin cancer screening at multiple institutions.Each lesion image contains only a primary lesion.

Baselines
In our endeavor to innovate in the field of medical image segmentation, we benchmark our proposed model against an array of highly-regarded baselines, including the U-net, UNet++, DA-Unet, Attention U-net, and TransUNet.The U-net has been a foundational model in biomedical image segmentation [1].Unet++ brings added sophistication with its implementation of intermediate layers [3].The DA-Unet goes a step further by integrating dual attention blocks, amplifying the richness of features extracted [36].The Attention U-net employs an attention mechanism for improved feature map weighting [18], and finally, the TransUNet deploys a transformer architecture, setting a new bar in segmentation precision [7].Through this comprehensive comparison with these eminent baselines, we aim to highlight the unique strengths and expansive potential applications of our proposed model.Additionally, we benchmarked our model against advanced state-of-the-art algorithms.UCTansNet allocates skip connections through the attention module in the traditional U-net model [37].TransNorm integrates the Transformer module into the encoder and skip connections of standard U-Net [38].A novel Transformer module was designed and a model named MIM was built with it [39].By extensively comparing our model with current state-of-the-art solutions, we intend to showcase its superior segmentation performance.

Implementation Details
We implemented DA-TransUNet using the PyTorch framework and trained it on a single NVIDIA RTX 3090 GPU [41].
The model was trained with an image resolution of 256x256 and a patch size of 16.We employed the Adam optimizer, configured with a learning rate of 1e-3, momentum of 0.9, and weight decay of 1e-4.All models were trained for 500 epochs unless stated otherwise.In order to ensure the convergence of the indicators, but due to different data set sizes, we used 50 epochs for training on the two data sets, Chest Xray Masks and Labels and ISIC 2018-Task.
During the training phase on five datasets, including CVC-ClinicDB, the proposed DA-TransUNet model is trained in an end-to-end manner.Its objective function consists of a weighted binary cross-entropy loss function (BCE) and a Dice coefficient loss function.To facilitate training, the final loss function, termed "Loss," is formulated as follows: To ensure a fair evaluation of the Synapse dataset, we utilized the pre-trained model "R50-ViT" with input resolution and patch size set to 224x224 and 16, respectively.We trained the model using the SGD optimizer, setting the learning rate to 0.01, momentum of 0.9, and weight decay of 1e-4.The default batch size was set to 24.The loss function employed for the Synapse dataset is defined as follows: This loss function balances the contributions of cross-entropy and Dice losses, ensuring impartial evaluation during testing on the Synapse dataset.
When using the datasets, we use a 3 to 1 ratio, where 75% is the training set and 25% is the test set, to ensure adequacy of training.

Model Evaluation
In evaluating the performance of DA-TransUNet, we utilize a comprehensive set of metrics including Intersection over Union (IoU), Dice Coefficient(DSC), and Hausdorff Distance (HD).These metrics are industry standards in computer vision and medical image segmentation, providing a multifaceted assessment of the model's accuracy, precision, and robustness.
IOU (Intersection over Union) is one of the commonly used metrics to evaluate the performance of computer vision tasks such as object detection, image segmentation and instance segmentation.It measures the degree of overlap between the predicted region of the model and the actual target region, which helps us to understand the accuracy and precision of the model.In target detection tasks, IOU is usually used to determine the degree of overlap between the predicted bounding box (Bounding Box) and the real bounding box.In image segmentation and instance segmentation tasks, IOU is used to evaluate the degree of overlap between the predicted region and the ground truth segmentation region.

IOU = T P F P +
The Dice coefficient (also known as the Sørensen-Dice coefficient, F1-score, DSC) is a measure of model performance in image segmentation tasks, and is particularly useful for dealing with class imbalance problems.It measures the degree of overlap between the predicted results and the ground truth segmentation results, and is particularly effective  when dealing with segmentation of objects with unclear boundaries.The Dice coefficient is commonly used as a measure of the model's accuracy on the target region in image segmentation tasks, and is particularly suitable for dealing with relatively small or uneven target regions.
Hausdorff Distance (HD) is a distance measure for measuring the similarity between two sets and is commonly used to evaluate the performance of models in image segmentation tasks.It is particularly useful in the field of medical image segmentation to quantify the difference between predicted and true segmentations.The computation of Hausdorff distance captures the maximum difference between the true segmentation result and the predicted segmentation result, and is particularly suitable for evaluating the performance of segmentation models in boundary regions.
We evaluate using both Dice and HD in the Synapse dataset and both Dice and IOU in other datasets.
In order to demonstrate the superiority of the DA-TransUNet model proposed in this paper, we conducted the main experiments using the Synapse dataset and compared it with its 11 state-of-the-art models (SOTA) (see Table1).The segmentation rate for the pancreas is notably higher at 5.73%.In a comparative evaluation across six distinct organs, DA-TransUNet demonstrates superior segmentation capabilities relative to TransUNet.Nevertheless, it exhibits a marginal decrement in the segmentation accuracy for the aorta and left kidney by 0.69% and 0.17%, respectively.The model achieves the best segmentation rates for the right kidney, liver, pancreas, and stomach, indicating superior feature learning capabilities on these organs.
To further confirm the better segmentation of our model compared to TransUNet, we visualized the segmentation plots of TransUNet and DA-TransUNet (see Figure6).From the yellow and purple parts in the first column, we can see that our segmentation effect is obviously better than that of TransUNet; from the second column, the extension of purple is better than that of TransUNet, and there is no vacancy in the blue part; from the third column, there is a semicircle in the yellow part, and the vacancy in red is smaller than that of TransUNet, etc.It is evident that DA-TransUNet outperforms TransUNet in segmentation quality.In summary, DA-TransUNet significantly surpasses TransUNet in segmenting the left kidney, right kidney, spleen, stomach, and pancreas.It also offers superior visualization performance in image segmentation.
We simultaneously took DA-TransUNet in five datasets, CVC-ClinicDB, Chest Xray Masks and Labels, ISIC2018-Task, kvasir-instrument, and kvasir-seg, and compared it with some classical models (see Table 2).In the We also show the results of image segmentation visualization of DA-TransUNet in these five datasets, and we also show the results of the comparison models for the comparison.The visualization results for Chest X-ray Masks and Labels, Kvasir-Seg, Kvasir-Instrument, ISIC2018-Task, and CVC-ClinicDB datasets are presented in Figure7, Figure8, Figure9, Figure10, and Figure11, respectively.In the Figure, it can be seen that the segmentation effect of DA-TransUNet has a good performance.Firstly, DA-TransUNet has better segmentation results than TransUNet.In addition, compared with the four classical models of U-net, Unet++, Attn-Unet, and Res-Unet, DA-TransUNet has a certain improvement.
It can be seen that the effectiveness of DA-TransUNet for model segmentation is not only confirmed in the Synapse dataset, but also in the five datasets (CVC-ClinicDB, Chest Xray Masks and Labels, ISIC2018-Task, kvasir-instrument, kvasir-seg).We further establish that DA-TransUNet excels in both 3D and 2D medical image segmentation.
an increase from 77.48% to 78.28%, HD index dropped from 31.69mm to 29.09mm.This indicates that the addition of DA-Blocks at each skip connection layer provided the decoder with more refined features, mitigating feature loss during the upsampling process, thereby reducing the risk of overfitting and enhancing model stability.Furthermore, incorporating DA-Blocks into the encoder before the Transformer yielded a significant enhancement, with the DSC baseline increasing from 77.48% to 78.87%, even though the HD metric decreased from 31.69mm to 27.71mm.In conclusion, based on the findings presented in Table 3, we can assert that the inclusion of DA-Blocks both before the Transformer layer and within the skip connections effectively boosts medical image segmentation capabilities.

Effect of adding DA-Blocks to skip connections in different layers
Building on the quantitative results from Table 4, we experimented with various configurations of DA-Block placement across three different layers of skip connections to identify the optimal architectural layout for enhancing the model's performance.Specifically, when DA-Blocks were added to just the first layer, the DSC metric improved to 79.36% from a baseline of 78.87%, and the HD metric decreased to 25.80mm from 27.71mm.The addition of DA-Blocks to the second and third layers also showed similar improvements, but the most significant enhancement was observed when DA-Blocks were integrated across all layers, yielding a DSC of 79.80% and an HD of 23.48mm.In contrast to traditional architectures where skip connections indiscriminately pass features from the encoder to the decoder, our approach with DA-Blocks selectively improves feature quality at each layer.The results, as corroborated by Table 4, reveal that introducing DA-Blocks to even a single layer enhances performance, and the greatest gains are observed when applied across all layers.This indicates the effectiveness of integrating DA-Blocks within skip connections for enhancing both feature extraction and medical image segmentation.Therefore, the table clearly supports the idea that layer-wise inclusion of DA-Blocks in skip connections is an effective strategy for enhancing medical image segmentation.

Discussion
In this present study, we have discovered promising outcomes from the integration of DA-Blocks with the Transformer and their combination with skip-connections.Encouraging results were consistently achieved across all six experimental datasets.
To start with, drawing from empirical results in Table 3, it is demonstrated that the integration of DA-Block within the encoder significantly enhances the feature extraction capabilities as well as its segmentation performance.In the landscape of computer vision, Vision Transformer (ViT) has been lauded for its robust global feature extraction capabilities [6].However, its falls short in specialized tasks like medical image segmentation, where attention to image-specific features is crucial.To remedy this, in DA-TransUNet we strategically place DA-Blocks ahead of the Transformer module.These DA-Blocks are tailored to first extract and filter image-specific features, such as spatial positioning and channel attributes.Following this initial feature refinement, the processed data is then fed into the Transformer for enhanced global feature extraction.This approach results in significantly improved feature learning and segmentation performance.In summary, the strategic placement of DA-Blocks prior to the Transformer layer constitutes a pioneering approach that significantly elevates both feature extraction efficacy and medical image segmentation precision.
Morever, building on empirical data in Table 4, our integration of DA-Blocks with skip connections significantly improves semantic continuity and the decoder's ability to reconstruct accurate feature maps.While traditional U-Net architectures [1] utilize skip connections to bridge the semantic gap between encoder and decoder, our novel incorporation of Dual Attention Blocks within the skip-connection layers yields promising results.By incorporating DA-Blocks across skip-connection layers, we focus on relevant features and filter out extraneous information, making the image reconstruction process more efficient and accurate.In summary, the strategic inclusion of DA-Blocks in skip connections represents a groundbreaking approach that not only enhances feature extraction but also improves the model's performance in medical image segmentation.
Despite the advantages, our model also has some limitations.Firstly, the introduction of the DA-Blocks contributes to an increase in computational complexity.This added cost could potentially be a hindrance in real-time or resourceconstrained applications.Secondly, the decoder part of our model retains the original U-Net architecture.While this design choice preserves some of the advantages of U-Net, it also means that the decoder has not been specifically optimized for our application.This leaves room for further research and improvements, particularly in the decoder section of the architecture.

Conclusion
In this paper, we innovatively proposed a novel approach to image segmentation by integrating DA-Blocks with the Transformer in the architecture of TransUNet.The DA-Blocks, focusing on image-specific position and channel features, were further integrated into the skip connections to enhance the model's performance.Our experimental results, validated by an extensive ablation study, showed significant improvements in the model's performance across various datasets, particularly the Synapse dataset.
Our research revealed the potential of DA-Block in enhancing the feature extraction capability and global information retention of the Transformer.The integration of DA-Block and Transformer substantially improved the model's performance without creating redundancy.Furthermore, the introduction of DA-Blocks into skip connections not only effectively bridges the semantic gap between the encoder and decoder, but also refines the feature maps, leading to an enhanced image segmentation performance.
This study has paved the way for the further use of DA-block in the field of image segmentation.Future work may focus on optimizing the decoder part of our architecture and exploring methods to reduce the computational complexity introduced by DA blocks without compromising the model's performance.We believe our approach can inspire future research in the domain of medical image segmentation and beyond.

Figure 1 :
Figure 1: Illustration of the proposed dual attention transformer U-Net(DA-TransUNet).For the input medical images, we feed them into an encoder with transformer and Dual Attention Block (DA-Block).Then, the features of each of the three different scales are purified by DA-Block.Finally, the purified skip connections are fused with the decoder, which subsequently undergoes CNN-based up-sampling to restore the channel to the same resolution as the input image.In this way, the final image prediction result is obtained.

Figure 2 :
Figure 2: The proposed Dual Attention Block (DA-Block) is shown in the Figure.The same input feature map is input into two feature extraction layers, one is the position feature extraction block and the other is the channel feature extraction block, and finally, the two different features are fused to obtain the final DA-Block output.

3. 3
Encoder with Transformer and Dual AttentionAs illustrated in Figure1, the encoder architecture consists of four key components: convolution blocks, DA-Block, embedding layers, and transformer layers.Of particular significance is the inclusion of the DA block before the Transformer layer.This design is aimed at performing specialized image processing on the post-convolution features, enhancing the Transformer's feature extraction for image content.While the Transformer architecture plays a crucial role in preserving global context, the DA block strengthens the Transformer's capability to capture image-specific features, enhancing its ability to capture global contextual information in the image.This approach effectively combines global features with image-specific spatial and channel characteristics.The first component comprises the three convolutional blocks of the architecture of the U-Net and its diverse iterations, seamlessly integrating convolutional operations with downsampling processes.Each convolutional layer halves the size of the input feature map and doubles its dimension, a configuration empirically found to maximize feature expressiveness while maintaining computational efficiency.The second component uses DA-Block extract features at both positional and channel levels, enhancing the depth of feature representation while preserving the intrinsic characteristics of the input map.The third component is the embedding layer serves as a critical intermediary, enabling the requisite dimensional adaptation, a prelude to the subsequent Transformer strata.The fourth component integrates Transformer layers for enhanced global feature extraction, beyond the reach of traditional CNNs.Putting the above parts together, it works as follows: the input image traverses three consecutive convolutional blocks, systematically expanding the receptive field to encompass vital features.Subsequently, the DA-Block refines features through the application of both position-based and channel-based attention mechanisms.Following this, the remodeled features undergo a dimensionality transformation courtesy of the embedding stratum before they are channeled into the Transformer framework for the extraction of all-encompassing global features.This orchestrated progression safeguards the comprehensive retention of information across the continuum of successive convolutional layers.Ultimately, the feature map generated by the Transformer is restructured and guided through intermediate strata en route to the decoder.

Figure 5 :Figure 6 :
Figure 5: Line chart of DSC and HD values of several advanced models in the Synapse dataset

Figure 7 :
Figure 7: Comparison of qualitative results between DA-TransUNet and existing models on the task of segmenting Chest X-ray Masks and Labels X-ray datasets.

Figure 8 :
Figure 8: Comparison of qualitative results between DA-TransUNet and existing models on the task of segmenting Kvasir-Seg datasets.

Figure 9 :
Figure 9: Comparison of qualitative results between DA-TransUNet and existing models on the task of segmenting Kavsir-Instrument datasets.

Figure 10 :
Figure 10: Comparison of qualitative results between DA-TransUNet and existing models on the task of segmenting 2018ISIC-Task datasets.

Figure 11 :
Figure 11: Comparison of qualitative results between DA-TransUNet and existing models on the task of segmenting CVC-ClinicDB datasets.

Table 1 :
Experimental results on the Synapse dataset

Table 2 :
Experimental results of datasets (CVC-ClinicDB, Chest Xray Masks and Labels, ISIC2018-Task, kvasir-instrument, kvasir-seg) table, the values of IOU and Dice of DA-TransUNet are higher than TransUNet in all five datasets, CVC-ClinicDB, Chest Xray Masks and Labels, ISIC2018-Task, kvasir-instrument, and kvasir-seg.Also DA-TransUNet has the best dataset segmentation in four of the five datasets.As seen in the table, our DA-TransUNet has more excellent feature learning and image segmentation capabilities.