MFA-UNet: a vessel segmentation method based on multi-scale feature fusion and attention module

Introduction The accurate segmentation of retinal vessels is of utmost importance in the diagnosis of retinal diseases. However, the complex vessel structure often leads to poor segmentation performance, particularly in the case of microvessels. Methods To address this issue, we propose a vessel segmentation method composed of preprocessing and a multi-scale feature attention network (MFA-UNet). The preprocessing stage involves the application of gamma correction and contrast-limited adaptive histogram equalization to enhance image intensity and vessel contrast. The MFA-UNet incorporates the Multi-scale Fusion Self-Attention Module(MSAM) that adjusts multi-scale features and establishes global dependencies, enabling the network to better preserve microvascular structures. Furthermore, the multi-branch decoding module based on deep supervision (MBDM) replaces the original output layer to achieve targeted segmentation of macrovessels and microvessels. Additionally, a parallel attention mechanism is embedded into the decoder to better exploit multi-scale features in skip paths. Results The proposed MFA-UNet yields competitive performance, with dice scores of 82.79/83.51/84.17/78.60/81.75/84.04 and accuracies of 95.71/96.4/96.71/96.81/96.32/97.10 on the DRIVE, STARE, CHASEDB1, HRF, IOSTAR and FIVES datasets, respectively. Discussion It is expected to provide reliable segmentation results in clinical diagnosis.


Introduction
Fundus diseases may cause vision loss, visual field defect, and serious cases may lead to blindness.Common fundus diseases, such as macular degeneration, hypertensive retinopathy, and diabetic retinopathy (Lin et al., 2021;Badawi et al., 2022;Saranya et al., 2022), are characterized by morphological changes in the retinal vasculature during advanced stages, including optic disc atrophy, vascular proliferation, and macular hole formation.Currently, medical devices are being updated and replaced gradually.Fundus cameras provide an intuitive and easy way to detect and observe the eyes, enabling timely identification of certain fundus diseases that may not have been clearly characterized.They serve as the primary diagnostic tool for ophthalmologists and are widely utilized.However, the manual segmentation of retinal vessels in fundus images (Figure 1) by ophthalmologists is time-consuming and susceptible to subjective errors due to the low contrast and complex structure of the vessels (Liu et al., 2022a).The development of an automated algorithm for vessel segmentation shows great potential in enhancing the capabilities of fundus cameras, reducing the diagnostic burden on physicians, and improving diagnostic efficiency.
Although existing methods have achieved excellent segmentation performance and generalization ability, there are still unresolved issues.For example, if vessels with a width of less than three pixels are defined as microvessels and the rest are defined as macrovessels, nearly 70% of the pixels in the image can be classified as macrovessels (Yan et al., 2019).Such class imbalance makes existing methods unable to segment complete microvascular structures or even ignore the existence of microvessels, while microvessel segmentation plays a crucial role in the diagnosis of retinal diseases related to vascular proliferation.Moreover, due to the limited availability of publicly available data, deep neural networks may tend to overfit the training data, resulting in poorer generalization ability (Su et al., 2022).
In this work, we propose a retinal vessel segmentation network (MFA-UNet) based on multi-scale feature fusion and attention mechanism.The primary goal is to improve the segmentation performance of vessels under low contrast and preserve the complete microvascular structure.To alleviate the performance bottleneck caused by data scarcity, an image patch-based training method is applied for data augmentation to expand the training samples.MFA-UNet uses UNet as the backbone network and proposes three modules to improve the accuracy and sensitivity of vessel segmentation.First, the multi-scale fusion self-attention module (MSAM) is used to fuse multi-scale features in the skip path and build long-range dependency relationships of the fused features through a self-attention mechanism targeting spatial and channel dimensions, thus improving the ability to preserve complete vascular structures.In addition, different branches in the multi-scale decoding module achieve segmentation of blood vessels with different widths through deep supervision, and integrate and refine these different segmentation results.Finally, considering that the features in the skip paths may cause information redundancy in the decoder, we introduce a Parallel Attention Mechanism (PAM) to filter redundant information in the decoder.The combination of these three modules improves the generalization ability of MFA-UNet and its ability to capture fine microvascular features.
The workflow of the proposed retinal vessel segmentation system is illustrated in Figure 2. First, preprocessing techniques are used to improve the visibility of the vessel structure in the images.Subsequently, the training data volume is increased by dividing the image into patches through cropping.These cropped patches are used to train the model and generate the segmentation mask.Finally, the segmented masks are merged to reconstruct the complete retinal vessel segmentation map.The contributions of this paper are summarized as follows: • In order to address the difficulty of retinal vessel segmentation,we propose a U-Net network structure based on a multi-branch decoder module (MBDM) and a multi-scale feature fusion self-attention module (MSAM) to enhance the effectiveness of retinal vessel segmentation.• We design a novel channel self-attention mechanism and combine it with a spatial self-attention mechanism to adapt multi-scale features and efficiently learn the dependencies of channel and spatial dimensions.The rest of the paper is structured as follows: Section 2 provides an overview of related work.Section 3 outlines the structure of the proposed MA-UNet model and the techniques used.Section 4 presents the experimental results, demonstrating the effectiveness of the proposed PAM, MASM, and MBDM models.Section 5 provides a discussion and analysis of the experimental results.Finally, in section 6, we draw our conclusions.

Related works
Retinal vessel segmentation algorithms include unsupervised and supervised methods based on the learning approach, but these methods often fail to achieve high accuracy in microvessel segmentation.Many researchers have incorporated deep supervision techniques and attention mechanisms into supervised methods to improve segmentation results.This section provides a comprehensive review of the relevant literature, covering four aspects: (1) unsupervised methods, (2) supervised methods, (3) attention mechanisms, and (4) deep supervision techniques.

. Unsupervised methods
The automatic retinal vessel segmentation algorithm comprises both supervised and unsupervised learning algorithms (Yan et al., 2018).Unsupervised methods utilize inherent vessel characteristics, such as curvature, width, color, etc., as supplementary information for vessel segmentation.Kande et al. (2010) proposed a thresholdbased method for retinal vessel segmentation.They utilized spatially weighted fuzzy C-mean clustering and enhanced intensity using data from the green and red channels of the image.Additionally, they improved the contrast between vessels .Supervised methods Supervised learning methods, including support vector machines, Gaussian models, logistic regression, and artificial neural networks, utilize manually labeled segmentation masks as the optimization target, leading to superior accuracy compared to traditional methods (Li et al., 2016).Ricci and Perfetti (2007) employed image vectorization based on pixel grayscale values to train a Support Vector Machine, resulting in improved accuracy in automatic vessel segmentation.Palanivel et al. (2020) introduced a retinal segmentation method that utilizes the Holder index and Gaussian mixture model classifier.Initially, the Holder index was employed to quantify vessel features, followed by vessel segmentation using the Gaussian mixture model.These methods heavily rely on manually extracted features to achieve exceptional performance but often exhibit insensitivity toward microvessels.
Deep learning-based methods have demonstrated superior accuracy in vessel segmentation compared to machine learning approaches.For instance, Skip FCN (Liu et al., 2019) enhances the segmentation accuracy of microvessels by incorporating skip connections.Samuel and Veeramalai (2021) introduced a vessel segmentation method based on VGG16 Simonyan and Zisserman (2014) utilizing the VSC module to extract and transfer features from the region of interest (ROI).However, the continuous downsampling operations employed in these methods can lead to the loss of image details and pose a bottleneck to segmentation accuracy.Kar et al. (2023) employed an adversarial generative network for vessel segmentation.The generator synthesizes the vessel mask, while the discriminator distinguishes between the synthesized and real masks, thereby optimizing the segmentation performance of the generator.
Moreover, some researchers have focused on the utilization of multiscale features in the segmentation process to recover the vessel structure.For example, CcNet Feng et al. (2020) had introduced cross-connections to allow the decoder for learning with multiscale features in the decoding process, thus improving the accuracy of microvessel segmentation.Zhao et al. (2021) proposed a multiscale upsampling attention module and incorporated it into the U-Net model for cross-scale information transfer.Guo et al. (2019) had designed a multi-scale deep supervision network BRS-DSN with a short connection and enabled to utilize of multi-scale features in the output layer.Deng and Ye (2022) substituted the convolution kernel in the UNet structure with deformable convolution, constructed several skip links, and introduced a channel attention mechanism to extract the multi-angle feature information of the image.Inspired by these works, we proposed a module (MSAM) that integrates multi-scale features and constructs the global correlation.

. Attention mechanism
The neural network takes in an entire image as input and learns semantic information through the convolutional kernel, which consists of both relevant and redundant features (Pang et al., 2021).However, redundant information can negatively impact the performance of network in the given task.To address this, the attention mechanism is introduced to selectively emphasize crucial areas in images and filter out redundant information.The attention mechanism can be categorized into local attention and non-local attention.Local attention comprises the channel attention mechanism, spatial attention mechanism, and other mechanisms that facilitate feature selection based on local information.For instance, Guo et al. (2020) introduced the SA-Unet model, which incorporates a spatial attention mechanism to enhance the preservation of microvessel structures and achieve higher accuracy.Non-local attention generally refers to an attention mechanism that captures dependencies between distant features through global information statistics, with the self-attention mechanism being particularly notable.The selfattention mechanism explores correlations within an image to effectively preserve the structure of vessel branches and the vessel tree.Several studies have demonstrated this, including Pang et al. (2021), who employed the self-attention mechanism to adjust multi-scale features and facilitate the restoration of microvessel structures during the decoding process.Similarly, Shen et al. (2022) substituted the convolutional layer with the selfattention mechanism in UNet, leading to outstanding segmentation results and enhanced feature extraction efficiency.Liu et al. (2023)introduced an attentional fusion block to enhance a skip connection, thus improving multiscale feature representation and enabling precise retinal vessel detection.Ouyang et al. (2023) addressed the issue of losing fine-vessel features by employing a local feature enhancement module and attention block to expand the receptive field and optimize the capturing of microvessel features.
In this paper, we introduce the self-attention mechanism to construct correlation between vessel branches and vessel trees.Considering the correlation between channel dimensions, which is ignored in the above methods, we propose a channel self-attention mechanism to capture non-local information of channel dimensions.Additionally, we introduce a parallel attention mechanism into the decoder to effectively utilize the features from the skip path.

. Deep supervision
Deep supervision (DS) technology (Lee et al., 2014;Szegedy et al., 2015) was developed to address issues such as gradient vanishing and slow convergence in neural network training.Its fundamental concept involves incorporating auxiliary classifiers into intermediate layers of the model to leverage shallow features in downstream tasks and facilitate the seamless flow of gradients into the deep layers of the network (Zhou et al., 2022).The model generated by the auxiliary classifier can be considered a sub-model of the original model.Through the introduction of additional loss functions, each sub-model can be optimized and specialized for different tasks, thereby enhancing their performance and reducing bias in deep features related to the task.DS techniques have also been utilized in methods presented by Mo and Zhang (2017), He et al. (2021), andCao et al. (2023) to incorporate multi-scale information and improve the performance and robustness of models in target region segmentation tasks.However, the current application of DS techniques primarily focuses on low-resolution boundary segmentation of target regions, which can be influenced by features in the skip path.
Inspired by these ideas, in order to improve the segmentation accuracy of MFA-UNet on tiny vessels, we introduce MBDM at the end of the decoder.This is done to prevent interference from other features and perform three sub-tasks: macrovessel segmentation, microvessel segmentation, and vessel structure segmentation.We gradually improve the complexity of sub-tasks and integrate subtargets with different focuses to ensure the integrity of vessel structure in segmentation results.

Methodology . Overview
The proposed model utilizes U-Net as the backbone model.Due to the limited number of pixels in microvessels, the segmentation of vessel structure becomes inefficient.Additionally, the raw semantic information in the images may not be readily available in the deeper layers of the model.Therefore, a specialized model design is required for the vessel segmentation task.
The structure of the MFA-UNet is depicted in Figure 3.The network consists of an encoder and a decoder, where the encoder learns from input image patches.The features at different scales in the encoder are fused and sent to MSAM via skip paths.MSAM applies a self-attention mechanism to establish longrange dependencies and generates a weight matrix to adjust the importance of pixels and channels in the fusion features.These fusion features are then forwarded to the decoder for detailed reconstruction of the segmentation mask.During this process, a parallel attention mechanism is applied to filter out redundant features.Finally, MBDM uses different branches to generate macrovessels and microvessels masks from the output features of the decoder, while the fusion branch synthesizes the macrovessel and microvessel structures to obtain the final segmentation mask.It is important to note that in MBDM, all branches are supervised by the same ground truth.

. Multi-scale fusion self-attention module
In the microvessel segmentation task, it is common to use features of different scales to restore the edges of the target area and mitigate the information loss caused by downsampling.However, direct fusion of multi-scale features without any adjustment can introduce redundant feature information and affect segmentation performance.To effectively utilize multi-scale information, we propose a multi-scale feature fusion self-attention module (MSAM), which consists of three components: Channel Self-Attention (CSA), Spatial Self-Attention (SSA), and Multi-Layer Perceptron (MLP).The structure of the MSAM is illustrated in Figure 4.
Firstly, the three feature maps, denoted as F 1 , F 2 , F 3 , are reshaped to the same size as F 2 through upsampling or downsampling.Since feature maps with a 64-channel structure contain less semantic information, only feature maps with 128, 256, and 512 channels are then considered for subsequent operations.Subsequently, these feature maps are combined to yield the fusion feature F a .The fusion feature F a is encoded into key and value through a 1 × 1 convolution and linear layer.Furthermore, we establish a set of trainable channel embeddings represented by ch, which are encoded and transformed into queries using linear layer.In the CSA layer, the correlations between the channel dimensions of the fusion feature F a are computed using the generated query and key.This process activates the channel weight map M c through a softmax operation.Value are then element-wise multiplied with M c to obtain the adjusted feature map F ch .The operations within the CSA layer are described by Equation 1.
Here, W K and W V denote the weights of the linear layers used to encode the token feature z, while W Q represents the weights of the linear layer used to encode the channel embeddings, R D×D denoted a matrix of size D × D.
The SSA layer consists of a self-attention mechanism that performs linear operations by dot product to mine the information in the image and generate a weight matrix of spatial dimensions, adjusting the importance of each pixel.Finally, the MLP layers enable the transfer of feature information across channels by using combinations of information in channel dimensions.Selfattention mechanisms can establish dependencies between the vascular tree and the pixels of the vascular branches.MSAM integrates feature information at different scales and applies selfattention mechanisms in the spatial and channel dimensions of feature maps to establish long-range correlations between pixels, thereby improving the effect of vessel segmentation.The specific operations performed in MSAM are shown in the Algorithm 1. ./fnins. .

FIGURE
The proposed MFA-UNet model architecture.

FIGURE
The architecture of MSAM module.It samples and sums the three scales of features to obtain the fusion features, followed by CSA and SSA processing to construct the long-range dependencies of the channel and spatial dimensions, finally integrates the features by using MLP to obtain the refined features, and recovers it to the three scales of the input features by upsampling or downsampling and transfer them to the skip path. .

Multi-branch decoder module
In fundus images, nearly 77% of the vessel pixels belong to macrovessels, while only 23% belong to microvessels.This causes an imbalance of vessel proportions in the images, which makes the single network structure less accurate in the segmentation of microvessels (Yan et al., 2019).To solve this imbalance problem, we applied deep supervision technology to the decoder.By adding multiple branches before the output layer, the vessel segmentation is divided into three stages: macrovessel segmentation, microvessel segmentation, and vessel fusion.We describe this structure as a multi-branch decoder module, and the structure is shown in Figure 5.
Branch 1 is optimized by the Binary Cross Entropy (BCE) function for pixel-level loss.The function compares the predicted map with the ground truth pixel by pixel, and the imbalance in vessel proportion causes the function to penalize macrovessels more, making this branch better for macrovessel segmentation.The optimization of branch 2 is implemented by dice loss (Milletari et al., 2016) for f class-level loss.The dice loss maximizes the intersection of the segmentation result and the ground truth as the optimization objective, which reduces the negative impact caused

FIGURE
The architecture of MBDM and PAM.In MBDM, Branches and receive the feature maps from MFA-UNet as input and are supervised with di erent loss functions to achieve the segmentation of both macro-and micro-vessels.The fusion branch combines the outputs of branches and and adjusts them to obtain the final prediction.In PAM, the input features are sent to CAM and SAM to adjust the weights of the channel and spatial dimensions, respectively.This adjustment helps filter out redundant information.The resulting features are then added together to achieve mutual complementation of feature information.
by the imbalance of vessel proportions and makes the branch better for preserving the microvessel structure.In order to recover the broken vascular structure, we aim to gather local information by utilizing a convolutional kernel with a larger receptive field.To achieve this, we introduce convolutional layers of 7 × 7 and 9 × 9 in branches 1 and 2. This allows for a broader scope of information aggregation, aiding in the restoration of the vascular structure.The module merges the outputs of branches 1 and 2 to obtain a segmentation map that considers both macro and microvessel segmentation.The segmentation map is then used as input to the fusion branch, and convolutional layers with kernel sizes of 3 × 3 and 1 × 1 are applied to adjust the segmentation map.The 3 × 3 convolution kernel is used to adjust the fusion features and restore them based on vessel locality, while the 1× 1 convolution kernel is used to adjust individual pixel values.The fusion branch needs to consider both macro and microvessel segmentation, so a mixed loss function of BCE and Dice loss is used for training.The BCE loss and the Dice loss are calculated as follows: where p n,c and y n,c are the target label and prediction probability of the cth category with the nth pixel in the batch, Y is the ground truth, P is the prediction result, C is the number of categories, and N is the total number of pixels in the dataset in the batch. .

Parallel attention module
In the structure of U-Net, the input of the convolutional layer of the decoder is the fusion feature after cascading the output of the previous layer with the encoder feature, which contains many useless information such as the optic disc feature.In this paper, a parallel attention module (PAM) is included in the decoder to adjust the spatial and channel dimensions of the feature maps respectively, therefore reducing the redundant information and allowing the network to focus on the segmentation of blood vessels.
In the Channel Attention Mechanism (CAM) (Woo et al., 2018), both a max-pooling and average-pooling operation are performed along the spatial dimension to extract different feature information, and two feature maps with size 1 × 1 are cascaded input to the MLP and sigmoid activation layer to obtain the channel dimension weight matrix.This operation avoids the influence of spatial dimensional information.The Spatial Attention Mechanism (SAM) (Woo et al., 2018) also applies maximum pooling and average pooling to the channel dimensions to avoid the influence of spatial information, to extract different spatial dimensions of information, and cascades the features through a 7 × 7 convolutional layer to obtain the spatial dimension dependency, and finally passes through a sigmoid activation layer to obtain the spatial dimension weight matrix.We parallel the CAM and SAM to complement the feature map with the different dimensions.The architecture of the attention module used in this paper is shown in Figure 5.
The formulas for the channel attention mechanism and the spatial attention mechanism are as follows: Frontiers in Neuroscience frontiersin.org where M denotes the operation performed by the MLP, F is the input feature map, W denotes the weight of the fully connected layer in the MLP, f 7×7 is the weight value of the convolution kernel of 7 × 7, and σ as the sigmoid activation function.M c (F) is the weight matrix generated by channel attention, and M s (F) is the weight matrix generated by spatial attention.
Where M represents the operation performed by the MLP, F denotes the input feature map, W signifies the weight of the fully connected layer in the MLP, f 7×7 represents the weight value of the 7 × 7 convolution kernel, and σ denotes the sigmoid activation function.M c (F) denotes the weight matrix generated by channel attention, while M s (F) represents the weight matrix generated by spatial attention.

Experiments . Datasets
In this paper, we evaluated the performance of the proposed method, the baseline U-Net, and other segmentation methods based on U-Net variants using six public datasets: DRIVE, STARE, CHASE-DB1, HRF, IOSTAR, and FIVES.We trained and tested MFA-UNet, U-Net, and other methods separately on four different fundus image datasets.The DRIVE dataset consists of 40 fundus images, with 20 images for training and 20 images for testing.The STARE dataset (Hoover et al., 2000) contains 20 fundus images, including some images with lesions.We randomly selected 10 images for training and 10 images for testing.The CHASE-DB1 dataset (Owen et al., 2009) consists of 28 images, 20 of which are used for training and the rest for testing.The HRF dataset (Odstrcilík et al., 2013) contains 15 normal fundus images, 15 fundus images with diabetic retinopathy, and 15 fundus images with glaucoma, of which 35 images were used for training and the remaining images for testing according to Shen et al. (2022).The OSTAR dataset (Zhang et al., 2016) comprises 30 retinal images, with 20 images for training and 10 images for testing.The FIVES dataset (Jin et al., 2022) encompasses retinal images depicting various conditions, including health, diabetic retinopathy, glaucoma, and age-related macular degeneration.Moreover, corresponding vascular segmentation masks are provided for each image.Each category contains a total of 200 images.The dataset is partitioned into 600 images for training and 200 for testing.Each of the six publicly available datasets is described in detail in Table 1.

. Preprocessing
During the imaging process of fundus cameras, various factors such as room lighting and physician expertise can lead to uneven illumination and low contrast in fundus images, which can hinder the segmentation of retinal vessels.Therefore, preprocessing techniques are considered essential to improve the quality of fundus images.First, the NTSC conversion method is used to convert the RGB image to a grayscale image since the green channel contains more significant information than the red and blue channels.The formula for this conversion is as follows: To address the problem of low contrast in the vessels, we apply the Contrast-Limited Adaptive Histogram Equalization (CLAHE) technique to the images.This method effectively enhances the vessel features by equalizing the grayscale histogram of a specific portion of the image.In addition, we use gamma correction to brighten the image with a gamma value of 1.2, making the darker areas of the image more visible.The raw and pre-processed images are shown in Figure 6.

. Data augmentation
Due to the high resolution of the images in each dataset, using the entire image as input is likely to place a significant burden on the hardware, and the availability of training data is limited.To solve this problem, we extend the training data using the overlap method.Briefly, we use a window with a size of 128 by 128 and a step size of 15 to perform patch cropping for each image and generate 25,000 patches with an overlap of 87.5% for adjoining image patches.We then used random cropping, horizontal and vertical mirror inversion, and random angle rotation for data enhancement to improve sample diversity.

. Implementation details
Our experiment is performed using the PyTorch framework.All models are trained and tested on an NVIDIA RTX 3080 with 24G of memory.The MFA-UNet model is optimized using the Adam optimizer with the parameters β 1 = 0.9; β 2 = 0.99.During both the training and testing phases, the batch size is set to 32, and the initial learning rate is set to 5 × 10 −4 .To facilitate the training phase, a learning rate decay strategy is implemented with a decay period of 50 epochs and a decay rate of 0.1.We designate 10% of the training data as the validation set to evaluate the performance of MFA-UNet.The models with the highest the area under the ROC curve (AUC) in the validation set are saved for model performance evaluation.

. Evaluation metrics
To binarize the segmentation probability map, we utilize a fixed threshold, where pixel values exceeding this threshold are set to 255, and while those below the threshold are set to 0. In the binarized segmentation maps, pixels that are correctly classified as vessels are denoted as true positive (TP), while the pixels that are incorrectly classified as part of vessels are designated as false positive (FP).Pixels that are accurately identified as non-vessel are marked as true negative (TN), and the pixels that are wrongly assigned as nonvessel are recorded as false negative (FN).To evaluate the proposed model, we employ several widely used metrics, including Accuracy (Acc), Dice Similarity Coefficient (DSC), Sensitivity (Sen), and Specificity (Sp).The formulas for each metric are presented below: where N = TP + TN + FP + FN.The receiver operating characteristic curve (ROC) is generated by constructing a coordinate system using the true positive rate (Se) and false positive rate (1-Sp).AUC is then calculated to evaluate the overall classification ability of the model.

Results
In this section, we perform a comparative analysis of the proposed model with other vessel segmentation methods, including U-Net, MS-NFN, CcNet, and NestU-Net, etc. to demonstrate the superiority of our proposed method over other existing techniques.The evaluation results of the proposed model and other methods on the DRIVE, STARE, CHASE DB1, and HRF datasets are presented in Table 2, where mSen and mSp represent the mean values of the sensitivity and specificity of the model on the binary classification task.It is important to note that in each experiment, MFA-UNet and U-Net are trained and tested separately on four different fundus image datasets, while the results of other methods are obtained from the cited references.
Among the employed evaluation metirics, mSen and DSC represent the degree of overlap between the segmentation prediction of the model and the ground truth label.Accuracy and AUC, on the other hand, serve to evaluate the overall performance of the model in pixel-level classification.Additionally, mSp indicates the segmentation accuracy of the model for nonvessel areas.The completeness of the vascular structure in the segmentation mask is crucial in clinical analysis.Therefore, in the performance comparison, we focus on the superiority and inferiority of MFA-UNet and other methods in the three comprehensive metrics of accuracy, DSC, and AUC.
The comparative results on the DRIVE dataset showcase the superior performance of MFA-UNet over other state-ofthe-art algorithms in terms of accuracy, DSC, and AUC, with improvements observed across all metrics when compared to UNet.The DRIVE dataset, with its larger number of test samples, serves as a more representative benchmark for evaluating segmentation algorithms in real-world scenarios.It is worth noting that the method proposed in Naveed et al. ( 2021) demonstrates higher sensitivity compared to our approach, possibly due to the postprocessing effect of the block matching 3D (BM3D) speckle filter, which specifically enhances sensitivity in microvessels.On the other hand, the method proposed by Barkana et al. ( 2017) exhibits excellent specificity.However, this approach combines predictions from multiple algorithms for vessel segmentation, resulting in the introduction of excessive model parameters and increased inference time.
In the STARE dataset, Nest U-Net, as proposed in Wang et al. (2021), outperforms MFA-UNet in terms of specificity and segmentation accuracy.We believe that Nest U-Net achieves this by utilizing high-resolution feature maps to produce more refined vessel segmentation results.On the other hand, the proposed MFA-UNet achieves higher sensitivity, AUC, and DSC than other models, indicating that MFA-UNet maintains the segmentation effect for macriovessels while preventing the loss of microvessel structures due to the MSAM.
In the CHASE-DB1 dataset, MFA-UNet demonstrates superior performance in all metrics, except for sensitivity, when compared to other current methods.The approach proposed in Lu et al. (2020) outperforms other methods specifically in terms of sensitivity.This method employs two models to separately segment macrovessels and microvessels, and combines the resulting segmentation maps to obtain the final results, allowing for a combined segmentation effect on both macrovessels and microvessels.In contrast, MFA-UNet utilizes the Multi-Branch and Dense Module (MBDM) to target different vessel segmentations and combine the results from multiple branches, resulting in more efficient memory usage The bolded value is the maximum value in the comparison.
compared to an ensemble model.Similarly, the method presented in Shen et al. (2022) surpasses other methods in sensitivity, AUC, and DSC on the HRF dataset by replacing the convolutional layer with a self-attention mechanism, enabling a global perceptual field.However, it should be noted that this method requires significantly higher memory resources compared to other approaches.
In the IOSTAR dataset, our model achieved the highest level of performance in the comparative analysis.The dataset exhibits relatively prominent blood vessels, and the MFA-UNet model successfully segments the vascular structures comprehensively, outperforming the UNet model by effectively restoring interrupted vessels.
In the FIVES dataset, our proposed model has also achieved optimal performance across all metrics.The FIVES dataset comprises images featuring various categories of fundus diseases, where the presence of biomarkers like leptomeningeal fundus, exudation, and hemorrhage can significantly influence vascular segmentation performance.As a result, the sensitivity of MFA-UNet only reaches 80.07%.However, it still excels in terms of DSC and AUC, highlighting the remarkable robustness of MFA-UNet when confronted with the diverse fundus images.
We observed a common issue among all segmentation models evaluated on this dataset: square-shaped convolutional kernels struggle to preserve the curvatures of blood vessels, particularly those with smaller curvatures.We believe that during the encoding process, the model's perception of blood vessels with larger curvatures is compromised due to their relatively low proportion in the overall structure.Additionally, the local feature extraction capabilities of the convolutional kernels fail to accurately segment distributed and highly curved vessel structures.
Figure 7 displays the segmentation visualization results of MFA-UNet on various datasets.The visual analysis demonstrates the exceptional performance of our method in accurately segmenting microvessel while maintaining their overall structural integrity.Notably, MFA-UNet exhibits precise delineation of macrovessels, highlighting its robust segmentation capability for such structures.To emphasize challenging segmentation regions, such as areas with significant width variations and low contrast in blood vessels, we have zoomed in and positioned these vessels adjacent to each image.
We have depicted the learning curve of MFA-UNet on the DRIVE dataset in Figure 8, aiming to observe the performance variations of MFA-UNet throughout the training process.The learning curve tends to plateau after the 80th epoch, indicating the gradual convergence of MFA-UNet on the training set.Moreover, MFA-UNet did not exhibit significant overfitting during the training process, as evidenced by the similarity between the learning curves of the validation set and the training set.This observation validates the effectiveness of patch-based data augmentation techniques.

. Ablative studies
Apart from conducting comparisons with aforementioned methods, we have conducted two sets of ablative experiments to analyze the influence of different combinations of modules and loss functions on the segmentation performance of MFA-UNet.These ablative experiments were exclusively carried out on the DRIVE dataset due to computational resource limitations.

. . E ect of various module combinations on segmentation performance
In the proposed MFA-UNet, we have incorporated PAM, MSAM, and MBDM to enhance the segmentation performance of the model.To validate the effectiveness of these modules, we conducted ablation experiments.The model that excludes all of the aforementioned modules is referred to as the basebone.
Table 3 presents the performance of the models with the addition of each module to the basebone.The backbone model, which incorporates residual connections, large kernel convolution downsampling layers, and adjustments to the number of channels, reduces the number of parameters by 42.7% compared to U-Net while maintaining similar performance.The introduction of PAM into the backbone led to improvements in the sensitivity, accuracy, and AUC of the model.This enhancement can be attributed to improved feature extraction.The incorporation of MSAM significantly improved the sensitivity, which represents the accuracy of the model in blood vessel segmentation.This improvement indicates that multi-scale features can greatly enhance segmentation performance.However, accuracy does not improve, suggesting that the model misclassifies background pixels as blood vessels during pixel class discrimination.Finally, the inclusion of MBDM in the model results in MFA-UNet achieving the best performance in accuracy, AUC, and DSC.In the ablation experiments, we observed significant improvements in certain metrics with the addition of each module to the model.This suggests that incremental enhancements in the model contribute to the improvement of vessel segmentation performance (see Figure 9).By incorporating MBDM into the model, it achieves the highest values in comprehensive metrics and attains the best balance across each metric. .

. E ect of the choice of loss function on segmentation performance
In this set of experiments, we removed the MBDM component from MFA-UNet and trained the network with three different loss functions to verify the effectiveness of MBDM in improving segmentation performance.The visual segmentation results of these networks are shown in Figure 9.When employing the BCE loss function, the network can only segment the structure of coarse vessels well, but causes structural discontinuity in fine vessel segmentation.By employing the Dice loss function, the network achieves a sensitivity of 83.93% and exhibits improved segmentation of small vessels, albeit with some non-vessel areas being erroneously segmented.We noticed that when using a mixed loss function, the misclassification rate in some uncertain areas of the network is significantly reduced, which is manifested as an increase in specificity.Benefiting from the training method of MBDM, MFA-UNet can better preserve the structure of fine vessels and reduce the misclassification rate of non-vessel areas, achieving the highest accuracy, specificity, and DSC in Table 3.

. Cross-training experiments
To validate the generalization and robustness of the proposed method, we conduct cross-training experiments.Specifically, we evaluated the performance of a model trained and converged on one dataset when applied to another dataset.Unlike the method of retraining the neural network in Li et al. (2016)  superior sensitivity, specificity, and accuracy compared to other methods, albeit with a lower AUC.Conversely, on the STARE dataset, the specificity and accuracy of our results are inferior to other methods.We attribute this situation to the presence of lesion images in the STARE dataset, which facilitate the learning of sufficient features by the model, while the segmentation masks in the DRIVE dataset exhibit greater detail compared to those in the STARE dataset.  .

Comparison of parameters, flops, and speeds
We trained MFA-UNet and other methods separately on image patches and whole images, subsequently evaluating them to obtain various metrics for the performance comparison of the mentioned methods.The metrics used encompass the number of trainable parameters in the model, Floating Point Operations PerSecond (FLOPs), inference time for a single image, DSC, and AUC.Table 5 provides an overview of the trainable parameters, FLOPs, and inference time for MFA-UNet and other methods.To visually demonstrate the performance of each model in terms of trainable parameters, DSC, and AUC, we present a ball chart in Figure 10.According to Table 5, UNet has approximately 3.45 million trainable parameters.After incorporating all the proposed modules into UNet, MFA-UNet achieves smaller trainable parameters while maintaining excellent segmentation accuracy.In comparison, DUNet has 7.41 million trainable parameters but exhibits weaker performance in terms of DSC.Additionally, Attention U-Net and R2U-Net, as variants of U-Net, have a similar number of parameters to UNet.Attention U-Net, with the inclusion of the attention mechanism, achieves high segmentation accuracy, validating the effectiveness of this module.Our MFA-UNet outperforms other methods in terms of DSC and AUC, while maintaining smaller model parameters and complexity.
When we use whole images as training data, all compared methods exhibit varying degrees of reduction in DSC and AUC, with the exception of UNet and R2U-Net.We attribute this decline to the fact that when using entire images as input, segmentation models may not adequately focus on the smaller, finer blood vessels, thereby resulting in incomplete vessel segmentation and decreased performance.Inference time is reduced compared to patch-based methods, as utilizing an overlapping patch strategy increases the number of image samples significantly.Patch-based methods require segmentation of more images, extending inference  The bolded value is the maximum value in the comparison.
time.Nevertheless, this approach enhances the segmentation performance of different models and meets the high-precision requirements of tasks like retinal vessel segmentation.

Discussion
Figure 9 shows that the incorporation of PAM into the decoder restores the structure of some of the macrovessels, while the addition of MSAM empowers MFA-UNet to accurately segment the intricate branching patterns of microvessels.To adjust the weight of the features within the skip path, we propose MSAM based on the self-attention mechanism and introduce it into the skip path of MFA-UNet.Unlike traditional convolutional neural networks, where the convolutional layer integrates the local feature information of the image through windowed convolution operations, resulting in a network with a restricted perceptual field, MSAM has a self-attention mechanism with a global perceptual field.This empowers MFA-UNet to effectively capture the interdependencies between vessel pixels and other pixels within the image.
The effectiveness of the multi-objective segmentation strategy is demonstrated by the improvement observed across all metrics in Table 2. Since a single network is inadequate for segmenting both macrovessels and microvessels, we established multiple branches in the decoder, each with different optimization objectives.This structure introduces deep supervision and enables the segmentation of macrovessels and microvessels to be separate, allowing us to adjust the optimization objectives of MFA-UNet.The primary structure of the model is utilized for feature extraction  In our study, we investigated the impact of the position of the attention mechanism within the model.When the attention mechanism is applied after the cascade of feature maps from the decoder and skip path, we observed an improvement in sensitivity but a decline in specificity.Conversely, when the attention mechanism is applied after the cascade fusion of feature maps and upsampling, we observed an increase in specificity but a decrease in sensitivity.We hypothesize that adjusting the feature map within the skip path can enhance the vessel features in the feature map, but the model is prone to misclassify background pixels as vessel pixels.On the other hand, adjusting the upsampled feature maps will make the model focus on the classification of background pixels.Considering the requirement for a balanced performance across multiple metrics, we incorporate MSA into the skip path to enable the decoder to utilize multi-scale features for better segmentation.Additionally, we position the attention mechanism after the convolutional layer of the decoder to mitigate the misclassification rate of the model.
In comparison to the DRIVE, STARE, HRF, IOSTAR, and FIVES datasets, the CHASE-DB1 dataset is specifically curated from the eyes of children to mitigate the interference caused by lesions associated with eye diseases.Figure 7 illustrates that during the testing phase, the segmentation of the optic disc boundary by MFA-UNet resulted in a low specificity improvement of the model.We attribute this observation to the high contrast exhibited by the optic disc in the fundus image of the CHASE-DB1 dataset, as well as the similarity in shape between the optic disc boundary and blood vessels.Consequently, the model tends to misclassify the optic disc boundary as blood vessels.
In the IOSTAR dataset, We observed a common issue among all segmentation models evaluated on this dataset: square-shaped convolutional kernels struggle to preserve the curvatures of blood vessels, particularly those with smaller curvatures.We believe that during the encoding process, the perception of blood vessels of model with larger curvatures is compromised due to their relatively low proportion in the overall structure.Additionally, the local feature extraction capabilities of the convolutional kernels fail to accurately segment distributed and highly curved vessel structures.
In the FIVES dataset, our proposed model has achieved optimal performance across all metrics.We visualized the segmentation results of MFA-UNet and UNet on glaucoma and DR images to demonstrate the influence of retinal image characteristics on segmentation performance (Figure 7).The left column in Figure 7 shows the segmentation results on glaucoma images, while the right column displays the results on DR images.We observed that the low contrast and intensity in glaucoma images significantly degraded the segmentation performance of UNet.However, MFA-UNet partially restored the interrupted vessels using self-attention mechanisms, although the segmentation results were still affected.Furthermore, the leopard-like appearance present in glaucoma images, resembling blood vessels, caused MFA-UNet to misclassify background pixels as vessel pixels.In contrast, DR images exhibited higher intensity and contrast, enabling MFA-UNet to accurately segment most vessel structures.These observations are further supported by the quantitative results presented in Table 2.
The proposed MFA-UNet has achieved competitive performance on multiple public datasets, yet there are still some limitations worth discussing.Convolutional neural networks (CNNs) inherently involve downsampling to reduce the dimensionality of image information, which can result in the loss of fine image details.Consequently, CNN-based methods may face challenges in accurately segmenting fine blood vessels, even when incorporating multi-scale feature information during the decoding process.To address this limitation, we believe that leveraging the information from the original image to refine the segmentation mask can further enhance the sensitivity and DSC of the segmentation framework.Recent studies have shown promising results by employing post-processing techniques such as dense conditional random fields (Lin et al., 2019), morphological reconstruction (Soomro et al., 2019), and probability-regularized random walks (Mou et al., 2020).
In addition, it is important to note that the proposed approach includes preprocessing the input of MFA-UNet to enhance the contrast of blood vessels, which contributes to improved segmentation performance.Consequently, the results may not be optimal when conducting segmentation on unprocessed images.Furthermore, while our method has been evaluated on datasets encompassing various categories of ocular diseases, the performance of MFA-UNet has not been specifically validated on a single ocular disease dataset that exhibits varying severity levels.As a future direction, we are considering the utilization of graph convolutional neural networks to comprehensively analyze the vascular skeleton.This approach would enable the establishment of dependencies between vascular nodes and endpoints, facilitating the capture of contextual information and ultimately leading to more accurate vascular segmentation.

Conclusion
In this study, we present MFA-UNet, a novel neural network architecture that leverages self-attention mechanisms and multi-branch decoding modules to enhance the accuracy of microvascular segmentation and preserve microvessel structure.We also adopt preprocessing techniques to improve the quality of fundus images obtained by the fundus camera and employ patch-based data augmentation methods to mitigate overfitting issues that may arise due to the limited number of training samples.The MSAM performs the fusion of multi-scale features and establishes inter-pixel dependencies to enable the model with a global perceptual field and improve the segmentation performance on microvessels.Additionally, the MBDM enables the model to segment macrovessels and microvessels separately and merge the segmentation results to obtain an excellent segmentation mask, resulting in better performance in the segmentation of both macrovessels and microvessels.The PAM is included in the decoder to reduce the misclassification rate of the model.The experimental results show that the MFA-UNet has excellent performance in retinal vessel segmentation and outperforms current state-of-the-art algorithms in several metrics on the DRIVE, STARE, CHASEDB1, HRF, IOSTAR, and FIVES datasets.Moreover, MFA-UNet has smaller model parameters and complexity, requiring only 0.399 s to segment an image, suggesting that the proposed method holds promise for being transplanted into embedded software, thus further advancing the intelligence level of fundus cameras in the realm of vessel segmentation.

FIGURE
FIGURESampling process for fundus images.(A) A fundus camera; (B) an acquired sample image with various issues, including uneven illumination, low contrast, and the presence of complex vessel structures.

Input:
Feature map while down-sampling F i (i∈ [1, 2, 3]), the channel embedding ch Output: Multi-scale fusion features S i (i∈ [1, 2, 3]) 1: Sample F 1 and F 3 to become the same scale as F 2 ; 2: Apply a 3 × 3 convolution on F 1 and F 3 to change the number of channels to 256; 3: Obtain fusion feature F a by perform the sum 3 i=1 F i ; 4: Apply 1 × 1 convolution to reshape F a into the token feature z.Then, perform linear operations on of the token feature z by dot product to generate key and values; 5: Perform a dot product operation on key and channel embedding ch as queries to obtain W ch ; 6: Conduct matrix multiplication on W ch and values to obtain S ch ; 7: Perform a dot product operation on S ch to generate vvalues, key and queries; 8: Perform a dot product operation between queries and key to derive W sp , and perform a dot product operation between W sp and values to derive S sp ; 9: Impose linear calculations on S sp to obtain S; 10: Obtain S i for the sampling operation of S, where S i is reshaped to be the same size as F i (i∈ [1, 2, 3]); Algorithm .Self-attention based multi-scale feature fusion algorithm.Frontiers in Neuroscience frontiersin.orgCao et al.

FIGURE
FIGURERetinal vessel segmentation results on DRIVE, STARE, Chase-DB , HRF, IOSTAR, and FIVES datasets.The left and right columns present the segmentation results of di erent samples from the same dataset.In each column, from left to right, you will find the raw image, its local detail images, ground truth, result of MFA-UNet, and result of UNet, respectively.

FIGURE
FIGURE Learning curves for MFA-UNet trained on the DRIVE dataset.(A) Learning curve illustrating the changes in Accuracy; (B) learning curve illustrating the changes in dice score.

FIGURE
FIGUREVisualization of segmentation results of MFA-UNet and its ablated versions on the DRIVE dataset.(A) Results of various combination of modules, (B) results of di erent loss functions.

FIGURE
FIGUREComparison of model Params and performance.Note that the size of the circle indicates the number of model parameters.DSC, and AUC are used to evaluate the performance of models.The red font indicates that the method utilizes the whole image as input, while the black font indicates that the method utilizes patches as input.
TABLE Comparison of the proposed method with other methods in terms of evaluation metrics including mSen, mSp, Acc, AUC, and DSC, using the DRIVE, STARE, CHASE-DB , HRF, IOSTAR, and FIVES datasets (unit:%).
, we utilize the model trained in Section 4.6 for cross-training without multiple training.The outcomes of the cross-training experiments are presented in Table4.Notably, for the DRIVE dataset, our method achieves TABLE The quantitative results of the ablative studies.Effects of different module combinations on segmentation performance; Study 2: Comparison of segmentation performance for MFA-UNet with MBDM using different loss functions and MFA-UNet without MBDM.The bolded value is the maximum value in the comparison.
TABLE The results of cross-training experiments on the DRIVE and STARE dataset.
TABLE Comparison of parameters (unit: M), FLOPs (unit: G), inference time for models with patch segmentation and panoramic segmentation on a single image (unit: s) among di erent methods on the DRIVE dataset.