Deep learning based retinal vessel segmentation and hypertensive retinopathy quantification using heterogeneous features cross-attention neural network

Retinal vessels play a pivotal role as biomarkers in the detection of retinal diseases, including hypertensive retinopathy. The manual identification of these retinal vessels is both resource-intensive and time-consuming. The fidelity of vessel segmentation in automated methods directly depends on the fundus images' quality. In instances of sub-optimal image quality, applying deep learning-based methodologies emerges as a more effective approach for precise segmentation. We propose a heterogeneous neural network combining the benefit of local semantic information extraction of convolutional neural network and long-range spatial features mining of transformer network structures. Such cross-attention network structure boosts the model's ability to tackle vessel structures in the retinal images. Experiments on four publicly available datasets demonstrate our model's superior performance on vessel segmentation and the big potential of hypertensive retinopathy quantification.


Introduction
Hypertension (HT) is a chronic ailment posing a profound menace to human wellbeing, manifesting in vascular alterations (1).Its substantial contribution to the global prevalence and fatality rates of cardiovascular diseases (CVD) cannot be overstated.The escalated incidence and mortality rates are not solely attributable to HT's correlation with CVD but also to the ramifications of hypertension-mediated organ damage (HMOD).This encompasses structural and functional modifications across pivotal organs, including arteries, heart, brain, kidneys, vessels, and the retina, signifying preclinical or asymptomatic CVD (2,3).HT management's principal aim remains to deter CVD incidence and mortality rates.Achieving this goal mandates meticulous adherence to HT guidelines, emphasizing precise blood pressure monitoring and evaluating target organ damage (4).Consequently, the early identification of HT-mediated organ damage emerges as a pivotal concern.The retinal vascular system shares commonalities in structural, functional, and embryological aspects with the vascular systems of the heart, brain, and kidneys (5)(6)(7)(8)(9).Compared to other microvascular territories, the distinctive attributes of the retinal microcirculation enable relatively straightforward detection of localized HMOD (5,9).Its capacity to offer a non-invasive and uncomplicated diagnostic tool positions retinal visualization as the simplest means of elucidating the microcirculatory system.In hypertensive patients, retinal microvasculature gives insight into the wellbeing of the heart, kidneys, and brain (5,10,11).Early detection of HT-mediated retinal changes indirectly mirrors the vascular status of these organs, facilitating refined evaluation of cardiovascular risk stratification, timely interventions, and improved prognostication, thereby holding substantial clinical significance.Traditional clinical methodologies for diagnosing HTmediated retinal alterations, while reliant on the proficiency of ophthalmic professionals, often demand considerable time and specialized expertise (12).Figure 1 presents a sample fundus image, demonstrating the complexity of the retinal vasculature and image intensity variation.However, integrating AI-based models in ophthalmology holds promising prospects for revolutionizing this paradigm.Leveraging machine learning algorithms and deep neural networks, AI-enabled diagnostic tools have demonstrated the potential to expedite and enhance the assessment of HTrelated retinal vessel changes (13)(14)(15)(16)(17).These AI models learn from extensive datasets of annotated medical images, swiftly recognizing subtle retinal anomalies that might elude human detection.By automating the analysis and interpretation of retinal images, AI-based systems offer the prospect of reducing diagnostic timeframes, improving accuracy, and potentially mitigating the need for extensive human oversight.In this work, we proposed a heterogeneous features cross-attention neural network to tackle the retinal vessel segmentation task with color fundus images.

Related work
Segmenting blood vessels in retinal color fundus images plays a pivotal role in the diagnostic process of hypertensive retinopathy.Over the years, researchers have explored computerassisted methodologies to tackle this task.For instance, Annunziata and Trucco (18) introduced a novel curvature segmentation technique leveraging an accelerating filter bank implemented via a speed-up convolutional sparse coding filter learning approach.Their method employs a warm initialization strategy, kickstarted by meticulously crafted filters.These filters are adept at capturing the visual characteristics of curvilinear structures, subsequently fine-tuned through convolutional sparse coding.Similarly, Marín et al. (19) delved into the realm of hand-crafted feature learning methods, harnessing gray-level and moment invariant-based features for vessel segmentation.However, despite the efficacy of such techniques, the manual crafting of filters is inherently timeintensive and prone to biases, necessitating a shift toward more automated and data-driven approaches in this domain.
Deep learning techniques based on data analysis have demonstrated superior performance to conventional retinal vessel segmentation approaches (18)(19)(20).For instance, Maninis et al. ( 21) developed a method wherein feature maps derived from a side output layer contributed to vessel and optic disc segmentation.Along a similar line, Oliveira et al. (22) combined the benefits of stationary wavelet transform's multi-scale analysis with a multiscale full convolutional neural network, resulting in a technique adept at accommodating variations in the width and orientation of retinal vessel structures.In terms of exploiting the advance of the Unit structure, there are previous methods that achieved promising performance.For example, Yan et al. (23) implemented a joint loss function in U-Net, comprising two components responsible for pixel-wise and segment-level losses, aiming to enhance the model's ability to balance segmentation between thicker and thinner vessels.Mou et al. (24) embedded dense dilated convolutional blocks between encoder and decoder cells at corresponding levels of a U-shaped network, employing a regularized walk algorithm for post-processing model predictions.Similarly, Wang et al. (25) proposed a Dual U-Net with two encoders: one focused on spatial information extraction and the other on context information.They introduced a novel module to merge information from both paths.
Despite the proficiency of existing deep learning methodologies in segmenting thicker vessels, there remains a challenge in combining heterogeneous features from different stages of the deep learning models via Transformers and CNN models.Generally, improving deep learning-based techniques for vessel segmentation can be approached from various angles, including multi-stage feature fusion and optimization of loss functions.This work proposes a heterogeneous feature cross-attention neural network to address the above challenge.

Materials and methods . Heterogeneous features cross-attention neural network
A detailed model structure overview is shown in Figure 2. In detail, two brunches of feature extraction modules are proposed to extract heterogeneous features from different stages of the backbone network.In detail, there is CNN-based (Conv-Block) and transformer-based (Trans-Block) brunch, which focus on local semantic and long-range spatial information.Those two features' information are both important for the vessel segmentation task.
The interaction between the two branches is used as a cross-attention module to emphasize the essential heterogeneous (semantic and spatial) features.It is used as the main structure to facilitate the interaction and integration of local and long-range global features.Drawing inspiration from the work by Peng et al. (26), the intersecting network architecture within our model ensures that both Conv-Block and Trans-Block can concurrently learn features derived from the preceding Conv-Block and Trans-Block, respectively.

. . CNN blocks
In the structure depicted in Figure 2, the CNN branch adopts a hierarchical structure, leading to a reduction in the resolution of feature maps as the network depth increases and the channel count expands.Each phase of this structure   consists of several convolution blocks, each housing multiple bottlenecks.These bottlenecks, in accordance with the ResNet framework (27), comprise a sequence involving down-projection, spatial convolution, up-projection, and a residual connection to maintain information flow within the block.Distinctly, visual transformers (28) condense an image patch into a vector in one step, which unfortunately leads to the loss of localized details.Conversely, in CNNs, the convolutional kernels operate on feature maps, overlapping to retain intricate local features.Consequently, the CNN branch ensures a sequential provision of localized feature intricacies to benefit the transformer branch.

. . Transformer blocks
In line with the approach introduced in ViT (28), this segment consists of N sequential transformer blocks, as showcased in Figure 2.Each transformer block combines a multi-head selfattention module with an MLP block, encompassing an upprojection fully connected layer and a down-projection fully connected layer.Throughout this structure, LayerNorms (29) are applied before each layer, and residual connections are integrated into both the self-attention layer and the MLP block.For tokenization purposes, the feature maps generated by the backbone module are compressed into 16 × 16 patch embeddings without overlap.This compression is achieved using a linear projection layer, implemented via a 3 × 3 convolution with a stride of 1. Notably, considering that the CNN branch (3 × 3 convolution) encodes both local features and spatial location information, the necessity for positional embeddings diminishes.This strategic adaptation results in an improved image resolution, advantageous for subsequent tasks related to vision.

. . Feature fusion blocks
Aligning the feature maps derived from the CNN branch with the patch embeddings within the transformer branch poses a significant challenge.To tackle this, we introduce the feature fusion block, aiming to continuously and interactively integrate local features with global representations.The substantial difference in dimensionalities between the CNN and transformer features is noteworthy.While CNN feature maps are characterized by dimensions C × H × W (representing channels, height, and width, respectively), patch embeddings assume a shape of (L + 1) × J, where L, 1, and J denote the count of image patches, class token, and embedding dimensions, respectively.To reconcile these disparities, feature maps transmitted to the transformer branch undergo an initial 1 × 1 convolution to align their channel numbers with the patch embeddings.Subsequently, a down-sampling module (depicted in Figure 2) aligns spatial dimensions, following which the feature maps are amalgamated with patch embeddings, as portrayed in Figure 2. Upon feedback from the transformer to the CNN branch, the patch embeddings necessitate up-sampling (as illustrated in Figure 2) to match the spatial scale.Following this, aligning the channel dimension with that of the CNN feature maps through a 1×1 convolution is performed, integrating these adjusted embeddings into the feature maps.Furthermore, LayerNorm and BatchNorm modules are employed to regularize the features.Moreover, a significant semantic disparity arises between feature maps and patch embeddings.While feature maps stem from local convolutional operators, patch embeddings arise from global selfattention mechanisms.Consequently, the feature fusion block is incorporated into each block (excluding the initial one) to bridge this semantic gap progressively.

. Loss functions
Commonly utilized region-based losses, like Dice loss (35), often result in highly precise segmentation.However, they tend to disregard the intricate vessel shapes due to a multitude of pixels outside the target area, overshadowing the significance of those delineating the vessel (36)(37)(38)(39)(40).This oversight may contribute to relatively imprecise retinal vessel segmentation and, consequently, inaccurate quantification of hypertensive retinopathy.In response, we incorporated the TopK loss (Equation 1) (41,42) to emphasize the retinal vessels during the training process specifically.When objects exhibit sizes that are not notably smaller in comparison to the convolutional neural network's (CNN) receptive field, the vessel emerges as the most variable component within the prediction, displaying the least certainty; thus, the loss within the vessel region tends to be the highest among the predictions (43).Building upon these observations and rationale, the TopK loss is formulated as follows: where g i is the ground truth of pixel i, s i is the corresponding predicted probability, and K is the set of the k% pixels with the lowest prediction accuracy.While sole vessel-focused loss often causes training instability (44), region-based loss, such as Dice loss (Equation 2) (35), is needed at the early stage of the training.We represent Dice loss as follows: where V g is the ground truth label and V s is the prediction result of segmentation.We coupled TopK with region-based Dice loss as our final loss function (Equation 3) for the retinal vessel segmentation. .

. Experimental setting
To enrich the dataset, we introduce random rotations on the fly to the input images in the training dataset, applied to both segmentation tasks.Specifically, these rotations span from -20 to 20 degrees.Additionally, 10% of the training dataset is randomly chosen to serve as the validation dataset.The proposed network was implemented utilizing the PyTorch Library and executed on the Nvidia GeForce TITAN Xp GPU.Throughout the training phase, we employed the AdamW optimizer to fine-tune the deep model.To ensure effective training, a gradually decreasing learning rate was adopted, commencing at 0.0001, alongside a momentum parameter set at 0.9.For each iteration, a random patch of size 118 × 118 from the image was selected for training purposes, with a specified batch size of 16.A backbone of ResNet50 ( 27) is used in this work.

. . Evaluation metrics
The model's output is represented as a probability map, assigning to each pixel the probability of being associated with the vessel class.Throughout the experiments, a probability threshold of 0.5 was employed to yield the results.To comprehensively assess the efficacy of our proposed framework during the testing phase, the subsequent metrics will be computed: • Acc (accuracy) = (TP + TN) / (TP + TN + FP + FN), • SE (sensitivity) = TP / (TP + FN), • SP (specificity) = TN / (TN + FP) • F1 (F1 score) = (2 × TP) / (2 × TP + FP + FN) • AUROC = area under the receiver operating characteristic curve.
In this context, the correct classification of a vessel pixel is categorized as a true positive (TP), while misclassification is identified as a false positive (FP).Correspondingly, accurate classification of a non-vessel pixel is considered a true negative (TN), whereas misclassification is denoted as a false negative (FN).

. Compared methods
We compared our approach to other classic and state-of-the-art models that have achieved promising performance on different medical image segmentation tasks.All of the experiments are conducted under the same experimental setting.The compared methods are briefly introduced below:

Results
. Vessel segmentation performance Our proposed method can outperform other compared methods on DRIVE, CHASEDB1, STARE, and HRF datasets, respectively.In detail, Ours achieved 83.3% F1 on DRIVE dataset, which outperformed Unet (45) by 3.6%, outperformed Swin-Transformer (47) (47) and TransUnet (49) belong to the transformer-based model structure, which demonstrates a superior performance on many tasks.However, in this work, the limited data size is one of the leading reasons for the relatively low performance of those datasets.Another reason could be the task's own nature of vessel segmentation, where more local information is needed rather than the long-range relationship between pixels.Thus, given two brunches with transformer and CNN structures and fusion modules, our proposed model can simultaneously tackle both the local semantic information and long-range spatial information for the segmentation task.
Figure 3 shows the qualitative comparison between ours and other compared methods.It demonstrated that our proposed  Performance is reported with Acc, SE, SP, F1 and AUROC.95% confidence interval is presented in the bracket.The best performance is highlighted in bold.
methods can segment the vessels more accurately.This is important for vessel segmentation tasks and hypertensive retinopathy quantification with more accurate vessel area calculation.

. Ablation study . . Ablation study on loss functions
We did ablation study experiments on loss functions.We maintain the same model structure and only change the loss functions.In detail, we remove Dice loss and TopK loss, respectively, to evaluate their respective contribution to the performance of the proposed models.Furthermore, we replace TopK loss with a cross-entropy loss to validate the effectiveness of TopK loss in the segmentation task.Table 5 demonstrates that Dice Loss can lead to a 6.2% F1 and TopK loss can lead to a 2.9% F1 performance.On the other hand, Dice loss can lead to 15.5% SE performance, and TopK loss can lead to a 2.8% SE performance on Drive dataset.Additionally, compared with crossentropy loss, the TopK loss could lead to a 1.5% F1 improvement and 2.3% SE improvement.Each loss function can boost the model's performance in different evaluation metrics.This demonstrated that the adopted loss function can both contribute to the learning process and benefit the vessel segmentation performance.

. . Ablation study on the models' components
We did ablation study experiments on the model's components.In detail, we maintain the same model structure and only change the models' structure by removing different modules, including Trans-Block, CNN-Block and Fusion-Block, respectively.In detail, we remove each of those three modules, respectively, to evaluate their respective contribution to the performance of the proposed models.Table 6 demonstrates that Trans-Block can lead to a 10% F1, CNN-Block can lead to a 10.3% F1 performance and Fusion-Block can lead to a 7.9% F1 performance boost.On the other hand, Trans-Block can lead to a 3.3% SE performance, CNN-Block can lead to a 2.3% SE performance, and Fusion-Block can lead to an 0.9% SE performance on Drive dataset.Each module can boost the model's performance in different evaluation metrics.This demonstrated that the proposed modules can all contribute to the learning process and benefit the vessel segmentation performance.

Hypertensive retinopathy quantification
The proposed method has demonstrated a promising retinal vessel segmentation performance on different datasets and benchmarks.Additionally, precise segmentation of retinal vessels plays a vital role in hypertensive retinopathy detection, whereas manual segmentation tends to be cumbersome and timeconsuming (50).The model proposed can generate a binary mask distinguishing vessel pixels as one and background pixels as zero.This mask effectively quantifies the total count of vessel pixels within each mask.The ratio (R vessel ) between the count of vessel pixels and non-vessel pixels is defined as follows: where N v represents the count of vessel pixels, and N non denotes the count of non-vessel pixels.The ratio R vessel (Equation 4) serves as a valuable metric in identifying hypertensive retinopathy within fundus images.Hypertensive retinopathy leads to vascular   constriction (51,52), resulting in a decrease in the count of vessel pixels (R vessel ).Detection of hypertensive retinopathy, characterized by vascular constriction, involves assessing changes in R vessel across sequential examinations.Increases or decreases in R vessel indicate the occurrence or progression of hypertensive retinopathy, respectively.Hence, our proposed methods offer a straightforward approach for detecting hypertensive retinopathy.
In the future, with increased datasets comprising fundus images from hypertensive and healthy patients, we can further analyze vessel changes within these images.In real-world clinical practice, comparing the R vessel obtained from consecutive visits can serve as a diagnostic tool.Additionally, the detection of newly formed vessels can be achieved by subtracting images from successive visits post-segmentation.This approach enables the identification and tracking of changes in vasculature over time, offering potential insights for clinical assessment and monitoring.

Limitation and future works
While our deep learning method has shown promising results in the challenging tasks of retinal vessel segmentation and hypertensive retinopathy quantification, it's important to acknowledge the nuanced landscape of limitations accompanying such endeavors.One notable factor is the inherent variability present in medical imaging datasets.Our model's performance could be influenced by factors such as variations in image quality and disease severity across different datasets.Moreover, despite achieving commendable results overall, there are instances where the model might struggle to accurately delineate intricate vascular structures or detect subtle manifestations of hypertensive retinopathy.This suggests the need for further exploration and refinement of our approach.
In future research, attention could be directed toward enhancing the model's robustness and adaptability to diverse imaging conditions and patient populations.Techniques such as advanced data augmentation and domain adaptation strategies could prove instrumental in achieving this goal.Additionally, integrating complementary sources of information, such as clinical metadata or genetic markers, holds promise for enriching the predictive capabilities of our model and enhancing its clinical relevance.Furthermore, the pursuit of interpretability and explainability remains paramount.Providing clinicians with insights into how the model arrives at its predictions can foster trust and facilitate its integration into real-world clinical workflows.However, this pursuit must be balanced with ethical considerations, particularly concerning patient privacy, algorithmic bias, and the potential consequences of automated decision-making in healthcare settings.By addressing these multifaceted challenges, we can pave the way for more effective and responsible deployment of deep learning technologies in ophthalmology and beyond.

Conclusion
We have proposed a novel and comprehensive framework for retinal vessel segmentation and hypertensive retinopathy quantification.It takes advantage of heterogeneous feature crossattention with the help of local emphasis CNN and long-range emphasis transformer structure with a fusion module to aggregate the information.Our experiments on four large-scale datasets have demonstrated that our framework can simultaneously conduct accurate segmentation and potential hypertensive retinopathy quantification performance.

FIGURE
FIGURESample retinal fundus image for vessel segmentation and hypertensive retinopathy quantification.The yellow areas in Ground Truth represent the retinal vessel area that needs to be segmented for disease analysis.

FIGURE
FIGURE

Figure
Figure of our proposed model structure.Our model contains three modules, including Trans-Block, CNN-Block and Fusion-Block.The detailed structure of each module is shown in the figure.

FIGURE
FIGUREQualitative results of the vessel segmentation.We compare our model with Unet ( ), Unet++ ( ), Swin-Transformer ( ), AttenUnet ( ), TransUnet ( ).Our method can produce more accurate segmentation results than the other methods compared with the ground truth.
TABLE Quantitative results comparison between our methods and other compared state-of-the-art methods on STARE dataset.
TABLE Quantitative results comparison between our methods and other compared state-of-the-art methods on HRF dataset.Performance is reported with Acc, SE, SP, F1 and AUROC.95% confidence interval is presented in the bracket.The best performance is highlighted in bold.
TABLE Quantitative ablation study results of the loss function on DRIVE dataset.Performance is reported with Acc, SE, SP, F1 and AUROC.95% confidence interval is presented in the bracket.The best performance is highlighted in bold.
TABLE Quantitative ablation study results of the model's components on DRIVE dataset.
Performance is reported with Acc, SE, SP, F1 and AUROC.95% confidence interval is presented in the bracket.The best performance is highlighted in bold.