Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Med., 29 September 2025

Sec. Ophthalmology

Volume 12 - 2025 | https://doi.org/10.3389/fmed.2025.1660920

This article is part of the Research TopicInnovative Advancements in Eye Image Processing for Improved Ophthalmic DiagnosisView all 4 articles

HyReti-Net: hybrid retinal diseases classification and diagnosis network using optical coherence tomography

Jikun Yang,Jikun Yang1,2Chaoliang HsuChaoliang Hsu3Jing WangJing Wang2Bin WuBin Wu2Yuanyuan LuYuanyuan Lu2Yuxi DingYuxi Ding2Zhenbo Zhao,Zhenbo Zhao1,2Kaili Tang,Kaili Tang1,2Feng LuFeng Lu4Liwei Ma,
Liwei Ma1,2*
  • 1Aier Eye Medical Center of Anhui Medical University, Hefei, Anhui, China
  • 2Shenyang Aier Excellence Eye Hospital, Shenyang, Liaoning, China
  • 3University of Southern California Translational Biotechnology Department, Los Angeles, CA, United States
  • 4School of Automation, Shenyang Aerospace University, Shenyang, Liaoning, China

Background: With optical coherence tomography (OCT), doctors are able to see cross-sections of the retinal layers and diagnose retinal diseases. Computer-aided diagnosis algorithms such as convolutional neural networks (CNNs) and vision Transformers (ViTs) enhance diagnostic efficiency by automatically analyzing these OCT images. However, CNNs are less effective in extracting global features and ViTs lack the local inductive bias and typically require large amounts of training data.

Methods: In this paper, we presented a hybrid retinal diseases classification and diagnosis network named HyReti-Net which incorporated two branches. One branch extracted local features by leveraging the spatial hierarchy learning capabilities of ResNet-50, while the other branch was established based on Swin Transformer to consider the global information. In addition, we proposed a feature fusion module (FFM) consisting of a concatenation and residual block and the improved channel attention block to retain local and global features more effectively. The multi-level features fusion mechanism was used to further enhance the ability of global feature extraction.

Results: Evaluation and comparison were used to show the advantage of the proposed architecture. Five metrics were applied to compare the performance of existing methods. Moreover, ablation studies were carried out to evaluate their effects on the foundational model. For each public dataset, heatmaps were also generated to enhance the interpretability of OCT image classification. The results underscored the effectiveness and advantage of the proposed method which achieved the highest classification accuracy.

Conclusion: In this article, a hybrid multi-scale network model integrating dual-branches and a features fusion module was proposed to diagnose retinal diseases. The performance of the proposed method produced promising classification results. On the OCT-2014, OCT-2017 and OCT-C8, experimental results indicated that HyReti-Net achieved better performance than the state-of-the-art networks. This study can provide a reference for clinical diagnosis of ophthalmologists through artificial intelligence technology.

1 Introduction

The number of patients suffering from retinal diseases has increased significantly in recent years (13). Among the most prevalent retinal disease cases, age-related macular degeneration and diabetic macular edema are two common retinal diseases that are the leading cause of vision impairment and blindness (4). Therefore, automatic detection of retinal diseases has become a necessary procedure to produce accurate results for early diagnosis and timely treatment (5).

OCT serves as a vital imaging modality for the diagnosis of retinal lesions. This technique is distinguished by its non-invasive and contactless imaging capabilities, enabling the visualization of cross-sectional retinal layers and abnormalities (68). However, manual diagnosis of retinal OCT images presents several challenges (911). First, with the number of patients increasing annually, depending exclusively on qualified medical ophthalmologists is becoming insufficient to meet the growing diagnostic and therapeutic` demands. Secondly, certain lesions present with subtle features which are challenging to detect, increasing the risk of misinterpretation and diagnostic oversights (1214).

Due to the progress in high-performance computing systems and deep learning algorithms, automatic diagnosis of retinal diseases has become a reality. Many researchers have made substantial advancements in the classification of OCT images through CNNs (1521). However, in the analysis of retinal diseases, CNNs may focus more on local lesion features and fail to effectively extract global features. Therefore, it is necessary to integrate global retinal information with local lesion information to enhance classification performance.

Alternatively, advancements in natural language processing have catalyzed the progression of computer image processing from CNNs to Transformer-based sequence networks (22). Among them, ViT has emerged as the dominant architecture in computer vision (23). It is notable for its use of window-based self-attention to capture long-range dependencies in the whole image. However, utilizing a Transformer architecture exclusively may present difficulties in achieving adequate performance on small-scale datasets (13). Indeed, the Transformer technology inherently lacks local inductive biases and generally demands extensive training data. It poses a considerable hurdle in the retinal OCT image classification, where training images are scarce.

With the development of CNN and ViT, plenty of researchers have delved into the integration of CNNs with Transformers in the realm of image analysis tasks (2426). On the basis of convolutional operations, CNNs excel at extracting local features and generating multi-scale feature information. Conversely, ViT utilizes a self-attention mechanism to effectively handle long-range and global dependencies in an image. The fusion of CNN and ViT promises to empower the model with enhanced robustness and efficiency in processing image information (12, 25, 2730).

At present, the research gap in the classification of retinal diseases lay in the fact that traditional networks lacked multi-scale information and had limited ability to extract features from lesion areas, which reduced the classification accuracy of retinal diseases (3133). To address the individual shortcomings of CNN and ViT, we proposed a hybrid model named HyReti-Net for classifying retinal diseases using OCT images. This model incorporated two branches: the CNN branch and the Transformer branch. Based on ResNet50, the CNN branch was responsible for extracting localized features and contextual information. Meanwhile, the Transformer branch, leveraging the Swin Transformer architecture, was tailored to capture global information and establish long-range dependencies. To evaluate the advantages of HyReti-Net model, we performed experiments on three public OCT datasets: OCT-2014 (34), OCT-2017 (35) and OCT-C8 (26).

The primary contributions of this research include:

1. We proposed HyReti-Net, a dual-branch architecture that seamlessly integrated CNN and Swin Transformer.

2. An innovative feature fusion approach was proposed, incorporating the concatenation and residual block and the improved channel attention block.

3. The muti-level features fusion mechanism was used to further enhance the ability of global feature extraction.

4. Grad-CAM method was used to improve the interpretability of the model and assist in confirming the accuracy of retinal diseases diagnosis.

2 Materials and methods

2.1 Data sources

Deep learning models had demonstrated exemplary performance on large-scale datasets (36). However, it was different to access to large and diverse datasets due to ethical concerns. Moreover, it was labor intensive and time consuming to obtain high-quality annotations for medical images. Consequently, we chose to use three existing OCT image datasets: OCT-2017, OCT-2014 and OCT-C8.

• OCT-2017 dataset consisted of 48,574 images in JPEG format divided into four categories: NORMAL, CNV, DME, and DRUSEN. This dataset was divided into training, testing, and validation folders for each image category.

• OCT-2014 dataset comprised a total of 3,231 SD-OCT collected from 45 patients. The whole dataset encompassed three distinct retinal conditions: 15 patients with normal retinal health, 15 patients with AMD, and 15 patients with DME.

• OCT-C8 dataset contained 24,000 images in JPG format divided into eight categories: Age-related Macular Degeneration (AMD), Central Serous Retinopathy (CSR), Choroidal Neo-Vascularization (CNV), Diabetic Macular Edema (DME), Diabetic Retinopathy (DR), DRUSEN, Macular Hole (MH), and NORMAL.

Based on the original data division strategy, both datasets were split image-wise into training set, validation set, and test set in a ratio of 7:2:1. The datasets information and distribution were listed in Table.1.

Table 1
www.frontiersin.org

Table 1. Datasets information and distribution.

2.2 Experimental conditions

The proposed HyReti-Net network was trained and tested in PyTorch 3.10. The hardware of the experimental system included an NVIDIA GeForce RTX 3090 GPUan and an Intel Core i10 processor with 64 GB of RAM. By utilizing NVIDIA CUDA and its related parts, we observed substantial improvements in the convergence speed and overall efficiency. Based on grid search strategy, we selected the training hyperparameters for the proposed model. The Adam optimizer was utilized with a batch size of 128 and an initial learning rate of 1×104. The weight decay was set to 0.05, and the minimum learning rate was set to 1×106. The entire training period consisted of 100 epochs. During the training process, we adopted the the proposed loss function which combined focal loss and correlation loss to address the impact of class imbalance on classification tasks. The significance threshold p was set to 0.05. If the rate of change of adjacent errors is less than 0.5% in the training process, we consider that the training tended to stabilize. More experimental details of the training parameters were presented in Table 2.

Table 2
www.frontiersin.org

Table 2. The detail of training parameters.

2.3 Model design

The architecture of HyReti-Net model was shown in Figure 1. The overall structure consisted of four layers. Each layer encompassed multiple CNN and Swin Transformer blocks, accompanied by a feature fusion module. The CNN branch focused on extracting local features from OCT images, whereas the Transformer branch was responsible for capturing global information. The FFM further processed and refined the information from both branches, leveraging attention mechanisms to emphasize essential details and suppress unnecessary information.

Figure 1
Diagram of a neural network architecture featuring a Transformer branch and a CNN branch. The Transformer branch includes Swin Transformer Blocks, while the CNN branch incorporates Residual Blocks. Both branches feed into Multi-level Feature Fusion (FFM) units. The output is processed through a Classifier composed of Layer Normalization, Average Pooling, and Fully Connected layers. Patch partitioning is applied initially for input dimensions.

Figure 1. The architecture of HyReti-Net model.

While Swin Transformer captured global information, it primarily employed window-based multi-head self-attention, which was limited to local windows. Therefore, the muti-level features fusion mechanism was used to enhance the ability of global feature extraction. The fusion features would be transferred between FFMs. The whole process passed the fusion features of each layer to subsequent FFMs. Features from two FFMs would be added. By passing small-scale fusion features to large-scale fusion features, the feature presentations at different scales would be improved. The transmission direction among FFMs was shown by the red line in Figure 1. The muti-level features fusion mechanism enhanced the feature expression. The fused features were input into classifier for classification task. In the classifier, the fully connected layer flattened the output of the previous layer into one dimension to perform classification tasks. In this paper, the neurons in the fully connected layer were set to 1,024, followed by the Softmax activation function for retinal diseases diagnosis.

2.4 CNN branch

We designed a CNN branch inspired by the ResNet50 architecture, with the primary objective of extracting both local features and contextual information from images. With the model depth increasing, the resolution progressively decreased while the number of channels increases. The overall structure comprised four layers and each layer consisted of two residual blocks, as shown in Figure 2. Each residual block was made up of three convolutional blocks: a 1×1 convolution for downsampling, a 3×3 convolution for spatial feature extraction, and a 1×1 convolution for upsampling. In addition, a residual module was incorporated to connect the input and output. CNN branch acquired local features by sliding convolutional kernels in the image, enabling them to more accurately capture local information, which added more refined local details into the features. To improve the training efficiency, the ResNet-50 backbone was trained on ImageNet (1). Then the backbone was fine-tuned from the pre-trained model based on the OCT classification tasks.

Figure 2
Diagram illustrating a convolutional neural network (CNN) process, beginning with a series of black and white images. Arrows lead to processing blocks labeled

Figure 2. The structure of residual block.

2.5 Transformer branch

In the Transformer branch, we utilized the original Swin Transformer architecture as the core framework which was accountable for extracting global information and long-range dependencies. Based on the application and analysis in Reference (13), the number of Swin Transformers was established as 2, 2, 6, and 2, respectively. The Swin Transformer module was shown in Figure 3.

Figure 3
Flowchart depicting two parallel processes. The left process shows blocks labeled LN, W-MSA, LN, and MLP, connected with arrows and loops indicating feedback. The right process mirrors the left but replaces W-MSA with SW-MSA. Both processes have feedback loops at the top and bottom.

Figure 3. The structure of Swin Transformer.

Swin Transformer module was comprised of two sub-modules. The first sub-module comprised layer normalization, multi-layer perceptron and Window-Based Multi-Head Self-Attention (W-MSA). Meanwhile, the second sub-module was analogous but adopted a shifted W-MSA (SW-MSA). In addition, a patch merging module was utilized for the purpose of downsampling.

Given an input image, the proposed model employed the patch partition module in the first layer to divide the image into non-overlapping patches with fixed dimensions. Subsequently, these patches underwent a flattening process along the channel dimension. Between the second and fourth layers, the patch merging module was employed for downsampling. According to experimental results in Reference (26), the identical division strategy was executed. As Swin Transformer progressed features from the second to the fourth layers, the resolution was progressively reduced by half at each stage and the number of channels expanded to C, 2C, 4C and 8C, respectively.

Due to calculating relationships among all elements, W-MSA mechanism and SW-MSA mechanism were used in the Swin Transformer architecture. The principle of W-MSA and SW-MSA were shown in Figure 4.

Figure 4
Four images labeled (a) to (d) show a grayscale eye scan divided by grids, highlighting different sections with red lines. Red-bordered grids indicate self-attention areas, with image (c) additionally labeling areas A to C. A legend explains symbols: red for self-attention, dark for A patch, light for masked patch.

Figure 4. The illustration of W-MSA and SW-MSA. (a) The window partition based on W-MSA; (b) The shifting process based on SW-MSA; (c) Window cyclic process based on SW-MSA; (d) The results with masked patch.

W-MSA mechanism was used to split the whole image into patches with uniform size and applied attention separately to each patch, as shown in Figure 4a. When the W-MSA was utilized, the attention computation failed to establish cross-window connections, thereby constraining the ability to capture features of complex patterns. To address this limitation while maintaining the efficient computation of non-overlapping windows, SW-MSA mechanism was used, as illustrated in Figure 4b. Figure 4c demonstrated the cyclic shifting mechanism, which was designed to redistribute the image into four sections. The sections marked A, B and C were positioned in the top-left corner. Then, these sections underwent cyclic shifting to align with their respective corresponding regions. The splitting process resulted in an increasing number of windows, which demanded more computational resources and may introduce uncertainty when calculating attention mechanisms for non-contiguous regions. To address these limitations, a masking mechanism was proposed, as shown in Figure 4d. This mechanism was used to mask patches from various regions, allowing attention to be calculated only for patches in the boundaries of a single window. Through the shifting strategy, SW-MSA not only addressed the issue of limited interaction among different windows but also significantly decreased the computational burden, thereby enhancing overall computational efficiency.

2.6 Feature fusion module

Given the differing priorities in information extraction between the CNN and Swin Transformer branches, the combination of these two types of information was necessary. In order to seamlessly combine the local features extracted by the CNN branch with the global features captured by the Transformer branch, we proposed a FFM, as depicted in Figure 5.

Figure 5
Diagram of a neural network architecture featuring concatenation and residual blocks, with processes involving convolution, batch normalization, and ReLU. It includes an improved channel attention block with global average pooling, fully connected layers, ReLU, and sigmoid activations, resulting in output tensors after a split operation.

Figure 5. The structure of FFM.

The operational procedure of the FFM was outlined as follows: Firstly, we established bidirectional links for the features from both branches. Subsequently, a channel attention mechanism was utilized to emphasize critical features. Finally, the fused features were divided into two parts along the channel dimension, producing two components: ki and di which denoted the feature maps originating from the Swin Transformer branch and the CNN branch, respectively. These refined features were then redirected to its corresponding branch and added to the original inputs, respectively.

The whole feature fusion module comprised two parts: the Concatenation and Residual Block (CRB) and the Improved Channel Attention Block (ICAB). CRB was employed to combine and extract features from both branches. First, a reshape operation was employed to change ki of dimension (h×w)×c into h×w×c. Subsequently, this module concatenated di and ki along the channel to fuse features which was then fed into a residual structure for further learning. Another part was ICAB. Although different channels in the feature map encompassed diverse information, not all of their features were important. Our objective was to accentuate significant regions while disregarding irrelevant environmental factors. Therefore, we introduced an enhanced channel attention mechanism. First, we employed the global average pooling operation to reshape the fused features into a one-dimensional vector. Subsequently, this vector underwent a transformation through a fully connected layer to produce attention weights. Ultimately, these weights were used to compute a weighted matrix through a dot product operation between these attention weights and the input features. Along the channel dimension, we divided the fused feature map which were then redirected to its corresponding branch and added to the original inputs to continue training. In conclusion, FFM effectively integrated information from both branches. Consequently, the model was empowered to more effectively capture correlations among features.

2.7 Loss function

The imbalance of training data could affect the performance of the model. Inspired by Reference (37), we used a novel hybrid loss function in this paper to address the impact of class imbalance on classification tasks. The hybrid loss function combined focal loss and correlation loss.

1) Focal loss: Binary cross entropy was the theoretical basis of focal loss. Usually, binary cross entropy was defined as Equation 1:

Lce(ŷ,y)=ylogŷ(1y)log(1ŷ)={logŷ,y=1,log(1ŷ),otherwise    (1)

Where y represented the true label, andŷrepresented the predicted result of y = 1. For further derivation, the cross entropy could be defined as Equation 2:

Lce(p)=log(p)    (2)

Where p wasŷif y = 1. Otherwise, p was 1ŷ. On the basis of cross entropy, the focal loss function introduced a modulation factor (1p)γ, which reduced the weight of simple examples and made the training focus on difficult examples. Then the focal loss function could be defined as Equation 3:

Lfl(p)=(1p)γlog(p)    (3)

Where γ was the focusing parameter. The contribution of easy samples decreased with the increase of focusing parameters. In addition, the class imbalance problem could be solved by adding a weighting factor α to the focal loss. Therefore, the focal loss could be further redefined as Equation 4:

Lfl(ŷ,y)={α(1ŷ)γlog(ŷ),y=1,(1α)ŷγlog(1ŷ),otherwise.    (4)

2) Correntropy Loss: By adjusting the adaptability of sample distance through different L-norms, correntropy loss had better robustness against outliers. For very small errors, correlation had the property of L2 norm. As the error gradually increased, the correlation eventually reached the L0 norm. Based on the principle of Reference (37), the correlation entropy-induced loss function LC(ŷ,y) could be defined as Equation 5:

LC(ŷ,y)=1kσ(yŷ)    (5)

Where kσrepresented the kernel function. In this paper, we took Gaussian kernel to calculate correntropy. The LC(ŷ,y) could be expressed as Equation 6:

LC(ŷ,y)=1Gσ(yŷ)=1exp((yŷ)2/σ2)    (6)

3) The proposed loss function: By combining focal loss and correntropy loss, this paper proposed a hybrid loss function which was defined as Equation 7:

LFC(ŷ,y)={Lfl(ŷ,y),0<tn,LC(ŷ,y),n<tN    (7)

Where t represented the current epoch and n represented the predefined threshold. In this paper, n was set to 30. HyReti-Net was trained for n epochs using focal loss and then correntropy loss was employed to complete the training. N referred to the total number of epochs. Correntropy loss had better generalization performance and robustness, but its non-convexity was prone to cause local minima. Then, we first used focal loss to pretrain the model to address the class imbalance problem.

3 Results

3.1 Evaluation method

In this study, we adopted accuracy (Acc), precision (Pre), recall (Rec), F1-score (FS) and specificity (Spe) as metrics for evaluating retinal diseases. The calculations of these evaluation metrics were based on the following formulas which were from Equations 812:

Acc=TP+TNTP+TN+FP+FN    (8)
Pre=TPTP+FP    (9)
Rec/Sensitivity=TPTP+FN    (10)
FS=2×Pre×RecPre+Rec    (11)
Spe=TNTN+FP    (12)

Where TP, TN, FP and FN represented true positive, true negative, false positive and false negative, respectively. Specifically, TP represented the number of correctly identified positive instances. TN represented the number of accurately identified negative instances in the model’s designated negative cases. FP denoted to the number of negative samples that were incorrectly classified as positive and FN represented the number of positive instances that were mistakenly labeled as negative.

3.2 Qualitative analysis

In this experiment, we systematically tracked the loss error for both the training and validation sets. The recorded values served as the basis for evaluating the model classification capabilities. The error progression for both the training and validation sets on three public datasets were shown in Figure 6.

Figure 6
Three line graphs labeled (a), (b), and (c) display training and validation loss over 100 epochs. Each graph shows a sharp decrease in loss initially, stabilizing around 0.2. The graphs compare losses with a shaded region indicating variability.

Figure 6. The errors comparison for both the training and validation sets on three public datasets. (a) Loss comparison on OCT-2014 and the average error was 0.41 ± 0.22; (b) loss comparison on OCT-2017 and the average error was 0.45 ± 0.28; (c) loss comparison on OCT-C8 and the average error was 0.49 ± 0.33.

Results from Figure 6 shown that training errors and validation errors rapidly decreased in the initial stage. If the rate of change of adjacent errors was less than 0.5%, we considered that the training tended to stabilize. On the OCT-2014, OCT-2017 and OCT-C8 databases, the minimum errors lay in the 38th epoch, the 51st epoch and the 60th epoch, respectively. On the OCT-2014, OCT-2017 and OCT-C8 databases, the average errors were 0.41 ± 0.22, 0.45 ± 0.28, and 0.49 ± 0.33, respectively. Due to similar trends and small errors observed between the training and validation processes, the model demonstrated that there were no signs of overfitting.

3.3 Quantitative analysis

In order to accurately and quantitatively analyze the performance of the model, we first provided the confusion matrices that were generated by HyReti-Net. The confusion matrices shown that HyReti-Net was capable of accurately classifying and detecting retinal diseases, as shown in Figure 7. Results from confusion matrices verified the feasibility of the proposed model.

Figure 7
Three confusion matrix diagrams comparing predicted and true labels for various conditions. Diagram (a) shows AMD, DME, and normal with accurate predictions mostly on the diagonal. Diagram (b) includes CNV, DME, Drusen, and normal, again with strong diagonal values. Diagram (c) features multiple conditions including AMD, CNV, DME, DR, Drusen, MH, and normal, with predominant diagonal entries indicating high prediction accuracy.

Figure 7. Confusion matrices of different datasets generated by HyReti-Net. (a) Confusion matrices from OCT-2014; (b) Confusion matrices from OCT-2017; (c) Confusion matrices from OCT-C8.

In quantitative analysis, the performance of the network was evaluated and analyzed utilizing the five specific metrics: Acc, Pre, Rec, Fs and Spe. To increase the generalizability of the model, we provided cross-validation with 3-fold on three public datasets and the average value served as the final value. The detailed classification metrics on the test dataset was presented in Table 3.

Table 3
www.frontiersin.org

Table 3. The detailed evaluation metrics of the HyReti-Net model.

Based on these results, HyReti-Net exhibited good performance in terms of five specific metrics. Specifically, in the OCT-2017 and OCT-2014 datasets, the model exhibited remarkable performance in the classification of AMD, DME, and Normal cases with average accuracy of 99.29 and 99.11%, respectively. In the OCT-C8 dataset, HyReti-Net achieves an average accuracy of 98.55% for eight distinct cases. Data analysis shown that the HyReti-Net has high level of accuracy in identifying fundus diseases. This achievement highlights the model’s remarkable predictive performance.

3.4 Features map visualization

Given ethical considerations and direct impact on human life and health, the interpretability of deep learning models could help clinicians, patients, and researchers in comprehending, trusting, and efficiently employing artificial intelligence technologies in the healthcare domain. In this section, we utilized the Grad-CAM method (38) to present Class Activation Mapping (CAM) visualizations, which could demonstrate the evidence underlying the predictions of the proposed model. The heat maps were generated by the CAM technique to emphasizes specific regions in an OCT image that were closely linked to the target class. The Grad-CAM showed the regions of interest. As shown in Figures 8, 9, 10, the Grad-CAM were generated based on original OCT images. Deeper red color represented stronger correlation with the predicted category. From the Grad-CAM, we found that the lesion regions were all appear with red. It demonstrated that the model payed attention to the crucial regions, which aligned with the diagnostic process of ophthalmologists.

Figure 8
A series of eight ocular images, showing two rows of Optical Coherence Tomography (OCT) scans. The top row features four black and white scans displaying eye structures with varying shapes. The bottom row presents corresponding color-enhanced images, highlighting areas of interest in red, yellow, green, and blue shades, indicating different intensities or depths.

Figure 8. Features map visualization of HyReti-Net on OCT-2014 dataset. The first line is the original OCT images. The second line is class activation mapping visualizations based on Grad-CAM method.

Figure 9
A set of eight OCT (Optical Coherence Tomography) scans of the retina arranged in two rows. The top row consists of grayscale images showing varying retinal textures and thickness. The bottom row displays the same scans in false color, highlighting varying densities with colors transitioning from blue to red, indicating different levels of retinal density and fluid presence.

Figure 9. Features map visualization of HyReti-Net on OCT-2017 dataset. The first line is the original OCT images. The second line is class activation mapping visualizations based on Grad-CAM method.

Figure 10
Four pairs of retinal scans are displayed. The top row shows grayscale optical coherence tomography (OCT) images, highlighting structural details of the retina. The bottom row presents color-coded versions of the same scans, where warmer colors, such as red and yellow, likely indicate areas of higher reflectivity or density, while cooler colors like blue and green suggest lower reflectivity or density.

Figure 10. Features map visualization of HyReti-Net on OCT-C8 dataset. The first line is the original OCT images. The second line is class activation mapping visualizations based on Grad-CAM method.

3.5 Validation and comparative analysis

To further confirm the efficacy of the HyReti-Net model, we compared it with several widely adopted classification models. The comparative results on three public OCT datasets were presented in Table 4.

Table 4
www.frontiersin.org

Table 4. The comparative results on three public OCT datasets.

Through analysis of different models, the HyReti-Net model consistently showed superior results, outperforming other selected models on three datasets. As illustrated in Table 4, HyReti-Net achieved accuracy of 99.06, 99.44 and 98.76%, with F1-Score values of 99.05, 99.16 and 98.33% on the OCT-2017, OCT-2014 and OCT-C8.

To further evaluate the proposed model under limited training data scenarios, we only used one-tenth images from OCT-2014 to build the new database. Then the new dataset had a total of 323 images which consisted of 214 training images, 65 validation images and 44 test images. To reduce the impact of imbalance on classification results, we ensured that the number of images in each category was equal. We compared the performance of different models on the new dataset. The comparison results were shown in Table 5.

Table 5
www.frontiersin.org

Table 5. The comparative results on the small OCT dataste.

The comparison results demonstrated that CNN was more suitable than ViT for classification tasks on small datasets. The results exhibited the shortcomings of ViT, which accorded with the conclusion from Reference (36). Due to integrating the local features and global features relationship, hybrid models overcame ResNet50 and ViT. Compare the HTCNet, WaveNet-SF and MSLl-Net, HyReti-Net could achieve the best performance on the new dataset, which demonstrated that HyReti-Net was able to achieve higher classification accuracy than other hybrid models under limited training data scenarios.

3.6 Ablation study

In this study, we proposed HyReti-Net model with dual-branch structure and FFM module. To evaluate the effectiveness of HyReti-Net, we applied various modifications to the model. Acc, Pre, Rec, and FS were also used as the metrics for evaluation. Eight schemes were tested on the three public datasets and the average values were set as the final values. Scheme 0 was the baseline model which combined ResNet and Transformer. The features from both branches were concatenated directly. Scheme 1 and Scheme 2 employed ResNet50 and Swin architectures, respectively. Scheme 3 combined two ResNet50 architectures to generate a simple hybrid model which were concatenated directly instead of using CRB and ICAB. Scheme 4 was similar to Scheme 3, which combined two Transformer architectures. On the basis of the dual-branch structure, the ICA and CBR modules were, respectively, incorporated into the basic model to generate Scheme 5 and Scheme 6. Scheme 7 represented the proposed HyReti-Net model. The performance of above eight schemes were presented in Table 6.

Table 6
www.frontiersin.org

Table 6. The performance of eight schemes.

By comparing results from Scheme 0, Scheme 1 and Scheme 2, Scheme 0 has better performance. It meat that the fusion of local features and global features could achieve better classification results than that from a single type of feature. Compared with Scheme 3 and Scheme 4, Scheme 0 has better performance. It demonstrated that multi-level feature fusion methods could improve the classification performance. From Scheme 0 to Scheme 4, all networks isolated the impact of CRB and ICAB. Building upon the dual-branch model, Scheme 5 incorporated the CRB attention mechanism, resulting in substantial improvements in Acc, Pre, Rec and FS. This also underscored a remarkable enhancement in performance attributed to the integration of the attention mechanism. Scheme 6 incorporated the ICAB module into the dual-branch model, producing further performance improvements. Specifically, compared to Scheme 5, there was an increase of 1.91% in accuracy, 0.25% in precision, 0.75% in recall, and 0.55% in FS. Ultimately, Scheme 7 incorporated both the dual-branch structure and the FFM module, achieving the highest values for all metrics: 99.25% Acc, 99.10% Pre, 99.12% Rec, and 99.11% FS. These results demonstrated the efficacy of the proposed HyReti-Net model in improving the overall performance of the model.

4 Discussion

The primary objective of our study was to develop a hybrid model named HyReti-Net for the accurate classification of retinal diseases. Inspired from many recent studies (2933), HyReti-Net contained a dual-branch architecture for feature extraction and a FFM. Due to leveraging the strengths of both paradigms, HyReti-Net was able to capture both local features via the CNN and global information with long-range dependencies through the Transformer. Moreover FFM was designed to merge these two types of information, thereby enabling the model generating more comprehensive features for classifying OCT scans. Three different datasets were used to validate the performance and generalization of HyReti-Net.

Experimental results indicated that HyReti-Net achieved accuracy of 99.06, 99.44 and 98.76%, with F1-Score values of 99.05, 99.16 and 98.33% on the OCT-2017, OCT-2014 and OCT-C8, respectively. Compared to other state-of-the-art models (2933), the performance of HyReti-Net was superior to traditional models and hybrid approaches. On the OCT-2017 dataset, HyReti-Net notably exceeded HTCNet, WaveNet-SF and MSLl-Net by 1.1, 0.92, and 0.49 percentage points in accuracy, respectively. On the OCT-2014 dataset, HyReti-Net exceeded HTCNet, WaveNet-SF and MSLl-Net by 1.59, 1.08, and 1.17 percentage points in accuracy, respectively. Similarly, on the OCT-C8 dataset, HyReti-Net exceeded HTCNet, WaveNet-SF and MSLl-Net by 3.43, 1.13 and 2.37 percentage points in accuracy, respectively. The comparative results underscored the effectiveness and advantage of HyReti-Net. These results from HyReti-Net underscored its robustness and superiority. Through analysis of the experimental results, we were also able to find that classes with limited sample sizes exhibited comparatively lower precision and recall in the classification results. In addition, we also evaluated and compared HyReti-Net with several classic methods. The results underscored the effectiveness and advantage of HyReti-Net which achieved the highest classification accuracy.

In addition, a FFM module was proposed in this paper. The whole module consisted of CRB and ICAB. CRB was employed to combine and extracted features from both branches. Then ICAB was utilized to emphasize critical features based on channel attention mechanism. The proposed FFM module was different from traditional feature fusion mechanism (2932). Along the channel dimension, we divided the fused feature map which was then redirected to its corresponding branch and added to the original inputs to continue training iteratively. Therefore the model could achieve superior performance. Ablation studies were carried out to evaluate their effects on the foundational model, as presented in Table 6. Comparison results further demonstrated the advantages of FFM. In comparison to using single ResNet50 or Swin Transformer branch, the dual-branch structure with ICAB attention mechanism exhibited significant improvements in Acc, Pre, Rec, and FS. Moreover, the dual-branch structure with the FFM module could achieve the best performance. The FFM module has a very high repeatability. Similar to the traditional feature concatenation process, the FFM module is designed to fuse the features from two branches. The overall structure still adopts basic structures such as convolution, pooling and residuals. The FFM module simultaneously introduces the channel attention mechanism. By introducing different attention mechanisms and fusion forms, the FFM module has strong scalability and computational efficiency.

For each public dataset, heatmaps were also generated to enhance the interpretability of OCT image classification, as shown in Figures 8, 9, 10. The heatmaps showed the specific regions in an OCT image that were closely linked to the target class. It meant that HyReti-Net was able to accurately focus on the lesion area, which meant that the corresponding classification performance was better.

Although our proposed hybrid model had demonstrated successes, this study also had some limitations. First, due to the challenges in acquiring medical images and time limitations, HyReti-Net was trained and tested on the public datasets instead of clinical datasets. Second, we only explored dual-branches structure: CNN branch and Transformer branch. Multiple-branches structure could be further explored based on HyReti-Net. Third, we only used OCT images in this paper, without studying the performance of multimodal images. In the future, we will conduct research using clinical retinal images and more effective feature extraction methods. Additionally, we will focus on the application of multimodal images and external fundus diseases to verify the generalization ability and robustness of the model. The computational complexity analysis, including model size, parameter counts, inference speed, and training time will be used to evaluate the practical utility of the proposed model.

5 Conclusion

Retinal diseases can lead to temporary or permanent visual impairments. In this study, we proposed a hybrid multi-scale network model, named HyReti-Net, integrating dual-branches and a features fusion module. One branch utilized ResNet50 to extract local features while the other branch employed the Swin Transformer to capture global information and long-range dependencies. Finally, local and global features were fused in the feature fusion module. When evaluated on three public retinal datasets, HyReti-Net achieved accuracy of 99.06% in the four-category task, accuracy of 99.44% in the three-category task and accuracy of 98.76% in the eight-category task. Additionally, we used the Grad-CAM method to generate heat maps to improve the interpretability of the model classification results. This study can provide a reference for clinical diagnosis of ophthalmologists through artificial intelligence technology. Moreover, it helps improve the accuracy of retinal disease diagnosis, and it will play an important role in preventing blindness caused by retinal diseases.

Data availability statement

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.

Author contributions

JY: Validation, Writing – review & editing, Software, Writing – original draft. CH: Project administration, Writing – review & editing, Methodology, Software. JW: Writing – review & editing, Project administration, Validation. BW: Methodology, Investigation, Writing – review & editing. YL: Writing – review & editing, Resources, Formal analysis. YD: Data curation, Software, Writing – review & editing. ZZ: Writing – review & editing, Conceptualization, Methodology. KT: Writing – review & editing, Software, Visualization. FL: Formal analysis, Methodology, Conceptualization, Writing – review & editing. LM: Funding acquisition, Investigation, Writing – original draft, Writing – review & editing, Supervision, Methodology.

Funding

The author(s) declare that financial support was received for the research and/or publication of this article. This research was funded by the Natural Science Foundation of Hunan Province China (Grant No.2023JJ70040).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The authors declare that no Gen AI was used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

1. Li, T, Bo, W, Hu, C, Kang, H, Liu, H, Wang, K, et al. Applications of deep learning in fundus images: a review. Med Image Anal. (2021) 69:101971. doi: 10.1016/j.media.2021.101971

Crossref Full Text | Google Scholar

2. Dutta, P, Sathi, KA, Hossain, MA, and Dewan, MAA. Conv-ViT: a convolution and vision transformer-based hybrid feature extraction method for retinal disease detection. J Imaging. (2023) 9:140. doi: 10.3390/jimaging9070140

Crossref Full Text | Google Scholar

3. Hassan, E, Elmougy, S, Ibraheem, MR, Hossain, MS, AlMutib, K, Ghoneim, A, et al. Enhanced deep learning model for classification of retinal optical coherence tomography images. Sensors. (2023) 23:5393. doi: 10.3390/s23125393

Crossref Full Text | Google Scholar

4. Luo, Y, Pan, J, Fan, S, Du, Z, and Zhang, G. Retinal image classification by self-supervised fuzzy clustering network. IEEE Access. (2020) 8:92352–62. doi: 10.1109/ACCESS.2020.2994047

Crossref Full Text | Google Scholar

5. Ferris, FL III, Wilkinson, CP, Bird, A, Chakravarthy, U, Chew, E, and Csaky, K. Clinical classification of age-related macular degeneration. Ophthalmology. (2013) 120:844–51. doi: 10.1016/j.ophtha.2012.10.036

PubMed Abstract | Crossref Full Text | Google Scholar

6. Fercher, AF, Drexler, W, Hitzenberger, CK, and Lasser, T. Optical coherence tomography-principles and applications. Rep Prog Phys. (2003) 66:239. doi: 10.1088/0034-4885/66/2/204

Crossref Full Text | Google Scholar

7. Thomas, A, Harikrishnan, PM, Krishna, AK, Palanisamy, P, and Gopi, VP. A novel multiscale convolutional neural network based age-related macular degeneration detection using OCT images. Biomed Signal Proces Control. (2021) 67:102538. doi: 10.1016/j.cmpb.2021.106294

Crossref Full Text | Google Scholar

8. Vali, M, Nazari, B, Sadri, S, Pour, EK, Riazi-Esfahani, H, Faghihi, H, et al. CNV-Net: segmentation, classification and activity score measurement of choroidal neovascularization (CNV) using optical coherence tomography angiography (OCTA). Diagnostics. (2023) 13:1309. doi: 10.3390/diagnostics13071309

Crossref Full Text | Google Scholar

9. Tsuji, T, Hirose, Y, Fujimori, K, Hirose, T, Oyama, A, Saikawa, Y, et al. Classification of optical coherence tomography images using a capsule network. BMC Ophthalmol. (2020) 20:114. doi: 10.1186/s12886-020-01382-4

Crossref Full Text | Google Scholar

10. Zhang, Q, Liu, Z, Li, J, and Liu, G. Identifying diabetic macular edema and other retinal diseases by optical coherence tomography image and multiscale deep learning. Diabetes Metab Syndr Obes. (2020) 13:4787–800. doi: 10.2147/DMSO.S288419

PubMed Abstract | Crossref Full Text | Google Scholar

11. Mishra, SS, Mandal, B, and Puhan, NB. Macularnet: towards fully automated attention-based deep CNN for macular disease classification. SN Comput Sci. (2022) 3:142. doi: 10.1007/s42979-022-01024-0

Crossref Full Text | Google Scholar

12. Shen, J, Hu, Y, Zhang, X, Gong, Y, Kawasaki, R, and Liu, J. Structure-oriented transformer for retinal diseases grading from OCT images. Comput Biol Med. (2023) 152:106445. doi: 10.1016/j.compbiomed.2022.106445

Crossref Full Text | Google Scholar

13. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., et al. Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE/CVF international conference on computer vision. (2021): 10012–10022.

Google Scholar

14. Chu, X, Tian, Z, Wang, Y, Zhang, B, Ren, H, Wei, X, et al. Twins: revisiting the design of spatial attention in vision transformers. Adv Neural Inf Process Syst. (2021) 34:9355–66. doi: 10.48550/arXiv.2104.13840

Crossref Full Text | Google Scholar

15. Zhao, J, Lu, Y, Qian, Y, Luo, Y, and Yang, W. Emerging trends and research foci in artificial intelligence for retinal diseases: bibliometric and visualization study. J Med Internet Res. (2022) 24:e37532. doi: 10.2196/37532

Crossref Full Text | Google Scholar

16. Spaide, RF, Jaffe, GJ, Sarraf, D, Freund, KB, Sadda, SR, Staurenghi, G, et al. Consensus nomenclature for reporting neovascular age-related macular degeneration data: consensus on neovascular age-related macular degeneration nomenclature study group. Ophthalmology. (2020) 127:616–36. doi: 10.1016/j.ophtha.2019.11.004

Crossref Full Text | Google Scholar

17. Varma, R, Bressler, NM, Doan, QV, Gleeson, M, Danese, M, Bower, JK, et al. Prevalence of and risk factors for diabetic macular edema in the United States. JAMA Ophthalmol. (2014) 132:1334–40. doi: 10.1001/jamaophthalmol.2014.2854

Crossref Full Text | Google Scholar

18. Ciulla, TA, Amador, AG, and Zinman, B. Diabetic retinopathy and diabetic macular edema: pathophysiology, screening, and novel therapies. Diabetes Care. (2003) 26:2653–64. doi: 10.2337/diacare.26.9.2653

PubMed Abstract | Crossref Full Text | Google Scholar

19. Lu, W, Tong, Y, Yu, Y, Xing, Y, Chen, C, and Shen, Y. Deep learning-based automated classification of multi-categorical abnormalities from optical coherence tomography images. Transl Vis Sci Technol. (2018) 7:41–1. doi: 10.1167/tvst.7.6.41

Crossref Full Text | Google Scholar

20. Kermany, DS, Goldbaum, M, Cai, W, Valentim, CC, Liang, H, Baxter, SL, et al. Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell. (2018) 172:e9:1122–31. doi: 10.1016/j.cell.2018.02.010

Crossref Full Text | Google Scholar

21. Huang, L, He, X, Fang, L, Rabbani, H, and Chen, X. Automatic classification of retinal optical coherence tomography images with layer guided convolutional neural network. IEEE Signal Process Lett. (2019) 26:1026–30. doi: 10.1109/LSP.2019.2917779

Crossref Full Text | Google Scholar

22. Vaswani, A, Shazeer, N, Parmar, N, Uszkoreit, J, Jones, L, Gomez, AN, et al. Attention is all you need. Adv Neural Inf Process Syst. (2017) 30

Google Scholar

23. Dosovitskiy, A, Beyer, L, Kolesnikov, A, Weissenborn, D, Zhai, X, Unterthiner, T, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv. (2020). doi: 10.48550/arXiv.2010.11929

Crossref Full Text | Google Scholar

24. Hassani, A, Walton, S, Shah, N, Abuduweili, A, Li, J, and Shi, H. Escaping the big data paradigm with compact transformers. arXiv. (2021). doi: 10.48550/arXiv.2104.05704

Crossref Full Text | Google Scholar

25. Dai, Z, Liu, H, Le, QV, and Tan, M. Coatnet: marrying convolution and attention for all data sizes. Adv Neural Inf Proces Syst. (2021) 34:3965–77. doi: 10.48550/arXiv.2106.04803

Crossref Full Text | Google Scholar

26. He, J, Wang, J, Han, Z, Ma, J, Wang, C, and Qi, M. An interpretable transformer network for the retinal disease classification using optical coherence tomography. Sci Rep. (2023) 13:3637. doi: 10.1038/s41598-023-30853-z

Crossref Full Text | Google Scholar

27. Xie, X, Niu, J, Liu, X, Chen, Z, Tang, S, and Yu, S. A survey on incorporating domain knowledge into deep learning for medical image analysis. Med Image Anal. (2021) 69:101985. doi: 10.1016/j.media.2021.101985

Crossref Full Text | Google Scholar

28. Ma, Z, Xie, Q, Xie, P, Fan, F, Gao, X, and Zhu, J. HCTNet: a hybrid ConvNet-transformer network for retinal optical coherence tomography image classification. Biosensors. (2022) 12:542. doi: 10.3390/bios12070542

Crossref Full Text | Google Scholar

29. Cheng, J, Long, G, Zhang, Z, Qi, Z, Wang, H, Lu, L, et al. WaveNet-SF: a hybrid network for retinal disease detection based on wavelet transform in the spatial-frequency domain. arXiv. (2025). doi: 10.48550/arXiv.2501.11854

Crossref Full Text | Google Scholar

30. Qi, Z, Hong, J, Cheng, J, Long, G, Wang, H, Li, S, et al. MSLI-net: retinal disease detection network based on multi-segment localization and multi-scale interaction. Front Cell Dev Biol. (2025) 13:1608325. doi: 10.3389/fcell.2025.1608325

Crossref Full Text | Google Scholar

31. Zhang, X, Gao, L, Wang, Z, Yu, Y, Zhang, Y, and Hong, J. Improved neural network with multi-task learning for Alzheimer's disease classification. Heliyon. (2024) 10:e26405. doi: 10.1016/j.heliyon.2024.e26405

Crossref Full Text | Google Scholar

32. Su, H, Gao, L, Wang, Z, Yu, Y, Hong, J, and Gao, Y. A hierarchical full-resolution fusion network and topology-aware connectivity booster for retinal vessel segmentation. IEEE Trans Instrum Meas. (2024) 73:1–16. doi: 10.1109/TIM.2024.3411133

Crossref Full Text | Google Scholar

33. Zuo, Q, Shi, Z, Liu, B, Ping, N, Wang, J, Cheng, X, et al. Multi-resolution visual mamba with multi-directional selective mechanism for retinal disease detection. Front Cell Dev Biol. (2024) 12:1484880. doi: 10.3389/fcell.2024.1484880

PubMed Abstract | Crossref Full Text | Google Scholar

34. Kermany, D, Zhang, K, and Goldbaum, M. Labeled optical coherence tomography (oct) and chest x-ray images for classification. Mendeley Data. (2018). doi: 10.17632/rscbjbr9sj.2

Crossref Full Text | Google Scholar

35. Srinivasan, PP, Kim, LA, Mettu, PS, Cousins, SW, Comer, GM, Izatt, JA, et al. Fully automated detection of diabetic macular edema and dry age-related macular degeneration from optical coherence tomography images. Biomed Opt Express. (2014) 5:3568–77. doi: 10.1364/BOE.5.003568

Crossref Full Text | Google Scholar

36. Issa, M, Sukkarieh, G, Gallardo, M, Sarbout, I, Bonnin, S, Tadayoni, R, et al. Applications of artificial intelligence to inherited retinal diseases: a systematic review. Surv Ophthalmol. (2025) 70:255–64. doi: 10.1016/j.survophthal.2024.11.007

Crossref Full Text | Google Scholar

37. Luo, X, Li, J, Chen, M, Yang, X, and Li, X. Ophthalmic disease detection via deep learning with a novel mixture loss function. IEEE J Biomed Health Inform. (2021) 25:3332–9. doi: 10.1109/JBHI.2021.3083605

Crossref Full Text | Google Scholar

38. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., and Batra, D. Grad-cam: visual explanations from deep networks via gradient-based localization. Proceedings of the IEEE international conference on computer vision. (2017): 618–626.

Google Scholar

Keywords: vision transformer, convolutional neural network, feature fusion, retinal diseases, classification

Citation: Yang J, Hsu C, Wang J, Wu B, Lu Y, Ding Y, Zhao Z, Tang K, Lu F and Ma L (2025) HyReti-Net: hybrid retinal diseases classification and diagnosis network using optical coherence tomography. Front. Med. 12:1660920. doi: 10.3389/fmed.2025.1660920

Received: 07 July 2025; Accepted: 15 September 2025;
Published: 29 September 2025.

Edited by:

Yongfei Wu, Taiyuan University of Technology, China

Reviewed by:

Jianhao Bai, University of Miami Health System, United States
Jin Hong, Zhongshan Institute of Changchun University of Science and Technology, China

Copyright © 2025 Yang, Hsu, Wang, Wu, Lu, Ding, Zhao, Tang, Lu and Ma. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Liwei Ma, MTg5MDA5MTM1ODhAMTYzLmNvbQ==

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.