VTG-Net: A CNN Based Vessel Topology Graph Network for Retinal Artery/Vein Classification

From diagnosing cardiovascular diseases to analyzing the progression of diabetic retinopathy, accurate retinal artery/vein (A/V) classification is critical. Promising approaches for A/V classification, ranging from conventional graph based methods to recent convolutional neural network (CNN) based models, have been known. However, the inability of traditional graph based methods to utilize deep hierarchical features extracted by CNNs and the limitations of current CNN based methods to incorporate vessel topology information hinder their effectiveness. In this paper, we propose a new CNN based framework, VTG-Net (vessel topology graph network), for retinal A/V classification by incorporating vessel topology information. VTG-Net exploits retinal vessel topology along with CNN features to improve A/V classification accuracy. Specifically, we transform vessel features extracted by CNN in the image domain into a graph representation preserving the vessel topology. Then by exploiting a graph convolutional network (GCN), we enable our model to learn both CNN features and vessel topological features simultaneously. The final predication is attained by fusing the CNN and GCN outputs. Using a publicly available AV-DRIVE dataset and an in-house dataset, we verify the high performance of our VTG-Net for retinal A/V classification over state-of-the-art methods (with ~2% improvement in accuracy on the AV-DRIVE dataset).


INTRODUCTION
Being the only vascular network of the human body that is visible to non-invasive imaging techniques, analysis of retinal vascular structures is a common way to diagnose a number of diseases. Conditions such as arteriovenous nicking, arteriolar constriction, vessel dilation, and tortuosity alteration are vital for examining various cardiovascular diseases, diabetic retinopathy, and hypertension (1)(2)(3). Specifically, the arteriolar-to-venular ratio (AVR) gives a key biomarker, critical for quantifying the severity of such diseases. Hence, accurate classification of retinal vessels into arteries/veins (A/V) is of significant clinical interest.
Significant research has been done on automatic A/V classification. Early studies (4) focused on designing hand-crafted features for automatic A/V classification. To exploit the tree-shaped retinal vasculature (5), graph based methods were proposed (3,6,7). Such methods used the segmented vessel structures to generate a graph, preserving the vessel topology; the graph was then traversed for accurate vessel classification. Recently, convolutional neural network (CNN) based approaches for A/V classification garnered large interest. In (8), a U-Net (9) based method was used for A/V classification. A SegNet (10) inspired encoder-decoder architecture (11) was proposed for pixel-wise classification. A multi-task framework with spatial activation was given (12) for simultaneous vessel segmentation and classification. Although outperforming traditional graph based methods, CNN approaches still suffer several drawbacks: (i) limited vessel connectivity; (ii) multiple class assignment of a single vessel segment. Recently, Chen et al. (13) proposed a generative adversarial network based method in which a topology preserving module with triplet loss was introduced to address the issue of limited connectivity of classified vessels. But, effective solutions for both these drawbacks of known CNN based approaches still remain highly sought.
CNN based approaches commonly use a series of feature extractors (also called spatial filters, kernels, or channels) to extract hierarchical information. Each filter extracts information from a fixed size spatial input neighborhood [the receptive field (14)] and propagates it to the output. Current spatial feature extraction methods are not able to handle the issues of multiple class assignment and limited vessel connectivity well (e.g., see Figure 1). Some seemingly simple cases for the graph based methods (3,6,7) can be wrongly classified by CNNs, possibly because their feature extractors do not capture vessel topology effectively. Thus, we believe that incorporating a deep graphbased model that can effectively capture vessel topology into a CNN based approach will improve A/V classification.
Recently, graph convolutional network (GCN) models have been shown to be effective for analyzing graph-structured data. Information propagation on graphs can be formulated, by conditioning the learning models both on such data and the adjacency matrices of the underlying graphs. Known approaches (15)(16)(17)(18)(19) have explored graph convolution for learning graph data in various applications, such as ecommerce (customer-product interaction), chemistry (molecule interaction), and citation networks (author-paper interaction). For retinal vessel classification, graph convolution was first proposed in (20) by generating a graph representation with graph nodes defined using sampled skeleton of vessels; graph edge information was extracted from the vessel skeleton, and graph node features were sampled from CNN feature maps using node locations. In (21), a model was proposed by using only vessel pixels as graph nodes, ignoring all non-vessel pixels; graph edges were built using a local patch based neighborhood, and node features were extracted from CNN feature maps using vessel segmentation masks. Although quite effective, these approaches failed to exploit the potential of GCNs by ignoring non-vessel pixels in graph generation and representation.
To improve A/V classification on fundus images by incorporating vessel topological features with CNN features, we propose VTG-Net (vessel topology graph network). VTG-Net exploits graph convolution based learning by strategically transforming the hierarchical CNN features of an input fundus image into a graph representation that preserves vessel topology.
Specifically, using a CNN model trained on the input dataset, we first extract image features along with the segmented vessels in the input images. Next, by using CNN features and the segmented vessels (providing the underlying graph structure), a graph representation is produced while preserving the non-vessel pixels as isolated graph nodes. Employing a GCN, we classify the generated graph by extracting its topological features. Lastly, by fusing the CNN output and GCN output, the final prediction is attained.
In contrast to the known GCN based methods for A/V classification, our VTG-Net seeks to address the issue of broken vessels by retaining non-vessel pixels as (isolated) graph nodes. Our approach is hinged on our observation that, if discarding the information content of non-vessel pixels, the errors generated by CNN (disconnected vessels generated due to, e.g., low image quality, lesser model ability) will propagate and cannot be corrected. The inclusion of isolated background nodes may facilitate CNN error correction since GCNs in general leverage not only edge information but also node features for classification. Further, GCN features learned by VTG-net from the connected graph portions (positive vessel examples) can help classify the disconnected portions. Disconnected vessels can still be classified with good accuracy using node features, since graph edges need not necessarily encode node similarity (the same label) (15), which is useful for A/V classification.
We evaluate our VTG-Net using a public dataset AV-DRIVE (22) and an in-house dataset, and our experimental results show its high efficacy.
The rest of this paper is organized as follows. In section 2, our proposed framework is presented. Experimental results are discussed in section 3. Ablation analysis is provided in section 4. Section 5 concludes the paper.  Figure 2) is trained using the input dataset. (2) The extracted features and segmented vessels from CNN are used to generate a vessel topology graph. (3) A GCN model is trained using the generated graphs to produce classified output (in the blue box of Figure 2). The final prediction is attained by fusing the CNN output and GCN output.

Graph Convolution Based Topology Analysis
In contrast to standard convolution where information is exchanged only in a small neighborhood (determined by the filter receptive field), graph convolution enables long range information exchange by incorporating adjacency matrices of graphs into message passing (23). Assume an undirected graph G = (V, E), with N nodes v i ∈ V (each node containing C features) and M edges (v i , v j ) ∈ E. The edge connectivity (capturing topological neighboring relations) is represented by an adjacency matrix A ∈ R N×N . The spectral graph convolution of a tensor x ∈ R N×C with a filter g θ is defined as g θ ⋆ x =     Ug θ U T x, where U is the matrix of eigenvectors of the graph Laplacian matrix L (15), which is a matrix representation of the graph G. L is defined as ) is a diagonal matrix of node degrees and I N is the identity matrix (24,25). To reduce the cost for computing Ug θ U T , the above graph convolution is approximated using a truncated expression of Chebyshev polynomials T k (x) up to the K th order, i.e., (26), and θ ′ k are the filter parameters acting as node feature transformers. The rescaled graph Laplacian matrixL = 2 λ max L − I N (λ max is the largest eigenvalue of L) can be viewed as an encoder of the topological information of the graph G.
In (15), a first order approximation of the Chebyshev polynomial (K = 1) is shown to be effective. Using K = 1 and λ max = 2, the graph convolution can be approximated as: where θ is chosen as θ = θ ′ 0 = −θ ′ 1 for constraining the number of parameters. To include self-connections of nodes in localized aggregation (Â = A + I N ) and to avoid vanishing/exploding gradients (D ii = jÂ ij ), a normalization trick was proposed (15): . Applying this normalization trick to Equation (1), the graph convolution can be generalized as: where X ∈ R N×C is the node feature vectors of the graph (N nodes with C dimensional features), and ∈ R C×F is the matrix of filter parameters extracting F hidden features. Y ∈ R N×F is the output of the graph convolution operation.
Using the graph convolution shown in Equation (2), a neural network model f (X, A) is trained [unlike a standard convolutional model f (X)) by conditioning f (·) simultaneously on the matrix of node features and the adjacency matrix of the graph [D − 1 2ÂD − 1 2 X in Equation (2]. Further, similar to CNN, by stacking multiple layers performing graph convolution, hierarchical topological features can be extracted by a GCN model. Both the node definition (node features) and graph structure (edge connectivity) play a key role in determining information propagation in GCN. In the next section, we describe how we utilize the extracted CNN features and segmented vessel structures to generate the needed graph representation for our VTG-Net.

Graph Representation Generation
To leverage a GCN model to incorporate vessel topological features with the extracted CNN features, a graph representation of the CNN features is used. We propose a graph representation, , which can be generated utilizing the CNN features for its nodes and the underlying vessel structure for its edge connectivity. Our proposed method for graph representation generation is illustrated in Figure 3 along with its major components. We first explain the CNN feature extraction, followed by the vessel structure generation. Finally, we combine these two types of information to generate our graph representation.

CNN Feature Extraction
For a CNN performing pixel-wise A/V classification [CNN output ∈ R P×Q×CL out , CL out = (background, artery, crossing+unknown, vein)] on an input image ∈ R P×Q×CL in (i.e., height× width × channels), we can assume that the last layer (uppermost) of the network is the classifier, while the remaining layers function as the feature extractor. Utilizing the H output features of the feature extractor, the classifier generates the final class probabilities (R P×Q×H → R P×Q×CL out ). For instance, a convolutional layer with a 1 × 1 filter f 1×1 ∈ R 1×1×CL out is used as the CNN classifier in (12,13). Thus, we use the input feature maps of the last 1 × 1 convolution layer as our representative CNN features.

Vessel Structure Extraction
The underlying retinal vessel structure (captured by the segmented vessels) provides a guide on the connectivity among pixels of the input image. Thus, we use the segmented vessels to construct our graph representation. Specifically, each pixel of the segmented vessel mask is treated as a node in G. If two adjacent pixels are both classified as the vessel class, an edge connects them in G. To identify all vessel pixels, the multi-class pixel-wise classified CNN output is converted to foreground/background classification (i.e., R CL out → R 0,1 ). Then for each node (i.e., each pixel) of our graph representation G (N = P × Q), we explore the pixel's 8-connected neighborhood (shown in Figure 3). If and only if both adjacent v i and v j belong to the segmented foreground, (v i , v j ) ∈ E and the adjacency matrix A of G is updated accordingly. Background pixels (non-vessel pixels) are represented as isolated nodes in G (shown in Figure 4).

Graph Representation Generation
Using the map of extracted CNN features and the vessel structure, we generate our graph representation G = (V, E), with N nodes and an adjacency matrix A. By combining the image channels (CH; RGB for fundus images) as additional features (shown in a blue box in Figure 3), each node has a feature vector of length H + CH. Let X ∈ R N×(H+CH) be the set of feature vectors of all the N nodes in G. Combining X and A, we are now ready to learn the GCN model f (X, A) [i.e., to determine the values of the parameters of the model f (X, A)] by using Equation (2) for information propagation on our graph representation G.

Graph Classification and Fusion
To extract hierarchical (topological) features, we propose a multilayer GCN model. Our proposed GCN model is shown in Figure 5. Using the graph convolution operation [defined in Equation (2)], X is transformed into H ′ hidden feature channels (R N×(H+CH) → R N×H ′ ; see the left orange box in Figure 5). After activation and dropout (27), another graph convolution operation converts the hidden features into output class probabilities, i.e., R N×H ′ → R N×CL out . Using an appropriate loss function (e.g., cross-entropy) and gradient back-propagation, the model parameters ( ) of the GCN model f (X, A) are learned.
To obtain more accurate classification, we further fuse the pixel-wise classified CNN output (R P×Q×CL out ) and the GCN output (R N×CL out ) to generate the final output of our model (i.e., R P×Q×CL out ⊙ R N×CL out → R N×CL out , where ⊙ denotes a fusion operation). One possible way to perform this fusion is to use an agreement based voting scheme in which only an agreement between both these outputs permits a class assignment (e.g., . Disagreements between the CNN and GCN outputs are ignored. Another way is to assign weights to the CNN and GCN output class probabilities (R N×CL i Fused = w CNN R P×Q×CL i CNN + w GCN R N×CL i GCN , where w CNN and w GCN are the weights for the CNN and GCN output class probabilities, respectively). After the fusion of individual class probabilities, a 50% threshold is applied to generate the final output for each class. Various fusion options with experimental details are shown in section 4.5.

Datasets
We use a public dataset AV-DRIVE (22) and an in-house dataset (which we call the Tongren dataset) to evaluate our VTG-Net for retinal A/V classification. In the AV-DRIVE dataset (22)

Experimental Setup
For the CNN training, we use PyTorch with the He initialization (28). To limit overfitting, data augmentation is performed using random flipping and rotation (14). Using a standard U-Net (9) as the CNN model, training is performed. The CNN training uses a cross-entropy loss and the Adam optimizer (29) (β 1 = 0.9, β 2 = 0.999, ǫ = 1e − 08) with an initial learning rate 2e − 05, which is halved in every 10k epochs for 20k epochs. Following known studies (13), we used accuracy, sensitivity, and specificity as evaluation metrics. For variability analysis,

Results
Quantitative results obtained on the AV-DRIVE dataset (22) and the Tongren dataset are shown in Table 1. On the AV-DRIVE dataset, comparison is performed with graph based (3,6,7,30), deep learning (DL) based (8,12,13,31) and GCN based (21) methods. Evaluation under the same criteria as used by known studies reveals that VTG-Net achieves an mean accuracy (Acc) of 98.11% on the AV-DRIVE dataset, outperforming all state-of-the-art methods. In comparison with CNN-only approaches, utilizing GCN improves classification accuracy and sensitivity (with p = 0.003 and p = 0.004, respectively). The improvement shown in specificity by GCN is found to be not statistically significant (p > 0.05) over the CNNonly approaches. Qualitative results of several example cases are given in Figure 6. Improved connectivity and reduction in multiple class assignments achieved by our VTG-Net are highlighted in Figure 6I, compared to the CNN-only outputs shown in Figure 6H.
Additional qualitative examples are shown in Figure 7, highlighting the CNN-only outputs and GCN-only outputs. In Figure 7G, a failure case produced by GCN with multiple class assignments is shown. On the Tongren dataset, VTG-Net yields considerably improved classification accuracy compared to the CNN-only method (p < 0.05).
We should mention that a main limitation of our framework evaluation is that the sizes of the datasets used are relatively small. For the public AV-DRIVE dataset, we have followed the standard training/test splitting adopted in existing work [e.g., (12,13)] in the evaluation, and our VTG-Net has outperformed those methods on this dataset. In our future work, we plan to conduct training and validation of VTG-Net on larger datasets for a more thorough evaluation.

ABLATION STUDY
We perform systematic ablation study on the graph structures, along with different components of our proposed framework. The ablation experiments use the AV-DRIVE dataset to examine the performances.

Graph Node Choices
Our framework utilizes isolated background pixels along with vessel pixels as graph nodes. In order to analyze the contribution of such isolated nodes, we remove all of them from the graph representation. Experiments are conducted using the graph containing only vessel pixels as graph nodes, and the results are shown in Table 2. It is interesting to see that, in the absence of such isolated nodes, accuracy degrades. In the presence of additional node information, VTG-Net is able to improve classification accuracy.

Graph Node Feature Assignment
In VTG-Net, each graph node contains CNN features (some examples of hidden feature maps are shown in Figure 8)

Graph Edge Arrangement
VTG-Net utilizes CNN-segmented vessels for graph edge assignment (shown in Figure 3), by exploring a node's (pixel's) 8-connected neighborhood. Note that during output generation, CNN may incur some errors in vessel segmentation and classification. As thresholded classification is used to generate the CNN-segmented vessels, errors on segmented vessels may affect the final output generation. Such CNN errors are likely to occur around the segmented vessels (e.g., broken vessels). In order to include the likely error pixels/nodes into the graph structure for possible GCN correction, we explore dilation of the segmented vessels for graph generation. Specifically, during graph generation, we dilate the segmented vessels with different dilation rates to generate the graph representation (e.g., see Figure 9). Using a disk-shaped area of radius r, dilation is performed on each segmented vessel pixel (with the pixel as the center of the disk area). Results thus obtained are shown in Table 4. Observe that for a smaller dilation rate (r = 1), improvement in accuracy and specificity is observed. But, dilation with a bigger r results in accuracy degradation.

Graph Neural Network Models
In VTG-Net, we use GCN (15) to include topological features for A/V classification (as shown in Figure 5). Here, we experiment with various graph message passing techniques to evaluate the effects of different graph convolution models (16,17,32). Results obtained are shown in Table 5. Note that the method in (16) (with k = 1) exhibits the best specificity. Comparing with GCN (15), other graph convolution models do not show any improvement in accuracy.

Fusion
VTG-Net utilizes an agreement based voting scheme between the CNN output and GCN output to generate the final fusion output (R N×CL i Fused = R P×Q×CL i CNN ∩ R N×CL i GCN , where i = output class), as discussed in section 2.3. Here, we experiment with different fusion options by assigning different weights to the CNN and GCN output class probabilities (R N×CL i Fused = w CNN R P×Q×CL i CNN + w GCN R N×CL i GCN , where w CNN and w GCN are the weights for the CNN and GCN output class probabilities, respectively). To achieve this, each individual class probability is stored separately (without applying argmax to both the CNN and GCN outputs), and used for fusion. After fusion, a threshold of 0.5 is used to generate the final output for each class. Results thus obtained are shown in Table 6. Observe that the test cases with higher GCN class probability weights yield better sensitivity and specificity compared to the cases with higher CNN class probability weights. This is expected since the GCN output is a more refined version of the CNN output.