Dynamic Deep Networks for Retinal Vessel Segmentation

Segmenting the retinal vasculature entails a trade-off between how much of the overall vascular structure we identify vs. how precisely we segment individual vessels. In particular, state-of-the-art methods tend to under-segment faint vessels, as well as pixels that lie on the edges of thicker vessels. Thus, they underestimate the width of individual vessels, as well as the ratio of large to small vessels. More generally, many crucial bio-markers---including the artery-vein (AV) ratio, branching angles, number of bifurcation, fractal dimension, tortuosity, vascular length-to-diameter ratio and wall-to-lumen length---require precise measurements of individual vessels. To address this limitation, we propose a novel, stochastic training scheme for deep neural networks that better classifies the faint, ambiguous regions of the image. Our approach relies on two key innovations. First, we train our deep networks with dynamic weights that fluctuate during each training iteration. This stochastic approach forces the network to learn a mapping that robustly balances precision and recall. Second, we decouple the segmentation process into two steps. In the first half of our pipeline, we estimate the likelihood of every pixel and then use these likelihoods to segment pixels that are clearly vessel or background. In the latter part of our pipeline, we use a second network to classify the ambiguous regions in the image. Our proposed method obtained state-of-the-art results on five retinal datasets---DRIVE, STARE, CHASE-DB, AV-WIDE, and VEVIO---by learning a robust balance between false positive and false negative rates. In addition, we are the first to report segmentation results on the AV-WIDE dataset, and we have made the ground-truth annotations for this dataset publicly available.


Introduction
Retinal vessels provide the only non-invasive view of the cardiovascular system. Thus, they are a key diagnostic feature for a number of diseases, including diabetic retinopathy [1], coronary heart disease [2], and atherosclerosis [3]. However, the current standard of care requires manual inspection by an ophthalmologist, which makes it more challenging for people in developing nations and low-income communities to receive care. For example, only 30% of African Americans in southern Los Angeles reported being screened for diabetic retinopathy, despite being the most at-risk ethnicity [4]. Therefore, it is vital to develop automatic retinal analysis methods to improve screening rates and public health outcomes.
The first step in a retinal analysis pipeline is to segment the regions in the image that correspond to the vasculature. Formally, vessel segmentation is a binary classification problem, but it presents unique challenges. First, existing datasets are small-typically on the order of 20-40 images-because an ophthalmologist has to manually trace every vessel in each fundus image. In addition, we need to classify hundreds of thousands or even millions of pixels per image, and the labels of nearby pixels are correlated.
Deep neural networks are the state of the art for a wide range of classification problems, but, due to the aforementioned challenges, training a deep network to segment retinal images is not straightforward. The two main strategies for applying deep learning to this domain are: (1) dividing each image into small patches to maximize the number of training samples [5] or (2) combining traditional convolutional layers with upsampling to learn both local and global features [6]. In particular, the U-net architecture [6] and its extensions (e.g., Recurrent U-net [7]) have achieved state-of-the-art results by applying the latter strategy.
However, U-net and other deep learning-based methods favor precision over recall, i.e., they tend to classify ambiguous pixels as background. This strategy is statistically sound since retinal images have about a nine to one ratio of background to vessel pixels, but it under-segments faint vessels, as well as pixels that lie on the edges of thicker vessels. Thus, current methods underestimate the width of individual vessels, as well as the ratio of large to small vessels. This, in turn, compromises later diagnoses since many crucial bio-markers-including the artery-vein (AV) ratio, branching angles, number of bifurcation, fractal dimension, tortuosity, vascular length-to-diameter ratio and wall-to-lumen length-require precise measurements of individual vessels.
In this paper, we propose a novel, two-stage segmentation method that combines both upsampling and patch-based processing. Our approach also leverages loss functions with stochastic weights to better detect faint and ambiguous vessel pixels. As our experiments show, our proposed pipeline achieves state-of-the-art results on five vessel segmentation datasets, including conventional fundus images [8][9][10], wide field-of-view (FOV) images [11], and low-resolution video stills and mosaics [12].
As Fig. 1 shows, we separate the overall vessel segmentation problem into two stages: (1) likelihood estimation and (2) targeted prediction. In the first half of our pipeline, we train a U-net architecture to estimate the vessel likelihood of each pixel using a stochastic cross entropy loss. That is, we randomly assign a weight to each class on each training iteration, which force the network to alternate between higher false positive vs. higher false negative rates during training, preventing it from getting stuck in a single minimum. In the second half of our pipeline, we first separate potential vessel pixels into two categories-easy and ambiguous-and then train a patch-based network to classify the latter. (Easy pixels can be classified with a simple threshold.) This second network, which we refer to as mini-U-net, is a scaled-down version of the network used in the first stage. To the best of our knowledge, we are the first to apply stochastic weights for vessel segmentation, as well as the first to combine both upsampling and patch-based learning into a single pipeline.
The rest of this paper is organized as follows. First, we discuss related work on vessel segmentation, particularly deep learning-based approaches. Then, we detail our complete methodology, including our stochastic loss function and targeted prediction steps. We then present and discuss experimental results on five datasets and explore avenues for future work.

Related work
Vessel segmentation has a long history, although the advent of deep neural networks has yielded significant improvements in recent years. Earlier techniques relied primarily on handcrafted features, including matched filters [13], quadrature filters [14], and Gabor filters [15,16]. The latter approach, in particular, uses a predefined kernel bank to incorporate all vessel widths and orientations. However, handcrafted features are limited by our ability to analytically model the segmentation process. Other classic techniques, as surveyed further in [17], include piecewise thresholding [9], region growing [18], and concavity measurements [19]. In addition to handcrafted features, some prior approaches used graph-theoretic techniques to trace the vascular structure, including shortest-path tracking [12] and the fast marching algorithm [20].
Since the breakthrough by convolutional neural networks (CNNs) on the ImageNet dataset in 2012 [21], deep learning has achieved tremendous success across a wide array of machine learning tasks [21][22][23][24]. Currently, deep networks are the state of the art in vessel segmentation. In particular, the most successful deep architecture in this domain-U-net [6]-combines traditional convolutional layers with feature upsampling to estimate both local and global image features. As noted above, earlier deep learning approaches [25,26] divided an image into a series of patches and applied a classifier to each patch independently. For instance, Dasgupta  We separate vessel segmentation into simpler tasks. The steps in gray boxes are fixed, while the ones in blue require training. In (a), we first extract the green channel and preprocess it with contrast adaptive histogram equalization. Then, in (b) we train a U-net architecture using a dynamic loss function. This network outputs a likelihood map across the entire image. In (c), we use two thresholds, support and resistance, (SRS) to separate easy (top image) from ambiguous (bottom image) pixels. We assign the final labels to easy pixels at this stage. To classify ambiguous pixels, in (d) we use the support threshold to generate a high-recall estimate of the vascular structure (right image) and then skeletonize this mask (left image). We then intersect this skeleton with a lattice mask (shown in (e)) to find a set of seed points (shown in (f). The left image in (f) shows the seed locations for the entire image, while the right image is a close-up of the square shown in red. We then define a small patch around each seed point and train a small network (mini-U-net) on these patches (g). Each patch has two channels. The channel highlighted with the purple border is extracted from the original likelihood map, while the channel with the green border includes only the ambiguous pixels in that region. In (h), we merge the mini-U-net predictions for ambiguous pixels with the labels for the easy pixels to obtain our complete segmentation result (shown in (i)). Figure best viewed on-screen.
image into small patches to train a CNN [25], while Ciresan et al. defined a patch around each pixel and used those patches to train their network [26]. One of the drawbacks of patch-based approaches is that we need to train a CNN with a large number of patches, most of which have redundant information. The upsampling in U-net, on the other hand, allows a network to train with fewer, significantly larger patches.
There are numerous extensions to the original U-net architecture. Alom et al. increased the performance of U-net by introducing residual connection, and a recursive training strategy [22], albeit at the expense of increased training time due to the extra connections and recursion. Zhuang et al. arranged two U-net architectures in serial fashion, enabling the architecture to learn more abstract features [27]. This serialization significantly increases the training time and requires heavy computational infrastructure to be able to train well. Jin et al. used deformable CNNs to construct better vascular features [28], but also with added training complexity. Finally, Yun et al. combined conditional generative adversarial networks (GANs) with U-net [23], achieving results comparable to the other U-net extensions. In contrast to these extensions, our approach not only yielded better results, but is simpler and faster to train, as detailed in the following section.

Methodology
As noted above, we separate vessel segmentation into two stages. Fig. 1 details each step of our pipeline. In the first stage, we use a stochastic loss function to obtain an overall likelihood map for every pixel in the image. Then, we identify those pixels that are most likely to be misclassified-generally faint vessels and pixels at the periphery of thick vessels-and apply a second, smaller network to these ambiguous regions. Below, we first present and motivate the use of stochastic loss functions and then discuss each step in our pipeline in turn.

Dynamic loss functions
A weighted loss function penalizes some types of errors more than others. They are useful for problems with unbalanced datasets or in which we want to prioritize certain outcomes (e.g., minimize false positives for a particular class). Traditionally, these weights have been fixed a priori by estimating them using a training set.
In this work, however, we propose using dynamic or stochastic weights. In this case, each weight is defined not as a single value but as a distribution. During training, every time we feed a new mini-batch to the network, we first sample weights for each class and then use those sampled weights to compute the loss. Since the weights will vary for each training iteration, the same error will be penalized more or less heavily at different points during training. Empirically, this forces the network to optimize across all ranges of false positive vs. false negative rates, leading to more robust classification. For concreteness, below we discuss the use of dynamic weights for the cross-entropy loss, but one can randomize any weighted loss function.
The cross-entropy loss is a natural choice for binary classification since it encourages values to converge towards zero or one. This loss is defined as follows: where p is the true distribution and q is the predicted distribution, i.e., the network's outputs. The training process aims to make these two distributions as close as possible. However, when the data is highly skewed, we can trivially obtain a high accuracy by always assigning the most common class to every pixel. In the case of vessel segmentation, very few pixels are part of a vessel, so always assigning a label of background will yield an accuracy of around 90% (but a recall of 0%). A straightforward way to ameliorate this imbalance is by applying different weights to each class: where w i are the class weights and i = 1, ..., C are the class labels (C = 2 for binary segmentation). By using weights, we force the network to care about both positive and negative samples equally during training. An easy way to define these weights is to estimate them from the training set, as was done for the original U-net architecture [6]. A major drawback of the above strategy is that it imposes a fixed ratio of vessel-to-background. As our experiments show, this ratio is a major determinant of whether ambiguous pixels will be mapped to either vessel or background. In other words, when faced with an ambiguous pixel, the network will use this ratio to "guess" which label to assign to it. We can see in Figs. 3(b) and 3(c) that a network trained with fixed weights will preferentially over-or under-segment ambiguous pixels.
Instead, we propose using stochastic weights which are randomly generated at each training iteration: where function rand(1, α, s) assigns random class weight within a range of 1 to al pha and s is the step by which the random class weights are taken. For example, s = 1 would generate random integers between [1, α]. Intuitively, dynamic weights prevent the network from getting stuck in a low-value local minimum. When we use fixed class weights, the neural network will always penalize all pixels by the same factor. As noted earlier and as Figs. 3(b) and 3(c) show, fixed weights fail to account for fainter vessels and are insensitive to noise. Stochastic costs reduce this bias because the network must vary in how aggressively it segments ambiguous pixels throughout the training process. Even if completely different weights are given in a subsequent iteration, some of the information learned in the previous step is preserved, leading to a more balanced segmentation of ambiguous regions. As Fig. 3(d) shows, dynamic weights capture fainter vessels more accurately than fixed weights.
We now discuss the various steps of our proposed pipeline below.

Likelihood map
The first step in our pipeline consists of training a U-net architecture using stochastic weights to derive a likelihood map over the entire image. The goal of this map is not to segment the image directly but to indicate how likely a particular pixel is to be part of a vessel. We do not segment the image at this stage because the up-sampling used by U-net blurs the original vessels, making it difficult to correctly label high-frequency elements (e.g., thin vessels or pixels near the edge of a thick vessel). Figure 3 contrast the outputs of our stochastic U-net vs. two fixed-weight U-nets on the datasets used in our experiments (see figure for details). Overall, networks trained with fixed weights struggle to capture faint vessels, as well as the pixels at the edges of thicker vessels. Interestingly, our stochastic weights are also better able to ignore vessel-like artifacts, e.g., the rim of a lens as seen in Fig. 3.

Support-resistance segmentation (SRS) for easy pixels
The likelihood map generated in the previous step ranges from [0,1]. Pixels with a higher probability of being vessels have values closer to one (and closer to zero otherwise). Intuitively, the closer a value is to either extreme, the more confident the network is in its classification. Thus, we use this likelihood map to separate pixels into either easy or ambiguous classes. The top and bottom of Fig. 1(c) show the easy and ambiguous pixels, respectively, of the likelihood map in Fig. 1(b).
We use two thresholds to classify pixels into these two classes. We refer to the lower threshold as support and the upper threshold as resistance. All pixels between resistance and support are considered ambiguous. Pixels below support (those with values near zero), which account for the majority of pixels, are background pixels whereas pixels above the resistance (those with value near one) are considered vessel pixels.
The choice of support and resistance thresholds affects the trade-off between precision and recall. Thresholds close to one and zero, respectively, lead to higher precision for easy pixels but cause more pixels to fall in the intermediate, ambiguous band. Thresholds closer to the middle As originally detailed in [6], the first half of the U-net architecture is a standard sequence of convolution and max-pooling layers.
The layers in the second half, on the other hand, receive two inputs: the features from the previous layer and the features from the corresponding layer in the first half of the network (i.e., the one with the same dimensionality). The features from previous layers are upsampled to match the larger dimensions of subsequent layers. Intuitively, the upsampled features provide more global information, while the features transferred from the earlier layer give more local information. Our mini-U-net architecture (highlighted by the light blue square) consists of the middle layers of the full U-net. lead to higher recall, but affect precision. For our experiments, we empirically set a support of 20 and a resistance of 235.

Targeted prediction for ambiguous pixels
The last stage of our pipeline segments ambiguous pixels, i.e., those that fall within the supportresistance band described above. Figs. 1(d)-(g) illustrate the various steps involved in this targeted prediction. Generally, most vessel pixels will lie above the support threshold, so they have already been predicted with high precision. To classify the pixels with intermediate values, we train a second, patch-based network only on those pixels. As Fig. 2 shows, this mini-U-net network is a scaled-down version that uses only the middle layers of the full U-net architecture. We use two channels as inputs to this network: the full likelihood map outputted by the first U-net ( Fig. 1(b)) and a version of this likelihood map which only contains ambiguous pixels (the bottom image in Fig. 1(c)). Unlike prior patch-based approaches (e.g., Ciresan et al. [5]), we do not extract a patch for every pixel. Instead, we use a sparse set of p × p patches centered at the intersections between ambiguous pixels and a regular lattice, as described below: 1. Raw estimate: We first produce a high-recall binary mask by considering all pixels above the support threshold as vessels. Intuitively, this mask should contain almost no false negatives, but will have numerous false positives. We then skeletonize the largest connected component in this mask (Fig. 1(d)).

Patch selection:
We then define a regular lattice of width p/2 ( Fig. 1(e)) and determine the intersection points between this lattice and the skeleton from the previous step ( Fig. 1(d)). We empirically verified that the union of the patches defined by these intersections always covered every ambiguous pixel. 3. Mini-U-net classification: Finally, we classify the ambiguous pixels in each patch using our second network ( Fig. 1(g)). For each patch, we extract the corresponding regions in the full likelihood map and the map with only ambiguous pixels, resp. Intuitively, the second channel indicates which pixels are most crucial to classify correctly. Finally, we use the outputs from the second network as the final labels for ambiguous pixels (Fig. 1(h)).

Experiments and Results
We carried out experiments on five different retinal vessel datasets to verify the effectiveness of our proposed approach. Below, we first detail our experimental protocol and then discuss our results.

Materials and methods
Hardware: All our experiments were conducted on a Dell Precision 7920R server with two Intel Xeon Silver 4110 CPUs, two GeForce GTX 1080 Ti graphics cards, and 128 GBs of RAM.

Datasets:
We used five datasets for our experiments (see the first column of Fig. 3 for a sample image from each dataset): • DRIVE [8]: 40, 565×584 color fundus images centered on the macula. Each image includes a circular field-of-view (FOV) mask. This dataset is pre-divided into 20 training and 20 test images, but we further divided the training set randomly into 15 training and 5 validation images.
• AV-WIDE [11,29]: 30 wide-FOV images. Image sizes vary, but are around 1300×800 pixels for most images. Images are loosely centered on the macula. As explained below, we used the existing, graph-based annotations to manually segment these images.
• VEVIO [12]: This dataset has two types of images: 16, 640×480 frames from 16 different video indirect ophthalmoscopy (VIO) videos and 16 corresponding mosaics (around 600×500 pixels) obtained by fusing different frames into a single image [30]. Due to the difficult image acquisition process, these images are blurry and have lens artifacts.

Data preparation:
We used only the green channel of each image. We then applied contrast adaptive histogram equalization [31] as a preprocessing step to enhance contrast. Given the limited number of images in each dataset, we used image augmentation techniques to increase our data size. Specifically, we flipped the input image horizontally and vertically randomly in each iteration.
Ground truth preparation: All datasets except AV-WIDE had preexisting ground-truth pixel labels. The original AV-WIDE dataset had only graph-based labels, so we manually traced the vessels in Adobe Photoshop CS (Adobe Systems Inc., San Jose, CA) using a Wacom Intuous3 graphics tablet (Wacom Co. Ltd, Kazo-shi, Saitama, Japan) to obtain pixel-level segmentations. The original, graph-based annotations had been carried out by an expert ophthalmologist, so we followed them exactly.
Network architectures: As noted above, we used a full-sized U-net architecture for our initial probability map and a smaller, mini-U-net network for targeted prediction. The U-net architecture consist of a series of down-sampling steps followed by the same number of up-sampling steps. Each up-sampling step receives, crops to match the size, and concatenates the feature map from the down-sample step on the same level, as shown in Fig. 2. We used 3×3-pixel kernels except in the last layer; here, we used 1×1 kernels to produce two outputs, which are then passed through a softmax layer to obtain two probabilities: one for that particular pixel being a vessel and another for it being part of the background. Depending on which of the probabilities is higher, a pixel is labeled as either vessel or background. As in the original U-net architecture [6], this network receives inputs of size 572×572 pixels, but only predicts values for the internal 388×388 pixels. The outer band of 92×92 pixel allows the U-net a wider view of the vascular structure. As in [6], we shift the U-net to obtain a label for every pixel and mirror the 388× 388 region if the outer band extends beyond the image. Our second network, the mini-U-net architecture, is identical to the middle three layers of the full-size U-net (see Fig. 2). Thus, its input size is 140 × 140 and it produces labels for the middle 100×100 pixels. Similar to the first network, we expand and mirror patches as needed. However, we train this network using a dice loss, rather than cross entropy because our goal is to maximize the F1 score of our pipeline. The weighted dice loss is given by the following formula: , where β controls how much we want to favor precision vs. recall. Higher beta yields higher precision and vice versa. As with the full-size U-net, we used a stochastic loss in which we sampled β for each training iteration: , where B r and (1, α, s) is picked randomly within range 1 -α with a stepsize of s.

Training parameters:
We trained our networks using Adam optimization [32] with a learning rate of 0.001 and a mini-batch size of 4. We trained the full U-net for 250 epochs and the mini-U-net for 60 epochs. For the full U-net, we used stochastic weights of w r and (1, 100, 1), i.e., where weights randomly oscillated within a range of 1 − 100 with a step size of 1. For the mini-U-net, we used a dice loss with stochastic weights of B r and (1, 2, 0.1), where β randomly oscillated between 1 − 2 with a step size of 0.1.

Model validation:
We used 5-fold cross validation with training, validation, and test sets on all datasets except DRIVE, since this dataset already had a predefined test set. For each fold, we randomly split the training set into 75% training and 25% validation images. In our tables, we report the performance on the test set across all the folds. Note that, using this protocol, each image is only included once in a test set.

Pipelines tested
To better understand the impact of each stage of our pipeline, we tested two versions of our approach, as well as two baseline methods: • Dynamic weights with targeted prediction: Our proposed approach.
• Dynamic weights only: We trained the full U-net using dynamic class weights, but omitted the targeted prediction step.
• Fixed weights only: We trained the full U-net using fixed class weights based on the ratio of vessel to background pixels in the training set. No targeted prediction step.
• Fixed weights with targeted prediction: We applied both fixed weights and targeted prediction.
The parameters for the Mini-U-net at the end of the full pipelines (variations 1 and 4) were the same for both configurations. In total, we carried out twenty experiments across the five datasets.

Results
We quantified our method's performance in terms of precision, recall and F1 score (also called F-measure). For comparison to the state of the art, we also calculated the accuracy of our model, even though this value can be misleading for unbalanced datasets [12]. The precision, recall, and accuracy values in our tables are the mean values across all test images. The reported F1 score is the harmonic average of the corresponding mean precision and recall values. As Forman et al.
showed [33], if the positive cases constitute around 7% or more of the samples, then this is an unbiased way to calculate the mean F1 score for k-fold cross validation. Across the different datasets, vessel pixels constituted around 10% of all pixels. Tables 1 to 6 summarize our results on the various datasets. A value in bold represents the best score for that particular column. For convenience, we also included those of current state-of-the-art approaches. Note, though, that we only list methods that reported F1 scores (or, equivalently, both precision and recall values). Due to the strong class imbalance between the two classes, accuracy alone is not an informative measure of performance. As the various tables show, our proposed pipeline consistently outperformed existing methods. To the best of our knowledge, our F1 scores and most of our accuracy values are state-of-the-art results across the five datasets.
Overall, dynamic weights outperformed fixed weights across the various datasets, including the highly studied DRIVE and STARE datasets. Specifically, stochastic weights led to more balanced precision and recall scores. For example, in the DRIVE dataset, the full, stochastic pipeline obtained a precision of 0.8284, and a recall of 0.8235 while the fixed-weight pipeline gave a more unbalanced results (precision of 0.7823 and recall of 0.8246). This pattern was replicated across datasets. Compared to prior methods, we consistently achieved better recall with little or no sacrifice in precision. Existing methods with high precision and low recall seldom misclassify a background pixel as a vessel, but they struggle to detect faint vessels.
Notably, our method outperformed other extensions of the original U-net architecture, including LadderNet [27], R2Unet [22], and DUNet [28]. The difference is particularly large for the CHASE-DB dataset (Table 3). Here, our method F 1 score was 2.66% higher than the previous state of the art (LadderNet).  Table 4 shows our results for the AV-WIDE dataset. Our paper is the first assessment of this dataset as we manually created the ground truth for it. As with the other datasets, we can observe a similar pattern of unbalanced precision and recall for fixed class weights. In contrast, our stochastic technique has more balanced precision and recall. Moreover, the F 1 score and accuracy are significantly higher for our method, particularly for the full pipeline with targeted prediction. Finally, our method also achieved state-of-the-art results in the highly challenging VEVIO dataset, which has two sets of images: composite mosaics and individual frames taken from a low-resolution video [30]. As with the other datasets, fixed class weights gave highly biased scores in terms of precision and recall, while dynamic weights gave more balanced results (Tables 5 and 6).

Discussion
The key novelties of our technique are two-fold. First, we use dynamic, as opposed to fixed, weights to allow our networks to learn a more robust balance between false positives and false negatives. Second, we separate different types of tasks involved in retinal vessel segmentation. The likelihood estimation step only focuses on capturing the vascular structure without worrying about the final segmentation. Conversely, the targeted prediction step relies on having a precomputed likelihood map with which to make the final predictions. We discuss specific properties of our approach below.

Dynamic weights yield smoother likelihood maps
In Fig. 3 (columns b and c), we can see that using class weights computed from the training set almost binarizes the image. All the likelihoods are very close to either 0 (background) or 1 (vessel); thus, we have no any useful information with which to refine our estimate in the targeted prediction step. Any information about fainter pixels was lost. However, if we use dynamic weights, a higher percentage of pixels will be assigned non-trivial likelihood values (Fig. 3(d)), which will allow our second network to better classify ambiguous pixels.

Dynamic weights yields more balanced precision and recall
As we can observe from the results (Tables 1 to 6), our method consistently achieves a good balance between precision and recall. This implies that our approach handles fainter or more ambiguous pixels, rather than ignoring them. In contrast, the state-of-the-art results for the DRIVE dataset [27] report precision and recall values of 0.8593 and 0.7896, respectively. Our corresponding, more balanced values were 0.8284 and 0.8163. We can see even more of a difference on the VEVIO dataset (Tables 5 and 6). For the mosaics, the precision and recall for the U-net with class weights were 0.5537 and 0.6832, respectively. However, our full, dynamic pipeline yielded values of 0.6573 and 0.6739. Our dynamic weights allow a network to settle on a local optimum that better balances precision and recall. As noted earlier, retinal images have a much higher percentage of background pixels (over 90%). Thus, a network trained on this type of data will be more reticent to label an ambiguous pixel a vessel (since the prior probability is so much lower). This imbalance makes the network settle on a local optimum that is skewed towards high precision and low recall. Previous approaches have ameliorated this effect by assigning fixed class weights in order to re-balance the classification problem. For example, if only 10% of pixels are vessels, then misclassifying those pixels will be penalized by a factor of 9. However, not all faint vessels are similar or distributed equally across different images, so a one-size-fits-all treatment is suboptimal. Our technique, on the other hand, applies stress tests during the training process to ensure that the network is robust across different ratios of precision and recall. As our experiments show, this approach allows the network to settle on a a better optimum. In more detail, Fig. 4 illustrates the effects of dynamic vs. fixed weights during training for the DRIVE dataset. Here, each subplot on the left-hand side shows precision vs. recall during training, while the subplots on the right show this value for the validation set. As Fig. 4(c) shows, a network that uses weights calculated from the training set will begin by strongly favoring recall, since a false negative is penalized more heavily than a false positive. In contrast, an unweighted network ( Fig. 4(e)) will favor precision at first, since the probability that an ambiguous pixel is a vessel is very low, a priori. Over time, both networks learned to classify the training set, but their initial bias lead to poor results on the validation set, suggesting overfitting ( Fig. 4(d and f)). This behavior implies that the network was very confident about easy pixels and therefore settled on an optimum that favored this class of pixels; however, this choice made it ignore fainter pixels, in turn. Our method, on the other hand, (Fig. 4(a, b)), had more balanced scores throughout the training process. The optimum on which it settled was more robust and thus achieved better scores on the validation and test sets. Our results suggest that the network was able to learn features that separate faint vessels from noise in the background.

Effect of different weighing schemes
Finally, we carried out additional analyses to investigate the impact of different weighing schemes on the final segmentation result. First, we trained a U-net architecture (without targeted prediction) with equal weights for the two classes. In other words, each misclassification, whether false positive or false negative, was weighted equally. Table 7 shows the results of this equal weighing on our different datasets (we also show the class-weighted results for each dataset for comparison). Equal weights yielded better precision but worst recall. Due to the roughly 9 to 1 class imbalance between background and vessel pixels, a network trained with equal weights will tend to label ambiguous pixels as background. We verified this trade-off further by systematically varying the class weights for the DRIVE dataset and recording the resulting precision and recall values. Table 8 lists the results when training with background-to-vessel weights of {1,5}, {1,1} (equal  weights), {5,1}, and {10,1}. For comparison, the class-based weights were roughly 1 to 10. We also tested a U-net with dynamic weights, but with a range of weights spanning only 10 to 1 (our main experiments used a range of 100 to 1). In this case, the dynamic weights has less variance across training epochs. Interestingly, the best fixed-weight result for the DRIVE dataset used equal weights (this result was not consistent across datasets, though, as Table 7 shows). Overall, we can see that the trade-off of precision vs. recall is driven by the ratio of class weights. Also, the dynamic weights with a 10-to-1 range did not balance the precision vs. recall as much as the 100-to-1 weights, so their results more closely resembled equal weights. We observed a similar pattern for the other datasets (we omitted these results for conciseness). Skewed weights tend to favor either precision or recall and reduce the other score accordingly. By effectively balancing these two scores, our dynamic weights achieve the best result.

Conclusion
An accurate segmentation of the vascular structure is crucial for retinal disease diagnosis. However, this task is not easy because the quality and features of a retinal image depend on many factors, including the imaging device, lighting conditions, and an individual's anatomy. Thus, trying to enforce a single, static approach for all retinal images is suboptimal. Instead, by separating retinal vessel segmentation into sub-tasks, we have more degrees of freedom with which to adapt our processing to the current image. Furthermore, dynamic weights allow a network to learn to classify all types of vessels, regardless of their colour or intensity, which in turn generates a likelihood map with clear distinctions between vessel, background, and noise. Our targeted prediction step then uses this likelihood map to better classify ambiguous pixels. Our technique gives better results than state-of-the-art techniques, many of which are much complex and less intuitive. Instead of optimizing a single, black box, our approach breaks down the problem into more manageable steps and makes these steps more robust by using dynamic weights. In future work, we plan to apply our pipeline to other medical imaging domains, including CT and MRI scans. We also intend to investigate the theoretical properties of our stochastic weights further.

Disclosures
The authors declare that there are no conflicts of interest related to this article.