Deep Active Learning via Open Set Recognition

In many applications, data is easy to acquire but expensive and time consuming to label prominent examples include medical imaging and NLP. This disparity has only grown in recent years as our ability to collect data improves. Under these constraints, it makes sense to select only the most informative instances from the unlabeled pool and request an oracle (e.g a human expert) to provide labels for those samples. The goal of active learning is to infer the informativeness of unlabeled samples so as to minimize the number of requests to the oracle. Here, we formulate active learning as an open-set recognition problem. In this latter paradigm, only some of the inputs belong to known classes; the classifier must identify the rest as unknown.More specifically, we leverage variational neuralnetworks (VNNs), which produce high-confidence (i.e., low-entropy) predictions only for inputs that closely resemble the training data. We use the inverse of this confidence measure to select the samples that the oracle should label. Intuitively, unlabeled samples that the VNN is uncertain about are more informative for future training. We carried out an extensive evaluation of our novel, probabilistic formulation of active learning, achieving state-of-the-art results on CIFAR-10 andCIFAR-100. In addition, unlike current active learning methods, our algorithm can learn tasks with non i.i.d distribution, without the need for task labels. As our experiments show, when the unlabeled pool consists of a mixture of samples from multiple tasks, our approach can automatically distinguish between samples from seen vs. unseen tasks.


Introduction
Supervised deep learning has achieved remarkable results across a variety of domains by leveraging large, labeled datasets [9]. However, our ability to collect data far outstrips our ability to label it, and this difference only continues to grow. This problem is especially stark in applications, such as medical imaging, where the ground truth must be provided by a highly trained specialist. Even in cases where labeled data is sufficient, there may be reasons to limit the amount of data used to train a model, e.g., time, financial constraints, or to minimize the model's carbon footprint.
Fortunately, the relationship between a model's performance and the amount of training data is not linear-there often exists a small subset of highly informative samples that can provide most of the information needed to learn to solve a task. In this case, we can achieve nearly the same performance Figure 1: Framework overview: Our proposed system has two models M i which we call M1 and M2 from now a) Encoder followed by a linear classifier (M1) b Encoder with optional Decoder Architecture (M2). Selected Model will be trained with initial labeled pool L with K samples. Once the initial training stage finishes on labeled pool L our selector will be given unlabeled pool U and trained Model M trained i . Our sampling module S will return top-k most informative samples that needs to be sent to oracle for annotation according to the budget size available. After the annotation those top k images will be removed from unlabeled pool U, added to labeled pool L.
by labeling (and training on) only those informative samples, rather than the entire dataset. The challenge, of course, is that the true usefulness of a sample can only be established a posteriori, after we have used it to train our model.
The growing field of active learning (AL) is concerned with automatically predicting which samples from an unlabeled dataset are most worth labeling. 1 In the standard AL framework, a selector identifies an initial set of promising samples; these are then labeled by an oracle (e.g., a human expert) and used to train a task network [5]. The selector then progressively requests labels for additional batches of samples, up to either a percentage threshold (e.g., 40% of the total data) or until a performance target is met. In short, an active learning system seeks to construct the smallest possible training set which will produce the highest possible performance on the underlying task/s. In this paper, we formulate active learning as an open-set recognition (OSR) problem, a generalization of the standard classification paradigm. In OSR only some of the inputs are from one of the known classes; the classifier must label the remaining inputs as out-of-distribution (OOD) or unknown. Intuitively, our hypothesis is that the samples most worth labeling are those that are most different from the currently labeled pool. Training on these samples will allow the network to learn features that are underrepresented in the existing training data. In short, our AL selection mechanism consists of picking unlabeled samples that are OOD relative to the labeled pool. Figure 1 illustrates our proposed approach. In more detail, our classifier is a variational neural network (VNN) [13], which produces high-confidence (i.e., low-entropy) outputs only for inputs that are highly similar to the training set. We use the inverse of this confidence measure to select which unlabeled samples to query next. In other words, our selector requests labels for the samples that the classifier is least confident about because this implies that the existing training set does not contain items that are similar to them. As we detail in Sec. 4, our OSR-based approach achieved state-of-the-art results in a number of datasets and AL variations, far surpassing existing methods.
The rest of this paper is organized as follows. In Sec. 2, we provide a brief overview of current active learning and open-set recognition methods. In Sec. 3, we present our proposed approach, then detail our experiments in Sec. 4. Finally, we discuss avenues for future work in Sec. 5.

Related Work
Recent approaches to the problem of Active Learning can be categorized as query-acquiring or query-synthesizing [18]. The distinction lies in whether the unlabeled OOD samples are immediately accessible (pool-based), or are instead synthesized using a generative model [10,11,20]. Assuming access to a pool of unlabeled OOD data, a strategy must be devised which selects only the most useful or informative samples from that distribution. It has been routinely demonstrated that training samples do not contain equal amounts of useful information [16]. In other words, some training distributions result in better task performance than others. Thus, the aim of an active learning system is to minimize the amount of training data required to achieve the highest possible performance on an underlying task, e.g. image classification. This is a form of learning efficiency, which we wish to improve, or maximize. As in [5], we therefore wish to learn the acquisition function that chooses the data points for which a label should be requested. The learned acquisition function is, in essence, an intelligent sampling function, the aim of which is to outperform random iid sampling of the unlabeled distribution, thereby maximizing the learning efficiency of the system. As such, various sampling strategies have been proposed which can typically be grouped into three broad categories [18]. They include uncertainty-based techniques, representation-based models [15], and hybrid approaches [14].

Uncertainty calibration
Open Set Recognition (OSR), on the other hand, refers to the ability of a system to distinguish between data it has already seen (the training distribution), and data to which it has not yet been exposed (OOD data). Though OSR has been scrutinized for decades, recent progress has come about via careful design of the heuristics used quantify the similarity between historical and current data distributions [12,13,6]. Since such measures have been shown to substantially improve OSR performance, and since OSR is an inherent necessity for Active Learning, it stands to reason that AL systems would benefit from the integration of such techniques. This is intuitive since it would be unhelpful for a system to request labels for data points which are nearly identical to those it has already seen. Redundancy, or excess similarity, in the training distribution can therefore be said to decrease the learning efficiency of the system. One of the most promising approaches to OSR incorporates ideas from Extreme Value Theory (EVT) in order to quantify the epistemic uncertainty of the model [12].
Though seemingly complimentary, very little work has been done to merge the distinct fields of Active Learning and OSR. In this work, we explicitly merge the fields of OSR and AL by adopting the heuristics used in [12] to quantify a model's predictive uncertainty w.r.t. newly acquired unlabeled data, in order to infer the degree to which the data is likely to improve the performance of a classifier, if that data were to be integrated into the labeled training distribution. In the process, we demonstrate how EVT-inspired heuristics can assist in improving the learning efficiency of deep learning systems.

Methodology
As noted above, our active learning approach iteratively selects samples from an unlabeled pool based on the confidence level of its OSR classifier. Below, we first formalize the active learning paradigm we are tackling, then detail our proposed system. In particular, we provide an overview of VNNs and explain how we use their outputs to select new samples to label.

Formal problem definition
Let us now describe our active learning protocol while introducing few notations. Each active learning problem is denoted as P = (C, D train , D eval ), it is dedicated to the classification of classes in a set C, coming with two sets of examples, the first one being used to infer a prediction model, D train and the second one, D eval , being used to evaluate the inferred model where D train ∩ D eval = ∅.
be a dataset consisting of N i.i.d. data points where only m of them are labeled (m< <N),where each sample x i ∈ R d is a d-dimensional feature vector and y i ∈ {1, 2, . . . , C}, represents the target label. The m labeled samples are uniformly sampled subset from the N samples. D train was partitioned into two disjoint subsets: a labeled set L which consists of m labeled data points and an unlabeled set U which consists of N − m data points with unknown target labels. We denote the state of a subset at a given iteration of our algorithm as U t (L t , resp), for t ∈ {0, 1, . . .}.
Active learning setup starts with a set of m labeled samples, where the size of L 0 is m 0 . We train a classifier f with parameters θ on the labeled pool L 0 , afterwards we select b data points from U using our OSR criterion (see Sec. 3.2) and send to oracle for annotation. The annontated data points are removed from U 0 and added to L along with their target labels. The size of the labeled pool L 1 becomes m 1 . The labled pool set grows as the training progresses L 0 ⊂ L 1 ⊂ ....L t with respective sizes m 0 <m 1 <m 2 ....<m t . We continue this process until the size m t of labeled pool L t reaches a predefined limit (40% of D train in our experiments).
Importantly, unlike other formulations of AL which assumes access to task boundaries and an i.i.d distribution. This has clear limitations when the i.i.d assumption is not satisfied or when the task boundaries are not available. In our experimental setup we assume our unlabeled pool U can contain training data from multiple tasks. In addition, we assume no task IDs. Our OSR selection criterion allows our system to learn multiple tasks without specifying the current task.

Active learning system
Our AL system ( Fig. 1) has two main components: a variational neural network [13], which serves as our classifier, and an entropy-based selection mechanism. We discuss each component below.

Variational Neural Networks (VNNs)
Variational neural networks (VNNs) [13] are a supervised variant of β-variational autoencoders (β-VAE) [6]. The latter is itself a variant of VAEs [2] but with a regularized cost function. That is, the cost function for a β-VAE consists of two terms: the reconstruction error, as with a regular VAE, and an entanglement penalty on the latent vector. This penalty forces the dimensions of the latent space to be as uncorrelated as possible, making them easier to interpret.
A VNN combines the encoder-decoder architecture of a β-VAE with a probabilistic linear classifier (see Fig. 1 for a visual representation). As such, its loss function includes a classification error, i.e., a supervised signal, in addition to the reconstruction and entanglement terms: As detailed in [13], θ, φ, and ξ are the parameters of the encoder, decoder, and classifier, resp, while p φ (x|z) and p ξ (y|z) are the reconstruction and classification terms. The last term is the entanglement penalty, which is given by the Kullback-Leibler divergence between the latent vector distribution and an isotropic Gaussian distribution.
In this work, we evaluated both the full framework discussed above (dubbed M 2 in our experiments), which uses the loss function in Eq. 1, and a simplified version (M 1 ) without the reconstruction error: Following a variational formulation as shown in the [13], the models M 1 and M 2 have natural means to capture epistemic uncertainty. As our experiments show, both versions outperform the state of the art, but M 2 achieves better results overall.

Sample Selection
Motivated by class disentanglement ability of the Eq. 1, for a give budget size b we aim to select b data points from the unlabeled pool L t . However, instead of using information about extreme distance values in the penultimate layer activations to modify a Softmax prediction's confidence, we propose to employ the EVT based on the class conditional posterior. In this sense, any unlabeled data point will be regarded as a sample containing useful information if its distance to the classes latent means is extreme with respect to what has been observed for the majority of correctly predicted data instances, i.e., the data point falls into a region of low density under the aggregate posterior and is more likely to have information which is unknown to the neural network with parameters θ at this point. We have employed two sampling algorithms for our OSR methodology which are detailed in the below sections 1. Uncertainty Sampling : This is the conventional sampling method in which data point x i is selected using uncertainity measure. Model uncertainity can be measured in several ways. One approach is shown in our Algorithm 2, which captures our model's epistemic or prediction uncertainity for any given input x i . We compute the utility of the unlabeled data points in unlabeled pool U to collect b number of informative data points where the utility is model uncertainity which captures most of the epistemic uncertainity. The selected informative data points are sent to the oracle to obtain its target label y * . Note that the closer the probability is to zero, the more likely it is that the model is very uncertain about that sample.
2.Wiebull Distribution Sampling : Our second sampling technique as shown in Algorithm 3 is based on heavy-tail weibull distribution, let us consider any selected stage of active learning lifecycle at which L t is the data which used for training the model f θ and we first obtain each class mean latent vector for all correctly predicted seen data instances i.e., m=1,...,M t to construct a statistical meta-recognition model as shown in below Eq.3 which quantifies all the per class latent means of the correctly classified training samples present in L t .
and the respective set of latent distances of correctly classified point to the all the means as as where f d signifies the choice of distance metric. We proceed to fit a per class heavy-tail weibull distribution ρ c,t =(τ c,t , κ c,t , λ c,t ) on ∆ c,t for a given tail-size η. As the distance are based on each of the individual class conditional approximate posterior, thus it bounds the latent space regions of such a high density. The tightness bounds is characterized through η which can been seen as prior belief with respect to outlier quantity present in the data inherently. The choice of f d determines the dimensionlity of obtained distance distributions. For our experiments, we find that the cosine distance and thus a univariate Weibull distance distribution per class seems to be sufficient.
Using the cumulative distribution function of this Weibull model we can estimate the oulier probability of any given data point using Eq. 5. If the output probability of outlier is larger than our threshold probability Ω t ,the instance is considered to be an outlier as it is very far from all the known classes. Note that the closer the probability is to one, the more likely it is that model doesn't seen the data point ever before. we set our threshold to be in range of between 0.5 to 0.8 as this will help in eliminating the total outliers and pick the ones which are most useful for the model.

Noisy Oracle and Non-i.i.d Setup:
When applying active learning to real world applications, human experts traditionally function as oracles to provide lables for the requested samples. When a user/model requests the labels for the selected data samples from the Oracle, the quality or accuracy of the labels depends on the expertness of the Oracle. However, human makes mistakes, hence these mistakes leads to noisy labels.
We consider both types of the oracle an ideal oracle which provides labels for requested samples with no error and a noisy oracle, which provides labels for the required images with some percentage of error or noise. This error might be occurring because of a lack of human expertise (oracle) or maybe oracle getting confused between similar classes of images as some classes causing ambiguity for the oracle. To create the same paradox here, we also applied noise to the related classes which are similar.

Experimental Results
In all our experiments we start with data points in which m (intial budget) of them are labeled which will be part of labeled pool L and unblabeled pool U which contains N-m data points for which target class is unknown. The initial budget m is set to 10% of training dataset D train .
After the model training on labeled pool, based on the utility estimates provided by the sampling strategy for each sample x i present in the unlabeled pool U, we select an additional b samples from the unlabeled pool U where b is budget which is set to 5% of training dataset D train at each stage. Once the oracle has annonated the target class for the requested b data points, the annonated data points will be added to labeled pool L t , and active learning model f θ is trained using the new labeled pool L t . The annotation, training process continues until size of the labeled pool L t reaches 40% of the training set D train . We assume both the cases where the unlabeled pool consists of samples from the same distributions from the training dataset and unknown dataset. We presume the oracle is perfect unless stated otherwise.
We have evaluated our model on standard Image classification tasks such as CIFAR10, CIFAR 100 both with 60k images of size 32 by 32. We have measured the performance of both of our models by measuring the average accuracy over the 5 runs. We have trained our model at each step with 10%, 15%, 20%, 25%, 30%, 35%, 40% of annotated data out of the training set D train as it becomes available with target class provided by the oracle.

Implementation Details and Baselines
Baselines: We compare our method against various competitive methods including Variational Adversarial Active Learning (VAAL) [18], Core-Set [15], Monte-Carlo Dropout [4], and Ensembles using Vatiation Ratios (Ensembles w. VarR ) [3] [1]. We also showed the performance of deep Bayesian AL (DBAL) [5] by following and performing sampling using their proposed approach [5] and perform sampling using their proposed max entropy scheme to measure the uncertainty. We also show the results achieved using the uniform random sampling in which samples are picked from unlabeled pool using the random sampling methodology. This random sampling method still serves as competitive baseline in the field of active learning.

Implementation Details
We have used VGG16 [17] as our encoder for both of our models M1,M2 and optional decoder architecture was based on 14-layer wide residual networks [6] [19], in the variational cases with a latent dimensionality of size 60. The encoder is followed by a classifier that consists of a single linear layer. We optimize all models using a mini-batch size of 128 using optimizers such as SGD, ADAM [7] with a learning rate of 0.001 , weight decay value of 10 −5 . For the EVT based outlier rejection, we fit Weibull models with a tail-size set to 5 % of training data points per class present in labeled pool L, and the distance metric used is cosine. Training continues for 200 epochs for all the datasets. The initial labeled pool size for all the experiments has been chosen to be 10 % of the training set D train , which is equivalent to 5000, 5000 for CIFAR 10 [8], CIFAR100 [8] for all the experiments. The budget size b is set to 5 % of the D train training set, which is equivalent to 2500, 2500 for CIFAR 10, and CIFAR 100. For the clarity of nomenclature we define our models as below  ] achieves a mean accuracy of [84.4%, 89.24%, 89.97%, 91.4%] by using 40% of the annotated data after 6 stages with a budget of 2500 per stage after the initial budget, whereas the baseline accuracy is 92.63% using the entire dataset, denoted as Top-1 accuracy in Fig.2 (a) (left) . As shown in Fig.2 (a) (left) the method which performs closest to our model's is VAAL with accuracy of 80.71% , core-set with accuracy of 80.37% and Ensemble w VarR with accuracy of 79.465%. Mean accuracy of both of our models consistently evidently outperforms all the methods as shown in the Figure 2 (a)(left) including random sampling, DBAL and MC-Dropout. To test the effect of choice of optimizer we ran all of our models on both SGD and ADAM optimizer and found out that using ADAM as an optimizer with our Model M1 outperforms our Model M1 with SGD.

Performance on image classification benchmarks
To evaluate the scalability of our approach we evaluate our approach on CIFAR100 dataset with larger no of classes. The maximum achievable mean accuracy is 63.14% on CIFAR100 using 100% of the data denoted as Top-1 accuracy in Fig.2 (a)  ] can achieve the Figure 2: Performance on classification tasks using a) CIFAR10, b) CIFAR100 compared to VAAL [18], Core-set [15], Ensembles w. VarR [1], MC-Dropout [4], DBAL [5], and Random Sampling. M1 indicates our model (2) and M2 indicates our model (1). All the Legend names are in descending order of final accuracies. Best visible in color. Data and code required to reproduce are provided in supplementary material.
top performance of VAAL using 20% of the annotated training data itself and M sgd 1 by using 30% of the annotated training data. The proposed models consistently outperform the existing baselines.

Effect of biased initial pool
We investigated the performance of our models [M sgd , where the initial labeled pool L is biased. A good AL system is expected to discover data points of unknown classes in an early stage. Intuitively, initial bias can affect the model training such that it causes the labeled pool L 0 to be not representative of the underlying data distribution by being inadequate to cover most of the regions in the latent space. We perform this by intentionally removing the data points for c classes. Such that the initial labeled pool L 0 won't have any data point belonging to these classes. We have performed our experimentation on CIFAR 100 for c values of 10, 20 where randomly c classes data points are removed from the initial labeled pool L 0 where superscript 0 indicates the stage to see how it affects the performance of the model. We compare it to the case where samples are uniformly randomly selected from all classes. As shown in the figure 6 our method is superior to VAAL, Core-set and random sampling in selecting informative data points from the classes that were underrepresented in the initial labeled pool.

Effect of budget size on performance
We repeated our experiements as described in the experiments section 4.1 to test effect of different budget sizes b on our method compared to the most competitive baselines on CIFAR100. In our experiments we tested our method on different budget sizes of b = 5% of D train and b=10% of D train . As shown in the Fig. 3 our model outperforms VAAL, Core-Set and Ensemble and random sampling, on both the budget sizes of b = 5% and b=10%. VAAL comes as the second best followed by the Core-set, Ensemble as shown in the Fig. 3 ] in the presence of noisy data caused by an inaccurate oracle instead of ideal oracle. we assume similar to Figure 3: Robustness of our approach on classification task CIFAR100 to (a) budget size (left), (b) biased initial labeled pool (right), with compared to VAAL [18], Core-set [15] , Ensembles w. VarR [1], MC-Dropout [4], DBAL [5], and Random Sampling. M1 indicates our model (2) and M2 indicates our model (1). Best visible in color. Data and code required to reproduce are provided in our supplementary material. VAAL [18] setup that erroneous labels are due to the ambiguity between some classes and are not adversarial attacks. a coarse label (the super-class to which it belongs). As shown in the figure ?? our models consistently overcomes the existing models like VAAL,Core-set which are independent of task learner. In the case of VAAL a separate VAE, Discriminator is used as part of their sampling strategy.
We also consider an extreme case of Active learning where the i.i.d assumption has been relaxed. We intentionally added 20% data which is equivalent to 10,000 images from other datasets to our existing Unlabeled pool, so our network should not only distinguish between the informative samples and non-informative samples but also given a task to distinguish the data samples from current distribution vs out of the distribution. Whenever models select wrong sample from the other dataset and send to oracle, the human expert discards the sample so it will have an effect on overall budget size and the discarded samples from other datasets are placed back in the unlabeled pool. Which means at any given stage the total no of images from other datasets are 10,000. As you can see at the end of graph in the FIg the increase in the accuracy is pretty less as unlabled pool with have higher impact on the sampling methodology. We have specifically used our sampling strategy 2 to handle this scenario.

Sampling Time Analysis
The sampling method in the active learning system plays a major role in time efficent training. We have compared our sampling time agnaist other baseline farmeworks. The analysis is done on CIFAR10 dataset using a single NVIDIA 1080 TI and overall time required for each model is shown  Figure 5: Performance on classification tasks using CIFAR10, CIFAR100 compared to VAAL [18], Core-set [15] , Ensembles w. VarR [1], MC-Dropout [4], DBAL [5], and Random Sampling. M1 indicates our model with Encoder and Classifier and M2 indicates our model with encoder-decoder and classifier Best visible in color. Data and code required to reproduce are provided in our code repository.
in the table. As you can see our method have closer sampling times similar to VAAL, DBAL. We have slightly more than VAAL because we need to pass our latent vector through the linear classifier, incase of value the discriminator just output's the probability itsel. But we do have better training time when compared to VALL because VAAL contains a classifier which is VGG16 similar to our, VAE , followed by a Discriminator and the VAE and discriminator are trained using min-max game approach and optimizing all of them togeather take far more time thatn optimizing a single model like in our case. MC-Dropout collects the uncertainty using multiple forward passes to measure the uncertainty from 10 dropout masks which leads to it's increased sampling time. Figure 6: Performance on classification tasks using CIFAR10, CIFAR100 compared to VAAL [18], Core-set [15] , Ensembles w. VarR [1], MC-Dropout [4], DBAL [5], and Random Sampling. M1 indicates our model with Encoder and Classifier and M2 indicates our model with encoder-decoder and classifier Best visible in color. Data and code required to reproduce are provided in our code repository.