Unsupervised Few-Shot Feature Learning via Self-Supervised Training

Learning from limited exemplars (few-shot learning) is a fundamental, unsolved problem that has been laboriously explored in the machine learning community. However, current few-shot learners are mostly supervised and rely heavily on a large amount of labeled examples. Unsupervised learning is a more natural procedure for cognitive mammals and has produced promising results in many machine learning tasks. In this paper, we propose an unsupervised feature learning method for few-shot learning. The proposed model consists of two alternate processes, progressive clustering and episodic training. The former generates pseudo-labeled training examples for constructing episodic tasks; and the later trains the few-shot learner using the generated episodic tasks which further optimizes the feature representations of data. The two processes facilitate each other, and eventually produce a high quality few-shot learner. In our experiments, our model achieves good generalization performance in a variety of downstream few-shot learning tasks on Omniglot and MiniImageNet. We also construct a new few-shot person re-identification dataset FS-Market1501 to demonstrate the feasibility of our model to a real-world application.


OTHER METRIC LOSS FUNCTIONS USED IN OUR MODEL
The goal of our study is to find a good embedding feature space from the unlabeled dataset X : {x i }, so that we can build a few-shot classifier which can be directly applied on the downstream tasks efficiently. Theoretically, many metric loss functions can be used in our model. Here we present the results on the triplet loss (Weinberger and Saul, 2009) and the hardtriplet loss (Hermans et al., 2017) (Table S1, Table S2). They have been widely used in face recognition and image retrieval. The triplet loss L triplet consists of several triplets, each of which includes a query feature z, a positive feature z + and a negative feature z − , and is written as where m controls the margin of two classes, and the hinge term plays the role of correcting triplets, so that the difference between the similarities of positive and negative examples to the query point is larger than a margin m. However, in the above form, positive pairs in those "already correct" triplets will no longer be pulled together due to the hard cutoff. We therefore replace the hinge term by a soft-margin formulation, Figure S3. Comparison between k-nearest neighbours and k-reciprocal nearest neighbours on FS-Market1501 dataset.
which gives Eq. S2 is similar to Eq. S1, but it decays exponentially instead of having a hard cutoff and tends to be numerically more stable (Hermans et al., 2017).
Here we simply discuss the relationships between the triplet loss and the prototypical loss we used in the main text. Consider a M-way 1-shot episodic learning scenario, where a prototype c k is the support point z k itself, the prototypical loss (Equ. 4 in the main text) is written as From Eq. S3, we can see that query point z is pulled towards the corresponded support point z p , and meanwhile, z is pushed away from all other support points {z k } k =p ; whereas, when using the triplet soft-margin loss (Eq. S2), the query point z is only pushed away from one negative points z − . This implies that in each update, L triplet−SM only interacts with a single negative example from one of other classes and ignores many other negative examples. When K is small, optimizing the model with the two loss functions has no big difference. For example, when K = 2 and m = 0, Eq. S2 and S3 become exactly the same. However, when K becomes larger, the possible number of triplets grows cubically with M and linearly with K, which makes it difficult to select non-trivial triplets. In such a situation, optimizing on these uninformative triplets leads to the problem that the model gets stuck into a local optimum and suffers slow convergence. This justifies why the model has a inferior performance using the triplet loss compared to using the prototype loss (Table S1 and Table S2).
The inefficiency of the conventional triplet loss motivate us to mine hard triplets to alleviate its shortcomings (Wang et al., 2014;Cui et al., 2016;Hermans et al., 2017). Mining hard negative examples across the whole dataset is infeasible, since it is too time-costing to evaluate all embedding vectors in the deep learning framework. So, we choose to do hard negative example mining within a batch, i.e., we select the hardest positive and the hardest negative examples when forming the triplets, and obtain L hard triplet−SM = log 1 + exp( max Compared to Eq. S3 which pushes a query point away from all other support points from different classes, Eq. S4 focuses on pulling the hardest positive example closer and pushing the hardest negative example away at the same time. By this, we get a slighbetter performance than that using the prototype loss (Table S1,  Table S2).

CONSTRUCTION OF THE FS-MARKET1501 DATASET
FS-Market1501 is a person re-identification (Re-ID) dataset constructed from the Market1501 dataset.
In the original dataset, a total of six cameras were used, including 5 high-resolution cameras, and one low-resolution camera to collect images. Overall, the original dataset contains 32,668 annotated bounding boxes of 1,501 identities, including 12936 images with 751 pedestrian identities for training, 3368 images with 750 pedestrian identities for query and the remaining images as the gallery set. To improve the retrieval difficulty, the original gallery set also contains some distractors, e.g., the low DPM value images and the images of identity "0". When constructing the FS-Market1501 dataset, we remove the distrators from the gallery set and keep the remaining as well as the query set as our testing set. Totally, there are 12936 images with 751 pedestrian identities for training and 16483 images with the remaining 750 pedestrian identities for evaluating the few-shot performance of our model(see Table S4).

THE CHOICE OF M S AND ρ IN THE DBSCAN ALGORITHM
In the main text, we have demonstrated that for the clustering method DBSCAN, we set ms = 2 and to be the mean of top P values of distance pairs, with P = ρN (N − 1)/2 and ρ = 0.0015. These values are set to be relatively small to ensure that feature points are well separated, so that diverse episodic tasks can be constructed. Here we analysis the effect of varying ρ when ms is fixed on the Omniglot dataset (see Table S3). The effect of varying ms when fixing ρ is the same.

TRAINING DETAILS OF THE UNSUPERVISED FEATURE LEARNING METHODS: AUTOENCODER, INFOGAN AND DEEPCLUSTERING
In Sec.4.3 (main text), we compared our model with some unsupervised feature learning methods: (Denoising) AutoEncoder (Vincent et al., 2008), InfoGAN (Chen et al., 2016), and DeepClustering (Caron et al., 2018). For a fair comparison, we modified the feature extractor (the encoder in the AutoEncoder model, the discriminator in the InfoGAN and the feature embedding network in the DeepClustering) to be the 4-layer network as described in Sec.4.2 (main text).
AutoEncoder: we both run AutoEncoder and Denoising AutoEncoder in the current study. We don't use the form of parameter sharing, that is, the decoder has weights that are the transpose of the encoder weights. The model is trained for 200 epochs in total. We used Adam with momentum to update parameters in the encoder and the decoder, and the learning rate is set to 0.005 with an exponential decay after 100 epochs. The mini-batch size is 128. InfoGAN: the model is an information-theoretic extension to the Generative Adversarial Network that is able to learn disentangled representations in a completely unsupervised manner. When training, we build upon the code which can be found at https://github.com/Natsu6767/InfoGAN-PyTorch. On the omniglot dataset, we set the dimension of incompressible noise to be 26, a categorical code with dimension 10, and two continuous codes that can capture variations that are continuous in nature. On the MiniImageNet dataset, we set the dimension of incompressible noise to be 128, a categorical code with dimension 10, and 10 continuous codes.
DeepClustering: the model jointly learns the parameters of a neural network and the cluster assignments of the resulting features. The main contribution of their work is to solve the degenerated solution problem in progressive clustering by reassigning empty clusters during the Kmeans optimization. We follow the training details in the authors' paper and train a 4-layer feature embedding network with a softmax classification learning objective. The number of clusters is set too be 1000 in both Omniglot and MiniImageNet. The readout layer is re-initialized after Kmeans clustering in each iteration. The number of iterations is set to be 20 and the training epochs in each iteration is set to be 50. The initial learning rate in each iteration is 0.005 with an exponential decay at epoch 25. The mini-batch size is 128.

PERFORMANCES OF OUR MODEL COMPARED TO OTHER NON-EPISODIC UNSUPERVISED FEATURE LEARNING METHODS WITH CONFIDENCE INTERVALS
After obtaining the feature extractor in three unsupervised feature learning models, we simply build a prototypical classifier to perform few-shot classification on downstream tasks, that is, performing classification by computing distances to prototype representations of each class. Other methods can be also used to perform few-shot classification on top of the embedding network, such as the K-nearest neighbour, the linear classifier and the multi-layer perceptron. These methods don't benefit from the episodic learning paradigm and cause the probelm of meta-overfitting, as reported in Hsu et al. (2018). Hence, we only run a prototypical classifier on top of these feature embedding networks in the current study (see Table S5 and Table S6).

PEFORMANCES OF OUR MODEL COMPARED TO THE SOTA UNSUPERVISED FEW-SHOT LEARNING MODELS WITH CONFIDENCE INTERVALS
Note that there is no confidence intervals reported in the UMTRA model. The confidence intervals on the supervised learning methods MAML and ProtoNets are borrowed from Hsu et.al hsu2018unsupervised (see Table S7 and Table S8 ).

SUPERVISED TRAINING ON THE FS-MARKET1501
Resnet50 pretrained on Imagenet is a conventional backbone model on person ReID benchmarks. In the current study, we also use it as our backbone model on the FS-Market1501 dataset. Following (Xiong et al., 2018), we add a batch normalization layer after the global pooling layer to prevent overfitting and directly use the batch-normalized global pooling features to calculate the prototype of each class. When training with triplet loss and hardtriplet loss, the margin m between negative pairs and positive pairs is set as 0.3. When training with prototype loss, the setting is the same as decribed in Sec.4.2. For the results, see Table S2.

USING RESNET12 AND ALEXNET AS THE FEATURE EMBEDDING NETWORK ON MINIIMAGENET
In Sec.4.4, we showed that the performance of our model on MiniImageNet is competitive to other SOTA unsupervised few-shot learning methods, but not one of the SOTA models. One possible reason is that the feature embedding network is too simple (a 4-layer convnet) to extract the semantic meaning of images, especially under the unsupervised setting. In other words, a shallow embedding network did not make adequate use of UFLST's expressive capacity, and opted to use a deeper embedding network to prevent underfitting. Here we use Resnet12 and AlexNet as the feature embedding network which are more complex that the 4-layer convnet to improve the performance of unsupervised few-shot learning. The Resnet12 has been used in several supervised few-shot learning models (Mishra et al., 2017;Oreshkin et al., 2018), which is a smaller version of Resnet (He et al., 2016). The AlexNet is proposed by Krizhevsky et al. (2012) which has had a large impact on the field of machine learning. Table S9 shows that when using a deeper