A novel meta-learning-based hyperspectral image classification algorithm

Aimed at the hyperspectral image (HSI) classification under the condition of limited samples, this paper designs a joint spectral–spatial classification network based on metric meta-learning. First, in order to fully extract HSI fine features, the squeeze and excitation (SE) attention mechanism is introduced into the spectrum dimensional channel to selectively extract useful HSI features to improve the sensitivity of the network to information features. Second, in the part of spatial feature extraction, the VGG16 model parameters trained in advance on the HSRS-SC dataset are used to realize the transfer and learning of spatial feature knowledge, and then, the higher-level abstract features are extracted to mine the intrinsic attributes of ground objects. Finally, the gated feature fusion strategy is introduced to connect the extracted spectral and spatial feature information on HSI for mining more abundant feature information. In this paper, a large number of experiments are carried out on the public hyperspectral dataset, including Pavia University and Salinas. The results show that the meta-learning method can achieve fast learning of new categories with only a small number of labeled samples and has good generalization ability for different HSI datasets.


Introduction
Hyperspectral image (HSI) refers to a spectral image with a spectral resolution in the range of nanometers, and its rich spectral information can be obtained while obtaining spatial information on ground objects [1]. The unique advantage of hyperspectral imagery is that it can not only obtain multi-channel spectral information on ground objects but also complex spatial information on different types of ground objects, and its spatial spectrum fusion features can effectively distinguish ground objects [2].
HSI classification methods based on deep learning can automatically extract spectral features, spatial features, or spectral-spatial features. Chen et al. [3] proposed a stacked autoencoder (SAE) to extract joint spectral-spatial features for HSI classification. Li et al. [4] utilized deep belief networks (DBNs) to extract spectral-spatial features and achieved better classification performance than SVM-based methods. Makantasis et al. [5] introduced 2D-CNN to HSI classification and obtained satisfactory performance by using CNN to encode spectral-spatial information and using multi-layer perceptron. Chen et al. [6] used 3-D CNN to simultaneously extract spectral-spatial features of HSI and achieved better classification results. Nevertheless, training very deep CNNs is still somewhat difficult due to the information loss produced by the vanishing gradient problem. To solve this problem, Wang et al. [7] introduced ResNet into HSI classification. Zhong et al. [8] designed a spectral-spatial residual network (SSRN) to identify HSI spectral properties and spatial context using spectral and spatial residual blocks and achieved state-of-the-art HSI classification accuracy. Furthermore, Paoletti proposed deep pyramidal residual networks (PyResNet) [9] to learn more robust spectral-spatial representations from HSI cubes and provide competitive advantages over state-of-the-art HSI classification methods in both classification accuracy and computation time aspect.
Hyperspectral image classification based on deep learning has achieved great success, but deep learning methods require a large number of labeled training samples, and the acquisition of labeled samples is very difficult, requiring great manpower, material, and financial resources. In practical classification applications, new scene images often have very few labeled samples, but other scene images often have enough labeled samples. Meta-learning is an effective method to achieve few-sample classifications. The learned metaknowledge can help predict the target domain data and solve the problem of hyperspectral image classification when there are only a few labeled samples for each class. Meta-learning is proposed to solve the problem of the insufficient generalization performance of traditional neural network models and the adaptability of new types of tasks. As early as the beginning of the 21st century, Hochreiter et al. verified that neural networks with memory modules could be used to deal with the proposition of meta-learning problems [10]. Such networks cache information efficiently and accurately through learning and then input it into the memory module to complete the conversion of the output. Subsequently, Munkhdalai et al. proposed a meta-network that applied the idea of meta-learning to the memory network to solve the problem of small-sample learning [11]. This network extracts task-independent meta-level knowledge to achieve rapid parameterization of common tasks. The matching network model proposed by O. Vinyals et al. [12] is the earliest method of combining metric learning with meta-learning. Subsequently, Snell et al. proposed prototypical networks by further improving and optimizing the matching network [13]. The prototype network uses simple ideas to effectively reduce the number of parameters, simplify the training process, and achieve good classification results. C. Finn et al. proposed a modelindependent meta-learning algorithm named model-agnostic meta-learning (MAML) [14], which can be seen as a metalearning tool for training basic meta-learners. Andrei A. Rusu et al. [15] proposed a meta-learning idea by optimizing hidden layer embedding on the basis of MAML, constructing a hidden space in which the parameters can complete its inner loop update, which effectively adapts the behavior of the model.
The contributions of the proposed method are as follows: 1. According to the few training samples and scarce labeled samples of HSI, this paper proposes a joint transfer classification framework based on the metric meta-learning method. 2. In order to mine the spatial-spectral features of HSI, this paper proposes a novel spatial-spectral feature extraction module. Moreover, the squeeze and excitation (SE) attention mechanism is introduced into the spectral dimension channel in the spatial-spectral feature migration network module to capture global information, selectively extract useful HSI features, reduce the influence of useless information, and increase the attention of important features. 3. The gated feature fusion strategy is introduced, the feature information on spectral-spatial HSI is utilized, and the method of recursive merging is adopted to gradually fuse the images, thereby enhancing the ability of the network to adapt to the characteristics of HSI. Through gated fusion, the network can select a reasonable combination scheme for each pixel, enhance the appropriate features and suppress the inappropriate features, and extract more abundant HSI feature information.
2 Materials and methods Figure 1 shows the overall block diagram of the spectral-spatial joint transfer classification network based on the metric metalearning of HSI.
First, the hyperspectral dataset is divided into many different metatasks. Each task contains a small number of labeled samples (the support set) and unlabeled samples (the query set). Then, the support set and query set samples are simultaneously sent to the spectral-spatial joint transfer network module constructed to extract the spatial and spectral embedded features. The parameters in the spatial features are initialized by the network parameters trained on the HSRS-SC dataset to realize the transfer learning of spatial feature knowledge, which provides a new idea for hyperspectral image classification when training samples are insufficient. Then, the extracted spatial and spectral features are fused to gain more knowledge about general HSI features. Finally, the fused spectral-spatial features are sent into the metric meta-learning classification module and feature information is mapped into an embedding space by making full use of the metric space in the prior knowledge so that the model can achieve the effect of quickly and efficiently classifying the image categories.

Spatial-spectral feature extraction module combined with transfer learning
This section starts from the perspective of joint spatial-spectral features and aims at optimizing feature extraction and proposes a spatial-spectral joint transfer network. For HSI classification, spectral feature extraction exclusively leads to difficult interpretation of high-level semantic information on features of HSI scenes. It is shown that modeling through the synergy of spatial and spectral information can combine the spectral and spatial advantages of images to better reveal the proprieties of HSI. The two branches of the network extract spatial features and spectral features of HSI, respectively, and a channel attention mechanism is applied after extracting spectral features. This mechanism can strengthen the extracted features and make them more discriminative, thus improving the classification effect of HSI. After the spatial features and spectral features are extracted, the two features are combined by the gated fusion method. The gated fusion method can selectively fuse the spatial-spectral features for the classification of different positions, according to the feature appearance of the input image.

Spectral feature extraction combined with SE attention mechanisms
For the spectral feature extraction model, the network is configured with one 1-D convolutional layer, one spectral residual block, one 1-D convolutional layer, and one FC layer. SE-Net adds attention mechanisms to channels, including two key operations: squeeze and incentive. It can be observed that the module obtains the best weight value through autonomous learning, which is generally implemented by the neural network. A feature recalibration mechanism based on the network model is proposed, which enables the model to find some small amount of information that needs to be focused on in a large amount of data, thus avoiding a waste of computing power on unimportant information.
The input X of any size is first given, then the input is mapped by F tr , resulting in its special transformation into a feature map U (U ∈ R H×W×C ). The convolutional neural network is then used to construct a corresponding SE block, which is used to re-calibrate the features. The calibration step is to generate an embedded global Frontiers in Physics frontiersin.org distribution channel feature response and aggregate the features with a dimension size of C × H × W to obtain a feature size of C × 1 × 1 so that the layer closest to the input layer can obtain the global receptive field. After the squeeze operation, an excitation operation is performed, which is to re-input the new feature and aggregate a new weight generated by each channel, which will be mapped to U again to obtain the final outputX combined with the weight.

Spatial feature extraction network combined with transfer learning
As shown in Figure 1, the three parts of spectral feature extraction, spatial feature extraction, and spectral-spatial feature extraction constitute the joint spatial-spectral feature extraction network. However, a meta-learning training strategy is used to learn the embedding feature space suitable for the HSRS-SC dataset. The pre-trained VGGNet's first seven-layer structure and parameters are used to train the data on the target domain, and the parameters are transferred to the feature extraction model of the HSRS-SC dataset. Then, a CNN with 2D convolution, 2D maxpooling, and FC layers is designed to extract spatial features. The 2D convolution layer is followed by the BN layer, ReLU activation function, and maximum pooling. A batch normalization (BN) layer is added after the 2D convolutional layer to solve gradient disappearance and improve the generalization ability of the model. The activation function is added after the normalization layer. Finally, an FC layer is added to generate spatial feature vectors.
After the spatial features and spectral features are extracted, the two features are combined by the gated fusion method, which can selectively fuse the spectral-spatial features for the classification of different positions according to the feature appearance of the input image. Through gated fusion, the network can select a reasonable combination scheme for each pixel, enhance suitable features and suppress inappropriate features, and extract richer HSI feature information.

Metric meta-learning classification module
As shown in Figure 1, the obtained spectral-spatial feature vector used to be classified by comparing the distance of labeled samples and unlabeled samples based on metric element learning. The method in this paper is an improvement on the classic algorithm of metric-based meta-learning. The estimated metric function minimizes the difference between similar tasks, which maximizes the distance between dissimilar tasks and improves the efficiency of task processing. The known support samples x j and query samples x i generate the eigenvector sums of two sets of E φ (x i ) and E φ (x j ) through the spectral-spatial transfer network module, and then generate eigenvectors through splicing operation C(*, *). The distance between samples can be used to obtain sample attributes Con j i without using all the features of samples, which can make more effective use of HSI features and reduce the model's dependence on training samples.
M ϕ is a neural network consisting of three regular convolutional layers. The first two convolution layers, with the size of 1 × 1 × 64, are followed by the Leaky-ReLU activation function for non-linear mapping. The sigmoid activation function is then used to output the similarity between samples, then the output feature vector is mapped to M ϕ and convolved by a convolutional layer. By analyzing the similarity between samples to obtain class x j , feature vectors and relationship scores m i,j are generated xj.
In order to determine the label of the query sample, the feature mapping of each combination is input into M ϕ to generate a similarity, which is defined to indicate the similarity between any two embedded samples. The value of the output m i,j is a range of [0, 1], in which the samples with high similarity scores are considered to be more similar.
The comparison measurement model uses the mean squared error (MSE) loss function to calculate the relationship score and conduct training. When the training samples belong to the same category, the loss value is 1; otherwise, it is 0. The loss function is shown as follows:

Results
In order to prove the effectiveness of this method, classification experiments are carried out on public datasets, namely, Pavia University and Salinas datasets, and the transfer learning dataset selects the HSRS-SC dataset. All experiments were conducted using the Intel (R) Xeon (R) 4208 CPU @ 2.10 GHz processor and Nvidia GeForce RTX 2080Ti graphics card. The number of training iterations is set to 1,000. For each training iteration, K is set to 1 and N is set to 19, which is the number of categories in the HSI dataset; that is, 1 labeled sample and 19 unlabeled samples were selected randomly to form a training set for model training. In addition, the model in this paper is optimized using Adam, and the learning rate is set to 0.001.

Comparison with state-of-the-art methods
In order to evaluate the effectiveness of the meta-learning method, this paper compares the meta-learning method with deep learning and few-shot supervised learning methods including extended morphological profile support vector machine (EMP-SVM) [16], deep convolutional neural network (DCNN) [17], residual network (ResNet) [18], and current few-shot learning methods including relation network (RLNet) [19], deep cross-domain few-shot learning (DCFSL) [20], and heterogeneous few-shot learning (HFSL) [21]. In order to ensure the fairness of the experiment, this paper randomly selects five labeled samples in each type of HSI dataset as the supervised samples. All experiments were performed 10 times to remove the effect of random sampling. Tables 1, 2 show the accuracy values of OA, AA, and kappa of Pavia University and Salinas datasets.
From Tables 1, 2, it can be seen that the proposed method in this paper achieves almost the highest classification accuracy in each class. In particular, it improves the classification accuracy more for classes with lower heights, such as the asphalt road class and the grass class in the Pavia University dataset. As shown in Table 1, the OA value of the Pavia University dataset is as high as 82.96%, and compared with EMP-SVM, DCNN, ResNet, RLNet, DCFSL, and HFSL, it has increased by 11.37%, 9.42%, 8.41%, 6.29%, 4.24%, and 3.11%, respectively. These results demonstrate the superiority of meta-learning methods in HSI Frontiers in Physics frontiersin.org  classification. For categories that other methods cannot accurately classify, such as gravel, bare soil, asphalt, and bricks, the metalearning method can obtain more accurate classification results, further demonstrating its effectiveness. Figures 2, 3, respectively, show the HSI ground truth maps and the false color image maps of the classification results. In the smallsample ground objects, the error features such as gravel and bricks have been significantly improved, and it can correct various types of ground objects with a small number of training samples.
In Figure 3, the SVM algorithm only considers the spectral feature, and the misclassification rate for Vinyard_untrained and Grapes_untrained is higher. DCNN and other deep learning algorithms are better than SVM in the classification of Vinyard_ untrained and Grapes_untrained, which shows that it has good feature extraction ability in large-scale landforms. On this dataset, the meta-learning algorithm has greatly improved compared with other algorithms, and the classification effect is the best.

Discussion
In order to verify the HSI classification framework based on model transfer, the influence of different datasets on the classification results in transfer learning was studied to construct the optimal classification framework. Figure 4 shows the results of classification using different datasets for transfer learning, where CP means the center of Pavia dataset, SA means the Salinas dataset, HS means the HSRS-SC dataset, and IM means the natural image dataset ImageNet. Through the visualization of experimental data, it can be found that the model-based transfer learning method is effective. As shown in Figure 4F, the classification result obtained by using the HSRS-SC as the source dataset is better than the other three types and is closer to the ground truth map in terms of spatial correlation and completeness, especially the soil class. Rich information on source domain data facilitates the learning of pre-trained models. By using the HSRS-SC dataset to train the model, the obtained model has a stronger feature extraction ability and is able to learn more general features rather than features limited to a specific dataset so that the migration of the pre-trained model to the target domain can better adapt to the new learning task well.

Conclusion
In order to improve the classification accuracy of hyperspectral images, this paper designs a spatial-spectral joint transfer classification network based on metric meta-learning. Furthermore, to combine the spectral and spatial superiority of HSI, a joint spatial-spectral transfer learning network module is proposed in this paper, which can extract finer HSI features and capture cross-dimensional and spatial interaction information. The experimental results on two publicly available HSI datasets show that the meta-learning method proposed in this paper is more competitive and outperforms other classical methods and existing few-shot learning methods. In the future, we will study model compression and pruning to reduce the complexity of the proposed model and improve real-time performance without affecting the classification ability.

Data availability statement
The original contributions presented in the study are included in the article/Supplementary Material; further inquiries can be directed to the corresponding author.