A hybrid neural architecture search for hyperspectral image classification

Convolution neural network (CNN)is widely used in hyperspectral image (HSI) classification. However, the network architecture of CNNs is usually designed manually, which requires careful fine-tuning. Recently, many technologies for neural architecture search (NAS) have been proposed to automatically design networks, further improving the accuracy of HSI classification to a new level. This paper proposes a circular kernel convolution-β-decay regulation NAS-confident learning rate (CK-βNAS-CLR) framework to automatically design the neural network structure for HSI classification. First, this paper constructs a hybrid search space with 12 kinds of operation, which considers the difference between enhanced circular kernel convolution and square kernel convolution in feature acquisition, so as to improve the sensitivity of the network to hyperspectral information features. Then, the β-decay regulation scheme is introduced to enhance the robustness of differential architecture search (DARTS) and reduce the discretization differences in architecture search. Finally, we combined the confidence learning rate strategy to alleviate the problem of performance collapse. The experimental results on public HSI datasets (Indian Pines, Pavia University) show that the proposed NAS method achieves impressive classification performance and effectively improves classification accuracy.


Introduction
Hyperspectral sensing images (HSIs) collect rich spatial-spectral information in hundreds of spectral bands, which can be used to effectively distinguish ground cover. HSI classification is based on pixel level, and many traditional methods based on machine learning have been used, such as the K-nearest neighbor (KNN) [1] and support vector machine (SVM) [2]. The HSI classification method based on deep learning can effectively extract the robust features to obtain better classification performance [3][4][5].
Limited by the cost of computing resources and the workload of parameter adjustment, it is inevitable to promote the development of automatic design efficient neural network technology [6]. The goal of NAS (neural architecture search)is to select and combine different neural operations from predefined search spaces and to automate the construction of high-performance neural network structures. Traditional NAS work uses the reinforcement learning algorithm (RL) [7], evolutionary algorithm (EA) [8], and the gradient-based method to conduct architecture search.
In order to reduce resource consumption, one-shot NAS methods based on supernet are developed [9]. DARTS is a oneshot NAS method with a distinguishable search strategy [10]. By introducing Softmax function, it expands the discrete space into a continuous search optimization process. Specifically, it can reduce the workload of network architecture design and reduce the process of a large number of verification experiments [9].
The method based on the automatic design of convolutional neural network for hyperspectral image classification (CNAS) introduces DARTS into the HSI classification task for the first time. The method uses point-by-point convolution to compress the spectral dimensions of HSI into dozens of dimensions and then uses DARTS to search for neural network architecture suitable for the HSI dataset [11]. Subsequently, based on the method of 3D asymmetric neural architecture search (3D-ANAS), a classification framework from pixel to pixel was designed, and the redundant operation problem was solved by using the 3D asymmetric CNN, which significantly improved the calculation speed of the model [12].
Traditional CNN design uses square kernel to extract image features, which brings significant challenges to the computing system because the number of arithmetic operations increases exponentially with the increase of network size. The features acquired by the square kernel are usually unevenly distributed [13] because the weights at the central intersection are usually large. Inspired by circular kernel (CK) convolution, this paper studies a new NAS paradigm to classify HSI data by automatically designing hybrid search space. The main contributions of this paper are as follows: 1) An effective framework is proposed to design the NAS, called CK-βNAS-CLR, which is composed of a hybrid search space of 12 operations of circular convolution with different convolution methods, different scales, and attention mechanism to effectively improve the feature acquisition ability. 2) β-decay regularization is introduced effectively to stabilize the search process and make the searched network architecture transferable among multiple HSI datasets. 3) We introduced the confident learning rate strategy to focus on the confidence level when updating the structure weight gradient and to prevent over-parameterization.

Materials and methods
As shown in Figure 1, the NAS framework for HSI proposed is described, called as CK-βNAS-CLR. Compared with other HSI classification methods, this method aims to alleviate the shortcomings of traditional microNAS methods from three aspects of search space, search strategy, and architecture resource optimization and effectively improve the classification accuracy.
DARTS is a basic framework which adopts weight sharing and combines hypernetwork training with the search of the best candidate architecture to effectively reduce the waste of computing resources. First, the hyperspectral image is clipped into patch by sliding window as input. Then, the hybrid search space of CK convolution and attention mechanism is constructed, FIGURE 1 Overall framework of the proposed CK-βNAS-CLR model.

Frontiers in Physics
frontiersin.org and the operation search between nodes is carried out in the hybrid search space to effectively improve the feature acquisition ability of the receptive field. At the same time, the architecture parameter set β, which represents the importance of the operator, is attenuated and regularized, effectively strengthening the robustness of DARTS and reducing the discretization differences in the architecture search process. After the search is completed, the algorithm stacks multiple normal cells and reduction cells to form the optimal neural structure, and then the classification results are obtained through Softmax operation. In addition, CLR is proposed to stack decay regularization to alleviate the performance crash of DARTS, improve memory efficiency, and reduce architecture search time.

The proposed NAS framework for HSI classification 2.1.1 Integrating circular kernel to convolution
The circular kernel is isotropic and can be realized from all directions. In addition, symmetric circular nuclei can ensure rotation invariance, which uses bilinear interpolation to approximate the traditional square convolution kernel to a circular convolution kernel, and uses matrix transformation to reparametrize the weight matrix, replacing the original matrix with the changed matrix to realize the offset of receptive field reception. Without considering the loss, the expression of receptive field H of standard 3 × 3 square convolution kernel with a dilation of 1 is written as follows: where H represents the offset set of the neighborhood convolved on the center pixel. By convolution, the feature map is R ∈ H S×S and kernel is J ∈ H M×N . The output feature map U ∈ H M×N can be obtained, and the coordinates of each position are shown in formula (2).
( 2 ) So, we get U R ⊗ J, where ⊗ represents the classical convolution operation used by the CNN. Therefore, the change of receptive field of nucleus circularis 3 × 3 is shown in formula (3).
For the sampling problem of circular convolution kernels, we selected the offset ( Δb { }) of k for different discrete kernel positions and resampled the offset to input J to obtain circular receptive fields. Because the sampling receptive field of circular nucleus has a fraction, we use bilinear interpolation to approximate the sampling value of the receptive field.
where b represents the grid position in the circular receptive field and k represents all grid positions in the square receptive field, which is the kernel of two-dimensional bilinear interpolation. According to the bilinear interpolation, V can be divided into two onedimensional cores.
Therefore, V(k, b) ≠ 0 and V(k, b) 1 only correspond to the corresponding grid k of receptive field B with grid location b. Then, we letĴ RF(l) ∈ B S 2 ×1 andR∈ B S 2 ×1 represent the adjusted receptive field centered on position i and nucleus, respectively. Generally, the standard convolution can be defined as shown in formula (8), so after replacing the circular convolution kernel, the circular convolution can be located as shown in formula (9).
where C ∈ B S 2 ×S 2 is a fixed sparse coefficient matrix, so let J ∈ B S 2 ×S 2 , U ∈ B S 2 ×S 2 , and R ∈ B S 2 ×S 2 be the input feature map, output feature map, and kernel, respectively, so the corresponding definition of formula (9) can be written as formula (10).
where C*J is the convolution process of changing the square receptive field into a circular receptive field. Thus, we can calculate the core weight to achieve operation C*J. This calculation method can effectively avoid calculating the offset of multiple convolutions and reduce the cost of core operation. Next, we summarize the analysis of the actual effect of the transformation matrix. We let ΔR R a+1 − R a , and the value of a change on the output is shown in formula (11). The squared value of a change on the output is shown in formula (12).
In contrast, ΔU of the traditional convolution layer is defined by ΔR T ΔR . Therefore, it can be concluded that the transformation matrix C caused by the circular core can provide a better choice for the gradient descent path of DARTS.

β-decay regularization scheme
In order to alleviate unfair competition in DARTS, we introduced the β-decay regulation scheme [14] so as to improve its robustness and generalization ability and effectively reduce the search memory and the search cost to find the best architecture as shown in Figure 2.
Starting from the default setting of regularization, consider the one-step update of architecture parameter α, where ς α represents the learning rate a of architecture parameters.
For the special gradient descent algorithm of DARTS, these regularized gradients need to be normalized (N L) through the sum size and to realize the average distribution of the total gradient without normalization.
In the DARTS search process, the architecture parameter set, β, is used to express the importance of all operators. The research on explicit regularization of β can more clearly standardize the optimization of architecture parameters so as to improve the robustness and architecture universality of DARTS. We use the χ function with α as the independent variable to express the total impact of attenuation regularization. β t+1 l χ t+1 l α t l β t+1 l , where the χ function (R is the independent variable) represents the overall influence of β attenuation regularization and R is the mapping function. We can iterate for dividing the single-step update parameter value β t+1 l and parameter value weighted sum β It can be found that mapping function R determines the impact of α on β. To avoid excessive regularization and optimization difficulties, Softmax is used to normalize α.
We can obtain the impact and effect of our method.

Confident learning rate strategy
When the NAS method is used to classify hyperspectral datasets, a large number of parameters will be generated. When the training samples are limited, the performance of the network may be reduced due to the over-fitting phenomenon, which will lead to low memory utilization during the training process. CLR is used to alleviate these two problems [15].
After applying the Softmax operation, the structure is relaxed. The gradient descent algorithm is used to optimize the α α (m,n) matrix, and the original weight of the network is called w. Then, the cross-entropy formula is used to calculate the loss value in the training stage and the parameters L train and L valid are updated.
In order to enable both to achieve the optimization strategy at the same time, it is necessary to fix the value of α α (m,n) matrix on the training set, update w using the gradient descent algorithm, Frontiers in Physics frontiersin.org fix the w value on the verification set, update the α α (m,n) value using the gradient descent algorithm, and obtain the best parameter value repeatedly. Stop the optimization after finding the best architecture neural architecture α * o and minimize the verification loss L valid (w*, α*).
NAS architecture parameters will be over-parameterized with the increase of training time. Therefore, the gradient confidence obtained from the parameterized DARTS should increase with the training time of the architecture weight update.
where α represents the number of epochs currently trained, A represents the preset total epochs, and τ is the confidence factor of CLR. Through the update of the confidence learning rate, the network obtains L valid and uses it for gradient update.
The confidence learning rate is established in the process of architecture gradient update.

Results
Our experiments are conducted using Intel (R) Xeon (R) 4208 CPU@2.10GHz Processor and Nvidia GeForce RTX 2080Ti graphics card. We selected the average of 10 experiments to compute the overall accuracy (OA), average accuracy (AA), and Kappa coefficient (K).

Comparison with state-of-the-art methods
In this section, we select some advanced methods to make comparison so as to evaluate the classification performance, which include extended morphological profile combined with support vector machine (EMP-SVM) [16], spectral spatial residual network (SSRN) [17], residual network (ResNet) [18], pyramid residual network (PyResNet) [19], multi-layer perceptron mixer (MLP Mixer) [20], CNAS [11], and efficient convolutional neural architecture search (ANAS-CPA-LS) [21]. All experimental results are shown in Tables 1, 2. The sample is clipped by using the sliding window strategy size of 32 × 32, and the overlap rate is set to 50%. We randomly selected 30 samples as the training dataset and 20 samples as the validation dataset. The training time is set to 200, and the learning rate of the three data sets is set to 0.001. In Table 1, compared with EMP-SVM, SSRN, ResNet, PyResNet, CNAS, MLP Mixer, and ANAS-CPA-LS, OA obtained by our proposed method is increased by 16.26%, 4.32%, 3.37%, 2.95%, 2.9%, 1.95%, and 1.33%, respectively, on the Indian Pines dataset. Figures 3, 4 shows the classification diagram obtained from a visual perspective. By comparing the classification diagrams obtained, we can draw a conclusion that our algorithm achieves better performance. Compared with CNAS, our method uses a hybrid search space, which can effectively expand the receptive field acquired by pixels, improve the flexibility of different convolution kernel operations to process spectrum and space, and achieve higher classification accuracy.

Discussion
The ablation study results are provided in Table 3. When CNAS is combined with hybrid search space, OA increases by 0.70%, 0.35%, and 0.54%, which proves that the hybrid search space can improve the sensitivity of the network to hyperspectral information features and slightly improve the classification performance of the model. Compared with CNAS, CK-NAS has no time change in the search time on the three datasets but has achieved better classification accuracy. CK-βNAS-CLR search gets better results with fewer parameters and involves less computational complexity.  Frontiers in Physics frontiersin.org In this paper, the neural network structure CK-βNAS-CLR is proposed. First of all, we introduce a hybrid search space with circular kernel convolution, which can not only enhance the robustness of the model and the ability of receptive field acquisition but also achieve a better role in optimizing the path. Second, we quoted the β-decay regulation scheme, which reduced the discretization difference and the search time. Finally, the confidence learning rate strategy is introduced to improve the accuracy of model classification and reduce computational complexity. The experiment was conducted on two HSI datasets, and CK-βNAS-CLR is compared with seven methods, and the experimental results show that our method achieves the most advanced performance while using less computing resources. In future, we will use an adaptive subset of the data even when training the final architecture, which may lead to faster runtime and lower regularization term.