Identification of hickory nuts with different oxidation levels by integrating self-supervised and supervised learning

The hickory (Carya cathayensis) nuts are considered as a traditional nut in Asia due to nutritional components such as phenols and steroids, amino acids and minerals, and especially high levels of unsaturated fatty acids. However, the edible quality of hickory nuts is rapidly deteriorated by oxidative rancidity. Deeper Masked autoencoders (DEEPMAE) with a unique structure for automatically extracting some features that could be scaleable from local to global for image classification, has been considered to be a state-of-the-art computer vision technique for grading tasks. This paper aims to present a novel and accurate method for grading hickory nuts with different oxidation levels. Owing to the use of self-supervised and supervised processes, this method is able to predict images of hickory nuts with different oxidation levels effectively, i.e., DEEPMAE can predict the oxidation level of nuts. The proposed DEEPMAE model was constructed from Vision Transformer (VIT) architecture which was followed by Masked autoencoders(MAE). This model was trained and tested on image datasets containing four classes, and the differences between these classes were mainly caused by varying levels of oxidation over time. The DEEPMAE model was able to achieve an overall classification accuracy of 96.14% on the validation set and 96.42% on the test set. The results on the suggested model demonstrated that the application of the DEEPMAE model might be a promising method for grading hickory nuts with different levels of oxidation.


. Introduction
There are more than 20 different varieties of walnut. According to FAO (2019), China produces more than half of the world's walnuts. From 2009 to 2019, China's walnut production increased by 11.3% year-on-year to 2,521,504 tons. The Hickory(Carya cathayensis Sarg.) is found mainly in Lin'an District, China. Because of the mountainous and high-altitude climate, hickory thrives in the area naturally. In Lin'an, the hickory plantation covers an area of 40,000 km 2 , with an annual production of 15,000 tons of hickory nuts. The output value of the whole hickory nuts industry is about 5 billion yuan.
There are a total of 544 kinds of lipids in mature hickory nuts (Huang et al., 2022). Furthermore, a mature hickory nut kernel contains more than 90% unsaturated fatty acids and 70% oil, which is in the top place in all oil-bearing crops (Kurt, 2018;Narayanankutty et al., 2018;Zhenggang et al., 2021). The oxidation of hickory nuts is an inescapable problem and a major contributor to a decline in the quality of the nuts. It is generally accepted that the process of lipid oxidation of nuts proceeds by way of a free radical mechanism called autoxidation (Kubow, 1992;López-Uriarte et al., 2009).
With the oxidation of hickory nuts, a series of changes in color, odor, taste, and other conditions occur. Significantly the kernels of hickory nuts change from light yellow to yellow-brown or brown, the taste gradually becomes lighter and lighter, and a strong rancid smell from the nuts (Jiang et al., 2012). Traditional methods of identifying hickory nuts are mainly manual and electronic nose screening (Pang et al., 2011). On the other hand, the former relies mainly on subjective human experience, which complicates the accuracy of screening and slows down the screening speed. In addition, electronic nose technology can detect the substance content of hickory nuts according to the degree of oxidation and acidity in different storage years (Pang et al., 2019), i.e., hickory nuts with different degrees of oxidation will produce different odors. However, electronic nose technology has a slow response time and requires special equipment, making it difficult to promote in the marketplace. Therefore, accurate identification and fine classification of hickory nuts based on color appearance could contribute to factory production and processing to safeguard consumers' food safety.
In classifying certain agricultural products, shape and color are the two fundamental characteristics. It is common knowledge that the most important distinguishing feature between naturally grown agricultural products is their appearance (Fernández-Vázquez et al., 2011;Rodríguez-Pulido et al., 2021). For instance, varied sizes, roundness, lengths, and widths distinguish walnut varieties. These characteristics are the core foundation for classification. In studies about walnuts, it is crucial to use their morphological properties for classification (Ercisli et al., 2012;Chen et al., 2014;Solak and Altinişik, 2018). Various color characteristics on the surfaces of objects are crucial for classification, and they primarily leverage RGB and hyperspectral images to generate. For example, color information in RGB images could generate a one-dimensional signal (Antonelli et al., 2004) or a matrix of signals, yielding excellent classification results for hazelnuts (Giraudo et al., 2018) and maize .
In addition, hyperspectral imaging technology can achieve the same higher level of classification accuracy (Alamprese et al., 2021;Bonifazi et al., 2021). There is also a significant distinction between RGB and hyperspectral data. RGB data contains less information than hyperspectral data. Nevertheless, the former is easier to gain and also widely popular. Although these studies above have delivered successful results in specific applications, mostly, experts manually extracted or specified features. In each of these extracted features, there are both strong and weak features, and if it is difficult to figure out the strong features of a target, it is challenging to produce very successful results.
Deep learning (LeCun et al., 2015) is a field of machine learning that has gained tremendous recognition in computer vision over the past decade. The pervasiveness of deep learning is relatively more advantageous than the above methods. Deep learning methods are mainly multi-layer artificial neural networks (ANN; like high-dimensional abstract functions) constructed by computers. In ANNs, image features can generate feedback signals that help models adjust their parameters. It is until the final ANN model contains critical features that can distinguish differences between images.
Deep learning technology has been used extensively for the classification of agricultural product quality Javanmardi et al., 2021;Bernardes et al., 2022;Mukasa et al., 2022). A Convolutional Neural Network (CNN) with a shallow depth was set up to classify four classes of tobacco with a 95% accuracy (Li et al., 2021). Nasiri et al. (2019) employed a modified version of VGG16 to identify dates, achieving an accuracy of 96.98%. Various models were created to classify the maturity of agricultural products from different perspectives (Zhang et al., 2018;Garillos-Manliguez and Chiang, 2021). Moreover, Saranya et al. (2022) was able to differentiate between four different maturity levels of bananas with an accuracy of 96.14%. Because of their shallow architecture, the networks used in the aforementioned applications may not possess the necessary generalization capabilities. Chen et al. (2022b) developed a high-performance classification model based on a 152-layer deep ResNet to identify different types of walnuts. Additionally, due to the capability of deep learning algorithms to automatically extract robust advanced features , most studies have not explicitly specified what characteristics those algorithms have learned. In this way, manual feature extraction is more conducive to explanation, such as grading based on the shape, color, and size of strawberries (Liming and Yanchao, 2010). However, Su et al. (2021) was able to successfully utilize the ResNet algorithm to effectively assess the ripeness and quality of strawberries, and noted that pigmentrelated information is essential for accurate ripeness recognition. Such explanations provide greater insight into the potential of deep learning algorithms. In addition to CNNs, deep learning is also based on VIT is developing rapidly for a variety of applications like the classification of weeds from drone images (Bi et al., 2022;Reedha et al., 2022). With the ever-growing number of emerging technologies, applied research in agricultural products is becoming increasingly feasible.
Deep learning algorithm is utilized in this paper to automatically extract the appearance features of hickory nuts, thereby avoiding the shortcomings of traditional methods while achieving more effective results. In addition, deep learning-based classification models are able to process an image in milliseconds (Lu et al., 2022), which is conducive to enhancing the automation of factory production and processing and thus improving the ability to ensure food safety. In this paper, DEEPMAE, a model algorithm based on deep self-supervised  and supervised learning is constructed, enabling the identification and distinction between various levels of oxidation and sourness of hickory nuts kernels. The primary contributions of this paper are enumerated as follows: . Materials and methods

. . Samples
The hickory nuts were harvested from the well-growing and ten-year-old hickory trees in Daoshi Town, China (Lin' an, 118 • 58'11" E, 30 • 16'50" N, elevation: 120 m) in September 2021. After harvesting, the nuts were transported to the laboratory and dried in an oven at 40 • C for 72 h to maintain their moisture content below 8%.

. . Experimental details and preparation
There are several steps in the experiments of this study, and we will describe the preparation and experimental details.

. . . To control experimental conditions
The hickory nuts are physically protected by the intact woody shell, and the lipids oxidize more slowly than they would without the shell. Generally, the nuts were preserved with their shells intact. We stored the nuts with the shells intact but sought to speed up the nuts' lipids' oxidation to reduce the experiment's duration. Prior to this formal experiment, we determined through pre-experiments on small samples that the oxidation rate of hickory nuts at 35 • C was within the tolerable range for this experiment, so we decided to place the nuts in a constant temperature and humidity chamber at 35 • C and 35% to accelerate the oxidation process. Through time, the lipids within hickory nuts kernels undergo continuous oxidation. In addition, we sampled for the experiment every 30 days.
. . . To acquire RGB images of nuts kernels Samples of 280 hickory nuts per experiment were taken in this study, and the nuts kernels were separated after the shells were broken by hand. After this, RGB images of the kernels were acquired. The image acquisition system is composed by placing a smartphone connected to a computer on an experimental stand.
The smartphone is mounted horizontally on the experimental stand while keeping the vertical height constant. In addition, we use the computer to control the phone to avoid changes in the angle and position of the phone. In addition, there are two symmetrical 4W lamps to fill in the light. More specifically, the phone was a Xiaomi 6X with LineageOS, the camera software was OpenCamera, the camera parameters were 20 megapixels, the lens aperture was f/1.75, the focal length was 4.07 mm, and the ISO was set to 100.
. . . To measure the physicochemical properties of hickory nuts Immediately after completing image acquisition, we physically pressed the hickory nuts kernels to obtain the nut oil. Then we measured the oil's peroxide value (POV) and acid value (AV). POVs were determined according to the Chinese standard method GB 5009.227-2016. The peroxide test indicates the rancidity of unsaturated oils, and the POV is the most commonly used value. It measures the extent to which the oil sample has undergone primary oxidation. In addition, the AV is one of the most sensitive indicators of nut spoilage. In this study, AV was measured using the method of the Chinese standard GB 5009.229-2016. Approximately 80 mL of oil was extracted in each experiment. Of this, 36 mL was divided into three replicate experiments for POV measurement, and the remaining oil was divided into three replicate experiments for AV measurement.

. . . Summary of preparations
This experiment took four samples with different oxidation times in this paper, resulting in four sets A, B, C, and D, containing 1,090 good hickory nuts. Additionally, 13,000 RGB images of their kernels were also taken. All of them were cropped to 512 × 512 pixels. Then, we randomly chose 9,000 images as the training set, 2,800 as the validation set and the remaining 1,200 as the test set. .

. An algorithm for aggregating image values
The CIELAB color space is expressed as three values: in human vision, the L-value from low to high indicates perceived brightness from black to white, the a-value from negative to positive represents green to red, and the b-value from negative to positive represents from blue to yellow. To investigate the relationship between the features produced by the deep learning model and the visual properties of hickory nut kernels, we did targeted processing of the kernels' RGB images in the CIELAB color space.
The original image I and the image I g generated (Equation 1) by fully convolutional networks (FCNN) which were almost smoothed are first transformed from RGB to CIELAB (Figure 1). The CIELAB images are split according to the three values. The corresponding values in the CIELAB color space are combined in an "enhancement" operation to convert the CIELAB images back into RGB images. The entire process is almost identical to EdgeFool (Shamsabadi et al., 2020), except for the "enhancement." .

FIGURE
The process of aggregating an image.

FIGURE
Datasets. The four columns are the four sets of experimental images of hickory nuts kernels, A, B, C, and D; the "Original" row is the acquired original images, AL* is aggregated images from the L* channel, Ab* is aggregated images from the b* channel, and AL*b* is aggregated images from both the L* and b* channels. (1) Our enhancement method, corresponding channel enhancement of image, is an aggregation algorithm aggregating a set of data closer to a specified value β (Equation 2). In general, the β falls within that range of the set. In addition, the L-value, a-value, and b-value can each be assigned beta values separately. There is the aggregation of L-values(AL*), aggregation of b-values(Ab*), and co-aggregation of L-values and b-values(AL*b*), but no aggregation at the a-value (Figure 2).

. . Classification methods
Our final work relies on a deep-learning model for classification. Based on existing research, this study proposes a more effective and improved model, and this section describes the detailed construction of our model.

. . . VIT and MAE
The workflow of Vision Transformer (VIT; Dosovitskiy et al., 2020) firstly requires dividing the original image into several regular non-overlapping blocks and spreading the divided blocks into a sequence, after which the sequence is transmitted into the Transformer Encoder. Finally, the output features of the Transformer Encoder are handed over to the fully connected layer for classification.
Masked autoencoders (MAE; He et al., 2022) is a self-supervised learning method that infers the original image from local features strongly correlated with global information. MAE's Decoder can reconstruct the same number of features as the original image blocks, thereby reconstructing a complete image from a partial image. When applied to downstream classification tasks, the MAE can split the trained Encoder and Decoder and use only the features extracted by the Encoder for classification. That is similar to the process of a standard VIT for image classification. Compared to VIT, MAE uses only part of the image data for the classification task, which can significantly reduce computational effort. In addition, MAE's Decoder can reconstruct the original image from partial features, which also can represent feature information in the association.

. . . Re-attention
The MAE is mainly stacked by the Multi-Head Self-Attention (MHSA; Equation 3) module in the vanilla VIT. However, the structure based on the Transformer does not obtain better results by simply stacking it like the convolutional networks (CNN) structure. Instead, it quickly sinks into saturation at deeper levels. That is called attention collapse . Re-attention (Equation 4) could replace the MHSA module in the VIT and regenerate the attention maps to establish crosshead communication in a learnable way.
is multiplied by the self-attention map along the head dimension. Re-attention exploits the interactions between the different attention heads to collect complementary information, regenerating the attention graph at a small computational cost but better enhancing the features' diversity between the layers. It stands to reason that the proposed DeepVIT ) model using the Re-attention mechanism also achieves excellent performance on classification tasks. .

. . DEEPMAE
This paper proposes the DEEPMAE model with MAE and DeepVIT as the backbone ( Figure 3). Firstly, unlike VIT, MAE and DeepVIT, the blocks sequence input to DEEPMAE is not from the original image but is composed of low-level features extracted from the original image by convolutional operations. Secondly, we introduce Re-attention into MAE, reduce .

FIGURE
The architecture of DEEPMAE.
the MAE model width, and increase its depth to achieve a deeper stacking of the Transformer to obtain a more vigorous representation of some of the blocks, which can reduce the computational effort while avoiding attention collapse. In addition, unlike MAE, which uses only the trained parameters of the Encoder when processing classification tasks, our DEEPMAE always retains both Encoder and Decoder and combines the reconstruction of image features and classification into one complete model. The reconstruction is a self-supervised learning. It is done by comparing the output features of the Decoder with the original features and trying to make them as similar as possible. The classification is a supervised learning. Eventually, the complete structure of DEEPMAE contains both self-supervised and supervised processes. The blocks sequences for MAE, VIT, and DeepVIT are derived from the original images. This approach starts by slicing an original image horizontally and vertically and spreading blocks sliced sequentially into a patch embedding blocks sequence. By default, a patch, also a block, is 16 × 16 pixels, implemented by a convolutional kernel and a step size of 16. That results in many convolutional parameters and a high degree of randomness. The process of slicing also results in large random matrices, which somehow affects the stability of the patch embedding and, thus, the instability of the Transformer (Xiao et al., 2021). Before that, VGG (Simonyan and Zisserman, 2014) compared the perceptual fields of small kernels of CNNs with big kernels. They found that multi-layers successive small kernels and single-layer big kernels were similar. So VGG replaced the large convolutional kernels by stacking multiple layers of 3 × 3 small convolutional operations, and 3 × 3 small convolutional kernels also dominated the CNNs after that (Simonyan and Zisserman, 2014;Iandola et al., 2016;Howard et al., 2019;Tan and Le QV, 2020). In addition to stability, the Transformer model has properties for global attention computation. However, it lacks some inductive biases inherent to CNNs, such as translation equivariance and locality (Han et al., 2020). The Transformer model, therefore, lacks some local features from earlier layers compared to the CNNs. Therefore, we change the patch embedding of DEEPMAE to an operation with multiple small convolutional kernels and convert the low-level features of the acquired images into patches, similar to the Image-to-Tokens module (Yuan et al., 2021). In MAE, the input to the Encoder is a subset of patches, and our DEEPMAE does the same thing, using only a subset of patches composed of low-level image features as input to the Encoder. Finally, because images are inherently strong positional relativities, DEEPMAE uses a two-dimensional fixed sine-cosine to encode the position of the spreading patches.
DEEPMAE as a whole also consists mainly of two parts, an Encoder and a Decoder, but the classifier is added after the Decoder to make up the whole. The Encoder part is composed of Transformer blocks composed of Re-attention (RTB). Decoder consists of self-attention Transformer blocks (STB). It is clear that Encoder and Decoder are asymmetrical in terms of both width and depth. In addition, the classifier does not use all the information from Decoder's output; it relies only on some of the features reconstructed by the Decoder to make its classification decisions.   (Equation 10) (Labatut and Cherifi, 2012;Giraudo et al., 2018;Alamprese et al., 2021;Chen et al., 2022a;Saranya et al., 2022) in this paper.

. . Performance evaluation
In addition, the reconstruction of image features by Decoder is a critical component of DEEPMAE. We use the Multi-scale Structural Similarity Index (MS-SSIM; Wang et al., 2003), the Erreur Relative Globale Adimensionnelle de Synthèse (ERGAS; Wald, 2000), and Visual Information Fidelity (VIF; Sheikh and Bovik, 2004) to measure the goodness of the reconstructed features. MS-SSIM is a multi-scale structural similarity method that considers the variation in observation conditions and provides a reliable approximation of perceived image quality. VIF is an image information metric that quantifies the fidelity of image information.

. . Lipid oxidation analysis for four samples
The quality of the oil extracted from hickory nuts was used to assess the physiological quality of the samples. The samples showed different POVs and AVs after different times of oxidation ( Figure 4).
POV is an indicative indicator of the quality of oils and fats (Beyhan et al., 2017). At 35 • C and 35% relative humidity, the POVs measured in four samples, A, B, C, and D, increased gradually with storage time. Samples A and B showed a slow increase in POVs, while experiment C exhibited a faster increment. Over the course of the four samples, the POVs consistently increased, demonstrating that the hickory nut oil was undergoing continuous oxidation.
The AV reflects the degree of fat hydrolysis and rancidity by indicating the oil's dissociative fat mass concentration level (Chatrabnous et al., 2018). The results of the four samples, measured based on differences in the time dimension of the hickory nuts, showed a significant upward trend. In samples A and B, the AVs of samples accumulated more rapidly, while in the later experiments, the AVs accumulated more slowly. Eventually, the AV in samples D exceeded 0.6 mg/g, doubling the value of samples A. The increase in AVs during the storage of hickory nuts is due to the enzymatic hydrolysis of lipids, which can adversely affect the hickory nuts.
The POVs and AVs of the hickory nut oils in the four samples suggest that the degree of oxidative deterioration of the samples was increasing in a sequential manner. This provides an objective basis for further distinguishing between samples with different levels of oxidative degradation.

. . Di erences of kernels' images for four samples
The data distribution was analyzed after the RGB images were converted to CIELAB images. More importantly, this paper Frontiers in Sustainable Food Systems frontiersin.org . /fsufs. . analyzed the relationship between changes in the exterior of hickory nuts kernels and their internal lipid oxidation and rancidity. Ortiz (Ortiz et al., 2019) expressed the L-value as the response to the browning of walnut kernels' exterior. They analyzed the correlation between changes in the exterior of walnut kernels and the rancidity and oxidation process. It is evident that the distribution of Lvalues and b-values on the appearance of hickory nuts kernels from four samples showed variability ( Figure 5). There is a large concentration of L-values around 47 in experiment A, and around 37, 38, and 31 in experiments B, C, and D, respectively. First, looking at the distribution of L-values ( Figure 5A), there is much crossover between experiments B and C. Even the mean brightness of C is slightly higher than B. However, the changes in L-values of the four experiments show an overall trend of gradual decrease. Four experiments also scored around 40, 24, 20 and 18 for the b-value. Taking experiment A as the benchmark, by observing the distribution of the L-value and b-value, it can be found that the changes in the brightness and chromaticity of the appearance of hickory nuts kernels show an uneven state of being larger first and then smaller. In the latter part of the samples, the human eye's differentiation advantage is significantly weakened. That means it will not even be possible to directly distinguish the differences between the appearance of kernels with the naked eye. This unevenness of variation is explained by Yang et al. (2022). The leading causes of pecans browning are membrane peroxidation and enzymatic browning catalyzed by polyphenol oxidase. Throughout the post-harvest storage period, hickory nuts maintained their antioxidant capacity, and the rate of browning was fastest in the early stages of storage, after which the rate of browning changed gradually and gently.
The results above indicate that there was some extent of correlation between the changes in the intrinsic oxidative rancidity of the hickory nuts and the changes in the appearance of kernels. For the same batch of hickory nuts, as the oxidation of their internal oils proceeded, the intrinsic quality of nuts would change, manifested in kernels' appearance as a decrease in L-value and a deviation from yellow in b-value. That also effectively supported the subsequent differentiation of different oxidized and acidified kernels by image features.

. . Classification results
Based on the above Analysis, we also need to classify the images of hickory nuts kernels to infer the internal quality from the appearance of kernels.

. . . General configuration
In this paper, the main optimization points of DEEPMAE based on its backbone model were previously mentioned. Ablation experiments are then conducted in order to evaluate the efficacy of the model at the three points specified.

A sequence consisting of blocks of low-level features extracted
by a convolution operation to replace the original 16 × 16-pixelsized image blocks sequence in the backbone. 2. The most critical point in MAE is using partial images to extract features, reducing the application's computational effort. DEEPMAE also retains this feature, but because the low-level features of the images are not as redundant as the original images, DEEPMAE will have a different input scale for the Encoder than MAE, and we compare three mask ratios. 3. DEEPMAE incorporates both self-supervised and supervised learning and has an Encoder and Decoder. The Decoder, a selfsupervised operation, could reconstruct the image features. That is very different from the inference process in MAE, so we want to verify the role of the Decoder in the classification process.
After establishing the core structure of DEEPMAE, some CNN models were introduced and compared to Transformer models and DEEPMAE model, and their classification effects were evaluated. The common CNNs are AlexNet (Krizhevsky et al., 2017), VGG19 (Simonyan and Zisserman, 2014), SequeezeNet (Iandola et al., 2016), MobileNetV3 (Howard et al., 2019), and EfficientNet (Tan and Le QV, 2020), respectively, and the Transformer modes are the backbones of DEEPMAE, mainly VIT (Dosovitskiy et al., 2020) and MAE . CNNs are all implemented by calling PyTorch's torchvision official interface to implement. In addition, the learning rate, optimizer, data augmentation, and other controllable hyperparameters are kept consistent across models. Training is done in the same environment for each model (Table 1).
. /fsufs. .  . . . DEEPMAE: Low-level features and RGB images data Many researchers are combining convolution blocks and transformer blocks Liu et al., 2022), not least with changes to the input data. Due to the redundancy of the RGB image, MAE uses the original image blocks as input, but DEEPMAE extracts the low-level features of the image as input. Therefore, this paper will compare the patch embedding composed of the original RGB images with the patch embedding composed of low-level features. Additionally, the size of low-level features is much smaller than that of the original RGB image, which is a characteristic of the convolution operation. In comparison, the MHSA used by the Encoder and Decoder in DEEPMAE does not have to shrink the feature map, and the patterns of the layers are similar, making DEEPMAE easily scalable. Subsequently, four practical structures based on DEEPMAE are constructed for comparison ( Table 2).
The number of parameters and classification accuracy of the two types of patches embedding from four different sizes of DEEPMAEs were compared in Table 3. The accuracy improvement was 1.14-1.17% on the validation set and 1.67-2.67% on the test set. For classification, the improvement of low-level features is significant, showing that the Transformer model is very effective after adding the low-level features extracted by convolutional operations.

. . . DEEPMAE: Mask ratios of input patches
It was mentioned that the original MAE masks a certain percentage of the input patches, which reduces the number of operations and improves the model's inference time. DEEPMAE also absorbs this advantage. However, DEEPMAE's inputs are lowlevel features with less redundancy than the original images. In addition, DEEPMAE combines the whole process of classification and MAE-like pre-training. DEEPMAE needs to focus on the unmasked part of the image and the masked part. Therefore, the mask ratio of DEEPMAE will be different from that of MAE. We have done further comparison experiments.
The MAE default is 75% masking, i.e., Mask ratio = 0.75. Based on this, we compared mask ratios of 0.25, 0.5, and 0.75 on the DEEPMAE model. In addition, it can also be seen that the DEEPMAE still has an increasing trend ( Figure 6B), so the number of training epochs in this section is set to an upper limit of 300.
The size of the Mask Ratio correlates with the number of features visible in the model, with a larger Mask Ratio giving the model fewer features to learn. As Mask Ratio increases sequentially (Figure 7), it is evident that the overall loss is also higher for the latter than for the former. Looking at the loss of the Decoder reconstructed feature maps, the level of loss decline at approximately the 100th epoch for Mask ratio = 0.5 is equivalent to the loss decline for a total of 300 epochs for Mask ratio=0.75, i.e., the training time for Mask ratio = 0.5 is only one-third of that for Mask ratio = 0.75. That means that the training time for Mask ratio = 0.5 is only one-third of that for Mask ratio = 0.75, while that for Mask ratio = 0.25 is only one-third of 0.5. In classification loss, the loss for a larger mask ratio is significantly higher than for a smaller one. Therefore, a smaller Mask ratio can release more features for DEEPMAE training and achieve better results. Incidentally, our experiments achieved 97% accuracy in about the 240th epoch by deepening the Encoder depth to 32 while using a Mask ratio of 0.25. However, the smaller the Mask ratio, the more hardware, and computational resources are required. Although using a smaller Mask ratio, deepening the network and extending the training time of the model can further improve accuracy, the computational resources required are more than these accuracy improvements. Therefore, to balance the model's performance and effectiveness, a moderate Mask ratio facilitates the implementation of the model. Furthermore, the masking operation has a considerable impact on CNNs. The default Mask ratio for the experiments in this paper is 0.5 unless otherwise stated.

. . . DEEPMAE: Decoder for classification
Our DEEPMAE combines the self-supervised approach of image reconstruction used by MAE with the supervised process of classification. However, unlike MAE, which only employs pre-trained Encoder parameters for classification, DEEPMAE also uses Decoder parameters in the classification process to reconstruct some of the features for better classification. Therefore, .
/fsufs. .   DEEPMAE's image reconstruction is very closely related to classification. Therefore, we still use the four different sizes DEEPMAEs in Table 2 for comparison to explore the role of the Decoder in reconstructing images. From the performance of the four DEEPMAEs (Table 4), it can be seen that the results of "classification and feature reconstruction" are higher than those of "classification only, " which indicates that the image feature reconstruction of Decoder is also a key factor in DEEPMAE.
In addition to the performance on the test set, this paper also measured the Decoder's performance after the image features. From a human visual point of view, the reconstructed feature images differ significantly from the original and appear difficult to understand (Figure 8). Therefore, The quality of the reconstructed feature images is measured using the MS-SSIM, ERGAS, and VIF metrics, and a comparison of these images from the perspective of images is carried out. Comparing the three metrics (Table 5), it is clear that the image features constructed by "classification and feature reconstruction" outperform the "classification only" image features, which is an advantage of the Decoder. That means that considering both "classification" and "image reconstruction" can improve the effect of classification and ensure the effect of "image reconstruction" at the same time. If only classifying, the classification effect is slightly lower, and the quality of the final image features extracted by the Decoder is negatively affected.
. /fsufs. . Examples of reconstructed feature images. In addition, image data must be transformed into tensors before being input into a model in PyTorch. The transformed images are from these tensors.

. . . Comparing DEEPMAE with popular models
From the accuracy performance of each model in the validation set ( Figure 6), it is easy to see that the MobileNetV3 and VGG19 models performed average level. They were slow to optimize, and their final accuracy was just over 80%. The remaining models, such as Alexnet, SqueezeNet, and EfficientNet, have high recognition and stable performance and have the advantage of fast convergence of the convolution operation.
The VIT and MAE models, which are representatives of Transformer, performed smoothly, with VIT reaching a maximum accuracy of 94.04% at the 95th epoch and MAE a maximum accuracy of 94.36%, which is not too far from the recognition of CNN models such as EfficientNet. In addition, the Transformer model has high accuracy from the beginning and gradually becomes more accurate afterwards. That is because the Transformer model uses initialized parameters, whereas the CNN models have random parameters. Initialization of the Transformer models was necessary, but this did not affect comparing the results with the CNN models. The DEEPMAE model outperformed the above models, reaching a maximum accuracy of 96.14% in the 89th epoch, which was significantly higher than the other models.
Regarding the curves ( Figure 6B), DEEPMAE shows relatively large amplitudes in the first 60 epochs and only slight oscillations afterwards. The curves still tend to increase and do not reach a bottleneck in the model's performance within 100 epochs. Regarding the performance of the models on the validation set, DEEPMAE outperforms the common ANNs and does not lose out on the CNN models in classification recognition. In addition, DEEPMAE is a sets of networks that can be effortlessly extended and fine-tuned both in terms of depth and width. Moreover, due to the global associate nature of MHSA, the connections between the layers are more adjustable than those of CNNs.

. . . Compare DEEPMAE with the backbones of DEEPMAE
The original MAE in experiments is constructed by Encoder and Decoder, which are purely stacked STB blocks. The Encoder and Decoder are pre-trained for 300 epochs, then the trained Encoder parameters are loaded and trained for classification. Because there is no generic hickory nuts dataset at the scale of ImageNet, we use the same dataset for the pre-training and classification process, also called self pre-training by Zhou et al. (2022). So MAE migrates from more extended pre-training weights in the classification process rather than using parameter initialization (Glorot and Bengio, 2010;He et al., 2015). As a result, MAE achieves an initial accuracy of over 90% on the validation set, which is far ahead of other models. However, MAE with the self pretraining approach does not improve the results significantly on the classification task, meaning that the MAE model still relies heavily on the pre-training image reconstruction process to update model parameters. Although the comparison in Figure 6 is "unfair, " pretraining based on image reconstruction is a robust functionality of MAE, so the DEEPMAE model also retains the Decoder to reconstruct images.
The MAE has precisely the same number of parameters as the VIT with the same structure during classification training. However, because the former randomly masks a certain proportion of the input patches, the original MAE's encoder input only accounts for a quarter of the initial data volume. It is faster and more accurate than the latter. In addition, the DEEPMAE model has more feature information and less redundancy for the Encoder's input of low-level features compared to the original image. Hence, DEEPMAE sets a lower masking ratio than MAE, with a masking ratio of 50%.
The confusion matrices of MAE ( Figure 9A) and VIT ( Figure 9B) on the test set show that both distinguish A images nearly completely. However, the MAE model misclassifies B images as A more often. Misclassification between B, C, and D is also inevitable with MAE and VIT. However, MAE is better at distinguishing D images. Correspondingly, VIT misidentified images from C and D more than MAE. The main reason for these significant discrimination errors may be the slight differences in the data itself. In addition, there are many similarities in the brightness and color of the hickory nuts kernels images from adjacent experiments. Furthermore, the individual difference in .
/fsufs. .   kernels also unavoidably influences the results. That results in some flaws in the image data, so the differences are not absolute and complete and are understandable in agriculture. According to DEEPMAE's confusion matrix on the test set ( Figure 9C), A images were correctly classified. It also had the lowest level of misclassification of the three above models. Also most noticeable was the significant enhancement in DEEPMAE's discrimination of C and D images. That is due to DEEPMAE being the most adept of the three models at distinguishing between C and D images. From the results, DEEPMAE is as good as MAE at .
/fsufs. .   identifying D, VIT at identifying B, and slightly better than both for A and C. Compared to the backbone model DEEPMAE learns more critical distinguishing features.
The specific results of MAE, VIT and DEEPMAE on the test set were compared quantitatively to objectively evaluate their performance without bias (

. . What features learned by DEEPMAE
Due to the "black box" problem of the deep learning model, this paper examines whether the features extracted by our model match the changes in the image appearance. We introduce an algorithm for aggregating images. According to this algorithm, this paper performs the corresponding aggregation operations on the L-value and b-value of the original images to demonstrate that these two values are the key factors that affect the model's differentiation of the kernels' images.
The β of our aggregation function is specified separately for each experiment for L-value and b-value, e.g., the β for A images with L-value is 47, and the rest of the experiments have β corresponding to Figure 5. The chromaticity change of enhanced images is represented in the same way as in Figure 5.
After image enhancement, the L-values of the four experiments become more aggregated and distinguishable ( Figures 10A, C). In addition, the L-values of the enhanced B and C images are slightly more discernible than those of the original images. Also, the b-values of the enhanced images are more aggregated ( Figures 10B, D). Compared to the statistical distribution in Figure 5, the images processed by the aggregation algorithm are significantly different from the previous because of the more significant differentiation of brightness and color.
We trained DEEPMAE on the original dataset and tested it on the aggregated datasets AL, Ab, and ALb. Despite the discrepancies between the original and aggregated datasets, the DEEPMAE still register some effectiveness in the test datasets. The correlation between the distribution of L-values and b-values in Figure 5 and the classification results in the confusion matrix is apparent, for instance, the overlapping areas of the distribution led to poorer performance on the AL, Ab, and ALb datasets. It shows that the range of L-values of D in AL is much smaller than in Figure 5A, resulting in images of D being largely misclassified as adjacent C. The ranges of b-values of B, C, and D are closely linked, indicating that C of Figure 11 was misclassified as B and D. After adjusting the L-value or b-value of images, the results of DEEPMAE demonstrated a strong relationship between the data distribution and the classification effect, indicating that the L-value or b-value characteristics are of great importance for the classification process of DEEPMAE. These values appear to be the main features learned by DEEPMAE to distinguish walnuts, such as their appearance brightness and color. The heat map of the features learned by DEEPMAE also confirms this conclusion (Figure 12).

. Conclusions
This study explores the link between changes in the physiological quality and appearance of hickory nuts kernels. It uses hickory nuts oxidation as the starting point and verifies through literature and experiments that oxidative changes in hickory nuts during storage cause changes in the brightness and color of the kernels. The aim of this paper is to use deep learning model optimization to distinguish nuts with different levels of oxidation and rancidity. The DEEPMAE model, a lighter deep learning model based on MAE, is designed to learn more key distinguishing features to help differentiate between varying levels of oxidation in hickory nuts. In particular, the antioxidant capacity of the nuts resulted in a slight change in the rate of browning during storage. Our DEEPMAE could distinguish hickory nuts based on the essential characteristics learned.
The results indicate that DEEPMAE achieves 96.14% accuracy on the validation set for the first 100 epochs of training and still tends to increase after that. With deeper DEEPMAE and more feature learning, it can exceed 97% accuracy on both the validation and test sets at the 240th epoch. In addition, by aggregating information from image samples, we have confirmed that the critical features learned by DEEPMAE are precisely the brightness and color of the appearance of kernels. That is the same conclusion we obtained from our physiological experiments on hickory nuts. Additionally, this paper carries out ablation experiments to confirm its efficiency from three main improvement points. Furthermore, we illustrate some differences in the topology of DEEPMAE and CNNs. In comparison, DEEPMAE shows greater flexibility, effectiveness and scalability than that of CNNs.
This study provides an accurate and valid method for distinguishing the degree of oxidative rancidity in hickory nuts. In the future, we will focus our research on the applicability of the method, longer-term hickory nuts oxidation processes, and reflections on other physiological manifestations of hickory nuts.

Data availability statement
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Author contributions
HK, DD, and JZ planned and designed the experiment. HK conducted the experiment. HK, DD, JZ, ZL, SC, and LD analyzed the data and drafted the manuscript with input from DD and JZ. All authors contributed to the article and approved the submitted version.