Fine-grained few-shot class-incremental identification of medicinal plants via frequency-aware contrastive learning

Tan, Chaoqun; Qin, Zhonghan; Tang, Zihan; Huang, Yongliang; Li, Ke

doi:10.3389/fpls.2026.1730047

ORIGINAL RESEARCH article

Front. Plant Sci., 13 February 2026

Sec. Sustainable and Intelligent Phytoprotection

Volume 17 - 2026 | https://doi.org/10.3389/fpls.2026.1730047

This article is part of the Research TopicIntegrating Visual Sensing and Machine Learning for Advancements in Plant Phenotyping and Precision AgricultureView all 9 articles

Fine-grained few-shot class-incremental identification of medicinal plants via frequency-aware contrastive learning

Chaoqun Tan¹

Zhonghan Qin²

Zihan Tang³

Yongliang Huang⁴

Ke Li^2*

¹School of Intelligent Medicine, Chengdu University of Traditional Chinese Medicine, Chengdu, China
²National Key Laboratory of Fundamental Science on Synthetic Vision, School of Computer Science, Sichuan University, Chengdu, China
³School of Economics, Southwestern University of Finance and Economics, Chengdu, China
⁴Department of Pharmacy, Hospital of Chengdu University of Traditional Chinese Medicine, Chengdu, China

Developing robust algorithmic tools for accurately identifying diverse medicinal plant species is critical for advancing precision medicine. Although deep learning methods have shown considerable promise, they generally require large-scale annotated datasets, which are often difficult to acquire given the vast taxonomic diversity and limited labeled samples available for many plant species. To address this, we propose a novel Frequency-Aware Guided Domain Enhancement Contrastive Learning (FGDE) framework, designed to incrementally learn new categories from few annotated examples while alleviating catastrophic forgetting and overfitting. Our approach integrates high- and low-frequency components to refine feature representations, using multi-frequency fusion to preserve detail-enhanced information. Contrastive learning is further employed to strengthen multi-semantic aggregation and extract discriminative features across both visual and label domains. Additionally, we introduce a multi-objective loss function to enhance semantic compactness within base classes and improve separation among incremental classes. Extensive experiments demonstrate that FGDE significantly outperforms state-of-the-art methods on our collected dataset and two public benchmarks. These results underscore the potential of our model to support practical applications in intelligent plant identification and precision agriculture.

1 Introduction

Medicinal plants, renowned for their therapeutic properties and historical significance, play a pivotal role in the clinical practice of traditional medicine (Sun et al., 2022; Zang et al., 2025). Consequently, they have garnered significant attention from both traditional healers and modern medical practitioners (Wang et al., 2020; Armijos et al., 2022; Chen et al., 2025). However, due to the confusion by different varieties for the affected quality and commercial value that have been reported, increasing concern has been expressed by the public (Xiao et al., 2022; Zhang et al., 2022; Vani et al., 2025). Therefore, accurate authentication of medicinal plant species is critical for practical application. Conventionally, detecting active ingredients such as organic acids and flavonoids serves as the gold standard for identifying medicinal plant varieties (Wu et al., 2025). While these laboratory-based methods offer high precision, they are often time-consuming, costly, and reliant on specialized equipment (Fitzgerald et al., 2020; Xiao et al., 2025). Alternatively, intelligent sensory technologies combined with chemometric methods have gained traction, yet they remain constrained by specific instrumentation requirements.

With recent advancements in Deep Learning (DL), computer vision has emerged as a promising, non-destructive, and rapid solution for plant identification, demonstrating remarkable success in medical image classification (Pandey and Jain, 2022; Huang et al., 2025; Wang et al., 2025). The efficacy of DL-based approaches in automating taxonomy is widely acknowledged (Attri et al., 2023). However, these data-driven models typically rely on large-scale annotated datasets to learn robust feature representations (Wang et al., 2021). In the context of medicinal plants, the sheer diversity of species renders the construction of comprehensive, large-scale annotated datasets impractical. Furthermore, acquiring images across a broad spectrum of varieties presents significant challenges due to the inherent difficulties in sample collection (LeCun et al., 2015; Tan et al., 2024). Consequently, how can we design a model capable of effectively learning feature representations from limited annotated data? Developing a system that can rapidly adapt to new concepts using only a few annotated samples would be highly beneficial for the advancement of the field.

Few-Shot Learning (FSL) (Fei-Fei et al., 2006; Gao et al., 2023) aims to enable image classification models to adapt to new tasks using scarce annotated samples. These frameworks typically involve a training phase for model adaptability and an adaptation phase for new tasks (Dvornik et al., 2019). Several studies have successfully applied FSL to plant analysis, such as leaf classification (Argüeso et al., 2020), plant detection (Rezaei et al., 2024), and hyperspectral categorization (Cai et al., 2023). However, standard FSL methods are prone to catastrophic forgetting, where adapting to new tasks degrades performance on previous ones. To mitigate this, Few-Shot Class-Incremental Learning (FSCIL) (Tao et al., 2020) was introduced, utilizing techniques like neural gas networks (Martinetz and Schulten, 1991; Prudent and Ennaji, 2005) to dynamically model feature space topology. Despite this progress, mainstream approaches (Ahmed et al., 2024; Han et al., 2024) often employ a frozen backbone pre-trained with cross-entropy loss. This strategy frequently fails to effectively separate class margins, leading to poor generalization (Raichur et al., 2024; Zhou et al., 2024). Moreover, the data often presents fine-grained features: minimal distinction between different species (low inter-class variance) and significant variation within the same species (high intra-class variance). Such ambiguity hinders the model’s ability to discriminate between new and old classes, resulting in false classifications.

Most existing techniques in fine-grained classification primarily focus on extracting image edge signals or high-frequency features (Song et al., 2023). While these detailed features are generally effective in revealing subtle inter-class differences, it is crucial to further enhance the distinction between fine-grained classes. Subsequently, it is essential to enhance the distinction between fine-grained classes and achieve clearer clustering of novel and old data, even with limited samples.

Motivated by these challenges, this paper proposes a novel Frequency-Aware Guided Domain Enhancement Contrastive Learning Model (FGDE). This framework constructs discriminative features by integrating high- and low-frequency components and leverages the class-clustering capability of contrastive learning. The result is a feature distribution characterized by improved intra-class compactness and inter-class separability. As illustrated in Figure 1, the detailed features are refined by incorporating high-frequency components to enhance domain-specific representations (Li et al., 2023). The proposed method is described in the third section, and the experimental results and analysis are shown in the fourth section. The main contributions of this paper are summarized as follows:

Figure 1

Diagram showing three versions of an image: original, low-frequency, and high-frequency components, processed through a model. The output includes base classes and incremental sequences represented by color-coded dots.

Figure 1. Illustration of our FGDE. Different classes are marked in different colors. Our proposed network extracts the high-frequency and low-frequency features using the Discrete Cosine Transform (DCT). Enhanced features improve the clustering performance of the model.

1. A novel Frequency-Aware Guided Domain Enhancement Contrastive Learning Model (FGDE) is proposed to strengthen the fine-grained semantic extension of base classes and the separation of subsequent classes. It achieves detail-enhanced feature representation by integrating multi-frequency components, thereby refining domain-specific distinctions.

2. We propose high-frequency and low-frequency components to enrich the original features and unearth class-discriminative information in both the visual and label domains. Subsequently, it enhances multi-semantic aggregation awareness, facilitating more precise differentiation of fine-grained images.

3. We introduce contrastive loss, cross-entropy loss, and feature augmentation loss. This mechanism minimizes intra-class variance while maximizing inter-class separation, significantly enhancing the model’s discriminative power and generalization capabilities.

4. We showcase robust performance on our datasets and public datasets, outperforming previous state-of-the-art methods. Furthermore, we perform a thorough analysis to evaluate the importance of each component.

2 Data collection and preprocessing

2.1 Sample preparation

We collected 28 different specimens and their derived products, which all samples were sourced from the Lotus Pond Chinese Medicinal Plant Market in Chengdu China. These samples were authenticated by experts from the Chengdu Institute of Food and Drug Control (Chengdu, China). The dried samples were obtained from the original intact specimens. Post-collection, they were stored in standard cold storage conditions.

2.2 Data acquisition

A self-developed high-resolution data acquisition device (Canon EOS 60D) was used to acquire images in Figure 2A. The device is composed of a box, a light system, and an image acquisition system, which can provide stable and consistent environmental conditions. The image acquisition process is illustrated in Figure 2.

Figure 2

A three-panel process image. Panel A shows a camera setup above a lightbox for imaging objects. Panel B displays numerous images acquired by A. Panel C shows the cropped results.

Figure 2. The image detection to detection results. (A) Image Acquisition, (B) Image Data, (C) Image Detection.

All images are captured using a 35mm CMOS sensor with a resolution of 5120×3840, as shown in Figure 2B. Images are annotated and cropped to obtain a target, see Figure 2C. We remove incomplete, blurry, and inappropriate images. Our collected dataset is shown in Figure 3. Due to the potential for a highly unbalanced training dataset, such as the interfered classes within our dataset. This issue is addressed by balancing each class through data augmentation. Specifically, we augment the data to ensure a uniform distribution of 250 samples in each class.

Figure 3

Twelve panels labeled A to L display different biological cross-sections and seeds. Each panel shows a series of four images, highlighting variations in texture, color, and shape. Panels A and B depict ring-like cross-sections; C and D show seed pods; E and F display round, fruit-like shapes; G to L present various seeds and seeds within fruits, each with unique surface details and colors ranging from pale to dark.

Figure 3. Random samples from the dataset, which consists of 28 different CHMs and their produced products. Namely (A) chaoshanzha (B) honghuajiao (C) jiaoshanzha (D) hanyuanhuajiao (E) shanzhatan (F) qingjiao (G) sichuanhuajiao (H) tengjiao (I) jiangbanxia (J) lubei (K) songbei (L) shengbanxia.

Different processed plants of Shanzha include Chaoshanzha, Jiaoshanzha, and Shanzhatan. Similarly, various processing plants are used to obtain different processes with Banxia, including Jiangbanxia, Fabanxia, Qingbanxia, and JingBanxia. However, ShuiBanxia is often used as a counterfeit product of Qingbanxia. Additionally, Jiang Nanxing, as a processed product derived from Tiger’s Paw Southern Star, is commonly considered a fake of Jiangbanxia in the commercial market. Lubeimu, Qingbeimu, and Songbiemu are the most common circulation of ChuanBeimu. According to the properties of images, all data are detected to remove redundant pixels that contain no information.

3 Methods

3.1 Problem definition

Continuous incremental classes are the key factors in FSCIL. In this paper, the first session can learn a generalizable representation. Then, multiple few-shot incremental classes are executed. There is the training data $D_{t}^{i} = {(x_{i}, y_{i})}_{i = 0}^{N_{t}}$ is the training data from session $t$ , $x_{i}$ and $y_{i}$ are the $i$ -th image and corresponding label respectively. The training images are expressed as $D_{s} = {D^{0}, D^{1} \dots . D^{N}}$ . For the initial sequence $D^{0}$ , the image domain contains $C^{0}$ classes, and the label domain is $L^{0}$ . For subsequent incremental sequences, the label domain has no overlap, the rest contained in new classes are invisible in base data. When the $D_{t r a i n}^{i}$ are trained, and the model is tested in $D_{test}^{i}$ , which contains all encountered classes $C_{0} \cup C_{1} \cup^{} \dots . C_{t}$ in the $t$ -th subsequent. In FSCIL, the initial sequence is with many samples, and the model only has access to a few samples in the following subsequent. Specifically, the incremental data are always organized in N-way K-shot format, N is the class, and K represents the training images of each class.

3.2 Overview

The architecture of our proposed FGDE model is illustrated in Figure 4. In the first phase, multiple predefined augmentations are applied to enrich the fine-grained images. Subsequently, Discrete Cosine Transform (DCT) is employed to extract multi-frequency features, which are fused with the original images to construct high-frequency and low-frequency enhanced representations. Simultaneously, label representations are expanded to encapsulate the semantic consistency of the images. The visual patches and these expanded labels interact to ensure cross-modal alignment and refine the embedding space, thereby improving the separability of base classes. In the second phase, generated contrastive learning pairs are utilized to enhance multi-semantic aggregation and mine class-discriminative information. Here, semantic granularity is enriched via contrastive learning. We jointly optimize contrastive, feature augmentation, and cross-entropy losses to minimize intra-class variance and maximize inter-class distance. During the third phase, the model adapts to new classes using limited few-shot samples. A similar metric is employed to assign incoming samples to their respective prototypes, ensuring robust generalization and stability while mitigating catastrophic forgetting.

Figure 4

Figure 4. The overall pipeline of our FGDE framework consists of three phases. Phase 1 emphasizes learning richer representations of the original space for both the image and label domains through multiple predefined transformations. Phase 2 involves leveraging contrastive learning to distinguish between positive and negative sample pairs. Phase 3 focuses on training the limited new classes to mitigate catastrophic forgetting and reduce overfitting.

3.2.1 Frequency-aware guided multi-semantic feature enhancement

Data scarcity in base classes restricts the diversity of learned semantic features, leading to poor generalization and unclear class boundaries. To enhance feature robustness, we apply targeted visual transformations, focusing on color and shape as suggested by previous studies. Specifically, we employ random cropping to expand the fine-grained feature space. Given an image $X$ , the cropping dimensions are defined as $w_{c r o p} \sim Rand (w_{m i n}, w_{m a x})$ and $h_{c r o p} \sim Rand (h_{m i n}, h_{m a x})$ . Given a rand point ( $x_{r}$ , $y_{r}$ ), where $x_{r} \in [0, w - w_{c r o p}]$ , $y_{r} \in [0, h - h_{c r o p}]$ . Thus, the point ( $x_{f}$ , $y_{f}$ ) of the lower right corner of the final cropping area are computed in Equation 1:

\begin{array}{l} {\begin{matrix} x_{f} = x_{r} + w_{c r o p} - 1 \\ y_{f} = y_{r} + h_{c r o p} - 1 \end{matrix} & (1) \end{array}

Random cropping of varying sizes is employed to capture local information, enhancing both local feature understanding and fine-grained semantic perception. To further enrich class-aware semantics, we introduce a transformation set $ℱ \in {T_{c}, T_{r}}$ , consisting of color jittering ( $T_{c}$ ) and random rotation ( $T_{r}$ ). The processed RGB images are then transformed into the frequency domain using the 2D DCT, which expresses pixel data via a linear combination of cosine basis functions. Leveraging the superior energy compaction of DCT over the complex-valued DFT (He et al., 2020; Huang et al., 2025), each channel of the input image $X$ is converted to the frequency spectrum $P^{2 d}$ in Equation 2:

\begin{array}{l} P_{h, w}^{2 d} = α_{h} α_{w} \sum_{a = 0}^{H - 1} \sum_{b = 0}^{W - 1} \cos (\frac{π h}{H} (a + 0.5)) \cos (\frac{π w}{W} (b + 0.5)), & (2) \end{array}

where $h \in {0, 1, \dots H - 1}, w \in {0, 1, \dots W - 1}$ represent the horizontal and vertical frequency indices. The normalization coefficients $α_{h}$ and $α_{w}$ are defined as in Equation 3:

\begin{array}{l} α_{k} = {\begin{matrix} \sqrt{1 / N}, & if k = 0 \\ \sqrt{2 / N}, & otherwise \end{matrix}, where N \in {H, W} & (3) \end{array}

In the resulting spectrum $P^{2 d}$ , low-frequency components are concentrated near the origin $(0, 0)$ , while high-frequency components are distributed in the peripheral regions. Then, we apply a binary mask $M$ to separate the spectrum into low-frequency and high-frequency components. We define a cut-off threshold $τ$ based on the Manhattan distance in the frequency domain. The mask $M$ is defined as in Equation 4:

\begin{array}{l} M_{h, w} = {\begin{matrix} 1, & if h + w \leq τ \\ 0, & otherwise \end{matrix} & (4) \end{array}

Subsequently, the low-frequency spectrum $P_{l o w}^{(2 d)}$ and high-frequency spectrum $P_{h i g h}^{(2 d)}$ are derived via the Hadamard product ( $⊙$ ): $P_{l o w}^{(2 d)} = P^{(2 d)} ⊙ M, P_{h i g h}^{(2 d)} = P^{(2 d)} ⊙ (1 - M)$ . Finally, we project the masked spectra back to the spatial domain using the 2D Inverse DCT (IDCT) in Equation 5:

\begin{array}{l} {\tilde{I}}_{h, w}^{2 d} = \sum_{a = 0}^{H - 1} \sum_{b = 0}^{W - 1} α_{h} α_{w} P_{h, w}^{2 d} \cos (\frac{π h}{H} (a + 0.5)) \cos (\frac{π w}{W} (b + 0.5)) & (5) \end{array}

These components are then sent into the encoder and obtain the original image $I_{X}$ , low-frequency $I_{l}$ and high-frequency $I_{h}$ feature maps. This enables the extraction of discriminative details from high-frequency components and structural context from low-frequency components. The feature extraction is defined as in Equation 6:

I_{l}^{'} = f_{θ} (I_{l} \times X + X)

\begin{array}{l} I_{h}^{'} = f_{θ} (I_{h} \times X + X) & (6) \end{array}

Utilizing enhanced discriminative feature maps as prior knowledge augments the model’s ability to capture critical information and adapt to incremental data. Specifically, by aligning samples with class prototypes via the high-frequency features $I_{h}^{'}$ , we encode fine-grained details that effectively sharpen decision boundaries and enhance model performance. Then, the embedded image is computed by Equation 7:

\begin{array}{l} I = f_{D C T} (f_{T_{r}} (f_{T_{c}} (C_{0}))), C_{0} = {X, I_{h}, I_{l}} & (7) \end{array}

Where $C_{0}$ is designated as the encountered class. This can extend various semantics and fill the unallocated image embedding space. It can also provide semantic knowledge, which encourages extensive learning of different semantics for better generalization.

For label domain, the predefined alternations can generate multiple augment image-label pair ( $x, y$ ), where $ℱ (x, y) = {x_{n}, y_{n}}_{n = 1}^{N}$ . $N$ is the number of transformed extension space. $x_{n}$ are the generated extension images, and the corresponding labels is $y_{n} = y \times N + n$ . Thus, the label space is extended with the fine-grained class-aware embedding derived from the original space. The association between the image domain and the label domain can effectively provide richer semantic details to improve the accuracy. Likewise, the training within the embedding space $ℱ$ can be expressed by Equation 8:

\begin{array}{l} L_{c l a s s} (f; x, y) = \frac{1}{N} \sum_{i = 1}^{N} l_{c e} (f (x_{n}), y_{n}) & (8) \end{array}

3.2.2 Detail-enhanced discriminative feature representation

Although effective for coarse classification, existing methods are limited in handling fine-grained data. We therefore propose an embedding-based supervised contrastive learning strategy using the MoCo (He et al., 2020) framework. This method optimizes feature distances by clustering positive pairs and separating negative ones. Given an instance $(x, y)$ , we generate a query view $x_{q} = {Aug}_{q} (x)$ and a key view $x_{k} = {Aug}_{k} (x)$ via data augmentation. A shared encoder $f_{q}$ , comprising a feature extractor and a classifier, is then employed to extract the corresponding features. As shown in Equation 9:

\begin{array}{l} f_{q} = ω^{T} f (x) & (9) \end{array}

where $ω^{T} ϵ ℝ^{d \times | C_{0} |}$ is the weight value, and $f (x) ϵ ℝ^{d \times 1}$ is the feature function. Among them, the query encoder $f_{q}$ are encoded through gradient descent, while the key encoder $f_{k}$ are encoded by a progressing encoder, driven by a momentum update with the $f_{q}$ . The queue of key embedding is maintained to store the feature vector.

In the label domain, a label queue maintains labels corresponding to the feature queue, facilitating the differentiation of positive and negative pairs. This queue preserves an identical length to the feature queue. Subsequently, the contrastive loss is computed to drive the model to capture discriminative fine-grained features. This optimization effectively minimizes intra-class distance while maximizing inter-class variation, thereby fostering deep interaction between the visual and label domains.

3.2.2.1 Inter-class variation

Usually, denote the representations of a class for a gathering center as prototypes $P_{j}$ , all prototypes of different classes are far away from each other. The $P_{j}$ is expressed by Equation 10:

\begin{array}{l} P_{j} = \frac{1}{N_{j}} \sum_{a = 1}^{N_{j}} X_{a} & (10) \end{array}

where $N_{j}$ is the number of classes $j$ . $X_{a}$ is the feature vector of the $a$ -th sample. Thus, denoted two prototypes $P_{j}$ and $P_{k}$ for class $j$ and $k$ in base session, the Euclidean distance of inter-class variation is calculated by Equation 11:

\begin{array}{l} d_{j, k}^{i n t e r} = \sqrt{\sum_{d} {(P_{j}^{i} - P_{k}^{i})}^{2}} & (11) \end{array}

Where $d$ is the dimension of the feature vector. $i$ is the $i$ -th sample. For the subsequent incremental sequence, the novel classes are also obtained by computing the distance between their prototypes with the samples.

3.2.2.2 Intra-class distances

The analysis of intra-class distances involves computing the Euclidean distances between the samples and prototype $P_{j}$ within the same class $j$ , and then determining the average value. For the testing sample, the intra-class distances are computed by Equation 12:

\begin{array}{l} d_{j}^{i n t r a} = \frac{1}{N_{j}} \sum_{a = 1}^{N_{j}} \sqrt{\sum_{d} {(x_{j}^{i} - P_{j}^{i})}^{2}} & (12) \end{array}

Where $x_{j}^{i}$ is the feature vector of samples. The smaller the intra-class distance, the more effectively the samples within the same classxcluster. This enhances the distinct separation of local information in the feature space and is crucial for accurate fine-grained identification.

3.2.2.3 Augmentation feature analysis

The model should also focus on global features for the multi-transformation imbalanced-grained data. To improve the generalization ability of class separation, we consider the global augmentation set from the generalization features. It serves as query view and optimizes the feature space by learning the general features of different classes. Likewise, the image-label pairs ${x_{m}, y_{m}}$ for global augmentation can be processed by Equation 13:

\begin{array}{l} L_{A l} (f; x, y) = \frac{1}{M} \sum_{i = 1}^{M} l_{c e} (f (x_{m}), y_{m}) & (13) \end{array}

With the augmentation feature analysis, the model can better concentrate the detail-enhanced information to distinguish imbalanced fine-grained images and optimize the feature space.

3.2.3 Incremental class inference

With the incremental sequence, the backbone network is fixed, and the classifier is computed by computing the novel class prototypes. The novel class information can be acquired to enable the extension of classifiers with the prototypes of basic classes and extended-augmentation classes. As shown in Equation 14:

\begin{array}{l} W_{N}^{c l a s s} = {w_{11}^{0}, w_{12}^{0}, \dots w_{b N}^{0}} \cup {w_{11}^{1}, w_{12}^{1}, \dots w_{b N}^{1}} \dots \cup {w_{11}^{t}, w_{12}^{t}, \dots w_{b N}^{t}} & (14) \end{array}

where $b$ is the number of basic classes, $N$ is the number of transformed extension space, and $t$ is the number of incremental sequences. The prototypes $W_{n}$ represents the focus of global and local fine-grained semantics from the original classes, shown in Equation 15:

\begin{array}{l} W_{n} = {w_{1 n}^{0}, w_{2 n}^{0}, \dots w_{b n}^{0}} \cup {w_{1 n}^{1}, w_{2 n}^{1}, \dots w_{b n}^{1}} \dots \cup {w_{1 n}^{t}, w_{1 n}^{t}, \dots w_{b n}^{t}} & (15) \end{array}

Subsequently, the classifier is updated by the novel classes’ prototypes combined with the original class prototypes. It can help push the novel samples away from the distributions of the old classes and benefit novel class-aware semantic information generalization. The FC layer of the model is updated by contrasting novel query samples with these slowly evolving key embedding of base classes from the feature queue. Finally, the cosine similarity between embedding and all prototypes is computed to obtain the inference results for the test image. It can be formulated as Equations 16 and 17:

\begin{array}{l} P r e d = a r g m a x \sum_{n = 1}^{N} s i m (f (x_{n}), w_{b n}^{t}) & (16) \end{array}

\begin{array}{l} ω_{b n}^{t} = \sum_{i = 1}^{n_{b n}^{t}} f (x_{b n}) / n_{b n}^{t} & (17) \end{array}

By adapting the classifier to the novel classes while keeping the backbone unchanged, it can maximize the preservation of previously acquired knowledge.

3.3 Loss function

In this paper, the loss functions consist of three parts: cross-entropy loss (Equation 18), contrastive loss (Equation 19), and feature augmentation loss (Equation 20). The model leverages sufficient data within basic classes to obtain multi-semantic aggregated information from fine-grained images by optimizing per-sample loss, simultaneously, maximizing inter-class margins.

For the extension of frequency-aware guided alternations, the anchor image $x_{i}$ is processed by the cross-entropy loss that computes class features to their targets. It is computed by Equation 18:

\begin{array}{l} L_{c e} (x_{i}) = - \sum_{i}^{b} y_{i} \log (p (x_{i})) & (18) \end{array}

To generate the query view $x_{q}$ and the key view $x_{k}$ , we apply specific data augmentations. For a specific anchor index $i$ , we define the set of all indices in the current batch (or memory queue) as $A (i)$ . We strictly categorize these indices into two subsets: Positive Set: This set contains the indices of samples that share the same class label as the anchor $i$ : $Q_{i} = {j \in A (i) ∣ y_{j} = y_{i}, j \neq i}$ . Negative Set: This set contains the indices of samples from all other classes: $K_{i} = {k \in A (i) ∣ y_{k} \neq y_{i}}$ . And we adopt the InfoNCE loss (Ahmed et al., 2024) as our contrastive objective. The objective is to maximize the similarity between the anchor and its positive peers while minimizing the similarity with negative samples. The loss for anchor $x_{i}$ is formulated as Equation 19:

\begin{array}{l} L_{c l} (x_{i}) = \frac{- 1}{| Q_{i} |} \sum_{x_{j} ϵ Q_{i}} \log \frac{e x p (x_{i} ⊙ x_{j} / τ)}{\sum_{x_{k} Є K_{i}} e x p (x_{i} ⊙ x_{k} / τ)} & (19) \end{array}

where $⊙$ denotes the dot product, $| Q_{i} |$ is the cardinality of the positive set, and $τ$ is the temperature parameter (set to 16). The denominator sums over all contrastive samples to strictly regulate the embedding space. $x_{j}$ is the element of positive set $Q_{i}$ , $x_{k}$ is the element of negative set $K_{i}$ . It aims to pull $x_{j}$ closer to $x_{i}$ , and push $x_{k}$ further to $x_{i}$ . In this paper, to complement contrastive loss, global feature augmentation loss is computed for sample $x_{i}$ to improve the generalization of class separation. Denote the prototype for each class as $P_{j}$ , the $L_{A l}$ is expressed by Equation 20:

\begin{array}{l} L_{A l} (x_{i}) = - \frac{1}{R} \sum_{i}^{R} P_{j} \log (p (x_{i})) & (20) \end{array}

where $R$ is the number of training images. And the overall training objective can be concluded as Equation 21:

\begin{array}{l} L_{l o s s} = L_{c e} + L_{c l} + L_{A l} & (21) \end{array}

4 Experimental results and discusses

4.1 Dataset

To validate the generalization of our method, additional experiments are conducted on two publicly available herb datasets, with comparisons made against other methods.

The two herb datasets are the Chinese Medicine dataset (Thella and Ulagamuthalvi, 2021) and Medicinal Leaf (Huang and Xu, 2023), and excerpts of these datasets are illustrated in Figures 5 and 6.

Figure 5

The part of the Chinese Medicine dataset. The Chinese Medicine dataset comprises 20 different types of Chinese medicinal plants, comprising a total of 3000 images.

Figure 5. The part of the Chinese Medicine dataset. The Chinese Medicine dataset comprises 20 different types of Chinese medicinal plants, comprising a total of 3000 images.

Figure 6

Figure 6. The part of Medicinal Leaf dataset. The Medicinal Leaf dataset contains 100 types of herbal plants, comprising a total of 10000 images.

4.2 Implementation details

The model is optimized by SGD with 0.9 momentum. The initial learning rate is 0.1, and the learning rate decay strategy is StepLR. The batch size is set to 16, and the final model is obtained when reaches 100 epochs in the basic learning phase. For the incremental learning phase, we fine-tune the per-trained model, and the novel query samples are compared with the key embedding of basic training. The model updates the classifier over 10 epochs to mitigate overfitting. The code is built by using PyTorch=2.2.1 with Python=3.11. The model is trained on a PC (equipped with an Intel i7 processor) with a graphics processing unit card (NVIDIA 4090, 24G memory).

4.3 Performance metrics

We use Accuracy, Precision, Recall, Specificity, and F 1Sore as evaluation metrics as shown in Equations 22–27. Furthermore, harmonic mean value (HM) (Kim et al., 2023; Yang et al., 2023) is used to balance the inherent biases of basic classes with incremental classes.

\begin{array}{l} A c c = \frac{T P + T N}{T P + F N + F P + T N} & (22) \end{array}

\begin{array}{l} Precision = \frac{T P}{T P + F P} & (23) \end{array}

\begin{array}{l} Recall = \frac{T P}{T P + F N} & (24) \end{array}

\begin{array}{l} Specificity = \frac{T N}{F P + T N} & (25) \end{array}

\begin{array}{l} F 1 Score = \frac{2 \times Precision \times Recall}{Precision + Recall} & (26) \end{array}

\begin{array}{l} H M = \frac{2 \times A_{b a s e} \times A_{i n c}}{A_{b a s e} + A_{i n c}} & (27) \end{array}

where TN represents the number of True Negative, and TP denotes the number of True Positive. FN indicates the number of False Negative, and FP is the number of False Positive. $A_{b a s e}$ is the accuracy of base classes, $A_{i n c}$ is the top-1 accuracy of incremental classes.

4.4 Performance of identification results

4.4.1 Performance of model identification

We split our dataset into the training set and the testing set. Specifically, 75% of the data is used for training, and the remaining 25% is used for testing. The basic training phase is composed of 16 classes with 200 samples. And there are 4 incremental sequences, within 3 classes and 5 samples of each class. The experimental results of each class are detailed in Table 1.

Table 1

Table 1. The experimental results of different varieties by our model.

As shown in Table 1, the identification performance is notably strong for the base classes. In particular, the model demonstrates high accuracy and robustness in classifying classes A, E, C, F, D, H, and I. In contrast, the F1-scores for the incremental classes predominantly range between 0.3 and 0.6, highlighting the model’s limited effectiveness in learning new classes. For instance, classes R and Q exhibit high recall but low precision; this can be attributed to significant inter-class feature similarity, which leads to misclassification. To further evaluate our model’s performance, we calculated the confusion matrix, and the experimental results are depicted in Figure 7.

Figure 7

Two confusion matrices, labeled (A) and (B), showing true vs. predicted labels for a classification task. Both matrices have a color gradient from light to dark blue, representing low to high values. Red rectangles in (A) and green rectangles in (B) highlight areas of significant values, indicating patterns of confusion in predictions. Numerical values range up to two hundred fifty. Axes are labeled as True Labels and Predicted Labels, with a color scale bar on the right indicating the value intensity.

Figure 7. The experimental results of the confusion matrix. (A) is ours, (B) is the result without high-frequency and low-frequency enhanced images. Significant contrast areas are marked with red and green.

While the identification results for base classes are comparable across methods, the improvements in incremental classes are more pronounced. In the confusion matrices, a brighter diagonal indicates higher identification accuracy, with significant contrast areas highlighted in red and green. As shown in Figure 7A, for base class identification, the marked areas demonstrate that FGDE achieves superior performance with fewer misclassifications compared to other methods. When comparing Figure 7A with Figure 7B, our model demonstrates distinct advantages over the baseline lacking high-frequency and low-frequency enhancement, particularly in the highlighted regions. Mechanistically, low-frequency components capture global structural features, while high-frequency components extract local fine-grained details, thereby enriching the image representation. Conversely, while the method without frequency-aware extension maintains a visible diagonal for base classes, it performs poorly on novel classes. In contrast, our method exhibits robust performance, indicating that it effectively adapts to novel classes without disrupting previous decision boundaries.

4.4.2 Different losses of model identification

To better evaluate the performance of the model and enhance the explanation of training, the loss and accuracy results of our model are shown in Figure 8.

Figure 8

Four line graphs show training and testing metrics over 100 epochs. Upper left: training loss with ce_loss, fa_loss, and cl_loss lines decreasing. Upper right: training accuracy increases reaching about 0.95. Lower left: testing loss decreases with fluctuations. Lower right: testing accuracy increases steadily, reaching around 0.9.

Figure 8. The loss and accuracy results of our model. (A) is the curve of different loss changes, and the curve for training accuracy. (B) is the curve of the testing loss and the curve of the testing accuracy of our model.

The loss curves and their convergence trends are illustrated in Figure 8. Throughout the training process, the loss consistently decreased while accuracy improved, eventually leading to model convergence. Specifically, the trajectories of the multi-objective losses are detailed in Figure 8A. The loss exhibits a steady decline until stabilizing at approximately 80 epochs, with the model achieving peak accuracy at epoch 88. Evaluations on the testing set confirm the model’s robust classification performance.

Simultaneously, as illustrated in Figure 9, While the CE loss baseline provides marginal class separation, our proposed method demonstrates superior capability in distinguishing base classes and integrating novel classes with minimal feature overlap.

Figure 9

Two line graphs compare testing loss and accuracy over 100 epochs. The left graph shows testing loss, while the right shows testing accuracy. “Our” method, in blue, initially shows higher loss but stabilizes lower than the “CE” method, in orange. For accuracy, “Our” method consistently achieves higher values compared to “CE”.

Figure 9. The testing loss and accuracy results of our model and with only CE loss.

4.4.3 Visualization of class separation

To verify the effectiveness of our model, we visualize the identification results using a scatter plot, as shown in Figure 10. The horizontal axis represents the True Labels, while the vertical axis denotes the Predicted Labels. In this visualization, points aligned closely with the diagonal indicate accurate classification performance. Conversely, points deviating from the diagonal represent misclassifications, where the magnitude of deviation highlights the discrepancy between the predicted and ground truth labels.

Figure 10

Scatter plots labeled (A) and (B) show predictions versus true labels for 28 classes with a color-coded legend. Both plots demonstrate a positive correlation, with scattered points in various colors representing different classes. Plot (A) has a tighter cluster of points compared to plot (B), which exhibits a wider spread, particularly among higher classes. Each class maintains a distinct color, enabling identification across the plots.

Figure 10. The scatter plot of different classes. (A) is our model, and (B) is without high-frequency and low-frequency enhanced images. The horizontal axis represents the true labels (True Labels), and the vertical axis represents the predicted labels (Predictions). The color of each point represents a different class, and each color uniquely corresponds to a class.

In Figure 10A, the points are predominantly clustered along the diagonal, indicating that the model achieves robust overall classification performance. In contrast, while Figure 10B exhibits some alignment with the diagonal, a significantly larger number of points deviate from it. This dispersion is particularly pronounced in specific categories, such as classes 7 and 8, indicating higher misclassification rates. For the newly added classes, Figure 10A maintains relatively high accuracy despite occasional errors. Conversely, Figure 10B reveals a marked decline in performance for these later classes, evidenced by a substantial increase in misclassified points and greater deviations from the diagonal. In summary, our method leverages a frequency-based separation strategy: low-frequency components extract fundamental structural features, while high-frequency components capture fine-grained details. This approach enhances inter-class separability and minimizes the interference of new classes on existing representations. Consequently, our method demonstrates significant advantages, exhibiting superior accuracy and robustness.

4.5 Comparison with state of the arts

4.5.1 Our dataset

To evaluate the accuracy performance of our model, the state-of-the-art FSCIL models are compared with ours. The experimental results are shown in Table 2.

Table 2

Table 2. The results of the comparison of the state-of-the-art FSCIL models.

Our proposed method achieves a peak accuracy of 95.000% on the base classes, surpassing all existing baselines. Across subsequent incremental sessions, our method maintains robust performance, recording accuracies of 91.701% in the first session and 78.906% in the fourth. Notably, catastrophic forgetting is significantly better mitigated compared to competing approaches.

While most methods suffer performance degradation as new classes are introduced, distinct patterns emerge. iCaRL exhibits the most inferior performance, starting with a base accuracy of 70.542% and declining rapidly, resulting in the lowest Harmonic Mean (HM) of 58.774%. Although FSLL, FACT, and C-FSCIL achieve respectable base accuracies (91.421%, 92.040%, and 94.051%, respectively), they experience sharp drops in later sessions, yielding HM scores below 80%. SAVC and Wang’s method demonstrate relatively stronger resilience, with HMs of 84.328% and 83.101%, respectively. Nevertheless, our approach consistently outperforms these leading methods across all sessions, securing the highest HM of 86.599%.

This superior performance is attributed to our frequency decomposition strategy. By separating images into low-frequency components (capturing global structural features) and high-frequency components (preserving fine-grained details), we generate a detail-enhanced discriminative representation. This mechanism bolsters both base class separability and novel class generalization. In summary, our model retains base class knowledge while adapting to new sequences with minimal degradation, marking a significant advancement over state-of-the-art methods in solving the FSCIL challenge.

4.5.2 Visualization of class activation maps

Class Activation Maps (CAMs) are essential for interpreting model decisions by highlighting influential image regions. For each instance, the original image is shown with its corresponding CAMs in Figure 11. The color intensity represents the activation level, signifying the importance of each region in the model’s classification result.

Figure 11

Twelve sets of images are arranged in two rows labeled (A) and (B), each with three rows: “Img,” “Wang,” and “Our.” The top row shows original images of objects like huajiao or chuanbeimu. The middle row displays the “Wang” method, and the bottom row shows the “Our” method, both using color maps to visualize data or differences on the objects. Each set numbered one to twelve demonstrates a comparison among the three rows, highlighting visual variations across different methods.

Figure 11. The visualization of the different attention modules for each class. The heat maps of each class are randomly selected. The first is the original image, the second is the heatmap of Wang’s (Wang et al., 2024) method, and the last is ours. The results are organized into (A, B). (A) are the base classes, (B) are the incremental classes.

From Figure 11, a systematic comparison across both base and incremental classes reveals a consistent pattern of superior performance by our method. Our approach demonstrates significantly more accurate localization of target objects, with activations that adhere tightly to object boundaries while effectively suppressing background noise. For instance, in items (1), (4), (8), and (10) (Wang et al., 2024), produces diffuse activations that often spill into the background or focus on restricted, peripheral regions. In contrast, our method generates heatmaps that are precisely centered on the targets, covering their salient regions more effectively. Furthermore, our model consistently yields more comprehensive activation maps that encompass the entire object, suggesting the acquisition of a holistic representation. This is particularly evident in items (2), (5), (7), and (11). Whereas (Wang et al., 2024) tends to fixate on local textures or edges, our model captures the full semantic structure of the object. This holistic understanding is crucial for robust classification, rendering the model less susceptible to variations in orientation or partial occlusion. Moreover, the heatmaps generated by our method exhibit a more concentrated focus on the objects’ discriminative regions. In contrast, the activations in the Wang (Wang et al., 2024) model appears scattered and less intense, as evident in examples (5), (9), and (12). Our model, conversely, produces strong, focused activations localized on the core features of the objects. This indicates that our approach more effectively identifies key predictive features and is less prone to relying on spurious image correlations. These qualitative results strongly support our hypothesis that the proposed methodology facilitates learning more robust and interpretable feature representations. By generating more complete and accurately localized heatmaps, our model demonstrates a deeper semantic understanding of the image content and enhanced mitigation of catastrophic forgetting.

4.5.3 Chinese medicine dataset

We compare our method with other state-of-the-art methods on the Chinese Medicine dataset, and the comparison results are shown in Figure 12. From Figure 12A, the base classes are 8, the number of classes in each incremental subsequent is 4, and the N-way is set to 3. We can see that our model exhibits the highest accuracy among other mainstream methods. From Figure 12B, the model is initialized with 60 base classes, followed by incremental sessions containing 8 classes each, with the setting of 3-shot samples per class.

Figure 12

Two line graphs compare accuracy over several stages for different methods. Graph A shows accuracy from stages zero to four for methods iCaRL, FSLL, FACT, C-FSCIL, SAVC, Wang, and Our. Graph B extends stages to eight, displaying similar trends. Overall, accuracy tends to decline across stages, with iCaRL generally showing the lowest accuracy compared to other methods.

Figure 12. Comparison with the state-of-the-art on two public datasets: (A) Chinese Medicine and (B) Medicinal Leaf. Error bars indicate the standard deviation, which are used to visualize the performance variance and stability of each method.

As observed in Figure 12, iCaRL exhibits the most precipitous performance decline, suggesting that its replay-based strategy lacks the stability required for FSCIL tasks and struggles to mitigate interference from novel classes. FSLL attains high initial accuracy but suffers a sharp drop in later stages, indicating that despite early effectiveness, its long-term generalization capability is limited. FACT and C-FSCIL display a more gradual decay, reflecting stronger knowledge retention, although performance degradation persists. Conversely, SAVC and the method by Wang et al. maintain relatively stable performance, particularly in later sessions. Furthermore, to evaluate the statistical reliability of our results, we add the error bars to represent the standard deviation in Figure 12. Our method is consistently compact for incremental sessions in both Figures (A) and (B). This low variance indicates that our FGDE model is highly robust to initialization differences and data sampling fluctuations, maintaining stable performance even as the number of classes increases. In contrast, baseline methods such as iCaRL and FSLL exhibit larger error bars in several sessions (e.g., Session 0 and 3 in Figure 12B), suggesting higher instability. Notably, our method outperforms all competing approaches, achieving consistent improvement and minimal degradation. By leveraging fine-grained feature comparison and multi-semantic discrimination, our approach significantly enhances adaptability to novel categories while effectively mitigating catastrophic forgetting.

4.6 Horizontal comparison and performance trade-offs

4.6.1 Results of different backbones

To select the appropriate backbone to extract features in our model, the various backbones are compared to reflect the impact of model efficiency. We also evaluate the model with a hybrid transformer structure, such as MBConv (Howard et al., 2017).

Using running time as a key feasibility metric, we compared various backbones in Table 3. ResNet20 and ResNet50 showed lower accuracy and longer inference times compared to ResNet18. VGG’s simpler architecture limited its feature extraction, reducing base class accuracy, while EfficientNet and DenseNet121 proved computationally expensive due to their complexity. Results from MBConv further highlighted the limitations of hybrid transformer structures in this setting. Although VGG19 improved incremental class accuracy by 1.322%, its slower running time made it less viable. Thus, ResNet18 was selected for achieving the highest comprehensive performance.

Table 3

Table 3. The results of the comparison of the different backbones.

4.6.2 Results of different number of classes

To evaluate the learning ability and generalization ability of our model to incremental classes, experiments with different numbers of classes are performed for the basic training. This paper compares the base 12 classes with the 16 classes. When the base classes are 12, the number of classes in each incremental subsequent is 4, and N-way is set to 4. When the base classes are 16, the number of classes in each incremental subsequent is 3, and N-way is set to 3. The experimental results are shown in Table 4.

Table 4

Table 4. The results of the comparison of the different numbers of classes.

For base 12 classes, the base accuracy is a better 1.875% than the 16 classes. However, base 12 classes perform 10.380% worse than base 16 classes in the incremental sequences. The experimental results demonstrate that a greater number of base classes enhances the ability of the model to learn and capture fine-grained features more effectively. Finally, we set the base class to 16, N-way is 3.

4.6.3 Results of different sizes of cropping

In the original settings, images are initially cropped to enrich the fine-grained feature space and increase image diversity. To evaluate the impact of cropping sizes on model performance, our experiment examines the different cropping sizes. The comparison results are shown in Table 5.

Table 5

Table 5. The results of the comparison of the different sizes of cropping.

From Table 5, it is evident that crop size significantly affects feature diversity. The 128-crop size achieves the highest discrimination, surpassing the 96-crop and 32-crop settings by 7.909% and 11.980%. Additionally, the similar HM scores for crop sizes 96 and 64 suggest a consistent feature distribution at these scales. Given these results, we fixed the crop size at 128 to ensure optimal model performance.

4.6.4 Results of different numbers of base sequences

To verify the effectiveness of our networks for few-shot images, we also compared the impact of different numbers of base classes on model performance. To ensure fairness, the remaining parameters remain unchanged in the comparison. The experimental results are shown in Table 6.

Table 6

Table 6. The results of the comparison for the different numbers of base sequences.

The results indicate that the accuracy for base classes remains relatively stable regardless of the class count. However, regarding the identification of incremental classes, the configuration with 200 base classes yields an accuracy 8.536% higher than that with 250 base classes and 5.667% higher than that with 300 base classes. These findings suggest that 200 base classes offer the optimal balance between feature diversity and model generalization.

5 Conclusion

In this paper, we proposed the FGDE address the challenge of FSCIL within the context of fine-grained images. This method innovatively synergizes the low-frequency and high-frequency components with dual-domain contrastive learning to enhance feature discriminability. Unlike existing methods that struggle with subtle inter-class differences, our approach effectively sharpens decision boundaries while maintaining the stability of base classes. Extensive experiments on both proprietary and public TCM datasets demonstrate that FGDE outperforms state-of-the-art methods, offering a robust solution for balancing plasticity and stability. In future work, we aim to further refine identification performance for highly fine-grained categories. Promising directions include integrating diffusion models to address data imbalance via high-fidelity sample generation and employing Graph Convolutional Networks (GCNs) to capture neighborhood structures, thereby further alleviating catastrophic forgetting.

Data availability statement

The original contributions presented in the study are included in the article/supplementary material. Further inquiries can be directed to the corresponding author.

Author contributions

CT: Formal Analysis, Investigation, Software, Writing – original draft, Writing – review & editing. ZQ: Methodology, Software, Writing – review & editing. ZT: Resources, Software, Writing – review & editing. YH: Funding acquisition, Methodology, Supervision, Writing – review & editing. KL: Funding acquisition, Project administration, Resources, Supervision, Writing – review & editing.

Funding

The author(s) declared that financial support was received for this work and/or its publication. This study was funded by the National Natural Science Foundation of China (No. 82405033), Natural Science Foundation of Sichuan Province (No. 2026NSFSC1837), China Postdoctoral Science Foundation (No. 2025MD774046), Sichuan Provincial Department of Human Resources and Social Security-Postdoctoral Research Special Foundation (No. TB2025094) and the Research Promotion Plan for Xinglin Scholars in Chengdu University of Traditional Chinese Medicine (No. BSZ2024030) and (No. QJRC2024007).

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Ahmed, N., Kukleva, A., and Schiele, B. (2024). “OrCo: towards better generalization via orthogonality and contrast for few-shot class-incremental learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. (CVPR), (Washington: IEEE) 28762–28771.

Google Scholar

Argüeso, D., Picon, A., Irusta, U., Medela, A., San-Emeterio, M. G., Bereciartua, A., et al. (2020). Few-Shot Learning approach for plant disease classification using images taken in the field. Comput. Electron. Agr. 175, 105542. doi: 10.1016/j.compag.2020.105542

Crossref Full Text | Google Scholar

Armijos, C., Ramírez, J., and Vidari, G. (2022). Poorly investigated Ecuadorian medicinal plants. Plants 11, 1590. doi: 10.3390/plants11121590

PubMed Abstract | Crossref Full Text | Google Scholar

Attri, I., Awasthi, L. K., Sharma, T. P., and Rathee, P. (2023). A review of deep learning techniques used in agriculture. Ecol. Inform. 77, 102217. doi: 10.1016/j.ecoinf.2023.102217

Crossref Full Text | Google Scholar

Cai, Z., He, M., Li, C., Qi, H., Bai, R., Yang, J., et al. (2023). Identification of chrysanthemum using hyperspectral imaging based on few-shot class incremental learning. Comput. Electron. Agr. 215, 108371. doi: 10.1016/j.compag.2023.108371

Crossref Full Text | Google Scholar

Chen, G., Xia, Z., Ma, X., Jiang, Y., and He, Z. (2025). MobileNet-GDR: A lightweight algorithm for grape leaf disease identification based on improved mobileNetV4-small. Front. Plant Sci. 16, 1702071. doi: 10.3389/fpls.2025.1702071

PubMed Abstract | Crossref Full Text | Google Scholar

Dvornik, N., Schmid, C., and Mairal, J. (2019). “Diversity with cooperation: Ensemble methods for few-shot classification,” in Proceedings of the IEEE/CVF international conference on computer vision. (CVPR), (Long Beach, CA: IEEE), 3723–3731.

Google Scholar

Fei-Fei, L., Fergus, R., and Perona, P. (2006). One-shot learning of object categories. IEEE T. Pattern Anal. 28, 594–611. doi: 10.1109/TPAMI.2006.79

PubMed Abstract | Crossref Full Text | Google Scholar

Fitzgerald, M., Heinrich, M., and Booker, A. (2020). Medicinal plant analysis: A historical and regional discussion of emergent complex techniques. Front. Pharmacol. 10, 1480. doi: 10.3389/fphar.2019.01480

PubMed Abstract | Crossref Full Text | Google Scholar

Gao, Y., Li, H., and Fu, W. (2023). Few-shot learning for image-based bridge damage detection. Eng. Appl. Artif. Intel. 126, 107078. doi: 10.1016/j.engappai.2023.107078

Crossref Full Text | Google Scholar

Han, W., Huang, K., Geng, J., and Jiang, W. (2024). Semi-supervised few-shot class-incremental learning based on dynamic topology evolution. Eng. Appl. Artif. Intel. (CVPR), (IEEE), 133, 108528. doi: 10.1016/j.engappai.2024.108528

Crossref Full Text | Google Scholar

He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. (2020). “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9729–9738.

Google Scholar

He, K., Zhang, X., Ren, S., and Sun, J. (2016). “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition. (CVPR), (Las Vegas: IEEE), 770–778.

Google Scholar

Hersche, M., Karunaratne, G., Cherubini, G., Benini, L., Sebastian, A., and Rahimi, A. (2022). “Constrained few-shot class-incremental learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (CVPR), ((New Orleans: IEEE): IEEE), 9057–9067.

Google Scholar

Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., et al. (2017). “MobileNets: efficient convolutional neural networks for mobile vision applications,” in Proceedings of the IEEE conference on computer vision and pattern recognition. 26, 1–9. (CVPR), (Hawaii: IEEE). doi: 10.48550/arXiv.1704.04861

Crossref Full Text | Google Scholar

Huang, H., Geng, X., Wang, L., Wang, X., Liu, F., Peng, Y., et al. (2025). Metabolic profiling and pharmacological evaluation of alkaloids in three Murraya species. Front. Plant Sci. 16, 1675533. doi: 10.3389/fpls.2025.1675533

PubMed Abstract | Crossref Full Text | Google Scholar

Huang, G., Liu, Z., van der Maaten, L., and Weinberger, K. Q. (2017). “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition. (CVPR), (Hawaii: IEEE), 4700–4708.

Google Scholar

Huang, M. L. and Xu, Y. X. (2023). Image classification of Chinese medicinal flowers based on convolutional neural network. Math. Biosci. Eng. 20, 14978–14994. doi: 10.3934/mbe.2023671

PubMed Abstract | Crossref Full Text | Google Scholar

Kim, D. Y., Han, D. J., Seo, J., and Moon, J. (2023). “Warping the space: Weight space rotation for class-incremental few-shot learning,” in The Eleventh International Conference on Learning Representations. (ICLR), (Kigali: ICLR).

Google Scholar

LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. Nature 521, 436–444. doi: 10.1038/nature14539

PubMed Abstract | Crossref Full Text | Google Scholar

Li, M., Wang, D., Liu, X., Zeng, Z., Lu, R., Chen, B., et al. (2023). “Patchct: Aligning patch set and label set with conditional transport for multi-label image classification,” in Proceedings of the IEEE/CVF International Conference on Computer Vision. (CVPR), (Vancouver: IEEE), 15348–15358.

Google Scholar

Martinetz, T. and Schulten, K. (1991). A” neural-gas” network learns topologies. Artificial Neural Networks. 1, 397–402

Google Scholar

Mazumder, P., Singh, P., and Rai, P. (2021). “Few-shot lifelong learning,” in In Proceedings of the AAAI Conference on Artificial Intelligence. (AAAI), (AAAI Press) 2337–2345.

Google Scholar

Pandey, A. and Jain, K. (2022). A robust deep attention dense convolutional neural network for plant leaf disease identification and classification from smart phone captured real world images. Ecol. Inform. 70, 101725. doi: 10.1016/j.ecoinf.2022.101725

Crossref Full Text | Google Scholar

Prudent, Y. and Ennaji, A. (2005). “An incremental growing neural gas learns topologies,” in Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005. (IJCNN), (Canada, QC: IEEE) 1211–1216.

Google Scholar

Raichur, N. L., Heublein, L., Feigl, T., Rügamer, A., Mutschler, C., and Ott, F. (2024). Bayesian learning-driven prototypical contrastive loss for class-incremental learning. Transact. Mach. Learn. Res. 2025, 03. doi: 10.48550/arXiv.2405.11067

Crossref Full Text | Google Scholar

Rebuffi, S. A., Kolesnikov, A., Sperl, G., and Lampert, C. H. (2017). “icarl: Incremental classifier and representation learning,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. (CVPR), (Hawaii: IEEE) 2001–2010.

Google Scholar

Rezaei, M., Diepeveen, D., Laga, H., Jones, M. G., and Sohel, F. (2024). Plant disease recognition in a low data scenario using few-shot learning. Comput. Electron. Agr. 219, 108812. doi: 10.1016/j.compag.2024.108812

Crossref Full Text | Google Scholar

Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. Comput. Sci arXiv preprint arXiv:1409.1556. doi: 10.48550/arXiv.1409.1556

Crossref Full Text | Google Scholar

Song, Z., Zhao, Y., Shi, Y., Peng, P., Yuan, L., and Tian, Y. (2023). “Learning with fantasy: Semantic-aware virtual contrastive constraint for few-shot class-incremental learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (CVPR), (Vancouver: IEEE), 24183–24192.

Google Scholar

Sun, M., Xu, S., Mei, Y., Li, J., Gu, Y., Zhang, W., et al. (2022). MicroRNAs in medicinal plants. Int. J. Mol. Sci. 23, 10477. doi: 10.3390/ijms231810477

PubMed Abstract | Crossref Full Text | Google Scholar

Tan, M. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks.International Conference on Machine Learning arXiv preprint arXiv:1905.11946. doi: 10.48550/arXiv.1905.11946

Crossref Full Text | Google Scholar

Tan, C., Tian, L., Wu, C., and Li, K. (2024). Rapid identification of medicinal plants via visual feature-based deep learning. Plant Methods 20, 81. doi: 10.1186/s13007-024-01202-6

PubMed Abstract | Crossref Full Text | Google Scholar

Tao, X., Hong, X., Chang, X., Dong, S., Wei, X., and Gong, Y. (2020). “Few-shot class-incremental learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (CVPR), (IEEE), 12183–12192.

Google Scholar

Thella, P. K. and Ulagamuthalvi, V. (2021). An efficient double labelling image segmentation model for leaf pixel extraction for medical plant detection. Ann. Romanian Soc. Cell Biol. 22 (5), 2241–2251.

Google Scholar

Vani, K., Sree, Sudharshanam, U., Mallela Venkata, N. K., Mandla, R., Sreenivas, A., Bedika, M., et al. (2025). Smart agriculture: A climate-driven approach to modelling and forecasting fall armyworm populations in maize using machine learning algorithms. Front. Plant Sci. 16, 1636412. doi: 10.3389/fpls.2025.1636412

PubMed Abstract | Crossref Full Text | Google Scholar

Wang, C., Liu, B., Liu, L., Zhu, Y., Hou, J., Liu, P., et al. (2021). A review of deep learning used in the hyperspectral image analysis for agriculture. Artif. Intell. Rev. 54, 5205–5253. doi: 10.1007/s10462-021-10018-y

Crossref Full Text | Google Scholar

Wang, X., Sun, J., Tian, P., Wu, M., Zhao, J., Chen, J., et al. (2025). Intelligent grading of sugarcane leaf disease severity by integrating physiological traits with the SSA-XGBoost algorithm. Front. Plant Sci. 16, 1698808. doi: 10.3389/fpls.2025.1698808

PubMed Abstract | Crossref Full Text | Google Scholar

Wang, W., Xu, J., Fang, H., Li, Z., and Li, M. (2020). Advances and challenges in medicinal plant breeding. Plant Sci. 298, 110573. doi: 10.1016/j.plantsci.2020.110573

PubMed Abstract | Crossref Full Text | Google Scholar

Wang, Q. W., Zhou, D. W., Zhang, Y. K., Zhan, D. C., and Ye, H. J. (2024). Few-shot class-incremental learning via training-free prototype calibration. Adv. Neural Inf. Process. Syst. 36:15060–15076. doi: 10.48550/arXiv.2312.05229

Crossref Full Text | Google Scholar

Wu, E., Chen, Y., Ma, R., and Zhao, X. (2025). A review of weed image identification based on deep few-shot learning. Comput. Electron. Agr. 237, 110675. doi: 10.1016/j.compag.2025.110675

Crossref Full Text | Google Scholar

Xiao, Q., Mu, X., Liu, J., Li, B., Liu, H., Zhang, B., et al. (2022). Plant metabolomics: a new strategy and tool for quality evaluation of Chinese medicinal materials. Chin. Med-UK 17, 45. doi: 10.1186/s13020-022-00601-y

PubMed Abstract | Crossref Full Text | Google Scholar

Xiao, Y., Wang, J., Xiong, H., Xiao, F., Huang, R., Hong, L., et al. (2025). A large-scale lychee image parallel classification algorithm based on spark and deep learning. Comput. Electron. Agr. 230, 109952. doi: 10.1016/j.compag.2025.109952

Crossref Full Text | Google Scholar

Yang, Y., Yuan, H., Li, X., Lin, Z., Torr, P., and Tao, D. (2023). “Neural collapse inspired feature classifier alignment for few-shot class-incremental learning,” in International Conference on Learning Representations. (ICLR), (Kigali: ICLR).

Google Scholar

Zang, H., Wang, Y., Peng, Y., Han, S., Zhao, Q., Zhang, J., et al. (2025). Automatic detection and counting of wheat seedling based on unmanned aerial vehicle images. Front. Plant Sci. 16, 1665672. doi: 10.3389/fpls.2025.1665672

PubMed Abstract | Crossref Full Text | Google Scholar

Zhang, M., Shi, Z., Zhang, S., and Gao, J. (2022). A database on mycorrhizal traits of Chinese medicinal plants. Front. Plant Sci. 13, 840343. doi: 10.3389/fpls.2022.840343

PubMed Abstract | Crossref Full Text | Google Scholar

Zhou, D. W., Wang, F. Y., Ye, H. J., Ma, L., Pu, S., and Zhan, D. C. (2022). “Forward compatible few-shot class-incremental learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (CVPR), (New Orleans: IEEE), 9046–9056.

Google Scholar

Zhou, Y., Zhu, H., Xu, C., Zhang, R., Hua, G., and Yang, W. (2024). Class-incremental novel category discovery in remote sensing image scene classification via contrastive learning. IEEE J. Stars. 17, 9214–9225. doi: 10.1109/JSTARS.2024.3391512

Crossref Full Text | Google Scholar

Keywords: contrastive learning, fine-grained few-shot incremental learning, frequency-aware, identification, medicinal plant

Citation: Tan C, Qin Z, Tang Z, Huang Y and Li K (2026) Fine-grained few-shot class-incremental identification of medicinal plants via frequency-aware contrastive learning. Front. Plant Sci. 17:1730047. doi: 10.3389/fpls.2026.1730047

Received: 22 October 2025; Accepted: 26 January 2026; Revised: 06 January 2026;
Published: 13 February 2026.

Edited by:

Sathishkumar Samiappan, The University of Tennessee, United States

Reviewed by:

Panagiotis Madesis, University of Thessaly, Greece
Long Chen, Chinese Academy of Forestry, China

Copyright © 2026 Tan, Qin, Tang, Huang and Li. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Ke Li, bGlrZXNjdUBzY3UuZWR1LmNu

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.