<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Neurorobot.</journal-id>
<journal-title>Frontiers in Neurorobotics</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Neurorobot.</abbrev-journal-title>
<issn pub-type="epub">1662-5218</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fnbot.2023.1301192</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Neuroscience</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Feature fusion network based on few-shot fine-grained classification</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Yang</surname> <given-names>Yajie</given-names></name>
<uri xlink:href="http://loop.frontiersin.org/people/2517255/overview"/>
<role content-type="https://credit.niso.org/contributor-roles/data-curation/"/>
<role content-type="https://credit.niso.org/contributor-roles/methodology/"/>
<role content-type="https://credit.niso.org/contributor-roles/software/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-original-draft/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Feng</surname> <given-names>Yuxuan</given-names></name>
<role content-type="https://credit.niso.org/contributor-roles/software/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-original-draft/"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Zhu</surname> <given-names>Li</given-names></name>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/2521983/overview"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-original-draft/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Fu</surname> <given-names>Haitao</given-names></name>
<role content-type="https://credit.niso.org/contributor-roles/writing-original-draft/"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Pan</surname> <given-names>Xin</given-names></name>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Jin</surname> <given-names>Chenlei</given-names></name>
<role content-type="https://credit.niso.org/contributor-roles/writing-original-draft/"/>
</contrib>
</contrib-group>
<aff><institution>College of Information Technology, Jilin Agriculture University</institution>, <addr-line>Changchun</addr-line>, <country>China</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Michal Wozniak, Wroc&#x00142;aw University of Science and Technology, Poland</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Mariusz Topolski, Faculty of Information and Communication Technology, Poland; Marcin Wozniak, Silesian University of Technology, Poland</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Li Zhu <email>jolielang&#x00040;163.com</email></corresp>
</author-notes>
<pub-date pub-type="epub">
<day>09</day>
<month>11</month>
<year>2023</year>
</pub-date>
<pub-date pub-type="collection">
<year>2023</year>
</pub-date>
<volume>17</volume>
<elocation-id>1301192</elocation-id>
<history>
<date date-type="received">
<day>24</day>
<month>09</month>
<year>2023</year>
</date>
<date date-type="accepted">
<day>25</day>
<month>10</month>
<year>2023</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2023 Yang, Feng, Zhu, Fu, Pan and Jin.</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Yang, Feng, Zhu, Fu, Pan and Jin</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<p>The objective of few-shot fine-grained learning is to identify subclasses within a primary class using a limited number of labeled samples. However, many current methodologies rely on the metric of singular feature, which is either global or local. In fine-grained image classification tasks, where the inter-class distance is small and the intra-class distance is big, relying on a singular similarity measurement can lead to the omission of either inter-class or intra-class information. We delve into inter-class information through global measures and tap into intra-class information via local measures. In this study, we introduce the Feature Fusion Similarity Network (FFSNet). This model employs global measures to accentuate the differences between classes, while utilizing local measures to consolidate intra-class data. Such an approach enables the model to learn features characterized by enlarge inter-class distances and reduce intra-class distances, even with a limited dataset of fine-grained images. Consequently, this greatly enhances the model&#x00027;s generalization capabilities. Our experimental results demonstrated that the proposed paradigm stands its ground against state-of-the-art models across multiple established fine-grained image benchmark datasets.</p></abstract>
<kwd-group>
<kwd>few-shot classification</kwd>
<kwd>fine-grained classification</kwd>
<kwd>similarity measurement</kwd>
<kwd>inter-class distinctiveness</kwd>
<kwd>intra-class compactness</kwd>
</kwd-group>
<counts>
<fig-count count="5"/>
<table-count count="4"/>
<equation-count count="11"/>
<ref-count count="43"/>
<page-count count="10"/>
<word-count count="6179"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>Deep learning models have achieved remarkable results in the realm of visual recognition, spanning tasks like image and text classification (LeCun et al., <xref ref-type="bibr" rid="B15">2015</xref>; Dvornik et al., <xref ref-type="bibr" rid="B4">2019</xref>; Lin et al., <xref ref-type="bibr" rid="B23">2020</xref>). However, in real-world settings, the efficacy of these models often hinges on the presence of vast amounts of training data (Simonyan and Zisserman, <xref ref-type="bibr" rid="B34">2015</xref>; Gu et al., <xref ref-type="bibr" rid="B8">2017</xref>; Lin et al., <xref ref-type="bibr" rid="B25">2017b</xref>). For certain categories, only a handful of labeled instances might be available. Yet, humans can rapidly acquire knowledge with limited data (Li et al., <xref ref-type="bibr" rid="B18">2020</xref>). Drawing inspiration from this human-centric learning approach, the concept of few-shot learning (Jankowski et al., <xref ref-type="bibr" rid="B10">2011</xref>) has been introduced to align machine learning more closely with human cognition.</p>
<p>In recent years, a plethora of methods have emerged in the field of few-shot learning. Broadly, these can be categorized into meta-learning methods (Lake et al., <xref ref-type="bibr" rid="B14">2017</xref>; Ravi and Larochelle, <xref ref-type="bibr" rid="B30">2017</xref>; Nichol et al., <xref ref-type="bibr" rid="B28">2018</xref>; Rusu et al., <xref ref-type="bibr" rid="B31">2019</xref>) and metric learning methods (Vinyals et al., <xref ref-type="bibr" rid="B37">2016</xref>; Li et al., <xref ref-type="bibr" rid="B20">2017</xref>; Snell et al., <xref ref-type="bibr" rid="B35">2017</xref>). Meta-learning-based approaches concentrate on sampling learners from the distribution for each task or episode. They execute the optimizer or conduct several unfolded weight updates in parallel to adapt the model specifically for the task at hand. On the other hand, metric learning methods prioritize embedding both support and query samples into a common feature space to gauge feature similarity (Vinyals et al., <xref ref-type="bibr" rid="B37">2016</xref>). Among these, metric learning stands out for its simplicity, ease in introducing new categories, and capability for incremental learning. It has shown exemplary performance on fine-grained images (Lin et al., <xref ref-type="bibr" rid="B24">2017a</xref>; Li et al., <xref ref-type="bibr" rid="B16">2019a</xref>).</p>
<p>Fine-grained few-shot learning (Chen et al., <xref ref-type="bibr" rid="B2">2020</xref>; Shermin et al., <xref ref-type="bibr" rid="B33">2021</xref>) frequently utilizes datasets such as Stanford-Cars (Wang et al., <xref ref-type="bibr" rid="B39">2021</xref>), Stanford-Dogs (Krause et al., <xref ref-type="bibr" rid="B13">2013</xref>), FGVC-Aircraft (Khosla et al., <xref ref-type="bibr" rid="B11">2013</xref>), and CUB-200-2011 (Maji et al., <xref ref-type="bibr" rid="B27">2013</xref>). Each of these datasets consists of numerous subcategories, with only a handful of images per subclass. Given the marked similarities between each class within fine-grained datasets, the crux of fine-grained image classification revolves around pinpointing local areas exhibiting nuanced differences (Wah et al., <xref ref-type="bibr" rid="B38">2011</xref>). Consequently, efficiently detecting foreground objects and identifying critical local area information has emerged as a pivotal challenge in the realm of fine-grained image classification algorithms (Zhang et al., <xref ref-type="bibr" rid="B42">2017</xref>; Li et al., <xref ref-type="bibr" rid="B17">2019b</xref>; Ma et al., <xref ref-type="bibr" rid="B26">2019</xref>).</p>
<p>Few-shot metric learning methods that are easy to operate typically employ a single similarity measure. Traditional metric learning techniques, like matching networks and relation networks, often depend on entire image features for recognition. This approach, however, is not particularly effective for fine-grained image classification tasks where the inter-class distances are narrow and the intra-class distances are broad. DN4 and TDSNet (Chang et al., <xref ref-type="bibr" rid="B1">2020</xref>) have introduced deep local descriptors, shifting the focus toward local feature information. Nevertheless, relying on a single similarity measure can introduce particular similarity biases, especially when dealing with limited training data. This can undermine the model&#x00027;s generalization capability. As illustrated in <xref ref-type="fig" rid="F1">Figure 1</xref>, merging global and local features can amplify the inter-class differences and condense the intra-class variances, enhancing accuracy in fine-grained image classification.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>Motivation diagram. <bold>(A)</bold> In the diagram, the exploration of global information aims to maximize the separation distance between different classes; <bold>(B)</bold> in the diagram, the exploration of local information directs the network to concentrate on crucial local regions and assigns varying weights to make intra-class information more compact; ultimately, global and local information are combined for class determination.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1301192-g0001.tif"/>
</fig>
<p>In this paper, we introduce a Feature Fusion Similarity Network designed to assess global and local similarities within each task generated by the meta-training dataset. It fully leverages the global invariant features and local detailed features present in the images. The Feature Fusion Similarity Network comprises three modules, as illustrated in <xref ref-type="fig" rid="F2">Figure 2</xref>. The initial module is a convolution-based embedding module responsible for generating feature information for both query and support images and subsequently assessing similarity through global and local modules. During the meta-training phase, the total loss is computed as the sum of the global and local losses. Extensive experiments have been conducted to showcase the performance of the proposed Feature Fusion Similarity Network. Our contributions can be summarized as follows: (1) We introduce a novel few-shot fine-grained framework that incorporates global and local features, enabling the extraction of fine-grained image feature information during the meta-learning training process. (2) The fusion of global and local features not only explicitly constructs crucial connections between different parts of fine-grained objects but also implicitly captures fine-grained details with discriminative characteristics. (3) We conducted comprehensive experiments on four prominent fine-grained datasets to demonstrate the effectiveness of our proposed approach. The results of these experiments affirm the high competitiveness of our proposed model.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>Schematic of the FFSNet framework. The support set initially passes through a data augmentation module and is then combined with the query set in the embedding module. Following the fusion of global information extraction and local important area feature weighting, the resulting features are utilized for classification purposes.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1301192-g0002.tif"/>
</fig>
</sec>
<sec id="s2">
<title>2. Related work</title>
<p>Fine-grained image recognition has consistently ranked among the most active research domains in Fine-Grained Visual Recognition (FGVR), presenting a set of significant challenges. The strong performance of deep learning techniques on large datasets with labeled information has been noteworthy. For instance, Qi et al. (<xref ref-type="bibr" rid="B29">2022</xref>) employed a bilinear network to capture local distinctions among different subordinate classes, enhancing discriminative capabilities. Fu et al. (<xref ref-type="bibr" rid="B6">2017</xref>) demonstrated the feasibility of employing an optimization scheme to train attention proposal networks and region-based classifiers when dealing with related tasks. Lin et al. (<xref ref-type="bibr" rid="B22">2015</xref>) introduced the Diversified Visual Attention Network (DVAN), with a particular emphasis on diversifying attention mechanisms to better capture crucial discriminative information.</p>
<p>Building upon the foundation of fine-grained image recognition, few-shot fine-grained image recognition has garnered significant attention (Zhao et al., <xref ref-type="bibr" rid="B43">2017</xref>). This approach enables models to recognize new fine-grained categories with minimal labeled information. Few-shot fine-grained image recognition can be categorized into methods based on meta-learning and methods based on metric learning.</p>
<p>Meta-learning-based methods: In Rusu et al. (<xref ref-type="bibr" rid="B31">2019</xref>), a model adapts to new episodic tasks by generating parameter updates through a recursive meta-learner. MAML (Finn et al., <xref ref-type="bibr" rid="B5">2017</xref>) and its variants (Lake et al., <xref ref-type="bibr" rid="B14">2017</xref>; Ravi and Larochelle, <xref ref-type="bibr" rid="B30">2017</xref>) have demonstrated that optimizing the parameters of the learner model enables it to quickly adapt to specific tasks. However, Lake et al. (<xref ref-type="bibr" rid="B14">2017</xref>) pointed out that while these methods iteratively handle samples from all classes in task updates, they often struggle to learn effective embedding representations. To address this issue, one approach involves updating the weights only for the top layer and applying them to the &#x0201C;inner loop.&#x0201D; Specifically, the top-layer weights can be initialized by sampling from a generative distribution conditioned on task samples and then pretraining on visual features during the initial supervised phase. In contrast, metric-learning-based methods have achieved considerable success in learning high-quality features.</p>
<p>Metric-learning-based methods: Metric learning methods primarily focus on learning a rich similarity metric. In Snell et al. (<xref ref-type="bibr" rid="B35">2017</xref>), the concept of episode training was introduced in few-shot learning, while prototype networks (Li et al., <xref ref-type="bibr" rid="B20">2017</xref>) determine the category of a query image by comparing its distance to class prototypes in the support set, inspired by Wei et al. (<xref ref-type="bibr" rid="B40">2019</xref>). Relation networks (Sung et al., <xref ref-type="bibr" rid="B36">2018</xref>) utilize neural networks with cascaded feature embeddings to assess the relationship between each query-support set pair. DN4 (Li et al., <xref ref-type="bibr" rid="B16">2019a</xref>) employs K-nearest neighbors to create an image-to-class search space using local representations. Lifchitz et al. (<xref ref-type="bibr" rid="B21">2019</xref>) directly predicts the classification of each local representation and computes the loss. Our method combines global and local metrics. Unlike relation networks, we integrate local information while considering global relationships. Through local measurements, we effectively compare two objects rather than two images, which enhances the ease and effectiveness of the process. By merging global and local features, our algorithm has demonstrated satisfactory results.</p>
</sec>
<sec id="s3">
<title>3. Approach</title>
<sec>
<title>3.1. Problem definition</title>
<p>Few-shot classification is often defined as an <italic>N</italic> &#x02212; <italic>way</italic> <italic>K</italic> &#x02212; <italic>shot</italic> classification problem. An <italic>N</italic> &#x02212; <italic>way</italic> <italic>K</italic> &#x02212; <italic>shot</italic> classification problem refers to a scenario where a small dataset <italic>S</italic>, also known as a support set, contains limited labeled information. The support set comprises <italic>N</italic> categories of images, with each category containing <italic>K</italic> labeled sample images, where the value of <italic>K</italic> can vary from 1 to 10 samples. In a given query set <italic>Q</italic>, each sample is an unlabeled instance awaiting classification. The objective of few-shot classification is to perform classification on the unlabeled samples in the query set <italic>Q</italic>, utilizing the limited information available in the support set <italic>S</italic>.</p>
<p>To address this issue, Vinyals et al. (<xref ref-type="bibr" rid="B37">2016</xref>) introduced an episode training approach. Given a task <italic>T</italic> &#x0003D; {<italic>D</italic><sub><italic>s</italic></sub>, <italic>D</italic><sub><italic>q</italic></sub>}, it comprises a support set <inline-formula><mml:math id="M1"><mml:msub><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msubsup><mml:msubsup><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msubsup></mml:math></inline-formula> (where <italic>x</italic><sub><italic>i</italic></sub> represents an image in the support set, and <italic>y</italic><sub><italic>i</italic></sub> represents the label of the image) and a query set <inline-formula><mml:math id="M2"><mml:msub><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>q</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>q</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mi>q</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msubsup></mml:math></inline-formula>. The support set consists of a total of <italic>N</italic><sub><italic>S</italic></sub> &#x0003D; <italic>N</italic> &#x000D7; <italic>K</italic> labeled sample images, while the query set <italic>D</italic><sub><italic>q</italic></sub> contains <italic>N</italic><sub><italic>q</italic></sub> unlabeled sample images. In each episode, training is carried out by iteratively constructing the support set <inline-formula><mml:math id="M3"><mml:msub><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msubsup><mml:msubsup><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msubsup></mml:math></inline-formula> and the query set <inline-formula><mml:math id="M4"><mml:msub><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>q</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>q</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mi>q</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msubsup></mml:math></inline-formula> through random sampling. This same sampling pattern must be adhered to during the testing process. The trained model can then recognize each sample in the query set <italic>Q</italic> by utilizing the constructed support set <italic>S</italic>, where <italic>C</italic><sub><italic>train</italic></sub> &#x02229; <italic>C</italic><sub><italic>test</italic></sub> &#x0003D; &#x02205;.</p>
</sec>
<sec>
<title>3.2. Feature fusion similarity network</title>
<p>The overall model framework is depicted in <xref ref-type="fig" rid="F2">Figure 2</xref>. Firstly, data augmentation is applied to the support set, which is then fed along with the query set into the embedding module <italic>f</italic><sub>&#x003B8;</sub>. This embedding module can either be a simple convolutional layer or a ResNet (He et al., <xref ref-type="bibr" rid="B9">2016</xref>), used for extracting feature information to obtain image feature vectors. These image feature vectors serve as inputs for the Relation Matrix <italic>M</italic> Module, Attention Weight Module, Relation Matrix <italic>M</italic>_<italic>A</italic> Module, and Relation Module. The Relation Matrix <italic>M</italic> module constructs a local description space and calculates the similarity between the query image and the corresponding local region in the support set. The Attention Weight Module reweights the locally significant areas and, utilizing a <italic>K</italic>-Nearest-Neighbors classifier (<italic>K</italic> &#x02212; <italic>NN</italic> for short), calculates the similarity of the query image to each category in the support set, effectively reducing noise. The Relation Matrix <italic>M</italic>_<italic>A</italic> Module serves as a refinement of the Attention Weight Module. Its purpose is to assess the correctness of the locally important areas and to fuse and calculate similarities with the feature map obtained from the Relation Matrix <italic>M</italic> Module. The Relation Module concatenates the feature vector corresponding to each query set image with the feature vector of each image in the support set to calculate their similarity. These two similarities are then combined to obtain class information.</p>
<sec>
<title>3.2.1. Data augmentation module</title>
<p>In the data augmentation module, we discovered that applying extensive transformations to the original image led to a decrease in classification accuracy during training and resulted in substantial information loss in smaller data domains. As a result, we opted for horizontal flipping enhancements and adjustments to image contrast processing, which perturb the data without sacrificing information, thereby ensuring consistency between training and testing.</p>
</sec>
<sec>
<title>3.2.2. Feature embedding module</title>
<p>We chose four convolutional blocks as the embedding module of the network to extract feature information. Each convolutional block consists of a 3 &#x000D7; 3 convolution with 64 filters, followed by batch normalization and ReLU activation function, as shown in <xref ref-type="fig" rid="F3">Figure 3</xref>.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>Structural framework of the embedding module, which is responsible for extracting feature information from both the support set and the query set.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1301192-g0003.tif"/>
</fig>
<p>The extracted <italic>f<sub>&#x003B8;</sub></italic>(<italic>S</italic>) and <italic>f<sub>&#x003B8;</sub></italic>(<italic>Q</italic>) from the embedding module are represented as follows:</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M5"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mtext>Input</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>:</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>N</mml:mi><mml:mo>,</mml:mo><mml:mi>K</mml:mi><mml:mo>,</mml:mo><mml:mi>C</mml:mi><mml:mo>,</mml:mo><mml:mi>H</mml:mi><mml:mo>,</mml:mo><mml:mi>W</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x02192;</mml:mo><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>:</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>N</mml:mi><mml:mo>&#x000B7;</mml:mo><mml:mi>K</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>C</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>H</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E2"><label>(2)</label><mml:math id="M6"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mtext>Input</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>:</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>N</mml:mi><mml:mo>,</mml:mo><mml:mi>K</mml:mi><mml:mo>,</mml:mo><mml:mi>C</mml:mi><mml:mo>,</mml:mo><mml:mi>H</mml:mi><mml:mo>,</mml:mo><mml:mi>W</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x02192;</mml:mo><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>:</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>K</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>C</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>H</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>N</italic>&#x000B7;<italic>K</italic> represents the total number of samples in the support set, with <italic>N</italic> classes and <italic>K</italic> samples per class. <italic>N</italic> represents the number of classes, <italic>K</italic> represents the number of samples in the support set and query set, and <italic>C</italic>, <italic>H</italic>, <italic>W</italic> respectively represent the number of input channels, width, and height.</p>
</sec>
<sec>
<title>3.2.3. Relation matrix <italic>M</italic> module</title>
<p>Recent research on DN4 and TDSNet has shown that features based on local descriptors are more distinctive than global features. Specifically, local descriptors have the ability to capture subtle details in the local areas of an image. After being extracted by the Feature Embedding module into <inline-formula><mml:math id="M7"><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo>&#x000B7;</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mi>&#x003F5;</mml:mi><mml:msup><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>H</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>W</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>C</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, it can be represented as <italic>m</italic> &#x0003D; (<italic>H</italic> &#x000D7; <italic>W</italic>) individual d-dimensional local descriptors. The support set can be represented as <inline-formula><mml:math id="M8"><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mi>&#x003F5;</mml:mi><mml:msup><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>m</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>C</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, and similarly, the query set is <inline-formula><mml:math id="M9"><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mi>&#x003F5;</mml:mi><mml:msup><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>C</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>. Next, we construct the relationship matrix <italic>M</italic> for local similarity calculation.</p>
<disp-formula id="E3"><label>(3)</label><mml:math id="M10"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>M</mml:mi><mml:mo>=</mml:mo><mml:mi>C</mml:mi><mml:mi>O</mml:mi><mml:mi>S</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>COS</italic>(&#x000B7;) represents the cosine similarity, and each row of the relationship matrix <italic>M</italic> represents the similarity between a local area of the query image and each local area of the support set.</p>
</sec>
<sec>
<title>3.2.4. Attention weight module</title>
<p>The <italic>M</italic><sup><italic>A</italic></sup> module takes the query set features and support set features as input, and directly generates spatial attention to locate the support class objects in the query image, thereby creating an attention map on the query image. In this module, we pass the <italic>f</italic><sub>&#x003B8;</sub>(&#x000B7;) extracted by the Feature Embedding module through a 1 &#x000D7; 1 convolution block. The purpose of this step is to reweight the local areas. When calculating local similarity, if there are multiple duplicate areas in the query image that overlap with the support set, it can lead to misjudgment during discrimination. Therefore, this module disperses attention and focuses on the relatively smaller parts in the local descriptor, effectively reweighting them. By employing the Relation Matrix <italic>M</italic> Module method, an Attention Matrix <italic>M</italic><sup><italic>A</italic></sup> is constructed. Subsequently, the minimum value of the Attention Matrix is obtained through the <italic>TOP</italic> &#x02212; <italic>K</italic> of the <italic>K</italic> &#x02212; <italic>NN</italic> (<italic>K</italic>-Nearest Neighbors) algorithm to suppress noise.</p>
<disp-formula id="E4"><label>(4)</label><mml:math id="M11"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi>I</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E5"><label>(5)</label><mml:math id="M12"><mml:mrow><mml:mi>I</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mtable columnalign='left'><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mrow><mml:mi>x</mml:mi><mml:mo>,</mml:mo></mml:mrow></mml:mtd><mml:mtd columnalign='left'><mml:mrow><mml:mtext>if&#x000A0;</mml:mtext><mml:mi>x</mml:mi><mml:mo>&#x0003E;</mml:mo><mml:mi>&#x003B2;</mml:mi></mml:mrow></mml:mtd></mml:mtr><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo></mml:mrow></mml:mtd><mml:mtd columnalign='left'><mml:mrow><mml:msub><mml:mrow><mml:mtext>otherwise</mml:mtext></mml:mrow><mml:mo>,</mml:mo></mml:msub></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:mrow></mml:mrow></mml:math></disp-formula>
<p>where <italic>&#x003B2;</italic> is selected by the <italic>K</italic> &#x02212; <italic>NN</italic> (<italic>K</italic>-Nearest Neighbors) algorithm based on the first three minimum values of the relation matrix <italic>M</italic>.</p>
</sec>
<sec>
<title>3.2.5. Relation matrix <italic>M</italic><sub>&#x02212;</sub><italic>A</italic> module</title>
<p>We obtain the element-wise multiplication between the feature matrix and the relationship matrix <italic>M</italic>, which is crucial for classification, using the weight matrix <italic>M</italic><sup><italic>A</italic></sup>. Finally, the local similarity score between the query image <italic>X</italic><sub><italic>q</italic></sub> and the support class <italic>X</italic><sub><italic>s</italic></sub> can be computed by applying the attention map to the similarity matrix <italic>M</italic>, as shown below:</p>
<disp-formula id="E6"><label>(6)</label><mml:math id="M14"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mo>-</mml:mo></mml:mrow></mml:msub><mml:mi>A</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>H</mml:mi><mml:mi>W</mml:mi></mml:mrow></mml:mfrac><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>H</mml:mi><mml:mi>W</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>H</mml:mi><mml:mi>W</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:msub><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi></mml:mrow></mml:msup><mml:mo>&#x000B7;</mml:mo><mml:mi>M</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>H</italic> and <italic>W</italic> represent the row and column indices of the matrix.</p>
</sec>
<sec>
<title>3.2.6. Relation module</title>
<p>In the relationship network, the relationship module is used to concatenate the feature vector <italic>f</italic><sub>&#x003B8;</sub>(<italic>x</italic><sub><italic>q</italic></sub>) corresponding to the query set image with the feature information <italic>f</italic><sub>&#x003B8;</sub>(<italic>x</italic><sub><italic>n</italic></sub>) corresponding to each image in the support set.</p>
<disp-formula id="E7"><label>(7)</label><mml:math id="M15"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>g</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003C6;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>|</mml:mo><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>q</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mtext>&#x02003;</mml:mtext><mml:mi>n</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x022EF;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where || represents the concatenation operator. The similarity module <italic>g</italic><sub>&#x003C6;</sub> consists of two 3 &#x000D7; 3 convolution blocks, each followed by a 2 &#x000D7; 2 max-pooling layer and a fully connected layer.</p>
<p>The final formula for the overall prediction score is:</p>
<disp-formula id="E8"><label>(8)</label><mml:math id="M16"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mi>o</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>S</italic><sub>1</sub> represents the score of the Relation Matrix <italic>M</italic><sub>&#x02212;</sub><italic>A</italic> Module, and <italic>S</italic><sub>2</sub> represents the score of the Relation Module.</p>
</sec>
<sec>
<title>3.2.7. Loss function</title>
<p>In the Feature Fusion Similarity Network, we obtain two prediction results, <italic>y</italic><sup>1</sup> and <italic>y</italic><sup>2</sup>, through the local module and global module, respectively. Then, we calculate the losses <italic>l</italic><sub>1</sub> and <italic>l</italic><sub>2</sub> between the two prediction results and the true values.</p>
<disp-formula id="E9"><label>(9)</label><mml:math id="M17"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:msup><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>q</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>q</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mi>q</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x022EF;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>q</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E10"><label>(10)</label><mml:math id="M18"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:msup><mml:mrow><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>q</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>q</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mi>q</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x022EF;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>q</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>y</italic><sup>1</sup> represents the predicted label and <italic>y</italic> represents the true label.</p>
<p>Because the network architecture we have designed combines local and global information, after obtaining the loss <italic>L</italic><sub>1</sub> of the local module and the loss <italic>L</italic><sub>2</sub> of the global module separately, we need to add them together to calculate the overall loss of the model:</p>
<disp-formula id="E11"><label>(11)</label><mml:math id="M19"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mi>o</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Where <italic>L</italic><sub>1</sub> represents the loss function of the Relation Matrix <italic>M</italic><sub>&#x02212;</sub><italic>A</italic> Module, and <italic>L</italic><sub>2</sub> represents the loss function of the Relation Module.</p>
</sec>
</sec>
</sec>
<sec id="s4">
<title>4. Experiments</title>
<sec>
<title>4.1. Experiments settings</title>
<p>The experimental datasets used in this paper consist of four common datasets for few-shot classification. Among them, three datasets are frequently employed in fine-grained image classification in the context of few-shot learning: Stanford-Dogs, Stanford-Cars, and CUB-200-2011. To assess the generalizability of our model, we also conducted experiments on the widely used few-shot classification dataset Mini-ImageNet.</p>
<p>The Stanford-Dogs dataset comprises 120 categories of dogs, totaling 20,580 images. This dataset divides the 120 categories into training, validation, and test sets, consisting of 60, 30, and 30 classes, respectively. The Stanford-Cars dataset consists of 196 car classes, with a total of 16,185 images. It divides these 196 classes into training, validation, and test sets, consisting of 98, 49, and 49 classes, respectively. CUB-200-2011 includes 200 bird species and has a total of 11,788 images. This dataset divides its 200 classes into training, validation, and test sets, comprising 100, 50, and 50 classes, respectively. Mini-ImageNet is a subset of ImageNet (Deng et al., <xref ref-type="bibr" rid="B3">2009</xref>), containing 100 classes, with 600 images per class, totaling 60,000 images. Mini-ImageNet&#x00027;s 100 classes are divided into training, validation, and test sets, consisting of 64, 16, and 20 classes, respectively, following the method described in Ravi and Larochelle (<xref ref-type="bibr" rid="B30">2017</xref>). The total number of samples in the training, validation, and test sets of the above datasets can be found in <xref ref-type="table" rid="T1">Table 1</xref>.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Data set partition table.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left"><bold>Datasets</bold></th>
<th valign="top" align="center"><bold>Divide</bold></th>
<th valign="top" align="center"><bold>Number of classes</bold></th>
<th valign="top" align="center"><bold>Sample number</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td/>
<td valign="top" align="center">Train</td>
<td valign="top" align="center">60</td>
<td valign="top" align="center">10,337</td>
</tr>
<tr>
<td valign="top" align="left">Stanford-Dogs</td>
<td valign="top" align="center">Val</td>
<td valign="top" align="center">30</td>
<td valign="top" align="center">5,128</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">Test</td>
<td valign="top" align="center">30</td>
<td valign="top" align="center">5115</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">Train</td>
<td valign="top" align="center">98</td>
<td valign="top" align="center">8,203</td>
</tr>
<tr>
<td valign="top" align="left">Stanford-Cars</td>
<td valign="top" align="center">Val</td>
<td valign="top" align="center">49</td>
<td valign="top" align="center">4,004</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">Test</td>
<td valign="top" align="center">49</td>
<td valign="top" align="center">3,978</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">Train</td>
<td valign="top" align="center">100</td>
<td valign="top" align="center">4,719</td>
</tr>
<tr>
<td valign="top" align="left">CUB-200-2011</td>
<td valign="top" align="center">Val</td>
<td valign="top" align="center">50</td>
<td valign="top" align="center">4,715</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">Test</td>
<td valign="top" align="center">50</td>
<td valign="top" align="center">2,953</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">Train</td>
<td valign="top" align="center">64</td>
<td valign="top" align="center">38,400</td>
</tr>
<tr>
<td valign="top" align="left">Mini-ImageNet</td>
<td valign="top" align="center">Val</td>
<td valign="top" align="center">16</td>
<td valign="top" align="center">9,600</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">Test</td>
<td valign="top" align="center">20</td>
<td valign="top" align="center">12,000</td>
</tr>
</tbody>
</table>
</table-wrap>
<sec>
<title>4.1.1. Evaluation metrics</title>
<p>We reported the average accuracy (%) for 600 randomly generated episodes, along with the 95% confidence interval on the test set, following the approach commonly used in most methods (Jankowski et al., <xref ref-type="bibr" rid="B10">2011</xref>; Nichol et al., <xref ref-type="bibr" rid="B28">2018</xref>; Xu et al., <xref ref-type="bibr" rid="B41">2022</xref>). Our model was trained end-to-end, without any pre-training process.</p>
</sec>
<sec>
<title>4.1.2. Implementation details</title>
<p>We conducted experiments on four datasets: Stanford-Dogs, Stanford-Cars, CUB-200-2011, and Mini-ImageNet, using two settings, namely 5-Way 1-shot and 5-Way 5-shot. There were 15 query samples per class. The input image size was resized to 84 &#x000D7; 84. We randomly sampled 60,000 and 40,000 tasks under the 1-shot and 5-shot experimental settings, respectively, to iteratively train the model. During the training process, we utilized the Adam optimizer (Kingma and Ba, <xref ref-type="bibr" rid="B12">2017</xref>) to optimize the model parameters with mean squared error loss, setting an initial learning rate of 0.001 and a weight decay rate of 0.</p>
</sec>
</sec>
<sec>
<title>4.2. Performance comprison</title>
<sec>
<title>4.2.1. Few-shot fine-grained image classification</title>
<p>We conducted classification experiments using the proposed method on three fine-grained image classification datasets: Stanford-Dogs, Stanford-Cars, and CUB-200-2011, with 5-Way 1-shot and 5-Way 5-shot experiments. We compared the experimental results with the current mainstream methods, and the results are presented in <xref ref-type="table" rid="T2">Table 2</xref>.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Comparison with typical FSL methods on three fine-grained.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left"><bold>Model</bold></th>
<th valign="top" align="center" colspan="6"><bold>5-Way accuracy (</bold><italic><bold>%</bold></italic><bold>)</bold></th>
</tr>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th/>
<th valign="top" align="center" colspan="2"><bold>Stanford-Dogs</bold></th>
<th valign="top" align="center" colspan="2"><bold>Stanford-Cars</bold></th>
<th valign="top" align="center" colspan="2"><bold>CUB-200-2011</bold></th>
</tr>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th/>
<th valign="top" align="center"><bold>1-shot</bold></th>
<th valign="top" align="center"><bold>5-shot</bold></th>
<th valign="top" align="center"><bold>1-shot</bold></th>
<th valign="top" align="center"><bold>5-shot</bold></th>
<th valign="top" align="center"><bold>1-shot</bold></th>
<th valign="top" align="center"><bold>5-shot</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Matching Net (Vinyals et al., <xref ref-type="bibr" rid="B37">2016</xref>)</td>
<td valign="top" align="center">45.05 &#x000B1; 0.66</td>
<td valign="top" align="center">59.60 &#x000B1; 0.73</td>
<td valign="top" align="center">45.29 &#x000B1; 0.82</td>
<td valign="top" align="center">64.00 &#x000B1; 0.74</td>
<td valign="top" align="center">55.65 &#x000B1; 0.38</td>
<td valign="top" align="center">72.60 &#x000B1; 0.45</td>
</tr>
<tr>
<td valign="top" align="left">Prototype Net (Snell et al., <xref ref-type="bibr" rid="B35">2017</xref>)</td>
<td valign="top" align="center">39.05 &#x000B1; 0.66</td>
<td valign="top" align="center">60.23 &#x000B1; 0.22</td>
<td valign="top" align="center">36.38 &#x000B1; 0.52</td>
<td valign="top" align="center">63.84 &#x000B1; 0.85</td>
<td valign="top" align="center">51.52 &#x000B1; 0.95</td>
<td valign="top" align="center">70.21 &#x000B1; 0.38</td>
</tr>
<tr>
<td valign="top" align="left">Relation Net (Sung et al., <xref ref-type="bibr" rid="B36">2018</xref>)</td>
<td valign="top" align="center">47.11 &#x000B1; 0.90</td>
<td valign="top" align="center">64.56 &#x000B1; 0.74</td>
<td valign="top" align="center">45.83 &#x000B1; 0.86</td>
<td valign="top" align="center">68.01 &#x000B1; 0.78</td>
<td valign="top" align="center">62.67 &#x000B1; 0.98</td>
<td valign="top" align="center">76.94 &#x000B1; 0.66</td>
</tr>
<tr>
<td valign="top" align="left">MAML (Finn et al., <xref ref-type="bibr" rid="B5">2017</xref>)</td>
<td valign="top" align="center">46.67 &#x000B1; 0.87</td>
<td valign="top" align="center">62.56 &#x000B1; 0.80</td>
<td valign="top" align="center">48.37 &#x000B1; 0.81</td>
<td valign="top" align="center">65.41 &#x000B1; 0.77</td>
<td valign="top" align="center">55.92 &#x000B1; 0.94</td>
<td valign="top" align="center">72.09 &#x000B1; 0.76</td>
</tr>
<tr>
<td valign="top" align="left">PCM (Wei et al., <xref ref-type="bibr" rid="B40">2019</xref>)</td>
<td valign="top" align="center">28.78 &#x000B1; 0.95</td>
<td valign="top" align="center">46.92 &#x000B1; 0.85</td>
<td valign="top" align="center">29.63 &#x000B1; 0.65</td>
<td valign="top" align="center">52.28 &#x000B1; 0.78</td>
<td valign="top" align="center">42.10 &#x000B1; 0.35</td>
<td valign="top" align="center">62.48 &#x000B1; 0.35</td>
</tr>
<tr>
<td valign="top" align="left">CovaMNet (Li et al., <xref ref-type="bibr" rid="B17">2019b</xref>)</td>
<td valign="top" align="center">49.11 &#x000B1; 0.56</td>
<td valign="top" align="center">63.04 &#x000B1; 0.76</td>
<td valign="top" align="center">56.65 &#x000B1; 0.86</td>
<td valign="top" align="center">69.56 &#x000B1; 0.78</td>
<td valign="top" align="center">52.42 &#x000B1; 0.76</td>
<td valign="top" align="center">63.76 &#x000B1; 0.64</td>
</tr>
<tr>
<td valign="top" align="left">DN4 (Li et al., <xref ref-type="bibr" rid="B16">2019a</xref>)</td>
<td valign="top" align="center">45.41 &#x000B1; 0.76</td>
<td valign="top" align="center">63.51 &#x000B1; 0.62</td>
<td valign="top" align="center">57.84 &#x000B1; 0.80</td>
<td valign="top" align="center"><bold>87.47</bold> <bold>&#x000B1;</bold> <bold>0.47</bold></td>
<td valign="top" align="center">46.84 &#x000B1; 0.81</td>
<td valign="top" align="center">74.92 &#x000B1; 0.64</td>
</tr>
<tr>
<td valign="top" align="left">BSNet (Li et al., <xref ref-type="bibr" rid="B19">2021</xref>)</td>
<td valign="top" align="center">50.68 &#x000B1; 0.56</td>
<td valign="top" align="center">67.93 &#x000B1; 0.75</td>
<td valign="top" align="center">54.39 &#x000B1; 0.92</td>
<td valign="top" align="center">72.37 &#x000B1; 0.77</td>
<td valign="top" align="center">65.89 &#x000B1; 0.46</td>
<td valign="top" align="center">78.48 &#x000B1; 0.65</td>
</tr>
<tr>
<td valign="top" align="left">TDSNet (Qi et al., <xref ref-type="bibr" rid="B29">2022</xref>)</td>
<td valign="top" align="center">52.48 &#x000B1; 0.87</td>
<td valign="top" align="center">66.45 &#x000B1; 0.49</td>
<td valign="top" align="center">57.35 &#x000B1; 0.91</td>
<td valign="top" align="center">73.64 &#x000B1; 0.72</td>
<td valign="top" align="center">67.34 &#x000B1; 0.85</td>
<td valign="top" align="center">79.38 &#x000B1; 0.59</td>
</tr>
<tr>
<td valign="top" align="left">Our</td>
<td valign="top" align="center"><bold>53.98</bold> <bold>&#x000B1;</bold> <bold>0.96</bold></td>
<td valign="top" align="center"><bold>70.85</bold> <bold>&#x000B1;</bold> <bold>0.75</bold></td>
<td valign="top" align="center"><bold>58.32</bold> <bold>&#x000B1;</bold> <bold>0.62</bold></td>
<td valign="top" align="center">75.68 &#x000B1; 0.76</td>
<td valign="top" align="center"><bold>68.30</bold> <bold>&#x000B1;</bold> <bold>0.90</bold></td>
<td valign="top" align="center"><bold>80.64</bold> <bold>&#x000B1;</bold> <bold>0.64</bold></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Bold values represents the maximum value in each column.</p>
</table-wrap-foot>
</table-wrap>
<p>From <xref ref-type="table" rid="T2">Table 2</xref>, it can be observed that our proposed method achieves the highest classification accuracy in the 1-shot experiments on the datasets Stanford-Dogs, Stanford-Cars, and CUB-200-2011. This is attributed to our method assigning higher weights to significant local information while simultaneously integrating global information to enhance accuracy. In the 5-shot experiment on Stanford-Cars, our classification results are slightly lower than those of DN4. However, when compared to some classic few-shot models such as Matching Net, Prototype Net, Relation Net, and MAML, there is a noticeable improvement. In the 1-shot experiments, we see improvements of 8.93%, 14.93%, 6.87%, and 7.31% on Stanford-Dogs, and 13.03%, 21.94%, 12.49%, and 9.95% on Stanford-Cars. Similarly, in the 1-shot experiment on CUB-200-2011, we observe improvements of 12.65%, 16.78%, 5.63%, and 12.38%. In the 5-shot experiments, our method also demonstrates improvements of 11.25%, 10.62%, 6.29%, and 8.29% on Stanford-Dogs, and 11.68%, 11.84%, 7.67%, and 10.27% on Stanford-Cars. Additionally, in the 5-shot experiment on CUB-200-2011, there are improvements of 8.04%, 10.43%, 3.7%, and 8.55%.</p>
</sec>
<sec>
<title>4.2.2. Few-shot image classification</title>
<p>To further assess the generalization performance of the method proposed in this paper, we conducted 5-Way 1-shot and 5-Way 5-shot classification experiments on the Mini-ImageNet dataset. We compared the experimental results with mainstream few-shot image classification methods to validate the generalization performance of our proposed method. The results are presented in <xref ref-type="table" rid="T3">Table 3</xref>.</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>The accuracy of few-shot image classification on the Mini-ImageNet dataset.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left"><bold>Model</bold></th>
<th valign="top" align="center"><bold>Type</bold></th>
<th valign="top" align="center" colspan="2"><bold>5-Way accuracy (</bold><italic><bold>%</bold></italic><bold>)</bold></th>
</tr>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th/>
<th/>
<th valign="top" align="center"><bold>1-shot</bold></th>
<th valign="top" align="center"><bold>5-shot</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Meta-Learner LSTM (Ravi and Larochelle, <xref ref-type="bibr" rid="B30">2017</xref>)</td>
<td valign="top" align="center">Meta-learning</td>
<td valign="top" align="center">43.44 &#x000B1; 0.77</td>
<td valign="top" align="center">60.60 &#x000B1; 0.71</td>
</tr>
<tr>
<td valign="top" align="left">MAML (Finn et al., <xref ref-type="bibr" rid="B5">2017</xref>)</td>
<td valign="top" align="center">Meta-learning</td>
<td valign="top" align="center">46.47 &#x000B1; 0.82</td>
<td valign="top" align="center">62.71 &#x000B1; 0.71</td>
</tr>
<tr>
<td valign="top" align="left">Matching net (Vinyals et al., <xref ref-type="bibr" rid="B37">2016</xref>)</td>
<td valign="top" align="center">Metric learning</td>
<td valign="top" align="center">48.14 &#x000B1; 0.23</td>
<td valign="top" align="center">63.48 &#x000B1; 0.66</td>
</tr>
<tr>
<td valign="top" align="left">Prototype net (Snell et al., <xref ref-type="bibr" rid="B35">2017</xref>)</td>
<td valign="top" align="center">Metric learning</td>
<td valign="top" align="center">44.42 &#x000B1; 0.84</td>
<td valign="top" align="center">64.24 &#x000B1; 0.72</td>
</tr>
<tr>
<td valign="top" align="left">Relation net (Sung et al., <xref ref-type="bibr" rid="B36">2018</xref>)</td>
<td valign="top" align="center">Metric learning</td>
<td valign="top" align="center">49.33 &#x000B1; 0.85</td>
<td valign="top" align="center">65.44 &#x000B1; 0.69</td>
</tr>
<tr>
<td valign="top" align="left">Matching nets FCE (Vinyals et al., <xref ref-type="bibr" rid="B37">2016</xref>)</td>
<td valign="top" align="center">Metric learning</td>
<td valign="top" align="center">43.56 &#x000B1; 0.96</td>
<td valign="top" align="center">55.31 &#x000B1; 0.73</td>
</tr>
<tr>
<td valign="top" align="left">GNN (Garcia and Bruna, <xref ref-type="bibr" rid="B7">2018</xref>)</td>
<td valign="top" align="center">Metric learning</td>
<td valign="top" align="center">50.33 &#x000B1; 0.36</td>
<td valign="top" align="center">65.23 &#x000B1; 0.86</td>
</tr>
<tr>
<td valign="top" align="left">Our</td>
<td valign="top" align="center">Metric learning</td>
<td valign="top" align="center"><bold>52.37</bold> <bold>&#x000B1;</bold> <bold>0.78</bold></td>
<td valign="top" align="center"><bold>68.19</bold> <bold>&#x000B1;</bold> <bold>0.95</bold></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Bold values represents the maximum value in each column.</p>
</table-wrap-foot>
</table-wrap>
<p>In <xref ref-type="table" rid="T3">Table 3</xref>, meta-learning disregards the training complexity of the model and the challenges associated with model convergence, leading to lower overall classification accuracy. In comparison to MAML, our method exhibits improvements of 5.9 and 5.48% in the 1-shot and 5-shot experiments, respectively. Furthermore, when compared to Meta-Learner LSTM, it demonstrates improvements of 8.93 and 7.59% in the 1-shot and 5-shot experiments, respectively. Moreover, in comparison to the metric-based approaches such as Matching Net, Prototype Net, Relation Net, GNN, and Matching Nets FCE as shown in <xref ref-type="table" rid="T3">Table 3</xref>, our method outperforms them in terms of classification performance in both the 1-shot and 5-shot experiments on Mini-ImageNet. This underscores that our proposed method exhibits superior classification performance on the Mini-ImageNet dataset.</p>
</sec>
</sec>
</sec>
<sec id="s5">
<title>5. Analysis</title>
<sec>
<title>5.1. Ablation experiment</title>
<p>To comprehensively validate the effectiveness of the proposed Feature Fusion Similarity Network, we conducted an ablation experiment on the CUB-200-2011 dataset using the Baseline, which consists of the feature embedding module and relationship module presented in this paper. As shown in <xref ref-type="table" rid="T4">Table 4</xref>, several key modules were added incrementally to the baseline model. Firstly, the addition of the aug (data augmentation module) led to an increase in the accuracy of our model by 3.35 and 2.93% for 1-shot and 5-shot experiments, respectively. This improvement is mainly attributed to the data augmentation module enhancing the model&#x00027;s robustness, reducing its sensitivity to images, increasing the training data, improving the model&#x00027;s generalization ability, and mitigating sample imbalance. Subsequently, by incorporating the LS (local similarity module), we observed a notable increase in accuracy of 5.23 and 5.64% for 1-shot and 5-shot experiments, respectively. This demonstrates that relying solely on the global module makes it challenging to detect subtle local information. The inclusion of the local module makes the system more sensitive to certain fine-grained local details. Finally, we verified the effectiveness of the att (attention weight module), which improved the accuracy in 1-shot and 5-shot experiments by 8.98 and 7.42%, respectively. The att module reweights important local areas, suppresses noise, and enhances the model&#x00027;s performance.</p>
<table-wrap position="float" id="T4">
<label>Table 4</label>
<caption><p>Results of ablative experiments on CUB-200-2011 datasets.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left"><bold>Model</bold></th>
<th valign="top" align="center" colspan="2"><bold>5-Way accuracy (</bold><italic><bold>%</bold></italic><bold>)</bold></th>
</tr>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th/>
<th valign="top" align="center"><bold>1-shot</bold></th>
<th valign="top" align="center"><bold>5-shot</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Baseline</td>
<td valign="top" align="center">59.32 &#x000B1; 0.71</td>
<td valign="top" align="center">73.22 &#x000B1; 0.14</td>
</tr>
<tr>
<td valign="top" align="left">Baseline &#x0002B; aug</td>
<td valign="top" align="center">62.67 &#x000B1; 0.98</td>
<td valign="top" align="center">76.15 &#x000B1; 0.66</td>
</tr>
<tr>
<td valign="top" align="left">Baseline &#x0002B; aug &#x0002B; LS</td>
<td valign="top" align="center">64.55 &#x000B1; 0.89</td>
<td valign="top" align="center">78.86 &#x000B1; 0.37</td>
</tr>
<tr>
<td valign="top" align="left">Baseline &#x0002B; aug &#x0002B; LS &#x0002B; att</td>
<td valign="top" align="center"><bold>68.30</bold> <bold>&#x000B1;</bold> <bold>0.90</bold></td>
<td valign="top" align="center"><bold>80.64</bold> <bold>&#x000B1;</bold> <bold>0.64</bold></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>aug, data augmentation module; LS, local similarity module; att, attention weight moudle. Bold values represents the maximum value in each column.</p>
</table-wrap-foot>
</table-wrap>
</sec>
<sec>
<title>5.2. Case study</title>
<p>From <xref ref-type="fig" rid="F4">Figure 4</xref>, it is evident that the color change trend on the CUB-200-2011 dataset indicates that our proposed network model exhibits better classification accuracy in the experiment compared to the original relation network. The relation network&#x00027;s inter-class discriminability and intra-class compactness on fine-grained datasets need improvement, which results in numerous misjudged areas in the measurement results and subsequently leads to poor outcomes. In contrast, our network demonstrates a high adaptability to fine-grained datasets and excels in fine-grained classification. This improvement enhances both inter-class discriminability and intra-class compactness, contributing to its superior performance.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>Visualization of the similarity scores predicted by the Relation Network and the network proposed in this paper on the CUB-200-2011 dataset is presented. In each matrix, the horizontal axis represents 15 query samples from each of the five classes, totaling 75 samples. The vertical axis represents the five classes in the task. The deeper the color, the higher the similarity.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1301192-g0004.tif"/>
</fig>
<p>We visualized important areas in the original images using the gradient-based Grad-CAM (Selvaraju et al., <xref ref-type="bibr" rid="B32">2017</xref>) technique on the CUB-200-2011 dataset. Nine original images were randomly selected from the CUB-200-2011 test dataset, and these selected images were resized to match the size of the output features from the embedding module. Subsequently, Grad-CAM images were generated using the Matching Network, Prototype Network, Relation Network, and our Network. As shown in <xref ref-type="fig" rid="F5">Figure 5</xref>, it is evident from the images that the class-discriminative areas in our Network are primarily concentrated on the target object, whereas other methods also exhibit more class-discriminative areas distributed in the background. Therefore, our model is more efficient in focusing on the target object and reducing background interference.</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>Feature visualization of the Matching Network, Prototype Network, Relation Network, and our Network on the CUB-200-2011 dataset (the redder the area, the more class-discriminative it is).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1301192-g0005.tif"/>
</fig>
</sec>
</sec>
<sec sec-type="conclusions" id="s6">
<title>6. Conclusion</title>
<p>The article introduces a network that integrates both global and local information, termed the Feature Fusion Similarity Network. This network comprises image data augmentation, feature embedding modules, and both global and local metric modules. By amalgamating global and local insights, it becomes possible to discern regions from various angles, which in turn amplifies classification accuracy. Thorough experimental evidence reveals that our approach showcases competitive performance on fine-grained image datasets when juxtaposed with other mainstream methodologies. Moving forward, our objective is to further refine the object features within images to optimize the model&#x00027;s classification capabilities.</p>
</sec>
<sec sec-type="data-availability" id="s7">
<title>Data availability statement</title>
<p>The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.</p>
</sec>
<sec sec-type="author-contributions" id="s8">
<title>Author contributions</title>
<p>YY: Data curation, Methodology, Software, Writing&#x02014;original draft, Writing&#x02014;review &#x00026; editing. YF: Software, Writing&#x02014;original draft. LZ: Writing&#x02014;original draft, Writing&#x02014;review &#x00026; editing. HF: Writing&#x02014;original draft. XP: Writing&#x02014;review &#x00026; editing. CJ: Writing&#x02014;original draft.</p>
</sec>
</body>
<back>
<sec sec-type="funding-information" id="s9">
<title>Funding</title>
<p>The author(s) declare that no financial support was received for the research, authorship, and/or publication of this article.</p>
</sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s10">
<title>Publisher&#x00027;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chang</surname> <given-names>D.</given-names></name> <name><surname>Ding</surname> <given-names>Y.</given-names></name> <name><surname>Xie</surname> <given-names>J.</given-names></name> <name><surname>Bhunia</surname> <given-names>A. K.</given-names></name> <name><surname>Li</surname> <given-names>X.</given-names></name> <name><surname>Ma</surname> <given-names>Z.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>The devil is in the channels: mutual-channel loss for fine-grained image classification</article-title>. <source>IEEE Trans. Image Process</source>. <volume>29</volume>, <fpage>4683</fpage>&#x02013;<lpage>4695</lpage>. <pub-id pub-id-type="doi">10.1109/TIP.2020.2973812</pub-id><pub-id pub-id-type="pmid">32092002</pub-id></citation></ref>
<ref id="B2">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>W.-Y.</given-names></name> <name><surname>Liu</surname> <given-names>Y.-C.</given-names></name> <name><surname>Kira</surname> <given-names>Z.</given-names></name> <name><surname>Wang</surname> <given-names>Y.-C. F.</given-names></name> <name><surname>Huang</surname> <given-names>J.-B.</given-names></name></person-group> (<year>2020</year>). <article-title>A closer look at few-shot classification</article-title>. <source>arXiv</source>. [preprint]. <pub-id pub-id-type="doi">10.48550/arXiv.1904.04232</pub-id></citation>
</ref>
<ref id="B3">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Deng</surname> <given-names>J.</given-names></name> <name><surname>Dong</surname> <given-names>W.</given-names></name> <name><surname>Socher</surname> <given-names>R.</given-names></name> <name><surname>Li</surname> <given-names>L.-J.</given-names></name> <name><surname>Kai Li Fei-Fei</surname> <given-names>L.</given-names></name></person-group> (<year>2009</year>). <article-title>&#x0201C;ImageNet: a large-scale hierarchical image database,&#x0201D;</article-title> in <source>2009 IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Miami, FL</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>248</fpage>&#x02013;<lpage>255</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2009.5206848</pub-id></citation>
</ref>
<ref id="B4">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Dvornik</surname> <given-names>N.</given-names></name> <name><surname>Mairal</surname> <given-names>J.</given-names></name> <name><surname>Schmid</surname> <given-names>C.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;Diversity with cooperation: ensemble methods for few-shot classification,&#x0201D;</article-title> in <source>2019 IEEE/CVF International Conference on Computer Vision (ICCV)</source> (<publisher-loc>Seoul</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>3722</fpage>&#x02013;<lpage>3730</lpage>. <pub-id pub-id-type="doi">10.1109/ICCV.2019.00382</pub-id></citation>
</ref>
<ref id="B5">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Finn</surname> <given-names>C.</given-names></name> <name><surname>Abbeel</surname> <given-names>P.</given-names></name> <name><surname>Levine</surname> <given-names>S.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Model-agnostic meta-learning for fast adaptation of deep networks,&#x0201D;</article-title> in <source>International Conference on Machine Learning</source>.</citation>
</ref>
<ref id="B6">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Fu</surname> <given-names>J.</given-names></name> <name><surname>Zheng</surname> <given-names>H.</given-names></name> <name><surname>Mei</surname> <given-names>T.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Look closer to see better: recurrent attention convolutional neural network for fine-grained image recognition,&#x0201D;</article-title> in <source>2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source> (<publisher-loc>Honolulu, HI</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>4476</fpage>&#x02013;<lpage>4484</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2017.476</pub-id></citation>
</ref>
<ref id="B7">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Garcia</surname> <given-names>V.</given-names></name> <name><surname>Bruna</surname> <given-names>J.</given-names></name></person-group> (<year>2018</year>). <article-title>Few-shot learning with graph neural networks</article-title>. <source>arXiv</source>. [preprint]. <pub-id pub-id-type="doi">10.48550/arXiv.1711.04043</pub-id></citation>
</ref>
<ref id="B8">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gu</surname> <given-names>J.</given-names></name> <name><surname>Wang</surname> <given-names>Z.</given-names></name> <name><surname>Kuen</surname> <given-names>J.</given-names></name> <name><surname>Ma</surname> <given-names>L.</given-names></name> <name><surname>Shahroudy</surname> <given-names>A.</given-names></name> <name><surname>Shuai</surname> <given-names>B.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>Recent advances in convolutional neural networks</article-title>. <source>arXiv</source>. [preprint]. <pub-id pub-id-type="doi">10.48550/arXiv.1512.07108</pub-id></citation>
</ref>
<ref id="B9">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>He</surname> <given-names>K.</given-names></name> <name><surname>Zhang</surname> <given-names>X.</given-names></name> <name><surname>Ren</surname> <given-names>S.</given-names></name> <name><surname>Sun</surname> <given-names>J.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;Deep residual learning for image recognition,&#x0201D;</article-title> in <source>2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source> (<publisher-loc>Las Vegas, NV</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>770</fpage>&#x02013;<lpage>778</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2016.90</pub-id></citation>
</ref>
<ref id="B10">
<citation citation-type="book"><person-group person-group-type="editor"><name><surname>Jankowski</surname> <given-names>N.</given-names></name> <name><surname>Duch</surname> <given-names>W.</given-names></name> <name><surname>Grabczewski</surname> <given-names>K.</given-names></name> <name><surname>Kacprzyk</surname> <given-names>J.</given-names></name></person-group> (eds) (<year>2011</year>). <source>Meta-Learning in Computational Intelligence, Volume 358 of Studies in Computational Intelligence</source>. <publisher-loc>Berlin</publisher-loc>: <publisher-name>Springer Berlin Heidelberg</publisher-name>. <pub-id pub-id-type="doi">10.1007/978-3-642-20980-2</pub-id></citation>
</ref>
<ref id="B11">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Khosla</surname> <given-names>A.</given-names></name> <name><surname>Jayadevaprakash</surname> <given-names>N.</given-names></name> <name><surname>Yao</surname> <given-names>B.</given-names></name> <name><surname>Li</surname> <given-names>F. L.</given-names></name></person-group> (<year>2013</year>). <article-title>&#x0201C;Novel dataset for fine-grained image categorization,&#x0201D;</article-title> in <source>CVPR Workshop on Fine-Grained Visual Categorization</source>.</citation>
</ref>
<ref id="B12">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kingma</surname> <given-names>D. P.</given-names></name> <name><surname>Ba</surname> <given-names>J.</given-names></name></person-group> (<year>2017</year>). <article-title>ADAM: a method for stochastic optimization</article-title>. <source>arXiv</source>. [preprint]. <pub-id pub-id-type="doi">10.48550/arXiv.1412.6980</pub-id></citation>
</ref>
<ref id="B13">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Krause</surname> <given-names>J.</given-names></name> <name><surname>Stark</surname> <given-names>M.</given-names></name> <name><surname>Deng</surname> <given-names>J.</given-names></name> <name><surname>Fei-Fei</surname> <given-names>L.</given-names></name></person-group> (<year>2013</year>). <article-title>&#x0201C;3D object representations for fine-grained categorization,&#x0201D;</article-title> in <source>2013 IEEE International Conference on Computer Vision Workshops</source> (<publisher-loc>Sydney, NSW</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>554</fpage>&#x02013;<lpage>561</lpage>. <pub-id pub-id-type="doi">10.1109/ICCVW.2013.77</pub-id></citation>
</ref>
<ref id="B14">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lake</surname> <given-names>B. M.</given-names></name> <name><surname>Salakhutdinov</surname> <given-names>R.</given-names></name> <name><surname>Tenenbaum</surname> <given-names>J. B.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;One-shot learning by inverting a compositional causal process,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems 26 (NIPS 2013)</source>, <fpage>2526</fpage>&#x02013;<lpage>2534</lpage>.</citation>
</ref>
<ref id="B15">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>LeCun</surname> <given-names>Y.</given-names></name> <name><surname>Bengio</surname> <given-names>Y.</given-names></name> <name><surname>Hinton</surname> <given-names>G.</given-names></name></person-group> (<year>2015</year>). <article-title>Deep learning</article-title>. <source>Nature</source> <volume>521</volume>, <fpage>436</fpage>&#x02013;<lpage>444</lpage>. <pub-id pub-id-type="doi">10.1038/nature14539</pub-id></citation>
</ref>
<ref id="B16">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>W.</given-names></name> <name><surname>Wang</surname> <given-names>L.</given-names></name> <name><surname>Xu</surname> <given-names>J.</given-names></name> <name><surname>Huo</surname> <given-names>J.</given-names></name> <name><surname>Gao</surname> <given-names>Y.</given-names></name> <name><surname>Luo</surname> <given-names>J.</given-names></name></person-group> (<year>2019a</year>). <article-title>Revisiting local descriptor based image-to-class measure for few-shot learning</article-title>. <source>arXiv</source>. [preprint]. <pub-id pub-id-type="doi">10.48550/arXiv.1903.12290</pub-id></citation>
</ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>W.</given-names></name> <name><surname>Xu</surname> <given-names>J.</given-names></name> <name><surname>Huo</surname> <given-names>J.</given-names></name> <name><surname>Wang</surname> <given-names>L.</given-names></name> <name><surname>Gao</surname> <given-names>Y.</given-names></name> <name><surname>Luo</surname> <given-names>J.</given-names></name></person-group> (<year>2019b</year>). <article-title>Distribution consistency based covariance metric networks for few-shot learning</article-title>. <source>Proc. AAAI Conf. Artif. Intell</source>. <volume>33</volume>, <fpage>8642</fpage>&#x02013;<lpage>8649</lpage>. <pub-id pub-id-type="doi">10.1609/aaai.v33i01.33018642</pub-id></citation>
</ref>
<ref id="B18">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>X.</given-names></name> <name><surname>Sun</surname> <given-names>Z.</given-names></name> <name><surname>Xue</surname> <given-names>J.-H.</given-names></name> <name><surname>Ma</surname> <given-names>Z.</given-names></name></person-group> (<year>2020</year>). <article-title>A concise review of recent few-shot meta-learning methods</article-title>. <source>arXiv</source>. [preprint]. <pub-id pub-id-type="doi">10.48550/arXiv.2005.10953</pub-id></citation>
</ref>
<ref id="B19">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>X.</given-names></name> <name><surname>Wu</surname> <given-names>J.</given-names></name> <name><surname>Sun</surname> <given-names>Z.</given-names></name> <name><surname>Ma</surname> <given-names>Z.</given-names></name> <name><surname>Cao</surname> <given-names>J.</given-names></name> <name><surname>Xue</surname> <given-names>J.-H.</given-names></name></person-group> (<year>2021</year>). <article-title>BSNet: bi-similarity network for few-shot fine-grained image classification</article-title>. <source>IEEE Trans. Image Process</source>. <volume>30</volume>, <fpage>1318</fpage>&#x02013;<lpage>1331</lpage>. <pub-id pub-id-type="doi">10.1109/TIP.2020.3043128</pub-id><pub-id pub-id-type="pmid">33315565</pub-id></citation></ref>
<ref id="B20">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>Z.</given-names></name> <name><surname>Zhou</surname> <given-names>F.</given-names></name> <name><surname>Chen</surname> <given-names>F.</given-names></name> <name><surname>Li</surname> <given-names>H.</given-names></name></person-group> (<year>2017</year>). <article-title>Meta-SGD: learning to learn quickly for few-shot learning</article-title>. <source>arXiv</source>. [preprint]. <pub-id pub-id-type="doi">10.48550/arXiv.1707.09835</pub-id></citation>
</ref>
<ref id="B21">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Lifchitz</surname> <given-names>Y.</given-names></name> <name><surname>Avrithis</surname> <given-names>Y.</given-names></name> <name><surname>Picard</surname> <given-names>S.</given-names></name> <name><surname>Bursuc</surname> <given-names>A.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;Dense classification and implanting for few-shot learning,&#x0201D;</article-title> in <source>2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source> (<publisher-loc>Long Beach, CA</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>9250</fpage>&#x02013;<lpage>9259</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2019.00948</pub-id></citation>
</ref>
<ref id="B22">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Lin</surname> <given-names>T.-Y.</given-names></name> <name><surname>RoyChowdhury</surname> <given-names>A.</given-names></name> <name><surname>Maji</surname> <given-names>S.</given-names></name></person-group> (<year>2015</year>). <article-title>&#x0201C;Bilinear CNN models for fine-grained visual recognition,&#x0201D;</article-title> in <source>2015 IEEE International Conference on Computer Vision (ICCV)</source> (<publisher-loc>Santiago</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1449</fpage>&#x02013;<lpage>1457</lpage>. <pub-id pub-id-type="doi">10.1109/ICCV.2015.170</pub-id></citation>
</ref>
<ref id="B23">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lin</surname> <given-names>W.</given-names></name> <name><surname>He</surname> <given-names>X.</given-names></name> <name><surname>Han</surname> <given-names>X.</given-names></name> <name><surname>Liu</surname> <given-names>D.</given-names></name> <name><surname>See</surname> <given-names>J.</given-names></name> <name><surname>Zou</surname> <given-names>J.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>Partition-aware adaptive switching neural networks for post-processing in HEVC</article-title>. <source>IEEE Trans. Multimedia</source> <volume>22</volume>, <fpage>2749</fpage>&#x02013;<lpage>2763</lpage>. <pub-id pub-id-type="doi">10.1109/TMM.2019.2962310</pub-id></citation>
</ref>
<ref id="B24">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lin</surname> <given-names>W.</given-names></name> <name><surname>Shen</surname> <given-names>Y.</given-names></name> <name><surname>Yan</surname> <given-names>J.</given-names></name> <name><surname>Xu</surname> <given-names>M.</given-names></name> <name><surname>Wu</surname> <given-names>J.</given-names></name> <name><surname>Wang</surname> <given-names>J.</given-names></name> <etal/></person-group>. (<year>2017a</year>). <article-title>Learning correspondence structures for person re-identification</article-title>. <source>IEEE Trans. Image Process</source>. <volume>26</volume>, <fpage>2438</fpage>&#x02013;<lpage>2453</lpage>. <pub-id pub-id-type="doi">10.1109/TIP.2017.2683063</pub-id><pub-id pub-id-type="pmid">28320667</pub-id></citation></ref>
<ref id="B25">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lin</surname> <given-names>W.</given-names></name> <name><surname>Zhou</surname> <given-names>Y.</given-names></name> <name><surname>Xu</surname> <given-names>H.</given-names></name> <name><surname>Yan</surname> <given-names>J.</given-names></name> <name><surname>Xu</surname> <given-names>M.</given-names></name> <name><surname>Wu</surname> <given-names>J.</given-names></name> <etal/></person-group>. (<year>2017b</year>). <article-title>A tube-and-droplet-based approach for representing and analyzing motion trajectories</article-title>. <source>IEEE Trans. Pattern Anal. Mach. Intell</source>. <volume>39</volume>, <fpage>1489</fpage>&#x02013;<lpage>1503</lpage>. <pub-id pub-id-type="doi">10.1109/TPAMI.2016.2608884</pub-id><pub-id pub-id-type="pmid">28113652</pub-id></citation></ref>
<ref id="B26">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ma</surname> <given-names>Z.</given-names></name> <name><surname>Chang</surname> <given-names>D.</given-names></name> <name><surname>Xie</surname> <given-names>J.</given-names></name> <name><surname>Ding</surname> <given-names>Y.</given-names></name> <name><surname>Wen</surname> <given-names>S.</given-names></name> <name><surname>Li</surname> <given-names>X.</given-names></name> <etal/></person-group>. (<year>2019</year>). <article-title>Fine-grained vehicle classification with channel max pooling modified CNNs</article-title>. <source>IEEE Trans. Veh. Technol</source>. <volume>68</volume>, <fpage>3224</fpage>&#x02013;<lpage>3233</lpage>. <pub-id pub-id-type="doi">10.1109/TVT.2019.2899972</pub-id></citation>
</ref>
<ref id="B27">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Maji</surname> <given-names>S.</given-names></name> <name><surname>Rahtu</surname> <given-names>E.</given-names></name> <name><surname>Kannala</surname> <given-names>J.</given-names></name> <name><surname>Blaschko</surname> <given-names>M.</given-names></name> <name><surname>Vedaldi</surname> <given-names>A.</given-names></name></person-group> (<year>2013</year>). <article-title>Fine-grained visual classification of aircraft</article-title>. <source>arXiv</source>. [preprint]. <pub-id pub-id-type="doi">10.48550/arXiv.1306.5151</pub-id></citation>
</ref>
<ref id="B28">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Nichol</surname> <given-names>A.</given-names></name> <name><surname>Achiam</surname> <given-names>J.</given-names></name> <name><surname>Schulman</surname> <given-names>J.</given-names></name></person-group> (<year>2018</year>). <article-title>On first-order meta-learning algorithms</article-title>. <source>arXiv</source>. [peprint]. <pub-id pub-id-type="doi">10.48550/arXiv.1803.02999</pub-id></citation>
</ref>
<ref id="B29">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Qi</surname> <given-names>Y.</given-names></name> <name><surname>Sun</surname> <given-names>H.</given-names></name> <name><surname>Liu</surname> <given-names>N.</given-names></name> <name><surname>Zhou</surname> <given-names>H.</given-names></name></person-group> (<year>2022</year>). <article-title>A task-aware dual similarity network for fine-grained few-shot learning</article-title>. <source>arXiv</source>. [preprint]. <pub-id pub-id-type="doi">10.48550/arXiv.2210.12348</pub-id></citation>
</ref>
<ref id="B30">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ravi</surname> <given-names>S.</given-names></name> <name><surname>Larochelle</surname> <given-names>H.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Optimization as a model for few-shot learning,&#x0201D;</article-title> in <source>ICLR</source>.</citation>
</ref>
<ref id="B31">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rusu</surname> <given-names>A. A.</given-names></name> <name><surname>Rao</surname> <given-names>D.</given-names></name> <name><surname>Sygnowski</surname> <given-names>J.</given-names></name> <name><surname>Vinyals</surname> <given-names>O.</given-names></name> <name><surname>Pascanu</surname> <given-names>R.</given-names></name> <name><surname>Osindero</surname> <given-names>S.</given-names></name> <etal/></person-group>. (<year>2019</year>). <article-title>Meta-learning with latent embedding optimization</article-title>. <source>arXiv</source>. [preprint]. <pub-id pub-id-type="doi">10.48550/arXiv.1807.05960</pub-id></citation>
</ref>
<ref id="B32">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Selvaraju</surname> <given-names>R. R.</given-names></name> <name><surname>Cogswell</surname> <given-names>M.</given-names></name> <name><surname>Das</surname> <given-names>A.</given-names></name> <name><surname>Vedantam</surname> <given-names>R.</given-names></name> <name><surname>Parikh</surname> <given-names>D.</given-names></name> <name><surname>Batra</surname> <given-names>D.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Grad-CAM: visual explanations from deep networks via gradient-based localization,&#x0201D;</article-title> in <source>2017 IEEE International Conference on Computer Vision (ICCV)</source> (<publisher-loc>Venice</publisher-loc>: <publisher-name>IEEE</publisher-name>). <pub-id pub-id-type="doi">10.1109/ICCV.2017.74</pub-id></citation>
</ref>
<ref id="B33">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shermin</surname> <given-names>T.</given-names></name> <name><surname>Teng</surname> <given-names>S. W.</given-names></name> <name><surname>Sohel</surname> <given-names>F.</given-names></name> <name><surname>Murshed</surname> <given-names>M.</given-names></name> <name><surname>Lu</surname> <given-names>G.</given-names></name></person-group> (<year>2021</year>). <article-title>Integrated generalized zero-shot learning for fine-grained classification</article-title>. <source>arXiv</source>. <pub-id pub-id-type="doi">10.48550/arXiv.2101.02141</pub-id></citation>
</ref>
<ref id="B34">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Simonyan</surname> <given-names>K.</given-names></name> <name><surname>Zisserman</surname> <given-names>A.</given-names></name></person-group> (<year>2015</year>). <article-title>Very deep convolutional networks for large-scale image recognition</article-title>. <source>arXiv</source>. [preprint]. <pub-id pub-id-type="doi">10.48550/arXiv.1409.1556</pub-id></citation>
</ref>
<ref id="B35">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Snell</surname> <given-names>J.</given-names></name> <name><surname>Swersky</surname> <given-names>K.</given-names></name> <name><surname>Zemel</surname> <given-names>R.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Prototypical networks for few-shot learning,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems, Vol. 30</source>.</citation>
</ref>
<ref id="B36">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Sung</surname> <given-names>F.</given-names></name> <name><surname>Yang</surname> <given-names>Y.</given-names></name> <name><surname>Zhang</surname> <given-names>L.</given-names></name> <name><surname>Xiang</surname> <given-names>T.</given-names></name> <name><surname>Torr</surname> <given-names>P. H.</given-names></name> <name><surname>Hospedales</surname> <given-names>T. M.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Learning to compare: relation network for few-shot learning,&#x0201D;</article-title> in <source>2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Salt Lake City, UT</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1199</fpage>&#x02013;<lpage>1208</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2018.00131</pub-id></citation>
</ref>
<ref id="B37">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Vinyals</surname> <given-names>O.</given-names></name> <name><surname>Blundell</surname> <given-names>C.</given-names></name> <name><surname>Lillicrap</surname> <given-names>T.</given-names></name> <name><surname>kavukcuoglu</surname> <given-names>K.</given-names></name> <name><surname>Wierstra</surname> <given-names>D.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;Matching networks for one shot learning,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems, Vol. 29</source>.</citation>
</ref>
<ref id="B38">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wah</surname> <given-names>C.</given-names></name> <name><surname>Branson</surname> <given-names>S.</given-names></name> <name><surname>Welinder</surname> <given-names>P.</given-names></name> <name><surname>Perona</surname> <given-names>P.</given-names></name> <name><surname>Belongie</surname> <given-names>S.</given-names></name></person-group> (<year>2011</year>). <source>The Caltech-UCSD Birds-200-2011 Dataset</source>. California Institute of Technology.</citation>
</ref>
<ref id="B39">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>C.</given-names></name> <name><surname>Song</surname> <given-names>S.</given-names></name> <name><surname>Yang</surname> <given-names>Q.</given-names></name> <name><surname>Li</surname> <given-names>X.</given-names></name> <name><surname>Huang</surname> <given-names>G.</given-names></name></person-group> (<year>2021</year>). <article-title>Fine-grained few shot learning with foreground object transformation</article-title>. <source>arXiv</source>. <pub-id pub-id-type="doi">10.48550/arXiv.2109.05719</pub-id></citation>
</ref>
<ref id="B40">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wei</surname> <given-names>X.-S.</given-names></name> <name><surname>Wang</surname> <given-names>P.</given-names></name> <name><surname>Liu</surname> <given-names>L.</given-names></name> <name><surname>Shen</surname> <given-names>C.</given-names></name> <name><surname>Wu</surname> <given-names>J.</given-names></name></person-group> (<year>2019</year>). <article-title>Piecewise classifier mappings: learning fine-grained learners for novel categories with few examples</article-title>. <source>IEEE Trans. Image Process</source>. <volume>28</volume>, <fpage>6116</fpage>&#x02013;<lpage>6125</lpage>. <pub-id pub-id-type="doi">10.1109/TIP.2019.2924811</pub-id><pub-id pub-id-type="pmid">31265400</pub-id></citation></ref>
<ref id="B41">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Xu</surname> <given-names>S.-L.</given-names></name> <name><surname>Zhang</surname> <given-names>F.</given-names></name> <name><surname>Wei</surname> <given-names>X.-S.</given-names></name> <name><surname>Wang</surname> <given-names>J.</given-names></name></person-group> (<year>2022</year>). <article-title>Dual attention networks for few-shot fine-grained recognition</article-title>. <source>Proc. AAAI Conf. Artif. Intell</source>. <volume>36</volume>, <fpage>2911</fpage>&#x02013;<lpage>2919</lpage>. <pub-id pub-id-type="doi">10.1609/aaai.v36i3.20196</pub-id></citation>
</ref>
<ref id="B42">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>X.</given-names></name> <name><surname>Xiong</surname> <given-names>H.</given-names></name> <name><surname>Zhou</surname> <given-names>W.</given-names></name> <name><surname>Lin</surname> <given-names>W.</given-names></name> <name><surname>Tian</surname> <given-names>Q.</given-names></name></person-group> (<year>2017</year>). <article-title>Picking neural activations for fine-grained recognition</article-title>. <source>IEEE Trans. Multimedia</source> <volume>179</volume>, <fpage>2736</fpage>&#x02013;<lpage>2750</lpage>. <pub-id pub-id-type="doi">10.1109/TMM.2017.2710803</pub-id></citation>
</ref>
<ref id="B43">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhao</surname> <given-names>B.</given-names></name> <name><surname>Wu</surname> <given-names>X.</given-names></name> <name><surname>Feng</surname> <given-names>J.</given-names></name> <name><surname>Peng</surname> <given-names>Q.</given-names></name> <name><surname>Yan</surname> <given-names>S.</given-names></name></person-group> (<year>2017</year>). <article-title>Diversified visual attention networks for fine-grained object classification</article-title>. <source>IEEE Trans. Multimedia</source> <volume>19</volume>, <fpage>1245</fpage>&#x02013;<lpage>1256</lpage>. <pub-id pub-id-type="doi">10.1109/TMM.2017.2648498</pub-id></citation>
</ref>
</ref-list>
</back>
</article>