<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Archiving and Interchange DTD v2.3 20070202//EN" "archivearticle.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" article-type="systematic-review" dtd-version="2.3" xml:lang="EN">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Plant Sci.</journal-id>
<journal-title>Frontiers in Plant Science</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Plant Sci.</abbrev-journal-title>
<issn pub-type="epub">1664-462X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fpls.2022.1041514</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Plant Science</subject>
<subj-group>
<subject>Systematic Review</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>AI-based object detection latest trends in remote sensing, multimedia and agriculture applications</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Nawaz</surname>
<given-names>Saqib Ali</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/1984356"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Li</surname>
<given-names>Jingbing</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
<xref ref-type="author-notes" rid="fn001">
<sup>*</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Bhatti</surname>
<given-names>Uzair Aslam</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/505472"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Shoukat</surname>
<given-names>Muhammad Usman</given-names>
</name>
<xref ref-type="aff" rid="aff3">
<sup>3</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Ahmad</surname>
<given-names>Raza Muhammad</given-names>
</name>
<xref ref-type="aff" rid="aff4">
<sup>4</sup>
</xref>
</contrib>
</contrib-group>
<aff id="aff1">
<sup>1</sup>
<institution>School of Information and Communication Engineering, Hainan University</institution>, <addr-line>Haikou</addr-line>, <country>China</country>
</aff>
<aff id="aff2">
<sup>2</sup>
<institution>State Key Laboratory of Marine Resource Utilization in the South China Sea, Hainan University</institution>, <addr-line>Haikou</addr-line>, <country>China</country>
</aff>
<aff id="aff3">
<sup>3</sup>
<institution>School of Automotive Engineering, Wuhan University of Technology</institution>, <addr-line>Wuhan</addr-line>, <country>China</country>
</aff>
<aff id="aff4">
<sup>4</sup>
<institution>College of Cyberspace Security, Hainan University</institution>, <addr-line>Haikou</addr-line>, <country>China</country>
</aff>
<author-notes>
<fn fn-type="edited-by">
<p>Edited by: Nieves Goicoechea, University of Navarra, Spain</p>
</fn>
<fn fn-type="edited-by">
<p>Reviewed by: Huan Yu, Chengdu University of Technology, China; Sijia Yu, The State University of New Jersey - Busch Campus, United States</p>
</fn>
<fn fn-type="corresp" id="fn001">
<p>*Correspondence: Jingbing Li, <email xlink:href="mailto:jingbingli2008@hotmail.com">jingbingli2008@hotmail.com</email>
</p>
</fn>
<fn fn-type="other" id="fn002">
<p>This article was submitted to Sustainable and Intelligent Phytoprotection, a section of the journal Frontiers in Plant Science</p>
</fn>
</author-notes>
<pub-date pub-type="epub">
<day>18</day>
<month>11</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="collection">
<year>2022</year>
</pub-date>
<volume>13</volume>
<elocation-id>1041514</elocation-id>
<history>
<date date-type="received">
<day>13</day>
<month>09</month>
<year>2022</year>
</date>
<date date-type="accepted">
<day>07</day>
<month>10</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2022 Nawaz, Li, Bhatti, Shoukat and Ahmad</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Nawaz, Li, Bhatti, Shoukat and Ahmad</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p>
</license>
</permissions>
<abstract>
<p>Object detection is a vital research direction in machine vision and deep learning. The object detection technique based on deep understanding has achieved tremendous progress in feature extraction, image representation, classification, and recognition in recent years, due to this rapid growth of deep learning theory and technology. Scholars have proposed a series of methods for the object detection algorithm as well as improvements in data processing, network structure, loss function, and so on. In this paper, we introduce the characteristics of standard datasets and critical parameters of performance index evaluation, as well as the network structure and implementation methods of two-stage, single-stage, and other improved algorithms that are compared and analyzed. The latest improvement ideas of typical object detection algorithms based on deep learning are discussed and reached, from data enhancement, <italic>a priori</italic> box selection, network model construction, prediction box selection, and loss calculation. Finally, combined with the existing challenges, the future research direction of typical object detection algorithms is surveyed.</p>
</abstract>
<kwd-group>
<kwd>deep learning</kwd>
<kwd>object detection</kwd>
<kwd>transfer learning</kwd>
<kwd>algorithm improvement</kwd>
<kwd>data augmentation</kwd>
<kwd>network structure</kwd>
</kwd-group>
<counts>
<fig-count count="7"/>
<table-count count="5"/>
<equation-count count="9"/>
<ref-count count="148"/>
<page-count count="21"/>
<word-count count="10799"/>
</counts>
</article-meta>
</front>
<body>
<sec id="s1" sec-type="intro">
<title>1 Introduction</title>
<p>Computer vision, also known as machine vision, uses an image sensor that replaces the human eye to obtain an image of an object, converts the image into a digital image, and uses computer-simulated human discrimination criteria to understand and recognize the image, to analyze the image, and draw conclusions. This technology gradually emerged on the basis of the successful application of remote sensing image processing and medical image processing technology in the 1970s and has been applied in many fields. At present, the application of computer vision technology in agriculture is increasing day by day. Object detection is widely used in different areas of agriculture and getting importance these days in fruits, diseases, and scene classification (<xref ref-type="bibr" rid="B139">Zhang et&#xa0;al., 2020</xref>; <xref ref-type="bibr" rid="B9">Bhatti et&#xa0;al., 2021</xref>).</p>
<p>The primary goal of this work is to find all of the objects of interest in a specified image with high accuracy and efficiency and to use the rectangular bounding box to determine the spot and size of the detected object, which is connected to object classification, semantic segmentation, and instance. In the process of object detection, due to the different appearance, posture, shape, and quantity of various target objects in the image, as well as the interference of multiple factors such as illumination and occlusion, the target is distorted, and the difficulty of object detection (<xref ref-type="bibr" rid="B22">Chen and Wang, 2014</xref>; <xref ref-type="bibr" rid="B8">Bhatti et&#xa0;al., 2019</xref>).</p>    <p>Deep learning-based object detection algorithms are mainly divided into traditional and detection algorithms. Traditional detection approaches rely on hand-crafted features and shallow trainable architectures, which are ineffective when creating complicated object detectors and scene classifiers that combine many low-level image features and high-level semantic information. Traditional object detection algorithms mainly include the deformable parts model (DPM) (<xref ref-type="bibr" rid="B32">Doll&#xe1;r et&#xa0;al., 2009</xref>), selective search (SS) (<xref ref-type="bibr" rid="B123">Uijlings et&#xa0;al., 2013</xref>), Oxford-MKL (<xref ref-type="bibr" rid="B124">Vedaldi et&#xa0;al., 2009</xref>), and NLPR-HOGLBP (<xref ref-type="bibr" rid="B137">Yu et&#xa0;al., 2010</xref>), etc. Traditional object detection algorithm basic structure mainly includes the following three-part: 1) region selector, first, a sliding window of different sizes and proportions is set for a given image, and the entire image is traversed from left to right and top to bottom to frame a specific part of the image to be detected as a candidate region; 2) feature extraction, extract visual features of candidate regions, such as scale-invariant feature transform (SIFT) (<xref ref-type="bibr" rid="B10">Bingtao et&#xa0;al., 2015</xref>), Haar (<xref ref-type="bibr" rid="B76">Lienhart and Maydt, 2002</xref>), histogram of oriented gradient (HOG) (<xref ref-type="bibr" rid="B117">Shu et&#xa0;al., 2021</xref>) commonly used in face and standard object detection, and other features to extract features for each region; 3) classifier classification, use the trained classifier to identify the target category of the feature, such as the commonly used deformable part model (DPM), adaboot (<xref ref-type="bibr" rid="B125">Viola and Jones, 2001</xref>), support vector machines (SVM) (<xref ref-type="bibr" rid="B4">Ashritha et&#xa0;al., 2021</xref>) and other classifiers. However, these three parts achieved certain results while exposing their inherent flaws, such as using a sliding window for region selection will result in high time complexity and window redundancy, the uncertainty of illumination change and the diversity of background will result in poor robustness of the guide design feature technique (<xref ref-type="bibr" rid="B15">Cao et&#xa0;al., 2020a</xref>), poor generalization, and complex algorithm stages will result in slow detection efficiency and low accuracy (<xref ref-type="bibr" rid="B131">Wu et&#xa0;al., 2021</xref>). As a result, classic object detection approaches have struggled to match people&#x2019;s demands for high-performance detection.</p>
<p>However, there are still some complications in applying an object detection algorithm based on deep learning, such as too small detection objects, insufficient detection accuracy, and insufficient data volume. Many scholars have improved algorithms and also formed a review by summarizing these improved methods. <xref ref-type="bibr" rid="B122">Tong et&#xa0;al. (2020)</xref> analyzed and outlined the improved techniques from the aspects of multi-scale features, data enhancement and context information but ignored the performance improvement of the feature extraction network for small object detection; moreover, the data enhancement part only considers improving the small object detection performance by increasing the number and type of small targets in the data set, which lacks diversity. <xref ref-type="bibr" rid="B133">Xu et&#xa0;al. (2021)</xref> and <xref ref-type="bibr" rid="B27">Degang et&#xa0;al. (2021)</xref> respectively introduced and analyzed the typical algorithms of object detection for the detection framework based on regression and candidate window. However, because the optimization scheme of the algorithm is not well classified in the text, they cannot clearly understand when and how to apply the improvement idea to the detection algorithm. The mainstream deep learning object detection algorithms are mainly separated into two-stage detection algorithms and single-stage detection algorithms, as shown in <xref ref-type="fig" rid="f1">
<bold>Figure&#xa0;1</bold>
</xref>.</p>
<fig id="f1" position="float">
<label>Figure&#xa0;1</label>
<caption>
<p>Object detection method based on deep learning <bold>(A)</bold> Single stage method <bold>(B)</bold> Two stage method.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpls-13-1041514-g001.tif"/>
</fig>    <p>In <xref ref-type="fig" rid="f1">
<bold>Figure&#xa0;1</bold>
</xref>, the two-stage detection algorithm is based on candidate regions represented by the R-CNN series; the single-stage detection algorithm is a regression analysis-based object detection algorithm defined by YOLO and SSD. This review is based on different object detection techniques approaches, and the main contribution of this paper is as follows:</p>
<list list-type="bullet">
<list-item>
<p>Firstly, this review organized the standard data sets and evaluation indicators. The list of datasets and their evaluation methods are in-depth and highlighted from different literature from recent years.</p>
</list-item>
<list-item>
<p>Secondly, this review paper focused on deep learning approaches for object detection, including two-stage and single-stage object detection algorithms and generative adversarial networks.</p>
</list-item>
<list-item>
<p>The third part of this paper surveyed the deep learning-based object detection algorithm applications in multimedia, remote sensing, and agriculture. Finally draws a conclusion and some future works.</p>
</list-item>
</list>
</sec>
<sec id="s2">
<title>2 Common data sets and evaluation indicators</title>
<p>This section highlights the datasets used for objects in remote sensing, agriculture, and multimedia applications.</p>
<sec id="s2_1">
<title>2.1 Common datasets</title>
<p>In the task of object detection, a dataset with strong applicability can effectively test and assess the performance of the algorithm and promote the development of research in related fields. The most widely used datasets for deep learning-based object detection tasks are PASCAL VOC2007 (<xref ref-type="bibr" rid="B60">Ito et&#xa0;al., 2007</xref>), PASCAL VOC2012 (<xref ref-type="bibr" rid="B93">Marris et&#xa0;al., 2012</xref>), Microsoft COCO (<xref ref-type="bibr" rid="B82">Lin et&#xa0;al., 2014</xref>), ImageNet (<xref ref-type="bibr" rid="B28">Deng et&#xa0;al., 2009</xref>) and OICOD (Open Image Challenge Object Detection) (<xref ref-type="bibr" rid="B69">Krasin et&#xa0;al., 2017</xref>). Different features and quantities of images in datasets are listed in <xref ref-type="table" rid="T1">
<bold>Table&#xa0;1</bold>
</xref>.</p>
<table-wrap id="T1" position="float">
<label>Table&#xa0;1</label>
<caption>
<p>Comparison of related data sets.</p>
</caption>
<table frame="hsides">
<thead>
<tr>
<th valign="top" align="left">Dataset Name</th>
<th valign="top" align="center">Quantity</th>
<th valign="top" align="center">Type</th>
<th valign="top" align="center">Year</th>
<th valign="top" align="center">Features</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">CIFAR-10 (<xref ref-type="bibr" rid="B70">Krizhevsky and Hinton, 2009</xref>)</td>
<td valign="top" align="center">60000</td>
<td valign="top" align="center">10</td>
<td valign="top" align="center">2009</td>
<td valign="top" align="left">Color pictures of everyday things in daily life; take up little storage space; objects detection in images is large; this dataset is often used to measure the classification ability of the model</td>
</tr>
<tr>
<td valign="top" align="left">PASCAL<break/>VOC 2007 (<xref ref-type="bibr" rid="B38">Everingham et&#xa0;al., 2010</xref>)<break/>PASCAL<break/>VOC 2012 (<xref ref-type="bibr" rid="B37">Everingham et&#xa0;al., 2015</xref>)</td>
<td valign="top" align="center">9963<break/>11530</td>
<td valign="top" align="center">20<break/>20</td>
<td valign="top" align="center">2010<break/>2015</td>
<td valign="top" align="left">Standardized datasets that can be used for image classification, object detection, and image segmentation; the standardized process makes most of the self-made datasets use this format; most of them are real-world data, which is difficult to detect; it has better image quality and complete Labels are mostly used to evaluate model performance; every image resembles to its annotation file one-to-one, which is easy to manage;</td>
</tr>
<tr>
<td valign="top" align="left">ImageNet (<xref ref-type="bibr" rid="B110">Russakovsky et&#xa0;al., 2015</xref>)</td>
<td valign="top" align="center">14.19 Million</td>
<td valign="top" align="center">21841</td>
<td valign="top" align="center">2015</td>
<td valign="top" align="left">Because this dataset has extremely rich variety information and can contain the underlying features of most detected objects, it is often used as a dataset for pre-training models, which also makes the model extremely challenging in both object detection and object classification.</td>
</tr>
<tr>
<td valign="top" align="left">Microsoft<break/>COCO (<xref ref-type="bibr" rid="B82">Lin et&#xa0;al., 2014</xref>)</td>
<td valign="top" align="center">328000</td>
<td valign="top" align="center">91</td>
<td valign="top" align="center">2014</td>
<td valign="top" align="left">The image environment is complex and diverse, which increases the difficulty of detection; in addition to the category and location information of the image, it also contains the scene description of the image; the number of categories is far from the ImageNet, Open Image, and SUN datasets, but this also makes each category more difficult to detect. The larger the number of images contained, the better the detection ability of the model during training.</td>
</tr>
<tr>
<td valign="top" align="left">Open Image (<xref ref-type="bibr" rid="B74">Kuznetsova et&#xa0;al., 2020</xref>)</td>
<td valign="top" align="center">1.9 Million</td>
<td valign="top" align="center">600</td>
<td valign="top" align="center">2020</td>
<td valign="top" align="left">The largest dataset with target location annotations currently available; the annotation information is manually reviewed to ensure accuracy and consistency; The majority of the photographs are complex settings with several objects</td>
</tr>
<tr>
<td valign="top" align="left">Places (<xref ref-type="bibr" rid="B144">Zhou et&#xa0;al., 2017</xref>)</td>
<td valign="top" align="center">2.5 Million</td>
<td valign="top" align="center">205</td>
<td valign="top" align="center">2017</td>
<td valign="top" align="left">The Places dataset is a scene-centric database, and the scene categories in the images represent the scene information of each image</td>
</tr>
<tr>
<td valign="top" align="left">SUN (<xref ref-type="bibr" rid="B132">Xiao et&#xa0;al., 2016</xref>)</td>
<td valign="top" align="center">130519</td>
<td valign="top" align="center">899</td>
<td valign="top" align="center">2016</td>
<td valign="top" align="left">Compared with the Places dataset, it has more scene category information, but the average category of the SUN dataset in each scene is about 80 times different from the Places dataset, resulting in a weaker scene classification ability learned by the model using the SUN dataset; In addition to scene recognition, object recognition under the scene can be performed.</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s2_2">
<title>2.2 Evaluation indicators</title>
<p>The act of the object detection algorithm is mainly evaluated by the following parameters: intersection over union (IoU) (<xref ref-type="bibr" rid="B105">Rahman and Wang, 2016</xref>), frame per second (FPS), accuracy (A), recall (R), precision (P), average precision (AP), and mean average precision (mAP) (<xref ref-type="bibr" rid="B122">Tong et&#xa0;al., 2020</xref>). Where AP consists of the area enclosed by the P-R curve and the coordinates, and mAP is the mean of AP (<xref ref-type="bibr" rid="B64">Kang, 2019</xref>; <xref ref-type="bibr" rid="B126">Wang, 2021</xref>).</p>
</sec>
</sec>
<sec id="s3">
<title>3 Deep learning approaches for object detection in multimedia</title>
<sec id="s3_1">
<title>3.1 Two-stage object detection algorithm</title>
<p>In two-stage object detection, one branch of object detectors is based on multi-stage models. Deriving from the work of R-CNN, one model is used to extract regions of objects, and a second model is used to classify and further refine the localization of the object. To obtain test results, the two-stage object detection approach primarily uses algorithms such as Selective Search or Edge Boxes (<xref ref-type="bibr" rid="B147">Zitnick and Doll&#xe1;r, 2014</xref>) to choose the candidate region (Region Proposal) (<xref ref-type="bibr" rid="B58">Hu and Zhai, 2019</xref>) that may include the object detection for the input image, and then categorize and position the candidate region. The R-CNN (<xref ref-type="bibr" rid="B44">Girshick et&#xa0;al., 2014</xref>) series, R-FCN (<xref ref-type="bibr" rid="B25">Dai et&#xa0;al., 2016</xref>), Mask R-CNN (<xref ref-type="bibr" rid="B49">He et&#xa0;al., 2017</xref>), and other algorithms are examples.</p>
<sec id="s3_1_1">
<title>3.1.1 OverFeat algorithm</title>
<p>The OverFeat algorithm was proposed by the author in <xref ref-type="bibr" rid="B113">Sermanet et&#xa0;al. (2013)</xref>, who improved AlexNet. The approach combines AlexNet with multi-scale sliding windows (<xref ref-type="bibr" rid="B96">Naqvi et&#xa0;al., 2020</xref>) to achieve feature extraction, shares feature extraction layers and is applied to tasks including image classification, localization, and object identification. On the ILSVRC 2013 (<xref ref-type="bibr" rid="B81">Lin et&#xa0;al., 2018</xref>) dataset, the mAP is 24.3%, and the detection effect is much better than traditional approaches. The algorithm has heuristic relevance for deep learning&#x2019;s object detection algorithm; however, it is ineffective at detecting small objects and has a high mistake rate.</p>
</sec>
<sec id="s3_1_2">
<title>3.1.2 R-CNN algorithm</title>
<p>The convolutional neural network (CNN) to the job of object detection introduced the R-CNN <xref ref-type="bibr" rid="B71">Krizhevsky et&#xa0;al. (2012)</xref>, a standard two-stage object detection approach. Three modules of deep feature extraction and classification and regression based on CNN:</p>
<list list-type="order">
<list-item>
<p>Use a selective algorithm to extract about 2000 regional candidate frames that may contain target objects from the individual image;</p>
</list-item>
<list-item>
<p>Normalize the applicant areas scale to a static magnitude for feature mining;</p>
</list-item>
<list-item>
<p>Use AlexNet to input the candidate region features into SVM one by one for classification, using Bounding Box Regression and Non-Maximum Suppression (NMS).</p>
</list-item>
</list>
<p>The Hinge loss with the L<sub>2</sub> regularization term (<xref ref-type="bibr" rid="B95">Moore and DeNero, 2011</xref>) is the loss function of the SVM classification algorithm. The following is the definition of the function form:</p>
<disp-formula>
<label>(1)</label>
<mml:math display="block" id="M1">
<mml:mrow>
<mml:msub>
<mml:mtext>L</mml:mtext>
<mml:mrow>
<mml:mtext>cls</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mtext>c</mml:mtext>
<mml:munder>
<mml:mo>&#x2211;</mml:mo>
<mml:mtext>i</mml:mtext>
</mml:munder>
<mml:mi>max</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo>
<mml:mtext>&#xa0;</mml:mtext>
<mml:mn>1</mml:mn>
<mml:mo>&#x2212;</mml:mo>
<mml:msubsup>
<mml:mi>p</mml:mi>
<mml:mi>i</mml:mi>
<mml:mo>*</mml:mo>
</mml:msubsup>
<mml:mtext>&#xa0;</mml:mtext>
<mml:mo>.</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mtext>&#xa0;&#xa0;p</mml:mtext>
</mml:mrow>
<mml:mtext>i</mml:mtext>
</mml:msub>
<mml:mtext>&#xa0;</mml:mtext>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>+</mml:mo>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mn>2</mml:mn>
</mml:mfrac>
<mml:msup>
<mml:mtext>w</mml:mtext>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
</mml:math>
</disp-formula>
<p>where the proper category of the item is represented by <inline-formula>
<mml:math display="inline" id="im1">
<mml:mrow>
<mml:msubsup>
<mml:mtext>p</mml:mtext>
<mml:mi>i</mml:mi>
<mml:mo>*</mml:mo>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula>, the possibility of the projected object class is represented by p<sub>i</sub>, and the index of the mini-batch is denoted by i. To improve the prediction&#x2019;s resilience, the main premise is to penalize the distance variation among the predicted bounding-box and the ground truth. The following is the definition of the function:</p>
<disp-formula>
<mml:math display="block" id="M2">
<mml:mrow>
<mml:msubsup>
<mml:mtext>t</mml:mtext>
<mml:mi>x</mml:mi>
<mml:mo>*</mml:mo>
</mml:msubsup>
<mml:mo>=</mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:msup>
<mml:mtext>x</mml:mtext>
<mml:mo>*</mml:mo>
</mml:msup>
<mml:mo>&#x2212;</mml:mo>
<mml:mtext>x</mml:mtext>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">/</mml:mo>
<mml:mtext>w,&#x2009;</mml:mtext>
<mml:msubsup>
<mml:mtext>t</mml:mtext>
<mml:mi>y</mml:mi>
<mml:mo>*</mml:mo>
</mml:msubsup>
<mml:mo>=</mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:msup>
<mml:mtext>y</mml:mtext>
<mml:mo>*</mml:mo>
</mml:msup>
<mml:mo>&#x2212;</mml:mo>
<mml:mtext>y</mml:mtext>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">/</mml:mo>
<mml:mtext>h</mml:mtext>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula>
<label>(2)</label>
<mml:math display="block" id="M3">
<mml:mrow>
<mml:msubsup>
<mml:mtext>t</mml:mtext>
<mml:mi>w</mml:mi>
<mml:mo>*</mml:mo>
</mml:msubsup>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mtext>log(w</mml:mtext>
</mml:mrow>
<mml:mo>*</mml:mo>
</mml:msup>
<mml:mo stretchy="false">/</mml:mo>
<mml:mtext>w)</mml:mtext>
<mml:mo>,</mml:mo>
<mml:msubsup><mml:mtext>t</mml:mtext>
<mml:mi>h</mml:mi>
<mml:mo>*</mml:mo>
</mml:msubsup>
<mml:mo>=</mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:msup>
<mml:mtext>h</mml:mtext>
<mml:mo>*</mml:mo>
</mml:msup>
<mml:mo stretchy="false">/</mml:mo>
<mml:mtext>h</mml:mtext>
<mml:mo stretchy="false">)</mml:mo>
<mml:mtext>&#xa0;</mml:mtext>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula>
<label>(3)</label>
<mml:math display="block" id="M4">
<mml:mrow>
<mml:msub>
<mml:mtext>L</mml:mtext>
<mml:mrow>
<mml:mtext>loc</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:munder>
<mml:mo>&#x2211;</mml:mo>
<mml:mtext>i</mml:mtext>
</mml:munder>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msubsup>
<mml:mtext>t</mml:mtext>
<mml:mo>*</mml:mo>
<mml:mtext>i</mml:mtext>
</mml:msubsup>
<mml:mo>&#x2212;</mml:mo>
<mml:msubsup>
<mml:mtext>w</mml:mtext>
<mml:mo>*</mml:mo>
<mml:mtext>T</mml:mtext>
</mml:msubsup>
<mml:mi>&#x3d5;</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msup>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
</mml:msup>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
</mml:math>
</disp-formula>
<p>where, the true coordinate is t<sup>*</sup> = (x<sup>*</sup>,y<sup>*</sup>,w<sup>*</sup>,h<sup>*</sup>) the predicted coordinate is t = (x,y,w,h), where (x, y) signifies the coordinate of the box center, (w, h) denotes the width and height of the box. <inline-formula>
<mml:math display="inline" id="im2">
<mml:mrow>
<mml:msubsup>
<mml:mtext>w</mml:mtext>
<mml:mo>*</mml:mo>
<mml:mtext>T</mml:mtext>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> is the learned limit, and &#x3d5;(t<sup>i</sup>) is the feature vector. The regional scores are adjusted and filtered for location regression in a fully connected network (<xref ref-type="bibr" rid="B44">Girshick et&#xa0;al., 2014</xref>).</p>    <p>On the ILSVRC2013 dataset, the R-CNN algorithm improves the mAP to 31.4% and 58.5% on the VOC2007 dataset. The performance is better than the typical object detection algorithm. However, the following issues persist:</p>
<list list-type="order">
<list-item>
<p>Because every stage must be qualified separately, training involves a multi-stage pipeline that is slow and difficult to optimize.</p>
</list-item>
<list-item>
<p>Because CNN features should be derived from each object proposal for each image, training of the SVM classifier and bounding box regressor is time and disk intensive. This is critical for large-scale detection.</p>
</list-item>
<list-item>
<p>The test speed is slow, because the CNN structures need to be mined in each test image object proposal, and there is no shared computation.</p>
</list-item>
</list>
</sec>
<sec id="s3_1_3">
<title>3.1.3 SPP-Net algorithm</title>    <p>
<xref ref-type="bibr" rid="B51">He et&#xa0;al. (2015)</xref> presented the Spatial Pyramid Pooling Network (SPP-Net) in 2015 as a solution to the problem that R-CNN pulls features from all candidate regions separately, which takes a lot of time. Between the last convolutional layer and the fully connected layer, SPP-Net adds a spatial pyramid structure, segments the image using numerous standard scales fine-tuners, and fuses the quantized local features to form a mid-level representation. To avoid repetitive feature extraction and break the shackles of fixed-size input, a fixed-length feature vector is built on the feature map, and features are extracted all at once. On the PASCAL 2007 dataset, the SPP-Net algorithm is 24102 times faster than the R-CNN algorithm in detection, and the mAP is increased to 59.2%. However, the following issues want to be addressed:</p>
<list list-type="order">
<list-item>
<p>A huge sum of features must be kept, which consumes a lot of space;</p>
</list-item>
<list-item>
<p>the SVM classifier is still utilized, which requires a lot of training steps and takes a long time.</p>
</list-item>
</list>
</sec>
<sec id="s3_1_4">
<title>3.1.4 Fast R-CNN algorithm</title>
<p>
<xref ref-type="bibr" rid="B43">Girshick (2015)</xref> introduced the Fast R-CNN technique grounded on bounding box and multi-task loss classification to solve the difficulties of SPP-Net. The algorithm streamlines the SPP layer and creates a single-scale ROI Pooling layer assembly, in which the applicant region of the entire image is tested into a static size, a feature map is created for SVD decomposition, and the Softmax classification score and BoundingBox are obtained <italic>via</italic> the ROI Pooling layer. As follow;</p>
<disp-formula>
<label>(4)</label>
<mml:math display="block" id="M5">
<mml:mrow>
<mml:mtext>L(p</mml:mtext>
<mml:mo>,</mml:mo>
<mml:mtext>u</mml:mtext>
<mml:mo>,</mml:mo>
<mml:msup>
<mml:mi>t</mml:mi>
<mml:mi>u</mml:mi>
</mml:msup>
<mml:mo>,</mml:mo>
<mml:mtext>v)</mml:mtext>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mtext>L</mml:mtext>
<mml:mrow>
<mml:mtext>cls</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mtext>p</mml:mtext>
<mml:mo>,</mml:mo>
<mml:mtext>u</mml:mtext>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mtext>&#x3bb;[u</mml:mtext>
<mml:mo>&#x2265;</mml:mo>
<mml:mn>1</mml:mn>
<mml:msub>
<mml:mrow>
<mml:mtext>]L</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mtext>loc</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:msup>
<mml:mi>t</mml:mi>
<mml:mi>u</mml:mi>
</mml:msup>
<mml:mo>,</mml:mo>
<mml:mtext>v</mml:mtext>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
<p>where, L<sub>cls</sub>(p,u) = -log p<sub>u</sub> computes the log loss for ground truth class u, and p<sub>u</sub> is determined from the separate chance dispersal p = (p<sub>0</sub>,&#xb7; &#xb7;,p<sub>c</sub>) over the C+1 outputs from the last FC layer. L<sub>loc</sub>(t<sup>u</sup>,v) is well-clear over the forecast offsets <inline-formula>
<mml:math display="inline" id="im3">
<mml:mrow>
<mml:msup>
<mml:mi>t</mml:mi>
<mml:mi>u</mml:mi>
</mml:msup>
<mml:mtext>&#xa0;</mml:mtext>
<mml:mo>=</mml:mo>
<mml:mtext>&#xa0;</mml:mtext>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msubsup>
<mml:mi>t</mml:mi>
<mml:mi>x</mml:mi>
<mml:mtext>u</mml:mtext>
</mml:msubsup>
<mml:mtext>&#xa0;</mml:mtext>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mi>t</mml:mi>
<mml:mi>y</mml:mi>
<mml:mtext>u</mml:mtext>
</mml:msubsup>
<mml:mtext>&#xa0;</mml:mtext>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mi>t</mml:mi>
<mml:mi>w</mml:mi>
<mml:mtext>u</mml:mtext>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
<mml:mi>u</mml:mi>
</mml:msubsup>
<mml:mtext>&#xa0;</mml:mtext>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> and ground-truth bounding-box regression objects v = (v<sub>x</sub>,v<sub>y</sub>,v<sub>w</sub>,v<sub>h</sub>), where x, y, w, and h mean the two synchronizes of the box center, width, and height, respectively. To stipulate an object proposal with a log-space height/width change and scale-invariant conversion, each t<sup>u</sup> uses the parameter settings (<xref ref-type="bibr" rid="B147">Zitnick and Doll&#xe1;r, 2014</xref>). To omit all backdrop RoIs, the Iverson bracket indicator function [u &#x2265; 1] is used. A smooth L<sub>1</sub> loss is used to fit bounding-box regressors in order to give additional robustness against outliers and remove sensitivity in exploding gradients:</p>
<disp-formula>
<label>(5)</label>
<mml:math display="block" id="M6">
<mml:mrow>
<mml:msub>
<mml:mtext>L</mml:mtext>
<mml:mrow>
<mml:mtext>loc</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mtext>tu</mml:mtext>
<mml:mo>,</mml:mo>
<mml:mtext>v</mml:mtext>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mtext>i</mml:mtext>
<mml:mo>&#x2208;</mml:mo>
<mml:mtext>x</mml:mtext>
<mml:mo>,</mml:mo>
<mml:mtext>y</mml:mtext>
<mml:mo>,</mml:mo>
<mml:mtext>w</mml:mtext>
<mml:mo>,</mml:mo>
<mml:mtext>h</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mtext>smoth</mml:mtext>
<mml:mo>&#xa0;</mml:mo>
<mml:mtext>L</mml:mtext>
</mml:mrow>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mtext>tiu</mml:mtext>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mtext>vi</mml:mtext>
<mml:mtext>i</mml:mtext>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<p>And</p>
<disp-formula>
<label>(6)</label>
<mml:math display="block" id="M7">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mtext>smoothL</mml:mtext>
</mml:mrow>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mn>0.5</mml:mn>
<mml:msup>
<mml:mtext>x</mml:mtext>
<mml:mn>2</mml:mn>
</mml:msup>
<mml:mo>&#xa0;</mml:mo>
<mml:mo>&#xa0;</mml:mo>
<mml:mo>&#xa0;</mml:mo>
<mml:mo>&#xa0;</mml:mo>
<mml:mo>&#xa0;</mml:mo>
<mml:mo>&#xa0;</mml:mo>
<mml:mo>&#xa0;</mml:mo>
<mml:mo>&#xa0;</mml:mo>
<mml:mo>&#xa0;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mo>|</mml:mo>
<mml:mtext>x</mml:mtext>
<mml:mo>|</mml:mo>
</mml:mrow>
<mml:mo>&lt;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mrow>
<mml:mo>|</mml:mo>
<mml:mtext>x</mml:mtext>
<mml:mo>|</mml:mo>
</mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>0.5</mml:mn>
<mml:mo>&#xa0;</mml:mo>
<mml:mo>&#xa0;</mml:mo>
<mml:mo>&#xa0;</mml:mo>
<mml:mo>&#xa0;</mml:mo>
<mml:mi>o</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>w</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:mtd>
</mml:mtr> </mml:mtable>
</mml:mrow>
</mml:mrow>
<mml:mtext>&#xa0;&#xa0;</mml:mtext>
</mml:mrow>
</mml:math>
</disp-formula>
</sec>
<sec id="s3_1_5">
<title>3.1.5 Faster R-CNN algorithm</title>
<p>The employment of candidate region generating methods such as bounding boxes, selective search, and others stymies accuracy progress. <xref ref-type="bibr" rid="B109">Ren et&#xa0;al. (2015)</xref> presented Faster R-CNN in 2017 as a solution to this problem and introduced a Region Proposal Network (RPN) to replace the selective search algorithm. Comparing suggestions to reference boxes, regressions toward actual BBs can be accomplished (anchors). Anchors of three scales and three feature ratios are used in the Faster R-CNN. The loss function resembles that of (4);</p>
<disp-formula>
<label>(7)</label>
<mml:math display="block" id="M8">
<mml:mrow>
<mml:mrow>
<mml:mtext>L</mml:mtext>
<mml:mo stretchy="false">(</mml:mo>
</mml:mrow>
<mml:msub>
<mml:mtext>p</mml:mtext>
<mml:mtext>i</mml:mtext>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mtext>t</mml:mtext>
<mml:mtext>i</mml:mtext>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mrow>
<mml:msub>
<mml:mtext>N</mml:mtext>
<mml:mrow>
<mml:mtext>cls</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfrac>
<mml:msub>
<mml:mo>&#x2211;</mml:mo>
<mml:mtext>i</mml:mtext>
</mml:msub>
<mml:msub>
<mml:mtext>L</mml:mtext>
<mml:mrow>
<mml:mtext>cls</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mtext>p</mml:mtext>
<mml:mtext>i</mml:mtext>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mtext>p</mml:mtext>
<mml:mi>i</mml:mi>
<mml:mo>*</mml:mo>
</mml:msubsup>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>+</mml:mo>
<mml:mtext>&#x3bb;</mml:mtext>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mrow>
<mml:msub>
<mml:mtext>N</mml:mtext>
<mml:mrow>
<mml:mtext>reg</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfrac>
<mml:msub>
<mml:mo>&#x2211;</mml:mo>
<mml:mtext>i</mml:mtext>
</mml:msub>
<mml:msubsup>
<mml:mtext>p</mml:mtext>
<mml:mi>i</mml:mi>
<mml:mo>*</mml:mo>
</mml:msubsup>
<mml:msub>
<mml:mtext>L</mml:mtext>
<mml:mrow>
<mml:mtext>reg</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mtext>t</mml:mtext>
<mml:mtext>i</mml:mtext>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mi>t</mml:mi>
<mml:mtext>i</mml:mtext>
<mml:mo>*</mml:mo>
</mml:msubsup>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<p>where, p<sub>i</sub> denotes the likelihood that the i<sup>th</sup> anchor will be an object. If the anchor is positive, the ground truth label <inline-formula>
<mml:math display="inline" id="im4">
<mml:mrow>
<mml:msubsup>
<mml:mtext>p</mml:mtext>
<mml:mi>i</mml:mi>
<mml:mo>*</mml:mo>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> is 1, otherwise, it is 0. <inline-formula>
<mml:math display="inline" id="im5">
<mml:mrow>
<mml:msubsup>
<mml:mtext>t</mml:mtext>
<mml:mi>i</mml:mi>
<mml:mo>*</mml:mo>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> is related to the ground-truth box overlying with a positive anchor, while t<sub>i</sub> contains four parameterized coordinates of the predicted bounding box. L<sub>cls</sub> is a binary log loss, while L<sub>reg</sub> is a smoothed L<sub>1</sub> loss, both of which are similar to (5). On the PASCAL VOC 2007 dataset, faster R-CNN achieves 73.2% mAP using the VGG-16 backbone network. However, there are still issues:</p>
<list list-type="bullet">
<list-item>
<p>The scale chosen by the selection box on the feature map when the anchor mechanism is employed is not adequate for all objects, notably for small object identification;</p>
</list-item>
<list-item>
<p>Only the last layer of the VGG-16 network is used. The accumulation layer&#x2019;s output features are predicted. The network topographies lose conversion invariance and accuracy after the RoI Pooling layer;</p>
</list-item>
</list>
</sec>
<sec id="s3_1_6">
<title>3.1.6 R-FCN algorithm</title>
<p>The idea and performance of the R-CNN series of algorithms determine the milestones of object detection. This series of structures is essentially composed of two subnets (Faster R-CNN adds PRN, which is composed of three subnets), the former subnet is the spine network for feature withdrawal, and the latter subnet is used to complete the classification and localization of object detection. Between the two subnetworks, the RoI pooling layer turns the multi-scale feature map into a static-size feature map, but this step breaks the network&#x2019;s translation invariance and is not favorable to object classification. Using the ResNet -101 <xref ref-type="bibr" rid="B52">He et&#xa0;al. (2016)</xref> backbone network, <xref ref-type="bibr" rid="B25">Dai et&#xa0;al. (2016)</xref> developed a position-sensitive score map (Position-Sensitive Score Maps) containing object location info in the R-FCN (Region based Fully Convolutional Networks) algorithm.</p>
</sec>
<sec id="s3_1_7">
<title>3.1.7 Mask R-CNN algorithm</title>
<p>MaskR-CNN, proposed by <xref ref-type="bibr" rid="B49">He et&#xa0;al. (2017)</xref> is a Faster R-CNN extension that uses the ResNet-101-FPN backbone network. Multi-task loss is combined with segmentation branch loss, arrangement, and bounding box regression loss in Mask R-CNN. A Mask network branch for RoI calculation and division is added to the object classification and bounding box regression to enable real-time object identification and instance segmentation. <xref ref-type="bibr" rid="B79">Lin et&#xa0;al. (2017a)</xref> projected the RoIAlign layer to replace the RoI pooling layer and used bilinear difference to plug the pixels of non-integer situations to tackle the problem of rounding the feature map scale in the downsampling and RoI pooling layers. The COCO dataset&#x2019;s mAP has been increased to 39.8% with a detection speed of 5 frames per second. However, meeting real-time criteria for detection speed is still problematic, and the cost of instance segmentation and labeling is too high.</p>
</sec>
<sec id="s3_1_8">
<title>3.1.8 Comparison and analysis</title>
<p>On the COCO dataset, the two-stage object detection uses a cascade structure and has been successful in instance segmentation. Although detection accuracy has improved over time, detection speed has remained poor. On the VOC2007 test set, VOC 2012 test set, and COCO test set, <xref ref-type="fig" rid="f2">
<bold>Figure&#xa0;2</bold>
</xref> reviews the spine network of the two-stage object detection method, as well as the detection accuracy (mAP) and detection speed. &#x201c;&#x2014;&#x201d; signifies no relevant data. Performance comparison of two-stage object detection algorithms as shown in <xref ref-type="fig" rid="f2">
<bold>Figure&#xa0;2</bold>
</xref>.</p>
<fig id="f2" position="float">
<label>Figure&#xa0;2</label>
<caption>
<p>Performance comparison of two-stage object detection algorithms.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpls-13-1041514-g002.tif"/>
</fig>
<p>The two-stage object detector, as shown in <xref ref-type="fig" rid="f2">
<bold>Figure&#xa0;2</bold>
</xref>, presents profound pillar networks such as ResNet (<xref ref-type="bibr" rid="B2">Allen-Zhu and Li, 2019</xref>) and ResNeXt (<xref ref-type="bibr" rid="B54">Hitawala, 2018</xref>), and the detection precision can reach 83.6%, but the expansion of the algorithm model causes an increase in the amount of calculation, and the detection speed is only 11% frame/s, which cannot meet the real-time requirements. <xref ref-type="table" rid="T2">
<bold>Table&#xa0;2</bold>
</xref> outlines the benefits, drawbacks, and contexts in which certain object detection techniques can be used.</p>
<table-wrap id="T2" position="float">
<label>Table&#xa0;2</label>
<caption>
<p>Advantages, disadvantages, and applicable scenarios of two-stage Object detection algorithms.</p>
</caption>
<table frame="hsides">
<thead>
<tr>
<th valign="top" align="left">Model</th>
<th valign="top" align="center">Advantage</th>
<th valign="top" align="center">Disadvantage</th>
<th valign="top" align="center">Applicable</th>
<th valign="top" align="center">References of Applications in Agriculture, Multimedia and Remote Sensing</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">OverFeat</td>
<td valign="top" align="left">Feature extraction using CNN</td>
<td valign="top" align="left">Using a sliding window, the time and space overhead is large</td>
<td valign="top" align="left">Object Detection</td>
<td valign="top" align="left">(<xref ref-type="bibr" rid="B31">Diwan et&#xa0;al., 2022</xref>; <xref ref-type="bibr" rid="B87">Li et&#xa0;al., 2020</xref>)</td>
</tr>
<tr>
<td valign="top" align="left">R-CNN</td>
<td valign="top" align="left">Combining CNN with the candidate box method</td>
<td valign="top" align="left">Feature extraction is complex, time-consuming, fixed image input size</td>
<td valign="top" align="left">Object Detection</td>
<td valign="top" align="left">(<xref ref-type="bibr" rid="B134">Yan et&#xa0;al., 2019</xref>; <xref ref-type="bibr" rid="B63">Jiao et&#xa0;al., 2020</xref>)</td>
</tr>
<tr>
<td valign="top" align="left">SPP-Net</td>
<td valign="top" align="left">Perform convolution operation on the entire image to realize multi-scale convolution calculation</td>
<td valign="top" align="left">High space cost</td>
<td valign="top" align="left">Object Detection</td>
<td valign="top" align="left">(<xref ref-type="bibr" rid="B65">Karim et&#xa0;al., 2020</xref>; <xref ref-type="bibr" rid="B73">Kumar and Kumar, 2022</xref>)</td>
</tr>
<tr>
<td valign="top" align="left">Fast R-CNN</td>
<td valign="top" align="left">Extract features with ROI Pooling layer, saving time and feature loading space</td>
<td valign="top" align="left">The selection of candidate regions is computationally complex</td>
<td valign="top" align="left">Object Detection</td>    <td valign="top" align="left">(<xref ref-type="bibr" rid="B89">Li et&#xa0;al., 2020</xref>; <xref ref-type="bibr" rid="B135">Yi et&#xa0;al., 2021</xref>)</td>
</tr>
<tr>
<td valign="top" align="left">Faster R-CNN</td>
<td valign="top" align="left">Replacing region proposals with RPN to speed up training and accuracy</td>
<td valign="top" align="left">The model is complex and the spatial quantification is rough</td>
<td valign="top" align="left">Object Detection</td>
<td valign="top" align="left">(<xref ref-type="bibr" rid="B24">Cynthia et&#xa0;al., 2019</xref>; <xref ref-type="bibr" rid="B140">Zhang et&#xa0;al., 2022</xref>)</td>
</tr>
<tr>
<td valign="top" align="left">R-FCN</td>
<td valign="top" align="left">Improved positioning accuracy</td>
<td valign="top" align="left">The model process is multifaceted and the amount of calculation is large</td>
<td valign="top" align="left">Object Detection</td>
<td valign="top" align="left">(<xref ref-type="bibr" rid="B41">Gera et&#xa0;al., 2022</xref>; <xref ref-type="bibr" rid="B99">Nguyen, 2022</xref>; <xref ref-type="bibr" rid="B13">Cai and Zhang, 2022</xref>)</td>
</tr>
<tr>
<td valign="top" align="left">Mask R-CNN</td>
<td valign="top" align="left">Solve the misalignment between the feature map and the original image, combining detection and segmentation</td>
<td valign="top" align="left">Instance segmentation is expensive</td>
<td valign="top" align="left">Object detection, instance segmentation</td>
<td valign="top" align="left">(<xref ref-type="bibr" rid="B62">Jian et&#xa0;al., 2022</xref>; <xref ref-type="bibr" rid="B119">Storey et&#xa0;al., 2022</xref>)</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>It can be realized from <xref ref-type="table" rid="T2">
<bold>Table&#xa0;2</bold>
</xref>, that the two-stage object detection algorithm has been making up for the faults of the preceding algorithm, but the problems such as large model scale and slow detection speed have not been solved. In this regard, some researchers put forward the idea of transforming Object detection into regression problems, simplifying the algorithm model, and improving the detection accuracy while improving the detection speed.</p>
</sec>
</sec>
<sec id="s3_2">
<title>3.2 Single-stage object detection algorithm</title>
<p>The single-stage object detection technique, also known as the object detection algorithm based on regression analysis, is based on the principle of regression analysis. The single-stage object detector, which is generally represented by the YOLO and SSD series, skips the applicant area generation stage and obtains object classification and position information directly.</p>
<sec id="s3_2_1">
<title>3.2.1 YOLO object detection algorithm</title>
<p>
<xref ref-type="bibr" rid="B106">Redmon et&#xa0;al. (2016)</xref> proposed the YOLO (You Only Look Once) target detector in 2016. The YOLO architecture comprises of 24 convolutional layers and 2 FC layers, with the topmost feature map predicting bounding boxes and the P-Relu activation function explicitly evaluating the likelihood of each class. The following loss function is optimized during training:</p>
<disp-formula>
<label>(8)</label>
<mml:math display="block" id="M9">
<mml:mtable columnalign="left">
<mml:mtr columnalign="left">
<mml:mtd columnalign="left">
<mml:msub>
<mml:mi>&#x3bb;</mml:mi>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mstyle displaystyle="true">
<mml:munderover>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mi>S</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
</mml:munderover>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:munderover>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mi>B</mml:mi>
</mml:munderover>
<mml:mrow>
<mml:msubsup>
<mml:mo>&#x301b;</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>o</mml:mi>
<mml:mi>b</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo stretchy="false">[</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mover accent="true">
<mml:mi>x</mml:mi>
<mml:mo>^</mml:mo>
</mml:mover>  <mml:mi>i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>y</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mover accent="true">
<mml:mi>y</mml:mi>
<mml:mo>^</mml:mo>
</mml:mover>  <mml:mi>i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
</mml:mstyle>
</mml:mtd>
</mml:mtr>
<mml:mtr columnalign="left">
<mml:mtd columnalign="left">
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mi>&#x3bb;</mml:mi>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mstyle displaystyle="true">
<mml:munderover>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mi>S</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
</mml:munderover>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:munderover>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mi>B</mml:mi>
</mml:munderover>
<mml:mrow>
<mml:msubsup>
<mml:mo>&#x301b;</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>o</mml:mi>
<mml:mi>b</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msqrt>
<mml:mrow>
<mml:msub>
<mml:mi>w</mml:mi> <mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:msqrt>
<mml:mo>&#x2212;</mml:mo>
<mml:msqrt>
<mml:mrow>
<mml:msub>
<mml:mover accent="true">
<mml:mi>x</mml:mi> <mml:mo>^</mml:mo>
</mml:mover>   <mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:msqrt>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msqrt>
<mml:mrow>
<mml:msub>
<mml:mi>h</mml:mi> <mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:msqrt>
<mml:mo>&#x2212;</mml:mo>
<mml:msqrt>
<mml:mrow>
<mml:msub>
<mml:mover accent="true">
<mml:mi>h</mml:mi> <mml:mo>^</mml:mo>
</mml:mover>   <mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:msqrt>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
</mml:mstyle>
</mml:mtd>
</mml:mtr>
<mml:mtr columnalign="left">
<mml:mtd columnalign="left">
<mml:mo>+</mml:mo>
<mml:mstyle displaystyle="true">
<mml:munderover>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mi>S</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
</mml:munderover>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:munderover>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mi>B</mml:mi>
</mml:munderover>
<mml:mrow>
<mml:msubsup>
<mml:mo>&#x301b;</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>o</mml:mi>
<mml:mi>b</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msup>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mover accent="true">
<mml:mi>C</mml:mi>
<mml:mo>^</mml:mo>
</mml:mover>  <mml:mi>i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
</mml:mstyle>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mi>&#x3bb;</mml:mi>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>b</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mstyle displaystyle="true">
<mml:munderover>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mi>S</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
</mml:munderover>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:munderover>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mi>B</mml:mi>
</mml:munderover>
<mml:mrow>
<mml:msubsup>
<mml:mo>&#x301b;</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>b</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msup>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>C</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mover accent="true">
<mml:mi>C</mml:mi>
<mml:mo>^</mml:mo>
</mml:mover>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
</mml:mstyle>
</mml:mtd>
</mml:mtr>
<mml:mtr columnalign="right">
<mml:mtd columnalign="right">
<mml:mo>+</mml:mo>
<mml:mstyle displaystyle="true">
<mml:munderover>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mi>S</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
</mml:munderover>
<mml:mrow>
<mml:msubsup>
<mml:mo>&#x301b;</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>b</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mstyle displaystyle="true">
<mml:munder>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:mi>c</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:munder>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>c</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mover accent="true">
<mml:mi>p</mml:mi>
<mml:mo>^</mml:mo>
</mml:mover>  <mml:mi>i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>c</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
</mml:mstyle>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</disp-formula>
<p>where, n is a certain cell of i,(x<sub>i</sub>,y<sub>i</sub>) and denotes the center of the box relative to the grid cell limits, (w<sub>i</sub>,h<sub>i</sub>) are the standardized width and height relative to the image size. The confidence scores are represented by C<sub>i</sub>, the existence of objects is indicated by <inline-formula>
<mml:math display="inline" id="im6">
<mml:mrow>
<mml:msubsup>
<mml:mo>&#x301b;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mrow>
<mml:mi>o</mml:mi>
<mml:mi>b</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula>, and the prediction is made by the j<sup>th</sup> bounding box predictor is indicated by <inline-formula>
<mml:math display="inline" id="im7">
<mml:mrow>
<mml:msubsup>
<mml:mo>&#x301b;</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>o</mml:mi>
<mml:mi>b</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula>.</p>
<p>The technique eliminates the stage of generating candidate regions and combines feature extraction, regression, and classification into a single volume. The YOLO detection speed in real-time is 45 frames per second, and the average detection accuracy mAP is 63.4%. YOLO&#x2019;s detection effect on small-scale objects, on the other hand, is poor, and it&#x2019;s simple to miss detection in environments where objects overlap and occlude.</p>
<p>
<xref ref-type="bibr" rid="B145">Zhou et&#xa0;al. (2022)</xref> proposed YOLOv5 with total of four network models: YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. The detection speed of YOLOv5 is very fast, and the inference time of each picture reaches 0.007 s, which is 140 frame/s. The generalization process of the YOLO series is not good in dealing with uncommon scale objects, and multiple down sampling is required to obtain standard features. Moreover, due to the influence of space limitation in bounding box prediction, the detection effect of small object detection is not good.</p>
</sec>
<sec id="s3_2_2">
<title>3.2.2 SSD object detection algorithm</title>
<p>
<xref ref-type="bibr" rid="B83">Liu et&#xa0;al. (2016)</xref> introduced the SSD (Single Shot multi-box Detector) algorithm to balance detection accuracy and detection speed by combining the advantages of Faster RCNN and YOLO. For feature extraction, SSD uses the VGG-16 backbone network. Convolutional layers take the place of FC6 and FC7 and add four different levels. SSD also employs a target prediction method to distinguish between target types and positions based on candidate frames collected by the anchor at various scales. The following are some of the benefits of this mechanism: (1) The convolutional layer predicts the target location and category, reducing the amount of computation; (2) the object detection process has no spatial limitations, allowing it to detect clusters of small target items effectively. The running speed of SSD on Nvidia Titan X is increased to 59 frame/s, which is significantly better than YOLO; the mAP on the VOC2007 dataset reaches 79.8%, which is 3 times that of Faster R-CNN.</p>
</sec>
<sec id="s3_2_3">
<title>3.2.3 RetinaNet algorithm</title>
<p>
<xref ref-type="bibr" rid="B80">Lin et&#xa0;al. (2017b)</xref> borrowed the ideas of Faster R-CNN and multi-scale Object detection <xref ref-type="bibr" rid="B35">Erhan et&#xa0;al. (2014)</xref> to design and train a RetinaNet Object detector. The chief idea of this module is to explain the previous detection model by reshaping the Focal Loss Function. The problem of class imbalance of positive and negative samples in training samples during training. The ResNet backbone network and two task-specific FCN subnetworks make up the RetinaNet network, which is a single network. Convolutional features are computed over the entire image by the backbone network. On the output of the backbone network, the regression subnetworks conduct image classification tasks. Convolutional bounding box regression is handled by the network.</p>
<p>In one-stage detectors, the class imbalance of foreground and background is the main reason for the convergence of network training. During the training phase, Focal Loss avoids many simple negative examples and focuses on hard training samples. By training unbalanced positive and negative instances, the speed of single-stage detectors is inherited. The experimental results show that on the MS COCO test set, the AP of RetinaNet using the ResNet-101-FPN backbone network is increased by 6% compared with the DSSD513; using the ResNeXt-101-FPN, the AP of RetinaNet is increased by 9%.</p>
</sec>
<sec id="s3_2_4">
<title>3.2.4 Tiny RetinaNet algorithm</title>
<p>
<xref ref-type="bibr" rid="B18">Cheng et&#xa0;al. (2020)</xref> planned Tiny RetinaNet, which customs MobileNetV2-FPN as the backbone network for feature extraction, primarily composed of Stem block backbone network and SEnet, as well as two task-specific subnets, to improve accuracy and reduce information. The mAPs for the PASCAL VOC2007 and PASCAL VOC2012 datasets are respectively 71.4% and 73.8%.</p>
</sec>
<sec id="s3_2_5">
<title>3.2.5 M2Det algorithm</title>
<p>
<xref ref-type="bibr" rid="B142">Zhao et&#xa0;al. (2019)</xref> proposed M2Det based on Multi-Level Feature Pyramid Network (ML-FPN), which solved the problem of scale variation between target instances. The model achieves the final incremental feature pyramid through three steps: (1) extract multi-layer features from a huge number of layers in the backbone network and fuse them into basic features; (2) send the base layer features into TUM (Thinned U-shape Modules) In a block formed by connecting the module and the FFM (Feature Fusion Modules) module, the TUM decoding layer is obtained as the input of the next step; (3) The decoding layer of equivalent scale is integrated to construct a feature pyramid of multi-layer features. M2Det adopts the VGG backbone network and obtains 41.0% AP at a speed of 1.8 frame/s using the single-scale inference strategy on the MS COCO test dataset, and 44.2% AP using the multi-scale inference strategy.</p>
</sec>
<sec id="s3_2_6">
<title>3.2.6 Comparison of single-stage object detection algorithms</title>
<p>The single-stage object detection algorithm was developed later than the two-stage object detection algorithm, but it has piqued the interest of many academics due to its simplified structure and efficient calculation, as well as its rapid development. Single-stage object detection algorithms are frequently rapid, but their detection precision is much substandard to that of two-stage detection methods. With the rapid advancement of computer vision, the present single-stage object detection framework&#x2019;s speed and accuracy have substantially increased. <xref ref-type="fig" rid="f3">
<bold>Figure&#xa0;3</bold>
</xref>, reviews the backbone network of the single-stage detection algorithm and the detection accuracy (mAP) and detection speed on the PASCAL VOC2007 test set, PASCAL VOC2012 test set and COCO test set, as well as <xref ref-type="table" rid="T3">
<bold>Table&#xa0;3</bold>
</xref> recaps the advantages, disadvantages and applicable situations of the single-stage object detection algorithm. The Performance assessment of single-stage Object detection algorithms as shown in <xref ref-type="fig" rid="f3">
<bold>Figure&#xa0;3</bold>
</xref>.</p>
<fig id="f3" position="float">
<label>Figure&#xa0;3</label>
<caption>
<p>Performance assessment of single-stage Object detection algorithms in different datasets.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpls-13-1041514-g003.tif"/>
</fig>
<table-wrap id="T3" position="float">
<label>Table&#xa0;3</label>
<caption>
<p>Advantages, disadvantages, and applicable situations of single-stage Object detection algorithms.</p>
</caption>
<table frame="hsides">
<thead>
<tr>
<th valign="top" align="left">Model</th>
<th valign="top" align="center">Advantage</th>
<th valign="top" align="center">Disadvantage</th>
<th valign="top" align="center">Applicable</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">YOLO</td>
<td valign="top" align="left">Divide the image into grid cells for fast detection</td>
<td valign="top" align="left">Not good for dense and small object detection</td>
<td valign="top" align="left">Object Detection</td>
</tr>
<tr>
<td valign="top" align="left">YOLOv2</td>
<td valign="top" align="left">Use clustering to make anchor boxes to improve classification precision</td>
<td valign="top" align="left">Using pre-training, difficult to transfer</td>
<td valign="top" align="left">Object Detection</td>
</tr>
<tr>
<td valign="top" align="left">YOLOv3</td>
<td valign="top" align="left">Using the residual learning idea to realize multi-scale detection</td>
<td valign="top" align="left">The model is complex, and the detection effect of medium and large-scale objects is poor</td>
<td valign="top" align="left">Multi-scale object detection</td>
</tr>
<tr>
<td valign="top" align="left">YOLOv4</td>
<td valign="top" align="left">Excellent trade-off of detection accuracy and detection speed</td>
<td valign="top" align="left">Detection precision needs to be better</td>
<td valign="top" align="left">High-precision real-time object detection</td>
</tr>
<tr>
<td valign="top" align="left">YOLOv5</td>
<td valign="top" align="left">Small model size, lower deployment costs, high flexibility, and high detection speed</td>
<td valign="top" align="left">Performance needs to be improved</td>
<td valign="top" align="left">Object Detection</td>
</tr>
<tr>
<td valign="top" align="left">SSD</td>
<td valign="top" align="left">Multi-scale anchor box discretization of boundary space</td>
<td valign="top" align="left">The accuracy rate is low, the model is difficult to converge, and the detection effect of small targets is not improved.</td>
<td valign="top" align="left">Multi-scale object detection</td>
</tr>
<tr>
<td valign="top" align="left">DSSD</td>
<td valign="top" align="left">Use ResNet-101 as the backbone network to improve the detection consequence of small objects</td>
<td valign="top" align="left">Slow detection speed compared to SSD</td>
<td valign="top" align="left">Object Detection</td>
</tr>
<tr>
<td valign="top" align="left">R-SSD</td>
<td valign="top" align="left">Improved feature fusion method to improve detection accuracy</td>
<td valign="top" align="left">The model calculation is complex, and the detection speed is average</td>
<td valign="top" align="left">Object Detection</td>
</tr>
<tr>
<td valign="top" align="left">F-SSD</td>
<td valign="top" align="left">Reconstruct the pyramid feature map to fuse features of different scales, which is beneficial to small object detection</td>
<td valign="top" align="left">Slow detection speed compared to SSD</td>
<td valign="top" align="left">Multi-scale object detection</td>
</tr>
<tr>
<td valign="top" align="left">DSOD</td>
<td valign="top" align="left">No pretraining required</td>
<td valign="top" align="left">Normal detection speed</td>
<td valign="top" align="left">Object Detection</td>
</tr>
<tr>
<td valign="top" align="left">RetinaNet</td>
<td valign="top" align="left">Optimize the ratio of positive and negative samples through Focal Loss</td>
<td valign="top" align="left">When training with dense samples, it will cause sample imbalance</td>
<td valign="top" align="left">Lightweight, multi-scale object detection</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>
<xref ref-type="table" rid="T3">
<bold>Table&#xa0;3</bold>
</xref> shows how the single-stage object detection algorithm improves object detection performance by employing pyramids to pact with pose changes and small object detection problems, novel training tactics, data augmentation, a mixture of changed backbone networks, multiple detection frameworks, and other techniques. The YOLO series is not practical for small-scale and dense object detection, and the SSD series has improved this to achieve high-precision, multi-scale detection.</p>
</sec>
</sec>
<sec id="s3_3">
<title>3.3 Object detection algorithm based on Generative Adversarial Networks</title>
<p>
<xref ref-type="bibr" rid="B45">Goodfellow et&#xa0;al. (2014)</xref> proposed Generative Adversarial Networks (GANs), which are unsupervised generative models that work based on the maximum likelihood principle and use adversarial training. The objective behind adversarial learning is to train the detection network by using an adversarial network to generate occlusion and deformed image samples, and it is one of the most used generative model methods for generating data distribution. GAN is more than just an image generator; it also uses training data to perform object detection, segmentation, and classification tasks across various domains.</p>
<sec id="s3_3_1">
<title>3.3.1 A-Fast-RCNN algorithm</title>
<p>
<xref ref-type="bibr" rid="B128">Wang et&#xa0;al. (2017)</xref> introduced the idea of adversarial networks and proposed the A-Fast-RCNN algorithm that uses adversarial networks to generate complex positive samples. Different from the traditional method of directly generating sample images, this method adopts some transformations on the feature map: (1) In the Adversarial Spatial Dropout Network (ASDN) dealing with occlusion, a Mask layer is added to realize the part of the feature Occlusion, select Mask according to loss; (2) In the Adversarial Spatial Transformer Network (ASTN) that deals with deformation, partial deformation of features is achieved by manipulating the corresponding features. ASDN and ASTN provide two different variants, and by combining these two variants (ASDN output as ASTN input), the detector can be trained more robustly. In comparison with the OHEM (Online Hard Example Mining) method, on the VOC 2007 dataset, the method is slightly better (71.4% vs. 69.9%), while on the VOC 2012 dataset, OHEM is better (69.0% vs. 69.8%). The introduction of adversarial network into object detection is indeed a precedent. In terms of improvement effect, it is not as good as OHEM, and some occlusion samples may lead to misclassification. <xref ref-type="table" rid="T4">
<bold>Table 4</bold>
</xref> shown the data Augmentation-based object detection in Multimedia, Agriculture and Remote sensing.</p>
<table-wrap id="T4" position="float">
<label>Table&#xa0;4</label>
<caption>
<p>Data Augmentation-based object detection in Multimedia, Agriculture and Remote sensing.</p>
</caption>
<table frame="hsides">
<thead>
<tr>
<th valign="top" align="left">Reference (Multimedia, Agriculture and Remote sensing)</th>
<th valign="top" align="center">Method description</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">(<xref ref-type="bibr" rid="B48">Haruna et&#xa0;al., 2022</xref>)</td>
<td valign="top" align="left">To improve the accuracy of deep learning models for identifying rice leaf disease, we built a GAN-based data augmentation pipeline with the state-of-the-art StyleGAN2-ADA and the variance of Laplace filter to generate high-quality synthetic rice leaf disease images.</td>
</tr>
<tr>
<td valign="top" align="left">(<xref ref-type="bibr" rid="B7">Bhakta et&#xa0;al., 2022</xref>)</td>
<td valign="top" align="left">Using state-of-the-art Generative Adversarial Network (GAN) technology, we can simulate thermal images of a rice plant with bacterial leaf blight.</td>
</tr>
<tr>
<td valign="top" align="left">(<xref ref-type="bibr" rid="B85">Liu et&#xa0;al., 2021</xref>)</td>
<td valign="top" align="left">A multiscale attention module that boosts the Cycle-Consistent Adversarial Network (CycleGAN) in both spatial and channel dimensions to boost the quality of synthetic images.</td>
</tr>
<tr>
<td valign="top" align="left">(<xref ref-type="bibr" rid="B134">Yan et&#xa0;al., 2019</xref>)</td>
<td valign="top" align="left">The dataset trained a faster region-based convolutional neural network (Faster R-CNN) built on Res101netwok, which was then used to classify both synthetic and real images.</td>
</tr>
<tr>
<td valign="top" align="left">(<xref ref-type="bibr" rid="B12">Bosquet et&#xa0;al., 2022</xref>)</td>
<td valign="top" align="left">Synthetic data of superior quality achieved by combining a GAN with image inpainting and mixing.<break/>DS-GAN can create believable miniature things.</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s3_3_2">
<title>3.3.2 SOD-MTGAN algorithm</title>
<p>
<xref ref-type="bibr" rid="B5">Bai et&#xa0;al. (2018)</xref> developed an end-to-end multi-task generative adversarial network (Small Item Detection <italic>via</italic> Multi-Task Generative Adversarial Network, SOD-MTGAN) technique in 2018 to increase small object detection accuracy. It uses a super-resolution network to up trial small muddled photos to fine images and recover comprehensive information for more accurate detection. Furthermore, during the training phase, the discriminator&#x2019;s classification and regression losses are back-propagated into the generator to provide more specific information for detection. Extensive trials on the COCO dataset demonstration that the method is operative in recovering clear super-resolved images from blurred small images, and that it outperforms the state-of-the-art in terms of detecting performance (particularly for small items).</p>
</sec>
<sec id="s3_3_3">
<title>3.3.3 SAGAN algorithm</title>
<p>Traditional Convolutional Generative Adversarial Networks (CGANs) only generate functions of spatially local points on low-resolution feature maps, thereby generating high-resolution details. The Self-Attention Generative Adversarial Network (SA-GAN) proposed by <xref ref-type="bibr" rid="B138">Zhang et&#xa0;al. (2019)</xref> allows attention-driven and long-term dependency modeling for image generation tasks. It can generate details from cues at all feature locations, and also applies spectral normalization to improve the dynamics of training with remarkable results.</p>
</sec>
<sec id="s3_3_4">
<title>3.3.4 Your local GAN algorithm</title>
<p>
<xref ref-type="bibr" rid="B26">Daras et&#xa0;al. (2020)</xref> proposed a two-dimensional local attention mechanism for generative models (2DLAMGM), and introduced a new local sparse attention layer that preserves 2D geometry and locality. It replaces the dense attention layer of SAGAN (Self-Attention Generative Adversarial Networks), and on ImageNet, the FID score is optimized from 18.65 to 15.94. The sparse attention pattern of the new layers proposed in this method is designed using the new information-theoretic criterion of the information flow graph, and a new method for reversing the attention of adversarial generative networks is also proposed.</p>
</sec>
<sec id="s3_3_5">
<title>3.3.5 MSG-GAN stabilized image synthesis algorithm</title>
<p>GANs although partially successful in image synthesis tasks, were unable to adapt to different datasets, in part due to unpredictability during training and sensitivity to hyperparameters. One cause for this instability is that when the supports of the real and virtual distributions do not overlap enough, the gradients passed from the discriminator to the generator will become underinformed. In response to the above problems, <xref ref-type="bibr" rid="B66">Karnewar and Wang (2019)</xref> planned a Multi-Scale Gradient Generative Adversarial Network (MSG-GAN), which consents gradients to flow from the discriminator to the generator at multiple scales for high resolution Rate image synthesis provides a stable method. MSG-GAN converges stably on datasets of different sizes, resolutions, and domains, as well as on different loss functions and architectures.</p>
</sec>
</sec>
</sec>
<sec id="s4">
<title>4 Deep learning-based object detection algorithm improvement</title>
<p>The rapid development of deep learning has increased the feasibility of improving various classical object detection algorithms in many ways. This section summarizes the main popular improvement methods from the aspects of data processing, model construction, prediction object and loss calculation, and discusses their characteristics, so that different algorithms can express different problems for different problems. The improved scheme corresponding to the algorithm detection process is shown in <xref ref-type="fig" rid="f4">
<bold>Figure&#xa0;4</bold>
</xref>.</p>
<fig id="f4" position="float">
<label>Figure&#xa0;4</label>
<caption>
<p>The corresponding improvement scheme of algorithm detection flow <bold>(A)</bold> Augmentation <bold>(B)</bold> Deep Learning <bold>(C)</bold> Results.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpls-13-1041514-g004.tif"/>
</fig>
<sec id="s4_1">
<title>4.1 Data processing</title>
<sec id="s4_1_1">
<title>4.1.1 Data augmentation</title>
<p>In the object detection algorithm based on deep learning, data augmentation techniques are divided into two types: supervised and unsupervised. Supervised data augmentation methods can be separated into three classes: geometric changes, color transformations, and hybrid transformations; unsupervised data augmentation methods can be divided into two sorts: generating new data and learning new augmentation strategies.</p>    <p>Currently, the research on supervised data augmentation strategies has tended to be perfect, and it has become the main requirement to combine multiple data augmentation techniques to improve model performance. The main reasons are as follows:</p>
<list list-type="order">
<list-item>
<p>The widespread use of supervised data enhancement methods makes unsupervised data enhancement methods less valued to a certain extent;</p>
</list-item>
<list-item>
<p>The Object detection algorithm is gradually developing towards an end-to-end network, integrating data enhancement methods. It has become a requirement in the algorithm, but the unsupervised data enhancement method has certain difficulties in integration due to its complexity and large amount of calculation, and its application scope is limited;</p>
</list-item>
<list-item>
<p>The generative adversarial network or reinforcement learning-related technologies required for unsupervised data augmentation methods are complex and diverse, which hinders researchers&#x2019; exploration.</p>
</list-item>
</list>
</sec>
</sec>
<sec id="s4_2">
<title>4.2 Model construction</title>
<sec id="s4_2_1">
<title>4.2.1 Improve the network structure</title>
<p>In 2015, the ResNet network first proposed the residual block (Residual block), which made the convolutional network deeper and less prone to degradation. As an improvement of the ResNet network, the DenseNet network <xref ref-type="bibr" rid="B56">Huang et&#xa0;al. (2017)</xref> achieves feature reuse by establishing dense connections among all former layers and the current layer, which can achieve well performance than the ResNet network with fewer parameters and less computational cost. The core part of the GoogLeNet network is the Inception module, which extracts the feature information of the image through different convolution kernels, and uses a 1&#xd7;1 convolution kernel for dimensionality reduction, which significantly reduces the amount of computation. Feature Pyramid Networks Lin et&#xa0;al. (2017) (Feature Pyramid Networks, FPN) have made outstanding contributions to identifying small objects. As an improvement of the FPN network, the PANet network <xref ref-type="bibr" rid="B86">Liu et&#xa0;al. (2018)</xref> adds a bottom-up information transfer path based on the FPN to make up for the insufficient utilization of the underlying features. The structure is shown in <xref ref-type="fig" rid="f5">
<bold>Figure&#xa0;5</bold>
</xref>.</p>
<fig id="f5" position="float">
<label>Figure&#xa0;5</label>
<caption>
<p>PANet model steps <bold>(A)</bold> FPN Backbone Network <bold>(B)</bold> Bottom Up Path Enhancement <bold>(C)</bold> Adaptive feature pooling <bold>(D)</bold> Fully Connected fusion.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpls-13-1041514-g005.tif"/>
</fig>
<p>The existence of the fully connected layer leads to the fact that the size of the input image must be uniform, and the proposal of SPP-Net <xref ref-type="bibr" rid="B51">He et&#xa0;al. (2015)</xref> solves this problem, so that the size of the input image is not limited. Efficient-Net <xref ref-type="bibr" rid="B120">Tan and Le (2019)</xref> does not pursue an increase in one dimension (depth, width, image resolution) to improve the overall precision of the model but instead explores the best combination of these three dimensions. Based on EfficientNet, <xref ref-type="bibr" rid="B121">Tan et&#xa0;al. (2020)</xref> suggested a set of Object detection frameworks, EfficientDet, which can achieve good performance for different levels of resource constraints. The comparison of the above networks is shown in <xref ref-type="table" rid="T5">
<bold>Table&#xa0;5</bold>
</xref>.</p>
<table-wrap id="T5" position="float">
<label>Table&#xa0;5</label>
<caption>    <p>Comparison of advantages and disadvantages of related networks.</p>
</caption>
<table frame="hsides">
<thead>
<tr>
<th valign="top" align="left">Network name</th>
<th valign="top" align="center">Advantage</th>
<th valign="top" align="center">Disadvantage</th>
<th valign="top" align="center">References of applications in Multimedia, Agriculture and Remote Sensing</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">SPP-Net</td>
<td valign="top" align="left">Facilitate multi-scale training</td>
<td valign="top" align="left">Requires huge storage space for feature extraction and SVM classification tasks</td>
<td valign="top" align="left">(<xref ref-type="bibr" rid="B29">Ding et&#xa0;al., 2018</xref>; <xref ref-type="bibr" rid="B40">Gao et&#xa0;al., 2019</xref>; <xref ref-type="bibr" rid="B50">Hespeler et&#xa0;al., 2021</xref>)</td>
</tr>
<tr>
<td valign="top" align="left">GoogLeNet</td>
<td valign="top" align="left">Use a 1&#xd7;1 convolution kernel to reduce the amount of computation; increase the width of the single-layer convolution to improve the network&#x2019;s ability to extract features</td>
<td valign="top" align="left">There is still 5&#xd7;5 convolution kernels to increase the network operation; including more complex hyperparameters, each transformation needs to specify the size and number of convolution kernels</td>
<td valign="top" align="left">(<xref ref-type="bibr" rid="B30">Ding et&#xa0;al., 2019</xref>; <xref ref-type="bibr" rid="B36">Eser, 2021</xref>; <xref ref-type="bibr" rid="B31">Diwan et&#xa0;al., 2022</xref>)</td>
</tr>
<tr>
<td valign="top" align="left">ResNet</td>
<td valign="top" align="left">The residual module adopts skip connection, which alleviates the problem of gradient disappearance and degradation caused by the network being too deep.</td>
<td valign="top" align="left">The number of limits is large, and the hardware requirements are slightly higher; when the number of network layers is too deep, the mitigation effect of problems such as gradient disappearance will be greatly reduced</td>
<td valign="top" align="left">(<xref ref-type="bibr" rid="B143">Zhong et&#xa0;al., 2018</xref>; <xref ref-type="bibr" rid="B101">Pan et&#xa0;al., 2021</xref>; <xref ref-type="bibr" rid="B119">Storey et&#xa0;al., 2022</xref>)</td>
</tr>
<tr>
<td valign="top" align="left">DenseNet</td>
<td valign="top" align="left">Compared with ResNet, the amount of parameters and computation is greatly reduced, and the accuracy is improved; it effectively solves the problem of overfitting caused by too few data sets; dense connections are used to strengthen feature propagation</td>
<td valign="top" align="left">During training, since the splicing operation will re-open a new memory storage space to save the spliced feature information, it consumes a lot of memory.</td>
<td valign="top" align="left">(<xref ref-type="bibr" rid="B146">Zhu et&#xa0;al., 2019</xref>; <xref ref-type="bibr" rid="B33">Dubey et&#xa0;al., 2023</xref>; <xref ref-type="bibr" rid="B55">Huang et&#xa0;al., 2017</xref>)</td>
</tr>
<tr>
<td valign="top" align="left">FPN</td>
<td valign="top" align="left">Multi-scale feature fusion to improve the accuracy of small Object detection</td>
<td valign="top" align="left">Top-down structure, the underlying features are not fully utilized</td>    <td valign="top" align="left">(<xref ref-type="bibr" rid="B57">Hu et&#xa0;al., 2022</xref>; <xref ref-type="bibr" rid="B46">Gunturu et&#xa0;al., 2022</xref>; <xref ref-type="bibr" rid="B84">Liu et&#xa0;al., 2021</xref>)</td>
</tr>
<tr>
<td valign="top" align="left">PANet</td>
<td valign="top" align="left">Make full use of high-level semantic information and low-level location information</td>
<td valign="top" align="left">In addition to the top-down structure, a bottom-up structure is also constructed, which requires a lot of additional computational overhead</td>
<td valign="top" align="left">(<xref ref-type="bibr" rid="B20">Cheng et&#xa0;al., 2020</xref>; <xref ref-type="bibr" rid="B21">Chen et&#xa0;al., 2021</xref>; <xref ref-type="bibr" rid="B104">Piao et&#xa0;al., 2021</xref>)</td>
</tr>
<tr>
<td valign="top" align="left">ResNeXt</td>
<td valign="top" align="left">The multi-branch network structure is simplified by grouping convolution; the overall performance is better than ResNet when the parameter quantity remains basically unchanged; the modular structure is easy to transplant;</td>
<td valign="top" align="left">Compared with the overall operation, grouped convolution is less efficient in hardware execution.</td>
<td valign="top" align="left">(<xref ref-type="bibr" rid="B78">Lin et&#xa0;al., 2020</xref>; <xref ref-type="bibr" rid="B112">Savarimuthu, 2021</xref>; <xref ref-type="bibr" rid="B116">Shi et&#xa0;al., 2021</xref>)</td>
</tr>
<tr>
<td valign="top" align="left">EfficientNet</td>
<td valign="top" align="left">The three dimensions of network depth, width and image resolution are well balanced; in the case of reducing the amount of parameters, the detection accuracy has been qualitatively improved</td>
<td valign="top" align="left">There are too many network layers, and the intermediate results of all layers need to be saved during gradient calculation, which requires high hardware and occupies a large amount of video memory; when the image size is too large, the training speed will be slowed down</td>
<td valign="top" align="left">(<xref ref-type="bibr" rid="B1">Alhichri et&#xa0;al., 2021</xref>; <xref ref-type="bibr" rid="B100">Nguyen et&#xa0;al., 2021</xref>; <xref ref-type="bibr" rid="B17">Chatterjee et&#xa0;al., 2022</xref>)</td>
</tr>
<tr>
<td valign="top" align="left">EfficientDet</td>
<td valign="top" align="left">The Bidirectional Feature Pyramid Network (BiFPN) proposed on the basis of PANet has the characteristics of cross-scale connection and weighted feature fusion, which is more efficient for feature detection; compound scaling is performed on multiple aspects at the same time to find the depth, width, and resolution. The best combination results in more accurate and objective results; it is ahead of common target detection models in terms of accuracy and computational complexity, such as: Yolo v3, Mask-RCNN, etc.</td>
<td valign="top" align="left">In view of its characteristics of using neural network to search for the optimal architecture, the time and hardware cost required for training the model will be extremely high; the target detection framework has poor modular structure, which is not conducive to integration</td>
<td valign="top" align="left">(<xref ref-type="bibr" rid="B129">Wei et&#xa0;al., 2021</xref>; <xref ref-type="bibr" rid="B17">Chatterjee et&#xa0;al., 2022</xref>; <xref ref-type="bibr" rid="B6">Basavegowda et&#xa0;al., 2022</xref>)</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Some scholars have introduced the above optimization scheme in the improvement of the network structure of related models to make the detection results more ideal. The related literature of the GoogLeNet network is a typical optimization method of the Inception module (<xref ref-type="bibr" rid="B115">Shi et&#xa0;al., 2017</xref>) and the optimization process is shown in <xref ref-type="fig" rid="f6">
<bold>Figure&#xa0;6</bold>
</xref>.</p>
<fig id="f6" position="float">
<label>Figure&#xa0;6</label>
<caption>
<p>Inception modules <bold>(A)</bold> Inception original module <bold>(B)</bold> Replacing the 5*5 convolution kernel with a 3*3 convolutional kernal <bold>(C)</bold> Single * n kernel <bold>(D)</bold> Inception V4.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpls-13-1041514-g006.tif"/>
</fig>
<p>In order to better improve the model detection accuracy, today&#x2019;s network structure is gradually increasing the depth (residual module), width (Inception module) and context feature extraction capabilities of the network model (<xref ref-type="bibr" rid="B88">Li et&#xa0;al., 2016</xref>; <xref ref-type="bibr" rid="B42">Ghiasi et&#xa0;al., 2019</xref>; <xref ref-type="bibr" rid="B14">Cao et&#xa0;al., 2020b</xref>), etc. However, the resulting model is complicated and redundant, making the improved algorithm more difficult to apply in real life scenarios.</p>
</sec>
</sec>
<sec id="s4_3">
<title>4.3 Other improved algorithms</title>
<p>At present, researchers have done a lot of study on the two-stage object detection algorithm and the single-stage object detection algorithm, so that they have a certain theoretical basis. The two-stage object detection algorithm has an advantage in detection accuracy, and needs to be continuously improved to enhance the detection speed; the single-stage object detection algorithm has an advantage in detection speed, and the model needs to be continuously improved to increase the detection accuracy, so some researchers put the two types of algorithm models such as detection accuracy and detection speed, as shown in <xref ref-type="fig" rid="f7">
<bold>Figure&#xa0;7</bold>
</xref>.</p>
<fig id="f7" position="float">
<label>Figure&#xa0;7</label>
<caption>
<p>The Evolution of mainstream GAN.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpls-13-1041514-g007.tif"/>
</fig>
<p>In 2017, the RON (Reverse connection with Objectness prior Networks) <xref ref-type="bibr" rid="B68">Kong et&#xa0;al. (2017)</xref> algorithm is an efficient and efficient algorithm based on the two-stage detection framework represented by Faster R-CNN and the single-stage detection framework signified by YOLO and SSD. Under the fully convolutional network, similar to SSD, RON uses VGG-16 as the backbone network, the difference is that RON changes the 14th and 15th fully connected layers of the VGG-16 network into a kernel size of 2 &#xd7; 2. In tests, RON achieves state-of-the-art object detection performance, with input 384&#xd7;384 size images, the mAP reaches 81.3% on the PASCAL VOC2007 dataset, and the mAP improves to 80.7% on the PASCAL VOC 2012 dataset. <xref ref-type="bibr" rid="B141">Zhang et&#xa0;al. (2018)</xref> designed the RefineDet algorithm, which inherited the advantages of single-stage detectors and two-stage detectors. RefineDet uses VGG-16 or ResNet-101 as the backbone network for feature extraction, and integrates the neck structure (feature pyramid and feature fusion) into the head structure.</p>
</sec>
</sec>
<sec id="s5">
<title>5 Object detection and recognition applications in agriculture using AI</title>
<p>The use of computer vision technology to inspect agricultural products has the advantages of real-time, objective, and no damage, so it is favored by people. <xref ref-type="bibr" rid="B111">Salda&#xf1;a et&#xa0;al. (2013)</xref> discussed the method of applying computer vision technology to detect mango weight and fruit surface damage, analyzed the algorithm to determine the required image area, and established the correlation between mango weight and its projected image. Experiments show that the accuracy rate of fruit surface damage classification is 76% and 80%, respectively. <xref ref-type="bibr" rid="B118">Slaughter and Harrell (1989)</xref> and others first studied using the chromaticity and brightness information of images taken under natural light conditions to guide the citrus harvesting manipulator, and established a classification model for identifying citrus from trees using color information in color images. The classifier was 75 percent accurate in identifying oranges from the orchard&#x2019;s natural environment.</p>
<p>
<xref ref-type="bibr" rid="B55">Huang et&#xa0;al. (2017)</xref> realized the detection and localization of apples through pattern recognition, mainly using an algorithm to realize the identification of apples, filtering and boundary extraction of the original image of the apple tree, and calculating Determines the outline of the apple relative to the shape of the image. <xref ref-type="bibr" rid="B127">Wang and Cheng (2004)</xref> studied the identification method of apple fruit stem and fruit body and the search method of fruit surface defect. According to the characteristics of apple fruit stalk, it is proposed to use block scanning to judge whether the fruit stalk exists; the different reflection characteristics of the damaged surface and the non-damaged surface of the apple, as well as the statistical characteristics of the pixel points of different gray values, are analyzed to find out the damaged surface. The damaged area was separated from the fruit pedicel and the fruit calyx. The judging accuracy rate of 15 images without fruit stems was 100%, and the accuracy rate of 90 pictures with intact fruit stems was 88%. <xref ref-type="bibr" rid="B92">Mahanti et&#xa0;al. (2021)</xref> used line scanning and analog cameras to detect apple damage, respectively, and showed that using digital image processing technology to detect apple damage can at least reach the accuracy of manual classification.</p>
<p>
<xref ref-type="bibr" rid="B136">Ying et&#xa0;al. (2000)</xref> used computer vision for a new method of huanghua pear fruit stalk recognition. The computer vision system was used to capture images of huanghua pear, and image processing technology was used to complete the segmentation of the image and the background. The stem speed is slow, so a fast algorithm is proposed. This method uses the small diameter of the stem of the pear, selects templates of different sizes, determines whether there is a stem in the image, and obtains the coordinates of the intersection of the head of the stem and the bottom of the pear. The tangent slope information is used to judge the integrity of the fruit stalk. The test results show that the algorithm can 100% judge whether the fruit stalk exists, and the correct rate of judging whether the fruit stalk is intact is more than 90%. <xref ref-type="bibr" rid="B75">Li et&#xa0;al. (2018)</xref> applied computer vision technology to detect the bruising injury of pears, and proposed to distinguish multiple bruising injuries by regional marking technology. In order to improve the measurement accuracy of the bruising area, a mathematical model for measuring the bruising area was established according to the shape of the pear and the characteristics of the bruising. This method can accurately detect multiple crush injuries of pears, and the relative error of most measurements can be controlled within 10%. <xref ref-type="bibr" rid="B102">Patel et&#xa0;al. (2012)</xref> conducted an experimental study on Huanghua pear&#x2019;s machine vision technology to detect the external dimension and performance status. By determining the image processing window, using the Sobel operator and Hilditch to refine the edge, and determining the centroid point to find the representative fruit diameter, the test results show that the correlation coefficient between the predicted fruit diameter and the actual size can reach 0.96. For the detection of fruit surface defects, it is proposed to use the mutation of red (R) and green (G) color components at the junction of damaged and non-damaged to obtain suspicious points, and then to obtain the entire damaged surface through regional growth. <xref ref-type="bibr" rid="B16">Chang (2022)</xref> developed a machine vision system for the quality inspection of Huanghuali, taking Huanghuali as the research object, and compared the influence of different intensity light sources and different backgrounds on the collected images, and developed a system suitable for Huanghuali and different backgrounds. Machine vision systems for other fruit quality inspections. <xref ref-type="bibr" rid="B23">Cubero et&#xa0;al. (2011)</xref> developed a machine vision system suitable for the quality inspection of Huanghuali by studying the spectroscopic reflection characteristics of Huanghuali. In order to adapt to the randomness of fruit orientation and the irregularity of fruit shape in actual production According to the requirements of the fruit size detection method, the method of fruit size detection has better adaptability. A method of using the minimum circumscribed rectangle (MER) method of fruit to find the maximum transverse diameter is designed, and the experimental verification is carried out, and the actual maximum transverse diameter is obtained. The regression equation of the relationship between the diameter and the predicted transverse diameter, the relationship between the two The coefficient is 0.996 2. The variation characteristics of the gray levels of R, G, and B components in the defect area of &#x200b;&#x200b;Huanghuali were analyzed, and finally the maximum combined set of defect pixels and all defect areas were found.</p>
<p>
<xref ref-type="bibr" rid="B77">Li et&#xa0;al. (2022)</xref> put forward a method for identifying germ and endosperm with saturation S as a characteristic parameter by analyzing the color characteristics of germ rice and color images, in order to realize the automatic computer vision of rice germ retention rate detection. Experiments are carried out with the established identification indicators and methods, and the results show that the coincidence rate between the identification results of the computer vision system and the manual detection is over 88%.</p>
</sec>
<sec id="s6">
<title>6 Object detection and recognition applications in agriculture using AI</title>
<p>The detection and recognition of objects based on remote sensing images is a current research focus in the field of target detection. AI brings much improvement in different applications of computer vision and a lot of latest progress in all applications improve it methods (<xref ref-type="bibr" rid="B98">Nawaz et&#xa0;al., 2020</xref>; <xref ref-type="bibr" rid="B97">Nawaz et&#xa0;al., 2021</xref>). The detection and recognition methods used can be divided into two types: target detection algorithms based on traditional methods and target detection algorithms based on deep learning. Commonly used target detection algorithms based on traditional methods include HOG feature algorithm combined with SVM algorithm, Deformable Parts Model (DPM), etc.; target detection and recognition algorithms based on deep learning can be roughly summarized into two categories, namely R-CNN series algorithm based on two stage method and YOLO series algorithm based on one stage method (<xref ref-type="bibr" rid="B47">Han et&#xa0;al., 2022</xref>), SSD (Single Shot Multibox Detector) series algorithm (<xref ref-type="bibr" rid="B3">Arora et&#xa0;al., 2019</xref>).</p>
<p>Initially, the detection of remote sensing images to obtain information is mainly through manual visual analysis, and the amount of information obtained in this way completely depends on the professional ability of technicians. After more than ten years of development, a new technology has appeared to be applied to the reading of remote sensing image information. This new method detects and recognizes targets through statistical models. For example, <xref ref-type="bibr" rid="B103">Peng et&#xa0;al. (2018)</xref> is in order to achieve higher classification accuracy using the maximum likelihood method for remote sensing image classification, etc. <xref ref-type="bibr" rid="B67">Kassim et&#xa0;al. (2021)</xref> proposed a multi-degree learning method, which first combined feature extraction with active learning methods, and then added a K-means classification algorithm to improve the performance of the algorithm. <xref ref-type="bibr" rid="B34">Du et&#xa0;al. (2012)</xref> proposed the adaptive binary tree SVM classifier, which has further improved the classification accuracy of hyperspectral images. <xref ref-type="bibr" rid="B91">Luo et&#xa0;al. (2016)</xref> studied an algorithm called small random forest, the purpose is to solve the problem of low accuracy and overfitting of decision trees. In addition, due to the problems of low detection accuracy and long time consumption, the traditional target detection method cannot meet the real-time requirements of the algorithm in practical applications.</p>
<p>In 2006, Geoffrey Hinton and his students published a paper related to deep learning (<xref ref-type="bibr" rid="B53">Hinton and Salakhutdinov, 2006</xref>), which opened the door to object detection and recognition using deep learning. In recent years, with the breakthrough of deep learning theory, the detection accuracy and detection speed of target detection algorithms have been effectively improved, so that the feature information in images can be extracted by deep learning, which gradually replaces the information based on manual methods and traditional methods. Extraction has become the main direction of object detection research.</p>
<p>In the 2017 ImageNet competition, trained and learned a million image datasets through the design of a multi-layer convolutional neural network structure. The classification error rate obtained in the final experiment was only 15%, and the second place in the competition. That&#x2019;s nearly 11% higher. In addition, many researchers have used deep learning to detect and recognize remote sensing image targets, and have achieved good results and achieved many breakthroughs (<xref ref-type="bibr" rid="B72">Krizhevsky et&#xa0;al., 2017</xref>). <xref ref-type="bibr" rid="B94">Mnih and Hinton (2010)</xref> used two datasets of remote sensing images to conduct research on deep learning technology. They extracted road features from images for training and achieved good experimental results. This is the first time that deep learning is used. applied to remote sensing technology. <xref ref-type="bibr" rid="B148">Zou et&#xa0;al. (2015)</xref> developed a new algorithm for extracting features in images. The algorithm designed a deep belief network structure and conducted experiments on feature extraction, and finally achieved an accuracy of 77%. <xref ref-type="bibr" rid="B59">Ienco et&#xa0;al. (2019)</xref> used a combination of deep learning and a patch classification system to detect ground cover, and achieved good detection results. <xref ref-type="bibr" rid="B130">Wei et&#xa0;al. (2017)</xref> developed a more accurate convolutional neural network for road structure feature extraction, and this algorithm has a remarkable effect on road extraction from aerial images. <xref ref-type="bibr" rid="B19">Cheng et&#xa0;al. (2018)</xref> proposed a rotation-invariant CNN (RICNN) model, which effectively addresses the technical difficulties of object detection in high-resolution remote sensing images. From the object detection experiment of remote sensing images using deep learning, it can be concluded that the extraction of target features by constructing a deep model structure can effectively improve the detection effect. (<xref ref-type="bibr" rid="B9">Bhatti et&#xa0;al., 2021</xref>) used edge detection for identification of objects in remote sensing images by using geometric algebra methods.</p>
</sec>
<sec id="s7">
<title>7 Challenges for object detection in agriculture</title>
<sec id="s7_1">
<title>7.1 Insufficient individual feature layers</title>
<p>Deep CNN plannings generate hierarchy feature maps due to pooling and subsampling operations, resulting in changed layers of feature maps with differing 3D resolutions. As is generally known, the feature maps of the early-layer feature maps have a higher resolution and signify smaller response fields. They also lack high-level semantic information, which is necessary for object detection. The latter-layer feature maps, on the other hand, contain additional semantic information that is required for detecting and classifying things like distinct object placements and illuminations. Higher-level feature maps are valuable for classifying large objects, but they may not be enough to recognize small ones.</p>
</sec>
<sec id="s7_2">
<title>7.2 Limited context information</title>
<p>Small items usually have low resolutions, which makes it difficult to distinguish them. Contextual information is crucial in small item detection because small objects themselves carry limited information. From a &#x201c;global&#x201d; picture level to a &#x201c;local&#x201d; image level, contextual information has been utilized in object recognition. A global image level takes into account image statistics from the entire image, whereas a local image level takes into account contextual information from the objects&#x2019; surrounding areas. Contextual characteristics can be divided into three categories such as local pixel context, semantic context, and spatial context.</p>
</sec>
<sec id="s7_3">
<title>7.3 Class imbalance</title>
<p>The term &#x201c;class imbalance&#x201d; refers to the unequal distribution of data between classes. There are two different sorts of class disparities. One issue is a disparity between foreground and background instances. By densely scanning the entire image, region proposal networks are utilized in object detection to create possible regions containing objects. The anchors are rectangular boxes that have been extensively tiled throughout the full input image. Anchor scales and ratios are pre-determined based on the sizes of target items in the training dataset. When detecting little items, the number of anchors generated per image is higher than when recognizing large things. Positive instances are only those anchors that have a high IoU with the ground truth bounding boxes. Anchors are considered bad examples since they have little or no overlap with the ground truth bounding boxes. The sparseness of ground-truth bounding boxes and IoU matching procedures between ground-truth and anchors are both drawbacks of the anchor-based object identification methodology, and the dense sliding window strategy has a high temporal complexity, making training time consuming.</p>
</sec>
<sec id="s7_4">
<title>7.4 Insufficient positive examples</title>
<p>Most object detection deep neural network models were proficient with objects of varying sizes. They usually work well with huge objects but not so well with small ones. A lack of small-scale anchor boxes produced to match the small objects, as well as an inadequate number of examples to be properly matched to the ground truth, could be the cause. The anchors are feature mappings from certain intermediate layers in a deep neural network that are projected back to the original image. Anchors for little objects are difficult to come by. In addition, the anchors must match the ground truth bounding boxes. The following is an example of a widely used matching method. A positive example is one that has a high IoU score in relation to a ground truth bounding box, such as more than 0.9. Furthermore, the anchor with the highest IoU score for each ground truth box is designated as a positive example. As a result, small objects usually have a limited number of anchors that match the ground truth bonding boxes.</p>
</sec>
</sec>
<sec id="s8" sec-type="conclusions">
<title>8 Conclusion</title>
<p>Deep learning-based object detection techniques have become a trendy research area due to their powerful learning capabilities and superiority in handling occlusion, scale variation, and background exchange. In this paper, we introduce the development of object detection algorithms based on deep learning and summarize two types of object detectors such as single and two-stage. In-depth analysis of the network structure, advantages, disadvantages, and applicable scenarios of various algorithms, we compare the analysis of standard data sets and experimental results of different related algorithms on mainstream data sets. Finally, this study summarizes some application areas of object detection to comprehensively understand and analyze its future development trend.</p>
<sec id="s8_1">
<title>Future work</title>
<p>Based on the analysis and summary of the above knowledge, we propose the following directions for future research.</p>
<list list-type="bullet">
<list-item>
<p>Video object detection has problems such as uneven moving targets, tiny targets, truncation, and occlusion, and it isn&#x2019;t easy to achieve high precision and high efficiency. Therefore, studying multi-faceted data sources such as motion-based objects and video sequences will be one of the most promising future research areas.</p>
</list-item>
<list-item>
<p>Weakly supervised object detection models aim to detect many non-annotated corresponding objects using a small set of fully annotated images. Therefore, using many annotated and labeled pictures with target objects and bounding boxes to train the network to achieve high effectiveness efficiently is an essential issue for future research.</p>
</list-item>
<list-item>
<p>Region-specific detectors tend to perform better, achieving higher detection accuracy on predefined datasets. Therefore, developing a general object detector that can detect multi-domain objects without prior knowledge is a fundamental research direction in the future.</p>
</list-item>
<list-item>
<p>Remote sensing photos are frequently employed in military and agricultural industries and are detected in real-time. The rapid development of these fields will be aided by automatic model detection and integrated hardware components.</p>
</list-item>
</list>
</sec>
</sec>
<sec id="s9" sec-type="data-availability">
<title>Data availability statement</title>
<p>The original contributions presented in the study are included in the article/supplementary material. Further inquiries can be directed to the corresponding author.</p>
</sec>
<sec id="s10" sec-type="author-contributions">
<title>Author contributions</title>
<p>Funding acquisition: JL; Project administration: MS, SN, JL, UB, and RA; Writing &#x2013; original draft: SN. All authors contributed to the article and approved the submitted version.</p>
</sec>
<sec id="s11" sec-type="COI-statement">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec id="s12" sec-type="disclaimer">
<title>Publisher&#x2019;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
</body>
<back>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Alhichri</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Alswayed</surname> <given-names>A. S.</given-names>
</name>
<name>
<surname>Bazi</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Ammour</surname> <given-names>N.</given-names>
</name>
<name>
<surname>Alajlan</surname> <given-names>N. A.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Classification of remote sensing images using EfficientNet-B3 CNN model with attention</article-title>. <source>IEEE Access</source> <volume>9</volume>, <fpage>14078</fpage>&#x2013;<lpage>14094</lpage>. doi: <pub-id pub-id-type="doi">10.1109/ACCESS.2021.3051085</pub-id>
</citation>
</ref>
<ref id="B2">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Allen-Zhu</surname> <given-names>Z.</given-names>
</name>
<name>
<surname>Li</surname> <given-names>Y.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>What can resnet learn efficiently, going beyond kernels</article-title>? <source>Adv. Neural Inf. Process. Syst.</source> <volume>32</volume>. doi: <pub-id pub-id-type="doi">10.48550/arXiv.1905.10337</pub-id>
</citation>
</ref>
<ref id="B3">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Arora</surname> <given-names>A.</given-names>
</name>
<name>
<surname>Grover</surname> <given-names>A.</given-names>
</name>
<name>
<surname>Chugh</surname> <given-names>R.</given-names>
</name>
<name>
<surname>Reka</surname> <given-names>S. S.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Real time multi object detection for blind using single shot multibox detector</article-title>. <source>Wireless. Pers. Commun.</source> <volume>107</volume> (<issue>1</issue>), <fpage>651</fpage>&#x2013;<lpage>661</lpage>. doi: <pub-id pub-id-type="doi">10.1007/s11277-019-06294-1</pub-id>
</citation>
</ref>
<ref id="B4">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Ashritha</surname> <given-names>P.</given-names>
</name>
<name>
<surname>Banusri</surname> <given-names>M.</given-names>
</name>
<name>
<surname>Namitha</surname> <given-names>R.</given-names>
</name>
<name>
<surname>Duela</surname> <given-names>,. J. S.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>Effective fault detection approach for cloud computing</article-title>,&#x201d; in <source>Journal of physics: Conference series</source>, vol. <volume>1979</volume>. (<publisher-loc>Sidney, Australia</publisher-loc>: <publisher-name>IOP Publishing</publisher-name>), <fpage>012061</fpage>.</citation>
</ref>
<ref id="B5">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Bai</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Ding</surname> <given-names>M.</given-names>
</name>
<name>
<surname>Ghanem</surname> <given-names>B.</given-names>
</name>
</person-group> (<year>2018</year>). &#x201c;<article-title>Sod-mtgan: Small object detection via multi-task generative adversarial network</article-title>,&#x201d; in <source>Proceedings of the European conference on computer vision (ECCV)</source> (<publisher-loc>Munich, Germany</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>206</fpage>&#x2013;<lpage>221</lpage>.</citation>
</ref>
<ref id="B6">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Basavegowda</surname> <given-names>D. H.</given-names>
</name>
<name>
<surname>Mosebach</surname> <given-names>P.</given-names>
</name>
<name>
<surname>Schleip</surname> <given-names>I.</given-names>
</name>
<name>
<surname>Weltzien</surname> <given-names>C.</given-names>
</name>
</person-group> (<year>2022</year>). <source>Indicator plant species detection in grassland using EfficientDet object detector</source> (<publisher-loc>Bonn, Germany</publisher-loc>: <publisher-name>GIL-Jahrestagung, K&#xfc;nstliche Intelligenz in der Agrar-und Ern&#xe4;hrungswirtschaft</publisher-name>), <fpage>42</fpage>.</citation>
</ref>
<ref id="B7">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Bhakta</surname> <given-names>I.</given-names>
</name>
<name>
<surname>Phadikar</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Majumder</surname> <given-names>K.</given-names>
</name>
</person-group> (<year>2022</year>). &#x201c;<article-title>Thermal image augmentation with generative adversarial network for agricultural disease prediction</article-title>,&#x201d; in <source>International conference on computational intelligence in pattern recognition</source> (<publisher-loc>Singapore</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>345</fpage>&#x2013;<lpage>354</lpage>.</citation>
</ref>
<ref id="B8">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bhatti</surname> <given-names>U. A.</given-names>
</name>
<name>
<surname>Huang</surname> <given-names>M.</given-names>
</name>
<name>
<surname>Wu</surname> <given-names>D.</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Mehmood</surname> <given-names>A.</given-names>
</name>
<name>
<surname>Han</surname> <given-names>H.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Recommendation system using feature extraction and pattern recognition in clinical care systems</article-title>. <source>Enterprise. Inf. Syst.</source> <volume>13</volume> (<issue>3</issue>), <fpage>329</fpage>&#x2013;<lpage>351</lpage>. doi: <pub-id pub-id-type="doi">10.1080/17517575.2018.1557256</pub-id>
</citation>
</ref>
<ref id="B9">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bhatti</surname> <given-names>U. A.</given-names>
</name>
<name>
<surname>Ming-Quan</surname> <given-names>Z.</given-names>
</name>
<name>
<surname>Qing-Song</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Ali</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Hussain</surname> <given-names>A.</given-names>
</name>
<name>
<surname>Yuhuan</surname> <given-names>Y.</given-names>
</name>
<etal/>
</person-group>. (<year>2021</year>). <article-title>Advanced color edge detection using Clifford algebra in satellite images</article-title>. <source>IEEE Photonics. J.</source> <volume>13</volume> (<issue>2</issue>), <fpage>1</fpage>&#x2013;<lpage>20</lpage>. doi: <pub-id pub-id-type="doi">10.1109/JPHOT.2021.3059703</pub-id>
</citation>
</ref>
<ref id="B10">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bingtao</surname> <given-names>G.</given-names>
</name>
<name>
<surname>Xiaorui</surname> <given-names>W.</given-names>
</name>
<name>
<surname>Yujiao</surname> <given-names>C.</given-names>
</name>
<name>
<surname>Zhaohui</surname> <given-names>L.</given-names>
</name>
<name>
<surname>Jianlei</surname> <given-names>Z.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>A high-accuracy infrared simulation model based on establishing the linear relationship between the outputs of different infrared imaging systems</article-title>. <source>Infrared. Phys. Technol.</source> <volume>69</volume>, <fpage>155</fpage>&#x2013;<lpage>163</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.infrared.2015.01.010</pub-id>
</citation>
</ref>
<ref id="B11">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bochkovskiy</surname> <given-names>A.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>C. Y.</given-names>
</name>
<name>
<surname>Liao</surname> <given-names>H. Y. M.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Yolov4: Optimal speed and accuracy of object detection</article-title>. <source>arXiv. preprint. arXiv.</source> <volume>2004</volume>, <fpage>10934</fpage>. doi: <pub-id pub-id-type="doi">10.48550/arXiv.2004.10934</pub-id>
</citation>
</ref>
<ref id="B12">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bosquet</surname> <given-names>B.</given-names>
</name>
<name>
<surname>Cores</surname> <given-names>D.</given-names>
</name>
<name>
<surname>Seidenari</surname> <given-names>L.</given-names>
</name>
<name>
<surname>Brea</surname> <given-names>V. M.</given-names>
</name>
<name>
<surname>Mucientes</surname> <given-names>M.</given-names>
</name>
<name>
<surname>Del Bimbo</surname> <given-names>A.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>A full data augmentation pipeline for small object detection based on generative adversarial networks</article-title>. <source>Pattern Recogn.</source> <volume>133</volume>, <fpage>108998</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.patcog.2022.108998</pub-id>
</citation>
</ref>
<ref id="B13">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Cai</surname> <given-names>D.</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>P.</given-names>
</name>
</person-group> (<year>2022</year>). &#x201c;<article-title>Rotating target detection for remote sensing images based on dense attention</article-title>,&#x201d; in <source>International conference on computing, control and industrial engineering</source> (<publisher-loc>Singapore</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>50</fpage>&#x2013;<lpage>63</lpage>.</citation>
</ref>
<ref id="B14">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cao</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Chen</surname> <given-names>Q.</given-names>
</name>
<name>
<surname>Guo</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Shi</surname> <given-names>R.</given-names>
</name>
</person-group> (<year>2020</year>b). <article-title>Attention-guided context feature pyramid network for object detection</article-title>. <source>arXiv. preprint. arXiv.</source> <volume>2005</volume>, <fpage>11475</fpage>. doi: <pub-id pub-id-type="doi">10.48550/arXiv.2005.11475</pub-id>
</citation>
</ref>
<ref id="B15">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Cao</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Kong</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>X.</given-names>
</name>
<name>
<surname>Li</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Xie</surname> <given-names>,. X.</given-names>
</name>
</person-group> (<year>2020</year>a). &#x201c;<article-title>Target detection algorithm based on improved multi-scale SSD</article-title>,&#x201d; in <source>Journal of physics: Conference series</source>, vol. <volume>1570</volume>. (<publisher-loc>Zhangjiajie, China</publisher-loc>: <publisher-name>IOP Publishing</publisher-name>), <fpage>012014</fpage>.</citation>
</ref>
<ref id="B16">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Chang</surname> <given-names>X.</given-names>
</name>
</person-group> (<year>2022</year>). &#x201c;<article-title>Application of computer vision technology in post-harvest processing of fruits and vegetables: Starting from shape recognition algorithm</article-title>,&#x201d; in <source>2022 international conference on applied artificial intelligence and computing (ICAAIC)</source> (<publisher-loc>Salem, India</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>934</fpage>&#x2013;<lpage>937</lpage>.</citation>
</ref>
<ref id="B17">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Chatterjee</surname> <given-names>R.</given-names>
</name>
<name>
<surname>Chatterjee</surname> <given-names>A.</given-names>
</name>
<name>
<surname>Islam</surname> <given-names>S. K.</given-names>
</name>
<name>
<surname>Khan</surname> <given-names>M. K.</given-names>
</name>
</person-group> (<year>2022</year>). <source>An object detection-based few-shot learning approach for multimedia quality assessment, Multimedia Systems</source> (<publisher-name>Springer</publisher-name>), <fpage>1</fpage>&#x2013;<lpage>14</lpage>.</citation>
</ref>
<ref id="B18">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Cheng</surname> <given-names>M.</given-names>
</name>
<name>
<surname>Bai</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Li</surname> <given-names>L.</given-names>
</name>
<name>
<surname>Chen</surname> <given-names>Q.</given-names>
</name>
<name>
<surname>Zhou</surname> <given-names>X.</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>H.</given-names>
</name>
<etal/>
</person-group>. (<year>2020</year>). &#x201c;<article-title>Tiny-RetinaNet: a one-stage detector for real-time object detection</article-title>,&#x201d; in <source>Eleventh international conference on graphics and image processing (ICGIP 2019)</source>, vol. <volume>11373</volume>. (<publisher-loc>Hangzhou, China</publisher-loc>: <publisher-name>International Society for Optics and Photonics</publisher-name>), <fpage>113730R</fpage>.</citation>
</ref>
<ref id="B19">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cheng</surname> <given-names>G.</given-names>
</name>
<name>
<surname>Han</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Zhou</surname> <given-names>P.</given-names>
</name>
<name>
<surname>Xu</surname> <given-names>D.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Learning rotation-invariant and fisher discriminative convolutional neural networks for object detection</article-title>. <source>IEEE Trans. Image. Process.</source> <volume>28</volume> (<issue>1</issue>), <fpage>265</fpage>&#x2013;<lpage>278</lpage>. doi: <pub-id pub-id-type="doi">10.1109/tip.2018.2867198</pub-id>
</citation>
</ref>
<ref id="B20">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cheng</surname> <given-names>G.</given-names>
</name>
<name>
<surname>Si</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Hong</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Yao</surname> <given-names>X.</given-names>
</name>
<name>
<surname>Guo</surname> <given-names>L.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Cross-scale feature fusion for object detection in optical remote sensing images</article-title>. <source>IEEE Geosci. Remote Sens. Lett.</source> <volume>18</volume> (<issue>3</issue>), <fpage>431</fpage>&#x2013;<lpage>435</lpage>. doi: <pub-id pub-id-type="doi">10.1109/lgrs.2020.2975541</pub-id>
</citation>
</ref>
<ref id="B21">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chen</surname> <given-names>J. W.</given-names>
</name>
<name>
<surname>Lin</surname> <given-names>W. J.</given-names>
</name>
<name>
<surname>Cheng</surname> <given-names>H. J.</given-names>
</name>
<name>
<surname>Hung</surname> <given-names>C. L.</given-names>
</name>
<name>
<surname>Lin</surname> <given-names>C. Y.</given-names>
</name>
<name>
<surname>Chen</surname> <given-names>S. P.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>A smartphone-based application for scale pest detection using multiple-object detection methods</article-title>. <source>Electronics</source> <volume>10</volume> (<issue>4</issue>), <fpage>372</fpage>. doi: <pub-id pub-id-type="doi">10.3390/electronics10040372</pub-id>
</citation>
</ref>
<ref id="B22">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Chen</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>,. H.</given-names>
</name>
</person-group> (<year>2014</year>). &#x201c;<article-title>SAR target recognition based on deep learning</article-title>,&#x201d; in <source>2014 international conference on data science and advanced analytics (DSAA)</source> (<publisher-loc>Shanghai, China</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>541</fpage>&#x2013;<lpage>547</lpage>.</citation>
</ref>
<ref id="B23">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cubero</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Aleixos</surname> <given-names>N.</given-names>
</name>
<name>
<surname>Molt&#xf3;</surname> <given-names>E.</given-names>
</name>
<name>
<surname>G&#xf3;mez-Sanchis</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Blasco</surname> <given-names>J.</given-names>
</name>
</person-group> (<year>2011</year>). <article-title>Advances in machine vision applications for automatic inspection and quality evaluation of fruits and vegetables</article-title>. <source>Food Bioprocess. Technol.</source> <volume>4</volume> (<issue>4</issue>), <fpage>487</fpage>&#x2013;<lpage>504</lpage>. doi: <pub-id pub-id-type="doi">10.1007/s11947-010-0411-8</pub-id>
</citation>
</ref>
<ref id="B24">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Cynthia</surname> <given-names>S. T.</given-names>
</name>
<name>
<surname>Hossain</surname> <given-names>K. M. S.</given-names>
</name>
<name>
<surname>Hasan</surname> <given-names>M. N.</given-names>
</name>
<name>
<surname>Asaduzzaman</surname> <given-names>M.</given-names>
</name>
<name>
<surname>Das</surname> <given-names>,. A. K.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>Automated detection of plant diseases using image processing and faster r-CNN algorithm</article-title>,&#x201d; in <source>2019 international conference on sustainable technologies for industry 4.0 (STI)</source> (<publisher-loc>Dhaka, Bangladesh</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1</fpage>&#x2013;<lpage>5</lpage>.</citation>
</ref>
<ref id="B25">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Dai</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Li</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>He</surname> <given-names>K.</given-names>
</name>
<name>
<surname>Sun</surname> <given-names>J.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>R-fcn: Object detection <italic>via</italic> region-based fully convolutional networks</article-title>. <source>Adv. Neural Inf. Process. Syst.</source> <volume>29</volume>.</citation>
</ref>
<ref id="B26">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Daras</surname> <given-names>G.</given-names>
</name>
<name>
<surname>Odena</surname> <given-names>A.</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Dimakis</surname> <given-names>A. G.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>Your local GAN: Designing two dimensional local attention mechanisms for generative models</article-title>,&#x201d; in <source>Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>. (<publisher-loc>Seattle, USA</publisher-loc>: <publisher-name>IEEE/CVF</publisher-name>), <fpage>14531</fpage>&#x2013;<lpage>14539</lpage>.</citation>
</ref>
<ref id="B27">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Degang</surname> <given-names>X.</given-names>
</name>
<name>
<surname>Lu</surname> <given-names>W.</given-names>
</name>
<name>
<surname>Fan</surname> <given-names>L.</given-names>
</name>
</person-group> (<year>2021</year>). <source>A review of typical target detection algorithms for deep learning [J/OL]</source> (<publisher-loc>Beijing, China</publisher-loc>: <publisher-name>Computer engineering and application</publisher-name>), <fpage>1</fpage>&#x2013;<lpage>21</lpage>.</citation>
</ref>
<ref id="B28">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Deng</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Dong</surname> <given-names>W.</given-names>
</name>
<name>
<surname>Socher</surname> <given-names>R.</given-names>
</name>
<name>
<surname>Li</surname> <given-names>L. J.</given-names>
</name>
<name>
<surname>Li</surname> <given-names>K.</given-names>
</name>
<name>
<surname>Fei-Fei</surname> <given-names>,. L.</given-names>
</name>
</person-group> (<year>2009</year>). &#x201c;<article-title>Imagenet: A large-scale hierarchical image database</article-title>,&#x201d; in <source>2009 IEEE conference on computer vision and pattern recognition</source> (<publisher-loc>Miami, Florida</publisher-loc>: <publisher-name>Ieee</publisher-name>), <fpage>248</fpage>&#x2013;<lpage>255</lpage>.</citation>
</ref>
<ref id="B29">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ding</surname> <given-names>P.</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Deng</surname> <given-names>W. J.</given-names>
</name>
<name>
<surname>Jia</surname> <given-names>P.</given-names>
</name>
<name>
<surname>Kuijper</surname> <given-names>A.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>A light and faster regional convolutional neural network for object detection in optical remote sensing images</article-title>. <source>ISPRS. J. Photogrammet. Remote Sens.</source> <volume>141</volume>, <fpage>208</fpage>&#x2013;<lpage>218</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.isprsjprs.2018.05.005</pub-id>
</citation>
</ref>
<ref id="B30">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ding</surname> <given-names>P.</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Jia</surname> <given-names>P.</given-names>
</name>
<name>
<surname>Chang</surname> <given-names>X. L.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>A comparison: different DCNN models for intelligent object detection in remote sensing images</article-title>. <source>Neural Process. Lett.</source> <volume>49</volume> (<issue>3</issue>), <fpage>1369</fpage>&#x2013;<lpage>1379</lpage>. doi: <pub-id pub-id-type="doi">10.1007/s11063-018-9878-5</pub-id>
</citation>
</ref>
<ref id="B31">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Diwan</surname> <given-names>T.</given-names>
</name>
<name>
<surname>Anirudh</surname> <given-names>G.</given-names>
</name>
<name>
<surname>Tembhurne</surname> <given-names>J. V.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>Object detection using YOLO: challenges, architectural successors, datasets and applications</article-title>. <source>Multimedia. Tools Appl.</source>, <fpage>1</fpage>&#x2013;<lpage>33</lpage>. doi: <pub-id pub-id-type="doi">10.1007/s11042-022-13644-y</pub-id>
</citation>
</ref>
<ref id="B32">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Doll&#xe1;r</surname> <given-names>P.</given-names>
</name>
<name>
<surname>Wojek</surname> <given-names>C.</given-names>
</name>
<name>
<surname>Schiele</surname> <given-names>B.</given-names>
</name>
<name>
<surname>Perona</surname> <given-names>,. P.</given-names>
</name>
</person-group> (<year>2009</year>). &#x201c;<article-title>Pedestrian detection: A benchmark</article-title>,&#x201d; in <source>2009 IEEE conference on computer vision and pattern recognition</source> (<publisher-loc>Miami, Florida</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>304</fpage>&#x2013;<lpage>311</lpage>.</citation>
</ref>
<ref id="B33">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Dubey</surname> <given-names>N.</given-names>
</name>
<name>
<surname>Bhagat</surname> <given-names>E.</given-names>
</name>
<name>
<surname>Rana</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Pathak</surname> <given-names>K.</given-names>
</name>
</person-group> (<year>2023</year>). &#x201c;<article-title>A novel approach to detect plant disease using DenseNet-121 neural network</article-title>,&#x201d; in <source>Smart trends in computing and communications</source> (<publisher-loc>Singapore</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>63</fpage>&#x2013;<lpage>74</lpage>.</citation>
</ref>
<ref id="B34">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Du</surname> <given-names>P.</given-names>
</name>
<name>
<surname>Tan</surname> <given-names>K.</given-names>
</name>
<name>
<surname>Xing</surname> <given-names>X.</given-names>
</name>
</person-group> (<year>2012</year>). <article-title>A novel binary tree support vector machine for hyperspectral remote sensing image classification</article-title>. <source>Optics. Commun.</source> <volume>285</volume> (<issue>13-14</issue>), <fpage>3054</fpage>&#x2013;<lpage>3060</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.optcom.2012.02.092</pub-id>
</citation>
</ref>
<ref id="B35">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Erhan</surname> <given-names>D.</given-names>
</name>
<name>
<surname>Szegedy</surname> <given-names>C.</given-names>
</name>
<name>
<surname>Toshev</surname> <given-names>A.</given-names>
</name>
<name>
<surname>Anguelov</surname> <given-names>D.</given-names>
</name>
</person-group> (<year>2014</year>). &#x201c;<article-title>Scalable object detection using deep neural networks</article-title>,&#x201d; in <source>Proceedings of the IEEE conference on computer vision and pattern recognition</source>. (<publisher-loc>Columbus, Ohio</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>2147</fpage>&#x2013;<lpage>2154</lpage>.</citation>
</ref>
<ref id="B36">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Eser</surname> <given-names>S. E. R. T.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>A deep learning based approach for the detection of diseases in pepper and potato leaves</article-title>. <source>Anadolu. Tar&#x131;m. Bilimleri. Dergisi.</source> <volume>36</volume> (<issue>2</issue>), <fpage>167</fpage>&#x2013;<lpage>178</lpage>. doi: <pub-id pub-id-type="doi">10.7161/omuanajas.805152</pub-id>
</citation>
</ref>
<ref id="B37">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Everingham</surname> <given-names>M.</given-names>
</name>
<name>
<surname>Eslami</surname> <given-names>S. M.</given-names>
</name>
<name>
<surname>Van Gool</surname> <given-names>L.</given-names>
</name>
<name>
<surname>Williams</surname> <given-names>C. K.</given-names>
</name>
<name>
<surname>Winn</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Zisserman</surname> <given-names>A.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>The pascal visual object classes challenge: A retrospective</article-title>. <source>Int. J. Comput. Vision</source> <volume>111</volume> (<issue>1</issue>), <fpage>98</fpage>&#x2013;<lpage>136</lpage>. doi: <pub-id pub-id-type="doi">10.1007/s11263-014-0733-5</pub-id>
</citation>
</ref>
<ref id="B38">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Everingham</surname> <given-names>M.</given-names>
</name>
<name>
<surname>Van Gool</surname> <given-names>L.</given-names>
</name>
<name>
<surname>Williams</surname> <given-names>C. K.</given-names>
</name>
<name>
<surname>Winn</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Zisserman</surname> <given-names>A.</given-names>
</name>
</person-group> (<year>2010</year>). <article-title>The pascal visual object classes (voc) challenge</article-title>. <source>Int. J. Comput. Vision</source> <volume>88</volume> (<issue>2</issue>), <fpage>303</fpage>&#x2013;<lpage>338</lpage>. doi: <pub-id pub-id-type="doi">10.1007/s11263-009-0275-4</pub-id>
</citation>
</ref>
<ref id="B39">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Fu</surname> <given-names>C. Y.</given-names>
</name>
<name>
<surname>Liu</surname> <given-names>W.</given-names>
</name>
<name>
<surname>Ranga</surname> <given-names>A.</given-names>
</name>
<name>
<surname>Tyagi</surname> <given-names>A.</given-names>
</name>
<name>
<surname>Berg</surname> <given-names>A. C.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>Dssd: Deconvolutional single shot detector</article-title>. <source>arXiv. arXiv. preprint. arXiv.</source>, <fpage>1701.06659</fpage>. doi: <pub-id pub-id-type="doi">10.48550/arXiv.1701.06659</pub-id>
</citation>
</ref>
<ref id="B40">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gao</surname> <given-names>M.</given-names>
</name>
<name>
<surname>Du</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Yang</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>J.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Adaptive anchor box mechanism to improve the accuracy in the object detection system</article-title>. <source>Multimedia. Tools Appl.</source> <volume>78</volume> (<issue>19</issue>), <fpage>27383</fpage>&#x2013;<lpage>27402</lpage>. doi: <pub-id pub-id-type="doi">10.1007/s11042-019-07858-w</pub-id>
</citation>
</ref>
<ref id="B41">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Gera</surname> <given-names>U. K.</given-names>
</name>
<name>
<surname>Siddarth</surname> <given-names>D.</given-names>
</name>
<name>
<surname>Singh</surname> <given-names>P.</given-names>
</name>
</person-group> (<year>2022</year>). &#x201c;<article-title>Smart farming: Industry 4.0 in agriculture using artificial intelligence</article-title>,&#x201d; in <source>Artificial intelligence for societal development and global well-being</source> (<publisher-loc>India</publisher-loc>: <publisher-name>IGI Global</publisher-name>), <fpage>211</fpage>&#x2013;<lpage>221</lpage>.</citation>
</ref>
<ref id="B42">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Ghiasi</surname> <given-names>G.</given-names>
</name>
<name>
<surname>Lin</surname> <given-names>T. Y.</given-names>
</name>
<name>
<surname>Le</surname> <given-names>Q. V.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>Nas-fpn: Learning scalable feature pyramid architecture for object detection</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</conf-name> (<publisher-loc>Long Beach, California</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>7036</fpage>&#x2013;<lpage>7045</lpage>.</citation>
</ref>
<ref id="B43">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Girshick</surname> <given-names>R.</given-names>
</name>
</person-group> (<year>2015</year>). &#x201c;<article-title>Fast r-cnn</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE international conference on computer vision</conf-name> (<publisher-loc>Washington, DC. United States</publisher-loc>: <publisher-name>IEEE Computer Society</publisher-name>), <fpage>1440</fpage>&#x2013;<lpage>1448</lpage>.</citation>
</ref>
<ref id="B44">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Girshick</surname> <given-names>R.</given-names>
</name>
<name>
<surname>Donahue</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Darrell</surname> <given-names>T.</given-names>
</name>
<name>
<surname>Malik</surname> <given-names>J.</given-names>
</name>
</person-group> (<year>2014</year>). &#x201c;<article-title>Rich feature hierarchies for accurate object detection and semantic segmentation</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE conference on computer vision and pattern recognition</conf-name> (<publisher-loc>Columbus, Ohio</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>580</fpage>&#x2013;<lpage>587</lpage>.</citation>
</ref>
<ref id="B45">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Goodfellow</surname> <given-names>I.</given-names>
</name>
<name>
<surname>Pouget-Abadie</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Mirza</surname> <given-names>M.</given-names>
</name>
<name>
<surname>Xu</surname> <given-names>B.</given-names>
</name>
<name>
<surname>Warde-Farley</surname> <given-names>D.</given-names>
</name>
<name>
<surname>Ozair</surname> <given-names>S.</given-names>
</name>
<etal/>
</person-group>. (<year>2014</year>). <article-title>Generative adversarial networks</article-title>. <source>Adv. Neural Inf. Process. Syst.</source> <volume>63</volume>(<issue>11</issue>), <fpage>139</fpage>&#x2013;<lpage>144</lpage>. doi: <pub-id pub-id-type="doi">10.1145/3422622</pub-id>
</citation>
</ref>
<ref id="B46">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gunturu</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Munir</surname> <given-names>A.</given-names>
</name>
<name>
<surname>Ullah</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Welch</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Flippo</surname> <given-names>D.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>A spatial AI-based agricultural robotic platform for wheat detection and collision avoidance</article-title>. <source>AI</source> <volume>3</volume> (<issue>3</issue>), <fpage>719</fpage>&#x2013;<lpage>738</lpage>. doi: <pub-id pub-id-type="doi">10.3390/ai3030042</pub-id>
</citation>
</ref>
<ref id="B47">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Han</surname> <given-names>C.</given-names>
</name>
<name>
<surname>Zhao</surname> <given-names>Q.</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Chen</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>Z.</given-names>
</name>
<name>
<surname>Yuan</surname> <given-names>J.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>YOLOPv2: Better, faster, stronger for panoptic driving perception</article-title>. <source>arXiv. preprint. arXiv.</source>, <fpage>2208.11434</fpage>. doi: <pub-id pub-id-type="doi">10.48550/arXiv.2208.11434</pub-id>
</citation>
</ref>
<ref id="B48">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Haruna</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Qin</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Mbyamm Kiki</surname> <given-names>M. J.</given-names>
</name>
</person-group> (<year>2022</year>). <source>An improved approach to detection of rice leaf disease with GAN-based data augmentation pipeline</source>, (<publisher-loc>USA</publisher-loc>:<publisher-name>SSRN</publisher-name>) SSRN 4135061.</citation>
</ref>
<ref id="B49">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>He</surname> <given-names>K.</given-names>
</name>
<name>
<surname>Gkioxari</surname> <given-names>G.</given-names>
</name>
<name>
<surname>Doll&#xe1;r</surname> <given-names>P.</given-names>
</name>
<name>
<surname>Girshick</surname> <given-names>R.</given-names>
</name>
</person-group> (<year>2017</year>). &#x201c;<article-title>Mask r-cnn</article-title>,&#x201d; in <source>Proceedings of the IEEE international conference on computer vision</source>. ( <publisher-loc>Venice, Italy</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>2961</fpage>&#x2013;<lpage>2969</lpage>.</citation>
</ref>
<ref id="B50">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hespeler</surname> <given-names>S. C.</given-names>
</name>
<name>
<surname>Nemati</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Dehghan-Niri</surname> <given-names>E.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Non-destructive thermal imaging for object detection <italic>via</italic> advanced deep learning for robotic inspection and harvesting of chili peppers</article-title>. <source>Artif. Intell. Agric.</source> <volume>5</volume>, <fpage>102</fpage>&#x2013;<lpage>117</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.aiia.2021.05.003</pub-id>
</citation>
</ref>
<ref id="B51">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>He</surname> <given-names>K.</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>X.</given-names>
</name>
<name>
<surname>Ren</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Sun</surname> <given-names>J.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>Spatial pyramid pooling in deep convolutional networks for visual recognition</article-title>. <source>IEEE Trans. Pattern Anal. Mach. Intell.</source> <volume>37</volume> (<issue>9</issue>), <fpage>1904</fpage>&#x2013;<lpage>1916</lpage>. doi: <pub-id pub-id-type="doi">10.1109/TPAMI.2015.2389824</pub-id>
</citation>
</ref>
<ref id="B52">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>He</surname> <given-names>K.</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>X.</given-names>
</name>
<name>
<surname>Ren</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Sun</surname> <given-names>J.</given-names>
</name>
</person-group> (<year>2016</year>). &#x201c;<article-title>Deep residual learning for image recognition</article-title>,&#x201d; in <source>Proceedings of the IEEE conference on computer vision and pattern recognition</source>. (<publisher-loc>Las Vegas, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>770</fpage>&#x2013;<lpage>778</lpage>.</citation>
</ref>
<ref id="B53">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hinton</surname> <given-names>G. E.</given-names>
</name>
<name>
<surname>Salakhutdinov</surname> <given-names>R. R.</given-names>
</name>
</person-group> (<year>2006</year>). <article-title>Reducing the dimensionality of data with neural networks</article-title>. <source>Science</source> <volume>313</volume> (<issue>5786</issue>), <fpage>504</fpage>&#x2013;<lpage>507</lpage>. doi: <pub-id pub-id-type="doi">10.1126/science.1127647</pub-id>
</citation>
</ref>
<ref id="B54">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hitawala</surname> <given-names>S.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Evaluating resnext model architecture for image classification</article-title>. <source>arXiv. preprint. arXiv.</source>, <fpage>1805.08700</fpage>.</citation>
</ref>
<ref id="B55">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Huang</surname> <given-names>X.</given-names>
</name>
<name>
<surname>Bi</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>N.</given-names>
</name>
<name>
<surname>Ding</surname> <given-names>X.</given-names>
</name>
<name>
<surname>Li</surname> <given-names>F.</given-names>
</name>
<name>
<surname>Hou</surname> <given-names>F.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>Application of computer vision technology in agriculture</article-title>. <source>Agric. Sci. Technol.</source> <volume>18</volume> (<issue>11</issue>), <fpage>2158</fpage>&#x2013;<lpage>2162</lpage>.</citation>
</ref>
<ref id="B56">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Huang</surname> <given-names>G.</given-names>
</name>
<name>
<surname>Liu</surname> <given-names>Z.</given-names>
</name>
<name>
<surname>van der Maaten</surname> <given-names>L.</given-names>
</name>
<name>
<surname>Weinberger</surname> <given-names>K. Q.</given-names>
</name>
</person-group> (<year>2017</year>). &#x201c;<article-title>Densely connected convolutional networks</article-title>,&#x201d; in <source>Proceedings of the IEEE conference on computer vision and pattern recognition</source>. (<publisher-loc>Honolulu, Hawaii</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>4700</fpage>&#x2013;<lpage>4708</lpage>.</citation>
</ref>
<ref id="B57">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Hu</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Dai</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>,. Z.</given-names>
</name>
</person-group> (<year>2022</year>). &#x201c;<article-title>Real-time detection of tiny objects based on a weighted bi-directional FPN</article-title>,&#x201d; in <source>International conference on multimedia modeling</source> (<publisher-loc>Cham</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>3</fpage>&#x2013;<lpage>14</lpage>.</citation>
</ref>
<ref id="B58">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hu</surname> <given-names>Q.</given-names>
</name>
<name>
<surname>Zhai</surname> <given-names>L.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>RGB-D image multi-target detection method based on 3D DSF r-CNN</article-title>. <source>Int. J. Pattern Recogn. Artif. Intell.</source> <volume>33</volume> (<issue>08</issue>), <fpage>1954026</fpage>. doi: <pub-id pub-id-type="doi">10.1142/S0218001419540260</pub-id>
</citation>
</ref>
<ref id="B59">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ienco</surname> <given-names>D.</given-names>
</name>
<name>
<surname>Interdonato</surname> <given-names>R.</given-names>
</name>
<name>
<surname>Gaetano</surname> <given-names>R.</given-names>
</name>
<name>
<surname>Minh</surname> <given-names>D. H. T.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Combining sentinel-1 and sentinel-2 satellite image time series for land cover mapping <italic>via</italic> a multi-source deep learning architecture</article-title>. <source>ISPRS. J. Photogrammet. Remote Sens.</source> <volume>158</volume>, <fpage>11</fpage>&#x2013;<lpage>22</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.isprsjprs.2019.09.016</pub-id>
</citation>
</ref>
<ref id="B60">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ito</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Chen</surname> <given-names>P.</given-names>
</name>
<name>
<surname>Comte</surname> <given-names>P.</given-names>
</name>
<name>
<surname>Nazeeruddin</surname> <given-names>M. K.</given-names>
</name>
<name>
<surname>Liska</surname> <given-names>P.</given-names>
</name>
<name>
<surname>P&#xe9;chy</surname> <given-names>P.</given-names>
</name>
<etal/>
</person-group>. (<year>2007</year>). <article-title>Fabrication of screen-printing pastes from TiO2 powders for dye-sensitised solar cells</article-title>. <source>Prog. Photovoltaics.: Res. Appl.</source> <volume>15</volume> (<issue>7</issue>), <fpage>603</fpage>&#x2013;<lpage>612</lpage>. doi: <pub-id pub-id-type="doi">10.1002/pip.768</pub-id>
</citation>
</ref>
<ref id="B61">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jeong</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Park</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Kwak</surname> <given-names>N.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>Enhancement of SSD by concatenating feature maps for object detection</article-title>. <source>arXiv. preprint. arXiv.</source>, <fpage>1705.09587</fpage>. doi: <pub-id pub-id-type="doi">10.5244/C.31.76</pub-id>
</citation>
</ref>
<ref id="B62">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jian</surname> <given-names>L.</given-names>
</name>
<name>
<surname>Pu</surname> <given-names>Z.</given-names>
</name>
<name>
<surname>Zhu</surname> <given-names>L.</given-names>
</name>
<name>
<surname>Yao</surname> <given-names>T.</given-names>
</name>
<name>
<surname>Liang</surname> <given-names>X.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>SS R-CNN: Self-supervised learning improving mask r-CNN for ship detection in remote sensing images</article-title>. <source>Remote Sens.</source> <volume>14</volume> (<issue>17</issue>), <fpage>4383</fpage>. doi: <pub-id pub-id-type="doi">10.3390/rs14174383</pub-id>
</citation>
</ref>
<ref id="B63">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jiao</surname> <given-names>L.</given-names>
</name>
<name>
<surname>Dong</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Xie</surname> <given-names>C.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>H.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>AF-RCNN: An anchor-free convolutional neural network for multi-categories agricultural pest detection</article-title>. <source>Comput. Electron. Agric.</source> <volume>174</volume>, <fpage>105522</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.compag.2020.105522</pub-id>
</citation>
</ref>
<ref id="B64">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Kang</surname> <given-names>H. J.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>Real-time object detection on 640x480 image with vgg16+ ssd</article-title>,&#x201d; in <source>2019 international conference on field-programmable technology (ICFPT)</source> (<publisher-loc>Tianjin, China</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>419</fpage>&#x2013;<lpage>422</lpage>.</citation>
</ref>
<ref id="B65">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Karim</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Yin</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Bibi</surname> <given-names>I.</given-names>
</name>
<name>
<surname>Brohi</surname> <given-names>A. A.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>A brief review and challenges of object detection in optical remote sensing imagery</article-title>. <source>Multiagent. Grid. Syst.</source> <volume>16</volume> (<issue>3</issue>), <fpage>227</fpage>&#x2013;<lpage>243</lpage>. doi: <pub-id pub-id-type="doi">10.3233/MGS-200330</pub-id>
</citation>
</ref>
<ref id="B66">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Karnewar</surname> <given-names>A.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>O.</given-names>
</name>
</person-group> (<year>2019</year>). <source>MSG-GAN: multi-scale gradient GAN for stable image synthesis</source>. (Long Beach, California: <publisher-name>CVF</publisher-name>).</citation>
</ref>
<ref id="B67">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Kassim</surname> <given-names>T.</given-names>
</name>
<name>
<surname>Mohan</surname> <given-names>B. S.</given-names>
</name>
<name>
<surname>Muneer</surname> <given-names>,. K. A.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>Modified ML-kNN and rank SVM for multi-label pattern classification</article-title>,&#x201d; in <source>Journal of physics: Conference series</source>, vol. <volume>1921</volume>. (<publisher-loc>Goa, India</publisher-loc>: <publisher-name>IOP Publishing</publisher-name>), <fpage>012027</fpage>.</citation>
</ref>
<ref id="B68">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Kong</surname> <given-names>T.</given-names>
</name>
<name>
<surname>Sun</surname> <given-names>F.</given-names>
</name>
<name>
<surname>Yao</surname> <given-names>A.</given-names>
</name>
<name>
<surname>Liu</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Lu</surname> <given-names>M.</given-names>
</name>
<name>
<surname>Chen</surname> <given-names>Y.</given-names>
</name>
</person-group> (<year>2017</year>). &#x201c;<article-title>Ron: Reverse connection with objectness prior networks for object detection</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE conference on computer vision and pattern recognition</conf-name> (<publisher-loc>Honolulu, Hawaii</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>5936</fpage>&#x2013;<lpage>5944</lpage>.</citation>
</ref>
<ref id="B69">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Krasin</surname> <given-names>I.</given-names>
</name>
<name>
<surname>Duerig</surname> <given-names>T.</given-names>
</name>
<name>
<surname>Alldrin</surname> <given-names>N.</given-names>
</name>
<name>
<surname>Ferrari</surname> <given-names>V.</given-names>
</name>
<name>
<surname>Abu-El-Haija</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Kuznetsova</surname> <given-names>A.</given-names>
</name>
<etal/>
</person-group>. (<year>2017</year>) <source>Openimages: A public dataset for large-scale multi-label and multi-class image classification</source>. Available at: <uri xlink:href="https://github.com/openimages">https://github.com/openimages</uri>.</citation>
</ref>
<ref id="B70">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Krizhevsky</surname> <given-names>A.</given-names>
</name>
<name>
<surname>Hinton</surname> <given-names>G.</given-names>
</name>
</person-group> (<year>2009</year>). <article-title>Learning multiple layers of features from tiny images</article-title>. <source> utoronto, Dissertation</source>, <fpage>1</fpage>&#x2013;<lpage>60</lpage>
</citation>
</ref>
<ref id="B71">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Krizhevsky</surname> <given-names>A.</given-names>
</name>
<name>
<surname>Sutskever</surname> <given-names>I.</given-names>
</name>
<name>
<surname>Hinton</surname> <given-names>G. E.</given-names>
</name>
</person-group> (<year>2012</year>). <article-title>Imagenet classification with deep convolutional neural networks</article-title>. <source>Adv. Neural Inf. Process. Syst.</source> <volume>25</volume>, <fpage>1097</fpage>&#x2013;<lpage>1105</lpage>. doi: <pub-id pub-id-type="doi">10.1145/3065386</pub-id>
</citation>
</ref>
<ref id="B72">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Krizhevsky</surname> <given-names>A.</given-names>
</name>
<name>
<surname>Sutskever</surname> <given-names>I.</given-names>
</name>
<name>
<surname>Hinton</surname> <given-names>G. E.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>Imagenet classification with deep convolutional neural networks</article-title>. <source>Commun. ACM</source> <volume>60</volume> (<issue>6</issue>), <fpage>84</fpage>&#x2013;<lpage>90</lpage>. doi: <pub-id pub-id-type="doi">10.1145/3065386</pub-id>
</citation>
</ref>
<ref id="B73">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kumar</surname> <given-names>R.</given-names>
</name>
<name>
<surname>Kumar</surname> <given-names>D.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>Comparative analysis of validating parameters in the deep learning models for remotely sensed images</article-title>. <source>J. Discrete. Math. Sci. Cryptograp.</source> <volume>25</volume> (<issue>4</issue>), <fpage>913</fpage>&#x2013;<lpage>920</lpage>. doi: <pub-id pub-id-type="doi">10.1080/09720529.2022.2068602</pub-id>
</citation>
</ref>
<ref id="B74">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kuznetsova</surname> <given-names>A.</given-names>
</name>
<name>
<surname>Rom</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Alldrin</surname> <given-names>N.</given-names>
</name>
<name>
<surname>Uijlings</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Krasin</surname> <given-names>I.</given-names>
</name>
<name>
<surname>Pont-Tuset</surname> <given-names>J.</given-names>
</name>
<etal/>
</person-group>. (<year>2020</year>). <article-title>The open images dataset v4</article-title>. <source>Int. J. Comput. Vision</source> <volume>128</volume> (<issue>7</issue>), <fpage>1956</fpage>&#x2013;<lpage>1981</lpage>. doi: <pub-id pub-id-type="doi">10.1007/s11263-020-01316-z</pub-id>
</citation>
</ref>
<ref id="B75">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Chen</surname> <given-names>L.</given-names>
</name>
<name>
<surname>Huang</surname> <given-names>W.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Detection of early bruises on peaches (Amygdalus persica l.) using hyperspectral imaging coupled with improved watershed segmentation algorithm</article-title>. <source>Postharvest. Biol. Technol.</source> <volume>135</volume>, <fpage>104</fpage>&#x2013;<lpage>113</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.postharvbio.2017.09.007</pub-id>
</citation>
</ref>
<ref id="B76">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Lienhart</surname> <given-names>R.</given-names>
</name>
<name>
<surname>Maydt</surname> <given-names>,. J.</given-names>
</name>
</person-group> (<year>2002</year>). &#x201c;<article-title>An extended set of haar-like features for rapid object detection</article-title>,&#x201d; in <source>Proceedings. international conference on image processing</source>, vol. <volume>1</volume>. (<publisher-loc>New York, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>I</fpage>&#x2013;<lpage>I</lpage>.</citation>
</ref>
<ref id="B77">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname> <given-names>B.</given-names>
</name>
<name>
<surname>Liu</surname> <given-names>B.</given-names>
</name>
<name>
<surname>Li</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Liu</surname> <given-names>H.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>An improved EfficientNet for rice germ integrity classification and recognition</article-title>. <source>Agriculture</source> <volume>12</volume> (<issue>6</issue>), <fpage>863</fpage>. doi: <pub-id pub-id-type="doi">10.3390/agriculture12060863</pub-id>
</citation>
</ref>
<ref id="B78">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Lin</surname> <given-names>L.</given-names>
</name>
<name>
<surname>Chen</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Liang</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Li</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Shan</surname> <given-names>Y.</given-names>
</name>
<etal/>
</person-group>. (<year>2020</year>). &#x201c;<article-title>Dual semantic fusion network for video object detection</article-title>,&#x201d; in <source>Proceedings of the 28th ACM international conference on multimedia</source>. (<publisher-loc>Seattle, WA (USA)</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>1855</fpage>&#x2013;<lpage>1863</lpage>.</citation>
</ref>
<ref id="B79">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Lin</surname> <given-names>T. Y.</given-names>
</name>
<name>
<surname>Doll&#xe1;r</surname> <given-names>P.</given-names>
</name>
<name>
<surname>Girshick</surname> <given-names>R.</given-names>
</name>
<name>
<surname>He</surname> <given-names>K.</given-names>
</name>
<name>
<surname>Hariharan</surname> <given-names>B.</given-names>
</name>
<name>
<surname>Belongie</surname> <given-names>S.</given-names>
</name>
</person-group> (<year>2017</year>a). &#x201c;<article-title>Feature pyramid networks for object detection</article-title>,&#x201d; in <source>Proceedings of the IEEE conference on computer vision and pattern recognition</source>. (<publisher-loc>Honolulu, Hawaii</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>2117</fpage>&#x2013;<lpage>2125</lpage>.</citation>
</ref>
<ref id="B80">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Lin</surname> <given-names>T. Y.</given-names>
</name>
<name>
<surname>Goyal</surname> <given-names>P.</given-names>
</name>
<name>
<surname>Girshick</surname> <given-names>R.</given-names>
</name>
<name>
<surname>He</surname> <given-names>K.</given-names>
</name>
<name>
<surname>Doll&#xe1;r</surname> <given-names>P.</given-names>
</name>
</person-group> (<year>2017</year>b). &#x201c;<article-title>Focal loss for dense object detection</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE international conference on computer vision</conf-name>, (<publisher-loc>Venice, Italy</publisher-loc>: <publisher-name>IEEE</publisher-name>) <fpage>2980</fpage>&#x2013;<lpage>2988</lpage>.</citation>
</ref>
<ref id="B81">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lin</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Ji</surname> <given-names>R.</given-names>
</name>
<name>
<surname>Chen</surname> <given-names>C.</given-names>
</name>
<name>
<surname>Tao</surname> <given-names>D.</given-names>
</name>
<name>
<surname>Luo</surname> <given-names>J.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Holistic cnn compression <italic>via</italic> low-rank decomposition with knowledge transfer</article-title>. <source>IEEE Trans. Pattern Anal. Mach. Intell.</source> <volume>41</volume> (<issue>12</issue>), <fpage>2889</fpage>&#x2013;<lpage>2905</lpage>. doi: doi.org/10.1109/tpami.2018.2873305
</citation>
</ref>
<ref id="B82">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Lin</surname> <given-names>T. Y.</given-names>
</name>
<name>
<surname>Maire</surname> <given-names>M.</given-names>
</name>
<name>
<surname>Belongie</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Hays</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Perona</surname> <given-names>P.</given-names>
</name>
<name>
<surname>Ramanan</surname> <given-names>D.</given-names>
</name>
<etal/>
</person-group>. (<year>2014</year>). &#x201c;<article-title>Microsoft Coco: Common objects in context</article-title>,&#x201d; in <source>European Conference on computer vision</source> (<publisher-loc>Cham</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>740</fpage>&#x2013;<lpage>755</lpage>.</citation>
</ref>
<ref id="B83">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Liu</surname> <given-names>W.</given-names>
</name>
<name>
<surname>Anguelov</surname> <given-names>D.</given-names>
</name>
<name>
<surname>Erhan</surname> <given-names>D.</given-names>
</name>
<name>
<surname>Szegedy</surname> <given-names>C.</given-names>
</name>
<name>
<surname>Reed</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Fu</surname> <given-names>C. Y.</given-names>
</name>
<etal/>
</person-group>. (<year>2016</year>). &#x201c;<article-title>Ssd: Single shot multibox detector</article-title>,&#x201d; in <source>European Conference on computer vision</source> (<publisher-loc>Cham</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>21</fpage>&#x2013;<lpage>37</lpage>.</citation>
</ref>
<ref id="B84">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liu</surname> <given-names>N.</given-names>
</name>
<name>
<surname>Celik</surname> <given-names>T.</given-names>
</name>
<name>
<surname>Li</surname> <given-names>H. C.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Gated ladder-shaped feature pyramid network for object detection in optical remote sensing images</article-title>. <source>IEEE Geosci. Remote Sens. Lett.</source> <volume>19</volume>, <fpage>1</fpage>&#x2013;<lpage>5</lpage>. doi: <pub-id pub-id-type="doi">10.1109/lgrs.2020.3046137</pub-id>
</citation>
</ref>
<ref id="B85">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liu</surname> <given-names>W.</given-names>
</name>
<name>
<surname>Luo</surname> <given-names>B.</given-names>
</name>
<name>
<surname>Liu</surname> <given-names>J.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Synthetic data augmentation using multiscale attention CycleGAN for aircraft detection in remote sensing images</article-title>. <source>IEEE Geosci. Remote Sens. Lett.</source> <volume>19</volume>, <fpage>1</fpage>&#x2013;<lpage>5</lpage>. doi: <pub-id pub-id-type="doi">10.1109/lgrs.2021.3052017</pub-id>
</citation>
</ref>
<ref id="B86">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Liu</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Qi</surname> <given-names>L.</given-names>
</name>
<name>
<surname>Qin</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Shi</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Jia</surname> <given-names>J.</given-names>
</name>
</person-group> (<year>2018</year>). &#x201c;<article-title>Path aggregation network for instance segmentation</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE conference on computer vision and pattern recognition</conf-name>, (<publisher-loc>Salt Lake City, UT, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>) <fpage>8759</fpage>&#x2013;<lpage>8768</lpage>.</citation>
</ref>
<ref id="B87">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname> <given-names>K.</given-names>
</name>
<name>
<surname>Wan</surname> <given-names>G.</given-names>
</name>
<name>
<surname>Cheng</surname> <given-names>G.</given-names>
</name>
<name>
<surname>Meng</surname> <given-names>L.</given-names>
</name>
<name>
<surname>Han</surname> <given-names>J.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Object detection in optical remote sensing images: A survey and a new benchmark</article-title>. <source>ISPRS. J. Photogrammet. Remote Sens.</source> <volume>159</volume>, <fpage>296</fpage>&#x2013;<lpage>307</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.isprsjprs.2019.11.023</pub-id>
</citation>
</ref>
<ref id="B88">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Wei</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Liang</surname> <given-names>X.</given-names>
</name>
<name>
<surname>Dong</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Xu</surname> <given-names>T.</given-names>
</name>
<name>
<surname>Feng</surname> <given-names>J.</given-names>
</name>
<etal/>
</person-group>. (<year>2016</year>). <article-title>Attentive contexts for object detection</article-title>. <source>IEEE Trans. Multimedia.</source> <volume>19</volume> (<issue>5</issue>), <fpage>944</fpage>&#x2013;<lpage>954</lpage>. doi: <pub-id pub-id-type="doi">10.1109/tmm.2016.2642789</pub-id>
</citation>
</ref>
<ref id="B89">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname> <given-names>M.</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>Z.</given-names>
</name>
<name>
<surname>Lei</surname> <given-names>L.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>X.</given-names>
</name>
<name>
<surname>Guo</surname> <given-names>X.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Agricultural greenhouses detection in high-resolution satellite images based on convolutional neural networks: Comparison of faster r-CNN, YOLO v3 and SSD</article-title>. <source>Sensors</source> <volume>20</volume> (<issue>17</issue>), <fpage>4938</fpage>. doi: <pub-id pub-id-type="doi">10.3390/s20174938</pub-id>
</citation>
</ref>
<ref id="B90">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname> <given-names>Z.</given-names>
</name>
<name>
<surname>Zhou</surname> <given-names>F.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>FSSD: feature fusion single shot multibox detector</article-title>. <source>arXiv. preprint. arXiv.</source>, <fpage>1712.00960</fpage>. doi: <pub-id pub-id-type="doi">10.48550/arXiv.1712.00960</pub-id>
</citation>
</ref>
<ref id="B91">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Luo</surname> <given-names>Y. M.</given-names>
</name>
<name>
<surname>Huang</surname> <given-names>D. T.</given-names>
</name>
<name>
<surname>Liu</surname> <given-names>P. Z.</given-names>
</name>
<name>
<surname>Feng</surname> <given-names>H. M.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>An novel random forests and its application to the classification of mangroves remote sensing image</article-title>. <source>Multimedia. Tools Appl.</source> <volume>75</volume> (<issue>16</issue>), <fpage>9707</fpage>&#x2013;<lpage>9722</lpage>. doi: <pub-id pub-id-type="doi">10.1007/s11042-015-2906-9</pub-id>
</citation>
</ref>
<ref id="B92">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mahanti</surname> <given-names>N. K.</given-names>
</name>
<name>
<surname>Pandiselvam</surname> <given-names>R.</given-names>
</name>
<name>
<surname>Kothakota</surname> <given-names>A.</given-names>
</name>
<name>
<surname>Ishwarya</surname> <given-names>P.</given-names>
</name>
<name>
<surname>Chakraborty</surname> <given-names>S. K.</given-names>
</name>
<name>
<surname>Kumar</surname> <given-names>M.</given-names>
</name>
<etal/>
</person-group>. (<year>2021</year>). <article-title>Emerging non-destructive imaging techniques for fruit damage detection: Image processing and analysis</article-title>. <source>Trends Food Sci. Technol</source> <volume>120</volume>, <fpage>418</fpage>&#x2013;<lpage>438</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.tifs.2021.12.021</pub-id>
</citation>
</ref>
<ref id="B93">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Marris</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Deboudt</surname> <given-names>K.</given-names>
</name>
<name>
<surname>Augustin</surname> <given-names>P.</given-names>
</name>
<name>
<surname>Flament</surname> <given-names>P.</given-names>
</name>
<name>
<surname>Blond</surname> <given-names>F.</given-names>
</name>
<name>
<surname>Fiani</surname> <given-names>E.</given-names>
</name>
<etal/>
</person-group>. (<year>2012</year>). <article-title>Fast changes in chemical composition and size distribution of fine particles during the near-field transport of industrial plumes</article-title>. <source>Sci. Total. Environ.</source> <volume>427</volume>, <fpage>126</fpage>&#x2013;<lpage>138</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.scitotenv.2012.03.068</pub-id>
</citation>
</ref>
<ref id="B94">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Mnih</surname> <given-names>V.</given-names>
</name>
<name>
<surname>Hinton</surname> <given-names>,. G. E.</given-names>
</name>
</person-group> (<year>2010</year>). &#x201c;<article-title>Learning to detect roads in high-resolution aerial images</article-title>,&#x201d; in <source>European Conference on computer vision</source> (<publisher-loc>Berlin, Heidelberg</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>210</fpage>&#x2013;<lpage>223</lpage>.</citation>
</ref>
<ref id="B95">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Moore</surname> <given-names>R. C.</given-names>
</name>
<name>
<surname>DeNero</surname> <given-names>J.</given-names>
</name>
</person-group> (<year>2011</year>). <source>L1 and L2 regularization for multiclass hinge loss models</source>.</citation>
</ref>
<ref id="B96">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Naqvi</surname> <given-names>S. F.</given-names>
</name>
<name>
<surname>Ali</surname> <given-names>S. S. A.</given-names>
</name>
<name>
<surname>Yahya</surname> <given-names>N.</given-names>
</name>
<name>
<surname>Yasin</surname> <given-names>M. A.</given-names>
</name>
<name>
<surname>Hafeez</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Subhani</surname> <given-names>A. R.</given-names>
</name>
<etal/>
</person-group>. (<year>2020</year>). <article-title>Real-time stress assessment using sliding window based convolutional neural network</article-title>. <source>Sensors</source> <volume>20</volume> (<issue>16</issue>), <fpage>4400</fpage>. doi: <pub-id pub-id-type="doi">10.3390/s20164400</pub-id>
</citation>
</ref>
<ref id="B97">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Nawaz</surname> <given-names>S. A.</given-names>
</name>
<name>
<surname>Li</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Bhatti</surname> <given-names>U. A.</given-names>
</name>
<name>
<surname>Bazai</surname> <given-names>S. U.</given-names>
</name>
<name>
<surname>Zafar</surname> <given-names>A.</given-names>
</name>
<name>
<surname>Bhatti</surname> <given-names>M. A.</given-names>
</name>
<etal/>
</person-group>. (<year>2021</year>). <article-title>A hybrid approach to forecast the COVID-19 epidemic trend</article-title>. <source>PloS One</source> <volume>16</volume> (<issue>10</issue>), <elocation-id>e0256971</elocation-id>. doi: <pub-id pub-id-type="doi">10.1371/journal.pone.0256971</pub-id>
</citation>
</ref>
<ref id="B98">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Nawaz</surname> <given-names>S. A.</given-names>
</name>
<name>
<surname>Li</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Bhatti</surname> <given-names>U. A.</given-names>
</name>
<name>
<surname>Mehmood</surname> <given-names>A.</given-names>
</name>
<name>
<surname>Ahmed</surname> <given-names>R.</given-names>
</name>
<name>
<surname>Ul Ain</surname> <given-names>Q.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>A novel hybrid discrete cosine transform speeded up robust feature-based secure medical image watermarking algorithm</article-title>. <source>J. Med. Imaging Health Inf.</source> <volume>10</volume> (<issue>11</issue>), <fpage>2588</fpage>&#x2013;<lpage>2599</lpage>. doi: <pub-id pub-id-type="doi">10.1166/jmihi.2020.3220</pub-id>
</citation>
</ref>
<ref id="B99">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Nguyen</surname> <given-names>H.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>An efficient license plate detection approach using lightweight deep convolutional neural networks</article-title>. <source>Adv. Multimedia.</source> <volume>2022</volume>, <fpage>1</fpage>&#x2013;<lpage>10</lpage> doi: <pub-id pub-id-type="doi">10.1155/2022/8852142</pub-id>
</citation>
</ref>
<ref id="B100">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Nguyen</surname> <given-names>T. T.</given-names>
</name>
<name>
<surname>Vien</surname> <given-names>Q. T.</given-names>
</name>
<name>
<surname>Sellahewa</surname> <given-names>H.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>An efficient pest classification in smart agriculture using transfer learning</article-title>. <source>EAI. Endorsed. Trans. Ind. Networks Intelligent. Syst.</source> <volume>8</volume> (<issue>26</issue>), <fpage>1</fpage>&#x2013;<lpage>8</lpage>. doi: <pub-id pub-id-type="doi">10.4108/eai.26-1-2021.168227</pub-id>
</citation>
</ref>
<ref id="B101">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pan</surname> <given-names>T. S.</given-names>
</name>
<name>
<surname>Huang</surname> <given-names>H. C.</given-names>
</name>
<name>
<surname>Lee</surname> <given-names>J. C.</given-names>
</name>
<name>
<surname>Chen</surname> <given-names>C. H.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Multi-scale ResNet for real-time underwater object detection</article-title>. <source>Signal. Image. Video. Process.</source> <volume>15</volume> (<issue>5</issue>), <fpage>941</fpage>&#x2013;<lpage>949</lpage>. doi: <pub-id pub-id-type="doi">10.1007/s11760-020-01818-w</pub-id>
</citation>
</ref>
<ref id="B102">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Patel</surname> <given-names>K. K.</given-names>
</name>
<name>
<surname>Kar</surname> <given-names>A.</given-names>
</name>
<name>
<surname>Jha</surname> <given-names>S. N.</given-names>
</name>
<name>
<surname>Khan</surname> <given-names>M. A.</given-names>
</name>
</person-group> (<year>2012</year>). <article-title>Machine vision system: a tool for quality inspection of food and agricultural products</article-title>. <source>J. Food Sci. Technol.</source> <volume>49</volume> (<issue>2</issue>), <fpage>123</fpage>&#x2013;<lpage>141</lpage>. doi: <pub-id pub-id-type="doi">10.1007/s13197-011-0321-4</pub-id>
</citation>
</ref>
<ref id="B103">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Peng</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Li</surname> <given-names>L.</given-names>
</name>
<name>
<surname>Tang</surname> <given-names>Y. Y.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Maximum likelihood estimation-based joint sparse representation for the classification of hyperspectral remote sensing images</article-title>. <source>IEEE Trans. Neural Networks Learn. Syst.</source> <volume>30</volume> (<issue>6</issue>), <fpage>1790</fpage>&#x2013;<lpage>1802</lpage>. doi: <pub-id pub-id-type="doi">10.1109/tnnls.2018.2874432</pub-id>
</citation>
</ref>
<ref id="B104">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Piao</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Jiang</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>M.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Lu</surname> <given-names>H.</given-names>
</name>
</person-group> (<year>2021</year>). <source>PANet: Patch-aware network for light field salient object detection</source> (<publisher-loc>USA</publisher-loc>: <publisher-name>IEEE Transactions on Cybernetics</publisher-name>).</citation>
</ref>
<ref id="B105">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Rahman</surname> <given-names>M. A.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>,. Y.</given-names>
</name>
</person-group> (<year>2016</year>). &#x201c;<article-title>Optimizing intersection-over-union in deep neural networks for image segmentation</article-title>,&#x201d; in <source>International symposium on visual computing</source> (<publisher-loc>Cham</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>234</fpage>&#x2013;<lpage>244</lpage>.</citation>
</ref>
<ref id="B106">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Redmon</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Divvala</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Girshick</surname> <given-names>R.</given-names>
</name>
<name>
<surname>Farhadi</surname> <given-names>A.</given-names>
</name>
</person-group> (<year>2016</year>). &#x201c;<article-title>You only look once: Unified, real-time object detection</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE conference on computer vision and pattern recognition</conf-name>, (<publisher-loc>Las Vegas, NV, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>) <fpage>779</fpage>&#x2013;<lpage>788</lpage>.</citation>
</ref>
<ref id="B107">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Redmon</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Farhadi</surname> <given-names>A.</given-names>
</name>
</person-group> (<year>2017</year>). &#x201c;<article-title>YOLO9000: better, faster, stronger</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE conference on computer vision and pattern recognition</conf-name>, (<publisher-loc>Honolulu, Hawaii</publisher-loc>: <publisher-name>IEEE Computer Society</publisher-name>) <fpage>7263</fpage>&#x2013;<lpage>7271</lpage>.</citation>
</ref>
<ref id="B108">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Redmon</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Farhadi</surname> <given-names>A.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Yolov3: An incremental improvement</article-title>. <source>arXiv. preprint. arXiv.</source>, <fpage>1804.02767</fpage>. doi: <pub-id pub-id-type="doi">10.48550/arXiv.1804.02767</pub-id>
</citation>
</ref>
<ref id="B109">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ren</surname> <given-names>S.</given-names>
</name>
<name>
<surname>He</surname> <given-names>K.</given-names>
</name>
<name>
<surname>Girshick</surname> <given-names>R.</given-names>
</name>
<name>
<surname>Sun</surname> <given-names>J.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>Faster r-cnn: Towards real-time object detection with region proposal networks</article-title>. <source>Adv. Neural Inf. Process. Syst.</source> <volume>28</volume>. doi: <pub-id pub-id-type="doi">10.1109/tpami.2016.2577031</pub-id>
</citation>
</ref>
<ref id="B110">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Russakovsky</surname> <given-names>O.</given-names>
</name>
<name>
<surname>Deng</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Su</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Krause</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Satheesh</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Ma</surname> <given-names>S.</given-names>
</name>
<etal/>
</person-group>. (<year>2015</year>). <article-title>Imagenet large scale visual recognition challenge</article-title>. <source>Int. J. Comput. Vision</source> <volume>115</volume> (<issue>3</issue>), <fpage>211</fpage>&#x2013;<lpage>252</lpage>. doi: <pub-id pub-id-type="doi">10.1007/s11263-015-0816-y</pub-id>
</citation>
</ref>
<ref id="B111">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Salda&#xf1;a</surname> <given-names>E.</given-names>
</name>
<name>
<surname>Siche</surname> <given-names>R.</given-names>
</name>
<name>
<surname>Luj&#xe1;n</surname> <given-names>M.</given-names>
</name>
<name>
<surname>Quevedo</surname> <given-names>R.</given-names>
</name>
</person-group> (<year>2013</year>). <article-title>Computer vision applied to the inspection and quality control of fruits and vegetables</article-title>. <source>Braz. J. Food Technol.</source> <volume>16</volume>, <fpage>254</fpage>&#x2013;<lpage>272</lpage>. doi: <pub-id pub-id-type="doi">10.1590/S1981-67232013005000031</pub-id>
</citation>
</ref>
<ref id="B112">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Savarimuthu</surname> <given-names>N.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>Investigation on object detection models for plant disease detection framework</article-title>,&#x201d; in <source>2021 IEEE 6th international conference on computing, communication and automation (ICCCA)</source> (<publisher-loc>New Delhi, India</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>214</fpage>&#x2013;<lpage>218</lpage>.</citation>
</ref>
<ref id="B113">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sermanet</surname> <given-names>P.</given-names>
</name>
<name>
<surname>Eigen</surname> <given-names>D.</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>X.</given-names>
</name>
<name>
<surname>Mathieu</surname> <given-names>M.</given-names>
</name>
<name>
<surname>Fergus</surname> <given-names>R.</given-names>
</name>
<name>
<surname>LeCun</surname> <given-names>Y.</given-names>
</name>
</person-group> (<year>2013</year>). <article-title>Overfeat: Integrated recognition, localization and detection using convolutional networks</article-title>. <source>arXiv. preprint. arXiv.</source>, <fpage>1</fpage>&#x2013;<lpage>16</lpage>. doi: <pub-id pub-id-type="doi">10.48550/arXiv.1312.6229</pub-id>
</citation>
</ref>
<ref id="B114">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Shen</surname> <given-names>Z.</given-names>
</name>
<name>
<surname>Liu</surname> <given-names>Z.</given-names>
</name>
<name>
<surname>Li</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Jiang</surname> <given-names>Y. G.</given-names>
</name>
<name>
<surname>Chen</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Xue</surname> <given-names>X.</given-names>
</name>
</person-group> (<year>2017</year>). &#x201c;<article-title>Dsod: Learning deeply supervised object detectors from scratch</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE international conference on computer vision</conf-name>, (<publisher-loc>Venice, Italy</publisher-loc>: <publisher-name>IEEE</publisher-name>) <fpage>1919</fpage>&#x2013;<lpage>1927</lpage>.</citation>
</ref>
<ref id="B115">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Shi</surname> <given-names>W.</given-names>
</name>
<name>
<surname>Jiang</surname> <given-names>F.</given-names>
</name>
<name>
<surname>Zhao</surname> <given-names>,. D.</given-names>
</name>
</person-group> (<year>2017</year>). &#x201c;<article-title>Single image super-resolution with dilated convolution based multi-scale information learning inception module</article-title>,&#x201d; in <source>2017 IEEE international conference on image processing (ICIP)</source> (<publisher-loc>Beijing, China</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>977</fpage>&#x2013;<lpage>981</lpage>.</citation>
</ref>
<ref id="B116">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Shi</surname> <given-names>L.</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>F.</given-names>
</name>
<name>
<surname>Xia</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Xie</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>Z.</given-names>
</name>
<name>
<surname>Du</surname> <given-names>Z.</given-names>
</name>
<etal/>
</person-group>. (<year>2021</year>). <article-title>Identifying damaged buildings in aerial images using the object detection method</article-title>. <source>Remote Sens.</source> <volume>13</volume> (<issue>21</issue>), <fpage>4213</fpage>. doi: <pub-id pub-id-type="doi">10.3390/rs13214213</pub-id>
</citation>
</ref>
<ref id="B117">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Shu</surname> <given-names>Q.</given-names>
</name>
<name>
<surname>Lai</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>L.</given-names>
</name>
<name>
<surname>Jia</surname> <given-names>Z.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Multi-feature fusion target re-location tracking based on correlation filters</article-title>. <source>IEEE Access</source> <volume>9</volume>, <fpage>28954</fpage>&#x2013;<lpage>28964</lpage>. doi: <pub-id pub-id-type="doi">10.1109/ACCESS.2021.3059642</pub-id>
</citation>
</ref>
<ref id="B118">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Slaughter</surname> <given-names>D. C.</given-names>
</name>
<name>
<surname>Harrell</surname> <given-names>R. C.</given-names>
</name>
</person-group> (<year>1989</year>). <article-title>Discriminating fruit for robotic harvest using color in natural outdoor scenes</article-title>. <source>Trans. ASAE.</source> <volume>32</volume> (<issue>2</issue>), <fpage>757</fpage>&#x2013;<lpage>0763</lpage>. doi: <pub-id pub-id-type="doi">10.13031/2013.31066</pub-id>
</citation>
</ref>
<ref id="B119">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Storey</surname> <given-names>G.</given-names>
</name>
<name>
<surname>Meng</surname> <given-names>Q.</given-names>
</name>
<name>
<surname>Li</surname> <given-names>B.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>Leaf disease segmentation and detection in apple orchards for precise smart spraying in sustainable agriculture</article-title>. <source>Sustainability</source> <volume>14</volume> (<issue>3</issue>), <fpage>1458</fpage>. doi: <pub-id pub-id-type="doi">10.3390/su14031458</pub-id>
</citation>
</ref>
<ref id="B120">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Tan</surname> <given-names>M.</given-names>
</name>
<name>
<surname>Le</surname> <given-names>,. Q.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>Efficientnet: Rethinking model scaling for convolutional neural networks</article-title>,&#x201d; in <source>International conference on machine learning</source> (<publisher-loc>Long Beach, California</publisher-loc>: <publisher-name>PMLR</publisher-name>), <fpage>6105</fpage>&#x2013;<lpage>6114</lpage>.</citation>
</ref>
<ref id="B121">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tan</surname> <given-names>M.</given-names>
</name>
<name>
<surname>Pang</surname> <given-names>R.</given-names>
</name>
<name>
<surname>Le</surname> <given-names>Q. V.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Efficientdet: Scalable and efficient object detection</article-title>. <source>In. Proc. IEEE/CVF. Conf. Comput. Vision Pattern Recogn.</source>, <fpage>10781</fpage>&#x2013;<lpage>10790)</lpage>. doi: <pub-id pub-id-type="doi">10.1109/CVPR42600.2020.01079</pub-id>
</citation>
</ref>
<ref id="B122">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tong</surname> <given-names>K.</given-names>
</name>
<name>
<surname>Wu</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Zhou</surname> <given-names>F.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Recent advances in small object detection based on deep learning: A review</article-title>. <source>Image. Vision Computing.</source> <volume>97</volume>, <fpage>103910</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.imavis.2020.103910</pub-id>
</citation>
</ref>
<ref id="B123">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Uijlings</surname> <given-names>J. R.</given-names>
</name>
<name>
<surname>Van De Sande</surname> <given-names>K. E.</given-names>
</name>
<name>
<surname>Gevers</surname> <given-names>T.</given-names>
</name>
<name>
<surname>Smeulders</surname> <given-names>A. W.</given-names>
</name>
</person-group> (<year>2013</year>). <article-title>Selective search for object recognition</article-title>. <source>Int. J. Comput. Vision</source> <volume>104</volume> (<issue>2</issue>), <fpage>154</fpage>&#x2013;<lpage>171</lpage>. doi: <pub-id pub-id-type="doi">10.1007/s11263-013-0620-5</pub-id>
</citation>
</ref>
<ref id="B124">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Vedaldi</surname> <given-names>A.</given-names>
</name>
<name>
<surname>Gulshan</surname> <given-names>V.</given-names>
</name>
<name>
<surname>Varma</surname> <given-names>M.</given-names>
</name>
<name>
<surname>Zisserman</surname> <given-names>A.</given-names>
</name>
</person-group> (<year>2009</year>). &#x201c;<article-title>Multiple kernels for object detection</article-title>,&#x201d; in <source>2009 IEEE 12th international conference on computer vision</source> (<publisher-loc>Kyoto, Japan</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>606</fpage>&#x2013;<lpage>613</lpage>.</citation>
</ref>
<ref id="B125">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Viola</surname> <given-names>P.</given-names>
</name>
<name>
<surname>Jones</surname> <given-names>,. M.</given-names>
</name>
</person-group> (<year>2001</year>). &#x201c;<article-title>Rapid object detection using a boosted cascade of simple features</article-title>,&#x201d; in <source>Proceedings of the 2001 IEEE computer society conference on computer vision and pattern recognition</source>, vol. <volume>1</volume>. (<publisher-loc>Kauai, Hawaii</publisher-loc>
<publisher-name>Ieee</publisher-name>), <fpage>I</fpage>&#x2013;<lpage>I</lpage>.</citation>
</ref>
<ref id="B126">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Wang</surname> <given-names>S.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>Research towards yolo-series algorithms: Comparison and analysis of object detection models for real-time UAV applications</article-title>,&#x201d; in <source>Journal of physics: Conference series</source>, vol. <volume>1948</volume>. (<publisher-loc>Lisbon, Portugal</publisher-loc>: <publisher-name>IOP Publishing</publisher-name>), <fpage>012021</fpage>.</citation>
</ref>
<ref id="B127">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wang</surname> <given-names>M. F.</given-names>
</name>
<name>
<surname>Cheng</surname> <given-names>L.</given-names>
</name>
</person-group> (<year>2004</year>). <article-title>Exposure of the shaded side of apple fruit to full sun leads to up-regulation of both the xanthophyll cycle and the ascorbate-glutathione cycle</article-title>. <source>HortScience</source> <volume>39</volume> (<issue>4</issue>), <fpage>887A</fpage>&#x2013;<lpage>8887</lpage>. doi: <pub-id pub-id-type="doi">10.21273/hortsci.39.4.887a</pub-id>
</citation>
</ref>
<ref id="B128">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Wang</surname> <given-names>X.</given-names>
</name>
<name>
<surname>Shrivastava</surname> <given-names>A.</given-names>
</name>
<name>
<surname>Gupta</surname> <given-names>A.</given-names>
</name>
</person-group> (<year>2017</year>). &#x201c;<article-title>A-fast-rcnn: Hard positive generation via adversary for object detection</article-title>,&#x201d; in <source>Proceedings of the IEEE conference on computer vision and pattern recognition</source>, <fpage>2606</fpage>&#x2013;<lpage>2615</lpage>. (<publisher-loc>Honolulu, Hawaii</publisher-loc>: <publisher-name>IEEE</publisher-name>)</citation>
</ref>
<ref id="B129">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Wei</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Chen</surname> <given-names>Z.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Zheng</surname> <given-names>X.</given-names>
</name>
<name>
<surname>Xiang</surname> <given-names>D.</given-names>
</name>
<name>
<surname>Dong</surname> <given-names>,. Z.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>Object detection with noisy annotations in high-resolution remote sensing images using robust EfficientDet</article-title>,&#x201d; in <source>Image and signal processing for remote sensing XXVII</source>, vol. <volume>11862</volume>. (<publisher-name>SPIE</publisher-name>), <fpage>66</fpage>&#x2013;<lpage>75</lpage>.</citation>
</ref>
<ref id="B130">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wei</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>Z.</given-names>
</name>
<name>
<surname>Xu</surname> <given-names>M.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>Road structure refined CNN for road extraction in aerial image</article-title>. <source>IEEE Geosci. Remote Sens. Lett.</source> <volume>14</volume> (<issue>5</issue>), <fpage>709</fpage>&#x2013;<lpage>713</lpage>. doi: <pub-id pub-id-type="doi">10.1109/LGRS.2017.2672734</pub-id>
</citation>
</ref>
<ref id="B131">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wu</surname> <given-names>Q.</given-names>
</name>
<name>
<surname>Feng</surname> <given-names>D.</given-names>
</name>
<name>
<surname>Cao</surname> <given-names>C.</given-names>
</name>
<name>
<surname>Zeng</surname> <given-names>X.</given-names>
</name>
<name>
<surname>Feng</surname> <given-names>Z.</given-names>
</name>
<name>
<surname>Wu</surname> <given-names>J.</given-names>
</name>
<etal/>
</person-group>. (<year>2021</year>). <article-title>Improved mask r-CNN for aircraft detection in remote sensing images</article-title>. <source>Sensors</source> <volume>21</volume> (<issue>8</issue>), <fpage>2618</fpage>. doi: <pub-id pub-id-type="doi">10.3390/s21082618</pub-id>
</citation>
</ref>
<ref id="B132">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Xiao</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Ehinger</surname> <given-names>K. A.</given-names>
</name>
<name>
<surname>Hays</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Torralba</surname> <given-names>A.</given-names>
</name>
<name>
<surname>Oliva</surname> <given-names>A.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>Sun database: Exploring a large collection of scene categories</article-title>. <source>Int. J. Comput. Vision</source> <volume>119</volume> (<issue>1</issue>), <fpage>3</fpage>&#x2013;<lpage>22</lpage>. doi: <pub-id pub-id-type="doi">10.1007/s11263-014-0748-y</pub-id>
</citation>
</ref>
<ref id="B133">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Xu</surname> <given-names>D.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>L.</given-names>
</name>
<name>
<surname>Li</surname> <given-names>F.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Review of typical object detection algorithms for deep learning</article-title>. <source>Comput. Eng. Appl.</source> <volume>57</volume> (<issue>8</issue>), <fpage>10</fpage>&#x2013;<lpage>25</lpage>.</citation>
</ref>
<ref id="B134">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yan</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Tan</surname> <given-names>Z.</given-names>
</name>
<name>
<surname>Su</surname> <given-names>N.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>A data augmentation strategy based on simulated samples for ship detection in RGB remote sensing images</article-title>. <source>ISPRS. Int. J. Geo-Inform.</source> <volume>8</volume> (<issue>6</issue>), <fpage>276</fpage>. doi: <pub-id pub-id-type="doi">10.3390/ijgi8060276</pub-id>
</citation>
</ref>
<ref id="B135">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yi</surname> <given-names>D.</given-names>
</name>
<name>
<surname>Su</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Chen</surname> <given-names>W. H.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Probabilistic faster R-CNN with stochastic region proposing: Towards object detection and recognition in remote sensing imagery</article-title>. <source>Neurocomputing</source>  <volume>459</volume>, <fpage>290</fpage>&#x2013;<lpage>301</lpage>.</citation>
</ref>
<ref id="B136">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Ying</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Jing</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Tao</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Jin</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Ibarra</surname> <given-names>J. G.</given-names>
</name>
<name>
<surname>Chen</surname> <given-names>,. Z.</given-names>
</name>
</person-group> (<year>2000</year>). &#x201c;<article-title>Application of machine vision in inspecting stem and shape of fruits</article-title>,&#x201d; in <source>Biological quality and precision agriculture II</source>, vol. <volume>4203</volume>. (<publisher-name>SPIE</publisher-name>), <fpage>122</fpage>&#x2013;<lpage>130</lpage>.</citation>
</ref>
<ref id="B137">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Yu</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Huang</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Zheng</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Ren</surname> <given-names>W.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>C.</given-names>
</name>
<etal/>
</person-group>. (<year>2010</year>). &#x201c;<article-title>Object detection by context and boosted HOG-LBP</article-title>,&#x201d; in <source>ECCV workshop on PASCAL VOC</source>. (<publisher-name>PASCAL</publisher-name>)</citation>
</ref>
<ref id="B138">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Goodfellow</surname> <given-names>I.</given-names>
</name>
<name>
<surname>Metaxas</surname> <given-names>D.</given-names>
</name>
<name>
<surname>Odena</surname> <given-names>,. A.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>Self-attention generative adversarial networks</article-title>,&#x201d; in <source>International conference on machine learning</source> (<publisher-loc>Long Beach, California</publisher-loc>: <publisher-name>PMLR</publisher-name>), <fpage>7354</fpage>&#x2013;<lpage>7363</lpage>.</citation>
</ref>
<ref id="B139">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname> <given-names>Q.</given-names>
</name>
<name>
<surname>Liu</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Gong</surname> <given-names>C.</given-names>
</name>
<name>
<surname>Chen</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Yu</surname> <given-names>H.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Applications of deep learning for dense scenes analysis in agriculture: A review</article-title>. <source>Sensors</source> <volume>20</volume> (<issue>5</issue>), <fpage>1520</fpage>. doi: <pub-id pub-id-type="doi">10.3390/s20051520</pub-id>
</citation>
</ref>
<ref id="B140">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname> <given-names>L.</given-names>
</name>
<name>
<surname>Ma</surname> <given-names>Z.</given-names>
</name>
<name>
<surname>Peng</surname> <given-names>X.</given-names>
</name>
</person-group> (<year>2022</year>). &#x201c;<article-title>A remote sensing object detection algorithm based on the attention mechanism and faster r-CNN</article-title>,&#x201d; in <source>Artificial intelligence in China</source> (<publisher-loc>Singapore</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>336</fpage>&#x2013;<lpage>344</lpage>.</citation>
</ref>
<ref id="B141">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Wen</surname> <given-names>L.</given-names>
</name>
<name>
<surname>Bian</surname> <given-names>X.</given-names>
</name>
<name>
<surname>Lei</surname> <given-names>Z.</given-names>
</name>
<name>
<surname>Li</surname> <given-names>S. Z.</given-names>
</name>
</person-group> (<year>2018</year>). &#x201c;<article-title>Single-shot refinement neural network for object detection</article-title>,&#x201d; in <source>Proceedings of the IEEE conference on computer vision and pattern recognition</source> (<publisher-loc>USA</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>4203</fpage>&#x2013;<lpage>4212</lpage>.</citation>
</ref>
<ref id="B142">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhao</surname> <given-names>Q.</given-names>
</name>
<name>
<surname>Sheng</surname> <given-names>T.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Tang</surname> <given-names>Z.</given-names>
</name>
<name>
<surname>Chen</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Cai</surname> <given-names>L.</given-names>
</name>
<etal/>
</person-group>. (<year>2019</year>). <article-title>M2det: A single-shot object detector based on multi-level feature pyramid network</article-title>. <source>Proc. AAAI. Conf. Artif. Intell.</source> <volume>33</volume>, <fpage>9259</fpage>&#x2013;<lpage>9266</lpage>. doi: <pub-id pub-id-type="doi">10.1609/aaai.v33i01.33019259</pub-id>
</citation>
</ref>
<ref id="B143">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhong</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Han</surname> <given-names>X.</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>L.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Multi-class geospatial object detection based on a position-sensitive balancing framework for high spatial resolution remote sensing imagery</article-title>. <source>ISPRS. J. Photogrammet. Remote Sens.</source> <volume>138</volume>, <fpage>281</fpage>&#x2013;<lpage>294</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.isprsjprs.2018.02.014</pub-id>
</citation>
</ref>
<ref id="B144">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhou</surname> <given-names>B.</given-names>
</name>
<name>
<surname>Lapedriza</surname> <given-names>A.</given-names>
</name>
<name>
<surname>Khosla</surname> <given-names>A.</given-names>
</name>
<name>
<surname>Oliva</surname> <given-names>A.</given-names>
</name>
<name>
<surname>Torralba</surname> <given-names>A.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>Places: A 10 million image database for scene recognition</article-title>. <source>IEEE Trans. Pattern Anal. Mach. Intell.</source> <volume>40</volume> (<issue>6</issue>), <fpage>1452</fpage>&#x2013;<lpage>1464</lpage>. doi: <pub-id pub-id-type="doi">10.1109/tpami.2017.2723009</pub-id>
</citation>
</ref>
<ref id="B145">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Zhou</surname> <given-names>T.</given-names>
</name>
<name>
<surname>Zheng</surname> <given-names>L.</given-names>
</name>
<name>
<surname>Peng</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Jiang</surname> <given-names>,. R.</given-names>
</name>
</person-group> (<year>2022</year>). &#x201c;<article-title>A survey of research on crowd abnormal behavior detection algorithm based on YOLO network</article-title>,&#x201d; in <source>2022 2nd international conference on consumer electronics and computer engineering (ICCECE)</source> (<publisher-loc>Guangzhou, China</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>783</fpage>&#x2013;<lpage>786</lpage>.</citation>
</ref>
<ref id="B146">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhu</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>P.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>L.</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>X.</given-names>
</name>
<name>
<surname>Jiao</surname> <given-names>L.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>A multiscale object detection approach for remote sensing images based on MSE-DenseNet and the dynamic anchor assignment</article-title>. <source>Remote Sens. Lett.</source> <volume>10</volume> (<issue>10</issue>), <fpage>959</fpage>&#x2013;<lpage>967</lpage>. doi: <pub-id pub-id-type="doi">10.1080/2150704X.2019.1633486</pub-id>
</citation>
</ref>
<ref id="B147">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Zitnick</surname> <given-names>C. L.</given-names>
</name>
<name>
<surname>Doll&#xe1;r</surname> <given-names>,. P.</given-names>
</name>
</person-group> (<year>2014</year>). &#x201c;<article-title>Edge boxes: Locating object proposals from edges</article-title>,&#x201d; in <source>European Conference on computer vision</source> (<publisher-loc>Cham</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>391</fpage>&#x2013;<lpage>405</lpage>.</citation>
</ref>
<ref id="B148">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zou</surname> <given-names>Q.</given-names>
</name>
<name>
<surname>Ni</surname> <given-names>L.</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>T.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>Q.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>Deep learning based feature selection for remote sensing scene classification</article-title>. <source>IEEE Geosci. Remote Sens. Lett.</source> <volume>12</volume> (<issue>11</issue>), <fpage>2321</fpage>&#x2013;<lpage>2325</lpage>. doi: <pub-id pub-id-type="doi">10.1109/LGRS.2015.2475299</pub-id>
</citation>
</ref>
</ref-list>
</back>
</article>