<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" article-type="research-article" dtd-version="2.3" xml:lang="EN">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Plant Sci.</journal-id>
<journal-title>Frontiers in Plant Science</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Plant Sci.</abbrev-journal-title>
<issn pub-type="epub">1664-462X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fpls.2022.1016470</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Plant Science</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Apple detection and instance segmentation in natural environments using an improved Mask Scoring R-CNN Model</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Wang</surname>
<given-names>Dandan</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
<xref ref-type="author-notes" rid="fn001">
<sup>*</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/1585009"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>He</surname>
<given-names>Dongjian</given-names>
</name>
<xref ref-type="aff" rid="aff3">
<sup>3</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/1367254"/>
</contrib>
</contrib-group>
<aff id="aff1">
<sup>1</sup>
<institution>College of Communication and Information Engineering, Xi&#x2019;an University of Science and Technology</institution>, <addr-line>Xi&#x2019;an</addr-line>, <country>China</country>
</aff>
<aff id="aff2">
<sup>2</sup>
<institution>Xi&#x2019;an Key Laboratory of Network Convergence Communication</institution>, <addr-line>Xi&#x2019;an</addr-line>, <country>China</country>
</aff>
<aff id="aff3">
<sup>3</sup>
<institution>College of Mechanical and Electronic Engineering, Northwest A&amp;F University</institution>, <addr-line>Xianyang</addr-line>, <country>China</country>
</aff>
<author-notes>
<fn fn-type="edited-by">
<p>Edited by: Jeonghwan Gwak, Korea National University of Transportation, South Korea</p>
</fn>
<fn fn-type="edited-by">
<p>Reviewed by: Zahid Ullah, Korea National University of Transportation, South Korea; Nisha Pillai, Mississippi State University, United States</p>
</fn>
<fn fn-type="corresp" id="fn001">
<p>*Correspondence: Dandan Wang, <email xlink:href="mailto:wdd_app@xust.edu.cn">wdd_app@xust.edu.cn</email>
</p>
</fn>
<fn fn-type="other" id="fn002">
<p>This article was submitted to Technical Advances in Plant Science, a section of the journal Frontiers in Plant Science</p>
</fn>
</author-notes>
<pub-date pub-type="epub">
<day>02</day>
<month>12</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="collection">
<year>2022</year>
</pub-date>
<volume>13</volume>
<elocation-id>1016470</elocation-id>
<history>
<date date-type="received">
<day>11</day>
<month>08</month>
<year>2022</year>
</date>
<date date-type="accepted">
<day>22</day>
<month>11</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2022 Wang and He</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Wang and He</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p>
</license>
</permissions>
<abstract>
<p>The accurate detection and segmentation of apples during growth stage is essential for yield estimation, timely harvesting, and retrieving growth information. However, factors such as the uncertain illumination, overlaps and occlusions of apples, homochromatic background and the gradual change in the ground color of apples from green to red, bring great challenges to the detection and segmentation of apples. To solve these problems, this study proposed an improved Mask Scoring region-based convolutional neural network (Mask Scoring R-CNN), known as MS-ADS, for accurate apple detection and instance segmentation in a natural environment. First, the ResNeSt, a variant of ResNet, combined with a feature pyramid network was used as backbone network to improve the feature extraction ability. Second, high-level architectures including R-CNN head and mask head were modified to improve the utilization of high-level features. Convolutional layers were added to the original R-CNN head to improve the accuracy of bounding box detection (<italic>bbox_mAP</italic>), and the Dual Attention Network was added to the original mask head to improve the accuracy of instance segmentation (<italic>mask_mAP</italic>). The experimental results showed that the proposed MS-ADS model effectively detected and segmented apples under various conditions, such as apples occluded by branches, leaves and other apples, apples with different ground colors and shadows, and apples divided into parts by branches and petioles. The <italic>recall</italic>, <italic>precision</italic>, false detection rate, and <italic>F</italic>1 score were 97.4%, 96.5%, 3.5%, and 96.9%, respectively. A <italic>bbox_mAP</italic> and <italic>mask_mAP</italic> of 0.932 and 0.920, respectively, were achieved on the test set, and the average run-time was 0.27 s per image. The experimental results indicated that the MS-ADS method detected and segmented apples in the orchard robustly and accurately with real-time performance. This study lays a foundation for follow-up work, such as yield estimation, harvesting, and automatic and long-term acquisition of apple growth information.</p>
</abstract>
<kwd-group>
<kwd>fruit</kwd>
<kwd>detection</kwd>
<kwd>segmentation</kwd>
<kwd>deep learning</kwd>
<kwd>Mask Scoring R-CNN</kwd>
<kwd>attention mechanism</kwd>
</kwd-group>
<counts>
<fig-count count="7"/>
<table-count count="6"/>
<equation-count count="4"/>
<ref-count count="47"/>
<page-count count="15"/>
<word-count count="7104"/>
</counts>
</article-meta>
</front>
<body>
<sec id="s1" sec-type="intro">
<label>1</label>
<title>Introduction</title>
<p>The production and management of apple orchards mainly rely on experienced growers, which has the disadvantages of being time-consuming, labour-intensive, high cost and low precision (<xref ref-type="bibr" rid="B1">Barbole et&#xa0;al., 2021</xref>). With the rapid development of precision and intelligent agriculture, machine vision has become an important way to obtain apple growth information. Apple detection and segmentation through machine vision is the foundation of an innovative orchard management method. It is of great significance for monitoring the growth and nutritional status of fruit, performing early yield estimation and timely harvesting, and it can effectively reduce the dependence on manual labour (<xref ref-type="bibr" rid="B39">Tian et&#xa0;al., 2019</xref>; <xref ref-type="bibr" rid="B16">Jia et&#xa0;al., 2020</xref>). However, the complex growth environment in orchards, fluctuating illumination, uneven distribution of fruits, overlaps and occlusions of apples, change of apple color during the growth process, varying colors and shadows on the surface of apples, and other environmental variables in the natural orchard have a significant impact on the accurate detection and segmentation of apples (<xref ref-type="bibr" rid="B36">Tang et&#xa0;al., 2020</xref>; <xref ref-type="bibr" rid="B42">Wang and He, 2022</xref>).</p>
<p>Many methods have been proposed to solve the problems mentioned above. For instance, <xref ref-type="bibr" rid="B7">Gongal et&#xa0;al. (2015)</xref> used histogram equalization first to intensify color differences between apples and background and then used Otsu threshold and edge detection methods to detect foreground pixels. Finally, Circular Hough Transformation and Blob detection were used to detect apples in images. The accuracy of this method was 82% with dual-side imaging. In another study, based on the color, texture, and three-dimension (3D) shape properties, <xref ref-type="bibr" rid="B32">Rakun et&#xa0;al. (2011)</xref> developed an apple image segmentation method, where color features and threshold segmentation were used to segment potential apple region from the background. Further, texture analysis and 3D reconstruction were utilized to refine the color-segmented area, and finally apple image segmentation were achieved. It is also believed that using artificial lighting during night time, a bright spot would appear on the surface of apple. <xref ref-type="bibr" rid="B23">Linker and Kelman (2015)</xref> used this property to design a method for detecting green apples, they found that this method was insensitive to the color of apples. These traditional image processing methods use manually designed features for target detection and segmentation. However, apple growth environment is complex, and the illumination conditions constantly change over time. Texture, shape and color features of fruit change due to light intensity, occlusions and overlaps. It is very difficult to extract the universal features of apples in natural environment, resulting in poor universality of traditional methods (<xref ref-type="bibr" rid="B46">Zhou et&#xa0;al., 2012</xref>; <xref ref-type="bibr" rid="B31">Nguyen et&#xa0;al., 2016</xref>; <xref ref-type="bibr" rid="B5">Fu et&#xa0;al., 2020</xref>).</p>
<p>With the development of machine learning, deep learning has been widely applied in the agricultural field (<xref ref-type="bibr" rid="B37">Tian H. et&#xa0;al., 2020</xref>; <xref ref-type="bibr" rid="B30">Naranjo-Torres et&#xa0;al., 2020</xref>; <xref ref-type="bibr" rid="B35">Saleem et&#xa0;al., 2021</xref>). Compared with traditional image processing methods, the deep learning-based methods avoid complex operations, such as image pre-processing and target feature extraction. These methods take images as input and extract appropriate features automatically (<xref ref-type="bibr" rid="B9">Guo et&#xa0;al., 2016</xref>). Deep learning achieves outstanding results with good robustness. Recently, it has been applied to fruit detection and segmentation (<xref ref-type="bibr" rid="B16">Jia et&#xa0;al., 2020</xref>; <xref ref-type="bibr" rid="B29">Maheswari et&#xa0;al., 2021</xref>; <xref ref-type="bibr" rid="B18">Jia et al., 2021</xref>; <xref ref-type="bibr" rid="B19">Jia et&#xa0;al., 2022a</xref>). For example, <xref ref-type="bibr" rid="B20">Kang and Chen (2019)</xref>; <xref ref-type="bibr" rid="B21">Kang and Chen (2020)</xref> designed a detection and segmentation network (DaSNet) to achieve the accurate segmentation of apples. <xref ref-type="bibr" rid="B22">Li et&#xa0;al. (2021)</xref> proposed an ensemble U-Net segmentation model for immature green apple segmentation. To compensate for the poor performance of the deep convolutional neural network in keeping the edge of the target, the edge features of the apples were fused with the high-level features of U-Net (<xref ref-type="bibr" rid="B33">Ronneberger et&#xa0;al., 2015</xref>) to achieve accurate segmentation of the apples. The experimental results showed that this method ensured the segmentation accuracy of apples and improved the generalisation ability of the model. A suppression mask region-based convolutional neural network (R-CNN) was developed by <xref ref-type="bibr" rid="B3">Chu et&#xa0;al. (2021)</xref> to detect apples. In this study, a suppression branch was added to the standard Mask R-CNN (<xref ref-type="bibr" rid="B10">He et&#xa0;al., 2020</xref>), which effectively suppressed the generation of non-apple features and improved the accuracy of detection. To realize the accurate segmentation of green fruit, <xref ref-type="bibr" rid="B14">Jia et&#xa0;al. (2022b)</xref> proposed an efficient You Only Look One-level Feature (YOLOF)-snake segmentation model. In the research, the contour based instance segmentation method Deep-snake algorithm module is embedded after the YOLOF regression branch. The method achieved the fast and accurate segmentation of green fruit. <xref ref-type="bibr" rid="B27">Liu J. et&#xa0;al. (2022)</xref> proposed a DLNet model to detect and segment obscured green fruits. They introduced an approach consisting of a detection network and a segmentation network. In the detection network, the Gaussian non-local attention mechanism was added to the feature pyramid network (FPN) to build a refined pyramid network that could continuously refine semantic features generated by the residual network (ResNet) (<xref ref-type="bibr" rid="B11">He et&#xa0;al., 2016</xref>) and FPN. The segmentation network was composed of a dual-layer Graph Attention Network (GAT). The experimental results showed that this method has high accuracy in detecting and segmenting green fruits with good robustness. An obscured green apple detection and segmentation method based on a fully convolutional one-stage (FCOS) object detection model was proposed by <xref ref-type="bibr" rid="B25">Liu M. Y. et&#xa0;al. (2022)</xref>. They used a residual feature pyramid to improve the detection accuracy of green fruits of various sizes and fused a two-layer convolutional block attention network into FCOS to recover the edges of incomplete green fruits. The accuracy of detection and segmentation were 77.2% and 79.7%, respectively. Compared with traditional methods, the accuracy and generalization ability of the above deep learning-based methods are significantly improved. However, most of the researches focus on immature green fruit or mature red fruit. The detection and segmentation of fruit whose ground color gradual change from green to red throughout the whole growth period in natural orchard remains a challenge. Currently, study on apple detection and segmentation based on deep learning is still under development, and there are few studies on the detection of apple in the whole growth periods. Additionally, the existing methods mainly focus on detecting fruit with little occlusion and simple lighting conditions (<xref ref-type="bibr" rid="B17">Jia et&#xa0;al., 2022c</xref>), which is difficult to meet the development needs of intelligent management of orchard.</p>
<p>Image segmentation includes semantic and instance segmentation. Semantic segmentation generates the same mask for the same class, rendering it ineffective in separating overlapping objects of the same class. Instance segmentation integrates object detection and segmentation and generates a different mask for each object. For apples grown in natural orchards, fruit overlap is common; hence instance segmentation is more applicable for apple detection and segmentation. Mask Scoring R-CNN (<xref ref-type="bibr" rid="B12">Huang et&#xa0;al., 2019</xref>) is one of the state-of-the-art instance segmentation methods, which is widely used in the detection and instance segmentation of various targets. For example, <xref ref-type="bibr" rid="B38">Tian Y. et&#xa0;al. (2020)</xref> applied Mask Scoring R-CNN to apple flower detection. They fused U-Net into Mask Scoring R-CNN, and proposed a MASU-R-CNN model. <xref ref-type="bibr" rid="B40">Tu et&#xa0;al. (2021)</xref> used Mask Scoring R-CNN to segment pig images, achieving the effective segmentation of adhesive pigs.</p>
<p>With the development of deep learning, the attention mechanism has gradually become an important component. Fusing the attention mechanism into network can effectively increase the expression ability of the network model and allows it to focus on important features of the target while suppressing unnecessary features (<xref ref-type="bibr" rid="B47">Zhu et&#xa0;al., 2019</xref>). Recently, attention mechanisms have also been used for fruit detection. <xref ref-type="bibr" rid="B15">Jiang et&#xa0;al. (2022)</xref> fused the non-local attention module (<xref ref-type="bibr" rid="B41">Wang et&#xa0;al., 2018</xref>) and convolutional block attention model, inspired by the Squeeze-and-excitation network (<xref ref-type="bibr" rid="B13">Hu et&#xa0;al., 2018</xref>), into a You Only Look Once (YOLO) V4 to achieve high-efficiency detection of young apples. The experimental results showed that the added attention module effectively improved the detection accuracy. <xref ref-type="bibr" rid="B27">Liu J. et&#xa0;al. (2022)</xref> added the Gaussian non-local attention mechanism to the FPN to refine the semantic features continuously generated by the ResNet and FPN.</p>
<p>The overall goal of this study is to provide a reliable and efficient method to detect and instance segment apples throughout the whole growth periods in complex environment. Inspired by the above successful researches, a method based on an improved Mask Scoring R-CNN (MS-ADS) that fused attention mechanism was proposed. Specific objectives are as follows:</p>
<list list-type="order">
<list-item>
<p>To improve the feature extraction ability of the backbone, ResNeSt, a variant of ResNet fused with attention mechanism, combined with FPN was used to replace the original backbone network of the Mask Scoring R-CNN.</p>
</list-item>
<list-item>
<p>To further improve the utilization of high-level features and enhance the accuracy of bounding box detection and instance segmentation, the R-CNN head and mask head of the Mask Scoring R-CNN were improved by adding convolution layers and Dual Attention Network (DANet), respectively.</p>
</list-item>
<list-item>
<p>Train and test the MS-ADS model to achieve the accurate detection and instance segmentation of apples in the natural environment.</p>
</list-item>
</list>
<p>The MS-ADS method focus on reliable and efficient detection and segmentation of apples throughout the whole growth stages. The method was achieved by improving the backbone and high-level architectures including R-CNN head and mask head of the original model. The improvement of backbone allows the network to improve its feature extraction ability by being more attentive to the apple features and effectively ignoring background features. High-level feature maps, containing rich context and semantic information, are useful in determining the invariant and abstract features that could be used for a variety of vision tasks including target detection and classification. By modifying high-level architectures, it was conducive to improving the utilization of high-level features to obtain more accurate detection results and more refined edge segmentation results. Accurate apple detection and segmentation throughout the growth period are crucial for realizing yield estimation, timely harvesting and automatic monitoring of the fruit growth. The proposed method can be used to count the growth cycles of apple, and simultaneously perform appropriate variable rate irrigation and fertilization according to the monitored growth state or density of the fruits at different growth stages, which then improves the resource utilization efficiency. Additionally, this method can also provide a reference for storage facilities according to production estimation.</p>
</sec>
<sec id="s2" sec-type="materials|methods">
<label>2</label>
<title>Materials and methods</title>
<sec id="s2_1">
<label>2.1</label>
<title>Image dataset acquisition</title>
<p>In this study, apple images were captured in an experimental apple orchard belonging to the College of Horticulture, Northwest A&amp;F University, Yangling, Shaanxi, China. The images used in this research were collected from 9:00 to 11:00 a.m. and 3:00 to 6:30 p.m. from May to September in 2019 during cloudy and sunny weather conditions. Images under natural daylight with backlight and direct sunlight conditions were acquired using an iPhone 7 Plus. The images were captured with a resolution of 4032 &#xd7; 3024 pixels and were saved in JPEG format.</p>
<p>To improve computational efficiency and to adapt to the images collected by cameras with a low resolution, the images were rescaled to 369 &#xd7; 277 pixels. To make the edges of the apple clearer and facilitate image annotation and subsequent feature extraction, the images were sharpened using the Laplace operator (<xref ref-type="bibr" rid="B8">Gonzalez &amp; Woods, 2020</xref>). The rescaled and sharpened images were manually annotated by polygons using the VGG image annotator (VIA) (<xref ref-type="bibr" rid="B4">Dutta and Zisserman, 2019</xref>) for network training and testing. After annotating, 219 images acquired under various conditions were selected as the test set, and the remaining images were used as the training set. <xref ref-type="table" rid="T1">
<bold>Table&#xa0;1</bold>
</xref> shows the information of the apple dataset.</p>
<table-wrap id="T1" position="float">
<label>Table&#xa0;1</label>
<caption>
<p>Information of apple dataset.</p>
</caption>
<table frame="hsides">
<thead>
<tr>
<th valign="top" align="left">Conditions</th>
<th valign="top" align="center">Color of apple</th>
<th valign="top" align="center">Number of training set/Number of annotated apples</th>
<th valign="top" align="center">Number of test set/Number of annotated apples</th>
<th valign="top" align="center">Total</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" rowspan="3" align="left">Direction sunlight</td>
<td valign="top" align="center">Red</td>
<td valign="top" align="center">173/428</td>
<td valign="top" align="center">35/68</td>
<td valign="top" align="center">208/496</td>
</tr>
<tr>
<td valign="top" align="center">Green</td>
<td valign="top" align="center">191/513</td>
<td valign="top" align="center">41/74</td>
<td valign="top" align="center">232/587</td>
</tr>
<tr>
<td valign="top" align="center">Uneven</td>
<td valign="top" align="center">187/479</td>
<td valign="top" align="center">36/65</td>
<td valign="top" align="center">223/544</td>
</tr>
<tr>
<td valign="top" rowspan="3" align="left">Backlight</td>
<td valign="top" align="center">Red</td>
<td valign="top" align="center">151/441</td>
<td valign="top" align="center">35/67</td>
<td valign="top" align="center">186/508</td>
</tr>
<tr>
<td valign="top" align="center">Green</td>
<td valign="top" align="center">160/536</td>
<td valign="top" align="center">35/73</td>
<td valign="top" align="center">195/609</td>
</tr>
<tr>
<td valign="top" align="center">Uneven</td>
<td valign="top" align="center">159/458</td>
<td valign="top" align="center">37/74</td>
<td valign="top" align="center">196/532</td>
</tr>
<tr>
<td valign="top" align="left">Total</td>
<td valign="top" align="left"/>
<td valign="top" align="center">1021/2855</td>
<td valign="top" align="center">219/421</td>
<td valign="top" align="center">1240/3276</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s2_2">
<label>2.2</label>
<title>Apple detection and instance segmentation based on the improved Mask Scoring R-CNN (MS-ADS)</title>
<p>Mask Scoring R-CNN is one of the state-of-the-art detection and instance segmentation methods. It is improved from the Mask R-CNN (<xref ref-type="bibr" rid="B10">He et&#xa0;al., 2020</xref>) by adding a maskIoU branch to achieve accurate object detection and instance segmentation. In this study, an MS-ADS network model based on an improved Mask Scoring R-CNN was proposed to accurately detect and segment apples in orchards. <xref ref-type="fig" rid="f1">
<bold>Figure&#xa0;1</bold>
</xref> shows the structure of the MS-ADS network, which includes three parts: (1) Backbone network: ResNeSt (<xref ref-type="bibr" rid="B45">Zhang et&#xa0;al., 2022</xref>), which is a variant of ResNet, combined with FPN, was used as the backbone network for extracting features of the images. (2) The output of the backbone network was fed into the region proposal network (RPN) to generate the region proposals. Then, RoIAlign extracted features from each proposal to properly align the features with the input. (3) Classification and bounding box regression of apples were performed, and the masks of apples were generated.</p>
<fig id="f1" position="float">
<label>Figure&#xa0;1</label>
<caption>
<p>Structure of MS-ADS network. 1. Input images into backbone for feature extraction. 2. Input obtained feature maps into RPN and RoIAlign. 3. Fed acquired feature maps into R-CNN head for classification and bounding box regression, and into attended mask head and Mask IoU head for apple instance segmentation.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpls-13-1016470-g001.tif"/>
</fig>
<sec id="s2_2_1">
<label>2.2.1</label>
<title>Backbone network of MS-ADS</title>
<p>A backbone network is used to extract features from images for subsequent object detection and segmentation. In this study, ReNeSt-50, a variant of ResNet-50 fused with attention mechanism, combined with FPN, was used as the backbone network.</p>
<p>ResNeSt network (<xref ref-type="bibr" rid="B45">Zhang et&#xa0;al., 2022</xref>), which improves based on ResNet, combines the advantages of Squeeze-and-Excitation networks (<xref ref-type="bibr" rid="B13">Hu et&#xa0;al., 2018</xref>), Selective Kernel networks (<xref ref-type="bibr" rid="B28">Li et&#xa0;al., 2019</xref>), and ResNeXt (<xref ref-type="bibr" rid="B43">Xie et&#xa0;al., 2017</xref>). As in ResNeXt blocks, in ResNeSt blocks, a Cardinality hyperparameter is given to divide the feature map into <italic>K</italic> groups. Meanwhile, a radix hyperparameter is defined to divide each group into <italic>R</italic> splits. Then, the input <italic>X</italic> is divided into <italic>G</italic> groups, <italic>G</italic> = <italic>KR</italic>, and <italic>X</italic> = {<italic>X</italic>
<sub>1</sub>, <italic>X</italic>
<sub>2</sub>,&#x2026;, <italic>X<sub>G</sub>
</italic>}. A series of transformations <italic>F</italic> = {<italic>F</italic>
<sub>1</sub>, <italic>F</italic>
<sub>2</sub>,&#x2026;, <italic>F<sub>G</sub>
</italic>} are performed on each individual group, then the intermediate representation of each group is <italic>U<sub>i</sub>
</italic> = <italic>F<sub>i</sub>
</italic>(<italic>X<sub>i</sub>
</italic>), <italic>i</italic> &#x2208;{1, 2,&#x2026;, <italic>G</italic>}. A weighted fusion of the cardinal group representation <italic>V<sup>k</sup>
</italic> &#x2208;<italic>&#x211d;<sup>H</sup>
</italic>
<sup>&#xd7;</sup>
<italic>
<sup>W</sup>
</italic>
<sup>&#xd7;</sup>
<italic>
<sup>C</sup>
</italic>
<sup>/</sup>
<italic>
<sup>K</sup>
</italic> (<italic>H</italic>, <italic>W</italic> and <italic>C</italic> are the sizes of output feature map) is aggregated using channel-wise soft attention, where each feature map channel was produced using a weighted combination of over splits. The features of the <italic>c</italic>-th channel are calculated by the formula (1).</p>
<disp-formula>
<label>(1)</label>
<mml:math display="block" id="M1">
<mml:mrow>
<mml:msubsup>
<mml:mi>V</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>k</mml:mi>
</mml:msubsup>
<mml:mo>=</mml:mo>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
<mml:mi>i</mml:mi>
<mml:msubsup>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mi>R</mml:mi>
</mml:msubsup>
</mml:mstyle>
<mml:msubsup>
<mml:mi>a</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>k</mml:mi>
</mml:msubsup>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>c</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:msub>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mi>R</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>K</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</disp-formula>
<p>where <inline-formula>
<mml:math display="inline" id="im1">
<mml:mrow>
<mml:msubsup>
<mml:mi>a</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>k</mml:mi>
</mml:msubsup>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>c</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> denotes an assignment weight. The cardinal group representations are then concatenated along the channel dimension: <italic>V</italic>= Concat{<italic>V</italic>
<sup>1</sup>, <italic>V</italic>
<sup>2</sup>,&#x2026;, <italic>V<sup>K</sup>
</italic>}. In a standard residual block, if the input and output feature map share the same shape, the final output <italic>Y</italic> of the ResNeSt block is produced using a shortcut connection: <italic>Y</italic> = <italic>V</italic> + <italic>X</italic>. For blocks with a stride, the shape of the input and output feature map are not the same; hence, an appropriate transformation <italic>T</italic> is applied to the shortcut connection to align the output shapes: <italic>Y</italic> = <italic>V</italic> + <italic>T</italic>(<italic>X</italic>).</p>
<p>The ResNeSt block is shown in <xref ref-type="fig" rid="f2">
<bold>Figure&#xa0;2</bold>
</xref>. An equivalent transformation of network model shown in <xref ref-type="fig" rid="f2">
<bold>Figure&#xa0;2</bold>
</xref> was used in this experiment for it can be modularized and accelerated by group convolution and standard CNN layers (<xref ref-type="bibr" rid="B45">Zhang et&#xa0;al., 2022</xref>). In this study, we used ResNeSt-50 to extract features. The parameter <italic>R</italic> was set to 2, and <italic>K</italic> was set to 1. The output of ResNeSt-50 was used as the input for FPN and together they functioned as the backbone network of our MS-ADS model. FPN extracts multi-scale features from a pyramid hierarchy of convolutional neural networks and combines the features of each stage of the ResNeSt-50 network to give network semantic and spatial information, thus improving its accuracy.</p>
<fig id="f2" position="float">
<label>Figure&#xa0;2</label>
<caption>
<p>ResNeSt block. <italic>r</italic> is the number of Splits. <italic>r</italic> = 1, 2,&#x2026;, <italic>R</italic>. <italic>k</italic> denotes the number of Cardinal, <italic>k</italic> = 1, 2,&#x2026;, <italic>K</italic>. <italic>h</italic>, <italic>w</italic> and <italic>c</italic> represent the height, width and channel of the input feature map, respectively. Conv represents convolutional layer, and Global pooling means global average pooling. BN and ReLU are batch normalization and activation function, respectively.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpls-13-1016470-g002.tif"/>
</fig>
</sec>
<sec id="s2_2_2">
<label>2.2.2</label>
<title>Generation of Region of interest and RoIAlign</title>
<p>The feature maps generated by the backbone network were fed into RPN to search RoIs where apples are located. When generating RoIs, according to the actual situation of a single fruit on the image, three area scales, including 32 &#xd7; 32, 64 &#xd7; 64 and 128 &#xd7; 128, and three aspect ratios as 1:1, 1:2 and 2:1 were randomly combined to generate nine anchors. The anchors were used to predict the location of apples to enhance the accuracy of the RoI outputs. After generating RoIs, the RoIs and the corresponding feature maps were input into RoIAlign to adjust the size of the anchor box to a fixed size. RoIAlign properly aligned the extracted features with the input to improve the pixel-level segmentation accuracy.</p>
</sec>
<sec id="s2_2_3">
<label>2.2.3</label>
<title>Apple detection and instance segmentation based on MS-ADS</title>
<p>The feature maps obtained from RoIAlign were used as input for the high-level heads of MS-ADS model. The heads included an improved R-CNN head, an attended mask head and a Mask IoU head. High-level feature maps, containing rich context and semantic information, are useful in determining the invariant and abstract features that could be used for a variety of vision tasks including target detection and classification. By modifying high-level architectures (R-CNN head and mask head), it was conducive to improving the utilization of high-level features to detect apples of various scales. Improving high-level architectures could be necessary and beneficial for obtaining more accurate detection results and more refined edge segmentation results.</p>
<sec id="s2_2_3_1">
<label>2.2.3.1</label>
<title>Improved R-CNN head</title>
<p>The improved R-CNN head of the MS-ADS model, which was used for classification and bounding box regression, was composed of convolutional layers and a fully connected layer. The structure of the improved R-CNN head is shown in <xref ref-type="fig" rid="f1">
<bold>Figure&#xa0;1</bold>
</xref>. Four convolutional layers were added to the original R-CNN head to extract features sufficiently and improve the accuracy of the final classification and regression. The kernel size, padding and stride of the added convolutional layers were 3 &#xd7; 3, 1 and 1, respectively, and the output channel was 256.</p>
</sec>
<sec id="s2_2_3_2">
<label>2.2.3.2</label>
<title>Attended mask head</title>
<p>To further improve the accuracy of instance segmentation, in this research, the DANet (<xref ref-type="bibr" rid="B6">Fu et&#xa0;al., 2019</xref>) was inserted into the original mask head. The structure of DANet is illustrated in <xref ref-type="fig" rid="f3">
<bold>Figure&#xa0;3</bold>
</xref>. DANet draws global context over local features, including a position attention module and a channel attention module. The position attention module selectively integrates the feature at each position through a weighted sum of the features at all positions (similar features would be related to each other, regardless of their distances). The channel attention module selectively emphasizes interdependent channel maps by aggregating relevant features among all channel maps. DANet sums the outputs of the two attention modules to further enhance the feature representation and to achieve more accurate segmentation results.</p>
<fig id="f3" position="float">
<label>Figure&#xa0;3</label>
<caption>
<p>Structure of DANet. (<italic>H</italic>, <italic>W</italic> and <italic>C</italic> are the height, width and channel of the input feature map, respectively).</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpls-13-1016470-g003.tif"/>
</fig>
<p>In this study, DANet was inserted followed by the second convolutional layers of the original mask head (as shown in <xref ref-type="fig" rid="f1">
<bold>Figure&#xa0;1</bold>
</xref>) to get a precise segmentation mask. The improved mask head was named as attended mask head.</p>
</sec>
<sec id="s2_2_3_3">
<label>2.2.3.3</label>
<title>MaskIoU head</title>
<p>MaskIoU head consists of convolutional layers and fully connected layers. It regresses the IoU between the predicted mask and its ground truth mask. The output features of the RoIAlign and the predicted mask were concatenated, and the concatenation result was used as the input for the MaskIoU head. The output of the MaskIoU head is the number of classes. In this study, the number of classes is 1, that is, the apple class.</p>
</sec>
</sec>
<sec id="s2_2_4">
<label>2.2.4</label>
<title>Loss function</title>
<p>The loss function represents the difference between the prediction and the ground truth, which is very important in network training. The loss function of the MS-ADS network model was composed of two parts: RPN loss and the training loss of the three heads, as shown in formula (2).</p>
<disp-formula>
<label>(2)</label>
<mml:math display="block" id="M2">
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mi>L</mml:mi>
<mml:mrow>
<mml:mi>R</mml:mi>
<mml:mi>P</mml:mi>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mi>L</mml:mi>
<mml:mrow>
<mml:mi>h</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>d</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</disp-formula>
<p>where <italic>L</italic> is the loss of the MS-ADS network model, <italic>L<sub>RPN</sub>
</italic> is the loss of RPN, and it can be calculated by the formula (3).</p>
<disp-formula>
<label>(3)</label>
<mml:math display="block" id="M3">
<mml:mrow>
<mml:msub>
<mml:mi>L</mml:mi>
<mml:mrow>
<mml:mi>R</mml:mi>
<mml:mi>P</mml:mi>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mrow>
<mml:msub>
<mml:mi>N</mml:mi>
<mml:mrow>
<mml:mi>R</mml:mi>
<mml:mi>P</mml:mi>
<mml:mi>N</mml:mi>
<mml:mo>_</mml:mo>
<mml:mi>c</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfrac>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:msubsup>
<mml:mi>i</mml:mi>
<mml:mi>L</mml:mi>
<mml:mrow>
<mml:mi>R</mml:mi>
<mml:mi>P</mml:mi>
<mml:mi>N</mml:mi>
<mml:mo>_</mml:mo>
<mml:mi>c</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:msub>
<mml:msubsup>
<mml:mi>p</mml:mi>
<mml:mi>i</mml:mi>
<mml:mo>*</mml:mo>
</mml:msubsup>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mstyle>
<mml:mo>+</mml:mo>
<mml:mi>&#x3bb;</mml:mi>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mrow>
<mml:msub>
<mml:mi>N</mml:mi>
<mml:mrow>
<mml:mi>R</mml:mi>
<mml:mi>P</mml:mi>
<mml:mi>N</mml:mi>
<mml:mo>_</mml:mo>
<mml:mi>b</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfrac>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:msubsup>
<mml:mi>p</mml:mi>
<mml:mi>i</mml:mi>
<mml:mo>*</mml:mo>
</mml:msubsup>
<mml:msub>
<mml:mi>L</mml:mi>
<mml:mrow>
<mml:mi>R</mml:mi>
<mml:mi>P</mml:mi>
<mml:mi>N</mml:mi>
<mml:mo>_</mml:mo>
<mml:mi>b</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
<mml:mo>*</mml:mo>
</mml:msubsup>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
</mml:math>
</disp-formula>
<p>where, <italic>L<sub>RPN</sub>
</italic>
<sub>_</sub>
<italic>
<sub>cls</sub>
</italic> and <italic>L<sub>RPN</sub>
</italic>
<sub>_</sub>
<italic>
<sub>bbox</sub>
</italic> are the classification loss and the bounding box regression loss of RPN, respectively. &#x3bb; is a balance parameter. <italic>N<sub>RPN</sub>
</italic>
<sub>_</sub>
<italic>
<sub>cls</sub>
</italic> and <italic>N<sub>RPN</sub>
</italic>
<sub>_</sub>
<italic>
<sub>bbox</sub>
</italic> are the mini-batch size and the number of anchor locations, respectively. <italic>P<sub>i</sub>
</italic> is the classification probability of anchor <italic>i</italic>, and <inline-formula>
<mml:math display="inline" id="im2">
<mml:mrow>
<mml:msubsup>
<mml:mi>p</mml:mi>
<mml:mi>i</mml:mi>
<mml:mo>*</mml:mo>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> is the ground truth label probability of anchor <italic>i</italic>. <italic>t<sub>i</sub>
</italic> represents the difference between the predicted bounding box and the ground truth labelled box. <inline-formula>
<mml:math display="inline" id="im3">
<mml:mrow>
<mml:msubsup>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
<mml:mo>*</mml:mo>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> denotes the difference between the ground truth labelled box and the positive anchor.</p>
<p>
<italic>L<sub>heads</sub>
</italic> represents the loss of the three heads, and it is a sum of the loss of the three heads. <italic>L<sub>heads</sub>
</italic> can be calculated by the formula (4).</p>
<disp-formula>
<label>(4)</label>
<mml:math display="block" id="M4">
<mml:mrow>
<mml:msub>
<mml:mi>L</mml:mi>
<mml:mrow>
<mml:mi>h</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>d</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mi>L</mml:mi>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mi>L</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mi>L</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mi>L</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>k</mml:mi>
<mml:mi>I</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>U</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</disp-formula>
<p>where, <italic>L<sub>cls</sub>
</italic> and <italic>L<sub>bbox</sub>
</italic> are the classification loss and the bounding box regression loss of the improved R-CNN head, respectively, <italic>L<sub>mask</sub>
</italic> is the mask loss of attended mask head, and <italic>L<sub>maskIoU</sub>
</italic> is the MaskIoU loss of MaskIoU head. The loss function of the three heads in this study is the same as those of the original Mask Scoring R-CNN.</p>
</sec>
<sec id="s2_2_5">
<label>2.2.5</label>
<title>Network training and evaluation of MS-ADS network model</title>
<p>The processor used in this study was an Intel Core i7-7700HQ, with a 16 GB RAM and an 8 GB NVIDIA GTX 1070 GPU. We trained the network on Ubuntu 16.04, and Python 3.6 was used in the training and testing of the MS-ADS network model.</p>
<p>The original Mask Scoring R-CNN model pre-trained on the COCO dataset (<xref ref-type="bibr" rid="B24">Lin et&#xa0;al., 2014</xref>) was used to initialize the MS-ADS to accelerate the training process. The manually annotated apple images were then utilized for training and testing the MS-ADS network. The iteration number was set to 24 epochs. The initial learning rate was set to 0.02 and later decreased by ten times at the 16th and 22nd epochs, respectively. The momentum and weight decay were set to 0.9 and 1 &#xd7; 10<sup>&#x2212;4</sup>, respectively. The total training time lasted for 3&#xa0;h and 6&#xa0;min.</p>
<p>To test the performance of the proposed MS-ADS method on the detection and instance segmentation of apples, <italic>precision</italic>, <italic>recall, F</italic>1 score, mean average precision of the detection bounding box (<italic>bbox_mAP</italic>), mean average precision of the segmentation mask (<italic>mask_mAP</italic>) and average run time were used to evaluate the method.</p>
</sec>
</sec>
</sec>
<sec id="s3" sec-type="results">
<label>3</label>
<title>Results</title>
<sec id="s3_1">
<label>3.1</label>
<title>Apple detection and instance segmentation using the MS-ADS method</title>
<p>To verify the effectiveness of the proposed MS-ADS method, 219 apple images captured during the growth stage were used to test the method. The <italic>precision</italic> and <italic>recall</italic> of the MS-ADS method were 96.5% and 97.4%, respectively, and the false detection rate was 3.5%. The <italic>bbox_mAP</italic> and <italic>mask_mAP</italic> were 0.932 and 0.920, respectively, on the test set, and the average run time was 0.27 s (<xref ref-type="table" rid="T2">
<bold>Table&#xa0;2</bold>
</xref>). Examples of the detection and instance segmentation results are illustrated in <xref ref-type="fig" rid="f4">
<bold>Figure&#xa0;4</bold>
</xref>. To further analyze the detection results of apples under various conditions, the recall of apples affected by different factors, such as independent apples, occluded apples, apples divided into parts by branches and petioles, clustered apples, red apples, green apples, apples with uneven colors, shadows and uneven illumination on the surface, were calculated and analyzed. The results are shown in <xref ref-type="table" rid="T3">
<bold>Table&#xa0;3</bold>
</xref>.</p>
<table-wrap id="T2" position="float">
<label>Table&#xa0;2</label>
<caption>
<p>Detection and instance segmentation results of apples on test set.</p>
</caption>
<table frame="hsides">
<thead>
<tr>
<th valign="top" align="left">Evaluations</th>
<th valign="top" align="center">
<italic>precision/</italic>%</th>
<th valign="top" align="center">
<italic>recall/</italic>%</th>
<th valign="top" align="center">
<italic>F1</italic>/%</th>
<th valign="top" align="center">
<italic>False detection</italic>/%</th>
<th valign="top" align="center">
<italic>bbox_ mAP</italic>
</th>
<th valign="top" align="center">
<italic>mask_mAP</italic>
</th>
<th valign="top" align="center">run time/s</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">MS-ADS</td>
<td valign="top" align="left">96.5</td>
<td valign="top" align="left">97.4</td>
<td valign="top" align="left">96.9</td>
<td valign="top" align="center">3.5</td>
<td valign="top" align="center">0.932</td>
<td valign="top" align="center">0.920</td>
<td valign="top" align="center">0.27</td>
</tr>
</tbody>
</table>
</table-wrap>
<fig id="f4" position="float">
<label>Figure&#xa0;4</label>
<caption>
<p>Examples of detection and instance segmentation of apples. <bold>(A, C, E)</bold> Original images. Specifically, (A1) Green apples affected by shadows. (A2) Small green apples with strong illumination on the surface. (A3) Apples affected by overlap, occlusion, shadows, and strong illumination. (A4) Green apple image captured under backlight condition. (A5) Green apples with uneven illuminations on the surface. (A6) Green apples with high similarities to the background. (C1) Overlapped apples with uneven colors. (C2) Apples affected by occlusion, shadows, and uneven colors. (C3) Apples affected by overlap, occlusion, shadows, and uneven colors. (C4) Apples with uneven colors and shadows on the surface captured under backlight conditions. (C5) Apples affected by overlap, occlusion, uneven colors, and backlight. (C6) Apples affected by overlap, occlusion, and uneven colors. (E1) Red overlap apples and apples with uneven colors. (E2) Red apples with uneven illuminations and apples with uneven colors. (E3) Overlapped and small red apples. (E4) Red apples affected by occlusion and shadows. (E5) Red apples affected by overlap and shadows. (E6) Red apples affected by overlap, occlusion, and shadows. (B1-6, D1-6, F1-6) Detection and instance segmentation results of images in <bold>(A, C, E)</bold>.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpls-13-1016470-g004.tif"/>
</fig>
<table-wrap id="T3" position="float">
<label>Table&#xa0;3</label>
<caption>
<p>Detection results of apples under different conditions.</p>
</caption>
<table frame="hsides">
<thead>
<tr>
<th valign="top" align="left">Conditions</th>
<th valign="top" align="center">IA</th>
<th valign="top" align="center">OA</th>
<th valign="top" align="center">DA</th>
<th valign="top" align="center">CA</th>
<th valign="top" align="center">RA</th>
<th valign="top" align="center">GA</th>
<th valign="top" align="center">UC</th>
<th valign="top" align="center">SA</th>
<th valign="top" align="center">UI</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">
<italic>recall/</italic>%</td>
<td valign="top" align="left">98.9</td>
<td valign="top" align="left">96.3</td>
<td valign="top" align="left">95.8</td>
<td valign="top" align="left">96.6</td>
<td valign="top" align="left">98.3</td>
<td valign="top" align="left">96.8</td>
<td valign="top" align="left">97.5</td>
<td valign="top" align="left">97.2</td>
<td valign="top" align="left">98.1</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>IA, Independent apple; OA, Occluded apples; DA, Apples divided into parts by branches and petioles; CA, Clustered apples; RA, Red apples; GA, Green apples; UC, Apples with uneven colors on the surface; SA, Apples with shadows on the surface; UI, Apples with uneven illuminations on the surface.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>As can be seen in <xref ref-type="fig" rid="f4">
<bold>Figure&#xa0;4</bold>
</xref> and <xref ref-type="table" rid="T3">
<bold>Table&#xa0;3</bold>
</xref>, the MS-ADS method was accurate and effective in detection and instance segmentation of green apples (<xref ref-type="fig" rid="f4">
<bold>Figure&#xa0;4A</bold>
</xref>), apples with uneven colors on the surface (<xref ref-type="fig" rid="f4">
<bold>Figure&#xa0;4C</bold>
</xref>) and red apples (<xref ref-type="fig" rid="f4">
<bold>Figure&#xa0;4E</bold>
</xref>), and the detection recall of these apples were 98.3%, 96.8%, and 97.5%, respectively. The MS-ADS method achieved accurate detection for apples occluded by branches and leaves (<xref ref-type="fig" rid="f4">
<bold>Figures&#xa0;4A1, A3, A6, C3, C5, C6, E4, E6</bold>
</xref>), and the detection recall was 96.3%. For apples occluded by branches and leaves, detection of apples divided into multiple parts by branches or petioles (<xref ref-type="fig" rid="f4">
<bold>Figures&#xa0;4A6, C5, C6, E4</bold>
</xref>) are often considered a special case. It is relatively difficult to detect this kind of apple. However, the detection recall of apples under this condition using the MS-ADS method was 95.8%, indicating that the proposed method is applicable for the detection and segmentation of apples divided into parts by branches or petioles. The MS-ADS method was also effective in detecting clustered apples (<xref ref-type="fig" rid="f4">
<bold>Figures&#xa0;4A3, A4, A6, C1, C2, C3, C5, C6, E1, E2, E3, E5, E6</bold>
</xref>), and the detection recall was 96.6%. Apples with shadows (<xref ref-type="fig" rid="f4">
<bold>Figures&#xa0;4A1, A3, C2, C3, C4, E4, E5, E6</bold>
</xref>) and uneven illumination (<xref ref-type="fig" rid="f4">
<bold>Figures&#xa0;4A5, E2</bold>
</xref>) on the surface were also accurately detected by the MS-ADS method. The detection recall of apples with shadows and uneven illumination on the surface were 97.2% and 98.1%, respectively. Additionally, the detection results for apples with extremely strong illumination (<xref ref-type="fig" rid="f4">
<bold>Figures&#xa0;4A2, A3</bold>
</xref>), extremely dark illumination on the surface (<xref ref-type="fig" rid="f4">
<bold>Figure&#xa0;4A4</bold>
</xref>) and extremely small apples (<xref ref-type="fig" rid="f4">
<bold>Figures&#xa0;4A2, E3</bold>
</xref>) by the MS-ADS method were all satisfactory. The MS-ADS method was also effective in detecting apples that were similar to the backgrounds (<xref ref-type="fig" rid="f4">
<bold>Figure&#xa0;4A6</bold>
</xref>), a task that is difficult even for human eyes.</p>
<p>From the detection and instance segmentation results shown in <xref ref-type="fig" rid="f4">
<bold>Figure&#xa0;4</bold>
</xref>, <xref ref-type="table" rid="T2">
<bold>Tables&#xa0;2</bold>
</xref> and <xref ref-type="table" rid="T3">
<bold>3</bold>
</xref>, it is clear that the proposed MS-ADS method overcame the effect of colors, illuminations, overlap, occlusion, complex background and shadows, and accurately and effectively detected and segmented apples under various conditions with good robustness.</p>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>Comparison with other methods</title>
<p>To further analyze the performance of the proposed MS-ADS method, parameters including <italic>precision</italic>, <italic>recall</italic>, <italic>F</italic>1 score, <italic>bbox_mAP</italic>, <italic>mask_mAP</italic>, and average run time, were used to evaluate the MS-ADS method. The performance of the method was compared with that of other six methods: YOLACT (<xref ref-type="bibr" rid="B2">Bolya et&#xa0;al., 2019</xref>), PolarMask (<xref ref-type="bibr" rid="B44">Xie et&#xa0;al., 2020</xref>), Mask R-CNN (<xref ref-type="bibr" rid="B10">He et&#xa0;al., 2020</xref>) with ResNet-50-FPN as backbone, Mask R-CNN with ConVeXt-T (<xref ref-type="bibr" rid="B26">Liu Z. et&#xa0;al., 2022</xref>) as backbone, Mask R-CNN integrated with GRoIE (<xref ref-type="bibr" rid="B34">Rossi et&#xa0;al., 2021</xref>), and Mask Scoring R-CNN (<xref ref-type="bibr" rid="B12">Huang et&#xa0;al., 2019</xref>). The configurations used in the seven methods are shown in <xref ref-type="table" rid="T4">
<bold>Table&#xa0;4</bold>
</xref>. In the comparison experiments, 5-fold cross-validation was used to evaluate the seven methods. We divided the dataset into 5 parts: 219, 256, 255, 255, and 255 to make the ratio of training set to test set was about 8:2 in each experiment. <xref ref-type="table" rid="T5">
<bold>Table&#xa0;5</bold>
</xref> gives the detection and instance segmentation results of the seven methods, and the results was the average of the five independent experiments.</p>
<table-wrap id="T4" position="float">
<label>Table&#xa0;4</label>
<caption>
<p>Configurations of seven methods.</p>
</caption>
<table frame="hsides">
<thead>
<tr>
<th valign="top" align="left">Methods</th>
<th valign="top" align="center">Backbone</th>
<th valign="top" align="center">Initial learning rate</th>
<th valign="top" align="center">Momentum</th>
<th valign="top" align="center">Weight decay</th>
<th valign="top" align="center">Iteration number</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Mask R-CNN</td>
<td valign="top" align="left">ResNet-50-FPN</td>
<td valign="top" align="center">0.02</td>
<td valign="top" align="center">0.9</td>
<td valign="top" align="center">1 &#xd7; 10<sup>&#x2212;4</sup>
</td>
<td valign="top" align="left">24 epochs</td>
</tr>
<tr>
<td valign="top" align="left">Mask R-CNN</td>
<td valign="top" align="left">ConVeXt-T</td>
<td valign="top" align="center">0.00007</td>
<td valign="top" align="center">0.9</td>
<td valign="top" align="center">5 &#xd7; 10<sup>&#x2212;2</sup>
</td>
<td valign="top" align="left">24 epochs</td>
</tr>
<tr>
<td valign="top" align="left">Mask Scoring R-CNN</td>
<td valign="top" align="left">ResNet-50-FPN</td>
<td valign="top" align="center">0.02</td>
<td valign="top" align="center">0.9</td>
<td valign="top" align="center">1 &#xd7; 10<sup>&#x2212;4</sup>
</td>
<td valign="top" align="left">24 epochs</td>
</tr>
<tr>
<td valign="top" align="left">YOLACT</td>
<td valign="top" align="left">ResNet-50-FPN</td>
<td valign="top" align="center">0.001</td>
<td valign="top" align="center">0.9</td>
<td valign="top" align="center">5 &#xd7; 10<sup>&#x2212;4</sup>
</td>
<td valign="top" align="left">55 epochs</td>
</tr>
<tr>
<td valign="top" align="left">PolarMask</td>
<td valign="top" align="left">ResNet-50-FPN</td>
<td valign="top" align="center">0.01</td>
<td valign="top" align="center">0.9</td>
<td valign="top" align="center">1 &#xd7; 10<sup>&#x2212;4</sup>
</td>
<td valign="top" align="left">12 epochs</td>
</tr>
<tr>
<td valign="top" align="left">Mask R-CNN + GRoIE</td>
<td valign="top" align="left">ResNet-50-FPN</td>
<td valign="top" align="center">0.02</td>
<td valign="top" align="center">0.9</td>
<td valign="top" align="center">1 &#xd7; 10<sup>&#x2212;4</sup>
</td>
<td valign="top" align="left">12 epochs</td>
</tr>
<tr>
<td valign="top" align="left">MS-ADS</td>
<td valign="top" align="left">ResNeSt-50-FPN</td>
<td valign="top" align="center">0.02</td>
<td valign="top" align="center">0.9</td>
<td valign="top" align="center">1 &#xd7; 10<sup>&#x2212;4</sup>
</td>
<td valign="top" align="left">24 epochs</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>As can be seen from <xref ref-type="table" rid="T5">
<bold>Table&#xa0;5</bold>
</xref>, the proposed MS-ADS method was more accurate in apple detection in terms of precision, F1 score, and bbox_mAP compared with the other six methods. Although the recall and mask_mAP for apples were lower than those of ConVeXt-T-based Mask RCNN, MS-ADS had a faster detection and segmentation speed and smaller computation than ConVeXt-T-based Mask RCNN. Though the run time was longer than that of methods including YOLACT, PolarMask, Mask R-CNN (ResNet-50-FPN) and Mask Scoring R-CNN, MS-ADS method was more accurate in detecting and segmenting apples throughout the whole apple growth period. Through the above comparison and analysis, the MS-ADS method outperformed other six methods, which enabled real-time and accurate detection and segmentation of apples under complex background.</p>
<table-wrap id="T5" position="float">
<label>Table&#xa0;5</label>
<caption>
<p>Detection and instance segmentation results of seven methods.</p>
</caption>
<table frame="hsides">
<thead>
<tr>
<th valign="top" align="left">Methods</th>
<th valign="top" colspan="6" align="center">Evaluation parameters/Average</th>
</tr>
<tr>
<th valign="top" align="left"/>
<th valign="top" align="center">
<italic>precision/</italic>%</th>
<th valign="top" align="center">
<italic>recall/</italic>%</th>
<th valign="top" align="center">
<italic>F</italic>1/%</th>
<th valign="top" align="center">
<italic>bbox_ mAP</italic>
</th>
<th valign="top" align="center">
<italic>mask_mAP</italic>
</th>
<th valign="top" align="center">run time/s</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Mask R-CNN<break/>(ResNet-50)</td>
<td valign="top" align="center">92.8</td>
<td valign="top" align="center">94.5</td>
<td valign="top" align="center">93.6</td>
<td valign="top" align="center">0.919</td>
<td valign="top" align="center">0.908</td>
<td valign="top" align="center">0.17</td>
</tr>
<tr>
<td valign="top" align="left">Mask R-CNN<break/>(ConVeXt-T)</td>
<td valign="top" align="center">95.9</td>
<td valign="top" align="center">
<bold>97.0</bold>
</td>
<td valign="top" align="center">
<bold>96.4</bold>
</td>
<td valign="top" align="center">0.925</td>
<td valign="top" align="center">
<bold>0.920</bold>
</td>
<td valign="top" align="center">0.39</td>
</tr>
<tr>
<td valign="top" align="left">Mask Scoring R-CNN</td>
<td valign="top" align="center">94.4</td>
<td valign="top" align="center">95.8</td>
<td valign="top" align="center">95.1</td>
<td valign="top" align="center">0.921</td>
<td valign="top" align="center">0.910</td>
<td valign="top" align="center">0.25</td>
</tr>
<tr>
<td valign="top" align="left">YOLACT</td>
<td valign="top" align="center">91.5</td>
<td valign="top" align="center">92.9</td>
<td valign="top" align="center">92.2</td>
<td valign="top" align="center">0.891</td>
<td valign="top" align="center">0.905</td>
<td valign="top" align="center">
<bold>0.16</bold>
</td>
</tr>
<tr>
<td valign="top" align="left">PolarMask</td>
<td valign="top" align="center">92.0</td>
<td valign="top" align="center">93.5</td>
<td valign="top" align="center">92.7</td>
<td valign="top" align="center">0.908</td>
<td valign="top" align="center">0.903</td>
<td valign="top" align="center">0.21</td>
</tr>
<tr>
<td valign="top" align="left">Mask R-CNN +GRoIE</td>
<td valign="top" align="center">94.8</td>
<td valign="top" align="center">96.3</td>
<td valign="top" align="center">95.5</td>
<td valign="top" align="center">0.923</td>
<td valign="top" align="center">0.908</td>
<td valign="top" align="center">0.40</td>
</tr>
<tr>
<td valign="top" align="left">
<bold>MS-ADS</bold>
</td>
<td valign="top" align="center">
<bold>96.0</bold>
</td>
<td valign="top" align="center">96.9</td>
<td valign="top" align="center">
<bold>96.4</bold>
</td>
<td valign="top" align="center">
<bold>0.928</bold>
</td>
<td valign="top" align="center">0.918</td>
<td valign="top" align="center">0.29</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>The best values are marked bold.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</sec>
</sec>
<sec id="s4" sec-type="discussion">
<label>4</label>
<title>Discussion</title>
<sec id="s4_1">
<label>4.1</label>
<title>Analysis of detection and segmentation results of apples in the growth period</title>
<p>Accurate fruit detection and segmentation during the growth period are crucial for realizing yield estimation, timely harvesting and automatic monitoring of the fruit growth. Apples are grown in open and unstructured orchards; therefore, the detection and segmentation of apples are affected by several factors, such as the fluctuating illumination, overlapping and occlusion of apples and similarities between immature green apples and the background color, which makes accurate detection and segmentation of apples challenging. An MS-ADS method was proposed in this study to solve these problems. To further improve the detection and segmentation accuracy of the Mask Scoring R-CNN model, a ResNeSt, which is a variant of ResNet fused with attention mechanism, combined with FPN, was used to replace the backbone network of the original Mask Scoring R-CNN. This allowed the network to improve its feature extraction capability by being more attentive to the apple features and effectively ignoring background features. Convolutional layers were added to the original R-CNN head to improve the accuracy of bounding box regression. Simultaneously, a dual attention network was inserted into the original mask head to improve the segmentation accuracy. The apple detection and instance segmentation results of the MS-ADS method showed that this method accurately detected and segmented apples under various conditions in a real-time way.</p>
<p>There were also false detection and segmentation when using the MS-ADS method, as shown in <xref ref-type="fig" rid="f5">
<bold>Figure&#xa0;5</bold>
</xref>. False detection was mainly caused by the high similarities between the background and apples. As shown in <xref ref-type="fig" rid="f5">
<bold>Figure&#xa0;5A</bold>
</xref>, a tag, which was made by testers, was falsely detected as an apple. In the image shown in <xref ref-type="fig" rid="f5">
<bold>Figure&#xa0;5B</bold>
</xref>, a green leaf was falsely detected as an apple. Future improvements will be made by expanding training samples with similar backgrounds to reduce false detection. The false detection rate of the MS-ADS method in this study was 3.5%. Although there was false detection, the MS-ADS method achieved optimal detection and segmentation on the test set.</p>
<fig id="f5" position="float">
<label>Figure&#xa0;5</label>
<caption>
<p>False detection and segmentation. <bold>(A, B)</bold> Original images. <bold>(C, D)</bold> Detection and instance segmentation results of original images <bold>(A, B)</bold>.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpls-13-1016470-g005.tif"/>
</fig>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Effect of the improved parts of the model on apple detection and segmentation</title>
<p>The proposed MS-ADS method was improved by modifying the Mask Scoring R-CNN (<xref ref-type="bibr" rid="B12">Huang et&#xa0;al., 2019</xref>). Firstly, the ResNeSt-50 combined with FPN was used as the backbone network to improve the feature extraction ability of the network. To further improve the accuracy of bounding box regression and segmentation, convolutional layers were added to the original R-CNN head to make feature extraction more sufficient, and DANet was inserted into the original mask head to make segmentation more accurate. To analyze the effect of each improvement on the performance of apple detection and segmentation, the training loss function (<xref ref-type="fig" rid="f6">
<bold>Figure&#xa0;6</bold>
</xref>), model parameters (<xref ref-type="table" rid="T6">
<bold>Table&#xa0;6</bold>
</xref>) and the detection and segmentation results on 219 test images (<xref ref-type="table" rid="T6">
<bold>Table&#xa0;6</bold>
</xref> and <xref ref-type="fig" rid="f7">
<bold>Figure&#xa0;7</bold>
</xref>) of the original Mask Scoring R-CNN (ResNet-50-FPN), Mask Scoring R-CNN with ResNeSt-50-FPN as the backbone network, Mask Scoring R-CNN with ResNeSt-50-FPN as the backbone network and improved R-CNN head, and the MS-ADS were compared.</p>
<fig id="f6" position="float">
<label>Figure&#xa0;6</label>
<caption>
<p>Loss curves of four methods. <bold>(A)</bold> Overall loss curves. <bold>(B)</bold> Bounding box loss curves. <bold>(C)</bold> Mask loss curves. <bold>(D)</bold> Mask_IoU loss curves.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpls-13-1016470-g006.tif"/>
</fig>
<table-wrap id="T6" position="float">
<label>Table&#xa0;6</label>
<caption>
<p>Detection and instance segmentation results of three methods.</p>
</caption>
<table frame="hsides">
<thead>
<tr>
<th valign="top" align="left">Methods</th>
<th valign="top" align="center">Model size/MB</th>
<th valign="top" align="center">GFLOPs</th>
<th valign="top" align="center">Parameters</th>
<th valign="top" align="center">
<italic>precision/</italic>%</th>
<th valign="top" align="center">
<italic>recall/</italic>%</th>
<th valign="top" align="center">
<italic>F</italic>1/%</th>
<th valign="top" align="center">
<italic>bbox_mAP</italic>
</th>
<th valign="top" align="center">
<italic>mask_mAP</italic>
</th>
<th valign="top" align="center">Train/h</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Original Mask Scoring R-CNN<break/>(ResNet-50-FPN)</td>
<td valign="top" align="center">
<bold>481.4</bold>
</td>
<td valign="top" align="center">
<bold>85.4</bold>
</td>
<td valign="top" align="center">
<bold>60.0M</bold>
</td>
<td valign="top" align="center">94.9</td>
<td valign="top" align="center">96.4</td>
<td valign="top" align="center">95.6</td>
<td valign="top" align="center">0.924</td>
<td valign="top" align="center">0.915</td>
<td valign="top" align="center">
<bold>2.5</bold>
</td>
</tr>
<tr>
<td valign="top" align="left">Mask Scoring R-CNN (ResNeSt-50-FPN)</td>
<td valign="top" align="center">496.8</td>
<td valign="top" align="center">93.1</td>
<td valign="top" align="center">62.3M</td>
<td valign="top" align="center">96.5</td>
<td valign="top" align="center">97.1</td>
<td valign="top" align="center">96.8</td>
<td valign="top" align="center">0.924</td>
<td valign="top" align="center">0.919</td>
<td valign="top" align="center">2.7</td>
</tr>
<tr>
<td valign="top" align="left">Mask Scoring R-CNN (ResNeSt-50-FPN and improved R-CNN head)</td>
<td valign="top" align="center">507.3</td>
<td valign="top" align="center">207.8</td>
<td valign="top" align="center">65.4M</td>
<td valign="top" align="center">95.6</td>
<td valign="top" align="center">97.4</td>
<td valign="top" align="center">96.5</td>
<td valign="top" align="center">0.930</td>
<td valign="top" align="center">0.918</td>
<td valign="top" align="center">3.0</td>
</tr>
<tr>
<td valign="top" align="left">
<bold>MS-ADS</bold>
</td>
<td valign="top" align="center">510.4</td>
<td valign="top" align="center">207.8</td>
<td valign="top" align="center">63.6M</td>
<td valign="top" align="center">
<bold>96.5</bold>
</td>
<td valign="top" align="center">
<bold>97.4</bold>
</td>
<td valign="top" align="center">
<bold>96.9</bold>
</td>
<td valign="top" align="center">
<bold>0.932</bold>
</td>
<td valign="top" align="center">
<bold>0.920</bold>
</td>
<td valign="top" align="center">3.1</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>The best values are marked bold.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<fig id="f7" position="float">
<label>Figure&#xa0;7</label>
<caption>
<p>Comparison of detection and segmentation results of four methods. <bold>(A)</bold> Original images. <bold>(B)</bold> Results of the MS-ADS method. <bold>(C)</bold> Results of the original Mask Scoring R-CNN (ResNet-50-FPN). <bold>(D)</bold> Results of the Mask Scoring R-CNN with ResNeSt-50-FPN. <bold>(E)</bold> Results of the Mask Scoring R-CNN with ResNeSt-50-FPN and improved R-CNN head.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpls-13-1016470-g007.tif"/>
</fig>
<p>As can be seen from <xref ref-type="fig" rid="f6">
<bold>Figure&#xa0;6</bold>
</xref>, the training loss curve of the proposed MS-ADS model is lower than that of the other three models. We improved the backbone network, R-CNN head and mask head of the original Mask Scoring R-CNN. This, in turn, improved the quality of the generated bounding box and mask, and the overall loss was reduced in comparison to the other three models.</p>
<p>During the experiment, the original Mask Scoring R-CNN based on ResNet-50 was first used to conduct the experiment. In order to further improve the feature extraction ability of the backbone network, ResNeSt-50, a variant of ResNet-50, was used to replace the ResNet-50. From the detection and segmentation results of the two models, it can be found that although the model size, calculations and the number of parameters increased, the detection and segmentation results of the Mask Scoring R-CNN based on ResNeSt-50 had dramatically improved (concluded from the comparison of <italic>precision</italic>, <italic>recall</italic>, <italic>F</italic>1 score, <italic>bbox_mAP</italic> and <italic>mask_mAP</italic> of the two methods). To make the detection more accurate and improve the <italic>bbox_mAP</italic>, we added four convolutional layers in the R-CNN head to extract features sufficiently. From the experimental results, we observed that although the <italic>bbox_mAP</italic> had been improved, <italic>precision</italic> and <italic>mask_mAP</italic> were reduced. To further improve <italic>precision</italic> and <italic>mask_mAP</italic> and ensure a high <italic>bbox_mAP</italic>, DANet was inserted into the mask head. The experimental results showed that <italic>bbox_mAP</italic> and <italic>mask_mAP</italic> had improved, and <italic>precision</italic> rebounded after the addition of DANet. However, since we replaced the backbone of the original Mask Scoring RCNN, added convolutional layers in the RCNN head and inserted DANet in the mask head, the proposed model was more complicated than the original model and the computation had dramatically increased, which resulting in longer training time and detection time. The results, as shown in <xref ref-type="table" rid="T6">
<bold>Table&#xa0;6</bold>
</xref>, indicate that although the model size, calculations, parameters and training time of the proposed MS-ADS method increased, the accuracy of the detection and segmentation had significantly improved, indicating that the MS-ADS model was suitable for the accurate detection and instance segmentation of apples in this study.</p>
<p>
<xref ref-type="fig" rid="f7">
<bold>Figure&#xa0;7</bold>
</xref> shows the comparison results of the four methods. Although apples in images were detected and segmented by the four methods (i.e., the <italic>precision</italic> and <italic>recall</italic> were high), the quality of the detected bounding box (<italic>bbox_mAP</italic>) and the segmented mask (<italic>mask_mAP</italic>) were very different. By contrast, the MS-ADS method achieved accurate detection and segmentation of apples on the premise of ensuring the quality of bounding box detection and segmentation.</p>
</sec>
</sec>
<sec id="s5" sec-type="conclusions">
<title>Conclusions</title>
<p>The MS-ADS method was proposed in this study to accurately detect and instance segment apples in different growth stages. The method was developed from the original Mask Scoring R-CNN. First, ResNeSt-50, a variant of ResNet-50 fused with attention mechanism, combined with FPN, was used to replace the backbone network of the original Mask Scoring R-CNN to enhance the feature extraction ability of the network model. Second, convolutional layers were added to the original R-CNN head to make feature extraction more sufficient and further enhance the accuracy of the generated bounding box. Finally, the DANet was inserted into the original mask head to further improve the accuracy of instance segmentation. Compared with the original Mask Scoring R-CNN, the proposed MS-ADS model performed better at detecting and segmenting the apples under various conditions.</p>
<p>The MS-ADS method effectively and accurately detected and segmented apples under various conditions during the growth stage with good robustness and real-time performance. The <italic>recall</italic>, <italic>precision</italic>, <italic>F</italic>1 score, <italic>bbox_mAP</italic>, <italic>mask_mAP</italic> and the average run-time of our method were 97.4%, 96.5%, 96.9%, 0.932, 0.920 and 0.27 s per image, respectively, on test set. This research could provide a reference for developing an automatic and long-term monitoring system for retrieving apple growth information.</p>
<p>The detection and instance segmentation results of this method were an improvement on prior studies; however, the network model was relatively large, and many aspects still need improvement. In the future, we will continue to track the latest research results and further expand the training set to cover more kinds of apples and apples under various conditions. We will continue to study methods that can further streamline the network model and improve its efficiency and the accuracy of the detection and segmentation of apples.</p>
</sec>
<sec id="s6" sec-type="data-availability">
<title>Data availability statement</title>
<p>The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.</p>
</sec>
<sec id="s7" sec-type="author-contributions">
<title>Author contributions</title>
<p>WD: Conceptualization, data curation, Methodology, Software, Formal analysis, Resources, Writing&#x2013;original draft, Supervision, Funding acquisition. HD: Conceptualization, Writing&#x2013;review and editing. All authors contributed to the article and approved the submitted version.</p>
</sec>
<sec id="s8" sec-type="funding-information">
<title>Funding</title>
<p>This work was funded by the Natural Science Basic Research Program of Shaanxi (2022JQ-186); Talent introduction Program of Xi&#x2019;an University of Science and Technology (2050121002).</p>
</sec>
<sec id="s9" sec-type="COI-statement">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec id="s10" sec-type="disclaimer">
<title>Publisher&#x2019;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
</body>
<back>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Barbole</surname> <given-names>D. K.</given-names>
</name>
<name>
<surname>Jadhav</surname> <given-names>P. M.</given-names>
</name>
<name>
<surname>Patil</surname> <given-names>S. B.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>A review on fruit detection and segmentation techniques in agricultural field</article-title>,&#x201d; in <source>Second international conference on image processing and capsule networks, (ICIPCN)</source>, vol. <volume>300</volume>. (<publisher-loc>Cham</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>269</fpage>&#x2013;<lpage>288</lpage>.</citation>
</ref>
<ref id="B2">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Bolya</surname> <given-names>D.</given-names>
</name>
<name>
<surname>Zhou</surname> <given-names>C.</given-names>
</name>
<name>
<surname>Xiao</surname> <given-names>F.</given-names>
</name>
<name>
<surname>Lee</surname> <given-names>Y. J.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>YOLACT: Real-time instance segmentation</article-title>,&#x201d; in <source>In 2019 IEEE/CVF international conference on computer vision (ICCV)</source> (<publisher-loc>Seoul Korea</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>9157</fpage>&#x2013;<lpage>9166</lpage>.</citation>
</ref>
<ref id="B3">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chu</surname> <given-names>P.</given-names>
</name>
<name>
<surname>Li</surname> <given-names>Z.</given-names>
</name>
<name>
<surname>Lammers</surname> <given-names>K.</given-names>
</name>
<name>
<surname>Lu</surname> <given-names>R.</given-names>
</name>
<name>
<surname>Liu</surname> <given-names>X.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Deep learning-based apple detection using a suppression mask r-CNN</article-title>. <source>Pattern Recong. Lett.</source> <volume>147</volume> (<issue>6</issue>), <fpage>206</fpage>&#x2013;<lpage>211</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.patrec.2021.04.022</pub-id>
</citation>
</ref>
<ref id="B4">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Dutta</surname> <given-names>A.</given-names>
</name>
<name>
<surname>Zisserman</surname> <given-names>A.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>The VIA annotation software for images, audio and video</article-title>,&#x201d; in <conf-name>In Proceedings of the 27th ACM International Conference on Multimedia</conf-name> (<publisher-loc>New York</publisher-loc>: <publisher-name>ACM</publisher-name>) <fpage>2276</fpage>&#x2013;<lpage>2279</lpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1145/3343031.3350535</pub-id>
</citation>
</ref>
<ref id="B5">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Fu</surname> <given-names>L.</given-names>
</name>
<name>
<surname>Gao</surname> <given-names>F.</given-names>
</name>
<name>
<surname>Wu</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Li</surname> <given-names>R.</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>Q.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Application of consumer RGB-d cameras for fruit detection and localization in field: A critical review</article-title>. <source>Comput. Electron. Agric.</source> <volume>177</volume>, <fpage>105687</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.compag.2020.105687</pub-id>
</citation>
</ref>
<ref id="B6">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Fu</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Liu</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Tian</surname> <given-names>H. J.</given-names>
</name>
<name>
<surname>Li</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Bao</surname> <given-names>Y. J.</given-names>
</name>
<name>
<surname>Fang</surname> <given-names>Z. W.</given-names>
</name>
<etal/>
</person-group>. (<year>2019</year>). &#x201c;<article-title>Dual attention network for scene segmentation</article-title>,&#x201d; in <source>In 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR)</source> (<publisher-loc>Long Beach, CA</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>3141</fpage>&#x2013;<lpage>3149</lpage>.</citation>
</ref>
<ref id="B7">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gongal</surname> <given-names>A.</given-names>
</name>
<name>
<surname>Silwal</surname> <given-names>A.</given-names>
</name>
<name>
<surname>Karkee</surname> <given-names>M.</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>Q.</given-names>
</name>
<name>
<surname>Lewis</surname> <given-names>K.</given-names>
</name>
<name>
<surname>Amatya</surname> <given-names>S.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>Apple crop-load estimation with over-the-row machine vision system</article-title>. <source>Comput. Electron. Agric.</source> <volume>120</volume>, <fpage>26</fpage>&#x2013;<lpage>35</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.compag.2015.10.022</pub-id>
</citation>
</ref>
<ref id="B8">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Gonzalez</surname> <given-names>R. C.</given-names>
</name>
<name>
<surname>Woods</surname> <given-names>R. E.</given-names>
</name>
</person-group> (<year>2020</year>). <source>Digital image processing</source> (<publisher-loc>Beijing</publisher-loc>: <publisher-name>Pubishing House of Electronics Industry</publisher-name>).</citation>
</ref>
<ref id="B9">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Guo</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Liu</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Oerlemans</surname> <given-names>A.</given-names>
</name>
<name>
<surname>Lao</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Lew</surname> <given-names>M. S.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>Deep learning for visual understanding: A review</article-title>. <source>Neurocomputing</source> <volume>187</volume> (<issue>C</issue>), <fpage>27</fpage>&#x2013;<lpage>48</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.neucom.2015.09.116</pub-id>
</citation>
</ref>
<ref id="B10">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>He</surname> <given-names>K.</given-names>
</name>
<name>
<surname>Gkioxari</surname> <given-names>G.</given-names>
</name>
<name>
<surname>Doll&#xe1;r</surname> <given-names>P.</given-names>
</name>
<name>
<surname>Firshick</surname> <given-names>R.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Mask r-CNN</article-title>. <source>IEEE T. Pattern Anal.</source> <volume>42</volume> (<issue>2</issue>), <fpage>386</fpage>&#x2013;<lpage>397</lpage>. doi: <pub-id pub-id-type="doi">10.1109/TPAMI.2018.2844175</pub-id>
</citation>
</ref>
<ref id="B11">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>He</surname> <given-names>K.</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>X.</given-names>
</name>
<name>
<surname>Ren</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Sun</surname> <given-names>J.</given-names>
</name>
</person-group> (<year>2016</year>). &#x201c;<article-title>Deep residual learning for image recognition</article-title>,&#x201d; in <source>In proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)</source> (<publisher-loc>Las Vegas</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>770</fpage>&#x2013;<lpage>778</lpage>.</citation>
</ref>
<ref id="B12">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Huang</surname> <given-names>Z.</given-names>
</name>
<name>
<surname>Huang</surname> <given-names>L.</given-names>
</name>
<name>
<surname>Gong</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Huang</surname> <given-names>C.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>X.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>Mask scoring r-CNN</article-title>,&#x201d; in <source>In 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR)</source> (<publisher-loc>Long Beach, CA</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>6402</fpage>&#x2013;<lpage>6411</lpage>.</citation>
</ref>
<ref id="B13">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Hu</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Shen</surname> <given-names>L.</given-names>
</name>
<name>
<surname>Sun</surname> <given-names>G.</given-names>
</name>
</person-group> (<year>2018</year>). &#x201c;<article-title>Squeeze-and-excitation networks</article-title>,&#x201d; in <source>In 2018 IEEE/CVF conference on computer vision and pattern recognition (CVPR)</source> (<publisher-loc>Salt Lake</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>7132</fpage>&#x2013;<lpage>7141</lpage>.</citation>
</ref>
<ref id="B14">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jia</surname> <given-names>W. K.</given-names>
</name>
<name>
<surname>Liu</surname> <given-names>M. Y.</given-names>
</name>
<name>
<surname>Luo</surname> <given-names>R.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>C. J.</given-names>
</name>
<name>
<surname>Pan</surname> <given-names>N. N.</given-names>
</name>
<name>
<surname>Yang</surname> <given-names>X. B.</given-names>
</name>
<etal/>
</person-group>. (<year>2022</year>b). <article-title>YOLOF-snake: An efficient segmentation model for green object fruit</article-title>. <source>Front. Plant Sci.</source> <volume>13</volume>, <elocation-id>765523</elocation-id>. doi: <pub-id pub-id-type="doi">10.3389/fpls.2022.765523</pub-id>
</citation>
</ref>
<ref id="B15">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jiang</surname> <given-names>M.</given-names>
</name>
<name>
<surname>Song</surname> <given-names>L.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>Y. F.</given-names>
</name>
<name>
<surname>Li</surname> <given-names>Z. Y.</given-names>
</name>
<name>
<surname>Song</surname> <given-names>H. B.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>Fusion of the YOLO V4 network model and visual attention mechanism to detect low-quality young apples in a complex environment</article-title>. <source>Precis. Agric.</source> <volume>23</volume>, <fpage>559</fpage>&#x2013;<lpage>577</lpage>. doi: <pub-id pub-id-type="doi">10.1007/s11119-021-09849-0</pub-id>
</citation>
</ref>
<ref id="B16">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jia</surname> <given-names>W. K.</given-names>
</name>
<name>
<surname>Tian</surname> <given-names>Y. Y.</given-names>
</name>
<name>
<surname>Luo</surname> <given-names>R.</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>Z. H.</given-names>
</name>
<name>
<surname>Lian</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Zheng</surname> <given-names>Y. J.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Detection and segmentation of overlapped fruits based on optimized mask r-CNN application in apple harvesting robot</article-title>. <source>Comput. Electron. Agric.</source> <volume>172</volume>, <fpage>105380</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.compag.2020.105380</pub-id>
</citation>
</ref>
<ref id="B17">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jia</surname> <given-names>W. K.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>Z. F.</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>Z. H.</given-names>
</name>
<name>
<surname>Yang</surname> <given-names>X. B.</given-names>
</name>
<name>
<surname>Hou</surname> <given-names>S. J.</given-names>
</name>
<name>
<surname>Zheng</surname> <given-names>Y. J.</given-names>
</name>
</person-group> (<year>2022</year>c). <article-title>A fast and efficient green apple object detection model based on foveabox</article-title>. <source>J. King Saud. Univ. Com.</source> <volume>34</volume> (<issue>8</issue>), <fpage>5156</fpage>&#x2013;<lpage>5169</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.jksuci.2022.01.005</pub-id>
</citation>
</ref>
<ref id="B18">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jia</surname> <given-names>W. K.</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>Z. H.</given-names>
</name>
<name>
<surname>Shao</surname> <given-names>W. J.</given-names>
</name>
<name>
<surname>Hou</surname> <given-names>S. J.</given-names>
</name>
<name>
<surname>Ji</surname> <given-names>Z.</given-names>
</name>
<name>
<surname>Liu</surname> <given-names>G. L.</given-names>
</name>
<etal/>
</person-group>. (<year>2021</year>). <article-title>Foveamask: A fast and accurate deep learning model for green fruit instance segmentation</article-title>. <source>Comput. Electron. Agric.</source> <volume>191</volume>, <fpage>106488</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.compag.2021.106488</pub-id>
</citation>
</ref>
<ref id="B19">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jia</surname> <given-names>W. K.</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>Z. H.</given-names>
</name>
<name>
<surname>Shao</surname> <given-names>W. J.</given-names>
</name>
<name>
<surname>Ji</surname> <given-names>Z.</given-names>
</name>
<name>
<surname>Hou</surname> <given-names>S. J.</given-names>
</name>
</person-group> (<year>2022</year>a). <article-title>RS-net: Robust segmentation of green overlapped apples</article-title>. <source>Precis. Agric.</source> <volume>23</volume>, <fpage>492</fpage>&#x2013;<lpage>513</lpage>. doi: <pub-id pub-id-type="doi">10.1007/s11119-021-09846-3</pub-id>
</citation>
</ref>
<ref id="B20">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kang</surname> <given-names>H. W.</given-names>
</name>
<name>
<surname>Chen</surname> <given-names>C.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Fruit detection and segmentation for apple harvesting using visual sensor in orchards</article-title>. <source>Sensors</source> <volume>19</volume>, <fpage>4599</fpage>. doi: <pub-id pub-id-type="doi">10.3390/s19204599</pub-id>
</citation>
</ref>
<ref id="B21">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kang</surname> <given-names>H. W.</given-names>
</name>
<name>
<surname>Chen</surname> <given-names>C.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Fruit detection, segmentation and 3D visualisation of environments in apple orchards</article-title>. <source>Comput. Electron. Agric.</source> <volume>171</volume>, <fpage>105302</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.compag.2020.105302</pub-id>
</citation>
</ref>
<ref id="B22">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname> <given-names>Q.</given-names>
</name>
<name>
<surname>Jia</surname> <given-names>W.</given-names>
</name>
<name>
<surname>Sun</surname> <given-names>M.</given-names>
</name>
<name>
<surname>Hou</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Zheng</surname> <given-names>Y.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>A novel green apple segmentation algorithm based on ensemble U-net under complex orchard environment</article-title>. <source>Comput. Electron. Agric.</source> <volume>180</volume> (<issue>6</issue>), <fpage>105900</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.compag.2020.105900</pub-id>
</citation>
</ref>
<ref id="B23">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Linker</surname> <given-names>R.</given-names>
</name>
<name>
<surname>Kelman</surname> <given-names>E.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>Apple detection in nighttime tree images using the geometry of light patches around highlights</article-title>. <source>Comput. Electron. Agric.</source> <volume>114</volume>, <fpage>154</fpage>&#x2013;<lpage>162</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.compag.2015.04.005</pub-id>
</citation>
</ref>
<ref id="B24">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Lin</surname> <given-names>T. Y.</given-names>
</name>
<name>
<surname>Maire</surname> <given-names>M.</given-names>
</name>
<name>
<surname>Belongie</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Hays</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Perona</surname> <given-names>P.</given-names>
</name>
<name>
<surname>Ramanan</surname> <given-names>D.</given-names>
</name>
<etal/>
</person-group>. (<year>2014</year>). &#x201c;<article-title>Microsoft COCO: Common objects in context</article-title>,&#x201d; in <source>In European conference on computer vision</source> (<publisher-loc>Cham</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>740</fpage>&#x2013;<lpage>755</lpage>.</citation>
</ref>
<ref id="B25">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liu</surname> <given-names>M. Y.</given-names>
</name>
<name>
<surname>Jia</surname> <given-names>W. K.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>Z. F.</given-names>
</name>
<name>
<surname>Niu</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Ruan</surname> <given-names>C. Z.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>An accurate detection and segmentation model of obscured green fruits</article-title>. <source>Comput. Electron. Agric.</source> <volume>197</volume>, <fpage>106984</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.compag.2022.106984</pub-id>
</citation>
</ref>
<ref id="B26">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Liu</surname> <given-names>Z.</given-names>
</name>
<name>
<surname>Mao</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Wu</surname> <given-names>C. Y.</given-names>
</name>
<name>
<surname>Feichtenhofer</surname> <given-names>C.</given-names>
</name>
<name>
<surname>Darrell</surname> <given-names>T.</given-names>
</name>
<name>
<surname>Xie</surname> <given-names>S.</given-names>
</name>
</person-group> (<year>2022</year>). &#x201c;<article-title>A ConvNet for the 2020s</article-title>,&#x201d; in <source>In proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR)</source> (<publisher-loc>New Orleans</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>11976</fpage>&#x2013;<lpage>11986</lpage>.</citation>
</ref>
<ref id="B27">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liu</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Zhan</surname> <given-names>Y. N.</given-names>
</name>
<name>
<surname>Jia</surname> <given-names>W. K.</given-names>
</name>
<name>
<surname>Ji</surname> <given-names>Z.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>DLNet: Accurate segmentation of green fruit in obscured environments - ScienceDirect</article-title>. <source>J. King Saud. Univ. Com.</source> <volume>34</volume> (<issue>9</issue>), <fpage>7259</fpage>&#x2013;<lpage>7270</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.jksuci.2021.09.023</pub-id>
</citation>
</ref>
<ref id="B28">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Li</surname> <given-names>X.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>W. H.</given-names>
</name>
<name>
<surname>Hu</surname> <given-names>X. L.</given-names>
</name>
<name>
<surname>Yang</surname> <given-names>J.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>Selective kernel networks</article-title>,&#x201d; in <source>In 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR)</source> (<publisher-loc>Long Beach, CA</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>510</fpage>&#x2013;<lpage>519</lpage>.</citation>
</ref>
<ref id="B29">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Maheswari</surname> <given-names>P.</given-names>
</name>
<name>
<surname>Raja</surname> <given-names>P.</given-names>
</name>
<name>
<surname>Apolo</surname> <given-names>O. E.</given-names>
</name>
<name>
<surname>P&#xe9;rez-Ruiz</surname> <given-names>M.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Intelligent fruit yield estimation for orchards using deep learning based semantic segmentation techniques&#x2013;a review</article-title>. <source>Front. Plant Sci.</source> <volume>12</volume>, <elocation-id>684328</elocation-id>. doi: <pub-id pub-id-type="doi">10.3389/fpls.2021.684328</pub-id>
</citation>
</ref>
<ref id="B30">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Naranjo-Torres</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Mora</surname> <given-names>M.</given-names>
</name>
<name>
<surname>Hern&#xe1;ndez-Garc&#xed;a</surname> <given-names>R.</given-names>
</name>
<name>
<surname>Barrientos</surname> <given-names>R. J.</given-names>
</name>
<name>
<surname>Valenzuela</surname> <given-names>A.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>A review of convolutional neural network applied to fruit image processing</article-title>. <source>Appl. Sci.</source> <volume>10</volume> (<issue>10</issue>), <fpage>3443</fpage>. doi: <pub-id pub-id-type="doi">10.3390/app10103443</pub-id>
</citation>
</ref>
<ref id="B31">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Nguyen</surname> <given-names>T. T.</given-names>
</name>
<name>
<surname>Vandevoorde</surname> <given-names>K.</given-names>
</name>
<name>
<surname>Wouters</surname> <given-names>N.</given-names>
</name>
<name>
<surname>Kayacan</surname> <given-names>E.</given-names>
</name>
<name>
<surname>De. Baerdemaeker</surname> <given-names>J. G.</given-names>
</name>
<name>
<surname>Saeys</surname> <given-names>W.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>). detection of red and bicolored apples on tree with an RGB-d camera</article-title>. <source>Biosyst. Eng.</source> <volume>146</volume>, <fpage>33</fpage>&#x2013;<lpage>44</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.biosystemseng.2016.01.007</pub-id>
</citation>
</ref>
<ref id="B32">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rakun</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Stajnko</surname> <given-names>D.</given-names>
</name>
<name>
<surname>Zazula</surname> <given-names>D.</given-names>
</name>
</person-group> (<year>2011</year>). <article-title>Detecting fruits in natural scenes by using spatial-frequency based texture analysis and multiview geometry</article-title>. <source>Comput. Electron. Agric.</source> <volume>76</volume> (<issue>1</issue>), <fpage>80</fpage>&#x2013;<lpage>88</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.compag.2011.01.007</pub-id>
</citation>
</ref>
<ref id="B33">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Ronneberger</surname> <given-names>O.</given-names>
</name>
<name>
<surname>Fischer</surname> <given-names>P.</given-names>
</name>
<name>
<surname>Brox</surname> <given-names>T.</given-names>
</name>
</person-group> (<year>2015</year>). &#x201c;<article-title>U-Net: Convolutional networks for biomedical image segmentation</article-title>,&#x201d; in <source>International conference on medical image computing and computer-assisted intervention</source> (<publisher-loc>Cham</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>234</fpage>&#x2013;<lpage>241</lpage>.</citation>
</ref>
<ref id="B34">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Rossi</surname> <given-names>L.</given-names>
</name>
<name>
<surname>Karimi</surname> <given-names>A.</given-names>
</name>
<name>
<surname>Prati</surname> <given-names>A.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>A novel region of interest extraction layer for instance segmentation</article-title>,&#x201d; in <source>2020 25th international conference on pattern recognition (ICPR)</source>(<publisher-loc>Milan, Italy</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>2203</fpage>&#x2013;<lpage>2209</lpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1109/ICPR48806.2021.9412258</pub-id>
</citation>
</ref>
<ref id="B35">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Saleem</surname> <given-names>M. H.</given-names>
</name>
<name>
<surname>Potgieter</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Arif</surname> <given-names>K. M.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Automation in agriculture by machine and deep learning techniques: A review of recent developments</article-title>. <source>Precis. Agric.</source> <volume>22</volume> (<issue>6</issue>), <page-range>2053&#x2013;2091</page-range>. doi: <pub-id pub-id-type="doi">10.1007/s11119-021-09806-x</pub-id>
</citation>
</ref>
<ref id="B36">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tang</surname> <given-names>Y. C.</given-names>
</name>
<name>
<surname>Chen</surname> <given-names>M.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>C.</given-names>
</name>
<name>
<surname>Luo</surname> <given-names>L.</given-names>
</name>
<name>
<surname>Zou</surname> <given-names>X.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Recognition and localization methods for vision-based fruit picking robots: A review</article-title>. <source>Front. Plant Sci.</source> <volume>11</volume>, <elocation-id>1</elocation-id>&#x2013;<lpage>17</lpage>. doi: <pub-id pub-id-type="doi">10.3389/fpls.2020.00510</pub-id>
</citation>
</ref>
<ref id="B37">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tian</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>T.</given-names>
</name>
<name>
<surname>Liu</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Qiao</surname> <given-names>X.</given-names>
</name>
<name>
<surname>Li</surname> <given-names>Y.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Computer vision technology in agricultural automation &#x2014;a review</article-title>. <source>Inf. Process. Agric.</source> <volume>7</volume> (<issue>1</issue>), <fpage>1</fpage>&#x2013;<lpage>19</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.inpa.2019.09.006</pub-id>
</citation>
</ref>
<ref id="B38">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tian</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Yang</surname> <given-names>G.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>Z.</given-names>
</name>
<name>
<surname>Li</surname> <given-names>E.</given-names>
</name>
<name>
<surname>Liang</surname> <given-names>Z.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Instance segmentation of apple flowers using the improved mask r&#x2013;CNN model</article-title>. <source>Biosyst. Eng.</source> <volume>193</volume>, <fpage>264</fpage>&#x2013;<lpage>278</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.biosystemseng.2020.03.008</pub-id>
</citation>
</ref>
<ref id="B39">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tian</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Yang</surname> <given-names>G.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>Z.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Li</surname> <given-names>E.</given-names>
</name>
<name>
<surname>Liang</surname> <given-names>Z.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Apple detection during different growth stages in orchards using the improved YOLO-V3 model</article-title>. <source>Comput. Electron. Agric.</source> <volume>157</volume>, <fpage>417</fpage>&#x2013;<lpage>426</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.compag.2019.01.012</pub-id>
</citation>
</ref>
<ref id="B40">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tu</surname> <given-names>S. Q.</given-names>
</name>
<name>
<surname>Yuan</surname> <given-names>W. J.</given-names>
</name>
<name>
<surname>Liang</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>F.</given-names>
</name>
<name>
<surname>Wan</surname> <given-names>H.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Automatic detection and segmentation for group housed pigs based on PigMS r-CNN</article-title>. <source>Sensors</source> <volume>21</volume> (<issue>9</issue>), <fpage>3251</fpage>. doi: <pub-id pub-id-type="doi">10.3390/s21093251</pub-id>
</citation>
</ref>
<ref id="B41">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Wang</surname> <given-names>X.</given-names>
</name>
<name>
<surname>Girshick</surname> <given-names>R.</given-names>
</name>
<name>
<surname>Gupta</surname> <given-names>A.</given-names>
</name>
<name>
<surname>He</surname> <given-names>K.</given-names>
</name>
</person-group> (<year>2018</year>). &#x201c;<article-title>Non-local neural networks</article-title>,&#x201d; in <source>In 2018 IEEE/CVF conference on computer vision and pattern recognition (CVPR)</source> (<publisher-loc>Salt Lake</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>7794</fpage>&#x2013;<lpage>7803</lpage>.</citation>
</ref>
<ref id="B42">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wang</surname> <given-names>D.</given-names>
</name>
<name>
<surname>He</surname> <given-names>D.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>Fusion of mask r-CNN and attention mechanism for instance segmentation of apples under complex background</article-title>. <source>Comput. Electron. Agric.</source> <volume>196</volume>, <fpage>106864</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.compag.2022.106864</pub-id>
</citation>
</ref>
<ref id="B43">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Xie</surname> <given-names>S. M.</given-names>
</name>
<name>
<surname>Girshick</surname> <given-names>R.</given-names>
</name>
<name>
<surname>Doll&#xe1;r</surname> <given-names>P.</given-names>
</name>
<name>
<surname>Tu</surname> <given-names>Z. W.</given-names>
</name>
<name>
<surname>He</surname> <given-names>K. M.</given-names>
</name>
</person-group> (<year>2017</year>). &#x201c;<article-title>Aggregated residual transformations for deep neural networks</article-title>,&#x201d; in <source>Proceedings of the IEEE conference on computer vision and pattern recognition</source> (<publisher-loc>Honolulu</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1492</fpage>&#x2013;<lpage>1500</lpage>.</citation>
</ref>
<ref id="B44">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Xie</surname> <given-names>E.</given-names>
</name>
<name>
<surname>Sun</surname> <given-names>P.</given-names>
</name>
<name>
<surname>Song</surname> <given-names>X.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>W.</given-names>
</name>
<name>
<surname>Liu</surname> <given-names>X.</given-names>
</name>
<name>
<surname>Liang</surname> <given-names>D.</given-names>
</name>
<etal/>
</person-group>. (<year>2020</year>). &#x201c;<article-title>Polarmask: Single shot instance segmentation with polar representation</article-title>,&#x201d; in <source>In proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source> (<publisher-loc>Seattle</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>12193</fpage>&#x2013;<lpage>12202</lpage>.</citation>
</ref>
<ref id="B45">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Wu</surname> <given-names>C.</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>Z.</given-names>
</name>
<name>
<surname>Zhu</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Lin</surname> <given-names>H. B.</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>Z.</given-names>
</name>
<etal/>
</person-group>. (<year>2022</year>). &#x201c;<article-title>ResNeSt: Split-attention</article-title>,&#x201d; in <source>In proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR)</source> (<publisher-loc>New Orleans</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>2736</fpage>&#x2013;<lpage>2746</lpage>.</citation>
</ref>
<ref id="B46">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhou</surname> <given-names>R.</given-names>
</name>
<name>
<surname>Damerow</surname> <given-names>L.</given-names>
</name>
<name>
<surname>Sun</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Blanke</surname> <given-names>M. M.</given-names>
</name>
</person-group> (<year>2012</year>). <article-title>Using color features of CV. &#x2018;Gala&#x2019; apple fruits in an orchard in image processing to predict yield</article-title>. <source>Precis. Agric.</source> <volume>13</volume> (<issue>5</issue>), <fpage>568</fpage>&#x2013;<lpage>580</lpage>. doi: <pub-id pub-id-type="doi">10.1007/s11119-012-9269-2</pub-id>
</citation>
</ref>
<ref id="B47">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Zhu</surname> <given-names>X.</given-names>
</name>
<name>
<surname>Cheng</surname> <given-names>D.</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>Z.</given-names>
</name>
<name>
<surname>Dai</surname> <given-names>J.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>An empirical study of spatial attention mechanisms in deep networks</article-title>,&#x201d; in <source>In 2019 IEEE/CVF international conference on computer vision (ICCV)</source>(<publisher-loc>Seoul</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1</fpage>&#x2013;<lpage>10</lpage>.</citation>
</ref>
</ref-list>
</back>
</article>