<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Plant Sci.</journal-id>
<journal-title>Frontiers in Plant Science</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Plant Sci.</abbrev-journal-title>
<issn pub-type="epub">1664-462X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fpls.2022.991487</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Plant Science</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Real-time guava tree-part segmentation using fully convolutional network with channel and spatial attention</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Lin</surname> <given-names>Guichao</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1246237/overview"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Wang</surname> <given-names>Chenglin</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1906507/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Xu</surname> <given-names>Yao</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Wang</surname> <given-names>Minglong</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Zhang</surname> <given-names>Zhihao</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Zhu</surname> <given-names>Lixue</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<xref ref-type="corresp" rid="c002"><sup>&#x002A;</sup></xref>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>School of Mechanical and Electrical Engineering, Zhongkai University of Agriculture and Engineering</institution>, <addr-line>Guangzhou</addr-line>, <country>China</country></aff>
<aff id="aff2"><sup>2</sup><institution>Guangdong Laboratory for Lingnan Modern Agriculture</institution>, <addr-line>Guangzhou</addr-line>, <country>China</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Kioumars Ghamkhar, AgResearch Ltd, New Zealand</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Changji Wen, Jilin Agricultural University, China; Longsheng Fu, Northwest A&#x0026;F University, China</p></fn>
<corresp id="c001">&#x002A;Correspondence: Chenglin Wang, <email>wangchenglin055@163.com</email></corresp>
<corresp id="c002">Lixue Zhu, <email>zhulixue@zhku.edu.cn</email></corresp>
<fn fn-type="other" id="fn004"><p>This article was submitted to Technical Advances in Plant Science, a section of the journal Frontiers in Plant Science</p></fn>
</author-notes>
<pub-date pub-type="epub">
<day>13</day>
<month>09</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="collection">
<year>2022</year>
</pub-date>
<volume>13</volume>
<elocation-id>991487</elocation-id>
<history>
<date date-type="received">
<day>11</day>
<month>07</month>
<year>2022</year>
</date>
<date date-type="accepted">
<day>15</day>
<month>08</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x00A9; 2022 Lin, Wang, Xu, Wang, Zhang and Zhu.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Lin, Wang, Xu, Wang, Zhang and Zhu</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<p>It is imminent to develop intelligent harvesting robots to alleviate the burden of rising costs of manual picking. A key problem in robotic harvesting is how to recognize tree parts efficiently without losing accuracy, thus helping the robots plan collision-free paths. This study introduces a real-time tree-part segmentation network by improving fully convolutional network with channel and spatial attention. A lightweight backbone is first deployed to extract low-level and high-level features. These features may contain redundant information in their channel and spatial dimensions, so a channel and spatial attention module is proposed to enhance informative channels and spatial locations. On this basis, a feature aggregation module is investigated to fuse the low-level details and high-level semantics to improve segmentation accuracy. A tree-part dataset with 891 RGB images is collected, and each image is manually annotated in a per-pixel fashion. Experiment results show that when using MobileNetV3-Large as the backbone, the proposed network obtained an intersection-over-union (IoU) value of 63.33 and 66.25% for the branches and fruits, respectively, and required only 2.36 billion floating point operations per second (FLOPs); when using MobileNetV3-Small as the backbone, the network achieved an IoU value of 60.62 and 61.05% for the branches and fruits, respectively, at a speed of 1.18 billion FLOPs. Such results demonstrate that the proposed network can segment the tree-parts efficiently without loss of accuracy, and thus can be applied to the harvesting robots to plan collision-free paths.</p>
</abstract>
<kwd-group>
<kwd>tree-part segmentation</kwd>
<kwd>MobileNetV3</kwd>
<kwd>attention mechanism</kwd>
<kwd>neural network</kwd>
<kwd>harvesting robot</kwd>
</kwd-group>
<counts>
<fig-count count="6"/>
<table-count count="3"/>
<equation-count count="9"/>
<ref-count count="33"/>
<page-count count="11"/>
<word-count count="7039"/>
</counts>
</article-meta>
</front>
<body>
<sec id="S1" sec-type="intro">
<title>Introduction</title>
<p>Fruit harvesting is time-sensitive and labor-intensive, making manual picking expensive. In order to reduce the cost burden of manual picking, it is of great significance to develop intelligent harvesting robots. In structured environments, fruit trees are often planted in a V shape (<xref ref-type="bibr" rid="B8">Chen et al., 2021</xref>) or plane shape (<xref ref-type="bibr" rid="B31">Zhang et al., 2018</xref>), and fruit detection and localization are key problems facing the robots, which have been well-addressed. However, in unstructured environments, the fruit trees have complex three-dimensional structures, and therefore a major problem facing the robots is how to recognize tree parts (including fruits, branches, and backgrounds) for the robots to plan collision-free paths (<xref ref-type="bibr" rid="B17">Lin et al., 2021a</xref>). Due to the complex shape and uneven thickness of the branches, the tree parts are difficult to identify (<xref ref-type="bibr" rid="B3">Barth et al., 2018</xref>; <xref ref-type="bibr" rid="B18">Lin et al., 2021b</xref>). Guava is a fruit widely grown in Guangdong Province, China. In this study, a real-time and accurate guava tree-part segmentation method is investigated to enable the guava-harvesting robots to work in unstructured environments.</p>
<p>Tree-part segmentation can be accomplished by traditional image analysis methods, requiring manual design of classifiers <italic>via</italic> feature engineering (<xref ref-type="bibr" rid="B1">Amatya et al., 2016</xref>; <xref ref-type="bibr" rid="B14">Ji et al., 2016</xref>). Such methods are usually limited to specific environments and fruit trees. Currently state-of-the-art tree-part segmentation are dominated by fully convolutional networks (FCN). Our previous study used a VGG16-based FCN to segment guava branches with an intersection-over-union (IoU) of 47.3% and an average running time of 0.165 s (<xref ref-type="bibr" rid="B16">Lin et al., 2019</xref>). Furthermore, we employed Mask R-CNN to detect and segment guava branches simultaneously, and obtained 51.8% F1 score at a speed of 0.159 s per image (<xref ref-type="bibr" rid="B18">Lin et al., 2021b</xref>). Unfortunately, slender branches were found difficult to recognize. Li et al. deployed DeepLabV3 with Xception65 as the backbone to recognize litchi branches and fruits, and accomplished a mean IoU (mIoU) of 78.46% at a speed of 0.6 s (<xref ref-type="bibr" rid="B15">Li et al., 2020</xref>). <xref ref-type="bibr" rid="B22">Majeed et al. (2020)</xref> used a VGG16-based SegNet to segment tree trunk, branch and trellis wire, and achieved a boundary-F1 score of 0.93, 0.89, and 0.91, respectively. Zhang et al. employed DeepLabV3+ with a lightweight backbone ResNet18 to identify apple tree trunks and branches. The IoUs for trunks and branches were 63 and 40%, respectively, and the average running time was 0.35 s per image (<xref ref-type="bibr" rid="B32">Zhang et al., 2021</xref>). <xref ref-type="bibr" rid="B8">Chen et al. (2021)</xref> applied a ResNet50-based DeepLabV3, a ResNet34-based U-Net and Pix2Pix to segment occluded branches, respectively, and found that DeepLabV3 outperformed the other models in terms of mIoU, binary accuracy and boundary F1 score. <xref ref-type="bibr" rid="B5">Boogaard et al. (2021)</xref> segmented cucumber plants into eight parts by using a point cloud segmentation network PointNet++ and obtained 95% mIoU. Wan et al. developed an improved YOLOV4 to detect branch segments, applied a thresholding segmentation method to remove background, and used a polynomial fit to reconstruct the branches. The detection F1 score was 90%, and the running speed was 22.7 frames per second (FPS) (<xref ref-type="bibr" rid="B28">Wan et al., 2022</xref>). Because manually annotating a large empirical dataset is time-consuming and costly, Barth et al. trained DeepLabV2 with VGG16 as the backbone on a large synthetic dataset and then fine-tuned DeepLabV2 on a small empirical dataset. The final network categorized pepper plants into seven different parts with a mIoU of 40% (<xref ref-type="bibr" rid="B4">Barth et al., 2019</xref>). Furthermore, <xref ref-type="bibr" rid="B2">Barth et al. (2020)</xref> deployed a cycle generative adversarial network to generate realistic synthetic images to train DeepLabV2 and obtained 52% mIoU. Although the approaches mentioned above produce encouraging results, they are typically computationally inefficient since they employ very deep backbones to encode both low-level and high-level features. How to strike a balance between real-time performance and accuracy is a key problem that needs to be solved.</p>
<p>Recently, some efforts have been made to develop real-time segmentation networks. These efforts can be roughly divided into two categories. The first category uses existing lightweight backbones to reduce computation. <xref ref-type="bibr" rid="B11">Howard et al. (2019)</xref> developed a shallow segmentation head and appended it to the top of MobileNetV3, and achieved a mIoU of 72% with only 1.98 million multiply-accumulate operations on Cityscapes dataset. Hu et al. proposed a fast spatial attention module to enhance the features encoded by ResNet34, used a simple decoder to merge the features, and achieved 75.5 mIoU at 58 FPS on the Cityscapes dataset (<xref ref-type="bibr" rid="B13">Hu P. et al., 2020</xref>). Another category uses customized lightweight backbones to speed up the network inference. Yu et al. proposed a novel network termed BiSeNetV2, which uses a semantic branch with narrow channels and deep layers to generate high-level semantics, applies a detail branch with wide channels and shallow layers to obtain low-level details, and combines these features to predict a segment map. It achieves 72.6% mIoU on the Cityscapes dataset with a speed of 156 PFS (<xref ref-type="bibr" rid="B30">Yu et al., 2021</xref>). <xref ref-type="bibr" rid="B10">Gao (2021)</xref> proposed a fast backbone that consists of many dilated block structures and used a shallow decoder to output the segmentation. The network achieves 78.3 mIoU at 30FPS on the Cityscapes dataset. Overall, the first category is more attractive, because it utilizes exiting backbones to extract semantic features and hence allows us to focus on more important modules such as decoder.</p>
<p>The objective of this study is to develop a real-time and accurate tree-part segmentation network so that the robots can avoid the obstacles during harvesting. Specifically, a state-of-the-art lightweight backbone is deployed to capture the low-level and high-level features. And then, an attention module is proposed to enhance informative channels and locations in the above features. Subsequently, these features are fused together by a feature aggregation module. The final feature is processed by a segmentation head to output a segment map. A comprehensive experiment is performed to evaluate the proposed tree-part segmentation network.</p>
<p>The contribution of the study is listed as follows:</p>
<list list-type="simple">
<list-item>
<label>(1)</label>
<p>A tree-part dataset containing 891 RGB images is provided, where each image is annotated on a per-pixel level manually.</p>
</list-item>
<list-item>
<label>(2)</label>
<p>A real-time tree-part segmentation is proposed by improving an FCN with channel and spatial attention.</p>
</list-item>
<list-item>
<label>(3)</label>
<p>The developed network achieves impressive results. Specifically, when using MobileNetV3-Large as the backbone, the network achieves an IoU of 63.33, 66.25, and 93.12% for the branches, fruits and background, respectively, at a speed of 36 FPS.</p>
</list-item>
</list>
</sec>
<sec id="S2" sec-type="materials|methods">
<title>Materials and methods</title>
<p>In this section, the data used for this research, including data acquisition, split and annotation, is presented in section 2.1. The developed tree-part segmentation network is introduced in section 2.2. Section 2.3 explains the evaluation criteria used to measure the performance of the developed network.</p>
<sec id="S2.SS1">
<title>Data</title>
<sec id="S2.SS1.SSS1">
<title>Data acquisition</title>
<p>The data acquisition site is located in a commercial guava orchard on Haiou Island, Guangzhou, China. The guava species is carmine. There is 3.1 m between two neighboring rows and 2.5 m between two neighboring trees in each row. A low-cost depth camera RealSense D435i is used to capture images, which can simultaneously generate RGB and depth images. This study only uses RGB images, which have a resolution of 480 pixels by 640 pixels. The images were taken on September 24, 2021 between 12:00 and 16:00, just in time for the guava harvest. The day was sunny with a temperature range of 30&#x2013;34&#x00B0;C. During image acquisition, the camera was held by hand and moved along the path between two rows. The distance between camera and guava tree was about 0.6 m. A total of 41,787 images were acquired. Because adjacent images look similar and may have little effect on network training, a subset of the images were sampled uniformly which comprises 891 images. <xref ref-type="fig" rid="F1">Figure 1A</xref> shows a captured image.</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption><p>Image example. <bold>(A)</bold> A guava tree. <bold>(B)</bold> Different parts of the guava tree, where the red, green, and black regions represent the fruit, branch, and background, respectively.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpls-13-991487-g001.tif"/>
</fig>
</sec>
<sec id="S2.SS1.SSS2">
<title>Data split</title>
<p>These 891 RGB images were divided into a test and training set. The test set contains the first 30% of the images, and the training set contains the last 70% of the images. This partitioning approach keeps the data sets independent and therefore better examines the generalization performance of the network.</p>
</sec>
<sec id="S2.SS1.SSS3">
<title>Data annotation</title>
<p>Because branches and fruits will prevent the robots from getting close to the targets, they should be annotated to enable the network to recognize them. Each pixel on the images in the training and test sets was annotated as a branch, fruit, or background class using the open-source annotation program LabelMe (<xref ref-type="bibr" rid="B26">Russell et al., 2008</xref>). A visual example is shown in <xref ref-type="fig" rid="F1">Figure 1B</xref>. It is worth noting that per-pixel label annotation is very time-consuming, and we spent almost 2 months to accomplish the annotation task.</p>
</sec>
</sec>
<sec id="S2.SS2">
<title>Tree-part segmentation network</title>
<p>This section illustrates the proposed tree-part segmentation network in detail. An efficient network backbone for capturing low-level and high-level features is introduced in Section 2.2.1. The proposed channel and spatial attention module for boosting meaningful features is elaborated in Section 2.2.2. Section 2.2.3 describes the multi-level feature aggregation module for fusing low-level details and high-level semantics. Section 2.2.4 introduces the segmentation head, and Section 2.2.5 presents the network architecture.</p>
<sec id="S2.SS2.SSS1">
<title>Backbone</title>
<p>To realize real-time segmentation and thus enable the harvesting robots to work efficiently, an efficient neural network MobileNetV3 (<xref ref-type="bibr" rid="B11">Howard et al., 2019</xref>) is employed as the segmentation network backbone. MobileNetV3 builds on the latest techniques such as depth-wise separable convolution, inverted bottleneck (<xref ref-type="bibr" rid="B27">Sandler et al., 2018</xref>) and squeeze-excitation network (<xref ref-type="bibr" rid="B12">Hu J. et al., 2020</xref>), and has been widely deployed in mobile applications. There are many layers outputting feature maps of the same resolution, and these layers are considered to be at the same stage. MobileNetV3 has five stages. Let &#x007B;<italic>C</italic><sub>2</sub>, <italic>C</italic><sub>3</sub>, <italic>C</italic><sub>4</sub>, <italic>C</italic><sub>5</sub>&#x007D; denote the outputs of the last layer of stage 2, stage 3, stage 4, and stage 5. Typically, the output of shallow stage such as <italic>C</italic><sub>2</sub> contains low-level information but with limited semantics, while that of deep stage such as <italic>C</italic><sub>5</sub> contains high-level semantics but with low resolution. These low-level details and high-level semantics can be combined to achieve high accuracy segmentation (<xref ref-type="bibr" rid="B30">Yu et al., 2021</xref>). Therefore, they are utilized in this study.</p>
<p>Because MobileNetV3 is primitively designed to output 1,000 classes for ImageNet (<xref ref-type="bibr" rid="B25">Russakovsky et al., 2015</xref>), the last few layers have many channels, which may be redundant for our task. In this study, the last layer in stage 5 is directly excluded. We discover that this modification can improve the segmentation accuracy and speed. Additionally, it is a common practice to place atrous convolution in the last few stages of the backbone to generate dense feature maps, which can effectively increase the segmentation accuracy (<xref ref-type="bibr" rid="B7">Chen et al., 2018</xref>; <xref ref-type="bibr" rid="B11">Howard et al., 2019</xref>). However, when we developed the network model in this paper, we found that the atrous convolution harmed the performance of our network. Hence, we do not use it in the backbone.</p>
<p>Global context information can reduce the probability of misclassification. Pyramid pooling module (PPM) (<xref ref-type="bibr" rid="B33">Zhao et al., 2016</xref>) is a practical technique to generate global context information, which uses four different scales of global average pooling layers to enlarge the network receptive fields, up-samples the resulting feature maps so that they have the same size as the original feature map by bilinear interpolation, and then concatenates them as the final global context information. PPM is attached at the top of MobileNetV3.</p>
</sec>
<sec id="S2.SS2.SSS2">
<title>Channel and spatial attention module</title>
<p>Formally, &#x007B;<italic>C</italic><sub>2</sub>, <italic>C</italic><sub>3</sub>, <italic>C</italic><sub>4</sub>, <italic>C</italic><sub>5</sub>&#x007D; encode different levels of channel and spatial information. Not every channel offers useful information. Channel attention mechanism (<xref ref-type="bibr" rid="B24">Roy et al., 2018</xref>; <xref ref-type="bibr" rid="B29">Woo, 2018</xref>; <xref ref-type="bibr" rid="B12">Hu J. et al., 2020</xref>) can be used to recalibrate these feature maps to focus on useful channels, thereby increasing the representation power. Note that the squeeze and excitation attention block of MobileNetV3 serves to refine some intermediate layers, whereas the channel attention mechanism here only serves to refine the output of the last layer of each stage. Besides, the pixel-wise spatial information is more important for semantic segmentation. Therefore, the feature maps can be further recalibrated along space using spatial attention mechanism, making them more informative spatially (<xref ref-type="bibr" rid="B24">Roy et al., 2018</xref>; <xref ref-type="bibr" rid="B29">Woo, 2018</xref>). To this effect, a channel and spatial attention module (CSAM) is proposed, which consists of a channel attention module and a spatial attention module. CSAM is detailed as follows.</p>
<p>The channel attention module is developed by the inspiration of <xref ref-type="bibr" rid="B11">Howard et al. (2019)</xref> to strengthen useful channels and weaken useless channels. Let <bold>X</bold> &#x2208; &#x211D;<italic><sup>H &#x00D7; W &#x00D7; C</sup></italic> denote a feature map, where <italic>H</italic> and <italic>W</italic> are the spatial height and width, and <italic>C</italic> is the number of channels. A global average pooling layer is first performed on <bold><italic>X</italic></bold>, resulting a vector <bold>u</bold> &#x2208; &#x211D;<sup><italic>C</italic></sup> with its <italic>k<sup>th</sup></italic> element:</p>
<disp-formula id="S2.E1"><label>(1)</label><mml:math id="M1"><mml:mrow><mml:mpadded width="+3.3pt"><mml:msub><mml:mi>u</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mpadded><mml:mo rspace="5.8pt">=</mml:mo><mml:mrow><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mpadded width="+3.3pt"><mml:mi>H</mml:mi></mml:mpadded><mml:mo rspace="5.8pt">&#x00D7;</mml:mo><mml:mi>W</mml:mi></mml:mrow></mml:mfrac><mml:mrow><mml:munderover><mml:mo largeop="true" movablelimits="false" symmetric="true">&#x2211;</mml:mo><mml:mrow><mml:mpadded width="+3.3pt"><mml:mi>h</mml:mi></mml:mpadded><mml:mo rspace="5.8pt">=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>H</mml:mi></mml:munderover><mml:mrow><mml:munderover><mml:mo largeop="true" movablelimits="false" symmetric="true">&#x2211;</mml:mo><mml:mrow><mml:mpadded width="+3.3pt"><mml:mi>w</mml:mi></mml:mpadded><mml:mo rspace="5.8pt">=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>W</mml:mi></mml:munderover><mml:mrow><mml:mtext mathvariant="bold">u</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>h</mml:mi><mml:mo>,</mml:mo><mml:mi>w</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mrow></mml:mrow></mml:mrow></mml:mrow></mml:math></disp-formula>
<p>Vector <bold>u</bold> is then used to generate a gate vector <bold><italic>g</italic></bold> by employing a gating mechanism:</p>
<disp-formula id="S2.E2"><label>(2)</label><mml:math id="M2"><mml:mrow><mml:mtext mathvariant="bold">g</mml:mtext><mml:mo rspace="5.8pt">=</mml:mo><mml:mrow><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mtext mathvariant="bold">W</mml:mtext><mml:mn>1</mml:mn></mml:msub><mml:mtext mathvariant="bold">u</mml:mtext></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mrow></mml:math></disp-formula>
<p>where &#x03C3; refers to the sigmoid function, <inline-formula><mml:math id="INEQ7"><mml:mrow><mml:msub><mml:mtext mathvariant="bold">W</mml:mtext><mml:mn>1</mml:mn></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:mpadded width="+3.3pt"><mml:mfrac><mml:mi>C</mml:mi><mml:mi>r</mml:mi></mml:mfrac></mml:mpadded><mml:mo rspace="5.8pt">&#x00D7;</mml:mo><mml:mi>C</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> is a learnable tensor, and <italic>r</italic> is a reduction ratio using for limiting model complexity. Gate vector <bold><italic>g</italic></bold> measures the usefulness of the channels, which is used to recalibrate <bold><italic>X</italic></bold>:</p>
<disp-formula id="S2.E3"><label>(3)</label><mml:math id="M3"><mml:mrow><mml:mpadded width="+3.3pt"><mml:msub><mml:mtext mathvariant="bold">X</mml:mtext><mml:mi>c</mml:mi></mml:msub></mml:mpadded><mml:mo rspace="5.8pt">=</mml:mo><mml:mrow><mml:mtext mathvariant="bold">g</mml:mtext><mml:mrow><mml:mo largeop="true" mathsize="160%" movablelimits="false" stretchy="false" symmetric="true">&#x2297;</mml:mo><mml:mrow><mml:mi mathvariant="normal">&#x03B4;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mtext mathvariant="bold">W</mml:mtext><mml:mn>2</mml:mn></mml:msub><mml:mo>&#x002A;</mml:mo><mml:mtext mathvariant="bold">X</mml:mtext></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mrow></mml:mrow></mml:mrow></mml:math></disp-formula>
<p>where &#x2297; denotes the channel-wise multiplication, &#x03B4; is the ReLu function, &#x002A; refers to convolution, <inline-formula><mml:math id="INEQ10"><mml:mrow><mml:msub><mml:mtext mathvariant="bold">W</mml:mtext><mml:mn>2</mml:mn></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:mpadded width="+3.3pt"><mml:mn>1</mml:mn></mml:mpadded><mml:mo rspace="5.8pt">&#x00D7;</mml:mo><mml:mpadded width="+3.3pt"><mml:mn>1</mml:mn></mml:mpadded><mml:mo rspace="5.8pt">&#x00D7;</mml:mo><mml:mpadded width="+3.3pt"><mml:mi>C</mml:mi></mml:mpadded><mml:mo rspace="5.8pt">&#x00D7;</mml:mo><mml:mfrac><mml:mi>C</mml:mi><mml:mi>r</mml:mi></mml:mfrac></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> denotes the filter kernel, and <inline-formula><mml:math id="INEQ11"><mml:mrow><mml:msub><mml:mtext mathvariant="bold">X</mml:mtext><mml:mi>c</mml:mi></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:mpadded width="+3.3pt"><mml:mi>H</mml:mi></mml:mpadded><mml:mo rspace="5.8pt">&#x00D7;</mml:mo><mml:mpadded width="+3.3pt"><mml:mi>W</mml:mi></mml:mpadded><mml:mo rspace="5.8pt">&#x00D7;</mml:mo><mml:mfrac><mml:mi>C</mml:mi><mml:mi>r</mml:mi></mml:mfrac></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> is the projection of <bold><italic>X</italic></bold>. Equation 3 not only depicts the interdependencies between the channels of X, but also highlights the useful channels while downplaying the useless ones.</p>
<p>In order to fully exploit the spatial information of the feature map, the spatial attention module developed by <xref ref-type="bibr" rid="B24">Roy et al. (2018)</xref> is deployed. Specifically, a gate map <bold>G</bold> &#x2208; &#x211D;<italic><sup>H &#x00D7; W</sup></italic> is first generated <italic>via</italic> squeezing the feature map along its channel dimension and employing a sigmoid function:</p>
<disp-formula id="S2.E4"><label>(4)</label><mml:math id="M4"><mml:mrow><mml:mtext mathvariant="bold">G</mml:mtext><mml:mo rspace="5.8pt">=</mml:mo><mml:mrow><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mtext mathvariant="bold">W</mml:mtext><mml:mn>3</mml:mn></mml:msub><mml:mo>&#x002A;</mml:mo><mml:msub><mml:mtext mathvariant="bold">X</mml:mtext><mml:mi>c</mml:mi></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mrow></mml:math></disp-formula>
<p>where <inline-formula><mml:math id="INEQ13"><mml:mrow><mml:msub><mml:mtext mathvariant="bold">W</mml:mtext><mml:mn>3</mml:mn></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:mpadded width="+3.3pt"><mml:mn>1</mml:mn></mml:mpadded><mml:mo rspace="5.8pt">&#x00D7;</mml:mo><mml:mpadded width="+3.3pt"><mml:mn>1</mml:mn></mml:mpadded><mml:mo rspace="5.8pt">&#x00D7;</mml:mo><mml:mpadded width="+3.3pt"><mml:mfrac><mml:mi>C</mml:mi><mml:mi>r</mml:mi></mml:mfrac></mml:mpadded><mml:mo rspace="5.8pt">&#x00D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> is the filter kernel. Then gate map <bold><italic>G</italic></bold> is used to rescale the feature map:</p>
<disp-formula id="S2.E5"><label>(5)</label><mml:math id="M5"><mml:mrow><mml:mpadded width="+3.3pt"><mml:msub><mml:mtext mathvariant="bold">X</mml:mtext><mml:mi>s</mml:mi></mml:msub></mml:mpadded><mml:mo rspace="5.8pt">=</mml:mo><mml:mrow><mml:mtext mathvariant="bold">G</mml:mtext><mml:mrow><mml:mo largeop="true" mathsize="160%" movablelimits="false" stretchy="false" symmetric="true">&#x2297;</mml:mo><mml:msub><mml:mtext mathvariant="bold">X</mml:mtext><mml:mi>c</mml:mi></mml:msub></mml:mrow></mml:mrow></mml:mrow></mml:math></disp-formula>
<p>where &#x2297; denotes the element-wise multiplication. Equation 5 makes the network focus on important spatial locations and ignore useless ones.</p>
<p>The architecture of CSAM is illustrated in <xref ref-type="fig" rid="F2">Figure 2</xref>. CSAM is appended on <italic>C</italic><sub>2</sub>, <italic>C</italic><sub>3</sub>, <italic>C</italic><sub>4</sub> and the output of PPM, and the corresponding reduction ratios are set to &#x007B;1, 1, 2, 4&#x007D; for MobileNetV3-Large and &#x007B;1, 1, 1, 2&#x007D; for MobileNetV3-Small. The resulting feature maps are denoted as &#x007B;<italic>G</italic><sub>2</sub>, <italic>G</italic><sub>3</sub>, <italic>G</italic><sub>4</sub>, <italic>G</italic><sub>5</sub>&#x007D;. It is worth noting that CASM is attached to PPM and not <italic>C</italic><sub>5</sub> simply because PPM itself contains <italic>C</italic><sub>5</sub>. The work (<xref ref-type="bibr" rid="B24">Roy et al., 2018</xref>) also proposes a similar attention module. CSAM differs in introducing a reduction ratio to reduce the module complexity, and information goes through the two modules in an orderly manner, which progressively filters out useless information.</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption><p>Design details of CSAM. Note that <italic>Conv</italic> is convolutional operation, and BN is batch normalization; 1 &#x00D7; 1 represents the kernel size, <italic>H</italic> &#x00D7; <italic>W</italic> &#x00D7; <italic>C</italic> and <italic>H</italic> &#x00D7; <italic>W</italic> &#x00D7; <italic>C</italic>/<italic>r</italic> denote the tensor shape (height, width, and depth); the first &#x2297; refers to channel-wise multiplication, and the second &#x2297; is element-wise multiplication.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpls-13-991487-g002.tif"/>
</fig>
<p>Let us consider an input feature map of <italic>C</italic> channels. The channel attention module introduces <inline-formula><mml:math id="INEQ15"><mml:mfrac><mml:mrow><mml:mn>2</mml:mn><mml:msup><mml:mi>C</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:mrow><mml:mi>r</mml:mi></mml:mfrac></mml:math></inline-formula> new weights, while the spatial attention module introduces <inline-formula><mml:math id="INEQ16"><mml:mfrac><mml:mi>C</mml:mi><mml:mi>r</mml:mi></mml:mfrac></mml:math></inline-formula> weights. So, a CASM brings a total of <inline-formula><mml:math id="INEQ17"><mml:mfrac><mml:mrow><mml:mrow><mml:mn>2</mml:mn><mml:mpadded width="+3.3pt"><mml:msup><mml:mi>C</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:mpadded></mml:mrow><mml:mo rspace="5.8pt">+</mml:mo><mml:mi>C</mml:mi></mml:mrow><mml:mi>r</mml:mi></mml:mfrac></mml:math></inline-formula> parameters. Because the feature maps of MobileNetV3 have relatively few channels, these extra parameters only add a small amount of computation to the backbone.</p>
</sec>
<sec id="S2.SS2.SSS3">
<title>Feature aggregation module</title>
<p>Typically, thin branches are harder to segment than thick branches, because detailed information is easily lost when the output stride is increased. This problem can be alleviated by fusing feature maps from different layers, such as &#x007B;<italic>G</italic><sub>2</sub>, <italic>G</italic><sub>3</sub>, <italic>G</italic><sub>4</sub>, <italic>G</italic><sub>5</sub>&#x007D;. A simple variant of feature pyramid network (FPN) (<xref ref-type="bibr" rid="B19">Lin et al., 2016</xref>) is used to gradually up-samples and merges the feature maps from deepest feature maps to shallow ones. As shown in <xref ref-type="fig" rid="F3">Figure 3</xref>, our FPN variant first appends a 1 &#x00D7; 1 convolutional layer on the coarsest feature map <italic>G</italic><sub>5</sub> to reduce its channel dimension, up-samples <italic>G</italic><sub>5</sub> by a factor of 2, and then merges <italic>G</italic><sub>5</sub> with its corresponding bottom-up map <italic>G</italic><sub>4</sub> by element-wise addition. This process is repeated until the finest feature map is generated. A 3 &#x00D7; 3 convolutional layer is appended on each merged feature map to generate the final feature map with a fixed output dimension of 48. Here, batch normalization and ReLu are adopted after each convolution, which are omitted for simplifying notations. On this basis, these feature maps are concatenated. Because lower-level feature maps may have large values than higher-level ones, which probably destabilizes network training, the concatenated features should be normalized carefully. To this effect, a <italic>L</italic><sub>2</sub> normalization layer (<xref ref-type="bibr" rid="B20">Liu et al., 2015</xref>) is performed on the concatenated features. Specifically, let <bold>X</bold> = (<bold>x</bold><sub>1</sub>, &#x2026;, <bold>x</bold><sub><italic>C</italic></sub>) be the concatenated features, and <italic>C</italic> is the number of channels. <bold><italic>X</italic></bold> is normalized with the following equation:</p>
<disp-formula id="S2.E6"><label>(6)</label><mml:math id="M6"><mml:mrow><mml:mpadded width="+3.3pt"><mml:msub><mml:mtext mathvariant="bold">x</mml:mtext><mml:mi>c</mml:mi></mml:msub></mml:mpadded><mml:mo rspace="5.8pt">=</mml:mo><mml:mrow><mml:msub><mml:mi mathvariant="normal">&#x03B3;</mml:mi><mml:mi>c</mml:mi></mml:msub><mml:mfrac><mml:msub><mml:mtext mathvariant="bold">x</mml:mtext><mml:mi>c</mml:mi></mml:msub><mml:msub><mml:mrow><mml:mo fence="true">&#x007C;&#x007C;</mml:mo><mml:msub><mml:mtext mathvariant="bold">x</mml:mtext><mml:mi>c</mml:mi></mml:msub><mml:mo fence="true">&#x007C;&#x007C;</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msub></mml:mfrac></mml:mrow></mml:mrow></mml:math></disp-formula>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption><p>Design details of FAM. Note that <italic>up</italic> refers to up-sampling by bilinear interpolation.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpls-13-991487-g003.tif"/>
</fig>
<p>where &#x007C;&#x007C;&#x22C5;&#x007C;&#x007C;<sub>2</sub> means the <italic>L</italic><sub>2</sub> norm; <italic>c</italic> = 1, &#x2026;, <italic>C</italic>; and &#x03B3;<sub><italic>c</italic></sub> is a learnable scaling parameter, which can avoid the resulting features being too small and hence promotes network learning. In experiments, the initial value of &#x03B3;<sub><italic>c</italic></sub> is set to 1. Subsequently, a CSAM with reduction ratio of <italic>K</italic> is attached after the <italic>L</italic><sub>2</sub> normalization layer to further refine the feature map, where <italic>K</italic> refers to the number of feature maps fused. <xref ref-type="fig" rid="F3">Figure 3</xref> shows the architecture of the proposed FAM.</p>
</sec>
<sec id="S2.SS2.SSS4">
<title>Segmentation head</title>
<p>The segmentation head is used to output a segment map of the same size as the input RGB image, which is <italic>N</italic>-channeled with <italic>N</italic> being the number of classes. In this study, <italic>N</italic> equals to 3. <xref ref-type="fig" rid="F4">Figure 4</xref> shows the segmentation head, which consists of a 3 &#x00D7; 3 convolution layer, a batch normalization layer, a ReLU activation, a 1 &#x00D7; 1 convolution layer and an up-sampling operation <italic>via</italic> bilinear interpolation.</p>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption><p>Illustration of the segmentation head. Note that <italic>S</italic> is the scale ratio of up-sampling, and <italic>N</italic> is the number of classes.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpls-13-991487-g004.tif"/>
</fig>
</sec>
<sec id="S2.SS2.SSS5">
<title>Network architecture</title>
<p>The overall architecture is shown in <xref ref-type="fig" rid="F5">Figure 5</xref>. MobileNetV3 forms the backbone network with PPM attached on the top to capture global contextual information. Feature maps from the last layers of stage 2, stage 3, stage 4, and PPM are refined by CSAM and then used as input to FAM to produce a feature map containing low-level details and high-level semantics. The output of FAM is processed by the segmentation head to make the final semantic segmentation.</p>
<fig id="F5" position="float">
<label>FIGURE 5</label>
<caption><p>Overview of the tree-part segmentation network, where three-dimensional blocks represent feature maps and two-dimensional blocks refer to convolutional modules.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpls-13-991487-g005.tif"/>
</fig>
<p>The tree-part segmentation network is trained in an end-to-end manner to minimize a cross-entropy loss defined on the output of the segmentation head. To stabilize network training, an auxiliary segmentation head is inserted after the output of stage 3, and an auxiliary cross-entropy loss with weight 0.4 is added to the final loss (<xref ref-type="bibr" rid="B33">Zhao et al., 2016</xref>), as shown in <xref ref-type="fig" rid="F5">Figure 5</xref>. This auxiliary segmentation head is only used in the training phase and removed in the inference phase. Furthermore, a <italic>L</italic><sub>2</sub> regularization with weight 5e<sup>&#x2013;4</sup> on the parameters of the network except the backbone are added to the final loss to alleviate network over-fitting. Note that because this study uses a pre-trained MobileNetV3 on ImageNet as the backbone, we do not place the <italic>L</italic><sub>2</sub> regularization on the parameters of the backbone.</p>
</sec>
</sec>
<sec id="S2.SS3">
<title>Segmentation evaluation</title>
<p>To evaluate the accuracy performance of the tree-part segmentation network, three commonly used metrics are used: IoU, mIoU, and pixel accuracy (PA). For the sake of explanation, let <italic>N</italic> denotes the total number of classes, and <italic>p</italic><sub><italic>ij</italic></sub> denote the number of pixels that belong to class <italic>i</italic> but are predicted to be class <italic>j</italic>. Obviously, <italic>p</italic><sub><italic>ii</italic></sub>, <italic>p</italic><sub><italic>ij</italic></sub> and <italic>p</italic><sub><italic>ji</italic></sub> represent the number of true positives, false negatives, and false positives, respectively. IoU is the ratio between the intersection and union of the ground true and predicted segmentation, and can be calculated by dividing true positives by the sum of false positives, false negatives and true positives. For class <italic>i</italic>, its IoU is computed as follows:</p>
<disp-formula id="S2.E7"><label>(7)</label><mml:math id="M7"><mml:mrow><mml:mrow><mml:mi>I</mml:mi><mml:mi>o</mml:mi><mml:mpadded width="+3.3pt"><mml:msub><mml:mi>U</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mpadded></mml:mrow><mml:mo rspace="5.8pt">=</mml:mo><mml:mfrac><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:msubsup><mml:mo largeop="true" symmetric="true">&#x2211;</mml:mo><mml:mrow><mml:mpadded width="+3.3pt"><mml:mi>j</mml:mi></mml:mpadded><mml:mo rspace="1.8pt">=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mpadded width="+3.3pt"><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mpadded></mml:mrow><mml:mo rspace="5.8pt">+</mml:mo><mml:mrow><mml:msubsup><mml:mo largeop="true" symmetric="true">&#x2211;</mml:mo><mml:mrow><mml:mpadded width="+3.3pt"><mml:mi>j</mml:mi></mml:mpadded><mml:mo rspace="1.8pt">=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>j</mml:mi><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mo>-</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac></mml:mrow></mml:math></disp-formula>
<p>mIoU is an improved IoU which computes the IoU value for each class and then averages them:</p>
<disp-formula id="S2.E8"><label>(8)</label><mml:math id="M8"><mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mi>I</mml:mi><mml:mi>o</mml:mi><mml:mpadded width="+3.3pt"><mml:mi>U</mml:mi></mml:mpadded></mml:mrow><mml:mo rspace="5.8pt">=</mml:mo><mml:mrow><mml:mfrac><mml:mn>1</mml:mn><mml:mi>N</mml:mi></mml:mfrac><mml:mrow><mml:munderover><mml:mo largeop="true" movablelimits="false" symmetric="true">&#x2211;</mml:mo><mml:mrow><mml:mpadded width="+3.3pt"><mml:mi>i</mml:mi></mml:mpadded><mml:mo rspace="1.8pt">=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:munderover><mml:mfrac><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mrow><mml:mrow><mml:msubsup><mml:mo largeop="true" symmetric="true">&#x2211;</mml:mo><mml:mrow><mml:mpadded width="+3.3pt"><mml:mi>j</mml:mi></mml:mpadded><mml:mo rspace="1.8pt">=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mpadded width="+3.3pt"><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mpadded></mml:mrow><mml:mo rspace="5.8pt">+</mml:mo><mml:mrow><mml:msubsup><mml:mo largeop="true" symmetric="true">&#x2211;</mml:mo><mml:mrow><mml:mpadded width="+3.3pt"><mml:mi>j</mml:mi></mml:mpadded><mml:mo rspace="1.8pt">=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>j</mml:mi><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow><mml:mo>-</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac></mml:mrow></mml:mrow></mml:mrow></mml:math></disp-formula>
<p>PA measures the network recall ability. It calculates a ratio between the amount of true positives and the total number of pixels:</p>
<disp-formula id="S2.E9"><label>(9)</label><mml:math id="M9"><mml:mrow><mml:mrow><mml:mi>P</mml:mi><mml:mpadded width="+3.3pt"><mml:mi>A</mml:mi></mml:mpadded></mml:mrow><mml:mo rspace="5.8pt">=</mml:mo><mml:mfrac><mml:mrow><mml:msubsup><mml:mo largeop="true" symmetric="true">&#x2211;</mml:mo><mml:mrow><mml:mpadded width="+3.3pt"><mml:mi>i</mml:mi></mml:mpadded><mml:mo rspace="1.8pt">=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msubsup><mml:mo largeop="true" symmetric="true">&#x2211;</mml:mo><mml:mrow><mml:mpadded width="+3.3pt"><mml:mi>i</mml:mi></mml:mpadded><mml:mo rspace="1.8pt">=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mrow><mml:msubsup><mml:mo largeop="true" symmetric="true">&#x2211;</mml:mo><mml:mrow><mml:mpadded width="+3.3pt"><mml:mi>j</mml:mi></mml:mpadded><mml:mo rspace="1.8pt">=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:mfrac></mml:mrow></mml:math></disp-formula>
<p>To measure the real-time performance of the developed network, three metrics are utilized: floating point operations per second (FLOPs), FPS, and number of parameters. Note that FPS is determined by counting how much RGB images can be processed per second in the inference phase.</p>
</sec>
</sec>
<sec id="S3">
<title>Experimental setup</title>
<sec id="S3.SS1">
<title>Implementation details</title>
<p>The developed network is programmed in Pytorch and runs on a computer with Windows 10 system, 32 GB RAM, Intel i9-11900K CPU, and NVIDIA GeForce RTX 3080 GPU. The backbone is pre-trained on ImageNet, and other parameters are initialized using the default initialization method in Pytorch. Standard Adam is used to minimize the loss function, and &#x201C;cosine&#x201D; learning scheduler (<xref ref-type="bibr" rid="B21">Loshchilov and Hutter, 2016</xref>) is used to adjust learning rate, where initial learning rate is set to 1e<sup>&#x2013;4</sup>. The network is trained on the train set, and 150 training epochs are used with a mini-batch size of 12. To avoid network over-fitting, the following data augmentation methods are implemented during training: horizontal flipping, vertical flipping, random rotation within the range of [&#x2212;45&#x00B0;, 45&#x00B0;], random scale within the rage of [0.8, 1.2], and randomly changing the hue, saturation and value of the input image.</p>
</sec>
<sec id="S3.SS2">
<title>Ablation study</title>
<p>This section performs the ablation study to validate the effectiveness of each module in our network. In the following experiments, MobileNetV3-Large is used as the backbone, and the segmentation models are trained on our training set and evaluated on our test set. The ablation study is detailed as follows:</p>
<list list-type="simple">
<list-item><label>(1)</label>
<p>Ablation for backbone. Placing atrous convolution in the last stage of the backbone can preserve the details, which has been widely utilized in semantic segmentation (<xref ref-type="bibr" rid="B7">Chen et al., 2018</xref>; <xref ref-type="bibr" rid="B11">Howard et al., 2019</xref>). However, it is unclear whether atrous convolution can improve the segmentation accuracy of our network. In addition, whether removing the last layer of stage 5 of the backbone network will improve efficiency and accuracy. Experiments are conducted to answer these questions.</p>
</list-item>
<list-item><label>(2)</label>
<p>Ablation for feature aggregation. High-level features contain semantic information but with limited details, while low-level features contain detailed information but with limited semantics. Fusing these features can improve segmentation accuracy. However, it is unclear which low-level and high-level features should be fused. We re-implement the network with different combinations of the low-level and high-level features, and find the best combination through experiments.</p>
</list-item>
<list-item><label>(3)</label>
<p>Ablation for auxiliary segmentation head. Auxiliary segmentation head has been widely used in semantic segmentation (<xref ref-type="bibr" rid="B33">Zhao et al., 2016</xref>; <xref ref-type="bibr" rid="B30">Yu et al., 2021</xref>). We insert the auxiliary segmentation head to different stages of the backbone in the training phase and reveal which position is most important.</p>
</list-item>
</list>
</sec>
<sec id="S3.SS3">
<title>Comparison with existing methods</title>
<p>To evaluate the accuracy and real-time performance of the developed network, a comparison experiment is performed. MobileNetV3-Large and MobileNetV3-Small are used as the backbone of our network. Four state-of-the-art networks are used for comparisons: DeepLabV3 (<xref ref-type="bibr" rid="B6">Chen et al., 2017</xref>), DeepLabV3+ (<xref ref-type="bibr" rid="B7">Chen et al., 2018</xref>), LR-ASPP (<xref ref-type="bibr" rid="B11">Howard et al., 2019</xref>), and FANet (<xref ref-type="bibr" rid="B13">Hu P. et al., 2020</xref>). For the sake of comparison, DeepLabV3, DeepLabV3+ and LR-ASPP use MobileNetV3-Large as the backbone, and apply the atrous convolution to the last block of MobileNetV3-Large to generate denser feature maps. FANet uses ResNet18 as the backbone as suggested by <xref ref-type="bibr" rid="B13">Hu P. et al. (2020)</xref>. All of the comparison networks are implemented in Pytorch and trained according to the strategy described in section 3.1. Our network and the comparison networks are evaluated on the test set, and quantitative results including IoU, mIoU, PA, FPS, and FLOPs are reported and discussed.</p>
</sec>
</sec>
<sec id="S4" sec-type="results|discussion">
<title>Results and discussion</title>
<sec id="S4.SS1">
<title>Ablation study</title>
<p><xref ref-type="table" rid="T1">Table 1</xref> lists the results of different configurations of the backbone. As shown in the table, we observed that (1) when not employing the atrous convolution in the last block of the backbone to extract dense features, the mIoU and PA slightly improved by 0.20 and 0.19%, respectively, while being faster (row 1 vs. row 3), (2) removing the last layer in stage 5 of the backbone did not decrease the IoU and PA while being slightly faster (row 1 vs. row2, row 3 vs. row 4), and (3) when not employing the atrous convolution and removing the last layer in stage 5, the network obtained similar accuracies while being significant faster than its variants (row 4 vs. row 1, 2, and 3). These results indicate that the atrous convolution was not necessary for our task, and the MobileNetV3 backbone contained redundant layers which should be excluded.</p>
<table-wrap position="float" id="T1">
<label>TABLE 1</label>
<caption><p>Ablations on the backbone and feature aggregation module.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">Row</td>
<td valign="top" align="center">AC</td>
<td valign="top" align="center">R</td>
<td valign="top" align="center">NF</td>
<td valign="top" align="center" colspan="3">IoU (%)<hr/></td>
<td valign="top" align="center">mIoU (%)</td>
<td valign="top" align="center">PA (%)</td>
<td valign="top" align="center">FPS</td>
<td valign="top" align="center">#Params</td>
<td valign="top" align="center">FLOPs</td>
</tr>
<tr>
<td/>
<td/>
<td/>
<td/>
<td valign="top" align="center">Branch</td>
<td valign="top" align="center">Fruit</td>
<td valign="top" align="center">Background</td>
<td/>
<td/>
<td/>
<td/>
<td/>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">1</td>
<td valign="top" align="center">&#x2713;</td>
<td valign="top" align="center">x</td>
<td valign="top" align="center">4</td>
<td valign="top" align="center">63.37</td>
<td valign="top" align="center">66.67</td>
<td valign="top" align="center">93.05</td>
<td valign="top" align="center">74.03</td>
<td valign="top" align="center">93.76</td>
<td valign="top" align="center">32.84</td>
<td valign="top" align="center">6.9M</td>
<td valign="top" align="center">3.48B</td>
</tr>
<tr>
<td valign="top" align="left">2</td>
<td valign="top" align="center">&#x2713;</td>
<td valign="top" align="center">&#x2713;</td>
<td valign="top" align="center">4</td>
<td valign="top" align="center">62.51</td>
<td valign="top" align="center">67.05</td>
<td valign="top" align="center">93.18</td>
<td valign="top" align="center">74.25</td>
<td valign="top" align="center">93.87</td>
<td valign="top" align="center">33.85</td>
<td valign="top" align="center">5.7M</td>
<td valign="top" align="center">3.08B</td>
</tr>
<tr>
<td valign="top" align="left">3</td>
<td valign="top" align="center">x</td>
<td valign="top" align="center">x</td>
<td valign="top" align="center">4</td>
<td valign="top" align="center">63.40</td>
<td valign="top" align="center">66.03</td>
<td valign="top" align="center">93.26</td>
<td valign="top" align="center">74.23</td>
<td valign="top" align="center">93.95</td>
<td valign="top" align="center">33.80</td>
<td valign="top" align="center">6.9M</td>
<td valign="top" align="center">2.44B</td>
</tr>
<tr>
<td valign="top" align="left">4</td>
<td valign="top" align="center">x</td>
<td valign="top" align="center">&#x2713;</td>
<td valign="top" align="center">4</td>
<td valign="top" align="center">63.33</td>
<td valign="top" align="center">66.25</td>
<td valign="top" align="center">93.12</td>
<td valign="top" align="center">74.23</td>
<td valign="top" align="center">93.84</td>
<td valign="top" align="center">36.00</td>
<td valign="top" align="center">5.7M</td>
<td valign="top" align="center">2.36B</td>
</tr>
<tr>
<td valign="top" align="left">5</td>
<td valign="top" align="center">x</td>
<td valign="top" align="center">&#x2713;</td>
<td valign="top" align="center">3</td>
<td valign="top" align="center">58.72</td>
<td valign="top" align="center">63.14</td>
<td valign="top" align="center">92.20</td>
<td valign="top" align="center">71.35</td>
<td valign="top" align="center">92.96</td>
<td valign="top" align="center">34.67</td>
<td valign="top" align="center">5.7M</td>
<td valign="top" align="center">1.66B</td>
</tr>
<tr>
<td valign="top" align="left">6</td>
<td valign="top" align="center">x</td>
<td valign="top" align="center">&#x2713;</td>
<td valign="top" align="center">2</td>
<td valign="top" align="center">49.74</td>
<td valign="top" align="center">61.16</td>
<td valign="top" align="center">90.72</td>
<td valign="top" align="center">67.21</td>
<td valign="top" align="center">91.49</td>
<td valign="top" align="center">34.36</td>
<td valign="top" align="center">5.7M</td>
<td valign="top" align="center">1.46B</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn><p>AC, Apply atrous convolution in the last block of the backbone; R, Remove the last layer in stage 5 of the backbone; NF, Number of feature maps fused in FAM. When NF = 4, &#x007B;<italic>G</italic><sub>2</sub>, <italic>G</italic><sub>3</sub>, <italic>G</italic><sub>4</sub>, <italic>G</italic><sub>5</sub>&#x007D; are fused. When NF = 3, &#x007B;<italic>G</italic><sub>3</sub>, <italic>G</italic><sub>4</sub>, <italic>G</italic><sub>5</sub>&#x007D; are fused. When NF = 2, &#x007B;<italic>G</italic><sub>4</sub>, <italic>G</italic><sub>5</sub>&#x007D; are fused. M and B represent million and billion, respectively.</p></fn>
</table-wrap-foot>
</table-wrap>
<p>Aggregating different levels of features has varying effects on the network performance, as shown in <xref ref-type="table" rid="T1">Table 1</xref>. Fusing &#x007B;<italic>G</italic><sub>2</sub>, <italic>G</italic><sub>3</sub>, <italic>G</italic><sub>4</sub>, <italic>G</italic><sub>5</sub>&#x007D; performed better than fusing &#x007B;<italic>G</italic><sub>3</sub>, <italic>G</italic><sub>4</sub>, <italic>G</italic><sub>5</sub>&#x007D; and &#x007B;<italic>G</italic><sub>4</sub>, <italic>G</italic><sub>5</sub>&#x007D; by 2.88 and 7.02%, respectively, in terms of mIoU, and only required a few more computation. This illustrates that the network performance could benefit from fusing as many features as possible. In this study, we fused &#x007B;<italic>G</italic><sub>2</sub>, <italic>G</italic><sub>3</sub>, <italic>G</italic><sub>4</sub>, <italic>G</italic><sub>5</sub>&#x007D; to improve the network accuracy.</p>
<p><xref ref-type="table" rid="T2">Table 2</xref> shows the effect of different positions to place the auxiliary segmentation head. As can be seen, inserting the auxiliary segmentation head into the output of stage 3 outperformed that of stage 2, stage 4 and stage 5 by 0.65, 1.12, and 1.47%, respectively, in terms of mIoU, and slightly underperformed that of stage 4 and stage 5 by 0.17 and 0.08%, respectively, in terms of PA. Therefore, we chose to attach the auxiliary segmentation head to the output of stage 3.</p>
<table-wrap position="float" id="T2">
<label>TABLE 2</label>
<caption><p>Ablations on the auxiliary segmentation head, which is inserted after the output of different stages in the backbone.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">Stage</td>
<td valign="top" align="center" colspan="3">IoU (%)<hr/></td>
<td valign="top" align="center">mIoU (%)</td>
<td valign="top" align="center">PA (%)</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">Branch</td>
<td valign="top" align="center">Fruit</td>
<td valign="top" align="center">Background</td>
<td/>
<td/>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">2</td>
<td valign="top" align="center">62.45</td>
<td valign="top" align="center">65.32</td>
<td valign="top" align="center">92.95</td>
<td valign="top" align="center">73.58</td>
<td valign="top" align="center">93.68</td>
</tr>
<tr>
<td valign="top" align="left">3</td>
<td valign="top" align="center">63.33</td>
<td valign="top" align="center">66.25</td>
<td valign="top" align="center">93.12</td>
<td valign="top" align="center">74.23</td>
<td valign="top" align="center">93.84</td>
</tr>
<tr>
<td valign="top" align="left">4</td>
<td valign="top" align="center">64.04</td>
<td valign="top" align="center">61.96</td>
<td valign="top" align="center">93.33</td>
<td valign="top" align="center">73.11</td>
<td valign="top" align="center">94.02</td>
</tr>
<tr>
<td valign="top" align="left">5</td>
<td valign="top" align="center">63.07</td>
<td valign="top" align="center">61.98</td>
<td valign="top" align="center">93.23</td>
<td valign="top" align="center">72.76</td>
<td valign="top" align="center">93.92</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="S4.SS2">
<title>Comparison with existing methods</title>
<p><xref ref-type="table" rid="T3">Table 3</xref> lists the accuracy and real-time performance of the proposed network and comparison methods. Overall, our network with MobileNetV3-Large as the backbone outperformed LR-ASPP, DeepLabV3, DeepLabV3+, and FANet in terms of the accuracy metrics, which validated the effectiveness of the proposed modules. Furthermore, our network performed faster than DeepLabV3, DeepLabV3+ and FANet in terms of FLOPs, likely because DeepLabV3 and DeepLabV3+ applied a very time-consuming atrous spatial pyramid pooling module to encode context information, and FANet used a relatively large backbone. Surprisingly, there was little difference in FPS between our network and the comparison networks, probably because the depth-wise convolution in MobileNets and the multi-branch design in ResNet increased the memory access cost, affecting the inference speed (<xref ref-type="bibr" rid="B9">Ding et al., 2021</xref>). Conclusively, the proposed network with MobileNetV3-Large as the backbone was more accurate than the comparison methods while being fast.</p>
<table-wrap position="float" id="T3">
<label>TABLE 3</label>
<caption><p>Accuracy and real-time performance of the proposed network and comparison methods on test set.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">Methods</td>
<td valign="top" align="center">Backbone</td>
<td valign="top" align="center" colspan="3">IoU (%)<hr/></td>
<td valign="top" align="center">mIoU (%)</td>
<td valign="top" align="center">PA (%)</td>
<td valign="top" align="center">FPS</td>
<td valign="top" align="center">#Params</td>
<td valign="top" align="center">FLOPs</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="center">Branch</td>
<td valign="top" align="center">Fruit</td>
<td valign="top" align="center">Background</td>
<td/>
<td/>
<td/>
<td/>
<td/>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Ours</td>
<td valign="top" align="center">MobileNetV3-Large</td>
<td valign="top" align="center">63.33</td>
<td valign="top" align="center">66.25</td>
<td valign="top" align="center">93.12</td>
<td valign="top" align="center">74.23</td>
<td valign="top" align="center">93.84</td>
<td valign="top" align="center">36.00</td>
<td valign="top" align="center">5.7M</td>
<td valign="top" align="center">2.36B</td>
</tr>
<tr>
<td valign="top" align="left">Ours</td>
<td valign="top" align="center">MobileNetV3-Small</td>
<td valign="top" align="center">60.62</td>
<td valign="top" align="center">61.05</td>
<td valign="top" align="center">92.82</td>
<td valign="top" align="center">71.50</td>
<td valign="top" align="center">93.52</td>
<td valign="top" align="center">37.91</td>
<td valign="top" align="center">2.7M</td>
<td valign="top" align="center">1.18B</td>
</tr>
<tr>
<td valign="top" align="left">LR-ASPP</td>
<td valign="top" align="center">MobileNetV3-Large</td>
<td valign="top" align="center">60.05</td>
<td valign="top" align="center">58.60</td>
<td valign="top" align="center">92.85</td>
<td valign="top" align="center">70.50</td>
<td valign="top" align="center">93.52</td>
<td valign="top" align="center">36.67</td>
<td valign="top" align="center">5.7M</td>
<td valign="top" align="center">2.37B</td>
</tr>
<tr>
<td valign="top" align="left">DeepLabV3</td>
<td valign="top" align="center">MobileNetV3-Large</td>
<td valign="top" align="center">56.34</td>
<td valign="top" align="center">58.82</td>
<td valign="top" align="center">92.14</td>
<td valign="top" align="center">69.11</td>
<td valign="top" align="center">92.85</td>
<td valign="top" align="center">35.78</td>
<td valign="top" align="center">13.5M</td>
<td valign="top" align="center">11.58B</td>
</tr>
<tr>
<td valign="top" align="left">DeepLabV3+</td>
<td valign="top" align="center">MobileNetV3-Large</td>
<td valign="top" align="center">62.59</td>
<td valign="top" align="center">61.05</td>
<td valign="top" align="center">93.36</td>
<td valign="top" align="center">72.33</td>
<td valign="top" align="center">94.00</td>
<td valign="top" align="center">31.52</td>
<td valign="top" align="center">14.2M</td>
<td valign="top" align="center">35.73B</td>
</tr>
<tr>
<td valign="top" align="left">FANet</td>
<td valign="top" align="center">ResNet18</td>
<td valign="top" align="center">54.71</td>
<td valign="top" align="center">57.57</td>
<td valign="top" align="center">92.25</td>
<td valign="top" align="center">68.17</td>
<td valign="top" align="center">92.97</td>
<td valign="top" align="center">36.65</td>
<td valign="top" align="center">13.8M</td>
<td valign="top" align="center">6.93B</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Additionally, our network with MobileNetV3-Small as the backbone had slightly lower accuracy than DeepLabV3+, but higher accuracy than LR-ASPP, DeepLabV3, and FANet. Moreover, this network achieved the best real-time performance. In other words, when MobileNetV3-Small was used as the backbone, the proposed network was the fastest among the comparison networks, but somewhat less accurate.</p>
<p>Our network achieved a large IoU value for the background, probably because the background dominated the images, making the network pay more attention to the background. This problem can be alleviated by reshaping the loss function by down-weighting the background and up-weighting other objects (<xref ref-type="bibr" rid="B23">Ronneberger et al., 2015</xref>). Besides, the IoU value of the branch class was lower than that of the fruit class. A possible reason was that some branches were very thin and hence their detailed information was easy to be lost, making them hard to segment. Although we have fused multi-layer features to solve such a problem, MobileNetV3 was too lightweight to provide enough features. Future work will consider adding a detail branch (<xref ref-type="bibr" rid="B30">Yu et al., 2021</xref>) to the backbone to extract detailed information.</p>
<p>Some qualitative results were shown in <xref ref-type="fig" rid="F6">Figure 6</xref>. Visually, our network was more accurate in tree-part segmentation. Specifically, the developed network could capture the details of most thin branches, whereas the comparison networks struggled to segment the thin branches, as shown in the yellow boxes in columns 1&#x2013;3 of <xref ref-type="fig" rid="F6">Figure 6</xref>. Besides, our network outperformed the comparison networks in the recognition ability of fruits, as shown in the while boxes in column 4 of <xref ref-type="fig" rid="F6">Figure 6</xref>. The results validate the effectiveness of the developed attention module and feature aggregation module. Although most of the branches were identified, some thin branches seemed to be difficult to identify. In robotic harvesting, the thin branches might clog the end effector, causing shear failure. Therefore, future work will focus on improving the segmentation accuracy of thin branches. A relevant video can be found at: <ext-link ext-link-type="uri" xlink:href="https://www.bilibili.com/video/BV1nS4y147wa/?vd_source=d082953b9cfe065d2d003486f259e84f">https://www.bilibili.com/video/BV1nS4y147wa/?vd_source=d082953b9cfe065d2d003486f259e84f</ext-link>.</p>
<fig id="F6" position="float">
<label>FIGURE 6</label>
<caption><p>Visual examples illustrating results of our network and comparison networks. <bold>(A)</bold> RGB image. <bold>(B)</bold> Ground truth. <bold>(C)</bold> Ours (MobileNetV3-Large). <bold>(D)</bold> LR-ASPP. <bold>(E)</bold> DeepLabV3. <bold>(F)</bold> DeepLabV3+. <bold>(G)</bold> FANet.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpls-13-991487-g006.tif"/>
</fig>
</sec>
</sec>
<sec id="S5" sec-type="conclusion">
<title>Conclusion</title>
<p>This study aimed to develop a tree-part segmentation network that can segment fruits and branches efficiently and accurately for harvesting robots to avoid obstacles. Experimental results validated that the proposed network can accomplish the research objective. Some specific conclusions drawn from the study were given as follows:</p>
<list list-type="simple">
<list-item><label>(1)</label>
<p>A tree-part dataset was collected. The dataset consists of 891 RGB images captured in the fields. Each image is manually annotated in a per-pixel fashion, which took us almost 2 months to label. To the best of our knowledge, this is the first tree-part dataset used to help harvesting robots avoid obstacles.</p>
</list-item>
<list-item><label>(2)</label>
<p>A tree-part segmentation network was developed, which consists of four components: a lightweight backbone, CASM, FAM, and segmentation head. Here, CASM was used to enhance informative channels and locations in the feature maps, and FAM was designed to fuse multi-layer feature maps to improve the segmentation accuracy. Experiments on the test set shows that when using MobileNetV3-Large as the backbone, the network achieved an IoU of 63.33, 66.25, and 93.12% for the branches, fruits and background, respectively, at a speed of 2.36 billion FLOPs. These performance values validates that the network could segment tree parts efficiently and quite accurately. However, the IoU value of the branch class was the lowest, probably because the max-pooling operations in the backbone lost the detailed information of the thin branches, thus making the thin branches difficult to segment.</p>
</list-item>
</list>
<p>The proposed network could be transferred to segment other fruits by fine-tuning on new datasets. Future research will add two more classes (soft branch and hard branch) to the current dataset to allow harvesting robots to push away soft branches and avoid hard ones for better fruit picking. Furthermore, future work will attempt to add a detailed path in the backbone to preserve the detailed information of the input image, thus improving the accuracy.</p>
</sec>
<sec id="S6" sec-type="data-availability">
<title>Data availability statement</title>
<p>The original contributions presented in this study are included in the article/supplementary material, further inquiries can be directed to the corresponding authors.</p>
</sec>
<sec id="S7">
<title>Author contributions</title>
<p>GL: methodology, investigation, and writing&#x2014;original draft. CW: investigation, methodology, and writing&#x2014;review and editing. YX and MW: writing&#x2014;review and editing. ZZ: conceptualization and data curation. LZ: methodology and supervision. All authors contributed to the article and approved the submitted version.</p>
</sec>
</body>
<back>
<sec id="S8" sec-type="funding-information">
<title>Funding</title>
<p>This work was supported by the Laboratory of Lingnan Modern Agriculture Project (Grant No. NZ2021038), the National Natural Science Foundation of China (Grant No. 32101632), the Basic and Applied Basic Research Project of Guangzhou Basic Research Plan (Grant No. 202201011310), and the Science and Technology Program of Meizhou, China (Grant No. 2021A0304004).</p>
</sec>
<sec id="S9" sec-type="COI-statement">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec id="S10" sec-type="disclaimer">
<title>Publisher&#x2019;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<ref-list>
<title>References</title>
<ref id="B1"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Amatya</surname> <given-names>S.</given-names></name> <name><surname>Karkee</surname> <given-names>M.</given-names></name> <name><surname>Gongal</surname> <given-names>A.</given-names></name> <name><surname>Zhang</surname> <given-names>Q.</given-names></name> <name><surname>Whiting</surname> <given-names>M. D.</given-names></name></person-group> (<year>2016</year>). <article-title>Detection of cherry tree branches with full foliage in planar architecture for automated sweet-cherry harvesting.</article-title> <source><italic>Biosyst. Eng.</italic></source> <volume>146</volume> <fpage>3</fpage>&#x2013;<lpage>15</lpage>. <pub-id pub-id-type="doi">10.1016/j.biosystemseng.2015.10.003</pub-id></citation></ref>
<ref id="B2"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Barth</surname> <given-names>R.</given-names></name> <name><surname>Hemming</surname> <given-names>J.</given-names></name> <name><surname>Van Henten</surname> <given-names>E. J.</given-names></name></person-group> (<year>2020</year>). <article-title>Optimising realism of synthetic images using cycle generative adversarial networks for improved part segmentation.</article-title> <source><italic>Comput. Electron. Agric.</italic></source> <volume>173</volume>:<fpage>105378</fpage>. <pub-id pub-id-type="doi">10.1016/j.compag.2020.105378</pub-id></citation></ref>
<ref id="B3"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Barth</surname> <given-names>R.</given-names></name> <name><surname>IJsselmuiden</surname> <given-names>J.</given-names></name> <name><surname>Hemming</surname> <given-names>J.</given-names></name> <name><surname>Henten</surname> <given-names>E. J. V.</given-names></name></person-group> (<year>2018</year>). <article-title>Data synthesis methods for semantic segmentation in agriculture: A capsicum annuum dataset.</article-title> <source><italic>Comput. Electron. Agric.</italic></source> <volume>144</volume> <fpage>284</fpage>&#x2013;<lpage>296</lpage>. <pub-id pub-id-type="doi">10.1016/j.compag.2017.12.001</pub-id></citation></ref>
<ref id="B4"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Barth</surname> <given-names>R.</given-names></name> <name><surname>IJsselmuiden</surname> <given-names>J.</given-names></name> <name><surname>Hemming</surname> <given-names>J.</given-names></name> <name><surname>Van Henten</surname> <given-names>E. J.</given-names></name></person-group> (<year>2019</year>). <article-title>Synthetic bootstrapping of convolutional neural networks for semantic plant part segmentation.</article-title> <source><italic>Comput. Electron. Agric.</italic></source> <volume>161</volume> <fpage>291</fpage>&#x2013;<lpage>304</lpage>. <pub-id pub-id-type="doi">10.1016/j.compag.2017.11.040</pub-id></citation></ref>
<ref id="B5"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Boogaard</surname> <given-names>F. P.</given-names></name> <name><surname>van Henten</surname> <given-names>E. J.</given-names></name> <name><surname>Kootstra</surname> <given-names>G.</given-names></name></person-group> (<year>2021</year>). <article-title>Boosting plant-part segmentation of cucumber plants by enriching incomplete 3d point clouds with spectral data.</article-title> <source><italic>Biosyst. Eng.</italic></source> <volume>211</volume> <fpage>167</fpage>&#x2013;<lpage>182</lpage>. <pub-id pub-id-type="doi">10.1016/j.biosystemseng.2021.09.004</pub-id></citation></ref>
<ref id="B6"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>L. C.</given-names></name> <name><surname>Papandreou</surname> <given-names>G.</given-names></name> <name><surname>Schroff</surname> <given-names>F.</given-names></name> <name><surname>Adam</surname> <given-names>H.</given-names></name></person-group> (<year>2017</year>). <article-title>Rethinking atrous convolution for semantic image segmentation.</article-title> <source><italic>arXiv</italic></source> [<comment>Preprint</comment>]. <volume>arXiv</volume>:<fpage>1706.05587</fpage>.</citation></ref>
<ref id="B7"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>L.</given-names></name> <name><surname>Zhu</surname> <given-names>Y.</given-names></name> <name><surname>Papandreou</surname> <given-names>G.</given-names></name> <name><surname>Schroff</surname> <given-names>F.</given-names></name> <name><surname>Adam</surname> <given-names>H.</given-names></name></person-group> (<year>2018</year>). <source><italic>Encoder-decoder with atrous separable convolution for semantic image segmentation</italic></source>, <role>eds</role> <person-group person-group-type="editor"><name><surname>Ferrari</surname> <given-names>V.</given-names></name> <name><surname>Hebert</surname> <given-names>M.</given-names></name> <name><surname>Sminchisescu</surname> <given-names>C.</given-names></name> <name><surname>Weiss</surname> <given-names>Y.</given-names></name></person-group> (<publisher-loc>Cham</publisher-loc>: <publisher-name>Springer International Publishing</publisher-name>). <pub-id pub-id-type="doi">10.1007/978-3-030-01234-2_49</pub-id></citation></ref>
<ref id="B8"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>Z.</given-names></name> <name><surname>Ting</surname> <given-names>D.</given-names></name> <name><surname>Newbury</surname> <given-names>R.</given-names></name> <name><surname>Chen</surname> <given-names>C.</given-names></name></person-group> (<year>2021</year>). <article-title>Semantic segmentation for partially occluded apple trees based on deep learning.</article-title> <source><italic>Comput. Electron. Agric.</italic></source> <volume>181</volume>:<fpage>105952</fpage>. <pub-id pub-id-type="doi">10.1016/j.compag.2020.105952</pub-id></citation></ref>
<ref id="B9"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ding</surname> <given-names>X.</given-names></name> <name><surname>Zhang</surname> <given-names>X.</given-names></name> <name><surname>Ma</surname> <given-names>N.</given-names></name> <name><surname>Han</surname> <given-names>J.</given-names></name> <name><surname>Ding</surname> <given-names>G.</given-names></name> <name><surname>Sun</surname> <given-names>J.</given-names></name></person-group> (<year>2021</year>). &#x201C;<article-title>Repvgg: Making vgg-style convnets great again</article-title>,&#x201D; in <source><italic>Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</italic></source> (<publisher-loc>Nashville</publisher-loc>). <pub-id pub-id-type="doi">10.1109/CVPR46437.2021.01352</pub-id></citation></ref>
<ref id="B10"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gao</surname> <given-names>R.</given-names></name></person-group> (<year>2021</year>). <article-title>Rethink dilated convolution for real-time semantic segmentation.</article-title> <source><italic>arXiv</italic></source> [<comment>Preprint</comment>]. <volume>arXiv</volume>:<fpage>2111.09957</fpage>.</citation></ref>
<ref id="B11"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Howard</surname> <given-names>A.</given-names></name> <name><surname>Sandler</surname> <given-names>M.</given-names></name> <name><surname>Chu</surname> <given-names>G.</given-names></name> <name><surname>Chen</surname> <given-names>L. C.</given-names></name> <name><surname>Chen</surname> <given-names>B.</given-names></name> <name><surname>Tan</surname> <given-names>M.</given-names></name><etal/></person-group> (<year>2019</year>). <article-title>Searching for mobilenetv3.</article-title> <source><italic>arXiv</italic></source> [<comment>Preprint</comment>]. <volume>arXiv</volume>:<fpage>1905.02244</fpage>. <pub-id pub-id-type="doi">10.1109/ICCV.2019.00140</pub-id></citation></ref>
<ref id="B12"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hu</surname> <given-names>J.</given-names></name> <name><surname>Shen</surname> <given-names>L.</given-names></name> <name><surname>Albanie</surname> <given-names>S.</given-names></name> <name><surname>Sun</surname> <given-names>G.</given-names></name> <name><surname>Wu</surname> <given-names>E.</given-names></name></person-group> (<year>2020</year>). <article-title>Squeeze-and-excitation networks.</article-title> <source><italic>IEEE Trans. Pattern Anal. Mach. Intell.</italic></source> <volume>42</volume> <fpage>2011</fpage>&#x2013;<lpage>2023</lpage>. <pub-id pub-id-type="doi">10.1109/TPAMI.2019.2913372</pub-id> <pub-id pub-id-type="pmid">31034408</pub-id></citation></ref>
<ref id="B13"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hu</surname> <given-names>P.</given-names></name> <name><surname>Perazzi</surname> <given-names>F.</given-names></name> <name><surname>Heilbron</surname> <given-names>F. C.</given-names></name> <name><surname>Wang</surname> <given-names>O.</given-names></name> <name><surname>Lin</surname> <given-names>Z.</given-names></name> <name><surname>Saenko</surname> <given-names>K.</given-names></name><etal/></person-group> (<year>2020</year>). <article-title>Real-time semantic segmentation with fast attention.</article-title> <source><italic>arXiv</italic></source> [<comment>Preprint</comment>]. <volume>arXiv</volume>:<fpage>2007.03815</fpage>. <pub-id pub-id-type="doi">10.1109/LRA.2020.3039744</pub-id></citation></ref>
<ref id="B14"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ji</surname> <given-names>W.</given-names></name> <name><surname>Qian</surname> <given-names>Z.</given-names></name> <name><surname>Xu</surname> <given-names>B.</given-names></name> <name><surname>Tao</surname> <given-names>Y.</given-names></name> <name><surname>Zhao</surname> <given-names>D.</given-names></name> <name><surname>Ding</surname> <given-names>S.</given-names></name></person-group> (<year>2016</year>). <article-title>Apple tree branch segmentation from images with small gray-level difference for agricultural harvesting robot.</article-title> <source><italic>Optik</italic></source> <volume>127</volume> <fpage>11173</fpage>&#x2013;<lpage>11182</lpage>. <pub-id pub-id-type="doi">10.1016/j.ijleo.2016.09.044</pub-id></citation></ref>
<ref id="B15"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>J.</given-names></name> <name><surname>Tang</surname> <given-names>Y.</given-names></name> <name><surname>Zou</surname> <given-names>X.</given-names></name> <name><surname>Lin</surname> <given-names>G.</given-names></name> <name><surname>Wang</surname> <given-names>H.</given-names></name></person-group> (<year>2020</year>). <article-title>Detection of fruit-bearing branches and localization of litchi clusters for vision-based harvesting robots.</article-title> <source><italic>IEEE Access</italic></source> <volume>8</volume> <fpage>117746</fpage>&#x2013;<lpage>117758</lpage>. <pub-id pub-id-type="doi">10.1109/ACCESS.2020.3005386</pub-id></citation></ref>
<ref id="B16"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lin</surname> <given-names>G.</given-names></name> <name><surname>Tang</surname> <given-names>Y.</given-names></name> <name><surname>Zou</surname> <given-names>X.</given-names></name> <name><surname>Xiong</surname> <given-names>J.</given-names></name> <name><surname>Li</surname> <given-names>J.</given-names></name></person-group> (<year>2019</year>). <article-title>Guava detection and pose estimation using a low-cost rgb-d sensor in the field.</article-title> <source><italic>Sensors</italic></source> <volume>19</volume>:<fpage>428</fpage>. <pub-id pub-id-type="doi">10.3390/s19020428</pub-id> <pub-id pub-id-type="pmid">30669645</pub-id></citation></ref>
<ref id="B17"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lin</surname> <given-names>G.</given-names></name> <name><surname>Zhu</surname> <given-names>L.</given-names></name> <name><surname>Li</surname> <given-names>J.</given-names></name> <name><surname>Zou</surname> <given-names>X.</given-names></name> <name><surname>Tang</surname> <given-names>Y.</given-names></name></person-group> (<year>2021a</year>). <article-title>Collision-free path planning for a guava-harvesting robot based on recurrent deep reinforcement learning.</article-title> <source><italic>Comput. Electron. Agric.</italic></source> <volume>188</volume>:<fpage>106350</fpage>. <pub-id pub-id-type="doi">10.1016/j.compag.2021.106350</pub-id></citation></ref>
<ref id="B18"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lin</surname> <given-names>G.</given-names></name> <name><surname>Tang</surname> <given-names>Y.</given-names></name> <name><surname>Zou</surname> <given-names>X.</given-names></name> <name><surname>Wang</surname> <given-names>C.</given-names></name></person-group> (<year>2021b</year>). <article-title>Three-dimensional reconstruction of guava fruits and branches using instance segmentation and geometry analysis.</article-title> <source><italic>Comput. Electron. Agric.</italic></source> <volume>184</volume>:<fpage>106107</fpage>. <pub-id pub-id-type="doi">10.1016/j.compag.2021.106107</pub-id></citation></ref>
<ref id="B19"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lin</surname> <given-names>T.</given-names></name> <name><surname>Doll&#x00E1;r</surname> <given-names>P.</given-names></name> <name><surname>Girshick</surname> <given-names>R.</given-names></name> <name><surname>He</surname> <given-names>K.</given-names></name> <name><surname>Hariharan</surname> <given-names>B.</given-names></name> <name><surname>Belongie</surname> <given-names>S.</given-names></name></person-group> (<year>2016</year>). <article-title>Feature pyramid networks for object detection.</article-title> <source><italic>arXiv</italic></source> [<comment>Preprint</comment>]. <volume>arXiv</volume>:<fpage>1612.03144</fpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2017.106</pub-id></citation></ref>
<ref id="B20"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>W.</given-names></name> <name><surname>Rabinovich</surname> <given-names>A.</given-names></name> <name><surname>Berg</surname> <given-names>A. C.</given-names></name></person-group> (<year>2015</year>). <article-title>Parsenet: Looking wider to see better.</article-title> <source><italic>arXiv</italic></source> [<comment>Preprint</comment>]. <volume>arXiv</volume>:<fpage>1506.04579</fpage>.</citation></ref>
<ref id="B21"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Loshchilov</surname> <given-names>I.</given-names></name> <name><surname>Hutter</surname> <given-names>F.</given-names></name></person-group> (<year>2016</year>). <article-title>SGDR: Stochastic gradient descent with restarts.</article-title> <source><italic>arXiv</italic></source> [<comment>Preprint</comment>]. <volume>arXiv</volume>:<fpage>1608.03983</fpage>.</citation></ref>
<ref id="B22"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Majeed</surname> <given-names>Y.</given-names></name> <name><surname>Zhang</surname> <given-names>J.</given-names></name> <name><surname>Zhang</surname> <given-names>X.</given-names></name> <name><surname>Fu</surname> <given-names>L.</given-names></name> <name><surname>Karkee</surname> <given-names>M.</given-names></name> <name><surname>Zhang</surname> <given-names>Q.</given-names></name><etal/></person-group> (<year>2020</year>). <article-title>Deep learning based segmentation for automated training of apple trees on trellis wires.</article-title> <source><italic>Comput. Electron. Agric.</italic></source> <volume>170</volume>:<fpage>105277</fpage>. <pub-id pub-id-type="doi">10.1016/j.compag.2020.105277</pub-id></citation></ref>
<ref id="B23"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ronneberger</surname> <given-names>O.</given-names></name> <name><surname>Fischer</surname> <given-names>P.</given-names></name> <name><surname>Brox</surname> <given-names>T.</given-names></name></person-group> (<year>2015</year>). <article-title>U-net: Convolutional networks for biomedical image segmentation.</article-title> <source><italic>arXiv</italic></source> [<comment>Preprint</comment>]. <volume>arXiv</volume>:<fpage>1505.04597</fpage>. <pub-id pub-id-type="doi">10.1007/978-3-319-24574-4_28</pub-id></citation></ref>
<ref id="B24"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Roy</surname> <given-names>A. G.</given-names></name> <name><surname>Navab</surname> <given-names>N.</given-names></name> <name><surname>Wachinger</surname> <given-names>C.</given-names></name></person-group> (<year>2018</year>). <article-title>Concurrent spatial and channel squeeze &#x0026; excitation in fully convolutional networks.</article-title> <source><italic>arXiv</italic></source> [<comment>Preprint</comment>]. <volume>arXiv</volume>:<fpage>1803.02579</fpage>. <pub-id pub-id-type="doi">10.1007/978-3-030-00928-1_48</pub-id></citation></ref>
<ref id="B25"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Russakovsky</surname> <given-names>O.</given-names></name> <name><surname>Deng</surname> <given-names>J.</given-names></name> <name><surname>Su</surname> <given-names>H.</given-names></name> <name><surname>Krause</surname> <given-names>J.</given-names></name> <name><surname>Satheesh</surname> <given-names>S.</given-names></name> <name><surname>Ma</surname> <given-names>S.</given-names></name><etal/></person-group> (<year>2015</year>). <article-title>Imagenet large scale visual recognition challenge.</article-title> <source><italic>Int. J. Comput. Vision</italic></source> <volume>115</volume> <fpage>211</fpage>&#x2013;<lpage>252</lpage>. <pub-id pub-id-type="doi">10.1007/s11263-015-0816-y</pub-id></citation></ref>
<ref id="B26"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Russell</surname> <given-names>B. C.</given-names></name> <name><surname>Torralba</surname> <given-names>A.</given-names></name> <name><surname>Murphy</surname> <given-names>K. P.</given-names></name> <name><surname>Freeman</surname> <given-names>W. T.</given-names></name></person-group> (<year>2008</year>). <article-title>Labelme: A database and web-based tool for image annotation.</article-title> <source><italic>Int. J. Comput. Vision</italic></source> <volume>77</volume> <fpage>157</fpage>&#x2013;<lpage>173</lpage>. <pub-id pub-id-type="doi">10.1007/s11263-007-0090-8</pub-id></citation></ref>
<ref id="B27"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sandler</surname> <given-names>M.</given-names></name> <name><surname>Howard</surname> <given-names>A.</given-names></name> <name><surname>Zhu</surname> <given-names>M.</given-names></name> <name><surname>Zhmoginov</surname> <given-names>A.</given-names></name> <name><surname>Chen</surname> <given-names>L.</given-names></name></person-group> (<year>2018</year>). <article-title>Mobilenetv2: Inverted residuals and linear bottlenecks.</article-title> <source><italic>arXiv</italic></source> [<comment>Preprint</comment>]. <volume>arXiv</volume>:<fpage>1801.04381</fpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2018.00474</pub-id></citation></ref>
<ref id="B28"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wan</surname> <given-names>H.</given-names></name> <name><surname>Fan</surname> <given-names>Z.</given-names></name> <name><surname>Yu</surname> <given-names>X.</given-names></name> <name><surname>Kang</surname> <given-names>M.</given-names></name> <name><surname>Wang</surname> <given-names>P.</given-names></name> <name><surname>Zeng</surname> <given-names>X.</given-names></name></person-group> (<year>2022</year>). <article-title>A real-time branch detection and reconstruction mechanism for harvesting robot via convolutional neural network and image segmentation.</article-title> <source><italic>Comput. Electron. Agric.</italic></source> <volume>192</volume>:<fpage>106609</fpage>. <pub-id pub-id-type="doi">10.1016/j.compag.2021.106609</pub-id></citation></ref>
<ref id="B29"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Woo</surname> <given-names>S. A. P. J.</given-names></name></person-group> (<year>2018</year>). <source><italic>CBAM: Convolutional block attention module.</italic></source> <publisher-loc>Cham</publisher-loc>: <publisher-name>Springer.</publisher-name> <pub-id pub-id-type="doi">10.1007/978-3-030-01234-2_1</pub-id></citation></ref>
<ref id="B30"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yu</surname> <given-names>C.</given-names></name> <name><surname>Gao</surname> <given-names>C.</given-names></name> <name><surname>Wang</surname> <given-names>J.</given-names></name> <name><surname>Yu</surname> <given-names>G.</given-names></name> <name><surname>Shen</surname> <given-names>C.</given-names></name> <name><surname>Sang</surname> <given-names>N.</given-names></name></person-group> (<year>2021</year>). <article-title>Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation.</article-title> <source><italic>Int. J. Comput. Vision</italic></source> <volume>129</volume> <fpage>3051</fpage>&#x2013;<lpage>3068</lpage>. <pub-id pub-id-type="doi">10.1007/s11263-021-01515-2</pub-id></citation></ref>
<ref id="B31"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>J.</given-names></name> <name><surname>He</surname> <given-names>L.</given-names></name> <name><surname>Karkee</surname> <given-names>M.</given-names></name> <name><surname>Zhang</surname> <given-names>Q.</given-names></name> <name><surname>Zhang</surname> <given-names>X.</given-names></name> <name><surname>Gao</surname> <given-names>Z.</given-names></name></person-group> (<year>2018</year>). <article-title>Branch detection for apple trees trained in fruiting wall architecture using depth features and regions-convolutional neural network (r-cnn).</article-title> <source><italic>Comput. Electron. Agric.</italic></source> <volume>155</volume> <fpage>386</fpage>&#x2013;<lpage>393</lpage>. <pub-id pub-id-type="doi">10.1016/j.compag.2018.10.029</pub-id></citation></ref>
<ref id="B32"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>X.</given-names></name> <name><surname>Karkee</surname> <given-names>M.</given-names></name> <name><surname>Zhang</surname> <given-names>Q.</given-names></name> <name><surname>Whiting</surname> <given-names>M. D.</given-names></name></person-group> (<year>2021</year>). <article-title>Computer vision-based tree trunk and branch identification and shaking points detection in dense-foliage canopy for automated harvesting of apples.</article-title> <source><italic>J. Field Robot.</italic></source> <volume>38</volume> <fpage>476</fpage>&#x2013;<lpage>493</lpage>. <pub-id pub-id-type="doi">10.1002/rob.21998</pub-id></citation></ref>
<ref id="B33"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhao</surname> <given-names>H.</given-names></name> <name><surname>Shi</surname> <given-names>J.</given-names></name> <name><surname>Qi</surname> <given-names>X.</given-names></name> <name><surname>Wang</surname> <given-names>X.</given-names></name> <name><surname>Jia</surname> <given-names>J.</given-names></name></person-group> (<year>2016</year>). <article-title>Pyramid scene parsing network.</article-title> <source><italic>arXiv</italic></source> [<comment>Preprint</comment>]. <volume>arXiv</volume>:<fpage>1612.01105</fpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2017.660</pub-id></citation></ref>
</ref-list>
</back>
</article>
