Single Shot MultiBox Detector for Urban Plantation Single Tree Detection and Location With High-Resolution Remote Sensing Imagery

Using high-resolution remote sensing images to automatically identify individual trees is of great significance to forestry ecological environment monitoring. Urban plantation has realistic demands for single tree management such as catkin pollution, maintenance of famous trees, landscape construction, and park management. At present, there are problems of missed detection and error detection in dense plantations and complex background plantations. This paper proposes a single tree detection method based on single shot multibox detector (SSD). Optimal SSD is obtained by adjusting feature layers, optimizing the aspect ratio of a preset box, reducing parameters and so on. The optimal SSD is applied to single tree detection and location in campuses, orchards, and economic plantations. The average accuracy based on SSD is 96.0, 92.9, and 97.6% in campus green trees, lychee plantations, and palm plantations, respectively. It is 11.3 and 37.5% higher than the latest template matching method and chan-vese (CV) model method, and is 43.1 and 54.2% higher than the traditional watershed method and local maximum method. Experimental results show that SSD has a strong potential and application advantage. This research has reference significance for the application of an object detection framework based on deep learning in agriculture and forestry.


INTRODUCTION
Single tree detection based on remote sensing images is a crucial technology for establishing a single tree database and monitoring single tree plantation resources, which is of great significance to urban landscape planning and ecological environment monitoring (Congalton et al., 2014;Faridatul and Wu, 2018;Ahl et al., 2019). Single tree detection is a cross-research field of computer vision, measurement, single tree management, and remote sensing (Kupidura et al., 2019;Zhang et al., 2020;Belcore et al., 2021). Researchers began to explore single tree detection methods a long time ago. As early as 1995, Gougeon et al. (Gougeon, 1995) used aerial photos to carry out single tree identification; they searched for the local minimum value at the bottom of a tree for the first time. Larsen et al. (Larsen and Rudemo, 1998) used an improved template matching method to detect crown vertices of a single tree. Poullot et al. (Pollock, 1996) used remote sensing imagery to determine the location of a single tree by selecting a moving window from 15 × 15 to 30 × 30.
Depending on the size of the canopy in the image, they determined the location of a single tree and used the local ray method to depict the young conifer forest crown. Wang et al. (Wang et al., 2004) used the watershed method to depict the canopy boundaries of white cloud fir forests. Zhang Ning et al. (Zhang et al., 2014) improved the application of the peak climbing method to the problem of canopy extraction and experimented in Quickbird images. The accuracy of experimental samples reached more than 85%. Jiang Renrong et al. (Jiang et al., 2016) used hydro analyzation and regional growth fusion methods for lychee single tree detection and canopy depiction. The overall accuracy was 78.69%. Yu et al. (Yu et al., 2018) applied the iterative threshold method to canopy extraction. The matching rate of the iterative threshold method was only 60.15%, due to complicated and discrepant texture and over-splitting phenomenon in a single tree canopy.
In recent years, researchers applied convolutional neural network (CNN) to single tree detection, solving the problems of traditional single tree detection methods (Liu et al., 2017;Zhao et al., 2020;Zhang et al., 2021). For example, traditional single tree detection methods detected seed points or matching templates by pixels, so traditional single tree detection methods required prior knowledge to specify characteristic parameters of different scenes. The traditional single tree detection method had low stability. The above problems could be solved by introducing CNN (Sharma et al., 2016;Chen et al., 2017). CNN could learn features automatically and abstract local low-level features into high-level features such as the color and the contour of trees gradually without specified parameters in advance (Chu et al., 2017;Mokroš et al., 2018). CNN showed good advantages in single tree detection (Roska and Chua, 2008;Wang et al., 2020). Weijia et al. (Weijia et al., 2016) applied the deep learning approach to detect densely planted Malaysian oil palm trees for the first time. Guirado et al. (Guirado et al., 2017) proposed a CNN-based shrub detection method. Dong Tianyang et al. (Dong et al., 2018) proposed a cascaded convolutional neural network of single tree detection in 2018. They applied it to Google Earth images in 2019 (Dong et al., 2019), and found that it was hard to repeat, had small object leakage, was inefficient, and that it was challenging to meet practical requirements.
At present, there is no single tree detection method that can adapt to various stands (Liu et al., 2019). It is a significant research direction to use existing technology to improve the accuracy of single tree detection and simplify the single tree detection process (Deng et al., 2010). Currently, object detection based on deep learning is divided into one-stage object detection and two-stage object detection. One-stage object detection is also known as endto-end, which only takes one step to obtain results. Single shot multibox detector (SSD) is one of the most widely used in onestage object detection. SSD uses multi-size convolutional layers to predict, adding data enhancement of small objects, which has the advantages of high accuracy and high efficiency .
In this study, we have improved the SSD backbone feature extraction network and optimized the aspect ratio of a preset box in single tree detection. The SSD model of single tree detection has been reduced in terms of parameters and computation burden. The optimal SSD model of single tree detection was obtained by comparing the experimental results of multiple groups. The optimal SSD model was used to detect quantity identification and location management of urban plantations. The optimal SSD model achieved better accuracy than the traditional watershed method, traditional local maximum method, and the latest template matching method, the CV model method.

METHODOLOGIES
This section focuses on the principles, training method, and other details of SSD. The principle of SSD is shown in Figure 1. During training, we only need to enter the original image and file that marks the original image's actual box. In multiple feature maps (e.g., 9 × 9 ( Figure 1B) and 5 × 5 ( Figure 1C)), SSD uses convolution kernel to calculate the category confidence of the detected box and the offset between the actual box and detected box. During training, we match these preset boxes with the actual box first. For example, if three preset boxes match three trees, they are regarded as positive samples. The remaining preset boxes are treated as negative samples.

Single Tree Detection Process
The overall process of single tree detection based on SSD is shown in Figure 2. The process of single tree detection based on SSD is divided into five parts, including: 1) Collect high resolution remote sensing image data. 2) Separate training data from validation data and label training data.
3) The SSD model is trained to obtain single tree characteristic parameters. 4) Single tree detection is carried out by single tree characteristic parameters.

5) Evaluation of test results.
A small number of factors can affect the quality of the SSD model, such as data richness, feature extraction accuracy, and robustness.

Simplified SSD Object Detection Framework
In this study, SSD is an object detection framework with 300 × 300×3 as the input image. SSD mainly divides into three parts: a central feature extraction network, feature layer processing network, and stacking adjustment parameters. The SSD model is simplified and shown in Figure 3. The backbone feature extraction network uses a visual geometry group network (VGG16) with the partial convolution layer removed. The optimal SSD model four extracts four feature layers for object detection, and the sizes of these four feature layers are 38 × 38×256, 19 × 19×512, 10 × 10×512, and 5 × 5×256, respectively. The 38 × 38×256 feature layer can be understood as dividing 300 × 300 image evenly into 38 × 38 parts. The center of each part generates preset boxes of different sizes and proportions for an Frontiers in Environmental Science | www.frontiersin.org November 2021 | Volume 9 | Article 755587 anchor point. The preset box is introduced in Section 2.4. The size of the feature map decreases gradually. The large feature map predicts small objects, while the small feature map predicts large objects.
For feature layer processing, the extracted four feature layers are convolved twice, one convolution to extract category confidence, another convolution to extract position adjustment parameters. Four parameters are needed to control the position of each preset box, including offset abscissa of point, ordinate of point, height, and width.

Non-Maximum Suppression
The main idea of non-maximum suppression is to search for local maximum values and suppress non-maximum values. As can be known from Section 2.4, SSD produces many preset boxes, overlapping between each preset box, and each preset box has a category confidence score. By introducing a non-maximum suppression method, we find the best location for a single tree by removing excess preset boxes and only retaining the optimal preset boxes. The process of non-maximum suppression is as follows:  Frontiers in Environmental Science | www.frontiersin.org November 2021 | Volume 9 | Article 755587 1) Sort each preset box by category confidence.
2) Select the preset box with the highest confidence in the category as the output box and remove it from the list of preset boxes. 3) Calculate the area of all preset boxes. 4) Calculate the intersection over union (IOU) value of the output box and other preset box. As shown in Figure 4, the intersection area of two boxes divides by their union area. 5) Remove a preset box with an IOU greater than threshold from the list of preset boxes. 6) Repeat steps (1) to (5) until the list of preset boxes is empty. Figure 5, the non-maximum suppression method is used to select what is most likely a tree canopy among detected objects. A single left tree outputs three preset boxes, and a single right tree outputs two preset boxes. The score sequence of preset boxes is {0.92, 0.86, 0.81, 0.49, 0.42}, and the highest score is 0.92. The preset box of 0.92 is taken as a detected single tree prediction box, and then the IOU value of the remaining preset boxes and prediction box is calculated. If the IOU of the preset boxes of 0.81 and 0.49 and the prediction box of 0.92 exceed threshold, the two preset boxes of 0.81 and 0.49 will be deleted from the list. The remaining two IOU areas of 0.86 and 0.42 are less than the set threshold, and they are rearranged as {0.86 and 0.42} according to score. The highest score is 0.82. The preset box of 0.82 is used as the detected single tree prediction box. The final test results are obtained by excluding preset box 0.42 through IOU.

Preset Box
The scale of the preset box follows a linear increment rule, increasing linearly as the size of the feature map decreases: In Eq. 1, m refers to the number of feature layers. Four feature layers are extracted, but m 3, because the first feature layer (Conv4) is set separately. S k represents the ratio of the preset box size to the image, and S min and S max represent minimum and maximum values of ratio. In this study, S min is set to 0.2 and S max to 0.9. For the first feature layer, the minimum ratio of the preset box to the original picture is S min 2 0.1, the size of preset box is 300 × 0.1 30. According to Eq. 1 calculation, the S k of each feature layer is S 1 0.2, S 2 0.375, and S 3 0.55. The scale of each feature layer preset box is 30, 60, 112.5, and 165.
The shape of the crown of a single tree is mainly round and oval, so the default width to height ratio of the frame that comes to mind at first is 1: 1 2 or 1:1. However, when cutting the picture for detection, the width to height ratio of the half tree frame is more than 1 2 . In this study, we select a r ∈ {1, 2, 3, 1 2 , 1 3 }, S k refers to the actual scale of preset box, and width (w a k S k a r √ ) and height (h a k S k a r √ ) of the present box are calculated. Most tree crowns are more circular. And each feature map will have an S k preset box of a r 1 and an S ' k scale. Besides, there is a preset box of S ' k S k S k+1 √ and a r 1, so that each feature map has two width to height ratios of preset boxes with an aspect ratio of 1: 1, they are 1 and 1 ' . The last feature layer needs S m+1 300 × 71 100 213 to  Frontiers in Environmental Science | www.frontiersin.org November 2021 | Volume 9 | Article 755587 calculate S ' m . Therefore, there are six preset boxes (a r ∈ {1, 2, 3, 1 2 , 1 3 , 1 ' }) for each feature map and each anchor point. The coordinate of anchor points can be obtained by point and |f k | are the size of the k-th feature map.
As shown from above, four feature layers are extracted, respectively 38 × 38×256, 19 × 19×512, 10 × 10×512, and 5 × 5×256. The number of preset boxes at each feature map is 4, 6, 6, and 6, so there are 8,692 preset boxes in each original image. The single tree position in the original image is retrieved using the thick preset box.

Loss Function
The loss function is defined on a single sample, and it is the error of a sample.
where L is the loss function of SSD. The loss function of SSD is divided into two parts: location loss and class confidence loss. L loc is location loss, and L conf is class confidence loss. The confidence loss is SoftMax loss over multiple classes confidences, c stands for confidence loss. N is the number of matched preset boxes. If N 0, loss is set to 0. The positioning loss is smoothing loss (Girshick; R.2015) between the preset box and actual box. l is the predicted box, g is the actual box. α is used to adjust the ratio between class confidence loss and location loss. By default, α 1.
where smooth L1 is smooth loss, Pos is positive samples, Neg is negative samples, (cx, cy) is the regressed offsets for the center of the preset box. w is the width of the preset box, h is the height of the preset box, d is the preset box, andĝ is the actual box that has been offset. p refers to category, p 0 represents the background. x p ij {1, 0} is an indicator for matching the i-th preset box to thet j-th actual box of p.

Increased Accuracy of Small Object
The main process of data enhancement is as follows: 1) Use entire original picture. 2) Take a small piece from the original image, and minimum overlap between this small piece and actual box is 0.1, 0.3, 0.5, 0.7, or 0.9. 3) Take a piece randomly from the original picture.
The ratio of sub-block size to original image size is between 0.1 and 1, and the width to height ratio is between 0.5 and 2. If the center of the actual single tree frame is within the intercepted subblock, overlap is retained as the actual box of the sub-block. Scale the size of each subblock to 300 × 300, and each sub-block has a probability of 0.5 to flip horizontally before training.
Distorting images is a way to enhance data, including randomly changing contrast, brightness, saturation, and tone of image and randomly disrupting the three RGB channels. After data enhancement, training images have richer features, which can enhance the robustness of the model. In this study, we prove that this strategy can effectively improve the detection accuracy of the SSD model.    Frontiers in Environmental Science | www.frontiersin.org November 2021 | Volume 9 | Article 755587 6 98°20′53.22″ E, 8°27′18.45″ N in Phang Nga province, Thailand. High-resolution remote sensing images are used in our experiments, with a spatial resolution of 0.27m, a scale of 800:1, and a visual field height of 1 km.

Training Dataset and Sample Dataset
The training data were collected from Google Earth images, and the training data of each experimental group are shown in Figure 6. Training data of palm trees were collected from around palm trees. Campus green tree training data were collected from universities in Beijing, China. Training data for lychee plantation were collected from around the lychee plantation.
Experimental results of the campus sample plot are shown in Figure 7A. The position data of a single reference tree were obtained from field measurements.
The sample plot representing the orchard is shown in Figure 8A. The position data of the single reference tree are obtained by visual annotation. The palm plantation area representing economic plantation is divided into three groups according to the characteristics of the palm plantation area. The location data of the single reference palm tree were obtained by visual annotation. Open economic plantation refers to the plantation area with a canopy density between 0.4 and 0.6.
The palm sample plot representing open economic plantation is shown in Figure 9A. Dense plantation refers to a plantation where the canopy density of palm plantation is between 0.7 and 1. Sample plots of palm trees representing dense economic plantations are shown in Figure 9C. Background detection in plantation areas also has a great impact on single tree detection. Especially, background color is like tree crown color, resulting in the background being wrongly identified as a single tree. In addition, complex ground features and shadow generated by sunlight in the background make the shape of a single tree abnormal, which greatly increases the difficulty of single tree detection. In this study, a plantation area with a complex background is experimented as a type alone. The sample plot representing a complex background economic plantation is shown in Figure 9E. Table 1 lists the main parameters used in the experiment. When a complete dataset passes through a neural network once and returns once, the process is called an "Epoch". When data cannot be passed through a neural network at one time, the dataset needs to be divided into several batch-sizes. Each "Batchsize" is equivalent to a new dataset. The "Score" is a confidence score. "Weight file size (MB)" is the size of the model.

Optimal SSD Model
The open economic plantation is experimentally studied. Palm trees are detected through the SSD object detection framework. Experimental results are shown in Table 2. The "trunk feature extractor" refers to the main network structure used in feature extraction. After the main network structure reframes, many feature layers are obtained. And some feature maps are selected in several feature maps to build preset boxes. The base scale of the preset box is related to the feature map. The number of parameters refers to the size of the model. As shown in Table 2, SSD has been improved in many areas for single tree detection, such as omitting some VGG16 feature  layers, network depth, and unwanted feature layers. The extracted feature layers by model 4 retains only K0-K3, which is approximately 13.6 and 5.8 MB less than model 2 and model 3 parameters, but accuracy does not decrease. In the process, it is found that there are more false positives in model 2 and model 3, and the deletion of feature layer can reduce false positives. Almost, no single tree has a crown height and width of 3 and 1 3 in the regular top view. SSD model 5 removes a r ∈ {3, 1 3 }, which makes detection speed increase to 0.43 s, omission rate increase, and accuracy decrease, because sample images are clipped during the experiment. The SSD model can recognize a separated half tree as a tree. After removing a r ∈ {3, 1 3 }, an incomplete canopy cannot be detected, resulting in some single trees being missed.
Precision-recall is one of the most useful weapons to detect the efficiency of the object detection model. As shown in Figure 10, model 5 performs the worst, AP 89.36%; model 4 has the best performance, AP 94.37%, in the precision-recall curve of all models. In model 4, the SSD object detection framework only extracts K0∼K3 feature layers and sets the aspect ratio to a r ∈ {1, 2, 3, 1 2 , 1 3 , 1 ' }. The accuracy of model 4 is 97%. Under the condition of reducing parameters and time, the accuracy of SSD is improved to the greatest extent.

Evaluation Criteria of Detection Accuracy
For a variety of single tree detection methods, the evaluation of their detection excellence depends on evaluation standard. At present, there is no unified evaluation standard. The spatial position difference between a ground reference single tree and a detected single tree can be considered as correct detection within a specific range. The geometric center of the actual box is the position of a single tree. The point coordinate of single tree detection and a single reference tree are denoted as M i and E j . There are three possibilities for the results of single tree detection: correct detection, error detection, and omission. A set threshold ε > 0, d (M i , M i ) is denoted as the distance between the two points M i and E j . M i is traverse: 1) When d (M i , E j ) < ε , it is considered that the detection of a single tree matches the single reference tree, and it is a correct detection. 2) If there is d (M i , E j ) > ε for any M i , there is no reference single tree matching with a detected single tree. The single tree detected is considered as a false detection.
3) If E j neither conforms to case (1) or case (2), E j is omission.
Based on the above conditions, N r is the number of reference single trees, N a is the number of detected single trees, and N match is the correct number of detected single trees in detected single trees. The calculation formula of all values is shown in Table 3, N leave is the number of undetected reference single trees, and is also the difference value between N r and N match , N error is the difference value between N match and N a . The recall rate is represented by symbol N mat , N om is commission rate, N com is omission rate, and M is accuracy.

SINGLE TREE DETECTION RESULTS
The optimal SSD model is applied to the single tree detection of an urban plantation, and experimental results are compared with the latest single tree detection method. Optimal SSD model 4 is called the SSD model in the following.

Campus
The available view of single tree detection around the campus is shown in Figure 7. The statistical analysis of experimental results is shown in Table 4. The accuracy of the five methods differs significantly. From the experimental results, the traditional local maximum method and watershed method are generally effective in single tree detection. The watershed method has the worst result. The accuracy of the watershed method is 32.9%. The latest template matching method and CV model method have a good effect. The accuracy of the SSD model is 96% and has the highest accuracy in the experimental results of five methods. Specifically, the SSD model gains high scores in recall rate and omission rate. The commission rate of the template matching method is zero. The omission rate of template matching is 29.1%.

Orchard
The visual results of the experiment in the lychee plantation are shown in Figure 8. The accuracy of the five methods is significantly different, as shown in Table 4. SSD has the highest accuracy of five methods. The accuracy of SSD is 92.9%. The lowest accuracy is 31.8%. The accuracy of the local maximum method is the lowest. The accuracy of SSD is 61.1% more than the accuracy of the local maximum method. The accuracy of watershed is 52.9%. Obviously, SSD's single tree detection effect is far better than the traditional single tree

Economic Plantation
Palm trees are not only a treasure, but are also one of the world's most important sources of oil (De Aguiar et al., 2020). Thailand is one of the main planting bases of palm trees, while China's largest area of palm trees is mainly distributed in Red River county, Yunnan province, China. Palm plantations represent economic plantations. The single palm tree detection experimental effect is shown in Figure 9, a detected box is red and a detected box outside the yellow line is cleared. Palm trees that were detected in error or  missed are flagged. A white marker denotes a mistakenly detected single tree, and a blue marker means the missed detected of a single tree.

Open Economic Plantation
On behalf of the open economic plantation, the open palm plantation is sample plot 1. The test results of the open palm plantation are shown in Figure 9B and Table 5. The SSD model has the highest accuracy, the highest recall rate, and the lowest omission in the experimental results of the five methods. The accuracy of SSD is 96.4%. The recall rate of SSD is 98.8%. The omission rate of SSD is 1.2%. Two palm trees are missed among 164 reference palm trees in SSD-detected results. Mistakenly, the template matching method identifies two background objects as palm trees. The lowest commission rate reaches 1.3%. the SSD model has the second lowest commission rate. The commission rate of the SSD model is 2.4%. The accuracy of the local maximum is 53.9%. Watershed has the lowest recall rate. The recall rate of watershed is 65.9%. Watershed has the highest omission rate. The highest omission rate of watershed is 34.1%. The commission rate of the local maximum is 38.4%.

Complex Background Economic Plantation
In this study, sample plot 2 is a complex background palm plantation, and the experimental results are shown in Figure 9D and Table 5. The SSD model has the highest accuracy in the experimental results of the five methods. The accuracy of SSD is 97.0%. The SSD model has the highest recall in the experimental results of the five methods. The recall of SSD is 98.2%. The SSD model has the lowest omission rate in the experimental results of the five methods. The lowest omission rate of SSD is 1.8%. Four palm trees are missed among 224 reference palm trees in SSDdetected results. Mistakenly, the SSD model identifies three background objects as palm trees. The lowest commission rate reaches 1.3%. The SSD model has the lowest commission rate. The accuracy of the local maximum is 31.4% and is the lowest accuracy. The recall rate of the local maximum is 41.5% and is the lowest recall rate. The omission rate of the local maximum is 58.5% and is the highest omission rate. The CV model has the highest commission rate. The commission rate of the CV model is 45.0%. Mistakenly, the CV model identifies 91 background objects as palm trees.

Dense Economic Plantation
Dense palm plantations have many problems, such as the fact that dense tree crowns completely block sunlight, obscure single tree crowns, and have high canopy density, which have increased the difficulty of gaining artificial statistics for single trees. Therefore, it is difficult to improve the accuracy of single tree detection and location in dense palm plantations when using the traditional method. In this study, sample plot 3 is a dense palm plantation, and experimental results are shown in Figure 9F and Table 5.
The SSD model has the highest accuracy in the experimental results of the five methods. The accuracy of SSD is 98.7%. The SSD model has the highest recall rate in the experimental results of the five methods. The recall rate of SSD is 99.3%. The omission rate of watershed is zero. The omission rate of the SSD model and template matching method is tied for second place with one missed detection among 148 reference palm trees. The omission rate of the SSD model and template matching method is 0.7%. The SSD model has the lowest commission rate in the experimental results of the five methods. Mistakenly, the SSD model identifies one background object as a palm tree. The commission rate of the SSD model is 0.7%. The accuracy of the local maximum is the lowest in the experimental results of the five methods. The accuracy of the local maximum is 66.7%. The lowest recall rate is the local maximum in the experimental results of the five methods. The recall rate of the local maximum is 83.8%. The omission rate of the local maximum is the highest in the experimental results of the five methods. As can be seen from Table 5, 24 reference trees are missed. The omission rate of the local maximum is 16.2%. Watershed has the highest commission rate in the experimental results of the five methods. The commission rate of watershed is 31.7%. The watershed mistakes 47 background objects as reference palm trees.

DISCUSSION
At present, it takes a lot of manpower and material resources to identify and locate tree species over a large area and in a scattered plantation depending on collecting information or identifying pictures with the naked eye. The manager of an artificial plantation divides it into four stages: young plantation, middle plantation, mature plantation, and overmature plantation. Each stage has different characteristics of individual trees, and machine recognition is more stable and produces fewer errors than human eye recognition. Researchers have developed a variety of methods to extract individual tree information from high-resolution remote sensing images instead of the human eye (Liu et al., 2016;Iqbal et al., 2021). However, the existing single tree detection methods still have shortcomings (Gebreslasie et al., 2011;Millikan et al., 2019;De Aguiar et al., 2020;Dersch et al., 2020).
In this study, the comparison of five experiment groups proves that single tree detection based on SSD has a better effect.
According to the problems existing in single tree detection, the experimental research is carried out one by one: 1) Experiments with different canopy densities have been completed, this article makes a comparison between an open plantation and dense plantation. 2) The detection effect of the SSD model has been verified in a plantation area with a complex background.
3) The application effect of the SSD model has been verified in urban single tree detection, including a single tree on a campus, a single tree in an orchard, and a single tree in an economic plantation.
The summary of the five experiment groups shows that the lowest accuracy is the local maximum method, with an average detection accuracy of 43.52%. Local maximum extracts the maximum value of an area. If there is no single tree in the local area, the local maximum value will be wrongly judged as a single tree. If there are multiple single trees in the local area, the local maximum can only identify a single tree with the largest value. Due to the above reasons, local maximum is mediocre in single tree detection. The average detection accuracy of the watershed method is 54.8%. If a tree has too many branches, watershed can easily identify it as two trees. This leads to the high commission rate of watershed. If the distance between two trees is very close, watershed will identify it as a tree, which leads to a high omission rate of detection. The average detection accuracy of the latest template matching method is 86.04%. Template matching is not as detailed as the SSD model in extracting single tree crown features. The average detection accuracy of the CV model method is 57.84%. The CV model combines the advantages of local maximum and watershed. However, the CV model also has the problems of oversegmentation and under-segmentation. The SSD model has the highest average accuracy in the experimental results of the five methods with an average accuracy of 96.32%, because the SSD model can capture single tree canopy features from high-resolution remote sensing images well. The average recall rate of the SSD model is 97.94% which is the highest average recall rate. When category confidence exceeds 0.5, the SSD model identifies an object as a single tree. The average commission rate of the SSD model is the lowest average commission rate, which is 1.7%. The SSD model has the lowest average omission rate in the experimental results of the five methods. The average omission rate of SSD is 2.06%. Therefore, the SSD model has the best single tree detection performance. Weijia Li et al. (Weijia et al., 2016) applied a convolutional neural network method based on deep learning to the detection of general palm trees, with an average recall rate of 96%. Dong Tianyang et al. (Dong et al., 2018) proposed a single tree detection method based on a progressive cascade convolutional neural network, which is applied to an open plantation and green trees with an average recall rate of 90%. In this study, the SSD model is applied to urban single tree detection, the average recall rate of SSD is 97.94%. So, the SSD model is more comprehensive and detailed in extracting single tree features. The SSD model is better than the common convolutional neural network in single tree detection.
Automatic identification and location of single trees based on high-resolution remote sensing images is of great significance for ecological park planning, plantation management, and large-scale ecological environment quality monitoring. In this study, the SSD model is applied to single tree detection of high-resolution remote sensing images. The study addresses a number of problems with previous approaches, such as too large trunk branches of a single tree, serious adhesion between crowns, and missing and false detection problems in complex backgrounds. The SSD model is applied to single tree detection in urban plantation. Accurately, the SSD model can capture the canopy features of single trees in high-resolution remote sensing images. The SSD model not only segments single tree crowns, but also recognizes single tree crowns in complex backgrounds. In the process of single tree detection, the SSD model has stronger anti-interference ability and is almost unaffected by the branches of a single tree. The SSD model has excellent performance in all aspects and shows good application potential. This study can be used as a reference for other agricultural and forestry products. It is also hoped that other techniques can be used to describe the crown contour of a single tree.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.