Accurate and fast detection of tomatoes based on improved YOLOv5s in natural environments

Uneven illumination, obstruction of leaves or branches, and the overlapping of fruit significantly affect the accuracy of tomato detection by automated harvesting robots in natural environments. In this study, a proficient and accurate algorithm for tomato detection, called SBCS-YOLOv5s, is proposed to address this practical challenge. SBCS-YOLOv5s integrates the SE, BiFPN, CARAFE and Soft-NMS modules into YOLOv5s to enhance the feature expression ability of the model. First, the SE attention module and the C3 module were combined to form the C3SE module, replacing the original C3 module within the YOLOv5s backbone architecture. The SE attention module relies on modeling channel-wise relationships and adaptive re-calibration of feature maps to capture important information, which helps improve feature extraction of the model. Moreover, the SE module’s ability to adaptively re-calibrate features can improve the model’s robustness to variations in environmental conditions. Next, the conventional PANet multi-scale feature fusion network was replaced with an efficient, weighted Bi-directional Feature Pyramid Network (BiFPN). This adaptation aids the model in determining useful weights for the comprehensive fusion of high-level and bottom-level features. Third, the regular up-sampling operator is replaced by the Content Aware Reassembly of Features (CARAFE) within the neck network. This implementation produces a better feature map that encompasses greater semantic information. In addition, CARAFE’s ability to enhance spatial detail helps the model discriminate between closely spaced fruits, especially for tomatoes that overlap heavily, potentially reducing the number of merging detections. Finally, for heightened identification of occluded and overlapped fruits, the conventional Non-Maximum-Suppression (NMS) algorithm was substituted with the Soft-NMS algorithm. Since Soft-NMS adopts a continuous weighting scheme, it is more adaptable to varying object sizes, improving the handling of small or large fruits in the image. Remarkably, this is carried out without introducing changes to the computational complexity. The outcome of the experiments showed that SBCS-YOLOv5s achieved a mean average precision (mAP (0.5:0.95)) of 87.7%, which is 3.5% superior to the original YOLOv5s model. Moreover, SBCS-YOLOv5s has a detection speed of 2.6 ms per image. Compared to other state-of-the-art detection algorithms, SBCS-YOLOv5s performed the best, showing tremendous promise for tomato detection in natural environments.


Introduction
The tomato is one of the world's most important crops (Peixoto et al., 2017), but harvesting tomatoes under natural conditions remains labor-intensive.Fruit harvesting has undergone a significant transformation through advances in artificial intelligence within laboratory research.This evolution has paved the way for the emergence of fruit-picking robots anticipated to supplant manual labor.The vision system plays a vital role in a fruit-picking robot.This is because of its fundamental role in accurately identifying fruits, a crucial initial step hinging on the robot's precision, efficiency, and resilience.Nevertheless, the challenges posed by natural conditions introduce complexities, such as unbalanced lighting, occlusions, overlapping, and other unforeseeable elements (Gongal et al., 2015), all of which affect the detection accuracy of fruit-picking robots.Consequently, enhancing the accuracy, efficiency, and robustness of harvesting robots under these natural conditions is essential.
Many researchers have studied fruit detection to deal with the problems mentioned above.Some digital image processing approaches, such as color features (Goel and Sehgal, 2015;Yang et al., 2020), shape (Jana and Pareskh, 2017), and texture (Rakun et al., 2011), have been proposed to obtain reasonable detection results.Zhao et al. (2016a) developed a technique for detecting immature citrus fruits in natural environments based on cascaded pixel segmentation.A combination of color feature maps and a block-matching method were employed to identify potential fruit pixels.Subsequently, further refinement is adopted utilizing an SVM classifier to eliminate false positives.On the other hand, in the initial stages of segmentation, by relying solely on color features, numerous fruit instances remain undetected due to the resemblance between green fruit and the background.Kurtulmus et al. (2011) introduced a new eigenfruit feature for identifying green citrus.This characteristic was paired with color information and a study of circular Gabor texture.Despite including the texture characteristics alongside color features, their method has encountered challenges distinguishing some fruit from the background and has struggled to detect heavily obscured fruit effectively.
Other methods include K-means clustering (Jiao et al., 2020), Support Vector Machine (SVM) (Azarmdel et al., 2020), and AdaBoost algorithms (Payne et al., 2014).In tomato detection, Liu et al. (2019) developed an approach to identify mature tomatoes within natural environments.A naive Bayesian classifier with an oriented gradient histogram was used to distinguish each tomato.Subsequently, a color analysis step was used to remove false positives.Nevertheless, adapting this method to natural settings posed a challenge owing to the inherent limitations of manually crafted features in terms of their capacity for high-level abstraction.Similarly, Zhao et al. (2016b) used Haar-like feature thresholding and AdaBoost classifier to detect tomatoes.Their study revealed a recognition rate of 96% for tomatoes within their testing dataset.Nevertheless, a long training time was required in their approach.
The aforementioned methods relying on manually designed features have inherent limitations, particularly in scenarios where intricate feature extraction is demanded.The introduction of deep learning successfully addressed these challenges.For example, Rahnemoofar and Sheppard (2017) showcased commendable fruit-counting capabilities through a customized Inception-ResNet architecture (Szegedy et al., 2017).On the other hand, this model focused exclusively on fruit counting and failed to detect them.Santo et al. (2020) introduced a method to detect and track grape clusters within images captured in vineyards.This approach utilized the Mask-RCNN algorithm (He et al., 2017) for the precise detection of individual grape bunches.Furthermore, structure from motion techniques were applied to achieve the 3D alignment of images, enabling the effective mapping of features across various images.Their method achieved an F1-score of 91%.In recent years, the emergence of YOLO models has revolutionized object detection (Redmon et al., 2016;Redmon and Farhadi, 2017;Redmon and Farhadi, 2018;Boschkovskiy et al., 2020;Jocher et al., 2020;Wang et al., 2022).These YOLO models exhibited exceptional improvement in accuracy and speed, surpassing traditional twostage pipelines (He et al., 2017;Girshick et al., 2014;Ren et al., 2015).They used a single feed-forward network to detect bounding boxes and corresponding classes.Wang et al. (2021) introduced an innovative method anchored in an enhanced YOLOv3-tiny model to identify disease occlusion and overlapping tomato leaves.This model strategically mitigated information loss during network transmission, resulting in a commendable mAP score of 93.1%.Bresilla et al. (2019) used DCNN architectures based on single-stage detectors.Leveraging deep learning techniques eliminates the need to manually code specific features tailored to particular fruit shapes, colors, or other attributes.This method achieved an accuracy of more than 90%.Liu et al. (2020) introduced YOLO-Tomato, a resilient model based on YOLOv3.This model achieved an Average Precision (AP) of 96.40% and a rapid detection speed of 54 ms.Chen et al. (2022) introduced a modified YOLOv4 to detect citrus.Their approach used an attention mechanism and a depth-wise separable convolution module.Moreover, they applied a pruning algorithm to eliminate the impact of irrelevant latent factors in the data.Their average improved from 92.89% to 96.15%, with 0.06s to detect each image.
Expanding the scope, Cao et al. (2023) proposed a technique for persimmon recognition in natural environments.They harnessed an enhanced YOLOv5 model, achieving an average accuracy of 95.53%.Mbouembe et al. (2023) proposed a modified YOLOv4-tiny model for tomato recognition.Their enhancements included a refined backbone design, reducing computational complexity while augmenting feature extraction.A simplified CSP (Cross-Stage Partial Connections) -Spatial Pyramid Pooling was incorporated to improve the receptive field of the backbone.This modification aimed to enhance the ability of the model to capture information from a wider area of the input data.The CARAFE module in the neck network further improved the quality of the feature map.Their method produced an mAP of 82.8%.
Despite extensive research in the fruit recognition domain within natural conditions, it is essential to improve the detection accuracy and robustness to fulfill the requirements of fruit detection.This study introduced a precise and resilient tomato detection methodology grounded in the YOLOv5s model to address these persisting challenges.Figure 1  Focusing on these goals, the study aimed to contribute to advancing tomato harvesting robots by developing an accurate tomato detection model that outperforms existing models in terms of accuracy and efficiency.
2 Theoretical background

YOLOv5 network
The YOLOv5s model (Jocher et al., 2020), pioneered by Ultralytics LLC in 2020, is composed of three core components: backbone, neck, and head networks.This study targets the YOLOv5s variant because of its superior performance compared to other iterations within the YOLO series.The backbone network employs a series of convolutional operations and fusion steps to extract the feature maps from input images.Subsequently, the neck network integrates feature maps of diverse dimensions, obtained from the backbone network.This amalgamation yields an upgraded, innovative feature map that effectively preserves contextual information, mitigating information loss.It is important to highlight that this process leverage the FPN (Feature Pyramid Network) structure (Lin et al., 2017) to facilitate the propagation of robust semantic features from higher-level feature maps to their lower-level counterparts.Simultaneously, the PANet Overview of the SBCS-YOLOv5s.
Touko Mbouembe et al. 10.3389/fpls.2023.1292766Frontiers in Plant Science frontiersin.org(Path Aggregation Network) architecture (Liu et al., 2018) facilitates the transmission of robust localization features from lower-level feature maps to their higher-level counterparts.The head network, the final segment of the model, consists of three layers that generate output feature maps at distinct scales.
The CBS (Convolution, batch normalization, and SiLU activation function) is a conventional convolution layer in the YOLOv5s network.It encompasses a sequence of operations, including convolution, batch normalization (Ioffe and Szegedy, 2015), and the SiLU activation function (Elfwing et al., 2018).YOLOv5 originally employed the BottleneckCSP module instead of the C3 module for feature extraction.The BottleneckCSP module combines the concepts of Bottleneck (He et al., 2016) and CSP (Cross-Stage Partial connections) (Wang et al., 2020a).It involves three successive convolutional kernel operations, with the output of the first being processed through two more convolutional kernels.This sequence culminates in the fusion of unprocessed and convolved features.The primary objective of the BottleneckCSP module is to deepen the model.
The CSP module introduced by Wang et al. (2020a) splits the input into two segments; one undergoes processing via a block (like Bottleneck), while the other proceeds directly through a 1×1 convolutional layer.These two streams are then recombined.The C3 module supplants a 1×1 convolutional layer within the BottleneckCSP module, simplifying the network architecture to enable the extraction of feature maps and minimize the computation complexity.The C3 module comprises two branches, each involving a convolution operation that reduces the feature map channel count by half.The output from these two branches is concatenated using the Bottleneck module, followed by a convolutional layer within the second branch.These processes tightly integrate the output feature maps from both branches, with a final convolutional layer generating the output feature map of the module.Furthermore, SPPF (Spatial Pyramid Pooling Fusion) augments the ability of the backbone to express features.This module employs a sequence of three convolutions with identical kernels, focusing on the amalgamation of features from various resolutions.

Content-aware reassembly of features
The YOLOv5 model uses a nearest neighbor interpolation method for its up-sampling process, utilizing the same kernel for up-sampling across the feature map.Nevertheless, this approach does not leverage the semantic information in the feature map during the up-sampling process, resulting in a significant loss of features.This study integrates the CARAFE module (Wang et al., 2019), a novel technique, to address these limitations.The CARAFE module consists of two main components: a content-aware reassembly module and a kernel prediction module.It anticipates and assembles the recombined kernel, reconstructing the features within predetermined local regions at each point while using the underlying content details.The CARAFE module dynamically adjusts and optimizes the reassembled kernels at distinct points based on the content information, offering superior performance compared to alternative up-sampling methods like interpolation.For every predefined location, the utilization of a reassembly kernel becomes imperative, with the kernel size denoted as k up .The reassembly procedure is illustrated using (Equation 1): where O and I represent the output and input, respectively.W l 0 denotes the location-wise kernel associated with each location l 0 based on the input.l 0 signifies the neighboring location of l, and r = The CARAFE approach significantly enhances the semantic richness of the reassembled feature maps compared to the nearest neighbor interpolation up-sampling technique.This approach is achieved by strategically emphasizing crucial points within localized regions.In scenarios where tomatoes overlap or are densely packed, CARAFE's ability to enhance spatial detail helps the model distinguish between closely spaced fruits, potentially reducing the number of merge detections.It also helps the model to improve localization accuracy in tomato detection.In addition, CARAFE encompasses a wider scope of observation, adept content handling, and its lightweight design, culminating in expedited computations.Figure 2 shows the architectural representation of CARAFE.

Image acquisition
Images of tomatoes were taken from December 2017 to November2019 in the greenhouses of a tomato production base, located in Shouguang city, China, with a digital camera (DSC-W170, Sony, Tokyo, Japan) at a resolution of 3648×2056 pixels.The camera was equipped with a 5×Carl Zeiss Vario-Tessar precision zoom lens.The distance between the camera and the target was from 500 mm to 1000 mm.Nine hundred and sixty-six images were captured under natural daylight (sunny and cloudy days) with different conditions such as shading, sunlight, occlusions, and overlaps.The training set had 725 images, while the test set contained 241 images.The scale of tomatoes in the images varies greatly, ranging from 200 to 1500 pixels in diameter.

Image augmentation
This study used data augmentation to counteract potential issues, such as over-fitting or non-convergence, that could arise during training.The augmentation of images was accomplished by applying diverse techniques, such as brightness transformation, blur, horizontal flip, noise, and rotation.These methods were employed to enhance the resilience of the model against noise and its ability to remain unaffected by variations in camera positioning.In particular, introducing a Random Gaussian blur makes the model more resistant to camera blur, with a threshold of 25 pixels for maximum blur.In addition, the incorporation of horizontal and vertical flips played a role in fortifying the capacity of the model to perform consistently regardless of the orientation of the subject.A visual representation of data enhancement techniques can be observed in Figure 3.

The SBCS-YOLOv5s architecture
The YOLOv5 model represents a single-stage object detection algorithm that introduces substantial enhancements over other YOLO models.On the other hand, the challenge of achieving high accuracy and fast speed persists in the tomato detection case, primarily because of the intricacies of the natural environment, such as occlusions and overlapping.This study proposes an SBCS-YOLOv5s model to address this issue, with the incorporation of SE, BiFPN, CARAFE, Soft-NMS into the YOLOv5s.The first module of this approach is used for feature extraction, merging the SE module (Hu et al., 2018) and the C3 module of the YOLOv5s model.This fusion enhances the network focus on useful information, refines the feature extraction process, and improves the model's robustness to variations in environmental conditions.The neck network integrates BiFPN (Tan et al., 2020) and CARAFE modules (Wang et al., 2019) into YOLOv5s, enriching features with more profound semantic information.The conventional NMS algorithm (Hosang et al., 2017) used in YOLOv5s was substituted with the Soft-NMS algorithm (Bodla et al., 2017) to make the network more efficient in detecting occluded and overlapped fruits.Additional intricacies of this approach are elaborated upon in subsequent sections.Figure 4 presents the architecture of SBCS-YOLOv5s.

The modified backbone network
The SE attention module (Hu et al., 2018) in Figure 5A is fused with the C3 module structure into an improved C3SE module.The  Overall architecture of the CARAFE module.
Touko Mbouembe et al. 10.3389/fpls.2023.1292766Frontiers in Plant Science frontiersin.orgSE module solves the issue of feature maps containing informative and less relevant channels.The re-calibration process empowers the network to prioritize informative channels while suppressing the less useful ones.In addition, the SE module's ability to adaptively re-calibrate features helps to improve the robustness of the model to variations in illumination and environmental conditions.It also helps to reduce over-fitting, which is essential for tomato detection to accurately identify the boundaries of individual tomatoes in an image.Figure 5B presents the structure of the C3SE module.The weight of each channel is allocated using the interdependence of the feature channels to facilitate the neural network to focus on significant feature information and to minimize the impact of feature redundancy.The SE attention module comprises three key operations: squeeze, excitation, and scale.The squeeze operation, also called compression, involves applying a global average pooling operation to each channel of the feature map.This compresses the spatial dimensions of the feature map, converting its size into multiple features while maintaining the overall channel dimension.For example, if the input feature map holds a size of H×W×C, and V = ½v 1 , v 2 , …, v c is an example input set, the transformation of the squeeze operation can be expressed using (Equation 2).
where c ∈ C, and C denotes the number of feature channels, while W signifies the feature map width; H corresponds to the height of the feature map; F sq denotes the specific squeeze operation being discussed.
The excitation operation consists of two primary components: a fully connected layer and a sigmoid activation function.The fully connected layer incorporates comprehensive information from all input features.Subsequently, the sigmoid function transforms the input into a range confined within [0,1].This process is visually represented by (Equation 3).
where s symbolizes the sigmoid activation function, d signifies the ReLU activation function, and F ex denotes the excitation operation.B 1 and B 2 denote the weights of the fully connected layer, respectively.
Finally, the scale operation combines or multiplies the input channel weight with the weight derived from the channel feature of the two preceding operations.(Equation 4) shows the rescaling operation: Original SE module and improved C3SE module architecture.(A) SE module architecture, (B) C3SE module architecture.The architecture of SBCS-YOLOv5s.
Touko Mbouembe et al. 10.3389/fpls.2023.1292766Frontiers in Plant Science frontiersin.org where F scale (V c , S c ) refers to channel-wise multiplication that takes place between S c and V c .

The modified neck network
The FPN+PANet network was replaced in the YOLOV5s neck with the weighted BiFPN in this study.The rationale stems from large-scale objects possessing many pixels, whereas small objects have few.The features of large objects can be easily maintained in the convolution process, while the features of the smaller ones can be easily ignored.The YOLOv3 model introduced the FPN network structure (Lin et al., 2017), emphasizing the down-sampling process of semantic information extraction.Based on this, the YOLOv5 incorporates PANet (Liu et al., 2018) to aggregate image features by incorporating secondary bottom-up fusion, as shown in Figure 6A.This approach integrates accurate low-level localization signals to enrich the entire feature hierarchy and facilitate the flow of information.On the other hand, PANet is characterized by simple two-way fusion, and their contributions to the output features often remain unequal because of the varying input resolutions.Furthermore, feature fusion of PANet involves a direct addition of distinct input features, leading to unbalanced output features and complicating computational processes.
The BiFPN, introduced by Tan et al. ( 2020), is an object detection model module.Its main strength lies in effectively fusing information within a deep learning network, ensuring efficiency and accuracy.The problem of correctly combining multi-scale features from multiple layers of a convolutional neural network are solved to improve the detection accuracy of objects at various scales.The bottom-up and topdown paths are used to construct a feature pyramid that captures finegrained features.The BiFPN combines the feature maps from the bottom-up and top-down paths.Furthermore, to avoid all feature maps contributing equally, the BiFPN provides learnable weights for each input feature map, allowing the network to assign varied priorities to different scales and resolutions.The notable enhancement brought by BiFPN is the introduction of a bi-directional connection between neighboring levels of the network.This augmentation substantially improves the flow of information and gradient propagation during the training process.It also improves to tomato localization, helping the model to capture details at different scales and make more accurate predictions of tomato locations.In addition, BiFPN is designed to be computationally efficient, making it well suited for real-time detection.Figure 6B shows the BiFPN architecture.(Equation 5) shows the fast normalized fusion between the feature maps from the bottom-up and top-down paths.
where the intermediate feature situated at Level 6 along the topdown pathway is P td 6 , while the resulting feature at Level 6 stemming from the bottom-up pathway is P out 6 , Conv and Resize correspond to convolution and sampling operations, respectively.w and ϵ represent the weight and a small pre-set value to avoid numerical instability, respectively.Usually, this value was set to 0.0001.
BiFPN improves the detection accuracy compared to the PANet used in the YOLOv5s model.Nevertheless, the BiFPN employs a nearest neighbor interpolation method for the up-sampling of feature maps.Using this approach could lead to a small receptive field and make the network focus only on sub-pixel spaces, resulting in the loss of rich semantic information.In this study, the CARAFE module was introduced to the BiFPN to tackle this problem.This integration improved feature maps with rich information and high resolutions.Section 2.2 describes the CARAFE module in detail.

Soft-NMS (non-maximum suppression) algorithm
The soft-NMS algorithm (Bodla et al., 2017) is a modified version of the conventional NMS algorithm (Hosang et al., 2017) used by the YOLOv5 framework.The fundamental principle behind the NMS algorithm involves selecting the bounding box with the highest confidence score.It suppresses the other bounding boxes with significant overlap with the selected box, leading to the missed detection of overlapping fruits.Moreover, the NMS algorithm does not perform optimally when dealing with different scales.Equation 6shows the NMS algorithm: Architectures of PANet and BiFPN.(A) PANet architecture, (B) BiFPN architecture.
Touko Mbouembe et al. 10.3389/fpls.2023.1292766Frontiers in Plant Science frontiersin.org where b i and Ŝ i denote the ith predictor and its score, respectively.N t is the pre-set threshold; M denotes the candidate box having the highest score; IoU (M, b i ) is the overlap region between M and b i .
The objective of the Soft-NMS algorithm is to solve the limitations of the traditional NMS algorithm approach.It is also designed to be more tolerant to overlapping objects.This is achieved using a softening function that progressively decreases the scores of bounding boxes overlapping with the one possessing the highest score.The primary goal is to reduce the severe suppression of surrounding boxes that might be slightly less confident but still contain useful information.This modification seeks to enhance the detection accuracy and improve the handling of cases involving overlapping fruits within the final detection results.And it helps maintain a consistent ranking of bounding box scores, even when there is overlap.The Soft-NMS algorithm is outlined in (Equation 7): where s represents the hyperparameter of the penalty function.When the IoU (M, b i ) exceeds the pre-defined threshold, the prediction frame confidence score is reduced systematically instead of being set to zero.As a result, the detection accuracy of overlapping and occluded fruits can be improved.

Loss function
The loss function used in this study is expressed as (Equation 8), which encompasses the regression error of bounding coordinates, the confidence error of the bounding box, and the classification error of object category.
In this study, the bounding box regression loss incorporatesthe use of CIoU (Complete IoU) as in (Equation 8.a).It could accurately represent the gap between the prediction and annotation frames, enhancing the network model during training.It also considers crucial factors, such as the overlapping area (expressed in Equation 8.b), central point distance, and aspect ratio (expressed in Equation 8.c) between b and b ɡt .
where b and b ɡt represent the predicted and ground truth bounding boxes, respectively.d signifies the distance between the predicted center point and the true center point; c is the diagonal length of the enclosing box covering b and b ɡt ; a and v are the positive trade-off and aspect ratio parameters, respectively.
Object classification loss is expressed as (Equation 8.e), wherein the process is initiated by calculating the confidence C of the cell grid as in Equation 8.d): then, with li,j expresses in (Equation 8.f): where s×s denotes the size of the grid cell; NB stands for the number of bounding boxes; Ci represents the confidence obtained from the prediction box; C i signifies the confidence threshold.

Experimental setup
The experiments of this research were conducted using an Intel i5 64-bit quad-core CPUs operating at a frequency of 3.30 GHz (Santa Clara, CA, USA).The system had 16 GB of RAM and an NVIDIA GeForce GTX 1070Ti GPU with 8 GB memory.The chosen model framework was PyTorch, with CUDA 11.1 and Python 3.8.10 for implementation.Table 1 lists some hyper parameters used in the experiments.The criteria used for assessing the performance of fruit detection encompassed precision, recall, mean average precision (mAP), and F 1 score (Padilla et al., 2020).The metrics are defined in (Equations 9-12):

R =
True Positive True Positive + False Negatiive (9) where R and P are the recall and precision, respectively.Using mAP is a valuable approach to assess the model performance across different confidence levels.
with AP expresses in (Equation 11.a): where p(r) represents the calculated precision at a given recall value (r), whileN cls is the total number of classes.
4 Results and discussions

Ablation study
The first step in this study was to determine which attention mechanism (CBAM (Woo et al., 2018), ECA (Wang et al., 2020b), CA (Hou et al., 2021), SE (Hu et al., 2018)) works better on the tomato datasets after fusing the original C3 module network.From Table 2, we can see that integrating the SE attention module with the C3 module led to a notable outcome.The mean average precision with an IoU of 0.5 to 0.95 reached 85.1%, which is the best result.
Since the SE attention module relies on modeling channel-wise relationships and adaptive re-calibration of feature maps to capture important information, it helps to improve feature extraction of the model.The fusion of the SE attention module with the C3 module was implemented within the backbone network.Furthermore, the integration of BiFPN, CARAFE, and Soft-NMS was used in the neck to improve the detection capabilities of YOLOv5s.An ablation study was carried out to evaluate the effectiveness of each component.
Integrating the SE attention module with the C3 module resulted in a 0.9% increase in the mean average precision with an IoU of 0.5 to 0.95, as shown in Table 3.This enhancement underscores the efficacy of the SE attention module to channel the model towards useful information.Subsequently, a further increase of 0.6% in mAP was achieved by replacing PANet with BiFPN.This is because the BiFPN assists the model in determining useful weights for comprehensive fusion of high-level and low-level features, thereby improving detection performance.Discernible performance improvements became evident after incorporating the CARAFE module as an up-sampling operator within PANet and BiFPN.This is due to the fact that CARAFE enhances spatial details and improves feature map resolution better than the original up-sampling method.On the other hand, the most remarkable results emerged when the Soft-NMS algorithm was applied to the BiFPN+CARAFE configuration, showcasing 3.5% advancement over the original YOLOv5s model.This proves the advantage of the continuous weighting scheme of Soft-NMS.This sequence of observations indicates a substantial enhancement in detection performance through different modifications.

Feature map visualization
Visualizations were performed to compare the improved model variants with the original YOLOv5s.Figure 7A presents an input image with tomatoes annotated for improved visibility.Figures 7B,  C show the difference between the C3 and C3SE modules, respectively.In particular, Figure 7C highlights finer details that are more discernible.This observation underscores the role of the SE module in steering the backbone network towards useful information.Figures 7D, E represent the original neck of YOLOv5s and the modified neck used in SBCS-YOLov5s, respectively.Figure 7E shows an improved feature map with heightened resolution after incorporating the BiFPN and CARAFE modules.These enhancements facilitate efficient context information aggregation and seamless fusion within the network.
Every modification produced superior features with high resolution compared to those in the original model (Figure 7).This visual evidence substantiates that SBCS-YOLOv5s excels in accuracy, resilience, and efficiency when compared to the original model.
The mean average precision with IoU of 0.5 to 0.95 was 3.8%, 9.7%, 5.8%, 16.4%, 4.6%, 9.3%, 4.5%, and 3.5% higher than those of Faster-RCNN, ynamic RCNN, YOLOv3, YOLOv3-tiny, YOLOv4, YOLOv4tiny, YOLOv7-tiny, and YOLOv5s, respectively (Table 4).Furthermore, the detection time achieved 2.6 ms per image, fulfilling the real-time detection criteria.Moreover, the precision of the proposed model improved by 0.3%, 1.5%, 2.5%, 1.7%, 1.4%, 1.5%, 0.6%, and 1.4% compared to the Faster RCNN, Dynamic RCNN, YOLOv3, YOLOv3tiny, YOLOv4, YOLOv4-tiny, YOLOv7-tiny, and YOLOv5s,   respectively.The F1 score and mAP with an IoU of 0.5 increased by 1.64% and 0.5%, respectively, compared to the original YOLOv5s model.Hence, the performance of SCBS-YOLOv5s was improved compared to other object detection networks.Importantly, the experimental results revealed the efficient real-time detection capability of SCBS-YOLOv5s in accurately identifying tomatoes within their natural environmental context.The detection performance of the improved YOLOv5s surpassed that of alternative models while demonstrating greater efficiency (Figure 8).The mean average precision with an IoU of 0.5 to 0.95 exhibited a notable 3.5% improvement compared to the original YOLOv5s model.Furthermore, the processing time for detecting each image was decreased by 1.5ms.These results collectively signify the improved model prowess in achieving improved accuracy, compactness, and efficiency when tasked with fruit detection in a natural environment.

Performance of the improved model under different conditions
In a natural environment, tomatoes are exposed to different lighting conditions because of the uneven illumination.Moreover, they can become obscured by leaves or branches and might overlap.The performance of the improved model was assessed across diverse scenarios.Table 5 shows how the tomatoes were classified into sunshine and shade cases regarding illumination.Within the test dataset, there were 425 tomatoes under shaded conditions and 487 tomatoes under sunlight conditions.In terms of obscured and overlapped severity, the tomatoes were classified as mild and significant occlusion, as delineated in Table 5.The latter pertains to situations where tomatoes are obstructed by leaves, branches, or other tomatoes by over 50%.
The correct detection rate for tomatoes under sunlight conditions was 97.2%, while the rate was 97.4% when tomatoes were in shaded conditions (Table 5).False identification was 3.1% for sunlight and 3.3% for shade, respectively.Approximately 97.7% of the tomatoes were detected accurately when they exhibited mild occlusion, with a correctness rate of 96.4% in the case of severe occlusion (Table 5).The rates of missed identification were 2.3% and 3.6% for mild and severe occlusions, respectively.Figure 9 presents some examples of detection outcome instances under various conditions.The results revealed the robustness of the improved model in effectively managing variations in illumination and situations involving overlapping fruits.

Qualitative analysis between SBCS-YOLOv5s and the original YOLOv5s model
Figure 10 shows some prediction results from the SBCS-YOLOv5s and the original YOLOv5s model.
As shown in Figure 10, the detection performance of SBCS-YOLOv5s was superior to the original YOLOv5s model.In particular, Figure 10G visually demonstrates the improved model focus on more useful information and retain the features for small tomatoes.Moreover, the original YOLOv5s model returned some false negatives and false positives, as shown in Figures 10E, F     Touko Mbouembe et al. 10.3389/fpls.2023.1292766Frontiers in Plant Science frontiersin.orgbottom-level features at high resolution.Furthermore, the traditional up-sampling operator within the BiFPN structure was substituted with the CARAFE module to yield more refined semantic information.Finally, the conventional NMS algorithm was replaced with the Soft-NMS algorithm to improve the detection accuracy of overlapped and occluded fruits.
A thorough experimentation was carried out to validate the performance of SBCS-YOLOv5s.An ablation study was instrumental in substantiating the efficacy of each modification.The findings of the experiment showed that the mAP with an IoU of 0.5 to 0.95 had 3.8%, 9.7%, 5.8%, 16.4%, 4.6%, 9.3%, 4.5%, and 3.5% improvements compared to other object detection algorithms, reaching 2.6ms per image in terms of detection time.
Furthermore, the experiments underscored the robustness of SBCS-YOLOv5s because it effectively detected tomatoes across diverse scenarios involving varying lighting and occlusion conditions.
Despite the excellent performance of the improved model, there is room for enhancing the detection performance.In the future study, the explicit incorporation of contextual information will be explored to refine the detection accuracy.Moreover, we will consider incorporating information about tomato maturity to enable differentiation among tomatoes at distinct growth stages based on SBCS-YOLOv5s presented in this study.

FIGURE 1
FIGURE 1 FIGURE 3 Examples of data enhancement techniques.(A) Input image, (B, C) varied exposure, (D) Noise (salt and pepper), and (E) Horizontal Flip.

FIGURE 4
FIGURE 4 FIGURE 7 (A) Annotated input image, (B) the feature of the C3 module of YOLOv5s, (C) feature of the C3SE module of SBCS-YOLOv5s, (D) feature of the original neck of YOLOv5s, and (E) feature of neck of SBCS-YOLOv5s.
Figure10shows some prediction results from the SBCS-YOLOv5s and the original YOLOv5s model.As shown in Figure10, the detection performance of SBCS-YOLOv5s was superior to the original YOLOv5s model.In particular, Figure10Gvisually demonstrates the improved model focus on more useful information and retain the features for small tomatoes.Moreover, the original YOLOv5s model returned some false negatives and false positives, as shown in Figures10E, F.

5
Conclusions and future workThis paper introduced an accurate and efficient tomato detection solution named SBCS-YOLOv5s, which builds upon the YOLOv5s framework.The accuracy and efficiency of the model were improved by substituting the original C3 module within YOLOv5s with a C3SE module, combining the SE attention module with the C3 module.This change amplified the feature extraction capabilities.Furthermore, the PANet in the neck of the original model was replaced with a weighted Bi-directional Feature Pyramid Network (BiFPN), enhancing the adaptability of the detector to objects of varying scales by fusing high-level and

FIGURE 8
FIGURE 8Detection performance of different models (accuracy vs. inference time).
FIGURE 9 Some examples of tomato detection results under different conditions.(A-C) sunlight cases, and (D-F) shade cases, (G-I) slight occlusions, and (J-L) severe occlusions.

TABLE 1
Configuration of certain hyper-parameters.

TABLE 2
Ablation analysis of different attention mechanisms.

TABLE 4
Comparison of the different models.
*YOLOv5s v6. 1 version is used in this study.

TABLE 5
Performance of the improved model under different conditions.