Cluster segmentation and stereo vision-based apple localization algorithm for robotic harvesting

Wang, Jianxia; Sun, Wenbing

doi:10.3389/fpls.2025.1598414

ORIGINAL RESEARCH article

Front. Plant Sci., 27 November 2025

Sec. Sustainable and Intelligent Phytoprotection

Volume 16 - 2025 | https://doi.org/10.3389/fpls.2025.1598414

This article is part of the Research TopicMachine Vision and Machine Learning for Plant Phenotyping and Precision Agriculture, Volume IIView all 42 articles

Cluster segmentation and stereo vision-based apple localization algorithm for robotic harvesting

Jianxia Wang

Wenbing Sun^*

College of Cyber Security, Tarim University, Alar, China

Introduction: Automated apple harvesting is hindered by clustered fruits, varying illumination, and inconsistent depth perception in complex orchard environments. While deep learning models such as Faster R-CNN and YOLO provide accurate 2D detection, they require large annotated datasets and high computational resources, and often lack the precise 3D localisation required for robotic picking.

Methods: This study proposes an enhanced K-Means clustering segmentation algorithm integrated with a stereo-vision system for accurate 3D apple localisation. Multi-feature fusion combining colour, morphology, and texture descriptors was applied to improve segmentation robustness. A block-matching stereo model was used to compute disparity and derive 3D coordinates. The method was evaluated against Faster R-CNN, YOLOv7, Mask R-CNN, SSD, DBSCAN, MISA, and HCA using metrics including Recognition Accuracy (RA), mean Average Precision (mAP), Mean Coordinate Deviation (MCD), Correct Recognition Rate (CRR), Frames Per Second (FPS), and depth-localisation error.

Results: The proposed method achieved >91% detection accuracy and <1% localisation error across challenging orchard conditions. Compared with Faster R-CNN, it maintained higher RA and lower MCD under high fruit overlap and variable lighting. Depth estimation achieved errors between 0.4%–0.97% at 800–1100 mm distances, confirming high spatial accuracy. The proposed model exceeded YOLOv7, SSD, FCN, and Mask R-CNN in F1-score, mAP, and FPS during complex lighting, occlusion, wind disturbance, and dense fruit distributions.

Discussion and Conclusion: The clustering-based stereo-vision framework provides stable 3D localisation and robust segmentation without large training datasets or high-performance hardware. Its low computational demand and strong performance under diverse orchard conditions make it suitable for real-time robotic harvesting. Future work will focus on large-scale orchard deployment, parallel optimisation, and adaptation to additional fruit types.

1 Introduction

The apple is one of the most popular fruit crops, ranking second in global fruit production. Harvesting apples remains a crucial yet demanding operation because it requires substantial labor and time (Qu et al., 2015; Jia et al., 2020). Traditional harvesting methods rely primarily on manual workforces, resulting in increased expenses, workforce shortages, and inconsistent quality and efficiency. Researchers have extensively investigated automated fruit detection and harvesting technologies that utilize machine vision and clustering-based segmentation to boost efficiency and precision (Tu et al., 2010; Jia et al., 2020).

In recent years, deep learning techniques such as YOLO, SSD, Faster R-CNN, and Mask R-CNN have been widely applied in fruit detection and recognition (Onishi et al., 2019; Biffi et al., 2020; Jia et al., 2020; Zhang et al., 2020; Xiao et al., 2023, 2023). These systems fall into two categories: single-stage models (e.g., YOLO, SSD), which directly predict object locations and classes for faster processing, and two-stage models (e.g., Faster R-CNN, Mask R-CNN), which first propose candidate regions to improve classification and bounding accuracy (Tianjing and Mhamed, 2024; Shi et al., 2025) (Likas et al., 2003; Wang et al., 2022; Mhamed et al., 2024; Tianjing and Mhamed, 2024). Recent studies have demonstrated the potential of UAV-based phenotyping and machine learning approaches for monitoring crop traits and yield in tomato and quinoa, highlighting the growing role of computer vision in precision agriculture (Johansen et al., 2019, 2020; Jiang et al., 2022a). Deep learning enhances fruit detection by extracting key colour, shape, and texture features for segmentation and recognition. However, accuracy in orchards is hindered by variable lighting, foliage cover, and clustered fruit. Moreover, reliance on large datasets, high computational demands, and long training times limits their practical use in apple harvesting (Wang et al., 2022). Moreover, they often produce only 2D bounding boxes, lacking the precise in-depth information needed for robotic harvesting. These constraints limit their suitability for real-time field deployment.

Beyond fruit detection, deep learning has advanced applications in remote sensing, radar imaging, and ecological monitoring (Guan et al., 2025). Recent studies on PolSAR ship detection (Gao et al., 2023a), scattering-aware networks, few-shot SAR classification (Gao et al., 2023b, 2024), and multi-source data fusion highlights its versatility in complex detection tasks (Shen et al., 2024; Zhang et al., 2024). These cross-domain advances reinforce the relevance of developing efficient and adaptable methods for automated fruit detection and localization.

An alternative to deep learning is clustering-based segmentation. K-Means clustering is an unsupervised learning method that groups pixels by feature similarity, enabling effective fruit segmentation under complex orchard conditions ( (Likas et al., 2003; Na et al., 2010). K-Means delivers rapid and sturdy segmentation, which stands out from other methods like Fuzzy C-Means and DBSCAN, which require more computation and struggle with noise (Song et al., 2013; Jamel and Akay, 2019; Ikotun et al., 2023). Previous studies have applied K-Means for apple recognition (Wang Dandan et al., 2015). While some researchers utilized integrated extremum methods for fruit positioning (Jia et al., 2020). Recent studies further refined segmentation with fuzzy C-means (Sarbaini et al., 2022) CNN-based semantic segmentation (Ramadhani et al., 2022; Wang et al., 2022), and monocular vision approaches (Zubair et al., 2024). However, the challenge of achieving robust performance in real orchard conditions with limited data remains (Yang et al., 2012).

This study presents an enhanced K-Means clustering segmentation algorithm combined with multi-feature fusion (colour, morphology, and texture) and stereo vision for accurate 3D localization. The approach is designed to reduce misclassification and provide depth information critical for robotic harvesting. Unlike deep learning methods, the proposed system emphasizes computational efficiency, real-time applicability, and reduced training data requirements, making it well suited to practical orchard deployment. The method is comprehensively evaluated against state-of-the-art models, including Faster R-CNN, YOLOv7, and Mask R-CNN, and demonstrates superior accuracy, reduced coordinate deviation, and stable performance across different camera angles.

2 Materials and methods

The experimental setup consists of a four-arm parallel picking robot equipped with a high-precision vision system and a 3D stereo camera (1920 × 1080 pixels; Model: Hikvision MV-DL2125-04H-R) for apple detection and localization. The 3D camera was mounted at the front end of the robotic arm. Computational processing was performed on a high-performance computer running an Intel i7–12700 processor, ensuring efficient execution of clustering, segmentation, and localization tasks. Apple images were collected from a commercial orchard with diverse lighting conditions (morning, noon, evening), varying shading levels, and different apple clustering patterns to ensure a representative dataset. A dataset comprising 4,200 sample images of Aksu apples, a variety cultivated in Aksu Prefecture, Xinjiang, China, was collected. The dataset includes 2,200 images of red apples against green foliage and 2,000 images of green apples against green foliage. Each apple within the images was manually annotated using a circle-fitting method to ensure precise localization and segmentation. The dataset was split into an 8:2 ratio, with 80% used for training and 20% for testing. This choice ensured sufficient data for training while maintaining an independent set for performance evaluation. As the proposed method is based on clustering and does not require iterative hyperparameter optimization, no separate validation set was used. A similar adjustment of dataset splitting has been discussed in previous studies with small datasets (Ashtiani et al., 2021). Each image was manually annotated using LabelImg software, and apples were labelled based on their position, size, and occlusion level. To improve the model’s robustness, data augmentation was applied. Random rotation (± 15°), brightness variation (± 20%), and Gaussian noise were introduced to simulate real-world orchard variability caused by lighting changes, fruit occlusion, and viewing angle differences. This process reduced the risk of overfitting and enabled better generalization to unseen samples. Similar to findings in postharvest imaging studies (Javanmardi and Ashtiani, 2025), such augmentation strategies enhance dataset diversity and improve the reliability of classification models.

In the next section, Equations describe standard image preprocessing operations, clustering formulations, stereo vision disparity and depth estimation, and evaluation metrics are based on established methods documented in (Hartigan and Wong, 1979; Hartley and Zisserman, 2003; Gonzales and Woods, 2018). The enhanced K-means clustering and stereo vision localization method was implemented using standard Python and OpenCV libraries, with all parameters reported in this study. The dataset cannot be made publicly available due to restrictions, but a representative subset or implementation details are available from the corresponding author upon reasonable request.

2.1 Optimization of apple image segmentation using enhanced K-Means

Combining morphological processing, feature optimization, and colour space analysis, a modified K-Means clustering method was constructed. Enhanced colour sensitivity was achieved by converting RGB to HSI, using the H component for exceptional target-background difference. Images were filtered using Gaussian and median filtering techniques to reduce noise (Supplementary Equation 2) and then transformed to greyscale to ensure consistency under varying illumination conditions (Supplementary Equation 1).

Then, we extracted the HSI colour space that is highly sensitive to apple colour for segmentation purposes using Equation 1. The RGB colour space illustrated variations in colour intensity and brightness, whereas the HSI space replicated human visual perception abilities. As Figure 1 shows, the RGB to HSI conversion turned unit square data into a bicone. A 3D camera captured apple image features and stored them as RGB grayscale values, ensuring enhanced consistency for segmentation under variable lighting conditions.

Figure 1

Diagram depicting a system for apple detection. It involves an apple detection target linked to “Color and depth characteristics,” which connects to a 3D camera. The 3D camera feeds into a “Visual identity system,” connected to “Graph neural network model,” which ultimately operates a robotic arm.

Figure 1. Conversion method from RGB to HSI color space.

\begin{array}{l} H = \arctan (\frac{\sqrt{3} (G - B)}{2 R - G - B}) & (1) \end{array}

Where H indicates component values.

The H component proved useful for separating apples from the background. The conventional K-Means method did, however, show errors, including mis-segmentation in challenging environments. To improve accuracy and robustness, the algorithm was enhanced through an adaptive selection of the initial clustering centers (Equations 2, 3). The updated clustering method minimized intra-cluster variance (Equation 5).

\begin{array}{l} C_{k} = a r g m a x_{P (i)} \sum_{j \in N (i)} \frac{1}{∥ H (i) - H (j) ∥} & (2) \end{array}

Where C_k denotes the initial center of the k class; P_(i)denotes the set of points; N _(i) denotes the set of domain points; H (i) and H (j) represent the feature vectors or attribute values of pixels i and j.

\begin{array}{l} D (x^{0}, y^{0}) = \sqrt{\sum_{m = 1}^{n} w_{m} \cdot {(F_{m} (x^{0}) - F_{m} (y^{0}))}^{2}} & (3) \end{array}

Where D (x⁰, y⁰) is the Euclidean distance between the pixel point x⁰ and y⁰ and wm for the feature weights; n denotes the total dimension of the feature space; F_m (x⁰) and F_m (y⁰) represent the pixel intensities in pixels x⁰ and y⁰ in the m^th dimension, respectively.

The segmentation results underwent morphological processing, eliminating small noise elements and restoring target edges (Supplementary Equation 3). Boundary extraction utilized erosion to isolate object edges, as shown in Figure 2. Connected region calculation was performed using Supplementary Equation 4 to obtain complete target information.

Figure 2

Diagram comparing RGB and Hexacone color models. The left shows an RGB cube with axes labeled red, green, and blue, depicting grayscale and colors like cyan and magenta. The right shows a hexacone with primary and secondary colors arranged in ovals, indicating hue with an arrow.

Figure 2. Morphological boundary extraction through erosion and subtraction. Small artifacts are removed, and clean object edges are restored for clustering.

2.2 Multi-feature model for apple recognition and 3D positioning

Following segmentation and clustering, apple centroids were precisely recognized by integrating colour, morphology, and texture features. Stereo vision technology and 3D camera calibration principles were used to map apples from 2D image coordinates to 3D spatial coordinates, providing accurate positional data for the harvesting robot. Figure 3 displays the calibration principle for the stereo vision system and the 3D camera. The stereo vision system and 3D camera underwent calibration to synchronize the vision coordinate system with the robot coordinate system, which enabled precise target recognition and localization.

Figure 3

Flowchart depicting a process for optimizing features from apple images and depth information. It starts with input and preprocessing, followed by multi-feature extraction and fusion. Subsequent stages include feature weight assignment, multi-feature fusion strategy, and feature standardization. The process concludes with outputting optimized fused features for localization and spatial representation.

Figure 3. Schematic of the robotic apple detection system integrating a 3D camera, a visual identity module, and a graph neural network for precise recognition.

Single-feature detection showed high vulnerability to environmental conditions, including lighting and noise levels. Therefore, a multi-feature fusion approach was employed to enhance detection robustness and accuracy. Composite feature values determined target areas based on colour, texture, and morphology weights (Equation 4).

\begin{array}{l} T (x, y) = α_{1} H (x, y) + α_{2} GLCM (x, y) + α_{3} Shape (x, y) & (4) \end{array}

Where T (x, y) is the composite feature value, which is used to determine whether the pixel point belongs to the target area or not; α₁, α₂ and α₃ are the weight coefficients, corresponding to the weights of colour, texture and morphological features, respectively. The values of α₁, α₂, and α₃ were empirically tuned using the training dataset, selecting the combination that achieved the best segmentation and detection performance under varying orchard conditions. H (x, y) indicates a colour feature; GLCM (x, y) denotes the grayscale covariance matrix, which is used to extract texture features; Shape (x, y) represents morphological features.

Figure 4 illustrates the multi-feature fusion approach for apple image analysis, which involves analyzing multiple pose features from apples and extracting essential features after bias removal to enhance centroid recognition and localization. We calculated the center of mass using the weighted average of pixel coordinates within the region, as described in Supplementary Equation 5. Internal and external camera parameters were calibrated using Supplementary Equation 6.

Figure 4

Two line graphs depict the proportion in percentage versus angle degree with vertical form for four apple trees, labeled one through four. Graph (a) illustrates the proposed algorithm, while graph (b) shows the MISA algorithm. Both graphs display a peak at forty-five degrees followed by a decline, leveling out after ninety degrees, with similar patterns for each tree. Different line styles represent each tree for comparison.

Figure 4. Algorithm pipeline showing preprocessing, multi-feature extraction, feature weighting, fusion, and 3D localization outputs, with results illustrated in (a) the proposed algorithm and (b) the MISA method.

The block-matching algorithm extracted parallax values to solve positional discrepancies between left and right camera images (Supplementary Equation 7). Depth information was then calculated using parallax values and triangulation principles (Supplementary Equation 8). Real-world coordinates were derived by mapping the center of mass and depth information to the camera’s coordinate system (Supplementary Equation 9).

The problem of environmental occlusion was solved by applying morphological techniques combined with depth interpolation methods (Supplementary Equation 10). Localization accuracy was further enhanced by adjusting camera parameters and refining feature fusion weights based on localization error (Equation 5).

Three-dimensional localization accuracy was tested by taking depth measurements at six points on apple corners at distances ranging from 800 mm to 1100 mm. The difference between real and calculated depth values was assessed, while morphological and depth interpolation techniques minimized errors (Supplementary Equation 10).

\begin{array}{l} E = \sqrt{{(X_{real} - X_{calc})}^{2} + {(Y_{real} - Y_{calc})}^{2} + {(Z_{real} - Z_{calc})}^{2}} & (5) \end{array}

Where E represents positioning error and (X_real, Y_real, Z_real) are the actual coordinates and (X_calc, Y_calc, Z_calc) are the calculated coordinates.

2.3 Benchmark comparisons and performance evaluation

Benchmarking the proposed model against several state-of-the-art methods allowed for a comprehensive performance evaluation. The selected benchmarks include widely recognized and validated techniques in fruit detection and segmentation research. Faster Region-Based Convolutional Neural Network (Faster R-CNN), You Only Look Once version 7 (YOLOv7), and Masked Region-Based Convolutional Neural Network (Mask R-CNN) are leading deep learning models known for their high detection accuracy. Density-Based Spatial Clustering of Applications with Noise (DBSCAN), Mean-Shift Image Segmentation Algorithm (MISA), and Superpixel Segmentation Algorithm (SSA) are commonly used clustering and segmentation methods designed to handle spatial variation and noise. These methods were chosen to ensure a balanced comparison between deep learning and clustering-based approaches.

The segmentation performance was compared using Mean Coordinate Deviation (MCD) and Correct Recognition Rate (CRR) as evaluation metrics. For object detection and spatial localization, the proposed model was evaluated against YOLOv7, Single Shot MultiBox Detector (SSD), Fully Convolutional Networks (FCN), and Mask R-CNN under four real-world conditions: complex illumination, fruit occlusion, dynamic oscillation, and dense target distribution. Performance was measured using Recognition Accuracy (RA), mean Average Precision (mAP), and Frames Per Second (FPS). Additionally, the model’s stability was assessed across different camera angles (0°, 15°, 30°, and 45°) by comparing it with the Hierarchical Clustering Algorithm (HCA) and Region Growing Segmentation Algorithm (RGSA) using the standard deviation of recognition accuracy.

The proposed model was comprehensively evaluated using RA for detection accuracy, MCD for spatial precision, CRR for segmentation accuracy, F1-score for detection reliability, mAP for overall detection performance, FPS for real-time efficiency, and standard deviation for stability under varying conditions. These metrics collectively demonstrate the model’s accuracy, robustness, and practical efficiency for automated apple detection.

3 Results

The proposed clustering-based segmentation and 3D localization algorithm demonstrated consistent superiority in detection precision and spatial localization under diverse orchard conditions. Figure 5 illustrates the variation in RA and MCD under different lighting and occlusion levels. The proposed method maintained an average accuracy above 91%, while Faster R-CNN exhibited a pronounced decline when fruit overlaps exceeded 40%. In contrast, our algorithm achieved lower MCD values (≤ 0.3%), indicating more stable spatial localization across both daytime and nighttime datasets. (Figure 5). Moreover, the consistently reduced MCD values throughout all collecting distances indicate better localization accuracy of the proposed algorithm (Figures 6A, B). Figures 6C and D demonstrate that the proposed method consistently maintains a CRR above 90%, outperforming DBSCAN across varying overlap rates. The depth estimation accuracy of the stereo vision system was evaluated by comparing it with YOLOv7 and SSD across four different scenarios: complex lighting conditions, fruit occlusion, dynamic oscillation conditions, and dense target distributions. Across all four tested scenarios, the suggested model showed better recall and precision than YOLOv7 and SSD (Figure 7).

Figure 5

Four line graphs comparing two algorithms, Faster R-CNN and a proposed algorithm, based on MCD and RA metrics. Graph (a) shows MCD increases with overlap rate during the day. Graph (b) shows a similar trend at night. Graph (c) displays stable RA during the day, while graph (d) shows RA decreases slightly at night.

Figure 5. Detection accuracy (RA) and mean coordinate deviation (MCD) of the proposed clustering algorithm and Faster R-CNN under different overlap rates, illustrated for (a) MCD during the day, (b) MCD during the night, (c) RA during the day, and (d) RA during the night.

Figure 6

Four charts compare DBSCAN and a proposed algorithm. Top left: Bar chart of MCD% under 40 images, showing MCD decreasing as collection distance increases. Top right: Similar bar chart under 45 images. Bottom left: Line chart of CRR% under 40 images, showing higher performance for the proposed algorithm across overlap rates. Bottom right: Similar CRR comparison under 45 images. Each chart highlights consistent performance trends for both algorithms.

Figure 6. Comparison between the proposed algorithm and DBSCAN across different collection distances (900–1700 mm), shown for (a) MCD under 40 images, (b) MCD under 45 images, (c) CRR under 40 images, and (d) CRR under 45 images.

Figure 7

Four precision-recall curves comparing YOLOv7, SSD, and a proposed model. Panel (a) displays a complex lighting environment, panel (b) an apple occlusion environment, panel (c) a dynamic oscillation environment, and panel (d) a multi-target dense distribution environment. The proposed model consistently shows superior performance, maintaining high precision across varying recall levels.

Figure 7. Precision–Recall comparison of YOLOv7, SSD, and the proposed model under different field conditions, including (a) complex lighting, (b) apple occlusion, (c) dynamic oscillation, and (d) multi-target dense distribution environments.

Depth estimation accuracy was further validated, achieving a maximum localization error of 0.97% across 800–1100 mm collection distances (Figure 8). Errors ranged from 0.4–0.65% at 800 mm and 0.4–0.5% at 1000 mm, with only slight increases to 0.73–0.79% at 1100 mm. All deviations remained below 1%, confirming high-precision depth estimation suitable for robotic harvesting applications. As shown in Figures 9A, B, the proposed algorithm outperformed MISA in detecting apple orientations on four trees at 0°, 45°, 90°, and 180°. It achieved the highest detection rate (> 40%) at 45°, while no apples were detected at 180°, where MISA showed greater variation and overlap, indicating reduced stability. Results for multiple algorithms at the 45° orientation are summarized in Table 1. The proposed method achieved the highest recognition accuracy (93%), correctly identifying 39 apples, followed by the CNN model (88%). The template-matching (TM) approach had the lowest accuracy (70%, 28 apples correctly identified).

Figure 8

Four line graphs compare the effective focal length standard deviation per pixel against the number of images per sheet, at different camera angles. Each graph includes plots for HCA, RGSA, and a proposed model. The y-axis represents the standard deviation, and the x-axis represents the number of images. (a) shows a camera angle of 0 degrees, (b) 15 degrees, (c) 30 degrees, and (d) 45 degrees. Across all angles, the proposed model consistently shows lower deviations compared to HCA and RGSA as the number of images increases.

Figure 8. Effective focal-length standard deviation of the stereo vision system under different numbers of images per sheet, evaluated for (a) camera angle 0°, (b) camera angle 15°, (c) camera angle 30°, and (d) camera angle 45°, comparing the proposed model with HCA and RGSA.

Figure 9

Bar charts comparing the performance of three models: FCN, Mask R-CNN, and a proposed model across four scenarios: changes in lighting, dense distribution of fruits, wind disturbance, and mixed fruit types. Each chart displays F1-score, mAP, and FPS. The proposed model generally shows higher performance in F1-score and mAP across all scenarios.

Figure 9. Comparison of apple detection performance among FCN, Mask R-CNN, and the proposed model under different field conditions, including (a) changes in lighting, (b) dense fruit distribution, (c) wind disturbance, and (d) mixed fruit types, evaluated using F1-score, mAP, and FPS.

Table 1

Table 1. Comparative performance of various algorithms in apple posture recognition.

In four real-world orchard scenarios, the proposed model was compared with FCN and Mask R-CNN (Figure 10). It consistently outperformed both, achieving an F1-score of 92% under varied illumination (Figure 10A) and an mAP of 91% for densely clustered fruits (Figure 10B). Under wind disturbance (Figure 10C), it maintained the highest frame rate per second (FPS), demonstrating strong real-time efficiency. Across multi-fruit orchard conditions (Figure 10D), the model again achieved the highest mAP, confirming its robustness and adaptability. Figure 11 shows that the proposed model maintained the lowest standard deviation across all camera angles (0°–45°), stabilizing after about 25 images. Even at 45°, where deviation slightly increased for all models, it remained the most stable, confirming reliable performance under varying camera orientations.

Figure 10

Four line graphs showing measuring distances and relative errors for collection distances of 800mm, 900mm, 1000mm, and 1100mm. Each graph has two lines: a solid line for measuring distance and a dashed line for relative error. The x-axis is labeled corner number, ranging from one to six. Each graph title specifies the collection distance.

Figure 10. Measuring distance and relative error of the proposed stereo-vision depth estimation system across different collection distances, evaluated at (a) 800 mm, (b) 900 mm, (c) 1000 mm, and (d) 1100 mm, based on measurements from six corner points in the calibration board.

Figure 11

A flowchart showing the process of image boundary detection. The original image A is a large green grid. Corrosive elements B is a smaller green grid. Applying B to A results in image C, which has a smaller green area. Subtracting C from A creates the boundary image highlighted in red.

Figure 11. Model stability across camera angles (0°–45°) compared with HCA and RGSA.

The proposed clustering-based stereo-vision approach achieved > 91% detection accuracy, < 1% localization error, and stable performance under varying lighting and camera angles, all with a modest dataset. These results demonstrate its suitability for real-time, low-cost robotic harvesting, offering reliable detection and positioning without extensive training or high computational demand—an effective solution for autonomous orchard operations in precision agriculture.

4 Discussion

Accurate segmentation is crucial for precise apple detection in challenging orchard environments (Kang and Chen, 2020). The improved MCD and RA values indicate that multi-feature fusion with adaptive K-means clustering increases robustness to lighting changes and occlusion. Deep-learning models such as Faster R-CNN often lose accuracy under these conditions (Bargoti and Underwood, 2017; Fu et al., 2020). In contrast, the proposed unsupervised approach remains stable with fewer samples. Compared with DBSCAN, it achieved higher stability and accuracy across distances and image counts (Limwattanapibool and Arch‐int, 2017; Hartigan and Wong, 1979). These results confirm strong generalization and real-time potential for orchard use.

The success of robotic apple picking depends heavily on precise 3D localization. Our results are consistent with earlier research, where YOLO-based algorithms struggle to make real-time changes in challenging agricultural settings (Jiang et al., 2022b). This is consistent with other studies where YOLO-based models struggle in complex environments (Bresilla et al., 2019; Parvathi and Selvi, 2021). Consistent with previous studies, YOLOv7 demonstrated better accuracy and recognition speed than SSD (Wang and Chen, 2024). In contrast, a previous study showed that YOLOv7 achieved exceptional detection rates of Camellia oleifera fruit in orchards with 95.74% mAP, 93.67% F1 score, 94.21% precision, 93.13% recall and a detection time of 0.025 seconds (Wu et al., 2022). Recent research on brinjal detection using deep learning models has demonstrated the effectiveness of a lightweight YOLO architecture and edge-based computing frameworks for real-time harvesting applications (Nahiduzzaman et al., 2025; Tamilarasi et al., 2025). These approaches, while achieving high precision and recall, still depend on large, annotated datasets and relatively intensive computational resources. In contrast, our clustering-based multi-feature method achieves stable performance with fewer training samples and reduced hardware requirements, underscoring its suitability for orchard conditions. Our results are consistent with previous studies, indicating that while SSD performs well in controlled environments, it may struggle in more complex scenarios than YOLOv7. For example, Xu et al. reported lower SSD performance in typical agricultural environments where occlusions and cluttered backgrounds are very common (Xu et al., 2024). In contrast, Deng et al (Deng et al., 2024). found that YOLOv7, when enhanced with attention mechanisms, consistently outperformed SSD in citrus detection under different orchard conditions. Apple posture detection is critical in establishing the best picking strategies (Liu et al., 2024). The observed stable detection suggests that our method effectively addresses occlusion and angle-related distortions, a common challenge in fruit detection (Safari et al., 2024).

The proposed method showed stable performance relative to MISA and achieved higher accuracy than CNN, TM, and other traditional classifiers, reflecting improved feature extraction and classification capability. Similar challenges in illumination and feature consistency were also noted by (Sun et al., 2021). Consistent results under varying field conditions confirm that the model can maintain real-time reliability in orchard operations. Previous studies using FCN reported fruit-counting accuracies of 0.91–0.95 and yield accuracies up to 0.98 (Häni et al., 2020), while Faster R-CNN achieved an F1-score of 0.89 and 91% mAP. In contrast, our model achieved higher mAP, F1-score, and frame rate, demonstrating superior detection in dense, multi-fruit environments. Real-world comparison with FCN and Mask R-CNN confirmed the proposed model’s superior accuracy and processing efficiency for dense, multi-fruit environments (Wan and Goudos, 2020; He et al., 2017). Compared to previous studies, Mask R-CNN performed poorly in our study, where the precision rate reached 97.31% and the recall rate reached 95.70% (Jia et al., 2020). These outcomes highlight its stability and real-time applicability under orchard conditions. Unlike deep-learning models that rely on large annotated datasets, the algorithm maintained strong performance with limited training images, reflecting better adaptability and lower data dependence (Koirala et al., 2019). Bargoti and Underwood found that 729 training images were necessary to stabilize AP for apple detection, but almond and mango models needed more data (Bargoti and Underwood, 2017). This study also demonstrated that data augmentation enabled better apple detection using only 100 images compared to 300 images without augmentation. Similarly, 93% of apples were accurately detected in 50 images despite uneven lighting conditions in a previous study (Xu and Lv, 2018). Compared to deep learning models like Faster R-CNN and YOLOv7, the proposed method requires less computational power and no extensive training, making it suitable for real-time applications on standard hardware. While sequential processing may limit scalability in large-scale deployments, this can be optimized with parallel computing. The pipeline’s reliance on generalizable features such as colour, texture, and morphology also makes it adaptable to other fruits or crops with minor adjustments. However, large-scale field validation and integration with robotic harvesting systems are still required to confirm performance under real operating conditions, which will be addressed in future development.

In conclusion, this study presents a clustering-based stereo vision algorithm that combines K-means segmentation and multi-feature fusion for accurate apple detection and 3D localization in orchard environments. The method offers high accuracy, strong generalization, and real-time feasibility with minimal training data and computational demand—key advantages over deep-learning approaches. While sequential processing and limited field scale remain constraints, these can be addressed through parallel computing and large-scale robotic trials. Future work should focus on optimizing real-time performance and extending the framework to other fruit crops and intelligent harvesting systems.

Data availability statement

The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding author.

Author contributions

JW: Funding acquisition, Visualization, Software, Conceptualization, Resources, Writing – original draft, Writing – review & editing, Project administration, Validation, Supervision. WS: Formal Analysis, Data curation, Visualization, Investigation, Writing – review & editing, Software.

Funding

The author(s) declare financial support was received for the research and/or publication of this article. This research was supported by the projects “Machine Learning-Based Vision System for Automatic Apple Harvesting” (No. TDZKSS202137) and “Medical Fabric Intelligent Management System Based on the Internet” (No. TDZKSS202135).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declare that no Generative AI was used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpls.2025.1598414/full#supplementary-material

References

Ashtiani, S.-H. M., Javanmardi, S., Jahanbanifard, M., Martynenko, A., and Verbeek, F. J. (2021). Detection of mulberry ripeness stages using deep learning models. IEEE Access 9, 100380–100394. doi: 10.1109/ACCESS.2021.3096550