An annotated image dataset for small apple fruitlet detection in complex orchard environments

Wang, Dandan; Wang, Bo

doi:10.3389/fpls.2025.1664972

DATA REPORT article

Front. Plant Sci., 05 January 2026

Sec. Sustainable and Intelligent Phytoprotection

Volume 16 - 2025 | https://doi.org/10.3389/fpls.2025.1664972

An annotated image dataset for small apple fruitlet detection in complex orchard environments

Dandan Wang^1,2

Bo Wang^3*

¹College of Communication and Information Engineering, Xi’an University of Science and Technology, Xi’an, Shaanxi, China
²Xi’an Key Laboratory of Network Convergence Communication, Xi’an, Shaanxi, China
³School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an, Shaanxi, China

This study introduces a small apple pre-thinning dataset designed to support the development of intelligent thinning systems by providing reliable data for small apple detection. The dataset comprises 2,517 RGB images (original size 3024×3024 pixels, uniformly resized to 500×500 pixels for standardization) systematically captured under real-world orchard conditions. The dataset encompasses natural variations in weather conditions (sunny/cloudy), lighting scenarios (direct sunlight/backlight), and fruit sizes (3-25mm diameter range) to ensure broad applicability. Each image was meticulously annotated using LabelImg software, with all small apple targets precisely labeled using both PASCAL VOC (XML) and YOLO (TXT) format bounding boxes, facilitating compatibility with various detection frameworks. Validation experiments conducted across multiple detection architectures (including Faster R-CNN, Cascade R-CNN, YOLO series, RT-DETR, DEIMv2, etc.) demonstrate the dataset’s effectiveness. This dataset serves as a valuable resource for developing intelligent thinning systems, with potential applications in promoting automation in the apple industry, enhancing thinning efficiency, and improving fruit quality.

1 Introduction

Fruit thinning constitutes an essential practice in modern precision orchard management, with direct implications for fruit quality, yield optimization, and economic profitability. In apple production, this operation is typically conducted 20–40 days after full bloom, corresponding to a developmental stage where young apples measure between 5–30 mm in diameter. The selective removal of fruitlets during this period is crucial for alleviating nutrient competition and promoting the optimal growth of remaining fruits (Ali, 2025). Nonetheless, conventional manual thinning practices remain heavily reliant on subjective empirical judgment, leading to inherent limitations including low efficiency, high labor costs, and inconsistent outcomes characterized by either missed or over-thinning (Xin et al., 2025). Against the backdrop of a rapidly worsening global agricultural labor shortage, the development of automated, vision-based thinning technologies has therefore emerged as an urgent priority. The cornerstone of such automated systems lies in the accurate and robust detection of small apples during the pre-thinning stage.

Deep learning-based approaches such as YOLO (Redmon et al., 2016) and DEIM (Huang et al., 2025a) have demonstrated remarkable success in various object detection applications. Their implementation has been extended to agricultural domains, including fruit detection (Xie et al., 2025; Jia et al., 2025), pest and weed identification (Suzauddola et al., 2025; Santhanambika and Maheswari, 2025; Betitame et al., 2025; Goyal et al., 2025), and automated harvesting (Jin et al., 2025). Nevertheless, the specific challenge of detecting small apples during the pre-thinning stage remains relatively under-explored. Since deep learning-based detection models depend on large-scale datasets for training, the absence of a dedicated, publicly available dataset for this particular task has significantly impeded research advancement. To address this gap, this study introduces a comprehensive dataset specifically designed for detecting small apples prior to thinning. The main contributions of this work are as follows:

1. Present a publicly available dataset for pre-thinning small apple detection. The dataset captures a wide range of real-world challenges, including scale variation, occlusion, and diverse lighting conditions.

2. Establish a rigorous data collection and annotation protocol, which incorporates a multi-stage quality control process to ensure high-quality annotations.

3. Provide extensive baseline evaluations by testing a suite of object detection models using standard COCO metrics, offering a critical reference for future research.

The paper is organized as follows: Section 1, Introduction, outlines the significance and contributions of this study. Section 2, Related work, reviews existing research and its limitations while highlighting the focus of our work. Section 3 elaborates on the value and key characteristics of the proposed dataset. Section 4 details the materials and methods, including the data acquisition setup, annotation process, benchmarking strategy, and experimental results. Section 5 acknowledges limitations and outlines directions for future research. Finally, Section 6 concludes the paper.

2 Related work

2.1 Deep learning-based object detection in agriculture

Deep learning has revolutionized visual perception in agriculture. Early applications primarily involved the direct adoption of generic object detection frameworks like Faster R-CNN (Ren et al., 2016) and SSD (Liu et al., 2016) for agricultural targets. However, these models often exhibited limited robustness when confronted with the inherent challenges of agricultural environments, including complex backgrounds, significant scale variation, and varying lighting conditions. To address these issues, subsequent research has focused on domain-specific architectural improvements. A prominent direction is the enhancement of feature pyramid networks (FPN) to better handle the multi-scale nature of agricultural objects, from small flowers to large fruits (Jia et al., 2021; Wang et al., 2025). Furthermore, the integration of attention mechanisms, such as convolutional block attention modules (CBAM) (Woo et al., 2018) and squeeze-and-excitation (SE) blocks (Hu et al., 2018), has been widely explored to improve feature representation in the presence of occlusion and clutter. More recently, Transformer-based architectures (Lou et al., 2025; Chen et al., 2025) have been introduced for their superior global context modeling capabilities, showing promising results in fruit detection (Guo et al., 2024) and counting tasks (Yang et al., 2023). Despite these algorithmic advances, the performance of deep learning models remains heavily dependent on the availability of large-scale, high-quality, and task-specific datasets.

2.2 Agricultural datasets for object detection

The availability of public datasets has been a key driver of progress in agricultural computer vision. Several benchmark datasets have been established for mature fruit detection, which is critical for harvesting robotics. Notable examples include the MinneApple dataset (Häni et al., 2020) for apple detection and the Deep Fruits dataset (Bargoti and Underwood, 2017). These datasets have facilitated the development and benchmarking of numerous detection algorithms. However, they are predominantly composed of images of mature or near-mature fruits, which are larger, exhibit more distinct color contrast against the foliage, and are often less densely clustered compared to fruits in the early growth stages. Consequently, models trained on these datasets are not directly applicable to the task of detecting small, green, and occluded fruitlets during the pre-thinning stage. While some research have begun to address earlier phenological stages, they often lack the scale, diversity of challenges, or public accessibility required for robust model development. This creates a significant data gap, specifically for the critical agricultural practice of fruit thinning.

2.3 The gap in pre-thinning small apple detection

The task of small apple detection prior to thinning presents unique challenges that are not adequately addressed by existing datasets. As summarized in Table 1, the target characteristics differ substantially from those of mature fruits. Pre-thinning small apples are typically defined by their small size, minimal color differentiation from the background foliage, and occurrence in dense clusters with mutual occlusion. These factors result in a domain shift that limits the applicability of models trained on mature fruit data. Although techniques like data augmentation and transfer learning can provide some improvements, they are insufficient to overcome the fundamental data distribution mismatch. Therefore, there is a pressing need for a dedicated, large-scale dataset that accurately captures the visual characteristics and challenges associated with the pre-thinning period. Such a dataset is essential for driving algorithmic innovation, enabling fair benchmarking, and ultimately facilitating the development of reliable vision systems for automated thinning.

Table 1

Table 1. Comparison of key characteristics between mature fruit detection and pre-thinning small apple detection.

This work bridges this gap by introducing the small apple dataset, specifically designed to address the challenges of small apple detection during the thinning season. Our dataset not only provides the necessary data foundation but also establishes rigorous criteria to propel future research in this critical area.

3 Value of the data

3.1 Diverse orchard imagery

The dataset was manually captured in an experimental orchard. Images were acquired under varying natural lighting and weather conditions prior to fruit thinning, ensuring robust applicability for real-world agricultural scenarios. The diversity of the dataset improves the generalization capability of deep learning models in real orchard environments.

3.2 Ready-to-use annotations

All images were annotated using LabelImg and are provided in a format compatible with mainstream machine learning frameworks (e.g., PyTorch). This minimizes preprocessing efforts and accelerates deployment for AI-driven agricultural research.

3.3 Foundation for smart orchard research

This dataset serves as a resource for advancing computer vision applications in precision agriculture. It supports the development of machine learning models for apple fruitlet detection and early yield prediction, enabling data-driven orchard management.

3.4 Challenging small-target detection dataset in color-near scenes

This dataset presents a particularly valuable resource for investigating multi-scale object detection challenges, with special emphasis on small-target recognition in complex color-near environments. The dataset captures the unique challenge where apple fruitlets exhibit similar coloration to background foliage, making it ideal for: (i) developing robust detection algorithms for small, low-contrast targets in close-range agricultural scenes, and (ii) advancing research on occlusion handling in dense foliage environments. Beyond its primary application in apple fruitlet detection, the dataset serves as: (i) a resource for early-stage detection of various fruit species with similar color blending characteristics, and (ii) a critical resource for developing automated fruit thinning systems that must operate in visually complex orchard conditions.

4 Materials and methods

The primary objective of this study is to construct a dataset for small apple detection prior to fruit thinning. The dataset development pipeline, illustrated in Figure 1, comprises three main stages: data collection, data annotation, and dataset validation. These stages will be detailed in the subsections below.

Figure 1

Diagram illustrating a three-step process: data collection with a person taking photos of fruit trees, data annotation with a person annotating images on a computer, and dataset validation with someone validating data using graphs on a computer.

Figure 1. The flowchart of dataset construction.

4.1 Data collection

The image dataset was collected from an experimental orchard at the College of Horticulture, Northwest A&F University (Yangling, Shaanxi, China). The dataset was captured using two mobile devices: an iPhone 7 Plus, equipped with a 12-megapixel CMOS sensor (f/1.8 aperture, hybrid autofocus, optical image stabilization), and an iPhone 6, equipped with an 8-megapixel CMOS sensor (f/2.2 aperture, hybrid autofocus). The comprehensive camera specifications are provided in Table 2. The data collection was designed to encapsulate a wide range of real-world orchard conditions, thereby enhancing the dataset’s robustness. Image acquisition was conducted during three distinct sessions in May 2018 under varying weather conditions: May 1 (sunny), May 2 (sunny), and May 4 (cloudy), with daily collection periods spanning 09:00-11:30 and 14:30-18:30 to capture diverse lighting conditions (backlight and direct sunlight). Complete data collection parameters are detailed in Table 3.

Table 2

Table 2. Description of camera device.

Table 3

Table 3. Description of data collection.

A systematic protocol was followed to ensure comprehensive coverage and minimize bias:

4.1.1 Plot selection

Data were collected from multiple rows of Fuji apple trees.

4.1.2 Viewpoint and distance

The camera was positioned to simulate the viewpoint of an automated thinning robot, typically at a distance of 0.5 to 3 meters from the target canopy. The target trees had an approximate height range of 2.0-3.0 meters. A variety of angles, including horizontal, elevated, and top-down views, were employed to capture the fruit clusters from different perspectives.

4.1.3 Scale variation

To account for the rapid growth of fruitlets, close-up shots of individual or small clusters of apples were taken alongside wider shots capturing the context within the canopy.

4.1.4 Occlusion and complexity

Specific attention was paid to capturing images with varying degrees of occlusion, from fully visible apples to those heavily obscured by leaves, branches, or other fruits.

The image dataset was carefully designed to capture diverse field conditions, with samples acquired under varying natural daylight scenarios including both backlight and direct sunlight illumination. All images were stored in standard JPEG format with full preservation of the original 12-megapixel resolution. During acquisition, the apple fruitlets were in their early developmental stage, as evidenced by horizontal diameters measuring less than 25 mm. Representative examples of the captured images, demonstrating the range of lighting conditions and fruitlet characteristics, are presented in Figure 2.

Figure 2

Four images of apple trees demonstrate different lighting and coverage conditions: direct sunlight, shadow, overlap, dark illumination, occlusion, and backlight. Each condition is highlighted with labeled arrows indicating specific areas on the branches and leaves, illustrating variations in visibility under these circumstances.

Figure 2. Representative examples of the captured images.

In total, over 3,000 raw images were captured. Following an initial quality check to remove blurry or severely over/under-exposed images, a final set of 2,517 high-quality images was selected for annotation.

4.2 Data annotation

4.2.1 Data annotation methodology

All images in the dataset were uniformly rescaled to 500×500 pixels to ensure processing efficiency. Manual annotation was performed using LabelImg software, with bounding box coordinates stored in XML format. To enhance usability across different platforms, we additionally provide annotations in TXT format. Figure 3 illustrates a representative annotation example. During the annotation process, special attention was given to: (1) precise localization of small fruitlets, and (2) accurate annotation of partially occluded targets, where only visible portions were labeled to minimize false positives in subsequent analyses.

Figure 3

Screenshot of an image annotation software showing green apples on a tree with green dot markers. The right panel lists box labels for “micro apple.” The left panel includes navigation and tool options like create and delete rectangle box.

Figure 3. Representative annotation example.

4.2.2 Quality control pipeline

In this study, a single annotator was responsible for the initial annotation to ensure consistency in style. To ensure dataset reliability, we implemented a rigorous quality control protocol following annotation completion. This involved comprehensive review of all labeled images and corresponding annotations to identify and rectify any missing or erroneous labels. Our quality control protocol consisted of three key phases:

4.2.2.1 Phase 1

Iterative self-validation by the annotator. The annotation process was conducted in multiple batches. After completing each batch, the annotator would take a mandatory 24-hour break before re-reviewing 100% of the images in that batch. This “cooling-off” period was critical for allowing the annotator to approach the data with a fresh perspective, making it easier to spot initial oversights or inconsistencies. During this self-review, the annotator carefully verifying that every visible small apple was captured and that the bounding boxes were tightly fitted.

4.2.2.2 Phase 2

Automated consistency checking. Following the self-validation, we ran a custom Python script to analyze the generated XML annotation files. This script checked for common errors that are difficult to catch manually across a large dataset. Checked items included:

1. Extremely small boxes: Flagging bounding boxes with a width or height of less than 5 pixels for manual re-inspection, as these could be annotation noise.

2. Invalid coordinates: Ensuring all bounding box coordinates were within the image boundaries.

3. Class label verification: Confirming that only the correct class label was present.

4.2.2.3 Phase 3

Final expert adjudication and check. This was the most critical QC step. A senior agricultural expert independently reviewed 100% of the annotated images. The expert had the authority to correct any errors directly in the annotation files. This adjudicated version constitutes the final, released dataset. The error correction rate during this phase was found to be below 2%, indicating the high initial quality achieved by the previous phases.

4.2.3 Dataset statistical characterization

A comprehensive statistical analysis was conducted to quantitatively characterize the composition and key challenges present in the proposed dataset. This analysis aims to provide a transparent overview of the data, facilitating a deeper understanding of its properties and the difficulties it presents for detection models. The statistical characterization of the dataset are shown in Table 4 and the details are as follows.

Table 4

Table 4. Statistical characterization of the dataset.

4.2.3.1 Target statistics

The dataset comprises a total of 2,517 RGB images. A cumulative of 22,415 small apple fruitlets with different conditions were meticulously annotated.

4.2.3.2 Environmental condition distribution

The dataset was constructed to encompass a diverse range of real-world conditions. The distribution of images across different weather and lighting scenarios is summarized in Table 4. This deliberate variation ensures that models trained on the dataset are exposed to a wide spectrum of visual appearances, thereby enhancing their potential robustness.

4.2.3.3 Temporal distribution

The distribution of images across the time of day (morning and afternoon) is shown in Table 4 to further characterize the data composition.

4.2.3.4 Object scale distribution

The scale of objects is a critical factor, especially for small object detection. The distribution of bounding box areas (in pixels) for all annotated targets is illustrated in Table 4. Notably, over 60% of the bounding boxes have an area smaller than 32² pixels, formally categorizing them as small objects according to the COCO benchmark criteria. This distribution confirms the dataset’s relevance to the core challenge of small object detection.

In summary, the statistical characterization confirms that the dataset not only provides a substantial number of targets but also encapsulates the primary challenges of small apple detection: small object size, and environmental variation. These quantified attributes are significant for evaluating the robustness of object detection algorithms in real-world orchard settings.

4.3 Dataset validation

The dataset supports comprehensive preprocessing and partitioning for machine learning applications. Researchers may perform data augmentation through various image transformations, including noise injection, brightness adjustment, chromaticity modification, contrast variation, and sharpness alteration. For model development and evaluation, the dataset can be partitioned into training, validation, and test subsets using multiple ratio configurations (8:1:1, 7:2:1, or 6:2:2), providing flexibility for different experimental designs and ensuring robust evaluation of deep learning models.

4.3.1 Dataset validation methods

To rigorously evaluate dataset effectiveness, we conducted evaluations using ten representative object detection architectures spanning different paradigms: (1) three two-stage detectors (Faster R-CNN (Ren et al., 2016), Cascade R-CNN (Cai and Vasconcelos, 2018), and Grid R-CNN (Lu et al., 2019)) and (2) seven single-stage detectors (RetinaNet (Lin et al., 2017), YOLOv5 (Jocher et al., 2020), YOLOv8, YOLOv11 (Khanam and Hussain, 2024), YOLOv12 (Tian et al., 2025), RT-DETR (Zhao et al., 2024), and DEIMv2 (Huang et al., 2025b)). This multi-model validation approach ensures robust assessment of the dataset’s suitability for various detection paradigms.

The dataset was randomly partitioned into training (2,013 images), validation (253 images), and test (251 images) sets at an 8:1:1 ratio for model training. The random partitioning strategy was intentionally employed to demonstrate dataset generalizability and support reliable performance evaluation across different experimental configurations.

4.3.2 Hardware and software platform

All experiments were conducted on a workstation equipped with an Intel Core i9-11900H CPU, 32GB of RAM, and an NVIDIA GeForce RTX 3080 GPU (16GB VRAM), running on a Windows 10 operating system. The software environment consisted of Python 3.8, PyTorch 2.2.2, and CUDA 11.8 for GPU acceleration.

4.3.3 Performance of different methods

All experimental results are meticulously documented and presented in Table 5. This table provides a comprehensive evaluation of ten object detection models on our apple fruitlet dataset, assessed using standard COCO metrics: Average Precision (AP) and Average Recall (AR). The experimental results reveal significant performance disparities among the models. The Transformer-based RT-DETR-L model leads in overall accuracy (AP = 0.669), demonstrating the most robust overall detection capabilities. In contrast, DEIMv2-N excels in recall (AR = 0.706) and loose-threshold precision (AP@0.5 = 0.921), offering particular value for applications like fruit thinning where a high recall rate is critical. Modern YOLO series models (v8, v11, v12) provide a balanced and high-performance alternative. Notably, the accuracy for small targets (AP_S) is substantially lower than for medium and large targets across all models. This performance gap further validates the inherent difficulty of detecting small apples prior to thinning and underscores the challenging nature of the presented dataset. Collectively, the benchmarking results from these ten models conclusively validate the effectiveness and utility of the proposed dataset for the challenging task of small apple detection.

Table 5

Table 5. Results of different methods on detecting apple fruitlets.

Beyond its immediate application to CNN-based models, the dataset holds significant potential for advancing state-of-the-art methodologies. Firstly, the high-quality manual annotations of the dataset make it particularly suitable for training and evaluating modern approaches such as self-supervised and unsupervised learning models, which require substantial amounts of data to learn meaningful representations without exhaustive manual labels. Secondly, the complexity and variety of scenes challenge unsupervised object detection algorithms, pushing the boundaries of models that can identify and segment objects without prior knowledge.

5 Limitations

The primary limitation of this dataset is its origin from a single experimental station, which may limit the generalizability of models trained on it to commercial orchards operating under different geographical and managerial conditions. However, the dataset was explicitly designed to capture a wide spectrum of visual challenges (e.g., lighting variations, occlusion levels, scale changes) inherent to the task. To address the current geographical limitation, our immediate next step is to enrich the dataset with samples from a wider range of regions and cultivation systems. This expansion will, in turn, fuel our investigation into domain adaptation methods designed to ensure model performance and reliability in unfamiliar orchard environments.

Additional limitations include the exclusive composition of the dataset with Fuji apple instances, which may hinder the model’s generalization to other apple varieties or fruit species with significantly different morphological characteristics. Future work should therefore incorporate more diverse cultivars and species to enhance the robustness and broader applicability of the approach.

Data availability statement

The datasets presented in this study can be found in online repositories https://dx.doi.org/10.21227/z74d-8t41.

Author contributions

DW: Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Resources, Software, Supervision, Visualization, Writing – original draft, Writing – review & editing. BW: Conceptualization, Methodology, Project administration, Software, Validation, Writing – review & editing.

Funding

The author(s) declared that financial support was received for this work and/or its publication. This work was supported by Natural Science Basic Research Program of Shaanxi (2025JC-YBMS-226).

Conflict of interest

The authors declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Ali, İ. (2025). The key to increasing yield and quality in pomegranate cultivation: Fruit thinning techniques. Ann. Agric. Crop Sci. 10, 1179.

Google Scholar

Bargoti, S. and Underwood, J. (2017). “Deep fruit detection in orchards,” in 2017 IEEE international conference on robotics and automation. (ICRA), Piscataway, NJ, USA. pp. 3626–3633.

Google Scholar

Betitame, K., Igathinathane, C., Howatt, K., Mettler, J., Koparan, C., and Sun, X. (2025). A practical guide to UAV-based weed identification in soybean: Comparing RGB and multispectral sensor performance. J. Agric. Food Res. 20, 101784. doi: 10.1016/j.jafr.2025.101784

Crossref Full Text | Google Scholar

Cai, Z. and Vasconcelos, N. (2018). “Cascade R-CNN: Delving into high quality object detection,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA. pp. 6154–6162.

Google Scholar

Chen, Z., Liu, Q., Ding, Z., and Liu, F. (2025). Automated structural resilience evaluation based on a multi-scale transformer network using field monitoring data. Mech. Syst. Signal. Pr. 222, 111813. doi: 10.1016/j.ymssp.2024.111813

Crossref Full Text | Google Scholar

Goyal, R., Nath, A., Niranjan, U., and Sharda, R. (2025). Analyzing the performance of deep convolutional neural network models for weed identification in potato fields. Crop Prot. 188, 107035. doi: 10.1016/j.cropro.2024.107035

Crossref Full Text | Google Scholar

Guo, C., Zhu, C., Liu, Y., Huang, R., Cao, B., Zhu, Q., et al. (2024). End-to-End lightweight Transformer-Based neural network for grasp detection towards fruit robotic handling. Comput. Electron. Agr. 221, 109014. doi: 10.1016/j.compag.2024.109014

Crossref Full Text | Google Scholar

Häni, N., Roy, P., and Isler, V. (2020). MinneApple: a benchmark dataset for apple detection and segmentation. IEEE Robot. Autom. Let. 5, 852–858. doi: 10.1109/LRA.2020.2965061

Crossref Full Text | Google Scholar

Hu, J., Shen, L., and Sun, G. (2018). “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition. (CVPR), Seoul, South Korea. pp. 7132–7141.

Google Scholar

Huang, S., Hou, Y., Liu, L., Yu, X., and Shen, X. (2025b). Real-time object detection meets DINOv3. arXiv preprint arXiv:2509.20787. doi: 10.48550/arXiv.2509.20787

Crossref Full Text | Google Scholar

Huang, S., Lu, Z., Cun, X., Yu, Y., Zhou, X., and Shen, X. (2025a). “Deim: Detr with improved matching for fast convergence,” in Proceedings of the Computer Vision and Pattern Recognition Conference. (CVPR), Los Alamitos, CA: IEEE. pp. 15162–15171. doi: 10.48550/arXiv.2412.04234

Crossref Full Text | Google Scholar

Jia, W., Zhang, Z., Shao, W., Hou, S., Ji, Z., Liu, G., et al. (2021). FoveaMask: A fast and accurate deep learning model for green fruit instance segmentation. Comput. Electron. Agr. 191, 106488. doi: 10.1016/j.compag.2021.106488

Crossref Full Text | Google Scholar

Jia, W., Zhao, R., Jiang, R., and Liu, G. (2025). SFD-Net: Enhancing small fruit detection before thinning period using query vectors. IEEE T. Autom. Sci. Eng. 22, 23007–23022. doi: 10.1109/TASE.2025.3624608

Crossref Full Text | Google Scholar

Jin, T., Han, X., Wang, P., Zhang, Z., Guo, J., and Ding, F. (2025). Enhanced deep learning model for apple detection, localization, and counting in complex orchards for robotic arm-based harvesting. Smart Agric. Technol. 10, 100784. doi: 10.1016/j.atech.2025.100784

Crossref Full Text | Google Scholar

Jocher, G., Stoken, A., Borovec, J., Changyu, L., Hogan, A., Diaconu, L., et al. (2020). ultralytics/yolov5: v3. 0 (Zenodo). doi: 10.5281/zenodo.390855

Crossref Full Text | Google Scholar

Khanam, R. and Hussain, M. (2024). Yolov11: An overview of the key architectural enhancements. arXiv preprint arXiv:2410.17725. doi: 10.48550/arXiv.2410.17725

Crossref Full Text | Google Scholar

Lin, T. Y., Goyal, P., Girshick, R., He, K., and Dollár, P. (2017). “Focal loss for dense object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (CVPR), Venice. pp. 2980–2988.

Google Scholar

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., et al. (2016). “SSD: Single shot multibox detector,” in Computer Vision–ECCV (Springer) 2016: 14th European Conference, Proceedings, Part I, Vol. 14. 21–37.

Google Scholar

Lou, M., Fu, Y., and Yu, Y. (2025). “Sparx: A sparse cross-layer connection mechanism for hierarchical vision mamba and transformer networks,” in Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, Pennyslvania. Vol. 39. 19104–19114.

Google Scholar

Lu, X., Li, B., Yue, Y., Li, Q., and Yan, J. (2019). “Grid R-CNN,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. (CVPR), Long Beach, CA, USA. pp. 7363–7372.

Google Scholar

Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. (2016). “You Only Look Once: Unified, real-time object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (CVPR), Las Vegas, NV, USA. pp. 779–788.

Google Scholar

Ren, S., He, K., Girshick, R., and Sun, J. (2016). Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE T. Pattern Anal. 39, 1137–1149. doi: 10.1109/TPAMI.2016.2577031

PubMed Abstract | Crossref Full Text | Google Scholar

Santhanambika, M. S. and Maheswari, G. (2025). Towards food security with the Grain Shield web application for stored grain pest identification. J. Stored Prod. Res. 111, 102515. doi: 10.1016/j.jspr.2024.102515

Crossref Full Text | Google Scholar

Suzauddola, M., Zhang, D., Zeb, A., Chen, J., Wei, L., and Rayhan, A. S. (2025). Advanced deep learning model for crop-specific and cross-crop pest identification. Expert Syst. Appl. 274, 126896. doi: 10.1016/j.eswa.2025.126896

Crossref Full Text | Google Scholar

Tian, Y., Ye, Q., and Doermann, D. (2025). Yolov12: Attention-centric real-time object detectors. arXiv preprint arXiv:2502.12524. doi: 10.48550/arXiv.2502.12524

Crossref Full Text | Google Scholar

Wang, D., Song, H., and Wang, B. (2025). YO-AFD: An improved YOLOv8-based deep learning approach for rapid and accurate apple flower detection. Front. Plant Sci. 16, 1541266. doi: 10.3389/fpls.2025.1541266

PubMed Abstract | Crossref Full Text | Google Scholar

Woo, S., Park, J., Lee, J. Y., and Kweon, I. S. (2018). “CBAM: Convolutional block attention module,” European conference on computer vision. (ECCV), Munich, Germany. pp. 3–19.

Google Scholar

Xie, J., Liu, J., Chen, S., Gao, Q., Chen, Y., Wu, J., et al. (2025). Research on inferior litchi fruit detection in orchards based on YOLOv8n-BLS. Comput. Electron. Agr. 237, 110736. doi: 10.1016/j.compag.2025.110736

Crossref Full Text | Google Scholar

Xin, J., Tan, C., Mhamed, M., and Zhang, Z. (2025). “Research progress on thinning Equipment in Orchards: A Review,” in Apple Production Technologies: From Laboratory to Practical Applications: Cutting-Edge Innovative Techniques, (Singapore, Springer) 23–44.

Google Scholar

Yang, Y., Wang, X., Liu, Z., Huang, M., Sun, S., and Zhu, Q. (2023). Detection of multi-size peach in orchard using RGB-D camera combined with an improved Detection Transformer model. Intell. Data. Anal. 27, 1539–1554. doi: 10.3233/IDA-220449

Crossref Full Text | Google Scholar

Zhao, Y., Lv, W., Xu, S., Wei, J., Wang, G., Dang, Q., et al. (2024). “Detrs beat yolos on real-time object detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (CVPR), Seattle, WA, USA. pp. 16965–16974.

Google Scholar

Keywords: apple fruitlet, computer vision in agriculture, image dataset, natural scenes, target detection

Citation: Wang D and Wang B (2026) An annotated image dataset for small apple fruitlet detection in complex orchard environments. Front. Plant Sci. 16:1664972. doi: 10.3389/fpls.2025.1664972

Received: 22 July 2025; Accepted: 08 December 2025; Revised: 28 November 2025;
Published: 05 January 2026.

Edited by:

Kai Huang, Jiangsu Academy of Agricultural Sciences (JAAS), China

Reviewed by:

Thiago Teixeira Santos, Brazilian Agricultural Research Corporation (EMBRAPA), Brazil
Camilo Chiang, Agroscope, Switzerland

Copyright © 2026 Wang and Wang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Bo Wang, d2FuZ2JveGp0dUB4anR1LmVkdS5jbg==

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

An annotated image dataset for small apple fruitlet detection in complex orchard environments

1 Introduction

2 Related work

2.1 Deep learning-based object detection in agriculture​​

2.2 Agricultural datasets for object detection​​

2.3 The gap in pre-thinning small apple detection​​

3 Value of the data

3.1 Diverse orchard imagery

3.2 Ready-to-use annotations

3.3 Foundation for smart orchard research

3.4 Challenging small-target detection dataset in color-near scenes

4 Materials and methods

4.1 Data collection

4.1.1 Plot selection

4.1.2 Viewpoint and distance

4.1.3 Scale variation

4.1.4 Occlusion and complexity

4.2 Data annotation

4.2.1 Data annotation methodology

4.2.2 Quality control pipeline

4.2.2.1 Phase 1

4.2.2.2 Phase 2

4.2.2.3 Phase 3

4.2.3 Dataset statistical characterization

4.2.3.1 Target statistics

4.2.3.2 Environmental condition distribution

4.2.3.3 Temporal distribution

4.2.3.4 Object scale distribution

4.3 Dataset validation

4.3.1 Dataset validation methods​

4.3.2 Hardware and software platform

4.3.3 Performance of different methods

5 Limitations

Data availability statement

Author contributions

Funding

Conflict of interest

Generative AI statement

Publisher’s note

References

2.1 Deep learning-based object detection in agriculture

2.2 Agricultural datasets for object detection

2.3 The gap in pre-thinning small apple detection

4.3.1 Dataset validation methods