- 1School of Artificial Intelligence, Changchun University of Science and Technology, Changchun, China
- 2Zhongshan Institute of Changchun University of Science and Technology, Zhongshan, China
- 3School of Data Science and Artificial Intelligence, Jilin Engineering Normal University, Changchun, China
- 4College of Electrical and Information Engineering, Jilin Engineering Normal University, Changchun, China
- 5College of Life Sciences and Agri-forestry, Southwest University of Science and Technology, Mianyang, China
Introduction: The real-time, accurate detection and classification of rice seeds are crucial for improving agricultural productivity, ensuring grain quality, and promoting smart agriculture. Although significant progress has been made using deep learning, particularly convolutional neural networks (CNNs) and attention-based models, earlier methods such as threshold segmentation and single-grain classification faced challenges related to computational efficiency and latency, especially in high-density seed agglutination scenarios. This study addresses these limitations by proposing an integrated intelligent analysis model that combines object detection, real-time tracking, precise classification, and high-accuracy phenotypic measurement.
Methods: The proposed model utilizes the lightweight YOLOv11-LA for real-time grain segmentation, which builds upon the YOLOv11 architecture. YOLOv11-LA incorporates several enhancements over YOLOv11, including separable convolutions, CBAM (Convolutional Block Attention Module) attention mechanisms, and module pruning strategies. These modifications not only improve detection accuracy but also significantly reduce the number of parameters by 63.2% and decrease computational complexity by 51.6%. For classification, the model employs a custom-designed, lightweight RiceLCNN classifier. Additionally, the DeepSORT algorithm is employed for real-time multi-object tracking, and sub-pixel edge detection along with dynamic scale calibration mechanisms are applied for precise phenotypic feature measurement.
Results: Compared to YOLOv11, the YOLOv11-LA model increases the mAP@0.5:0.95 score by 1.9%, showcasing its superior detection performance while maintaining lower computational overhead. The RiceLCNN classifier achieved classification accuracies of 89.78% on private datasets and 96.32% on public benchmark datasets. The system demonstrated high accuracy in measuring phenotypic features such as seed size and roundness, with measurement errors kept within 0.1 millimeters. The DeepSORT algorithm effectively managed multi-object tracking, reducing duplicate identifications and frame loss in real-time.
Discussion: Experimental validation confirmed that the YOLOv11-LA model outperforms the original YOLOv11 in terms of both detection speed and accuracy, while also maintaining low computational complexity. The integration of the YOLOv11-LA, RiceLCNN, and DeepSORT algorithms, combined with advanced measurement techniques, underscores the model's potential for industrial applications, particularly in enhancing smart agricultural practices.
1 Introduction
Rice (Oryza sativa L.), as a globally significant staple crop, plays a crucial role in maintaining food security and agricultural economics through its yield and quality. Achieving efficient and precise rice seed detection and classification not only enables stringent quality control at the source and supports the breeding and promotion of superior varieties, but also provides critical technological support for the transformation of agricultural production toward intelligent and precision farming (Karunasena et al., 2020).
Compared to other crops, rice seeds are tightly enclosed by hard inner and outer glumes (husk), and their phenotypic characteristics pose a series of unique challenges for phenotypic analysis. First, rice grains are small and elongated, often appearing in high-density, inter-contacted distributions during high-throughput imaging, significantly increasing the difficulty of target segmentation (Aukkapinyo et al., 2019; Sun et al., 2014; Ning et al., 2023); Second, visual differences between varieties are often subtle, primarily manifesting in traits such as grain length-to-width ratio, end contour, and surface texture; Environmental and cultivation management factors (e.g., day-night temperature differences, water and fertilizer conditions) further amplify intraspecific variation and reduce interspecific distinguishability by influencing quality traits like chalkiness (Zhou et al., 2018; Huo et al., 2023); Additionally, variations in husk surface gloss and color, along with localized high-reflectance and low-contrast regions, complicate feature extraction and threshold segmentation (Yan et al., 2018; Yang et al., 2023).
Traditional rice variety classification methods—manual phenotypic identification, chemical or biological analysis—remain valuable for accuracy and interpretability. However, they are generally destructive time-consuming, and costly. Consequently, recent research has shifted toward image-based deep learning approaches (Cui et al., 2019; Li et al., 2022). In the typical scenario of “dense small objects in close contact,” object detection has become the mainstream technical approach: Mask-based methods excel in contour delineation but are prone to over-segmentation or under-segmentation in densely packed rice seed scenarios, and are not conducive to high-throughput initial screening. In contrast, one-stage methods (YOLO series) maintain accuracy while offering real-time performance. They enable rapid initial screening of rice seeds in detection-priority pipelines, followed by subsequent fine segmentation and measurement, aligning better with high-throughput workflows from field to laboratory. Previously, models like YOLOv5 have demonstrated strong performance in rice seed detection tasks (Phan et al., 2023); recent enhancements—including more efficient backbones and decoupled detector heads, improved small object training and sampling strategies, and end-to-end/low-latency inference—further boost their applicability in dense scenarios. YOLOv9 introduced the Generalized Efficient Layer Aggregation Network (GELAN) and programmable gradient information to enhance information utilization in lightweight models (Wang et al., 2024b); YOLOv10 achieved NMS-free training and significantly reduced latency through strategies like consistent dual allocation, pushing the performance boundaries of small object detection and real-time deployment (Wang et al., 2024a). Building upon this evolutionary trajectory, YOLOv11 comprises three major modules: the backbone network, the neck network, and the head network. The backbone handles multi-scale feature extraction, while the neck aggregates features through multi-layer convolutions and attention mechanisms before passing them to the head, which generates the final predictions. Performance improvements stem primarily from innovative module designs: replacing traditional C2f with C3k2 enhances computational efficiency, and integrating Spatial Pyramid Pooling Fast (SPPF) with Cross-Stage Partial with Spatial Attention (C2PSA) further strengthens feature representation. Furthermore, YOLOv11 undergoes systematic optimization in model size and computational efficiency, demonstrating exceptional performance in multi-scale small object detection and proving particularly well-suited for edge computing scenarios (Rasheed and Zarkoosh, 2025). Accordingly, we build the segmentation frontend upon the lightweight YOLOv11-LA network to balance speed and accuracy while ensuring seamless integration with subsequent fine-grained classification modules.
For seed classification, RiceSeedNet employs a visual Transformer architecture supplemented by traditional image processing for RGB-based rice variety identification (Rajalakshmi et al., 2024); RiceNet focuses on color and shape feature extraction, demonstrating strong generalization capabilities across varying grain sizes and contours (Mohi Ud Din et al., 2024). These studies demonstrate that convolutional models based on standard image information can effectively characterize rice seed phenotype. However, when confronted with high-density, intra-class-diverse real-world workflows, classification networks alone struggle to meet the efficiency and accuracy demands of end-to-end processing. Therefore, this study adopts an integrated “detection front-end – fine-grained classification back-end” approach to balance high throughput and high accuracy in complex scenarios.
Experiments employed a single-factor randomized block design. The same rice variety was planted across nine different treatment combinations (straw incorporation, enzyme application, and various combinations of organic and chemical fertilizers) on plots demonstrating fertilizer efficacy, ensuring consistent cultivation practices. Post-harvest, rice seed samples were collected from each treatment plot and photographed against a standard black background (each image containing 200 grains) to analyze phenotypic variations induced by different fertilizer treatments. Methodologically, we employed an optimized lightweight YOLOv11-LA frontend for automated detection/segmentation of high-density grains, coupled with a lightweight RiceLCNN for phenotypic feature extraction and classification of individual seeds, forming a scalable two-stage segmentation-classification pipeline. The main contributions of this study are as follows: (1) Proposed a lightweight, structurally optimized YOLOv11-LA localization front-end tailored for dense, overlapping, low-contrast rice seed images, enabling real-time inference and consistently improving small object recall; (2) Designs the lightweight and efficient RiceLCNN classification network, enhancing its ability to recognize subtle phenotypic variations in rice seeds; (3) Constructs a rice seed classification dataset under standardized imaging conditions, providing data support for subsequent phenotypic analysis and management strategy evaluation; (4) Establishes a scalable “two-stage segmentation-classification” pipeline, facilitating real-time deployment and online iterative refinement.
2 Related work
In rice seed identification experiments, while individual grain imaging can reduce errors, it incurs extremely high labor and time costs; In practice, a more common approach involves capturing multiple grains in a single frame, followed by segmentation. Existing research, often prioritizing classification, typically employs the classical Otsu global threshold (Otsu, 1979) for segmentation. This method achieves unsupervised segmentation by maximizing inter-class variance and has introduced several improvements for rice seed scenarios to enhance robustness under complex lighting and background conditions (Qingge et al., 2020). However, under conditions of low contrast, noise interference, and grain adhesion, Otsu and its variants remain reliant on local adaptation and post-processing, with limited effectiveness in separating adhered objects. Therefore, we turn to detection-driven real-time segmentation: The YOLO series has demonstrated its detection and localization capabilities in agricultural imagery, such as size estimation based on YOLOv7 (Pawłowski et al., 2024) and grain measurement leveraging rotation awareness with YOLOv8 (Zhao et al., 2023). Within this framework, we adopt an optimized and lightweight YOLOv11-LA as the frontend. It inherits the CSP backbone and FPN+PAN feature fusion from YOLOv11n to enhance small object representation. Combined with pruning and quantization, it achieves a balance between accuracy and speed, enabling real-time instance-level segmentation and screening of densely packed, contacting grains.
Based on the instance regions output by this frontend, we employ the lightweight RiceLCNN for single-grain classification and phenotypic feature extraction. Early rice seed classification studies predominantly utilized machine learning algorithms such as Logistic Regression (LR), Linear Discriminant Analysis (LDA), k-Nearest Neighbors (k-NN), and Support Vector Machines (SVM). These methods perform well under conditions of thorough preprocessing and explicit feature extraction. For instance, Kiratiratanapruk et al. (2020) achieved 90.61% accuracy using SVM classification on RGB images. Kurade et al. (2023) combined geometric features with a Random Forest (RF) classifier to achieve a 77% classification rate. Phan et al. (2024) further proposed a hybrid model integrating deep feature extraction with traditional classifiers, enhancing model stability and generalization capabilities. As CNNs and their variants gain widespread adoption for rice variety image classification, Jin et al. (2024) enhanced image quality and training sample selection efficiency by integrating the EDSR network with the Kennard-Stone algorithm. Sharma et al. (2024) introduced transfer learning to rice variety recognition. Comparing architectures like ResNet50, Xception, and InceptionV3, they found ResNet50 performed optimally on large-scale datasets, achieving 80.5% accuracy. These studies demonstrate deep learning’s distinct advantages in feature extraction and phenotypic recognition. To further enhance model inference capabilities, some research attempts to incorporate multi-model fusion into rice variety classification systems. For instance, Rathnayake et al. (2023) fused GBDT with ANFIS to construct a nonlinear classifier, enabling intelligent assessment of rice seed aging. Although such models demonstrate strong classification capabilities, their complex structures and high computational costs limit practical deployment in edge computing and field environments. In contrast, Mohi Ud Din et al. (2024) proposed RiceNet, centered on a lightweight convolutional architecture. Tested on five common rice varieties, it achieved 94% accuracy, outperforming InceptionV3 (84%) and InceptionResNetV2 (81.33%). RiceNet’s compatibility with high accuracy and low computational complexity makes it suitable for rapid deployment and industrialization under resource-constrained conditions. Nevertheless, this approach primarily focuses on static image recognition, and future work should extend it to multi-task scenarios (such as counting and multi-attribute analysis) to enhance practicality.
3 Materials and methods
This study proposes a two-stage segmentation–classification framework for rice seed recognition and phenotypic analysis, consisting of two core modules: (1) the real-time grain segmenter YOLOv11-LA, designed to perform rapid detection and instance segmentation in dense contact scenarios; and (2) the rice seed classifier RiceLCNN, responsible for phenotypic discrimination and feature extraction from individual grain images. As illustrated in Figure 1, the overall workflow comprises dataset construction (collection, annotation, and quality control), model design and optimization (architecture refinement, pruning–quantization, and training strategies), tracking-based counting (instance-level inter-frame association), and phenotypic measurement and analysis, together forming a scalable pipeline suitable for high-throughput applications. To ensure the fairness and reliability of the experiments, all experiments were independently repeated k=5 times under the same conditions, and the average results were reported to minimize randomness.
3.1 Real-time grain segmenter: YOLOv11-LA
3.1.1 Image acquisition and detection data set construction
The data for this study were provided by the Rice Research Institute of the Jilin Academy of Agricultural Sciences. To generate sufficient phenotypic variation while keeping the genetic background constant, the same rice variety, “Tongjing 612” was cultivated under nine fertilization regimes that systematically combined straw incorporation, enzyme application, organic fertilizer, and chemical fertilizer (Table 1). Plot 1 received neither straw, enzyme, organic fertilizer, nor chemical fertilizer and therefore served as the unfertilized control. Plots 2 and 3 were designed to test the effect of straw incorporation with and without enzyme application. Plots 4–9 combined straw with different types of fertilizers: Plot 4 received all four inputs (straw, enzyme, organic and chemical fertilizers), Plots 5 and 6 received straw and enzyme in combination with either chemical or organic fertilizer, respectively, while Plots 7–9 received straw with different combinations of organic and/or chemical fertilizers but without enzyme. This design allowed us to disentangle the effects of straw, enzyme and fertilizer type on rice growth and phenotypic variation.
Imaging was performed using a Nikon D7100 camera under uniform black background and standardized lighting conditions, with a resolution of 6000×4000 (300 dpi). Each image captured 200 grains; an example is shown in Figure 2. Additionally, we integrated public datasets including RiceNet (Mohi Ud Din et al., 2024), Japanese Rice (Rathnayake et al., 2023), and RiceSeedSize (Rajalakshmi et al., 2024) to enhance training data diversity and cross-scenario adaptability. To enable traceable pixel-to-physical-scale conversion, black light-absorbing discs were placed within images as dynamic scale calibration references. In this study, a total of 25,000 rice seed samples were collected and annotated. These were divided into training, validation, and test sets in an 8:1:1 ratio, comprising 20,000, 2,500, and 2,500 seeds, respectively. The training set was used for model parameter learning and feature extraction; the validation set was employed during training to tune hyperparameters and monitor convergence, preventing overfitting; the test set remained strictly independent from the model development process and was used only in the final stage for unified evaluation of the model’s generalization ability and overall performance, ensuring the objectivity and comparability of experimental results. All images were annotated with rice seed bounding boxes in YOLO format, forming the rice seed detection dataset. YOLO annotations for each dataset are illustrated in Figure 3.
Figure 3. Example figure of multi-source datasets: (a) Absorbance reference plate; (b) RiceNet; (c) Japanese Rice; (d) RiceSeedSize; (e) Dataset used in this study.
3.1.2 Network structure optimization of YOLOv11-LA
This study designs a lightweight segmentation model, YOLOv11-LA (Lightweight-Attention), based on the YOLOv11n architecture. Its primary optimizations include:
(1) Lightweight Convolution Structure Design: In the model backbone, the original two-layer standard convolution structure (Conv + Conv) is replaced with a Conv + DWConv combination. Depthwise Separable Convolution (DWConv) is also adopted in subsequent multi-layer feature extraction modules to reduce computational burden (Shafik et al., 2025). The computational complexity is optimized from standard convolution (Equation 1) to the depthwise separable structure (Equation 2):
where K is the convolution kernel size, Cin and Cout are the input/output channel numbers, and H × W is the output feature map size.
(2) Streamline the C3k2 module structure to reduce redundant computations by replacing deeper C3 variants with C3k2 (). Apply a compression coefficient to the channel count at corresponding stages (while preserving the backbone resolution and receptive field) to reduce parameters and FLOPs. This modification inherits C3’s cross-stage feature fusion advantage (CSP mechanism) while significantly reducing redundant computations caused by stack depth and intermediate channels (Jocher et al., 2022).
(3) Introducing CBAM Attention Module before Detection Head (Woo et al., 2018) To highlight discriminative features of adhesions and minute grains at low cost, we insert CBAM once before the classification/regression convolutions in the detection head (channel attention with reduction ratio r = 16, spatial attention using 7×7 convolutions, as shown in the Equations 3, 4). This placement strategy focuses attention on high-level features after multi-scale fusion without overly interfering with the backbone’s general representations. This modification consistently improves performance while keeping GFLOPs largely unchanged.
Where denotes Global Average Pooling, averaging the spatial dimensions of feature map and outputting ; denotes Global Max Pooling, which takes the maximum value in the spatial dimensions and outputs ; represents a two-layer fully connected network (typically with ReLU activation in the middle layer), used to generate channel attention weights; denotes a convolution operation with a 7×7 kernel, taking 2 input channels (concatenated from ) and producing 1 output channel; denotes the Sigmoid activation function, which compresses values into the range; represents concatenation along the channel dimension.
(4) Channel width reduction: The maximum output channel count of the backbone is reduced from 1024 to 512. The output channels of P5 in the Head module are also simplified accordingly, effectively compressing the parameter scale and reducing memory consumption.
Furthermore, to address the limitation of conventional YOLO boxes failing to accurately fit tilted seed grain boundaries, the Otsu adaptive thresholding method is introduced. This performs fine segmentation on each detected seed grain region and extracts the minimum bounding rectangle, thereby enhancing the accuracy of subsequent phenotypic measurements. Figure 4 demonstrates the effect of using minimum bounding rectangles, and the model structure diagram is shown in Figure 5.
Figure 4. (a) Original YOLO detection box; (b) Minimum bounding rectangle detection box after applying Otsu.
3.2 Rice seed classifier: RiceLCNN
3.2.1 Classification dataset construction
Based on the detection outputs from YOLOv11-LA, each rice seed image was automatically cropped and uniformly resized to 224 × 224 pixels. After removing occluded, blurred, and incorrectly detected images, a high-quality rice seed classification dataset comprising 16,731 images was ultimately constructed. The data was partitioned into training, validation, and test sets at an 8:1:1 ratio, with specific counts shown in the Table 2. The naming convention reflects the collection year and experimental plot number. Grain samples of each variety are displayed in Figure 6.
Figure 6. Rice seed samples harvested from plots with different fertilization treatments (Plots 1–9). The seed samples, collected in 2022, vary in grain shape, surface texture, and color based on the application combinations of straw return (S), enzymes (E), organic fertilizer (O), and chemical fertilizer (C). These differences provide distinct phenotypic features for image-based intelligent segmentation and classification. (a) Rice seeds from Plot 1, (b) Rice seeds from Plot 2, (c) Rice seeds from Plot 3, (d) Rice seeds from Plot 4, (e) Rice seeds from Plot 5, (f) Rice seeds from Plot 6, (g) Rice seeds from Plot 7, (h) Rice seeds from Plot 8, (i) Rice seeds from Plot 9.
3.2.2 RiceLCNN network structure design
To balance inference efficiency on mobile devices with rice seed classification accuracy, this paper proposes the lightweight convolutional neural network RiceLCNN. The model backbone consists of six cascaded convolutional modules followed by a fully connected classification head (Figure 7), adhering to the lightweight paradigm of “1 × 1 bottleneck restructuring + 3 × 3 local modeling.” Batch Normalization and LeakyReLU are applied after all convolutions to stabilize gradients and accelerate convergence (Bao et al., 2025; Wang et al., 2019). The first three modules employ a structure composed of sequential 1×1, 3×3, and 1×1 convolutions. The initial stage performs early downsampling using a stride s = 2 at the 3 × 3 convolution, followed by further spatial dimension reduction via 4 × 4 max pooling. The second and third stages maintain convolutions with stride s = 1 and similarly employ 4 × 4 max pooling to control resolution. After these three stages, the backbone introduces channel attention to enhance discriminative feature representation: Specifically, a Squeeze-and-Excitation (SE) mechanism is applied to the 64-channel features for channel recalibration. First, global average pooling is performed on the feature F to obtain the channel description z = GAP(F). Then, two fully connected layers with nonlinear transformations generate the weight vector, as shown in the following Equation 5.
where δ(·) and σ(·) denote ReLU and Sigmoid, respectively. s is used for per-channel scaling of the original features.
Following SE, a 3 × 3 convolution layer and 2 × 2 max pooling continue feature refinement and compression. The fifth stage reuses the 1×1, 3×3, 1×1 convolution sequence followed by 2 × 2 pooling downsampling. The sixth stage employs a single 3×3 convolution for final semantic aggregation, followed by 2×2 pooling to obtain a compact spatial representation. Channel sizes follow a monotonically increasing pyramidal configuration along the network backbone: {16, 32, 64, 64, 128, 224}. Here, 1 × 1 convolutions perform cross-channel feature reorganization and dimension reduction at low cost, while 3×3 convolutions focus on critical local contexts. The SE module adaptively reweights channel importance during the mid-level semantic stage, enhancing feature selection capability and inter-class separability within limited parameter and computational budgets (Ding et al., 2021; Li et al., 2020; Liu et al., 2023). Finally, the network obtains a 224-dimensional global representation through a flattening operation and outputs classification results via a single fully-connected layer (224 × C, where C denotes the number of classes). This design effectively improves recognition performance while maintaining manageable inference overhead, making it suitable for resource-constrained mobile deployment scenarios.
3.3 Integrated tracking counting and morphometry module
3.3.1 Real-time tracking and counting
In the scenario of simultaneous multi-grain rice variety identification, we employ YOLOv11-LA as the front-end detector. For each frame image It, it outputs a detection set , where denotes the center and scale of the bounding box, represents the initial category (rice-seed), and indicates the confidence score. The detection results are fed into DeepSORT for cross-frame association: for each in-orbit target , establish a Kalman filter state (where are the bounding box center, ), yielding the prior trajectory set . Construct the cost matrix under the threshold of Mahalanobis distance and . Solve the global optimal matching of using cascade matching (starting from the most recent update time of the trajectory and proceeding from near to far) and the Hungarian algorithm, obtaining the set of matching pairs . Perform Kalman updates on to obtain the corrected trajectory . Unmatched detections initialize new trajectories, and trajectories unmatched for more than consecutive frames are terminated. This dual “cascade + IoU” constraint effectively reduces missed detections and ID switching (Yan et al., 2025), as 272 illustrated in Figure 8.
For each confirmed track, the cropped region is input to the lightweight classifier RiceLCNN to obtain category prediction and confidence score . Combined with dynamic scale calibration, this outputs phenotypic measurements including length, width, aspect ratio, and roundness. Counting employs a unique ID deduplication strategy: Let denote the set of confirmed trajectories first entering the counting region by time t that remain uncounted. The cumulative count is then . In dense images with coalescing but minimal occlusion, this modular “detection-tracking-classification-measurement” pipeline maintains consistent cross-frame IDs and stable counting. Key parameters of DeepSORT are shown in Table 3, with ID switches (IDSW) in the test sequence reduced to 2.1(average).
3.3.2 Real-time measurement
To achieve dynamic absolute scale conversion based on the image, this paper directly employs YOLOv11- LA for detection and segmentation of the calibrated disk: The detector first outputs the disk’s Region of Interest (ROI) (category scale disk), and its segmentation head returns the disk mask within this ROI. The pixel domain area is obtained by counting pixels in the mask, as shown in the following Equation 6.
where denotes the pixel-domain area of the disk (units: ), is the idempotent function, and represents pixel coordinates.
Based on the circle area-diameter relationship, the equivalent pixel diameter of the disk is obtained as shown in the following Equation 7.
where Dpix denotes the diameter in the pixel domain (unit: px). This formula leverages mask geometry without requiring additional edge/ellipse fitting, making it suitable for real-time high-throughput applications.
Comparing the pixel diameter to the physical diameter yields a uniform millimeter/pixel scale across the entire image, as shown in the following Equation 8.
where Dreal is the true diameter of the calibrated disk (unit: mm), and α is the scale factor mapping lengths from the pixel domain to the physical domain. This paper assumes near-orthogonal imaging with square (isotropic) pixels; if minor perspective/distortion exists, camera calibration is applied for prior correction.
After obtaining , for each detected seed instance (category seed), contours are extracted within its detection bounding box using the Otsu method. These are then processed through Gaussian smoothing, Sobel gradient detection, and sub-pixel edge fitting to derive the pixel-domain metrics within its detection bounding box. The corresponding physical domain quantities are converted according to the following Equation 9:
where denote seed length and width (mm), denotes area (), and denotes perimeter (mm); The superscript (s) denotes seed-related quantities.
Roundness is a scale-invariant property and can be computed equivalently in both the pixel domain and the physical domain. The calculation process is shown in Equation 10.
where A,P must be taken from the same measurement domain; C reflects the degree of circularity of the contour (C ∈ (0,1]).
To evaluate measurement uncertainty, this paper employs a first-order error propagation approximation, as shown in Equation 11.
where is the estimated scale, is the pixel-domain length error caused by sub-pixel edge localization, and is the standard deviation of the scale estimate. This equation quantifies the combined uncertainty contribution from segmentation boundary and scale estimation noise to the physical-domain length.
4 Experiments
4.1 Experimental environment and parameters
This experiment leverages the PyTorch framework and utilizes a Quadro RTX 8000 GPU (48 GB VRAM, 672 GB/s bandwidth, 4608 CUDA cores) under CUDA 12.6 for accelerated training. The segmentation model uses a uniform input size of 1080 × 720 with a batch size of 8. The optimizer operates in auto mode with an initial learning rate of 0.01. The classification model employs an input size of 3 × 224 × 224 with a batch size of 64, trained for 50 epochs. The optimizers include SGD and Adam, with an initial learning rate of 0.001.
4.2 Experimental evaluation metrics
To comprehensively evaluate the performance of the proposed model in rice seed detection and classification tasks, this paper constructs a multidimensional evaluation metric system focusing on accuracy, robustness, and computational complexity. The detection task primarily assesses the model’s ability to accurately identify the location of rice seed targets, while the classification task emphasizes the model’s capability to distinguish between different rice seed varieties or grades. All evaluation metrics are calculated based on an independent test set, with specific definitions and calculation formulas as follows.
(1) Model Complexity: To measure model lightweight nature and computational overhead, the following metrics are calculated: Parameters (Params): Total number of trainable parameters in the model; GFLOPs (Giga Floating Point Operations per Second): The floating-point computation required for forward inference on a single image, as shown in Equation 12.
where Clin and Clout denote the input and output channel counts of the lth convolutional layer, respectively; Klis the convolution kernel size; Hl× Wlrepresents the spatial resolution of the feature map at that layer; and L is the total number of convolutional layers.
(2)Object detection metrics: The quality of bounding box predictions is evaluated using three metrics—Precision (P), Recall (R), and Mean Average Precision (mAP) (Wen et al., 2025). The calculation processes are shown in Equations 13–15, respectively.
In Equation 13, TP denotes the number of samples predicted as positive and actually positive (true positives), while FP denotes the number of samples predicted as positive but actually negative (false positives); In Equation 14, FN denotes missed samples (false negatives). For detection tasks, the definitions of TP, FP, and FN depend on whether the Intersection over Union (IoU) between predicted and ground-truth bounding boxes exceeds a set threshold (e.g., 0.5).
where N is the total number of target classes, and Pi(Ri) denotes the precision-recall (PR) curve for the ith class. This paper reports two metrics: mAP@0.5 (IoU threshold of 0.5) and mAP@0.5:0.95 (averaging samples from 0.5 to 0.95), reflecting detection performance under lenient and stringent conditions, respectively.
(3) Image Classification Metrics: To evaluate the model’s discriminative capability for rice variety classification, we introduce Accuracy and F1 Score [where Precision (P) and Recall (R) are defined in (13) and (14)]. Accuracy is defined as shown in Equation 16.
where TN denotes the number of samples correctly classified as negative. This metric measures the proportion of correctly classified samples across all categories. To further evaluate the model’s performance in positive class recognition, the F1 score is introduced as shown in Equation 17.
F1 combines precision and recall, serving as their harmonic mean and being well-suited for evaluating models in scenarios with class imbalance. For multi-class classification tasks, the macro-average strategy is employed: P, R, and F1 are calculated separately for each class, and their arithmetic mean is taken to fairly reflect the importance of each class. Distinction: While Precision and Recall share identical calculation methods, their meanings and emphasis differ across tasks: In object detection, these metrics evaluate how well detection boxes overlap with targets, prioritizing spatial accuracy and false negative rates. In classification tasks, they measure a model’s ability to distinguish between categories, particularly crucial for identifying intermediate, adjacent classes.
4.3 YOLOv11-LA lightweight model improvement and ablation experiments
To reduce computational and memory overhead while maintaining detection accuracy, this paper proposes the lightweight detection network YOLOv11-LA based on the YOLOv11n baseline. The design focuses on four orthogonal structural optimizations: (i) replacing part of the standard convolutions with Depthwise Separable Convolution (DWConv+PWConv) to reduce parameters and FLOPs; (ii) streamlining the C3k2 feature extraction unit and simultaneously narrowing channel width to suppress redundant computation (Slim); (iii) introducing CBAM attention in the detection head to enhance the representation of key regions; and (iv) imposing a maximum-channel cap (512 channels) on the widest layers to further control model size.
To quantify how each strategy contributes to the performance–efficiency trade-off, we constructed seven ablation configurations (Table 4). A denotes the original YOLOv11n baseline. B (YOLOv11n-DW) isolates the effect of DWConv by only replacing the corresponding standard convolutions in A. C (YOLOv11n-Slim) keeps the convolution type unchanged but simplifies the C3k2 blocks and compresses channels, so that the impact of structural slimming on accuracy and complexity can be observed independently. D (YOLOv11n-Attn) inserts CBAM modules into the detection head on top of A, leaving the backbone unchanged, which isolates the contribution of attention. E (YOLOv11n-SlimAttn) combines the Slim backbone in C with the attention mechanism in D to investigate the interaction between compression and attention under a low-computation regime. F (YOLOv11n-Cap) applies only the maximum-channel cap to the baseline A, without DWConv, Slim, or attention, to isolate the influence of the channel cap itself on model capacity and efficiency. Finally, G (YOLOv11-LA) integrates all the above optimizations—DWConv, Slim C3k2 blocks with channel compression, CBAM attention, and the channel cap—to obtain the final lightweight model.
Table 4. Ablation study results of different configurations of YOLOv11-LA on rice seed detection (values reported as mean ± std).
As shown in Table 4, B significantly reduces computational and parameter overhead (GFLOPs drop from 6.4 to 5.8, i.e., a 9.4% reduction; parameters decrease from 2.59M to 2.44M) with almost no loss in detection accuracy (mAP@0.5 remains at 99.50%). C further compresses the model to 0.94M parameters and 3.3 GFLOPs (48.4% fewer FLOPs than A), but causes a noticeable drop in mAP@0.5:0.95 (from 91.16% to 82.09%), indicating that overly aggressive structural slimming alone harms localization and scale regression of dense small objects. D shows that CBAM attention mainly improves accuracy: compared with A, YOLOv11n-Attn slightly increases complexity (GFLOPs from 6.4 to 6.5), but boosts mAP@0.5:0.95 by 1.73 percentage points (from 91.16% to 92.89%), confirming that attention is beneficial even without compression.
The additional configuration F highlights the isolated effect of the channel cap. Compared with the baseline A, YOLOv11n-Cap achieves a substantial reduction in complexity (GFLOPs from 6.4 to 3.3, i.e., a 48.4% reduction; parameters from 2.59M to 0.94M, a 63.6% reduction), while maintaining high detection performance: mAP@0.5 only decreases slightly (from 99.50% to 99.40%), and mAP@0.5:0.95 even improves from 91.16% to 92.40%. Under a similar computational cost to C, F therefore provides much better overall detection quality (mAP@0.5:0.95 improves from 82.09% to 92.40%), indicating that redistributing capacity via a channel cap is more favorable than pure structural slimming.
The combined configurations E and G further reveal the interaction among these modules. Adding CBAM to the slimmed model (E) recovers most of the accuracy loss of C: mAP@0.5:0.95 is increased from 82.09% to 92.41%, while GFLOPs remain at only 3.4 and parameters below 1.0M. On this basis, integrating DWConv and the channel cap in G further improves the trade-off: YOLOv11-LA achieves mAP@0.5 of 99.50% and mAP@0.5:0.95 of 93.06% with only 3.1 GFLOPs and 0.95M parameters, corresponding to a 51.6% reduction in computation and about a 63.2% reduction in parameters compared with the baseline A, while also increasing mAP@0.5:0.95 by 1.90 percentage points. These results demonstrate that under dense, small-object and occlusion-prone scenarios, each lightweighting strategy contributes in a complementary way, and their combination in YOLOv11-LA maintains or even improves detection accuracy while significantly reducing model size and computational cost, making it well-suited for edge deployment.
4.4 Comparison experiment design
4.4.1 Detection model comparison experiment
To comprehensively evaluate the performance of the proposed lightweight improved network YOLOv11-LA, we trained and validated five object detection models on the rice seed detection dataset: YOLOv5n, YOLOv8n, YOLOv10n, YOLOv11n, and the enhanced YOLOv11-LA. All models were trained with identical parameter settings and underwent three independent experimental runs to ensure result stability and reproducibility. Experimental results are shown in Table 5.
Table 5. Comparison of results from each model in the contrast experiment (values reported as mean ± std).
As shown in Table 5, YOLOv11-LA demonstrates superior performance at both common IoU thresholds, outperforming all comparison models. At IoU=0.5, YOLOv11-LA achieves mAP improvements of 0.08%, 0.03%, 0.45%, and -0.05% compared to YOLOv5n, YOLOv8n, YOLOv10n, and YOLOv11n, respectively. Although numerically close to YOLOv11n, YOLOv11-LA demonstrates a more significant advantage under the stricter mAP@0.5:0.95 metric, achieving improvements of 1.69%, 1.25%, and 1.9% over YOLOv5n, YOLOv10n, and YOLOv11n, respectively. Indicating its enhanced robustness across multiple scales and boundary conditions.
In computational complexity, YOLOv11-LA achieves 3.1 GFLOPs, reducing computational load by approximately 56.9%, 62.2%, 63.1%, and 51.6% compared to YOLOv5n, YOLOv8n, YOLOv10n, and YOLOv11n, respectively. Simultaneously, YOLOv11-LA demonstrates significant advantages in model parameter count. With 953,682 parameters, it represents only 38% of YOLOv5n (2,508,854), 31.7% of YOLOv8n (3,011,238), YOLOv10n (2,707,820) by 35.2%, and YOLOv11n (2,590,230) by 36.8%. This compression ratio of approximately 2.5 to 3 times significantly reduces memory consumption, demonstrating robust, lightweight characteristics and strong potential for edge deployment.
Although YOLOv8n exhibits good accuracy on certain categories, its parameter count and computational cost are significantly higher than YOLOv11-LA. Meanwhile, YOLOv11-LA achieves higher detection accuracy on most key categories while maintaining low complexity, demonstrating an excellent performance-complexity tradeoff. Figure 9 displays the Precision-Recall curve and Recall-Confidence curve of the YOLOv11-LA model on the rice seed detection task. The model demonstrates outstanding performance across both categories (refer and rice-seed), achieving peak accuracies of 0.995 and 0.987, respectively. Its overall mAP@0.5 reaches 0.991, indicating exceptional discrimination capabilities in multi-class object recognition.
4.4.2 Classification model comparison experiment
To comprehensively evaluate the classification performance of the RiceLCNN model, this study compared it against current mainstream deep learning classification models, including MobileNetV2 (Model A), MobileNetV3 (Model B, using small variant), Xception (Model C), ResNet50 (Model D), EfficientNetV2 (Model E), ShuffleNetV2 (Model G, using 0.5x variant) and Vision Transformer (Model H, using ViT-B/16). RiceLCNN was evaluated as the comparison model F. Specific comparison results include the number of parameters, training speed (seconds per epoch, abbreviated as “Speed”), classification accuracy (Accuracy, abbreviated as “Acc”), precision (P), recall (R), and F1 score for each model, as shown in Table 6.
Table 6 shows that RiceLCNN achieves the highest prediction accuracy (89.78%) among all comparison models, demonstrating outstanding classification performance. With only 527,913 parameters, RiceLCNN is considerably more lightweight than large-scale models such as ResNet50 (24,045,897) and Xception (20,045,897), giving it a pronounced advantage in resource efficiency. In terms of training efficiency, RiceLCNN also performs exceptionally well, completing a single training cycle in just 15.19 seconds—approximately 39.8% faster than MobileNetV3-small (25.23 seconds)—further highlighting its effectiveness. By contrast, the Vision Transformer (ViT-B/16, Model H) exhibited substantially poorer classification performance than all CNN-based counterparts, achieving only 65.89% accuracy—significantly lower than RiceLCNN (89.78%) and even lightweight CNN models such as MobileNetV3 (87.77%).Despite having the largest representation capacity with 85.8M parameters, ViT was markedly less efficient, requiring about 240.03 ± 20 seconds per epoch, which is over 15 times slower than RiceLCNN. The inferior performance of ViT can be explained by several factors. First, Vision Transformers generally demand large-scale datasets to fully exploit their self-attention mechanism, while the rice seed dataset used in this study is relatively limited in both size and diversity. This makes ViT more susceptible to underfitting or unstable feature learning, especially in fine-grained classification tasks such as distinguishing visually similar seed categories. Second, unlike CNNs, ViT lacks strong inductive biases (e.g., locality and translation equivariance) that are particularly advantageous when training data are scarce. As a result, whereas CNN-based models can effectively capture the local texture and morphological cues of rice grains, ViT struggles to generalize under the same conditions.
Regarding key performance evaluation metrics, RiceLCNN achieved the highest scores across accuracy (89.78%), recall (89.93%), and F1 score (89.81%), outperforming all other comparison models. It demonstrated superior stability and robustness, particularly in negative class identification and marginally classified sample discrimination. Figure 10 illustrates the accuracy and loss trends of RiceLCNN during training, validation, and testing. As shown in Figure 10a, the model’s accuracy rapidly improves and stabilizes within the first 15 to 20 training epochs. The training set accuracy approaches 95%, while the validation and test set accuracies stabilize around 89% with no noticeable overfitting. The loss curve in Figure 10b further validates the model’s convergence and training efficiency, with training, validation, and testing losses all rapidly decreasing and converging below 0.2 within 20 epochs.
Figure 10. (a) Training–val–test accuracy of RiceLCNN for 50 epochs (epoch 0 to 50) (b) Training–val–test loss of RiceLCNN for 50 epochs (epoch 0 to 50).
Figure 11 visualizes RiceLCNN’s classification performance on the test set. Figure 11a shows the confusion matrix for nine rice grain categories. The diagonal region indicates high recognition accuracy across all categories with minimal classification error. Figure 11b displays the precision-recall (PR) curves for each category. The PR curves for most categories approach the upper-right corner, indicating that the model maintains high precision while preserving high recall, demonstrating strong overall classification robustness.
Figure 11. (a) Confusion matrix (b) precision recall curve of RiceLCNN model on test set of rice grain dataset.
4.4.3 Public dataset evaluation and generalization ability verification
To further verify the generalization ability and classification robustness of the RiceLCNN model in different data domains, this paper selected the public rice seed image dataset provided by Srinagar, Sher-Kashmir Agricultural and Technological University (SKUAST) as an independent testing platform (Mohi Ud Din et al., 2024). The dataset includes five typical rice varieties: Jehlum, Mushkibudji, Sr-1, Sr-2, and Sr-4, comprising a total of 4,748 high-quality images (as shown in Figure 12). The dataset was collected using a high-resolution flatbed scanner (HP Scanjet 200), set to 200 dpi optical resolution and 24-bit color depth. Each image contains 24 seeds of the same rice variety, arranged randomly, with the background covered by black paper to ensure imaging consistency. All images are saved in PNG format, maximizing the preservation of grain arrangement, detailed features, and appearance differences, and presenting a certain level of image complexity and challenge.
Figure 12. (a) Rice variety: Jehlum; (b) Rice variety: Mushkibudji; (c) Rice variety: Sr-1; (d) Rice variety: Sr-2; (e) Rice variety: Sr-3.
For training and evaluation, the dataset was divided into a training set (3,278 images), a validation set (735 images), and a test set (735 images). The test set was used for independent evaluation of the RiceLCNN model, with its classification performance shown in Table 7 and the confusion matrix shown in Figure 13. The results show that RiceLCNN achieves excellent performance on this public rice dataset, with recognition accuracy and recall rates for the Jehlum and Mushkibudji varieties approaching or reaching 100%, and F1 scores of 99.66% and 98.64%, respectively. The F1 scores for the remaining three classes, Sr-1, Sr-2, and Sr-4, are 94.16%, 95.59%, and 93.52%, respectively. The overall accuracy of the model is 96.32%, and the Macro average metric remains between 96.31% and 96.32%, demonstrating good generalization ability and stability.
As shown in Figure 13, RiceLCNN achieved 100% accurate classification for Jehlum and Mushkibudji samples; There is some confusion between Sr series varieties, primarily between Sr-1 and Sr-2, as well as Sr-4, but the overall misclassification rate is low, indicating that the model has high discriminative ability in most categories and maintains good robustness in categories with similar phenotypic characteristics.
4.5 Implementation and deployment
To balance mobile inference efficiency and recognition accuracy, this paper constructs an integrated rice seed recognition system incorporating four key components: “lightweight detection—fine-grained classification—online tracking—morphometric measurement.” YOLOv11-LA handles instance-level localization, RiceLCNN performs single-grain classification, DeepSORT maintains cross-frame ID consistency, and the sub-pixel measurement module outputs phenotypic metrics like length and width (Figure 14). Experimental results demonstrate prediction-to-actual measurement errors within 0.1 mm.
Figure 14. (a) Measure the actual length of one of the rice seeds; (b) ”Total” indicates that a total of 7 rice grains were identified, all belonging to the variety 2022-8 (actually also for 2022-8). For each individual grain, “class” denotes the predicted category, “ID” represents the unique identity assigned during tracking, and “Time” indicates the tracking duration. “h” and “w” refer to the grain’s length and width in millimeters, respectively. “cof” refers to the classification confidence score—the higher the value, the more reliable the classification result (note: this is the confidence of the classification, not of the detection).
The system is deployed on a Jetson Orin Nano 8 GB platform (single input channel, batch = 1) with Super power mode enabled; key hardware parameters are listed in Table 8. This platform employs a Unified Memory Architecture (UMA), where the GPU lacks dedicated VRAM and shares system memory with LPDDR5.
To evaluate the performance of different “detection/classification” combinations, we employ a unified pipeline: “detection → tracking → classification → measurement.” To ensure comparability, all combinations use the same input, with consistent threshold settings and post-processing strategies across configurations. Reported metrics include: throughput FPS (steady-state mean ± standard deviation), end-to-end latency percentiles p50–p99, GPU memory peak, and process resident memory (RSS) peak. All tests were conducted after 5 warm-up passes, with statistics aggregated from 100 consecutive inferences. Deployment results for both input resolutions are shown in Table 9.
Furthermore, to characterize the speed-accuracy tradeoff, we selected YOLOv11n as a control detector alongside the YOLOv11–LA + RiceLCNN baseline. For the control classifier, we employed Xception, the second-highest performing model on this dataset after RiceLCNN combined as YOLOv11n–Xception. Both approaches were evaluated under identical post-processing and evaluation protocols to highlight the impact of model size and computational load on real-time performance.
5 Discussion
Experimental results indicate differences in classification accuracy among fertilization treatments: First, 2022–1 and 2022–2 exhibited the most pronounced bidirectional misclassification, suggesting that grain visible phenotypes remain highly similar when both “straw” and “enzymes” factors vary simultaneously.
Second, 2022–6 and 2022–7 exhibited misclassification almost exclusively between themselves, differing solely in the application of enzymes. This suggests enzymes exert marginal effects on grain appearance (primarily image-visible phenotype and color/chalkiness phenotypes), as illustrated in 11. This aligns with the established understanding that “nitrogen fertilizer primarily influences taste-related physicochemical properties, exerting limited and dose-dependent effects on appearance phenotypes” (Guo et al., 2023; Liang et al., 2021; Shi et al., 2022). In contrast, the classification of 2022–4 and 2022–8 was nearly “pure”, both receiving combined chemical fertilizer application. This suggests chemical fertilizer treatment generates stable phenotypic signals readily captured by visual models, aligning with field evidence that “optimized nitrogen-potassium management significantly reduces chalkiness incidence and severity” (Guo et al., 2024; Zhang et al., 2025).
Building upon this, we deployed a “detection-tracking-classification-quantification” pipeline on a Jetson Orin 8GB edge device: YOLOv11-LA serves as the front-end detector, integrated with DeepSORT for continuous instance-level ID tracking. RiceLCNN then performs fine-grained single-grain classification, coupled with sub-pixel measurement (0.1mm error) to obtain key phenotypic traits including length, width, aspect ratio, and roundness. At 1080×1080 input resolution, the lightweight YOLOv11-LA-RiceLCNN combination achieves a 16.7% FPS improvement over YOLOv11n–Xception with reduced latency. With p50/p99 latency decreasing by approximately 11.3% and 10.5%, respectively. GPU memory and process memory usage were reduced by 50% and 7.1%, respectively. At the high-resolution input of 3200×3200, the lightweight combination’s advantages became even more pronounced: FPS increased from 2 to 4, p50/p99 latency decreased by approximately 37.2% and 37.3% respectively, GPU memory dropped from 1.9GB to 0.6GB (a 68.4% reduction), and RSS memory decreased by 7.1%. As shown in Table 9, this demonstrates that the lightweight combination excels in high-resolution scenarios while meeting edge device online processing constraints. In rice variety identification, higher resolutions provide finer-grained texture and contour information. Furthermore, measured frame rates and resource utilization demonstrate the system’s online processing capability. It supports rapid screening and parent selection for breeding and high-throughput phenotyping, random sampling and traceability for variety admixing, and quantitative assessment of treatment effects (e.g., fertilization, crop residue incorporation, enzyme application) within the field-to-postprocessing workflow. On quality control and processing lines, thresholding rules based on length, area, roundness, and variety category can generate actionable grade labels and anomaly alerts. Leveraging DeepSORT significantly reduces duplicate counting risks, enabling flexible switching between sampling and offline verification. Furthermore, research indicates that cross-varietal recognition rates generally exceed those of intra-varietal fertilization treatments, as shown in Table 7. This suggests that the explanatory power of genotype main effects surpasses that of fertilizer efficacy main effects, aligning with findings that rice quality is jointly regulated by genotype-environment interactions, with genotype contributions often dominant (Yu et al., 2023; Chen et al., 2012).
It should be noted that this study primarily covers imaging scenarios with “adhesion but minimal occlusion”; under conditions of severe overlap or strong occlusion, front-end detection may require integration with instance segmentation or geometric separation to maintain recall. Domain shifts across camera positions and lighting conditions may also cause performance fluctuations, necessitating mitigation through small-scale retraining or adaptive strategies. Concurrently, potential confounders such as soil moisture, pest/disease pressure, microclimate, and variations in harvest and imaging batches may still influence phenotypic expression and model classification.
6 Conclusions and future work
This study proposes and validates an integrated intelligent rice seed recognition system encompassing detection, tracking, classification, and morphometric analysis, achieving a favorable balance between model lightweighting, recognition accuracy, and edge-deployability. It employs YOLOv11-LA as the front-end detector, utilizes DeepSORT for instance-level ID association, and leverages RiceLCNN for precise single-grain classification. Combined with sub-pixel measurement, it outputs key traits including length, width, aspect ratio, and roundness. The system operates stably on edge devices like Jetson Orin 8 and GB, enabling real-time processing for breeding screening, data collection, and quality monitoring in processing lines. A systematic analysis based on confusion matrices further indicates that, compared to fertilization treatments, variety (genotype) explains a higher proportion of seed appearance phenotypes. Among these, enzyme application alone has a marginal impact on visible phenotypes, while differences induced by chemical fertilizer treatments are more readily captured by visual models. These findings provide an efficient, stable, and engineering-feasible technical pathway for intelligent seed recognition and crop phenotyping.
Future work will focus on enhancing the system’s extrapolation capability and application depth through the following directions: (1) Incorporate multi-modal sensing (e.g., near-infrared/hyperspectral imaging) and cross-modal feature alignment to enhance recognition and quantitative characterization capabilities across diverse physiological states and environmental conditions; (2) Conduct robustness enhancement and domain adaptation studies under non-ideal natural conditions (backlighting, mud spots, strong shadows, cross-camera/cross-lighting), ensuring model robustness and interpretability through hierarchical cross-validation and uncertainty estimation; (3) Promote deep integration with field agricultural machinery and IoT nodes to construct a closed-loop “perception-decision-execution” control prototype. (4) Extend testing to multi-varietal datasets and real field environments, and explore integration with breeding programs to support large-scale phenotyping and accelerate genetic improvement pipelines. This supports online grading, anomaly removal, and adaptive adjustment of operational parameters, further advancing the digitalization and precision of agricultural production.
Data availability statement
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: https://github.com/5120191452/RiceLCNN.
Author contributions
DZ: Methodology, Software, Writing – original draft. SS: Funding acquisition, Resources, Writing – review & editing. JL: Validation, Writing – review & editing. WX: Data curation, Resources, Writing – review & editing. NX: Formal Analysis, Visualization, Writing – review & editing.
Funding
The author(s) declared that financial support was received for this work and/or its publication. This work was financially supported by the Natural Science Foundation of Jilin Province (No. 20220101144JC).
Conflict of interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Generative AI statement
The author(s) declare that no Generative AI was used in the creation of this manuscript.
Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
Supplementary material
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpls.2025.1673143/full#supplementary-material
References
Aukkapinyo, K., Sawangwong, S., Pooyoi, P., and Kusakunniran, W. (2019). Localization and classification of rice-grain images using region proposals-based convolutional neural network. Mach. Intell. Res. 16, 185–195. doi: 10.1007/s11633-019-1207-6
Bao, Y., Kang, G., Yang, L., Duan, X., Zhao, B., and Zhang, B. (2025). Normalizing batch normalization for long-tailed recognition. IEEE Trans. Image Process. 34, 209–220. doi: 10.1109/tip.2024
Chen, Y., Wang, M., and Ouwerkerk, P. B. F. (2012). Molecular and environmental factors determining grain quality in rice. Food Energy Secur. 1, 111–132. doi: 10.1002/fes3.11
Cui, S., Ma, X., Wang, X., Zhang, T.-A., Hu, J., Tsang, Y. F., et al. (2019). Phenolic acids derived from rice straw generate peroxides which reduce the viability of staphylococcus aureus cells in biofilm. Ind. Crops Products 140, 111561. doi: 10.1016/j.indcrop.2019.111561
Ding, E., Cheng, Y., Xiao, C., Liu, Z., and Yu, W. (2021). Efficient attention mechanism for dynamic convolution in lightweight neural network. Appl. Sci. 11, 3111. doi: 10.3390/app11073111
Guo, X., Wang, L., Zhu, G., Xu, Y., Meng, T., Zhang, W., et al. (2023). Impacts of inherent components and nitrogen fertilizer on eating and cooking quality of rice: A review. Foods 12, 2495. doi: 10.3390/foods12132495
Guo, C., Zhang, L., Jiang, P., Yang, Z., Chen, Z., Xu, F., et al. (2024). Grain chalkiness is decreased by balancing the synthesis of protein and starch in hybrid indica rice grains under nitrogen fertilization. Foods 13, 855. doi: 10.3390/foods13060855
Huo, X., Wang, J., Chen, L., Fu, H., Yang, T., Dong, J., et al. (2023). Genome-wide association mapping and gene expression analysis reveal candidate genes for grain chalkiness in rice. Front. Plant Sci. 14. doi: 10.3389/fpls.2023.1184276
Jin, C., Zhou, X., He, M., Li, C., Cai, Z., Zhou, L., et al. (2024). A novel method combining deep learning with the kennard–stone algorithm for training dataset selection for image-based rice seed variety identification. J. Sci. Food Agric. 104, 8332–8342. doi: 10.1002/jsfa.13668
Jocher, G., Chaurasia, A., Stoken, A., Borovec, J., et al. (2022). ultralytics/yolov5, Vol. v6.2. doi: 10.5281/zenodo.7002879
Karunasena, G., Priyankara, H., and Madushanka, B. (2020). Machine vision techniques for improve rice grain quality analyzing process. Int. J. Innovative Sci. Res. Technol. 5, 1001–1005. doi: 10.38124/IJISRT20JUN691
Kiratiratanapruk, K., Temniranrat, P., Sinthupinyo, W., Prempree, P., Chaitavon, K., Porntheeraphat, S., et al. (2020). Development of paddy rice seed classification process using machine learning techniques for automatic grading machine. J. Sensors 2020, 7041310. doi: 10.1155/2020/7041310
Kurade, C., Meenu, M., Kalra, S., Miglani, A., Neelapu, B. C., Yu, Y., et al. (2023). An automated image processing module for quality evaluation of milled rice. Foods 12, 1273. doi: 10.3390/foods12061273
Li, P., Chen, Y.-H., Lu, J., Changquan, Z., Liu, Q.-Q., and Li, Q.-F. (2022). Genes and their molecular functions determining seed structure, components, and quality of rice. Rice 15, 18. doi: 10.1186/s12284-022-00562-8
Li, D., Wen, G., Kuai, Y., Zhu, L., and Porikli, F. (2020). Robust visual tracking with channel attention and focal loss. Neurocomputing 401, 295–307. doi: 10.1016/j.neucom.2019.10.041
Liang, H., Tao, D., Zhang, Q., Zhang, S., Wang, J., Liu, L., et al. (2021). Nitrogen fertilizer application rate impacts eating and cooking quality of rice after storage. PloS One 16, e0253189. doi: 10.1371/journal.pone.0253189
Liu, Q., Wu, T., Deng, Y., and Liu, Z. (2023). Se-yolov7 landslide detection algorithm based on attention mechanism and improved loss function. Land 12, 1522. doi: 10.3390/land12081522
Mohi Ud Din, N., Assad, A., Dar, R. A., Rather, S. A., Bhat, M. A., Mushtaq, U., et al. (2024). Ricenet: A deep convolutional neural network approach for classification of rice varieties. Expert Syst. Appl. 235, 121214. doi: 10.1016/j.eswa.2023.121214
Ning, L., Sun, S., Zhou, L., Zhao, N., Wu, J., Wang, W., et al. (2023). High-throughput instance segmentation and shape restoration of overlapping vegeta ble seeds based on sim2real method. Measurement 2022), 112414. doi: 10.2139/ssrn.4195243
Otsu, N. (1979). A threshold selection method from gray‑level histograms. IEEE Transact. Syst. Man Cybernet 9, 62–66. doi: 10.1109/TSMC.1979.4310076
Pawłowski, J., Kołodziej, M., and Majkowski, A. (2024). Implementing yolo convolutional neural network for seed size detection. Appl. Sci. 14, 6294. doi: 10.3390/app14146294
Phan, T.-T.-H., Ho, H.-T., and Hoang, T.-N. (2023). “Investigating yolo models for rice seed classification,” in Lecture notes in networks and systems (Cham, Switzerland: Springer), 181–192. doi: 10.1007/978-3-031-25072-215
Phan, T.-T.-H., Vo, T., and Nguyen, H.-D. (2024). A novel method for identifying rice seed purity using hybrid machine learning algorithms. Heliyon. 10, e33941. doi: 10.1016/j.heliyon.2024.e33941
Qingge, L., Zheng, R., Zhao, X., Wei, S., and Yang, P. (2020). An improved otsu threshold segmentation algorithm. Int. J. Comput. Sci. Eng. 22, 146. doi: 10.1504/IJCSE.2020.10029225
Rajalakshmi, R., Faizal, S., Sivasankaran, S., and Geetha, R. (2024). Riceseednet: Rice seed variety identification using deep neural network. J. Agric. Food Res. 16, 101062. doi: 10.1016/j.jafr.2023.101062
Rasheed, A. F. and Zarkoosh, M. (2025). Yolov11 optimization for efficient resource utilization. J. Supercomputing 81, 1–21. doi: 10.1007/s11227-025-07520-3
Rathnayake, N., Miyazaki, A., Dang, T. L., and Hoshino, Y. (2023). Age classification of rice seeds in Japan using gradient-boosting and anfis algorithms. Sensors 23, 2828. doi: 10.3390/s23052828
Shafik, W., Tufail, A., Liyanage De Silva, C., and Awg Haji Mohd Apong, R. A. (2025). A novel hybrid inception–xception convolutional neural network for efficient plant disease classification and detection. Sci. Rep. 15, 82857. doi: 10.1038/s41598-024-82857-y
Sharma, K., Sethi, G., and Bawa, R. (2024). A comparative analysis of deep learning and deep transfer learning approaches for identification of rice varieties. Multimedia Tools Appl. 84, 6825–6842. doi: 10.1007/s11042-024-19126-7
Shi, S., Zhang, G., Li, L., Chen, D., Liu, J., Cao, C., et al. (2022). Effects of nitrogen fertilizer on the starch structure, protein distribution, and quality of rice. ACS Food Sci. Technol. 2, 1347–1354. doi: 10.1021/acsfoodscitech.2c00155
Sun, C., Liu, T., Ji, C., Jiang, M., Tian, T., Guo, D., et al. (2014). Evaluation and analysis of rice chalkiness by image processing. J. Cereal Sci. 59, 12–18. doi: 10.1016/j.jcs.2014.04.009
Wang, A., Chen, H., Liu, L., Chen, K., Lin, Z., Han, J., et al. (2024a). Yolov10: Real-time end-to-end object detection. arXiv.
Wang, J., Li, S., An, Z., Jiang, X., Qian, W., and Ji, S. (2019). Batch-normalized deep neural networks for achieving fast intelligent fault diagnosis of machines. Neurocomputing 329, 53–65. doi: 10.1016/j.neucom.2018.10.049
Wang, C.-Y., Yeh, I.-H., and Liao, H.-Y. M. (2024b). Yolov9: Learning what you want to learn using programmable gradient information. arXiv. doi: 10.48550/ARXIV.2402.13616
Wen, G., Li, M., Tan, Y., Shi, C., Luo, Y., and Luo, W. (2025). Enhanced yolov8 algorithm for leaf disease detection with lightweight gocr-elan module and loss function: Wsiou. Comput. Biol. Med. 186, 109630. doi: 10.1016/j.compbiomed.2024.109630
Woo, S., Park, J., Lee, J.-Y., and Kweon, I. (2018). Cbam: Convolutional block attention module (Cham, Switzerland: Springer). doi: 10.48550/arXiv.1807.06521
Yan, F., Sun, Y., Xu, H., Yin, Y., Wang, H., Wang, C., et al. (2018). Effects of wheat straw mulch application and nitrogen management on rice root growth, dry matter accumulation and rice quality in soils of different fertility. Paddy Water Environ. 16, 507–518. doi: 10.1007/s10333-018-0643-1
Yan, Z., Wu, Y., Zhao, W., Zhang, S., and Li, X. (2025). Research on an apple recognition and yield estimation model based on the fusion of improved yolov11 and deepsort. Agriculture 15, 765. doi: 10.3390/agriculture15070765
Yang, D., Wang, Y., and Wu, Q. (2023). Impact of tillage and straw management on soil properties and rice yield in a rice-ratoon rice system. Agronomy 13, 1762. doi: 10.3390/agronomy13071762
Yu, J., Zhu, D., Zheng, X., Shao, L., Fang, C., Yan, Q., et al. (2023). The effects of genotype × environment on physicochemical and sensory properties and differences of volatile organic compounds of three rice types (oryza sativa l.). Foods 12, 3108. doi: 10.3390/foods12163108
Zhang, X., Li, Y., Dong, J., Sun, Y., and Fu, H. (2025). Split application of potassium reduces rice chalkiness by regulating starch accumulation process under high temperatures. Agronomy 15, 116. doi: 10.3390/agronomy15010116
Zhao, J., Ma, Y., Yong, K., Zhu, M., Wang, Y., Wang, X., et al. (2023). Rice seed size measurement using a rotational perception deep learning model. Comput. Electron. Agric. 205, 107583. doi: 10.1016/j.compag.2022.107583
Keywords: YOLOv11-LA, RiceLCNN, rice seeds, object detection, classification
Citation: Zhang D, Song S, Liu J, Xu W and Xiayidan N (2025) Real-time segmentation and phenotypic analysis of rice seeds using YOLOv11-LA and RiceLCNN. Front. Plant Sci. 16:1673143. doi: 10.3389/fpls.2025.1673143
Received: 30 July 2025; Accepted: 20 November 2025; Revised: 10 November 2025;
Published: 08 December 2025.
Edited by:
Chengcheng Chen, Shenyang Aerospace University, ChinaReviewed by:
Zhenping Qiang, Southwest Forestry University, ChinaGuodong Sun, Beijing Forestry University, China
Xiaofei Fan, Hebei Agricultural University, China
Copyright © 2025 Zhang, Song, Liu, Xu and Xiayidan. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Weiwei Xu, eHV3d0BqbGVudS5lZHUuY24=