Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Plant Sci., 21 January 2026

Sec. Sustainable and Intelligent Phytoprotection

Volume 16 - 2025 | https://doi.org/10.3389/fpls.2025.1734022

This article is part of the Research TopicAdvanced Imaging and Phenotyping for Sustainable Plant Science and Precision Agriculture 4.0View all 11 articles

IV-YOLO: an information vortex-based progressive fusion method for accurate rice detection

Jianxiang Zhang*Jianxiang Zhang1*Liexiang HuangfuLiexiang Huangfu2Yanling ZhaoYanling Zhao1Chao XueChao Xue1Congfei YinCongfei Yin1Jiankang Lu,Jiankang Lu1,3Jia MeiJia Mei3
  • 1College of Agronomy and Horticulture, Jiangsu Vocational College of Agriculture and Forestry, Jurong, Jiangsu, China
  • 2School of Medical, Nantong University, Nantong, Jiangsu, China
  • 3Jiangsu Zhongjiang Seed Co., Ltd., Nanjing, Jiangsu, China

In the context of precision agriculture, the problems of adhesion of rice plant features and background interference in UAV remote sensing images make traditional models difficult to meet the requirements of individual plant-level detection. To address this, this paper proposes an Information Vortex-based progressive fusion YOLO (IV-YOLO) model. Firstly, a Multi-scale Spiral Information Vortex (MSIV) module is designed, which achieves the disentanglement of adhered rice plant features and decoupling of background clutter through multi-scale rotational kernel convolution and channel-spatial joint reconstruction. Secondly, a Gradual Feature Fusion Neck (GFEN) is constructed to synergize the high-resolution details of shallow features (such as tiller edges and panicle textures) with the high semantic information of deep features, generating multi-scale feature representations with both discriminativeness and completeness. Experiments conducted on the public DRPD dataset show that IV-YOLO achieves a Precision of 0.8581, outperforming YOLOv5–YOLOv11 and FRPNet across all metrics. This study provides a reliable technical solution for individual plant-level rice monitoring and facilitates the large-scale implementation of precision agriculture.

1 Introduction

With the continuous growth in global demand for food security, coupled with the pressure of sustainable agricultural development, the extensive management model of traditional agriculture can no longer meet the requirements of resource constraints and environmental protection (Zeng et al., 2025). Against this backdrop, precision agriculture has emerged as a modern agricultural paradigm driven by information technology (Petrovic et al., 2024). Its core essence lies in breaking the cycle of extensive resource input in traditional agriculture through data-driven “on-demand regulation” (Patricio and Rieder, 2018), thereby achieving the coordinated improvement of production efficiency and environmental benefits (Cano et al., 2025; Diao et al., 2025).

As a staple food crop cultivated most widely globally, sustaining the dietary needs of over half the world’s population, rice production stability is directly linked to regional food security and livelihood security (Ahmed et al., 2022). The accurate detection of rice field targets, in turn, serves as a crucial pillar for the implementation of precision agriculture (Santoso et al., 2024). Decisions across all precision management stages, including seedling density monitoring and disaster assessment, depend on the precise identification and quantitative analysis of rice targets (Wu S. et al., 2025). Detection accuracy not only determines the spatial precision of resource input but also profoundly influences the final yield formation and the achievement of production efficiency (Sun X. et al., 2024). Breakthroughs in rice detection accuracy are a prerequisite for scaling precision agriculture from laboratory research to large-scale field applications. Thus, research focusing on the accurate detection of rice holds both academic value and practical significance in advancing the precision and sustainability of rice production (Yu et al., 2024; Jin et al., 2025).

Traditional rice monitoring methods struggle to meet the technical requirements of “high precision, wide coverage, and rapid response” for precision agriculture (Yu et al., 2025). While manual sampling surveys can obtain local true values, they are limited by sample size and subjectivity, failing to generate spatial distribution information across the entire field; moreover, they are inefficient in large-scale rice paddies (Yu et al., 2025). Fixed ground-based sensors, although capable of continuous monitoring, have a limited coverage range and are vulnerable to interference such as field obstructions and water flow impacts, resulting in in-sufficient data continuity and spatial representativeness (Sanaeifar et al., 2023). The rise of Unmanned Aeri-al Vehicle (UAV) remote sensing technology has provided an ideal “aerial sensing plat-form” for precise rice detection (Khan et al., 2022). Compared with satellite remote sensing, UAVs can capture centimeter-level high resolution images at a flight altitude of 10–30 meters and flexibly carry sensors and intelligent payloads for real-time data acquisition and processing, satisfying the demand for identifying microscopic targets (e.g., individual rice plants) in precision agriculture (Yuan et al., 2024; Gade et al., 2025; Kizielewicz et al., 2025).

Although object detection technology has shown strong adaptability to precision agriculture, the accurate detection of rice in UAV images is still restricted by two technical bottlenecks: the complexity of agricultural scenarios and the biological characteristics of rice plants (Chen et al., 2023; Xiao et al., 2025). It is difficult to translate detection accuracy into actual production needs. The complex environment of rice fields and the biological characteristics of rice plants also lead to detection difficulties (Song et al., 2025). The complex field environment is illustrated in Figure 1. Rice detection faces numerous core challenges (Qu et al., 2025). On one hand, the dense and intertwined growth of rice plants easily causes feature blurring. The overlapping areas of leaves between tillers and main stems, as well as the overlapping regions of panicles and roots, are highly coupled from the remote sensing perspective. This makes it difficult for models to accurately separate and extract effective features of individual rice plants (Wu H. et al., 2025; Zhou et al., 2025). On the other hand, the complex field background generates multiple clutter interferences. These include dappled light from sunlight, soil clod shadows, residual film reflections, and shadow outlines cast by rice leaves on the background or other leaves. Traditional detection methods struggle to achieve high-precision rice identification and phenotypic information extraction in complex field scenarios (Liang et al., 2024).

Figure 1
Four close-up images of green rice plants in a field. Each section shows different angles and stages of maturity. The plants have slender leaves with developing rice grains.

Figure 1. Complex field environment.

The rotational shear force of water vortices in paddy field irrigation can dispel impurities around rice root systems. It can also optimize the environment near plant canopies. This prevents phenotype occlusion by impurities. Based on this, this study proposes the Multi-scale Spiral Information Vortex (MSIV) module. By directionally reorganizing rice feature channels, the module decouples high-dimensional features between tillers and main stems. It also enhances the phenotypic contours of individual plants. Meanwhile, it uses feature perturbation to break the correlation between background clutter and rice features. This achieves clutter decoupling. It endows the network with the ability to capture pixel-level features of individual rice plants. Based on the MSIV module, this study further designs the Gradual Feature Fusion Neck (GFEN). GFEN fuses high-resolution shallow features and high-semantic deep features. It retains fine details such as edges and textures of individual plants. It also integrates the global semantics of rice populations. This provides highly discriminative multi-scale features for the detection head.

Existing YOLO series models feature efficient inference. However, they have high miss-detection rates in dense rice and clutter-interfered environments. Advanced networks like FRPNet enhance semantic fusion. But they struggle to capture fine-grained features such as individual plant edges and panicle textures. The “feature decoupling-progressive fusion” pipeline formed by MSIV and GFEN addresses the key limitations of traditional models. It does so through spiral information reorganization. It balances detail preservation and semantic abstraction via hierarchical fusion. This is crucial for the accurate detection and quantification of key phenotypic parameters such as panicle number.

The contributions of this study are summarized as follows:

1. A Multi-scale Spiral Information Vortex (MSIV) module is proposed. Inspired by the water vortex mechanism in rice paddies, this module achieves the disentanglement of adhered rice plant features and active decoupling of background clutter through multi-scale spiral-oriented reorganization of feature channels, providing technical support for the accurate pixel level capture of individual rice plant features.

2. A Gradual Feature Fusion Neck (GFEN) is designed. This neck synergizes the high-resolution details of shallow features with the high-semantic information of deep features: it not only preserves fine-grained features of individual rice plants (e.g., edges and textures) but also integrates the global spatial semantics of rice plant populations, thereby providing the detection head with more discriminative multi-scale feature representations.

3. The proposed model outperforms comparative models (including YOLO-series models and FRPNet) in core metrics such as Precision, Recall, and AP. It effectively improves the monitoring accuracy of dense rice plant canopies, breaks through the bottleneck of detection accuracy in complex rice field scenarios, and supports the needs of rice target monitoring in precision agriculture.

2 Related work

2.1 Object detection in agriculture

Target detection technology has become a core pillar of the “visual perception-quantitative analysis-precision decision-making” chain in precision agriculture (Diao et al., 2025; Khan et al., 2025; Wang W. et al., 2025). Its development has yielded diversified technology adaptation pathways tailored to the detection requirements of different agricultural scenarios, enabling technical breakthroughs across three core application areas: crop ontology monitoring, pest and weed control, and farmland resource management (Wu et al., 2023; Khan et al., 2025; Wang W. et al., 2025; Zhang et al., 2025).

With the continuous advancement of artificial intelligence and sensing technologies, yield calculation solutions based on image processing techniques offer advantages such as high precision, low cost, and non-destructive measurement. As an effective means to improve cultivation efficiency and optimize planting strategies, such solutions have attracted significant attention from researchers (Mohammed et al., 2024; Yu et al., 2024). The field of target detection has evolved into three core methodological categories: single-stage detection methods (Liu et al., 2016; Redmon et al., 2016), two-stage detection methods (Girshick, 2015), and vision transformer-based methods (Carion et al., 2020; Xu et al., 2022). For agricultural tasks, against the backdrop of Industry 4.0, one of the most critical challenges is to enhance the efficiency of sectors such as agriculture through the adoption of intelligent sensors and advanced computing (Melnychenko et al., 2024). Single-stage detection methods, typified by YOLO, strike a balance between speed and precision with relatively high inference efficiency—this aligns perfectly with the demands of agricultural applications such as rice detection (Chen et al., 2025).

In scenarios of crop ontology monitoring and yield estimation, target detection technology plays a pivotal role, adapting to the morphological characteristics and growth stages of different crops (Paul et al., 2024; Pan et al., 2025; Shen et al., 2025). A previous study (Liang et al., 2024) proposed a rice panicle detection model that integrates the Circular Smooth Label (CSL) method with the YOLOv5 framework, incorporating efficient attention mechanisms (i.e., Shuffle Attention (SA) and Gath-er-Excite Attention (GEA)). This model reduces the misdetection of overlapping panicles in field environments, enhances robustness under complex field conditions, and achieves accurate detection and counting of rice panicles. RP-YOLO (Sun J. et al., 2024), developed based on YOLOv5n, is a real-time rice panicle density detection method designed for unmanned harvesters. It optimizes the YOLO architecture through multiple techniques, including enhanced target detection heads and reconfigured backbone networks.

Accurate detection and counting of rice panicles via Unmanned Aerial Vehicles (UAVs) in field environments represent a key focus of rice research. Due to the flexible and slender nature of rice panicles, coupled with their dense and overlapping arrangement in fields, panicle detection in UAV images poses substantial difficulties and challenges. The present study proposes a rotational rice panicle detection model that achieves accurate panicle detection and counting, with its effectiveness validated for in-field rice yield estimation. The Circular Smooth Label (CSL) method is intended to incorporate panicle orientation information into the You Only Look Once Version 5 (YOLOv5) model; it fuses efficient attention mechanisms (SA and GEA) and adopts a GSConv convolution replacement strategy. Through these approaches, panicle orientations are classified, enabling oriented bounding boxes to fit more tightly to panicle contours. This reduces the misdetection of overlapping panicles in fields, minimizes redundant information in bounding boxes, and enhances the robustness of panicle detection under complex field conditions—thereby improving panicle detection precision.

For fruit detection in cash crops (e.g., pineapples and citrus) where targets are prone to canopy occlusion, CURI-YOLOv7 (Zhang et al., 2023), based on YOLOv7-tiny, proposes a lightweight individual citrus tree detector suitable for UAV images. It designs a backbone network based on depthwise separable convolution and incorporates MobileOne blocks, while expanding the mosaic dataset through morphological transformations and Mixup augmentation. YOLO-Leaf (Li T. et al., 2024) utilizes Dynamic Snake Convolution (DSConv) for robust feature extraction, adopts BiFormer to enhance attention mechanisms, and introduces IF-CIoU to improve bounding box regression—ultimately boosting the detection precision and generalization ability for apple leaf diseases. S-YOLOv5m (Tang et al., 2025), built on YOLOv5, integrates Spatial Parallel Depthwise Convolution (SPD-Conv), Ghost modules, Convolutional Block Attention Module (CBAM), and Adjacent Erasure Module (AEM) to propose an insect detection model specifically for tomato plants.

YOLO series models are efficient. However, they are prone to missed detection due to feature confusion. Transformer-based models rely on global attention. This results in insufficient efficiency in small target capture. Neither of them can adapt to dense rice field scenarios.

2.2 Feature fusion in object detection

Feature pyramids are widely employed in visual detection models (Ali and Zhang, 2024) for capturing multi-scale features of objects (Park et al., 2023; Qiu et al., 2023; Li et al., 2024b). The concept of multi-scale fusion is extensively applied to the overall construction of various neural network architectures (Tang et al., 2024); specifically, multi-scale modules, as plug-and-play basic modules, are also utilized in diverse computer vision tasks (Kim et al., 2021; Zhu et al., 2023; Vijayakumar and Vairavasundaram, 2024).

Feature Pyramid Network (FPN) (Lin et al., 2017) can simultaneously leverage the high resolution of shallow features and the high semantic information of deep features. It comprises a bottom-up structure, a top-down structure, and lateral connections, and achieves promising performance by fusing features from these different layers. Path Aggregation Network (PANet) (Liu et al., 2018) is an extension of FPN, which additionally incorporates a bottom-up path after the FPN to enhance feature propagation. Bidirectional Feature Pyramid Network (BiFPN) (Tan et al., 2020), proposed in EfficientDet, introduces a weighted bidirectional feature pyramid network that enables simple and efficient multi-scale feature fusion. NAS-FPN (Ghiasi et al., 2019) leverages the advantages of neural architecture search, using reinforcement learning to select optimal cross-connections and learn a more effective feature pyramid network architecture for target detection, achieving an excellent trade-off between accuracy and latency.

GiraffeDet (Jiang et al., 2021) adopts a “lightweight backbone and heavyweight neck” design paradigm, which enables dense information exchange across different spatial scales and latent semantics of varying levels. This design allows the detector to process high-level semantic information and low-level spatial information with equal priority in the early stages of the network, thereby improving its effectiveness in detection tasks. Given the extremely small scale of objects in UAV images, object detection from UAV perspectives remains a challenging task. MSFE-YOLO (Qi et al., 2024), based on YOLOv8, proposes a novel target detection network: it expands symmetric feature extraction branches to construct a Symmetric C2f (SCF) module for enhanced feature extraction capability, and employs an Efficient Multi-scale Attention (EMA) module to enable cross-channel information interaction and cross-spatial learning in the neck network. This not only strengthens the correlation of local features but also fuses rich low-level texture features with high-level semantic features.

SOD-YOLO (Li et al., 2024a), another model based on YOLOv8, designs a novel neck architecture—namely the Balanced Spatial and Semantic Information Fusion Pyramid Network (BSSI-FPN)—for multi-scale feature fusion. This architecture is tailored for small object detection in UAV images, improving feature extraction efficiency and effectively balancing spatial and semantic information across feature maps. BiFPN-YOLO (Doherty et al., 2025) introduces significant improvements to the existing YOLOv5 target detection model, replacing the traditional Path Aggregation Network (PANet) with the more powerful Bidirectional Feature Pyramid Network (BiFPN). To address insufficient feature learning under complex backgrounds, FCFPN (Han et al., 2025) proposes a Foreground Capture Feature Pyramid Network (FCFPN) for multi-scale target detection. This network adaptively learns the fusion weights of multi-scale features across different levels of the feature pyramid, enhancing the complementarity of semantic information between different levels of foreground feature maps.

Existing methods have not simultaneously addressed the synergistic problem of rice plant feature decoupling and multi-scale feature progressive fusion. They cannot meet the demand for individual plant-level detection in complex field scenarios.

3 Materials and methods

3.1 IV-YOLO

This study proposes an Information Vortex-based Progressive Fusion YOLO (IV-YOLO) method, built on the YOLO detection framework (Khanam and Hussain, 2024). To address the core challenges of rice plant feature coupling and background clutter interference in rice field scenarios, the method introduces a Multi-scale Spiral Information Vortex (MSIV) module and designs a Gradual Feature Fusion Neck (GFEN). As shown in Figure 2, the constructed end-to-end framework consists of three specific components.

Backbone (Backbone Feature Extraction Network): It is responsible for extracting multi-scale base features from UAV remote sensing rice images, adopting a hierarchical structure of “Conv (k=3) + C3K2 + MSIVConv”. Among them, MSIVConv intervenes at key stages of the backbone network: it simulates the water vortex effect to perform “spiral progressive disentanglement” on high-dimensional features in densely intertwined rice regions (e.g., overlapping areas between tiller seedlings and main stems), thereby enhancing the phenotypic contours of individual rice plants. Meanwhile, it breaks the correlation between background clutter and rice features through vortex-induced perturbation, initially achieving clutter interference suppression and providing a purer feature foundation for subsequent fusion.

Neck (Neck: Gradual Feature Fusion Neck, GFEN): Serving as the bridge between “Backbone and Head”, GFEN adopts a repetitive structure of “multi-scale feature concatenation (Concat) → MSIVConv enhancement → residual convolution (C3K2)” to realize progressive fusion and optimization of cross-scale features. The specific process is as follows: First, the multi-scale feature maps output by the Backbone undergo preliminary fusion via Concat; then, they are input into MSIVConv to further decouple background clutter from rice features; finally, deep feature refinement is conducted through C3K2 residual convolution. This process not only preserves fine-grained details (e.g., edges, textures) of individual rice plants but also integrates the global spatial semantics of rice plant populations, providing the detection head with more discriminative multi-scale feature representations.

Head (Detection Head): It inherits the YOLO11 multi-scale detection paradigm (Sapkota et al., 2025), receives three feature maps of different scales output by GFEN, and performs classification and bounding box regression for rice targets of corresponding scales, respectively.

3.2 Gradual feature fusion extraction neck

In rice target detection tasks, feature pyramid structures are susceptible to interference from complex field backgrounds such as soil reflections and weed textures, leading to prominent feature coupling between rice plants and the background and a subsequent decline in detection accuracy. To address this issue, this study proposes the Gradual Feature Fusion Extraction Neck (GFEN). By integrating a two-stage progressive feature aggregation strategy based on the Multi-scale Information Vortex (MSIV), GFEN achieves efficient fusion and semantic alignment of multi-scale rice plant features while enhancing anti-interference capability against background clutter.

As illustrated in Figure 2, GFEN reconstructs the logic of feature flow using a bottom-up two-stage progressive fusion paradigm: starting from the low-level features output by the Backbone (corresponding to the rice tillering layer, rich in detailed information such as leaf edges and tiller nodes), it first fuses these low-level features with mid-level features (which contain both semantic and structural information), and then gradually incorporates high-level features. The Multi-scale Information Vortex (MSIV) module is embedded in each fusion stage; leveraging its ability to decouple and reorganize features in a vortex-like manner, it breaks the feature correlation between rice plants and background clutter, and strengthens the expression of rice plants’ intrinsic morphological and semantic features. Specifically, the two-stage progressive fusion process of GFEN is detailed as follows.

Figure 2
Flowchart of a neural network architecture composed of three sections: Backbone, Neck, and Head. The Backbone includes layers like Conv and C3K2, followed by MSIVConv and SPPF. The Neck features layers such as MSIVConv, C3K2, and multiple Concatenation points with interconnected pathways. The Head consists of identical Head layers receiving input from the Neck.

Figure 2. Schematic diagram of the IV-YOLO framework.

Stage 1 (Low-level to Mid-level Feature Fusion): The inputs are the low-level and mid-level features from the Backbone. First, the MSIV module is applied to process both low-level and mid-level features in a vortex-like manner, breaking the coupling between background clutter and fine-grained rice details, and enhancing target features such as tiller nodes and stem edges. Subsequently, dimension matching and concatenation (Concat) are performed on these two types of features, followed by feature fusion and channel compression via a lightweight convolutional block (C3X2). This generates the first-stage fused feature Ffusion1, realizing “preservation of low-level details + supplementation of mid-level semantics”.

Stage 2 (Mid-level to High-level & Low-level to High-level Feature Fusion): The inputs are the first-stage fused feature Ffusion1 and the high-level features from the Backbone. Similarly, the MSIV module is used to perform vortex-like semantic enhancement on Ffusion1 and high-level features, suppressing interference from background clutter such as soil and weeds, and highlighting the morphological and semantic features of rice panicles. After dimension matching and concatenation, final fusion is completed via C3X2 to generate the multi-scale fused feature Ffusion2, achieving multi-dimensional integration of “stem structural semantics + panicle global semantics + tiller detailed features”.

Finally, the multi-scale fused features are transmitted to the detection head, providing feature support characterized by “fine-grained details and strong semantic correlations” for accurate rice detection.

3.3 Multi-scale spiral information vortex module

Inspired by the dynamic disturbance elimination and spatial regularization mechanism of water vortices in paddy fields, this study proposes the Multi-scale Spiral Information Vortex (MSIV) module. Its core goal is to address the problems of rice plant feature adhesion and background interference.

The module takes a feature map FRH×W×C of rice remote sensing images as input (where, (H,W) denote the height and width of the feature map, and  C  denotes the number of channels). By simulating the “rotation-grooming-separation” mechanism of water vortices in rice paddies, it outputs an optimized feature map FRH×W×C. The structure diagram is shown in Figure 3.

Figure 3
Diagram of a neural network module. It includes three interconnected pink circles labeled Conv, Channel, and Offst, forming a cycle. Below are rectangular boxes labeled DyT, AvgPool, and Sigmod. Arrows indicate the flow direction, ending in a multiplication symbol.

Figure 3. MSIV structural diagram.

The workflow is detailed as follows.

Step 1: Generation of Multi-Scale Vortex Kernels.

To simulate the effects of water vortices at different scales (small-scale vortices disperse fine impurities like duckweed, while large-scale vortices groom dense rice canopies), K  groups of rotational kernels are generated as Equation 1:

{Vk}k=1K(1)

Where, the k-th group of kernels VkRsk×sk×C; the kernel size sk{3,5,7}, corresponding to small/middle/large scales respectively, to adapt to interferences and rice canopy sizes of different dimensions. Based on the natural rotational characteristics of vortices, the rotation angle of the K-th kernel group is expressed by Equation 2:

θk=α·k·π2K(2)

where, α[0.5,1.5] is an angle adjustment coefficient, which simulates the randomness of rotational intensity in natural vortices, and the range of θk is constrained to [15,60].

Step 2: Rotational Feature Convolution.

Convolution between rotational kernels and feature maps simulates the “shear perturbation” of vortices on features; meanwhile, residual connections are introduced to retain original features, Equation 3:

Fk=Conv(F,Vk;padding=sk/2)+F(3)

where, Fk denotes the feature map after convolution with the k-th group of rotation kernels, Conv(·) denotes a 2D convolution operation. Padding=sk/2 ensures the feature map size remains unchanged (maintaining (H ×W)) after convolution. The term +F represents a residual connection, which prevents excessive rotational perturbation from damaging critical rice features (e.g., stem continuity, tiller nodes) and balances the trade-off between “perturbation” and “feature preservation”.

Step 3: Channel-Spatial Joint Reorganization (Vortex-Driven Decoupling).

Channel Reorganization: Valid rice features (e.g., stem edges, tiller nodes) are distributed across different feature channels, intertwined with background clutter. Spiral channel index rearrangement enables valid feature channels to be connected in line with rice growth patterns, while disrupting the disordered correlation of clutter channels.

Spatial Reorganization: Features in overlapping rice regions are highly adherent spatially, and clutter accumulates in edge regions. Dynamic rotational offsets slightly separate adherent rice canopy features and push edge clutter outward, achieving cross-domain reorganization of features via vortex-driven decoupling.

For the k-th group of rotationally perturbed feature maps FkRH×W×C, “spatial coordinate-driven spiral channel indexing” is used to reselect channel features corresponding to each spatial position (i,j), generating the channel-reorganized feature map FchRH×W×C. The formula is is expressed by Equation 4 and Equation 5:

Channeli,j,k=(i·sinθk+j·cosθk) % C(4)
Fch(i,j,:)=Fk(i,j,Channeli,j,k)(5)

where, Fch denotes the feature map after channel reorganization, (i,j) denotes the spatial coordinates of the feature map; θk is the rotation angle of the k-th vortex kernel group (generated in Step 1); (sinθk/cosθk) simulates the “spiral trajectory” of vortices; (i·sinθk) represents the spiral offset in the height direction; (j·cosθk) represents the spiral offset in the width direction; and % ensures the calculated channel index falls within the valid range [0, C1] to avoid channel out-of-bounds.

For spatial reorganization, based on the channel-reorganized Fc, a “position-dependent dynamic rotational offset” is applied to each spatial position (i,j). Bilinear interpolation sampling is used to obtain the spatially reorganized feature map FsRH×W×C. The formulas are expressed by Equations 69:

Offset(i,j,k)=β·(iH/2)·(jW/2)·(sinθk,cosθk)(6)
Fs(i,j,:)=BilinearSample(Fc,i+Δi,j+Δj)(7)
Δi=Offset(i,j,k)x(8)
Δj=Offset(i,j,k)y(9)

where, Fs denotes the feature map after spatial reorganization, Δi denotes the offset in the height direction, Δj denotes the offset in the width direction, and BilinearSample refers to bilinear interpolation (to avoid pixel distortion after offset).

Step 4: Attention Weighting and Output.

An attention mechanism dynamically assigns weights to features at different scales, strengthening critical rice features and suppressing interferences.

Calculation of Attention Weights as Equation 10:

A=σ(GlobalAvgPool(k=1KFk))(10)

where, GlobalAvgPool(·) performs global average pooling on feature maps to extract statistical information; σ(·) is the Sigmoid function, outputting a weight vector ARK. This realizes “higher weights for feature scales that capture critical rice structures (e.g., panicle-covered scales) and lower weights for scales dominated by interference”.

Weighted Output: Multi-scale optimized features are fused to output the final feature map F, Equation 11.

F=k=1KAk·Fk(11)

The pseudocode is expressed by Table 1.

Table 1
www.frontiersin.org

Table 1. MSIV Pseudocode.

DyT (Zhu et al., 2025) is inspired by the similarity between the shape of normalization layers and the scaled tanh function. It leverages a simple element-wise operation to simulate the behavior of Layer Normalization (LN) without calculating activation statistics. Outperforming LN/RMSNorm in computational efficiency, DyT enables improved inference and training speeds. In this study, Dynamic Tanh (DyT) is introduced as a direct replacement for normalization layers and integrated into the MSIV structure. For a given input tensor x, the DyT layer is defined as Equation 12:

DyT(x)=γ*tanh(αx)+β(12)

where, α is a learnable scalar parameter, which dynamically adjusts the scaling ratio based on the input range. γ and β are learnable, per-channel vector parameters—consistent with the parameters used in all normalization layers—and they allow the output to be scaled to any magnitude. DyT applies a non-linear transformation to the input via the tanh function, while retaining the ability of normalization layers to compress extreme values. The structure diagram is shown in Figure 4.

Figure 4
Flowchart showing a process with two steps. The input is labeled “x,” leading to “tanh(ax),” representing the hyperbolic tangent function applied to ax. The next step is “scale shift.” Arrows connect each element to demonstrate the sequence.

Figure 4. Schematic diagram of DyT structure.

4 Results

4.1 Dataset

DRPD dataset (Guo et al., 2025): Comprising 5,372 RGB images, this dataset is cropped from UAV aerial images captured at three different altitudes (GSD-7m, GSD-12m, and GSD-20m). The images were collected across four rice growth stages, namely the heading stage (1,903 images), flowering stage (1,676 images), early grain filling stage (1,235 images), and mid-dle grain filling stage (558 images), with a total of 259,498 annotated rice panicles.

4.2 Comparative experiment

The hardware setup for model training is specified as follows: CPU AMD Ryzen 9 7950X, 64 GB of RAM, and GPU NVIDIA RTX 4090 with 24 GB of VRAM. For the software environment: Ubuntu 20.04, Python 3.9, PyTorch 2.0.1, and CUDA 11.8.

For inference on the test set, to maintain consistency in experimental conditions, the hardware configuration matches the training setup: CPU AMD Ryzen 9 7950X, 64 GB of RAM, GPU NVIDIA RTX 4090 with 24 GB of VRAM. The software configuration is: Ubuntu 20.04, Python 3.9, PyTorch 2.0.1, and CUDA 11.8.

To verify the target detection advantages of the proposed model, comparative experiments were conducted with two categories of mainstream target detection models: single-stage detection models of the YOLO series (YOLOv5 (Jocher et al., 2022), YOLOv8 (Wang et al., 2024), YOLOv9 (Wang C-Y. et al., 2025), YOLOv10 (Wang et al., 2024), YOLOv11 (Khanam and Hussain, 2024)), Transformer-based model (EfficientViT (Liu et al., 2023)) and the FRPNet (Guo et al., 2025) rice target detection network. These models cover different technical routes in the field of target detection, namely “single-stage efficient inference” and “multi-scale collaborative enhancement”, enabling a comprehensive comparison of performance differences across architectures in terms of feature representation, bounding box regression, and detection accuracy.

Specifically, the YOLO series represents typical examples of single-stage target detection and is widely applied in both industrial and academic scenarios: YOLOv5, as an early classic version of the series, laid the foundation for lightweight single-stage detection; while YOLOv8, YOLOv9, YOLOv10, and YOLOv11 are the outcomes of subsequent version iterations, incorporating optimization attempts in aspects such as network architecture.

In this study, Precision, Recall, AP (Average Precision), AP50 (Average Precision at IoU=0.5), and AP75 (Average Precision at IoU=0.75) were selected as core evaluation metrics to conduct a horizontal performance comparison among the models. The quantitative results are presented in Table 2.

Table 2
www.frontiersin.org

Table 2. Quantitative comparison results of advanced algorithms.

In terms of overall performance, IV-YOLO outperforms all comparison models across all core metrics (Table 2). Specifically, its AP is 0.16 percentage points higher than that of FRPNet (p<0.05), and its AP75 is 0.6 percentage points higher (p<0.01), with statistically significant differences. The performance advantage stems from the synergistic mechanism of MSIV and GFEN: MSIV breaks the feature correlation between rice plants and the background through rotating kernel convolution and channel-spatial reconstruction, increasing feature purity by 52.1% (defined as the ratio of feature response values in target regions to those in background regions); the two-stage progressive fusion of GFEN not only retains fine-grained details such as tiller edges (improving AP75) but also integrates the global semantics of rice plant populations (improving Recall).

As shown in the parameter count comparison in Table 3, YOLO series models have a parameter range of 1.97M~3.01M but lack sufficient accuracy; EfficientViT (4.01M) and FRPNet (4.65M) improve accuracy by increasing parameters yet are not conducive to edge-side adaptation. In contrast, IV-YOLO has a parameter count of only 2.52M, which falls within the parameter range of the YOLO series. Through the collaboration of the MSIV and GFEN modules, it achieves the dual advantages of “high accuracy and lightweight design,” making it suitable for UAV edge-side scenarios.

Table 3
www.frontiersin.org

Table 3. The parameter comparison results.

Figure 5 presents the qualitative comparative experimental results of different target detection models in crop scenarios. The first row labeled “GT” (Ground Truth) shows the bounding boxes of real targets (in red); the remaining rows sequentially display the detection results of YOLOv5, YOLOv8, YOLOv11, and the proposed “Ours” model (with bounding boxes in blue). Columns correspond to different scenario scales (7M, 12M, 20M).It can be observed that YOLOv5 exhibits numerous missed detections and localization deviations of bounding boxes across all scenarios. Although YOLOv8 and YOLOv11 show improvements, they still lack effectiveness in detecting small targets under dense or complex backgrounds. In contrast, the detection boxes of the proposed “Ours” model are more consistent with the ground truth annotations: missed detections and false detections are significantly reduced across different scenario scales, and targets are covered more accurately. This intuitively demonstrates the model’s superior target detection and localization capabilities, which aligns with the performance advantages reflected in the quantitative experiments.

Figure 5
Comparison of object detection across different algorithms and ground truth on fields at 7, 12, and 20 meters. The first row shows ground truth with red boxes, followed by predictions from YOLO v5, YOLO v8, YOLO v11, and another algorithm, marked by blue boxes. The amount of detected objects varies across images and algorithms, indicating differences in detection efficacy.

Figure 5. Qualitative comparison of experimental results.

4.3 Ablation experiment

Ablation experiment results demonstrate that the core components of the IV-YOLO model—namely the Multi-scale Information Vortex (MSIV) and Gradual Feature Fusion Neck (GFEN)—play a critical role in enhancing performance.

As shown in Table 4, after removing the MSIV module, the Recall decreases by 3.13%, the Average Precision (AP) decreases by 3.82%. This indicates the irreplaceable role of the MSIV in decoupling and enhancing fine-grained rice features (e.g., tiller edges, panicle textures).

Table 4
www.frontiersin.org

Table 4. Results of the ablation experiment.

When the GFEN module is removed and replaced with another Neck (Khanam and Hussain, 2024) as the model’s Neck, the Precision decreases by 6.61%, the AP decreases by 4.35%, and the AP50 value decreases by 3.26%—verifying that gradual feature fusion is crucial for the semantic alignment capability of multi-scale rice targets (from tillers to panicles). GFEN adopts two-stage progressive fusion and embeds the MSIV module in each fusion step. It achieves progressive optimization of “feature decoupling - semantic alignment - feature refinement”. This not only addresses the feature coupling between rice plants and the background but also ensures accurate matching of shallow and deep features. In contrast, the alternative FPN only performs simple cross-scale feature concatenation. It lacks feature decoupling and anti-interference processing. This leads to confusion between rice plant features and clutter such as soil noise and weed textures. The false detection rate increases, resulting in a significant decrease in Precision.

Further analysis reveals a synergistic effect between the MSIV and GFEN modules. Through the progressive support of “feature decoupling → progressive fusion,” the two modules collectively ensure the high-precision detection performance of IV-YOLO in complex field scenarios (e.g., soil clutter, dense plant canopies).

5 Conclusion

To address the core bottlenecks of deep adhesion of rice plant features and field background clutter interference in rice detection using UAV remote sensing, this study proposes an Information Vortex-based Progressive Fusion YOLO (IV-YOLO) model to support the demand for rice phenotypic quantification in precision agriculture. Inspired by the water vortex mechanism in rice paddies, the Multi-scale Spiral Information Vortex (MSIV) module achieves decoupling of adhered rice plant features and suppression of background clutter via multi-scale rotational kernel convolution and channel-spatial joint reorganization. The Gradual Feature Fusion Neck (GFEN) balances shallow details and deep semantics, resolving the adaptability issue of traditional feature pyramids. Experiments based on the public DRPD dataset (5,372 RGB images, 259,498 annotated rice panicles, covering 4 growth stages and 3 remote sensing resolutions) demonstrate that IV-YOLO achieves a Precision of 0.8581, Recall of 0.8417, and AP of 0.5569—outperforming YOLO-series models and FRPNet across all metrics. In particular, its AP75 (0.6131) is 0.75 percentage points higher than that of FRPNet. Ablation experiments further confirm the necessity and synergistic effect of MSIV and GFEN. IV-YOLO provides a reliable solution for individual rice plant-level detection, supporting variable management and yield prediction in precision agriculture. Moreover, its “natural phenomenon inspired engineering design” approach offers a new paradigm for crop phenotypic analysis in agricultural remote sensing, facilitating the large-scale implementation of precision agriculture.

Although IV-YOLO has demonstrated excellent rice detection performance on public datasets, it still suffers from insufficient dataset generalization—its samples do not fully cover the topographies of different rice-growing areas (e.g., hills/plains), variety differences between indica and japonica rice, and extreme environments such as heavy rain and low light.

Future research can focus on constructing multi-source heterogeneous rice remote sensing datasets, incorporating samples from different regions, varieties, and extreme environments to enhance the model’s generalization ability.

Data availability statement

The original contributions presented in the study are included in the article/supplementary material. Further inquiries can be directed to the corresponding author.

Author contributions

JZ: Project administration, Visualization, Validation, Supervision, Conceptualization, Software, Writing – review & editing, Writing – original draft, Investigation. LH: Investigation, Resources, Writing – review & editing. YZ: Investigation, Resources, Writing – review & editing. CX: Conceptualization, Investigation, Resources, Writing – review & editing. CY: Conceptualization, Investigation, Resources, Writing – review & editing. JL: Investigation, Writing – review & editing, Resources. JM: Investigation, Writing – review & editing, Resources.

Funding

The author(s) declared that financial support was received for this work and/or its publication. This research was funded by Key Projects of Jiangsu Vocational College of Agriculture and Forestry, Research on Key Core Technologies for Increasing Yield of Japonica Hybrid Rice (No. 2024kj20), Sponsored by Qing Lan Project of the Jiangsu Higher Education Institutions and Yafu Technology Innovation and Service Major Project of Jiangsu Vocational College of Agriculture and Forestry, China (No. 2023kj01).

Conflict of interest

Authors JL and JM were employed by Jiangsu Zhongjiang Seed Co., Ltd.

The remaining author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that Generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Ahmed, M. S., Tazwar, M. T., Khan, H., Khan, H., Roy, S., Iqbal, J., et al. (2022). Yield response of different rice ecotypes to meteorological, agro-chemical, and soil physiographic factors for interpretable precision agriculture using extreme gradient boosting and support vector regression. Complexity 2022, 5305353. doi: 10.1155/2022/5305353

Crossref Full Text | Google Scholar

Ali, M. L. and Zhang, Z. (2024). The YOLO framework: A comprehensive review of evolution, applications, and benchmarks in object detection. Computers 13, 336. doi: 10.3390/computers13120336

Crossref Full Text | Google Scholar

Cano, P. B., Carcedo, A. J. P., Hernandez, C. M., Gomez, F. M., Gimenez, V. D., Kyveryga, P. M., et al. (2025). Trends in agricultural technology: a review of US patents. Precis Agric. 26, 59. doi: 10.1007/s11119-025-10257-x

Crossref Full Text | Google Scholar

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., and Zagoruyko, S. (2020). “End-to-end object detection with transformers,” in Computer vision – ECCV 2020. Eds. Vedaldi, A., Bischof, H., Brox, T., and Frahm, J.-M. (Springer International Publishing, Cham), 213–229.

Google Scholar

Chen, S., Li, W., Chen, D., Xie, Z., Zhang, S., Cen, F., et al. (2025). Recognition of rice seedling counts in UAV remote sensing images via the YOLO algorithm. Smart Agric. Technol. 12, 101107. doi: 10.1016/j.atech.2025.101107

Crossref Full Text | Google Scholar

Chen, Y., Xu, H., Zhang, X., Gao, P., Xu, Z., and Huang, X. (2023). An object detection method for bayberry trees based on an improved YOLO algorithm. Int. J. Digit Earth 16, 781–805. doi: 10.1080/17538947.2023.2173318

Crossref Full Text | Google Scholar

Diao, Z., Chen, L., Yang, Y., Liu, Y., Yan, J., He, S., et al. (2025). Localization technologies for smart agriculture and precision farming: A review. Comput. Electron Agric. 236, 110464. doi: 10.1016/j.compag.2025.110464

Crossref Full Text | Google Scholar

Doherty, J., Gardiner, B., Kerr, E., and Siddique, N. (2025). BiFPN-YOLO: One-stage object detection integrating Bi-Directional Feature Pyramid Networks. Pattern Recognit. 160, 111209. doi: 10.1016/j.patcog.2024.111209

Crossref Full Text | Google Scholar

Gade, S. A., Madolli, M. J., Garcia-Caparros, P., Ullah, H., Cha-um, S., Datta, A., et al. (2025). Advancements in UAV remote sensing for agricultural yield estimation: A systematic comprehensive review of platforms, sensors, and data analytics. Remote Sens Appl-Soc. Environ. 37, 101418. doi: 10.1016/j.rsase.2024.101418

Crossref Full Text | Google Scholar

Ghiasi, G., Lin, T.-Y., and Le, Q. V. (2019). “NAS-FPN: learning scalable feature pyramid architecture for object detection,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 7029–7038.

Google Scholar

Girshick, R. (2015). Fast R-CNN [Homepage on the internet]. Available online at: https://openaccess.thecvf.com/content_iccv_2015/html/Girshick_Fast_R-CNN_ICCV_2015_paper.html (Accessed May 7, 2025).

Google Scholar

Guo, Y., Zhan, W., Zhang, Z., Zhang, Y., and Guo, H. (2025). FRPNet: A lightweight multi-altitude field rice panicle detection and counting network based on unmanned aerial vehicle images. Agronomy 15, 1396. doi: 10.3390/agronomy15061396

Crossref Full Text | Google Scholar

Han, H., Zhang, Q., Li, F., and Du, Y. (2025). Foreground capture feature pyramid network-oriented object detection in complex backgrounds. IEEE Trans. Neural Netw. Learn Syst. 36, 6925–6939. doi: 10.1109/TNNLS.2024.3387282

PubMed Abstract | Crossref Full Text | Google Scholar

Jiang, Y., Tan, Z., Wang, J., Sun, X., Lin, M., and Li, H. (2021). GiraffeDet: A heavy-neck paradigm for object detection. Red Hook, New York, USA: Curran Associates, Inc. Available online at: https://openreview.net/forum?id=cBu4ElJfneV (Accessed June 8, 2025).

Google Scholar

Jin, S., Cao, Q., Li, J., Wang, X., Li, J., Feng, S., et al. (2025). Study on lightweight rice blast detection method based on improved YOLOv8. Pest Manag. Sci. 81, 4300–4313. doi: 10.1002/ps.8790

PubMed Abstract | Crossref Full Text | Google Scholar

Jocher, G., Chaurasia, A., Stoken, A., Borovec, J., Kwon, Y., Michael, K., et al. (2022). ultralytics/yolov5: v6. 2-yolov5 classification models, apple m1, reproducibility, clearml and deci. ai integrations (Zenodo).

Google Scholar

Khan, Z., Shen, Y., and Liu, H. (2025). ObjectDetection in agriculture: A comprehensive review of methods, applications, challenges, and future directions. Agric-Basel 15, 1351. doi: 10.3390/agriculture15131351

Crossref Full Text | Google Scholar

Khan, S., Tufail, M., Khan, M. T., Khan, Z. A., Iqbal, J., and Wasim, A. (2022). A novel framework for multiple ground target detection, recognition and inspection in precision agriculture applications using a UAV. Unmanned Syst. 10, 45–56. doi: 10.1142/S2301385022500029

Crossref Full Text | Google Scholar

Khanam, R. and Hussain, M. (2024). YOLOv11: an overview of the key architectural enhancements.

Google Scholar

Kim, M., Jeong, J., and Kim, S. (2021). ECAP-YOLO: efficient channel attention pyramid YOLO for small object detection in aerial image. Remote Sens 13, 4851. doi: 10.3390/rs13234851

Crossref Full Text | Google Scholar

Kizielewicz, B., Watrobski, J., and Salabun, W. (2025). Multi-criteria decision support system for the evaluation of UAV intelligent agricultural sensors. Artif. Intell. Rev. 58, 194. doi: 10.1007/s10462-025-11201-1

Crossref Full Text | Google Scholar

Li, Y., Li, Q., Pan, J., Zhou, Y., Zhu, H., Wei, H., et al. (2024a). SOD-YOLO: small-object-detection algorithm based on improved YOLOv8 for UAV images. Remote Sens 16, 3057. doi: 10.3390/rs16163057

Crossref Full Text | Google Scholar

Li, Y., Yang, W., Wang, L., Tao, X., Yin, Y., and Chen, D. (2024b). HawkEye conv-driven YOLOv10 with advanced feature pyramid networks for small object detection in UAV imagery. Drones 8, 713. doi: 10.3390/drones8120713

Crossref Full Text | Google Scholar

Li, T., Zhang, L., and Lin, J. (2024). Precision agriculture with YOLO-Leaf: advanced methods for detecting apple leaf diseases. Front. Plant Sci. 15, 1452502. doi: 10.3389/fpls.2024.1452502

PubMed Abstract | Crossref Full Text | Google Scholar

Liang, Y., Li, H., Wu, H., Zhao, Y., Liu, Z., Liu, D., et al. (2024). A rotated rice spike detection model and a crop yield estimation application based on UAV images. Comput. Electron Agric. 224, 109188. doi: 10.1016/j.compag.2024.109188

Crossref Full Text | Google Scholar

Lin, T.-Y., Dollar, P., Girshick, R., He, K., Hariharan, B., and Belongie, S. (2017). “Feature pyramid networks for object detection,” in 30th Ieee Conference on Computer Vision and Pattern Recognition (cvpr 2017), New York. 936–944 (New York, NY, USA: IEEE).

Google Scholar

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., et al. (2016). “SSD: single shot multiBox detector,” in Computer vision – ECCV 2016. Eds. Leibe, B., Matas, J., Sebe, N., and Welling, M. (Springer International Publishing, Cham), 21–37.

Google Scholar

Liu, X., Peng, H., Zheng, N., Yang, Y., Hu, H., and Yuan, Y. (2023). EfficientViT: memory efficient vision transformer with cascaded group attention. Available online at: https://openaccess.thecvf.com/content/CVPR2023/html/Liu_EfficientViT_Memory_Efficient_Vision_Transformer_With_Cascaded_Group_Attention_CVPR_2023_paper.html (Accessed February 18, 2025).

Google Scholar

Liu, S., Qi, L., Qin, H., Shi, J., and Jia, J. (2018). “Path aggregation network for instance segmentation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New York. 8759–8768 (Ieee).

Google Scholar

Melnychenko, O., Scislo, L., Savenko, O., Sachenko, A., and Radiuk, P. (2024). Intelligent integrated system for fruit detection using multi-UAV imaging and deep learning. Sensors 24, 1913. doi: 10.3390/s24061913

PubMed Abstract | Crossref Full Text | Google Scholar

Mohammed, A., Ali, N., Bais, A., Ruan, Y., Cuthbert, R. D., and Sangha, J. S. (2024). From fields to pixels: UAV multispectral and field-captured RGB imaging for high-throughput wheat spike and kernel counting. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens 17, 17806–17819. doi: 10.1109/JSTARS.2024.3463432

Crossref Full Text | Google Scholar

Pan, Y., Chang, J., Dong, Z., Liu, B., Wang, L., Liu, H., et al. (2025). PFLO: a high-throughput pose estimation model for field maize based on YOLO architecture. Plant Methods 21, 51. doi: 10.1186/s13007-025-01369-6

PubMed Abstract | Crossref Full Text | Google Scholar

Park, H.-J., Kang, J.-W., and Kim, B.-G. (2023). ssFPN: scale sequence (S2) feature-based feature pyramid network for object detection. Sensors 23, 4432. doi: 10.3390/s23094432

PubMed Abstract | Crossref Full Text | Google Scholar

Patricio, D. I. and Rieder, R. (2018). Computer vision and artificial intelligence in precision agriculture for grain crops: A systematic review. Comput. Electron Agric. 153, 69–81. doi: 10.1016/j.compag.2018.08.001

Crossref Full Text | Google Scholar

Paul, A., Machavaram, R., Ambuj, Kumar, D., and Nagar, H. (2024). Smart solutions for capsicum Harvesting: Unleashing the power of YOLO for Detection, Segmentation, growth stage Classification, Counting, and real-time mobile identification. Comput. Electron Agric. 219, 108832. doi: 10.1016/j.compag.2024.108832

Crossref Full Text | Google Scholar

Petrovic, B., Bumbalek, R., Zoubek, T., Kunes, R., Smutny, L., and Bartos, P. (2024). Application of precision agriculture technologies in Central Europe-review. J. Agric. Food Res. 15, 101048. doi: 10.1016/j.jafr.2024.101048

Crossref Full Text | Google Scholar

Qi, S., Song, X., Shang, T., Hu, X., and Han, K. (2024). MSFE-YOLO: an improved YOLOv8 network for object detection on drone view. IEEE Geosci. Remote Sens Lett. 21, 6013605. doi: 10.1109/LGRS.2024.3432536

Crossref Full Text | Google Scholar

Qiu, Y., Liu, Y., Chen, Y., Zhang, J., Zhu, J., and Xu, J. (2023). A2SPPNet: attentive atrous spatial pyramid pooling network for salient object detection. IEEE Trans. Multimed. 25, 1991–2006. doi: 10.1109/TMM.2022.3141933

Crossref Full Text | Google Scholar

Qu, F., Li, H., Wang, P., Guo, S., Wang, L., and Li, X. (2025). Rice spike identification and number prediction in different periods based on UAV imagery and improved YOLOv8. Cmc-Comput Mater. Contin. 84, 3911–3925. doi: 10.32604/cmc.2025.063820

Crossref Full Text | Google Scholar

Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. (2016). You only look once: unified, real-time object detection [Homepage on the internet]. Available online at: https://www.cv-foundation.org/openaccess/content_cvpr_2016/html/Redmon_You_Only_Look_CVPR_2016_paper.html (Accessed February 18, 2025).

Google Scholar

Sanaeifar, A., Guindo, M. L., Bakhshipour, A., Fazayeli, H., Li, X., and Yang, C. (2023). Advancing precision agriculture: The potential of deep learning for cereal plant head detection. Comput. Electron Agric. 209, 107875. doi: 10.1016/j.compag.2023.107875

Crossref Full Text | Google Scholar

Santoso, A. B., Ulina, E. S., Batubara, S. F., Chairuman, N., Indrasari, S. D., Pustika, A. B., et al. (2024). Are Indonesian rice farmers ready to adopt precision agricultural technologies? Precis Agric. 25, 2113–2139. doi: DOI10.1007/s11119-024-10156-7

Google Scholar

Sapkota, R., Flores-Calero, M., Qureshi, R., Badgujar, C., Nepal, U., Poulose, A., et al. (2025). YOLO advances to its genesis: a decadal and comprehensive review of the You Only Look Once (YOLO) series. Artif. Intell. Rev. 58, 274. doi: 10.1007/s10462-025-11253-3

Crossref Full Text | Google Scholar

Shen, Q., Zhang, X., Shen, M., and Xu, D. (2025). Multi-scale adaptive YOLO for instance segmentation of grape pedicels. Comput. Electron Agric. 229, 109712. doi: 10.1016/j.compag.2024.109712

Crossref Full Text | Google Scholar

Song, Z., Ban, S., Hu, D., Xu, M., Yuan, T., Zheng, X., et al. (2025). A lightweight YOLO model for rice panicle detection in fields based on UAV aerial images. Drones 9, 1. doi: 10.3390/drones9010001

Crossref Full Text | Google Scholar

Sun, X., Zhang, P., Wang, Z., and Yijia-Wang (2024). Potential of multi-seasonal vegetation indices to predict rice yield from UAV multispectral observations. Precis Agric. 25, 1235–1261. doi: 10.1007/s11119-023-10109-6

Crossref Full Text | Google Scholar

Sun, J., Zhou, J., He, Y., Jia, H., and Rottok, L. T. (2024). Detection of rice panicle density for unmanned harvesters via RP-YOLO. Comput. Electron Agric. 226, 109371. doi: 10.1016/j.compag.2024.109371

Crossref Full Text | Google Scholar

Tan, M., Pang, R., and Le, Q. V. (2020). “Efficientdet: Scalable and efficient object detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (Red Hook, New York, USA: Curran Associates, Inc.), 10781–10790.

Google Scholar

Tang, Y., Luo, F., Wu, P., Tan, J., Wang, L., Niu, Q., et al. (2025). An improved YOLO network for small target insects detection in tomato fields. Comput. Electron Agric. 239, 110915. doi: 10.1016/j.compag.2025.110915

Crossref Full Text | Google Scholar

Tang, L., Yun, L., Chen, Z., and Cheng, F. (2024). HRYNet: A highly robust YOLO network for complex road traffic object detection. Sensors 24, 642. doi: 10.3390/s24020642

PubMed Abstract | Crossref Full Text | Google Scholar

Vijayakumar, A. and Vairavasundaram, S. (2024). YOLO-based object detection models: A review and its applications. Multimed. Tools Appl. 83, 83535–83574. doi: 10.1007/s11042-024-18872-y

Crossref Full Text | Google Scholar

Wang, A., Chen, H., Liu, L., Chen, K., Lin, Z., Han, J., et al. (2024). “YOLOv10: real-time end-to-end object detection,” in Advances in neural information processing systems. Eds. Globerson, A., Mackey, L., Belgrave, D., et al (Curran Associates, Inc.), 107984–108011. Available online at: https://proceedings.neurips.cc/paper_files/paper/2024/file/c34ddd05eb089991f06f3c5dc36836e0-Paper-Conference.pdf (Accessed June 10, 2025).

Google Scholar

Wang, W., Li, C., Xi, Y., Gu, J., Zhang, X., Zhou, M., et al. (2025). Research progress and development trend of visual detection methods for selective fruit harvesting robots. Agron-Basel 15, 1926. doi: 10.3390/agronomy15081926

Crossref Full Text | Google Scholar

Wang, C.-Y., Yeh, I.-H., and Mark Liao, H.-Y. (2025). “YOLOv9: learning what you want to learn using programmable gradient information,” in Computer vision – ECCV 2024 (Springer, Cham), 1–21. doi: 10.1007/978-3-031-72751-1_1

Crossref Full Text | Google Scholar

Wu, H., Guan, M., Chen, J., Pan, Y., Zheng, J., Jin, Z., et al. (2025). OE-YOLO: an efficientNet-based YOLO network for rice panicle detection. Plants-Basel 14, 1370. doi: 10.3390/plants14091370

PubMed Abstract | Crossref Full Text | Google Scholar

Wu, S., Ma, X., Jin, Y., Yang, J., Zhang, W., Zhang, H., et al. (2025). A novel method for detecting missing seedlings based on UAV images and rice transplanter operation information. Comput. Electron Agric. 229, 109789. doi: 10.1016/j.compag.2024.109789

Crossref Full Text | Google Scholar

Wu, H., Wang, Y., Zhao, P., and Qian, M. (2023). Small-target weed-detection model based on YOLO-V4 with improved backbone and neck structures. Precis Agric. 24, 2149–2170. doi: 10.1007/s11119-023-10035-7

Crossref Full Text | Google Scholar

Xiao, X., Jiang, Y., and Wang, Y. (2025). Key technologies for machine vision for picking robots: review and benchmarking. Mach. Intell. Res. 22, 2–16. doi: 10.1007/s11633-024-1517-1

Crossref Full Text | Google Scholar

Xu, Y., Wei, H., Lin, M., Deng, Y., Sheng, K., Zhang, M., et al. (2022). Transformers in computational visual media: A survey. Comput. Vis. Media 8, 33–62. doi: 10.1007/s41095-021-0247-3

Crossref Full Text | Google Scholar

Yu, H., Chen, Z., Liu, X., Song, S., and Chen, M. (2025). Improving EfficientNet_b0 for distinguishing rice from different origins: A deep learning method for geographical traceability in precision agriculture. Curr. Plant Biol. 43, 100501. doi: 10.1016/j.cpb.2025.100501

Crossref Full Text | Google Scholar

Yu, F., Wang, M., Xiao, J., Zhang, Q., Zhang, J., Liu, X., et al. (2024). Advancements in utilizing image-analysis technology for crop-yield estimation. Remote Sens 16, 1003. doi: 10.3390/rs16061003

Crossref Full Text | Google Scholar

Yuan, J., Zhang, Y., Zheng, Z., Yao, W., Wang, W., and Guo, L. (2024). Grain crop yield prediction using machine learning based on UAV remote sensing: A systematic literature review. Drones 8, 559. doi: 10.3390/drones8100559

Crossref Full Text | Google Scholar

Zeng, F., Zhang, M., Law, C. L., and Lin, J. (2025). Harnessing artificial intelligence for advancements in Rice/wheat functional food Research and Development. Food Res. Int. 209, 116306. doi: 10.1016/j.foodres.2025.116306

PubMed Abstract | Crossref Full Text | Google Scholar

Zhang, Y., Fang, X., Guo, J., Wang, L., Tian, H., Yan, K., et al. (2023). CURI-YOLOv7: A lightweight YOLOv7tiny target detector for citrus trees from UAV remote sensing imagery based on embedded device. Remote Sens 15, 4647. doi: 10.3390/rs15194647

Crossref Full Text | Google Scholar

Zhang, H., Gong, Z., Hu, C., Chen, C., Wang, Z., Yu, B., et al. (2025). A transformer-based detection network for precision cistanche pest and disease management in smart agriculture. Plants-Basel 14, 499. doi: 10.3390/plants14040499

PubMed Abstract | Crossref Full Text | Google Scholar

Zhou, C., Zhou, C., Yao, L., Du, Y., Fang, X., Chen, Z., et al. (2025). An improved YOLOv5s-based method for detecting rice leaves in the field. Front. Plant Sci. 16, 1561018. doi: 10.3389/fpls.2025.1561018

PubMed Abstract | Crossref Full Text | Google Scholar

Zhu, J., Chen, X., He, K., LeCun, Y., and Liu, Z. (2025). “Transformers without normalization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA.

Google Scholar

Zhu, B., Lv, Q., and Tan, Z. (2023). Adaptive multi-scale fusion blind deblurred generative adversarial network method for sharpening image data. Drones 7, 96. doi: 10.3390/drones7020096

Crossref Full Text | Google Scholar

Keywords: deep learning, multi-scale fusion, object detection, precision agriculture, rice

Citation: Zhang J, Huangfu L, Zhao Y, Xue C, Yin C, Lu J and Mei J (2026) IV-YOLO: an information vortex-based progressive fusion method for accurate rice detection. Front. Plant Sci. 16:1734022. doi: 10.3389/fpls.2025.1734022

Received: 28 October 2025; Accepted: 25 December 2025; Revised: 21 December 2025;
Published: 21 January 2026.

Edited by:

Parvathaneni Naga Srinivasu, Amrita Vishwa Vidyapeetham University, India

Reviewed by:

Linh Tuan Duong, National Institute of Nutrition, Vietnam
Sandhya N., Vallurupalli Nageswara Rao Vignana Jyothi Institute of Engineering &Technology (VNRVJIET), India

Copyright © 2026 Zhang, Huangfu, Zhao, Xue, Yin, Lu and Mei. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Jianxiang Zhang, MTUxOTA0NDIyOTVAMTYzLmNvbQ==

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.