DFSNet: directional feature aggregation and shape-aware supervision for eggplant pest and disease detection

Sun, Hui; Fan, Weicun; Zhang, Junbo; Feng, Minghan; Wang, Fulin; Fu, Rui

doi:10.3389/fpls.2026.1775987

ORIGINAL RESEARCH article

Front. Plant Sci., 09 February 2026

Sec. Sustainable and Intelligent Phytoprotection

Volume 17 - 2026 | https://doi.org/10.3389/fpls.2026.1775987

DFSNet: directional feature aggregation and shape-aware supervision for eggplant pest and disease detection

Hui Sun¹

Weicun Fan¹

Junbo Zhang²

Minghan Feng¹

Fulin Wang¹

Rui Fu^1*

¹Weifang University of Science and Technology, Weifang, China
²Shandong First Medical University & Shandong Academy of Medical Sciences, Jinan, China

In natural planting environments, pest and disease detection on eggplant fruits is characterized by small lesion sizes, weak edge feature information, significant scale variations, and complex backgrounds. Particularly, fruit borer holes, fruit rot lesions, and melon thrips bite marks exhibit obvious differences in size, edge structure, and spatial distribution, posing considerable challenges for real-time accurate detection. This paper proposes the DFSNet, a lightweight improved network for pest and disease detection on eggplant fruits in natural scenes. First, PConv is introduced in the P1, P2 shallow feature extraction stages of the baseline model’s backbone network to enhance the modeling capability for fine-grained directional textures and weak edge information. Subsequently, an MSDA (Multi-Scale Directional Aggregation) module is designed and embedded into the feature enhancement modules at the P3, P4, and P5 layers of the backbone, which effectively improves the perception capability for insect hole edges and lesion contours through multi-directional depthwise separable convolution and Directional Edge Enhancer (DEE). Furthermore, a CSP-MSLA structure is introduced into the neck network, combining multi-scale linear attention mechanism with cross-stage partial connections to achieve selective enhancement of key pest and disease regions while maintaining low computational complexity. Finally, an SDDH (Shape-based Dynamic Detection Head) is introduced, which enhances the model’s adaptive capability to different pest and disease geometric features and scale variations by introducing Scale-based Dynamic Loss. Experimental results demonstrate that the model achieves Precision of 81.0%, Recall of 78.3%, and mAP@50 of 80.5% on a self-constructed eggplant pest and disease dataset under natural scenes, representing improvements of 6.9, 8.8%, and 7.8% percentage points respectively compared to the baseline model. Meanwhile, the model parameters and computational cost are compressed to 1.8M and 5.4G respectively, with an inference speed of up to 378.13 FPS. The proposed method effectively improves small target detection accuracy and robustness under complex backgrounds while ensuring real-time performance, demonstrating particularly significant advantages in detecting small targets such as fruit borer holes and melon thrips bite marks, proving that this model is an efficient and robust real-time detection model for eggplant fruit pests and diseases.

1 Introduction

Eggplant (Solanum melongena L.), rich in various vitamins and bioactive substances, is one of the widely cultivated vegetables around the world. However, during the growth cycle, fruit diseases become a significant obstacle to eggplant production, such as internal boring caused by fruit borers, fruit rot caused by fungi or bacteria, and surface scars and banded stripes caused by thrips feeding (such as melon thrips), as shown in Figure 1. These pests and diseases lead to the loss of fruit commercial value, severely causing enormous economic losses, affecting the economic income of plantations and the production enthusiasm of practitioners (Kellab et al., 2025). Traditional pest and disease diagnosis mainly relies on field inspection and empirical judgment by agricultural experts. This method depends on growers’ visual observation and production experience, with diagnostic results easily influenced by personal experience and conditions. Meanwhile, it is difficult to achieve rapid inspection over large areas and cannot meet the requirements for accurate identification operations in modern large-scale agricultural production. Therefore, developing more efficient and intelligent detection technologies has become an urgent problem to be solved in modern agriculture.

Figure 1

Four eggplants on plants, each showing signs of damage. The first is slightly twisted, the second has scratches, the third shows yellow and brown marks, and the fourth has several holes.

Figure 1. Fruit borer holes (nearly circular cavities) and banded stripes caused by melon thrips feeding, as well as fruit rot caused by fungi, occurring in eggplant cultivation.

Early research was mostly based on traditional machine learning (Agarwal et al., 2020), performing disease classification through manually designed features, such as lesion segmentation (Spisni et al., 2020) and SVM classification (Wu et al., 2014). However, although such methods have high computational efficiency, their feature representation capability is limited, and generalization performance is poor under complex field conditions such as illumination variations and background interference. In contrast, deep learning methods (Fu et al., 2025) have achieved significant progress in the fields of agricultural monitoring and disease detection, opening up new pathways for intelligent agricultural management. Among them, Convolutional Neural Networks (CNNs) have demonstrated excellent performance in crop disease recognition and classification tasks, with accuracy rates substantially surpassing traditional methods. For example (Ashurov et al., 2025), proposed a DCNN model integrating depthwise separable convolution, SE module, and improved residual connections, achieving a significant reduction in computational complexity and enhancing disease recognition capability in resource-constrained environments with an accuracy of 99.47% (PlantVillage dataset) (Shafik et al., 2025). proposed a plant pest and disease detection method based on ResNet-9 deep convolutional neural network, which not only improved detection accuracy to 97.4%, but also effectively alleviated the class imbalance problem in the dataset through data augmentation strategies (Salka et al., 2025). reviewed and compared CNN-based plant disease detection architectures, establishing EfficientNet-B4 with an accuracy of 99.97% as the current accuracy benchmark. Recent studies have explored target perception and image quality enhancement in complex environments (Li et al., 2025a). introduced a joint detection and tracking framework based on reinforcement learning, which improves the perception of weak targets under heavy clutter (Wang et al., 2023). addressed underwater image degradation through color compensation and multi-attribute adjustment. Later (Wang et al., 2026), proposed a multimodal diffusion model to enhance color fidelity and detail representation under limited data conditions. In addition (Li et al., 2025b), employed graph convolutional networks to exploit echo topology, improving target discrimination in low signal-to-noise scenarios. These studies provide useful insights for perception modeling in challenging environments. Although existing research has made certain progress, disease detection still faces severe challenges in complex agricultural scenarios. As shown in Figure 1, the lesions of fruit rot, fruit borer damage, and melon thrips on eggplant fruits are extremely small in size and easily interfered with by complex backgrounds such as leaf textures, illumination shadows, and fruit surfaces, resulting in extremely low pixel proportions of lesion regions and weak visual features. This is because the spatial resolution of single-scale features is insufficient, making it difficult for the model to effectively capture the fine-grained geometric and texture information of small targets, thus causing high miss detection rates and false detection rates, which severely restricts the improvement of detection performance.

To address this, researchers have enhanced the model’s representation capability for targets at different scales by introducing multi-scale feature fusion and attention mechanisms (Zhang et al., 2025a). proposed MAVM-UNet based on multi-scale aggregated vision Mamba, achieving pixel accuracy and MIoU of 82.07% and 81.48% respectively, with performance superior to HCFormer and VM-UNet (Zhang et al., 2023). designed DBCLNet, a dual-branch collaborative network combining multi-scale convolution and Focal Loss, achieving an accuracy of 99.89% on the PlantVillage dataset, significantly surpassing existing mainstream models. Furthermore, WMC-RTDETR proposed by (Zhang et al., 2025b) enhanced multi-scale feature extraction by integrating CSRFPN, achieving 97.7% mAP50 while reducing computational cost by 40.42%, enabling real-time detection on edge devices.

However, although multi-scale feature fusion and attention mechanisms have achieved positive progress, existing methods still have the following limitations: (1) channel and spatial attention based on global pooling are difficult to effectively model cross-scale high-dimensional feature dependencies; (2) complex attention structures are sensitive to noisy features, affecting the robustness of small target detection; (3) the introduction of multi-scale structures often leads to a significant increase in network parameters and computational complexity, which is unfavorable for lightweight deployment and real-time applications.

To address the above problems, this paper proposes a lightweight detection framework integrating edge feature enhancement, multi-scale feature modeling, and efficient feature selection. The main contributions of this study are as follows:

● PConv backbone design for shallow detail perception. To address the problems of blurred edges and weak directional texture features of pest and disease targets, this paper introduces PConv (Pinwheel Convolution) in the shallow stages (P1–P2) of the lightweight detection network. This module models local directional structural information in parallel through multi-directional asymmetric convolution kernels. It significantly enhances the representation capability of direction-sensitive features with minimal parameter increase, effectively improving the feature representation quality of small insect holes and early lesions, and lays a reliable low-level feature foundation for subsequent multi-scale feature fusion.

● MSDA structure for multi-directional multi-scale feature aggregation. To address the limitations of the C3K2 module in directional information modeling, this paper designs the MSDA (Multi-Scale Directional Aggregation) structure and embeds it into C3K2. This module models horizontal and vertical structural information in parallel through multi-directional depthwise separable convolution branches, while introducing a dual-branch edge enhancement module (DEE) to explicitly enhance the edge response of pest and disease targets, thereby improving the model’s recognition capability for fruit rot lesion contours and irregular insect bite marks.

● Lightweight attention-driven CSP-MSLA neck structure. To address the problems of complex background interference and redundant features, this paper constructs the CSP-MSLA structure in the neck network, combining Multi-Scale Linear Attention (MSLA) with the CSP mechanism to achieve adaptive enhancement of key pest and disease regions while controlling computational complexity. This structure improves the discriminability of multi scale feature fusion and enhances the detection robustness of the model under complex illumination variations and local occlusion conditions.

● Shape-aware dynamic detection head SDDH. To address the significant differences in scale distribution and geometric morphology among different pest and disease targets, this paper designs a shape-aware detection head structure and adopts Scale-based Dynamic Loss as the regression supervision strategy. This method alleviates the problem of insufficient gradient contribution of small-scale targets during the training process through a scale-adaptive dynamic supervision mechanism, further improving the generalization capability of the detection head for multi-scale pest and disease targets.

2 Related work

2.1 Lightweight object detection backbone network structure

In recent years, with the widespread application of object detection algorithms in mobile and embedded scenarios, lightweight network design has become a research hotspot. CNN-based MobileNets (Howard et al., 2019, 2017; Sandler et al., 2018) significantly reduced computational complexity by replacing standard convolutions with depthwise separable convolutions, while GhostNets (Han et al., 2020; Liu et al., 2024; Tang et al., 2022) her reduced the number of parameters by generating feature maps on half of the channels using cheap operations. However, these methods are limited by local receptive fields and struggle to capture global context information. In contrast, Vision Transformer (ViT) demonstrates advantages with its global receptive field and long-range dependency modeling capability. However, the quadratic computational complexity of its self-attention mechanism brings higher computational overhead. To achieve a better trade-off between speed and accuracy, single-stage detection models represented by the YOLO (You Only Look Once) series have achieved a balance between efficiency and performance in real-time object detection tasks through collaborative optimization design of the backbone network, neck structure, and detection head.

Among them (Pan et al., 2025), constructed the SSD-YOLO model by integrating the SENetV2 mechanism and DySample lightweight sampling module, achieving efficient and accurate detection of rice diseases with only 6MB parameters (Song et al., 2024). constructed the extremely lightweight model DODN by fusing deformable convolution and Transformer components, achieving efficient and accurate detection of cucumber diseases in complex scenarios with only 3.7 MB parameter scale and 3.9 GFLOPs low power consumption. However, such methods rely on spatial convolution stacking, which is not only limited by computational resources, but also difficult to adapt to pest and disease detection due to the neglect of fine-grained textures, resulting in existing models often being in a suboptimal state in real agricultural scenarios, which is also the core problem that this paper urgently needs to solve.

2.2 Frequency domain feature modeling and the application of wavelet transform in visual tasks

In addition to traditional spatial domain convolution, frequency domain feature modeling has gradually received attention in recent years. MWCNN proposed by (Liu et al., 2018) introduced Discrete Wavelet Transform (DWT) to replace traditional downsampling, retaining frequency domain information while compressing feature maps, effectively alleviating the information loss problem (Li et al., 2020). achieved decoupling of high-frequency and low-frequency components through DWT, significantly enhancing the anti-noise robustness of the model. Wavelet-SRNet proposed by (Huang et al., 2017) utilized wavelet coefficient prediction to reconstruct facial details, solving the over-smoothing problem in super-resolution tasks. Wavelet transform, with its excellent time-frequency localization characteristics, achieves effective decoupling of structure and details through multi-scale frequency band decomposition, demonstrating significant advantages in tasks such as image restoration, super-resolution, and semantic segmentation.

In the field of pest and disease detection, existing lightweight models neglect fine-grained textures due to their reliance on spatial convolution stacking, making it difficult to cope with complex detection scenarios. To address this, researchers have attempted to introduce wavelet transform to enhance texture perception capability (Li et al., 2022). introduced Discrete Wavelet Transform (DWT) into YOLOv4, strengthening the extraction of pest and disease detail textures and achieving accurate detection of small targets under complex backgrounds (Li et al., 2022). utilized Continuous Wavelet Analysis (CWA) to process hyperspectral data, accurately discriminating the stress states of tea plants affected by tea green leafhoppers, anthracnose, and other similar symptoms (Panchananam et al., 2025). proposed WFS-YOLO, which enhanced features in both frequency domain and spatial domain through Discrete Wavelet Transform (DWT), improving the perception accuracy of small pests and diseases in complex environments.

However, such methods are difficult to adapt to lightweight deployment due to their structural complexity and computational expense. How to efficiently utilize frequency domain information under conditions of limited computing power remains a core challenge in current model design.

2.3 Edge and high-frequency feature enhancement methods

Edge information is an important basis for target contour and shape discrimination. In fine-grained visual tasks such as pest and disease detection, high-frequency textures and boundary features are crucial for distinguishing lesions, insect holes, and healthy regions. However, low-resolution images lose a large amount of high-frequency details during the imaging process, resulting in edge blurring and texture degradation. To solve this problem, researchers have explored enhancement strategies for edge and high-frequency features from multiple perspectives. For example (Zhao et al., 2019), proposed EGNet, which guides target localization by explicitly modeling edge features (Edge Guidance Stream), compensating for the loss of boundary information in deep networks (Qiu et al., 2024). proposed an Adaptive Compressed Sensing (ACS) architecture that captures key edge regions through a cascaded guidance mechanism, providing a low-overhead solution for pest and disease detail preservation (Zheng and Yang, 2024). proposed Contextual Boundary Aware Network (CBA-Net), which strengthens the model’s capture of salient object contours through a contextual boundary awareness mechanism.

Although existing methods have made progress in edge and high-frequency feature enhancement, they still have limitations in fine-grained tasks such as pest and disease detection. Existing methods mostly focus on salient edges, with insufficient reconstruction capability for small textures such as early lesions and insect holes; frequency domain and spatial domain mechanisms are often designed independently, making it difficult to collaboratively capture global and local features; in addition, improper module design can easily lead to a surge in computational overhead or feature distribution imbalance. Therefore, how to design a lightweight and efficient enhancement mechanism that balances fine-grained texture recovery and global reconstruction quality is the core problem of this paper.

2.4 Application of multi-scale attention mechanisms

Multi-scale feature fusion is a key technology for improving object detection performance. Classic structures such as FPN and PAN fuse features at different scales through top-down and bottom-up pathways, but their information interaction mainly relies on element-wise addition or concatenation, lacking explicit modeling of cross-scale semantic relationships. In recent years, attention mechanisms have been introduced into detection networks to enhance feature selectivity and context awareness capability. For example (Sun et al., 2025), proposed the SRCA attention module, which effectively integrates high and low-resolution features through adaptive weighting and bidirectional fusion, significantly improving the multi-scale perception capability of tomato leaf lesions (Zhang, 2025). designed the MHCF encoder, which enhances multi-scale feature fusion using the Transformer structure, achieving a balance between accuracy and efficiency in pomegranate detection in complex orchard environments.

However, existing attention-based multi-scale fusion methods are usually accompanied by high computational complexity, making it difficult to directly adapt to resource-constrained lightweight detection models. How to design computationally efficient attention mechanisms while maintaining the effectiveness of multi-scale feature fusion, achieving a balance between detection accuracy and model lightweight, is the core problem that this paper is committed to solving.

3 Methods

3.1 Overall network architecture

Based on the baseline model lightweight detection framework, this paper constructs an improved model DFSNet (Directional Feature Aggregation and Shape-Aware Supervision for Eggplant Pest and Disease Detection) for pest and disease detection on eggplant fruits in natural scenes. The structure is shown in Figure 2a. While maintaining the original inference efficiency advantages, DFSNet performs targeted optimization on backbone feature extraction, feature fusion, and detection head, specifically addressing the characteristics of pest and disease targets such as “small scale, multiple morphologies, and weak edges”.

Figure 2

Illustration of a neural network architecture with three sections: (a) Backbone, (b) Neck and Head, and (c) Detailed CSP-MSLA module. Section (a) shows sequential processing steps like PConv and MSDA. Section (b) highlights the Neck including concatenations, upsampling, and the CSP-MSLA leading to the Head, specifically SDDH. Section (c) details the CSP-MSLA module with Conv, Split, MSLA, and SiLU layers. A small image inset shows a plant stem with pore-like structures. The design is color-coded and labeled with module names and parameters.

Figure 2. Overall architecture of DFSNet: (a) the complete network,(b) the MSDA model and (c) the CSP-MSLA model.

In the backbone network, Conv is replaced by PConv (Pinwheel Convolution) in the P1–P2 layers to enhance shallow directional texture and fine-grained edge feature representation; meanwhile, the MSDA (Multi-Scale Directional Aggregation, as shown the Figure 2b module is introduced into the C3K2 structure to improve the aggregation capability for multi-directional structural information. In the neck network, the CSP-MSLA structure (as shown in the Figure 2c is designed to integrate the multi-scale linear attention mechanism into the cross-stage partial connection framework to achieve selective enhancement of key pest and disease regions. Finally, SDDH (Shape-Based Dynamic Detection Head) is introduced to improve the model’s adaptive capability to different pest and disease geometric features through a shape aware dynamic loss function.

3.2 PConv-enhanced shallow feature extraction

In the baseline model YOLOv11 original network, backbone feature extraction mainly relies on standard two-dimensional convolution operators. The convolution kernels in standard convolution have a unified response form in all spatial directions. This modeling approach implicitly assumes that local structures have similar statistical characteristics in different directions. However, in natural scene eggplant fruit pest and disease detection, this assumption is difficult to establish. For example, fruit borer holes typically exhibit extremely small scale and weak edge local structures, while melon thrips bite marks present obvious elongated stripe morphology with strong directional dependence. Standard convolution has limited discriminative capability for these directional features in the shallow stages, easily leading to the weakening of key information during the feature downsampling process. Therefore, this paper introduces the PConv (Pinwheel Convolution) structure in the P1, P2 layers of the YOLOv11 backbone network, as shown in Figure 3, to replace standard convolution.

Figure 3

Diagram illustrating a pinwheel-shaped convolution module. The process begins with a grid, forming an overlapping structure. Multiple CBS blocks are concatenated into a 3D grid. This undergoes a Conv(2,2) operation, resulting in a receptive field visualization, featuring a hash operation with a pinwheel pattern.

Figure 3. Pinwheel-shaped convolution module. The CBS module consists of three parts: Conv, BN (Batch Normalization), and SiLU. Concat stands for concatenate.

PConv (Yang et al., 2025) achieves explicit modeling of local structural directionality by introducing asymmetric padding strategies in different spatial directions of the input feature map and combining parallel convolution operations. Specifically, for the four directions of left, right, top, and bottom, different forms of asymmetric padding are applied respectively, as shown in Equation 1:

\begin{array}{l} X^{(d)} = P^{(d)} (X), d \in {left, right, top, bottom} & (1) \end{array}

where $P^{(d)}$ represents the asymmetric padding operation applied along the d-th direction, used to introduce directionally biased spatial context information. Subsequently, CBS operations are performed on the four directionally enhanced feature maps respectively, with the expression given by Equation 2:

\begin{array}{l} X_{i} = SiLU (BN (X^{(d)}) one k_{i}), i \in {1, 2, 3, 4} & (2) \end{array}

where k_iis the convolution kernel. Finally, the output feature X_out is obtained, as shown in Equation 3:

\begin{array}{l} X_{o u t} = SiLU (BN (Concat (X_{1}, X_{2}, X_{3}, X_{4})) one K_{2 \times 2}) & (3) \end{array}

Compared to traditional convolution, this approach can more sensitively capture target structures with obvious directional features, such as lesion edge contours and elongated insect bodies, while maintaining a low number of parameters. By introducing PConv before attention modeling, the model can proactively highlight morphological information related to pests and diseases and suppress the interference of complex background textures on subsequent feature fusion processes. Meanwhile, explicit directional structural modeling capability is introduced at the shallow stage, enhancing the discriminability of edge and texture features while maintaining lightweight characteristics.

3.3 C3K2-MSDA

Although introducing directional convolution PConv at shallow layers can enhance detail representation, the baseline model adopting small convolution kernels and shallow design is prone to insufficient context modeling and insufficient receptive field problems when dealing with large targets and complex backgrounds. This paper introduces a wavelet-enhanced multi-scale dilated attention module MSDA (Multi-Scale Directional Aggregation) into C3K2 at the P3, P4, and P5 stages of the backbone, obtaining C3K2-MSDA as shown in Figure 4. In this module, one path maintains the original cross-stage connection path of C3K2, while the other path introduces MSDA for enhanced modeling, with the structure shown in Figure 2b-MSDA. This design ensures the continuity of the original C3K2 feature flow while providing additional multi-scale attention supplementation to the network. The MSDA structure consists of two important branches: multi-scale modeling and EdgeEnhance branch. Let the input feature be $F \in ℝ^{C \times H \times W}$ , where C, H, and W represent the number of channels, height, and width respectively.

Figure 4

Flowchart of the C3K2-MSDA model showing sequential components: a 1x1 convolution, a split, MSDA, another 1x1 convolution, SiLU activation, and more convolutions, concluding with a concatenation process. Arrows depict data flow between elements, with feedback loops and repetition indicated.

Figure 4. Architecture of C3K2-MSDA.

First, a 1×1 convolution is used to complete channel mapping as shown in Equation 4:

\begin{array}{l} F_{0} = {conv}_{1 \times 1} (F) & (4) \end{array}

3.3.1 Multi-scale modeling branch

The multi-scale modeling branch adopts a continuous WTConv structure combined with the GELU nonlinear function as shown in Figure 2b, and the feature transformation process can be expressed as Equation 5:

\begin{array}{l} F_{0} = [F_{m}, F_{a}] & (5) \end{array}

where F_mis used for multi-scale feature modeling, and F_ais used for attention weight generation. In the multi-scale branch, a continuous feature transformation structure based on WTConv is introduced to enhance the perception capability for local patterns at different scales. The feature extraction process of this branch can be expressed as Equation 6:

\begin{array}{l} F'_{m} = ϕ_{w_{2}} (δ (ϕ_{w_{1}} (F_{m}))) & (6) \end{array}

where $ϕ_{w_{1}} (\cdot)$ and $ϕ_{w_{2}} (\cdot)$ represent WTConv operations, and $δ (\cdot)$ is the GELU nonlinear activation function. To avoid information attenuation in deep networks while enhancing feature stability, the feature $F_{m s}$ after introducing the residual connection is shown in Equation 7:

\begin{array}{l} F_{m s} = F_{m} + F'_{m} & (7) \end{array}

This structure enables multi-scale features to obtain richer contextual representations while maintaining original structural information. WTConv (Finder et al., 2024) (as shown in Figure 5), as a key component for implementing large receptive field modeling in the MSDA module, is used to enhance the global representation capability of features while maintaining local structural information. WTConv decomposes the input feature map into four frequency subbands through discrete wavelet transform, including LL, LH, HL, and HH. Among them, LL mainly reflects the overall structure and semantic information of the target, while (LH, HL, and HH) correspond to detail features such as edges and textures. By independently modeling features in different frequency bands and fusing them in subsequent stages, WTConv can introduce cross-scale and cross-frequency feature responses without significantly increasing computational complexity. This multi-band feature aggregation approach enables the network to simultaneously perceive local details and larger-scale contextual information, thereby effectively expanding the receptive field in the MSDA module and enhancing the representation capability for edges and structural changes of pest and disease targets.

Figure 5

Diagram illustrating the WTConv architecture. Input $X$ undergoes a convolution, producing intermediary outputs. Wavelet transform (WT) decomposes $X$ into sub-bands $X^{1}_{LL}, X^{1}_{LH}, X^{1}_{HL}, X^{1}_{HH}$. These are processed through convolution and additional WT steps, generating transformed sub-bands $X^{2}_{LL}, X^{2}_{LH}, X^{2}_{HL}, X^{2}_{HH}$. After further transformations, inverse wavelet transform (IWT) recomposes the signals into $X'$, contributing to the final output. Multiple operations merge through addition at various stages.

Figure 5. Architecture of WTConv.

3.3.2 Edge feature enhancement module DEE

In pest and disease detection tasks, the edges of pests and diseases such as fruit borer holes, fruit rot lesions, and melon thrips stripes present rapidly changing pixel intensity regions on eggplant fruit surface images. These regions change dramatically and constitute high-frequency information, which is also the core feature for distinguishing pest and disease features from healthy fruit images. Therefore, effective modeling of high frequency edge features can effectively highlight the contour and shape features of pests and diseases. However, traditional convolutional neural networks tend to produce smoothing effects on high-frequency details during layer-by-layer downsampling and feature fusion processes, resulting in the gradual weakening of edge information. To address this problem, this paper designs DEE (Edge feature Enhancement Module) as shown in Figure 6, which explicitly highlights high-frequency change regions in the input features, enabling the network to perceive the contour and shape information of pest and disease regions during the feature extraction stage.

Figure 6

Diagram depicting an edge feature enhancement module with a flowchart of operations. Input feature $ F_a $ is split and passed through two depthwise convolution (DWConv) units. Outputs $ F_{d1} $ and $ F_{d2} $ are combined by element-wise addition, then processed through a $ 1 \times 1 $ convolution and a sigmoid function, producing output $ A $. This is element-wise multiplied with the initial split input. A legend describes symbols for element-wise addition and multiplication.

Figure 6. Architecture of DEE module.

This module models gradient changes in the feature map and injects the enhanced edge response into the original features in residual form, thereby avoiding interference with the overall semantic structure. It is used to effectively strengthen the representation capability of edge and high-frequency information without changing the spatial resolution of the input feature map. The DEE module acts on intermediate layer features of the network, and its enhancement process is learnable, capable of adaptively adjusting the response intensity to different edge patterns according to task requirements.

This module takes intermediate layer feature mapping as input and first calculates feature gradients (F_d1, F_d2) in the horizontal and vertical directions respectively, representing regions with relatively drastic pixel intensity changes in the feature map, thereby explicitly extracting high-frequency information such as edges and textures. Subsequently, a comprehensive edge response is obtained through gradient magnitude fusion, and 1×1 convolution is used for channel mapping and adaptive reweighting of edge features. Finally, the enhanced edge features are combined with the original features in residual form, achieving effective enhancement of target boundaries and local structures while avoiding disruption of the original semantic information distribution. Equations 8-12 express the feature information flow process.

\begin{array}{l} F_{d 1} = ϕ_{w_{1}} (F_{a 1}) & (8) \end{array}

\begin{array}{l} F_{d 2} = ϕ_{w_{2}} (F_{a 2}) & (9) \end{array}

where $ϕ_{d w} (\cdot)$ represents Depthwise Convolution, used to obtain spatial structural information at lower computational complexity.

\begin{array}{l} F_{d} = F_{d_{1}} + F_{d_{2}} & (10) \end{array}

Subsequently, 1×1 convolution and Sigmoid function are used to generate attention weight A as shown in the equation:

\begin{array}{l} A = σ (ϕ_{1 \times 1} (F_{d})) & (11) \end{array}

and perform element-wise recalibration on the input features:

\begin{array}{l} F_{a t t} = F_{d} ⨀ A & (12) \end{array}

where ⊙ represents element-wise multiplication. This process achieves joint feature selection at both spatial and channel levels without introducing global pooling or high-complexity operators.

DEE enhances the activation intensity of pest and disease target edge regions through adaptive weighting of multi-directional feature responses, improving the model’s perception capability for fruit borer hole edges and fruit rot lesion contours. This module maintains a lightweight design 345 in structure, effectively balancing directional information modeling capability and computational 346 efficiency.

3.4 CSP-MSLA neck and shape-aware detection head

3.4.1 CSP-MSLA neck design

In natural scenes, complex illumination, leaf occlusion, and background texture similarity easily introduce a large amount of redundant features, weakening the multi-scale feature fusion effect. In object detection, the Neck structure plays an important role in connecting the backbone network and detection head, with its core objective being to achieve effective alignment and fusion of multi-scale features. The original Neck of YOLOv11n mainly relies on lightweight convolution modules such as C3K2 for feature transformation, possessing certain local modeling capability while ensuring computational efficiency. However, such structures are still essentially dominated by spatial domain convolution, with limited modeling capability for cross-scale contextual relationships and long-range dependencies. Especially in complex agricultural scenarios, the semantic associations between small scale lesions and medium-scale fruit regions are difficult to fully characterize. On the other hand, attention mechanisms demonstrate obvious advantages in modeling long-distance dependencies and global information interaction, but directly introducing standard self-attention structures often brings high computational and storage overhead, making them unsuitable for lightweight detection frameworks. Based on this, in the baseline Neck part, this paper introduces the Multi-Scale Linear Attention (MSLA) module for key scale feature layers P3, P4, and P5, and deeply integrates it with the C3K2 structure to reconstruct the module into CSP-MSLA units as shown in CSP-MSLA in Figure 2c. The CSP structure reduces redundant computation and enhances gradient flow through cross-stage feature splitting and recombination. The MSLA structure, as shown in Figure 7, explicitly introduces multi-scale global modeling capability, modeling cross-position and cross-scale global association relationships in multi-scale feature space. The combination of the two enables the neck network to significantly enhance the model’s capability to localize pest and disease target regions of interest and suppress irrelevant information while maintaining lightweight characteristics.

Figure 7

Diagram illustrating a multi-scale linear attention model. It shows a process from input channels through multi-scale feature extraction using depthwise convolutions (DWConv) with different kernel sizes (3x3, 5x5, 7x7, 9x9) and ReLU activation. Outputs are fed into a multi-head efficient attention mechanism involving linear transformations and matrix multiplications (Matmul). Combined results are processed via 1x1 convolution, resulting in an output. Various stages and connections are visually labeled, detailing the flow and transformations of data.

Figure 7. Architecture of MSLA.

In the MSLA (Multi-Scale Linear Attention) module, the three branches Q, K, and V are retained, but to reduce computational complexity, the global computation of QKT is approximated linearly. Meanwhile, multi-scale convolutions (e.g., 3 × 3, 5 × 5, 7 × 7, 9 × 9) are applied to enhance the features of Q and K, and matrix multiplication is used to calculate weights for local regions. That is,

Y = ϕ (Q) \cdot (ϕ (K)^{T} \cdot V),

where $ϕ$ represents a kernel function approximation or multi-scale feature transformation. This transformation bypasses explicit Softmax, achieving attention intensity distribution through kernel function approximation or weight normalization. Due to the adoption of block computation (i.e., first calculating K^TV, then multiplying with Q), the overall computational complexity is reduced from O(N²) in traditional self-attention to O(N). By constructing multi-scale parallel convolution branches, MSLA can capture feature responses under different receptive fields and utilize the linear attention mechanism to perform weighted fusion of multi-scale features, enabling coordinated representation of local detail information and global semantic information. This design promotes effective transfer of multi-scale features in the backbone network and helps improve cross-scale feature modeling capability.

3.4.2 Shape-aware dynamic detection head

3.4.2.1 Baseline model loss function

The baseline model adopts a fixed loss function in the detection head, which consists of the classification loss L_cls, the objectness confidence loss L_obj, and the bounding box regression loss L_reg. The overall formulation can be expressed as Equation 13.

\begin{array}{l} L_{YOLOv 11} = L_{cls} + L_{obj} + λ L_{reg} & (13) \end{array}

Here, L_reg is typically optimized based on IoU or its variants, and λ denotes the weighting coefficient of the regression loss term L_reg. This loss function adopts a unified weight allocation strategy for all targets, without explicitly distinguishing the contributions of objects with different scales or shapes during training. In pest and disease detection tasks, such approximately uniform supervision is prone to causing gradient imbalance. On the one hand, small-scale targets (e.g., borer holes) occupy a relatively small proportion of pixels in the feature maps, making their L_reg easily overwhelmed in the overall loss. On the other hand, elongated and stripe-like targets (e.g., feeding traces of thrips) are highly sensitive to slight localization deviations under IoU-based constraints, which leads to instability in the regression process. To address these issues, this paper introduces a Scale-based Dynamic Loss on the basis of the original regression structure to dynamically adjust the regression supervision. This strategy is applied to the detection head, forming the SDDH (Shape-aware Dynamic Detection Head).

3.4.2.2 Scale-based dynamic loss

It is well known that IoU-based losses (S_loss) exhibit relatively large fluctuations in small object detection, which negatively affect model stability and regression performance. In S_loss with bounding box (BBox) annotations, smaller objects usually receive lower attention weights, whereas mask annotations have a greater impact on small or irregularly shaped objects. Therefore, some studies dynamically adjust the influence coefficients β of S_loss and L_loss according to object scale, so as to enhance the influence of S_loss on mask annotations and reduce the adverse effects of inaccurate annotations on the stability of the loss function, thereby ensuring that the model pays more attention to small or irregularly shaped objects. This loss function mainly consists of two components: LSDB (the Scale-based Dynamic Loss for the BBox) and LSDM (the Scale-based Dynamic Loss for the Mask). The computation of LSDB and its related parameters are given in Equations 14-20, while the computation of LSDM and its related parameters are presented in Equations 21-23.

• The $ℒ_{SDB}$

The scale-based dynamic loss for the bounding box is composed of a scale consistency loss $ℒ_{BS}$ and a localization loss $ℒ_{BL}$ with corresponding weights. It is defined as Equation 14:

\begin{array}{l} ℒ_{SDB} = β_{1} \times ℒ_{BS} + β_{2} \times ℒ_{BL} & (14) \end{array}

Here, $β_{1} \in [0.5, 1.0]$ and $β_{2} \in [1.0, 1.5]$ as defined in Equation 18, denote the dynamic weighting coefficients for the bounding box scale loss $ℒ_{BS}$ (see Equation 15) and the localization loss $ℒ_{BL}$ (see Equation 16), respectively. The scale loss is defined as:

\begin{array}{l} ℒ_{BS} = 1 - S_{IoU} + γ & (15) \end{array}

\begin{array}{l} ℒ_{BL} = \frac{d^{2} one ((x_{sbp}, y_{sbp}), (x_{sgt}, y_{sgt}))}{L^{2}} & (16) \end{array}

Let B_P and B_gt denote the predicted bounding box and the ground-truth bounding box, respectively. The scale-aware S_IoU is defined as Equation 17:

\begin{array}{l} S_{IoU} = \frac{B_{P} \cap one ​ B_{gt}}{B_{P} \cup one ​ B_{gt}} & (17) \end{array}

β₁ and β₂ is defined in Equation 18. These coefficients dynamically adjust the loss weights according to the object scale. Here a scale influence factor β₃ is introduced as shown in Equation 19:

\begin{array}{l} β_{1} = 1 - δ + β_{3}, β_{2} = 1 + δ - β_{3} & (18) \end{array}

\begin{array}{l} β_{3} = min (\frac{B_{gt}}{\max B_{gt}} \times θ \times δ, δ) & (19) \end{array}

where $δ$ =0.5 is the upper limit for scale adjustment, used to constrain the range of weight variation and prevent instability during training, red $max B_{gt} = 81$ is the maximum size of IRST as defined by the Society of Photo-Optical Instrumentation Engineers (Zhang et al., 2003). The scale mapping factor θ is defined as Equation 20:

\begin{array}{l} θ = \frac{{size}_{i}}{{size}_{f}} & (20) \end{array}

Here, $β_{3}$ is the scale influence factor for both the BBox and Mask branches; the function $d (\cdot)$ denotes the Euclidean distance; $L$ represents the diagonal length of the minimum enclosing rectangle that simultaneously bounds the predicted box $B_{P}$ and the ground-truth box $B_{G}$ , used to normalize the center point distance. ${size}_{i}$ and ${size}_{f}$ denote the dimensions of the original image and the current feature map, respectively.

• The $ℒ_{SDM}$

The $ℒ_{SDM}$ is similarly composed of the mask scale loss $ℒ_{MS}$ and the mask localization loss $ℒ_{ML}$ with corresponding weights $(β_{1}^{'} \in [1.0, 1.5], β_{2}^{'} ı n [0.5, 1.0])$ , as defined in Equation 21:

\begin{array}{l} ℒ_{SDM} = β_{1}^{'} \times ℒ_{MS} + β_{2}^{'} \times ℒ_{ML} & (21) \end{array}

Let M_P and M_gt denote the sets of pixels in the predicted mask and the ground-truth mask, respectively, and let p be a weighting coefficient. The mask scale loss $ℒ_{MS}$ is defined as Equations 22, 23:

\begin{array}{l} M_{IoU} = \frac{M_{P} \cap one ​ M_{gt}}{M_{P} \cup one ​ M_{gt}} & (22) \end{array}

\begin{array}{l} ℒ_{MS} = 1 - p \cdot M_{IoU} & (23) \end{array}

The mask localization loss $ℒ_{ML}$ is defined as Equation 24:

\begin{array}{l} ℒ_{ML} = 1 - \frac{\min one (d_{mp}, d_{mgt})}{\max one (d_{mp}, d_{mgt})} + \frac{4 (θ_{mp} - θ_{mgt})}{π^{2}} & (24) \end{array}

Here, d_mp and d_mgt denote the average distances of the predicted mask pixels and the ground-truth mask pixels from the origin in polar coordinates, respectively; θ_pp and θ_pgt represent the average angles of the predicted mask pixels and the ground-truth mask pixels in polar coordinates, respectively.

The Scale-based Dynamic Loss (SDLoss) employed in this study incorporates object scale information into the loss computation process to achieve adaptive constraints for targets of different scales. In the bounding box regression branch, the scale consistency term and the localization term are combined with weighted summation, and the weights are adjusted using scale factors, thereby applying differentiated supervision to targets of varying scales without altering the original regression formulation. In the mask branch, SDLoss integrates pixel-level overlap constraints with polar-coordinate-based spatial distribution modeling, providing joint constraints on the regional consistency and spatial distribution of mask predictions, which helps improve the modeling stability for irregularly shaped targets. It should be noted that the mathematical definition of the Scale-based Dynamic Loss is not modified in this work; rather, it is applied to pest and disease detection tasks and combined with the proposed detection head structure to accommodate the variations in scale and shape of pest and disease targets in natural scenes.

4 Experiments and results

4.1 Dataset construction

In this study, a custom dataset of eggplant fruit pests and diseases was established, comprising four categories: FruitBorer, FruitRot, and MelonThrips, as shown in Figure 8. The dataset was primarily derived from two sources. The first source consists of sample images collected on October 3, 2025, in a vegetable greenhouse in Shouguang, Shandong Province, China. The original images, captured using an iPhone 14 at a resolution of 1920×1080, encompassed various lighting conditions and shooting angles to enhance data diversity, resulting in 1074 images in JPEG format. After removing blurred, duplicate, and invalid images, 673 valid samples remained. These images were annotated using the Label Studio tool to label the pest and disease regions and their corresponding categories, and the annotations were saved in YOLO format. The second source is the publicly available Eggplant Fruit Disease dataset from the Robotflow platform, from which 2,177 pest and disease images were randomly selected. Combined with the first source, a total of 3,250 sample images were obtained, and all images were resized to 640×640 pixels. To improve model generalization and robustness, Mosaic data augmentation techniques, including random translation, horizontal/vertical flipping, non-uniform scaling, brightness adjustment, and Gaussian noise injection, were applied to expand the number of training images to 7,256. The dataset was randomly split into training (5,080 images), validation (1,451 images), and test sets (725 images) in a 7:2:1 ratio.

Figure 8

Four eggplants are shown in separate stages of condition. The first is affected by a fruit borer with visible holes. The second has fruit rot with dark patches. The third is damaged by melon thrips, showing a scarred area. The fourth is a healthy eggplant with no visible damage, hanging on a plant with green leaves.

Figure 8. Some samples of eggplant fruit disease dataset.

To further evaluate the generalization capability of the proposed method in agricultural vision tasks, comparative experiments were conducted on the publicly available PlantDoc dataset. PlantDoc is an open dataset for disease detection in real agricultural scenarios, covering multiple crop types and their corresponding disease categories. The dataset exhibits uneven object scale distribution, diverse lesion morphologies, and complex backgrounds, which effectively reflect the practical challenges of object detection tasks in natural cultivation environments. Representative sample images are shown in Figure 9.

Figure 9

A grid of plant images shows various diseases and conditions affecting crops. Top row: corn with gray leaf spots, bean with rust, grape with black rot, tomato with rust. Middle row: apple with scab, leaves with bacterial spot, tomato with septoria, potato with early blight. Bottom row: healthy blueberry plant, strawberry with yellow leaf, blueberry with bilberry, and a peach with a leaf.

Figure 9. Some samples of PlantDoc dataset.

4.2 Experimental environment

All experiments in this study were conducted using the same model parameters and environmental settings. The experimental environment and model parameter configurations are listed in Table 1, and the epochs is 150, batch size is 16.

Table 1

Table 1. Experimental environment parameters.

4.3 Performance evaluation metrics

In the eggplant fruit pest and disease detection task, multiple evaluation metrics were employed to comprehensively assess the detection accuracy and computational efficiency of the lightweight model in complex natural environments. The metrics include Precision (P), Recall (R), mean Average Precision (mAP@50 and mAP@50 − 95), number of parameters (Params), and floating-point operations (GFLOPs, True positives (TP) are defined as correctly detected fruit borer holes, fruit rot lesions, or thrips feeding traces; false positives (FP) occur when fruit surface textures, glare, or other regions are incorrectly identified as pests or diseases; false negatives (FN) correspond to missed detections of existing borer holes or lesions, particularly for small-scale targets. The main formulas are presented in Equations 25-28.

\begin{array}{l} P = \frac{T P}{T P + F P} & (25) \end{array}

\begin{array}{l} R = \frac{T P}{T P + F N} & (26) \end{array}

\begin{array}{l} A P = \int_{0}^{1} P (R) d R & (27) \end{array}

Here, P(R) represents the precision-recall curve with recall as the horizontal axis. The mAP is then obtained by averaging over all categories:

\begin{array}{l} m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i} & (28) \end{array}

4.4 Comparative study

To comprehensively evaluate the performance of various object detection models in the eggplant pest and disease detection task, this study selected mainstream models including YOLOv5n, YOLOv8n, YOLOv10n, YOLOv11n, YOLOv12n, Faster R-CNN, and the RT-DETR r18 variant for experiments on the custom eggplant pest and disease dataset. The comparative experimental results of these models on the publicly available PlantDoc dataset are presented in the Tables 2, 3.

Table 2

Table 2. Comparison of results on the eggplant dataset.

Table 3

Table 3. Comparison of results on the PlantDoc dataset.

Overall, DFS-Net shows stable and competitive performance on both the Eggplant and PlantDoc datasets. On the Eggplant dataset, the proposed method achieves favorable mAP results with fewer parameters and lower computational cost, indicating that the lightweight design contributes to improved efficiency without sacrificing accuracy. On the more challenging PlantDoc dataset, DFS-Net maintains comparable or slightly improved accuracy relative to the baseline, while preserving advantages in inference speed and model compactness. These results suggest that DFS-Net offers a reasonable balance between detection performance and computational efficiency for disease and 517 pest detection in agricultural applications.

4.5 Ablation study

To evaluate the contribution of each module to detection performance, YOLOv11n was used as the baseline model, and the PConv, C3K2-MSDA, CSP-MSLA, and SDDH modules were progressively incorporated to ultimately construct the complete DFSNet (OURS) model. The experiments were conducted on the public eggplant pest and disease dataset, and the evaluation metrics described in Section 4.1.2 were employed. The results of the ablation study are presented in Table 4.

Table 4

Table 4. Ablation study results of different modules.

The ablation study results indicate that the contributions of different improvement modules within the network exhibit clear hierarchical and complementary effects. The introduction of PConv in the lower backbone layers stabilizes the preservation of fine-grained features, while the integration of C3K2 and MSDA in the middle and higher layers enhances the network’s capability to represent multi-scale pest and disease targets. On this basis, the incorporation of CSP-MSLA into the Neck stage facilitates more comprehensive feature fusion, improving the information utilization efficiency across different scales. Finally, with the introduction of the SDDH loss function, the model demonstrates better adaptability in target matching and bounding box regression, particularly reflected in improvements in recall and overall detection stability. These results suggest that the modules do not operate in isolation but collaboratively achieve an optimal balance between performance and efficiency.

To further evaluate the effect of each module on the detection performance of different target categories, a category-level comparison of mAP@50 for four detection classes (FruitRot, FruitBorer, Healthy, MelonThrips) was conducted, as presented in Table 5.

Table 5

Table 5. Category-level mAP@50 results for different ablation blocks.

The category-level ablation results indicate that the Healthy class exhibits relatively stable performance across different configurations, whereas the FruitRot class shows a consistent improvement trend with the incorporation of multi-scale feature modeling. For small-scale and morphologically complex classes such as FruitBorer and MelonThrips, the collaborative effect of PConv and the multi-scale attention modules significantly enhances feature representation. With the further introduction of the SDDH loss function, the matching and regression performance of these difficult-to-detect classes is improved, leading to overall performance gains that are consistent with the conclusions drawn from the general ablation study.

4.6 Visualization

4.6.1 PConv compared to Conv

A representative eggplant image containing a FruitBorer hole was selected, and shallow feature extraction at the P1 and P2 layers of the backbone network was performed using both PConv and standard Conv. The comparative results are presented in Figure 10. The above comparison indicates that PConv, by employing multi-directional asymmetric convolutional kernels to model local directional structural information in parallel, enables the network to capture more discriminative edge and texture features at shallow layers. This direction-sensitive feature extraction approach facilitates the preservation of critical fine-grained details during subsequent downsampling and multi-scale feature fusion, providing a clear advantage for small-scale targets with prominent edge features, such as FruitBorer holes. In contrast, standard convolution primarily emphasizes the overall local texture distribution at shallow layers, exhibiting limited capability in distinguishing directional structures and fine edges, which may lead to progressive attenuation of small target features under complex natural backgrounds.

Figure 10

Input image of a leaf with red outlined squares highlighting areas is shown on the left. On the right, there are four black and white processed images divided into two layers, labeled “First layer” and “Second layer,” with two methods, “PConv” and “Conv,” showcasing different views at resolutions of 256 by 256 and 128 by 128, each highlighting similar regions with red squares.

Figure 10. Comparison of shallow feature responses between PConv and standard Conv.

4.6.2 Visualization of the model’s feature localization capability

Figure 11 presents heatmap comparisons between the proposed DFSNet model and the baseline model on four representative images. These visualizations intuitively reveal the key image regions that the models focus on when detecting different areas or shapes of eggplant fruit diseases and pests. As shown in Figure 11, distinct activation patterns can be observed across different categories in the Grad-CAM visualizations. For fruit rot samples, high-response regions are mainly concentrated on the diseased areas and show good spatial consistency with the ground-truth annotations and detection results. For melon thrips, the activation exhibits a clear vertically elongated pattern, which is consistent with the characteristic damage morphology. In healthy samples, no localized abnormal activation is observed, and the responses are primarily distributed over the fruit body. For fruit borer samples, the model produces concentrated activations around the infestation regions. Overall, the heatmap results indicate that the model is able to attend to disease- and pest-related regions while maintaining low responses to background areas.

Figure 11

Grid layout of eggplants showing four conditions: Fruit Rot, Melon Thrips, Healthy, and Fruit Borer. Each condition has four columns: original image, ground truth with green boxes, detection results with blue boxes and classification labels, and Grad-CAM heatmaps indicating focus areas for each classification.

Figure 11. Comparative visualization of detection results and Grad-CAM heatmaps for eggplant fruit diseases and pests.

4.6.3 Qualitative comparison on small fruit rot detection

The Figure 12 shows DFSNet (OURS) exhibits the best performance on small Fruit Rot detection, providing more accurate localization and higher confidence than other compared models. YOLOv5n–YOLOv12n, Faster R-CNN, and RT-DETR r18 show limited robustness, with low confidence or imprecise bounding boxes under complex backgrounds. These results indicate that DFSNet is more effective in capturing fine-grained features of small disease regions.

Figure 12

Comparison of eggplant fruit rot detection using various methods. Top row: Original image, ground truth, Yolov5n, Yolov8n. Middle row: Yolov9t, Yolov10n, Yolov12n, Faster RCNN. Bottom row: RT DETR r18, Yolov11n (baseline), DFSNet (OURS). Each method highlights and labels affected areas with various confidence scores.

Figure 12. Qualitative detection results.

4.6.4 Visualization of detection results under complex environments

To intuitively compare the performance of different detection models in identifying eggplant fruit diseases and pests under complex greenhouse conditions, representative samples were selected, and the detection results of multiple mainstream object detection models were visualized. As shown in the Figure 13, the detection outputs of YOLOv5n, YOLOv8n, YOLOv9t, YOLOv10n, YOLOv12n, Faster R-CNN, RT-DETR-r18, YOLOv11n (Baseline), and the proposed DFSNet are presented for the same scene. Comparative analysis of bounding box positions, class predictions, and confidence distributions allows for an intuitive assessment of each model’s differences in target localization accuracy, class discrimination capability, and adaptability to complex background interference.

Figure 13

Comparison of eggplant disease detection using various models. Twelve images show the original image, ground truth, and results from different YOLO versions, Faster RCNN, RT DETR r18, and DFSNet (OURS). Boxes highlight fruit rot and melon thrips, labeled with probability scores for disease detection performance.

Figure 13. Comparative Performance of Different Models in Complex Environments.

The visual comparison of detection results reveals notable differences in the localization and discrimination capabilities of various models under complex backgrounds for eggplant fruit diseases and pests. Some comparative models exhibit overlapping bounding boxes, low confidence scores, or class confusion over fruit surface disease regions, particularly for small-scale targets such as MelonThrips, which are susceptible to occlusion by leaves and interference from background textures. In contrast, DFSNet (OURS) achieves more accurate localization of pest and disease regions, with detection boxes closely aligned with the actual distribution of lesions, while significantly reducing false positives and redundant boxes. Overall, the improved model demonstrates more reliable performance in both target localization stability and class discrimination accuracy, consistent with the results of the previous experiments and ablation analyses, thereby validating the effectiveness and practical applicability of the proposed method in real-world pest and disease detection scenarios.

5 Conclusion

This study addresses the task of detecting eggplant fruit diseases and pests under greenhouse conditions, focusing on challenges such as small target scales, diverse morphologies, complex backgrounds, and limited computational resources on edge devices. We propose a lightweight and efficient real-time detection approach. By specifically improving the baseline network architecture, PConv, C3K2-MSDA, and CSP-MSLA modules were incorporated into the backbone and neck structures, and combined with the improved SDDH loss function, resulting in the DFSNet model tailored for complex agricultural scenarios. Experimental results demonstrate that the proposed method achieves a favorable balance between detection accuracy and inference efficiency on the eggplant fruit disease and pest dataset. Compared with various mainstream detection models, DFSNet exhibits superior performance in Precision, Recall, and mAP metrics, while maintaining low parameter counts and computational complexity, satisfying the requirements for real-time detection and deployment in practical natural environments. Ablation studies and visualization analyses further validate the complementary roles of the proposed modules in feature modeling and target discrimination, particularly providing more stable detection for small-scale and morphologically irregular disease and pest targets.

Despite these achievements, there remains room for improvement. First, in scenarios with severe occlusion or significant illumination variations, the model’s recognition of weakly textured lesions could be further enhanced. Second, the current study is primarily validated on a single crop dataset, and the generalization capability of the model across different crops and environmental conditions requires further evaluation.

Future work will explore the integration of more sophisticated cross-scale feature interaction mechanisms or temporal information to enhance model adaptability in complex dynamic scenarios. Additionally, techniques such as knowledge distillation or self-supervised learning could be employed to further improve the detection performance of lightweight models under small-sample conditions. Long-term operational stability and practical deployment of the model on agricultural robots or embedded devices will also be investigated to provide more reliable technical support for intelligent pest and disease monitoring in agriculture.

Data availability statement

The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding author.

Author contributions

HS: Methodology, Conceptualization, Investigation, Writing – original draft, Formal analysis, Data curation. WF: Data curation, Methodology, Writing – review & editing. JZ: Data curation, Investigation, Writing – review & editing. MF: Validation, Writing – review & editing, Software. FW: Data curation, Writing – review & editing, Investigation. RF: Writing – review & editing, Supervision, Formal analysis, Funding acquisition.

Funding

The author(s) declared that financial support was received for this work and/or its publication. Shandong Provincial Natural Science Foundation (Grant No. ZR2025QC649). Shandong Province Higher Education Institutions 2025 Young Innovative Research Team (Grant No.2025KJH190).

Acknowledgments

The authors wish to acknowledge the contributions of all participants in this study. The authors would like to thank the open-source community of Ultralytics and the agricultural research institutions that provided valuable data and technical support for this work.

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Agarwal, M., Gupta, S. K., and Biswas, K. K. (2020). Development of efficient cnn model for tomato crop disease identification. Sustain. Computing: Inf. Syst. 28, 100407. doi: 10.1016/j.suscom.2020.100407

Crossref Full Text | Google Scholar

Ashurov, A. Y., Al-Gaashani, M. S. A., Samee, N. A., Alkanhel, R., Atteia, G., Abdallah, H. A., et al. (2025). Enhancing plant disease detection through deep learning: a depthwise cnn with squeeze and excitation integration and residual skip connections. Front. Plant Sci. 15. doi: 10.3389/fpls.2024.1505857

PubMed Abstract | Crossref Full Text | Google Scholar

Finder, S. E., Amoyal, R., Treister, E., and Freifeld, O. (2024). “Wavelet convolutions for large receptive fields,” in European Conference on Computer Vision. 363–380 (Springer). doi: 10.48550/arXiv.2407.05848

Crossref Full Text | Google Scholar

Fu, R., Wang, S., Dong, M., Sun, H., Al-Absi, M., Zhang, K., et al. (2025). Pest detection nin dynamic environments: An adaptive continual test-time domain adaptation strategy. Plant Methods 21, 53. doi: 10.1186/s13007-025-01371-y

PubMed Abstract | Crossref Full Text | Google Scholar

Han, K., Wang, Y., Tian, Q., Guo, J., Xu, C., and Xu, C. (2020). “Ghostnet: More features from cheap operations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1580–1589. doi: 10.1109/CVPR42600.2020.00165

Crossref Full Text | Google Scholar

Howard, A., Sandler, M., Chu, G., Chen, L.-C., Chen, B., Tan, M., et al. (2019). “Searching for mobilenetv3,” in Proceedings of the IEEE/CVF International Conference on Computer Vision. 1314–1324. doi: 10.1109/ICCV.2019.00140

Crossref Full Text | Google Scholar

Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., et al. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv. doi: 10.48550/arXiv.1704.04861

Crossref Full Text | Google Scholar

Huang, H., He, R., Sun, Z., and Tan, T. (2017). “Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution,” in Proceedings of the IEEE International Conference on Computer Vision. 1689–1697. doi: 10.1109/ICCV.2017.187

Crossref Full Text | Google Scholar

Kellab, R., Boulkenafet, F., Amokrane, S., Benmakhlouf, Z., Bensouici, C., Bounamous, A., et al. (2025). Chemical profiling and in vitro evaluation of the antioxidant, anti-inflammatory, and antibacterial effects of Algerian solanum melongena l. Indian J. Pharm. Educ. Res. 59, 338–350. doi: 10.5530/ijper.20250132

Crossref Full Text | Google Scholar

Li, Q., Shen, L., Guo, S., and Lai, Z. (2020). “Wavelet integrated cnns for noise-robust image classification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7245–7254. doi: 10.1109/CVPR42600.2020.00727

Crossref Full Text | Google Scholar

Li, H., Shi, H., Du, A., Mao, Y., Fan, K., Wang, Y., et al. (2022). Symptom recognition of disease and insect damage based on mask r-cnn, wavelet transform, and f-rnet. Front. Plant Sci. 13. doi: 10.3389/fpls.2022.922797

PubMed Abstract | Crossref Full Text | Google Scholar

Li, X., Sun, W., Ji, Y., and Huang, W. (2025a). A joint detection and tracking paradigm based on reinforcement learning for compact hfswr. IEEE J. Selected Topics Appl. Earth Observations Remote Sens. 18, 1995–2009. doi: 10.1109/JSTARS.2024.3504813

Crossref Full Text | Google Scholar

Li, X., Sun, W., Ji, Y., and Huang, W. (2025b). S2g-gcn: A plot classification network integrating spectrum-to-graph modeling and graph convolutional network for compact hfswr. IEEE Geosci. Remote Sens. Lett. 22, 1–5. doi: 10.1109/LGRS.2025.3623931

Crossref Full Text | Google Scholar

Liu, Z., Hao, Z., Han, K., Tang, Y., and Wang, Y. (2024). Ghostnetv3: Exploring the training strategies for compact models. arXiv. doi: 10.48550/arXiv.2404.11202

Crossref Full Text | Google Scholar

Liu, P., Zhang, H., Zhang, K., Lin, L., and Zuo, W. (2018). “Multi-level wavelet-cnn for image restoration,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 773–782. doi: 10.1109/CVPRW.2018.00121

Crossref Full Text | Google Scholar

Pan, C., Wang, S., Wang, Y., and Liu, C. (2025). Ssd-yolo: A lightweight network for rice leaf disease detection. Front. Plant Sci. 16. doi: 10.3389/fpls.2025.1643096

PubMed Abstract | Crossref Full Text | Google Scholar

Panchananam, L. S., Chandaliya, P. K., Akhtar, Z., Upla, K., and Ramachandra, R. (2025). Waveletfusion: Enhancing plant leaf disease classification with multi-scale feature extraction and explainable ai. Expert Syst. Appl. 285, 127947. doi: 10.1016/j.eswa.2025.127947

Crossref Full Text | Google Scholar

Qiu, C., Yue, T., and Hu, X. (2024). “Reconstruction-free cascaded adaptive compressive sensing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2620–2630. doi: 10.1109/CVPR52733.2024.00253

Crossref Full Text | Google Scholar

Salka, T. D., Hanafi, M. B., Rahman, S. M. S. A. A., Zulperi, D. B. M., and Omar, Z. (2025). Plant leaf disease detection and classification using convolution neural networks model: A review. Artif. Intell. Rev. 58, 322. doi: 10.1007/s10462-025-11234-6

Crossref Full Text | Google Scholar

Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and Chen, L.-C. (2018). “Mobilenetv2: Inverted residuals and linear bottlenecks,” in In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4510–4520. doi: 10.1109/CVPR.2018.00474

Crossref Full Text | Google Scholar

Shafik, W., Tufail, A., De Silva, L. C., Haji Mohd Apong, R. A., and Kim, K. (2025). Deep learning technique for plant disease classification and pest detection and model explainability elevating agricultural sustainability. BMC Plant Biol. 25, 1491. doi: 10.1186/s12870-025-07377-x

PubMed Abstract | Crossref Full Text | Google Scholar

Song, W., Hao, L., Hao, G., Hao, Q., Xu, Y., and Cui, L. (2024). Deformable object detection network for lightweight cucumber leaf disease detection. Proc. CCF Conf. Comput. Supported Cooperative Work Soc. Computing. 2344, 255–265. doi: 10.1007/978981-96-2376-1_19

Crossref Full Text | Google Scholar

Spisni, E., Valerii, M. C., De Fazio, L., Rotondo, E., Di Natale, M., Giovanardi, E., et al. (2020). A khorasan wheat-based diet improves systemic inflammatory profile in semi-professional basketball players: A randomized crossover pilot study. J. Sci. Food Agric. 100, 4101–4107. doi: 10.1002/jsfa.9947

PubMed Abstract | Crossref Full Text | Google Scholar

Sun, H., Li, X., Li, X., Wang, X., Cheng, Z., Al-Absi, M. A., et al. (2025). A multi-scale detection model for tomato leaf diseases with small target detection head. Front. Plant Sci. 16. doi: 10.3389/fpls.2025.1598534

PubMed Abstract | Crossref Full Text | Google Scholar

Tang, Y., Han, K., Guo, J., Xu, C., Xu, C., and Wang, Y. (2022). Ghostnetv2: Enhance cheap operation with long-range attention. Adv. Neural Inf. Process. Syst. 35, 9969–9982.

Google Scholar

Wang, H., Frery, A. C., Li, M., and Ren, P. (2023). Underwater image enhancement via histogram similarity-oriented color compensation complemented by multiple attribute adjustment. Intelligent Mar. Technol. Syst. 1, 12. doi: 10.1007/s44295-023-00015-y

Crossref Full Text | Google Scholar

Wang, H., Zhang, W., Xu, Y., Li, H., and Ren, P. (2026). Watercyclediffusion: Visual–textual fusion empowered underwater image enhancement. Inf. Fusion 127, 103693. doi: 10.1016/j.inffus.2025.103693

Crossref Full Text | Google Scholar

Wu, L., Zheng, Z., Qi, L., Ma, X., Liang, Z., and Chen, G. (2014). Field detection method of rice leaf blast lesions based on image processing. Res. Agric. Mechanization. 1, 32–35. doi: 10.3969/j.issn.1003-188X.2014.09.007

Crossref Full Text | Google Scholar

Yang, J., Liu, S., Wu, J., Su, X., Hai, N., and Huang, X. (2025). “Pinwheel-shaped convolution and scale-based dynamic loss for infrared small target detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 9202–9210. doi: 10.48550/arXiv.2412.16986

Crossref Full Text | Google Scholar

Zhang, X. (2025). Pg-detr: A lightweight and efficient detection transformer for early stage pomegranate fruit detection. IEEE Access. 13, 155547–155559. doi: 10.1109/ACCESS.2025.3605887

Crossref Full Text | Google Scholar

Zhang, W., Cong, M., and Wang, L. (2003). “Algorithms for optical weak small targets detection and tracking,” in Proceedings of the 2003 International Conference on Neural Networks and Signal Processing, Vol. 1. 643–647 (IEEE).

Google Scholar

Zhang, Y., Song, J., Yu, X., and Ji, X. (2025b). Wmc-rtdetr: A lightweight tea disease detection model. Front. Plant Sci. 16. doi: 10.3389/fpls.2025.1574920

PubMed Abstract | Crossref Full Text | Google Scholar

Zhang, W., Sun, X., Zhou, L., Xie, X., Zhao, W., Liang, Z., et al. (2023). Dual-branch collaborative learning network for crop disease identification. Front. Plant Sci. 14. doi: 10.3389/fpls.2023.1117478

PubMed Abstract | Crossref Full Text | Google Scholar

Zhang, C., Zhang, T., and Shang, G. (2025a). Mavm-unet: Multiscale aggregated vision mambau-net for field rice pest detection. Front. Plant Sci. 16. doi: 10.3389/fpls.2025.1635310

PubMed Abstract | Crossref Full Text | Google Scholar

Zhao, J.-X., Liu, J.-J., Fan, D.-P., Cao, Y., Yang, J., and Cheng, M.-M. (2019). “Egnet: Edge guidance network for salient object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision. 8779–8788. doi: 10.1109/ICCV.2019.00887

Crossref Full Text | Google Scholar

Zheng, J. and Yang, Q. (2024). “Contextual boundary aware network for salient object detection,” in Proceedings of the 2024 7th International Conference on Image and Graphics Processing. 19–24. doi: 10.1145/3647649.3647653

Crossref Full Text | Google Scholar

Keywords: deep learning, edge feature enhancement, eggplant disease detection, multi-scale attention, real-time detection

Citation: Sun H, Fan W, Zhang J, Feng M, Wang F and Fu R (2026) DFSNet: directional feature aggregation and shape-aware supervision for eggplant pest and disease detection. Front. Plant Sci. 17:1775987. doi: 10.3389/fpls.2026.1775987

Received: 26 December 2025; Accepted: 20 January 2026; Revised: 17 January 2026;
Published: 09 February 2026.

Edited by:

Xiao Ming Zhang, Yunnan Agricultural University, China

Reviewed by:

Hao Wang, Laoshan National Laboratory, China
Yulong Nan, Yancheng Institute of Technology, China

Copyright © 2026 Sun, Fan, Zhang, Feng, Wang and Fu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Rui Fu, ZnVydWkxOTg5MTIwOUB3ZnVzdC5lZHUuY24=

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.