A precision grading method for walnut leaf brown spot disease integrating hierarchical feature selection and dynamic multi-scale convolution

Wei, Yuting; Zeng, Debin; Zheng, Liangfang

doi:10.3389/fpls.2025.1641677

ORIGINAL RESEARCH article

Front. Plant Sci., 03 October 2025

Sec. Sustainable and Intelligent Phytoprotection

Volume 16 - 2025 | https://doi.org/10.3389/fpls.2025.1641677

This article is part of the Research TopicInnovative Field Diagnostics for Real-Time Plant Pathogen Detection and ManagementView all 9 articles

A precision grading method for walnut leaf brown spot disease integrating hierarchical feature selection and dynamic multi-scale convolution

Yuting Wei^1,2†

Debin Zeng^2†

Liangfang Zheng^2*

¹College of Information Engineering, Tarim University, Alaer, China
²Key Laboratory of Tarim Oasis Agriculture, Ministry of Education, Tarim University, Alaer, China

Walnut leaf brown spot disease, caused by Ophiognomonia leptostyla, is among the most destructive fungal diseases in walnut cultivation. In the development of smart agriculture, precision grading of plant diseases remains a core technical challenge; specifically, this disease is plagued by blurred lesion edges and inefficient extraction of complex features, which directly limits the accurate grading of the disease. To address these issues, this study proposes a disease grading method integrating hierarchical feature selection and adaptive multi-scale dilated convolution, and develops the CogFuse-MobileViT model. This model overcomes the limitations of the standard MobileViTv3 model in capturing blurred edges of tiny lesions via three innovative modules: specifically, the Hierarchical Feature Screening Module (HFSM) enables hierarchical screening of disease-related features; the Edge Feature Focus Module (ECFM) works in synergy with the HFSM to enhance the focus on lesion edge features; and the Adaptive Multi-Scale Dilated Convolution Fusion Module (AMSDIDCM) achieves dynamic multi-scale fusion of lesion textures and global structures. Experimental results demonstrate that the proposed model achieves an accuracy of 86.61% on the test set, representing an improvement of 7.8 percentage points compared with the original MobileViTv3 model and significantly outperforming other mainstream disease grading models. This study confirms that the CogFuse-MobileViT model can effectively resolve the issues of blurred edges and inefficient feature extraction in this disease, provides a reliable technical solution for its precision grading, and holds practical application value for the intelligent diagnosis of plant diseases in smart agriculture.

1 Introduction

Walnut diseases pose a serious threat to walnut production in Xinjiang. The spread of crop diseases significantly exacerbates the security risks of the walnut industry if timely prevention and control measures are not taken (Cooke, 2006; Khan et al., 2021). As a core component of the disease prevention and control system, early precise grading plays a critical role in agricultural production management (Adaskaveg et al., 2009; Chiang et al., 2016; Lamichhane, 2014). Walnut brown spot, caused primarily by the fungus Ophiognomonia leptostyla, typically forms specific symptoms on organs such as leaves, flowers, and fruits. Among them, leaves, as the primary carrier of plant diseases, exhibit characteristics such as lesion morphology and color changes on their surfaces, which often serve as important bases for disease grading (Ghaiwat and Arora, 2014; Weber, 1980; Zarei et al., 2019). Traditional disease identification relies on manual experience (Moragrega et al., 2011; Wang et al., 2020), a method that is not only time- color system to reduce interference from light and leaf veins, and lesion edge detection using the Sobel operator, ultimately achieving fast and accurate grading based on the ratio of lesion area to leaf area. Jadhav and Patil (2016) developed a leaf image partitioning technique based on k-means clustering and squared Euclidean distance, automatically quantifying damaged leaf area through the pixel ratio of lesions to leaves for disease grading. Arivazhagan et al. (2013) proposed a four-step processing system involving color conversion, green pixel removal, segmentation, and texture feature classification for leaf disease grading. However, traditional methods for acquiring disease information suffer from insufficient segmentation accuracy for plant leaf disease images with complex textures and unclear lesion boundaries, thereby resulting in poor disease severity grading performance.

With the continuous advancement of computational power, deep learning has increasingly become a key technology for addressing complex lesion segmentation, primarily due to its superior capability in automatic feature extraction (Mao et al., 2023). Chen et al. (2021) proposed the BLSNet method based on the UNet semantic segmentation network, enhancing lesion segmentation accuracy through the introduction of attention mechanisms and multi-scale feature fusion. Experiments showed that its segmentation and classification accuracy outperformed benchmark models such as DeepLabv3+ and UNet, preliminarily verifying the reliability of this method in automatic assessment of BLS disease severity. Ngugi et al. (2020) developed the KijaniNet segmentation network based on fully convolutional neural networks, demonstrating excellent performance in tomato leaf segmentation under complex backgrounds; Lin et al. (2019) developed a CNN semantic segmentation model that achieved high-precision pixel-level segmentation of cucumber powdery mildew on leaves. Additionally, some studies have fused traditional features with deep learning: Tripathi (2021) proposed a convolutional neural network method based on AlexNet, achieving disease grading by fusing features extracted by the model with external leaf segmentation features; Ma et al. (2017) introduced Comprehensive Color Features (CCF) combining hyper-red index, HSV/H and Lab/b components, and achieved lesion segmentation via interactive region growing, with experiments confirming that this method enables accurate grading of disease images such as cucumber downy mildew under actual field conditions. However, although such methods have improved grading accuracy to some extent, they still suffer from insufficient feature capture of blurred disease edges, making it difficult to achieve fine-grained differentiation of subtle differences between disease grades (Rastogi et al., 2015).

In recent years, the application of deep learning in plant disease classification has continued to expand. Parashar et al. (2024) systematically validated the capability of complex feature modeling for crop yield prediction, further substantiating the critical role of adaptive feature extraction in agricultural intelligent decision-making. Concurrently, Vishnoi et al. (2022) enhanced small-target detection precision in apple disease identification through a spatial attention mechanism-based CNN architecture for leaf disease diagnosis. Ozturk et al. (2025) constructed an ensemble learning classification model based on ResNet50, Mobile Net, EfficientNetB0, and DenseNet121, enhancing generalization performance through statistical cross-validation and improving decision interpretability via Grad-CAM visualization. Experimental results showed that the ensemble model achieved stable high-precision classification performance across multiple validation rounds. These studies provide robust and interpretable solutions for intelligent plant disease recognition through technical integration and architectural innovation. Hu et al. (2021) integrated Retinex enhancement, Faster R-CNN detection, and VGG16 classification to improve the grading and detection accuracy of blurred diseased leaves. Picon et al. (2019) proposed an adaptive deep residual neural network algorithm for the classification of three European wheat diseases (Septoria leaf blotch, brown spot, and rust) in real-world scenarios, which effectively improved the classification accuracy of wheat diseases. Kim and Ahn (2021) employed deep CNN architectures such as ResNet, Xception, and DenseNet, combined with transfer learning and fine-tuning, to classify 9 categories of pests, diseases, and healthy states in tomatoes. Although DenseNet combined with the RMSprop algorithm achieved an accuracy of 98.63%, single CNN architectures have clear bottlenecks in handling complex plant lesions and overlapping multiple diseases, including insufficient specificity in feature extraction and limited ability to distinguish subtle differences between lesions. Shi et al. (2023) analyzed the application of CNNs in evaluating the severity of plant diseases, pointing out that hybrid architectures such as classical CNNs, improved CNNs, and segmentation networks have limitations in handling complex plant lesions: classification errors caused by concurrent multiple diseases, as well as constraints on model generalization ability due to unbalanced datasets and insufficient annotation accuracy. These issues significantly restrict their precise application in practical agricultural scenarios.

Although significant advancements have been made in plant lesion segmentation and disease classification using deep learning, the grading task for walnut leaf brown spot disease—characterized by blurred edge representation and complex small lesions—still faces notable challenges. Current MobileViT-based hybrid models fall into two categories: general architectures (MobileViTv1/v2/v3) focus on optimizing the CNN-Transformer balance for natural images; domain-improved models (Mobile-Former/EdgeViT) are oriented toward medical/industrial tasks, with their task focus on localization rather than grading, which mismatches the pain points and fails to address the blurriness of agricultural lesions, tissue interference, and morphological variations. Thus, the present study proposes a novel CogFuse-MobileViT model specifically designed for disease grading, with its specific contributions as follows:

1. A diverse field-collected walnut leaf dataset was constructed, with images divided into four distinct severity levels based on qualitative assessment of disease severity, ensuring the accuracy of disease severity grading.

2. The Hierarchical Feature Selection Module (HFSM) enhances the fusion of local details (such as lesion texture and color) with global context through local and global attention mechanisms and task-driven feature selection, while suppressing interference from healthy regions.

3. The Edge Convolution Fusion Module (ECFM) strengthens edge details through Sobel convolution and residual connections, achieving effective integration of edge-specific features with general features, further enhancing edge details and enabling more precise capture of lesion contour details.

4. The Adaptive Multi-Scale Dilated Dense Inception Convolution Module (AMSDDICM) extracts differentiated features using multi-shaped convolution kernels and adaptively fuses multi-scale information via a dynamic weight mechanism, capturing lesion morphologies at different pathogenesis stages.

2 Materials and methods

2.1 Characteristics of walnut leaf brown spot and classification of disease severity levels

In the local standard Technical Regulations for Prevention and Control of Walnut Brown Spot (McGranahan and Leslie, 1991), walnut brown spot is divided into four severity levels. The severity of plant leaf diseases is generally determined using the spot coverage method. In this study, walnut leaves with different disease severities were processed to separate disease-infected spots from healthy leaf areas, and the percentage of lesion area to total leaf area was calculated. With reference to the local standard, the disease grades specified in the standard were re-determined through calculations in this study, as shown in Table 1. Classification of different severity levels of walnut leaf brown spot was conducted in accordance with Technical Regulations for Prevention and Control of Walnut Brown Spot (Wang et al., 2022; Yang et al., 2021).

Table 1

Table 1. Walnut leaf brown spot disease severity grading criteria.

Walnut leaf brown spot is caused by infection with the fungus Ophiognomonia leptostyla (Mircetich et al., 1980). In the early infection stage, near-circular or irregular small spots appear on the leaves, with a gray-brown center and dark yellow-green to purplish-brown edges, and diseased leaves tend to fall off prematurely. In the middle stage, elongated elliptical or irregular slightly sunken dark brown lesions form, with larger spot size and light brown edges; longitudinal cracks are often present in the center of the lesions. In the late stage, lesions often coalesce to form large scorched necrotic areas, surrounded by yellow to golden-yellow zones, and small black granules (conidiomata and conidia of the pathogen) are scattered on the surface of the diseased tissue (Mircetich et al., 1980). According to the degree of color and texture feature changes in infected leaves, walnut leaf disease is divided into four stages: healthy, early, middle, and late stages, corresponding to disease severity levels: healthy (Level 0), mild (Level 1), moderate (Level 2), and severe (Level 3). The different severity levels of walnut leaf brown spot are shown in Figure 1.

Figure 1

Four leaves are displayed, labeled A to D. A shows a healthy leaf with no spots. B features a mildly infected leaf with a few small spots. C presents a moderately infected leaf with numerous spots. D shows a severely infected leaf covered with dark spots.

Figure 1. Walnut leaves infected with brown spot disease at different severity levels. (A) Healthy leaves; (B) Mildly infected leaves; (C) Moderately infected leaves; (D) Severely infected leaves.

Table 1 shows the ratio of walnut leaf brown spot lesion area to total leaf area. Level 0 (Healthy): Healthy leaves without symptoms. Level 1 (Mild): Near-circular or irregular small white spots appear on leaves, accounting for less than 5% of the leaf area. Level 2 (Moderate): Lesions expand into irregular dark brown spots, covering 5% to 30% of the leaf area. Level 3 (Severe): Abundant lesion coalesce to form large scorched necrotic areas, exceeding 30% of the leaf area.

2.2 Calculation algorithm for walnut leaf brown spot disease severity levels

After capturing walnut leaf brown spot samples using a camera, to minimize subjective bias, this study strictly adhered to the objective quantitative criteria defined in the Technical Regulations for the Control of Walnut Brown Leaf Spot (Table 1), utilizing the lesion area percentage (K) as the primary grading indicator. we re-determined the severity levels through computational analysis, with the specific determination process shown in Figure 2.

Figure 2

Flowchart illustrating a process for identifying disease areas on leaves. Images of leaves are converted through morphological color space to grayscale. The grayscale is used to highlight disease area S1 and leaf area S2. The ratio S1/S2 is applied, producing a final analysis image showing leaf disease marked in red.

Figure 2. Computational method for different severity levels of brown spot infection in walnut leaves.

First, a Python-based color segmentation algorithm was employed to convert images from the BGR color space to the HSV color space for processing. The HSV color space offers inherent advantages in color segmentation tasks. By analyzing the hue, saturation, and value characteristics of diseased regions, the color threshold range in the HSV space was determined (Chaudhary et al., 2012). After generating an initial disease region mask based on this threshold range, morphological operations such as closing and opening were applied to optimize mask quality, effectively eliminating noise while preserving the integrity of lesion contours. Subsequently, the original image was converted to grayscale mode, and the leaf region was extracted using an adaptive threshold binarization algorithm. Leaf boundaries were localized via contour detection technology, and a complete leaf mask was generated through contour filling. Pixel statistical analysis was performed on both the leaf mask and disease mask to calculate the total leaf area and diseased region area, respectively. When a valid leaf area (>0) was detected, the disease severity index was calculated using the following formula: Disease Severity Index $(%) = \frac{S_{1}}{S_{2}} \times 100 % S_{1}$ : Total Area of Diseased Regions $S_{2}$ : Total Area of the Complete Leaf.

2.3 Dataset construction

Data were collected from the Walnut Plantation Base of Tarim University (Alar, Xinjiang) between May and October 2024. Leaves from three locally dominant early-fruiting cultivars (‘Wen 185’, ‘Xinxin 2’, ‘Zha 343’)widely cultivated in southern Xinjiang were vertically photographed using an iPhone 13. This genotypic diversity ensures model generalizability by encompassing varied disease phenotypes. The orchard follows standardized cultivation with 4m plant spacing, 5m row spacing (≈50 plants/mu), and conventional management practices. Sampled trees were in full fruiting stage. To capture disease traits across microenvironments, samples were collected from safely accessible crown layers. The dataset includes healthy and brown spot-infected leaves three severity levels, spanning varied time periods, light conditions, and angles. After removing duplicates and invalid images,5,120 high-quality JPGs were retained for analysis.

2.3.1 Development of the walnut leaf brown spot dataset

Disease severity grading was established through quantitative lesion area analysis (Figure 2). A representative subset of 512 images (10% of the full 5,120-image dataset) was selected via stratified random sampling, accurately preserving the original severity distribution. To ensure labeling rigor, two plant protection specialists (>5 years’ expertise in walnut pathology, certified in local protocols) executed standardized annotation: Pre-annotation training unified diagnostic criteria for ambiguous cases, with formal annotation commencing only after achieving pre-calibration inter-rater agreement (Kappa coefficient ≥0.75). As validated in Table 2, dual statistical metrics confirmed gold-standard reliability—inter-rater Cohen’s Kappa reached 0.71, meeting Landis & Koch’s “substantial agreement” threshold; algorithm-expert consensus attained a weighted Fleiss’ Kappa of 0.77, substantially outperforming conventional Cohen’s Kappa in multi-annotator scenarios. A three-stage standardization pipeline eliminated preprocessing discrepancies. The final dataset, partitioned into training/testing sets (8:2 ratio), strictly adheres to plant disease survey protocols and provides benchmarked data for deep learning model development (Table 3).

Table 2

Table 2. Algorithm vs. expert consensus agreement evaluation.

Table 3

Table 3. Walnut leaf brown spot disease image acquisition data.

2.4 Construction of walnut leaf brown spot disease severity grading model

2.4.1 The optimized MobileViTv3 network: CogFuse-MobileViT

To address the challenge of difficult feature capture for small lesions in walnut leaf brown spot severity grading, this study proposes an innovative model, CogFuse-MobileViT. MobileViTv3 was selected as the backbone network due to its suitability for walnut brown spot grading. Its hybrid MobileNet-ViT architecture integrates CNN-based local feature extraction essential for lesion detail capture with Transformer-enabled global contextual modeling critical for analyzing lesion spatial distribution. This design aligns with pathological requirements for severity grading, necessitating concurrent attention to local lesion characteristics and global infection patterns. The lightweight architecture further supports real-time processing on embedded devices, enabling future field deployment.

The model embodies a “grading task-driven” design principle. Conceptually, it establishes a disease-specific framework of “hierarchical screening to edge enhancement to dynamic fusion.” This framework elevates general feature extraction to targeted solutions addressing three major agricultural challenges: Suppressing interference from healthy tissues via hierarchical screening resolving feature confusion; Strengthening blurred lesion contours through edge enhancement overcoming the bottleneck of edge blurriness; Adaptively handling multi-stage morphological variations using dynamic fusion addressing morphological variation challenges. Structurally, as illustrated in Figure 3, the network implements a progressive feature extraction strategy. The Hierarchical Feature Selection Module (HFSM) first fuses shallow and middle-layer features. It employs a hierarchical attention mechanism to enhance semantic consistency while preserving spatial details. The Edge Convolution Fusion Module (ECFM) then processes these features. It utilizes learnable Sobel operators to extract edge information, which is fused with conventional convolutional features via residual connections to augment edge perception capability. Finally, the Adaptive Multi-Scale Dilated Inception Convolution Module (AMSDDICM) enables adaptive processing of multi-scale edge features. This module is capable of both capturing fine edge changes in minute lesions and grasping the overall contour structure of large lesions, thereby comprehensively covering edge features across different developmental stages. Based on these enhanced edge features, the network accurately outputs the final disease severity grade.

Figure 3

Diagram illustrating a deep learning process for leaf classification. Input images of leaves undergo convolution and MobileViT blocks, reducing dimensions and processing through HFSM, ECFM, and AMSDDICM modules. Outputs are pooled and linearly transformed to logits.

Figure 3. CogFuse-MobileViT Architecture Diagram (HFSM suppresses healthy regions through dynamic masks, ECFM explicitly extracts edges via Sobel convolution, and AMSDDICM fuses multi-scale features through dynamic weights).

2.4.2 The Hierarchical Feature Selection Module

The HFSM (Hierarchical Feature Selection Module) is an innovative architecture proposed in this study. Unlike MobileViT’s direct feature concatenation, HFSM utilizes learnable prompt vectors (Equation 1) to generate spatial masks, dynamically suppressing non-lesion regions. Such task-driven selection is crucial for tiny objects. In the grading task of walnut leaf brown spot disease, this module utilizes a local attention mechanism to focus on micro-regions of walnut leaves, fusing shallow and middle-layer features. Meanwhile, the hierarchical feature selection mechanism enables precise capture of local detail features such as lesion texture and color, effectively suppressing interference from healthy leaf regions and enhancing disease features. This provides high-discriminative feature representations for brown spot disease grading while preparing for subsequent lesion edge processing. As depicted in Figure 4, the module processes two hierarchical feature maps. Initial 1×1 convolutions reduce both maps’ channels to half the output dimension curtailing computational load. One reduced map undergoes bilinear upsampling for spatial alignment with the other. These aligned features are summed and processed by a 3×3 convolution to generate base-path features. Simultaneously, the original reduced map and upsampled map are fed into dual Local-Global Attention modules for parallel processing (Xu, 2024). Each group contains local and global branches. During branch processing, the feature map is partitioned into P×P non-overlapping patches $P_{i, j}$ via the Unfold operation. After calculating the mean of pixel features within each patch, the result is processed using the following core formula: attention distribution generation.

Figure 4

Diagram illustrating a neural network architecture with multiple components. The upper section shows two convolution layers, LGA modules, concatenation, and rep blocks. The lower section features processes like unfold, mean calculation, softmax, and feature selection, highlighting token and channel selection. Various operations and components like multiplication, addition, and dropout are labeled, with symbols explaining their functions.

Figure 4. Architecture diagram of the hierarchical feature selection module (HFSM).

\begin{array}{l} Z_{i, j} = ML P_{2} (LayerNor (ML P_{1} (Mean (P_{i, j})))) & (1) \end{array}

\begin{array}{l} a_{i, j} = Softmax (z_{i, j}) & (2) \end{array}

Equations 1 and 2 convert the mean value of patch features into a high-dimensional feature vector through multi-layer perceptrons (MLP) and layer normalization (LayerNorm) (wherein( ${MLP}_{1} {and MLP}_{2}$ ) ${achieve dimensional transformation P}^{2} \to ouc / 2 \to ouc / 2$ )), and then generates the attention distribution via the Softmax function $a_{i, j}$ . This mechanism enables the model to focus on the key features of lesion regions, weaken the interference from healthy leaf areas, enhance the specificity of feature selection, and provide high-quality features for subsequent precise grading. After the local and global features output by the two modules are spliced along the channel dimension, they are combined with the basic path features to generate the output features of the final processing module.

For leaf images with over 80% healthy tissue, dynamic masks are generated via learnable prompt vectors to suppress green texture features while enhancing the gray-brown features of lesions (McGranahan and Leslie, 1991).

2.4.3 ECFM Edge Convolutional Fusion Module

In the walnut leaf brown spot grading task, lesion edge features are crucial for accurately determining disease severity. The standard MobileViT relies on CNN-Transformer blocks to implicitly learn edges, whereas the ECFM (Edge Convolution Fusion Module) incorporates Sobel convolutions, conventional convolutions, and residual connections, effectively fusing edge features with general features and thereby enhancing edge details. As shown in Figure 5, the Sobel branch employs Sobel convolution to extract image edge information, accurately capturing the contour details of brown spot lesions; the convolutional branch captures general features such as leaf color and texture through standard convolution. The two realize feature fusion via the following core formula (Equation 3):

Figure 5

Diagram showing a network flow. Input X is split into two paths: one goes through a Sobel filter and the other through a convolution (Conv) layer. Outputs are concatenated (Concat) and followed by another Conv layer. The result is summed with a bypass connection and passed through a final Conv layer.

Figure 5. Architecture diagram of the edge convolutional fusion module (ECFM).

\begin{array}{l} S = SobelConv (x) & (3) \end{array}

Processing the input feature map $(x)$ through the Sobel convolution operator (SobelConv) specifically extracts edge information of lesions (such as the boundary contours between lesions and healthy tissues, and the edge textures of small lesions). For the commonly seen blurred edges in brown spot disease, Sobel convolution can enhance edge gradient changes, making the originally blurred lesion contours clearer, thus providing key edge feature support for subsequent grading. Conventional feature extraction and fusion (Equation 4):

\begin{array}{l} C = Conv (x), F_{concat} = Concat (S, C) & (4) \end{array}

$C = Conv (x)$ Extracting the overall features of leaves (such as the color distribution of lesions and the overall texture of leaves) through standard convolution forms a complement to edge features, preventing the model from focusing only on local edges while ignoring global lesion information.

$F_{concat} = Concat (S, C)$ Concatenating the edge feature S and the conventional feature C along the channel dimension achieves the initial fusion of edge details and global features. This fusion enables the model to both identify the fine contours of lesions and judge the disease condition by combining the overall color and texture changes of lesions, thereby improving the accuracy of grading. The module also introduces feature addition and subsequent convolution operations: the fused features are first processed by the first convolution layer $F_{1} = {Conv}_{1} (F_{concat})$ , The corresponding operation results are added to the original input, and the final output is generated through the second convolution layer $F_{final} = {Conv}_{2} (F_{1} + X)$ . In this process, the residual connection retains the original feature information, further strengthens the integration of edge features and conventional features, and ensures the effective transmission and enhancement of multi-level features.

The edge gradient information extracted by Sobel convolution (such as grayscale differences between diseased spots and healthy tissues) and the texture features from conventional convolution (such as the roughness of necrotic areas) are fused via residual connections. This not only preserves the blurred edges of early-stage lesions but also prevents edge features from being disconnected from the overall texture (Ghaiwat and Arora, 2014).

2.4.4 AMSDDICM Adaptive Multi-scale Dilated Depthwise Inseparable Convolution Module

This module differs from MobileViT’s single-scale convolution, leverages multi-shaped convolution kernels to accurately extract lesion features of circular, irregular, and other shapes from multiple dimensions, enabling the identification of subtle differences in lesion edges and internal colors to enrich feature dimensions. Meanwhile, depth wise separable convolutions are employed to accommodate lesion scales at different stages of disease development. Specifically, the input features are first split into two groups, each entering the AMSDDICM module to extract features via multi-scale depth wise separable convolutions. A dynamic weighting mechanism (global pooling + convolution + softmax) is used to generate weights for fusing multi-scale features. The processed features from the two groups are concatenated, and finally, cross-channel fusion is completed via 1×1 convolution to output the final optimized features. The entire process integrates multi-scale convolution, dynamic weight allocation, and feature fusion to enhance the model’s ability to capture complex features, as shown in Figure 6. In response to the morphological differences between early-stage small spots (1-3mm in diameter) and late-stage fused spots (over 20mm in diameter), the dynamic weight mechanism enables the model to adaptively allocate the contributions of 3×3 kernels for capturing local textures and 11×1 kernels for extracting strip-shaped spreading features. This addresses the limitation of fixed weights in traditional Inception architectures in adapting to multi-scale morphologies.

Figure 6

Diagram showing a neural network process. The top section involves splitting input channels, applying a DMS 2D convolution, concatenation, and another convolution. The bottom section includes a pooling layer, convolution, rearrangement, softmax, summation, batch normalization, and activation. The process results in output $X^{''}$.

Figure 6. Architecture Diagram of the Adaptive Multi-Scale Dilated Depth wise Inseparable Convolution Module (AMSDDICM).

When the input feature tensor X with the number of channels c enters the AMSDDICM module, it first undergoes a feature grouping operation: the input features are evenly split into two groups along the channel dimension, with each group containing half of the original number of channels (c//2). This grouping strategy not only reduces computational complexity but also creates conditions for subsequent multi-scale feature extraction. Each feature group is fed into an independent DMSconv2d module, which adopts different configurations such as 3×3, 5×5 square convolution kernels and 1×11, 11×1 strip-shaped convolution kernels — the square kernels capture local textures, while the strip-shaped kernels extract direction-sensitive long-range dependencies. This dynamic weight allocation draws inspiration from adaptive strategies in agricultural IoT systems. similar to how salinity-aware ETs models prioritize soil conductivity features under salt stress (Zhang et al., 2023), our mechanism selectively amplifies lesion morphological features critical for severity grading. The dynamic weighting mechanism of DMSconv2d generates weights through global average pooling and 1×1 convolution, with the core formulas as follows:

Weighted basic feature generation:

\begin{array}{l} x_{dkw} = rearrange ({Conv}_{1 \times 1} (AvgPoo l2 d (x))) & (5) \end{array}

Equation 5 compresses the spatial dimensions of the input features through global average pooling (AvgPool2d), retains global statistical information (such as the overall distribution characteristics of lesions at different scales), and then transforms the dimensions via 1×1 convolution to generate base features for calculating dynamic weights. This step provides a basis for subsequent weight allocation, enabling the model to preliminarily evaluate the importance of features at different scales. Weight normalization:

\begin{array}{l} x_{dkw} = F . softmax (x_{dkw}) & (6) \end{array}

Equation 6 compresses the spatial dimensions of the input features through global average pooling (AvgPool2d), retains global statistical information, such as the overall distribution characteristics of lesions at different scales, and then transforms the dimensions via 1×1 convolution to generate base features for calculating dynamic weights. This step provides a basis for subsequent weight allocation, enabling the model to preliminarily evaluate the importance of features at different scales. Weight normalization:

\begin{array}{l} x = \sum_{i = 0}^{2} (dwcon v_{i} (x) \times x_{dkw, i}) & (7) \end{array}

Equation 7 is based on dynamic weights $x_{dkw, i}$ ) for different convolution kernels ${dwconv}_{i} (x)$ Weighted fusion is performed on the extracted features. For walnut brown spot lesions exhibiting size heterogeneity and morphological complexity—ranging from early-stage circular micro-lesions to late-stage coalesced irregular lesions—this mechanism adaptively modulates multi-scale feature contributions. It prioritizes retention of morphology-discriminative features, local textures in small lesions or directional distributions in large lesions, thereby comprehensively capturing pathological characteristics across developmental stages. The processed features undergo channel-wise concatenation to generate optimized outputs (Mircetich et al., 1980).

2.5 Experimental process for severity grading of walnut leaf brown spot disease

Figure 7 is the overall flow chart of severity grading of walnut leaf spot disease. First is the image acquisition stage, where raw images of walnut leaves are captured and collected to construct an image database containing pictures to be classified (Singh et al., 2019). Next is the image preprocessing stage, which involves sequentially calculating and classifying disease severity levels, establishing image labels, and performing image preprocessing operations. Subsequently, the processed data are used to train the constructed dataset. Finally, in the model training and performance evaluation stage, the preprocessed images are input into the model for classifying walnut leaf brown spot disease, After which performance evaluation is conducted on the model’s classification results to determine the model’s effectiveness and accuracy.

Figure 7

Flowchart illustrating a process for walnut leaf brown spot disease classification. It consists of three main phases: image acquisition, preprocessing, and model training and evaluation. The image acquisition phase involves collecting walnut leaf images and storing them in a database. The preprocessing phase includes classification of disease grades, establishing image labels, and training the dataset. Finally, the model classifies the disease, resulting in output images and a performance evaluation matrix for accuracy verification. The chart uses labeled boxes, arrows, and image samples for illustration.

Figure 7. Overall Flow Chart for Severity Grading of Walnut Leaf Brown Spot Disease.

2.6 Experimental parameters and evaluation metrics

2.6.1 Test environment and hyperparameter setting

The experimental setup of this study is based on deep learning technology and leverages high-performance computing resources. All experiments were conducted on the Windows 10 operating system. The hardware platform consists of an AMD Ryzen 7 3700X processor and an NVIDIA RTX 2080Ti graphics card. The software environment was built using Python 3.8, the PyTorch 1.13.0 deep learning framework, and the CUDA 11.3 parallel computing platform. Additionally, PyTorch 1.13.0—a widely adopted open-source deep learning library renowned for its high flexibility—was selected, making it well-suited for research and development.

During model training, multiple adjustments were made to the hyperparameters to compare test results and select the optimal hyperparameter combination. The model accepts input images sized at 224×224 pixels, with a configured batch size of 32 during training and a maximum of 100 training epochs. The optimizer employs an initial learning rate of 0.003.

2.6.2 Evaluation metrics

This paper aims to evaluate the performance of the model and verify the effectiveness of the improvement measures. We selected multiple evaluation metrics, and all subsequent experimental results adopt the method of calculating averages via 5-fold cross-validation, aiming to comprehensively and reliably evaluate the model performance. including precision (P), recall (R), etc. (Equations 8–16) These metrics can be calculated using the following formulas.

The arithmetic mean of the metric values across the five folds:

\begin{array}{l} \bar{m} = \frac{1}{5} \sum_{i = 1}^{5} m_{i} & (8) \end{array}

\begin{array}{l} Precision = TP / (TP + FP) & (9) \end{array}

\begin{array}{l} Recall = TP / (TP + FN) & (10) \end{array}

\begin{array}{l} F 1 score = \frac{2 TP}{2 TP + FP + FN} & (11) \end{array}

\begin{array}{l} Macro Precision= \frac{1}{C} \sum_{i = 1}^{C} P_{i} & (12) \end{array}

\begin{array}{l} Macro Recall= \frac{1}{C} \sum_{i = 1}^{C} R_{i} & (13) \end{array}

\begin{array}{l} Weighted Avg Precision= \frac{\sum (N_{i} \cdot Precisio n_{i})}{\sum N_{i}} & (14) \end{array}

\begin{array}{l} Weighted Avg recall= \frac{\sum (N_{i} \cdot Precisio n_{i})}{\sum N_{i}} & (15) \end{array}

\begin{array}{l} Weighted Avg Recall= \frac{\sum_{i = 1}^{C} T P_{i}}{N} & (16) \end{array}

3 Experiments and results analysis

3.1 Core module design and validity experimental verification

3.1.1 Comparative test of necessity of HFSM module

To validate the necessity of the Hierarchical Feature Selection Module (HFSM) within the CogFuse-MobileViT framework, this study conducted comparative tests against two mainstream lightweight attention modules: SE (Squeeze-and-Excitation) and CBAM (Convolutional Block Attention Module). As illustrated in Table 4, the HFSM module achieves a significantly higher accuracy of 86.61%, outperforming the SE module and CBAM module by 7.38% and 4.46%, respectively. This demonstrates that its hierarchical feature selection mechanism more effectively captures discriminative features. In terms of computational efficiency, the parameters and FLOPs of HFSM remain comparable with those of the SE module and CBAM module, indicating that its performance breakthrough stems from innovative structural design under equivalent lightweight constraints. Comprehensive results confirm that replacing HFSM with SE or CBAM modules would incur a performance degradation exceeding 4%, thereby validating the indispensable value of HFSM in enabling efficient feature selection within the CogFuse-MobileViT framework.

Table 4

Table 4. Performance comparison of attention modules within the CogFuse-MobileViT framework.

3.1.2 Convolutional kernel selection in AMSDDICM

Rigorous validation via kernel combination ablation studies (Table 5) demonstrates. The hybrid configuration (3×3 + 5×5 + 1×11) significantly outperformed square-only kernels (3×3 + 5×5) Accuracy 86.61% and 82.15% Stage-specific recall gains. Mid-stage lesions (Level 2) raise 9.4 percentage points. Late-stage lesions (Level 3) raise 6.1 percentage points. This empirically validates the necessity of 1×11 rectangular kernels for capturing linear pathological features, overcoming the limitation of isotropic kernels in detecting anisotropic structures.

Table 5

Table 5. Comparison of different convolution kernels in AMSDDICM.

3.1.3 The impact of new modules on computational complexity

Figure 8 shows that the HFSM module incurs a 169% FLOPs increase to achieve a breakthrough improvement in early-stage lesion detection—reducing Level 0/1 misclassification by 24%, thereby establishing the pathological foundation for severity grading. The ECFM module contributes a mere 3% FLOPs increment yet drives a 3.68 percentage point accuracy gain through enhanced edge feature representation. With only a 0.028G FLOPs overhead 1.3%, the AMSDDICM module enables adaptive fusion of multiscale pathological deformations, culminating in a 7.80-pp accuracy leap. These modules form a cascaded optimization paradigm: HFSM’s substantial cost resolves the core pathological bottleneck, while subsequent modules deliver superlinear returns—harvesting 6.85-pp accuracy gain with just 14% additional FLOPs—collectively establishing the globally optimal computation-performance equilibrium.

Figure 8

Bar graphs comparing different configurations of MobileViT models across three metrics. Panel A shows accuracy percentages, with values from 78.81% to 86.61%. Panel B displays FLOPs, ranging from 0.74G to 2.1G. Panel C illustrates parameters in millions, with values between 1.94M and 2.02M. Each graph compares MobileViT, MobileViT plus HFSM, MobileViT plus HFSM plus ECFM, and CogFuse-MobileViT.

Figure 8. Comparison of the impact of each newly added module on computational complexity (A) accuracy (%); (B) Flops (G); (C) Params (M). Measured on NVIDIA RTX 2080Ti GPU (PyTorch 1.13.0, CUDA 11.3) with 224×224 input.

3.1.4 Comparison of the influence of different module fusion on model performance

To validate the effect of module fusion, Table 6 compares model performances with different combinations. When only HFSM (Hierarchical Feature Selection Module) is introduced, precision increases from the baseline of 82.23% to 83.77%, but recall decreases by 3 percentage points to 72.31%, causing the F1 score to slightly decline to 78.04%. Accuracy rises to 82.15%, indicating improved overall classification correctness of the model, but with potential risk of missed detections.

Table 6

Table 6. Impact of fusion of different modules on model performance.

When HFSM and ECFM (Edge-Context Feature Module) are synergistically introduced, precision increases to 86.34%, recall recovers to 74.69%, and both F1 score and accuracy are significantly optimized. HFSM fuses shallow and middle-layer features through a hierarchical attention mechanism, laying the foundation for subsequent edge feature processing; ECFM enhances lesion edge features. The combination of the two effectively improves the integrity of feature representation and the clarity of lesion boundaries.

The combination of HFSM and AMSDDICM (Adaptive Multi-Scale Dilated Depthwise Inseparable Convolution Module) further pushes recall to 76.24%, accuracy to 83.94%, and F1 score to 80.79%, outperforming the HFSM+ECFM combination. AMSDDICM compensates for HFSM’s deficiency in local feature refinement through attention-guided multi-scale detail fusion, which integrates multi-layer detail feature weights, especially suitable for scenarios with small targets or blurred features.

When the three modules work synergistically, all indicators reach optimal levels: precision, recall, F1 score, and accuracy increase by 12.19%, 7.67%, 9.62%, and 10.00% respectively compared with the baseline. Among them, HFSM lays the foundation for cross-layer feature fusion, ECFM effectively integrates edge-specific features with general features to enhance image edge information, and AMSDDICM adaptively fuses multi-scale and multi-type features through an attention mechanism, forming a progressive optimization chain from “hierarchical feature extraction” to “edge semantic enhancement” and then to “multi-layer detail fusion”. The experimental results show a balanced improvement in both precision and recall, indicating that the fusion of the three modules enables the model to focus more on learning the features of small brown spot lesions, thereby improving its classification performance. This synergistic task-specific design, combining hierarchical selection, explicit edge enhancement, and adaptive multi-scale fusion, fundamentally distinguishes CogFuse-MobileViT from prior MobileViT hybrids designed for general vision or localization tasks.

3.1.5 Influence of different module combinations on F1-score of level (0-3)

In order to verify the targeted improvement of each module for a specific disease level, we conducted the following comparative experiments as shown in the Table 6, When HFSM is introduced alone, the F1-score of Level 0 increases from 83.10% to 85.60%, effectively reducing feature confusion between healthy leaves and early-stage lesions. After adding ECFM, the F1-score of Level 1 rises from 73.20% (with HFSM alone) to 79.80%, significantly mitigating the recognition bias caused by blurred edges of small lesions. AMSDDICM increases the F1-score of Level 3 from 80.37% to 85.54%, enhancing the adaptability to the morphology of large-scale fused scorched areas. With the collaboration of the three modules, the F1-scores of Level 0–3 reach 93.99%, 82.57%, 84.28%, and 85.54% respectively, achieving balanced optimization of performance across all levels.

3.2 Results comparison of different algorithms and statistical significance verification

3.2.1 Comparison of grading results for different classification models

To verify the effectiveness of the CogFuse-MobileViT model, we selected 9 commonly used classification models to compare with the optimized CogFuse-MobileViT model, and the results are the average of 5-fold cross-validation over 120 epochs. As shown in Table 7, the proposed CogFuse-MobileViT model achieved the highest grading performance in terms of Precision, Recall, and F1-score for identifying walnut leaf brown spot disease at different severity levels among all compared models. The improved CogFuse-MobileViT model exhibits an accuracy of 86.61%, representing a 7.80-percentage-point improvement over the original model. Overall, the experimental results highlight that the enhanced CogFuse-MobileViT model is more conducive to focusing on lesion edge details and accurately learning the features of different severity levels of walnut leaf brown spot disease, thereby improving the model’s classification performance.

Table 7

Table 7. Comparison of grading results for different classification models.

3.2.2 Comparison of performance and reliability validation of different algorithms

In model training, to quantify the reliability of results and avoid random biases from single experiments, this study employs 5-fold cross-validation to generate 95% confidence intervals, as shown in the Figure 9. The results demonstrate that CogFuse-MobileViT takes a significant lead with an accuracy of 86.61% (95% CI: [85.24%, 87.89%]). Its narrowest confidence interval range indicates the strongest generalization capability. Among the comparative models, MobileViTv3 has a 95% CI of [77.20%, 80.42%]; its lower bound is significantly lower than that of CogFuse-MobileViT, confirming that the 7.8% performance improvement is not due to random fluctuations. Furthermore, the confidence intervals constructed through 5-fold cross-validation further highlight the reliability of the experimental results.

Figure 9

Bar chart showing confidence intervals for accuracy rates of different classification models. Models with higher accuracy include CogFuse-MobileViT at 86.61 and MobileViTv3 at 78.81. Other models range from 73.63 to 68.72. Each model is color-coded, with a corresponding legend on the right.

Figure 9. Confidence intervals of accuracy for different classification models in 5-fold cross-validation.

3.2.3 Statistical significance verification of model improvement

To statistically evaluate the performance improvement of CogFuse-MobileViT over the baseline MobileViTv3, an independent samples t-test was conducted (Table 8). For each model architecture, five independent training runs. The null hypothesis stated that the mean accuracy of CogFuse-MobileViT was equal to that of MobileViTv3 $H_{0} : μ_{C og} = μ_{M o b}$ ), while the alternative hypothesis stated that CogFuse-MobileViT had a higher mean accuracy ( $H_{1} : μ_{C og} > μ_{M o b}$ )CogFuse-MobileViT achieved a mean accuracy, significantly outperforming MobileViTv3. The independent samples t-test confirmed this improvement (t(8)=18.92,p=3.7×10 ⁻⁸).

Table 8

Table 8. Comparison of accuracy results between MobileViTv3 and CogFuse-MobileViT using independent samples t-test.

Under 5-fold cross-validation (Figure 10), CogFuse-MobileViT achieves rapid convergence within the first 20 epochs, demonstrating more efficient feature learning capability. Upon entering the steady-state phase, its validation loss is reduced by over 5-fold compared to the baseline MobileViTv3, directly confirming smaller prediction errors and superior generalization capability. Meanwhile, the smooth and minimally fluctuating loss curve of CogFuse-MobileViT reflects strong robustness against data noise and distribution variations. Combined with the previous statistical findings from accuracy confidence intervals and independent t-tests, these results collectively validate the statistical significance of the improved model.

Figure 10

Line graph showing validation loss over epochs for two models: CogFuse-MobileViT (blue) and MobileViTv3 (orange). Loss decreases sharply initially, then stabilizes, with CogFuse-MobileViT exhibiting a consistently lower loss.

Figure 10. Comparison of loss curves between the CogFuse-MobileViT model and the original model in 5-fold cross-validation within 120 epochs.

3.3 Model result analysis performance comparison of different models

Figure 11A shows that in the walnut leaf brown spot disease severity grading task, the classification accuracy of all models gradually improved and stabilized with the increase of training epochs, indicating that the models continuously optimized during learning until convergence. CogFuse-MobileViT stood out prominently: it achieved rapid accuracy improvement, reached a high level in relatively early training epochs with minimal subsequent fluctuations, and stably maintained the highest accuracy, demonstrating fast convergence and strong generalization ability to efficiently extract features distinguishing different disease levels. DenseNet also maintained high accuracy in the late training stage, but its accuracy improvement was slower, with slightly inferior convergence speed and final stability compared to CogFuse-MobileViT. Models like ResNet and RegNet showed limited accuracy gains with gentle upward trends, and their final stable accuracy values were significantly lower, reflecting insufficient feature extraction capabilities possibly due to network architecture or parameter optimization efficiency.

Figure 11

Panel A shows a line graph comparing the accuracy of different neural networks over 100 epochs. CogFuse-MobileViT reaches the highest accuracy, achieving over 90 percent. Panel B displays a line graph comparing loss over epochs. CogFuse-MobileViT has the lowest loss, declining swiftly to under 0.2. The legend lists networks like DenseNet, EfficientNet, and MobileViTv3.

Figure 11. Comparison of Accuracy and loss across different Models (A) accuracy Line chart; (B) loss line chart.

In Figure 11B, all models showed high initial loss values that dropped rapidly and then stabilized, following the typical learning process of iterative optimization. CogFuse-MobileViT achieved significant early loss decline and stabilized near the minimum, indicating efficient feature learning and strong fitting capability. While DenseNet also reached a relatively low loss level, it remained slightly higher than CogFuse-MobileViT. Other models had higher late-stage loss values: some showed fast initial decline but converged to higher levels, indicating shortcomings in capturing key disease features.

3.4 Analysis of detection results for different classification models

The confusion matrix is a key indicator for evaluating classification models: the higher the diagonal values, the higher the classification accuracy for the corresponding category, and the lower the off-diagonal values, the fewer misjudgments. Figure 12 shows the classification results of different models for the four disease grades (0–3) of walnut leaf brown spot disease. It is clearly evident that the overall classification performance, particularly the accuracy of CogFuse-MobileViT across all categories, is remarkably high with minimal misjudgments, demonstrating that the model has enhanced discrimination ability for disease grades and performs optimally in distinguishing the four disease levels.

Figure 12

Six normalized confusion matrices labeled A to F show predicted versus true values ranging from zero to three. Each matrix illustrates classification performance with varying accuracy per class, indicated by different shades of red. Brighter reds denote higher accuracy, while lighter shades indicate lower accuracy or misclassification.

Figure 12. Confusion matrices of the improved cogfuse-mobilevit model and traditional classification models (A) DenseNet; (B) EfficientNet; (C) EfficientNetV2; (D) Swin Transformer; (E) MobileViTv3; (F) CogFuse-MobileViT.

The Edge-Enhanced Feature Module (ECFM) effectively addressed Level 0 and Level 1 misclassification by capturing incipient lesion features. Baseline analysis revealed substantial false negatives for early-stage lesions, with Level 1 and Level 0 misclassification reaching 32%. ECFM reduced this rate to 8%, demonstrating its capacity to resolve texture confusion through enhanced edge contour delineation. Similarly, the Adaptive Multi-Scale Dilated Convolution Module (AMSDDICM) significantly mitigated Level 2 and Level 3 mutual misclassification. AMSDDICM drastically reduced both errors by enhancing local fine-grained features in mid-stage lesions (Level 2) while strengthening global fusion features in late-stage coalesced lesions (Level 3). This resolves grading ambiguity arising from the morphological continuum of lesion progression.

To establish a comprehensive performance evaluation framework and quantify the model’s overall discriminative capability, Figure 13 presents the Receiver Operating Characteristic (ROC) curves of CogFuse-MobileViT across four severity levels. These curves compare the True Positive Rate (TPR) against the False Positive Rate (FPR) by dynamically adjusting the classification threshold. The Area Under the Curve (AUC) serves as the primary performance metric, where a higher value indicates stronger classification ability. Notably, the AUC values for all categories significantly exceed the level of random guessing (AUC=0.5), fully demonstrating that the model possesses robust discriminative capability across different severity levels.

Figure 13

Receiver Operating Characteristic (ROC) curve chart showing true positive rate versus false positive rate for four classes: Class 0 (blue), Class 1 (orange), Class 2 (green), and Class 3 (red), along with an average ROC curve (dashed purple). The curves are above the diagonal, indicating good model performance.

Figure 13. ROC curves of the four different severity levels for CogFuse-MobileViT.

3.5 TSNE visualization of features extracted by different models

The TSNE visualization of features extracted from the models is shown in Figures 14A–D. This study analyzed the features of the original model at the 20th, 50th, and 100th training epochs, as well as the improved model at the 100th epoch. After 20 epochs of training, the original model showed a highly discrete feature distribution. Limited by the number of training epochs, the model failed to fully learn discriminative features, resulting in insufficient distinction between categories and significant overlap among different classes. This indicates that under this training intensity, the original model’s ability to capture meaningful patterns was relatively limited. When the original model was trained to 50 epochs, compared with (a), the feature aggregation significantly improved. However, inter-class overlap still existed, suggesting that although prolonged training aided feature learning, the original model’s architecture had inherent defects in achieving clear feature separation. At 100 epochs of training, the original model exhibited more prominent feature clustering. Nevertheless, some regions still lacked clear separation between different categories, indicating that even after prolonged training, the original model faced challenges in maximizing inter-class distance and intra-class compactness. In Figure (d), the improved model after 100 epochs of training demonstrated a more superior feature distribution: each category was tightly clustered, achieving high intra-class compactness, while distinct boundaries between different categories were established, resulting in significant inter-class distance. This suggests that the model improvements effectively enhanced its ability to extract discriminative features, greatly reduced feature ambiguity, and improved feature discriminative power. Compared with the original model, it showed stronger feature discrimination and optimization potential.

Figure 14

Four scatter plots labeled A, B, C, and D illustrate data points grouped by colors representing levels zero to three. Each plot displays varying cluster formations. Plot A shows two distinct clusters. Plot B has dispersed clusters, with one elongated group. Plot C features overlapping clusters, and plot D displays four well-separated clusters.

Figure 14. TSNE visualization of features extracted by different models. After the three modules work synergistically, the boundaries of feature clustering between Level 0 (healthy) and Level 1 (early-stage) are significantly clearer, which verifies the effectiveness of edge-texture collaborative learning (A) The original model was trained for 20 epochs; (B) The original model was trained for 50 epochs; (C) The original model was trained for 100 epochs; (D) The improved model was trained for 100 epochs.

3.6 Radar chart for comparison of classification performance between original and improved models

As shown in Figure 15, a radar chart compares the performance of the original model and the improved CogFuse-MobileViT model across multiple evaluation metrics. In terms of accuracy, CogFuse-MobileViT exhibits significantly higher values than the original MobileViTv3 model, indicating that the improved model achieves overall higher classification accuracy. Macro-average precision, which considers the average precision of each category, clearly shows that CogFuse-MobileViT also performs better, meaning it has superior precision across all categories. Meanwhile, CogFuse-MobileViT also demonstrates better macro-average recall. When examining weighted-average precision and weighted-average recall—metrics that account for class imbalance—CogFuse-MobileViT outperforms MobileViTv3 in both, indicating that in practical applications, even with uneven class distribution, the CogFuse-MobileViT model can deliver better performance. These results further demonstrate the model’s reliability and practicality.

Figure 15

Radar chart comparing two models: MobileViTv3 (red) and CogFuse-MobileViT (blue). Metrics include accuracy, macro average precision, macro average recall, weighted average precision, and weighted average recall. Each axis ranges from fifty to one hundred.

Figure 15. Multi-Metric Comparison Between MobileViTv3 and Improved CogFuse-MobileViT Model.

3.7 Public data set experiment

To further validate the efficacy and generalizability of the proposed CogFuse-MobileViT model, experiments were conducted on the public dataset AppleLeaf9 (Yang et al., 2022). This dataset comprises healthy apple leaves and eight categories of apple leaf diseases captured in field environments without restrictions on imaging angles or noise interference. The dataset was partitioned into training and test sets at an 8:2 ratio. All images were resized to 224×224 pixels to optimize deep learning model training efficiency. Following the hyperparameter configurations specified in “Experimental Parameters and Evaluation Metrics”, both the baseline MobileViTv3 and CogFuse-MobileViT models were trained and evaluated.

Cross-species validation demonstrates that the CogFuse-MobileViT model delivers exceptional performance on the AppleLeaf9 dataset. As presented in Table 9, the model comprehensively surpasses the baseline MobileViTv3 across all four core metrics: precision increases by 0.16 percentage points, recall achieves a breakthrough improvement of 3.45 percentage points, F1-score rises by 1.18 percentage points, and accuracy elevates by 1.14 percentage points. The synergistic enhancement in both precision and recall signifies that performance gains stem from strengthened feature discriminability rather than metric trade-off compromises. The remarkable percentage points recall gain substantially mitigates leaf disease omission rates, while the holistic advance in F1-score further attests to the model’s robustness. These results collectively validate the generalizability of CogFuse-MobileViT’s core innovations in agricultural fine-grained disease grading, establishing a transferable paradigm for small-target pathology identification across plant species.

Table 9

Table 9. Experimental results of the original model and CogFuse-MobileViT Model on Appleleaf9 data set.

4 Conclusion

This study addresses the challenges in precise grading of walnut leaf brown spot disease. By adopting dynamic feature filtering, edge gradient reinforcement, and multi-scale morphological adaptation, it effectively resolves three key limitations of existing hybrid architectures and the baseline model MobileViTv3 in plant disease grading weak capability in capturing blurred edges, poor multi-scale adaptability, and interference from healthy tissues. Experimental results show that the model achieves an 86.61% grading accuracy on a dataset encompassing diverse lighting conditions, cultivars, and lesion stages, representing a 7.80% improvement over the baseline MobileViTv3 and outperforming nine state-of-the-art models. At the same time, this study confirmed that the performance improvement of CogFuse-MobileViT was statistically significant and stable through strict statistical tests, providing a reliable method for accurate classification of walnut leaf spot disease. The constructed image dataset and proposed grading re-measurement method lay the foundation for accuracy. This approach provides a new paradigm for small-target disease grading, with core modules transferable to multi-crop disease recognition. Beyond disease classification, context-aware AI frameworks demonstrate significant efficacy in agricultural management (Acharya et al., 2022). IoT systems dynamically adjust fertilization recommendations by integrating real-time soil-crop data, while salinity-corrected evapotranspiration (ETs) models optimize irrigation strategies for saline-alkali soils (Li et al., 2023; Su et al., 2023). Similarly, multimodal fusion techniques enhance reference evapotranspiration (ETo) prediction accuracy, facilitating precision water allocation (Jo et al., 2021; Rustia et al., 2023). These breakthroughs collectively validate the robust capability of adaptive feature processing in complex agricultural environments—a core principle that resonates with our dynamic weighting strategy for disease feature extraction. Future research should therefore integrate cross-modal and cross-domain capabilities to establish multidimensional assessment systems and regional dynamic monitoring frameworks, thereby delivering comprehensive technical solutions for small-target disease control.

5 Discussion and future work

Compared with existing methods, this study overcomes the limitation of traditional deep learning models relying on fixed convolution kernels, enabling adaptive capture of lesion features with multi-shaped convolution kernels to accurately extract edge textures of circular micro-lesions and regional contours of irregular lesions. Although the constructed multi-source dataset covers diverse lighting conditions, cultivars, and lesion development stages, the homogeneous internal features of severe diseases lead to limited recall rate. Meanwhile, when lesions are severely overlapped or mixed with mechanical damage, insect damage, or other types of injuries, the discriminative accuracy of the model is affected. For ultra-small lesions, the local feature information is too weak and easily overlooked, and the computational efficiency still needs optimization on some hardware platforms. Furthermore, the adaptability of the current model in extreme field scenarios, such as high-density occlusion and compound damage from pests and diseases, remains to be further verified. In these scenarios, the coupling of complex interference factors exacerbates feature confusion, affecting grading reliability.

To address these issues, future work will focus on the following research directions. Future research will focus on enhancing the model’s performance in complex field environments through multiple synergistic strategies. This includes constructing multi-interference factor coupled datasets that incorporating insect holes, lesions, and soil adhesion to strengthen robustness against extreme field disturbances; introducing attention-based multi-damage feature decoupling modules and dedicated micro-lesion enhancement modules combining super-resolution and feature interpolation to improve discriminability in complex scenarios and for tiny lesions; optimizing dynamic weight calculations via approximate computation or hardware-friendly reconstruction, while developing lightweight real-time deployment frameworks integrated with UAV near-ground sensing technology to boost efficiency and practical monitoring coverage; and establishing cross-crop pathological transfer learning mechanisms, extending the model to multi-crop disease recognition tasks with self-supervised pre-training to enhance algorithm universality and low-contrast lesion feature mining capabilities. Through the above research, it is expected to further improve the applicability and practicality of the model in complex field environments, promoting the development of intelligent plant disease grading technology towards more accurate, efficient, and universal directions.

Data availability statement

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Author contributions

YW: Writing – original draft. DZ: Writing – original draft. LZ: Writing – review & editing.

Funding

The author(s) declare financial support was received for the research and/or publication of this article. This work was supported in part by the Science and Technology Key Project of the Xinjiang Production and Construction Corps (2023AB063) Development and application demonstration of green technology for online monitoring of fresh fruits and cold chain logistics in Xinjiang.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declare that no Generative AI was used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Acharya, P., Burgers, T., and Nguyen, K.-D. (2022). Ai-enabled droplet detection and tracking for agricultural spraying systems. Computers and Electronics in Agriculture, 202, 107325. doi: 10.1016/j.compag.2022.107325

Crossref Full Text | Google Scholar

Adaskaveg, J., Förster, H., Thompson, D., Driever, G., Connell, J., Buchner, R., et al. (2009). Epidemiology and management of walnut blight. Walnut Res. Rep, 94, 225–256. doi: 10.1016/j.inpa.2016.10.005

Crossref Full Text | Google Scholar

Arivazhagan, S., Shebiah, R. N., Ananthi, S., and Varthini, S. V. (2013). Detection of unhealthy region of plant leaves and classification of plant leaf diseases using texture features. Agricultural Engineering International: CIGR Journal, 15, 211–217.

Google Scholar

Chaudhary, P., Chaudhari, A. K., Cheeran, A., and Godara, S. (2012). Color transform based approach for disease spot detection on plant leaf. International journal of computer science and telecommunications, 3, 65–70.

Google Scholar

Chen, S., Zhang, K., Zhao, Y., Sun, Y., Ban, W., Chen, Y., et al. (2021). An approach for rice bacterial leaf streak disease segmentation and disease severity estimation. Agriculture, 11, 420. doi: 10.3390/agriculture11050420

Crossref Full Text | Google Scholar

Chiang, K.-S., Bock, C., El Jarroudi, M., Delfosse, P., Lee, I., and Liu, H. (2016). Effects of rater bias and assessment method on disease severity estimation with regard to hypothesis testing. Phytopathology, 65, 523–535. doi: 10.1111/ppa.12435

Crossref Full Text | Google Scholar

Cooke, B. (2006). “Disease assessment and yield loss,” in The epidemiology of plant diseases (Dordrecht: Springer), 43–80.

Google Scholar

Ghaiwat, S. N. and Arora, P. (2014). Detection and classification of plant leaf diseases using image processing techniques: a review. International Journal of Recent Advances in Engineering & Technology, 2, 1–7.

Google Scholar

Hu, G., Wang, H., Zhang, Y., and Wan, M. (2021). Detection and severity analysis of tea leaf blight based on deep learning. Computers & Electrical Engineering, 90, 107023. doi: 10.1016/j.compeleceng.2021.107023

Crossref Full Text | Google Scholar

Jadhav, S. B. and Patil, S. B. (2016). Grading of soybean leaf disease based on segmented image using k-means clustering. IAES International Journal of Artificial Intelligence, 5, 13–13. doi: 10.11591/ijai.v5.i1.pp13-21

Crossref Full Text | Google Scholar

Jo, W. J., Kim, D. S., Sim, H. S., Ahn, S. R., Lee, H. J., Moon, Y. H., et al. (2021). Estimation of evapotranspiration and water requirements of strawberry plants in greenhouses using environmental data. Frontiers in Sustainable Food Systems, 5, 684808. doi: 10.3389/fsufs.2021.684808

Crossref Full Text | Google Scholar

Khan, M. A., Ali, M., Shah, M., Mahmood, T., Ahmad, M., Jhanjhi, N., et al. (2021). Machine learning-based detection and classification of walnut fungi diseases. Intelligent Automation & Soft Computing, 30, 771–785. doi: 10.32604/iasc.2021.018039

Crossref Full Text | Google Scholar

Kim, S.-K. and Ahn, J.-G. (2021). Tomato crop diseases classification models using deep CNN-based architectures. J Korea Acad-Ind Coop Soc, 22, 7–14. doi: 10.5762/KAIS.2021.22.5.7

Crossref Full Text | Google Scholar

Lamichhane, J. R. (2014). Xanthomonas arboricola diseases of stone fruit, almond, and walnut trees: progress toward understanding and management. Plant Disease, 98, 1600–1610. doi: 10.1094/PDIS-08-14-0831-FE

PubMed Abstract | Crossref Full Text | Google Scholar

Li, X., Zha, T., Liu, P., Bourque, C. P.-A., Jia, X., and Tian, Y. (2023). Interannual variation in gross ecosystem production and evapotranspiration in a temperate semiarid grassland undergoing vegetation recovery. Agricultural and Forest Meteorology, 341, 109672. doi: 10.1016/j.agrformet.2023.109672

Crossref Full Text | Google Scholar

Lin, K., Gong, L., Huang, Y., Liu, C., and Pan, J. (2019). Deep learning-based segmentation and quantification of cucumber powdery mildew using convolutional neural network. Frontiers in plant science, 10, 155. doi: 10.3389/fpls.2019.00155

PubMed Abstract | Crossref Full Text | Google Scholar

Ma, J., Du, K., Zhang, L., Zheng, F., Chu, J., and Sun, Z. (2017). A segmentation method for greenhouse vegetable foliar disease spots images using color information and region growing. Computers and Electronics in Agriculture, 142, 110–117. doi: 10.1016/j.compag.2017.08.023

Crossref Full Text | Google Scholar

Mao, R., Wang, Z., Li, F., Zhou, J., Chen, Y., and Hu, X. (2023). GSEYOLOX-s: An improved lightweight network for identifying the severity of wheat Fusarium head blight. Agronomy, 13, 242. doi: 10.3390/agronomy13010242

Crossref Full Text | Google Scholar

McGranahan, G. and Leslie, C. (1991). Walnuts (Juglans). Molecules, 907–924. doi: 10.17660/ActaHortic.1991.290.20

Crossref Full Text | Google Scholar

Mircetich, S. M., Sanborn, R., and Ramos, D. (1980). Natural spread, graft-transmission, and possible etiology of walnut blackline disease. Phytopathology, 70, 962–968. doi: 10.1094/Phyto-70-962

Crossref Full Text | Google Scholar

Moragrega, C., Matias, J., Aletà, N., Montesinos, E., and Rovira, M. (2011). Apical necrosis and premature drop of Persian (English) walnut fruit caused by Xanthomonas arboricola pv. juglandis. Plant Disease, 95, 1565–1570. doi: 10.1094/PDIS-03-11-0259

PubMed Abstract | Crossref Full Text | Google Scholar

Ngugi, L. C., Abdelwahab, M., and Abo-Zahhad, M. (2020). Tomato leaf segmentation algorithms for mobile phone applications using deep learning. Computers and Electronics in Agriculture, 178, 105788. doi: 10.1016/j.compag.2020.105788

Crossref Full Text | Google Scholar

Ozturk, O., Sarica, B., and Seker, D. Z. (2025). Interpretable and robust ensemble deep learning framework for tea leaf disease classification. Horticulturae, 11, 437. doi: 10.3390/horticulturae11040437

Crossref Full Text | Google Scholar

Parashar, N., Johri, P., Khan, A. A., Gaur, N., and Kadry, S. (2024). An integrated analysis of yield prediction models: A comprehensive review of advancements and challenges. Computers, Materials & Continua, 80 (1). doi: 10.32604/cmc.2024.050240

Crossref Full Text | Google Scholar

Picon, A., Alvarez-Gila, A., Seitz, M., Ortiz-Barredo, A., Echazarra, J., and Johannes, A. (2019). Deep convolutional neural networks for mobile capture device-based crop disease classification in the wild. Computers and Electronics in Agriculture, 161, 280–290. doi: 10.1016/j.compag.2018.04.002

Crossref Full Text | Google Scholar

Rastogi, A., Arora, R., and Sharma, S. (2015). “Leaf disease detection and grading using computer vision technology & fuzzy logic,” in Paper presented at the 2015 2nd international conference on signal processing and integrated networks (SPIN).

Google Scholar

Rustia, D. J. A., Lee, W.-C., Lu, C.-Y., Wu, Y.-F., Shih, P.-Y., and Chen, S.-K. (2023). Edge-based wireless imaging system for continuous monitoring of insect pests in a remote outdoor mango orchard. Computers and Electronics in Agriculture, 211, 108019. doi: 10.1016/j.compag.2023.108019

Crossref Full Text | Google Scholar

Shi, T., Liu, Y., Zheng, X., Hu, K., Huang, H., Liu, H., et al. (2023). Recent advances in plant disease severity assessment using convolutional neural networks. Scientific Reports, 13, 2336. doi: 10.1038/s41598-023-29230-7

PubMed Abstract | Crossref Full Text | Google Scholar

Singh, U. P., Chouhan, S. S., Jain, S., and Jain, S. (2019). Multilayer convolution neural network for the classification of mango leaves infected by anthracnose disease. IEEE access, 7, 43721–43729. doi: 10.1109/ACCESS.2019.2907383

Crossref Full Text | Google Scholar

Su, Y., Wang, J., Li, J., Wang, L., Wang, K., Li, A., et al. (2023). Spatiotemporal changes and driving factors of reference evapotranspiration and crop evapotranspiration for cotton production in China from 1960 to 2019. Frontiers in Environmental Science, 11, 1251789. doi: 10.3389/fenvs.2023.1251789

Crossref Full Text | Google Scholar

Tripathi, R. (2021). “A deep learning approach for plant material disease identification,” in Paper presented at the IOP Conference Series: Materials Science and Engineering.

Google Scholar

Vishnoi, V. K., Kumar, K., Kumar, B., Mohan, S., and Khan, A. A. (2022). Detection of apple plant diseases using leaf images through convolutional neural network . IEEE Access, 11, 6594–6609. doi: 10.1109/ACCESS.2022.3232917

Crossref Full Text | Google Scholar

Wang, F., Dun, C., Tang, T., Duan, Y., Guo, X., and You, J. (2022). Boeremia exigua causes leaf spot of walnut trees (Juglans regia) in China. Plant Disease, 106, 1993. doi: 10.1094/PDIS-10-21-2304-PDN

Crossref Full Text | Google Scholar

Wang, Q.-H., Fan, K., Li, D.-W., Han, C.-M., Qu, Y.-Y., Qi, Y.-K., et al. (2020). Identification, virulence and fungicide sensitivity of Colletotrichum gloeosporioides ss responsible for walnut anthracnose disease in China. Plant disease, 104, 1358–1368. doi: 10.1094/PDIS-12-19-2569-RE

PubMed Abstract | Crossref Full Text | Google Scholar

Weber, B. C. (1980). How to diagnose black walnut damage Vol. 57 (North Central Forest Experiment Station, Forest Service, US Department of Agriculture).

Google Scholar

Xu, W. (2024). HCF-Net: Hierarchicalcontextfusion network forinfrared small object detection. arXiv. arXiv, 2403.10778. doi: 10.1109/ICME57554.2024.10687431

Crossref Full Text | Google Scholar

Yang, C., Deng, Y., Wang, F., Yang, H., Xu, X., Zeng, Q., et al. (2021). Brown leaf spot on Juglans sigillata caused by Ophiognomonia leptostyla in Sichuan, China. Plant Disease, 105, 4160. doi: 10.1094/PDIS-02-21-0344-PDN

Crossref Full Text | Google Scholar

Yang, Q., Duan, S., and Wang, L. (2022). Efficient identification of apple leaf diseases in the wild using convolutional neural networks. Agronomy, 12, 2784. doi: 10.3390/agronomy12112784

Crossref Full Text | Google Scholar

Zarei, S., Taghavi, S. M., Banihashemi, Z., Hamzehzarghani, H., and Osdaghi, E. (2019). Etiology of leaf spot and fruit canker symptoms on stone fruits and nut trees in Iran. Journal of Plant Pathology, 101, 1133–1142. doi: 10.1007/s42161-019-00283-w

Crossref Full Text | Google Scholar

Zhang, Y., Li, X., Šimůnek, J., Shi, H., Chen, N., and Hu, Q. (2023). Quantifying water and salt movement in a soil-plant system of a corn field using HYDRUS (2D/3D) and the stable isotope method. Agricultural Water Management, 288, 108492. doi: 10.1016/j.agwat.2023.108492

Crossref Full Text | Google Scholar

Keywords: walnut, brown spot disease (Ophiognomonia leptostyla), hierarchical feature selection, edge features perception, adaptive multi-scale dilated convolution, disease grading

Citation: Wei Y, Zeng D and Zheng L (2025) A precision grading method for walnut leaf brown spot disease integrating hierarchical feature selection and dynamic multi-scale convolution. Front. Plant Sci. 16:1641677. doi: 10.3389/fpls.2025.1641677

Received: 05 June 2025; Accepted: 09 September 2025;
Published: 03 October 2025.

Edited by:

Ravinder Kumar, Indian Agricultural Research Institute (ICAR), India

Reviewed by:

Geza Bujdoso, Hungarian University of Agricultural and Life Sciences, Hungary
Vasudha Vedula, University of Texas of the Permian Basin, United States

Copyright © 2025 Wei, Zeng and Zheng. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Liangfang Zheng, MTgxNjAyNTA0OThAMTYzLmNvbQ==

^†These authors have contributed equally to this work

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.