LeafSightX: an explainable attention-enhanced CNN fusion model for apple leaf disease identification

Haque, Md. Ehsanul; Farid, Fahmid Al; Siam, Md. Kamrul; Absur, Md. Nurul; Uddin, Jia; Abdul Karim, Hezerul

doi:10.3389/frai.2025.1689865

ORIGINAL RESEARCH article

Front. Artif. Intell., 30 January 2026

Sec. AI in Food, Agriculture and Water

Volume 8 - 2025 | https://doi.org/10.3389/frai.2025.1689865

LeafSightX: an explainable attention-enhanced CNN fusion model for apple leaf disease identification

Md. Ehsanul Haque¹

Fahmid Al Farid²

Md. Kamrul Siam³

Md. Nurul Absur⁴

Jia Uddin⁵^*

Hezerul Abdul Karim²^*

¹Department of Computer Science and Engineering, East West University, Dhaka, Bangladesh
²Centre for Image and Vision Computing (CIVC), Faculty of Artificial Intelligence and Engineering, Multimedia University, Cyberjaya, Malaysia
³Department of Computer Science, New York Institute of Technology, New York, NY, United States
⁴Department of Computer Science, City University of New York, New York, NY, United States
⁵AI and Big Data Department, Woosong University, Daejeon, Republic of Korea

The rapid and precise identification of apple leaf diseases is crucial for minimizing yield loss in precision agriculture. However, many existing deep learning methods struggle to be applicable in real-world settings, are not easily interpretable, and often lack sufficient statistical validation. To address these difficulties, we propose our solution approach LeafSightX. This dual-backbone architecture combines features from DenseNet201 and InceptionV3 using Multi-Head Self-Attention (MHSA) techniques, enhancing representational capability and spatial context reasoning. Our extensive procedure includes specialized preprocessing and limited data augmentation, improving model resilience in many scenarios. Furthermore, LeafSightX integrates explainable AI techniques with Grad-CAM visualizations to improve transparency. In assessments of a five-class apple leaf disease dataset featuring field and laboratory images, LeafSightX demonstrates exceptional performance, attaining a test accuracy of 99.64%, an F1-score of 0.9962, and AUC and PR-AUC scores of 1.000, far surpassing all baseline CNNs. Cross-validated Cohen's Kappa (mean = 0.9917, σ = 0.0020) and AUC (mean = 0.9998) indicate a significant level of predictive consistency. Despite its architectural complexity, the model offers real-time inference capabilities, ensuring per-sample latency suitable for edge device deployment. Additionally, the proposed LeafSightX framework was trained and evaluated on an additional independent apple leaf disease dataset, achieving a test accuracy of 99.69%, demonstrating its robustness and generalization. Our approach is a rigorously evaluated, clear, and highly accurate system for identifying plant diseases, providing a reproducible foundation for the actual application of AI in agriculture.

1 Introduction

Agriculture plays a vital role in ensuring global food security and maintaining economic stability (Aboelenin et al., 2025). Apple (Malus domestica) is of significant commercial importance among fruit crops due to its nutritional value and market demand. Nonetheless, the production and quality of apples are severely threatened by several foliar diseases, including Alternaria leaf spot, rust, gray spot, and brown spot (Cabrefiga et al., 2023). These diseases result in significant yield losses and directly influence fruit appearance and quality, causing severe economic harm (Hussain, 2024; Vurro et al., 2010; Bonkra et al., 2024). Early and precise diagnosis of apple leaf diseases is thus critical for sustainable production and disease management (Fadia et al., 2019).

Historically, farmers and agricultural specialists have used manual inspection to detect the disease on apple leaves (Khan et al., 2022). This technique is based on optical observation of leaf surfaces, during which any visible lesions, discoloration, fungal plaques, or necrotic tissue are assessed. Despite its simplicity and widespread use, manual diagnosis is a slow, labor-intensive, and highly expert, experience-based method. In addition, accuracy is often diminished by subjective judgment, fatigue, and environmental factors such as changes in lighting. Manual inspection is inefficient in large-scale orchards or rural settings, where qualified experts are scarce and making timely decisions is difficult. Figure 1 illustrates the manual examination process performed by an agricultural expert in an orchard environment.

Figure 1

Three images showing leaves with disease symptoms. The first leaf has brown spots. The second leaf, held by someone in gloves, shows more severe damage with discoloration. The third image displays leaves with dark patches and uneven surfaces.

Figure 1. Manual inspection of leaves by an agricultural expert in an orchard setting. The process is time-consuming and subjective, motivating the need for automated computer vision systems (Fu et al., 2025).

Due to the rapid advancement of computer vision and deep learning, automated plant disease diagnosis has become a compulsory thing (Shafay et al., 2025; Upadhyay et al., 2025). VGG19, InceptionV3, DenseNet201, and Xception are convolutional neural network architectures that have demonstrated impressive results in image-based classification. Nevertheless, even with these developments, the current practices continue to encounter significant challenges. It is common in many models to have limited generalization to actual field images that do not match laboratory conditions. Interpretability also tends to be lacking in most systems, and it is hard to explain why they make the predictions they do (Doutoum and Tugrul, 2023). Moreover, the inability to use reliable probability calibration can lead to overconfident or underconfident decisions, thereby adversely affecting the practice of precision agriculture. Also, Computational cost analysis is rarely discussed in existing studies on plant disease detection. Yet, it plays a critical role in determining the real-time applicability and practicality of deep learning models in agricultural environments.

To address these drawbacks, a robust, explainable deep learning model that provides reliable diagnoses across different cases is an immediate necessity. To meet this requirement, we present LeafSightX, a more sophisticated system for automatically tracing and identifying apple leaf diseases. LeafSightX combines the backbones of DenseNet201 and InceptionV3 via a Multi-Head Self-Attention mechanism, enabling the model to remember both local texture features and wide contextual features. A full preprocessing pipeline of Gaussian blur filtering, contrast-limited adaptive histogram equalization, and intensity normalization is used to maximize image quality and minimize environmental variation. Also, Grad CAM visualization is added to highlight the disease area, making models more transparent and interpretable.

The contribution of this study is summarized as:

• LeafSightX is a new deep learning framework that combines DenseNet201 and InceptionV3 with Multi-Head Self-Attention to capture both small details and overall patterns for accurate apple leaf disease detection.

• A robust preprocessing pipeline using Gaussian blur, contrast adjustment, and normalization improves image quality and handles changes in light and environment.

• Explainable artificial intelligence using Grad-CAM highlights the leaf regions that influence predictions, making the results easier for experts to understand.

• Reliability and computational cost are evaluated using Brier Score, permutation testing, bootstrap confidence intervals, Cohen Kappa, and inference time, addressing the rare focus on efficiency for real-time deployment.

• LeafSightX outperforms existing methods in accuracy, reliability, and efficiency, providing more trustworthy results.

• The framework is validated on an additional dataset to demonstrate robustness and generalization to new images and conditions.

The recognized constraints would be mitigated by the proposed conceptual design, which aims to provide an automated leaf disease detection system that serves as a replicable, interpretable, and efficient tool for precision agriculture and sustainable apple growing worldwide.

The subsequent sections of the paper are organized as follows: Section 2 presents a literature review on the identification of apple leaf diseases using deep learning techniques. Section 3 describes the proposed LeafSightX, the dataset pre-processing, and the training environment. The experimental results and model interpretability are explored in Section 4 with the aid of Grad-CAM visuals. Finally, Section 5 summarizes the article and provides guidance for further research.

2 Literature review

The precise, effective, and resilient identification of apple leaf diseases has been a fundamental challenge in precision agriculture, and recent advancements in deep learning have been a significant catalyst in addressing it. A-Net, proposed by Liu and Li (2024), was a YOLOv5-based framework that incorporated an additional Wise-IoU loss function and RepVGG modules. This model exhibited an exceptional detection rate, with a mean average precision (mAP@0.5) of 92.7%. However, its adaptability to diverse environmental conditions had not been examined. Wang et al. (2025) presented ELM-YOLOv8n, which integrated Efficient Multi-Scale Attention (EMA) and DESCS-DH blocks to balance speed and accuracy while maintaining a lightweight design. It achieved a mAP@0.5 of 96.7% and an F1-score of 94.0%, but still faced challenges in real-time applicability and robustness across heterogeneous datasets. Similarly, Gong and Zhang (2023) employed GhostConv and GAM modules in a compact YOLOv8 variant, attaining a mAP@0.5 of 86.9% with very few parameters; however, concerns remained regarding the model's generalizability and explainability.

Similarly, Lv and Su (2024) applied transformer encoders and CBAM in YOLOv5-CBAM-C3TR to improve feature description. Although it achieved 92.4% accuracy, it was limited by its inability to incorporate newer YOLO variants (YOLOv7/YOLOv8). Scientific (2025) proposed a VGG-DAGSVM model that employed bilateral filtering and SegNet-based segmentation. While it achieved 96.5% classification accuracy, the robustness of its preprocessing and segmentation methods was unsatisfactory. Moreover, Luo (2025) introduced AppleLite-YOLOv8, which integrated EdgeNeXt and C2f-SC modules. This system achieved 97.56% accuracy and 94.38% recall, yet struggled in complex backgrounds and uneven lighting.

In a recent work, Rajput et al. (2024) implemented Neutrosophic Logic with EfficientNetB0 to address uncertainty, reporting an accuracy of 99.51%. Nevertheless, the lack of cross-validation and interpretability reduced its reliability. Segmentation-integrated approaches also gained traction. Parashar and Johri (2024) achieved an accuracy of 94.76% by incorporating Canny edge detection and watershed transformation into a CNN framework. However, the segmentation methods were conventional, offered no explainability, and were not validated. Rohith et al. (2025) compared CNNs including ResNet, VGG19, and InceptionV3. When combined with ResNet, VGG19, and InceptionV3, the models achieved maximal validation accuracies of 98.9%, 97.1%, and 97.4%, respectively. However, these studies lacked robustness testing and interpretability.

In contrast, Liu et al. (2024) proposed MCDCNet, which combined multi-scale fusion with constrained deformable convolution to enhance geometric adaptability. It demonstrated a detection accuracy of 66.8% and performed well at capturing spatial deformations, but failed to stand out in identifying overlapping lesions. Furthermore, Zhang et al. (2023) introduced BCTNet, which incorporated a Bole Convolution Module and bidirectional feature fusion. The model achieved 85.23% accuracy and a real-time inference rate of 33 FPS, but its effectiveness was limited by dataset diversity. Khan et al. (2022) proposed a two-step lightweight classification and symptom localization network, trained on 9,000 RGB images, achieving 88% classification accuracy and 42% mAP. Despite its ability to operate in real time, the dataset was limited in diversity, scalability was questionable, and the network misclassified subtle symptoms. It achieved a recall of 49.0% and a mean Average Precision of 34.0%. Although it performed reasonably well in complex environments, it remained prone to misclassifying small lesions and did not generalize across datasets.

Despite the increasing number of studies on apple leaf disease detection, several critical gaps persisted. First, the generalizability of the findings was limited by the lack of replication and validation methods, particularly cross-validation, which involves dividing the data into training and test sets to train and test models multiple times. Second, most studies employed segmentation and preprocessing algorithms (steps to highlight affected leaf areas and prepare data before analysis) that were either inadequately optimized or simplistic, rendering them ineffective in isolating complex or overlapping lesions (Sharma et al., 2020; Dayang and Meli, 2021). Third, explainability, meaning the ability to interpret how and why a model makes decisions, was often absent, which undermined trust in model predictions. Fourth, although models demonstrated high accuracy under controlled conditions, few were rigorously tested across varying environmental conditions, such as lighting, background clutter, or leaf orientation, thereby limiting their field applicability. Moreover, computational expense, referring to the amount of computing resources required, remained high due to complex architectures with numerous parameters. Yet, this issue was seldom quantified or analyzed, raising concerns about deployment on resource-constrained devices. Finally, recent research shows that insufficient application of statistical validation methods, such as analysis of variance (ANOVA) and t-tests, compromised the empirical rigor and robustness of the reported findings (Saho, 2025).

These shortcomings underscore the need for future research to prioritize computational efficiency, statistically rigorous validation, advanced preprocessing, and explainability. Addressing these gaps would improve the practical applicability, reliability, and portability of automated apple leaf disease detection systems in precision agriculture. This paper addressed these gaps by introducing a comprehensive preprocessing strategy and extensive data augmentation to improve model resilience. A novel dual-backbone transfer learning framework, LeafSightX, was developed, integrating MHSA operations for enhanced feature representation and classification accuracy. The model was validated through cross-validation and statistical testing to confirm the significance of improvements. Furthermore, XAI methods were incorporated to ensure interpretability and transparency. Finally, the practicality and generalizability of the system were enhanced by training on both laboratory-controlled and field-collected images under varied environmental conditions, while substantially reducing inference time per sample for real-time agricultural deployment.

3 Methodology

The proposed methodology for detecting apple leaf disease employs a systematic pipeline aimed at effectively identifying and classifying diverse apple leaf states. The framework's workflow is illustrated in Figure 2. The primary phases of this methodology encompass data gathering, preprocessing and augmentation, disease diagnosis, baseline and suggested model building, training, and performance evaluation.

Figure 2

Flowchart illustrating a machine learning process for leaf disease detection. It includes sections for data collection with leaf images, preprocessing steps like resizing and noise reduction, and augmentation techniques such as flipping and rotation. Dataset splitting is detailed with training, validation, and testing sets. A variety of baseline classifiers are listed, such as DenseNet201-MHSA. Training settings include parameters like epochs and learning rates. The proposed LeafSightX model is shown, followed by evaluation criteria like accuracy and F1-score, and Grad Cam visualizations for model interpretation.

Figure 2. Workflow diagram of proposed Apple leaf disease detection framework.

3.1 Data acquisition

This research utilizes the Apple Tree Leaf Disease dataset, collected from Kaggle and made available by Nirmal (Kaggle, 2025). The dataset consists of images of diseased apple leaves gathered from four locations at China's Northwest University of Agriculture and Forestry Science and Technology. Images were taken with a Glory V10 mobile phone in diverse environmental settings, with approximately 52% captured in a controlled laboratory and 48% in natural growing fields. To evaluate the effectiveness of our approach, we also use an additional dataset (Dhar, 2023).

The dataset is divided into five categories: healthy leaf, alternaria leaf spot, rust, gray spot, and brown spot. The class distribution is presented in Table 1, while Figure 3 shows typical sample images from each category.

Table 1

Table 1. Class distribution of the apple leaf disease dataset.

Figure 3

Five images of leaves are shown, each labeled with different conditions. The first leaf is healthy, the second has Alternaria leaf spot with small dark spots, the third shows rust with reddish-brown lesions, the fourth displays a gray spot, and the fifth has a brown spot with yellowing around it.

Figure 3. Sample images from the apple leaf disease dataset.

3.2 Image processing and feature enhancement

The image quality was enhanced, and regions of the image associated with disease were highlighted by applying a standard preprocessing pipeline before training the model (Gudge et al., 2025). To ensure the model input was consistent, the first step was to scale the pictures to a specific resolution. Subsequently, a Gaussian blur filter was applied to mitigate high-frequency noise and sensor artifacts, thereby facilitating computation and enabling the model to focus on relevant patterns rather than noise (Gedraite and Hadad, 2011). Also, CLAHE (Contrast Limited Adaptive Histogram Equalization) was applied to the V channel of the HSV color space (Pizer et al., 1990). The step will help with changes in lighting and improve local contrast, making it easier to find illness spots and lesions in images shot under diverse lighting and weather conditions. Finally, the pixel intensities were normalized to a comparable range. Collectively, these preprocessing steps stabilized and optimized the training process, allowing the neural network to learn more effectively and generalize better to heterogeneous data. Overall, such comprehensive preprocessing enhances the reliability and accuracy of the disease classification model. Table 2 summarizes the key preprocessing steps along with their corresponding parameters.

Table 2

Table 2. Preprocessing operations and parameters applied to the apple leaf images.

While, Figure 4 shows example images at various stages of the preprocessing pipeline, illustrating the effects of each operation.

Figure 4

A set of images displaying leaves with different conditions and processing techniques. There are five categories: Healthy Leaf, Alternaria Leaf, Rust Leaf, Gray Spot, and Brown Spot. Each category includes four images showing the original leaf, Gaussian blur applied, CLAHE on V channel, and a normalized version. The transitions illustrate the effects of each processing method on leaf appearance and texture.

Figure 4. Representative images from the preprocessing pipeline: original, Gaussian blurred, CLAHE-enhanced, and normalized images from left to right.

3.3 Data augmentation

To enhance the dataset's diversity and improve the model's generalization, data augmentation was implemented during preprocessing of Such augmentations reproduce various real-life variations, including orientational, scale, and light variations (Mikolajczyk and Grochowski, 2018; Kumar et al., 2024). Random horizontal flipping helps the model learn to be invariant to leaf orientation. Random rotation enables the capture of leaves at different angles in an image. Random zoom changes the camera's distance, and brightness adjustment adjusts the illumination. Collectively, such transformations make the data sets more variable, thereby contributing to a model that is more robust to varying input states. The augmentation techniques and the parameters are summarized in Table 3.

Table 3

Table 3. Data augmentation techniques and parameters.

This augmentation multiplied the size of the dataset five times the initial reading of five samples, each containing one original image and four variants of augmentation. As a result, the augmented image counts per class are shown in Table 4.

Table 4

Table 4. Number of images per class after augmentation.

3.4 Dataset splitting

The augmented data set was divided into training, validation, and testing sets with ratios of 80%, 10%, and 10%, respectively. This division is typically applied to ensure that the model used has sufficient data to learn from during training, and that there is enough data to verify model execution and ultimately test it (Mamun et al., 2025). The division was conducted on a class basis to ensure proportional representation in all subsets. Table 5 contains the summary of the split number of images per class.

Table 5

Table 5. Dataset split summary (number of images per class).

3.5 Baseline models

This study utilizes five state-of-the-art convolutional neural networks, namely DenseNet201, InceptionV3, VGG19, NASNetMobile, and Xception, as the baseline to investigate the classification of diseases in apple leaves. All the models used were optimized to include an MHSA mechanism with four attention heads, which better represent their features and capture long-range dependencies.

3.5.1 Densenet201-MHSA

The DenseNet201 architecture comprises 201 layers of batch normalization and 201 layers of activation, inserted between 200 convolutional (Conv2D) layers, enabling the extraction and reuse of features through dense block-wise connectivity. They are then sequentially merged by 98 concatenation layers that enable the network to retain information across layers and integrate it. The spatial dimensions are progressively suppressed through three average pooling layers and one single max pooling layer. At the same time, the feature maps are zero-padded at two pooling levels to preserve their size. A flat layer follows the feature extraction, and then the three dense layers perform the final classification. There are two dropout layers to minimize overfitting, and two lambda layers have custom tensor operations. This architecture consists of a single input embedding layer, a mid-layer MHSA, and four-head self-attention after the convolutional blocks to learn long-range spatial correlations. Such an attention level is accompanied by a one-layer normalization and an additional layer, thereby establishing residual connections that improve learning stability and representation. Combined, these elements allow DenseNet201-MHSA to capture intricate patterns in the images of stable apple leaves with diseases. To describe the structural composition of the DenseNet201-MHSA architecture on a quantitative level, the frequency and structure of its primary type of layers are outlined in Table 6.

Table 6

Table 6. Layer-wise summary of the DenseNet201-MHSA architecture.

3.5.2 InceptionV3-MHSA

InceptionV3-MHSA is an architecture that begins with a single input layer, followed by 94 convolutional layers, coupled with batch normalization and activation layers, to facilitate balance and speed during training. It features four max pooling layers and nine average pooling layers, which can progressively downsample the spatial dimension while preserving essential features. The 15 concatenated layers interconnect the inception modules, enabling multi-scale feature learning across various convolutional branches. The spatial dimensions are flattened into a compact feature vector, which is then fed into three dense layers, followed by a final classification layer with average pooling (also known as global average pooling). To avoid overfitting, dropout is applied twice per layer, whereas two lambda layers can be used to apply customized tensor operations. A layer normalization layer has been added to improve stability during training and to provide consistent feature scaling across four attention heads. An MHSA layer has been added before the final dense layers to improve spatial feature representation. Such an arrangement enables InceptionV3-MHSA to successfully learn local and global patterns in the apple leaf dataset. Table 7 presents the key layer summary of the InceptionV3-MHSA architecture.

Table 7

Table 7. Layer-wise summary of the InceptionV3-MHSA architecture.

3.5.3 VGG19-MHSA

The VGG19-MHSA model architecture begins with a single input layer, followed by 16 successive convolutional layers that generate hierarchical features from the input image. These are interspersed with five max pooling layers to compact the spatial dimensions while preserving essential structures. There is a global average pooling layer in between, which reduces the spatial features to a small representation after feature extraction. It goes through a triplet-dense layer, with two injection dropout layers to help prevent overfitting and improve generalization. A MHSA layer, consisting of four attention heads, is added to enable the model to focus on various regions of space and perform global contextual reasoning. A layer normalization layer will be used to stabilize the attention output, and two lambda layers will be employed to inject custom operations, enabling greater architectural flexibility. The features with attention are brought to an earlier representation via an additional layer, enriching the final classification pathway. Table 8 presents the key layer summary of the VGG19-MHSA architecture.

Table 8

Table 8. Layer-wise summary of the VGG19-MHSA architecture.

3.5.4 Xception-MHSA

The Xception-MHSA has a single input layer, and the remaining layers comprise 116 convolutional layers that use depthwise separable convolutions to learn fine spatial features at low resolution automatically. This has been accompanied by 126 batch normalization layers and 126 activation functions, which enhance both nonlinearity and learning stability. It is done through max pooling, and the feature maps are down-sampled three times. It also has nine average pooling layers, and the final layer is a global average pooling layer that reduces the spatial dimensions to a feature vector. The network has three dense layers for final prediction, with two dropout layers in between for regularization. A Multi-Head Self-Attention Mechanism is added to increase feature interaction across the globe, utilizing four heads. Layer normalization is then applied to stabilize the attention output process. Custom computation is implemented across two lambda layers, and 24 add layers are strategically distributed to combine intermediate representations, thereby maintaining information flow through residual connections. The last representation is detailed not only in the local context but also in the international context. Table 9 presents the key layer summary of the Xception-MHSA architecture.

Table 9

Table 9. Layer-wise summary of the Xception-MHSA architecture.

3.5.5 NasNetMobile-MHSA

The NASNetMobile-MHSA model comprises a model framework that begins with an input layer, followed by a composition of 36 conventional convolutional layers and a massive 160 separable convolutional layers, all parametrized to be space-aware. All convolutions are followed by the same number of batch normalization layers and activation layers, with 192 per instance, to train the networks to greater depth and learn non-linear representations. All eight max pooling and 52 average pooling layers are used to perform the spatial reduction and refinement. It features a zero-padding and cropping architecture, utilizing 24 zero-padding and 4 cropping layers, respectively, to adjust the parameter input dimensions to fit the desired dimensions while preserving edge information. The network extensively utilizes 20 concatenation layers and 81 addition layers to reduce the number of features across branches for combination. After the feature maps, they undergo global average pooling and are fed into three dense layers for classification, with two dropout layers introduced to curb overfitting. There is an MHSA layer with four heads, which enables modeling of global context, and a layer normalization layer that stabilizes the training process. Two lambda layers also enable custom tensor operations, making the architecture even more flexible. Table 10 shows the key layer summary of the NasNetMobile-MHSA architecture.

Table 10

Table 10. Layer-wise summary of the NasNetMobile-MHSA architecture.

3.5.6 LeafSightX: proposed hybrid deep feature fusion Dense201 and InceptionV3

The proposed LeafSightX model is a combination of the DenseNet201 and InceptionV3 architectures, applied to a deep feature fusion strategy complemented by Multi-Head Self-Attention structures (Li et al., 2023). The model takes in processed input images. In parallel, the two pre-trained networks each have a global average pooling and an attention layer to extract rich (and complementary) features. The versions of these features will then be concatenated and run through fully connected layers with dropout as regularization, and, lastly, classification will be performed. The architecture has a total of 114 concatenation layers, 294 convolutional layers, 295 batch normalization and activation layers each, and two Multi-Head Attention modules. Other elements include pooling layers, layer normalization, and residual connections introduced through Lambda and zero-padding layers, making the model's design tightly structured and capable of fulfilling its purpose. In detail, the LeafSightX model architecture is illustrated in Figure 5, which features a dual backbone feature extractor, self-attention modules with shared parameters across different heads, and a combination pattern that collectively learns discriminative representations for classifying apple leaf diseases. Table 11 provides a layer-wise summary of the proposed LeafSightX architecture and its training details. Algorithm 1 illustrates the overall workflow of the proposed LeafSightX model.

Figure 5

Diagram of a deep learning model for image analysis with two branches: DenseNet201-MHSA and InceptionV3-MHSA. Each branch contains layers for convolution, pooling, and dense blocks, followed by global average pooling. Outputs are concatenated and passed through dense layers with dropout for classification, ending in a softmax layer for output. The input image is a leaf.

Figure 5. Architecture of the proposed LeafSightX model (the dual backbone feature extraction, multi-head self-attention modules, and the fusion mechanism that jointly learn discriminative representations for apple leaf disease classification).

Table 11

Table 11. Layer-wise summary of the LeafSightX architecture (DenseNet + InceptionV3 + MHSA).

Algorithm 1

Algorithm 1. LeafSightX: dual-backbone transfer learning with multi-head self-attention.

3.6 Model training settings

CNN baseline models and the proposed LeafSightX framework were trained with reasonable hyperparameters and approaches to provide a fair and comparable assessment. Images were resized and normalized, and pretrained weights were used to speed up convergence. Four attention heads, Multi-Head Self-Attention modules, were combined to capture global dependencies. The Adam optimizer with categorical cross-entropy loss was used to train the given model, along with regularization methods, dropout, and L2 weight decay, to prevent overfitting. The use of early stopping and learning rate scheduling addressed training robustness. In particular, it has used the ReduceLROnPlateau callback to monitor the validation loss and reduce the learning rate by a factor of 0.5 after 3 consecutive poor epochs. The overall training settings used across all models are described in Table 12.

Table 12

Table 12. Training settings and hyperparameters for CNNs and LeafSightX.

3.7 Proposed model explainability

To enhance the interpretability of the LeafSightX model, we employed Grad-CAM, which operates on both DenseNet201 and InceptionV3 backbones. Grad-CAM provides class-discriminative heatmaps by calculating the gradient of a class score over the feature maps of the preceding convolutional layer. The approach can help identify areas in the input image that make the most significant contribution to the model's predictions.

In every backbone, activations of the last convolutional layer were used to generate heatmaps of the original images of leaves. These visual explanations showed that DenseNet201 and InceptionV3 both focused on biologically significant areas, including disease lesions, texture alterations, and color aberrations. Grad-CAM demonstrates that the LeafSightX model determines actions by considering relevant visual elements, thereby increasing transparency and reliability in disease classification.

3.8 Evaluation metrics

To comprehensively assess the performance of the proposed LeafSightX model, multiple evaluation metrics were employed.

Accuracy measures the overall correctness of the predictions and is defined as:

Accuracy = \frac{T P + T N}{T P + T N + F P + F N}

where TP is true positives, TN is true negatives, FP is false positives, and FN is false negatives.

Precision evaluates how well the model identifies positive cases, defined as:

Precision = \frac{T P}{T P + F P}

Recall measures the model's ability to detect actual positive cases:

Recall = \frac{T P}{T P + F N}

F1-Score balances precision and recall using their harmonic mean:

F 1 = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}

A confusion matrix summarizes all TP, FP, TN, and FN values, providing a clear view of classification errors.

Area Under the ROC Curve (AUC) measures the model's ability to distinguish between classes across different thresholds. In contrast, Precision-Recall AUC (PR AUC) quantifies the trade-off between precision and recall over varying thresholds.

Finally, the robustness and reliability of LeafSightX were evaluated against baseline models using statistical significance testing, ensuring that observed improvements are not due to chance.

4 Results and discussion

In this section, we systematically analyze the performance of the proposed LeafSightX model in comparison to established baselines. We explore its accuracy, robustness, and interpretability through extensive experiments, emphasizing its potential to advance automated plant disease diagnosis. The discussion highlights key insights gained from the results and situates our findings within the broader context of agricultural AI applications.

4.1 Model performance overview

The classification accuracy of all the evaluated models was comparatively assessed on the training, validation, and test sets and is presented in Table 13. The training accuracies of all architectures were high, indicating that features can be well learned for leaf disease classification. DenseNet201 and InceptionV3 were found to generalize well, with both validation and test accuracy well above 98%, whereas VGG19 and NASNetMobile performed relatively poorly, possibly because features are not extracted optimally, or the models are too small. The proposed LeafSightX model outperformed all baselines, achieving 99.02% and 99.64% accuracy on the validation and test sets, respectively, which are similar to its training accuracy of 99.98%. The slight performance difference reflects the excellent overfitting mitigation enabled by the dual-backbone fusion and MHSA modules, which enhance discriminative feature representation. These findings highlight the strengths and potential of LeafSightX for real-world applications as an agricultural diagnostic system.

Table 13

Table 13. Comparison of train, validation, and test accuracy for different models.

4.2 Comprehensive performance metrics: precision, recall, F1-score, AUC, and PR AUC

To provide a comprehensive evaluation beyond accuracy, Table 14 presents additional performance metrics including precision, recall, F1 score, AUC, and PR AUC across training, validation, and test sets. These metrics assess class balance, model sensitivity, and confidence calibration. The proposed LeafSightX model consistently outperforms all baseline models, achieving mean precision, recall, and F1 scores of more than 0.99 on both the validation and test sets. The high recall with low false positives implies strong identification of diseased cases, whereas F1 scores indicate a balanced classification. AUC and PR AUC values near 1.000 indicate high discrimination and a strong precision-recall trade-off, which is significant when the classes are unbalanced. Conversely, there is a visible drop in these measures between the training, validation, and test sets when using baseline models, indicating overfitting and poor generalization. These findings support the robustness, reliability, and applicability of the proposed model for the accurate diagnosis of leaf diseases in precision agriculture.

Table 14

Table 14. Average precision, recall, F1-score, AUC, and PR-AUC for train, validation, and test sets.

4.3 Assessing model generalization via 5-fold cross-validation

To assess the models' strength and generalization, a 5-fold cross-validation was conducted, and the results are shown in Table 15. In 5-fold cross-validation, the dataset is divided into five equal parts. Each part is used once as a validation set, while the remaining four are used for training. The process repeats five times to ensure all data is tested. LeafSightX achieved the best performance across all folds, with an average validation accuracy of 99.19% and a low standard deviation of 0.14%, indicating stable performance across data splits. Its average Cohen's Kappa of 0.9917 validates predictions well beyond chance. Validation AUC and PR AUC were also very high, 0.9998 and 0.9994, respectively, indicating good discriminative power and multi-class performance even when classes are unbalanced. The nearest baseline was DenseNet201, achieving an average validation accuracy of 99.02% and a Kappa of 0.9876. InceptionV3 and Xception obtained fairly good results of about 97%–98% accuracy, but with a bit more variation. VGG19 and NASNetMobile showed lower accuracy and Kappa scores, and poorer generalization, likely due to fewer features. On the whole, these findings indicate that LeafSightX exhibits consistent, reliable, and robust performance and justify the efficiency of dual-backbone fusion and MHSA for managing leaf diseases in practice.

Table 15

Table 15. 5-Fold validation metrics for different models.

4.4 Computational cost analysis

The radar chart in Figure 6 illustrates the computational cost of six deep learning models used for Apple Leaf disease classification in terms of training time, test inference time, and per-sample inference time. On the whole, DenseNet201 has a moderate training time of 364.30 seconds, a relatively high test inference time of 23.32 seconds, and a per-sample time of 0.0284 seconds, compared to InceptionV3, which has the longest training time of 467.86 seconds, the highest test inference time of 10.17 seconds, and a per-sample time of 0.0124 seconds, respectively. In comparison, the VGG19 has a better training time of 758.99 seconds, with test inference times of 10.63 seconds and 0.0129 seconds per sample, respectively. Conversely, Xception proves efficient across the board, with training, test, and per-sample inference times of 387.49, 8.65, and 0.0105 seconds, respectively. Equally, NASNetMobile has the lowest training time of 304.92 seconds, with average test inference and per-sample times of 12.23 seconds and 0.0149 seconds, respectively. Lastly, the proposed LeafSightX model has comparatively high computational costs: 633.85 seconds for training, 30.03 seconds for test inference, and 0.0365 seconds per-sample inference. Thus, the chart can serve well to indicate the trade-offs in the training and inference efficiency of the models for Apple Leaf disease detection.

Figure 6

Three radar charts compare model performance metrics: Training Time, Test Inference Time, and Per-Sample Inference Time, all in seconds. Each chart includes models: LeafSightX, InceptionV3, VGG19, NASNetMobile, and Xception. Training Time shows LeafSightX with the highest at 758.99 seconds. Test Inference Time highlights LeafSightX at 30.00 seconds, and VGG19 at 23.32 seconds. Per-Sample Inference Time indicates LeafSightX and NASNetMobile have the highest values, at 0.037 and 0.028 seconds respectively. DenseNet201 is referenced for each.

Figure 6. Radar plot illustrating the computational cost of six deep learning models for Apple Leaf disease classification, including training time, test inference time, and per-sample inference time.

4.5 Performance evaluation via learning curves, confusion matrix, and AUC-based metrics

To comprehensively assess the proposed model's performance, we employ a suite of visual evaluation tools. These include learning curves to monitor training dynamics, a confusion matrix to analyze class-wise predictions, and AUC-based metrics to evaluate classification quality under varying thresholds. Such visualizations provide deeper insight into the model's strengths, weaknesses, and generalization behavior.

4.5.1 Performance trends across training epochs

Figure 7 shows the training and validation loss curves of all evaluated models. To start with, the loss curves for DenseNet201 demonstrate effective learning, with training and validation losses steadily declining from initial values of 2.0927 and 0.3448, respectively, to 0.2059, suggesting robust generalization with little overfitting. Concurrently, InceptionV3 shows a slight decrease in training loss [0.7851 to 0.0680], but validation loss ranges from 0.3592 to 0.1259, suggesting slight fluctuations but good generalization. Conversely, VGG19 converges more slowly once the training loss has reduced to 0.0734 and validation loss to 0.1281, indicating higher computational and slower optimization. Conversely, Xception shows a more fluent and efficient convergence: the training loss is reduced to 0.0718, and the validation loss to 0.1450. NASNetMobile shows rapid convergence, with training and validation losses dropping to 0.1048 and 0.1881, respectively, suggesting a fast learning process and sensible levels of generalization. Lastly, the LeafSightX model achieves the best performance: the training loss drops to 0.0365, and the validation loss to 0.0900, demonstrating excellent optimization, stability, and generalization. Overall, the loss curves indicate that all models converge, and LeafSightX has the most efficient and effective training dynamics among all other evaluated models.

Figure 7

Six charts show training and validation loss and accuracy over epochs for various models: DenseNet201, InceptionV3, VGG19, Xception, NASNetMobile, and LeafSightX. Loss decreases and accuracy increases across all models, indicating convergence. Each model compares training and validation metrics, showing relatively stable performance post-convergence.

Figure 7. Training and validation loss and accuracy curves across epochs for six models: (a) DenseNet201, (b) InceptionV3, (c) VGG19, (d) Xception, (e) NASNetMobile, and (f) LeafSightX. The figure provides a comprehensive comparison of training and validation performance across all models.

4.5.2 Confusion matrix analysis

The classification performances of the models in classifying leaf diseases are shown in the confusion matrices in Figure 8. DenseNet201 accurately recognized the majority of classes, including 136 healthy leaves, all cases of Alternaria leaf spot, and rust and gray spot, which were slightly confused with brown spot. The same was observed in InceptionV3, with the minor misclassifications being between rust and brown spot, but healthy leaves and Alternaria leaf spot were generally correct. VGG19 had more misclassifications, particularly on rust, gray spot, and healthy leaves, suggesting a low discriminative capacity. Xception produced more evenly distributed errors, with most samples correctly classified and only slight misclassifications in a few categories. NASNetMobile was more confused, especially with healthy leaves being classified as rust 15 times, 7 Rust predicted as Gray Spot, 4 Brows spot predicted as Rust, but the rest of the classes had few misclassifications. The proposed model, LeafSightX, was the most effective, with the fewest misclassifications. There were 138 correct classifications of healthy leaves and one incorrect classification of brown spot. Alternaria leaf spot contained no errors. Rust was correct on 196 of its predictions with only two small misclassifications. In all 205 and 172 cases, respectively, the gray and brown spots were correctly identified. The findings indicate that LeafSightX is more accurate and less confusing, especially in rust, gray spot, and brown spot. The findings affirm the conclusion that LeafSightX performs better in the performance-based class, especially against rust, gray spot, and brown spot, indicating superior performance compared to the other models.

Figure 8

Six confusion matrices compare the performance of different models: DenseNet201, InceptionV3, VGG19, Xception, NasNetMobile, and LeafSightX. Each matrix displays classifications of plant diseases: Healthy, Alternaria, Rust, Gray Spot, and Brown Spot. Correct predictions are along the diagonal, highlighted, with DenseNet201 showing high accuracy for Rust and Gray Spot, and LeafSightX showing high accuracy in multiple categories. Models vary in their misclassification rates.

Figure 8. Confusion matrices for various deep learning models on apple leaf disease classification, showing true vs. predicted labels for five classes: Healthy leaf, Alternaria leaf spot, Rust, Gray spot, and Brown spot.

4.6 Model calibration, statistical significance, and reliability metrics

Additional complementary indicators and tests were implemented to conduct a comprehensive assessment of the reliability, calibration, and statistical significance of the measured classification models. The Brier Score is a metric that assesses the precision of probabilistic forecasts and their calibration across all classes. The nonparametric p-values were robust, and permutation testing was used to determine whether the observed best accuracies are substantially larger than those expected by chance. To establish a competition on confidence levels for accuracy measurements, bootstrap resampling was implemented to provide insight into the performance's stability. Lastly, Cohen's Kappa statistic measures the extent to which predicted labels match the actual ones above chance level, thereby serving as an additional metric of reliability. The average Brier Scores, accuracy, permutation test p-values, bootstrap accuracy with 95% confidence interval, and Cohen's Kappa scores are summarized in Table 16. These statistics are calculated for all models.

Table 16

Table 16. Average brier scores, accuracy, permutation p-values, bootstrap accuracy (95% CI), and Cohen's kappa.

The low average Brier Scores across all models, ranging from 0.0001 for the proposed LeafSightX model on the training set to 0.0122 for NASNetMobile on the test set, indicate perfect calibration of the predicted probabilities and well-calibrated confidence estimates. Accuracy scores also emphasize the models' predictive abilities: LeafSightX achieves 99.98 on the training data and 99.64 on the test data, whereas all models achieve over 96 on the test data. Its results are confirmed by p-values from the permutation tests, which are always less than 0.001, indicating that the classification performance significantly outperforms random chance alone. The 95% confidence intervals obtained with the Bootstrap are stringent; i.e., the accuracy of the LeafSightX test is estimated at 99.15–100.00, and the accuracy of the DenseNet201 test is estimated at 98.05–99.39, indicating a stable and accurate range of estimates. Additionally, the Cohen Kappa values in the test set are impressive (0.9506, NASSNetMobile; 0.9954, LeafSightX), demonstrating the reliability of the models' labels despite the predicted and actual labels achieving near-perfect performance, which is significantly better than chance.

All of these complementary measures, granted by probabilistic calibration using the Brier Score, statistical verification using permutation testing, quantification of uncertainty using bootstrap confidence intervals, and agreement and consistency using the Cohen Kappa, form an in-depth and stringent evaluation scheme. The method has been used to create transparency and reliability in model evaluation, and this strategy has supported the real-world applicability of such classifiers for disease identification in agriculture. To evaluate the calibration of our proposed LeafSightX model, we present the Expected Calibration Error (ECE) bar plot, which illustrates how well the predicted probabilities align with actual outcomes across different confidence bins, as presented in Figure 9.

Figure 9

Bar chart titled “ECE per Bin” depicting the fraction of positives against mean predicted probability for leaf conditions: healthy, Alternaria leaf spot, rust, gray spot, and brown spot. Each condition is represented by a distinct color. The chart includes a dotted diagonal line indicating perfect calibration. Bars for each condition show variability in predicted probability accuracy.

Figure 9. Expected calibration error (ECE) bar plot for the proposed LeafSightX model, showing the alignment between predicted probabilities and actual outcomes across different confidence bins.

Figure 9 presents the Expected Calibration Error (ECE) per bin of five classes of plant leaf disease. The x-axis is the average predicted probability, and the y-axis is the proportion of positives in each probability bin. The color-coded bars indicate the distribution of predicted probabilities of each disease group. The dotted diagonal line indicates that the calibration is perfect; that is, the predicted probabilities are equal to the actual class frequencies. As shown in the figure, most classes are well-calibrated in terms of the predictability of their probabilities, with their bars tending to closely or broadly follow the diagonal line. Nevertheless, Gray spot and Brown spot exhibit calibration flaws, especially in higher-probability bins, suggesting the model is either overconfident or underconfident for these classes. We also include the Confidence Distribution Curve and Max-confidence Histogram for the proposed LeafSightX system, to further assess the model's confidence across all disease classes. These additions provide more information about the model's behavior and, as such, help evaluate and optimize model calibration, especially for more challenging courses such as Gray spot and Brown spot.

Figure 10 shows the Confidence Distribution Curve (left) and the Histogram of Maximum Prediction Confidence (right) for the proposed LeafSightX system. The Confidence Distribution Curve shows that many predictions have a distinctly strong score of zero or one, indicating that the model is very sure in its predictions. In contrast, only a small number have a moderate score. This suggests that the model is likely to give firm predictions, either sure of its classification or unsure, instead of giving a wide spectrum of probability values.

Figure 10

Two bar graphs illustrate prediction confidence for leaf samples. The left graph, titled “Confidence Distribution Curve,” shows most samples with low predicted probability. The right graph, titled “Histogram of Maximum Prediction Confidence,” indicates samples are at 0.9 to 1.0 confidence levels. Categories include healthy leaf, Alternaria leaf spot, rust, gray spot, and brown spot.

Figure 10. Confidence distribution curve and maximum prediction confidence histogram for the proposed LeafSightX system.

This is further supported by the Histogram of Maximum Prediction Confidence, which indicates that most samples have maximum predicted probabilities close to 1. It means the model is always certain about its predictions, though the number of samples with lower confidence is significantly lower. The results imply that the model is confident in its predictions, but additional calibration may be needed to better handle less certain cases.

4.7 Ablation study: evaluating the impact of multi-head self attention on LeafSightX and backbone models

In our ablation study, we evaluated the test accuracy of DenseNet201, InceptionV3, and their combinations without Multi-Head Self Attention (MHSA). Without MHSA, DenseNet201, InceptionV3, and the DenseNet + InceptionV3 fusion achieved test accuracies of 0.9854, 0.9672, and 0.9927, respectively, as observed from Table 1. After integrating MHSA, DenseNet201, and InceptionV3, the models achieved 0.9878 and 0.9830, while the fusion model reached the highest accuracy of 0.9964. These results indicate that MHSA improves the model's ability to focus on relevant spatial features, enhancing classification performance.

The combination of DenseNet201 and InceptionV3 with MHSA outperforms all individual models. The improvement over the fusion model without MHSA highlights the importance of attention mechanisms in identifying key patterns and producing more reliable predictions. These findings confirm LeafSightX's effectiveness in achieving superior accuracy and generalization, establishing it as a robust tool for leaf disease classification and agricultural diagnostics.

4.8 Interpretation of learned features in the fusion backbone using Grad-CAM

The Grad-CAM visualizations for the DenseNet201 and InceptionV3 backbones are shown in Figures 11, 12, respectively, highlighting the spatial regions of leaf images that most influence model predictions. The Grad-CAM visualizations of DenseNet201 and InceptionV3 as backbone models are presented in Figures 11, 12, respectively. These backbones are the main feature extractors used in LeafSightX, and the heatmaps provide insight into their pattern preferences for identifying disease-specific features before fusion, which is helpful for interpretation and reliability.

Figure 11

Infrared images of leaves in five categories: healthy leaves (four images at the top), rust (four images in the second row), Alternaria leaf spot (four images in the third row), brown spot (four images in the fourth row), and gray spot (four images at the bottom). Each category shows varying color patterns representing different conditions of the leaves.

Figure 11. Grad-CAM visualization for DenseNet201 backbone.

Figure 12

Thermal images of leaves categorized by condition: healthy, brown spot, gray spot, rust, and alternaria leaf spot. Each leaf shows varied heat patterns with colors from blue to red, indicating different conditions and stress levels.

Figure 12. Grad-CAM visualization for InceptionV3 backbone.

Figure 11 indicates that DenseNet201 concentrates on the regions of the affected places that include water-soaked blotches, necrotic lesions, and distortion of the edges of the leaf. MHSA is yet another refinement of this focus, making the localization features more semantically precise and relevant. These findings indicate that DenseNet201 can capture physiologically significant patterns, consistent with visual clues identified by plant pathologists, and provide valuable additions to disease classification.

Similarly, in Figure 12, the Grad-CAM heatmaps of the InceptionV3 backbone highlight critical indicators, including textural changes, fungal plaques, and tissue atrophy. These visualizations show that InceptionV3 effectively identifies disease-relevant regions, capturing precise patterns essential for accurate diagnosis. The backbone demonstrates the ability to focus on disease-specific variations across spatial scales, enhancing interpretability and reliability. Analyzing individual backbones is crucial for understanding and validating the model's decision-making, especially in domains like agriculture and plant pathology.

Though Grad-CAM can be successfully used to visualize the spatial attention of individual backbones, feature fusion, which is repeated concatenation, then attentions and fully connected layers, breaks the spatial alignment of features to the original image, and reliable Grad-CAM heatmaps of the fused output are technically infeasible. However, Grad-CAM visualizations of the individual backbones are highly interpretable. Regions of interest are always aligned with areas affected by the disease and verified as biologically relevant by plant pathology experts, so it is plausible to expect that the fused LeafSightX model will capture these important features even better and increase its interpretability and diagnostic accuracy.

4.9 Comparative results using an additional dataset

Table 17 presents the comparative results of several deep learning architectures evaluated on an additional dataset (Dhar, 2023). The performance comparison shows that all models achieved consistently high accuracy across training, validation, and test sets, indicating strong generalization. Among them, the proposed LeafSightX model achieved the best overall performance, with a test accuracy of 0.9969 and an F1-score of 0.9970, surpassing all baseline architectures. DenseNet201 and InceptionV3 also demonstrated strong performance, achieving test accuracies over 0.99 and near-perfect AUC and PR AUC values. However, their marginally lower precision and recall suggest a minor imbalance in class prediction compared to the proposed method. The consistent superiority of LeafSightX across all metrics, including both AUC (0.9999) and PR AUC (1.0000), confirms its robustness and adaptability to unseen data. These results validate the scalability and generalization potential of the proposed framework, showing that it maintains high discriminative power when applied to datasets beyond the primary experimental setup.

Table 17

Table 17. Comparative performance of deep learning models on the apple leaf disease dataset (additional dataset results).

4.10 Benchmarking LeafSightX against existing literature

The current deep learning models employed to predict apple leaf maladies are compared in detail with LeafSightX in Table 18. The suggested LeafSightX achieves an impressive 99.64% accuracy, surpassing all evaluated approaches, which generally report accuracies ranging from 66.8% to 99.51%. This outstanding outcome underscores the efficacy of a double-backbone architecture integrating DenseNet201 and InceptionV3 with MHSA, enhancing the model's ability to extract both local and global features of the disease from leaf images. Unlike most prior studies, LeafSightX is the inaugural work to integrate XAI, thereby ensuring transparent, interpretable predictions. In conjunction with heightened trust among end users, Grad-CAM heatmaps can help agricultural specialists derive actionable insights by highlighting diseased areas of leaves. Moreover, LeafSightX achieves these enhancements without imposing a significant computational burden, while preserving efficient runtime inference and resource allocation. The trade-off between high accuracy, interpretability, and computational intensity makes LeafSightX viable for application in a low-resource agricultural context. Consequently, our framework bridges the divide between advanced research and practical application, resulting in more user-friendly and dependable instruments for diagnosing plant diseases.

Table 18

Table 18. Comparative analysis of existing models and LeafSightX for apple leaf disease detection.

5 Conclusion

This study has led to the development of LeafSightX. This diagnostic framework leverages deep learning to address key challenges in the automated diagnosis of apple leaf diseases, including limited generalizability, interpretability issues, and sensitivity to domain shifts. LeafSightX achieves this by combining features from the DenseNet201 and InceptionV3 backbones using Multi-Head Self-Attention, effectively capturing both detailed and high-level spatial information. A robust preprocessing and augmentation pipeline, along with Grad-CAM visualizations for explanations, further enhances the model's reliability and transparency. Experimental results indicate that LeafSightX delivers outstanding performance, with 99.64% accuracy, an F1-score above 0.996, and perfect AUC and PR-AUC scores, all while maintaining low inference latency, making it suitable for real-time applications in the field. These results surpass multiple baselines and demonstrate the model's consistency across cross-validation splits, as evidenced by a mean Cohen's Kappa of 0.9917 and a standard deviation of 0.0020. Also, the proposed LeafSightX framework was trained and evaluated on an additional independent apple leaf disease dataset, achieving a test accuracy of 99.69%, demonstrating its robustness and generalizability. The broader significance of this research lies in its dual focus on predictive accuracy and model interpretability, meeting the increasing demand for trustworthy AI in agricultural diagnostics. LeafSightX stands as a feasible solution, particularly effective for on-device deployment in underprivileged rural areas. In terms of limitations, LeafSightX was primarily trained on region-specific datasets, which may limit its generalizability to apple leaf diseases in other geographic regions.

Future research will involve further development of LeafSightX as an apple leaf disease detector using large datasets, as a component of edge computing systems, and adaptation to the temporal dynamics of leaf infections through sequential imaging. Finally, LeafSightX will be a more intelligent, explainable instrument to improve control of apple leaf disease in precision agriculture.

Data availability statement

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding authors.

Author contributions

MH: Conceptualization, Methodology, Writing – original draft. FF: Software, Visualization, Writing – review & editing. MS: Conceptualization, Formal analysis, Software, Writing – original draft. MA: Formal analysis, Software, Validation, Visualization, Writing – original draft. JU: Conceptualization, Supervision, Writing – review & editing. HA: Funding acquisition, Writing – review & editing.

Funding

The author(s) declared that financial support was received for this work and/or its publication. This research was funded by the Multimedia University, Cyberjaya, Selangor, Malaysia [Grant Number: PostDoc (MMUI/240029)].

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Aboelenin, S., Elbasheer, F., Eltoukhy, M., El-Hady, W. M., and Hosny, K. M. (2025). A hybrid framework for plant leaf disease detection and classification using convolutional neural networks and vision transformer. Complex Intell. Syst. 11:142. doi: 10.1007/s40747-024-01764-x

Crossref Full Text | Google Scholar

Bonkra, A., Pathak, S., Kaur, A., and Shah, M. A. (2024). Exploring the trend of recognizing apple leaf disease detection through machine learning: a comprehensive analysis using bibliometric techniques. Artif. Intell. Rev. 57:21. doi: 10.1007/s10462-023-10628-8

Crossref Full Text | Google Scholar

Cabrefiga, J., Salomon, M. V., and Vilardell, P. (2023). Improvement of alternaria leaf blotch and fruit spot of apple control through the management of primary inoculum. Microorganisms 11:101. doi: 10.3390/microorganisms11010101

PubMed Abstract | Crossref Full Text | Google Scholar

Dayang, P., and Meli, A. S. K. (2021). Evaluation of image segmentation algorithms for plant disease detection. Int. J. Image, Graph. Signal Proc. 12:14. doi: 10.5815/ijigsp.2021.05.02

Crossref Full Text | Google Scholar

Dhar, S. (2023). Apple leaf disease classification dataset. Available online at: https://www.kaggle.com/datasets/showravdhar/apple-disease-dataset (Accessed November 1, 2023).

Google Scholar

Doutoum, A. S., and Tugrul, B. (2023). A review of leaf diseases detection and classification by deep learning. IEEE Access 11, 119219–119230. doi: 10.1109/ACCESS.2023.3326721

Crossref Full Text | Google Scholar

Fadia, R. N., Dwi, R., and Anisa, A. (2019). Risk mitigation of sustainable supply chain for food product based on apple commodity. Russian J. Agric. Soc. Econ. Sci. 96, 60–68. doi: 10.18551/rjoas.2019-12.08

Crossref Full Text | Google Scholar

Fu, R., Wang, X., Wang, S., and Sun, H. (2025). Pmjdm: a multi-task joint detection model for plant disease identification. Front. Plant Sci. 16. doi: 10.3389/fpls.2025.1599671

PubMed Abstract | Crossref Full Text | Google Scholar

Gedraite, E. S., and Hadad, M. (2011). “Investigation on the effect of a gaussian blur in image filtering and segmentation,” in Proceedings ELMAR-2011, 393–396.

Google Scholar

Gong, X., and Zhang, S. (2023). A high-precision detection method of apple leaf diseases using improved faster r-cnn. Agriculture 13:240. doi: 10.3390/agriculture13020240

Crossref Full Text | Google Scholar

Gudge, S., Tiwari, A., Ratnaparkhe, M., and Jha, P. (2025). On construction of data preprocessing for real-life soyleaf dataset &disease identification using deep learning models. Comput. Biol. Chem. 117:108417. doi: 10.1016/j.compbiolchem.2025.108417

Crossref Full Text | Google Scholar

Hussain, S. (2024). Advancing plant health management: challenges, strategies, and implications for global agriculture. Int. J. Agric. Sustain. Dev. 6, 73–89.

Google Scholar

Kaggle (2025). Apple tree leaf disease dataset. Available online at: https://www.kaggle.com/datasets/nirmalsankalana/apple-tree-leaf-disease-dataset (Accessed August 8, 2025).

Google Scholar

Khan, A. I., Quadri, S., Banday, S., and Shah, J. L. (2022). Deep diagnosis: a real-time apple leaf disease detection system based on deep learning. Comput. Electr. Agric. 198:107093. doi: 10.1016/j.compag.2022.107093

Crossref Full Text | Google Scholar

Kumar, T., Brennan, R., Mileo, A., and Bendechache, M. (2024). Image data augmentation approaches: a comprehensive survey and future directions. IEEE Access 12, 187536–187571. doi: 10.1109/ACCESS.2024.3470122

Crossref Full Text | Google Scholar

Li, W., Peng, Y., Zhang, M., Ding, L., Hu, H., and Shen, L. (2023). Deep model fusion: a survey. arXiv preprint arXiv:2309.15698.

PubMed Abstract | Google Scholar

Liu, B., Huang, X., Sun, L., Wei, X., Ji, Z., and Zhang, H. (2024). Mcdcnet: multi-scale constrained deformable convolution network for apple leaf disease detection. Comput. Electr. Agric. 222:109028. doi: 10.1016/j.compag.2024.109028

Crossref Full Text | Google Scholar

Liu, Z., and Li, X. (2024). An improved YOLOv5-based apple leaf disease detection method. Sci. Rep. 14:17508. doi: 10.1038/s41598-024-67924-8

PubMed Abstract | Crossref Full Text | Google Scholar

Luo, M. (2025). The application of deep learning technology in smart agriculture: Lightweight apple leaf disease detection model. Int. J. Simul. Multidisc. Design Optim. 16, 1–16. doi: 10.1051/smdo/2025006

Crossref Full Text | Google Scholar

Lv, M., and Su, W. (2024). Yolov5-cbam-c3tr: an optimized model based on transformer module and attention mechanism for apple leaf disease detection. Front. Plant Sci. 14:1323301. doi: 10.3389/fpls.2023.1323301

PubMed Abstract | Crossref Full Text | Google Scholar

Mamun, A. A., Ray, P. C., Nasib, M. R. U., Das, A., Uddin, J., and Absur, M. N. (2025). “Optimizing deep learning for skin cancer classification: a computationally efficient CNN with minimal accuracy trade-off,” in 2025 2nd International Conference on Next-Generation Computing, IoT and Machine Learning (NCIM) (IEEE), 1–6. doi: 10.1109/NCIM65934.2025.11159853

Crossref Full Text | Google Scholar

Mikolajczyk, A., and Grochowski, M. (2018). “Data augmentation for improving deep learning in image classification problem,” in 2018 International Interdisciplinary PhD Workshop (IIPhDW), 117–122. doi: 10.1109/IIPHDW.2018.8388338

Crossref Full Text | Google Scholar

Parashar, N., and Johri, P. (2024). Enhancing apple leaf disease detection: a CNN-based model integrated with image segmentation techniques for precision agriculture. Int. J. Mathem. Eng. Manag. Sci. 9, 943–964. doi: 10.33889/IJMEMS.2024.9.4.050

Crossref Full Text | Google Scholar

Pizer, S., Johnston, R., Ericksen, J., Yankaskas, B., and Muller, K. (1990). “Contrast-limited adaptive histogram equalization: speed and effectiveness,” in [1990] Proceedings of the First Conference on Visualization in Biomedical Computing, 337–345. doi: 10.1109/VBC.1990.109340

Crossref Full Text | Google Scholar

Rajput, A. S., Rajput, A. S., and Thakur, S. S. (2024). A novel approach to apple leaf disease detection using neutrosophic logic-integrated efficientnetb0. Neutros. Sets Syst. 73:17. Available online at: https://digitalrepository.unm.edu/nss_journal/vol73/iss1/17/ (Accessed July 28, 2025).

Google Scholar

Rohith, D., Saurabh, P., and Bisen, D. (2025). An integrated approach to apple leaf disease detection: leveraging convolutional neural networks for accurate diagnosis. Multim. Tools Applic. 84, 40307–40342. doi: 10.1007/s11042-025-20735-z

Crossref Full Text | Google Scholar

Saho, M. (2025). A comparative study of traditional and modern analytical approaches to ordinal data. PhD thesis, North Dakota State University.

Google Scholar

Scientific, L. L. (2025). Vgg-dagsvm for multi-class apple leaf disease detection: A transfer learning approach. J. Theor. Appl. Inf. Technol. 103, 4793–4802. Available online at: https://www.jatit.org/volumes/Vol103No13/24Vol103No13.pdf (Accessed July 28, 2025).

Google Scholar

Shafay, M., Hassan, T., Owais, M., Hussain, I., Khawaja, S. G., Seneviratne, L., et al. (2025). Recent advances in plant disease detection: challenges and opportunities. Plant Methods 21:140. doi: 10.1186/s13007-025-01450-0

PubMed Abstract | Crossref Full Text | Google Scholar

Sharma, P., Hans, P., and Gupta, S. C. (2020). “Classification of plant leaf diseases using machine learning and image preprocessing techniques,” in 2020 10th International Conference on Cloud Computing, Data Science &Engineering (Confluence) (IEEE), 480–484. doi: 10.1109/Confluence47617.2020.9057889

Crossref Full Text | Google Scholar

Upadhyay, A., Chandel, N. S., Singh, K. P., Chakraborty, S. K., Nandede, B. M., Kumar, M., et al. (2025). Deep learning and computer vision in plant disease detection: a comprehensive review of techniques, models, and trends in precision agriculture. Artif. Intell. Rev. 58:92. doi: 10.1007/s10462-024-11100-x

Crossref Full Text | Google Scholar

Vurro, M., Bonciani, B., and Vannacci, G. (2010). Emerging infectious diseases of crop plants in developing countries: impact on agriculture and socio-economic consequences. Food Secur. 2, 113–132. doi: 10.1007/s12571-010-0062-7

Crossref Full Text | Google Scholar

Wang, G., Sang, W., Xu, F., Gao, Y., Han, Y., and Liu, Q. (2025). An enhanced lightweight model for apple leaf disease detection in complex orchard environments. Front. Plant Sci. 16. doi: 10.3389/fpls.2025.1545875

PubMed Abstract | Crossref Full Text | Google Scholar

Zhang, H., Yang, J., Lv, C., Wei, X., Han, H., and Liu, B. (2024). Incremental RPN: hierarchical region proposal network for apple leaf disease detection in natural environments. IEEE/ACM Trans. Comput. Biol. Bioinform. 21, 2418–2431. doi: 10.1109/TCBB.2024.3469178

PubMed Abstract | Crossref Full Text | Google Scholar

Zhang, Y., Zhou, G., Chen, A., He, M., Li, J., and Hu, Y. (2023). A precise apple leaf diseases detection using bctnet under unconstrained environments. Comput. Electr. Agric. 212:108132. doi: 10.1016/j.compag.2023.108132

Crossref Full Text | Google Scholar

Keywords: apple leaf disease, convolutional neural networks (CNNs), DenseNet201, explainable artificial intelligence (XAI), InceptionV3, multi-head self-attention (MHSA), plant disease detection, precision agriculture

Citation: Haque ME, Farid FA, Siam MK, Absur MN, Uddin J and Abdul Karim H (2026) LeafSightX: an explainable attention-enhanced CNN fusion model for apple leaf disease identification. Front. Artif. Intell. 8:1689865. doi: 10.3389/frai.2025.1689865

Received: 21 August 2025; Revised: 17 December 2025; Accepted: 22 December 2025;
Published: 30 January 2026.

Edited by:

Aalt-Jan Van Dijk, University of Amsterdam, Netherlands

Reviewed by:

Prashanta Kumar Patra, Siksha O Anusandhan University, India
Nitalaksheswara Kolukula, Gandhi Institute of Technology and Management, India

Copyright © 2026 Haque, Farid, Siam, Absur, Uddin and Abdul Karim. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Jia Uddin, amlhLnVkZGluQHdzdS5hYy5rcg==; Hezerul Abdul Karim, aGV6ZXJ1bEBtbXUuZWR1Lm15

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.