Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Plant Sci., 05 December 2025

Sec. Sustainable and Intelligent Phytoprotection

Volume 16 - 2025 | https://doi.org/10.3389/fpls.2025.1704663

This article is part of the Research TopicAI-Driven Plant Intelligence: Bridging Multimodal Sensing, Adaptive Learning, and Ecological Sustainability in Precision Plant ProtectionView all 12 articles

Detection techniques for tomato diseases under non-stationary climatic conditions

Zhenzhen WuZhenzhen WuJiao HanJiao HanShiyu Wang&#x;Shiyu WangXiangwei Meng&#x;Xiangwei MengRui Fu*Rui Fu*
  • Shandong Facility Horticulture Bioengineering Research Center, Weifang University of Science and Technology, Weifang, Shandong, China

Tomato growth is highly susceptible to diseases, making accurate identification crucial for timely intervention. While deep learning models like the YOLO family have demonstrated success in detecting diseases in agricultural settings, they typically assume that training and testing data are independently and identically distributed (i.i.d.), which often doesn’t hold in real-world scenarios. When pre-trained models are applied to new environments, performance can degrade due to domain shifts. To address this, we propose CTTA-DisDet, a continuous test-time domain adaptation framework for tomato disease detection that adapts models to evolving environments during testing, improving generalization in unseen domains. CTTA-DisDet utilizes a teacher-student architecture where both models share the same structure. Dynamic data augmentation is introduced, involving explicit and implicit augmentations. Explicit augmentation corrupts input images, while implicit augmentation uses large language models (LLMs) to generate new domain data. The teacher model learns generalized knowledge, and the student model mimics the teacher to distill domain-specific information. During testing, pseudo-labels generated by the teacher update the student model. To prevent catastrophic forgetting, a subset of neurons is randomly restored to their original weights during each test-time iteration. The teacher model is continuously updated via exponential moving average (EMA). Experimental results demonstrate that CTTA-DisDet achieves an impressive 67.9% performance in continuously changing cross-domain environments, significantly benefiting practical applications in non-stationary settings.

1 Introduction

The escalating global population and climate shifts pose formidable hurdles for crop cultivation. Plant diseases, once contracted, can swiftly propagate and inflict heavy economic losses. Tomatoes, widely cultivated across temperate and tropical regions, are a vital food crop due to their high nutritional and economic value. However, like many crops, they are vulnerable to various diseases during growth (Benelli et al., 2020), which can drastically reduce yields if not detected in time (Liu and Wang, 2020).

Traditional early detection often relies on manual inspection by experts, which is time-consuming and error-prone. In recent years, machine learning and computer vision have emerged as effective tools for identifying plant diseases through image analysis, offering a faster and more accurate alternative.

Several studies have explored these approaches. For instance, Lu et al. (2023) combined spectral and image data to detect cotton yellow wilt disease, while Zhou et al. (2024) and Duan et al. (2024) proposed image-text retrieval and hyperspectral learning frameworks for rice and chili diseases. Other efforts, such as Triki et al. (2023) and Gao et al. (2023), applied simulation and remote sensing technologies to improve recognition performance. Although these works achieved notable progress, they primarily capture basic visual traits and still struggle to represent complex, fine-grained disease patterns, limiting their practical applicability.

In computer vision, deep learning is transforming disease detection, as evidenced by promising findings in Khalid et al. (2023) and Johnson et al. (2021). YOLO-based detectors and other neural network architectures have demonstrated notable advantages in agricultural applications, offering both high accuracy and real-time inference (Abade et al., 2021; Zheng et al., 2022). For example, MG-YOLO (Li et al., 2023) improves spore detection in cases of blur, dense clustering, and irregular morphology, achieving a 6.8% improvement over YOLOv5. SeptoSympto (Mathieu et al., 2024) leverages a U-Net and YOLO hybrid to quantify wheat leaf necrosis, while Torres-Lomas et al. (2024) introduced an image–text paired model for tomato disease diagnosis without manual annotation. Transformer-based segmentation approaches, such as Li et al. (2024a), have further enhanced feature extraction for rice leaf disease localization. Beyond detection, several studies explore model robustness, transferability, and fine-grained segmentation. Li et al. (2024b) evaluated multiple pre-trained CNNs for plant disease diagnosis under transfer learning, while MC-UNet Deng et al. (2023) integrates multi-scale fusion for tomato disease segmentation. DC2Net (Feng et al., 2024) incorporates deformable and dilated convolutions for Asian soybean rust detection, setting new performance benchmarks. Weakly supervised and few-shot segmentation has also been explored in Zhou et al. (2023), while Tang et al. (2023) employed proximity feature aggregation to handle inter- class similarity in custom datasets. Additionally, Dong et al. (2023) released a comprehensive set of pre-trained models for plant disease recognition, facilitating downstream applications. Temporal monitoring strategies such as Tschurr et al. (2023) utilized high-resolution time-series imagery for canopy frost assessment, and Xu et al. (2023) demonstrated the effectiveness of hyperspectral and RGB-based deep learning models for tea coal smut classification, outperforming traditional baselines such as SVM and showcasing the potential of multimodal learning.

Especially, You Only Look Once (YOLO) is an advanced real-time object detection system. It processes images at 30 FPS on a Pascal Titan X and achieves a mean average precision (mAP) of 57.9% on the COCO test-dev dataset. Recent research closely related to crop diseases includes several advancements (Zhu et al., 2023; Qi et al., 2023; Dong et al., 2024. Dai et al. (2023) enhanced the YOLOv5m model by incorporating Swin Transformer (SWinTR) and Transformer (C3TR) mechanisms for plant pest detection. Tests showed 95.7% accuracy, 93.1% recall, an f2 score of 94.38%, and an mAP of 96.4%, outperforming the original YOLOv3, YOLOv4, and YOLOv5m models. Ye et al. (2023) enhanced the YOLOv5s algorithm for detecting pine wilt disease (PWD), creating the PWD-YOLO model. It offers superior efficiency with a compact size of 2.7 MB, 3.5 GFLOPs complexity, 1.09 MB parameters, and processes 98.0 frames/s on their tailored dataset.

The prevailing research in disease recognition tends to concentrate on static scenarios, relying on image-based identification under controlled settings. This method, albeit beneficial, is fundamentally constrained by the i.i.d. data assumption. As illustrated in Figure 1, such static modeling fails to maintain accuracy under real-world dynamic agricultural environments, motivating the need for adaptive frameworks like CTTA-DisDet.

Figure 1
Flowchart of a machine learning process. At the top, “Off-the-Shelf Source Pre-Trained YOLOv9” leads to “Online Adapted YOLOv9” using a source-free approach. The adapted model is used for online prediction. Icons at the bottom indicate changing environments with sun, moon, rain, and snow symbols.

Figure 1. In dynamic real-world settings (e.g., varying weather), our target model starts with a pre-trained source network. It updates in real-time using current data without needing source data, overcoming issues like error buildup and forgetting to maintain performance over time. Our approach ensures sustained adaptation in fluctuating environments.

Agricultural dynamics are strongly shaped by environmental fluctuations and disease progression, which cause data distribution shifts that degrade model accuracy. Variability in symptom appearance under different weather conditions often leads to misclassification when models are trained on static datasets. For example, an atypical blight strain in a tomato plantation may be overlooked by a non-adaptive model, delaying treatment and increasing losses. Frameworks like CTTA-DisDet are designed to remain adaptive at inference time, thereby supporting timely and reliable diagnosis in real-world agricultural environments.

As machine learning evolves, researchers have increasingly focused on addressing such distribution gaps through test-time adaptation (TTA), which dynamically refines the model on the target domain without requiring access to source data (Chi et al., 2021; Varsavsky et al., 2020; Sun et al., 2020; Wang et al., 2020). Recent studies have validated the effectiveness of TTA across multiple vision tasks: Shanmugam et al. (2021) and Xu et al. (2022) show that improved aggregation strategies during testing can significantly enhance classification robustness, while Cohen et al. (2023) demonstrates that diverse unsupervised augmentations reduce false positives in anomaly detection. In agriculture, Saadati et al. (2024) applied TTA to pest classification and incorporated OOD awareness to avoid unreliable predictions on unfamiliar species, and Cui et al. (2023) proposed a teacher–student TTA scheme based on domain-augmented training to improve cross-domain generalization.

In this work, to handle the challenges of tomato disease detection in non-static scenarios, we draw inspiration from the recent progress in TTA and knowledge distillation. We further take into account the dynamic nature of real-world tomato plantations, where environmental factors such as weather, pest outbreaks, and disease progression constantly introduce domain shifts, causing each target-domain sample to exhibit different visual characteristics.

To address this issue, we propose a novel continuous test-time adaptation framework, CTTA-DisDet, which enables the model to adapt online during inference, thereby improving generalization in unseen environments. Specifically, CTTA-DisDet adopts a teacher–student design with identical network architectures. Both models are based on YOLOv9, which provides a favorable balance between detection accuracy and computational cost in practical deployment; meanwhile, the framework itself is model-agnostic and can be readily extended to other detectors such as YOLOv10, Fast R-CNN, or SSD.

Unlike the student model, which directly processes the original images, the teacher model is trained on data generated by our Dynamic Data Augmentation (DDA) pipeline. Following recent advances in knowledge distillation, DDA incorporates both explicit and implicit augmentations: the explicit augmenter generates M = K × S corruption-based variants per image using K transformation strategies and S fixed perturbation levels, while the implicit augmenter leverages multimodal LLMs [e.g., GPT4o (Islam and Moushi, 2024)] to synthesize additional domain-diverse samples guided by textual prompts. A fidelity classifier further filters inconsistent or irrelevant samples to ensure augmentation quality.

Notably, by combining fixed explicit perturbations with flexible LLM-driven implicit augmentations, CTTA-DisDet benefits from both controlled domain expansion and realistic distribution shifts, enabling effective test-time adaptation and improved robustness in previously unseen agricultural environments.

In summary, this work introduces CTTA-DisDet, a state-of-the-art solution for tomato disease detection under non-stationary conditions, with the following key contributions:

● We propose a continuous test-time domain adaptation framework (CTTA-DisDet), which allows the integration with the mainstream YOLO detection models. This flexibility ensures that CTTA-DisDet can be accommodated different levels of computational resources and performance.

● CTTA-DisDet features a dynamic data augmentation (DDA) that encompasses explicit and implicit augmenters. The former create varied image corruptions to simulate environmental perturbations, while the latter use LLMs to synthesize new domain-relevant data. This comprehensive augmentation enriches the model’s training, enhancing its adaptability to diverse and changing conditions.

● By employing a teacher-student configuration, CTTA-DisDet leverages knowledge distillation to transfer cross-domain generalization capabilities. The teacher model, trained on augmented data, imparts its learned knowledge to the student model, which is then fine-tuned for the specific target domain, ensuring improved accuracy in detecting diseases across different environmental contexts.

● Through rigorous experimental validation, we demonstrate that CTTA-DisDet achieves an impressive 67.9% detection accuracy in continuously changing cross-domain environments, outperforming conventional models. This significant enhancement in performance not only underscores the framework’s technical superiority but also highlights its practical applicability, offering a concrete advantage for agricultural stakeholders by facilitating more precise and timely disease detection and management.

2 Materials and methods

2.1 Overview

To achieve the goal of tomato disease detection in non-stationary environments, we propose a novel continuous test-time domain adaptation framework (CTTA-DisDet) that combines the advantages of both knowledge distillation and dynamic data augmentation. Compared with the vanilla YOLO, our proposed CTTA-DisDet consists of two innovative components: dynamic data augmentation (DDA) and continuous test-time domain adaptation (CTTA). The proposed CTTA-DisDet framework is illustrated in Figure 2.

Figure 2
Diagram illustrating Dynamic Data Augmentation (DDA) and its integration with CTTA-DisDet. Input images are processed through explicit and implicit augmenters, producing augmented images. The base model architecture includes a backbone, neck, and head for prediction updates. The training loop involves a student model updated via stochastic gradient descent and a teacher model using exponential moving average. Stages are shown for different time steps \( T=0 \) and \( T=t \).

Figure 2. Illustration of the proposed continuous test-time domain adaptation framework (CTTA- DisDet). The backbone follows the network structure of the YOLOv9 model; however, includes a knowledge distillation architecture with a teacher model θT and a student model θS. Moreover, the dynamic data augmentation (DDA) generates augmented images that adapt to changing environmental conditions, thereby enhancing the model’s generalization ability in unseen domains. We note the training phase of CTTA-DisDet is similar to the conventional training process; by contrast, at test-time, the model continuously adapts to the changing environment, improving its performance in non-stationary scenarios. This continual adaptation is achieved through the teacher-student model architecture, where the teacher model θT learns generalized knowledge from augmented images, and the student model θS distills domain-specific information.

In this framework, the teacher and student models are constructed based on the same YOLOv9-s backbone and neck to ensure architectural consistency. No additional attention modules are introduced, and parameters are fully shared across the feature-extraction layers, while the detection heads are updated independently during the adaptation process. This design allows the teacher to provide stable supervision through EMA updates and the student to adapt flexibly to domain shifts in real time.

Current advancements have evidenced that data augmentation is a powerful technique for enhancing the generalization ability of deep learning models. However, the existing data augmentation strategies are static and do not adapt to changing environments. In this work, we propose a dynamic data augmentation (DDA) that combines both explicit and implicit augmentations (Aexp,Aimp)  and can dynamically generates augmented images based on the current environment conditions. Aexp  follows the standard data augmentation pipeline and produces 16 augmentations of input images using 4 corruption strategy (Gaussian noise, Brightness adjustment, Pixel loss and Blur noise) with 4 fixed levels. Aimp is derived from the current cutting-edge text- to-image generative model [e.g., GPT4o (Islam and Moushi, 2024)] to better generate images with more expressiveness and diversity from text descriptions and the original image. In our work, we take 4 different environmental conditions (Foggy day, Dark night, Rainy day, Snowy day) and 4 different levels to construct the textual prompts, and generate 32 stylized images. Meanwhile, the implicit augmenter are followed by a binary classifier to filter out the generated images that are not relevant to the original image, and the remaining 16 images are used to train the implicit augmenter until sufficient stylized images are produced.

In addition to DDA, the proposed CTTA-DisDet also employs a teach-student model architecture to facilitate knowledge distillation, where the teacher θT and the student model θS share the same network architecture derived from the YOLO v9 model θ(0) (Wang and Liao, 2024). Fed into the augmented images generated by DDA, θT is trained to learn the generalized knowledge. Then, the student model θS is trained to mimic the teacher using the proposed continuous test-time domain adaptation (Chi et al., 2021; Varsavsky et al., 2020), thereby distilling domain-specific information. During the test phase, for t-th given image, the teacher model θT(t1)  can achieve the pseudo-label, which is then used to obtained adapted student model θS(t) θS(t1)  through an additional gradient descent step. Moreover, to prevent catastrophic forgetting, a small subset of neurons from the pre-trained model is randomly restored to the original pre-trained weights θ(0) in each test-time iteration to student model. This random restoration mechanism ensures that both teacher and student model θS(t)  can retain the original knowledge while adapting to the new domain. After the adaptation of student model, the adapted student model θS(t) can produce the final prediction via a single forward pass. Then, the updated teacher model θS(t) can be correspondingly achieved using the exponential moving average (EMA) (Klinker, 2011) from the previous parameter of teacher  θS(t1)  and the adapted student model’s parameter θS(t).

2.2 Dynamic data augmentation

Recent advancements have demonstrated that data augmentation is a powerful technique for enhancing the generalization ability of deep learning models (Zhang et al., 2022; Fleuret, 2021; Chen et al., 2023; Wang et al., 2022). Inspired by the success of data augmentation, we propose a dynamic data augmentation (DDA) that can adapt to changing environments. DDA is designed to generate augmented images that can adapt to the changing environment conditions. The DDA consists of both explicit and implicit augmenter (Aexp,Aimp), which are designed to enrich the training data and distill the cross-domain generalized information, as shown in Figure 3. Next, we will introduce the explicit and implicit augmenters in detail.

Figure 3
Diagram depicting Dynamic Data Augmentation (DDA) process for disease classification in tomato leaves. Original images of leaf diseases such as blight, gray leaf spot, and scab are processed through implicit and explicit augmenters. Implicit augmenter uses 16 prompts to produce 32 images, which are classified to generate high-quality augmented images. Explicit augmentation involves mathematical transformations. Final output includes high-quality images for disease types like leaf blight, gray leaf spot, and scab.

Figure 3. The proposed dynamic data augmentation (DDA), which involves both explicit and implicit augmenter (Aexp,Aimp). The former generates corrupted counterparts of input images, while the latter leverages large language models (LLMs) to produce new domain data. Then, the augmented images are fed into the teacher model to learn cross-domain generalized knowledge, and the student model is subsequently trained to mimic the teacher, thereby distilling domain-specific information, to improve the model’s generalization ability in unseen domains.

Explicit Augmenter Aexp enhances model generalization through various image transformations, including geometric transformations, brightness adjustments, noise additions and pixel-level transformations, which have been proven effective in numerous related literatures. We note that Aexp is parameter-free, and the explicit augmenter is designed to generate 16 augmented images for each input image. Specifically, we select 4 common explicit augmentation methods: Gaussian noise, Brightness adjustment, Pixel loss, and Motion blur, each with 4 different parameter settings to generate a more diverse set of samples. The explicit augmenter is designed to simulate the common image variations in real-world scenarios and improve the model’s robustness in practical applications. These four explicit augmentation methods were selected because they cover common image variation scenarios and are easy to implement and control.

They have characteristics like intuitive effectiveness, easy implementation, being able to introduce uncertainty and increasing difficulty, which can effectively enrich the data samples and prompt the model to learn more comprehensive and robust feature representations, thereby improving the accuracy and generalization ability of pest and disease recognition and detection. The 4 mathematical models of the explicit augmenter are as follows:

(1) Gaussian Noise. To emulate sensor imperfections and environmental interference, Gaussian noise is added to each pixel value of the input image. This augmentation helps the model learn invariant representations and suppress overfitting to clean data. For a given image x, the augmented versions x(i) are generated by the Equation 1:

q(x(i)|x)=N(x(i);x, σi2I),(1)

where the notation N(x(i);x, σi2I) signifies a Gaussian distribution centered at the original image x with a covariance matrix defined by the variance σi2 times the identity matrix I. The augmented image x(i) is thus influenced by the Gaussian noise with a standard deviation of σi. In our experiments, we selected four distinct standard deviations: σ = [5, 10, 15, 20] to explore the effect of varying noise levels on model performance.

(2) Brightness Adjustment. Brightness variations due to diverse lighting conditions can impede object recognition in images. To mitigate this, we apply brightness adjustment to simulate real- world lighting and improve model adaptability. The process adjusts the image intensity by scaling the original image x with a factor α and adding a bias b, as defined in Equation 2:

x(i)=x × α+b(2)

The scaling factor α, uniformly distributed between 0.8 and 1.2, modulates image brightness, while the bias term b, ranging from -0.5 to 0.5, fine-tunes color balance, collectively enhancing model performance across varying illumination.

(3) Pixel Loss. In dynamic environments, occlusions among leaves lead to information loss, which we emulate through pixel loss augmentation. This technique involves setting random pixels to zero or replacing them with the average of neighboring pixels, mimicking real-world occlusions and prompting the model to concentrate on salient features. Let m(i) ∈ {0, 1} denote a binary mask with a uniform loss rate of α, the corrupted image x is generated by element-wise multiplication, as formalized in Equation 3:

x(i) = x  m(i),(3)

where ⊙ denotes the Hadamard product. The binary mask m(i)  is generated by setting each pixel to zero with probability αi and one otherwise. We set α = [0.1, 0.15, 0.2, 0.25] to create varying levels of pixel loss, enabling the model to learn robust features in the presence of occlusions.

(4) Motion Blur. This augmentation technique introduces a level of blur that mimics the effect of relative motion between the camera and the subject. Our augmenter employs a range of kernel sizes to generate varying degrees of blur: small kernels (1 × 1, 3 × 3) for a subtle effect, medium kernels (5 × 5, 7 × 7) for moderate blurring, and large kernels (9 × 9 to 15 × 15) for a pronounced blur effect. This approach enables the model to learn from a diverse set of scenarios, improving its generalization capabilities. The mathematical formulation of motion blur is encapsulated in the convolution integral in Equation 4:

x(i)=h(u,v)x(iu,jv)dudv.(4)

Here, h(u, v) denotes the point spread function (PSF), which characterizes the blur induced by motion. x(i) is the resulting intensity at pixel coordinates (i, j), and x(iu, jv) represents the original image’s grayscale values at the shifted coordinates. By analyzing h(u, v), we can understand the impact of motion blur and potentially apply de-blurring techniques to restore image clarity.

After the parameter-free explicit augmenter Aexp with 4 corruption strategies and 4 fixed levels, we generate 16 augmented images for each input image. For a clear distinction with the implicit augmenter Aimp, we afterhere denote the explicit augmented images as {xexp(i)} i=1,16 where x(i)  represents the i-th augmentation, and the {ximp(i)} i=116  as the implicit augmented images. We note that the Aexp is effective in simulating common image variations in real-world scenarios and improving the model’s robustness in practical applications.

Implicit Augmenter Aimp  represents an advanced data augmentation technique that enhances model generalization by introducing training uncertainty. This implicit augmentation method leverages Large Language Models (LLMs) to generate images from descriptive text, significantly improving the naturalness and realism of the visual content. It surpasses the limitations of traditional tools like OpenCV, which often fall short in simulating complex physical conditions and offer limited parameter flexibility. By integrating LLMs, we create a diverse set of cross-domain augmented images that reflect various weather conditions, environments, and visibility levels, leading to feature generalization prior to data input. We focus on four common environmental conditions—Foggy, Dark, Rainy, and Snowy days—each with four distinct parameter settings, totaling 16 textual prompts. These conditions are chosen for their prevalence and representativeness in real-life scenarios, as well as their potential to simulate complex weather phenomena, preparing the model for a broad range of dynamic environmental changes and enhancing its recognition efficiency in unseen domains. Each text prompt generates 2 augmented images, therefore 16 prompts for 32 implicit augmentations, which are then evaluated by a fidelity classifier that retains high-assurance images and discards those with low fidelity, retaining only the 16 higher-quality {Aimp} i=116. This binary classifier ensures that only reasonable images are retained, avoiding the impact of extreme unreasonable results, and fortifies the model’s generalization capacity by ensuring the generated images adhere to expected features.

(1) Foggy day. To achieve diverse foggy weather effects, we utilize a textual template with four levels of fog intensity. The LLMs generate augmented images that simulate eight levels of foggy days based on the input original image. The foggy day template is as follows: Add a level fog effect to the given image, without any changes to the other context. Concretely, the four levels of fog intensity are: very slight, light, moderate, and heavy. By employing LLMs, we can create eight distinct foggy day effects, each with unique characteristics.

● Add a very slight fog effect to the given image, without any context changes.

● Add a light fog effect to the given image, without any context changes.

● Add a moderate fog effect to the given image, without any context changes.

● Add a heavy fog effect to the given image, without any context changes. ·

(2) Dark night. To imbue images with a realistic and varied nocturnal ambiance, we employ 4 prompts tailored for nighttime effects. These strategies are designed to capture the nuances of darkness that can occur in real-world scenarios. Our approach utilizes Large Language Models (LLMs) with a specific template request: “Generate different levels dark night effect to the given image”. The four levels of darkness are: slight, half, grey, and deep, as detailed below:

● Generate slight dark night effect to the given image.

● Generate half dark night effect to the given image.

● Generate grey dark night effect to the given image.

● Generate deep dark night effect to the given image.

(3) Rainy day. For this cases, the textual template is Generate different levels raining effect to the given image. The model is trained to generate 4 different levels of raining effect to the given image. The specific levels of raining effect are: drizzle, moderate, heavy, and thunderstorm, as described below:

● Generate drizzle raining effect to the given image.

● Generate moderate raining effect to the given image.

● Generate heavy raining effect to the given image.

● Generate thunderstorm raining effect to the given image.

(4) Snowy day. We utilize 4 text prompts to generate augmented images with various snowy weather effects. The template of LLMs is: Generate different levels snow effect to the given image. The four levels of snowy day are: light snow, moderate snow, heavy snow, and blizzard, as detailed below:

● Generate light snow effect to the given image.

● Generate moderate snow effect to the given image.

● Generate heavy snow effect to the given image.

● Generate blizzard snow effect to the given image.

We note that these structured textual prompts consider the diversity of weather conditions and the need for the model to adapt to various environmental changes. The model is expected to generate a variety of foggy, dark, rainy, and snowy images, each with its own unique characteristics and style. Moreover, for each prompt, we generate two augmented images, therefor, resulting in a total of N = 32 augmented images for 16 textual prompts.

We chose GPT-4o for its multi-modal text-to-image capability that supports fine-grained prompt control (e.g., fog density levels), unlike StyleGAN which requires domain-specific training. This capability allows flexible adaptation to agricultural scenes and ensures semantic alignment between generated and real images.

2.2.1 Fidelity classifier

To filter out the generated images that are not relevant to the original image, we introduce a binary classifier to evaluate the fidelity of the generated images. The classifier is trained to distinguish between high-fidelity and low-fidelity images, ensuring that only high-quality images are retained for model training. The classifier is implemented using a pre-trained ResNet-19 model, which is fine-tuned on our dataset. After generating augmentations from LLMs, the classifier will eliminated the scenes with suboptimal effects. For example, if the generated rainy scene is overly smooth and does not conform to the expected real rainy scene characteristics, then the classifier can identify and remove such ineffective rainy scene to retain various rainy scene images that are more in line with requirements, have greater authenticity and diversity. If the generated implicit images fail to pass the classifier and there are less than 16 kinds, then a new round of image updating is needed until 32 augmented image categories are satisfied.

If we have the true label yi (which is a category label) and the corresponding probability distribution is pi, while the probability distribution predicted by the model is q(y), this loss function is formally expressed in Equation 5:

Lfidelity=i=1Nyilog qi+(1yilog)(1qi)(5)

Among them, N = 32 is the number of augmented samples, yi represents the true label of the ith sample, and qi represents the probability value that the model predicts the sample belongs to each category (corresponding probability distribution). By minimizing this cross-entropy loss function, the prediction result of the model can be closer to the real situation. These 16 kinds of fixed explicit augmented versions together with the new domain augmented 16 kinds of images generated by the implicit augmenter, provide rich data sources for model training. In this way, the model can not only learn the generalization knowledge in the augmented domain, but also can perform personalized processing according to the characteristics of different input samples, showing higher recognition accuracy and robustness in practical applications.

2.3 Continuous test-time domain adaptation

In dynamic environment T, our objective is to engineer a model capable of adapting to ongoing changes for effective object detection (Cui et al., 2023; Zhang et al., 2022). To this end, we employ the target data xT(i), where 0 < i < |T|, to fine-tune the base YOLO model, denoted as θ(0), using our CTTA-DisDet. Our framework enables the base detection model to evolve in response to environmental shifts without reliance on original datasets, achieving enhancement through ongoing adjustments during the inference phase. During the test-time adaptation, the model incrementally adapts to incoming target domain data xT(t) with the CTTA-DisDet framework generating respective adapted models θS((t)) and θT((t)) for each image t. This process facilitates more accurate predictions for each subsequent target data point xT(t).

In agriculture, the environment around crops is constantly evolving due to changes in geographical location, climatic conditions and time. In this context, the model needs to make accurate perceptual decisions in real time and can quickly adapt to these changes. At test time, the input of the model does not see the unseen target domain data XT. The target domain is presented one by one in time series order, which means that at any given time point t, the model can only access the target domain data at that moment. Based on the target domain data xT(t), θS must make an immediate prediction to get θS(0)(x(t)) and adjust itself at the same time, so as to better cope with the new data x(t+1) that may appear later.

We introduce an innovative adaptive strategy designed for real-time model adaptability within a fluctuating target domain, as illustrated in Figure 4. This strategy leverages a pre-trained model, which is equipped to swiftly respond to domain changes through a dynamic adjustment mechanism. To address the potential issue of error accumulation during self-training, we employ knowledge distillation and pseudo-labeling techniques. For each given test sample xt, the proposed DDA generates explicit and implicit augmented images xexp(i)i = 116 and ximp(i)i = 116. During testing, 8 images are randomly sampled from these 32 augmentations (denoted as a subset S), which are then fed into the teacher model θT. The teacher model θT generates pseudo-label y^T(t) as shown in Equation 6:

Figure 4
Flowchart describing a model update process involving a pre-trained model, student model, and teacher model. Arrows indicate model updates and data flow through layers marked by different colored nodes. Random restoration and disease images with changing patterns are incorporated, with details such as model update, data flow, and EMA mentioned.

Figure 4. In the continuous test time domain adaptive graph, the teacher model θT is used to generate pseudo-labels, and the student model θS is used to adapt to the new domain. ⊙ denotes element-by-element multiplication and ⊕ denotes addition.

y^T(t)=18iSθT(t1)(xaug(i)),(6)

We note that this method modulate pseudo-labels, mitigating error propagation and enhancing the prediction accuracy. Once the pseudo-label is obtained, the student model θS is updated by minimizing the cross-entropy loss between the prediction and the pseudo-label to achieve θs(t) θS(t1) . Then, we utilize the exponential moving average (EMA) to update the teacher model obtain θT (Wang et al., 2022; Klinker, 2011). Furthermore, to prevent catastrophic forgetting when the model adapts to new target domains, we incorporate a random restoration mechanism from the base model θ0. This approach ensures that the model retains essential knowledge from its initial training, integrating this knowledge at strategic points during the adaptation process. Hence, it maintains its foundational expertise while accommodating new domain-specific insights.

Unlike TENT and CoTTA which perform single-step adaptation, CTTA-DisDet introduces a dual-loop mechanism combining EMA-updated teacher and random-restored student for continuous adaptation over time, effectively mitigating catastrophic forgetting and error accumulation. This dynamic frequency adjustment of EMA and stochastic recovery ratio distinguishes our framework from prior TTA approaches.

2.3.1 Continual test-time adaptation

The student model θS(t) refines its predictions by minimizing the loss relative to a teacher model.

The classification loss Lcls(t) quantifies the categorical discrepancy between the student and teacher model predictions, as defined in Equation 7:

Lcls(t)=cp^cls(t)logq^cls(t),(7)

where p^cls(t) and q^cls(t)  represent the teacher and student model’s categorical predictions, respectively.

The confidence loss Lconf (t) measures the divergence in confidence scores between the models as shown in Equation 8:

Lconf(t)=1Ni=1N[p^conf(t)logq^conf(t)+(1p^conf(t))log(1logq^conf(t))].(8)

Here, N denotes the number of prediction boxes, and p^conf(t)  and q^conf(t) represent the confidence scores of the teacher and student models, respectively.

The bounding box loss Lbboxt evaluates the spatial accuracy of the predictions, as shown in Equation 9:

Lbbox(t)=i=14Smoot_L1(p^bbox(t), q^bbox(t)).(9)

The total loss L at each time step t is a weighted sum of these losses as shown in Equation 10:

L=λclsLclst+λconfLconft+λbbox Lbboxt,(10)

where λcls = 0.4, λconf = 0.3, and λbbox = 0.3 are the balancing parameters.

The student model  θs(t) is then updated using a single gradient descent step with a learning rate η = 0.001, as shown in Equation 11:

θS(t) θS(t1)η·L(t).(11)

2.3.2 With the adapted student model, the final prediction can be achieved, that is, [qcls, qconf, qbbox] = θS(t)(x(t)).Exponential moving average

A key objective in self-training is aligning model predictions with generated pseudo-labels, typically achieved by minimizing the cross-entropy between predictions and pseudo-labels for target data xT(t) and model θS(t). While using model predictions as pseudo-labels is effective in static domains, it poses challenges in dynamic domains due to potential shifts in data distribution that can degrade pseudo-label accuracy. To address this, we’ve adopted an innovative approach to produce high-quality pseudo-labels. We introduce an exponential moving average (EMA) (Wang et al., 2022; Klinker, 2011) to enhance the model’s adaptability in a continuously evolving target domain. Initially, the teacher model’s weights θT(t) mirror the source pre-trained network. At each time step t, the teacher model and student model collaboratively generate the primary prediction q^(0) and the pseudo-label p^(t). The teacher model’s weights are updated using EMA, blending with the student model’s weights, with a smoothing factor of µ = 0.95 ensuring a smooth transition, as shown in Equation 12:

θT(t)=μθT(t1)+(1μ)θS(t).(12)

After the teacher model is updated and the student model is fine-tuned, the model is ready to accept the next target domain data point xT(t1). This process continues iteratively, with the model adapting to each new data point in real time, ensuring accurate predictions and robust performance in dynamic environments.

2.3.3 Random restoration

While pseudo-labels are crucial for maintaining model accuracy across domains, prolonged self-training can introduce errors, especially in the presence of significant domain shifts. Such distribution changes may result in increasingly inaccurate predictions, potentially leading to a scenario where self-training reinforces false predictions with erroneous labels. Additionally, models may suffer from catastrophic forgetting after adapting to new domains, compromising their performance on the original data. To address it, we introduce a random restoration mechanism (Cui et al., 2023) designed to re-integrate knowledge from the source pre-trained model θ(0). Specifically, at the beginning of each iteration of the student model θS(t) the restoration is governed by Equations 13 and 14, it’s parameters will first retained 1 − p proportion, and p proportion of the parameters will be randomly restored to the base model θ(0). This random restoration mechanism ensures that the model retains essential knowledge from the source model during self-training, enhancing its adaptability to new data while preserving its understanding of the original data and preventing catastrophic forgetting due to over-adaptation. Formally, it can be expressed as:

H Bernoulli(p),(13)
θ(t+1) H θ(0)+(1H)θ(t).(14)

Here, ⊙ denotes element-wise multiplication, p is a small restoration probability, and H is a binary mask. we note that H determines which weight elements revert to the source weight θ(0), allowing us to deliberately recover lost knowledge from the source model and enhance the model’s adaptability to new data without forgetting the original.

The final model prediction leverages the student model’s output, integrating information from past iterations to mitigate performance degradation due to catastrophic forgetting. This approach not only bolsters the model’s robustness during continuous adaptation but also reduces error accumulation from distribution shifts, thereby improving the model’s generalization to new, unseen domains. It ensures not only immediate performance in the target domain but also long-term stability in a dynamically changing environment.

3 Experiments

3.1 Experiment setting and dataset

Our experiment is based on PyTorch 2.2.2 + Cuda_121 and is trained using NVIDIA GeForce RTX 4090. The random restoration probability p = 0.1, the learning rate of student model update is η = 0.001, and the smoothing factor in EMA is µ = 0.95. Batch size is 16, the number of epoch is 100, and the optimizer is Adam to obtain the base model, where the network architecture is YOLOv9s.

Our dataset uses a self-collected tomato disease dataset, including a training set of 4059, the validation set of 451 and test set of 450. The categories are tomato_healthy, tomato_leaf blight, tomato_leaf curl, tomato_septoria leaf spot and tomato_verticulium wilt. The test set is extended the original image to four fixed explicit including Gaussian noise, Brightness adjustment, Pixel loss and Motion blur, for flexible implicit including Foggy day, Dark night, Rainy day, Snowy day on the basis of the original image to detect the recognition ability of the model in the cross-domain case.

To ensure the reliability of the experiments, we also conducted additional experiments using the PlantDoc public dataset and augmented it into the eight types of cross-domain images mentioned above to evaluate the model’s performance.

3.2 Analysis of dynamic data augmentation

Our explicit augmenter utilizes the four fixed augmentation methods, Gaussian noise, brightness adjustment, pixel loss and motion blur to generate 16 augmented images for each input image. Despite the diversity of these fixed augmentations, other parameter-free augmentation strategies may be feasible. To explore this possibility, we introduce two new augmentation method, cropping, rotation, to compare its performance with the four fixed augmentations. Concretely, for the cropping method, we randomly crop 20% square area of the original image, while for the rotation method, we rotate the image by 90 degrees. The experimental results are shown in Table 1, where the proposed Aexp outperforms the cropping and rotation methods in terms of precision, recall, mAP50, and mAP50-95. Therefore, we can draw the conclusion that although these four explicit augmentation methods each have certain characteristics. Compared with adding one more augmentation method, their influence on the overall experimental results is relatively small. Consequently, it can be concluded that the our four explicit augmentation methods possess representativeness and universality.

Table 1
www.frontiersin.org

Table 1. Experimental results of our explicit augmentation methods (Gaussian noise, Brightness adjustment, Pixel loss, and Motion blur), and the combination with other specific augmentation methods (Cropping, Rotation), illustrating the performance of the YOLOv9s model on our custom- built dataset.

The implicit augmenter Aimp generates 16 augmented images for each input image using the four flexible augmentation methods, foggy day, dark night, rainy day, and snowy day. To explore the potential of other flexible augmentation methods, we introduce a new augmentation method, frost day, to compare its performance with the four flexible augmentations. The prompt template is similar to our proposed ones, i.e., Generate different levels frost effect to the given image, and other setting are the same as the four Aimp augmentations. As shown in Table 2, the proposed Aimp clearly outperforms the frost day method in terms of precision, recall, mAP50, and mAP50-95. This indicates that the our proposed four flexible augmentation methods are more effective than the frost day method, where other more complex generation strategies are not necessary.

Table 2
www.frontiersin.org

Table 2. Experimental results of our implicit augmentation methods (Foggy day, Dark night, Rainy day, Snowy day), and the combination with another specific augmentation method (Frost day), illustrating the performance of the YOLOv9s model on our custom-built dataset.

Analysis of Table 3 reveals that Dynamic Data Augmentation (DDA) significantly enhances performance metrics—Precision, Recall, mAP50, and mAP50-95—across all models compared to Non-Augmentation and Traditional Augmentation conditions. For YOLOv9c, under Non- Augmentation, the metrics are Precision=63.2%, Recall=70.5%, mAP50 = 69.6%, and mAP50- 95 = 55.0%; Traditional Augmentation improves these to 74.2%, 75.9%, 81.1%, and 64.3%, respectively; while Dynamic Data Augmentation further elevates them to 77.1%, 75.4%, 83.8%, and 65.9%, reflecting gains of 13.9%, 4.9%, 14.2%, and 10.9% over Non-Augmentation, and notable improvements over Traditional Augmentation. Similarly, for other models like YOLOv8s, Dynamic Data Augmentation boosts Precision from 61.5% (Non-Augmentation) and 66.9% (Traditional Augmentation) to 71.5%, and mAP50–95 from 57.5% and 59.5% to 63.7%; for YOLOv8m, mAP50–95 reaches 65.8% under Dynamic Data Augmentation, a 8.6% increase over Non-Augmentation (57.2%). This consistent performance uplift across models underscores the reliability and effectiveness of Dynamic Data Augmentation in enhancing robustness and generalization capability.

Table 3
www.frontiersin.org

Table 3. Comparison of various models under non-augmentation, traditional augmentation, and our DDA conditions, illustrate the performance of the models on our custom-built dataset.

Additionally, we employed the DDA algorithm to test image augmentation under a single scenario, with the results presented in Table 4. By comparing the experimental outcomes, we observed that while image augmentation in a single scenario may not be as effective as composite augmentation, it still outperforms traditional methods of image augmentation. It is noteworthy that although the four distinct augmentation scenarios each demonstrated certain advantages when used individually, the performance differences among them were not significant.

Table 4
www.frontiersin.org

Table 4. Comparison of various models under DDA with different single-scenario augmentations, illustrate the performance of the models on our custom-built dataset.

Table 4 further examines the effect of each single-scenario augmentation within DDA, providing granular evidence of how different weather conditions influence cross-domain robustness; thus it complements Table 3; Figure 5 rather than duplicating them.

Figure 5
(a) Tomato leaves with blight, highlighted in red boxes with confidence scores. (b) Tomato plant with leaf curl, marked by orange boxes and scores. (c) Tomato plant showing verticillium wilt, outlined in yellow with confidence scores. (d) Tomato plants in a garden exhibiting verticillium wilt symptoms, with yellow highlights and scores.

Figure 5. The detection results of the target domain adaptation model in the source domain are illustrated. (a) tomato leaf blight, (b) tomato leaf curl, (c) tomato verticillium wilt, and (d) tomato verticillium wilt.

Image augmentation in a single scenario, though less comprehensive than composite augmentation, can still enhance performance. This is significant for applications with limited resources or specific needs, showing that simple augmentation can be beneficial without complex strategies.

Furthermore, this insight also indicates that when devising image augmentation strategies, there is no need to overly pursue complex composite methods. Instead, the most appropriate augmentation strategy should be selected based on the actual application scenario and available resources. In some cases, a simple and effective single-scenario augmentation may be sufficient to meet the requirements, thereby conserving computational resources and time.

It can be seen from Figure 6 that the confidence of the model trained after adding the DDA to the detection target is significantly increased, with an average increase of 0.175, and the attention distribution map is more focused on visualizing plant disease sites. This shows that the DDA significantly improves the cross-domain recognition ability of the model.

Figure 6
Five-panel image showing tomato leaf detection with varying degrees of confidence. Panels (a), (c), and (e) display tomato leaves outlined with orange boxes labeled “tomato_leaf curl” and confidence scores, such as 0.29, 0.41, and 0.93. Panels (b) and (d) depict corresponding thermal or spectral images with blue to red gradients indicating different temperature or material properties.

Figure 6. Comparison of the baseline model, the traditional augmentation and DDA (original diagram and heat map). (a) YOLOv9c, (b) Heat map of baseline, (c) Traditional Augmentation, (d) Heat map of Traditional Augmentation, and (e) DDA.

In the diagram Figure 7, orange(triangle) is the source model, blue(rectangle) is the traditional enhancement and green(circular) is DDA. It can be clearly seen that in the data of the source domain, the model learning of DDA is faster and the same accuracy is achieved in the earlier time and the final accuracy is higher. It can be seen that the final accuracy DDA > Traditional enhancements > Source model.

Figure 7
Two line graphs comparing performance metrics. Graph (a) shows mAP50, and graph (b) shows mAP50-95. Both graphs display three data series: green circles, blue squares, and orange triangles, all following a curve that increases rapidly and then plateaus.

Figure 7. Comparison of non-augmentation (orange triangles), traditional augmentation (blue squares), and DDA (green circles). (a) metrics/mAP50 (B) and (b) metrics/mAP50–95 (B).

3.3 Comparison of cross-domain detection performance

Analyzing Table 5, this table comprehensively presents the performance of various methods on a custom tomato dataset under two conditions—Traditional Augmentation and DDA—across different augmentation types, with performance measured by mAP50. A detailed examination of the data reveals the outstanding effectiveness and robustness of CTTA-DisDet(v9s) in handling varying environmental conditions under both augmentation scenarios. Under Traditional Augmentation, CTTA-DisDet(v9s) achieves an average mAP50 of 65.3%, surpassing other methods such as YOLOv9s (60.6%), BN (57.2%), PL (36.3%), TENT (35.5%), LAME (63.3%), and CoTTA (62.1%), with particularly strong results in specific augmentations like FO (49.4%), PL (71.5%), DA (76.5%), GN (77.4%), BR (72.7%), and MB (58.9%), where it consistently secures the highest or near-highest scores, highlighting its superior adaptability across diverse image distortion scenarios. When evaluated under DDA, CTTA-DisDet(v9s) further improves its average mAP50 to 67.9%, again outperforming YOLOv9s (63.1%), BN (60.3%), PL (39.1%), TENT (38.5%), LAME (66.4%), and CoTTA (64.8%), while maintaining a lead in augmentations such as FO (51.5%), SN (53.4%), PL (73.8%), DA (79.1%), GN (80.1%), BR (74.9%), and MB (62.2%), even surpassing methods like LAME (e.g., SN: 64.4%) and CoTTA (e.g., GN: 79.2%) that excel in specific categories. Notably, CTTA-DisDet(v9s) demonstrates consistent improvement from Traditional Augmentation to DDA, with gains such as FO increasing from 49.4% to 51.5%, PL from 71.5% to 73.8%, and GN from 77.4% to 80.1%, underscoring how DDA enhances the model’s resilience to complex environmental variations. Overall, CTTA-DisDet(v9s) not only achieves a significantly higher average performance compared to other methods but also excels in most individual augmentation types, validating its effectiveness in leveraging dynamic data augmentation and continuous test-time domain adaptation to boost detection accuracy and robustness in non-stationary agricultural settings like tomato disease detection, thereby offering reliable support for real-world applications.

Table 5
www.frontiersin.org

Table 5. Performance of various methods on the custom tomato dataset under different augmentations (mAP50/%).

Analyzing Table 6, this table evaluates adaptation performance on the PlantDoc dataset under diverse environmental augmentations. While YOLOv9s achieves competitive scores in specific traditional augmentation scenarios (e.g., FO: 50.6%, PL: 51.4%, DA: 56.7%), its performance fluctuates significantly across conditions like SN (27.1%) and MB (20.5%), exposing vulnerability to extreme distortions. CTTA-DisDet(v9s) demonstrates balanced robustness, attaining the highest average mAP50 in both Traditional Aug. (46.7%) and DDA (49.8%) without extreme performance drops. Notably, it addresses critical weaknesses of specialized methods: under DDA, CTTA-DisDet(v9s) surpasses LAME’s strong MB performance (43.1% vs. 42.9%) while outperforming CoTTA in SN (42.6% vs. 43.4%) and GN (49.3% vs. 38.7%), showcasing multi- threat mitigation capability. The method exhibits strategic improvements from Traditional Aug. to DDA, particularly enhancing RA (45.3%→48.4%) and GN (46.2%→49.3%), where other approaches plateau. Unlike the tomato dataset analysis, PlantDoc reveals CTTA-DisDet(v9s)’s ability to narrow the performance gap between specialized augmentations (e.g., improving RA by 3.1% while maintaining FO gains), achieving more homogeneous robustness across all test conditions. This contrasts with LAME’s inconsistent adaptation, which excels in MB (43.1%) but falters in FO (43.1%). The results validate CTTA-DisDet(v9s)’s cross-dataset effectiveness, particularly in handling PlantDoc’s complex multi-class detection scenarios through stable, non-oscillating adaptation.

Table 6
www.frontiersin.org

Table 6. Performance of various methods on the PlantDoc dataset under different augmentations (mAP50/%).

To demonstrate that the model does not experience catastrophic forgetting while enhancing detection capability in the target domain and retaining the knowledge acquired from the source domain, we conducted a re-evaluation on the source domain after the model adapted to the target domain. As shown in Figure 5, the accuracy of the model remains consistent with that of the version without cross-domain adaptation, indicating that our method of random recovery effectively prevents catastrophic forgetting.

Analyzing Table 7, CTTA-DisDet’s FPS of 47 is notably lower than YOLOv9s (163) and YOLOv9e (55), which is closely tied to its significantly higher computational load of 303.64 GFLOPs compared to 26.9 for YOLOv9s and 189.5 for YOLOv9e; however, its precision (Avg. mAP50) reaches 67.9, substantially surpassing YOLOv9s (60.6) and YOLOv9e (60.2), highlighting its superior accuracy and reliability in object detection tasks; in terms of parameters, CTTA-DisDet (14,636,736) falls between the two, suggesting that its complexity is not solely driven by parameter scale but rather by computational intensity; although its FPS is lower, 47 frames per second remains sufficient for real-time processing in many practical applications, and it even offers advantages in precision-critical scenarios; thus, despite its high computational demand, CTTA-DisDet effectively meets usage requirements through its enhanced precision, striking a reasonable balance between performance and practicality. Furthermore, as shown in Figure 8, the experimental results of CTTA-DisDet under different scenarios demonstrate the model’s robust detection capability and adaptability across diverse field conditions.

Figure 8
Images of tomato plant leaves displaying various conditions. Image (a) shows leaves with annotated “tomato leaf blight,” highlighted in red rectangles. Image (b) identifies “tomato leaf curl” with orange markings. Image (c) and (d) feature “tomato verticillium wilt” outlined in yellow rectangles. Each image includes confidence scores for the annotations.

Figure 8. CTTA-DisDet experimental results diagram under different scenarios. (a) Snowy, (b) Foggy, (c) Rainy, and (d) Frost.

Table 7
www.frontiersin.org

Table 7. Evaluation of methods’ efficiency.

4 Discussion

In this study, by proposing an innovative framework, through dynamic data augmentation, the model can learn more diverse feature representations, so as to achieve more accurate pest identification in unseen fields. The continuous test time domain adaptive approach allows the model to adjust in real time in the changing target domain, maintaining a rapid response to new situations. Our framework of CTTA-DisDet leverages a teacher-student model configuration, where both models share the same architecture. The teacher model is trained on augmented data, gaining generalized knowledge, which it then transfers to the student model through knowledge distillation. This setup allows the student model to be fine-tuned for specific target domains, significantly enhancing its detection accuracy across different environmental contexts. These methods not only enhance the generalization ability of the model, but also improve its cross-domain detection performance, significantly improve the model’s ability to detect crop diseases in unseen fields, which has important practical significance for improving crop yield and quality. However, it is important to note that CTTA-DisDet may demand higher computational resources which could limit the model’s application in resource-constrained environments. Moreover, while a random recovery mechanism has been implemented to address catastrophic forgetting, further research is needed to ensure the model’s stability and efficiency during long-term adaptation.

5 Discussion

In this study, by proposing an innovative framework, through dynamic data augmentation, the model can learn more diverse feature representations, so as to achieve more accurate pest identification in unseen fields. The continuous test-time domain adaptive approach allows the model to adjust in real time in the changing target domain, maintaining a rapid response to new situations. Our framework of CTTA-DisDet leverages a teacher–student model configuration, where both models share the same architecture. The teacher model is trained on augmented data, gaining generalized knowledge, which it then transfers to the student model through knowledge distillation.

This setup allows the student model to be fine-tuned for specific target domains, significantly enhancing its detection accuracy across different environmental contexts. These methods not only enhance the generalization ability of the model, but also improve its cross-domain detection performance, significantly improving the model’s ability to detect crop diseases in unseen fields, which has important practical significance for improving crop yield and quality.

From the experimental results, it can be observed that the integration of dynamic data augmentation (DDA) and continuous test-time adaptation (CTTA) leads to consistent improvements in precision, recall, and mAP across multiple datasets. The explicit augmentations contribute to robustness against common environmental distortions, while the implicit augmentations generated by large language models introduce realistic weather variations that effectively simulate non-stationary conditions. This demonstrates that the proposed dual-augmentation strategy can successfully narrow the domain gap between training and testing data, enabling stable adaptation even under rapid environmental shifts. Furthermore, the dual-loop mechanism combining exponential moving average (EMA) updates with stochastic restoration effectively mitigates catastrophic forgetting, maintaining model stability during long- term adaptation.

Beyond empirical performance, the broader implication of CTTA-DisDet lies in its potential for real-world agricultural automation. By enabling models to self-adapt without source data access, this framework provides a practical pathway for intelligent monitoring systems capable of long-term deployment in complex, evolving environments. However, it is important to note that CTTA-DisDet may demand higher computational resources, which could limit its application in resource-constrained edge devices. Additionally, the fidelity of implicit augmentations still depends on the generative quality of LLM-based image synthesis, which may vary across domains. Future research should explore lightweight adaptation strategies, uncertainty-aware pseudo-label filtering, and hybrid augmentation mechanisms to further enhance stability and efficiency during continuous adaptation.

Data availability statement

The required data has been deposited in a public repository. It is available on Zenodo at the following https://doi.org/10.5281/zenodo.17659449.

Author contributions

ZW: Conceptualization, Funding acquisition, Methodology, Writing – original draft. JH: Methodology, Software, Writing – review & editing. SW: Data curation, Writing – review & editing. XM: Funding acquisition, Methodology, Software, Writing – review & editing. RF: Funding acquisition, Project administration, Supervision, Writing – review & editing.

Funding

The author(s) declare financial support was received for the research and/or publication of this article. Shandong Provincial Natural Science Foundation (Grant No. ZR2025QC649) and the Weifang University of Science and Technology A-Class Doctoral Research Fund (Grant No. KJRC2024006). Weifang University of Science and Technology Doctoral Research Fund (Grant No. KJRC2023045). Weifang University of Science and Technology A-Class Doctoral Research Fund (Grant No. KJRC2024014).

Acknowledgments

The authors would like to acknowledge the contributions of the participants in this study.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declare that no Generative AI was used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Abade, A., Ferreira, P. A., and de Barros Vidal, F. (2021). Plant diseases recognition on images using convolutional neural networks: A systematic review. Comput. Electron. Agric. 185, 106125. doi: 10.1016/j.compag.2021.106125

Crossref Full Text | Google Scholar

Benelli, A., Cevoli, C., and Fabbri, A. (2020). In-field hyperspectral imaging: An overview on the ground-based applications in agriculture. J. Agric. Eng. 51, 129–139. doi: 10.4081/jae.2020.1030

Crossref Full Text | Google Scholar

Chen, L., Zhang, Y., Song, Y., Shan, Y., and Liu, L. (2023). “Improved test-time adaptation for domain generalization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 24172–24182. doi: 10.48550/arXiv.2304.04494

Crossref Full Text | Google Scholar

Chi, Z., Wang, Y., Yu, Y., and Tang, J. (2021). “Test-Time fast adaptation for dynamic scene deblurring via meta-auxiliary learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9137–9146. doi: 10.1109/CVPR46437.2021.00902

Crossref Full Text | Google Scholar

Cohen, S., Goldshlager, N., Rokach, L., and Shapira, B. (2023). Boosting anomaly detection using unsupervised diverse test-time augmentation. Inf. Sci. 626, 821–836. doi: 10.1016/j.ins.2023.01.081

Crossref Full Text | Google Scholar

Cui, Q., Sun, H., Lu, J., Li, W., Li, B., Yi, H., et al. (2023). “Test-time personalizable forecasting of 3d human poses,” in Proceedings of the IEEE/CVF International Conference on Computer Vision. 274–283. doi: 10.1109/ICCV51070.2023.00032

Crossref Full Text | Google Scholar

Dai, M., Dorjoy, M. M. H., Miao, H., and Zhang, S. (2023). A new pest detection method based on improved yolov5m. Insects 14, 54. doi: 10.3390/insects14010054

PubMed Abstract | Crossref Full Text | Google Scholar

Deng, Y., Xi, H., Zhou, G., Chen, A., Wang, Y., Li, L., et al. (2023). An effective image-based tomato leaf disease segmentation method using mc-unet. Plant Phenomics 5, 49. doi: 10.34133/plantphenomics.0049

PubMed Abstract | Crossref Full Text | Google Scholar

Dong, Q., Sun, L., Han, T., Cai, M., and Gao, C. (2024). Pestlite: a novel yolo-based deep learning technique for crop pest detection. Agriculture 14, 228. doi: 10.3390/agriculture14020228

Crossref Full Text | Google Scholar

Dong, X., Wang, Q., Huang, Q., Ge, Q., Zhao, K., Wu, X., et al. (2023). Pddd-pretrain: a series of commonly used pre-trained models support image-based plant disease diagnosis. Plant Phenomics 5, 54. doi: 10.34133/plantphenomics.0054

PubMed Abstract | Crossref Full Text | Google Scholar

Duan, Z., Li, H., Li, C., Zhang, J., Zhang, D., Fan, X., et al. (2024). A cnn model for early detection of pepper phytophthora blight using multispectral imaging, integrating spectral and textural information. Plant Methods 20. doi: 10.1186/s13007-024-01239-7

PubMed Abstract | Crossref Full Text | Google Scholar

Feng, J., Zhang, S., Zhai, Z., Yu, H., and Xu, H. (2024). Dc2net: An asian soybean rust detection model based on hyperspectral imaging and deep learning. Plant Phenomics 6, 163. doi: 10.34133/plantphenomics.0163

PubMed Abstract | Crossref Full Text | Google Scholar

Fleuret, F. (2021). “Test time adaptation through perturbation robustness,” in NeurIPS 2021 Workshop on Distribution Shifts: Connecting Methods and Applications.

Google Scholar

Gao, W., Yang, X., Cao, L., Cao, F., Liu, H., Qiu, Q., et al. (2023). Screening of ginkgo individuals with superior growth structural characteristics in different genetic groups using terrestrial laser scanning (tls) data. Plant Phenomics 5, 92. doi: 10.34133/plantphenomics.0092

PubMed Abstract | Crossref Full Text | Google Scholar

Islam, R. and Moushi, O. M. (2024). “Gpt-4o: The cutting-edge advancement in multimodal llm.” in Intelligent Computing-Proceedings of the Computing Conference. (Cham: Springer Nature Switzerland), 47–60.

Google Scholar

Johnson, J., Sharma, G., Srinivasan, S., Masakapalli, S. K., Sharma, S., Sharma, J., et al. (2021). Enhanced field-based detection of potato blight in complex backgrounds using deep learning. Plant Phenomics. doi: 10.34133/2021/9835724

PubMed Abstract | Crossref Full Text | Google Scholar

Khalid, S., Oqaibi, H. M., Aqib, M., and Hafeez, Y. (2023). Small pests detection in field crops using deep learning object detection. Sustainability 15, 6815. doi: 10.3390/su15086815

Crossref Full Text | Google Scholar

Klinker, F. (2011). Exponential moving average versus moving exponential average. Mathematische Semesterberichte 58, 97–107. doi: 10.1007/s00591-010-0080-8

Crossref Full Text | Google Scholar

Li, Z., Sun, J., Shen, Y., Yang, Y., Wang, X., Wang, X., et al. (2024b). Deep migration learning- based recognition of diseases and insect pests in yunnan tea under complex environments. Plant Methods 20. doi: 10.1186/s13007-024-01219-x

PubMed Abstract | Crossref Full Text | Google Scholar

Li, J., Zhao, F., Zhou, G., Xu, J., Gao, M., Li, X., et al. (2024a). A multi-modal open object detection model for tomato leaf diseases with strong generalization performance using pdc-vld. Plant Phenomics. doi: 10.34133/plantphenomics.0220

PubMed Abstract | Crossref Full Text | Google Scholar

Li, K., Zhu, X., Qiao, C., Zhang, L., Gao, W., and Wang, Y. (2023). The gray mold spore detection of cucumber based on microscopic image and deep learning. Plant Phenomics 5, 11. doi: 10.34133/plantphenomics.0011

PubMed Abstract | Crossref Full Text | Google Scholar

Liu, J. and Wang, X. (2020). Tomato diseases and pests detection based on improved yolo v3 convolutional neural network. Front. Plant Sci. 11. doi: 10.3389/fpls.2020.00898

PubMed Abstract | Crossref Full Text | Google Scholar

Lu, Z., Huang, S., Zhang, X., Shi, Y., Yang, W., Zhu, L., et al. (2023). Intelligent identification on cotton verticillium wilt based on spectral and image feature fusion. Plant Methods 19. doi: 10.1186/s13007-023-01056-4

PubMed Abstract | Crossref Full Text | Google Scholar

Mathieu, L., Reder, M., Siah, A., Ducasse, A., Langlands-Perry, C., Marcel, T. C., et al. (2024). Septosympto: a precise image analysis of septoria tritici blotch disease symptoms using deep learning methods on scanned images. Plant Methods 20. doi: 10.1186/s13007-024-01136-z

PubMed Abstract | Crossref Full Text | Google Scholar

Qi, F., Wang, Y., Tang, Z., and Chen, S. (2023). Real-time and effective detection of agricultural pest using an improved yolov5 network. J. Real-Time Image Process. 20, 33. doi: 10.1007/s11554-023-01264-0

Crossref Full Text | Google Scholar

Saadati, M., Balu, A., Chiranjeevi, S., Jubery, T. Z., Singh, A. K., Sarkar, S., et al. (2024). Out-of-distribution detection algorithms for robust insect classification. Plant Phenomics 6, 0170. doi: 10.48550/arXiv.2305.01823

PubMed Abstract | Crossref Full Text | Google Scholar

Shanmugam, D., Blalock, D., Balakrishnan, G., and Guttag, J. (2021). “Better aggregation in test-time augmentation,” in Proceedings of the IEEE/CVF international conference on computer vision. 1214–1223. doi: 10.1109/ICCV48922.2021.00125

Crossref Full Text | Google Scholar

Sun, Y., Wang, X., Liu, Z., Miller, J., Efros, A., and Hardt, M. (2020). “Test-time training with self- supervision for generalization under distribution shifts,” in International conference on machine learning. (PMLR). 9229–9248. doi: 10.48550/arXiv.1909.13231

Crossref Full Text | Google Scholar

Tang, Z., He, X., Zhou, G., Chen, A., Wang, Y., Li, L., et al. (2023). A precise image-based tomato leaf disease detection approach using plpnet. Plant Phenomics 5, 42. doi: 10.34133/plantphenomics.0042

PubMed Abstract | Crossref Full Text | Google Scholar

Torres-Lomas, E., Lado-Bega, J., Garcia-Zamora, G., and Diaz-Garcia, L. (2024). Segment anything for comprehensive analysis of grapevine cluster architecture and berry properties. Plant Phenomics 6, 202. doi: 10.34133/plantphenomics.0202

PubMed Abstract | Crossref Full Text | Google Scholar

Triki, H. E., Ribeyre, F., Pinard, F., and Jaeger, M. (2023). Coupling plant growth models and pest and disease models: An interaction structure proposal, mimic. Plant Phenomics 5, 77. doi: 10.34133/plantphenomics.0077

PubMed Abstract | Crossref Full Text | Google Scholar

Tschurr, F., Kirchgessner, N., Hund, A., Kronenberg, L., Anderegg, J., Walter, A., et al. (2023). Frost damage index: The antipode of growing degree days. Plant Phenomics 5, 104. doi: 10.34133/plantphenomics.0104

PubMed Abstract | Crossref Full Text | Google Scholar

Varsavsky, T., Orbes-Arteaga, M., Sudre, C. H., Graham, M. S., Nachev, P., and Cardoso, M. J. (2020). “Test-time unsupervised domain adaptation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. (Cham: Springer International Publishing), 428–436. doi: 10.1007/978-3-030-59710-8_42

Crossref Full Text | Google Scholar

Wang, Q., Fink, O., Van Gool, L., and Dai, D. (2022). “Continual test-time domain adaptation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7201–7211. doi: 10.48550/arXiv.2203.13591

Crossref Full Text | Google Scholar

Wang, C.-Y. and Liao, H.-Y. M. (2024). “Yolov1 to yolov10: The fastest and most accurate real- time object detection systems,” in APSIPA Transactions on Signal and Information Processing. doi: 10.48550/arXiv.2408.09332

Crossref Full Text | Google Scholar

Wang, D., Shelhamer, E., Liu, S., Olshausen, B., and Darrell, T. (2020). Tent: Fully test-time adaptation by entropy minimization. arXiv preprint arXiv:2006.10726. doi: 10.48550/arXiv.2006.10726

Crossref Full Text | Google Scholar

Xu, Y., Mao, Y., Li, H., Sun, L., Wang, S., Li, X., et al. (2023). A deep learning model for rapid classification of tea coal disease. Plant Methods 19. doi: 10.1186/s13007-023-01074-2

PubMed Abstract | Crossref Full Text | Google Scholar

Xu, Z., York, L. M., Seethepalli, A., Bucciarelli, B., Cheng, H., and Samac, D. A. (2022). Objective phenotyping of root system architecture using image augmentation and machine learning in alfalfa (medicago sativa l.). Plant Phenomics. doi: 10.34133/2022/9879610

PubMed Abstract | Crossref Full Text | Google Scholar

Ye, X., Pan, J., Liu, G., and Shao, F. (2023). Exploring the close-range detection of uav-based images on pine wilt disease by an improved deep learning method. Plant Phenomics 5, 129. doi: 10.34133/plantphenomics.0129

PubMed Abstract | Crossref Full Text | Google Scholar

Zhang, M., Levine, S., and Finn, C. (2022). Memo: Test time robustness via adaptation and augmentation. Adv. Neural Inf. Process. Syst. 35, 38629–38642. doi: 10.48550/arXiv.2110.09506

Crossref Full Text | Google Scholar

Zheng, C., Abd-Elrahman, A., Whitaker, V. M., and Dalid, C. (2022). Deep learning for strawberry canopy delineation and biomass prediction from high-resolution images. Plant Phenomics. doi: 10.34133/2022/9850486

PubMed Abstract | Crossref Full Text | Google Scholar

Zhou, H., Hu, Y., Liu, S., Zhou, G., Xu, J., Chen, A., et al. (2024). A precise framework for rice leaf disease image–text retrieval using fhtw-net. Plant Phenomics 6, 168. doi: 10.34133/plantphenomics.0168

PubMed Abstract | Crossref Full Text | Google Scholar

Zhou, L., Xiao, Q., Taha, M. F., Xu, C., and Zhang, C. (2023). Phenotypic analysis of diseased plant leaves using supervised and weakly supervised deep learning. Plant Phenomics 5, 22. doi: 10.34133/plantphenomics.0022

PubMed Abstract | Crossref Full Text | Google Scholar

Zhu, R., Hao, F., and Ma, D. (2023). Research on polygon pest-infected leaf region detection based on yolov8. Agriculture 13, 2253. doi: 10.3390/agriculture13122253

Crossref Full Text | Google Scholar

Keywords: tomato disease detection, non-stationary environments, test-time domain adaptation, knowledge distillation, dynamic data augmentation, teacher–student, pseudo-labeling, EMA

Citation: Wu Z, Han J, Wang S, Meng X and Fu R (2025) Detection techniques for tomato diseases under non-stationary climatic conditions. Front. Plant Sci. 16:1704663. doi: 10.3389/fpls.2025.1704663

Received: 13 September 2025; Accepted: 24 October 2025;
Published: 05 December 2025.

Edited by:

Xiaojun Jin, Nanjing Forestry University, China

Reviewed by:

Sanjay Mate, Government Polytechnic Daman, India
Guo Xu, Shanghai Dianji University, China

Copyright © 2025 Wu, Han, Wang, Meng and Fu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Rui Fu, ZnVydWkxOTg5MTIwOUB3ZnVzdC5lZHUuY24=

ORCID: Shiyu Wang, orcid.org/0009-0000-2600-6521
Xiangwei Meng, orcid.org/0009-0001-1238-7623

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.