A mini-review on data science approaches in crop yield and disease detection

Agriculture constitutes a sector with a considerable environmental impact, a concern that is poised to increase with the projected growth in population, thereby amplifying implications for public health. Effectively mitigating and managing this impact demands the implementation of intelligent technologies and data-driven methodologies collectively called precision agriculture. While certain methodologies enjoy widespread acknowledgement, others, despite their lesser prominence, contribute meaningfully. This mini-review report discusses the prevalent AI technologies within precision agriculture over the preceding ﬁ ve years, with a speci ﬁ c emphasis on crop yield prediction and disease detection domains extensively studied within the current literature. The primary objective is to give a comprehensive overview of AI applications in agriculture, spanning machine learning, deep learning, and statistical methods. This approach aims to address a notable gap wherein existing reviews predominantly focus on singular aspects rather than presenting a uni ﬁ ed and inclusive perspective.


Introduction
Agriculture plays a central role in the global economy, offering vital income generation and employment opportunities (Phasinam et al., 2022).It holds critical responsibilities in ensuring food quality and safety, preserving the environment, fostering integrated rural development, and maintaining social structure and cohesion in rural areas (Loizou et al., 2019).For instance, in 2022, the European Union's agricultural sector played a crucial economic role, contributing significantly with a gross value added of 222.3 billion euros.This amount represented about 1.4% of the total gross domestic product (GDP) of Europe.Particularly noteworthy was the relative increase in the estimated agricultural income per annual work unit, reaching a level 44.3% higher than that observed in 2015 (Eurostat, 2023).Furthermore, agriculture remained a crucial employer, with a staggering 8.7 million individuals employed in the agricultural sector across Europe in 2020, affirming its continued prominence within the EU (Eurostat, 2020).These data are projected to further surge in response to the expected increase in the global population, reaching 9.7 billion by 2050 (Pew Research Center, 2019).As evident from the data, the most substantial population increase is expected in Africa, with a projected boost of approximately 92.3% (Pew Research Center, 2019).Following by Latin America and Asia, which are expected to experience population growth by about 21% and 15.23%, respectively (Pew Research Center, 2019).The surge in population in specific regions has led to a notable escalation in food demand.A significant publication by Alexandratos and Bruinsma (2012) underscores the imperative need to increase global agricultural production by 60% to meet this growing food requirement.Developing countries are faced with an even greater challenge, as they would need to enhance agricultural output by 77%, while developed countries should aim for a 24% increase (Malhi et al., 2021).Consequently, the environmental impact of the agricultural sector has amplified, and in the next four decades, the emissions will increase by more than 60% (Frońa et al., 2019).In general, agriculture accounts for more than 11% of the total anthropogenic emission from direct source (Maraseni and Qu, 2016), and this value grows about 3-6% if the storage, transportation, packaging and agricultural input production are included (Tan et al., 2022).Considering direct agricultural emissions, 81% of the global ammonia (NH 3 ) is reached by the agronomic sector (Damme et al., 2021) as a result of the increase in animal feeding operation (Schultz et al., 2019).NH 3 has a high impact on the ecosystem leading to the acidification and eutrophication phenomena and also has a key role in the Particulate Matter 2.5 micrometers (PM 2 .5 ) generation which is responsible for serious health problems such as chronic obstructive pulmonary disorder and lung cancer (Lelieveld et al., 2015;Apte et al., 2018).Other emissions from the agricultural sector are methane (CH 4 ) and nitrous oxide (N 2 O) which are greenhouse gases (GHGs) and contribute to climate change.They are produced during the enteric fermentation, manure management, synthetic fertilizer, manure management, synthetic fertilizers, rice cultivation, manure applied to soils and pastures, crop residues, cultivation of organic soils, and burning of crop residues (Han et al., 2019).So it is undeniable that agriculture has a very large influence on climate change, which also has a negative effect on agriculture itself.Indeed, agriculture, being highly susceptible to climate variations, experiences adverse consequences due to significant fluctuations in temperature and rainfall.These variations directly influence crop yields and quality, posing challenges to food production and agricultural sustainability.For instance, extended precipitations could delay production processes due to muddy soils and inaccessible fields for machinery, high temperatures cause the lack of winter chill induces a negative effect on the quality of asparagus and rhubarb and affect flowering time, the increase of CO 2 induce the reduction of micro and macronutrients in lettuce, celery (Bisbis et al., 2018).
In order to mitigate the impact of climate change on agriculture and simultaneously reduce agriculture's contribution to climate change embracing new technologies based on Data Science is required.In fact, data-driven decision-making holds the potential to revolutionize farming practices by enabling more efficient utilization of water, pesticides, and fertilizers, thereby minimizing environmental impacts (Akkem et al., 2023).

Data science in agriculture
Nowadays, there are many new technologies based on the Internet of Things (IoT), wireless connection, cloud computing, and block-chain technology that have the potential to revolutionize crop monitoring.An example, is remote sensing technologies, such as satellite-based (Sentinel-3) or Unmanned Aerial Vehicle (UAV) systems, utilize spectral images to calculate reflected radiation (Toth and Joźḱoẃ, 2016).These images, when subjected to data analysis, provide valuable vegetation indices, including the widely used Normalized Difference Vegetation Index (NDVI) (Skakun et al., 2018), which assesses crop health based on the Red and Near Infrared reflectance.Beyond general vegetation indices, specific pigment content can be evaluated using remote sensing data.For instance, the Normalized Red Index quantifies chlorophyll levels, while the Normalized Green Index focuses on other pigments, excluding chlorophyll (Qi et al., 1994).In addition to remote sensing, field wireless sensor networks are employed to measure vital weather variables, such as temperature, air humidity, soil moisture, pH and so on (Priya and Yuvaraj, 2019).All these technologies guide agriculture toward a digital revolution, leading to the rise of precision agriculture (PA), which tackles the customization of agricultural practices to fit the unique characteristics of each crop, field, and environmental context.It advocates the adoption of cutting-edge technologies and data-driven approaches to effectively address the inherent heterogeneities within a field (Finger et al., 2019), providing an increase in terms of productivity using less natural resources such as energy and water (Pathan et al., 2020).PA finds broad applicability across various agricultural practices, offering valuable benefits in terms of resource efficiency and enhanced crop management.For instance, in the context of irrigation, PA enables precise water delivery, avoiding wastage and ensuring optimal water utilization.Similarly, in fertilization, PA plays a crucial role in identifying specific areas within the field where nutrients are needed, thereby providing targeted support to plant growth and minimizing resource losses due to over-application.Furthermore, PA's impact extends to pest control and disease detection, where early warnings through predictive models enable proactive intervention, reducing potential damage and optimizing treatment strategies (Shafi et al., 2019).In Figure 1 are reported the domains where PA techniques are applied.
As evident from the data, the majority of publications in precision agriculture are concentrated in the crop domains (green).Specifically, disease detection (22%) and yield prediction (20%) stand out as the dominant subsections in research.The third most studied domain is livestock production, accounting for 12% of the publications.These new technologies are available in agriculture, paving the way for big data, and making it attractive for advanced data analysis methodologies such as Deep learning (DL) and Machine learning (ML), making them the most used in the recent literature for PA applications (Ayoub Shaikh et al., 2022).Here below are reported recent literatures about ML and DL techniques regarding Yield prediction and Disease detection, since these are the domains in which precision agriculture is most studied, then, another common class of model in PA applications is reviewed.

Prominent machine and deep learning techniques employed in precision agriculture applications
Crop yield prediction is one of the most important sectors belonging to precision agriculture because accurate model predictions help farmers to optimize crop management, although this task remains quite complex due to the hierarchical nature of crop yield that involves variables ranging from plant genotype to environmental descriptors along time and space.Some of the most recent publications propose semiparametric DL networks to encode nonlinear relationships between variables, for instance, Jeong et al. (2022) developed an early stage prediction of rice yield at pixel scale methodology using as input variables: vegetation indices, transplanting dates, minimum and maximum of temperatures, solar radiation, administrative information, yearly rice maps.The outputs of the remote-sensing integrated crop model (RSCM) (Pistenma et al., 1977) was used to train five different DL models.The model selected was the Long Short-Term Memory combined with 1D-Convolutional Neural Network (CNN), also a comparison between the county-scale model and pixelscale model was done, county-scale yields lack the significant advantages of satellite images and are less sensitive to spatial variations within each county region, while the pixel-scale crop yield better-representing variations within a region.CNNs are also used for strawberry cultivation to detect and count mature, immature strawberries, and blossoms, through UAV and near-ground digital images in order to predict strawberry yield and perfect harvesting time (Zhou et al., 2021).Another DL technique which finds application in crop yield prediction is deep neural networks which are multilayer feed-forward neural networks very useful with large datasets.Their training commonly involves gradient-based methods, though this can introduce challenges such as converging slowly or getting trapped in local minima due to the initialization of the random weights.To address this issue, a fusion of deep neural networks and genetic algorithms has been explored.This combination aims to address the issue of local minima by identifying a reduced-dimensional subspace of weights.This integration becomes especially relevant when environmental and genotype data are employed for accurate crop yield prediction (Bi and Hu, 2021).
The disease detection is vital to avoid loss of yield and quality of the crop, since pesticides were usually applied uniformly to the whole field, the classification and prediction of the early stage of the disease and finding critical infestation areas, are crucial in order to avoid economic losses and environmental problems, using mainly hourly weather data ranged from two to five years (Fenu and Malloci, 2021).Within this field ML techniques have been introduced for disease management, such as the work by Bhatia et al. (2022).This study conducted a comparative analysis of three ML methods, namely k-Nearest Neighbor (k-NN), Support Vector Machine (SVM), and Na¨ıve Bayes (NB).The aim was to develop an optimized spray prediction model against powdery mildew, by exploiting the tomato powdery mildew dataset (TPMD).This dataset encompasses a range of weather variables like temperature, relative humidity, wind speed, and global radiation, along with leaf wetness data.The findings of this study indicated that SVM exhibited the most favorable classification performance, thus rendering it the most suitable choice for this particular prediction task.Furthermore, a hybrid variant of the SVM was introduced for the detection of powdery mildew.In this approach, SVM worked as a wrapper, enhancing the training set and minimizing the possibility of sample mislabeling.Subsequently, a logistic regression model was applied to the refined training set, leading to a reduction of the classification error (Bhatia et al., 2020).The Random Forest (RF) has been proposed as a machine learning classifier against tomato diseases.A RF uses leaf images of Early Blight, Late Blight, Septoria Leaf spot, Spidermite, Mosaic Virus, Yellow leaf curl virus, to classify the healthy and diseased plant leaves (Govardhan and M B, 2019).RFs have been observed that outperform other supervised ML and DL algorithms such as CNN, SVM and k-NN for the classification of maize plant leaf diseases (Arora et al., 2020).

Mechanistic-deterministic models in precision agriculture applications
Big data leads to the use of another class of model, namely the mechanistic-deterministic model (MDM), which are not based on statistical relationships between variables, but they model biophysical processes accounting for deterministic relationships between crop growth and environmental, management and genetic factors.MDM are useful to understand complex croprelated phenomena and to optimally manage the agrosystems (Pasquel et al., 2022).These characteristics makes them a widespread tool in the agroenvironmental field, since they can work without massive amounts of data that can be timeconsuming and expensive to collect, such as disease observations at level of leaf.Among the many applications developed in this model framework, below a comprehensive selection of models is summarized.
AquaCrop a prominent crop modeling tool by the FAO, predicts crop biomass and yield under diverse water management scenarios.Comprising multiple modules, each simulating aspects of agroecosystems with unique equations, its main components are outlined.The Phenology module identifies plant development stages, while the Climate module includes variables like air temperatures, rainfall, and evapotranspiration demand.The Soil module manages daily water balance, considering soil characteristics.The Canopy module models soil surface coverage, influenced by stress and phenological stage.The Biomass module (Equation 1) calculates plant biomass over time using the formula: where B signifies final biomass, WP represents water productivity (biomass per cumulative transpiration unit), and Tr denotes daily crop transpiration.The remaining components quantify this equation.Dependencies exist among components, like the influence of carbon dioxide levels on water productivity (Climate), and the connection between green canopy cover and the Soil module.Green canopy cover is affected by air temperatures and evapotranspiration (Climate), creating a web of interdependencies (Raes et al., 2009;Steduto et al., 2009).AquaCrop's versatility spans various locations and seasons, facilitating its application in a wide range of contexts.Notably, it has been successfully coupled with remote sensing data, specifically green fractional vegetation cover, to estimate maize growth and total aboveground dry biomass in Belgium (Mohamed Sallah et al., 2019).Additionally, its efficacy has been demonstrated in investigating diverse irrigation treatments in Semi-Arid Tropical areas of India (Umesh et al., 2022), as well as exploring varied soil conditions' impact on maize growth (Shan et al., 2022).
Another famous MDM is the decision support system for agrotechnology transfer (DSSAT) (Jones et al., 2003).It covers a wide range of applications, such as fertilization management (Si et al., 2021), irrigation management (Malik and Dechmi, 2019), impacts of the climate change (Hasan and Rahman, 2020), and so on.One of the main characteristics of DSSAT is that has been developed using a modular approach, where each module has a distinct goal and works independently using different MDM.For instance, the Soil module provides information about soil water, using CERES-Wheat model (Ritchie and Otter, 1985), simulating information about: the daily changes in soil water content due to infiltration of rainfall and irrigation, vertical drainage, unsaturated flow, soil evaporation, and root water uptake processes.The CROPGRO model (Boote et al., 1998) employs input data regarding crop growth, including optimal temperatures for various developmental stages, information on photosynthesis, and nitrogen fixation.It uses this information to simulate parameters such as the emergence day, harvest maturity date, daily senescent plant matter, and other critical elements for determining plant stress, such as the nitrogen stress factor.The modular structure of DSSAT makes easy for user the integration of new modules with different goals e.g.livestock management, also in different programming languages.They are other MDMs whose structure is based on different sub-models, but they achieve the same goal, the optimal agrosystem management (Brown et al., 2014;de Wit et al., 2019).A compartmental model has been proposed for pest management by Savary et al. (2012) which proposed a susceptible-exposed-infectious-removed model (SEIR model) which is composed by four compartments: healthy (H), latent (L), infectious (I), and post-infectious sites (P) epidemics, coupled with other variables such as: crop growth, tissue senescence disease (induced by disease or physiological) and the spatial aggregation of the disease.Those compartments are used to simulate the rice and wheat disease (Savary et al., 2015) over a 120day duration using a daily time step.

Statistical methods in precision agriculture application
"Pure" statistical methods remain less prevalent in PA applications; however, they continue to play a significant role in specific sectors of agriculture.For instance, statistical approaches like Mixed Effects Models (MEM) are commonly employed in genome-wide association studies (GWAS) for crop breeding prediction, exemplified by the prominence of studies such as Berhe et al. (2021) use of Mixed Effects Models.In the domain of GWAS, Principal Component Analysis (PCA) is also frequently used due to its ability to reduce data complexity by transforming it into a limited number of Principal Components.These components can subsequently be incorporated as covariates in MEM, often employed to capture population structures (Abdi et al., 2023).PCA's suitability for various GWAS applications, including genotype-by-environment interaction analysis and trait selection for yield modeling, further underscores its importance (Abdipour et al., 2019;Ahakpaz et al., 2021).In the domain of soil mapping, geostatistical techniques like regression kriging continue to maintain prominence due to their consideration of spatial autocorrelation, a factor not fully embraced by many ML methods (Heuvelink and Webster, 2022).Conversely, within crop yield prediction and disease detection studies, statistical methodologies such as regression models (Chen et al., 2020;Kodaty and Halavath, 2021) and Bayesian networks (Kocian et al., 2020;Singh and Gupta, 2020) have been proposed.In Table 1, the studies cited in the text above are reported, including information about the goal of the study, the variables and the method used.

Discussion
The objective of this concise review is to offer a comprehensive overview of the prevailing data science methodologies, highlighting their popularity and significance in the field.Indeed, an extensive portion of the literature is focused on machine learning and deep learning because the black-box/opaque AI methodologies may require less work from experts, albeit at the price of much more computational work because of the big sample size required.On the other hand, mechanistic-deterministic models take the other part of the literature with many applications ranging from fertilization management to disease predictions, but they often neglect the inferential uncertainty, with the risk of falsely over-accurate inferential statements.The MDMs clearly offer significant advantages in agrosystem management, enabling predictions across various scenarios of interest.To achieve this predictive power, a crucial step often involves calibration, which entails identifying optimal, context-specific parameter values (input values) for solving the underlying equations.These parameter   (Kennedy and O'Hagan, 2001).
In crop modelling with MDMs, the trial-and-error procedure is the most used (Della Nave et al., 2022;Rai et al., 2022;Terań-Chaves et al., 2022;Alvar-Beltrań et al., 2023) where the authors use historical data or build new experiments to achieve their prediction goals.Statistical procedures can be employed in the input value selection phase to facilitate uncertainty quantification in predictions.However, their application within these studies remains circumscribed, in part due to the involved nature of these techniques, but also for the prominent role played by the adopted calibration method on the resulting prediction errors (Gao et al., 2020).The literature cited in this work highlights the limited number of contributions dealing with statistical methodologies in the PA field, particularly for crop yield prediction and disease detection; future work might consider the quantitative integration of the expert's degree of belief into the decision-making processes of agriculture (Valleggi et al., 2023(Valleggi et al., , 2024)).This starting step also seems helpful in fully harnessing the power of modern structural causal models (Pearl, 2009) and improving decision-making in PA (Stefanini and Valleggi, 2022).Bhatia, A., Chug, A., Singh, A. P., Singh, R. P., and Singh, D. (2022).A machine learning-based spray prediction model for tomato powdery mildew disease.Indian Phytopathol. 75, 225-230. doi: 10.1007/s42360-021-00430-3 Bi, L., andHu, G. (2021).A genetic algorithm-assisted deep learning approach for crop yield prediction. Soft Comput.25, 10617-10628.doi: 10.1007/s00500-021-05995-9 Bisbis, M., Gruda, N., and Blanke, M. (2018).Potential impacts of climate change on vegetable production and product quality -A review.J. Cleaner Production 170, 1602-1620. doi: 10.1016/j.jclepro.2017.09.224

FIGURE 1
FIGURE 1Distribution of the precision agriculture's publications in for each domain byLiakos et al. (2018).

TABLE 1
AI studies in precision agriculture.