Performances of Machine Learning Algorithms in Predicting the Productivity of Conservation Agriculture at a Global Scale

Assessing the productive performance of conservation agriculture (CA) has become a major issue due to growing concerns about global food security and sustainability. Numerous experiments have been conducted to assess the performance of CA under various local conditions, and meta-analysis has become a standard approach in agricultural sector for analysing and summarizing the experimental data. Meta-analysis provides valuable synthetic information based on mean effect size estimation. However, summarizing large amounts of information by way of a single mean effect value is not always satisfactory, especially when considering agricultural practices. Indeed, their impacts on crop yields are often non-linear, and vary widely depending on a number of factors, including soil properties and local climate conditions. To address this issue, here we present a machine learning approach to produce data-driven global maps describing the spatial distribution of the productivity of CA versus conventional tillage (CT). Our objective is to evaluate and compare several machine-learning models for their ability in estimating the productivity of CA systems, and to analyse uncertainty in the model outputs. We consider different usages, including classification, point regression and quantile regression. Our approach covers the comparison of 12 different machine learning algorithms, model training, tuning with cross-validation, testing, and global projection of results. The performances of these algorithms are compared based on a recent global dataset including more than 4,000 pairs of crop yield data for CA vs. CT. We show that random forest has the best performance in classification and regression, while quantile regression forest performs better than quantile neural networks in quantile regression. The best algorithms are used to map crop productivity of CA vs. CT at the global scale, and results reveal that the performance of CA vs. CT is characterized by a strong spatial variability, and that the probability of yield gain with CA is highly dependent on geographical locations. This result demonstrates that our approach is much more informative than simply presenting average effect sizes produced by standard meta-analyses, and paves the way for such probabilistic, spatially-explicit approaches in many other fields of research.


INTRODUCTION
Increasing food production and its stability over time becomes more difficult due to the negative effects of climate change on agricultural systems (Renard and Tilman, 2019;Ortiz-Bobea et al., 2021). The development of sustainable cropping systems, such as conservation agriculture (CA), has been proposed as a path to increase food security (Pradhan et al., 2018), preserve biodiversity (González-Chávez et al., 2010;Page et al., 2020), and increase the resilience of agriculture to climate change (Kassam et al., 2009;Michler et al., 2019). Numerous experiments have been conducted to compare the productivity of different farming practices or cropping systems under a diversity of soil and climate conditions. The wealth of experimental data available offers an opportunity to identify the most efficient practices and systems based on robust scientific evidence. In this context, meta-analysis has become a standard method for analysing experimental agricultural data and estimating mean effect sizes as a way of summarizing the performances of cropping systems. Specifically, several meta-analyses were conducted during the past decade to estimate the average performances of CA compared to CT (Pittelkow et al., 2015a;Pittelkow et al., 2015b;Corbeels et al., 2020) showing conflicting results on the relative performance of CA. Although meta-analysis is a powerful tool to analyse large experimental datasets, this approach has several limitations (Fitz-Gibbon, 1984;Eisler, 1990;Flather et al., 1997). One of them is that while mean effect sizes can summarize experiments conducted in contrasting conditions and account for the average performance of a practice or system, they cannot provide a detailed description of the variability induced by local conditions (Walker et al., 2008;Goulding et al., 2011;Krupnik et al., 2019). This is an important limitation for the analysis of agricultural production because crop yields are highly dependent on the local climate conditions (Pittelkow et al., 2015a;Pittelkow et al., 2015b;Steward et al., 2018), soil characteristics (Holland, 2004;Govaerts et al., 2007;Farooq and Siddique, 2015), and agricultural management practices (Scopel et al., 2013;Pittelkow et al., 2015a;Pittelkow et al., 2015b;Farooq and Siddique, 2015;Su et al., 2021a), which often vary in time and space. This makes it hard for standard meta-analyses to provide accurate predictions for a given geographical region.
To gain further insight and overcome this limitation, we define a new approach to analyse large experimental agricultural datasets based on standard machine learning algorithms. These algorithms have proven their usefulness over the last few years and are now widely used in numerous areas to process and analyse complex, heterogeneous and highdimensional data (Schmidt et al., 2019). Here, we have relied on these algorithms to develop a machine learning pipeline ( Figure 1) that standardizes the process of comparing the performance of different cropping systems and mapping them at the global scale. The proposed framework includes several steps covering algorithms selection, model training, model tuning by cross-validation, model testing, and global projection of results. The value of this pipeline is illustrated using a recent global crop yield dataset (Su et al., 2021b) comparing cropping systems under CA systems (and their variants) and CT systems. Twelve different machine learning algorithms (See Supplementary Table S1 for the details) are applied to train for classification, regression and quantile regression models. These models are used to analyse the yield ratio of CA vs. CT ( Yield CA Yield CT ), relative yield change ( Yield CA −Yield CT Yield CT ) and the range of relative yield change as a function of climate variables, geolocations, soil characteristics, crop management practices. The results are used to identify the most suitable machine learning algorithms to assess the performance of a range of no-tillage systems from continuous and categorical data. The proposed pipeline can be easily adapted to analyse land management systems in general or to compare and identify suitable machine learning algorithms for other types of datasets.
Moreover, we also developed a new evaluation metric for quantile regression, error score (ES), which can assess the overall model performance for the interval prediction ability for all quantiles, rather than simply using the traditional prediction interval coverage probability (PICP) that can only evaluate the interval prediction ability for a specific prediction interval (1 − α).

Dataset Establishment
The literature search was conducted in February 2020 using the keywords 'conservation agriculture or no-till or no tillage or zero tillage and 'yield or yield change' in the websites 'ScienceDirect' and 'Science Citation Index'. The details of the paper screening and data collection procedure (Supplementary Figure S1) were described in previous publications (Su et al., 2021a;Su et al., 2021b). The final dataset includes 4,403 paired yield observations for no tillage (NT) and CT under different farming practices for eight major crop species, which were extracted from 413 papers (published between 1983 and 2020). The experimental sites cover 50 countries from 1980 to 2017. This dataset contains both numerical data (such as climatic conditions, year of NT implementation) and categorical data (such as soil characteristics, crop species and different farming practices).
As shown in Figure 1, all the models are trained based on 80% of the dataset, while the rest 20% is used for model testing. And these models are trained based on the model inputs that describing climate conditions over crop growing seasons (precipitation balance, minimum temperature, average temperature, and maximum temperature), soil texture, agricultural management practices (crop rotation, soil cover, fertilization, weed and pest control, irrigation and CA/NT implementation year) and location (latitude and longitude).
Note that CA is defined as NT with soil cover and with crop rotation based on the FAO's definition (Food and Agriculture Organization of the United Nations FAO, 2021). The brief description of algorithms and packages used were listed Supplementary Table S1.

Model Tuning With 10-Fold Cross-Validation
The hyperparameters of all the algorithms except GLM are tuned by searching the hyperparameter space (grid search (Chan and Treleaven, 2015), see Supplementary Table S2) for the best score of 10-fold cross-validation. In detail, the model performance under current setting of hyperparameters is calculated from 10-fold cross-validation. In each fold of the FIGURE 1 | Proposed machine learning pipeline for predicting the performance of cropping systems and comparing different algorithms.
Frontiers in Environmental Science | www.frontiersin.org February 2022 | Volume 10 | Article 812648 3 cross-validation, 90% of the training dataset are used to train the model, and 10% of the training dataset is used to test the model performance. This testing dataset is not overlapping in each fold and covers the whole dataset after the crossvalidation. The cross-validation procedure was repeated for all the hyper-parameters within the searching grid, after which the hyperparameters that give the best performance are selected for use in model testing stage. As for GLM, the final model is selected with a stepwise algorithm (Venables and Ripley, 2002) implemented using the step function (from the 'stats' package, version 4.0.4 in R) run with AIC (Aho et al., 2014).
In cross-validation, the performance of classification model is assessed based on the area under the receiver operating characteristics curve (AUC) (Fernández, 2018). AUC corresponds to the probability that a classifier can rank a positive instance (yield gain in this case) higher than a negative one (yield loss). Thus, a higher AUC indicates a superior classification performance by the model (Fernández, 2018;Su et al., 2021a). The performance of quantitative predictions of relative yield change ( YieldCA−YieldCT YieldCT ) is assessed using the root-mean-square error (RMSE), which estimates the differences between predictions and the observations. Lower RMSE values indicate a better model (Carpenter, 1960). Quantile regression models were assessed by comparing the coverage probability (PICP) of the range of values defined by the quantiles from α 2 and (1 − α 2 ) to its nominal target prediction interval (1 − α) (Newcombe, 1998;Landon and Singpurwalla, 2008;Wang, 2008;Khosravi et al., 2010). Here we set (1 − α) 0.8, thus, the model with the PICP value closest to 0.8 is selected as the final model (see Supplementary Figure S2).
The performances from cross-validation for all the models are presented in Supplementary Table S3-5.

Model Testing
The performances of the trained models are determined using an independent testing dataset including 20% of the total number of data ( Figure 1) using the criteria AUC (Fernández, 2018) and RMSE (Carpenter, 1960) for classification models and regression models, respectively. For quantile regression models, the prediction interval coverage probability (PICP) can only represent the interval prediction ability of the model at the range defined by the quantiles from α 2 and (1 − α 2 ) (Newcombe, 1998;Landon and Singpurwalla, 2008;Wang, 2008;Khosravi et al., 2010). To assess the overall performance for all quantiles, we generalized this approach by plotting PICP vs. its corresponding prediction interval (1 − α) for (1 − α) ranging between 0 and 1, then comparing the resulting curve to the reference line (1: 1 line). The error score (ES) is then defined by Eq. 1 and can be written as the area between the curve of PICPs vs. (1 − α) and the 1:1 line, divided by 0.5, which serves here as a benchmark. This value of 0.5 is reached when the PICP is independent from α. The criterion ES measures the overall performance of the model over all quantiles. Smaller ES values indicate better model interval prediction ability.

Error Score
Area between the curve of PICPs vs.
(1 − ∝ ) and 1: 1 line 0.5 × 100% (1) Interval scores (IS) (Gneiting and Raftery, 2007;Papacharalampous et al., 2019) for (1 − α) 0.5, 0.8 and 0.9 were also calculated for different quantile regression models to cross-check the relevance of this new criterion (Supplementary Figure S3). The interval score is defined by Eq. 2, where the upper (u) and lower (l) endpoints are the predicted quantiles at levels α 2 and (1 − α 2 ), respectively. The first term of the IS indicates the range of prediction interval, while the second and third terms are the penalty scores added to this range when the observations are outside the prediction interval. For a given (1 − α), a smaller IS indicates a better model performance.
Interval Score (2) The final performances from model testing for all the models are presented in Supplementary Table S3-5.

Global Projection
The algorithms with the best final model performance are used for classification (yield gain vs. loss), for quantitative prediction (ratio of relative yield change of CA vs. CT), and for computing intervals (intervals of relative yield change ratios). Model outputs include the probabilities of yield gain with CA for the classification models, the relative yield change with CA for regression models, and the 10th and 90th percentiles of relative yield change for the range regression models. These outputs are mapped at the global scale with climate data over the 1981-2010 time slice at a spatial resolution of 0.5°latitude × 0.5°longitude (average air temperature (NOAA/OAR/ESRLPSL, 2020), maximum air temperature (NOAA/OAR/ESRLPSD, 2020), minimum air temperature (NOAA/OAR/ESRLPSD, 2020), precipitation (NOAA/OAR/ ESRLPSL, 2020), potential evapotranspiration (Miralles et al., 2011;Martens et al., 2017)). As for the farming practices, we set up CA as a combination of no tillage with crop rotation and with soil cover, and CT as a combination of tillage without crop rotation and without soil cover. Both systems are incorporated with fertilizer utilization and weed and pest control. All maps are generated with MATLAB R2020a (Version 9.8.0.1451342). Details of model settings for global projection and the sources of climate data and soil data are listed in Supplementary Table S6.  The regression models achieved RMSEs larger than 0.25, and R squared (R 2 ) of all algorithms under 0.6 ( Figure 3A, Supplementary Table S4), revealing low to moderate explanatory power. Among these algorithms, the best performance is achieved by RF, with a R 2 of 0.52, and a RMSE of 0.25, while GLM has the worst performance, with a R 2 of 0.12, and a RMSE of 0.34 ( Figure 3A, Supplementary  Table S4). Figures 3B,C show the scatterplot of observations versus predictions of relative crop yield change from the RF and GLM models, respectively. Predictions with GLM are poorly related to observations, while RF is able to explain a substantial fraction of the total variability. For range regression models, different rankings are obtained depending on whether outliers in the dataset were included or not (Figure 4). With outliers included, the best performance is obtained with QRF (ES 1.73%), while QRGBM performs better when the outliers were filtered (ES 1.38%). QRNN performs better with one than with two hidden layers, but never outperforms QRF and QRGBM (with ES equals to 5.35 and 15.69% without and with outliers, respectively). The results of interval scores for different prediction intervals (0.5, 0.8 and 0.9) also show that QRF has the best performance among all algorithms (Supplementary Figure S3). QRGBM has a relatively low ES ( Figure 4) and a relatively high IS (Supplementary Figure S3A) when the outliers are included. This indicates that, for QRGBM, the observations fall within the prediction interval in expected proportions (i.e., 0.5, 0.8 and 0.9), although relatively large errors can occur for data falling outside the prediction interval.

RESULTS
The productivity of CA vs. CT systems for spring barley was mapped at the global scale. To reveal the differences among models, we mapped the probability of yield gain ( Figures 5A,B) and relative yield change (Supplementary Figure S4) of CA vs. CT based on the results obtained with the best (RF) and the worst (GLM) algorithms. Results obtained with both algorithms show that -with fertilizer inputs and an appropriate control of weeds and pests -the probability of yield gain with CA vs. CT is higher than 0.5 in western North America, central Asia, and many regions in the east and central Africa, while the probability is lower in eastern North America and Europe ( Figure 5A, Figure 5B). However, there are many inconsistencies in the predictions of the two algorithms in other regions. For example, according to RF, the use of CA instead of CT would most likely lead to yield loss (with a probability of yield gain under 0.5) in south America, and in most regions within eastern and southern Asia, while opposite results are provided by GLM in those regions ( Figure 5A, Figure 5B, respectively). This contradiction reveals that the choice of an inappropriate model (such as GLM, here) would lead to wrong conclusions, highlighting the importance of the model selection step in our procedure. This is confirmed by the relative yield changes of CA vs. CT mapped with the RF and GLM models in Supplementary Figure S4. According to RF, yield gains are expected when shifting from CT to CA in western North America, central Asia, and many regions in the east and central Africa, while yield losses are predicted over most of Europe (Supplementary Figure S4A). This is in line with the results from RF classification model ( Figure 5A). Conversely, according to the results of the GLM regression model, yield gains are predicted in eastern North America and Europe, while yield losses are predicted in central Asia (Supplementary Figure S4B), which is not consistent with the GLM classification model ( Figure 5B) and RF models ( Figure 5A, Supplementary Figure S4A).
To describe the uncertainty associated with this projection, we plotted the relative yield change of CA vs. CT that corresponding to 10th and 90th percentiles ( Figures 5C,D). We show that there is 10% chance that the relative yield change of CA vs. CT will be higher than 0.45 in western North America, southern South America, eastern and central Africa ( Figure 5D), while there is 10% chance that relative yield changes be lower than 0.35 in part of western North America, central Asia, and northern China ( Figure 5C). We also plotted the differences between the 10th and 90th percentiles, and the results show that the uncertainty in western North America, central Africa, central Asia are relatively larger than other regions (Supplementary Figure S5).
SHAP dependence plot (Roth, 1988) was generated with respect to two control variables: the precipitation balance (PB precipitation -evapotranspiration) (Steward et al., 2018;Su et al., 2021a;Su et al., 2021b;Su et al., 2021c) and the number of years since the switch to no tillage (NTyear). This is an alternative to partial dependence plots (PDP), and a means to assess how PB and NTyear would affect the performance of CA. While PDPs show average effects, the SHAP dependence plot also shows the variance on its y-axis. The results show that a relatively lower PB or a longer period of no tillage implementation is likely to improve the performance of CA compared to CT (Supplementary Figures S6, S7, respectively).

DISCUSSION
In this study, we trained a broad set of machine learning (ML) models for different purposes: classification, quantitative prediction, and range prediction. This is the first time that 12 ML algorithms are implemented and compared in an application dealing with a major, global issue for agriculture. And we developed a new evaluation metric for quantile regression, the error score, based on the traditional coverage probability of a specific prediction interval, which enables us to assess the overall interval prediction ability for all quantiles and the interval prediction ability is for each prediction interval. It is a more accurate and comprehensive evaluation metric than coverage probability in evaluating and comparing quantile regression models.
Here we produced global maps obtained with the most accurate algorithms, which reveal a strong geographical variation (Sun et al., 2020;Su et al., 2021a;Su et al., 2021c) of the probability of yield gain with CA, and of the predicted relative yield change resulting from its adoption over convention tillage.  CT under 10th and 90th percentiles, respectively, based on QRF algorithm. There was a 90% chance that the relative yield change will be higher than the ratio shown on the map in plot c, and conversely a 10% chance that the relative change will be lower. This result shows that the mere presentation of an average effect size -as often done in standard meta-analyses -does not provide sufficient information on the performance of one cropping system compared to another. Contrary to standard meta-analyses, our approach can be used to describe the variability of this relative performance and to identify geographical areas where one cropping system outranks the other with a higher spatial resolution. This is an important advantage in a context where the choice of cropping systems should be adapted to the local context to provide optimal performance. Here, the global maps generated from our machine learning approach reveal that CA can be competitive in western North America and central Asia, in particular in dry regions, where CA tends to have better performance (Supplementary Figure S6). The result is consistent with recent studies (Pittelkow et al., 2015a;Pittelkow et al., 2015b;Su et al., 2021a), and related to the fact that no tillage and soil cover can reduce soil evapotranspiration (O'Leary and Connor, 1997;Nielsen et al., 2005;Page et al., 2020) and increase the water holding ability (Liu, 2013), which help the crops coping with agricultural drought, thus, potentially increasing the crop yield and CA competitiveness in dry regions.
Our comparative analysis shows that, RF has the best performance for both classification and quantitative prediction, followed by GBM, XGBOOST, ANNs, SVM, and KNN, while GLM and NB have the worst performance compared to other algorithms. The reason why RF and GBM perform better is likely linked to the type of data used for model training. As many agricultural datasets, our dataset includes both quantitative and categorical features. RF, GBM, SVM are able to handle such heterogeneous features, while ANNs, KNN, XGBOOST work usually better with numerical and homogeneous inputs (Ali et al., 2019;Zhou, 2021). Converting categorical features to numerical will also increase the number of feature inputs, it will dramatically increase the computational time for, e.g., ANNs (Nataraja and Ramesh, 2019). Moreover, the numerical features in our dataset are in very different scales, it is necessary to normalize them for ANNs and KNN to reduce effects of disparate ranges (Sarle et al., 1991;Han et al., 2011;Ali et al., 2019). Concerning SVM, it gives poor performance with large dataset (Nataraja and Ramesh, 2019). Finally, VB and GLM are often inaccurate when the data are heterogeneous and when the responses are non-linear (Nataraja and Ramesh, 2019). Other studies from agriculture-related sectors also reported similar trend in the performance of machine learning algorithms. For example, Cao et al. (2021) reported that RF had better performance than ANNs in wheat yield prediction. Rahmati et al. (2020) revealed that RF had better classification ability than SVM when predicting agricultural droughts in Australia. Dubois et al. (2021) reported that RF and SVM had better quantitative prediction ability than ANNs for forecasting soil moisture in the short-term. In other fields, Uddin et al. (2019) unveiled that in disease detection, RF had the highest chance to show excellent classification capability (with an AUC over 0.8), followed by SVM, NB, ANNs, KNN. However, RF is not systematically ranked first in previous studies (Bozdağ et al., 2020;Schwalbert et al., 2020;Khanam and Foo, 2021) and the performance of different algorithms may shift depending on the type of applications and data provided. It is therefore essential to evaluate a range of different candidate algorithms for each application, and not to systematically rely on the same approach. Our methodological framework appears very useful in this context because it allows the comparison of several algorithms on an objective basis.
Concerning interval prediction, it is reported that the performances of quantile regression algorithms often vary depending on the quantiles considered (Newcombe, 1998;Wang, 2008;David et al., 2018;He et al., 2020;Córdoba et al., 2021). This highlights the importance of evaluating these algorithms for a wide range of quantiles. In this perspective, the new evaluation metric we developed for quantile regressionerror score (ES) -is meaningful and can be used to assess any quantile regression model over the whole range of quantiles. This criterion can be used to select or tune algorithms performing over a large range of quantiles and not only for specific quantile values.
In this study, we prove that the experimental data collected from published studies can be used to conduct more complex analyses via machine learning techniques (such as the random forest algorithm) than those done in standard meta-analyses, usually based on linear models. The maps created from the machine learning pipeline we proposed here provides detailed geographical information about the performance of one agricultural system compared to a reference baseline, and this pipeline can be easily adapted to analyse a diversity of outcomes or land management systems. These may involve the effects of crop management practices on soil organic carbon dynamics, greenhouse gas emissions, biodiversity, etc. for different types of cropping systems, such as organic agriculture or agroforestry, and thus provide valuable information on the local performance of sustainable farming practices together with a global perspective.

CODE AVAILABILITY STATEMENT
All the R and MATLAB codes are available under request from the corresponding author.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: https://doi.org/10.6084/ m9.figshare.12155553.

AUTHOR CONTRIBUTIONS
YS wrote the main manuscript text, HZ, BG and DM modified it; YS collected the data; All authors designed the machine learning pipeline together; YS and HZ worked on the model training, model cross-validation, model testing, and model comparison; YS, BG and DM worked together to prepare the figures and tables; All authors reviewed the manuscripts.