Plant Genotype to Phenotype Prediction Using Machine Learning

Danilevicz, Monica F.; Gill, Mitchell; Anderson, Robyn; Batley, Jacqueline; Bennamoun, Mohammed; Bayer, Philipp E.; Edwards, David

doi:10.3389/fgene.2022.822173

REVIEW article

Front. Genet., 18 May 2022

Sec. Genomics of Plants and Plant-Associated Organisms

Volume 13 - 2022 | https://doi.org/10.3389/fgene.2022.822173

Plant Genotype to Phenotype Prediction Using Machine Learning

1. School of Biological Sciences and Institute of Agriculture, University of Western Australia, Perth, WA, Australia
2. School of Physics, Mathematics and Computing, University of Western Australia, Perth, WA, Australia

Article metrics

View details

Citations

21k

Views

4,9k

Downloads

Abstract

Genomic prediction tools support crop breeding based on statistical methods, such as the genomic best linear unbiased prediction (GBLUP). However, these tools are not designed to capture non-linear relationships within multi-dimensional datasets, or deal with high dimension datasets such as imagery collected by unmanned aerial vehicles. Machine learning (ML) algorithms have the potential to surpass the prediction accuracy of current tools used for genotype to phenotype prediction, due to their capacity to autonomously extract data features and represent their relationships at multiple levels of abstraction. This review addresses the challenges of applying statistical and machine learning methods for predicting phenotypic traits based on genetic markers, environment data, and imagery for crop breeding. We present the advantages and disadvantages of explainable model structures, discuss the potential of machine learning models for genotype to phenotype prediction in crop breeding, and the challenges, including the scarcity of high-quality datasets, inconsistent metadata annotation and the requirements of ML models.

Introduction

The expansion of genome sequencing technology has led to a rapid growth in plant genomic resources, including advanced genome and pangenome assemblies, providing a better understanding of plant genetic variation. Genotyping methods including genome resequencing (Morrell et al., 2011; Hu et al., 2018; Juliana et al., 2018) and single nucleotide polymorphism (SNP) arrays (Wang et al., 2014; Mason et al., 2017) have supported the expansion of crop genomic resources, assisting the identification of genetic variations related to agronomic traits. The emergence of plant pangenomes as a reference better represents the genomic variability within a species (Bayer et al., 2020; Danilevicz et al., 2020; Golicz et al., 2020), allowing researchers to expand genotypes from SNPs and indels to include gene presence absence variation, a structural variation that has been associated with crop disease resistance and stress tolerance (Golicz et al., 2016; Montenegro et al., 2017). Plant breeders now have a wide variety of tools available to assess genetic variation, however the most efficient way to apply these for crop improvement is often unclear (Heffner et al., 2009; Scheben et al., 2017; Nuccio et al., 2018). Genomic selection (GS) using best linear unbiased prediction (BLUP) was developed for livestock breeding (Meuwissen et al., 2001), but has seen success in plant breeding (Cooper et al., 2014; Gaffney et al., 2015; Crossa et al., 2017; Lebedev et al., 2020; Ukrainetz and Mansfield, 2020). This approach takes a subpopulation of breeding material, and through linear modelling, estimates the contribution of each SNP to phenotypes of interest. Due to the simplicity of the modelling, BLUP is straightforward to implement, and the contribution of each SNP to the phenotype is relatively easy to calculate. However, compared to animal breeding, genomic selection in plants has to account for the greater genotype by environment interactions, and requires the addition of appropriate multi-environment trial data (Voss-Fels et al., 2019).

Machine Learning (ML) and Deep Learning (DL) algorithms are more complex than linear prediction models and can discover non-linear relationships within datasets. The use of complex ML and DL models to associate plant genotypes with phenotypes is gaining popularity, with an increasing number of publications predicting a wide variety of agronomic traits, including yield, days to heading and thousand kernel weight (Montesinos-López A. et al., 2018, Montesinos-López et al., 2018 O. A.; Ma et al., 2018; Crossa et al., 2019; Khaki et al., 2019; Grinberg et al., 2020). Random forests, support vector machines and artificial neural networks may more easily capture the complex relationships between genotype, phenotype and environment compared with previous methods due to their non-linearity, and ML and DL have significant potential to support plant breeding. Here we review the progress that has been made applying ML and DL methods, and how to interpret their results within a biological context.

Machine Learning Methods for Genotype to Phenotype Prediction

Genotype to phenotype prediction has expanded with the application of GS. Genomic Best Linear Unbiased Prediction (GBLUP), a linear based modelling system has been used extensively in GS (Meuwissen et al., 2001), as have Bayesian systems (Pérez et al., 2010). Despite the successes of linear methods in GS, they can run into challenges due to the high dimensionality of marker data versus the number of individuals, and the presence of complex relationships that are difficult to account for (Crossa et al., 2017).

To improve on linear models in GS, there has been an increased use of nonlinear methods such as ML and DLmodels to predict plant phenotypes (Montesinos-López et al., 2021a). Nonlinear methods have been theorised to be better able to capture small interactions between markers, account for environment interactions and generate predictions with higher accuracy for data with high dimensionality (Pérez-Rodríguez et al., 2012; Cuevas et al., 2016). The ML and DL architectures can also include multimodal data and data types that are not suited to simple tabular formats (Figure 1). For example, the ML method random forest can capture patterns in high dimensional data to deliver accurate predictions and can also take into account non-additive effects (Heslot et al., 2012). Its use as a model for genomic selection has also demonstrated superior performance in comparison to linear models such as Bayesian Least Absolute Shrinkage and Selection Operator (LASSO) (Ornella et al., 2014) and Ridge Regression BLUP (RR-BLUP), depending on the genetic architecture of the predicted trait (Spindel et al., 2015). Other ML models that have shown potential for genomic selection include convolutional neural networks and feed forward deep neural networks that can outperform linear methods with correct optimisation of hyperparameters (Ma et al., 2018; Sandhu et al., 2020). Multi-trait DL models can help understand the relationship between related traits for improved prediction (Montesinos-López O. A. et al., 2018), and ensemble models that are a powerful way of combining multiple ML methods that may produce weaker predictions by themselves (Jubair and Domaratzki, 2019; Banerjee et al., 2020).

FIGURE 1

Whether ML and DL approaches improve prediction compared to traditional GBLUP for genomic selection is still unclear. Several publications compare DL models with linear ML models, but not with GBLUP directly, and many breeding methods use BLUP values, whereas ML and DL studies tend to predict single traits. In an example where GBLUP was compared to DL methods for multi-trait prediction, when the genotype by environment interaction term was included in the GBLUP modelling, it outperformed DL across 8 of the 9 datasets, whereas when it was not included, the DL model outperformed GBLUP in 6 of the 9 datasets (Montesinos-López A. et al., 2018). Another study observed that their DL model outperformed linear and Bayesian prediction models when predicting under strong epistatic interactions, using SNPs from octoploid strawberry and tetraploid blueberry (Zingaretti et al., 2020). As with many problems that involve algorithm choice, the no free lunch theorem suggests that no one algorithm will perform the best across all problems, and this seems to be the case for genomic prediction (Azodi et al., 2019). DL is an efficient method to extract representative features from large datasets, with the capacity to account for feature interaction effects. However, conventional ML methods and mixed linear models are still well suited to deal with limited datasets, in many cases generating more accurate predictions than DL models. We propose that phenotype prediction should be expanded beyond GBLUP methods to ensure sufficient models are evaluated for each genomic selection problem.

Genetic Variation Representation

The most common form of encoding whole genome SNP data for ML and DL, is to use one hot encoding, where each SNP position is represented by four columns, each one representing the four bases of DNA: A, T, C, and G. Presence of the base at each position is indicated by a 1 in this column, whereas absence is indicated by 0 (Zou et al., 2019). In this way a letter can be encoded as a binary representation, suitable for ML and DL processes, that only accept numeric input. SNP one hot encoding is one of the most common data representations for DNA sequence data in phenotype prediction (Montesinos-López A. et al., 2018, Montesinos-López et al., 2018 O. A.; Ma et al., 2018; Khaki et al., 2019; Grinberg et al., 2020).

In ML and DL, the number of features is usually lower than the number of samples, however it is common in trait association studies that the number of features significantly outnumbers the samples. Genotyping usually generates large numbers of genetic markers and phenotyping large cohorts of plants can be prohibitively expensive, restricting the number of samples. Feature selection, minor allele frequency and genome wide association study (GWAS) can be applied to reduce the dimensionality of datasets and remove excess information, presenting varying degrees of popularity in plant breeding. In human research, some studies have used minor allele frequency, position within promoter regions (Yin et al., 2019), selective SNP number reduction, and integrated transcription data (Zhou et al., 2019), where previous association results are used to strategically reduce SNP numbers. Other strategies to reduce SNP numbers include a focus on rare variants, where loss of function variants have been noted for their influence in phenotype (Monroe et al., 2020) and using GWAS to select regions of interest, though this has had mixed results (Spindel et al., 2016; Rice and Lipka, 2019; Jeong et al., 2020).

Genotype encoding in plants has been mostly limited to SNP encoding, however there are other forms of genomic variation, and different ways of encoding genetic variation data that could be employed for phenotype prediction. A recent article used methylation at cytosine positions to predict traits in Populus balsamifera, input cytosines were encoded using variable importance scores and individual methylation intensity, The results show that biomass and physiochemical wood traits can be modelled using methylation data and DL (Champigny et al., 2020). Outside of plant research, genomic k-mers were used to encode genomic variability and employed to predict antibiotic resistance in multiple species of pathogenic bacteria, using classification and regression trees and set covering machines algorithms for interpretability (Pérez et al., 2010; Drouin et al., 2019). In Chen et al. (2019), rare genetic variants from a Mycobacterium tuberculosis dataset were grouped by locus, using binary encoding to account for mutation presence/absence and feature labelling to encode the type of mutation (e.g., SNP, insertion/deletion, change in frameshift within a coding region). The dataset was used to predict antibiotic resistance in the pathogen. improving prediction for both logistic regression models and neural networks. The expansion of plant pangenome assemblies provides an opportunity to include gene presence/absence variation in trait prediction, encoded in a similar way to SNPs (Golicz et al., 2016; Jin et al., 2016; Hurgobin and Edwards, 2017; Gao et al., 2019). The way researchers choose to encode the genomic information and use the associated data (environmental or gene expression data) is an important aspect of trait association analysis, and a wider variety of encoding methods should be explored.

Integrating High Throughput Phenotyping Into Genotype to Phenotype Models

Genotype to phenotype models are commonly applied to sparsely collected phenotypic traits such as plant height and germination rate, and these traits are often hand-collected, potentially introducing bias and increasing experimental cost. In contrast, high throughput phenotyping allows monitoring hundreds or thousands of plants under field and greenhouse conditions (Tattaris et al., 2016; Shakoor et al., 2017), performing non-destructive measurement of traits such as biomass (Quirós Vargas et al., 2019; Marino and Alvino, 2020), plant height (Anderson et al., 2020), wheat spike count (Hasan et al., 2018; Xiong et al., 2019), and disease (De Castro et al., 2015; Nagasubramanian et al., 2018). The increased density of phenotypic data produced by high throughput phenotyping enables researchers to dynamically measure changes in plant growth, evaluating the impact of genomic variation at different developmental stages. For example, a study using images collected weekly by an unmanned aerial vehicle observed that two SNPs related to maize height had a larger effect during early growth (10–25 cm), with their contribution to this phenotype decreasing towards the end of the season (Adak et al., 2021). In rice, high throughput imagery was used to calculate six vegetation indexes that were used as input traits for the identification of eight quantitative trait loci (QTLs), one of which was corroborated by another analysis using ground-truth agronomic data (Naito et al., 2017). Similarly, vegetation indices and canopy temperature were employed in the identification of QTLs related to abscisic acid concentration and other physiological traits in cotton using inclusive composite interval mapping for multi-environment trials (Pauli et al., 2016). Vegetation indices and temperature were also used to support genomic selection for grain yield in wheat (Rutkoski et al., 2016). These studies show the potential of incorporating raw or extracted features from image data to detect genomic variation associated with desirable traits. Intermediate phenotypes such as transcriptome, proteome or metabolome data can also be associated in a multidimensional dataset, providing a more detailed description of the plant response to environmental conditions, and potentially increasing phenotype prediction accuracy (de Abreu E Lima et al., 2018; Voss-Fels et al., 2019; Scossa et al., 2021).

The addition of multidimensional datasets can increase the complexity of analysis exponentially, requiring algorithms that are able to uncover the relationships between the data types and the target trait. DL models have shown success in dealing with complex multimodal datasets, being effectively applied to generate image descriptions (Hossain et al., 2021), predict injuries in sports (Song et al., 2021) and disease detection from medical datasets (Khan et al., 2020; McKenzie et al., 2020; Fan et al., 2021). Recently, several studies were published using DL for trait prediction using high throughput plant phenotyping images as input. These take advantage of convolutional neural networks for extracting the spectral features of the leaf reflectance for disease classification (Picon et al., 2018; Nagasubramanian et al., 2019), wheat spike segmentation and counting (Misra et al., 2020), and QTLs related to root architecture traits (Pound et al., 2017). A wide diversity of DL architectures that have been developed to address problems in other areas of research can be adapted for use in crop breeding. For example, multimodal DL models are composed of multiple models, each trained using a single input type (e.g., rainfall, soil measurements, genetic data, hyperspectral imagery), or a single model trained on concatenated multimodal data (Figure 2). The different modalities contribute to enrich the available features for model learning, contributing to an improved final prediction (Baltrusaitis et al., 2019; Khaki et al., 2019; Gadiraju et al., 2020; Hoang Trong et al., 2020; Maimaitijiang et al., 2020). The use of multimodal models and other DL architectures, such as recurrent neural networks and graph neural networks, are still largely unexplored in genotype to phenotype predictions, but present a powerful alternative to traditional statistical methods.

FIGURE 2

Challenges in deploying DL models emerge mainly from plant phenotypic plasticity, as plants present a wide range of phenotypes depending on the environmental conditions (

Nicotra et al., 2010

). The effectiveness of ML/DL models also depends on appropriately tuning the model hyperparameters to the target task, with many packages designed to assist tuning such as Optuna and HyperOpt (

Bergstra et al., 2013

;

Akiba et al., 2019

). Nonetheless, tuning hyperparameters for DL models tends to be more computationally intensive than conventional ML. Below we list some challenges that arise from phenotype variability that may need to be addressed to effectively use ML for genotype to phenotype prediction (

Figure 3

i) Consistent protocol for data collection and processing during training and model deployment (Hagiwara et al., 2020; Mårtensson et al., 2020). Because DL models learn directly from the dataset, varying the methods for data collection and processing can add noise causing the model to underperform. It is important to maintain a consistent protocol for data handling, as well as regular assessment to ensure that the model remains suitable for the task;
ii) Avoid the curse of dimensionality (Altman and Krzywinski, 2018). High throughput phenotyping platforms, hyperspectral cameras and pangenome assemblies can generate massive amounts of data, making it harder for the model to define which data points are representative of the trait. Feature selection algorithms can assist in selecting the most representative data subset for training the DLmodel (Cen et al., 2016; Khaki and Wang, 2019);
iii) Data imbalance. Scarcity of samples representing a specific genotype or environment can introduce bias to the model. This can be addressed by employing sampling methods such as over and under sampling, or generative DL to build an artificially augmented dataset (Radford et al., 2015);
iv) Changes in environmental conditions (interannual weather variation, differences in agroecological zone and crop management practices) may impact model performance due to plant phenotypic plasticity. Environmental effects on the phenotype should be considered when defining the model validation and future applicability, and can be addressed by collecting data that mimic the conditions that the model will see when predicting the phenotype (Montesinos-López O. A. et al., 2018; Khaki et al., 2019; Shook et al., 2021).

FIGURE 3

As crop research advances, it is advisable that the models are updated with new data collected. ML and DL models are under intense research, with improvements and novel architectures being published regularly. Selecting the most appropriate architecture and addressing the challenges above will depend on the goal of the project and the data availability. Some data types benefit from complex DL models that draw complex non-linear relationships from the data, while others can achieve high performance using simpler tree-based models with greater potential for interpretability of the results. Nonetheless, the successful implementation of ML/DL workflows in crop breeding will require the familiarisation of breeders and other stakeholders to the models capabilities, so they can be used appropriately.

Data-Driven Breeding Requires Structured Datasets

The prediction accuracy of machine learning models is intrinsically related to the quality of the dataset employed during development (LeCun et al., 2015). ML models calibrate their internal weights based on the data provided, requiring a representative dataset with accurate annotation (Sasidharan Nair and Vihinen, 2013; Schaafsma and Vihinen, 2018). Consequently, a frequent challenge for training robust ML models is the lack of appropriate datasets with enough data points and sample variability. It has been suggested that the scarcity of plant phenotype datasets is because these are either inefficiently shared with the community due to missing information and difficulties to find the public repository in which it is stored (Zamir, 2013; Lobet, 2017), or because the data is maintained in data silos with restricted access (Bayer and Edwards, 2020). A few international consortia such as the AgBioData (Harper et al., 2018) and Breeding API (Selby et al., 2019) are making an effort to share and transform breeding datasets to become more findable, accessible, interoperable and reusable (FAIR) (Wilkinson et al., 2016). However, a centralised platform for hosting and managing phenotypic datasets is needed to make data more widely available, similar to approaches used to share genomic data. Another aspect that prevents researchers from employing previously published datasets is the lack of standardised metadata descriptions, encompassing experimental design, data collection protocol, field management, environmental variables, and other information. The observed plant phenotype is the result of the conditions that the plant experiences, thus reusing previously published data requires that all the factors influencing the target trait are described for the user. The Minimum information about a plant phenotype project (MIAPPE) (Papoutsoglou et al., 2020) offers a resource to guide researchers on how to annotate metadata to increase the usability and interoperability of the datasets, but the development and application of suitable standards need to be expanded to support the growth in ML/DL applications.

A large number of high quality phenotypic datasets are generated each year during field trials led by industry, government or academic groups, however these often have controlled access to protect industry knowledge. An alternative approach to protect sensitive information while supporting collaboration towards data-driven breeding is the establishment of federated learning cohorts (Konečný et al., 2016). Within these, each participant institution trains the model with its own dataset and shares the updated model peer to peer or to a centralised server that will aggregate the models weights (Figure 4). The updated model parameters improve the baseline model, that is then shared among institutions (Yang et al., 2019). Federated learning has seen increasing application in digital health, where data sensitivity is a major issue (Brisimi et al., 2018; Lee et al., 2018; Rieke et al., 2020).

FIGURE 4

Explainable (Interpretable) Machine Learning for Genotype to Phenotype Models

The increasing use of large scale data, such as genomic variation and high throughput imagery, have produced studies with highly accurate ML and non-ML prediction models (Montesinos-López et al., 2021b). However, building models capable of predicting a biological output can only be seen as one of the goals. Models should also attempt to address biological questions, which requires an understanding of how the models make predictions. The use of ML as a method to ask biological questions presents researchers with the problem of model explainability. Explainability in prediction models is a relatively new area for genomic prediction, as the main goal has often been to achieve optimum prediction performance, with model explainability less of a focus. This idea can be represented with GBLUP, one of the most widely implemented genomic selection methods. GBLUP offers researchers accurate prediction of breeding values but low explainability of its predictions, as estimating individual SNP effects can be difficult for genomic prediction datasets due to the “large-p little-n” problem (de Los Campos et al., 2013; Shen et al., 2013). As crop genomic prediction ability advances, it is becoming increasingly important for researchers to explain how a “black box” model has made predictions, to understand the biological implications, justify further research questions and support problem specific model improvement.

For genotype to phenotype prediction in crops, explainability provides the ability to identify important genomic markers and then apply these genomic markers to reduce the size of model inputs required to make further predictions. In Khaki and Wang (2019), analysis of feature importance determined that weather had a larger effect on crop yield than genotype, and models trained on a reduced selection of top ranking input features did not lead to a significant reduction in performance (Khaki and Wang, 2019). Soybean yield prediction models using genotype and other input data identified the importance of features from August-September, which coincides with crop reproduction (Shook et al., 2021). The identification of features from explainable models provides biological understanding and supports results from other methods that identify important features. For example, the application of saliency maps as a post-hoc method to a convolutional neural network for soybean trait prediction identified an overlap with significant SNPs found from a GWAS conducted with the same data (Liu et al., 2019). Model explainability should be a consideration for genomic prediction problems and included as a determining factor for evaluating optimum model performance.

When predicting plant phenotypes from genotype information, the use of an interpretable model offers an opportunity to select high ranking markers as a feature selection strategy, and there is evidence that selecting a subset of important markers can improve the prediction for a given phenotype (Oakey et al., 2016). This is due to the large number of SNPs acting as background noise for prediction, resulting in diminishing returns on performance unless a major proportion of included SNPs are associated with the trait (Pérez-Enciso et al., 2015). A tool such as CGBayesNets can be used to firstly select a sample of features that are informative for phenotype prediction (McGeachie et al., 2014). Harvestman is another tool that selects a representative and non-redundant subset of features with a specific focus on minimising overfitting issues, which is common in high dimensionality prediction tasks (Frisby et al., 2021). The best subset and encoding of features can then be used to train new models. Ensemble methods can also be implemented, where interpretable ML methods can be used for feature selection, and the high ranking features can then be input into another model, such as a DL architecture, to improve predictions (Azodi et al., 2019). A benefit of feature selection is that input feature reduction can reduce the computing resources and time required to train models.

Model interpretation is complex, as definitions of interpretability are variable (Lipton, 2018), and evaluation of these interpretations is non-standardised, making evidence based comparisons between interpretable methods is difficult (Doshi-Velez and Kim, 2017). Rudin (2019) argues that instead of extracting meaning from the ‘black box’ model after training, ML models should be built with interpretability in mind (Rudin, 2019). Less complex ML algorithms, such as decision tree-based models, are inherently interpretable, with inbuilt functions to determine feature importance, however it can be hard to quantify importance between different models as the importance of these features often uses a metric that measures relative feature importance against other input features for a given model (Lundberg et al., 2018). For example, extreme gradient boosting’s (XGBoost) inbuilt functions offer multiple methods of calculating feature importance (weight, cover, gain), each offering different importance scores and ranking of input features, leading to variable interpretation of results depending on the method used. Model-agnostic local explanation methods, such as Shapley additive explanations (SHAP) (Lundberg and Lee, 2017) and Local interpretable model-agnostic explanations (LIME) (Ribeiro et al., 2016) have the potential to overcome this issue due to how the methods consistently and transparently quantify the input’s effect on prediction across most model types (Lundberg et al., 2018). However, this leads back to the original criticism of describing the extraction of meaning post-hoc from the black box as a practice with potential bias [57], the ability to purposely engineer explanations (Slack et al., 2020) and the likelihood of false conclusions being made by inexperienced users (Chromik et al., 2021).

Explainability for genotype to phenotype prediction is a relatively new area in genomic prediction studies, and the interest in explainability and interpretability suggests that new DL algorithms may have enhanced interpretability (Montesinos-López et al., 2021b). This presents a challenge to evaluate new algorithms, as conclusion extracted from interpretable models can be erroneous if the model’s predictive performance is poor (Murdoch et al., 2019). Additionally, interpretable ML extracts associations through the correlation between features and outcomes (Azodi et al., 2020), but are often interpreted with the goal of guiding hypotheses to suggest possible causal relationships from features (Lipton, 2018; Murdoch et al., 2019). Placed in a genotype to phenotype context, this would mean the identification of important genomic regions for phenotype outcome would remain noncausal and associative until further work was conducted to determine causality. One such example of why further work is required is the possibility of an untested factor not included in the genotype data, such as epigenetic or environmental features, that cause both the genomic region and the predicted phenotype to be associated with each other, without the genomic region being the causal feature.

Discussion

ML models have the potential to predict complex phenotypic traits such as yield, biotic, and abiotic stress tolerance due to the way they capture the non-linear relationships between the genetic variation and environmental features extracted from the dataset. The growth of high throughput phenotyping and genotyping will continue to support crop breeding, however we need algorithms capable of handling data at this scale and complexity. Further research into the application of ML/DL models in crop breeding is required as the application of the model is highly dependent on representativeness of the training dataset, which may cause difficulties to apply the model in a real-world setting. Increased efforts to share datasets, or use data augmentation methods may help improve model generalisation. In addition, updating model weights as new datasets become available may also contribute to improve model prediction accuracy. Most genotype to phenotype studies currently lack interpretation of their predictions, which should be addressed to gain insight from the features. As researchers and breeders use advanced data analysis approaches, explainable ML models will help answer biological questions. Explainable MLwill be a key area for genotype to phenotype research, with an increasing focus on harnessing the potential of ML to generate accurate predictions combined with reliable interpretations.

Statements

Author contributions

MD, MG, and RA wrote the manuscript with input and edits from JB, MB, PB, and DE.

Funding

This work was supported by the Australian Government through the Australian Research Council (Projects DP210100296, DP200100762 and DE210100398) and the Grains Research and Development Corporation (Projects 9177539 and 9177591). MG, RA, and MD are supported by Research Training Program scholarships. MD received further support from the Forrest Research Foundation. The funding bodies had no involvement in the study design or its development.

Acknowledgments

We would like to thank Felipe Casaprima for his assistance with designing the figures.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Abbreviations

DL, Deep Learning; GBLUP, Genomic Best Linear Unbiased Predictions; GS, Genomic Selection; GWAS, Genome-Wide Association Study; LASSO, Least Absolute Shrinkage and Selection Operator; LIME, Local Interpretable Model-agnostic Explanations; MIAPPE, Minimum Information About a Plant Phenotype Experiment; ML, Machine Learning; SHAP, SHapley Additive exPlanations; SNP, Single Nucleotide Polymorphism; RR, Ridge Regression; QTL, Quantitative Trait Loci.

References

1
AdakA.ConradC.ChenY.WildeS. C.MurrayS. C.Anderson IIS. L.et al (2021). Validation of Functional Polymorphisms Affecting maize Plant Height by Unoccupied Aerial Systems (UAS) Discovers Novel Temporal Phenotypes, 11. G3 (Bethesda). 10.1093/g3journal/jkab075
- CrossRef
- Google Scholar
2
AkibaT.SanoS.YanaseT.OhtaT.KoyamaM. (2019). “Optuna: A Next-Generation Hyperparameter Optimization Framework,” in Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2623–2631.
- Google Scholar
3
AltmanN.KrzywinskiM. (2018). The Curse(s) of Dimensionality. Nat. Methods15, 399–400. 10.1038/s41592-018-0019-x
- CrossRef
- Google Scholar
4
AndersonS. L.MurrayS. C.ChenY.MalamboL.ChangA.PopescuS.et al (2020). Unoccupied Aerial System Enabled Functional Modeling of maize Height Reveals Dynamic Expression of Loci. Plant Direct4, e00223. 10.1002/pld3.223
- CrossRef
- Google Scholar
5
AzodiC. B.BolgerE.McCarrenA.RoantreeM.de Los CamposG.ShiuS.-H. (2019). Benchmarking Parametric and Machine Learning Models for Genomic Prediction of Complex Traits. G3 (Bethesda)9, 3691–3702. 10.1534/g3.119.400498
- CrossRef
- Google Scholar
6
AzodiC. B.TangJ.ShiuS.-H. (2020). Opening the Black Box: Interpretable Machine Learning for Geneticists. Trends Genet.36, 442–455. 10.1016/j.tig.2020.03.005
- CrossRef
- Google Scholar
7
BaltrusaitisT.AhujaC.MorencyL.-P. (2019). Multimodal Machine Learning: A Survey and Taxonomy. IEEE Trans. Pattern Anal. Mach. Intell.41, 423–443. 10.1109/TPAMI.2018.2798607
- CrossRef
- Google Scholar
8
BanerjeeR.MarathiB.SinghM. (2020). Efficient Genomic Selection Using Ensemble Learning and Ensemble Feature Reduction. J. Crop Sci. Biotechnol.23, 311–323. 10.1007/s12892-020-00039-4
- CrossRef
- Google Scholar
9
BayerP. E.EdwardsD. (2020). Machine Learning in Agriculture: from Silos to Marketplaces. Plant Biotechnol. J.19, 648–650. 10.1111/pbi.13521
- CrossRef
- Google Scholar
10
BayerP. E.GoliczA. A.SchebenA.BatleyJ.EdwardsD. (2020). Plant Pan-Genomes Are the New Reference. Nat. Plants6, 914–920. 10.1038/s41477-020-0733-0
- CrossRef
- Google Scholar
11
BergstraJ.YaminsD.CoxD. D. (2013). Hyperopt: A python Library for Optimizing the Hyperparameters of Machine Learning Algorithms. Proc. 12th Python Sci. Conf.13, 20. 10.25080/majora-8b375195-003
- CrossRef
- Google Scholar
12
BrisimiT. S.ChenR.MelaT.OlshevskyA.PaschalidisI. C.ShiW. (2018). Federated Learning of Predictive Models from Federated Electronic Health Records. Int. J. Med. Inform.112, 59–67. 10.1016/j.ijmedinf.2018.01.007
- CrossRef
- Google Scholar
13
CenH.LuR.ZhuQ.MendozaF. (2016). Nondestructive Detection of Chilling Injury in Cucumber Fruit Using Hyperspectral Imaging with Feature Selection and Supervised Classification. Postharvest Biol. Tech.111, 352–361. 10.1016/j.postharvbio.2015.09.027
- CrossRef
- Google Scholar
14
ChampignyM. J.UndaF.SkybaO.SoolanayakanahallyR. Y.MansfieldS. D.CampbellM. M. (2020). Learning from Methylomes: Epigenomic Correlates of Populus Balsamifera Traits Based on Deep Learning Models of Natural DNA Methylation. Plant Biotechnol. J.18, 1361–1375. 10.1111/pbi.13299
- CrossRef
- Google Scholar
15
ChenM. L.DoddiA.RoyerJ.FreschiL.SchitoM.EzewudoM.et al (2019). Beyond Multidrug Resistance: Leveraging Rare Variants with Machine and Statistical Learning Models in Mycobacterium tuberculosis Resistance Prediction. EBioMedicine43, 356–369. 10.1016/j.ebiom.2019.04.016
- CrossRef
- Google Scholar
16
ChromikM.EibandM.BuchnerF.KrügerA.ButzA. (2021). “I Think I Get Your point, AI! the Illusion of Explanatory Depth in Explainable AI,” in 26th International Conference on Intelligent User Interfaces, New York, NY, USA (ACM), 307–317. 10.1145/3397481.3450644
- CrossRef
- Google Scholar
17
CooperM.MessinaC. D.PodlichD.TotirL. R.BaumgartenA.HausmannN. J.et al (2014). Predicting the Future of Plant Breeding: Complementing Empirical Evaluation with Genetic Prediction. Crop Pasture Sci.65, 311. 10.1071/CP14007
- CrossRef
- Google Scholar
18
CrossaJ.MartiniJ. W. R.GianolaD.Pérez-RodríguezP.JarquinD.JulianaP.et al (2019). Deep Kernel and Deep Learning for Genome-Based Prediction of Single Traits in Multienvironment Breeding Trials. Front. Genet.10, 1168. 10.3389/fgene.2019.01168
- CrossRef
- Google Scholar
19
CrossaJ.Pérez-RodríguezP.CuevasJ.Montesinos-LópezO.JarquínD.de Los CamposG.et al (2017). Genomic Selection in Plant Breeding: Methods, Models, and Perspectives. Trends Plant Sci.22, 961–975. 10.1016/j.tplants.2017.08.011
- CrossRef
- Google Scholar
20
CuevasJ.CrossaJ.SoberanisV.Pérez‐ElizaldeS.Pérez‐RodríguezP.CamposG. d. l.et al (2016). Genomic Prediction of Genotype × Environment Interaction Kernel Regression Models. Plant Genome9. 10.3835/plantgenome2016.03.0024
- CrossRef
- Google Scholar
21
DanileviczM. F.Tay FernandezC. G.MarshJ. I.BayerP. E.EdwardsD. (2020). Plant Pangenomics: Approaches, Applications and Advancements. Curr. Opin. Plant Biol.54, 18–25. 10.1016/j.pbi.2019.12.005
- CrossRef
- Google Scholar
22
de Abreu E LimaF.WillmitzerL.NikoloskiZ. (2018). Classification-driven Framework to Predict maize Hybrid Field Performance from Metabolic Profiles of Young Parental Roots. PLoS ONE13, e0196038. 10.1371/journal.pone.0196038
- CrossRef
- Google Scholar
23
De CastroA. I.EhsaniR.PloetzR.CraneJ. H.AbdulridhaJ. (2015). Optimum Spectral and Geometric Parameters for Early Detection of laurel Wilt Disease in Avocado. Remote Sensing Environ.171, 33–44. 10.1016/j.rse.2015.09.011
- CrossRef
- Google Scholar
24
de Los CamposG.HickeyJ. M.Pong-WongR.DaetwylerH. D.CalusM. P. L. (2013). Whole-genome Regression and Prediction Methods Applied to Plant and Animal Breeding. Genetics193, 327–345. 10.1534/genetics.112.143313
- CrossRef
- Google Scholar
25
Doshi-VelezF.KimB. (2017). Towards A Rigorous Science of Interpretable Machine Learning, 08608. arXiv:1702.
- Google Scholar
26
DrouinA.LetarteG.RaymondF.MarchandM.CorbeilJ.LavioletteF. (2019). Interpretable Genotype-To-Phenotype Classifiers with Performance Guarantees. Sci. Rep.9, 4071. 10.1038/s41598-019-40561-2
- CrossRef
- Google Scholar
27
FanZ.LiJ.ZhangL.ZhuG.LiP.LuX.et al (2021). U-net Based Analysis of MRI for Alzheimer's Disease Diagnosis. Neural Comput. Applic33, 13587–13599. 10.1007/s00521-021-05983-y
- CrossRef
- Google Scholar
28
FrisbyT. S.BakerS. J.MarçaisG.HoangQ. M.KingsfordC.LangmeadC. J. (2021). HARVESTMAN: a Framework for Hierarchical Feature Learning and Selection from Whole Genome Sequencing Data. BMC Bioinformatics22, 174. 10.1186/s12859-021-04096-6
- CrossRef
- Google Scholar
29
GadirajuK. K.RamachandraB.ChenZ.VatsavaiR. R. (2020). “Multimodal Deep Learning Based Crop Classification Using Multispectral and Multitemporal Satellite Imagery,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, New York, NY, USA (ACM), 3234–3242. 10.1145/3394486.3403375
- CrossRef
- Google Scholar
30
GaffneyJ.SchusslerJ.LöfflerC.CaiW.PaszkiewiczS.MessinaC.et al (2015). Industry-Scale Evaluation of Maize Hybrids Selected for Increased Yield in Drought-Stress Conditions of the US Corn Belt. Crop Sci.55, 1608–1618. 10.2135/cropsci2014.09.0654
- CrossRef
- Google Scholar
31
GaoL.GondaI.SunH.MaQ.BaoK.TiemanD. M.et al (2019). The Tomato Pan-Genome Uncovers New Genes and a Rare Allele Regulating Fruit Flavor. Nat. Genet.51, 1044–1051. 10.1038/s41588-019-0410-2
- CrossRef
- Google Scholar
32
GoliczA. A.BayerP. E.BarkerG. C.EdgerP. P.KimH.MartinezP. A.et al (2016). The Pangenome of an Agronomically Important Crop Plant Brassica oleracea. Nat. Commun.7, 13390. 10.1038/ncomms13390
- CrossRef
- Google Scholar
33
GoliczA. A.BayerP. E.BhallaP. L.BatleyJ.EdwardsD. (2020). Pangenomics Comes of Age: from Bacteria to Plant and Animal Applications. Trends Genet.36, 132–145. 10.1016/j.tig.2019.11.006
- CrossRef
- Google Scholar
34
GrinbergN. F.OrhoborO. I.KingR. D. (2020). An Evaluation of Machine-Learning for Predicting Phenotype: Studies in Yeast, rice, and Wheat. Mach. Learn.109, 251–277. 10.1007/s10994-019-05848-5
- CrossRef
- Google Scholar
35
HagiwaraA.FujitaS.OhnoY.AokiS. (2020). Variability and Standardization of Quantitative Imaging. Invest. Radiol.55, 601–616. 10.1097/RLI.0000000000000666
- CrossRef
- Google Scholar
36
HarperL.CampbellJ.CannonE. K. S.JungS.PoelchauM.WallsR.et al (2018). AgBioData Consortium Recommendations for Sustainable Genomics and Genetics Databases for Agriculture. Database (Oxford)2018. 10.1093/database/bay088
- CrossRef
- Google Scholar
37
HasanM. M.ChopinJ. P.LagaH.MiklavcicS. J. (2018). Detection and Analysis of Wheat Spikes Using Convolutional Neural Networks. Plant Methods14, 100. 10.1186/s13007-018-0366-8
- CrossRef
- Google Scholar
38
HeffnerE. L.SorrellsM. E.JanninkJ.-L. (2009). Genomic Selection for Crop Improvement. Crop Sci.49, 1–12. 10.2135/cropsci2008.08.0512
- CrossRef
- Google Scholar
39
HeslotN.YangH.-P.SorrellsM. E.JanninkJ.-L. (2012). Genomic Selection in Plant Breeding: A Comparison of Models. Crop Sci.52, 146–160. 10.2135/cropsci2011.06.0297
- CrossRef
- Google Scholar
40
Hoang TrongV.Gwang-hyunY.Thanh VuD.Jin-youngK. (2020). Late Fusion of Multimodal Deep Neural Networks for Weeds Classification. Comput. Elect. Agric.175, 105506. 10.1016/j.compag.2020.105506
- CrossRef
- Google Scholar
41
HossainM. Z.SohelF.ShiratuddinM. F.LagaH.BennamounM. (2021). Text to Image Synthesis for Improved Image Captioning. IEEE Access9, 64918–64928. 10.1109/ACCESS.2021.3075579
- CrossRef
- Google Scholar
42
HuH.SchebenA.EdwardsD. (2018). Advances in Integrating Genomics and Bioinformatics in the Plant Breeding Pipeline. Agriculture8, 75. 10.3390/agriculture8060075
- CrossRef
- Google Scholar
43
HurgobinB.EdwardsD. (2017). SNP Discovery Using a Pangenome: Has the Single Reference Approach Become Obsolete?Biology6, 21. 10.3390/biology6010021
- CrossRef
- Google Scholar
44
JeongS.KimJ.-Y.KimN. (2020). GMStool: GWAS-Based Marker Selection Tool for Genomic Prediction from Genomic Data. Sci. Rep.10, 19653. 10.1038/s41598-020-76759-y
- CrossRef
- Google Scholar
45
JinM.LiuH.HeC.FuJ.XiaoY.WangY.et al (2016). Maize Pan-Transcriptome Provides Novel Insights into Genome Complexity and Quantitative Trait Variation. Sci. Rep.6, 18936. 10.1038/srep18936
- CrossRef
- Google Scholar
46
JubairS.DomaratzkiM. (2019). “Ensemble Supervised Learning for Genomic Selection,” in 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) (IEEE), 1993–2000. 10.1109/BIBM47256.2019.8982998
- CrossRef
- Google Scholar
47
JulianaP.SinghR. P.PolandJ.MondalS.CrossaJ.Montesinos‐LópezO. A.et al (2018). Prospects and Challenges of Applied Genomic Selection-A New Paradigm in Breeding for Grain Yield in Bread Wheat. Plant Genome11, 180017. 10.3835/plantgenome2018.03.0017
- CrossRef
- Google Scholar
48
KhakiS.WangL.ArchontoulisS. V. (2019). A CNN-RNN Framework for Crop Yield Prediction. Front. Plant Sci.10, 1750. 10.3389/fpls.2019.01750
- CrossRef
- Google Scholar
49
KhakiS.WangL. (2019). Crop Yield Prediction Using Deep Neural Networks. Front. Plant Sci.10, 621. 10.3389/fpls.2019.00621
- CrossRef
- Google Scholar
50
KhanM. A.AshrafI.AlhaisoniM.DamaševičiusR.SchererR.RehmanA.et al (2020). Multimodal Brain Tumor Classification Using Deep Learning and Robust Feature Selection: A Machine Learning Application for Radiologists. Diagnostics10, 565. 10.3390/diagnostics10080565
- CrossRef
- Google Scholar
51
KonečnýJ.McMahanH. B.RamageD.RichtárikP. (2016). Federated Optimization: Distributed Machine Learning for On-Device Intelligence. CoRR abs/1610.02527.
- Google Scholar
52
LebedevV. G.LebedevaT. N.ChernodubovA. I.ShestibratovK. A. (2020). Genomic Selection for forest Tree Improvement: Methods, Achievements and Perspectives. Forests11, 1190. 10.3390/f11111190
- CrossRef
- Google Scholar
53
LeCunY.BengioY.HintonG. (2015). Deep Learning. Nature521, 436–444. 10.1038/nature14539
- CrossRef
- Google Scholar
54
LeeJ.SunJ.WangF.WangS.JunC.-H.JiangX. (2018). Privacy-Preserving Patient Similarity Learning in a Federated Environment: Development and Analysis. JMIR Med. Inform.6, e20. 10.2196/medinform.7744
- CrossRef
- Google Scholar
55
LiptonZ. C. (2018). The Mythos of Model Interpretability. Queue16, 31–57. 10.1145/3236386.3241340
- CrossRef
- Google Scholar
56
LiuY.WangD.HeF.WangJ.JoshiT.XuD. (2019). Phenotype Prediction and Genome-wide Association Study Using Deep Convolutional Neural Network of Soybean. Front. Genet.10, 1091. 10.3389/fgene.2019.01091
- CrossRef
- Google Scholar
57
LobetG. (2017). Image Analysis in Plant Sciences: Publish Then Perish. Trends Plant Sci.22, 559–566. 10.1016/j.tplants.2017.05.002
- CrossRef
- Google Scholar
58
LundbergS.LeeS.-I. (2017). A Unified Approach to Interpreting Model Predictions. Adv. Neural Inf. Process. Syst.30, 4765–4774.
- Google Scholar
59
LundbergS. M.ErionG. G.LeeS.-I. (2018). Consistent Individualized Feature Attribution for Tree Ensembles. arXiv:1802.03888.
- Google Scholar
60
MaW.QiuZ.SongJ.LiJ.ChengQ.ZhaiJ.et al (2018). A Deep Convolutional Neural Network Approach for Predicting Phenotypes from Genotypes. Planta248, 1307–1318. 10.1007/s00425-018-2976-9
- CrossRef
- Google Scholar
61
MaimaitijiangM.SaganV.SidikeP.HartlingS.EspositoF.FritschiF. B. (2020). Soybean Yield Prediction from UAV Using Multimodal Data Fusion and Deep Learning. Remote Sensing Environ.237, 111599. 10.1016/j.rse.2019.111599
- CrossRef
- Google Scholar
62
MarinoS.AlvinoA. (2020). Agronomic Traits Analysis of Ten Winter Wheat Cultivars Clustered by UAV-Derived Vegetation Indices. Remote Sensing12, 249. 10.3390/rs12020249
- CrossRef
- Google Scholar
63
MårtenssonG.FerreiraD.GranbergT.CavallinL.OppedalK.PadovaniA.et al (2020). The Reliability of a Deep Learning Model in Clinical Out-Of-Distribution MRI Data: A Multicohort Study. Med. Image Anal.66, 101714. 10.1016/j.media.2020.101714
- CrossRef
- Google Scholar
64
MasonA. S.HigginsE. E.SnowdonR. J.BatleyJ.SteinA.WernerC.et al (2017). A User Guide to the Brassica 60K Illumina Infinium SNP Genotyping Array. Theor. Appl. Genet.130, 621–633. 10.1007/s00122-016-2849-1
- CrossRef
- Google Scholar
65
McGeachieM. J.ChangH.-H.WeissS. T. (2014). CGBayesNets: Conditional Gaussian Bayesian Network Learning and Inference with Mixed Discrete and Continuous Data. Plos Comput. Biol.10, e1003676. 10.1371/journal.pcbi.1003676
- CrossRef
- Google Scholar
66
McKenzieE. M.SanthanamA.RuanD.O'ConnorD.CaoM.ShengK. (2020). Multimodality Image Registration in the Head‐and‐neck Using a Deep Learning‐derived Synthetic CT as a Bridge. Med. Phys.47, 1094–1104. 10.1002/mp.13976
- CrossRef
- Google Scholar
67
MeuwissenT. H. E.HayesB. J.GoddardM. E. (2001). Prediction of Total Genetic Value Using Genome-wide Dense Marker Maps. Genetics157, 1819–1829. 10.1093/genetics/157.4.1819
- CrossRef
- Google Scholar
68
MisraT.AroraA.MarwahaS.ChinnusamyV.RaoA. R.JainR.et al (2020). SpikeSegNet-a Deep Learning Approach Utilizing Encoder-Decoder Network with Hourglass for Spike Segmentation and Counting in Wheat Plant from Visual Imaging. Plant Methods16, 40. 10.1186/s13007-020-00582-9
- CrossRef
- Google Scholar
69
MonroeJ. G.ArciniegasJ. P.MorenoJ. L.SánchezF.SierraS.ValdesS.et al (2020). The Lowest Hanging Fruit: Beneficial Gene Knockouts in Past, Present, and Future Crop Evolution. Curr. Plant Biol.24, 100185. 10.1016/j.cpb.2020.100185
- CrossRef
- Google Scholar
70
MontenegroJ. D.GoliczA. A.BayerP. E.HurgobinB.LeeH.ChanC. K. K.et al (2017). The Pangenome of Hexaploid Bread Wheat. Plant J.90, 1007–1013. 10.1111/tpj.13515
- CrossRef
- Google Scholar
71
Montesinos‐LópezO. A.Montesinos‐LópezA.Hernandez‐SuarezC. M.Barrón‐LópezJ. A.CrossaJ. (2021a). Deep‐learning Power and Perspectives for Genomic Selection. Plant Genome14 (3), e20122. 10.1002/tpg2.20122
- CrossRef
- Google Scholar
72
Montesinos-LópezA.Montesinos-LópezO. A.GianolaD.CrossaJ.Hernández-SuárezC. M. (2018a). Multi-environment Genomic Prediction of Plant Traits Using Deep Learners with Dense Architecture. G3 (Bethesda)8, 3813–3828. 10.1534/g3.118.200740
- CrossRef
- Google Scholar
73
Montesinos-LópezO. A.Montesinos-LópezA.CrossaJ.GianolaD.Hernández-SuárezC. M.Martín-VallejoJ. (2018b). Multi-trait, Multi-Environment Deep Learning Modeling for Genomic-Enabled Prediction of Plant Traits. G3 (Bethesda)8, 3829–3840. 10.1534/g3.118.200728
- CrossRef
- Google Scholar
74
Montesinos-LópezO. A.Montesinos-LópezA.Pérez-RodríguezP.Barrón-LópezJ. A.MartiniJ. W. R.Fajardo-FloresS. B.et al (2021b). A Review of Deep Learning Applications for Genomic Selection. BMC Genomics22, 19. 10.1186/s12864-020-07319-x
- CrossRef
- Google Scholar
75
MorrellP. L.BucklerE. S.Ross-IbarraJ. (2011). Crop Genomics: Advances and Applications. Nat. Rev. Genet.13, 85–96. 10.1038/nrg3097
- CrossRef
- Google Scholar
76
MurdochW. J.SinghC.KumbierK.Abbasi-AslR.YuB. (2019). Definitions, Methods, and Applications in Interpretable Machine Learning. Proc. Natl. Acad. Sci. U.S.A.116, 22071–22080. 10.1073/pnas.1900654116
- CrossRef
- Google Scholar
77
NagasubramanianK.JonesS.SarkarS.SinghA. K.SinghA.GanapathysubramanianB. (2018). Hyperspectral Band Selection Using Genetic Algorithm and Support Vector Machines for Early Identification of Charcoal Rot Disease in Soybean Stems. Plant Methods14, 86. 10.1186/s13007-018-0349-9
- CrossRef
- Google Scholar
78
NagasubramanianK.JonesS.SinghA. K.SarkarS.SinghA.GanapathysubramanianB. (2019). Plant Disease Identification Using Explainable 3D Deep Learning on Hyperspectral Images. Plant Methods15, 98. 10.1186/s13007-019-0479-8
- CrossRef
- Google Scholar
79
NairP. S.VihinenM. (2013). VariBench: a Benchmark Database for Variations. Hum. Mutat.34, 42–49. 10.1002/humu.22204
- CrossRef
- Google Scholar
80
NaitoH.OgawaS.ValenciaM. O.MohriH.UranoY.HosoiF.et al (2017). Estimating rice Yield Related Traits and Quantitative Trait Loci Analysis under Different Nitrogen Treatments Using a Simple tower-based Field Phenotyping System with Modified Single-Lens Reflex Cameras. ISPRS J. Photogrammetry Remote Sensing125, 50–62. 10.1016/j.isprsjprs.2017.01.010
- CrossRef
- Google Scholar
81
NicotraA. B.AtkinO. K.BonserS. P.DavidsonA. M.FinneganE. J.MathesiusU.et al (2010). Plant Phenotypic Plasticity in a Changing Climate. Trends Plant Sci.15, 684–692. 10.1016/j.tplants.2010.09.008
- CrossRef
- Google Scholar
82
NuccioM. L.PaulM.BateN. J.CohnJ.CutlerS. R. (2018). Where Are the Drought Tolerant Crops? an Assessment of More Than Two Decades of Plant Biotechnology Effort in Crop Improvement. Plant Sci.273, 110–119. 10.1016/j.plantsci.2018.01.020
- CrossRef
- Google Scholar
83
OakeyH.CullisB.ThompsonR.ComadranJ.HalpinC.WaughR. (2016). Genomic Selection in Multi-Environment Crop Trials. G3 (Bethesda)6, 1313–1326. 10.1534/g3.116.027524
- CrossRef
- Google Scholar
84
OrnellaL.PérezP.TapiaE.González-CamachoJ. M.BurgueñoJ.ZhangX.et al (2014). Genomic-enabled Prediction with Classification Algorithms. Heredity112, 616–626. 10.1038/hdy.2013.144
- CrossRef
- Google Scholar
85
PapoutsoglouE. A.FariaD.ArendD.ArnaudE.AthanasiadisI. N.ChavesI.et al (2020). Enabling Reusability of Plant Phenomic Datasets with MIAPPE 1.1. New Phytol.227, 260–273. 10.1111/nph.16544
- CrossRef
- Google Scholar
86
PauliD.Andrade-SanchezP.Carmo-SilvaA. E.GazaveE.FrenchA. N.HeunJ.et al (2016). Field-Based High-Throughput Plant Phenotyping Reveals the Temporal Patterns of Quantitative Trait Loci Associated with Stress-Responsive Traits in Cotton. G3 (Bethesda)6, 865–879. 10.1534/g3.115.023515
- CrossRef
- Google Scholar
87
PérezP.CamposG.CrossaJ.GianolaD. (2010). Genomic‐Enabled Prediction Based on Molecular Markers and Pedigree Using the Bayesian Linear Regression Package in R. Plant Genome3, 106–116. 10.3835/plantgenome2010.04.0005
- CrossRef
- Google Scholar
88
Pérez-EncisoM.RincónJ. C.LegarraA. (2015). Sequence- vs. Chip-Assisted Genomic Selection: Accurate Biological Information Is Advised. Genet. Sel. Evol.47, 43. 10.1186/s12711-015-0117-5
- CrossRef
- Google Scholar
89
Pérez-RodríguezP.GianolaD.González-CamachoJ. M.CrossaJ.ManèsY.DreisigackerS. (2012). Comparison between Linear and Non-parametric Regression Models for Genome-Enabled Prediction in Wheat. G3 (Bethesda)2, 1595–1605. 10.1534/g3.112.003665
- CrossRef
- Google Scholar
90
PiconA.Alvarez-GilaA.SeitzM.Ortiz-BarredoA.EchazarraJ.JohannesA. (2019). Deep Convolutional Neural Networks for mobile Capture Device-Based Crop Disease Classification in the Wild. Comput. Elect. Agric.161, 280–290. 10.1016/j.compag.2018.04.002
- CrossRef
- Google Scholar
91
PoundM. P.AtkinsonJ. A.TownsendA. J.WilsonM. H.GriffithsM.JacksonA. S.et al (2017). Deep Machine Learning Provides State-Of-The-Art Performance in Image-Based Plant Phenotyping. Gigascience6, 1–10. 10.1093/gigascience/gix083
- CrossRef
- Google Scholar
92
Quirós VargasJ. J.ZhangC.SmitchgerJ. A.McGeeR. J.SankaranS. (2019). Phenotyping of Plant Biomass and Performance Traits Using Remote Sensing Techniques in Pea (Pisum Sativum, L.). Sensors19, 2031. 10.3390/s19092031
- CrossRef
- Google Scholar
93
RadfordA.MetzL.ChintalaS. (2015). Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv:1511.06434.
- Google Scholar
94
RibeiroM. T.SinghS.GuestrinC. (2016). “"Why Should I Trust You?",” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, New York, USA (ACM Press), 1135–1144. 10.1145/2939672.2939778
- CrossRef
- Google Scholar
95
RiceB.LipkaA. E. (2019). Evaluation of RR‐BLUP Genomic Selection Models that Incorporate Peak Genome‐Wide Association Study Signals in Maize and Sorghum. Plant Genome12, 180052. 10.3835/plantgenome2018.07.0052
- CrossRef
- Google Scholar
96
RiekeN.HancoxJ.LiW.MilletarìF.RothH. R.AlbarqouniS.et al (2020). The Future of Digital Health with Federated Learning. Npj Digit. Med.3, 119. 10.1038/s41746-020-00323-1
- CrossRef
- Google Scholar
97
RudinC. (2019). Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead. Nat. Mach. Intell.1, 206–215. 10.1038/s42256-019-0048-x
- CrossRef
- Google Scholar
98
RutkoskiJ.PolandJ.MondalS.AutriqueE.PérezL. G.CrossaJ.et al (2016). Canopy Temperature and Vegetation Indices from High-Throughput Phenotyping Improve Accuracy of Pedigree and Genomic Selection for Grain Yield in Wheat. G3 (Bethesda)6, 2799–2808. 10.1534/g3.116.032888
- CrossRef
- Google Scholar
99
SandhuK. S.LozadaD. N.ZhangZ.PumphreyM. O.CarterA. H. (2020). Deep Learning for Predicting Complex Traits in spring Wheat Breeding Program. Front. Plant Sci.11, 613325. 10.3389/fpls.2020.613325
- CrossRef
- Google Scholar
100
SchaafsmaG. C. P.VihinenM. (2018). Representativeness of Variation Benchmark Datasets. BMC Bioinformatics19, 461. 10.1186/s12859-018-2478-6
- CrossRef
- Google Scholar
101
SchebenA.WolterF.BatleyJ.PuchtaH.EdwardsD. (2017). Towards CRISPR/Cas Crops - Bringing Together Genomics and Genome Editing. New Phytol.216, 682–698. 10.1111/nph.14702
- CrossRef
- Google Scholar
102
ScossaF.AlseekhS.FernieA. R. (2021). Integrating Multi-Omics Data for Crop Improvement. J. Plant Physiol.257, 153352. 10.1016/j.jplph.2020.153352
- CrossRef
- Google Scholar
103
SelbyP.AbbeloosR.BacklundJ. E.Basterrechea SalidoM.BauchetG.Benites-AlfaroO. E.et al (2019). BrAPI-an Application Programming Interface for Plant Breeding Applications. Bioinformatics35, 4147–4155. 10.1093/bioinformatics/btz190
- CrossRef
- Google Scholar
104
ShakoorN.LeeS.MocklerT. C. (2017). High Throughput Phenotyping to Accelerate Crop Breeding and Monitoring of Diseases in the Field. Curr. Opin. Plant Biol.38, 184–192. 10.1016/j.pbi.2017.05.006
- CrossRef
- Google Scholar
105
ShenX.AlamM.FikseF.RönnegårdL. (2013). A Novel Generalized ridge Regression Method for Quantitative Genetics. Genetics193, 1255–1268. 10.1534/genetics.112.146720
- CrossRef
- Google Scholar
106
ShookJ.GangopadhyayT.WuL.GanapathysubramanianB.SarkarS.SinghA. K. (2021). Crop Yield Prediction Integrating Genotype and Weather Variables Using Deep Learning. Plos one16 (6), e0252402. 10.1371/journal.pone.0252402
- CrossRef
- Google Scholar
107
SlackD.HilgardS.JiaE.SinghS.LakkarajuH. (2020). “Fooling LIME and SHAP,” in Proceedings of the AAAI/ACM Conference on AI. New York, NY, USA. Ethics, and SocietyACM, 180–186. 10.1145/3375627.3375830
- CrossRef
- Google Scholar
108
SongH.xiu-ying Hanx. y.Montenegro-MarinC. E.krishnamoorthyS. (2021). Secure Prediction and Assessment of Sports Injuries Using Deep Learning Based Convolutional Neural Network. J. Ambient Intell. Hum. Comput12, 3399–3410. 10.1007/s12652-020-02560-4
- CrossRef
- Google Scholar
109
SpindelJ.BegumH.AkdemirD.VirkP.CollardB.RedoñaE.et al (2015). Genomic Selection and Association Mapping in rice (Oryza Sativa): Effect of Trait Genetic Architecture, Training Population Composition, Marker Number and Statistical Model on Accuracy of rice Genomic Selection in Elite, Tropical rice Breeding Lines. Plos Genet.11, e1004982. 10.1371/journal.pgen.1004982
- CrossRef
- Google Scholar
110
SpindelJ. E.BegumH.AkdemirD.CollardB.RedoñaE.JanninkJ.-L.et al (2016). Genome-wide Prediction Models that Incorporate De Novo GWAS Are a Powerful New Tool for Tropical rice Improvement. Heredity116, 395–408. 10.1038/hdy.2015.113
- CrossRef
- Google Scholar
111
TattarisM.ReynoldsM. P.ChapmanS. C. (2016). A Direct Comparison of Remote Sensing Approaches for High-Throughput Phenotyping in Plant Breeding. Front. Plant Sci.7, 1131. 10.3389/fpls.2016.01131
- CrossRef
- Google Scholar
112
UkrainetzN. K.MansfieldS. D. (2020). Prediction Accuracy of Single-step BLUP for Growth and wood Quality Traits in the lodgepole pine Breeding Program in British Columbia. Tree Genet. Genomes16, 64. 10.1007/s11295-020-01456-w
- CrossRef
- Google Scholar
113
Voss-FelsK. P.CooperM.HayesB. J. (2019). Accelerating Crop Genetic Gains with Genomic Selection. Theor. Appl. Genet.132, 669–686. 10.1007/s00122-018-3270-8
- CrossRef
- Google Scholar
114
WangS.WongD.ForrestK.AllenA.ChaoS.HuangB. E.et al (2014). Characterization of Polyploid Wheat Genomic Diversity Using a High‐density 90 000 Single Nucleotide Polymorphism Array. Plant Biotechnol. J.12, 787–796. 10.1111/pbi.12183
- CrossRef
- Google Scholar
115
WilkinsonM. D.DumontierM.AalbersbergI. J.AppletonG.AxtonM.BaakA.et al (2016). The FAIR Guiding Principles for Scientific Data Management and Stewardship. Sci. Data3, 160018. 10.1038/sdata.2016.18
- CrossRef
- Google Scholar
116
XiongH.CaoZ.LuH.MadecS.LiuL.ShenC. (2019). TasselNetv2: In-Field Counting of Wheat Spikes with Context-Augmented Local Regression Networks. Plant Methods15, 150. 10.1186/s13007-019-0537-2
- CrossRef
- Google Scholar
117
YangQ.LiuY.ChenT.TongY. (2019). Federated Machine Learning. ACM Trans. Intell. Syst. Technol.10, 1–19. 10.1145/3298981
- CrossRef
- Google Scholar
118
YinB.BalvertM.van der SpekR. A. A.DutilhB. E.BohtéS.VeldinkJ.et al (2019). Using the Structure of Genome Data in the Design of Deep Neural Networks for Predicting Amyotrophic Lateral Sclerosis from Genotype. Bioinformatics35, i538–i547. 10.1093/bioinformatics/btz369
- CrossRef
- Google Scholar
119
ZamirD. (2013). Where Have All the Crop Phenotypes Gone?Plos Biol.11, e1001595. 10.1371/journal.pbio.1001595
- CrossRef
- Google Scholar
120
ZhouJ.ParkC. Y.TheesfeldC. L.WongA. K.YuanY.ScheckelC.et al (2019). Whole-genome Deep-Learning Analysis Identifies Contribution of Noncoding Mutations to Autism Risk. Nat. Genet.51, 973–980. 10.1038/s41588-019-0420-0
- CrossRef
- Google Scholar
121
ZingarettiL. M.GezanS. A.FerrãoL. F. V.OsorioL. F.MonfortA.MuñozP. R.et al (2020). Exploring Deep Learning for Complex Trait Genomic Prediction in Polyploid Outcrossing Species. Front. Plant Sci.11, 25. 10.3389/fpls.2020.00025
- CrossRef
- Google Scholar
122
ZouJ.HussM.AbidA.MohammadiP.TorkamaniA.TelentiA. (2019). A Primer on Deep Learning in Genomics. Nat. Genet.51, 12–18. 10.1038/s41588-018-0295-5
- CrossRef
- Google Scholar

Summary

Keywords

machine learning, plant phenotyping, phenotype prediction, plant breeding, big data

Citation

Danilevicz MF, Gill M, Anderson R, Batley J, Bennamoun M, Bayer PE and Edwards D (2022) Plant Genotype to Phenotype Prediction Using Machine Learning. Front. Genet. 13:822173. doi: 10.3389/fgene.2022.822173

Received

25 November 2021

Accepted

07 March 2022

Published

18 May 2022

Volume

13 - 2022

Edited by

Pasquale Tripodi, Council for Agricultural and Economics Research (CREA), Italy

Reviewed by

Chuang Ma, Northwest A&F University, China

Osval Antonio Montesinos-López, Universidad de Colima, Mexico

Updates

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: David Edwards, dave.edwards@uwa.edu.au

This article was submitted to Plant Genomics, a section of the journal Frontiers in Genetics

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Genomics of Plants and Plant-Associated Organisms

REVIEW article

Plant Genotype to Phenotype Prediction Using Machine Learning

Abstract

Introduction

Machine Learning Methods for Genotype to Phenotype Prediction

Genetic Variation Representation

Integrating High Throughput Phenotyping Into Genotype to Phenotype Models

Data-Driven Breeding Requires Structured Datasets

Explainable (Interpretable) Machine Learning for Genotype to Phenotype Models

Discussion

Statements

Author contributions

Funding

Acknowledgments

Conflict of interest

Publisher’s note

Abbreviations

References

Summary

Outline

Figures

Cite article

Article metrics

REVIEW article

Plant Genotype to Phenotype Prediction Using Machine Learning

Abstract

Introduction

Machine Learning Methods for Genotype to Phenotype Prediction

Genetic Variation Representation

Integrating High Throughput Phenotyping Into Genotype to Phenotype Models

Data-Driven Breeding Requires Structured Datasets

Explainable (Interpretable) Machine Learning for Genotype to Phenotype Models

Discussion

Statements

Author contributions

Funding

Acknowledgments

Conflict of interest

Publisher’s note

Abbreviations

References

Summary

Outline

Figures

Cite article

Share article

Article metrics