Exploring Deep Learning for Complex Trait Genomic Prediction in Polyploid Outcrossing Species

Genomic prediction (GP) is the procedure whereby the genetic merits of untested candidates are predicted using genome wide marker information. Although numerous examples of GP exist in plants and animals, applications to polyploid organisms are still scarce, partly due to limited genome resources and the complexity of this system. Deep learning (DL) techniques comprise a heterogeneous collection of machine learning algorithms that have excelled at many prediction tasks. A potential advantage of DL for GP over standard linear model methods is that DL can potentially take into account all genetic interactions, including dominance and epistasis, which are expected to be of special relevance in most polyploids. In this study, we evaluated the predictive accuracy of linear and DL techniques in two important small fruits or berries: strawberry and blueberry. The two datasets contained a total of 1,358 allopolyploid strawberry (2n=8x=112) and 1,802 autopolyploid blueberry (2n=4x=48) individuals, genotyped for 9,908 and 73,045 single nucleotide polymorphism (SNP) markers, respectively, and phenotyped for five agronomic traits each. DL depends on numerous parameters that influence performance and optimizing hyperparameter values can be a critical step. Here we show that interactions between hyperparameter combinations should be expected and that the number of convolutional filters and regularization in the first layers can have an important effect on model performance. In terms of genomic prediction, we did not find an advantage of DL over linear model methods, except when the epistasis component was important. Linear Bayesian models were better than convolutional neural networks for the full additive architecture, whereas the opposite was observed under strong epistasis. However, by using a parameterization capable of taking into account these non-linear effects, Bayesian linear models can match or exceed the predictive accuracy of DL. A semiautomatic implementation of the DL pipeline is available at https://github.com/lauzingaretti/deepGP/.

Genomic prediction (GP) is the procedure whereby the genetic merits of untested candidates are predicted using genome wide marker information. Although numerous examples of GP exist in plants and animals, applications to polyploid organisms are still scarce, partly due to limited genome resources and the complexity of this system. Deep learning (DL) techniques comprise a heterogeneous collection of machine learning algorithms that have excelled at many prediction tasks. A potential advantage of DL for GP over standard linear model methods is that DL can potentially take into account all genetic interactions, including dominance and epistasis, which are expected to be of special relevance in most polyploids. In this study, we evaluated the predictive accuracy of linear and DL techniques in two important small fruits or berries: strawberry and blueberry. The two datasets contained a total of 1,358 allopolyploid strawberry (2n=8x=112) and 1,802 autopolyploid blueberry (2n=4x=48) individuals, genotyped for 9,908 and 73,045 single nucleotide polymorphism (SNP) markers, respectively, and phenotyped for five agronomic traits each. DL depends on numerous parameters that influence performance and optimizing hyperparameter values can be a critical step. Here we show that interactions between hyperparameter combinations should be expected and that the number of convolutional filters and regularization in the first layers can have an important effect on model performance. In terms of genomic prediction, we did not find an advantage of DL over linear model methods, except when the epistasis component was important. Linear Bayesian models were better than convolutional neural networks for the full additive architecture, whereas the opposite was observed under strong epistasis. However, by using a parameterization capable of taking into account these non-linear effects, Bayesian linear models can match or exceed the predictive accuracy of DL. A semiautomatic implementation of the DL pipeline is available at https://github.com/ lauzingaretti/deepGP/.

INTRODUCTION
Deep learning (DL) techniques comprise a heterogeneous collection of machine learning algorithms which have excelled at many prediction tasks, and this is a very active area of research (Min et al., 2017;Pattanayak, 2017;Namin et al., 2018). All DL algorithms employ multiple neuron layers and numerous architectures have been proposed: multiple layer perceptrons (MLPs), recurrent neural networks (RNNs), convolutional neural networks (CNNs) (LeCun et al., 2015) and others. DL is relatively straightforward to implement (https://keras.io/whyuse-keras/), but optimum performance depends on an adequate hyperparameter choice, which is not trivial and requires considerable computational resources (Young et al., 2015;Chan et al., 2018). Although previous, limited evidence does not show a consistent advantage of DL over penalized linear methods for genomic prediction (GP) purposes (González-Recio et al., 2014;Ma et al., 2017;Bellot et al., 2018;Montesinos-López et al., 2018a;Montesinos-López et al., 2018b;Montesinos-López et al., 2019a), more efforts are needed to fully understand the behavior and potential constraints and capabilities of DL in GP scenarios.
Genomic selection (GS) is the breeding strategy consisting in predicting complex traits using genomic-wide genetic markers. The idea was developed to overcome the limitations of markerassisted selection (MAS) and was formalized by Meuwissen et al. (2001). While MAS establishes a model with only the markers with significant associations, genomic selection includes all, or most available markers, for GP, irrespective of their effect and its significance. Due to the decrease in genotyping costs, genomic selection is becoming the standard tool in many plant and animal breeding programs (Bernardo, 2008;González-Camacho et al., 2012;Crossa et al., 2013b;Meuwissen et al., 2013;Wiggans et al., 2017). There is an increasing number of successful applications of genomic selection in diploid and polyploid organisms where its use has generated important genetic gains by improving the accuracy of breeding value prediction and dramatically reducing generation intervals (Crossa et al., 2013a;Castillo-Juárez et al., 2015;Duangjit et al., 2016;Juliana et al., 2019;de Bem Oliveira et al., 2019).
In any scenario, GP poses statistical challenges since the number of markers is usually much larger than the number of individuals, i.e., the so-called large p (number of features) small n (sample size) paradigm Pérez and De Los Campos, 2014). In this context, statistical methods require either shrinkage, variable selection, or a combination of both (Tibshirani, 1996). Most GP methods are based on linear models, such as Genomic Best Linear Unbiased Prediction (GBLUP) (VanRaden, 2008), the Bayesian GP family (Meuwissen et al., 2001;Pérez and De Los Campos, 2014), or LASSO (Tibshirani, 1996). In GBLUP, all marker effects are assumed to be normally distributed with equal variance and a homogeneous shrinkage is induced, whereas Bayesian models are more flexible and differential shrinkages and/or variable selection can be applied to distinct marker subsets. Note that these methods are linear and, in contrast to DL, have not been designed to model nonadditive genetic effects (such as dominance or epistasis); however, these effects can be incorporated in the model with appropriate parameterizations.
One potential advantage of DL for GP over standard methods is that the whole genetic merit, including all non-additive effects, can potentially be predicted without the need to partition all effects. This is an interesting property for clonally propagated outcrossing species, because genomes can be asexually reproduced from single plants once the desirable individual is found. It should also be a promising strategy in polyploids, although their complex genetic structure has delayed the availability of whole genome markers and of specific analytic tools for, e.g. SNP calling (Slater et al., 2016;Gezan et al., 2017;Bourke et al., 2018). A few studies have demonstrated the potential advantages of GS in allo and autopolyploids (Gezan et al., 2017;Enciso-Rodriguez et al., 2018;Nyine et al., 2018;Amadeu et al., 2019;de Bem Oliveira et al., 2019;Juliana et al., 2019;Zingaretti et al., 2019), although its implementation is still in its infancy.
When non-additive effects are investigated, there are two important points that need to be considered for higher ploidy levels: i) there is a portion of the intra-locus allele interaction (i.e., dominance) that is passed to the progeny (particularly full-sibs), and ii) the definition of non-additive effects is more complex than in diploids as higher order interaction exist (Osborn et al., 2003). Thus, methodologies that could model the whole genetic merit without restrictive assumptions could facilitate and improve the prediction for polyploid species, making DL an attractive choice for genomic prediction. In practice, DL aims at predicting the whole genetic merit, including interactions irrespective of their origin.
Among the polyploid species, strawberries (Fragaria x ananassa) and blueberries (Vaccinium corymbosum) are considered two of the most important soft fruit commodities. Considered a rich source of vitamins and minerals, fruit markets for both species have experienced a global increase in production and consumption over the past decade (https://www.nass.usda. gov/Publications/Todays_Reports/reports/ncit0619.pdf). To ensure that production and fruit quality meet the global demand, genetic improvement, and particularly GP, has a role to play in maximizing the utility, diversity, and yield of resources. In this sense, previous experimental assessments performed in blueberry de Bem Oliveira et al., 2019) and strawberry (Gezan et al., 2017) have proven the feasibility of incorporating genomic selection to either accelerate the pace or improve the efficiency of breeding programs. From a genetic standpoint, one important difference between both species is its inheritance pattern. Cultivated strawberry (Fragaria x ananassa) is an allo-octoploid hybrid plant originated by cross between two wild octoploid species F. chiloensis and F.virginiana (Hancock et al., 2008) both descendants of Fragaria diploid species; referred as allopolyploids, meiosis is mainly dictated by preferential pairing, exhibiting a diploid-like (or disomic) segregation. In contrast, blueberry is a tetraploid organism originated from genome duplication within the same species. In autopolyploids, the meiotic pairing is mainly described by forming either random bivalents or multivalent during the division. Since the molecular mechanisms in auto and allopolyploids are quite complex, comparing new algorithms is a relevant issue to the prospect of GP in these and other polyploid species.
In this study, we evaluated the performance of deep learning for genomic prediction in two important horticultural species: allo-octoploid strawberry and auto-tetraploid blueberry. We complement the empirical study with simulations to understand better the impact of genetic architecture on DL performance. Given the complexity of implementing DL, we also provide a guideline on best practices for hyperparameter tuning and evaluate its importance in terms of predictive ability. To facilitate reproducibility of these methods, a python-based package for semiautomatic DL implementation, including auto and allopolyploid organisms have been made available at https:// github.com/lauzingaretti/deepGS/.

Plant Material and Genotypes
Predictive performances were compared in two polyploid species (blueberry and strawberry), for a series of traits with presumably contrasting genetic architecture. A summary of both experimental data sets is presented in the Table 1.
Regarding strawberry, we used 1,233 unique genotypes which correspond to five advanced selection trials (T2, T4, T6, T8, and T10) from the strawberry breeding program at the University of Florida, Institute of Food and Agricultural Sciences (USA). These advanced trials were planted in five consecutive seasons and were given an even code starting with season 2013-2014 as T2 and ending with season 2018-2019 as T10. The number of lines in each trial was 217, 240, 236, 272, and 393 for T2, T4, T6, T8, and T10, respectively. Some of the genotypes in the last trial T10 were already tested in earlier trials, making the total number of observations sum up to 1,358 (instead of 1,233). Plants were genotyped with the Axiom IStraw90 SNP array (Bassil et al., 2015). After quality control, in which those markers with minor allele frequencies (MAF) < 5% and with missing marker data > 5% were eliminated, 9,908 polymorphic SNP markers were available. A total of five yield and fruit quality traits were evaluated in each trial: soluble solid content (brix), average fruit weight (AveWtT), total marketable yield (MktWtT), early marketable yield (MktWtE), and percentage of culled (unmarketable) fruit (CullsTPer). Additional details for T2 and T4 can be found in Gezan et al. (2017).
The blueberry population used in this study encompasses one cycle of the University of Florida blueberry breeding program's recurrent selection and comprised 1,802 lines from 117 full-sib families. The population was originated from 146 parents that presented superior phenotypic performance (cultivars and advanced stage of breeding). Individuals were evaluated for five yield and fruit quality-related traits: firmness, fruit size, weight, yield, and picking scar, which were collected during two production seasons. Phenotypes were pre-corrected for fixed year effects, as detailed in Amadeu et al. (2019) and de Bem Oliveira et al. (2019). A total of 73,045 SNPs was obtained using sequence capture by Rapid Genomics (Gainesville, FL), after aligning the reads against the high-quality "Draper" genome assembly (Colle et al., 2019) as described in Benevenuto et al. (2019). Marker filtering followed these criteria: biallelic, mean coverage > 40, minimum allele frequency > 0.01; maximum missing data = 0.5%; minimum quality = 20. Also, individuals with more than 50% missing data were removed, missing genotypes were simply imputed with the mean. Tetraploid genotypes were called and the allele dosages were inferred with the updog R package (Gerard et al., 2018). Standard genotype calling with updog allows inferring genotypes according to the number of allele copies, and genotypes can be coded say 0,1,2,3,4. In addition, as in de Bem Oliveira et al. (2019), here we considered a set of "diploidized" genotypes that were obtained pooling all heterozygous genotypes in a single class, i.e., genotypes above 0,1,2,3,4 can be recoded as 0,1,1,1,2. The rationale is that there can be incertitude on the exact number of allele copies in heterozygous genotypes.
The GP methods evaluated in this study were assessed by true validation, which was obtained by splitting data into a training and a validation dataset. In the strawberry dataset, we considered that predicting performance of the last stage lines (T10) is the most interest application for the industry and therefore the population was divided between training (T2, T4, T6, and T8 trials) and validation (T10) subsets with 965 and 393 lines, respectively. In the case of blueberry data, all samples were equally important and 30% of randomly sampled genotypes were assigned to the validation set. Predictive ability (PA) was defined as the correlation between observed and predicted phenotypes in the validation set; prediction was computed from parameters estimated in the training dataset only.

Genetic Structure and Heritability Inference
Potential genetic structure was assessed by principal component analysis (PCA) using all genotypes. Since genetic architecture may have an impact on GP performance and on the optimum GP model , additive and non-additive genetic features were assessed by computing variance components from the model: where the vector y represents the adjusted phenotype, m1 is the intercept, a∼N(0, As 2 a ),d∼N(0, Ds 2 d ) and e∼N(0, Es 2 e ) are the additive, dominant and epistatic effects, respectively, and e ∼ N(0, Is 2 e ) is the residual component. Matrices A and D were obtained using AGHmatrix package (Amadeu et al., 2016) for both strawberry (as diploid) and blueberry (autotetraploid) species. For diploids, A and D were computed using VanRaden (2008) and Vitezica et al. (2013) for genotypes 0, 1, and 2, respectively. In the case of ploidy = 4, D was obtained as in Slater et al. (2016). The epistatic matrix (E) considered is the Hadamard product of additive effects, i.e. A ⊙ A (Henderson, 1984) Posterior distributions of genetic parameters were obtained using Reproducing Kernel Hilbert Spaces (RKHS) regression with BGLR package (Pérez and De Los Campos, 2014). The additive, dominance and epistatic ratios were estimated as: ) andĥ 2 e = s 2 e =(s 2 a + s 2 d + s 2 e + s 2 e ); where s 2 i the i th mean posterior estimates of s 2 as in Equation 1. We used both training and validation datasets combined in this stage, since this is purely a descriptive analysis and the values obtained are not employed in the later prediction stages.

Penalized Linear Methods
We compared the prediction performance of DL models with two well-established linear methods: Bayesian Lasso (BL, Meuwissen et al., 2001) and Bayesian Ridge Regression (BRR, Gianola, 2013). In these models, the trait can be expressed as: where m1 is the overall mean, g=Xb, X is the genotypes' matrix and b is a vector of marker effects. In BRR, prior distributions of marker effects b are N(0, Is 2 b ), whereas the prior distributions for ). Note that the Laplace distribution does not remove markers so, contrary to its frequentist counterpart, BL is not a variable selection approach. Each model was fitted by using only phenotypes from the training subset. The models were run using the BGLR package (Pérez and De Los Campos, 2014) with a Gibbs sampler algorithm for a total of 6,000 cycles, discarding the first 1,000 samples for burn-in.
The above parameterization assumes additivity of effects, although linear models can address non-linear relationships if properly parameterized. Non-linear interactions can be modeled by expressing g (Equation 2) in a general way, i.e., g = W w where W (centered and scaled) is a matrix of dummy variables that indicates the number of copies of the reference allele ranging from 0 to the ploidy level (Slater et al., 2016;Enciso-Rodriguez et al., 2018). This model is, in principle, a good parameterization to account for nonlinear interactions and we will refer to it as BRR general model (BRR-GM), since Bayes ridge regression was used. For more details, see Enciso-Rodriguez et al. (2018) and Amadeu et al. (2019).
Non-linearity can also be managed by means of RKHS regression (Gianola et al., 2006) as an alternative to a linear regression for capturing complex interactions. This model , a kernel function where h is de bandwidth parameter controlling how fast the covariance function drops with the distance between pairs of markers and jjx i − x i jj 2 is the Euclidean distance between any two pairs of genotypes. This parameterization induces a general matrix of genetic covariance between markers. The key point here is that the kernel can model non-linear relationships because it is a non-linear transformation of the distances between the input variables. Empirical evidence confirms that it is an accurate approach to predict phenotypes of complex traits (Gianola et al., 2008;de Los Campos et al., 2010). BRR-GM and RKHS were only implemented for strawberry and simulated scenarios, since it was in strawberry where we found the trait with the largest epistasis component, as described below.
Deep Learning (Convolutional Neural Networks) DL has been described as a universal learning approach able to solve supervised, semi-supervised and unsupervised problems. Several DL architectures have been proposed, such as MLPs, RNNs, CNNs, Generative Adversarial Networks (GANs) and Reinforcement Learning (RL). Figure 1 shows a generic pipeline to evaluate DL in a GP context. In our previous experiment (Bellot et al., 2018), CNNs were the best performing methods and therefore are the only ones discussed here. The advantage of CNNs in a GP context is that they can model the correlation between adjacent input variables, that is, linkage disequilibrium between nearby SNPs. This is done via a mathematical operation called convolution (Goodfellow et al., 2016). A typical CNN is made up of "convolutional layers", "pooling", "flatten" and "dense" fully connected layers (Figure 2). In the "convolutional layer", an operation called convolution is performed along the input of predefined width and strides, which are known as "kernel" and "filter" in the DL jargon, respectively. From a mathematical view, a convolution s(t) is a function that can be defined as an "integral transform" (Widder, 1954): where one of the functions (k or f) in Equation (3) must be a kernel. Assuming that the kernel is represented by k, the convolution is the transformation of f (input data in the DL context) into s(t). The operation is just the weighted sum of an infinite number of copies f shifting over the kernel. The discrete version of Equation 3 follows naturally as: One of the main advantages of convolution networks is their capability to reduce the number of operations, i.e., the hyperparameters to be estimated. As usual, an activation function (generally non-linear) is applied after each convolution FIGURE 1 | A generic deep learning (DL) pipeline for genomic prediction (GP) purposes. The general process includes the training and validation steps. In the training step, data are split into training and testing, DL hyperparameters are optimized by internal cross-validation with the test set and the model with the best predictive ability (PA) is chosen. In the validation step, the model PA is evaluated using a new set of data.
FIGURE 2 | General CNN architecture employed in our workflow. The input layer is a SNP matrix of size n x p, where n is the number of training set and p, the number of SNPs. The convolutional layer consists on a n filters convolution followed by a max-pooling layer with pool size = 3 and an optional dropout. The outputs of max-pooling layer are joined together into one vector by flattening. All the neurons in the flatten layer are fully connected to the first dense layer. We tune the network using i dense layers with a variable number of hidden neurons in the respective hidden layers. The output of these dense layers is the prediction layer that uses linear function as activation. The neurons in convolutional and dense layers use relu, tanh or linear function as activations.
to produce the output layer. Finally, "pooling" layers reduces dimension and achieves a smoother representation, summarizing adjacent neurons by computing their maximum or mean.

Hyperparameter Optimization
Since DL depends on numerous parameters that influence performance, optimizing hyperparameter can be a critical unresolved step, which relies heavily on heuristics. Hence, it is surprising that many DL applications in GP have not paid enough attention to this problem (Ma et al., 2017;Montesinos-López et al., 2018b;Montesinos-López et al., 2019b). Several approaches have been proposed for hyperparameter tuning (e.g., Bellot et al., 2018;Cho and Hegde, 2019;Le et al., 2019;Rajaraman et al., 2019;Yoo, 2019). Here, DL architectures were optimized using Talos (Autonomia Talos, 2019), which works combining all parameters in a grid. Talos can choose the best model either maximizing the predictive accuracy or minimizing the error; the former criterion was employed here. Since the approach can be expensive as the number of hyperparameters increases, a random search is the best strategy in practice. This rule evolves a list of CNN models for each phenotypic trait. We optimized the following hyperparameters (values considered within parentheses): activation function (relu, tanh, linear), number of filters (16, 32, 64, 128), regularization (i.e., weight decay in DL terminology, 0, 0.1, 0.01, 0.001), learning rate (0.1, 0.01, 0.001, 0.0025), number of neurons in fully connected layer (4,8,12,16), number of hidden layers (1,5,10), and dropout (0, 0.01, 0.1, 0.2).
Talos output is the accuracy for each hyperparameter combination; we then used hyperparameter values as independent variables and accuracy as target variable to run a random forest algorithm, which allowed us to compute the hyperparameter value importance, measured as the decrease in Gini's coefficient when adding the given hyperparameter. This hyperparameter importance can be then used as guide to improve interpretability. The R package randomForest (Liaw and Wiener, 2002) was employed for this analysis.
The DL algorithms used in this study were implemented in Keras (Chollet, 2015) and Tensorflow (Abadi et al., 2015) and run on a GPU equipped Linux workstation. A generic script is publicly available at https://github.com/lauzingaretti/deepGS/.

Simulation
We studied the impact of genetic architecture on prediction performance by simulation using the actual observed strawberry genotypes, assessing predictive performance with the same T10 strawberry genotypes (and genotypic data) as in the real experiment, except that phenotypic responses were simulated. Three contrasting genetic architectures were considered: 1. Additive: 200 randomly chosen SNPs were considered as causal loci. No dominance was simulated. Total individual genetic value was the sum of effects across loci. 2. Epistatic: 100 epistatic pairs of SNPs were randomly sampled.
Epistasis was multiplicative by pairs, i.e., the genotype was the product of individual genotypes in each pair. Total genetic value was the sum of effects across pairs of loci. 3. Mixed: 80 individual additive SNPs and 60 epistatic SNP pairs were randomly chosen. Total genetic value was the sum of effects across pairs of loci and individual additive loci.
Allele substitution effects were sampled from a gamma distribution G(a = 1, b = 0.2). The trait was obtained adding the genetic value to an environmental normal residual. Environmental variance was chosen such that broad-sense heritability was set to 0.50. For each genetic architecture, five replicates were run. We compared BRR, BRR-GM, RKHS, and DL. DL architectures were specifically optimized to each phenotypic trait, since no universal architecture is able to make accurate predictions for all cases.

Population Structure and Genetic Parameters
No clear population structure was observed, neither in the strawberry nor in the blueberry datasets ( Figure S1). Note that genetic relationships between trials in strawberry data are rather uniform, irrespective of whether they are successive seasons or not. This, together with the fact that little genotype by environment (or year) interaction was observed (Gezan et al., 2017), suggests a favorable scenario for GP.
Heritability estimates in strawberry are slightly different from those obtained in the same material by Gezan et al. (2017) since here we used additional data and we removed genotypes tested since here we used additional material and we removed genotypes tested more than once on different seasons. Nevertheless, in agreement with previous results de Bem Oliveira et al., 2019;Gezan et al., 2017) narrow-sense heritabilities were moderate, ranging from 0.25 to 0.35 for most strawberry ( Figure 3) and blueberry (Figure 4) traits, except for strawberry average fruit weight (h 2 a = 0:58) The degree of dominance found was quite low in general, especially in strawberry. An exception was blueberry yield, where dominant and epistatic variances were similar to the additive variance ( Figure 4E). A remarkable case is percentage of culled fruit (CulsTPer) in strawberry, where the epistatic ratio (18%) was only slightly smaller than the additive one (25%, Figure 3E).

Hyperparameter Importance
CNN hyperparameters were optimized for each strawberry trait separately. Figure 5A shows the importance of each hyperparameter obtained from random forest by regressing the model predictive accuracies (obtained by an inner crossvalidation) on all hyperparameter values combinations. Interestingly, the number of filters was overall the most relevant factor, whereas other factors such as learning rate, whose importance has been claimed in the literature as critically important (Maas et al., 2013;Bawa and Kumar, 2019;Feng and Lu, 2019), played only a minor role. We also observed that the effect of each hyperparameter depends on the layer, e.g. regularization or dropout were more important in first than in deep layers.
In Figure 5A, the "trait" effect was excluded since it cannot be controlled by the experimenter, although it was by far the most influential variable. This is illustrated in Figure 5B, which shows the distribution of accuracies for each trait studied. Not only maximum accuracies varied across trait, the profiles were also extremely different, usually multimodal. This suggests interactions between hyperparameter combinations, and it also indicates that traitspecific optimization should be performed whenever possible. Figure 5 illustrates the kind of complex interactions that we observed in hyperparameter optimization. For instance, Figures  5C, E, show the distinct influence of activation functions in percentage of culled fruit ( Figure 5C) and brix ( Figure 5E). Although "relu" activation function has been suggested as the activation of choice in recent DL literature (Maas et al., 2013;Pouladi et al., 2016), here we observed that linear or even sigmoid-like hyperbolic tangent (tanh) seemed to be a safer choice overall. It is relevant to note that interactions were clearly observed for some hyperparameters, such as the number of filters. For CulsTPer, either 16 or 128 filters resulted in optimum accuracies, although they were also associated with the worst hyperparameter combinations. In contrast, either 32 or 64 filters are to be preferred for average weight in strawberry ( Figure 5F).
The final sets of hyperparameters for strawberry and blueberry phenotypes are indicated in Tables S1 and S2, respectively. Overall, our study shows that shallow architectures are more competitive than deep architectures in terms of PA, since the majority of models only included one CNN layer. The number of filters -in combination with dropouthas a large effect in the PA, but is highly dependent of the trait. For instance, all optimal architectures for strawberry contain 128 convolutions, but this is much more variable in the case of blueberry, with a range between 16 and 128 convolutional operations. As for the fully connected layers, the situation is less clear, and no obvious pattern is observed. We can highlight some characteristics though, for example, the number of hidden fully connected layers is quite variable, but only a few neurons (4,8,12) are preferable in most of the architectures. As also reported by Waldmann (2018), combining weight decay and dropout regularization is an efficient option to increase PA. Finally, the best overlapping stride was 1 and optimum window size was 3 in the convolutional layer, confirming Bellot et al. (2018) results.

Comparing Deep Learning With Bayesian
Penalized Linear Models Figure 6A shows observed predictive abilities for each of the five GP methods compared: BL, BRR, BRR-GM, RKHS, and CNNs in strawberry. When averaged over traits in the strawberry species,    Figure 3E).  As for the blueberry phenotypic traits, we found no differences between GP methods BL and BRR (average PA = 0.42), whereas CNNs were somewhat underperforming (average PA = 0.40). The most remarkable result in blueberry is that CNN performance was barely affected by the ploidy level employed to build the genetic relationship matrix. In fact, the "diploid" option seemed more robust than the tetraploid one, except in fruit yield, the only trait that was measured using a rating scale. Table 2 presents the main simulation results and Table S3, the CNN architectures used for computing the PA in each replicate. Some interesting remarks can be made from these simulations. First, although biased, the variance component estimates do detect whether epistasis is important: h 2 e estimates are larger than the narrow-sense heritability in the presence of complete epistasis. Results are far less clear when only a fraction of loci show epistasis. But the most relevant result is that, as we hypothesized, predictive accuracies of CNN and additive penalized methods were affected by genetic architecture. BRR and RKHS were better than CNNs for the pure additive architecture, whereas the opposite was observed with pure epistasis. However, BRR-GM, which accounts for non-linear relationships, was better than either CNNs or pure additive linear models in most of the studied cases.

DISCUSSION
Supervised DL methods are examples of predictive modelling, consisting of approximating a mapping function (f) from input FIGURE 6 | Predictive ability (PA) measured as correlation between observed and predicted phenotypes in the validation dataset in strawberry (A) and blueberry (B). Bayesian linear models (lasso and BRR) PAs in blueberry were computed with tetraploid genotypes, but were almost identical to those obtained with the diploidized ones. (X) to output (y) variables (Hornik et al., 1990). These problems include classification or regression tasks, to use the machine learning jargon. Numerous successful applications of DL in classification contexts have been published, e.g. pattern recognition (Drayer and Brox, 2014;Liang and Hu, 2015;Işin et al., 2016;Badrinarayanan et al., 2017) and natural language processing (NLP) (Deng and Liu, 2018). The DL implementation in regression tasks is less abundant and the benefit of using these methods remains uncertain (Bellot et al., 2018;Montesinos-López et al., 2018a;Azodi et al., 2019). Most GP problems fall into the regression task due to the complex nature of quantitative traits (MacKay, 2009). So far, GP problems have been mainly addressed using penalized linear models Crossa et al., 2013a). More recently, the prediction of complex traits from genetic data is receiving attention from DL users (Ma et al., 2017;Montesinos-López et al., 2018a;Montesinos-López et al., 2019b). The present work aim was to study the strengths and weaknesses of applying CNN to GP problems in polyploid species. CNN networks are attractive for addressing these problems, as they can accommodate situations where input variables are distributed along a space pattern, as with the case of SNPs. Implementing GS in polyploids is challenging. In allopolyploids, genetic analyses have been traditionally implemented assuming diploidy, taking advantage of the fact that systems present disomic inheritance. High predictive performances have been observed in a variety of allopolyploid species (e.g. cotton, strawberry, wheat) and traits (Gezan et al., 2017;Gapare et al., 2018;Juliana et al., 2019). Recently, the importance of accounting for the contribution of subgenomes-potentially expressing epistatic effects-was considered in wheat, which shed light on the importance of accounting for this source of variation within the GP models (Santantonio et al., 2019). However, the scenario is even more complex in autopolyploid species. Even with the recent advances in genotyping and sequencing technologies, the amount of genomic information, and understanding, in most autopolyploid species is still limited when compared to allopolyploid crops. One of the challenges is resolving the allelic dosage of individual locus (Bourke et al., 2018;Gerard et al., 2018). From a quantitative genetics standpoint, we emphasize that polyploid species might present higher degrees of complete and partial intra-locus interactions than diploids (Gallais, 2003;Ferrão et al., 2018). Here, the interest of investigating DL methods in polyploids is to take advantage of its non-linearity and less restrictive assumptions for GP in comparison to the traditional linear model-based methods.
Previous studies (Ma et al., 2017;Bellot et al., 2018;Montesinos-López et al., 2018a;Montesinos-López et al., 2019b) have not shown clear advantages of DL over linear model GP, as conventional models were competitive in terms of PA, with their added benefit of being faster and with more biological interpretability. However, DL could be better suited to explore non-linear components than linear models, especially when genotypes can be transmitted integrally, as occurs with asexual propagation. Certainly, the weak performance of classical additive models in the presence of non-additive variance (e.g. Figure 6 for percentage of culled fruit) confirms the relevance of developing methodologies that can incorporate non-linearity (Ober et al., 2015;Gezan et al., 2017). This purpose can be attained by different approaches. The simplest one is to incorporate a general matrix into the linear models made up of dummy variables. This model contains as many degrees of freedom as ploidy level per locus and allowing for any interaction structure between alleles (Enciso-Rodriguez et al., 2018;Amadeu et al., 2019). RKHS models (Gianola et al., 2006;Gianola et al., 2008; are also able to capture complex interaction patterns in a relatively straightforward manner. Alternatively, a CNN can be implemented using simply the raw data. Our analyses suggest that DL can perform better than additive and RKHS models for traits where the epistatic component is important and where narrow-sense heritability is low (e.g. percentage of culled fruit, Figure 6). The simulation study performed in this work (Table 2) suggested that BRR including additivity, dominance and the general dummy matrix described above can improve upon CNNs when the non-additive component is important, although CNNs were better than strict additive linear models. Additional analyses with a wider range of phenotypic traits, genetic structures and in larger datasets are needed to validate our results.
An underlying goal of our work was to investigate the effect of accounting for allele dosage in a GP context. Owing to the complex nature of polyploids, genotype calling can be a challenge and "diploidization", i.e., considering a polyploid genome as diploid is usual (Bourke et al., 2018). Some studies have recently investigated the effect of accounting the ploidy level in prediction accuracy in polyploids (Endelman et al., 2018;Nyine et al., 2018;Amadeu et al., 2019;Lara et al., 2019;Zingaretti et al., 2019). As in these previous results (de Bem Oliveira et al., 2019;Zingaretti et al., 2019), here we found that "diploidization", in which all heterozygous genotypes are pooled, is as efficient and accurate as polyploid genotyping for prediction purposes, albeit it is trait dependent. Therefore, we conjecture that genomic selection, particularly for low levels of ploidy, can pay off in polyploids even with simplified genotyping strategies. We need to be careful though as this approach may not be equally appropriate for all levels of ploidy and heterozygosity. For instance, this might be an issue with sugarcane (with ploidy starting from 2n=20) as most individuals will be heterozygous.
It is traditionally thought that DL requires extremely large datasets to be trained effectively Liang and Hu, 2015;Xiong et al., 2015). However, this and related works (Ma et al., 2017;Bellot et al., 2018;Montesinos-López et al., 2018a;Montesinos-López et al., 2019b) have shown that DL performance in GP is comparable to those of linear methods. Furthermore, the largest dataset analyzed so far with DL for prediction (~100k individuals) did not show a consistent advantage of DL (Bellot et al., 2018). Therefore, it seems that is the trait what really influences the success of DL and it appears not so critical the size of the dataset. This does not preclude, of course, that a large N is needed to advance in our knowledge on best GP strategies. In fact, a larger N can be especially recommended in clonally propagated species. It is well known that an efficient breeding program tests a low number of crosses with a high number of genotypes in each of them. A cross would need to be tested if not much information is available though. Numerous clonally propagated species of agricultural interest are polyploids, leading to high heterozygosity, non-linear interactions, and scarce prior knowledge about the crosses. In this scenario, as many cross-combinations as feasible should be produced to ensure the discovery and evaluation of the best genotypes (Grüneberg et al., 2009). The actual balance will depend on the level of epistasis and dominance. If dominance is large, then the best clone would be within families with good performance; if dominance is low, this is not necessarily so.
A drawback of DL models is that they lack biological or process interpretability and neither feature selection nor feature importance are obvious. In our opinion, GP algorithms are not too useful for providing biological insight into the genetic basis of phenotypes; genome wide association studies should be more appropriate. In all, our results suggest that DL performance improve as non-additive variance increases, a situation is usually encountered in fitness related traits.
DL hyperparameter tuning is critical and difficult, especially in terms of computational resources. Our analysis allows us to provide some generic recommendations though. First, we and others (Bellot et al., 2018;Montesinos-López et al., 2018a;Montesinos-López et al., 2019b) concluded that the predictive accuracy is mainly dependent of the trait, i.e., the architecture needs to be tuned for each trait individually. Second, here we show that the popular relu activation function is not necessarily a universally valid activation function, that interactions between hyperparameter combinations should be expected and that the number of convolutional filters and regularization in the first layers can have an important effect into the model performance ( Figure 5). In general, we and other authors (Bellot et al., 2018;Waldmann, 2018) have reported that a shallow network is the best scenario in most cases. Nevertheless, DL can still be attractive because it does not require feature engineering, a critical step in most machine learning methods. A further strength of DL is its flexibility, e.g., it allows to define latent variables by using autoencoder or embedding as a generative latent variable model. In addition, networks, even if shallow, can model complex relationships employing any non-linear activation function.
Overall, there is no evidence that applying DL in GP applications necessarily improves the prediction accuracy upon that of classical linear model methods. PA depends on the trait and is affected by many factors; no one algorithm is uniformly better for all species and traits (Hu and Greene, 2018;Pérez-Enciso and Zingaretti, 2019). PA usually decays if heritability is low or in the presence of high epistatic effects. Even under these conditions though, Bayesian models were better than CNNs in almost all cases (Tables S1, S2, Table 2). Even if performance of DL for GP is not outstanding, we cannot ignore that plant breeding is based on both genotyping and phenotyping, and that high throughput phenotyping is critical for genomic dissection of complex traits (Cobb et al., 2013). Imaging and computer vision can be employed to measure the physiological, growth, development, and other phenotypic properties of plants with the advantage of being fast, non-invasive and a low-cost strategy (Fahlgren et al., 2015), hyperspectral imaging is useful to measure plant traits under say disease progression (Bergsträsser et al., 2015), infrared thermography is able to scan temperature and transpiration; NMR (nuclear magnetic resonance spectroscopy) and mass spectrometry (MS) are applied in plants metabolite evaluation (Hong et al., 2016). These examples should be an ideal scenario to neural networks as they involve imaging at high scale, complex, and heterogeneous datasets with multiple variables and outcome. In summary, we believe that the enormous amount of data that can be automatically recorded revolutionizing plant breeding and the flexible nature of neural networks makes them promising for meeting this future challenge.

AUTHOR CONTRIBUTIONS
MP-E conceived and supervised research. VW, PM, LO, SG, and LF contributed experimental data. LZ developed software and performed research. LZ and MP-E wrote the initial manuscript draft. All authors contributed to discussion and to writing the final draft.

ACKNOWLEDGMENTS
The VW lab acknowledges Dr. Sujeet Verma for curation of strawberry SNP data.