Genome-Wide Prediction of Complex Traits in Two Outcrossing Plant Species Through Deep Learning and Bayesian Regularized Neural Network

Genomic selection models were investigated to predict several complex traits in breeding populations of Zea mays L. and Eucalyptus globulus Labill. For this, the following methods of Machine Learning (ML) were implemented: (i) Deep Learning (DL) and (ii) Bayesian Regularized Neural Network (BRNN) both in combination with different hyperparameters. These ML methods were also compared with Genomic Best Linear Unbiased Prediction (GBLUP) and different Bayesian regression models [Bayes A, Bayes B, Bayes Cπ, Bayesian Ridge Regression, Bayesian LASSO, and Reproducing Kernel Hilbert Space (RKHS)]. DL models, using Rectified Linear Units (as the activation function), had higher predictive ability values, which varied from 0.27 (pilodyn penetration of 6 years old eucalypt trees) to 0.78 (flowering-related traits of maize). Moreover, the larger mini-batch size (100%) had a significantly higher predictive ability for wood-related traits than the smaller mini-batch size (10%). On the other hand, in the BRNN method, the architectures of one and two layers that used only the pureline function showed better results of prediction, with values ranging from 0.21 (pilodyn penetration) to 0.71 (flowering traits). A significant increase in the prediction ability was observed for DL in comparison with other methods of genomic prediction (Bayesian alphabet models, GBLUP, RKHS, and BRNN). Another important finding was the usefulness of DL models (through an iterative algorithm) as an SNP detection strategy for genome-wide association studies. The results of this study confirm the importance of DL for genome-wide analyses and crop/tree improvement strategies, which holds promise for accelerating breeding progress.


INTRODUCTION
Artificial neural networks (ANNs) are computational methods of interest in the area of Machine Learning (ML) research, which has proved to be a powerful tool in several studies of genomic prediction (Drummond et al., 2003;Gianola et al., 2011;González-Recio and Forni, 2011;González-Recio et al., 2014;Leung et al., 2015;Glória et al., 2016;Romagnoni et al., 2019;Yin et al., 2019;Grinberg et al., 2020), due to its ability of dealing with a wide variety of high-dimensional problems in a computationally flexible manner (González-Recio et al., 2014;Ranganathan et al., 2018). In this regard, Gianola et al. (2011) pointed out that this method may be useful for the prediction of complex traits when the number of unknown variables is much larger than the number of samples (high-dimensional genomic information), since ANNs have the ability to capture non-linearities, adaptively . Moreover, Pérez-Rodríguez et al. (2012) found that Bayesian Regularized Neural Networks (BRNNs) and Radial Basis Function Neural Networks (RBFNNs) (non-linear models) had higher predictive accuracy and smaller predictive meansquared error than Bayesian linear regression models (Bayesian LASSO: BL, Bayesian Ridge Regression: BRR, Bayes A and Bayes B) for grain yield and days to heading in wheat.
Over the last several years, complex ANN architectures have been implemented to predict complex traits in several plant species (Ma et al., 2017;Khaki and Wang, 2019). For instance, Deep Learning (DL), a form of ML, has gained increasing interest in prediction studies, which typically use multiple hidden layers, trying to learn functions that connect the input data (inputs layer) and response variables (output layer) in absence of a model (Sheehan and Song, 2016). However, unlike traditional neural networks, algorithms of DL consider many hidden layers during the training of network (Sheehan and Song, 2016;Min et al., 2017), which transform the input data into a more abstract representation at each stacked layer (Khaki and Wang, 2019). In this regard, LeCun et al. (2015) pointed out that the use of multiple hidden layers in DL can reveal non-linear relationships between input data and response variables and can perform extremely intricate functions that are sensitive to minute details and insensitive to large irrelevant variations. Moreover, the use of non-linear functions is a powerful alternative to linear regression because it offers the most flexible curve-fitting functionality, which seeks to minimize the standard error of the estimate to increase the prediction accuracy (Abebe et al., 2018). Rachmatia et al. (2017) implemented Deep Belief Network (DBN), one of the architectural DL methods, for developing genomic prediction models in maize and found that DBN outperformed other methods of prediction such as Reproducing Kernel Hilbert Space (RKHS), BL, and best linear unbiased predictor (BLUP). In another study, Montesinos-López et al. (2018) compared DL with the genomic BLUP (GBLUP) method using nine published genomic datasets (three of maize and six of wheat), in which the DL method had a better performance in most cases. Importantly, they corroborated that there are no universally best prediction machines, and thus, several available methods should be tested in a given breeding population. Khaki and Wang (2019) designed a deep neural network approach to predict crop yield that took advantage of state-of-the-art modeling and solution techniques. The authors pointed out that a salient feature of this learning model is that they treat the response variables as an implicit function of the input variables (e.g., genes and environmental components), which could be a non-linear function and highly complex.
In this study, genomic prediction models based on DL and BRNN were investigated to predict complex traits in two economically important plant species: Eucalyptus globulus Labill. and Zea mays L. E. globulus is one of the most widely planted hardwood tree species in temperate regions of the world, mostly used as raw material for pulp and paper industry due to its highquality cellulose pulp, low lignin, and lipid content (Aumond et al., 2017). This species is also used for the production of essential oils in the pharmaceutical industry (Shao et al., 2020). Notably, E. globulus has been successfully grown in a broad range of environmental conditions, and it stands out as the targets of multiple breeding programs to improve economically important traits such as tree growth and wood quality . On the other hand, Z. mays is one of the world's leading cereal grains along with rice and wheat (Reeves et al., 2016). This species plays an important role in food security and has many uses, including biofuel, animal feed, pharmaceutical, and agro-industrial products (Suleiman et al., 2013). Due to the importance of maize in the global context, several efforts have been undertaken addressing the efficient use of germplasm materials . Therefore, in these two important outcrossing species, we have assessed ML architectures, which were compared with GBLUP and different linear Bayesian regression models (Bayes A, Bayes B, Bayes Cπ, BRR, and BL) and the non-linear model RKHS. One of the main difficulties in implementing DL models is to find the best hyperparameter configuration, which requires time and some basic understanding of hyperparameters to optimize (Pérez-Enciso and Zingaretti, 2019;Zingaretti et al., 2020). In this regard, we evaluated the importance of mini-batch and activation function hyperparameters in terms of predictive ability (PA) and computational time required. The findings of this study can be useful as a guide to the analysis of DL for the genomic prediction of complex traits, facilitating its implementation in operational breeding programs.

Maize Experiment
The panel was composed of 322 inbred lines representing a core collection of tropical maize germplasm of the State University of Maringa, Parana State, Brazil. All inbred lines were obtained from selfing and selection in nurseries, starting from populations and hybrids released by Brazilian public and private sector corn breeding programs. This maize panel was sown in the growing season 2018 in two locations of southern Brazil: Cambira and Sabaudia. The experimental design in both trials was an alpha-lattice with 24 incomplete blocks, and three replications per inbred line. The following traits related to flowering were evaluated: Male and Female Flowering time (MF and FF, respectively), measured in each line as the number of days from sowing to anther extrusion from the tassel glumes and to visible silks, respectively, and the Anthesis-Silking Interval (ASI) calculated as the difference between MF and FF . The target population of maize was genotyped using genotyping by sequencing (GBS) (Elshire et al., 2011;Glaubitz et al., 2014). The raw database of GBS was filtered considering a minor allele frequency (MAF) > 0.05, resulting in 291,633 highquality SNPs. Subsequently, missing data were imputed through LDkNNi imputation (linkage disequilibrium k-nearest neighbor imputation) (Money et al., 2015). SNPs with a MAF < 0.01 and a proportion of missing data per location > 10% were eliminated from the imputed dataset, resulting in a total of 290,973 SNPs. Finally, a subset of 10,000 SNPs was randomly selected for genomic prediction analyses.

Eucalypt Experiment
This breeding population was composed of 62 full-sib and 3 half-sib families of E. globulus located in the La Poza sector, Purranque, administrative region of Los Lagos, Chile. The experimental design was a randomized complete block, with 30 blocks, single-tree plots, and a spacing of 2.5 m between the trees within a block. The traits were assessed in this breeding population: pilodyn penetration (WD), stem straightness (ST), branch quality (BQ), diameter at breast height (DBH), and tree height (TH). WD was indirectly estimated based on pilodyn penetration (in millimeters) at breast height using a Pilodyn 6J Forest (PROCEQ, Zurich, Switzerland) according to Valenzuela et al. (2019). ST was evaluated in the first 2/3 of the total height of the tree, considering an ordinal scale of seven levels, in which 0 represents trees with curvature in the first third of the total height of the tree, and 6 represents trees that could present a slight curvature in the upper third of the tree without loss of productivity . BQ was evaluated according to different criteria of quality (i.e., diameter, angle, and distribution of branches in the tree), considering an ordinal scale of six levels, in which 1 represents a tree with serious limitations and 6 represents a tree with all branching variables in good condition . Total tree height was measured using a Suunto R hypsometer, while DBH was measured with a diameter tape at 1.3 m above ground level. Genomic DNA was isolated from leaf tissue of 646 individuals randomly selected from the breeding population (approximately 10 individuals per family) and genotyped using the EUChip60K SNP system (GeneSeek, Lincoln, NE, United States) (Silva-Junior et al., 2015). The genotyping quality was evaluated using Genome Studio software (Illumina, San Diego, CA, United States). Subsequently, SNPs with a MAF < 0.05 and a proportion of missing data > 10% were eliminated. A final set of 14,442 SNPs was retained.

Heritability Estimates
Narrow-sense heritability (ĥ 2 ) of E. globulus was estimated as follows:ĥ where σ 2 a and σ 2 e are the additive and residual variances, respectively. Due to the mix families (half-sib and full-sib) in E. globulus, only additive variance was considered. On the other hand, in Z. mays, broad-sense heritability (Ĥ 2 ) was computed as: where σ 2 G is the total genotypic variance (i.e., σ 2 G = σ 2 a + σ 2 d ; additive and dominance genetic variance, respectively). Thus, the additive and dominance ratios were estimated as: All estimates and calculations described in this section were performed using the R package sommer (Team R. C., 2013;Covarrubias-Pazaran, 2016). The additive and dominance relationship matrices were estimated using the A.mat (Endelman, 2011) and D.mat (Su et al., 2012) functions, respectively.

Bayesian Alphabet
Genomic prediction linear Bayesian models were fitted using the following approaches: Bayesian Ridge Regression (BRR; Gianola, 2013), Bayesian Lasso (BL; Legarra et al., 2011), Bayes A , Bayes B , and Bayes Cπ (Habier et al., 2011). The models were adjusted as follows: where y is a vector of phenotypes pre-corrected for nongenetic effects (i.e., block effect; experimental design); µ is an overall constant; A ij and D ij are genotype indicator variables for individuals i at locus j; a j and d j for j = 1, 2, . . . p are the additive (a j ) and dominance (d j ) genetic effect of the jth SNP; and ε i is the residual associated to the observation on individual i, with distribution ε ∼ N(0, Iσ 2 ε ), where I is an identity matrix, and σ 2 ε is the residual variance. Genotypes in A matrix were coded as 0 for "aa, " 1 for "Aa, " and 2 for "AA" to capture additive effects, while that for modeling dominance effects in D matrix, the genotypes "aa, " "Aa, " and "AA" were coded as 0, 1, and 0, respectively. For the BRR model, the marker effect (a j ) is distributed as follows: a j |σ 2 a ∼ N(0, σ 2 a ), and the common variance (σ 2 a ) is treated as unknown and p σ 2 a ∼ χ −2 σ 2 a df a , S a , with degrees of freedom and scale parameter df a and S a , respectively. The BL method assumes that the conditional prior distribution of each marker effect follows a double exponential (DE), p a j , σ 2 ε = DE a j 0, λ, σ 2 ε , where is the regularization parameter and σ 2 ε is a specified scaled inverse Chi-squared prior density p σ 2 ε ∼ χ −2 σ 2 ε df ε , S ε , with degrees of freedom df ε and scale parameter S ε (Park and Casella, 2008). Bayes A assumes that the conditional prior distribution of a marker effect a j is assumed to be Gaussian with null mean and marker-specific variance σ 2 a j , independent from each other. In this model, the variance of each marker is assumed to be distributed scaled inverse Chi-squared, with p(σ 2 a j ) = χ −2 σ 2 a j df , S 2 , where df and S 2 are known degrees of freedom and scale parameters, respectively (Pérez and de Los Campos, 2014). The Bayes B method assumes that only a few loci contribute with some genetic variance and that some genetic markers have zero effect, such that the prior distribution of the effects of all markers is given by: p a j σ 2 a j , π = 0 with probability π N(0, σ 2 a j ) with probability (1−π) where π is the proportion of markers with null genetic effects. A scaled inverse Chi-square prior distribution χ −2 (df a , S a ) is assumed for σ 2 a j (j = 1, . . . , p), which is equal for all markers (Meuwissen et al., 2001;Pérez and de Los Campos, 2014). Bayes Cπ is similar to Bayes B, in which all markers are considered to have a common variance (σ 2 a ) and promote the selection of variables. The marker effects are assumed to be a j |σ 2 a j ∼ N(0, σ 2 a ), and the inclusion of each marker in the model is modeled by an indicator variable δ j , which is equal to 1 if the marker j is fitted in the model and is 0 otherwise.
The Bayesian models for maize data were implemented in the library BGLR (Pérez and de Los Campos, 2014) of R 3.6.1. All models were run with 1,000,000 iterations, a burn-in period of 100,000, and a thin of 50. For eucalypt data, the results of the Bayesian alphabet models (implemented in BGLR package) are available in Ballesta et al. (2019), which were used in this study for comparison purposes.

Genomic Best Linear Unbiased Prediction (GBLUP) and Reproducing Kernel Hilbert Spaces (RKHS)
The GBLUP was performed using the R package sommer (Covarrubias-Pazaran, 2016). The additive and dominance relationship matrices were estimated using the A.mat (Endelman, 2011) and D.mat (Su et al., 2012) functions, respectively.
In the RKHS model (Gianola et al., 2006), the genomic relationship matrix used in GBLUP is replaced by a kernel matrix (K), which enables non-linear regression in a higherdimensional feature space. This model considers that where h is a bandwidth parameter that controls the rate of decay between pairs of markers, and x i − x i 2 is the Euclidean distance between any two pairs of genotypes i and i normalized to range from 0 to 1. The RKHS model was implemented in the library BGLR (Pérez and de Los Campos, 2014) of R 3.6.1., which was run with 1,000,000 iterations, a burn-in period of 100,000, and a thin of 50.

BAYESIAN REGULARIZED NEURAL NETWORK
The regularization process of BRNN was obtained by considering the weights (feature vector in the feature space, which represent the strength of connections between neurons) as random variables with a given prior distribution (defined below) according to Glória et al. (2016). In general, the structure of BRNN consists of three parts: (I) an input layer, which is given by genomic information of individuals (independent variables), (II) one hidden layer with n neurons that connect the input and output layers, and (III) an output layer with only one neuron that produces as output the prediction values of interest (Glória et al., 2016). The neurons allow the connection of the different layers inside a network, and the strength of the connection between neurons is called weight (Glória et al., 2016). The means of estimated weights measure the influence of the predictor variables on the response variable (learned information from training data). The posterior distribution for the weights from BRNN proposed by Gianola et al. (2011) can be accessed according to Bayes theorem: is the prior distribution for the weights vector, and P (Y | α, γ, M) is the marginal likelihood of the data; Y represents the observed data (markers genotypes matrix and the adjusted phenotypic values); w is the unknown weights vector; M denotes the architecture of the neural network used; α and γ are the regularization parameters that control the compensation between the smoothing of the network and goodness of fit Okut et al., 2011;Glória et al., 2016). Ten BRNN-based architectures were tested, in which the following hyperparameters were considered to find the optimal architecture that increases the PA of genomic prediction: activation functions (i.e., linear: purelin; log-sigmoid: logsig; tangent sigmoid: tansig) and layers number (1-3 layers). The architectures tested with one layer (and one neuron) were brnn1, brnn2, and brnn3, in which each one considered the activation function purelin, logsig, and tansig, respectively. The architectures brnn4, brnn5, brnn6, brnn7, and brnn8 considered two layers, with two neurons and one neuron in each layer, respectively. The architectures brnn4, brnn5, and brnn6 consider in both layers a repetition of the activation function purelin, logsig, and tansig, respectively. Brnn7 and brnn8 used a combination of activation functions, in which brnn7 considered the activation functions tansig and purelin, in the first and second layer, respectively, while for brnn8, the activation functions logsig (layer 1) and purelin (layer 2) were considered. The architectures brnn9 and brnn10 considered three layers, a combination of the three activation functions and two, two, and one neuron in layers 1, 2, and 3, respectively. Brnn9 used tansig, logsig, and purelin activation functions in layers 1, 2, and 3, respectively, while brnn10 used logsig (layer 1), tansig (layer 2), and purelin (layer 3) activation functions (details about architectures are shown in Supplementary Table S1). All architectures of BRNN studied were fitted through the trainbr(x) function implemented in neural networks toolbox of Matlab 2019a (Beagle et al., 2019).

Long Short-Term Memory Network (LSTM)
LSTMs are a special kind of recurrent neural network (RNN) designed to learn long-term dependencies (Pouladi et al., 2015;Pérez-Enciso and Zingaretti, 2019). These networks are one of the most popular methods of RNN for their favorable convergence properties, adding additional interactions per module (or cell) and allowing one to overcome the vanishing gradient problem, which is a difficult task in RNN (Pouladi et al., 2015;Le et al., 2019). A typical LSTM network is composed of memory blocks called cells (Sak et al., 2014;Hua et al., 2019;Le et al., 2019). These cells are a recurrently connected subnet that contains memory cell in charge of remembering the temporal state of the neural network and gates responsible for controlling the flow of information and avoid the long-term dependency problem (Figure 1).
The first step for constructing a LSTM network is to determine what kind of information is not required and will be removed from the memory cell state. This process is implemented by a sigmoid function (called forget gate), which is determined by a vector with values ranging from 0 to 1, corresponding to each number in the cell state (Le et al., 2019). This function takes the part from the old output (h t−1 ) at time t − 1, and the current input (X t ) at time t, for calculating the components that control the cell state and hidden state of the layer: where CL represents the components of the LSTM layer (input gate [i t ], forget gate [f t ], and output gate [o t ]), σ is the sigmoid function, while W and b are the weight matrices and bias (respectively). The learnable weights and bias of the LSTM layer are the input weights W (Input Weights), the recurrent weights R (Recurrent Weights; associated with h t−1 ), and the bias b (Bias). These matrices are concatenations of their weights and bias in each component, respectively (Beagle et al., 2019).
where i, f, g, and o indicate the input gate, forget gate, cell candidate, and output gate, respectively. The next step is divided into two parts: (1) deciding whether the new information should be updated or ignored (0 or 1) through a sigmoid layer (i t ), and (2) deciding the level of importance (−1 to 1) of values that passed through a hyperbolic tangent (tanh) layer (g t ). Next, these two parts are combined to trigger an update to the memory cell state, wherein this new memory is then added to old memory C t−1 (at time t − 1) resulting in C t (at time t).
The last step is to decide the output values, which is performed by multiplication between the value obtained from the output of the sigmoid layer and the new values created by the hyperbolic tangent (Tanh) layer from the cell state (C t ) (ranging between −1 and 1).
The LSTM layer, by default, uses the Tanh function to compute the state activation function. Pouladi et al. (2015) replaced the Tanh function by a Rectified Linear Units (ReLU) learning strategy, for genotype imputation and phenotype sequences prediction. This study shows that the ReLU methods have a better performance in training (less error) compared to other LSTM models and better phenotype prediction compared to the results of the sparse partial least squares method. In this model, the recurrent weight matrix is initialized to an identity matrix and the biases are set to zero (Pouladi et al., 2015).
Considering the previous observations, six LSTM architectures were tested with different activation functions (Tanh or ReLU) and subsets of the training set to evaluate the gradient of the loss function and update of the weights (mini-batch). The architectures lstm1, lstm3, and lstm5 used the ReLU activation function, while for lstm2, lstm4, and lstm6, the Tanh activation function was used. The mini-batch considered 10, 50, and 100% of the training dataset. In this sense, lstm1 and lstm2 used a mini-batch of 10%, while for lstm3 and lstm4, a mini-batch of 50% was considered. For lstm5 and lstm6, a mini-batch of 100% was used (details about architectures are shown in Supplementary Table S2). All architectures of LSTM networks were implemented in Matlab (Beagle et al., 2019) (details about scripts are shown in Supplementary Data Sheet S1). LSTM is named in other parts of the manuscript as "Deep Learning" (DL). Glória et al. (2016) were the first to empirically test the BRNN model for estimating marker effects (considering from one to three layers) through methods proposed by Dimopoulos et al. (1995) and Goh (1995), which are based on partitioning the connection weights to determine the relative importance of the SNP markers and the sensitivity of the network for each SNP, respectively. However, models of DL typically consider multiple hidden layers and different types of layers (e.g., fully connected or normalization layers) for reducing the regression errors (Khaki and Wang, 2019), which prevents the use of methods proposed by Goh and Dimopoulos. In this sense, Wang et al. (2012) proposed an estimation of SNP effects by using the vector of genotype effects (or breeding values, a), the diagonal matrix of weights for variances of SNPs (D), and a matrix relating genotypes of each locus (Z), which represent substitution effects for each marker FIGURE 1 | Diagram of a Long Short-Term Memory (LSTM) block. This block is a recurrently connected subnet that contains memory cell and gates functional modules. X t , h t−1 , and C t−1 are the inputs of the LSTM unit, which correspond to the input of the current time step, the output from the previous LSTM unit, and the memory of the previous unit, respectively. C t denotes the memory of the current unit and h t denotes the output of the current network (outputs of LSTM unit). The LSTM block is divided into three parts: gates forget (blue), update (green), and output the cell (yellow). Each part is composed of a sigmoid function (σ), which computes the gates activation function (f t : forget gate, i t : input gate, o t : output gate) from input weights (WX t ), recurrent weights (Wh t−1 ), and bias (b). The update and output parts use a hyperbolic tangent (Tanh) function to calculate the input update (g t ) and the memory of the current unit (h t ), respectively.

Estimates of SNP Effects From ML Models
locus (coding {AA, Aa, aa} as {0, 1, 2}). It is assumed that the vector of genotype effects (or breeding values) is a function of SNP effects (VanRaden, 2008;Strandén and Garrick, 2009;VanRaden et al., 2009;Misztal et al., 2012;Wang et al., 2012), such that: where u is a vector of SNP marker effects. Therefore, considering the equation of Strandén and Garrick (2009), the SNP effects is given by: u = DZ ZDZ −1 a Wang et al. (2012) created an iterative algorithm for the estimation of D fromâ and Z. In the present study, D was estimated for three iterations of this algorithm, which proceeded as follows: 1.û t = D t Z ZD t Z −1 whereâ is the genotype effects (or breeding values), D 0 is the identity matrix, when t = 0. Finally, the vector of prediction of SNP effects is given byû t . In each model (Bayesian alphabet models, GBLUP, RKHS, BRNN, and LSTM), the PA was measured as the average of Pearson correlation coefficient between observed and predicted phenotypes in the validation set. The GP methods evaluated in this study were assessed by 50 cycles of cross-validation, in which the dataset was divided randomly into two independent training (90%) and validation (10%) groups. In the dataset of Z. mays, 290 genotypes were randomly selected as training dataset at each cycle of cross-validation, and the remainder 32 genotypes were used as validation samples. Similarly, in E. globulus, each cycle of cross-validation was performed considering 581 trees (randomly selected) as training dataset and the remainder 65 trees as validation set. The Tukey-Kramer test was performed to compare the PA values for each trait among the evaluated models.

The Importance of Hyperparameters in DL and BRNN Architectures
Tables 1, 2 show the PA, obtained by cross-validation, of the hyperparameter combinations in each architecture tested for DL (LSTM) and BRNN, respectively. The lstm5 network had the highest PA values among all the DL networks tested. Moreover, lstm5 was statistically different from all models for ASI (environment CAM, maize) and WD (eucalypt). In the DL models, the use of ReLU as the activation function in the architectures lstm3 (mini-batch = 50%) and lstm5 (minibatch = 100%) was more efficient in terms of PA values. In the architectures with smaller mini-batches (10%), ReLU was the function with higher PA values in the eucalypt population, while the hyperbolic tangent (Tanh) function was better in maize ( Table 1). This result indicates that the ReLU function is more efficient in terms of PA than Tanh when the minibatch size increased. This finding may be due to the fact that the optimization and backpropagation of the error by the ReLU function is more efficient when gradient estimates are less noisy and have larger partial data (large mini-batch), whereas Tanh is more efficient when subsets of the training set are smaller (small mini-batch) and gradient estimates are noisier (Masters and Luschi, 2018;Thafar et al., 2019). On the other hand, methods with mini-batch of 100% (lstm5 and lstm6) were the most efficient in terms of the computational time required for the genomic prediction of genotype effects in the cross-validations, performing up to four times faster than the mini-batches of 10% (Supplementary Table S3).
In BRNN architectures, the brnn1 and brnn4 networks showed the best PA. In the flowering traits, the PA of the brnn4 network was the highest within Bayesian models, while that in wood-related traits, the highest PA was in brnn1. Although the PA values of the brnn1 model were slightly lower than brnn4 in flowering traits of maize, the predictions of both networks were very competitive, and these evidenced no significant differences ( Table 2). Notably, brnn1 was the most efficient in terms of the computational time required in comparison with brnn4, with differences of up to 12.4 h (Supplementary Table S4). This result is due to fact that the architectures that used one layer converged faster than architectures with two or three layers. On the other hand, the activation functions in the Bayesian architectures are related to the PA values, since the architectures with one or two layers were more efficient (in terms of PA) when the purelin activation functions were used and showed lower PA when the logsig function was used.
The above results indicated that lstm5 and brnn1 were the most efficient architectures in the prediction of the study traits, due to their high PA values and low computational time required. Therefore, these architectures were selected to compare PA values   (Pérez and de los Campos, 2013;Wang et al., 2015). In this regard, GBLUP was the computationally most efficient model, running each iteration in ∼1 min (Supplementary Tables S5, S6). However, this model The study traits were pilodyn penetration (WD), stem straightness (ST), branch quality (BQ), tree height (TH), and diameter at breast height (DBH) of eucalypt trees. Predictive ability values followed by a common letter are not significantly different according to the Tukey-Kramer test at a level of significance of 0.01. *Narrow-sense heritability. SE, standard error.
was not better than the DL method in terms of the PA for all traits.

Estimates of SNP Effects in DL Model
The SNP marker effects of the DL (lstm5) model, estimated using the iterative algorithm of Wang et al. (2012), are shown in Supplementary Data Sheet S2. The estimate of the first iteration was similar to those obtained by the BRR model, while the marker effect estimates of the second and third iterations were not similar to BRR (Supplementary Figures S1, S2), since the marker effects were re-estimated in each iteration, reducing or increasing their values (Supplementary Figure S3). As an example, Figure 2 shows the estimates of marker effects (in absolute terms) for male flowering of maize plants and stem straightness of eucalypt trees, estimated using the DL (lstm5) method (estimates of marker effects for all traits are presented in Supplementary Figures S4, S5). For ST, the marker effect estimates varied from 1.2 × 10 −8 to 4.0 × 10 −3 , with an average of 3.8 × 10 −4 , considering the first iteration of the algorithm, whereas for MF, the marker effects varied from 3.4 × 10 −7 to 3.4 × 10 −2 , with a mean of 5.9 × 10 −3 . The use of iterative algorithm  showed that the markers with small effects were reduced in the second iteration (Figure 2), while the markers with large effects were increased even more in the second and third iteration. In fact, in the second and third iteration, SNPs with large effects increased approximately four and nine times their values (respectively) in both methods with respect to the first iteration. Therefore, the results from this algorithm can be used to map and identify QTLs, an aspect highlighted by Wang et al. (2012), that is, because the algorithm increases the differences between the SNPs with high and low effects, enhancing the visual interpretation of the plots (Figure 2). Based on this approach, SNPs with the greatest effect for MF of maize (SNP effects > 0.2) and for ST of eucalypt trees (SNP effects > 0.1) were arbitrarily considered as QTL. In this sense, four SNPs-on chromosomes 1, 3, 4, and 5-were associated with MF of maize plants, whereas five SNPs-on chromosomes 4, 5, 6, and 7-were associated with ST of eucalypt trees (Supplementary Data Sheet S3). Based on the physical position of maize reference genome 1 , 147 candidate genes were identified nearby (1 Mb) the SNPs with greatest effect. For E. globulus, considering the physical position of the reference genome of E. grandis 2 , a total of 277 candidate genes were identified nearby (1 Mb) the five SNPs with major effect (Supplementary Data Sheet S3).

DISCUSSION
Over the last several years, many approaches have been proposed to increase the prediction accuracy in GP studies, such as linear models of Bayesian alphabet (Pérez and de Los Campos, 2014;De Los Campos and Pérez-Rodríguez, 2016). However, few approaches have included non-parametric   approaches and non-linear functions. In this study, two ML-based approaches that implement non-parametric methods and numerous non-linear activation functions were used Montesinos-López et al., 2019a,b;Pérez-Enciso and Zingaretti, 2019). The results of this study showed that the DL model had a higher PA than GBLUP, linear (Bayes A, Bayes B, Bayes Cπ, BRR, and BL) and non-linear (RKHS and BRNN) Bayesian regression models in the prediction of several complex traits in both breeding populations. Bellot et al. (2018) developed genomic prediction models for human complex traits using UK Biobank data and found that the DL model was more competitive than the penalized linear methods. However, the predictive ability of DL was dependent on the study phenotype. Particularly, the results of this genomic prediction showed that DLs performed comparatively better as narrow-sense heritability decreased and the contribution of dominance increased. Similarly, Zingaretti et al. (2020) implemented GP in polyploid outcrossing species (i.e., strawberry and blueberry) and found that DL did not show clear advantages over the linear models BL and BRR, except when the nonadditive effects (dominance or epistasis) were important. The authors also pointed out that the use of DL methods in GP of polyploid plants allows one to exploit its non-linearity, and it has less restrictive assumptions in comparison to traditional linear model-based methods. Moreover, polyploid plants might present higher degrees of complete and partial intra-locus interactions compared with diploid species (Zingaretti et al., 2020). These results are not in agreement with this study, in which the PA of DL did not evidence differences between the contributions of the additive or dominance effects (Table 4), since DL was the best method in the prediction (in terms of PA) of all traits. These findings may be due to the fact that Bellot et al. (2018) and Zingaretti et al. (2020) used the Convolutional Neural Networks, whereas in the present study, the LSTM method was used; however, we emphasize that other studies must be performed to corroborate this argument. Interestingly, Alves et al. (2020) compared the predictive performance of GBLUP with ANN method in simulated traits considering different levels of dominance effects. They found that ANN had a higher prediction accuracy compared with GBLUP for traits with moderate narrow-sense heritability (h 2 = 0.30) and dominance effects of 0 or 0.15. In the present study, the DL approach outperformed GBLUP despite the low dominance effect (ĥ 2 d < 0.035; Table 4). This is indicative that DL is a promising alternative tool for GP independent on the contribution of additive and/or dominance genetic effects.
Several studies have reported that the combination of hyperparameters critically influences the predictive performance of the DL model, emphasizing the need to carefully optimize hyperparameters in the ML architectures (Glória et al., 2016;Bellot et al., 2018;Pérez-Enciso and Zingaretti, 2019;Zingaretti et al., 2020). Particularly, Glória et al. (2016) found that a simple architecture of BRNN outperformed other more complex architectures (by adding layers and/or more complex activation functions) in terms of PA. This finding implies that the complexity of neural networks provides a decrease in PA, increasing the standard error of prediction. In this study, 10 BRNN architectures were tested through the combination among the number of layers (1-3) and activation functions (purelin, logsig, and tansig). Similarly, the six architectures tested in the DL method correspond to the combination of the activation functions Tanh and ReLU, with different minibatch sizes (10, 50, and 100%). A close association between the hyperparameter of activation function and the efficiency of the genomic prediction was observed in the BRRN model, as the higher PA values were observed in the brnn1 and brnn4, which had different numbers of layers (1 and 2, respectively), but the same activation functions (purelin). On the other hand, the poor performance of architectures with three layers could be due to an overfit to the training set, an aspect observed by Glória et al. (2016). However, it is worth noting that a combination of activation functions was used in the networks with three layers, and these activation functions were not repeated as in the networks with two layers. Therefore, the reduction in the PA values could be caused by the logsig and tansig functions. In spite of this, it was evidenced that brnn1 and brnn4 were very competitive, revealing that the increase in the number of layers did not present a significant increase in the PA. Moreover, the results showed that the increased number of layers was more expensive computationally (Supplementary Table S4).
The hyperparameters assessed in DL showed that the ReLU activation function was more efficient in terms of PA than the Tanh function when the mini-batch was larger. ReLU can represent a linear function and thus has the advantage of preserving the properties of linear and non-linear models, i.e., easy to optimize and backpropagate the error (Thafar et al., 2019). Furthermore, ReLU offers better performance and generalization in DL compared to the sigmoid and Tanh activation functions (Bouktif et al., 2018;Nwankpa et al., 2018). Hesamifard et al. (2017) observed that deep neural networks with the ReLU activation function had better performance in the classification of encrypted data compared to the sigmoid and Tanh functions. Similarly, Li et al. (2018) showed that the use of ReLU function greatly improves the performance over Tanh and sigmoid functions, in sequential MNIST classification and language modeling (using the character-level Penn Treebank dataset). On the other hand, a large mini-batch size was more efficient in terms of PA. Previous studies have indicated that a small mini-batch size achieves better training stability and generalization performance, while a larger mini-batch tends to have degradation in the quality of the model (Li et al., 2014;Keskar et al., 2016;Smith et al., 2017). Moreover, a smaller minibatch increases the velocity of model updates and the efficient use of memory. Masters and Luschi (2018) showed that increasing the mini-batch size provides stable convergence and acceptable test performance. Furthermore, the training samples of the minibatch are randomly drawn in every step, so the resulting gradients are less accurate and gradient estimates are noisier. In this study, the mini-batch did not affect the efficiency of genomic prediction; however, the time of processing was affected. In this sense, the increasing of mini-batch significantly reduced the training time (Supplementary Table S3), due to the fact that a large mini-batch size has fewer training processes in each epoch and takes less step to converge (Smith et al., 2017;Wang et al., 2017). Therefore, the best architecture in DL was lstm5 due to its efficiency in terms of PA and reduced computational times.
The Bayesian linear models did not show differences with BRNN in terms of PA. This was also noted by Glória et al. (2016) who did not observe differences among BRNN, BL, and RR-BLUP in the prediction of quantitative traits. On the other hand, DL had a better result than Bayesian models in terms of PA. In this sense, Liu and Wang (2017) found that DL outperformed traditional statistical models (RR-BLUP, BL, and Bayes A) in the genomic prediction of grain yield, in soybean, and stem height, in loblolly pine. In a GP study for meat tenderness, Lopes et al. (2020) found similar results to the current study, in which the DL model had higher PA than all models of the Bayesian alphabet (Bayes A, Bayes B, Bayes Cπ, BRR, and BL). Notably, in this study, ReLU was the best activation function used for training DL, because it is faster to learn than sigmoid and hyperbolic tangent functions, and it has better performance during the random grid search. The results of the present study indicated that DL can provide superior genomic predictions for quantitative traits, despite the relatively small sample sizes used (322 maize inbred lines and 646 half/full-sib progenies of eucalypt). The computational time required for the prediction analysis is also expected to be reduced in the DL method, as demonstrated in this study, since the hyperparameter selection can reduce the time of analysis and enhance the performance of genomic predictions.
From the genomic point of view, the molecular marker techniques used in this study present clear differences, which have been broadly studied and discussed (Pérez-Enciso et al., 2015;De Moraes et al., 2018). For example, De Moraes et al.
(2018) pointed out that both techniques have differences in linkage disequilibrium patterns, MAF, missing data, and marker distribution. However, the results of their study, considering 13 wood quality and growth traits of Eucalyptus trees, demonstrated that both genotyping methods are equivalent in terms of PA in the GP models RR-BLUP and Bayes B. In this regard, the missing marker data and MAF are two major quality control factors in genome-wide studies (Ali et al., 2020). These factors along with the population size may affect the PA in GP models. In this sense, in studies of genomic prediction that consider a small population size, the PA values can be limited (Edwards et al., 2019). In the present study, we assessed two breeding populations that differ in population size, genotyping method (i.e., DNA chip array and GBS), the degree of missing marker data, and MAF values. Despite these differences, our results indicated that DL-based prediction models presented high PA values in both different breeding populations, indicating that this model can help to decrease phenotypic cost within breeding programs.
The present study leveraged the iterative algorithm of Wang et al. (2012) to estimate the marker effects of the DL method, which increased the differences between the markers with high and low effects, in each iteration. Therefore, the results of this iterative algorithm can be used to map and identify QTLs. In this sense, Wang et al. (2014) showed the accuracy of the iterative algorithm for the QTL identification in 6 weeks body weight in broiler chickens. In this study, four and five SNPs were considered as QTL due to its great effect on the expression of traits in maize (MF) and eucalypt (ST), respectively, according to Wang et al. (2012). Overall, 147 and 277 candidate genes were identified nearby the SNPs with a major effect for MF and ST, respectively. In eucalypt, chromosome 4 had three candidate genes (Eucgr.D02209, Eucgr.D02250, and Eucgr.D02208) that are described as RING zinc finger protein, which has been related to ST by Bartholomé et al. (2016) and Li et al. (2016) in radiata pine and maritime pine, respectively. Li et al. (2016) also identified a protein serine/threonine kinase associated with ST, which is in accordance with our finding, since four candidate genes (Eucgr.D02135, Eucgr.G02060, Eucgr.G02065, and Eucgr.G02273) presented this same description (Supplementary Data Sheet S3). On the other hand, candidate gene GRMZM2G111491 of maize is a homolog of AT4G29380 in Arabidopsis thaliana, which encodes phosphatidylinositol 3-kinase, a protein involved in the development and germination of pollen (Xu et al., 2011). Finally, it should be noted that the advantages of using this approach in the QTL identification include the possibility of using a complex model with single or multiple traits and a computational implementation that is fast and simple (Wang et al., , 2014.

CONCLUSION
Hyperparameter optimization is a fundamental step for successfully implementing a DL model. In this sense, the results of this study suggested that architectures with the activation function ReLU and a mini-batch of large size were the most optimal for the genomic prediction of complex traits in maize and eucalypt. Furthermore, our results showed that DL had a superior performance than GBLUP, Bayes A, Bayes B, Bayes Cπ, BRR, BL, RKHS, and BRNN. On the other hand, the iterative algorithm proposed by Wang et al. (2012) was first used in marker effect estimation from a DL model. This can be seen as a new approach for GWAS through DL, since it allows the identification of the most relevant genomic regions affecting the traits of interest. The results of this study confirm the importance of DL models in genome-wide studies and crop/tree improvement, which holds promise for accelerating breeding progress.

AUTHOR CONTRIBUTIONS
FM-P, CM, CAS, and ATDAJ conceived the research plans. SA and J-TC performed the data curation. CM and FM-P analyzed the genomic data and wrote the first draft of the manuscript. FM-P, RIC-S, and CAS supervised the field experiments. FM-P, CM, SA, J-TC, and ATDAJ reviewed and edited the final version of the manuscript. All authors reviewed and approved the manuscript for publication.

FUNDING
This study was supported by the FONDECYT (grant no. 1201973) and Semillas Imperial SpA for the eucalypt experiment, and by CNPq and CAPES for the maize experiment.