Abstract
The rapid growth in genomic selection data provides unprecedented opportunities to discover and utilize complex genetic effects for improving phenotypes, but the methodology is lacking. Epistasis effects are interaction effects, and haplotype effects may contain local high-order epistasis effects. Multifactorial methods with SNP, haplotype, and epistasis effects up to the third-order are developed to investigate the contributions of global low-order and local high-order epistasis effects to the phenotypic variance and the accuracy of genomic prediction of quantitative traits. These methods include genomic best linear unbiased prediction (GBLUP) with associated reliability for individuals with and without phenotypic observations, including a computationally efficient GBLUP method for large validation populations, and genomic restricted maximum estimation (GREML) of the variance and associated heritability using a combination of EM-REML and AI-REML iterative algorithms. These methods were developed for two models, Model-I with 10 effect types and Model-II with 13 effect types, including intra- and inter-chromosome pairwise epistasis effects that replace the pairwise epistasis effects of Model-I. GREML heritability estimate and GBLUP effect estimate for each effect of an effect type are derived, except for third-order epistasis effects. The multifactorial models evaluate each effect type based on the phenotypic values adjusted for the remaining effect types and can use more effect types than separate models of SNP, haplotype, and epistasis effects, providing a methodology capability to evaluate the contributions of complex genetic effects to the phenotypic variance and prediction accuracy and to discover and utilize complex genetic effects for improving the phenotypes of quantitative traits.
Introduction
Genomic estimation of variance components and associated heritabilities and genomic prediction for quantitative traits using single nucleotide polymorphism (SNP) markers and mixed models have become a widely used approach for genetic improvement in livestock and crop species. The rapid growth in genomic selection data provides unprecedented opportunities to discover and utilize complex genetic mechanisms, but methodology and computing tools are lacking for investigating complex genetic mechanisms using the approach of genomic estimation and prediction. The integration of global low-order epistasis effects and local high-order epistasis effects contained in haplotypes for genomic estimation and prediction is a step forward for the discovery and application of complex genetic mechanisms to improve the phenotypes of quantitative traits.
The theory of genetic partition of two-locus genotypic values defines four types of epistasis values: additive × additive (A×A), additive × dominance (A×D), dominance × additive (D×A), and dominance × dominance (D×D) epistasis values (Cockerham, 1954; Kempthorne, 1954). The Cockerham method defines each epistasis coefficient as the product of the coefficients of the two interacting effects that each can be an additive or dominance effect (Cockerham, 1954). This definition of epistasis coefficient is the basis for defining epistasis model matrices in terms of the model matrices of additive and dominance effects. Cockerham also defines a pedigree epistasis relationship as the product between the pedigree additive and dominance relationships (Cockerham, 1954), and this definition is the theoretical basis for Henderson’s approach to express epistasis relationship matrices as the Hadamard products of the additive and dominance relationship matrices (Henderson, 1985). Henderson’s Hadamard products for epistasis relationship matrices were suggested for genomic prediction using epistasis effects by replacing the pedigree additive and dominance relationship matrices with the genomic additive and dominance relationship matrices calculated from SNP markers (Su et al., 2012; Muñoz et al., 2014; Vitezica et al., 2017). This genomic version of Henderson’s Hadamard products avoids the use of large epistasis model matrices that can be difficult or impossible to compute but contains intra-locus epistasis effects that are not present in the epistasis model (Martini et al., 2020). For this reason, the genomic version of Henderson’s Hadamard products could be described as approximate genomic epistasis relationship matrices (AGERM). Formulations have been developed to obtain the exact genomic epistasis relationship matrices (EGERM) that remove the intra-locus epistasis effects in AGERM by modifying Henderson’s Hadamard products without creating the epistasis model matrices (Jiang and Reif, 2015; Martini et al., 2016; Jiang and Reif, 2020; Martini et al., 2020). The difference between AGERM and EGERM tends to diminish as the number of SNPs increases (Jiang and Reif, 2020). Henderson’s Hadamard products and hence AGERM are applicable to any order of epistasis effects, and EGERM also has a general formula for any order of epistasis effects (Jiang and Reif, 2020). However, limited tests showed that fourth-order global epistasis contributed virtually nothing to the phenotypic variance but generated considerable computing difficulty (Liang et al., 2021), raising questions about the value of global epistasis effects beyond the third-order. Methods of genomic estimation and prediction of global epistasis effects up to the third-order should have wide-range applications, given that the number of reported epistasis effects lags far behind the number of single-point effects (Carlborg and Haley, 2004; Phillips, 2008; Ritchie and Van Steen, 2018) even though epistasis effects are important genetic effects (Cordell, 2002; Segre et al., 2005; Mackay, 2014). In contrast to the computing difficulty and uncertain impact of global high-order epistasis effects beyond the third-order, local high-order epistasis effects in haplotypes with potentially many SNPs were responsible for the increased accuracy of predicting phenotypic values of certain traits (Liang et al., 2020; Bian et al., 2021). The integration of haplotype and epistasis effects provides an approach to investigate the contributions of global low-order epistasis effects and local high-order epistasis effects to the phenotypic variance and the accuracy of genomic prediction under the same model.
An epistasis GWAS in Holstein cattle showed that intra- and inter-chromosome epistasis effects affected different traits differently, for example, the daughter pregnancy rate was mostly affected by inter-chromosome epistasis effects, whereas milk production traits were mostly affected by intra-chromosome epistasis effects (Prakapenka et al., 2021), and genomic heritability estimates of intra- and inter-chromosome heritabilities for the daughter pregnancy rate using methods in this article showed that inter-chromosome A×A heritability was much higher than the intra-chromosome A×A heritability (Liang et al., 2022). Therefore, dividing pairwise epistasis effects into intra- and inter-chromosome epistasis effects allows the investigation of the contributions of intra- and inter-chromosome pairwise epistasis effects to the phenotypic variance and prediction accuracy.
The purpose of the multifactorial model in this article is to integrate haplotype effects and epistasis effects up to the third-order for genomic estimation of variance components and associated heritabilities, as well as genomic prediction of genetic and phenotypic values of quantitative traits, to provide a general and flexible methodology framework for genomic prediction and estimation using complex genetic mechanisms and to provide methodology details of the EPIHAP computer package that implements the integration of haplotype and epistasis effects (Liang et al., 2021, 2022). The methodology in this article will facilitate the discovery and utilization of global low-order and local high-order epistasis effects relevant to the phenotypic variances and prediction accuracies of quantitative traits, and obtain new knowledge of complex genetic mechanisms underlying quantitative traits.
Materials and methods
Quantitative genetics model with single nucleotide polymorphism, haplotype, and epistasis effects and values
The mixed model with single-SNP additive and dominance effects, haplotype additive effects, and pairwise SNP epistasis effects in this article is based on the quantitative genetics (QG) model resulting from the genetic partition of single-SNP genotypic values (Da et al., 2014; Wang and Da, 2014), haplotype genotypic values (Da, 2015), and pairwise genotypic values (Cockerham, 1954). An advantage of this QG model is the readily available quantitative genetics interpretations of SNP additive and dominance effects, values, and variances; haplotype additive effects, values, and variances; epistasis effects, values, and variances; and the corresponding SNP, haplotype, and epistasis heritability estimates. Two QG models are developed: Model-I with 10 effect types, including SNP additive and dominance effects, haplotype additive effects, and epistasis effects up to the third-order; and Model-II with 13 effect types resulting from replacing the pairwise epistasis effects of Model-I with intra- and inter-chromosome epistasis effects. Detailed descriptions of the effects, values, model matrices, the coding of the model matrices, as well as the precise definition of each term in the two QG models, are provided in Supplementary Text S1 and Supplementary Table S1. With these precise definitions of genetic effects, values, and model matrices in the QG models, a concise multifactorial QG model covering both Model-I and Model-II can be established, that iswhere genetic effects of the effect type from the original QG model based on genetic partition, model matrix of , genetic values of the effect type from the original QG model, and f = number of effect types. For Model-I, subscripts represent SNP additive (A), SNP dominance (D), haplotype additive, A×A, A×D, D×D, A×A×A, A×A×D, A×D×D, and D×D×D effects sequentially. For Model-II, subscripts represent SNP additive, SNP dominance, haplotype additive, intra-chromosome A×A, intra-chromosome A×D, intra-chromosome D×D, inter-chromosome A×A, inter-chromosome A×D, inter-chromosome D×D, A×A×A, A×A×D, A×D×D, and D×D×D effects sequentially. The variance–covariance matrix of the genetic values of Eqs 1 and 2 iswhere genetic variance of the effect type under the original QG model is common to all individuals (all j values). It is of note that is not a genomic relationship matrix but is the primary information for calculating each genomic relationship matrix. The structure of the G matrix of Eqn. 3 assumes independence between the genetic values of different effect types. However, the GBLUP values of different effect types using the G matrix of Eqn. 3 could be correlated. Under the Hardy–Weinberg equilibrium (HWE) and LE assumptions, additive, dominance, and epistasis effects are independent of each other (Cockerham, 1954; Kempthorne, 1954). For genome-wide SNPs, the LE assumption generally does not hold for closely linked loci, and nonzero Hardy–Weinberg disequilibrium (HWD) may exist numerically. These and other unknown factors in real data may result in the existence of correlations between different effect types. Haplotype additive values are correlated with SNP additive effects because a haplotype additive value is the sum of all SNP additive values and an epistasis value within the haplotype block plus a potential haplotype loss (Da et al., 2016). In two recent haplotype studies for genomic prediction, the integration of SNP and haplotype effects increased the prediction accuracy for four of the seven traits in the human study (Liang et al., 2020) and for three of the eight traits in the swine study (Bian et al., 2021), showing that SNP and haplotype additive values compensated each other for prediction accuracy and that the correlation between SNP and haplotype additive values were incomplete for those traits. The correlation between haplotype and epistasis values can be complex. The correlation should be nonexistent if the A×A values are inter-chromosome A×A values or intra-chromosome A×A values involving distal SNPs not covered by the haplotypes, but the correlation could be strong if the A×A values are intra-chromosome A×A values involving proximal SNPs covered by the haplotypes.
The reparametrized and equivalent quantitative genetics model for genomic estimation and prediction
Genomic relationship matrices are used for genomic estimation and prediction, and the use of genomic relationship matrices results in a reparametrized and equivalent model of the original QG model for genetic values, to be referred to as the RE-QG model, where “reparametrized” refers to the reparameterization of the genetic effects, model matrix, and genetic variance of each effect type; and “equivalent” refers to the requirement of the same first and second moments for the original QG model (Eqs 1–5) and the RE-QG model. This RE-QG model of genetic values can be expressed aswhere= variance of the genetic effects of the effect type common to all individuals= average variance of all individuals for the genetic values of the effect type= variance–covariance matrix of the genetic values of the effect type
Equations 8–10 are the reparametrization of the genetic effects, model matrices, and genetic variances of the original QG model, whereas Eqs 11 and 12 show the genetic values and the variance–covariance matrix of the genetic values are the same under the RE-QG and QG models. In Eq.10, = the genetic variance of the individual for the effect type = the diagonal element of the matrix defined by Eq. 12. The formula of Eq. 14 as the average of the diagonal elements of was originally proposed for genomic additive relationships (Hayes and Goddard, 2010) and was used for genomic dominance relationships (Da et al., 2014; Wang and Da, 2014), haplotype additive genomic relationships (Da, 2015), and pairwise epistasis genomic relationships (Vitezica et al., 2017). The need for this RE-QG model is due to the use of the genomic relationship matrices (e.g., Eq. 13) because the QG model does not contain genomic relationship matrices (Eq. 3). Detailed notations of the QG model of Eqs 1–5 in reference to the RE-QG model described by Eqs 6–14 are summarized in Supplementary Table S1.
The formula of the genomic relationship matrix ( of Eq. 13) is based on the model matrix of each effect type and can be difficult or impossible to compute if epistasis model matrices are used. This computing difficulty of epistasis model matrices is removed by calculating based on the model matrices of SNP additive and dominance effects without creating the epistasis model matrices using either AGERM or EGERM. AGERM refers to the genomic version of Henderson’s Hadamard products between pedigree additive and dominance relationship matrices (Henderson, 1985), with the pedigree additive and dominance relationship matrices replaced by the genomic additive and dominance relationship matrices (Su et al., 2012; Muñoz et al., 2014; Vitezica et al., 2017). AGERM contains intra-locus epistasis that should not exist (Martini et al., 2020), and EGERM removes intra-locus epistasis from AGERM based on products between genomic additive and dominance relationship matrices (Jiang and Reif, 2020; Martini et al., 2020).
The QG and RE-QG models have the same prediction accuracy due to the equivalence between these two models. The genetic values (, Eqs 2, 11) and the variance–covariance matrix of the genetic values (, Equations 5 and 12) under the QG and RE-QG models are identical, although and have different expressions under the QG and RE-QG models. Consequently, the QG model without using genomic relationship matrices and the RE-QG model using genomic relationship matrices have identical accuracy of genomic prediction. The choice of the formula for defining the genomic relationship matrix does not affect the accuracy of genomic prediction but affects the interpretation and application of the genetic variance and genomic relationships for each effect type. Since the interpretation of each genetic variance is a focus, whereas the interpretation of the genomic relationships is not a focus in this study, the interpretation of the genetic variance and associated heritability is the consideration in choosing the formula of Eq.14.
The RE-QG model using genomic relationships (Equations 6–14) has two major advantages over the QG model without using genomic relationship matrices (Equations 1–5), although the two models have the same prediction accuracy. First, the use of genomic relationships, originally proposed for genomic additive relationships (VanRaden, 2008), provides a genomic version of the traditional theory and methods of best linear unbiased prediction (BLUP) that uses pedigree relationships, and this genomic version can utilize a wealth of BLUP-based theory, methods, and computing strategies. Second, the genetic variance of the genetic effects of each effect type under the RE-QG model can be used for estimating genomic heritability, whereas the genetic variance of the genetic effects under the QG model cannot be used for estimating genomic heritability. With the value defined by Eq. 14, the variance of the genetic effects of the effect type, (Eq.10), has the unique interpretation as the average variance of the genotypic values of all individuals and is a common variance to all individuals. Moreover, is unaffected by the number of levels for each effect type, unless the number of levels such as the number of SNPs is too small to provide sufficient coverage of the genome (Da et al., 2014; Tan et al., 2017; Liang et al., 2020). In contrast, the QG model does not have a method to estimate genetic variance components for calculating genomic heritabilities because is an inverse function of the number of effect levels. As the number of effect levels such as the number of SNPs increases or decreases, the value of each element in changes in the same direction and the estimate changes in the opposite direction, that is, as the number of effect levels increases or decreases, decreases or increases. Consequently, the estimate does not have a unique interpretation and cannot be used for estimating genomic heritability (Da et al., 2014). Moreover, the variance of the genetic value of an individual cannot be used for calculating genomic heritability because of the individual specificity of the values, as shown as follows.
The exact relationship between the genetic variance for the effect type of the individual under the RE-QG model and the QG model can be described based on the matrix defined by Eq. 12:where the diagonal element of the matrix defined by Eq.12 = the genetic variance of the individual for the genotypic value of the effect type, and = the element of defined by Eq.11. Equation 15 shows that different individuals do not have a common variance of the genetic values unless all diagonal elements of or are identical, which could not happen with genome-wide SNP data in the absence of identical twins because genome-wide SNPs have a high degree of individual specificity. Consequently, is not a common variance to all individuals and cannot be used for calculating the genomic heritability of the effect type. In contrast, of Eq.10 under the RE-QG model as the average variance of the genotypic values of all individuals is common to all individuals and can be used for calculating the heritability of each effect type. For the example of Model-I, the exact genetic interpretation of is the variance of the genomic additive (breeding) value of the individual for , the variance of the genomic dominance value of the individual for , the variance of the genomic haplotype additive value of the individual for , the variance of the A×A value of the individual for , the variance of the A×D value of the individual for , the variance of the D×D value of the individual for , the variance of the A×A×A value of the individual for , the variance of the A×A×D value of the individual for , the variance of the A×D×D value of the individual for , and the variance of the D×D×D value of the individual for . These genetic interpretations, along with those for intra- and inter-chromosome pairwise epistasis effects of Model-II under the QG and RE-QG models, are summarized in Supplementary Table S1.
Results and discussion
The multifactorial model of phenotypic values
Based on the RE-QG model of Eqs 6–14, the multifactorial model for phenotypic values iswhere y = N×1 column vector of phenotypic observations, Z = incidence matrix allocating phenotypic observations to each individual = identity matrix for one observation per individual (N = n), N = number of observations, n = number of individuals, column vector of fixed effects such as heard-year-season in dairy cattle, c = number of fixed effects, model matrix, b,e = column vector of random residuals, = residual variance, and (Eq. 7). The phenotypic values are assumed to follow a normal distribution with mean and variance–covariance matrix of V. The methods described below for genomic estimation and prediction are based on the conditional expectation (CE) method, which is more efficient computationally than the methods based on mixed-model equations (MME) when the number of genetic effects is greater than the number of individuals (Da et al., 2014; Da, 2015).
For Model-I with 10 effect types, the genomic epistasis relationship matrices can be calculated using either EGERM or AGERM. However, EGERM or AGERM did not consider intra- and inter-chromosome genomic epistasis relationship matrices that are required by Model-II with 13 effect types. This research derives intra- and inter-chromosome genomic epistasis relationship matrices for both EGERM and AGERM.
Intra- and inter-chromosome genomic epistasis relationship matrices
The main derivation of the intra- and inter-chromosome genomic epistasis relationship matrices is the partition of the numerator of a genomic epistasis relationship matrix into intra- and inter-chromosome numerators. The first step is to derive the intra-chromosome numerator, and the second step is to derive the inter-chromosome numerator as the difference between the whole-genome numerator and the intra-chromosome numerator. The last step is to divide the intra-chromosome numerator by the average of the diagonal elements of the intra-chromosome numerator and to divide the inter-chromosome numerator by the average of the diagonal elements of the inter-chromosome numerator. Using this procedure, intra- and inter-chromosome epistasis relationship matrices were derived for both EGERM and AGERM (Supplementary Text S1).
Genomic best linear unbiased prediction and reliability
Based on the multifactorial genetic model of Eqs 16 and 17, the GBLUP of the genetic values of the effect type () and the best linear unbiased estimator (BLUE) or generalized least squares (GLS) estimator of fixed effect () arewhere . The GBLUP of total genetic values of the n individuals is the summation of all types of genetic values:
Reliability of GBLUP is the squared correlation between the GBLUP of a type of genetic value and the unobservable true genetic value being predicted by the GBLUP. The expected accuracy of predicting the genetic values by the GBLUP is the square root of reliability or the correlation between the GBLUP of a type of genetic effect and the unobservable true genetic effects being predicted by the GBLUP. In the absence of validation studies for observed prediction accuracy, reliability or the expected prediction accuracy is the measure of prediction accuracy of the GBLUP. The reliability of the GBLUP of the total genetic value (Eq. 2) of the individual iswhere (Eq. 7), , and subscript or superscript jj denotes the diagonal element. The reliability formula for any or a combination of genetic values can be readily derived from Eq. 21, for example, the reliability of (GBLUP of haplotype additive values) is obtained from Eq. 21 by deleting all terms except in the numerator and in the denominator, with changes in the V and P matrices accordingly.
Calculation of genomic best linear unbiased prediction and reliability for individuals with and without phenotypic observations separately
Two strategies are available for calculating GBLUP and the reliability of Eqs 20 and 21. Strategy-1 is a one-step strategy that includes all individuals with and without phenotypic observations in the same system of equations so that GBLUP and reliability are calculated simultaneously for all individuals. This strategy essentially augments the mixed model for individuals with phenotypic observations with a set of null equations consisting of “0”s but uses each genomic relationship matrix for all individuals, and these null equations and the use of the relationship matrix for all individuals do not affect the GBLUP, reliability, and heritability of individuals with phenotypic observations. The advantage of this one-step strategy is the simplicity of data preparation. For example, for a k-fold cross validation study, the phenotypic input file only needs to have k columns of the trait observations, with one column for each validation where the phenotypic observations for the validation individuals are set as “missing,” and the X and Z model matrices for the “missing” observations are set to zero. With this strategy, the genotypic data need to be processed only once. As the number of traits increases for validation studies, this one-step strategy becomes more appealing due to the savings in data preparation work. This strategy has been implemented in our computing tools of GVCBLUP (Wang et al., 2014), GVCHAP (Prakapenka et al., 2020), and EPIHAP (Liang et al., 2021, 2022). However, when the number of validation individuals or individuals without phenotypic values is large, each genomic relationship matrix ( matrix) is large, and the one-step strategy becomes more difficult as the number of individuals increases.
For large numbers of individuals without phenotypic observations, calculating GBLUP for individuals with and without phenotypic values separately is more efficient computationally than calculating GBLUP for all individuals in the same system of equations by applying Henderson’s BLUP for animals without phenotypic observations (Henderson, 1977) to GBLUP. Let = number of individuals with phenotypic observations, = number of individuals without phenotypic observations, , and let the matrix be partitioned aswhere genomic relationship matrix of the genetic values of the effect type for individuals with phenotypic observations, genomic relationship matrix of the genetic values of the effect type between individuals without phenotypic observations and individuals with phenotypic observations, genomic relationship matrix between individuals with phenotypic observations and individuals without phenotypic observations, and = genomic relationship matrix of the genetic values of the effect type for individuals without phenotypic observations. In Eqs 16 and 17, , the Z matrix needs to be changed to , the vector partitioned as , and the g vector partitioned as , where incidence matrix allocating phenotypic observations to individuals with phenotypic observations, incidence matrix with elements “0” connecting phenotypic observations to individuals without phenotypic observations. With these changes and Eq. 22, the V matrix of Eq. (17) can be re-written asand the GBLUP and reliability for individuals with and without phenotypic observations can be calculated aswhere column vector of the GBLUP of the genetic values of the effect type for individuals with phenotypic observations, column vector of the GBLUP of the total genetic values for individuals with phenotypic observations, reliability for the individuals with phenotypic observations, column vector of the GBLUP of the genetic values of the effect type for individuals without phenotypic observations, column vector of the GBLUP of the total genetic values for individuals without phenotypic observations, reliability for the individuals without phenotypic observations, , , , , and .
Equations 27 and 28 yield identical results if exists. However, when the number of individuals is greater than the number of effect levels, such as the number of SNPs, in Eq. 28 does not exist, and Eq. 27 still can calculate the GBLUP. The usefulness of Eq. 28 is that it shows the GBLUP of individuals without phenotypic observations is the regression of the genetic values of individuals without phenotypic observations on the genetic values of individuals with phenotypic observations. The advantage of Eq. 27 is that it does not calculate and hence is unaffected by the singularity of . Therefore, Eq. 27 is recommended for calculating GBLUP for individuals without phenotypic observation when the number of such individuals is large. The GBLUP calculations of Eqs 24, 27, and 28 do not involve the genomic relationship matrix among individuals without phenotypic observations , which is much larger than when is much larger than . The reliability calculation for individuals without phenotypic observations (Eq. 30) only uses the diagonal elements of and not the entire .
Advantage of the integrated model over separate models
The multifactorial model of Eqs 16 and 17 integrating SNP, haplotype, and epistasis effects has the advantage of using more effect types and assessing each effect type based on the phenotypic values adjusted for all remaining effect types over separate models for SNP, haplotype, and epistasis effects that do not have a mechanism to adjust for effect types not in the model, and each uses a smaller number of genetic effects in the model.
This advantage of the multifactorial model assessing each effect type based on the phenotypic values adjusted for all remaining effect types can be shown using the MME version of the GBLUP for the effect type: where = phenotypic observations adjusted for the fixed effects and all random genetic values except those of , = phenotypic observations adjusted for all random genetic values, and is a generalized inverse of . Eq. 31 shows the MME version of uses the phenotypic values adjusted for the GBLUP of all other effect types in the model. Since the MME version of (Eq. 31) and (Eq. 32) are identical to the CE version of (Eq. 18) and (Eq. 19), the CE version of (Eq. 18) uses the phenotypic values adjusted for the GBLUP of all other effect types in the model even though the CE version does not do such adjustments explicitly.
Genomic restricted maximum estimation (GREML) of variances and heritabilities
The estimation of variance components uses GREML and a combination of EM-REML and AI-REML algorithms of iterative solutions. EM-REML is slow but converges, whereas AI-REML is fast but fails for zero heritability estimates. In our GVCBLUP, GVCHAP, and EPIHAP computing packages that implement these two algorithms (Wang et al., 2014; Prakapenka et al., 2020; Liang et al., 2021), EM-REML is used automatically when AI-REML fails. The EM-REML iterative algorithm for the multifactorial model of Eqs 16 and 17 iswhere j = iteration number. The AI-REML iterative algorithm is an extension of the early formulations (Johnson and Thompson, 1995; Lee and van der Werf, 2006) to the multifactorial model of Eqs 16 and 17:where = column vector of variance–covariance components, = residual variance, = column vector of the partial derivatives of the log residual likelihood function with respect to each variance component, and j = iteration number. A typical term in () and a typical term in AI () arewhere . For the full Model-I or Model-II, some effect types inevitably may have zero variances. In those cases, AI-REML (Equations 35–37) fails and EM-REML (Equations 33 and 34) still converges, although a slow convergence rate can be expected for the full Model-I or Model-II. Once the effect types with zero variances are removed from the model, AI-REML converges, and a fast convergence rate can be expected. The estimate of the genomic heritability for each type of genetic effects and the total heritability of all types of genetic effects arewhere phenotypic variance.
The heritability estimates of Eq. 38 can be used for model selection by removing effect types with heritability estimates below a user-determined threshold value from the prediction model. Since different traits may have different genetic architectures, we hypothesize that some traits may involve only a small number of the effect types, and some traits are more complex and involve more effect types; global epistasis may be more important than local high-order epistasis effects of haplotypes for some traits, whereas the reverse may be true for other traits, and some traits may be affected by both global high-order and local high-order epistasis effects. The heritability estimates from Eq. 37 provide an approach to evaluate these hypotheses and identify effect types relevant to the phenotypic variance, whereas the total heritability of Eq. 38 provides an estimate of the total genetic contribution to the phenotypic variance. In addition to the use of heritability estimates, prediction accuracy based on GBLUP can be used for model selection by requiring a threshold accuracy level for the effect type to be included in the prediction model, for example, we identified the A + A×A model to have the same accuracy of predicting the phenotypic values of daughter pregnancy rate as the full Model-I in U.S. Holstein cows (Liang et al., 2022).
Estimation of pairwise epistasis effect and heritability
The heritability of an SNP, haplotype block, or pairwise epistasis effect is the contribution of the genetic effect to the phenotypic variance and is also the contribution to the heritability of the effect type, and is estimated through the GBLUP of the corresponding genetic effects. These heritability estimates can be used to identify genome locations with large contributions to the phenotypic variance. The estimation of pairwise epistasis effects and heritability is the most demanding computation because the pairwise epistasis model matrices must be creased and are no longer avoidable. Estimating the effects and heritabilities for third-order epistasis effects is computationally unfeasible and is not considered. GBLUPs of SNP, haplotype, and pairwise epistasis effects of Model-I (Supplementary Table S1) are calculated aswhere is the column vector of SNP additive effects for , or SNP dominance effects for ; or column vector of haplotype additive effects for ; or column vector of A×A epistasis effects for , or column vector of A×D epistasis effects for , or column vector of D×D epistasis effects for . For ; the order of A×D and D×A effects is determined by the order of the model matrices of those effects, that is, if , or if . The heritability of the effect of the effect type () is estimated as a faction of the genomic heritability of the effect type ():where the effect of estimated variance of the effect type, estimated variance of the effect of the effect type, and = genomic heritability of the effect type defined by Equation (52). For proving Equation 57, and can be formulated based on the method of mixed-model equations (MME):where is the submatrix in the inverse or generalized inverse of the coefficient matrix of the MME corresponding to the effect type, = number of effects of the effect type, and . Dividing Eq. 43 by and multiplying by yield Eq. 41:
It is readily seen that the sum of all heritability estimates of the effect type is the genomic heritability of the effect type: . It is of note that Eqs 42 and 43 using MME are only for proving Eq. 41. The MME method is computationally prohibitive for estimating genetic effects and their variances under the multifactorial model, although the MME method yields results identical to the CE method, which is computationally feasible for genomic estimation and prediction under the multifactorial model.
Comparison between exact and approximate genomic epistasis relationship matrices
We evaluated the differences between AGERM and EGERM in genomic heritability estimates and prediction accuracies using a publicly available swine genomics data set that had 3,534 animals from a single PIC nucleus pig line with five anonymous traits and 52,842 genotyped and imputed autosome SNPs after filtering by requiring minor allele frequency (MAF) > 0.001 and proportion of missing SNP genotypes < 0.100 (Cleveland et al., 2012). The EGERM followed the method used by Jiang and Reif (2020), and the AGERM methods are described in Supplementary Text S1. The heritability results showed that EGERM had slightly higher heritability estimates than AGERM except for the A×A heritability of T3, where AGERM had a slightly higher estimate than EGERM (0.280 vs. 0.278, Table 1). From Table 1, effect type with nonzero heritability estimates was included in the prediction model for evaluating the observed prediction accuracy as the correlation between the GBLUP of genotypic values and the phenotypic values in each validation population and then averaged over all 10 validation populations. The results showed that AGERM and EGERM had the same prediction accuracy for this swine sample (Table 2). A disadvantage of EGERM is the computing time for the construction of EGERM, about 9.51 times as much time for pairwise relationship matrices, 8.29 as much time for third-order and 9.44 times as much time for fourth-order as required for AGERM (Table 3). However, computing time is not the deciding factor for choosing between the exact and approximate methods because the multi-node approach that calculates each genomic relationship matrix in pieces and adds those pieces together can reduce the computing time to an acceptable level when multiple threads/cores are available, and the two-step strategy can be used so that each genomic relationship is calculated only once for different traits and validation populations (Prakapenka et al., 2020). Prediction accuracy is the ultimate deciding factor in choosing between different methods. We reported results of comparing AGERM and EGERM using 60,671 SNPs and 22,022 first-lactation Holstein cows with phenotypic observations of daughter pregnancy rates, showing that AGERM and EGERM had the same heritability estimates and prediction accuracy, but EGERM required 21 times as much computing time as that required by AGERM, which required 1.32 times as much time for the genomic additive relationship matrix (Liang et al., 2022). The combined results of the swine and Holstein samples indicated that EGERM and AGERM had similar results and that the computing difficulty of EGERM over AGERM increased rapidly as the sample size increased. Given the computing difficulty of EGERM and the negligible differences between EGERM and AGERM in prediction accuracy, AGERM should be favored for its mathematical simplicity and computing efficiency, at least for samples with 50,000 SNPs or more.
TABLE 1
| Trait | |||||
|---|---|---|---|---|---|
| T1 | T2 | T3 | T4 | T5 | |
| Effect | Exact genomic epistasis relationship matrices (EGERM) | ||||
| A | 0.023 | 0.217 | 0.131 | 0.336 | 0.366 |
| D | 0.000 | 0.013 | 0.000 | 0.000 | 0.052 |
| A×A | 0.046 | 0.186 | 0.278 | 0.017 | 0.054 |
| A×D | 0.000 | 0.000 | 0.091 | 0.000 | 0.000 |
| D×D | 0.000 | 0.000 | 0.091 | 0.000 | 0.000 |
| A×A×A | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| A×A×D | 0.000 | 0.000 | 0.079 | 0.000 | 0.000 |
| A×D×D | 0.000 | 0.000 | 0.102 | 0.000 | 0.000 |
| D×D×D | 0.000 | 0.000 | 0.117 | 0.000 | 0.000 |
| Total heritability | 0.069 | 0.416 | 0.889 | 0.354 | 0.471 |
| Effect | Approximate genomic epistasis relationship matrices (AGERM) | ||||
| A | 0.022 | 0.215 | 0.139 | 0.329 | 0.360 |
| D | 0.000 | 0.013 | 0.000 | 0.000 | 0.051 |
| A×A | 0.043 | 0.176 | 0.280 | 0.016 | 0.050 |
| A×D | 0.000 | 0.000 | 0.091 | 0.000 | 0.000 |
| D×D | 0.000 | 0.000 | 0.090 | 0.000 | 0.000 |
| A×A×A | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| A×A×D | 0.000 | 0.000 | 0.075 | 0.000 | 0.000 |
| A×D×D | 0.000 | 0.000 | 0.095 | 0.000 | 0.000 |
| D×D×D | 0.000 | 0.000 | 0.109 | 0.000 | 0.000 |
| Total heritability | 0.065 | 0.404 | 0.879 | 0.346 | 0.461 |
Genomic heritability estimates of additive, dominance, and epistasis effects up to the third-order for five traits in a swine population.
TABLE 2
| Trait | |||||
|---|---|---|---|---|---|
| T1 | T2 | T3 | T4 | T5 | |
| Prediction accuracy of SNP model | |||||
| A | 0.066 | 0.495 | 0.326 | 0.468 | 0.493 |
| A + D | 0.056 | 0.495 | 0.326 | 0.468 | 0.496 |
| Epistasis model | A + AA | A + D + AA | A + AA + AD + DD+ | ||
| AAD + ADD + DDD | A + AA | A + D + AA | |||
| EGERM | |||||
| Prediction accuracy | 0.063 | 0.498 | 0.336 | 0.468 | 0.497 |
| Accuracy increase (%) | −4.545 | 0.606 | 3.067 | 0.000 | 0.202 |
| AGERM | |||||
| Prediction accuracy | 0.063 | 0.498 | 0.336 | 0.468 | 0.497 |
| Accuracy increase (%) | −4.545 | 0.606 | 3.067 | 0.000 | 0.202 |
Observed prediction accuracy of epistasis models relative to the additive model for five traits in a swine population.
“Prediction accuracy” is the observed prediction accuracy calculated as the correlation between the GBLUP of genotypic values and the phenotypic values in each validation population and then averaged over all 10 validation populations. “Accuracy increase” is the percentage increase of the observed prediction accuracy of the epistasis model over the observed prediction accuracy of the best SNP model, which was the additive model (A) for T1–T4 and the A + D model for T5. A = additive effects, D = dominance effects, AA = A×A effects, AD = A×D effects, DD = D×D effects, AAA = A×A×A effects, AAD = A×A×D effects, ADD = A×D×D dominance effects, and DDD = D×D×D dominance effects.
TABLE 3
| Genomic epistasis relationship matrices | Pairwise | Third-order | Fourth-order |
|---|---|---|---|
| EGERM | 666 | 796 | 1,256 |
| AGERM | 70 | 96 | 133 |
| EGERM/AGERM | 9.51 | 8.29 | 9.44 |
Computing time (in seconds) for the construction of exact and approximate genomic epistasis relationship matrices for a swine population with 3,534 pigs and 52,843 SNPs using 20 threads of the Mangi supercomputer of the Minnesota Supercomputer Institute at the University of Minnesota.
Numerical demonstration
The methods of genomic epistasis relationship matrices based on the additive and dominance model matrices, GREML, GBLUP and reliability, and estimation of effect heritability are demonstrated using an R program (DEMO.R) and a small artificial sample for the convenience of reading the numerical results (Supplementary Text S2 and R program). Because of the artificial nature and the extremely small sample size, this numerical demonstration does not have any genetic and methodology implications and is for showing calculations of the methods only. This R program is an extension of the R demo program of GVCHAP that integrates SNP and haplotype effects and has a computing pipeline for producing the input haplotype data from the SNP data (Prakapenka et al., 2020).
Conclusion
The multifactorial methods with SNP, haplotype, and epistasis effects up to the third-order provide an approach to investigate the contributions of global low-order and local high-order epistasis effects to the phenotypic variance and the accuracy of genomic prediction. Genomic heritability of each effect type from GREML and prediction accuracy from validation studies using GBLUP can be used jointly to identify effect types contributing to the phenotypic variance and the accuracy of genomic prediction, and the GBLUP for the multifactorial model with selected effect type can be used for genomic evaluation. With many capabilities, including the use of intra- and inter-chromosome separately, the multifactorial methods offer a significant methodology capability to investigate and utilize complex genetic mechanisms for genomic prediction and for understanding the complex genome–phenome relationships.
Statements
Data availability statement
Publicly available datasets were analyzed in this study. These data can be found at: https://academic.oup.com/g3journal/article/2/4/429/6026060#supplementary-data.
Author contributions
YD conceived this study and derived the formulations. ZL contributed to formulations of the epistasis genomic relationships, implemented the epistasis methods in EPIHAP, and validated and evaluated the methods. DP contributed to the data processing for methodology evaluation. YD and ZL prepared the manuscript.
Funding
This research was supported by the National Institutes of Health’s National Human Genome Research Institute, grant R01HG012425, as part of the NSF/NIH Enabling Discovery through Genomics (EDGE) Program; grant 2020-67015-31133 from the USDA National Institute of Food and Agriculture; and project MIN-16-124 of the Agricultural Experiment Station at the University of Minnesota. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. The use of the Ceres and Atlas computers in this research was supported by USDA-ARS projects 8042-31000-002-00-D and 8042-31000-001-00-D.
Acknowledgments
The supercomputer of Minnesota Supercomputer Institute at the University of Minnesota and the Ceres and Atlas high-performance computing system of USDA-ARS were used for the evaluation and testing of the methods and EPIHAP computing package. The authors thank Paul VanRaden, Steven Schroeder, and Ransom Baldwin for help with the use of the USDA-ARS computing facilities.
Conflict of interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors, and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
Supplementary material
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2022.922369/full#supplementary-material
References
1
BianC.PrakapenkaD.TanC.YangR.ZhuD.GuoX.et al (2021). Haplotype genomic prediction of phenotypic values based on chromosome distance and gene boundaries using low-coverage sequencing in Duroc pigs. Genet. Sel. Evol.53 (1), 78–19. 10.1186/s12711-021-00661-y
2
CarlborgÖ.HaleyC. S. (2004). Epistasis: Too often neglected in complex trait studies?Nat. Rev. Genet.5 (8), 618–625. 10.1038/nrg1407
3
ClevelandM. A.HickeyJ. M.ForniS. (2012). A common dataset for genomic analysis of livestock populations. G32 (4), 429–435. 10.1534/g3.111.001453
4
CockerhamC. C. (1954). An extension of the concept of partitioning hereditary variance for analysis of covariances among relatives when epistasis is present. Genetics39 (6), 859–882. 10.1093/genetics/39.6.859
5
CordellH. J. (2002). Epistasis: What it means, what it doesn't mean, and statistical methods to detect it in humans. Hum. Mol. Genet.11 (20), 2463–2468. 10.1093/hmg/11.20.2463
6
DaY. (2015). Multi-allelic haplotype model based on genetic partition for genomic prediction and variance component estimation using SNP markers. BMC Genet.16 (1), 144. 10.1186/s12863-015-0301-1
7
DaY.TanC.ParakapenkaD. (2016). 0336 Joint SNP-haplotype analysis for genomic selection based on the invariance property of GBLUP and GREML to duplicate SNPs. J. Animal Sci.94 (5), 161–162. 10.2527/jam2016-0336
8
DaY.WangC.WangS.HuG. (2014). Mixed model methods for genomic prediction and variance component estimation of additive and dominance effects using SNP markers. PLoS One9 (1), e87666. 10.1371/journal.pone.0087666
9
HayesB.GoddardM. (2010). Genome-wide association and genomic selection in animal breeding. Genome53 (11), 876–883. 10.1139/G10-076
10
HendersonC. (1977). Best linear unbiased prediction of breeding values not in the model for records. J. Dairy Sci.60 (5), 783–787. 10.3168/jds.s0022-0302(77)83935-0
11
HendersonC. (1985). Best linear unbiased prediction of nonadditive genetic merits in noninbred populations. J. Animal Sci.60 (1), 111–117. 10.2527/jas1985.601111x
12
JiangY.ReifJ. C. (2020). Efficient algorithms for calculating epistatic genomic relationship matrices. Genetics216 (3), 651–669. 10.1534/genetics.120.303459
13
JiangY.ReifJ. C. (2015). Modeling epistasis in genomic selection. Genetics201 (2), 759–768. 10.1534/genetics.115.177907
14
JohnsonD.ThompsonR. (1995). Restricted maximum likelihood estimation of variance components for univariate animal models using sparse matrix techniques and average information. J. Dairy Sci.78 (2), 449–456. 10.3168/jds.s0022-0302(95)76654-1
15
KempthorneO. (1954). The correlation between relatives in a random mating population. Proc. R. Soc. Lond. B Biol. Sci.143 (910), 102–113.
16
LeeS. H.van der WerfJ. H. (2006). An efficient variance component approach implementing an average information REML suitable for combined LD and linkage mapping with a general complex pedigree. Genet. Sel. Evol.38 (1), 25–43. 10.1186/1297-9686-38-1-25
17
LiangZ.PrakapenkaD.DaY. (2022). Comparison of two methods of genomic epistasis relationship matrices using daughter pregnancy rate in U.S. Holstein cattle. Abstract 2466V, page 409 of ADSA2022 Abstracts. Available at: https://www.adsa.org/Portals/0/SiteContent/Docs/Meetings/2022ADSA/Abstracts_BOOK_2022.pdf?v=20220613 (Accessed September 27, 2022).
18
LiangZ.PrakapenkaD.DaY. (2021). Epihap: A computing tool for genomic estimation and prediction using global epistasis effects and haplotype effects. Abstract P167, page 223 of ADSA2021 Abstracts, ADSA 2021 Virtual Annual Meeting. Available at: https://www.adsa.org/Portals/0/SiteContent/Docs/Meetings/2021ADSA/ADSA2021_Abstracts.pdf (Accessed September 27, 2022).
19
LiangZ.TanC.PrakapenkaD.MaL.DaY. (2020). Haplotype analysis of genomic prediction using structural and functional genomic information for seven human phenotypes. Front. Genet.11 (1461), 588907. 10.3389/fgene.2020.588907
20
MackayT. F. (2014). Epistasis and quantitative traits: Using model organisms to study gene–gene interactions. Nat. Rev. Genet.15 (1), 22–33. 10.1038/nrg3627
21
MartiniJ. W.ToledoF. H.CrossaJ. (2020). On the approximation of interaction effect models by Hadamard powers of the additive genomic relationship. Theor. Popul. Biol.132, 16–23. 10.1016/j.tpb.2020.01.004
22
MartiniJ. W.WimmerV.ErbeM.SimianerH. (2016). Epistasis and covariance: How gene interaction translates into genomic relationship. Theor. Appl. Genet.129 (5), 963–976. 10.1007/s00122-016-2675-5
23
MuñozP. R.ResendeM. F.GezanS. A.ResendeM. D. V.de los CamposG.KirstM.et al (2014). Unraveling additive from nonadditive effects using genomic relationship matrices. Genetics198 (4), 1759–1768. 10.1534/genetics.114.171322
24
PhillipsP. C. (2008). Epistasis—The essential role of gene interactions in the structure and evolution of genetic systems. Nat. Rev. Genet.9 (11), 855–867. 10.1038/nrg2452
25
PrakapenkaD.LiangZ.JiangJ.MaL.DaY. (2021). A Large-scale genome-wide association study of epistasis effects of production traits and daughter pregnancy rate in US Holstein cattle. Genes12 (7), 1089. 10.3390/genes12071089
26
PrakapenkaD.WangC.LiangZ.BianC.TanC.DaY. (2020). Gvchap: A computing pipeline for genomic prediction and variance component estimation using haplotypes and SNP markers. Front. Genet.11, 282. 10.3389/fgene.2020.00282
27
RitchieM. D.Van SteenK. (2018). The search for gene-gene interactions in genome-wide association studies: Challenges in abundance of methods, practical considerations, and biological interpretation. Ann. Transl. Med.6 (8), 157. 10.21037/atm.2018.04.05
28
SegreD.DeLunaA.ChurchG. M.KishonyR. (2005). Modular epistasis in yeast metabolism. Nat. Genet.37 (1), 77–83. 10.1038/ng1489
29
SuG.ChristensenO. F.OstersenT.HenryonM.LundM. S. (2012). Estimating additive and non-additive genetic variances and predicting genetic merits using genome-wide dense single nucleotide polymorphism markers. PloS One7 (9), e45293. 10.1371/journal.pone.0045293
30
TanC.WuZ.RenJ.HuangZ.LiuD.HeX.et al (2017). Genome-wide association study and accuracy of genomic prediction for teat number in Duroc pigs using genotyping-by-sequencing. Genet. Sel. Evol.49 (1), 35. 10.1186/s12711-017-0311-8
31
VanRadenP. M. (2008). Efficient methods to compute genomic predictions. J. Dairy Sci.91 (11), 4414–4423. 10.3168/jds.2007-0980
32
VitezicaZ. G.LegarraA.ToroM. A.VaronaL. (2017). Orthogonal estimates of variances for additive, dominance, and epistatic effects in populations. Genetics206 (3), 1297–1307. 10.1534/genetics.116.199406
33
WangC.DaY. (2014). Quantitative genetics model as the unifying model for defining genomic relationship and inbreeding coefficient. PLoS One9 (12), e114484. 10.1371/journal.pone.0114484
34
WangC.PrakapenkaD.WangS.PulugurtaS.RuneshaH. B.DaY. (2014). Gvcblup: A computer package for genomic prediction and variance component estimation of additive and dominance effects. BMC Bioinforma.15 (1), 270. 10.1186/1471-2105-15-270
Summary
Keywords
multifactorial model, epistasis, haplotype, SNP, GBLUP, GREML
Citation
Da Y, Liang Z and Prakapenka D (2022) Multifactorial methods integrating haplotype and epistasis effects for genomic estimation and prediction of quantitative traits. Front. Genet. 13:922369. doi: 10.3389/fgene.2022.922369
Received
17 April 2022
Accepted
12 September 2022
Published
14 October 2022
Volume
13 - 2022
Edited by
Kefei Chen, Curtin University, Australia
Reviewed by
Tianjing Zhao, University of California, Davis, United States
José Marcelo Soriano Viana, Universidade Federal de Viçosa, Brazil
Updates
Copyright
© 2022 Da, Liang and Prakapenka.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Yang Da, yda@umn.edu
This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics
Disclaimer
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.