Multifactorial methods integrating haplotype and epistasis effects for genomic estimation and prediction of quantitative traits

The rapid growth in genomic selection data provides unprecedented opportunities to discover and utilize complex genetic effects for improving phenotypes, but the methodology is lacking. Epistasis effects are interaction effects, and haplotype effects may contain local high-order epistasis effects. Multifactorial methods with SNP, haplotype, and epistasis effects up to the third-order are developed to investigate the contributions of global low-order and local high-order epistasis effects to the phenotypic variance and the accuracy of genomic prediction of quantitative traits. These methods include genomic best linear unbiased prediction (GBLUP) with associated reliability for individuals with and without phenotypic observations, including a computationally efficient GBLUP method for large validation populations, and genomic restricted maximum estimation (GREML) of the variance and associated heritability using a combination of EM-REML and AI-REML iterative algorithms. These methods were developed for two models, Model-I with 10 effect types and Model-II with 13 effect types, including intra- and inter-chromosome pairwise epistasis effects that replace the pairwise epistasis effects of Model-I. GREML heritability estimate and GBLUP effect estimate for each effect of an effect type are derived, except for third-order epistasis effects. The multifactorial models evaluate each effect type based on the phenotypic values adjusted for the remaining effect types and can use more effect types than separate models of SNP, haplotype, and epistasis effects, providing a methodology capability to evaluate the contributions of complex genetic effects to the phenotypic variance and prediction accuracy and to discover and utilize complex genetic effects for improving the phenotypes of quantitative traits.

where g = 9 1  column vector of genotypic values;  = common mean; io α = additive effect of SNP i ( i 1,2  ); αi w = 9 1  model matrix of io α ; io δ = dominance effect of SNP i ( i 1,2  ); δ1 w = 9 1  model matrix of io δ , i αi io α  a w = 9 1  column vector of additive values of SNP i ( i 1,2  ); A A genotypes of the th i SNP respectively, with i p = allele frequency of 1 A and i q = allele frequency of 2 A . Equation (S1.1) is the foundation for pairwise epistasis effects as the interaction between 1o α and 2o α , 1o α and 2o δ , 1o δ and 1o α , and 1o δ and 2o δ under the assumption of linkage equilibrium (LE) that allows simplified genomic epistasis relationship matrices based on SNP additive and dominance relationship matrices without creating the epistasis model matrices. Subscript 'o' denotes a genetic effect form the original quantitative genetics model of Equation (S1.1) and later will be removed in the raparameterized and equivalent model resulting from the use of genomic relationship matrices.

Epistasis effects, model matrices and values
The epistasis effects defined by Cockerham method can cover high-order epistasis effects. However, the biological significance of high-order epistasis is unknown. This study only considers second-and third-order epistasis. Based on the single-SNP model of two loci defined by Equation (S1.1) and the epistasis effects defined by Cockerham (Cockerham, 1954), the model matrices of pairwise epistasis effects for n individuals are expressed as the Hadamard products between the model matrices of the additive and dominance effects:      An important point of Equations (S1.3) and (S1.4) is that the epistasis model matrices are expressed as the Hadamard products between SNP additive and dominance model matrices. This result is the foundation for calculating genomic epistasis relationship matrices using SNP additive and dominance model matrices without creating the epistasis model matrices.

Quantitative genetics model with haplotype additive effects and values
The haplotype additive effects and values are derived by the multiallelic genetic partition of haplotype genotypic values and variances, and the relationship between the haplotype additive effects and values is:  (Da, 2015).

Integrated QG model and multifactorial notations (Model-I)
This integrated model is advantageous over the separate SNP, haplotype, and epistasis models, because the predicted effects or values of each effect type is based on the phenotypic values after removing the predicted effects or values of the remaining effect types, whereas such predicted effects or values are unavailable from the separate models, as shown in the main text.
Combining the additive and dominance values , haplotype additive values (Equation S1.5), and the epistasis values (Equations S1.3 and S1.4), The integrated QG model with SNP additive and dominance, haplotype additive and epistasis effects and values is: With the precise definitions of genetic effects, values and model matrices by Equations S1.3-S1.6, the integrated QG model of Equation S1.6 can be expressed succinctly using multifactorial notations as:

Integrated QG model with intra-and inter-chromosome epistasis effects and multifactorial notations (Model-II)
Replacing the pairwise epistasis effects, values and model matrices in Equation S1 .6 with the intraand inter-chromosome epistasis effects, values and model matrices of Equation S1 .8 the integrated QG model with intra-and inter-chromosome epistasis effects is: int ra int er (2)int ra (2)inter h int ra int er (3) int ra int ra int er int er With the precise definitions of genetic effects, values and model matrices by Equations S1.6 and S1.8, the integrated QG model of Equation S1.9 can be expressed succinctly using multifactorial notations as:

General multifactorial model that covers both Model-I and Model-II
The multifactorial model of Equations S1.7 and S.19 can be summarized by one general multifactorial model as: where io τ = genetic effects of the th i effect type from the original QG model, i W = model matrix W τ = genetic values of the th i effect type, and f = number effect types, with f =10 for Model-I of Equations S1.6 and S1.7, and f = 13 for Model-II of Equations S1.9 and S1.10.

Genomic epistasis relationship matrices using model matrices
Based on the general multifactorial model of Equation S.11 that covers both Model-I and Model-II, the genomic relationship matrices are: where i S = genomic epistasis relationship matrix of the th i effect type, and i k = the average of the diagonal elements of i i ' W W , originally proposed for genomic additive relationship matrix (Hayes and Goddard, 2010).
This article uses the i k value of Equation S1.13 because the heritability estimate of each effect type has the interpretation of being the average variance of the genetic values for the effect type over all individuals, as shown in the main text. The original genomic relationship matrix uses the coefficient of total additive variance of all SNPs as the i k value (VanRaden, 2008). We referred to the VanRaden method as Definition-I and the Hayes-Goddard method Definition-II Wang and Da, 2014). Definition-I is the genomic version of the pedigree additive relationship because Definition-I maintains the properties of the pedigree relationships and would be conceptually appropriate as the 'genomic version' of pedigree relationships. However, Definition-II will be used in this study because Definition-II yields the interpretation that the variance component is the average of the genetic variances of all individuals for the th i type of genetic effects, although Definition-I and Definition-II have the same prediction accuracy, as shown in the main text.
Equations S1.12 and S1.13 use the original model matrices. For epistasis effects, the model matrices are difficult or impossible to compute. Two methods are available for calculating Equations S1.12 and S1.13 without creating the epistasis model matrices.

Genomic epistasis relationship matrices without creating epistasis model matrices
The computing difficulty due to creating epistasis model matrices can be removed by computing the approximate genomic epistasis relationship matrices (AGERM) as the genomic version of Henderson's Hadamard products between additive and dominance relationship matrices (Su et al., 2012;Muñoz et al., 2014;Vitezica et al., 2017), or by the exact genomic epistasis relationship matrices (EGERM). AGERM contains intra-locus epistasis that should not exist (Martini et al., 2020) and EGERM removes intra-locus epistasis from AGERM based on products between SNP genomic additive and dominance relationship matrices (Jiang and Reif, 2020;Martini et al., 2020). Although EGERM is theoretically more appealing than AGERM for being exact, the difference between these two methods in prediction accuracy and heritability estimates was nonexistent for a Holstein dataset with 78,964 SNPs (Liang et al., 2022) and was negligible for a swine dataset with 52,842 SNPs in this article, but EGERM required 21 times as much computing time as required by AGERM for the Holstein dataset, and required about 9 times as much computing time as required by AGERM for the swine dataset in this article. The lack of difference between these two methods and the computing inefficiency of EGERM should favor AGERM for its mathematical simplicity and computing efficiency at least for datasets with 50,000 or more SNPs. In this article, we use general notations of genomic epistasis relationship matrices defined by Equation S1.12 that can be either AGERM or EGERM. The EPIHAP computing package (Liang et al., , 2022 implements both AGERM or EGERM. The use of AGERM or EGERM implies the assumption of linkage equilibrium (LE). Both AGERM and EGERM have general formulations to cover any order of epistasis effects, but no evidence is available in livestock populations showing the epistasis beyond the third-order have contributions to the phenotypic variance and prediction accuracy. Therefore, this article limits to second-and third-order epistasis effects.

Exact genomic epistasis relationship matrices (EGERM)
For EGERM, we use the general formula of Jiang and Reif (Jiang and Reif, 2020). For the example of pairwise epistasis effects in Equation S1.6, the products of the epistasis model matrices for defining pairwise EGERM are calculated as: In Equations S1.14-S1.16, SNP additive and dominance model matrices are used, and no epistasis model matrix needs to be created. In Equation

Pairwise intra-and inter-chromosome EGERM
Based on the products of the epistasis model matrices, the numerators of intra-and interchromosome pairwise EGERM are calculated as: where ch = number of chromosomes. The intra-and inter-chromosome pairwise EGERM are: int ra int ra int ra int ra 4 4 αα αα αα As shown by Equations S1.14-S1.22, the calculations of Equations S1.23-S1.34 are based on the model matrices of SNP additive and dominance effects and do not involve the epistasis model matrices.

Approximate genomic epistasis relationship matrices (AGERM)
Let A = pedigree additive relationship matrix, and D = pedigree dominance relationship matrix. Then, Henderson defined each epistasis relationship matrix as a Hadamard product of A and D matrices (Henderson, 1985). For Model-I: Approximate genomic epistasis relationship matrices are the genomic version of Henderson's epistasis relationship matrices with the pedigree A and D matrices replaced by the genomic additive and dominance relationship matrices (Su et al., 2012). Using notations in this article, genomic additive and dominance relationship matrices that are the genomic versions of pedigree additive and dominance relationship matrices are: Replacing the A and D in Equations S1.35-S1.41 with those of Equations S1.42 and S1.43 yields the AGERM.

Pairwise intra-and inter-chromosome AGERM
For Model-II, the numerators of the intra-and inter-chromosome AGERM can be derived from those of AGERM. In Equations S1.14-S1.16, the minus term is for removing intra-locus epistasis that is contained in AGERM but should exist. Therefore, removing the minus terms of Equations S1.14-S1.16 yields the numerators of AGERM, i.e.: noting that the left-hand-side of Equations S1.14-S1.16 and S1.23-S1.25 are the same, but the right-hand-side are different. Similarly, the numerators of the intra-chromosome epistasis relationships based on the additive and dominance model matrix of each chromosome, i.e.: where ch = number of chromosomes. Then, the numerators of the inter-chromosome epistasis relationship matrices are calculated as the difference between the numerators of the # A A , # A D and # D D genomic relationships matrices and the numerators of the intra-chromosome relationship matrices, i.e.: int er int er int ra int ra αα αα αα αα αα αα Then, the pairwise intra-and inter-chromosome AGERM are: Note that AGERM of Equations S1.53-S1.58 and EGERM of Equations S1.23-S1.28 have the same general expression defined by Equations S1.12-S1.13 but the right-hand-side of the EGERM and AGERM are different.

Text S2. Numerical Demonstration
This is a summary of the results from DEMO.R program for numerical demonstration of the methods in this article. The small dataset is for the convenience of reading the numerical results. The haplotype data is adopted from the numerical example in Da (2015) and is not derived from the small SNP data set, which does not have any accuracy for imputing haplotypes. Detailed procedure to produce haplotype data from the SNP data is described in the GVHHAP computing pipeline (Prakapenka et al., 2020). The SNP and haplotype data are completely artificial for showing calculations only without any genetic or methodology implications.

Phenotypic values, SNP and haplotype genotypes, fixed effects
The dataset in in this demo consists of 5 individuals with one missing phenotypic value for individual #5. The column vector of the phenotypic values for the four individuals with phenotypic observations is: y = (1.753222, 1.617655, 1.713119, 1.738835)ʹ.
The fixed non-genetic effects are male and female, with individuals 1 and 2 having a sex code of '2', and individuals 3 and 4 having a sex code of '1'. The X matrix for these fixed effects is: Each individual has one phenotypic observation, so the Z matrix is an identity matrix.
The 5 individuals have one haplotype block that is treated as a 'locus' with four haplotypes that each was treated as an 'allele'. The haplotype frequencies of the four haplotypes are 0.4, 0.3, 0.2 and 0.1 for haplotypes 1, 2 3 and 4 respectively.

Calculation of allele frequencies
Lines 42-75 in DEMO.R calculate the allele frequencies of the 6 SNPs from the SNP genotypes of the 5 individuals.

SNP additive and dominance codings
Lines 78-100 in DEMO.R calculate the additive coding for each SNP genotype, and lines 110-132 calculate the dominance coding for each SNP genotype using formulae described in the main text.

Model matrices of SNP additive and dominance effects
Lines 102-107 and line 144 in DEMO.R calculate the SNP additive model matrix, and lines 110-139 and line 145 calculate the SNP dominance model matrix.    Lines 1143-1169 calculate third-order epistasis genomic relationship matrices using additive and dominance model matrices. The results are the same as the direct calculations using the thirdorder epistasis model matrices.

Heritability estimates from EPIHAP
All results of DEMO.R are verified by the EIPHAP program , and DEMO.R and EPIHAP have identical results, which provide strong mutual validation for the correct implementation of the methods in this article. The following only the heritability estimates from EPIHAP without repeating all identical results.