A revised Fisher model on analysis of quantitative trait loci with multiple alleles

Zeng et al. (2005) proposed a general two-allele (G2A) model to model bi-allelic quantitative trait loci (QTL). Comparing with the classical Fisher model, the G2A model can avoid using redundant parameters and be fitted directly using standard least square (LS) approach. In this study, we further extend the G2A model to general multi-allele (GMA) model. First, we propose a one-locus GMA model for phase known genotypes based on modeling the inheritance of paternal and maternal alleles. Next, we develop a one-locus GMA model for phase unknown genotypes by treating it as a special case of the phase known one-locus GMA model. Thirdly, we extend the one-locus GMA models to multiple loci. We discuss how the genetic variance components can be analyzed using these GMA models in equilibrium as well as disequilibrium populations. Finally, we apply the GMA model to a published experimental data set.


INTRODUCTION
Currently there are two types of statistical genetic models that are commonly used in genetic analysis of quantitative traits. One is the F ∞ type models that concentrate on direct modeling of the expected genotypic values at targeted quantitative trait loci (QTL) or genetic markers and association testing for various allelic effects and interactions (Fisher, 1918;Cheverud, 2000;Hansen and Wagner, 2001;Wang, 2011). Another is Fisher's analysis of variance (ANOVA) models that focus on assessing variations contributed by some genetic components (i.e., grouped allelic effects or allelic interactions) at targeted QTL or genetic markers (Fisher, 1918;Cockerham, 1954Cockerham, , 1963Kempthorne, 1969;Weir and Cockerham, 1977;Wang and Zeng, 2006). A considerable amount of discussion has been made about the distinction between these two different types of genetic models (Zeng et al., 2005;Álvarez-Castro and Carlborg, 2007;Yang and Álvarez-Castro, 2008;Wang and Zeng, 2009). More recently, a comprehensive review of various genetic models was also given in Álvarez-Castro (2012).
In genetic association studies, we are often interested in direct comparison of the expected genotypic values at certain QTL or marker loci. The F ∞ models are appealing in this setting due to their simplicity in interpretation of their model parameters, which are often referred as the fixed genetic effects such as the additive and dominance effects or the allelic effects and allelic interactions in terms of the expected genotypic values. By applying the F ∞ models, we can compare the expected genotypic values via hypothesis tests on various fixed genetic effects. However, as pointed out in Wang and Zeng (2009), the p-value based association tests on these fixed genetic effects could highly depend upon the regression model, the distribution assumption and the available sample size. Besides, a statistically significant genetic effect with a small enough p-value may not necessarily imply a clinically important finding. On the other hand, the Fisher type models allow us to assess the genetic variations contributed by certain genetic components to the overall variation of a quantitative trait. By definition, these genetic variance components do not depend on the sample size, and they can provide additional information on better understanding the genetic etiology and assessing for clinical importance. Nonetheless, both the F ∞ and the Fisher type models form basis in the analysis of quantitative traits. They provide different perspectives in assessing the genetic effects of QTL or genetic markers. The basic genetic model on assessing the genetic variance components was first proposed by Fisher (1918). Cockerham (1954Cockerham ( , 1963) extended Fisher's one-locus model to two bi-allelic QTL with a particular focus on epistatic variance components. Kempthorne (1954Kempthorne ( , 1957 further extended the two-locus model to multiple alleles. Wang and Zeng (2006) also explored the Fisher type multi-allele two-locus model on partition of the genotypic variance in the presence of linkage disequilibrium (LD). With their model parameters referred as the average effect of the gene substitution (see Falconer and Mackay, 1996) or the average allelic effects and interactions (Wang and Zeng, 2009), these classical Fisher type models often contain constraints on their model parameters due to an over-parameterization of the expected genotypic values. These constraints could make it difficult to fit this type of models using standard least square (LS) regression approach. To avoid this problem, under a regression model framework, Zeng et al. (2005) proposed a general twoallele (G2A) model for variance component analysis on bi-allelic QTL. The G2A model retains the nice property of the classical Fisher's model on orthogonal partition of the genotypic variance in equilibrium populations. Meanwhile, without using redundant parameters, the G2A model can be fitted directly using the standard LS approach. Wang and Zeng (2009) further explored the origin of the G2A model and clarified its theoretical basis on pertaining the orthogonal partition of the genotypic variance in equilibrium populations. Álvarez-Castro and Carlborg (2007) also proposed a different way for re-parameterization of the expected genotypic values, which the authors referred as the natural and orthogonal interactions (NOIA) model. It appears that the NOIA model first defines the genetic effects in terms of the expected genotypic values, and then derives the design matrix for the genetic effects via an inverse of the matrix. Though the matrix formulation is helpful for facilitating the transformation between different parameterizations, making inverse of the matrix to construct the design matrix is difficult to implement analytically. For multiple QTL, how to define various genetic effects in terms of the expected genotypic values could also be a challenge especially when locus-by-locus interactions are involved. Besides, the NOIA model assumes that the design matrix is of full rank, which makes it unsuitable for reduced re-parameterization of the genotypic values. Currently, no NOIA model has been proposed for reduced re-parameterization of the expected genotypic values, or for multiple QTL in the presence of LD.
In this study, we further extend the G2A model to QTL with multiple alleles and multiple loci. In bi-allelic case, only one additive effect and one dominance effect are needed at each locus, and the locus-by-locus interactions can be easily included for constructing a full re-parameterization of the genotypic values. For one QTL with multiple alleles, how to define the dominance effects for various allelic interactions is not straightforward especially when phases of its genotypes are unknown. The extension to multiple loci is also cumbersome by the much more complex structure of locus-by-locus interactions. How to present the model and define various genetic variance components are not trivial tasks. To construct one-locus general multi-allele (GMA) model, we overcome the phase problem by appropriately merging the paternal and maternal allelic effects and allelic interactions in the phase-known situation. Typically, with phase unknown genotypes at a locus, we may have to assume that the paternal and maternal alleles have the same frequencies and contribute the same genetic effects so that we could merge them without distinguishing their parental origins. With phase-known genotypes, we can further break down the additive variance component into paternal and maternal variance components. For multiple QTL with multiple alleles, we develop concise expressions for constructing multi-locus GMA models and defining various genetic components. Explicit formulas for calculating various genetic variance components in equilibrium population are also derived.
The structure of this manuscript is organized as the following. First, we consider one multi-allele QTL with phase known genotypes. Following the same strategy as adopted in Wang and Zeng (2009), we start by introducing indicator variables that describe the inheritance of paternal and maternal alleles. Then we make mean corrections on these indicator variables and build onelocus GMA model based on the mean-corrected index variables. Next, we consider one QTL with phase unknown genotypes. We construct a one-locus GMA model for unphased genotypes by appropriately combining the paternal and maternal allelic effects in the phase known one-locus GMA model. We derive formulas on partitioning the genotypic variance into the additive and dominance variance components under Hardy-Weinberg equilibrium (HWE) as well as in Hardy-Weinberg disequilibrium (HWD) for both the phase-known and phase-unknown one-locus GMA models. Thirdly, we extend the one-locus GMA models to multiple loci with either phase known or phase unknown genotypes. Based on these multi-locus GMA models, we describe how the various genetic components can be defined. An orthogonal partition on the genotypic variance in an equilibrium population is also presented in both the phase known and unknown cases. In addition, we discuss how to construct the reduced multi-locus GMA models in practice. The difference in using F ∞ models and GMA models to build reduced models for the expected genotypic values is explored. Finally, we apply the GMA model to a published experimental data set.

METHODS AND RESULTS
The variation of a quantitative trait Y is usually assumed to be contributed by both genetic and environmental effects. Let G denote the true (unobservable) genotypic value from the joint contribution of all the genetic factors to the quantitative trait Y. Given genotypes at targeted QTL or genetic marker loci, we focus on assessing the variations (i.e., variances) contributed by the allelic effects and interactions of the QTL or marker loci to the total genotypic variance V G .

ONE-LOCUS MODEL FOR PHASE KNOWN
First, let us consider a single QTL with multiple alleles A 1 , . . . , A m (m ≥ 2). Suppose we know the parental origins of the alleles for observed genotypes. Then we can distinguish the paternal and maternal allelic effects separately. With m alleles, there are in total m 2 possible phased genotypes: (A i , A j ), i, j = 1, . . . , m, where A i and A j represent the paternal and maternal allele, respectively. Let G i j = E[G|g = (A i , A j )] denote the expected genotypic value for a phased genotype (A i , A j ) in a study population. Then there are totally m 2 possible expected genotypic values G i j , i, j = 1, . . . , m. To model these expected genotypic values, Fisher's classical one-locus model is given by where α i (or α j ) is the so-called average additive or main allelic effect of a paternal (or maternal) allele A i (or A j ), and δ i j is the average allelic interaction between a paternal allele A i and a maternal allele A j . The above model is a typical two-way ANOVA model with the paternal and maternal gametes being treated as two independent risk factors. Although the paternal and maternal gametes often share (but do not have to) the same set of alleles A 1 , . . . , A m at the QTL, it allows the paternal and maternal gametes to have different allele frequencies and allelic effects at the QTL.
Let p i be the frequency of allele A i on paternal gametes m i = 1 p i = 1 , and p j be the frequency of allele A j on the maternal gametes m j = 1 p j = 1 . One nice feature of the Fisher model above is that it can assess the additive and dominance variance components V A and V D , which are defined as variations contributed by the additive allelic effects from the paternal and maternal alleles and the allelic interactions between the paternal or maternal alleles, respectively. For example, under HWE, it is well known that the total genotypic variance V G = i,j p i p j (G i j − μ) 2 has an orthogonal partition Note that there are in total m 2 + 2m + 1 = (m + 1) 2 parameters being involved in Fisher model (1) including the intercept μ, which is more than the total number m 2 of the expected genotypic values G i j , i, j = 1, . . . , m. As a result, not all the model parameters are estimable. To avoid this problem, some constraints on the model parameters are often required. It is usually assumed that all the genetic effects are averaged to zero over any index; i.e., With these constraints, Fisher (1918) showed that the least square estimates (LSE) of the model parameters are given by In simple cases, we could estimate the model parameters using these formulas. In general, however, those "irregular" constraints in (2) make it difficult to fit model (1) using the standard LS approach via commonly used software such as SAS (SAS Institute INC, Raleigh, NC), especially when we need to adjust for certain environmental covariates.
Here we propose a way to get rid of the redundant parameters. Let us first introduce the following indicator variables that describe the transmission inheritance of the paternal and maternal alleles.
for i, j = 1, 2, . . . , m at the QTL. Next, using the same strategy as adopted in Wang and Zeng (2006), we further make mean corrections on these indicator variables z P i , z M j and define the following mean-corrected index variables Then we can re-write the Fisher model (1) as where E(G|g) denotes the expected genotypic value given a genotype g at the targeted QTL. As shown in Wang and Zeng (2006), the above model is simply a different representation of the Fisher model (1) with the same constraints being applied on the model parameters. However, as the HWD measurements can be quantified as covariances between those mean-corrected index variables x P i 's and x M j 's, this model expression can facilitate derivation on examining the genetic variance components under HWD. Now, we further exclude the redundant parameters in model (3). For a diploid subject such as human being, his or her genotype at a locus on a pair of homologous chromosomes consists of two alleles with one from the father and the other one from the mother. Therefore, we always have m i = 1 z P i (g) ≡ 1 and m j = 1 z M j (g) ≡ 1. Thus, (3), which leads to the following revised Fisher model Model (4) provides a full re-parameterization of the m 2 expected genotypic values G i j , i, j = 1, . . . , m, without using redundant parameters. We refer it as a revised one-locus Fisher model or a general multi-allele (GMA) model. We also refer its model parameters α * i (or α * j ) as the average (additive) allelic effect of the paternal (or maternal) allele A i (or A j ), and δ * i * j the average allelic interaction between the paternal allele A i and maternal allele A j , with respect to the baseline allele A m . Here, we choose allele A m as the baseline allele but it could be any other allele as well. It is easy to show that, in terms of the parameters in the original Fisher model (1) . . , m. Model (4) retains most of the nice features of the original Fisher model (1) on partition of the genotypic variance. Based on this model, we have the genetic additive variance components , which denote variations contributed by the additive effects of the paternal and maternal alleles, respectively. The genetic dominant variance component , which accounts for a variation contributed by the interactions between paternal and maternal alleles in addition to the additive variance components. Under HWE, the paternal index variables {x P i , i = 1, . . . , m} are independent of the maternal index variables {x M j , j = 1, . . . , m}. Therefore, we have μ = E(G) and an orthogonal partition on the variance of the expected genotypic values: In HWD, the disequilibrium measurements can be captured by the covariances between the index variables x P i 's and x M j 's. In this case, we no longer have an orthogonal partition on the variance of the genotypic values for the additive and dominance variance components (see Appendix A). It should be pointed out that although the model parameters defined in model (4) depend upon the choice of the reference allele A m , the additive and dominant variance components V A P , V A M , V D as well as their covariance components do not depend on such a choice. Note that the partition of the total genotypic variance V G based on the Fisher model (1) assumes that the genotypic values are completely determined by the QTL. In general, beyond the targeted QTL, the genotypic value G could also be affected by other unobserved QTL. In this case, the total genotypic variance is a variation that cannot be explained by the targeted QTL. So the Fisher model (1), or its revised model (4), actually provides a partition on the variance V(E(G|g)) of the expected genotypic values at the targeted QTL instead of the total genotypic variance V G . In practice, given a random sample from the study population, if we ignore the genetic by environmental interaction, a quantitative trait can typically be expressed in a linear regression model form as where, for k = 1, . . . , n, g k is the observed QTL genotype of subject k, z k is a vector of subject k from some adjusted environmental covariates with fixed effects β, and k is a model residual contributed by other environmental and genetic factors that cannot be captured by z k and g k at the targeted QTL. Assuming that k , k = 1, . . . , n, are independent and identically distributed (i.i.d) with a variance σ 2 , we have V y = V(E(G|g)) + σ 2 . To further partition V(E(G|g)) based on the QTL genotypes, we can first calculate the allele frequencies and compute the values of the index variables {x P i (g k ), x M j (g k )} for each subject k = 1, . . . , n, in the sample. Then, by treating these index variables as fixed covariates, we can incorporate the GMA model (4) into the regression model above and fit the model using the standard LS approach.
Based on the LSEα * i ,α * j andδ * i * j of the model parameters, we can compute the additive and dominance genetic com- Finally, the genetic variance components V A P , V A M and V D can be estimated as the sample variances of {A P (k), k = 1, . . . , n},

ONE-LOCUS MODEL FOR PHASE UNKNOWN
In this subsection, we consider modeling a multi-allele QTL with phase unknown genotypes-a more common situation in practice. As we cannot distinguish the parental origins of alleles in QTL genotypes, as usual, we assume that the paternal and maternal gametes share the same set of alleles with the same allele frequencies. Let A 1 , . . . , A m (m ≥ 2) denote the alleles at a target QTL or marker locus with allele frequencies p i , i = 1, . . . , m.
Ignoring the phases, there are m possible homozygous genotypes A i A i , i = 1 . . . , m, and m(m − 1)/2 possible heterozygous genotypes A i A j , i = j. We also assume that these alleles contribute the same genetic effects regardless of their parental origins, which implies that the expected genotypic values G ij = E(G|g = A i A j ) satisfy the symmetric property: G ij = G ji , for i = j. So there are totally m(m + 1)/2 possible distinctive expected genotypic values G ij , i, j = 1, . . . , m. In this case, by treating the paternal and maternal gametes as two independent risk factors, the Fisher's ANOVA model for the expected genotypic values G ij can be written as where α i is the average (additive) allelic effect of the paternal or maternal allele A i (i = 1, . . . , m), and δ ij is the average allelic interaction between two alleles A i and A j (i, j, = 1, . . . , m). As pointed out in Wang (2011), the above model is different from the classical two-way ANOVA model in that the paternal and the maternal gametes share the same set of alleles and have the same allelic effects at the locus. To avoid the inestimability of model parameters in model (5) due to over-parameterization, the following constraints are often added on the model parameters From the symmetric property of G ij , we also assume δ jk 's are symmetric. Similarly, the above constraints together with the symmetry property of δ jk make it difficult to fit model (5) using the standard LS approach. Note that model (5) can be treated as a special case of model (1) or (3) To construct a similar GMA model for model (5), we can combine the term where, for i = 1, . . . , m − 1, Here A c ij denotes an allele which is different from A i and A j . Note that the combined index variables w i (g), v ii (g), and v ij (g) above are well defined on unphased genotypes, although x P i , x M j are not. We refer them as genotype coding variables, and the model parameter α * i as the average allelic effect of allele A i (i = 1, . . . , m), and δ * ij as the average allelic interaction between two alleles A i and A j (i, j, = 1, . . . , m), with respect to the baseline allele A m . In terms of the parameters in the original model (5), we can show that α Model (6) is an extension of the one-locus G2A model proposed in Zeng et al. (2005) to one QTL with multiple alleles. Note that no v ij (g)'s for i < j are needed in the bi-allelic case. The combined index variables v ii (g), i = 1, . . . , m − 1, are also slightly different from the ones defined in Zeng et al. (2005) by a scalar of (−2) folds. Here we drop the scalar (−2) so that the coefficient δ * ii can keep the same interpretation as δ * i * i in model (4). In addition, the combined index variables v ij (i < j) defined above are not exactly the same as the ones we suggested in the discussion section of Wang (2011). Based on the previously defined inheritance indicator variables, we define w Still, model (6) Here we define δ * ji = δ * ij , for i < j. In HWD, the desired orthogonal partition on the variance of the expected genotypic values is no longer held. But the model can still allow us to capture the disequilibria via covariances between those index variables x P i 's and x M j 's (see Appendix B).
As an example, let us consider a QTL with 3 alleles A 1 , A 2 , and A 3 . By taking A 3 as the baseline allele, model (6) leads to Or, in a matrix form, we have ⎛ If we choose A 1 (or A 2 ) instead of A 3 as the baseline allele, we can obtain different re-parameterizations of the six expected genotypic values. But they all give the same partition on the variance of the expected genotypic values. Álvarez-Castro and Yang (2011) also presented a similar re-parameterization of the six expected genotypic values from their NOIA model. It appears that their one-locus 3-allele NOIA model and the GMA model (7) above are likely equivalent on partitioning the variance of the six expected genotypic values, as we will see later in the Example Subsection 2.4. Similar to the phase-known case, we can estimate the additive, dominance variance components and the covariance Cov(A, D) from a random sample when the QTL genotypes are available. First, we can compute the values of the combined index variables w i (g k ) and v ij (g k ) at the target QTL for each subject k = 1, . . . , n. Next, we can incorporate model (6) into a regression model with other possible adjusted covariates. By treating those combined index variables as regular fixed covariates, we can fit the regression model using the standard LS approach. Then, based on the fitted model, we compute the additive and dominance com-

MULTI-LOCUS MODELS
The one-locus GMA models can be extended to multiple loci. Typically, for each locus k, we can create the mean-corrected index variables x (k) P i , x (k) M j (or combined index variables w k,i , v k,ij for phase-unknown genotypes) in the same way as in the one-locus case. Then we can build a multi-locus GMA model by including the within locus additive and dominance effects as well as the locus-by-locus interactions (i.e., epistases). The proposed multilocus GMA model allows LD among multiple loci even though the LD may affect the partition on the variance of the expected genotypic values.
Consider L loci with alleles A k1 , . . . , A km k at the kth locus for k = 1, . . . , L. With phase known, there are totally m 2 1 · · · m 2 L possible expected genotypic values: G s 1 ···s L t 1 ···t L = E[G|g = (A 1s 1 · · · A Ls L , A 1t 1 · · · A Lt L )], where the joint genotype (A 1s 1 · · · A Ls L , A 1t 1 · · · A Lt L ) is formed by the union of a paternal gamete A 1s 1 · · · A Ls L and a maternal gamete A 1t 1 · · · A Lt L with s k , t k ∈ {1, . . . , m k } for k = 1, . . . , L. Let p P ks k and p M kt k be the frequencies of the paternal allele A ks k and maternal allele A kt k , respectively. At each locus k = 1, . . . , L, we define the meancorrected index variables x (k) Ps k for paternal alleles A ks k (s k = 1, . . . , m k ) and x (k) Mt k for maternal alleles A kt k (t k = 1, . . . , m k ) in the same way as in the one-locus case. Then, when we choose A 1m 1 , . . . , A Lm L as the baseline alleles, a fully parameterized Llocus GMA model can be expressed as where the summation of s k (or t k ) is from 1 up to (m k − 1) for k = 1, . . . , L; s k (or t k ) refers to a particular paternal allele A ks k (or maternal allele A kt k ) at the k-th locus; and i k (or j k ) is an indicator variable for the presence or absence of a paternal (or maternal) allele at locus k which is involved in a term. The coefficient α s * 1 ···s * L t * 1 ···t * L in each term represents an average allelic effect of a single paternal or maternal allele, or an allelic interaction from a set of paternal and maternal alleles that are involved in this term with respect to the baseline alleles A 1m 1 , . . . , A Lm L . The superscripts (or subscripts) in α s * 1 ···s * L t * 1 ···t * L are defined as s * k = i k · s k (or t * k = j k · t k ), indicating which paternal (or maternal) allele at locus k is involved in this term. If the term does not have any paternal (or maternal) allele at locus k being involved, then we have s * k = 0 (or t * k = 0). Note that na P = L k = 1 i k (or na M = L k = 1 j k ) specifies the total number of paternal (or maternal) alleles being involved in a term. We refer the total number of both paternal and maternal alleles that are involved in a term na = na P + na M as the order of this term.
The multi-locus GMA model (8) provides a full reparameterization of the m 2 1 · · · m 2 L expected genotypic values G s 1 ···s L t 1 ···t L with phase-known genotypes without using redundant parameters. Note that E(G|g) can also be partitioned into a sum of the following genetic components where i 1 , j 1 , . . . , i L , j L = 0, 1 are indicator variables, which specify the parental origins of the contributing alleles. In other words, each genetic component consists of the terms with the same order and having their alleles all coming from the same subset of loci with the same parental origins but allowing varied allelic types. The classical genetic variance components can then be defined as variances of these genetic components. Note that the component C 0···0 0···0 is a constant, which corresponds to an intercept without any alleles being involved. Therefore, for a L-locus GMA model with phase known genotypes, there are in total (2 2L − 1) genetic variance components V C i 1 ···i L j 1 ···j L , where i 1 , j 1 , . . . , i L , j L = 0, 1 but not all being zeros. Among them, there are L paternal and L maternal variance components of order 1 with each having a single paternal or maternal allele being involved, L dominance variance components of order 2 with each having two alleles within the same locus being involved, L(L − 1)/2 paternal by paternal (or maternal by maternal) variance components of order 2 with two paternal (or maternal) alleles coming from different loci, and L(L − 1) paternal by maternal variance components of order 2 with one paternal and one maternal allele coming from two different loci. For examples, the within-locus paternal, maternal and dominance variance components at a locus k can be written as 1, . . . , L). For a pair of loci j = k, the two-locus paternal by paternal variance component is Ps j x (k) and the two-locus paternal by maternal variance component is The variance component of epistases with the highest order is which has 2L alleles being involved with two alleles per locus. Based on these genetic components, we can partition the variance of the expected genotypic values into the variances and covariances of the genetic components. Still, the coefficients α Mt j , s j , t j = 1, . . . , m j } at a locus j are also independent of the index variables {x (k) Ps k , x (k) Mt k , s k , t k = 1, . . . , m k } at a different locus k (k = j). To achieve an orthogonal partition on the variance of the expected genotypic values, we need to further assume that all the paternal index variables {x (k) Ps k , s k = 1, . . . , m k , k = 1, . . . , L} are independent of the maternal index variables {x (k) Mt k , t k = 1, . . . , m k , k = 1, . . . , L} across all the loci; i.e., the so-called gametic equilibrium (see Wang and Zeng, 2006). Under both the gametic and linkage equilibria, each genetic component has its mean E C i 1 ···i L j 1 ···j L = 0, and the covariances between different genetic components are zeros. Thus, we have E(G) = α 0···0 0···0 and an orthogonal partition on the variance of the expected genotypic values is given by where Similarly, we can construct multi-locus GMA models for QTL with phase unknown genotypes. Without distinguishing the parental origin of the alleles, there are totally L k = 1 m k (m k + 1)/2 L possible expected genotypic values: , for s k , t k = 1, . . . , m k and k = 1, . . . , L. We assume that the paternal and maternal allele frequencies are the same and denoted by p ks k . We define the combined index variables w k,i (g) and v k,ij at each locus k in the same way as the one-locus case for k = 1, . . . , L. Then, when we choose A 1m 1 , . . . , A Lm L as the baseline alleles, a fully parameterized L-locus GMA model for QTL with phase unknown genotypes can be expressed as where 1 {i k =j} is the Kronecker function which equals 1 when i k = j and 0 otherwise, for k = 1, . . . , L and j = 1 or 2; the summation of s k (or t k ) is from 1 up to m k − 1 with s k (or t k ) referring to allele A ks k (or A kt k ) at a locus k (k = 1, . . . , L); and i k specifies how many alleles at a locus k are involved in this term. If i k = 1, this term has only one allele A ks k (either paternal or maternal) being involved via w k,s k and we set s * k = s k . If i k = 2, both the paternal and maternal alleles A ks k and A kt k are involved via v k,s k t k in this term and we set s * k = s k t k regardless of the order of s k and t k . When i k = 0, this term does not have any alleles at locus k being involved and we have w k,s k t k = 1. The coefficient α s * 1 ···s * L represents the average allelic effect of a single allele, or an allelic interaction from all the alleles that are involved in this term, with respect to the baseline alleles A 1m 1 , . . . , A Lm L . We still refer the total number of alleles being involved in each term; i.e., na = L k = 1 i k , as the order of this term. Based on the above model, we can define the genetic components as for i 1 , . . . , i L = 0, 1, 2, with each component being a summation of the terms with their alleles all coming from the same subset of loci and having the same number of alleles from each locus. Excluding the one with i 1 = · · · = i L = 0, which corresponds to an intercept, there are in total 3 L − 1 genetic variance components V(C i 1 ···i L ) for i 1 , . . . , i L = 0, 1, 2. Among them, there are L additive variance components with each having a single paternal or maternal allele being involved, L dominance variance components with each having both paternal and maternal alleles within the same locus being involved, L(L − 1)/2 additive by additive variance components with two alleles coming from different loci, etc. For examples, the within-locus additive and dominance variance components at a locus k can be written as respectively, for k = 1, . . . , L. For a pair of loci j = k, the two-locus additive by additive interaction is The variance component of epistases with the highest order of 2L is given by Under both the gametic and linkage equilibria, we have E(G) = α 0···0 and an orthogonal partition on the variance of the expected genotypic values where Here, when i k = 2, we have s * k = s k t k and s * k = s k t k . We also define α s * 1 ···s * L (or α s * 1 ···s * L ) to be the same if we switch the order of s k and t k in s * k (or s k and t k in s * k ) for k = 1, . . . , L. In the presence of disequilibria, the desired orthogonal partition may no longer hold. However, regardless of the equilibrium, the coefficients α s * 1 ···s * L in model (10) are defined based on the baseline alleles A 1m 1 , . . . , A Lm L , while the variances and covariances of the genetic components do not depend on the choice of these baseline alleles.
In practice, we do not have to rely on the derived formula to estimate the genetic variance or covariance components. Similar to the one-locus case, given the observed QTL genotypes for a random sample from a study population, we can always incorporate model (8) or (10) into a regression model with other possible adjusted covariates and fit the model using standard LS approach. Then we can estimate various genetic variance components as well as the covariances among different genetic components based on the fitted model. A good fit of a fully parameterized GMA model often requires that the expected genotypic values for all possible joint genotypes of the QTL are estimable from the study sample. If certain genotypes are not observable or rarely present in subjects from the study sample, a situation which likely happens when the number of alleles or the number of QTL is large with moderate or small sample size, the design matrix for the genetic effects could become singular which implies that some genetic variance components cannot be estimated reliably. But we do not have to use fully parameterized GMA models to model the expected genotypic values. In this case, we may want to build a reduced GMA model that can provide a good approximation to the expected genotypic values overall and meanwhile has a less complicated model structure. The fact that two terms within the same genetic component are unavoidably correlated suggests that we should perhaps treat each genetic component as a whole and keep or drop its terms all at once in building a GMA model. As genetic components of lower orders tend to have bigger impact on the expected genotypic values than the higher order ones, one way to construct a reduced GMA model is perhaps to go through a stepwise forward selection procedure by hierarchically adding the lowest order genetic component that can achieve a nominal significance level (e.g., 5%) but has not yet been selected in the model into the model one at a time. Here, the classical likelihood ratio statistic can be used to assess each genetic component for entering into or dropping from the model.
It has been known that the model building procedure is often sensitive to potential confounding among the selected variables. A GMA model uses the mean-corrected index variables x (k) Ps k , x (k) Mt k (or their combined index variables w k,s k , v k,s k t k in the phase unknown case) as the basic units in constructing all its model terms. At least in an equilibrium population, these mean-corrected index variables can reduce the confounding among different genetic components and make them uncorrelated. This orthogonality implies that each genetic component can be assessed independently regardless of which components are presented in the model. On the other hand, an F ∞ model can be thought of directly using the inheritance indicator variables z (k) Ps k and z (k) Mt k (or their merged genotype coding variables Ms k in phase unknown case) as the basic units in constructing its terms (Wang and Zeng, 2009;Wang, 2011). Even in an equilibrium population, an F ∞ model could have its low-order terms being highly confounded with other high-order terms when they contain shared alleles (see Zeng et al., 2005). As the result, in building a predictive model based on F ∞ models, the stepwise forward selection procedure could be problematic because failing to include a significant higher order term (or component) in a reduced F ∞ model could make the assessment of some low-order terms (or components) unreliable. On selecting significant QTL from a set of loci without having the locus-by-locus interactions being involved, the choice of using GMA or F ∞ model in building a reduced model for the expected genotypic values should not matter much because using mean-corrected index variables mainly affects the intercept in this case. However, when we consider including epistases for a given set of QTL, the GMA model can appropriately use the orthogonal property among different genetic components to dissect the confounding at least in equilibrium populations, while it appears that the F ∞ models cannot make full use of the equilibrium information. When disequilibria are present, as Hardy-Weinberg equilibria are expected to be held in most of the human genomic regions and LD mainly present for closely linked loci, we would expect that most of the genetic components in a GMA model are likely uncorrelated. Therefore, in most cases, using the GMA model could still be preferable to using F ∞ model in building reduced models for expected genotypic values especially when epistases are involved.

EXAMPLE
As an example, we apply the GMA model to a published experimental data set on the polymorphism at the human acid phosphatase locus (ACP1). The analysis of this data set was first conducted by Greene et al. (2000), and recently re-analyzed as an example in Álvarez-Castro and Yang (2011). The ACP1 gene involves 3 alleles A, B, C. Two phenotypic traits are considered, which measure the ACP1 enzyme activity (y ac ) and inhibition (y in ). The estimates of the expected genotypic values and the genotype frequencies are summarized in Table 1.
From the genotype frequencies, we first estimate the allele frequencies as p A = 0.3534, p B = 0.5818, and p C = 0.0647. Then, for each trait, we fit a separate GMA model (7) to its expected genotypic values by taking C as the baseline allele and creating the combined index variables w 1 (g), w 2 (g), v 11 (g), v 22 (g), v 12 (g 12 v 12 (g k ), k = 1, . . . , 6, for the six ACP1 genotypes. Based on the genotype frequencies, we then calculate the genetic variance components. For ACP1 enzyme activity, we obtain V A = 658.868, V D = 0.973 and the covariance Cov(A, D) = −0.356. The total variance of the expected genotypic values is V(E(G ac |g)) = 659.129, and V A /V(E(G ac |g)) = 99.96%. For ACP1 enzyme inhibition, we have V A = 44.920, V D = 0.146 and the covariance Cov(A, D) = −0.217. The total variance of the expected genotypic values is V(E(G in |g)) = 44.632, and V A /V(E(G in |g)) = 100.65%. Note that the partition V(E(G|g)) = V A + V D + 2Cov(A, D), which is not orthogonal for both traits due to a slight deviation from HWE at ACP1 locus as we have observed from the genotype frequencies in Table 1. It is also interesting to see that for ACP1 enzyme inhibition, V A /V(E(G in |g)) is bigger than 100% due to HWD and the fact that V D + 2Cov(A, D) < 0.
Using the same allele frequencies but assuming HWE, we would have slightly different genotype frequencies. Note that the LSE of the model parameters keep the same and do not depend on the genotype frequencies because they are completely determined by the allele frequencies and the six expected genotypic values when a fully-parameterized GMA is used. But the total variance of the expected genotypic values and its variance components will be different. For ACP1 enzyme activity, we obtain V A = 660.588, V D = 0.971, Cov(A, D) = 0.006, V(E(G ac |g)) = 661.573 and V A /V(E(G ac |g)) = 99.85%. For ACP1 enzyme inhibition, we have V A = 46.422, V D = 0.152, Cov(A, D) = 0.001, V(E(G in |g)) = 46.573 and V A /V(E(G in |g)) = 99.68%. The minor deviations from orthogonal partitions are likely due to some numerical round-off and the fact that the summation of the three allele frequencies is 0.9999 instead of exactly 1. For this three-allele example, we also applied the NOIA model using the formulas (10) and (11) provided in Álvarez-Castro and Yang (2011), and in both HWE and HWD cases we obtained exactly the same results as above based on our own calculation, which however appear to be slightly different from what had been reported in Álvarez-Castro and Yang (2011). For example, based on the observed genotype frequencies, an estimate of V D = 1.15 for the trait of ACP1 enzyme inhibition was reported in Álvarez-Castro and Yang (2011), which is noticeably larger than V D = 0.146 that we obtain above.

DISCUSSION
In the analysis of genetic variance components, a separation of the variations contributed by the additive allelic effects and allelic interactions is complicated by the fact that the observed genotypes are often phase-unknown. In this study, by appropriately merging the paternal and maternal allelic effects and allelic interactions in the phase-known situation, we propose a way to construct one-locus and multi-locus GMA models on analysis of genetic variance components for QTL with multiple alleles. In the same way as building a G2A model, we construct a GMA model by first specifying its design matrix for the genetic effects via some mean-corrected index variables. As these mean-corrected index variables are well defined based on the observed genotypes and allele frequencies, they can be treated as regular covariates for coding QTL genotypes. These one-locus or multi-locus GMA models can then be incorporated into standard regression models with other possible adjusted covariates and fitted using standard LS approach. Based on the fitted models, we can further estimate the genetic variance and covariance components through the sample variances and covariances of various genetic components. As we have pointed out, these GMA models can be applied to equilibrium populations as well as populations in Hardy-Weinberg and/or linkage disequilibria. By using the full set or a loworder subset of the index variables (or genetic components), the GMA model allows us to make either full or reduced reparameterization of the genotypic values. When some loci have phase known genotypes while other loci have phase unknown genotypes (a possible hypothetical situation), a mixed GMA model could also be constructed by adopting the same modeling strategy.
Sometimes we may want to perform hypothesis tests on the existence of certain genetic variance and covariance components. Note that the GMA models have allele frequencies being involved in their design matrices. As allele frequencies often need to be estimated from the genotype data, they could contribute another source of variation in the LSE (β) of the model parameters as well as the genetic variance components. When the allele frequencies can be accurately estimated, we could simply treat them as fixed constants. When the residuals in a regression model do not depend on the allele frequency estimates, based on the linear model theory (see Ravishanker and Dey, 2002),β are known to be unbiased with its covariance matrix cov(β) = σ 2 E[(X X) −1 ], where X is the design matrix of a GMA model. In this case, we can assess the existence of variance components by performing traditional hypothesis tests onβ. In general, we could also assess the existence of these genetic variance and covariance components through a bootstrap procedure. By repeatedly drawing random samples of the same size from the observed random sample with replacement, we can estimate the genetic variance and covariance components for each bootstrap sample and meanwhile assess the variances in estimates of these genetic variance and covariance components and test for their existence.
In genetic studies, QTL with missing genotypes is a common phenomenon. GMA model can be used to fit QTL with missing genotypes. Rather than excluding patients with missing QTL genotypes, we could treat "missing" as an allele although this strategy may induce potential bias as we assume that all the missing alleles have the same genetic effect. GMA models could also be applied incorporation with various imputation methods. In recent years, there has been a great deal of interest in developing methodologies for QTL mapping using recombinant intercrosses from multiple inbred lines. In this case, the putative QTL often have their locations and genotypes unknown. But the allele frequencies of QTL could probably be inferred from the study design and the QTL genotypes might be imputable from their neighboring genetic markers. How to apply GMA models to this type of experimental crosses for QTL mapping could be a research topic for further exploration.
In summary, the analysis of genetic variance components for multi-allele QTL has been challenging due to complex allelic interactions and locus-by-locus interactions. In this study, we thoroughly explored the architecture of one-locus and multilocus GMA models with either phase known or unknown genotypes. Particularly, we described in detail the architecture of the multi-locus GMA model, and how the model terms can be grouped into various genetic components. Under equilibria populations, we also derived formulas for orthogonal partition of the genetic variance components, which could be useful for analytical assessment of the variance components. Comparing to the classical Fisher model, the GMA models can estimate the genetic variance and covariance components more conveniently via standard LS approach for either one or multiple QTL with multiple alleles, in equilibrium as well as disequilibrium populations.