Analysis of Variance Components for Genetic Markers with Unphased Genotypes

Wang, Tao

doi:10.3389/fgene.2016.00123

ORIGINAL RESEARCH article

Front. Genet., 13 July 2016

Sec. Statistical Genetics and Methodology

Volume 7 - 2016 | https://doi.org/10.3389/fgene.2016.00123

Analysis of Variance Components for Genetic Markers with Unphased Genotypes

Tao Wang^*

Division of Biostatistics, Institute for Health and Society, Medical College of Wisconsin, Milwaukee, WI, USA

An ANOVA type general multi-allele (GMA) model was proposed in Wang (2014) on analysis of variance components for quantitative trait loci or genetic markers with phased or unphased genotypes. In this study, by applying the GMA model, we further examine estimation of the genetic variance components for genetic markers with unphased genotypes based on a random sample from a study population. In one locus and two loci cases, we first derive the least square estimates (LSE) of model parameters in fitting the GMA model. Then we construct estimators of the genetic variance components for one marker locus in a Hardy-Weinberg disequilibrium population and two marker loci in an equilibrium population. Meanwhile, we explore the difference between the classical general linear model (GLM) and GMA based approaches in association analysis of genetic markers with quantitative traits. We show that the GMA model can retain the same partition on the genetic variance components as the traditional Fisher's ANOVA model, while the GLM cannot. We clarify that the standard F-statistics based on the partial reductions in sums of squares from GLM for testing the fixed allelic effects could be inadequate for testing the existence of the variance component when allelic interactions are present. We point out that the GMA model can reduce the confounding between the allelic effects and allelic interactions at least for independent alleles. As a result, the GMA model could be more beneficial than GLM for detecting allelic interactions.

1. Introduction

Typically, there are two different ways in assessing the statistical association of a categorical variable with a continuous outcome. We can either make a direct comparison of the group means among groups defined by the categorical variable or assess the variation that the categorical variable may contribute to the total variance of the continuous outcome. The former approach is usually conducted via fitting a general linear model (GLM) with or without an adjustment for other covariates. The latter approach is referred as the analysis of variance (ANOVA), which examines a quantitative outcome variable by partitioning its total variance into variance components attributable to different sources of variation. The original ANOVA model was proposed initially by Fisher (1918) and formalized later in Fisher (1925). The traditional ANOVA approach on estimation of variance components was mainly based on Henderson's method I-III by equating the observed sums of squares to their expected values (Henderson, 1953). Currently, for predictors with observed group levels, their variance components can be estimated via ANOVA tables, which are mostly based on the sequential (Type I) sums of squares or partial reductions in (Type III) sums of squares via fitting GLM (Searle, 1971; Searle et al., 1992). For predictors with observed or unobserved group levels, their variance components could also be evaluated via maximum likelihood estimation (MLE) or restricted maximum likelihood (REML) (see Searle et al., 1992).

In genetic association studies, the standard GLM or ANOVA approaches may not be directly applicable to genetic markers with unphased genotypes. In humans, a marker genotype is a combination of one paternal and one maternal allele. But the phase information (i.e., the origin of parental alleles) is often missing from most of the current genotype typing technologies. Appropriate modification on the classical GLM or ANOVA methods is therefore needed in order to overcome this unknown phase problem. In Wang (2011), we introduced several coding schemes for unphased marker genotypes in constructing the dummy-variable based GLM. In Zeng et al. (2005) and Wang (2014), some revised Fisher's ANOVA models were also proposed for ANOVA analysis of quantitative trait loci or genetic markers with unphased genotypes, which were referred as the general biallelic (G2A) or general multi-allelic (GMA) models. The estimation of genetic variance components based on the G2A and GMA models were also briefly explored.

In this study, by applying the GMA model, we further examine estimation of genetic variance components for genetic markers with unphased genotypes from a random sample of individuals in a study population. First, for a single locus GMA model, we derive the least square estimates (LSE) of model parameters in fitting the GMA model based on the genotypic group means and allele frequency estimates. Than we construct estimators of the variance components for one marker in Hardy-Weinberg disequilibrium (HWD). Next, we consider a fully parameterized two-locus GMA model. We derive the LSE of model parameters in fitting the GMA model and develop estimators of the variance components for two genetic markers in an equilibrium population. In both one locus and two loci cases, we also explore the difference between the GLM and GMA in association analysis of genetic markers. We show that the GMA models can provide the same partition on the genetic variance components as the original Fisher's ANOVA models but the GLM cannot. We clarify that the F-statistics based on the partial reductions in sums of squares from the standard ANOVA table for association testing of the fixed allelic effects in a GLM could be inadequate for testing the existence of variance components when higher order fixed allelic interactions are present. In addition, we point out that the GMA model can reduce the confounding between allelic effects and allelic interactions at least when the inheritance of alleles are independent. As the result, the GMA model could be more beneficial than GLM for detecting allelic interactions. Finally, a simulation example is presented to show the difference between GLM and GMA models on partitioning variance components. The performance in model selection between using GLM and GMA models is also examined.

2. Models and Results

Consider a random sample of size N from a study population. Let y_i, i = 1, … , N, be their observed phenotypic values for a quantitative trait Y, and g_i, i = 1, … , N, be their observed unphased genotypes at certain genetic marker loci. The quantitative trait Y is assumed to be affected by both genetic and environmental effects. Let G denote the true (unobservable) genotypic value, which could be affected by many genetic factors. If we ignore the genetic by environmental interactions, the relationship between the quantitative trait and marker genotypes can be modeled via a GLM

\begin{array}{l} y_{i} = β z_{i} + E (G | g_{i}) + ϵ_{i}, i = 1, \dots, N, & (1) \end{array}

where z_i is a vector of the adjusted environmental covariates with fixed effects β, E(G|g_i) represents the expected genotypic value of the i-th individual given his/her observed marker genotypes g_i, and ϵ_i is a model residual error contributed by other environmental and genetic factors that cannot be captured by the covariates z_i and marker genotypes g_i. We also assume that ϵ_i, i = 1, … , N, are independent and identically distributed (i.i.d) with E(ϵ_i) = 0 and V(ϵ_i) = V_ϵ. Besides, {ϵ_i, i = 1, … , N} are independent of {g_i, i = 1, … , N}. Then, the total phenotypic variance V_Y = V(E(G|g)) + E(V(G|g)) = V(E(G|g)) + V_ϵ. In the rest of the paper, we focus on comparing the GLM and GMA modeling on the expected genotypic values E(G|g) and assessing the genetic variance components contributed by the allelic effects and allelic interactions at the marker loci to the expected genotypic variance V(E(G|g)).

2.1. One-Locus Models

Consider a single marker locus with multiple alleles A₁, …, A_m (m ≥ 2). Let p_j, j = 1, … , m, be the allele frequencies ( $\sum_{j = 1}^{m} p_{j} = 1$ ), and p_jk = P(A_jA_k), j, k = 1, … , m (j ≤ k), be the genotype frequencies ( $\sum_{j \leq k} p_{j k} = 1$ ), in a study population. For a random sample of size N from the study population, suppose that there are n_jk individuals carrying genotypes A_jA_k for j ≤ k, j, k = 1, … , m ( $N = \sum_{j = 1}^{m} \sum_{k = j}^{m} n_{j k}$ ). For notation simplicity, we also let p_kj = p_jk and n_kj = n_jk for k > j, j, k = 1, … , m. Then, $p_{j} = p_{j j} + \sum_{k \neq j} p_{j k} / 2$ , for j = 1, … , m. Let $n_{j \cdot} = 2 n_{j j} + \sum_{k \neq j} n_{j k}$ , which is the total number of alleles A_j carried by the sampled individuals (j = 1, … , m). Assume that the observed marker genotypes g_i, i = 1, … , N, are i.i.d. and follow a multinomial distribution. Then the MLE of the genotype frequencies are given by ${\hat{p}}_{j k} = n_{j k} / N$ , and the MLE of the allele frequencies are given by ${\hat{p}}_{j} = n_{j \cdot} / (2 N) = {\hat{p}}_{j j} + \sum_{k \neq j} {\hat{p}}_{j k} / 2$ , for j = 1, … , m. Also, let y_jki,i = 1, … , n_jk be the observed phenotypic values of the individuals carrying genotypes A_jA_k. We define the observed genotypic group means $ȳ_{j k \cdot} = ȳ_{k j \cdot} = \sum_{i = 1}^{n_{j k}} y_{j k i} / n_{j k}$ for j, k = 1, … , m, the allele averaged means $ȳ_{j \cdot \cdot} = ȳ_{\cdot j \cdot} = (2 n_{j j} ȳ_{j j \cdot} + \sum_{k \neq j} n_{j k} ȳ_{j k \cdot}) / n_{j \cdot}$ for j = 1, … , m, and the grand mean $ȳ_{\cdot \cdot \cdot} = \sum_{j = 1}^{m} \sum_{k = j}^{m} n_{j k} ȳ_{j k \cdot} / N$ . In addition, we define the allele weighted means $ȳ_{j \cdot}^{*} = ȳ_{\cdot j}^{*} = \sum_{k = 1}^{m} {\hat{p}}_{k} ȳ_{j k \cdot}$ for j = 1, … , m, and the weighted overall mean $ȳ_{\cdot \cdot}^{*} = \sum_{j = 1}^{m} \sum_{k = 1}^{m} {\hat{p}}_{j} {\hat{p}}_{k} ȳ_{j k \cdot} = \sum_{j = 1}^{m} {\hat{p}}_{j}^{2} ȳ_{j j \cdot} + \sum_{j = 1}^{m - 1} \sum_{k = j + 1}^{m} 2 {\hat{p}}_{j} {\hat{p}}_{k} ȳ_{j k \cdot}$ . Note that in a classical two-way factorial ANOVA model, each individual can receive one and only one level assignment from each of the two factors. At a marker locus, however, an individuals can carry a homozygous genotype “A_jA_j” (j = 1, … , m) with two A_j alleles indistinguishable. In general, the allele weighted means $ȳ_{j \cdot}^{*}$ may not be the same as the allele averaged means ȳ_j··, and the weighted overall mean $ȳ_{\cdot \cdot}^{*}$ may also differ from the grand mean ȳ_···.

To assess the allelic effects on the expected genotypic values (or phenotypic values), let us first take a brief review of Fisher's one-locus ANOVA model (see Kempthorne, 1957; Weir and Cockerham, 1977). Since the marker genotypes are unphased, we usually assume that the paternal and maternal gametes share the same set of alleles, have the same allele frequencies, and contribute the same allelic effects at the marker locus. The fact that each allele contributes the same genetic effect regardless of its parental origins implies that the expected genotypic values G_jk = E(G|g = A_jA_k) should satisfy the symmetric property: G_jk = G_kj, for j ≠ k. So there are totally m(m + 1)/2 possible distinctive expected genotypic values G_jk, j, k = 1, …, m. By treating the paternal and maternal gametes as two independent risk factors, the traditional Fisher's one-locus ANOVA model for the expected genotypic values G_jk can be written as

\begin{array}{l} G_{j k} = μ + α_{j} + α_{k} + δ_{j k}, j, k = 1, \dots, m, & (2) \end{array}

where α_j is referred as the average additive effect of the paternal or maternal allele A_j (j = 1, …, m), and δ_jk the average allelic interaction between two parental alleles A_j and A_k (j, k = 1, …, m). It is known that not all the parameters in the above model are estimable due to an over-parameterization of the model on the expected genotypic values. Typically, the following constraints can be added on the model parameters (see Kempthorne, 1957).

\begin{array}{l} \sum_{j = 1}^{m} p_{j} α_{j} = 0, \sum_{j = 1}^{m} p_{j} δ_{j k} = 0 for k = 1, \dots, m . & (3) \end{array}

From the symmetric property of G_jk, we also assume that δ_jk = δ_kj, for j ≠ k. With these constraints, if we further incorporate model (2) into GLM (1) and ignore the adjusted covariates, the LSE of parameters are given by $\hat{μ} = ȳ_{\cdot \cdot}^{*}$ , ${\hat{α}}_{j} = ȳ_{j \cdot}^{*} - ȳ_{\cdot \cdot}^{*}$ for j = 1, … , m, and ${\hat{δ}}_{j k} = ȳ_{j k \cdot} - ȳ_{j \cdot}^{*} - ȳ_{k \cdot}^{*} + ȳ_{\cdot \cdot}^{*}$ for j, k = 1, … , m. It has been known that, under Hardy-Weinberg equilibrium (HWE), the above model (2) can also provide an orthogonal partition of the expected genotypic variance as V(E(G|g)) = V_A + V_D, where $V_{A} = 2 \sum_{j} p_{j} α_{j}^{2}$ and $V_{D} = \sum_{j, k} p_{j} p_{k} δ_{j k}^{2}$ are the so-called additive and dominance variance components. Weir and Cockerham (1977) also explored model (2) on partitioning the expected genotypic variance in HWD. In practice, as pointed out in Wang (2014), the symmetric property of δ_jk's and the irregular constraints (3) could make it difficult to fit model (2) using standard statistical software especially when the adjusted covariates are involved. Besides, the random variables that are responsible for the additive and dominance variance components are not explicitly defined in model (2).

The expected genotypic values can also be modeled via a classical dummy-variable based GLM. As shown in Wang (2011), we can introduce the following indicator variables to describe the inheritance of the two parental alleles for each individual

\begin{array}{l} z_{1 j} = {\begin{array}{l} \begin{array}{l} 1, the inherited paternal allele is A_{j} \\ 0, the inherited paternal allele is not A_{j} \end{array} \end{array}, \\ z_{2 j} = {\begin{array}{l} \begin{array}{l} 1, the inherited maternal allele is A_{j} \\ 0, the inherited maternal allele is not A_{j} \end{array} \end{array} \end{array}

for j = 1, …, m. Though z_1j and z_2j cannot be defined on unphased heterozygous genotypes, we can define the following genotype coding variables for unphased genotypes

w_{j} (g) = z_{1 j} + z_{2 j} = {\begin{array}{l} 2, if g = A_{j} A_{j} \\ 1, if g = A_{j} A_{j}^{c} \\ 0, if g = A_{j}^{c} A_{j}^{c} \end{array}

for j = 1, …, m, and

\begin{array}{l} v_{j j} (g) = z_{1 j} z_{2 j} = {\begin{array}{l} 1, if g = A_{j} A_{j} \\ 0, otherwise \end{array}, \\ v_{j k} (g) = z_{1 j} z_{2 k} + z_{1 k} z_{2 j} = {\begin{array}{l} 1, if g = A_{j} A_{k} \\ 0, otherwise \end{array} \end{array}

for j ≠ k, j, k = 1, …, m. Here, $A_{j}^{c}$ denotes any other allele rather than A_j. By choosing A_m as a reference allele, we can construct the following GLM

\begin{array}{l} E (G | g_{i}) = μ_{0} + \sum_{j = 1}^{m - 1} a_{j} w_{j} (g_{i}) + \sum_{j = 1}^{m - 1} \sum_{k = j}^{m - 1} d_{j k} v_{j k} (g_{i}), & (4) \end{array}

for i = 1, …, N, where a_j is usually referred as the fixed additive allelic effect of the paternal or maternal allele A_j, and d_jk the fixed allelic interaction between two parental alleles A_j and A_k, with respect to the reference allele A_m. In terms of the expected genotypic values, we can show that μ₀ = G_mm, a_j = G_jm − G_mm and d_jk = (G_jk − G_km) − (G_jm − G_mm), for j = 1, …, m − 1 and k = j, …, m − 1.

Model (4) provides a full re-parameterization of the m(m + 1)/2 expected genotypic values. Suppose that there are no empty genotype groups for the observed random sample; i.e., n_jk > 0 for any j, k = 1, … , m. When we incorporate model (4) into GLM (1) and ignore the adjusted covariates, the LSE of the expected genotypic values are given by: Ĝ_jk = ȳ_jk, for j, k = 1, … , m − 1 and j < k. Therefore, the LSE of parameters in model (4) can be easily derived in terms of the observed genotypic group means as ${\hat{μ}}_{0} = ȳ_{m m \cdot}$ and â_j = ȳ_jm· − ȳ_mm· for j = 1, … , m − 1, and ${\hat{d}}_{j k} = ȳ_{j k \cdot} - ȳ_{j m \cdot} - ȳ_{k m \cdot} + ȳ_{m m \cdot}$ for j, k = 1, … , m − 1 and j ≤ k. Note that these LSE could be sensitive to phenotypic outliers especially for small genotypic groups. For example, a few individuals may have genotype “A_jA_m” (j ≠ m) for a rare allele “A_j.” If the genotypic group mean ȳ_jm· is much larger than ȳ_mm·, the LSE â_j will be large. By choosing a common allele “A_m” as the reference, we could improve the accuracy of LSE. Through incorporating model (4) into a GLM, we could also perform hypothesis tests on its model parameters via contrasts. In analysis of genetic variance components, however, it has been known that w_j's are often correlated with v_jk's as random variables over the random individuals even when the inheritance of paternal and maternal alleles are independent (Wang and Zeng, 2009). As the result, model (4) leads to a different partition of the expected genotypic variance V(E(G|g)) from the one defined in Fisher's ANOVA model (2).

To further dissect the confounding between the additive and dominance effects in model (4), we can make mean corrections on the indicator variables {z_1i} and {z_2j}, and introduce the following mean-corrected index variables (see Wang, 2014)

\begin{array}{l} x_{1 j} (g) = z_{1 j} (g) - E [z_{1 j} (g)] = {\begin{array}{l} 1 - p_{j}, & the paternal allele is A_{j} \\ - p_{j}, & the paternal allele is not A_{j} \end{array} \\ x_{2 j} (g) = z_{2 j} (g) - E [z_{2 j} (g)] = {\begin{array}{l} 1 - p_{j}, & the maternal allele is A_{j} \\ - p_{j}, & the maternal allele is not A_{j} \end{array} \end{array}

Then, we define the following modified genotype coding variables

\begin{array}{l} w_{j}^{*} (g) = x_{1 j} + x_{2 j} = {\begin{matrix} 2 (1 - p_{j}), & if g = A_{j} A_{j} \\ 1 - 2 p_{j}, & if g = A_{j} A_{j}^{c} \\ - 2 p_{j}, & if g = A_{j}^{c} A_{j}^{c} \end{matrix}, \\ v_{j j}^{*} (g) = x_{1 j} x_{2 j} = {\begin{matrix} {(1 - p_{j})}^{2}, & if g = A_{j} A_{j} \\ - p_{j} (1 - p_{j}), & if g = A_{j} A_{j}^{c} \\ p_{j}^{2}, & if g = A_{j}^{c} A_{j}^{c} \end{matrix} \end{array}

for j = 1, … , m, and

v_{j k}^{*} (g) = x_{1 j} x_{2 k} + x_{1 k} x_{2 j} = {\begin{array}{l} (1 - p_{j}) (1 - p_{k}) + p_{j} p_{k}, if g = A_{j} A_{k} \\ - 2 p_{k} (1 - p_{j}), if g = A_{j} A_{j} \\ - 2 p_{j} (1 - p_{k}), if g = A_{k} A_{k} \\ - p_{k} (1 - 2 p_{j}), if g = A_{j} A_{j k}^{c} \\ - p_{j} (1 - 2 p_{k}), if g = A_{k} A_{j k}^{c} \\ 2 p_{i} p_{j}, if g = A_{j k}^{c} A_{j k}^{c} \end{array}

for j ≠ k, j, k = 1, …, m. Here, $A_{j k}^{c}$ denotes an allele which is different from both A_j and A_k. Note that the modified genotype coding variables $w_{j}^{*} (g) = w_{j} (g) - 2 p_{j}$ , $v_{j j}^{*} (g) = v_{j j} (g) - p_{j} w_{j} (g) + p_{j}^{2}$ , and $v_{j k}^{*} (g) = v_{j k} (g) - p_{j} w_{k} (g) - p_{k} w_{j} (g) + 2 p_{j} p_{k}$ are well defined on unphased genotypes, although the mean-corrected index variables x_1j, x_2k cannot be defined on unphased heterozygous genotypes. By choosing A_m as a reference allele, we can construct the following GMA model (Wang, 2014)

\begin{array}{l} E (G | g_{i}) = μ^{*} + \sum_{j = 1}^{m - 1} α_{j}^{*} w_{j}^{*} (g_{i}) + \sum_{j = 1}^{m - 1} \sum_{k = j}^{m - 1} δ_{j k}^{*} v_{j k}^{*} (g_{i}), & (5) \end{array}

for i = 1, …, N. Still, in model (5), we refer the parameters $α_{j}^{*}$ as the average additive allelic effects and $δ_{j k}^{*}$ (j ≤ k) as the average allelic interactions, with respect to the reference allele A_m. Both the GMA model (5) and the GLM (4) can provide a full parameterization of the expected genotypic values G_jk. Comparing to model (4), one major advantage of model (5) is that it can retain the same partition on the genetic variance components as the one from Fisher's ANOVA model (2). In fact, under the constraints (3) plus the symmetric property of δ_jk, model (2) can be re-written as

\begin{array}{l} E (G | g_{i}) = μ + \sum_{j = 1}^{m} α_{j} w_{j}^{*} (g_{i}) + \sum_{j = 1}^{m} \sum_{k = j}^{m} δ_{j k} v_{j k}^{*} (g_{i}), & (6) \end{array}

for i = 1, …, N. Model (5) is a simplified version of model (6) by further replacing the redundant variables $w_{m}^{*}$ , $v_{j m}^{*}$ and $v_{m m}^{*}$ by $w_{m}^{*} = - \sum_{j = 1}^{m - 1} w_{j}^{*}$ , $v_{j m}^{*} = - v_{j j}^{*} - \sum_{k = 1}^{m - 1} v_{j k}^{*}$ for j = 1, … , m − 1, and $v_{m m}^{*} = \sum_{j = 1}^{m - 1} \sum_{k = j}^{m - 1} v_{j k}^{*}$ . We can see that both models (5) and (2) share exactly the same additive and dominance variance components. They become equivalent when we take μ* = μ, $α_{j}^{*} = α_{j} - α_{m}$ , and $δ_{j k}^{*} = δ_{j k} - δ_{j m} - δ_{k m} + δ_{m m}$ , for j, k = 1, … , m − 1 and j ≤ k. Note that model (5) does not contain redundant parameters. Therefore, it does not require constraints on its model parameters. Besides, the random variables that are responsible for the additive and dominance variance components are explicitly defined in model (5). In practice, similar to enforcing the constraints (3) on model (2), we can create the variables $w_{j}^{*}$ and $v_{j k}^{*}$ by replacing the allele frequencies p_j's by their MLE ${\hat{p}}_{j}$ 's. Then, by incorporating model (5) into GLM (1), we can treat these modified genotype coding variables as regular fixed covariates and fit the model using the ordinary LS approach. The hypothesis of H₀:V_A = 0 for existence of the additive variance component can also be tested via testing $H_{0} : α_{j}^{*} = 0, j = 1, \dots, m - 1$ for the average additive allelic effects.

The GLM (4) and GMA model (5) can be transformed easily from one to the other. From the relationship between their genotype coding variables, we can show that

{\begin{array}{l} μ^{*} = γ + 2 \sum_{j = 1}^{m - 1} p_{j} a_{j} + \sum_{j = 1}^{m - 1} \sum_{k = 1}^{m - 1} p_{j} p_{k} d_{j k} \\ α_{j}^{*} = a_{j} + \sum_{k = 1}^{m - 1} p_{k} d_{j k}, j = 1, \dots, m - 1 \end{array}

and $δ_{j k}^{*} = d_{j k}$ for j ≤ k, j, k = 1, … , m − 1. Here, for notation simplicity, we define d_kj = d_jk, for j < k. To fit model (5), instead of solving its normal equations directly, we can derive the LSE of its model parameters from the LSE of model (4) as the following.

{\begin{array}{l} {\hat{μ}}^{*} = \sum_{j = 1}^{m} {\hat{p}}_{j}^{2} {\bar{y}}_{j j \cdot} + \sum_{j = 1}^{m - 1} \sum_{k = j + 1}^{m} 2 {\hat{p}}_{j} {\hat{p}}_{k} {\bar{y}}_{j k \cdot} = {\bar{y}}_{\cdot \cdot}^{*} \\ {\hat{α}}_{j}^{*} = {\bar{y}}_{j \cdot}^{*} - {\bar{y}}_{m \cdot}^{*}, j = 1, \dots, m - 1 \end{array}

and ${\hat{δ}}_{j k}^{*} = {\hat{d}}_{j k} = ȳ_{j k \cdot} - ȳ_{j m \cdot} - ȳ_{k m \cdot} + ȳ_{m m \cdot}$ for j ≤ k, j, k = 1, … , m − 1. Note that the LSE of the average additive allelic effects are calculated from allele weighted means, which could be more robust to phenotypic outliers than the LSE of fixed additive allelic effects. It is also interesting to see that both the fixed additive allelic effects a_j's and the fixed allelic interactions d_jk's may affect the average additive allelic effects $α_{j}^{*}$ 's, though the fixed allelic interactions keep the same as the average allelic interactions. In general, the hypothesis test of H₀ : a_j = 0, j = 1, … , m − 1, for the fixed additive effects in a GLM (4) is not equivalent to the hypothesis test of $H_{0} : α_{j}^{*} = 0, j = 1, \dots, m - 1$ , for the average additive effects in an equivalent GMA model (5). Therefore, the standard F-statistics for testing the fixed additive effects in model (4) could be inadequate for testing the existence of the additive variance component V_A when significant fixed (or average) allelic interactions are present.

In ANOVA, we treat {x_1j, j = 1, … , m − 1} and {x_2k, k = 1, … , m − 1} as random variables over the sampled individuals. Model (5) provides an approximation to the expected genotypic values E(G|g) by a linear combination of the random variables x_1j, j = 1, … , m − 1 and x_2k, k = 1, … , m − 1, plus their cross-product terms for allelic interactions. To model the genotypic distributions, let $g_{i}^{*} = (z_{1 j} (g_{i}) z_{2 k} (g_{i}), j, k = 1, \dots, m, j \leq k)$ be a vector that describes the unphased genotypic categories for i = 1, … , N. As usual, we assume that the marker genotypes $g_{1}^{*}, \dots, g_{N}^{*}$ are i.i.d. from a multinomial distribution of Mult(1, {p_jk, j, k = 1, … , m, j ≤ k}). Under this assumption, we can show that the vectors g_1i = (z₁₁(g_i), … , z_1m(g_i)), i = 1, … , N, of paternal allele indicator variables are i.i.d. from Mult(1, {p₁, … , p_m}); and the vectors g_2i = (z₂₁(g_i), … , z_2m(g_i)), i = 1, … , N, of maternal allele indicator variables are also i.i.d. from Mult(1, {p₁, … , p_m}). When the inheritance of paternal and maternal alleles are independent, the study population is under HWE and the mean corrected index variables {x_1j(g_i), j = 1, … , m} are independent of {x_2j(g_i), j = 1, … , m} for any randomly sampled individual i. In HWD, however, the mean corrected index variables {x_1j, j = 1, … , m} could be correlated with {x_2j, j = 1, … , m}. Let us define

{\begin{array}{l} D_{j j} = Cov (z_{1 j}, z_{2 j}) = E (x_{1 j} x_{2 j}) = E (v_{j j}^{*}) = p_{j j} - p_{j}^{2} \\ D_{j k} = E (x_{1 j} x_{2 k} + x_{1 k} x_{2 j}) / 2 = E (v_{j k}^{*}) / 2 = p_{j k} / 2 - p_{j} p_{k} . \end{array}

for j ≠ k, j, k = 1, … , m. Then, {D_jk, j, k = 1, … , m} measure the departures from HWE, which satisfy $\sum_{j = 1}^{m} D_{j k} = \sum_{k = 1}^{m} D_{j k} = 0$ . From the observed genotypes, we have MLE ${\hat{D}}_{j j} = n_{j j} / N - {\hat{p}}_{j}^{2}$ for j = 1, … , m, and ${\hat{D}}_{j k} = n_{j k} / (2 N) - {\hat{p}}_{j} {\hat{p}}_{k}$ for j, k = 1, … , m and j ≠ k.

It is more convenient to use GMA model (5) rather than GLM (4) for estimation of genetic variance components. In model (5), let $A = \sum_{j = 1}^{m - 1} α_{j}^{*} w_{j}^{*} (g)$ and $D = \sum_{j = 1}^{m - 1} \sum_{k = j}^{m - 1} δ_{j k}^{*} v_{j k}^{*} (g)$ represent the additive and dominance components, respectively. In general, we can partition the expected genotypic variance as V(E(G|g)) = V_A + V_D + 2Cov(A, D). In Wang (2014), we have derived formulas for estimating the variance components V_A, V_D and covariance component Cov(A, D) based on the parameters in model (5). By plugging in LSE of the parameters, we obtain estimators of the variance components V_A, V_D and the covariance component Cov(A, D) as shown in Appendix A in Supplementary Material. It should be pointed out that these estimators of the variance and covariance components are different from the traditional ANOVA estimators when the marker genotypes are in HWD. Unlike the ANOVA estimators of variance components which could be negative when data are unbalanced, our estimators of the variance components are guaranteed to be non-negative. Similar to the ANOVA estimators, when the model residuals are normally distributed, these estimators become MLE of the variance and covariance components and likely possess the asymptotic normality property (see Searle, 1995). Meanwhile, with variations from both the genotypic group means and the MLE estimates of allele frequencies, these estimators are only asymptotically unbiased, while the original ANOVA estimators are unbiased.

Under HWE, we would expect an orthogonal partition of the expected genotypic variance as V(E(G|g)) = V_A + V_D. By taking ${\hat{D}}_{j k} = 0$ for j, k = 1, … , m in Appendix A in Supplementary Material, we obtain the following estimators

\begin{array}{l} {\hat{V}}_{A} = 2 \sum_{j = 1}^{m} {\hat{p}}_{j} {({\bar{y}}_{j \cdot}^{*} - {\bar{y}}_{\cdot \cdot}^{*})}^{2}, \\ {\hat{V}}_{D} = \sum_{j = 1}^{m} \sum_{k = 1}^{m} {\hat{p}}_{j} {\hat{p}}_{k} {({\bar{y}}_{j k \cdot} - {\bar{y}}_{\cdot \cdot}^{*})}^{2} - 2 \sum_{j = 1}^{m} {\hat{p}}_{j} {({\bar{y}}_{j \cdot}^{*} - {\bar{y}}_{\cdot \cdot}^{*})}^{2} \end{array}

If we further incorporate model (5) into GLM (1) and ignore the adjusted covariates, the estimator of residual variance is given by ${\hat{V}}_{ϵ} = \sum_{j = 1}^{m} \sum_{k = j}^{m} \sum_{i = 1}^{n_{j k}} {(y_{j k i} - ȳ_{j k \cdot})}^{2} / N$ . The estimator of total phenotypic variance is given by ${\hat{V}}_{Y} = \sum_{j = 1}^{m} \sum_{k = j}^{m} \sum_{i = 1}^{n_{j k}} {(y_{j k i} - ȳ_{\cdot \cdot \cdot})}^{2} / N$ . We can show that ${\hat{V}}_{Y} = {\hat{V}}_{A} + {\hat{V}}_{D} + {\hat{V}}_{ϵ}$ . Note that when ${\hat{D}}_{j k} = 0$ for j, k = 1, … , m, the allele weighted means and the allele averaged means become the same; i.e., $ȳ_{j \cdot}^{*} = ȳ_{j \cdot \cdot}$ for j = 1, … , m. Besides, the weighted overall mean $ȳ_{\cdot \cdot}^{*}$ is the same as the grand mean ŷ_···. However, due to possible sampling variations from the sampled individuals, under HWE we could still have some ${\hat{D}}_{j k} \neq 0$ , j, k = 1, … , m, which may lead to a slight deviation from the orthogonal partition of the expected genotypic variance.

When ${\hat{D}}_{j k} = 0$ for j, k = 1, … , m, another nice feature of the GMA model (5) is that the LSE ${{\hat{α}}_{j}^{*}, j = 1, \dots, m - 1}$ of its main effects will keep the same in a reduced GMA model when we ignore the allelic interactions. Let us represent the full GMA model (5) in a matrix form with the design matrix $X = (1_{N}, X_{α^{*}}, X_{δ^{*}})$ , where 1_N is a N by 1 vector with all its elements being 1, $X_{α^{*}} = {(w_{1}^{*} (g_{i}), \dots, w_{m - 1}^{*} (g_{i}); i = 1, \dots, N)}_{N \times (m - 1)}$ and $X_{δ^{*}} = {(v_{11}^{*} (g_{i}), \dots, v_{1, m - 1}^{*} (g_{i}), \dots, v_{m - 1, m - 1}^{*} (g_{i}); i = 1, \dots, N)}_{N \times m (m - 1) / 2}$ , which correspond to the main effects $α^{*} = (α_{1}^{*}, \dots, α_{m - 1}^{*})$ and allelic interactions $δ^{*} = (δ_{11}^{*}, \dots, δ_{1, m - 1}^{*}, \dots, δ_{m - 1, m - 1}^{*})$ , respectively. We can show that $X^{'} X = d i a g (N, X_{α^{*}}^{'} X_{α^{*}}, X_{δ^{*}}^{'} X_{δ^{*}})$ , which is a block diagonal matrix when ${\hat{D}}_{j k} = 0$ , for j, k = 1, … , m (see proof in Appendix B in Supplementary Material). Therefore, the LSE ${\hat{μ}}^{*} = ȳ_{\cdot \cdot \cdot}$ and ${\hat{α}}^{*} = {(X_{α^{*}}^{'} X_{α^{*}})}^{- 1} X_{α^{*}}^{'} Y$ , which do not depend on $X_{δ^{*}}$ . In other words, ignoring the average allelic interactions in model (5) does not affect the LSE of the intercept and the average additive effects in this case. The GLM (4) does not have such a property. In a reduced GLM (4) without the fixed allelic interactions, the LSE of its additive effects are $â_{j} = {\hat{α}}_{j}^{*} = ȳ_{j \cdot}^{*} - ȳ_{m \cdot}^{*}$ for j = 1, … , m − 1, while in a full GLM (4) the LSE become â_j = ȳ_jm· − ȳ_mm·, for j = 1, … , m − 1.

We can also estimate the genetic variance components with an adjustment for other covariates. First, by incorporating the GMA model (5) into GLM (1), we can fit GLM (1) using the ordinary LS approach. Next, based on the fitted model, we can calculate the fitted additive and dominance components $Â_{i} = \sum_{j = 1}^{m - 1} {\hat{α}}_{j}^{*} w_{i} (g_{i})$ and ${\hat{D}}_{i} = \sum_{j = 1}^{m - 1} \sum_{k = j}^{m - 1} {\hat{δ}}_{j k}^{*} v_{j k} (g_{i})$ , for each individual i = 1, … , N. Then, we can estimate V_A and V_D as the sample variances of {Â_i, i = 1, … , N} and ${{\hat{D}}_{i}, i = 1, \dots, N}$ , respectively. Meanwhile, the covariance component Cov(A, D) can be estimated as the sample covariance between {Â_i, i = 1, … , N} and ${{\hat{D}}_{i}, i = 1, \dots, N}$ .

2.2. Two-Locus Models

We consider an extension of the previous one-locus models to two loci. Assume that marker 1 has alleles A₁₁, …, A_{1_m₁} (m₁ ≥ 2) with p₁₁, … , p_{1_m₁} being the allele frequencies, and marker 2 has alleles A₂₁, …, A_{2_m₂} (m₂ ≥ 2) with p₂₁, … , p_{2_m₂} being the allele frequencies. Let p_jkrs = P(A_1jA_1k, A_2rA_2s) denote the joint genotype frequencies at the two marker loci in a study population. For a random sample of size N from the study population, let n_jkrs be the number of individuals who carry unphased genotypes A_1jA_1k (j ≤ k) at locus 1, and A_2rA_2s (r ≤ s) at locus 2, for j, k = 1, … , m₁ and r, s = 1, … , m₂ ( $N = \sum_{j = 1}^{m_{1}} \sum_{k = j}^{m_{1}} \sum_{r = 1}^{m_{2}} \sum_{s = r}^{m_{2}} n_{j k r s}$ ). Then, the MLE of genotype frequencies ${\hat{p}}_{j k r s} = n_{j k r s} / N$ . Let y_{jkrs, i},i = 1, … , n_jkrs, be the observed phenotypic values of individuals carrying the joint genotypes (A_1jA_1k, A_2rA_2s). We define the observed genotypic group means $ȳ_{j k r s \cdot} = \sum_{i = 1}^{n_{j k r s}} y_{j k r s, i} / n_{j k r s}$ , for j, k = 1, … , m₁ and r, s = 1, …, m₂. Without distinguishing the origin of parental alleles, we assume that n_kjrs = n_jkrs for k > j, j, k = 1, … , m₁, and n_jksr = n_jkrs for r > s, r, s = 1, … , m₂. Let $n_{j k \cdot \cdot} = \sum_{r = 1}^{m_{2}} \sum_{s = r}^{m_{2}} n_{j k r s}$ for j, k = 1, … , m₁, and $n_{\cdot \cdot r s} = \sum_{j = 1}^{m_{1}} \sum_{k = j}^{m_{1}} n_{j k r s}$ for r, s = 1, … , m₂. We have the MLE of allele frequencies ${\hat{p}}_{1 j} = (2 n_{j j \cdot \cdot} + \sum_{k \neq j} n_{j k \cdot \cdot}) / (2 N)$ for j = 1, … , m₁ at locus 1, and ${\hat{p}}_{2 r} = (2 n_{\cdot \cdot r r} + \sum_{s \neq r} n_{\cdot \cdot r s}) / (2 N)$ for r = 1, … , m₂ at locus 2. For notation simplicity, we define ȳ_kjrs· = ȳ_jkrs· for k > j (j, k = 1, … , m₁) and ȳ_jksr· = ȳ_jkrs· for r > s (r, s = 1, … , m₂). In addition, we define the weighted genotypic group means $ȳ_{j \cdot r s}^{*} = ȳ_{\cdot j r s}^{*} = \sum_{k = 1}^{m_{1}} {\hat{p}}_{1 k} ȳ_{j k r s}$ , $ȳ_{j k r \cdot}^{*} = ȳ_{j k \cdot r}^{*} = \sum_{s = 1}^{m_{2}} {\hat{p}}_{2 s} ȳ_{j k r s}$ , $ȳ_{j k \cdot \cdot}^{*} = ȳ_{k j \cdot \cdot}^{*} = \sum_{r, s = 1}^{m_{2}} {\hat{p}}_{2 r} {\hat{p}}_{2 s} ȳ_{j k r s}$ , $ȳ_{\cdot \cdot r s}^{*} = ȳ_{\cdot \cdot s r}^{*} = \sum_{j, k = 1}^{m_{1}} {\hat{p}}_{1 j} {\hat{p}}_{1 k} ȳ_{j k r s}$ , $ȳ_{j \cdot \cdot \cdot}^{*} = ȳ_{\cdot j \cdot \cdot}^{*} = \sum_{k = 1}^{m_{1}} \sum_{r, s = 1}^{m_{2}} {\hat{p}}_{1 k} {\hat{p}}_{2 r} {\hat{p}}_{2 s} ȳ_{j k r s}$ , and $ȳ_{\cdot \cdot r \cdot}^{*} = ȳ_{\cdot \cdot \cdot r}^{*} = \sum_{j, k = 1}^{m_{1}} \sum_{s = 1}^{m_{2}} {\hat{p}}_{1 j} {\hat{p}}_{1 k} {\hat{p}}_{2 s} ȳ_{j k r s}$ , for j, k = 1, … , m₁ and r, s = 1, … , m₂. The weighted overall mean $ȳ_{\cdot \cdot \cdot \cdot}^{*} = \sum_{j, k = 1}^{m_{1}} \sum_{r, s = 1}^{m_{2}} {\hat{p}}_{1 j} {\hat{p}}_{1 k} {\hat{p}}_{2 r} {\hat{p}}_{2 s} ȳ_{j k r s}$ .

Let G_jkrs = E(G(g)|g = A_1jA_1k, A_2rA_2s) be the expected genotypic value of individuals with the joint genotypes (A_1jA_1k, A_2rA_2s), for j, k = 1, …, m₁ and r, s = 1, …, m₂. Without distinguishing the origin of parental alleles, we assume that {G_jkrs} satisfy the symmetric properties: G_jkrs = G_kjrs = G_jksr = G_kjsr, for j < k and r < s. In general, there are totally m₁m₂(m₁ + 1)(m₂ + 1)/4 possible distinctive expected genotypic values. By treating the paternal and maternal gametes as two independent risk factors, the Fisher's two-locus ANOVA model for the expected genotypic values G_jkrs can be written as (see Kempthorne, 1957)

\begin{array}{l} G_{j k r s} = μ + α_{1 j} + α_{1 k} + δ_{1 j k} + α_{2 r} + α_{2 s} + δ_{2 r s} + {(α α)}_{j r} \\ + {(α α)}_{j s} + {(α α)}_{k r} + {(α α)}_{k s} + {(α δ)}_{j, r s} + {(α δ)}_{k, r s} \\ + {(δ α)}_{j k, r} + {(δ α)}_{j k, s} + {(δ δ)}_{j k, r s} & (7) \end{array}

for j, k = 1, …, m₁ and r, s = 1, …, m₂. The parameters α_1j and δ_1jk are referred as the average additive and dominance effects at locus 1, α_2r and δ_2rs the average additive and dominance effects at locus 2, (αα)_jr the additive by additive interactions, (αδ)_{j, rs} the additive by dominance interactions, (δα)_{jk, r} the dominance by additive interactions, and (δδ)_{jk, rs} the dominance by dominance interactions. Still, due to an over-parameterization of the model for the expected genotypic values, not all the model parameters are estimable. From the symmetric property of the expected genotypic values, we usually assume that δ_1jk = δ_1kj, δ_2rs = δ_2sr, (αδ)_{j, rs} = (αδ)_{j, sr}, (δα)_{jk, r} = (δα)_{kj, r}, (δδ)_{jk, rs} = (δδ)_{kj, rs} = (δδ)_{jk, sr}. In addition, the following constraints need to be added on the model parameters (Gallais, 1974; Weir and Cockerham, 1977).

\begin{array}{l} \sum_{j = 1}^{m_{1}} p_{1 j} α_{1 j} = 0, \sum_{r = 1}^{m_{2}} p_{2 r} α_{2 r} = 0, \sum_{j = 1}^{m_{1}} p_{1 j} δ_{1 j k} = 0, \sum_{r = 1}^{m_{2}} p_{2 r} δ_{2 r s} = 0, \\ \sum_{j = 1}^{m_{1}} p_{1 j} {(α α)}_{j r} = 0, \sum_{r = 1}^{m_{2}} p_{2 r} {(α α)}_{j r} = 0, \sum_{j = 1}^{m_{1}} p_{1 j} {(α δ)}_{j, r s} = 0, \\ \sum_{r = 1}^{m_{2}} p_{2 r} {(α δ)}_{j, r s} = 0, \sum_{j = 1}^{m_{1}} p_{1 j} {(δ α)}_{j k, r} = 0, \sum_{r = 1}^{m_{2}} p_{2 r} {(δ α)}_{j k, r} = 0, \\ \sum_{j = 1}^{m_{1}} p_{1 j} {(δ δ)}_{j k, r s} = 0, \sum_{r = 1}^{m_{2}} p_{2 r} {(δ δ)}_{j k, r s} = 0 & (8) \end{array}

In other words, a weighted sum of the average genetic effects of a genetic component is zero over any index. It has been known that model (7) under constraints (8) can provide an orthogonal partition on the genetic variance components when the inheritance of four paternal and maternal alleles at the two loci are independent (Kempthorne, 1957; Weir and Cockerham, 1977). Still, as pointed out in Wang (2014), it is difficult to estimate the parameters in model (7) under the complicated constraints (8) when other adjusted covariates are involved. Besides, the random variables that constitute the random sources of the genetic variance components are not explicitly defined in model (7).

To model the expected genotypic values through a GLM, we can re-list all the sampled individuals by an index variable k = 1, … , N. Similar to the one-locus case, we introduce the indicator variables $z_{1 j}^{(1)}$ and $z_{2 k}^{(1)}$ for inheritance of the paternal and maternal alleles at locus 1, and $z_{1 r}^{(2)}$ and $z_{2 s}^{(2)}$ for inheritance of the paternal and maternal alleles at locus 2. Then, for phase-unknown genotypes, we define the genotype coding variables w_1j, v_1jk for j, k = 1, … , m₁ (j ≤ k) at locus 1, and w_2r, v_2rs for r, s = 1, … , m₂ (r ≤ s) at locus 2, in the same way as we did in the one-locus case. By including the within locus additive and dominance effects as well as the locus-by-locus interactions (i.e., epistases), a fully parameterized two-locus GLM model can be written as (Wang, 2011),

\begin{array}{l} E (G | g_{i}) = μ_{0} + \sum_{j = 1}^{m_{1} - 1} a_{1 j} w_{1 j} + \sum_{j = 1}^{m_{1} - 1} \sum_{k = j}^{m_{1} - 1} d_{1 j k} v_{1 j k} + \sum_{r = 1}^{m_{2} - 1} a_{2 r} w_{2 r} \\ + \sum_{r = 1}^{m_{2} - 1} \sum_{s = r}^{m_{2} - 1} d_{2 r s} v_{2 r s} + \sum_{j = 1}^{m_{1} - 1} \sum_{r = 1}^{m_{2} - 1} {(a a)}_{j r} w_{1 j} w_{2 r} \\ + \sum_{j = 1}^{m_{1} - 1} \sum_{r = 1}^{m_{2} - 1} \sum_{s = r}^{m_{2} - 1} {(a d)}_{j, r s} w_{1 j} v_{2 r s} \\ + \sum_{j = 1}^{m_{1} - 1} \sum_{k = j}^{m_{1} - 1} \sum_{r = 1}^{m_{2} - 1} {(d a)}_{j k, r} v_{1 j k} w_{2 r} \\ + \sum_{j = 1}^{m_{1} - 1} \sum_{k = j}^{m_{1} - 1} \sum_{r = 1}^{m_{2} - 1} \sum_{s = r}^{m_{2} - 1} {(d d)}_{j k, r s} v_{1 j k} v_{2 r s} & (9) \end{array}

for i = 1, …, N. A nice feature of the above GLM is that we can easily establish the relationship between its model parameters and the expected genotypic values by starting from the lowest order parameter μ₀ = G_{m₁m₁m₂m₂}. Suppose that there are no empty genotypes; i.e., n_jkrs > 0 for any j, k = 1, … , m₁ and r, s = 1, … , m₂. When we incorporate the above model (9) into a GLM (1) and ignore the adjusted covariates, the LSE of parameters in model (9) can be derived as shown in Appendix C in Supplementary Material. Similar to the one-locus GLM, the LSE of these fixed allelic effects are highly dependent upon the genotypic group means, which could be sensitive to phenotypic outliers in small genotypic groups. More importantly, model (9) cannot provide the same partition of the expected genotypic variance V(E(G|g)) as the one defined in the original Fisher's ANOVA model (7).

To construct a two-locus GMA model for analysis of genetic variance components, we can introduce the mean-corrected index variables $x_{1 j}^{(1)}, x_{2 k}^{(1)}$ for j, k = 1, … , m₁ at locus 1, and $x_{1 r}^{(2)}, x_{2 s}^{(2)}$ for r, s = 1, … , m₂ at locus 2. Then we define the modified genotype coding variables $w_{1 j}^{*}$ , $v_{1 j k}^{*}$ for j, k = 1, … , m₁ at locus 1, and $w_{2 r}^{*}$ , $v_{2 r s}^{*}$ for r, s = 1, … , m₂ at locus 2 in the same way as we did in the one-locus case. By including the within locus additive and dominance effects as well as the locus-by-locus interactions (i.e., epistases), we can build a fully parameterized two-locus GMA model as the following.

\begin{array}{l} E (G | g_{i}) = μ^{*} + \sum_{j = 1}^{m_{1} - 1} α_{1 j}^{*} w_{1 j}^{*} + \sum_{j = 1}^{m_{1} - 1} \sum_{k = j}^{m_{1} - 1} δ_{1 j k}^{*} v_{1 j k}^{*} + \sum_{r = 1}^{m_{2} - 1} α_{2 r}^{*} w_{2 r}^{*} \\ + \sum_{r = 1}^{m_{2} - 1} \sum_{s = r}^{m_{2} - 1} δ_{2 r s}^{*} v_{2 r s}^{*} + \sum_{j = 1}^{m_{1} - 1} \sum_{r = 1}^{m_{2} - 1} {(α α)}_{j r}^{*} w_{1 j}^{*} w_{2 r}^{*} \\ + \sum_{j = 1}^{m_{1} - 1} \sum_{r = 1}^{m_{2} - 1} \sum_{s = r}^{m_{2} - 1} {(α δ)}_{j, r s}^{*} w_{1 j}^{*} v_{2 r s}^{*} \\ + \sum_{j = 1}^{m_{1} - 1} \sum_{k = j}^{m_{1} - 1} \sum_{r = 1}^{m_{2} - 1} {(δ α)}_{j k, r}^{*} v_{1 j k}^{*} w_{2 r}^{*} \\ + \sum_{j = 1}^{m_{1} - 1} \sum_{k = j}^{m_{1} - 1} \sum_{r = 1}^{m_{2} - 1} \sum_{s = r}^{m_{2} - 1} {(δ δ)}_{j k, r s}^{*} v_{1 j k}^{*} v_{2 r s}^{*} & (10) \end{array}

for i = 1, …, N. Similar to the one-locus case, the original Fisher's ANOVA model (7) under constraints (8) and the symmetric properties of the dominance effects can be re-written as

\begin{array}{l} E (G | g_{i}) = μ + \sum_{j = 1}^{m_{1}} α_{1 j} w_{1 j}^{*} + \sum_{j = 1}^{m_{1}} \sum_{k = j}^{m_{1}} δ_{1 j k} v_{1 j k}^{*} + \sum_{r = 1}^{m_{2}} α_{2 r} w_{2 r}^{*} \\ + \sum_{r = 1}^{m_{2}} \sum_{s = r}^{m_{2}} δ_{2 r s} v_{2 r s}^{*} + \sum_{j = 1}^{m_{1}} \sum_{r = 1}^{m_{2}} {(α α)}_{j r} w_{1 j}^{*} w_{2 r}^{*} \\ + \sum_{j = 1}^{m_{1}} \sum_{r = 1}^{m_{2}} \sum_{s = r}^{m_{2}} {(α δ)}_{j, r s} w_{1 j}^{*} v_{2 r s}^{*} \\ + \sum_{j = 1}^{m_{1}} \sum_{k = j}^{m_{1}} \sum_{r = 1}^{m_{2}} {(δ α)}_{j k, r} v_{1 j k}^{*} w_{2 r}^{*} \\ + \sum_{j = 1}^{m_{1}} \sum_{k = j}^{m_{1}} \sum_{r = 1}^{m_{2}} \sum_{s = r}^{m_{2}} {(δ δ)}_{j k, r s} v_{1 j k}^{*} v_{2 r s}^{*} . & (11) \end{array}

Model (10) is actually a simplified version of model (11) by further removing the redundant parameters. The two models (10) and (11) are equivalent when we take

\begin{array}{l} μ^{*} = μ, α_{1 j}^{*} = α_{1 j} - α_{1 m_{1}}, α_{2 k}^{*} = α_{2 k} - α_{2 m_{2}} \\ δ_{1 j k}^{*} = δ_{1 j k} - δ_{1 j m_{1}} - δ_{1 k m_{1}} + δ_{m_{1} m_{1}} \\ δ_{2 r s}^{*} = δ_{2 r s} - δ_{2 r m_{2}} - δ_{2 s m_{2}} + δ_{m_{2} m_{2}} \\ {(α α)}_{j r}^{*} = {(α α)}_{j r} - {(α α)}_{j m_{2}} - {(α α)}_{m_{1} r} + {(α α)}_{m_{1} m_{2}} \\ {(α δ)}_{j, r s}^{*} = [{(α δ)}_{j, r s} - {(α δ)}_{j, r m_{2}} - {(α δ)}_{j, s m_{2}} + {(α δ)}_{j, m_{2} m_{2}}] \\ - [{(α δ)}_{m_{1}, r s} - {(α δ)}_{m_{1}, r m_{2}} - {(α δ)}_{m_{1}, s m_{2}} + {(α δ)}_{m_{1}, m_{2} m_{2}}] \\ {(δ α)}_{j k, r}^{*} = [{(δ α)}_{j k, r} - {(δ α)}_{j m_{1}, r} - {(δ α)}_{k m_{1}, r} + {(δ α)}_{m_{1} m_{1}, r}] \\ - [{(δ α)}_{j k, m_{2}} - {(δ α)}_{j m_{1}, m_{2}} - {(δ α)}_{k m_{1}, m_{2}} + {(δ α)}_{m_{1} m_{1}, m_{2}}] \\ {(δ δ)}_{j k, r s}^{*} = [{(δ δ)}_{j k, r s} - {(δ δ)}_{j k, r m_{2}} - {(δ δ)}_{j k, s m_{2}} + {(δ δ)}_{j k, m_{2} m_{2}}] \\ - [{(δ δ)}_{j m_{1}, r s} - {(δ δ)}_{j m_{1}, r m_{2}} - {(δ δ)}_{j m_{1}, s m_{2}} + {(δ δ)}_{j m_{1}, m_{2} m_{2}}] \\ - [{(δ δ)}_{k m_{1}, r s} - {(δ δ)}_{k m_{1}, r m_{2}} - {(δ δ)}_{k m_{1}, s m_{2}} + {(δ δ)}_{k m_{1}, m_{2} m_{2}}] \\ + [{(δ δ)}_{m_{1} m_{1}, r s} - {(δ δ)}_{m_{1} m_{1}, r m_{2}} - {(δ δ)}_{m_{1} m_{1}, s m_{2}} \\ + {(δ δ)}_{m_{1} m_{1}, m_{2} m_{2}}] \end{array}

for j, k = 1, … , m₁ − 1 (j ≤ k) and r, s = 1, … , m₂ − 1 (r ≤ s). Therefore, both model (10) and (11) share exactly the same genetic components as the ones defined in the original Fisher's two-locus ANOVA model (7). In model (10), let $A_{1} = \sum_{j = 1}^{m_{1} - 1} α_{1 j}^{*} w_{1 j}^{*}$ , $D_{1} = \sum_{j = 1}^{m_{1} - 1} \sum_{k = j}^{m_{1} - 1} δ_{1 j k}^{*} v_{1 j k}^{*}$ , $A_{2} = \sum_{r = 1}^{m_{2} - 1} α_{2 r}^{*} w_{2 r}^{*}$ , $D_{2} = \sum_{r = 1}^{m_{2} - 1} \sum_{s = r}^{m_{2} - 1} δ_{2 r s}^{*} v_{2 r s}^{*}$ , $A_{1} A_{2} = \sum_{j = 1}^{m_{1} - 1} \sum_{r = 1}^{m_{2} - 1} {(α α)}_{j r}^{*} w_{1 j}^{*} w_{2 r}^{*}$ , $A_{1} D_{2} = \sum_{j = 1}^{m_{1} - 1} \sum_{r = 1}^{m_{2} - 1} \sum_{s = r}^{m_{2} - 1} {(α δ)}_{j, r s}^{*} w_{1 j}^{*} v_{2 r s}^{*}$ , $D_{1} A_{2} = \sum_{j = 1}^{m_{1} - 1} \sum_{k = j}^{m_{1} - 1} \sum_{r = 1}^{m_{2} - 1} {(δ α)}_{j k, r}^{*} v_{1 j k}^{*} w_{2 r}^{*}$ and $D_{1} D_{2} = \sum_{j = 1}^{m_{1} - 1} \sum_{k = j}^{m_{1} - 1} \sum_{r = 1}^{m_{2} - 1} \sum_{s = r}^{m_{2} - 1} {(δ δ)}_{j k, r s}^{*} v_{1 j k}^{*} v_{2 r s}^{*}$ , which represent the additive and dominance at locus 1, additive and dominance at locus 2, additive by additive, additive by dominance, dominance by additive and dominance by dominance genetic components, respectively. Each genetic component consists of a set of average allelic effects (additive or interaction) contributed by alleles from the same locus or the same set of loci, and at each locus at most two alleles could be involved. We refer the number of alleles involved in each average allelic effect of a genetic component as the order of the genetic component, and the number of non-redundant average allelic effects involved in a genetic component as the degrees of freedom (df) of the genetic component. Note that genetic components of the same order may have different df. For example, the additive by additive component A₁A₂ has its order being 2 and df = (m₁ − 1)(m₂ − 1), while the dominance component D₁ at locus 1 has its order being 2 and $d f = {(m_{1} - 1)}^{2}$ .

Both the GLM (9) and GMA model (10) provide a re-parameterization of the expected genotypic values. From the relationship between their genotype coding variables, these two models can be transformed equivalently from one to the other. We can represent the parameters in GMA model (10) in terms of the parameters in its equivalent GLM (9) as shown in Appendix D in Supplementary Material. It is interesting to see that an average allelic effect can be affected by not only its corresponding fixed allelic effect but also other higher-order fixed allelic effects. As a result, the hypothesis test of an average allelic effect in the GMA model (10) is not equivalent to the hypothesis test of the corresponding fixed allelic effect in an equivalent GLM (9) when higher order fixed allelic effects are present. From Appendices C, D in Supplementary Material, we can also derive the LSE of parameters for the GMA model (10) as the following

\begin{array}{l} {\hat{μ}}^{*} = {\bar{y}}_{\cdot \cdot \cdot \cdot}^{*}, {\hat{α}}_{1 j}^{*} = {\bar{y}}_{j \cdot \cdot \cdot}^{*} - {\bar{y}}_{m_{1} \cdot \cdot \cdot}^{*}, {\hat{α}}_{2 r}^{*} = {\bar{y}}_{\cdot \cdot r \cdot}^{*} - {\bar{y}}_{\cdot \cdot m_{2} \cdot}^{*} \\ {\hat{δ}}_{1 j k}^{*} = {\bar{y}}_{j k \cdot \cdot}^{*} - ({\bar{y}}_{j m_{1} \cdot \cdot}^{*} + {\bar{y}}_{m_{1} k \cdot \cdot}^{*}) + {\bar{y}}_{m_{1} m_{1} \cdot \cdot}^{*} \\ {\hat{δ}}_{2 r s}^{*} = {\bar{y}}_{\cdot \cdot r s}^{*} - ({\bar{y}}_{\cdot \cdot r m_{2}}^{*} + {\bar{y}}_{\cdot \cdot m_{2} s}^{*}) + {\bar{y}}_{\cdot \cdot m_{2} m_{2}}^{*} \\ {\hat{(α α)}}_{j r}^{*} = {\bar{y}}_{j \cdot r \cdot}^{*} - ({\bar{y}}_{j \cdot m_{2} \cdot}^{*} + {\bar{y}}_{m_{1} \cdot r \cdot}^{*}) + {\bar{y}}_{m_{1} \cdot m_{2} \cdot}^{*} \\ {\hat{(α δ)}}_{j, r s}^{*} = {\bar{y}}_{j \cdot r s}^{*} - ({\bar{y}}_{m_{1} \cdot r s}^{*} + {\bar{y}}_{j \cdot r m_{2}}^{*} + {\bar{y}}_{j \cdot m_{2} s}^{*}) \\ + ({\bar{y}}_{j \cdot m_{2} m_{2}}^{*} + {\bar{y}}_{m_{1} \cdot r m_{2}}^{*} + {\bar{y}}_{m_{1} \cdot m_{2} s}^{*}) - {\bar{y}}_{m_{1} \cdot m_{2} m_{2}}^{*} \\ {\hat{(δ α)}}_{j k, r}^{*} = {\bar{y}}_{j k r \cdot}^{*} - ({\bar{y}}_{j k m_{2} \cdot}^{*} + {\bar{y}}_{j m_{1} r \cdot}^{*} + {\bar{y}}_{k m_{1} r \cdot}^{*}) \\ + ({\bar{y}}_{j m_{1} m_{2} \cdot}^{*} + {\bar{y}}_{k m_{1} m_{2} \cdot}^{*} + {\bar{y}}_{m_{1} m_{1} r \cdot}) - {\bar{y}}_{m_{1} m_{1} m_{2} \cdot}^{*} \end{array}

and ${\hat{(δ δ)}}_{j k, r s}^{*} = {\hat{(d d)}}_{j k, r s}$ , for j, k = 1, …, m₁ − 1, j ≤ k; and r, s = 1, …, m₂ − 1, r ≤ s. Similar to the one-locus GMA model, the LSE of these average allelic effects depend on the weighted genotypic group means. Except the LSE of the highest order effects for dominance by dominance interactions, the LSE of other lower order average allelic effects could be more robust to phenotypic outliers in small genotypic groups than the LSE of their corresponding fixed allelic effects.

When the two markers are unlinked and in HWE, the inheritance of four paternal and maternal alleles at the two loci are independent. Therefore, the four sets of indicator variables ${z_{1 j}^{(1)}, j = 1, \dots, m_{1}}$ , ${z_{2 k}^{(1)}, k = 1, \dots, m_{1}}$ , ${z_{1 r}^{(2)}, r = 1, \dots, m_{2}}$ , and ${z_{2 s}^{(2)}, s = 1, \dots, m_{2}}$ are independent with each other, although the variables within each set could still be correlated. In this case, all the genetic components are independent of each other, and the GMA model (10) can provide an orthogonal partition of the expected genotypic variance. In Wang (2014), we have derived formulas for computing the genetic variance components based on the model parameters in a general multi-locus GMA model. For the two-locus GMA model (10), by plugging in the LSE of its model parameters, we obtain estimators of the genetic variance components as shown in Appendix E in Supplementary Material. It should be pointed out that Weir and Cockerham (1977) also derived estimates of the genetic variance components based on the model parameters in Fisher's two-locus ANOVA model (7). But they did not construct the estimators of genetic variance components in terms of the weighted genotypic group means. Note that the orthogonal partition on the genetic variance components implies that the genetic components constitute independent random sources contributing to the expected genotypic variance. Therefore, we could estimate and test for each genetic component separately. Bonferroni criterion can also be applied to correct for the association testing of multiple genetic components.

When the inheritance of paternal and maternal alleles at the two loci are dependent, non-zero covariances among different genetic components may present. The dependency among inheritance of the paternal and maternal alleles at the two markers can be measured by D_jkrs = p_jkrs/(2 − 1_{{j = k}})(2 − 1_{{r = s}})−p_jp_kp_rp_s, for j, k = 1, … , m₁ and r, s = 1, … , m₂. Let ${\hat{D}}_{j k r s} = {\hat{p}}_{j k r s} / (2 - 1_{{j = k}}) (2 - 1_{{r = s}}) - {\hat{p}}_{j} {\hat{p}}_{k} {\hat{p}}_{r} {\hat{p}}_{s}$ . Similar to the one-locus case, when ${\hat{D}}_{j k r s} = 0$ for j, k = 1, … , m₁ and r, s = 1, … , m₂, we can show that the LSE of the average additive, dominance, additive by additive, additive by dominance, dominance by additive and dominance by dominance effects in the full model (10) can keep consistent when some components are excluded from the model.

In general, we can always incorporate the two-locus GMA model (10) into a regression model (1). By treating the modified genotype coding variables $w_{1 j}^{*}$ , $w_{2 r}^{*}$ , $v_{1 j k}^{*}$ , and $v_{2 r s}^{*}$ as regular fixed covariates, we can fit the regression model using the ordinary LS approach. Based on the fitted model, we can calculate the fitted genetic components for each individual i = 1, … , N. Then the genetic variance components and their possible covariances can be estimated as the sample variances and covariances from the fitted values of these genetic components. Similar to the one-locus case, these estimators of the variance and covariance components are different from the traditional ANOVA estimators when the two marker are linked or their genotypes are in HWD. As the estimators of variance components coming from the sample variances, they are guaranteed to be non-negative. When the model residuals are normally distributed, these estimators of the variance and covariance components become MLE and likely possess the asymptotic normality property. Besides, these variance and covariance estimators are likely asymptotically unbiased.

2.3. A Simulation Example

Consider two biallelic marker loci with allele frequencies p₁ = 0.4 and q₁ = 0.6 for alleles “1” and “0” at locus 1, and p₂ = 0.2 and q₂ = 0.8 for alleles “1” and “0” at locus 2. Assume that the two marker loci are in linkage and gametic equilibria. The expected genotypic values at the two marker loci are simulated based on a two-locus GLM model (9) with μ₀ = 10 and a₁₁ = a₂₁ = d₁₁₁ = d₂₁₁ = (aa)₁₁ = 1 and (ad)_{1, 11} = (da)_{11, 1} = (dd)_{11, 11} = 0. From Appendix D in Supplementary Material, we can show that this GLM is equivalent to a GMA model (10) with μ* = 11.72 and $α_{11}^{*} = 1.8$ , $α_{21}^{*} = 2$ , $δ_{111}^{*} = 1$ , $δ_{211}^{*} = 1$ , ${(α α)}_{11}^{*} = 1$ , ${(α δ)}_{1, 11}^{*} = 0$ , ${(δ α)}_{11, 1}^{*} = 0$ , and ${(δ δ)}_{11, 11}^{*} = 0$ . Based on the true expected genotypic values and the genotypic distribution, we also know that the total expected genotypic variance, which is V_G = 3.09, has an orthogonal partition which consists of eight variance components contributed by $w_{1}^{*}, v_{1}^{*}, w_{2}^{*}, v_{2}^{*}, w w^{*}, w v^{*}, v w^{*}, v v^{*}$ . Besides, these eight components $w_{1}^{*}, v_{1}^{*}, w_{2}^{*}, v_{2}^{*}, w w^{*}, w v^{*}, v w^{*}, v v^{*}$ contribute 50.74, 1.87, 41.53, 0.83, 5.03, 0.00, 0.00, 0.00% of the total expected genotypic variance V_G. For each random sample of size n, we first generate genotypes of individuals independently based on the genotypic distribution. Then, based on the genotypes, we create dummy variables w₁ = w₁₁, v₁ = v₁₁₁, w₂ = w₂₁, v₂ = v₂₁₁, ww = w₁₁*w₂₁, wv = w₁₁*v₂₁₁, vw = v₁₁₁*w₂₁, vv = v₁₁₁*v₂₁₁ and mean-corrected index variables $w_{1}^{*} = w_{11}, v_{1}^{*} = v_{111}, w_{2}^{*} = w_{21}, v_{2}^{*} = v_{211}, w w^{*} = w_{11} * w_{21}, w v^{*} = w_{11} * v_{211}, v w^{*} = v_{111} * w_{21}, v v^{*} = v_{111} * v_{211}$ . The phenotypic values is a sum of the genotypic values and the residuals with the residuals being independent of the genotypes. To generate phenotypic values, we assume that the residuals ϵ_i, i = 1, … , n, are i.i.d. and normally distributed with the residual variance being V_ϵ = 17.51, which account for 85% of the total phenotypic variance (i.e., the broad sense heritability is about 15%).

First, we show that the GLM and GMA models provide different partitions of the expected genotypic variance. To minimize the sampling variation, we simulate a random sample with n = 100,000. We fit both GLM and GMA models to the random sample using SAS “proc glm” (SAS Institute, INC.). For the fitted GLM, we have LSE ${\hat{μ}}_{0} = 9.99$ , â₁₁ = 1.01, â₂₁ = 1.05, ${\hat{d}}_{111} = 0.97$ , ${\hat{d}}_{211} = 0.95$ , ${\hat{(a a)}}_{11} = 0.97$ , ${\hat{(a d)}}_{1, 11} = 0.02$ , ${\hat{(d a)}}_{11, 1} = - 0.03$ and ${\hat{(d d)}}_{11, 11} = 0.07$ , which are close to the true values. From the fitted GLM, we calculate estimates of the variance components contributed by w₁, v₁, w₂, v₂, ww, wv, vw, vv as 0.4894, 0.1274, 0.3533, 0.0345, 0.4153, 0.00, 0.00, 0.00, respectively. The summation of these variance components is 1.4199, which is about 46% of the total expected genotypic variance estimate V_G = 3.0896. In other words, in the GLM based partition of the expected genotypic variance, about 54% is contributed by the covariance components due to the collinearity among the variables w₁, v₁, w₂, v₂, ww, wv, vw, vv. For the fitted GMA, we have LSE ${\hat{μ}}^{*} = 11.72$ , ${\hat{α}}_{11}^{*} = 1.78$ , ${\hat{α}}_{21}^{*} = 2.02$ , ${\hat{δ}}_{111}^{*} = 0.96$ , ${\hat{δ}}_{211}^{*} = 0.98$ , ${\hat{(α α)}}_{11}^{*} = 0.97$ , ${\hat{(α δ)}}_{1, 11}^{*} = 0.05$ , ${\hat{(δ α)}}_{11, 1}^{*} = - 0.02$ and ${\hat{(δ δ)}}_{11, 11}^{*} = 0.07$ , which are also close to the true values. From the fitted GMA model, we compute estimates of the variance components contributed by $w_{1}^{*}, v_{1}^{*}, w_{2}^{*}, v_{2}^{*}, w w^{*}, w v^{*}$ , vw*, vv* as 1.5321, 0.0534, 1.3079, 0.0244, 0.1462, 0.00, 0.00, 0.00, respectively. The summation of these variance components is 3.064, which is about 99% of the total expected genotypic variance estimate V_G = 3.0896. This indicates that the GMA model leads to a partition of the expected genotypic variance, which is almost orthogonal with only about 1% coming from the covariance components. Note also that these estimates of the variance components contributed by $w_{1}^{*}, v_{1}^{*}, w_{2}^{*}, v_{2}^{*}, w w^{*}, w v^{*}$ , vw*, vv* account for approximately 49.59, 1.73, 42.33, 0.79, 4.73, 0.00, 0.00, 0.00% of the expected genotypic variance estimate V_G = 3.0896, which are close to the true proportions 50.74, 1.87, 41.53, 0.83, 5.03, 0.00, 0.00, 0.00% of the variables $w_{1}^{*}, v_{1}^{*}, w_{2}^{*}, v_{2}^{*}, w w^{*}, w v^{*}$ , vw*, vv* contributed to the expected genotypic variance.

We also look at the difference between the partition of the expected genotypic variance and the Type III sums of squares for this random sample. Both the GLM and GMA models have the same regression sum of squares SSR = 307788.89 and mean square error MSE = 17.52. In the fitted GLM, the Type III sums of squares for the eight variables w₁, v₁, w₂, v₂, ww, wv, vw, vv are 13415.25, 3475.46, 8472.60, 840.54, 4149.25, 0.24, 1.04, and 0.67, respectively. The summation of these Type III sums of squares is 30355.05, which is less than 10% of the total SSR. In the fitted GMA, the Type III sums of squares for the eight variables $w_{1}^{*}, v_{1}^{*}, w_{2}^{*}, v_{2}^{*}, w w^{*}, w v^{*}, v w^{*}, v v^{*}$ are 153184.54, 5340.61, 130751.17, 2438.75, 14620.38, 2.92, 0.44, and 0.67, respectively. The summation of these Type III sums of squares is 306339.50, which is about 99.5% of the total SSR. This indicates that these Type III sums of squares also provide an orthogonal partition of the SSR. Or, in other words, the hypothesis tests of the eight genetic components are approximately orthogonal.

Next, we examine the performance in model selection between using the GLM and GMA models. We consider varied sample sizes of n = 500, 1000, 2000, and 5000. Under each simulation scenario, we run 1000 simulation. For each simulation sample, we first apply one commonly used method—the forward stepwise selection for model selection on {w₁, v₁, w₂, v₂, ww, wv, vw, vv} and ${w_{1}^{*}, v_{1}^{*}, w_{2}^{*}, v_{2}^{*}, w w^{*}, w v^{*}, v w^{*}, v v^{*}}$ , separately. We run “proc glmselect” in SAS with the criterion p=0.05 for both entry and stay in the model. For the GMA model selection on index variables ${w_{1}^{*}, v_{1}^{*}, w_{2}^{*}, v_{2}^{*}, w w^{*}, w v^{*}, v w^{*}, v v^{*}}$ , as these variables are independent in the underlying true model, we can rank them in the order of $w_{1}^{*}, w_{2}^{*}, w w^{*}, v_{1}^{*}, v_{2}^{*}, v w^{*}, w v^{*}$ , and vv* according to their variance contributions from the largest to the smallest. Intuitively, we expect that the selected models would include the higher ranked variables more often than the lower ranked ones. We mainly focus on the top five variables $w_{1}^{*}, w_{2}^{*}, w w^{*}, v_{1}^{*}, v_{2}^{*}$ and classify the selected models into 10 categories: I = all the five variables $w_{1}^{*}, w_{2}^{*}, w w^{*}, v_{1}^{*}, v_{2}^{*}$ are selected; II = $w_{1}^{*}, w_{2}^{*}, w w^{*}, v_{1}^{*}$ are selected but not $v_{2}^{*}$ ; III = $w_{1}^{*}, w_{2}^{*}, w w^{*}, v_{2}^{*}$ are selected but not $v_{1}^{*}$ ; IV= $w_{1}^{*}, w_{2}^{*}, w w^{*}$ are selected but not $v_{1}^{*}, v_{2}^{*}$ ; V= $w_{1}^{*}, w_{2}^{*}$ are selected but not $w w^{*}, v_{1}^{*}, v_{2}^{*}$ ; VI = $w_{1}^{*}, w w^{*}$ are selected but not $w_{2}^{*}$ ; VII = $w_{2}^{*}, w w^{*}$ are selected but not $w_{1}^{*}$ ; VIII = ww* is selected but $w_{1}^{*}, w_{2}^{*}$ are missed; IX= $w_{1}^{*}$ is selected but miss $w_{2}^{*}, w w^{*}$ , or $w_{2}^{*}$ is selected but miss $w_{1}^{*}, w w^{*}$ ; X= $w_{1}^{*}, w_{2}^{*}, w w^{*}$ are missed. For the GLM model selection on the dummy variables {w₁, v₁, w₂, v₂, ww, wv, vw, vv}, we also define the similar 10 categories I-VIII based on the variables w₁, v₁, w₂, v₂, ww, wv, vw, vv. Under each simulation scenario, the counts of selected GLM and GMA model types for stepwise selection are listed in Table 1.

TABLE 1

Table 1. Counts of selected GLM and GMA model types for “Stepwise Selection”.

For the stepwise selection, the GMA model shows a clear advantage over the GLM on choosing the Type I (the true model), Type I+II, Type I+II+III, Type I+II+III+IV, or Type I-V models. Meanwhile, the GLM tends to miss one of the main fixed effects (Type VI, VII, or VIII) when the sample size ≤ 2000, even though a Type VI, VII, or VIII GLM could imply an equivalent GMA model of Type IV. Overall, as the sample size increases, the model selection improves using either GLM or GMA models. But the selected GLM have a bigger variety, while the selected GMA models are more predictable.

It has been known that the conventional stepwise model selection has problems such as the stepwise F-statistics may no longer follow the F-distributions and the aggravated collinearity among the selected variables (Harrell, 2001). Several regularized regression methods have been developed to deal with the model selection problem especially for high-dimensional data. Here we also apply one of the regularized regression methods - the adaptive LASSO, using “proc glmselect” in SAS. We adopt the Bayesian information criterion (BIC) for model selections. The standardization of the variables is also made prior to the model selection. Note that the standardization of the variables {w₁, v₁, w₂, v₂, ww, wv, vw, vv} or ${w_{1}^{*}, v_{1}^{*}, w_{2}^{*}, v_{2}^{*}, w w^{*}, w v^{*}, v w^{*}, v v^{*}}$ can only affect the scales of the regression coefficients but does not change the correlation structure among each set of variables, while making mean-corrections on the indicator variables ${z_{1 j}^{(1)}, z_{2 k}^{(1)}}$ and ${z_{1 r}^{(2)}, z_{2 s}^{(2)}}$ can lead to different correlation structures between the dummy variables {w₁, v₁, w₂, v₂, ww, wv, vw, vv} and the index variables ${w_{1}^{*}, v_{1}^{*}, w_{2}^{*}, v_{2}^{*}, w w^{*}, w v^{*}, v w^{*}, v v^{*}}$ . The counts of selected GLM and GMA model types for adaptive LASSO are listed in Table 2.

TABLE 2

Table 2. Counts of selected GLM and GMA model types for “Adaptive LASSO”.

For the adaptive LASSO, it appears that the GMA model performs slightly better on choosing Type I (the true model), Type I+II, Type I+II+III, Type I+II+III+IV, or Type I-V models when sample size=500. However, this advantage diminishes for larger sample sizes. Still, the selected GLM appear to have a bigger variety, while the selected GMA models are more predictable. On the other hand, it seems that the selected GMA models could miss all the three top variables $w_{1}^{*}, w_{2}^{*}, w w^{*}$ more likely than the GLM models for w₁, w₂, ww. A brief comparison between Tables 1, 2 also reveals that the stepwise selection has a better performance than the adaptive LASSO under our simulation setting. But this may not be a fair comparison as the selected variables in our simulation setting is very limited. Besides, except BIC, many other criteria are available for model selections in “proc glmselect.” Further exploration on those criteria is needed.

3. Discussion

In this study, we applied the GMA models on analysis of variance components for genetic markers with unphased genotypes. We pointed out that the traditional Fisher's ANOVA model does not explicitly specify the random variables that contribute to the various genetic variance components. The constraints on their model parameters also complicate the model fitting. Meanwhile, the classical dummy-variable based GLM does not provide the same partition on the genetic variance components as the original Fisher's ANOVA model. The hypothesis tests of the fixed allelic effects in a GLM based on the partial (Type III) reduction in sums of squares could also be inadequate for testing the existence of variance components when allelic interactions are present. Alternatively, the GMA model can retain the same partition on the genetic variance components as the traditional Fisher's ANOVA model. Similar to the classical GLM, the GMA model does not require constraints on its model parameters and can be fitted using the ordinary least square approach. As the result, the GMA model allows us to estimate and test for the genetic variance components more conveniently than the classical GLM.

The classical GLM is appealing in genetic association studies due to its simplicity in interpretation of the model parameters, which often represent certain comparisons of the expected genotypic values in different genotype groups. However, the GLM-based approach faces challenges in dealing with allelic interactions because the fixed lower order allelic effects are often confounded with the higher order allelic interactions in a GLM even when the paternal and maternal alleles are independently inherited. In order to test for a particular fixed allelic interaction based on the classical GLM, we need to include all its lower order effects in the model to make this allelic interaction interpretable. Besides, ignoring certain higher order interactions in the model could also affect the definition of this particular allelic interaction due to their potential confounding, as pointed out in Zeng et al. (2005). On the other hand, analysis of genetic variance components using ANOVA type models such as GMA provides an alternative way on assessing the allelic effects and interactions. A nice feature of the GMA model is that the genetic components are independent of each other when multiple genetic markers are unlinked and in equilibrium populations. Therefore, at least in equilibrium populations, each genetic component can be treated as a random source of variation and tested individually regardless of its lower or higher order genetic components. A statistically significant higher-order genetic variance component implies a variation contributed by varied allele types from a set of loci, which could be more perceivable than a significant fixed allelic interaction for claiming allelic interactions.

As shown in Wang (2014), the extension of GMA to multiple loci is straightforward. The GMA model could also be applied to other phenotypic outcomes. For example, under the generalized linear mixed model framework, we can consider the genetic components contributing to a g-transformed (g - a link function) expected outcomes. For survival outcomes, we can examine the genetic components contributing to the hazard functions as well. However, it should be pointed out that the estimation of genetic variance components for genetic markers relies on the genotypic distribution of the markers in a study population. When a sample does not come from the simple random sampling, an adjustment for the sampling strategy is needed in order to make appropriate statistical inference in the general population. Currently, the genome-wide association studies (GWAS) often adopt the case-control design. When the case and control sampling rates are known, we could assess the variance components of genetic markers based on their odds ratio estimates. A thorough exploration on applying the GMA type models to a case-control based GWAS could be a research topic in the future.

Author Contributions

TW planned the study, conducted the derivation, and wrote the manuscript.

Conflict of Interest Statement

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The reviewer ZY and handling editor declared their shared affiliation, and the handling editor states that the process nevertheless met the standards of a fair and objective review.

Acknowledgments

The author would like to thank his colleague Dr. Kwang Woo Ahn in Division of Biostatistics at Medical College of Wisconsin for his advice on implementation of the adaptive LASSO method.

Supplementary Material

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fgene.2016.00123

References

Fisher, R. A. (1918). The correlation between relatives on the supposition of mendelian inheritance. Trans. R. Soc. Edinburgh 52, 399–433.

Google Scholar

Fisher, R. A. (1925). Statistical Methods for Research Workers. Edinburgh; London: Oliver & Boyd.

Google Scholar

Gallais, A. (1974). Covariances between arbitrary relatives with linkage and epistasis in the case of linkage disequilibrium. Biometrics 30, 429–446.

PubMed Abstract | Google Scholar

Harrell, F. (2001). Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. New York, NY: Springer-Verlag.

Google Scholar

Henderson, C. R. (1953). Estimation of variance and covariance components. Biometrics 9, 226–252.

Google Scholar

Kempthorne, O. (1957). An Introduction to Genetic Statistics. New York, NY: Wiley.

Google Scholar

Searle, S. R. (1971). Linear Models. New York, NY: John Wiley & Sons, Inc.

Google Scholar

Searle, S. R. (1995). An overview of variance component estimation. Metrika 42, 215–230.

Google Scholar

Searle, S. R., Casella, G., and McCulloch, C. E. (1992). Variance Components. New York, NY: John Wiley & Sons, Inc.

Google Scholar

Wang, T. (2011). On coding genotypes for genetic markers with multiple alleles in genetic association study of quantitative traits. BMC Genet. 12:82. doi: 10.1186/1471-2156-12-82

PubMed Abstract | CrossRef Full Text | Google Scholar

Wang, T. (2014). A revised fisher model on analysis of quantitative trait loci with multiple alleles. Front. Genet. 5:328. doi: 10.3389/fgene.2014.00328

PubMed Abstract | CrossRef Full Text | Google Scholar

Wang, T., and Zeng, Z. B. (2009). Contribution of genetic effects to genetic variance components with epistasis and linkage disequilibrium. BMC Genet. 10:52. doi: 10.1186/1471-2156-10-52

PubMed Abstract | CrossRef Full Text | Google Scholar

Weir, B. S., and Cockerham, C. C. (1977). “Two-locus theory in quantitative genetics,” in Proceedings of the International Conference on Quantitative Genetics, eds E. Pollak, O. Kempthorne, and T. B. Bailey (Ames: Iowa State University Press), 247–269.

Google Scholar

Zeng, Z.-B., Wang, T., and Zou, W. (2005). Modeling quantitative trait loci and interpretation of models. Genetics 169, 1711–1725. doi: 10.1534/genetics.104.035857

PubMed Abstract | CrossRef Full Text | Google Scholar

Keywords: Fisher's ANOVA model, analysis of variance, general linear model, general multi-allelic model, genetic variance components, orthogonal partition, allelic interactions, least square estimates

Citation: Wang T (2016) Analysis of Variance Components for Genetic Markers with Unphased Genotypes. Front. Genet. 7:123. doi: 10.3389/fgene.2016.00123

Received: 25 March 2016; Accepted: 22 June 2016;
Published: 13 July 2016.

Edited by:

Steven J. Schrodi, Marshfield Clinic Research Foundation, USA

Reviewed by:

Zhan Ye, Marshfield Clinic, USA
Charles M. Rowland, Quest Diagnostics, USA

Copyright © 2016 Wang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Tao Wang, dGFvd2FuZ0BtY3cuZWR1

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.