Comparison of F-tests for Univariate and Multivariate Mixed-Effect Models in Genome-Wide Association Mapping

Genome-wide association mapping (GWA) has been widely applied to a variety of species to identify genomic regions responsible for quantitative traits. The use of multivariate information could enhance the detection power of GWA. Although mixed-effect models are frequently used for GWA, the utility of F-tests for multivariate mixed-effect models is not well-recognized. Thus, we compared the F-tests for univariate and multivariate mixed-effect models with simulations. The superiority of the multivariate F-test over the univariate test varied depending on three parameters: phenotypic correlation between variates (r), relative size of quantitative trait locus effects between variates (ad), and missing proportion of phenotypic records (mprop). Simulation results showed that, when mprop was low, the multivariate F-test outperformed the univariate test as r and ad differ, and as mprop increased, the multivariate F-test outperformed as ad increased. These observations were consistent with results of the analytical evaluation of the F-value. When mprop was at the maximum, i.e., when no individual had phenotypic values for multiple variates, as in the case of meta-analysis, the multivariate F-test gained more detection power as ad increased. Although using multivariate information in mixed-effect model contexts did not always ensure more detection power than with univariate tests, the multivariate F-test will be a method applied when multivariate data are available because it does not show inflation of signals and could lead to new findings.


INTRODUCTION
Genome-wide association mapping (GWA) has been widely applied to humans, animals, and plants to identify genomic regions responsible for quantitative traits, which has been made feasible by decreases in the cost and time required to obtain genome-wide single-nucleotide polymorphisms (SNPs) and sequences. Whereas various statistical methods have been proposed for GWA, particularly in recent animal and plant studies, mixed-effect models are often used to correct population stratification (e.g., Zhao et al., 2011;Sahana et al., 2014;Yano et al., 2016;Frischknecht et al., 2017) inspired by Yu et al. (2006). The detection power of GWA primarily depends on the sample size, which is common for statistical methods. Thus, when few samples (genotypes) are available, as often seen in plant studies (e.g., Zhao et al., 2011;Yano et al., 2016;Minamikawa et al., 2017), GWA may fail to detect any responsible regions or, even if it does, only a few regions are found.
In contrast, biological data are usually multivariate. For example, phenotypes are often measured for multiple traits. Phenotypes may be measured on multiple occasions or in different locations and/or by multiple individual using a variety of methods. For plants, multi-year and/or multi-environment evaluation is often conducted to understand the complex reactions of plants to environmental stimuli, which can be observed as genotype-by-environment interactions. Moreover, meta-analysis or phenotypes belonging to different groups of individuals, for example, sex, age, or geographical populations, can be considered as multivariate with the phenotypic record of each sample consisting of only a single variate.
Thus, the use of multivariate information to enhance the detection power of GWA is straightforward. Indeed, various methods for multivariate GWA or quantitative trait locus (QTL) mapping have been proposed (Piepho, 2005;Banerjee et al., 2008;Ferreira and Purcell, 2009;Kim and Xing, 2009;O'Reilly et al., 2012;Zhang et al., 2014;Guo et al., 2015;Ray et al., 2016;Wang et al., 2016;Cheng et al., 2017), and these methods have been compared (Galesloot et al., 2014). However, these methods do not include polygenic effects, essential parts of mixed-effect models, which are effective for adjustment of population structure. Although GWA based on multivariate mixed-effect models has also been studied, the focus has been on computational efficiency (Zhou and Stephens, 2014;Furlotte and Eskin, 2015;Joo et al., 2016), and not on the properties of statistical tests. Simulation analysis for statistical tests based on multivariate mixed-effect models conducted to date remains insufficient in terms of the ranges of scenarios; relative QTL effect sizes among variates, phenotypic correlation, and the missing proportion of phenotypes are not fully considered (e.g., Korte et al., 2012;Zhou and Stephens, 2014).
In the present study, we formulated the F-test for multivariate mixed-effect models of general form; both empirical and analytical evaluations were performed to elucidate the properties of the F-tests for multivariate mixed-effect models. The F-test (or the Wald-type test) is usually used for association analysis based on univariate mixed-effect models (e.g., R package, rrBLUP, Endelman, 2011). The multivariate F-test simultaneously tests the effect of SNPs on multiple variates. Thus, the purpose of the multivariate F-test introduced here is to identify genomic regions that are common to multiple variates and cannot be detected by the univariate test for each variate.

MATERIALS AND METHODS
Univariate F-test in GWA GWA is often conducted assuming the following univariate mixed-effect model, where Y is the vector of response variables (e.g., phenotypic records), X is the covariate matrix, including intercepts, genotypes of the SNP to be tested, and principal components for adjustment of population structure, B is the fixed effects of the covariates, Z is the design matrix, u is the polygenic effects, and e is the residuals. u and e follow multivariate normal distributions, where A is the genomic relationship matrix defined by genomewide SNPs (e.g., VanRaden, 2008), and σ 2 u is the additive genetic variance explained by SNPs and where σ 2 e is the residual variance. The significance of B can be assessed by the F-test (Henderson, 1984;Kennedy et al., 1992). (1) Here, V is the phenotypic covariance, i.e., where n is the number of individuals and p is the number of fixed effects. H is a matrix to indicate which effects inB are tested. For example, when p is five and the second effect is tested, H ′ = [0 1 0 0 0]. Here we assume that when the fixed effect of interest is 0, the F-statistic follows an F distribution with the numerator degrees of freedom being f and the denominator degrees of freedom being n − p. We refer to this test as the univariate F-test. However, note that this assumption ignores the reduction of the denominator degrees of freedom owing to estimation of variance components. Because of this reduction, the F-test introduced here can underestimate (inflate) p (−log10P) values. Nevertheless, we propose that the methods and results presented here are useful in practice because p-values obtained for negative SNPs (i.e., SNPs that were unlinked to QTLs) were not underestimated in simulations presented later, and thus the reduction of degrees of freedom would not affect GWA given the number of samples usually used in GWA. This issue is revisited in the Results and Discussion.

Multivariate F-test
A multivariate mixed model can be written as Here, Y m is the vector of response variables of length n × d, where d is the number of variates (e.g., traits or experiment years). X m is the matrix of covariates, including the intercepts, SNP genotypes, and so on. For example, when d is two, B m is the vector of fixed effects of length p × d where p is the number of fixed effects for each variate. Z m is the design matrix, which for the example of d = 2, takes the following form, u m and e m are the polygenic effects and residuals, respectively, and where ⊗ indicates the Kronecker product and 2 u is the d × d genetic covariance matrix, and where 2 e is the d × d residual covariance matrix. In the F-test introduced here for this multivariate mixed-effect model, SNP effects on all of the variates are tested simultaneously. The F-statistic in the multivariate case is (2) We assume that when the fixed effect of interest is 0, the Fstatistic follows an F distribution with the numerator degrees of freedom being f and the denominator degrees of freedom being n × d − p. When Y m includes missing records, the dimensions of Y m , X m , and V m and the denominator ofσ 2 m decrease accordingly, excluding the missing records from the model. In this multivariate setting, H is constructed as follows. For example, when d is two (variates 1 and 2) and , where µ 1 and µ 2 are the intercepts and B 1 and B 2 are the effects of a SNP, H becomes We refer to this test as the multivariate F-test. Korte et al. (2012) used the F-test for bivariate mixedeffect models, and GEMMA provides the Wald, likelihood ratio, and score tests (Zhou and Stephens, 2014). The F-test is asymptotically equivalent to the Wald test, and the likelihood ratio test is equivalent to the Wald test when the parameters except for the one to be tested are equal in the null and alternative models. Furlotte and Eskin (2015) and Joo et al. (2016) also provide the multivariate F-tests based on transformed matrices for high computational efficiency.

Marker Genotypes
Genome-wide markers were generated using a coalescent simulator, Genome (Liang et al., 2007). We assumed a diploid selfing species because small sample sizes are often seen in crop studies (e.g., 176 rice varieties in the study by Yano et al., 2016). The number of chromosomes was five, and it was assumed that there were 2,000 SNPs on each chromosome. There were 200 individuals. The parameters of Genome used for the simulation were as follows: "-pop 1 200 -N 1000 -c 1 -pieces 2000 -s 2000 -rec 0.0002." A typical distribution of linkage disequilibrium among SNPs is shown in Figure S1. SNP genotypes were generated 100 times with the simulator. We pruned SNPs with a minor allele frequency <0.02, which resulted in the average number of SNPs being 6390 (± 82). Using these genotype data, 100 data sets were generated for each scenario described below.

QTL Effects and Phenotypes
We selected four SNPs per chromosome randomly as QTLs (i.e., a total of 20 QTLs). Among the QTLs, one QTL was randomly selected as a "target QTL, " and the remaining ones were used as "background QTLs." For the background QTLs, the additive genetic effects were simulated using a multivariate normal distribution, where B m,bQTL is a vector of background QTL effects with length d (number of variates), and Σ is a d by d correlation matrix represented with a single correlation parameter r, i.e., For the target QTL, the QTL effect on the first variate was generated from the standard normal distribution, and then the effect on the jth variate (d ≥ j > 1) was determined as where a j is a real value between −1 and 1. We investigated the detection power of the multivariate F-test by assessing the power on the target QTLs while varying a j and r values. Total genetic effects, u m , were generated by summing all of the products between the QTL effects and QTL genotypes. The residuals of individual i were generated using the same correlation matrix Σ, where S is a diagonal matrix whose elements were the standard deviations of the total genetic effects (u m ). Thus, the heritability was 0.5 throughout the simulation. The phenotypic correlation was largely Σ. Because, as illustrated in the previous section, the F-value is influenced by the phenotypic covariance (V) rather than the individual variance components ( 2 u and 2 e ), we did not simulate situations where the genetic and residual correlations differ.
We set d to 2, 4, and 8. When d = 2, r was set to −0.95, −0.9, −0.8, . . . , 0.8, 0.9, and 0.95. When d = 4 and 8, r was 0, 0.4, 0. We did not assess the detection power of F-tests using the background QTLs. The rationale for this is as follows. When we draw random variables from a multivariate normal distribution and represent the variables using a d , a wide range of a d values are obtained. For example, when we draw random values from Approximately 80% of variable pairs are represented with a d < 0.9, and even 10% of pairs show a d < 0. However, as we show later, a d has an impact on the detection power of multivariate F-tests. Thus, using the background QTLs, the power of multivariate F-tests could not be assessed appropriately.

Missing Phenotypes
Missing phenotypic values were randomly generated, such that every individual had a phenotypic value for at least one variate. In the analysis of each scenario, we used SNPs that were not selected as QTLs to construct the genomic relationship matrix for the polygenic effects. Then, the effect of the target QTL was tested. Principal components were not added. Other than the QTL genotypes, only intercepts for each variate were added to the model as the fixed effects.

Implementation
The univariate and multivariate F-tests were implemented using R (R Development Core Team, 2011). Variance components (σ 2 u , σ 2 e , 2 u , and 2 e ) were estimated using remlf90 software (Misztal et al., 2002), which is based on an EM algorithm. The estimation of variance components was conducted using the null model, i.e., a model that did not include any SNP genotypes as fixed effects; then, F-tests were performed for each SNP using the estimated variance components. This approximation procedure was proposed in Kang et al. (2010) and Zhang et al. (2010). R scripts for the multivariate F-tests are available at https://github.com/Onogi/MultivariateFtest. Simulated data and analysis results are available upon request.

RESULTS AND DISCUSSION
We assessed the power of the multivariate F-test by comparing the p-values for the target QTLs obtained by the multivariate and univariate F-tests. Because the additive effect of the target QTL on the first variate was simulated to be greatest among the variates (see Materials and Methods), we compared the-log10p-values of the multivariate F-tests with those of the univariate tests for the first variate, which was expected to be highest among the variates.
When d = 2 and the data had no missing records, the multivariate F-test outperformed the univariate test as the differences between a d and r became large (Figure 1). This tendency was also confirmed by analytical evaluation of the numerator of the F-value in the multivariate F-test (Equation 2). The numerator can be expressed as a function of a d and r by simplifying model assumptions as follows. Suppose that d = 2, V = 1 r r 1 , β tQTL,2 = a 2 β tQTL,1 , and A = I. Then, the numerator of the F-value is where x is the vector of the genotypes of the QTL. The heat map of f (a 2 , r), shown in Figure 2, is consistent with the heat map when d = 2, shown in Figure 1.
The results when d = 4 and 8 differ from that when d = 2, particularly when both r and a d are near zero (Figure 1). This finding is probably because we determined a j (d > j >1) as the intermediate between 1 and a d with constant intervals (see Materials and Methods). For example, when d = 4 and a 4 = 0, we set a 2 and a 3 to 0.666 and 0.333, respectively. This means that the QTL effects on the first to fourth variates become B tQTL,1 , 0.666B tQTL,1 , 0.333B tQTL,1 , and 0, respectively. If we focus on variates 1 and 2 and examine the heat map for d = 2 (Figure 1), with this a d value (0.666), the multivariate F-test is superior to the univariate test within the range from r = 0 to 0.3, which is consistent with the results when d = 4 (Figure 1). These results suggest that, although a j will take various values when d > 2 in real-data analysis, the behavior of the multivariate F-test can be interpreted based on the results when d = 2 to some extent.
For each d, as the missing proportion increased, the range of combinations of r and a d where the multivariate F-test showed superiority gradually increased (Figure 3). This tendency was prominent when the number of variates was high. In contrast, the difference of-log10p (i.e., the surface of the heat map) became FIGURE 1 | Mean difference of -log10p values between the multivariate and univariate F-tests. The p-values of the univariate F-test were obtained from the test on the first variate, which had the greatest QTL effect. Warm colors indicate that the multivariate F-test showed higher-log10p-values than the univariate test, and cold colors indicate that the univariate F-test showed higher values. d, r, and a d denote the number of variates, phenotypic correlation, and the relative size of QTL effects between variates, respectively. Missing proportions of phenotypes are zero. For the results when d = 4 and d = 8, the mean differences of the scenarios that were not tested were interpolated using spline regression implemented in the mgcv R package.  (a 2 , r) is a term appearing in equation (3), which is the numerator of the F-value of the multivariate F-test, making several assumptions for simplification (see the main text). Higher values are relevant with lower p-values. similar (flatter) among the simulation scenarios. Note that, from Figure 3, we dropped the results when d = 8 and m prop = 0.656 because of a failure of variance component estimation in 98.1% of analyzes. We return to this issue later. When m prop was the maximum, i.e., when every individual had a phenotypic value only for a single variate, the superiority of the multivariate F-test depended on a d , i.e., as a d increased, the test outperformed the univariate one. However, it was also observed that the gain in-log10p was minimal. The fact that r became less influential when m prop was the maximum can be illustrated analytically. Suppose that d = 2 (variates 1 and 2) and the first half of individuals (group X) has values only for variate 1 and the second half (group Y) has values only for variate 2. Let , and Then, the phenotype (co)variance matrix appearing in equation (2) can be written as where I X and I Y are the identity matrices for groups X and Y, respectively. Because groups X and Y do not have values for variates 2 and 1, respectively, the relevant rows and columns are removed from V m resulting in: The parameter r disappears from the residual covariance (the second term). Although r still appears in the genetic covariance (the first term), the influence of r on the F-value is expected to decrease.
To examine the multivariate F-test not inflating the p-values of negative SNPs, QQ plots were drawn for the SNPs that were on the same chromosome as the target QTLs and were not associated with any QTLs (i.e., r 2 < 0.1 with any QTLs; Figures S2-S5). The results showed that the multivariate F-tests did not inflate the p-values of negative SNPs. Rather, p-values tended to be overestimated as d increased (Figures S4, S5).
A major drawback of the multivariate F-test is the computational difficulty in estimating the variance components, which became severe as d and m prop increased. We observed that the solver we used (remlf90) occasionally failed to converge within the default iteration number (5,000) or to return a positive-definite matrix. As mentioned above, when d = 8 and m prop = 0.656, the estimation failed in most cases in this scenario. In other scenarios, the frequencies of failure were 0.004, 0.054, and 0.081 when d = 2, 4, and 8, respectively. This issue would be solved using Gibbs sampling or by estimating the covariance for each pair of variates independently, as is often done in the genetic analysis of multiple traits (e.g., Nogi et al., 2011). When missing records are included, this issue may be mitigated by imputation. Computational speed also may be problematic, particularly when n and d increase. The most time-consuming step for evaluation of the F-value is the calculation of X ′ m V −1 m Y m , which is proportional to n (n + 1) d 3 .
Considering this computational difficulty, meta-analyzes will be an attractive alternative to use multivariate information. For example, TETAS can combine p-values obtained by univariate mixed-effect models considering correlations between variates (van der Sluis et al., 2013). Comparing such meta-analysis methods with F-tests based on multivariate mixed-effects models is interesting, and we leave this issue for future studies. As mentioned in the Materials and Methods, explicitly, the statistics in equations (1) and (2) do not follow the F distributions with the denominator degrees of freedom, n − p and n × d − p, respectively. However, we did not observe inflation of −log10pvalues in simulations (Figures S2-S5), and thus we consider that the reduction of the denominator degrees of freedom owing to variance component estimation is not influential given the sample size in GWA. When accurate p-values are required, several remedies are available. The first one is permutation or parametric bootstrapping. These methods are time-consuming because, for each SNP, the null model is fitted to the generated data repeatedly to obtain the null distribution. Thus, genomewide scans using these methods will not be feasible. Nevertheless, these methods will be useful for obtaining accurate p-values for SNPs that are pruned using other methods, including Ftests. The second one is to adjust the statistics such that the statistics follow F distributions as proposed in Kenward and Roger (1997). The method estimates the denominator degrees of freedom and a scaling parameter to match the moments of the statistics to those of the F distributions. Several extensions to multivariate models can be found in Alnosaier (2007). Although pursuing this second approach in GWA may be important from the theoretical viewpoint, in practice, applying permutation or parametric bootstrapping to a limited number of SNPs detected by F-tests or other methods would be useful.
When missing records are included, both the multivariate and univariate F-tests can be conducted for imputed data. Phenotype imputation is essential for several estimation algorithms for multivariate mixed-effect models because of transformation using eigen vectors, which is required for high computational efficiency (e.g., Zhou and Stephens, 2014;Lee and van der Werf, 2016). Phenotype imputation using multivariate information has been proposed for GWA (Dahl et al., 2016) and plant breeding (Hori et al., 2016). Although we evaluated the multivariate F-test without imputation in the present study, the performance of the multivariate and univariate F-tests based on imputed data would also depend on the magnitude of correlation between variates (r) as observed here because it will influence the imputation accuracy regardless of the imputation methods.

AUTHOR CONTRIBUTIONS
The author confirms being the sole contributor of this work and has approved it for publication.

FUNDING
Japan Science and Technology Agency PRESTO (grant number 15656049).