A Set-Based Mixed Effect Model for Gene-Environment Interaction and Its Application to Neuroimaging Phenotypes

Imaging genetics is an emerging field for the investigation of neuro-mechanisms linked to genetic variation. Although imaging genetics has recently shown great promise in understanding biological mechanisms for brain development and psychiatric disorders, studying the link between genetic variants and neuroimaging phenotypes remains statistically challenging due to the high-dimensionality of both genetic and neuroimaging data. This becomes even more challenging when studying gene-environment interaction (G×E) on neuroimaging phenotypes. In this study, we proposed a set-based mixed effect model for gene-environment interaction (MixGE) on neuroimaging phenotypes, such as structural volumes and tensor-based morphometry (TBM). MixGE incorporates both fixed and random effects of G×E to investigate homogeneous and heterogeneous contributions of multiple genetic variants and their interaction with environmental risks to phenotypes. We discuss the construction of score statistics for the terms associated with fixed and random effects of G×E to avoid direct parameter estimation in the MixGE model, which would greatly increase computational cost. We also describe how the score statistics can be combined into a single significance value to increase statistical power. We evaluated MixGE using simulated and real Alzheimer's Disease Neuroimaging Initiative (ADNI) data, and showed statistical power superior to other burden and variance component methods. We then demonstrated the use of MixGE for exploring the voxelwise effect of G×E on TBM, made feasible by the computational efficiency of MixGE. Through this, we discovered a potential interaction effect of gene ABCA7 and cardiovascular risk on local volume change of the right superior parietal cortex, which warrants further investigation.

Imaging genetics is an emerging field for the investigation of neuro-mechanisms linked to genetic variation. Although imaging genetics has recently shown great promise in understanding biological mechanisms for brain development and psychiatric disorders, studying the link between genetic variants and neuroimaging phenotypes remains statistically challenging due to the high-dimensionality of both genetic and neuroimaging data. This becomes even more challenging when studying gene-environment interaction (G×E) on neuroimaging phenotypes. In this study, we proposed a set-based mixed effect model for gene-environment interaction (MixGE) on neuroimaging phenotypes, such as structural volumes and tensor-based morphometry (TBM). MixGE incorporates both fixed and random effects of G×E to investigate homogeneous and heterogeneous contributions of multiple genetic variants and their interaction with environmental risks to phenotypes. We discuss the construction of score statistics for the terms associated with fixed and random effects of G×E to avoid direct parameter estimation in the MixGE model, which would greatly increase computational cost. We also describe how the score statistics can be combined into a single significance value to increase statistical power. We evaluated MixGE using simulated and real Alzheimer's Disease Neuroimaging Initiative (ADNI) data, and showed statistical power superior to other burden and variance component methods. We then demonstrated the use of MixGE for exploring the voxelwise effect of G×E on TBM, made feasible by the computational efficiency of MixGE. Through this, we discovered a potential interaction effect of gene ABCA7 and cardiovascular risk on local volume change of the right superior parietal cortex, which warrants further investigation.

INTRODUCTION
Neurodegenerative diseases such as Alzheimer's disease (AD) are highly heritable, but how specific genetic variants contribute to these diseases remains largely undetermined (Pedersen et al., 2004;Bertram et al., 2010;Ridge et al., 2013). Neuroimaging has received great attention for understanding the genetic contribution to psychiatric disorders. It provides attractive intermediate phenotypes, including brain morphology, functional activity, and brain wiring, etc, that are closer to the biology of genetic function than clinical phenotypes of illnesses or cognitive phenotypes Mattay et al., 2008;Bigos and Weinberger, 2010). Imaging genetics has thus become an emerging field for the investigation of neural mechanisms underlying genetic variation. Imaging genetics aims to determine how differences in single nucleotide polymorphism (SNP) lead to differences in brain anatomy and function, and hence to understand how variants in SNPs lead to diseases. Although imaging genetics has recently shown great promise in the domains of studying brain development (Viding et al., 2006;de Geus et al., 2008;Mattay et al., 2008;Rasch et al., 2010), as well as psychiatric disorders Meyer-Lindenberg and Weinberger, 2006;Domschke and Dannlowski, 2010;Durston, 2010;Scharinger et al., 2010;Tost et al., 2012), studying the link between genetic variants and neuroimaging phenotypes remains statistically challenging, especially when dealing with high dimensional neuroimaging data, such as tensor-based morphometry, whole brain functional activity, etc. This becomes even more challenging when studying the effect of G×E on neuroimaging phenotypes.
Genome-wide association studies (GWAS) is a simple and widely used technique to determine associations between genetic polymorphisms and neuroimaging phenotypes, such as structural volumes (Stein et al., 2012). For GWAS, each SNP is independently tested for association with a scalar measure using regression models. There are however some limitations to genome-wide association studies (GWAS). Common genetic variants only account for a small fraction of the heritability of brain phenotypes such as hippocampal and subcortical volumes (Stein et al., 2012;Hibar et al., 2015). "Missing, " or perhaps "hidden, " heritability may be accounted for by rare variants (Korte and Farlow, 2013), which are not even considered in GWAS due to the lack of statistical power (Lee et al., 2014). Population genetic theory supports the idea that rare variants may play a significant role in brain-related diseases (Manolio et al., 2009). But rare variants can only be identified using GWAS with a large sample size (Manolio et al., 2009), which is difficult to achieve in imaging genetics. In fact, even common variants do not usually pass genome-wide significance if they have very small effect sizes, after adjusting for the very large number of variants that are tested (Gibson, 2010). GWAS becomes even more difficult if voxelwise neuroimaging measures are used as phenotypes, such as voxelwise GWAS (Stein et al., 2010;Hibar et al., 2011;Ge et al., 2012;Huang et al., 2015), due to computational cost and the lack of statistical power.
Set-based association tests are an alternative class of techniques that are able to overcome the aforementioned limitations of GWAS with regards to rare variants or common variants with small effect sizes. In contrast to GWAS, this class of techniques examines whether a set of genetic variants is collectively associated with a phenotype. The variant set can be defined as variants that belong to a particular gene, pathway, or any other biologically meaningful systems. With a set-based association test, genetic effects of individual variants are accumulated within the set into a single large effect, such that the set could be significantly associated with phenotype even if the individual variants are not. Furthermore, a setbased association test reduces the number of independent statistical tests conducted, which implies less stringent correction for multiple comparisons than that for GWAS. Lee et al. (2014) reviewed five different types of set-based association tests, including burden tests, adaptive burden tests, variance component tests, combined tests, and Exponential Combination test. Burden tests, such as Cohort Allelic Sum Test (Morgenthaler and Thilly, 2007) and Weighted Sum Statistic (Madsen and Browning, 2009), collapse variants into a burden score, and are more powerful when a large proportion of the variants in a set is causal, and the effects are homogeneous (equally deleterious for example). Adaptive burden tests, such as Data-adaptive Sum Test (Han and Pan, 2010) and Variable Threshold (Price et al., 2010), use data adaptive weights, and are more powerful than burden tests but are computationally intensive. Variance component tests, such as C-alpha test (Neale et al., 2011) and sequence kernel association test (SKAT) (Wu et al., 2011), examine the variance of genetic effects, and are more powerful when the effects of individual genetic variants are heterogeneous, even if the variants have opposing directions of effects. Combined tests -such as Optimal SKAT (Lee et al., 2012), Fisher method (Derkach et al., 2013), and mixed-effects score test (MiST)  -and Exponential Combination test (Chen et al., 2012) combine burden and variance component tests and hence incorporate both homogeneous and heterogeneous effects of individual genetic variants.
Combined tests are generally more desirable because only mild distributional assumptions are made about the underlying nature of the effects of individual genetic variants on disease, which is largely unknown (Lee et al., 2014). For MiST, two score statistics are derived -one for homogeneous characteristics, and the other for heterogeneous effects -and both are easy to calculate under their respective null hypotheses to test genetic effects on phenotypes. The significance levels of the score statistics are also easy to determine because their respective distributions are known under reasonable parametric assumptions. In addition, since the two score statistics are independent of one another, the resulting p-values of the two score statistics can be combined using simple formulae, which leads to a single significance value. This results in the computational efficiency of MiST. MiST is a very general model that includes burden test and SKAT as special cases. In comparison with MiST, the weighted linear combination of SKAT and burden statistics in Optimal SKAT is less ideal, since the two score statistics are not statistically independent . For Fisher method and exponential combination test, the estimation of the significance level of the statistics is computationally intensive because the joint distribution of the statistics is not known and permutation analysis is required (Lee et al., 2014).
While genetics has significant contributions to many psychiatric disorders, they are often functional under certain environment. Indeed, G×E has been shown to be relevant for many psychiatric disorders (Cadoret et al., 1983;Milberger et al., 1997;Wahlberg et al., 1997;Yaffe et al., 2000;Eley et al., 2004;Kim-Cohen et al., 2006). Environmental factors, such as disease risk factors, could interact with genetic factors to result in variations in phenotypes. For example, genetics may modulate the effects of various risk factors on the manifestation of a disease, causing varying severities of the disease across individuals even though they may be exposed to the same risk factors. Recently, G×E has been incorporated into burden tests (Jiao et al., 2013) and variance component tests (Lin et al., 2013(Lin et al., , 2016Ge et al., 2015). However, as mentioned earlier, burden tests and variance component tests make assumptions on the underlying nature of the effects of individual genetic variants on disease, and these tests become less powerful when the assumptions do not hold. There is a need to incorporate burden and variance component tests for G×E into one statistical model.
The aim of this study is to extend MiST to examine effects of G×E on neuroimaging phenotypes. We chose MiST because it has been shown to be consistently more powerful than other tests, especially at stringent thresholds, while still controlling for false positive rate (Moutsianas et al., 2015). Another advantage of MiST is its computational efficiency, which is especially important for imaging genetics, since this makes the use of voxelwise neuroimaging measures as phenotypes feasible. Hence, in this study we proposed a set-based MixGE. MixGE is designed to incorporate both fixed and random effects of G×E, to investigate homogeneous and heterogeneous contributions of sets of genetic variants and their interaction with environmental risks to phenotypes. In Section 2, we explained the MixGE model, and how the score statistics for the terms associated with fixed and random effects of G×E were calculated and combined into a single significance value. We then described the simulated data generated to test the type 1 error rate, and power of MixGE as compared to other existing methods. Next we described the real data used in this study. In brief, we used data from the ADNI database, including 21 candidate risk genes for late-onset AD as variant sets, the first principal component of six cardiovascular disease risk factors as an environmental factor, and hippocampal volume and TBM as the neuroimaging phenotypes. In Section 3, we showed the results from testing MixGE on simulated data as well as on the two neuroimaging phenotypes of the ADNI data. Running MixGE on the first neuroimaging phenotype of hippocampal volume serves to replicate the results of the kernel machine method (hereafter referred to as kernel machine method (KMM) for convenience) of (Ge et al., 2015) as a sanity check. Running MixGE on the second neuroimaging phenotype of TBM demonstrated the potential of MixGE for voxelwise neuroimaging phenotypes of imaging genetics. We implemented the MixGE model in MATLAB, and the software and demo are available at http://www.bioeng.nus.edu.sg/cfa/imaginggenetics. html.

A Set-Based Mixed Effect Model for Gene-Environment Interaction (MixGE)
We now introduce a unified set-based mixed effect model for gene-environment interaction (MixGE). We assume that there are N unrelated subjects with brain measures, genotyping, environmental factor, and demographic data. Let y i be a quantitative brain measure for the i th subject. Let X i be a m × 1 vector of potential confounding covariates for the i th subject, such as demographic variables and population stratification. For simplicity of notation, we incorporate the intercept into X i . Let G i be a variant set that is a p × 1 vector with the genotypes of p SNPs for the i th subject. Each SNP takes the values of 0 (homozygotic major alleles), 1 (heterozygote), or 2 (homozygotic minor alleles). Let e i be an environmental risk factor for the i th subject. Based on the above notation, we define the MixGE model as: is a link function. When y i only takes the values of 0 or 1, g(·) can be a logit function. When y i is continuous, g(·) can be an identity function. T represents the vector transpose operation. β x and β e are the regression coefficients for X i and e i respectively, where β x is a m × 1 vector and β e is a scalar. W is a p × q weight matrix based on q variant characteristics (such as nonsense, missense, insertion, deletion, etc) of p SNPs, if effects of the various characteristics are expected to be different . In the simplest case, W can be a p × 1 vector with elements of 1/p, which means that all SNPs are equally weighted according to one common variant characteristic. W can also be used to exclude SNPs from the model by giving them a weight of 0. π (1) and π (2) are q × 1 vectors and account for fixed effects of the variant set G i , and its interaction with an environmental risk factor e i respectively. On the other hand, δ = [δ 1 , δ 2 , · · · , δ p ] T is a p × 1 vector and accounts for random effects of individual SNPs in G i , and their interaction with the environmental risk factor e i , that cannot be explained by the fixed effects. We assume that δ follows a distribution with mean 0 and variance τ 2 . The MixGE model in Equation (1) is a generalized form of commonly used models such as burden test and SKAT (Morgenthaler and Thilly, 2007;Madsen and Browning, 2009), with the addition of G×E. For instance, if heterogeneous effects of individual SNPs are expected to be subtle, we may assume that τ 2 = 0. The MixGE model is hence simplified to an interaction model of burden test. In particular, when W is set as a p × 1 vector of 1/p, the MixGE model becomes an interaction model of Cohort Allelic Sum Test (Morgenthaler and Thilly, 2007). Whereas, when W is set to be a vector of weights, the MixGE model becomes an interaction model of Weighted Sum Statistic (Madsen and Browning, 2009). On the other hand, if π (1) and π (2) are set to zero, this model is equivalent to an interaction model of SKAT (Wu et al., 2011).

Score Test Statistics
From the model in Equation (1), our null hypothesis H 0 is no interaction effect between the variant set, G i , and the environmental factor, e i , on brain measure, y i . The examination of this null hypothesis is equivalent to showing that π (2) = 0 and τ 2 = 0. We use score statistics to test H 0 so that -unlike the likelihood ratio test -there is no need to estimate τ 2 under the alternative. Estimating τ 2 can be computationally expensive as it requires p-dimensional multiple integration. As the score statistic for π (2) asymptotically follows a normal distribution and the score statistic for τ 2 asymptotically follows a mixture of chi-square distributions, it is challenging to derive the joint distribution of π (2) and τ 2 , especially when they are correlated. We propose a sequential way to calculate the score statistics for τ 2 and π (2) , which will make them independent of one another and hence make the testing of H 0 feasible. For this, we first derive the score statistic for the variance component under the null hypothesis of τ 2 = 0 without assuming that π (2) = 0. We then derive the score statistic for π (2) under H 0 . This sequential derivation ensures the independence of these two score statistics, which has been proven in Sun et al. (2013).
To be more general, when the variances of δ j are not expected to be the same for all j, we can define the variance as ω j τ 2 . The score statistic for ω j τ 2 can be calculated as: with ω = diag(ω 1 , · · · , ω p ). For instance, ω j can be set to be negatively correlated with minor allele frequency (MAF) since by evolutionary theory rarer genetic variants tend to have larger effect sizes. Second, we compute the score statistic for π (2) as: (1) ) .
β x ,β e , andπ (1) are the estimates of β x , β e , and π (1) respectively under the null hypothesis of τ 2 = 0 and π (2) = 0. These coefficients can be obtained via solving the least-squares linear regression of g{E(Y)} = Xβ x + Eβ e + GWπ (1) . Based on the central limit theory and the law of large number, we can show that −1/2 U π (2) converges to a qdimensional normal distribution with mean of zero and covariance of identity matrix. = To avoid computing the square root of , we compute U π (2) −1 U π (2) that follows a χ 2 distribution with q degrees of freedom. We denote the p-value of U π (2) −1 U π (2) as P π (2) .
Based on the above construction, the score statistics of U π (2) and S τ 2 are statistically independent . This independence property makes it feasible to combine U π (2) and S τ 2 and examine our null hypothesis of τ 2 = 0 and π (2) = 0. Two possible ways to combine P π (2) and P τ 2 are the Fisher and Tippett methods (Koziol and Perlman, 1978). For the Fisher method, H 0 is rejected if −2 log P π (2) − 2 log P τ 2 ≥ χ 2 4,α . Permutation is not required in this case, unlike in Derkach et al. (2013), because the two score statistics are independent. For the Tippett method, H 0 is rejected if min(P π (2) , P τ 2 ) ≤ 1 − (1 − α) 1/2 , where α is the significance level. The Fisher method is more powerful when the alternative hypotheses of both score statistics are likely to be true, whereas the Tippett method is more powerful when the alternative hypothesis of only one of the two score statistics is likely to be true.

Simulation
MixGE was run on simulated data to ensure that it has a reasonable type 1 error rate, and to compare its power against other existing methods. Two separate simulations were carried out for binary and continuous phenotypes. For binary phenotype, simulated data were generated according to the following model adapted from Lin et al. (2013): (8) where i refers to the i th subject, j refers to the j th variant of the genetic set, E i ∼ N(0, 1) is an environmental factor, and G ij is a genetic set generated under Hardy-Weinberg equilibrium with MAF ∼ U(0, 0.5). How a j and b j were set differs depending on whether the simulation is for type 1 error or for power, and will be explained accordingly in the respective subsections 2.3.1 and 2.3.2. The constant coefficients are set to be the same as Lin et al. (2013). Y i can be generated by taking the logistic of the right hand side of Equation (8), and using that as the probability of success for a Bernoulli trial. Success would represent a disease case subject, whereas failure would represent a control subject. Simulated subjects were generated until there were sufficient cases and controls. For continuous phenotype, simulated data were generated according to the following model modified from Equation (8): where ǫ is noise with distribution N(0, σ 2 ). The intercept term log(0.01/0.99) found in Equation (8) was removed for Equation (9) as it models disease prevalence and is thus meaningless for the continuous phenotype, although the simulation results would not have been affected even if it were included. Simulated data were generated separately for small samples with 200 subjects (100 disease cases and 100 controls) as well as large samples with 2000 subjects (1000 disease cases and 1000 controls). The reason for simulating small samples is to examine the behavior of MixGE when the number of subjects available is small, which is typical for imaging genetics studies. For every simulated scenario, 10,000 simulations were run.

Type 1 Error Simulation
For type 1 error simulation, a was fixed as [1, 1, −1, −1, 0, 0, 0, 0] T while b was fixed as zero meaning that there is no G×E effect. G ij and E i were regenerated for each simulation. To determine type 1 error rate, p-values for the G×E effect for 10,000 simulations were calculated. The p-values were then thresholded at a significance level of 0.05. Type 1 error rate is the proportion of simulations that pass the significance threshold, even though there is no G×E effect, out of the total 10,000 simulations run.

Power Simulation
For power simulation, MixGE was compared against two existing methods: set-based gene-environment interaction test (SBERIA) (Jiao et al., 2013) and KMM (Ge et al., 2015). These two methods were chosen as they perform set-based G×E burden and variance component tests respectively. These methods are hence good representatives of their respective class of methods, and are good bases of comparison for MixGE which is a combination test. a j ∼ Bernoulli(0.5)×[1−2×Bernoulli(0.5)]. Three different scenarios were tested in the power simulation, each with a different b. b was fixed as C × [1, 1, 1, 1, 1, 1, 1, 1] T for the burden scenario, as C ×[1, 1, −1, −1, 0, 0, 0, 0] T which has zero mean for the variance component scenario, and as C × [1, 1, 0, 0, 0.5, 0.5, 0.5, 0.5] T for the mixture of burden and variance component scenario. For binary phenotype, C is given an arbitrary range of positive values such that the simulation is within reasonable power range. For continuous phenotype, C is fixed as 1 while varying the variance of the Gaussian noise. a j , G ij , and E i were regenerated for each simulation. Power in this simulation is defined as the rate at which the various methods are able to significantly detect G×E effect, i.e. the proportion of simulations that pass the significance threshold of 0.05 out of the total 10,0000 simulations run for each C for each scenario.

ADNI Data
The neuroimaging data -including hippocampal volumes and TBM -genetic data, environmental factors, demographic data are the same as those used in previous studies (Stein et al., 2010;Hibar et al., 2011;Ge et al., 2012Ge et al., , 2015. We describe the genetic and neuroimaging data, and their preprocessing steps below. The data used in this study were obtained from the ADNI database http://adni.loni.usc.edu/. ADNI was launched in 2003 by the National Institute of Aging (NIA), the National Institute of Biomedical Imaging and Bioengineering (NIBIB), the Food and Drug Administration (FDA), private pharmaceutical companies and non-profit organizations, as a $60 million, 5-year public private partnership. The primary goal of ADNI has been to test whether biological markers such as serial magnetic resonance imaging (MRI) and positron emission topography (PET), and clinical and neuropsychological assessments can be combined to measure the progression of mild cognitive impairment (MCI) and early AD. Determination of sensitive and specific markers of very early AD progression is intended to aid researchers and clinicians to develop new treatments and monitor their effectiveness, as well as lessen the time and cost of clinical trials. ADNI is the result of efforts of many coinvestigators from a broad range of academic institutions and private corporations, and subjects have been recruited from over 50 sites across the U.S. and Canada. The initial goal of ADNI was to recruit 800 adults, ages 55 to 90, to participate in the research approximately 200 cognitively normal older individuals to be followed for 3 years, 400 people with MCI to be followed for 3 years, and 200 people with early AD to be followed for 2 years (see http://www.adniinfo.org/ for up-to-date information). The data were analyzed anonymously, using publicly available secondary data from the ADNI study, therefore no ethics statement is required for this work.
This study only included 697 subjects (age: 55-91 years) with MRI scans and genotype, environmental and demographic data. Among them, there were 295 females and 402 males. The amount of education received by these subjects ranged from 4 to 20 years.

Environmental Factors
The environmental factors used in this and previous studies (Ge et al., 2015) are six cardiovascular disease risk variables, including age, gender, body mass index, systolic blood pressure, current smoking status and diabetes. These cardiovascular disease risk factors were chosen as environmental factors because sufficient evidence suggests that they increase the risk for AD and accelerate AD progression (Kivipelto et al., 2001;Luchsinger et al., 2005). For gender, females were given a value of 1 whereas males were given a value of 0. For current smoking status, smokers were given a value of 1 whereas non-smokers were given a value of 0. For diabetes, diabetics were given a value of 1 whereas non-diabetics were given a value of 0. Here, we performed principal component analysis on these six environmental variables. The first principal component accounted for 81.9% of the total variation of the six environmental variables. It was used as the environmental factor for MixGE, as MixGE is only able to accommodate one environmental factor. The loadings of the six cardiovascular disease risk variables on the first principal component are 0.213, -0.522, 0.548, 0.301, 0.537, and 0.0424 respectively.

Genotyping and Variant Sets
The genome-wide SNP data was preprocessed according to the ENIGMA2 1KGP cookbook (v3): http://enigma.ini.usc.edu/wpcontent/uploads/2012/07/ENIGMA2_1KGP_cookbook_v3.pdf. Twenty-one candidate genes that were previously identified as risk genes for AD (Lambert et al., 2013;Lu et al., 2014) were used in this study. These sets of genetic variants were also previously used to examine the G×E influence on the aging brain (Ge et al., 2015). The SNPs on the coding regions, as well as 20kb upstream and downstream of these 21 genes were extracted. The names of the 21 genes and the number of SNPs for each are listed in Table 1. The SNPs for each gene were considered as a separate set of variants. Each of the 21 candidate risk genes were put through MixGE separately.

MRI Acquisition and Analysis
All ADNI 1.5T structural brain MRI scans were processed using FreeSurfer (Dale et al., 1999). Intra-cranial volume and bilateral hippocampal volumes were automatically computed using FreeSurfer after skull stripping, B1 bias field correction, segmentation, and labeling (Fischl et al., 2002), and passed rigorous visual quality control checks (Ge et al., 2015).
The process used to generate the TBM data was described in (Hua et al., 2008), and is restated here. First, a random subset of healthy elderly subjects were chosen. The intensities of their MRI scans were normalized, aligned using a 9 parameter affine transform, and averaged voxelwise to create an initial affine average template. The scans of this subset of subjects were then warped to the affine average template using a non-linear inverse consistent elastic intensity-based registration algorithm (Leow et al., 2005), and averaged to create a non-linear average intensity template. Inverse geometric centering of the displacement fields to the non-linear average intensity template was performed to construct the minimal deformation template (MDT). The MDT serves as an unbiased atlas image to which all other images were transformed using the same non-linear inverse consistent elastic intensity-based registration algorithm as earlier (Leow et al., 2005). The determinant of the Jacobian matrix of the deformation was computed to assess volumetric tissue difference at each voxel, which encodes local volume excess or deficit relative to the atlas image. This volumetric tissue difference relative to the atlas at each voxel was used as a quantitative measure of brain tissue volume difference for examining the MixGE. From here on, we shall refer to "the determinant of the Jacobian matrix of the deformation" as "local volume change." There are a total of 2,061,878 voxels per image.

MixGE Application to Hippocampal Volume and TBM
We applied MixGE to examine the interactive effect of cardiovascular risks and AD candidate genes on hippocampal volume and TBM of the brain using the aforementioned ADNI dataset. For this, the environmental factor was defined as the first principal component of the six cardiovascular factors mentioned earlier. The genes are listed in Table 1.
For TBM, we applied MixGE to the local volume change measure and examined the null hypothesis H 0 , at the voxellevel of the atlas. We then performed false discovery rate (FDR) adjustment on the obtained p-values, according to the procedure outlined in Storey (2002). Lastly we thresholded the adjusted p-values at a significance level of 0.05.

RESULTS AND DISCUSSION
In this section, we first present the results of the simulated data to evaluate the type 1 error rate of MixGE and to compare its power with two existing techniques: SBERIA (Jiao et al., 2013) and KMM (Ge et al., 2015).
For real ADNI data, we run MixGE on hippocampal volume to compare with the results obtained by Ge et al. (2015) using KMM. We then run MixGE on TBM to demonstrate the potential of MixGE for voxelwise imaging genetics. In both experiments, intra-cranial volume and education were considered as covariates in the MixGE model, where intra-cranial volume was automatically calculated by FreeSurfer. Hence, in the MixGE model, X is a (697 × 3) matrix with the first column of constant, the second column of intra-cranial volume, and the third column of education. W is a p × 1 vector with elements of 1/p.

Type 1 Error Simulation
According to the simulated data for binary phenotype, MixGE had type 1 error rates of 0.0479 and 0.0507 for small samples of 200 subjects and large samples of 2000 subjects respectively. The type 1 error rates for SBERIA (Jiao et al., 2013) were determined to be 0.0523 and 0.0504 for small samples of 200 subjects and large samples of 2000 subjects respectively, while that of KMM (Ge et al., 2015) were determined to be 0.0287 and 0.0751 respectively.
The results for the type 1 error simulation for continuous phenotype is shown in Figure 1. The type 1 error rates for both MixGE and SBERIA (Jiao et al., 2013) are close to 0.05 regardless of the variance of Gaussian noise added. The type 1 error rates for KMM (Ge et al., 2015) is higher than 0.05 when variance of Gaussian noise added is low, and drops below 0.05 when variance of Gaussian noise added is high. Sample size appear to have little effect on type 1 error rates for continuous phenotype.
These results show that MixGE is a valid method that does not have an inflated false positive rate.  (Jiao et al., 2013) and KMM (Ge et al., 2015) for these three scenarios against these various values of C are plotted in Figure 2. Figure 2 shows that in general, when the number of subjects is small, C has to be much larger in order to detect the G×E effect. Also, when the number of subjects is small, all three methods suffer great statistical power loss for the burden scenario. Furthermore, the power for KMM (Ge et al., 2015) begins to decrease when C becomes larger than 0.8. This is probably because of increase in the estimated variance of b as C increases, thereby causing power loss. The same trends but to a smaller extent is also observed for the mixture of burden and variance component scenario, as expected.

Power Simulation
We indeed see that SBERIA (Jiao et al., 2013) and KMM (Ge et al., 2015) performed best for the scenarios that they were designed for (burden and variance component respectively) but performed worst for the opposite scenario (variance component and burden respectively). They both performed somewhat average for the mixture of burden and variance component scenario.
MixGE on the other hand performs consistently well for all three scenarios. For the simulated data with small samples, MixGE outperformed the other two methods for all scenarios. It is especially notable, that while MixGE also suffered significant power loss for the burden scenario, it is by far the least affected by small sample size as compared to the other two methods. For the simulated data with large samples, MixGE has more power than the other two methods for the burden and the mixture of burden and variance component scenarios. For the variance component scenario, the performance of MixGE and KMM (Ge et al., 2015) are very close. KMM (Ge et al., 2015) slightly outperforms MixGE when the effect size is smaller, whereas MixGE outperforms KMM (Ge et al., 2015) when the effect size is larger. This power simulation shows the strength of MixGE. MixGE exhibits relatively high power whether for datasets with small or large sample sizes. MixGE also performs well for all three scenarios, which shows that MixGE can be applied to any dataset without having to first make any assumption on the underlying nature of the G×E effects.  (Jiao et al., 2013) and KMM (Ge et al., 2015) for these three scenarios against these various values of σ 2 are plotted in Figure 3. Figure 3 shows that in general, when the number of subjects is small, σ 2 has to be smaller in order to detect the G×E effect. In all scenarios, MixGE and KMM (Ge et al., 2015) performed  similarly, with MixGE having slightly more power than KMM (Ge et al., 2015). On the other hand, SBERIA (Jiao et al., 2013) performed poorly as compared to the other two methods.

Hippocampal Volume
In this experiment, we employed MixGE and KMM (Ge et al., 2015) to investigate the interactive effect of individual variant sets listed in Table 1 with cardiovascular risks on hippocampal volume. Y therefore represents hippocampal volume. For MixGE the first principal component of the six cardiovascular risk factors was used as the environmental factor, as MixGE is only able to accommodate one environmental factor. Columns of P Fisher and P Tippett in Table 2, respectively, show the results from the MixGE model based on the Fisher and Tippett methods for combining the p-values of the two score statistics. Additionally, in Table 2, column P π (2) is equivalent to a burden test with G×E, and column P Random is equivalent to SKAT with G×E. On the other hand, the column P all shows the results of KMM (Ge et al., 2015) with all six cardiovascular risk factors as environmental factors, and column P PC shows the results of KMM (Ge et al., 2015) with the first principal component as the only environmental factor. From Table 2, SKAT with G×E, KMM (Ge et al., 2015), and MixGE, revealed the significant interactive effect of EPHA1 and the cardiovascular risks on the hippocampal volume with the smallest p-value given by MixGE. SKAT and KMM (Ge et al., 2015) also revealed the significant interactive effect of CR1 with the cardiovascular risks on hippocampal volume. However, MixGE failed to detect such an interactive effect after Bonferroni correction. As mentioned earlier, the MixGE method is most powerful when the G×E comprises a good mixture of fixed and random effects. If the interaction is predominantly of random effects, then MixGE could become weaker than the variance component methods. This is a slight limitation of MixGE, in exchange for the benefit of not having to make any assumptions of the effects of G×E on phenotypes, which is largely unknown. Another point to note is that MixGE is only able to accommodate one environmental factor, which in this case is the first principal component of the six cardiovascular risk factors that only accounts for 81.9% of the total variation of all six factors. Weaker statistical power in the MixGE method could be due to loss of variance in the environmental factor, which is a limitation of the MixGE method in comparison with KMM (Ge et al., 2015). When the first principal component was used as the only environmental factor for KMM (Ge et al., 2015), no significant results were obtained at all.

TBM
In this experiment, we employed MixGE to examine the interactive effect of individual variant sets listed in Table 1 with cardiovascular risks on TBM. In particular, we describe our findings using the Fisher combined score method below. It took < 7 min to perform the computation for each variant set on a consumer grade laptop. On the other hand, it would take an estimated 90 days to perform the same computation using KMM (Ge et al., 2015).
MixGE first discovered the interactive effect of EPHA1 and cardiovascular risk on local volume change of the right inferior lateral ventricle and the right posterior hippocampus ( Table 3).
The TBM analysis provided additional information of anatomical location and demonstrated the coherent influence of the G×E on these two adjacent structures (Figure 4). This finding is in line with the result on hippocampal volume, suggesting the robustness and statistical power of the MixGE model. This finding is also consistent with previous finding of the effect of EPHA1 on hippocampal volume in MCI (Wang et al., 2015).
MixGE further revealed the interactive effect of ABCA7 and cardiovascular risk on local volume change of the right superior parietal cortex (Table 3). ABCA7 (ATP-binding cassette sub-family A member 7) is a transmembrane transporter that mediates the biogenesis of high-density lipoprotein (Tanaka et al., 2011), and has the capacity to stimulate cellular cholesterol efflux and regulate amyloid precursor protein processing resulting in an inhibition of β-amyloid production (Chan et al., 2008). ABCA7 is involved in lipid metabolism and transport, and is a possible link between cardiovascular risk factors and AD (Stampfer, 2006;Jones et al., 2010;Reitz et al., 2013). Furthermore, ABCA7 has been shown to be nominally associated with the brain atrophy in the posterior portion of the cerebral cortex (Carrasquillo et al., 2014), which is consistent with the significant cluster identified in this study.

CONCLUSIONS
In summary, we proposed the MixGE method to examine the effect of G×E on neuroimaging phenotypes. In particular, the MixGE model incorporates both fixed and random effects of the G×E, which is superior to burden and variance component tests, since for MixGE no assumptions have to be made of the effects of G×E on phenotypes, which is largely unknown. We evaluated MixGE with simulated data and showed that it controlled well for type 1 error. MixGE also had more statistical power than the other two methods that were compared, SBERIA (Jiao et al., 2013) and KMM (Ge et al., 2015), in all simulated scenarios. Furthermore, KMM (Ge et al., 2015) did not seem to control well for type 1 error. We also demonstrated the use of MixGE on real ADNI data with brain structural volumes and TBM as phenotypes. Our results showed consistent findings across the hippocampal volume and TBM measures, suggesting the robustness of the MixGE model. MixGE further revealed the interactive effect of ABCA7 and cardiovascular risk on local volume change of the right superior parietal cortex. This is a potential new discovery that warrants further investigation. The computational efficiency of MixGE made it feasible for MixGE to be used on voxelwise phenotypes such as TBM in the first place. It took < 7 min to perform the computation for each variant set on a consumer grade laptop. On the other hand, it would take an estimated 90 days to perform the same computation using KMM (Ge et al., 2015).
The MixGE model proposed in this study can only accommodate one environmental factor for a gene set and environment interaction, which is less flexible in comparison with KMM (Ge et al., 2015). Nevertheless, for clinical applications, it is often important to understand how individual environmental factors interact with gene on clinical phenotypes rather than all together.

AUTHOR CONTRIBUTIONS
CW, AQ, JS, and CG contributed to the method derivation and implementation. CW, BG, and AQ also contributed to the experiment design and implementation. TG and DH contributed to the experiment neuroimage analysis. All authors contributed to the manuscript writing.