This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics
This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
The explosion of biobank data offers unprecedented opportunities for geneenvironment interaction (GxE) studies of complex diseases because of the large sample sizes and the rich collection in genetic and nongenetic information. However, the extremely large sample size also introduces new computational challenges in G×E assessment, especially for setbased G×E variance component (VC) tests, which are a widely used strategy to boost overall G×E signals and to evaluate the joint G×E effect of multiple variants from a biologically meaningful unit (e.g., gene). In this work, we focus on continuous traits and present SEAGLE, a
Human complex diseases such as neurodegenerative diseases, psychiatric disorders, metabolic syndromes, and cancers are complex traits for which disease susceptibility, disease development, and treatment response are mediated by intricate genetic and environmental factors. Understanding the genetic etiology of these complex diseases requires collective consideration of potential genetic and environmental contributors. Studies of geneenvironment interactions (G×E) enable understanding of the differences that environmental exposures may have on health outcomes in people with varying genotypes (
When assessing G×E effects, setbased tests are popular approaches to detecting interactions between an environmental factor and a set of single nucleotide polymorphism (SNPs) in a gene, sliding window, or functional region (
Largescale biobanks collect genetic and health information on hundreds of thousands of individuals. Their large sample sizes and rich data on nongenetic factors offer unprecedented opportunities for indepth studies on G×E effects. While the explosion of biobank data collections provides great hopes for novel G×E discoveries, it also introduces computational challenges. In particular, many setbased G×E tests can be cast as variance component (VC) tests under a random effects modeling framework (
A number of methods attempt to ease this computational burden by bypassing the estimation of nuisance variance components, either through approximation of the variance or kernel matrices (
In this work, we focus on continuous traits and introduce a Scalable Exact AlGorithm for Largescale setbased G×E tests (SEAGLE) for performing G×E VC tests on biobank data. Here “exact” refers to the fact that SEAGLE computes the original VC test statistic without any approximations, rather than the null distribution of the test statistic being asymptotic or exact. Exactness and scalability are achieved through the judicious use of modern matrix computations, allowing us to dispense with approximations and assumptions. Our numerical experiments illustrate that SEAGLE produces Type 1 error rates and power identical to those of the original GxE VC methods (
The rest of the paper proceeds as follows.
We describe the standard mixed effects model for G×E effects and testing procedures (
We present the standard mixed effects model for studying G×E effects, the scorelike test statistic, and its
Consider the linear mixed effects model (
Here,
The SNPset analysis models the G and G×E effects of the
To simplify the null model of (
In
Given the
We identify three computational bottlenecks.
1. The test statistic
2. The REML EM algorithm (
3. Computing the pvalues requires two eigenvalue decompositions: 1) an eigenvalue decomposition of
We present our approach for overcoming the three computational challenges in the previous section: Multiplication with
The test statistic
LEMMA 1 [Section 2.1.4 in
Applying Lemma 1 to the product of the inverse of
applyVinv

We present the scalable version of the REML EM algorithm in
We assume throughout that
In iteration
Then the updates in
The two bottlenecks in the REML EM algorithm are the computation of the nonsymmetric “squareroot”
To avoid explicit formation of the full matrix in (
Thus,
REMLEM.

Furthermore, we apply
Computation of the
The explicit formation of
SEAGLE.

Combining the algorithms from
We evaluate the performance of our proposed method SEAGLE using simulation studies from two settings. In the first, we simulate data from a random effects genetic model according to Model (1). This enables us to evaluate SEAGLE’s estimation and testing performance. We include experiments with a smaller
In each setting, we study the Type 1 error rate and power. We consider three baseline approaches: i) the original G×E VC test (referred to as OVC) (
In all simulations, we obtain the genotype design matrix
Given the
We begin by examining the Type 1 error rate for SEAGLE with
Type 1 error rate with 95

Type 1 error  Std. Error  95% CI 

5 × 10^{−2}  0.0497635  0.0000486  (0.0496681, 0.0498588) 
5 × 10^{−3}  0.0049571  0.0000157  (0.0049263, 0.0049878) 
5 × 10^{−4}  0.0004894  0.0000049  (0.0004797, 0.0004990) 
5 × 10^{−5}  0.0000481  0.0000016  (0.0000451, 0.0000511) 
2.5 × 10^{−6}  0.0000027  0.0000004  (0.0000020, 0.0000035) 
SEAGLE vs. original G×E VC (OVC) results based on




SEAGLE  Bias  −3.06 × 10^{−4}  −1.01 × 10^{−3} 
MSE  4.37 × 10^{−2}  4.18 × 10^{−4}  
OVC  Bias  −6.93 × 10^{−4}  −5.56 × 10^{−4} 
MSE  4.36 × 10^{−2}  1.05 × 10^{−4} 
SEAGLE vs. original G×E VC (OVC) results based on
Since the data are generated from a random effects model under
In
Mean squared error (MSE) of the pvalues obtained from different G×E VC tests, compared to the “Truth” pvalues. Results are obtained with
MSE of 


SEAGLE  3.49 × 10^{−4} 
MAGEE  224.96 × 10^{−4} 
OVC  3.49 × 10^{−4} 
FastKM  17.88 × 10^{−4} 
Regarding the computational cost, the left panel in
Type 1 error at
Power at
Type 1 error at
Time in seconds for random effects simulations with
Power at
To study the performance of our proposed method when the data may not adhere to our model assumptions, we follow previous work (
We first evaluate the Type 1 error of SEAGLE by simulating
Type 1 error at
Computation time in seconds for fixed effects simulations with
For power evaluation, we simulate
Power at
Power at
To illustrate the scalability of the G×E VC test using SEAGLE, we apply SEAGLE and MAGEE to the Taiwan Biobank (TWB) data. TWB is a nationwide biobank project initiated in 2012 and has recruited more than 15,995 individuals. Peripheral blood specimens were extracted and genotyped using the Affymetrix Genomewide Axiom TWB array, which was designed specifically for a Taiwanese population. We conduct the genebased G×E analysis and evaluate the interaction between gene and physical activity (PA) status on body mass index (BMI), adjusting for age, sex and the top 10 principal components for population substructure. The PA status is a binary indicator for with/without regular physical activity. Our G×E analyses focuses on a subset of 11,664 unrelated individuals who have nonmissing phenotype and covariate information. After PLINK quality control (i.e., removing SNPs with call rates < 0.95 or HardyWeinberg Equilibrium
The median run time of SEAGLE and MAGEE is 2.4 and 1.3 s, respectively, Both SEAGLE and MAGEE do not find any significant G×PA interactions at the genomewide Bonferroni threshold 0.05/13,260 = 3.77 × 10^{−6}. We hence discuss the results using a less stringent threshold, i.e., 5 × 10^{−4} and summarize the results in
Upset plot of significant G×PA interactions identified by SEAGLE and MAGEE in the Taiwan Biobank at the 5 × 10^{−4} nominal level.
We introduced SEAGLE, a scalable exact algorithm for performing setbased G×E VC tests on largescale biobank data. We achieve scalability and accuracy by applying modern numerical analysis techniques, which include avoiding the explicit formation of products and inverses of large matrices. Our numerical experiments illustrate that SEAGLE produces Type 1 error rates and power that are identical to those of the original VC test (
Stateoftheart computational approaches such as MAGEE bypass the traditional timeconsuming REML EM algorithm, and instead compute an approximation to the scorelike test statistic by assuming that the G main effect size is small. In practice, however, the G main effect size is often unknown. Our numerical experiments illustrate that SEAGLE generally achieves better Type 1 error and power with comparable computation time.
In general, computational time differences between SEAGLE and MAGEE depend primarily on the effect size of the genetic main effect (G effect) and the size of the data, particularly the number of loci L. This is due to the fact that SEAGLE computes the scorelike test statistic exactly and therefore requires an EM algorithm to estimate the main effect parameter
We highlight the fact that most of our timing experiments were performed on a 2013 Intel Core i5 laptop with a 2.70 GHz CPU and 16 GB RAM. Therefore, SEAGLE performs efficient and exact setbased G×E tests on biobankscale data with
Recent studies suggested incorporating functional annotations can further enhance power and accuracy in setbased association analysis [e.g., STAAR (
We conclude with a discussion of possible avenues for future extensions. First, the current SEAGLE framework only allows for quantitative traits. Extension to binary traits can be based on developing a scalable algorithm for the G×E VC test described in
Second is the extension from a single environmental factor to a set of factors represented by
Here
Third is the extension from unrelated samples to family samples. With family samples, Model (1) will include an additional nuisance VC associated with kinship matrix to account for the withinfamily correlation. Similar scalable algorithms as for the multiE analysis would also be useful for analyzing family samples.
The last is the extension to other types of kernels (
The datasets for this article are not publicly available but requests to access the data from the Taiwan Biobank can be emailed to
The studies involving human participants were reviewed and approved by the ethical committee at Taichung Veterans General Hospital (IRB TCVGH No. CE16270B2). Written informed consent for participation was not required for this study in accordance with the national legislation and the institutional requirements.
JC, II and JYT conceived the presented ideas and study design. JC implemented the methods and performed the numerical studies under the supervision of II and JYT. JC, THH, CHL, LSW, WPL, TPL, and JYT contributed to data processing, analysis and result interpretation of the real data analysis. JC, II and JYT drafted the manuscript with input from THH, CHL, LSW, WPL, and TPL. All authors helped shape the research, discussed the results and contributed to the final manuscript.
This work has been partially supported by National Science Foundation Grant DMS1760374 (to JC and II), National Science Foundation Grant DMS2103093 (to JC), National Institutes of Health Grants U54 AG052427 (to LSW and WPL), U24 AG041689 (LSW, WPL and JYT), and P01 CA142538 (to JYT), and Taiwan Ministry of Science and Technology Grants MOST 1062314B002097MY3 (to TPL) and MOST 1092314B002152 (to TPL).
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
The authors would like to thank the reviewers for their helpful feedback, thank Yueyang Huang for his help with the example data and PLINK1.9 code for the SEAGLE R package tutorials, and thank Caizhi Huang for his help in the simulations for the revision.
The Supplementary Material for this article can be found online at: