METHODS article

Front. Genet., 21 November 2013

Sec. Applied Genetic Epidemiology

Volume 4 - 2013 | https://doi.org/10.3389/fgene.2013.00243

Testing for direct genetic effects using a screening step in family-based association studies

  • SM

    Sharon M. Lutz 1,2*

  • SV

    Stijn Vansteelandt 3

  • CL

    Christoph Lange 2

  • 1. Department of Biostatistics, University of Colorado Aurora, CO, USA

  • 2. Department of Biostatistics, Harvard School of Public Health Boston, MA, USA

  • 3. Department of Applied Mathematics, Computer Science and Statistics, Ghent University Ghent, Belgium

Abstract

In genome wide association studies (GWAS), family-based studies tend to have less power to detect genetic associations than population-based studies, such as case-control studies. This can be an issue when testing if genes in a family-based GWAS have a direct effect on the phenotype of interest over and above their possible indirect effect through a secondary phenotype. When multiple SNPs are tested for a direct effect in the family-based study, a screening step can be used to minimize the burden of multiple comparisons in the causal analysis. We propose a 2-stage screening step that can be incorporated into the family-based association test (FBAT) approach similar to the conditional mean model approach in the Van Steen-algorithm (Van Steen et al., 2005). Simulations demonstrate that the type 1 error is preserved and this method is advantageous when multiple markers are tested. This method is illustrated by an application to the Framingham Heart Study.

Introduction

Some of the recently published genome-wide association studies identified the same genetic locus as a disease susceptibility locus for different complex diseases (Amos et al., 2008; Thorgeirsson et al., 2008). One possible mechanism is that the marker locus is pleiotropic and has genetic effects on several, different phenotypes. Determining whether the marker acts directly on each of these phenotypes or only indirectly via one or more intermediate phenotypes is an important step in understanding the biological significance of the genetic associations. In order to understand and characterize the underlying genetic effect, methods have been proposed to disentangle these potential direct and indirect genetic effects (Vansteelandt et al., 2009; Vansteelandt, 2010; Berzuini et al., 2012; Vansteelandt and Lange, 2012; VanderWeele et al., 2012). All currently available methods focus on the direct and indirect genetic effects relative to one (group of) secondary phenotypes. Because the magnitude of the indirect effect depends on how strongly these secondary phenotypes affect the primary phenotype, these methods consider adjustment for confounding of the relationship between these phenotypes by measured extraneous factors. Some of these methods quantify both the direct and indirect genetic effects, but assume that none of these extraneous confounding factors is influenced by the considered marker (VanderWeele et al., 2012). Some of these methods allow for some of the extraneous confounding factors to be influenced by the considered marker, but they merely quantify direct genetic effects (Vansteelandt et al., 2009; Vansteelandt, 2010; Berzuini et al., 2012).

Regardless of the considered framework, all available methods only test one gene at a time and need to be corrected for multiple comparisons. This concern over multiple comparisons becomes an issue in family-based genome wide association studies (GWAS). When there is a region with a strong association with both the endo-phenotype and phenotype, identifying SNPs in the region that are still associated with the phenotype of interest after accounting for the association with the endo-phenotype requires testing for a direct causal effect for every SNP in the region. In order to increase power to detect this direct genetic effect, we propose a 2-stage testing strategy to minimize the burden of multiple comparisons in the causal analysis (Van Steen et al., 2005; Murphy et al., 2008; Won et al., 2009). The application of a screening step when testing for direct genetic effects is an important advantage in this scenario where the multiple-comparison problem is a major hurdle. The power of our approach is assessed by simulation studies. We show that the type-1 error is preserved and the method is shown to be advantageous when multiple SNPs are tested for a direct effect on the phenotype of interest.

Methods

Suppose that in the family-based study, n trios (offspring and both parents) have been genotyped at a specific marker locus. Assuming there is no bias due to ascertainment conditions, the variable Xi denotes the coded genotype of the offspring and Si denotes the parental genotypes for individual i. If genotypic data is unavailable for the parents but genotypic information is available on the subject's siblings, the variable Si denotes the sufficient statistic by Rabinowitz and Laird (2000) For offspring i, Yi denotes the target phenotype in the association study and Ki denotes the secondary phenotype in the study.

Suppose that an association has been observed between the secondary phenotype of interest, Ki, and the marker locus. Given this association, the goal is to test for an association between the target phenotype Yi and the marker locus that cannot be explained by a possible indirect effect mediated by Ki. To achieve this goal, data is needed on all risk factors of the secondary phenotype Ki that are also associated with the primary phenotype (Cole and Hernan, 2002). Let Li denote this collection of measured confounding variables. Because L may be high-dimensional, we do not assume that it is only related with Y by means of a causal effect, but allow for their association to be itself confounded by potentially unmeasured factors U. This is shown in the causal diagram of Figure 1, where the presence of U additionally captures the potential for confounding of the genetic association as a result of population admixture (Vansteelandt and Lange, 2012). Throughout, in contrast to other mediation analysis techniques (namely those based on so-called natural direct and indirect effects), we will allow for the possibility that some of these confounding variables are themselves affected by the studied marker, as illustrated via the edge from X to L in the causal diagram (VanderWeele et al., 2012).

Figure 1

Consider model where γj for j = 0, 1, …, 3 denote the mean parameters and can be estimated by ordinary least squares. Note that γ1 represents the true effect of Ki on Yi and not a spurious association because, by assumption, the above model includes all relevant risk factors of Ki. In order to construct an adjustment principle that tests for a direct genetic effect of the marker locus X on the target phenotype Y, the effect of the secondary phenotype K has to be estimated. Vansteelandt et al. use an estimate for γ1 based on model (1) to adjust the phenotype Yi to Yi − γ1Ki. A family-based association test (FBAT) on this adjusted phenotype is then a test for the direct genetic effect in the family-based setting (provided that the distribution of the test statistic acknowledges the uncertainty in the estimated effect γ1) (Vansteelandt et al., 2009).

To reduce the number of multiple comparisons, we adapt the conditional mean model approach in the VanSteen-algorithm (Van Steen et al., 2005) to model (1). By replacing the observed marker score in model (1) by the expected marker score conditional upon the parental genotypes or sufficient statistic, the genetic effects of locus Xi can be assessed without having to adjust the α-level of any subsequently computed FBATs (Lange et al., 2003a,b; Van Steen et al., 2005). Similar to the idea of the conditional mean model approach, model (1) can be rewritten by substituting Xi with its expected value E(Xi|Si),

As shown in the proof given in the appendix, the parameter γ1 is the same in both model (1) and model (2) when the null hypothesis holds that there is no direct effect and, moreover, there is no confounding due to population substructure. For testing the null hypothesis of no direct genetic effect, model (2) can thus be used to estimate the parameter γ1 in a screening step without biasing the significance level since Xi is not included in this model, provided there is no confounding due to population substructure.

For the screening step, each subject contributes where and is the ordinary least squares estimate for γ1 in model (2), which does not involve the genetic marker X. is not adjusted for the covariates Li since including factors such as Li in the phenotypic adjustment would introduce bias if the common risk factor Li is associated with the DSL Xi (Vansteelandt et al., 2009). The parameters y and k are the observed phenotypic means of Y and K in the sample, respectively. Then the test statistic for the screening step is where where is calculated based on the sample variance of and ϵ*i denotes the residual from model (2). μ*(i)k = E(K|Li, E(Xi|Si)) is the predicted value for K under a linear regression model for K with the covariates Li and E(Xi|Si), and σ*2k denotes the residual variance in that model. The variance correction given in Equation (5) is needed to account for estimating γ1 in the proposed phenotype adjustment (Vansteelandt et al., 2009).

For step 1, the test statistic given in Equation (4) can be used for the screening step to pick the SNPs with the highest power since X is not used in this test statistic. For step 2, this smaller subset of SNPs are used to test the null hypothesis of no direct effect using the test statistic based on Equation (1) proposed by Vansteelandt et al. (2009) where and is the ordinary least square estimate for γ1 in model (1), which does involve the genetic marker X. Using this association test with the adjusted phenotype as the target phenotype provides a robust and valid test for the null hypothesis that there is no direct effect between the target phenotype Yi and the DSL; i.e., the association between the target phenotype Yi and the DSL is solely a result of the association between the secondary phenotype Ki and the DSL. Adjusting for estimating γ1 based on model (1), the test statistic is distributed chi-square with one degree of freedom under the null hypothesis of no direct effect of X on Y and has the following form where where is calculated based on the sample variance of and ϵi denotes the residual from model (1). μ(i)k = E(K|Li, Xi, E(Xi|Si)) is the predicted value for K under a linear regression model for K with the covariates Li, Xi, and E(Xi|Si), and σ2k denotes the residual variance in that model. The variance correction given in Equation (8) is needed to account for estimating γ1 in the proposed phenotype adjustment (Vansteelandt et al., 2009). Note that Equation (3) is similar to Equation (6), but Equation (6) contains the genetic marker Xi. Similarly, Equation (5) is similar to Equation (8), but Equation (8) contains the genetic marker Xi.

Note that under the alternative hypothesis, the association between K and Y is different in models (1) and (2), even in the absence of population admixture. Model (1) represents the causal effect of K on Y under the alternative hypothesis, but model (2) does not represent the causal effect of K on Y because there is a remaining spurious association between X and Y along the path K ← X → Y in Figure 1. Under the null hypothesis, this path does not exist. As a result, the proposed approach is valid for testing in the absence of population stratification, but may have less power when either the XK or the XY link is strong. This scenario is explored further in the simulation section of this paper.

Because the test statistic for the screening step given in Equation (4) is susceptible to population stratification, we examined this scenario in the simulation section as well. Principal component analysis (PCA) can be used in the screening step to correct for population stratification.

Simulations

Using simulation studies, we asses the type-1 error rate, the power, and robustness of this new approach which uses a trait that estimates γ1 based on model (2) in the screening step and compare it to the approach proposed by Vansteelandt et al. (2009) which uses a trait that estimates γ1 based on model (1). Similar to Vansteelandt et al. (2009), both methods are evaluated under various conditions. All simulations use a sample size of 1000 trios and are based on 5000 replications. The simulations are run for allele frequencies 5, 10, 15, 20, 25, 30, 35, 40, and 45%.

To reflect a realistic setting, the data is simulated to reflect covariances found in the Framingham Heart Study (Herbert et al., 2006). The phenotype of interest Y is simulated such that it resembles FEV1. The secondary phenotype K resembles weight and the set of common confounding variables resemble height and age. As seen in Figure 2, the first scenario assumes there is a direct genetic effect of the marker on the intermediate phenotype K and on the common covariate L. Each genetic effect has a locus specific heritability of 1%. The intermediate phenotype K explains 1% of the phenotypic variation in Y, creating an association between the SNP and Y. The second scenario is similar to the first scenario except that there is no genetic effect on the confounder L. The genetic association with the intermediate phenotype K is still present. The third scenario is similar to the first scenario except there is no association between K and Y. The fourth scenario is similar to the second scenario except that there is no genetic effect on the intermediate phenotype K.

Figure 2

As seen in Table 1, the type-1 error rate is similar whether model (1) or model (2) is used to estimate γ1. For lower allele frequencies, under scenario 1 and 3, the type-1 error rate is 1–2% higher than expected. For higher allele frequencies under all four scenarios, the type-1 error rate is 0.5% lower than expected. In general, the type-1 error rate is close to 0.05 regardless of how γ1 is estimated. As seen in Table 2, the power is similar whether model (1) or model (2) is used to estimate γ1 assuming no population admixture. For lower allele frequencies, the method by Vansteelandt et al. (2009) has higher power and for higher allele frequencies the proposed method has higher power. However, this difference in power is negligible; the power never differs by more than 2%.

Table 1

Allele frequency (%)Type-1 error rate when 1 SNP is tested
51015202530354045
Scenario 1: Model 10.0710.0590.0490.0470.0450.0470.0490.0510.050
Scenario 1: Model 20.0690.0580.0480.0460.0460.0460.0490.0500.051
Scenario 2: Model 10.0440.0450.0450.0450.0450.0450.0450.0430.045
Scenario 2: Model 20.0450.0440.0450.0450.0450.0430.0450.0430.045
Scenario 3: Model 10.0580.0480.0430.0450.0450.0460.0440.0470.044
Scenario 3: Model 20.0520.0500.0440.0460.0440.0460.0450.0470.046
Scenario 4: Model 10.0440.0450.0450.0430.0460.0440.0450.0450.042
Scenario 4: Model 20.0440.0440.0450.0430.0460.0440.0460.0450.042

This table displays the type-1 error rate for the test statistics using Model 1 [the Vansteelandt et al. test statistic (Vansteelandt et al., 2009)] or Model 2 (the screening test statistic) to estimate γ1 for different allele frequencies.

Table 2

Allele frequency (%)Power when 1 SNP is tested
51015202530354045
Scenario 1: Model 10.2640.3630.4480.5040.5760.6290.6690.6920.706
Scenario 1: Model 20.2410.3610.4440.5080.5810.6330.6710.6960.710
Scenario 2: Model 10.1800.3020.4060.4920.5640.6100.6490.6670.686
Scenario 2: Model 20.1800.3020.4080.4910.5630.6100.6460.6660.685
Scenario 3: Model 10.2650.3650.4490.5040.5810.6320.6690.6960.712
Scenario 3: Model 20.2460.3610.4510.5100.5860.6340.6710.6990.716
Scenario 4: Model 10.1750.3040.4080.4990.5580.6070.6470.6710.681
Scenario 4: Model 20.1740.3030.4070.4980.5570.6050.6480.6720.682

This table displays the power for the test statistics using Model 1 [the Vansteelandt et al. test statistic (Vansteelandt et al., 2009)] or Model 2 (the screening test statistic) to estimate γ1 for different allele frequencies.

The advantage of our approach becomes clear when testing multiple SNPs. Table 4 shows how the power to detect the causal SNP for our approach compares to Vansteelandt et al. (2009) when one SNP has a direct effect on the phenotype as simulated above in Table 2 and 49 other SNPs are not associated with the phenotype of interest. Table 1 shows the type-1 error rate in this scenario where the one SNP has an indirect effect on the phenotype as simulated above in Table 1 and 49 other SNPs are not associated with the phenotype of interest or any of the other phenotypes. Table 6 shows how the power to detect the causal SNP for our approach compares to Vansteelandt et al. (2009) when one SNP has a direct effect on the phenotype as simulated above in Table 2 and 99 other SNPs are not associated with the phenotype of interest. Table 5 shows the type-1 error rate in this scenario where the one SNP has an indirect effect on the phenotype as simulated above in Table 1 and 99 other SNPs are not associated with the phenotype of interest or any of the other phenotypes.

Our approach allows for a screening step similar to the Van Steen algorithm (Van Steen et al., 2005) where the top 3 SNPs out of 50 and the top 5 SNPs out of 100 with the highest test statistic given by Equation (4) are chosen. We chose 3 SNPs out of 50 and 5 SNPs out of 100 since this is roughly 5% of the SNPs. After the top 3 or 5 SNPs are chosen based on the screening step, the test statistic described in Equation (7) is used to obtain a p-value which is compared to α/3 and α/5, respectively. We compare our approach with the screening step to the approach by Vansteelandt et al. (2009) with a Sidak correction. Since our approach allows for a screening step, we are better able to detect the SNP that has a direct causal effect on the target phenotype as seen in Tables 4, 6.

Note that the power in Tables 4, 6 is lower than that in Table 2 which is expected since multiple SNPs are tested. For more common allele frequencies, the power of using the proposed method with a screening step is more than double that of the Vansteelandt algorithm with a Sidak correction while the type-1 error rates are similar as seen in Tables 3, 5. Therefore, if multiple SNPs are tested, the proposed approach has better power to detect the SNP that has a direct effect on the phenotype of interest.

Table 3

Allele frequency (%)Type-1 error rate when 50 SNPs are tested
51015202530354045
Scenario 1: Model 10.00180.00080.00080.00060.00040.00060.00080.00060.0010
Scenario 1: Model 20.00140.00060.00020.00060.00120.00120.00060.00120.0006
Scenario 2: Model 10.00140.00060.00080.00120.00040.00080.00040.00080.0002
Scenario 2: Model 20.00040.00100.00120.00160.00120.00060.00100.00040.0006
Scenario 3: Model 10.00180.00060.00080.00140.00060.00100.00080.00080.0002
Scenario 3: Model 20.00140.00060.00080.00160.00120.00100.00120.00040.0006
Scenario 4: Model 10.00140.00060.00080.00120.00040.00080.00040.00080.0002
Scenario 4: Model 20.00080.00100.00130.00160.00120.00060.00100.00040.0006

This table displays the significance rate when one SNP does not have a direct effect on the phenotype Y but acts as seen in Figure 2 without the arrow from X to Y and 49 SNPs are not associated with the phenotype Y.

Model 1 is used to estimate γ1 with a Sidak correction and Model 2 is used estimate γ1 with a screening step where the three SNPs with the largest test statistic given by Equation (8) are tested.

Since the proposed approach is valid for testing, but may have less power when either the XK or the XY link is strong, we looked at the effect of increasing the association between X and K when K influences Y (XK) and X and Y (XY). We increased the correlation between X and K from 0.025 to 0.05 and then 0.075. We also increased the correlation between X and Y from 0.05 to 0.10 and then 0.15. The power of both statistics remained very close. At most, the power of the Vansteelandt et al. statistic (Vansteelandt et al., 2009) was 0.9% better than our approach.

Since the test statistic for the screening step given in Equation (4) is susceptible to population stratification, we examined a few scenarios where population stratification was present. We simulated half of the subjects to have allele frequency of 5, 5, 20, and 40% and the other half of the subjects to have allele frequency of 10, 45, 25, and 45%, respectively. Similar to Tables 3, 4, one SNP has a direct effect on the phenotype of interest and 49 other SNPs are not associated with the phenotype of interest in Tables 7, 8. Similar to Tables 5, 6, one SNP has a direct effect on the phenotype of interest and 99 other SNPs are not associated with the phenotype of interest in Tables 9, 10. As seen in Tables 7, 9, the type-1 error rates are similar for both methods. As seen in Tables 8, 10, even though there is some population stratification present, the proposed method with a screening step still performs better than the Vansteelandt algorithm, especially when the allele frequencies are more common.

Table 4

Allele frequency (%)Power when 50 SNPs are tested
51015202530354045
Scenario 1: Model 10.0310.0390.0730.0750.1200.1500.1760.1910.191
Scenario 1: Model 20.0380.0730.1330.1880.2550.3210.3560.3680.431
Scenario 2: Model 10.0130.0300.0400.0740.1100.1120.1580.1620.172
Scenario 2: Model 20.0150.0560.1170.180.2360.2920.3440.3560.378
Scenario 3: Model 10.0310.0390.0740.0830.1210.1300.1850.1910.201
Scenario 3: Model 20.0380.0730.1360.1940.2570.3120.3680.3700.445
Scenario 4: Model 10.0120.0300.0630.0760.1100.1130.1590.1760.177
Scenario 4: Model 20.0150.0570.1070.1810.2350.2900.3440.3760.416

This table displays the power when one SNP has a direct effect on the phenotype Y and 49 SNPs are not associated with the phenotype Y.

Model 1 is used to estimate γ1 with a Sidak correction and Model 2 is used estimate γ1 with a screening step where the three SNPs with the largest test statistic given by Equation (8) are tested.

Table 5

Allele frequency (%)Type-1 error rate when 100 SNPs are tested
51015202530354045
Scenario 1: Model 10.00100.00060.00040.00060.00070.00060.00020.00040.0005
Scenario 1: Model 20.00080.00040.00040.00060.00060.00060.00080.00020.0006
Scenario 2: Model 10.00060.00000.00080.00000.00000.00040.00020.00060.0002
Scenario 2: Model 20.00040.00040.00080.00020.00040.00060.00100.00040.0008
Scenario 3: Model 10.00100.00100.00020.00040.00000.00040.00020.00080.0000
Scenario 3: Model 20.00080.00040.00020.00020.00020.00060.00020.00020.0004
Scenario 4: Model 10.00060.00030.00040.00060.00070.00060.00020.00040.0005
Scenario 4: Model 20.00020.00040.00040.00060.00060.00060.00080.00020.0006

This table displays the significance rate when one SNP does not have a direct effect on the phenotype Y but acts as seen in Figure 2 without the arrow from X to Y and 99 SNPs are not associated with the phenotype Y.

Model 1 is used to estimate γ1 with a Sidak correction and Model 2 is used estimate γ1 with a screening step where the five SNPs with the largest test statistic given by Equation (8) are tested.

Table 6

Allele frequency (%)Power when 100 SNPs are tested
51015202530354045
Scenario 1: Model 10.0140.0280.0490.0480.0840.1090.1110.1470.142
Scenario 1: Model 20.0210.0560.0990.1360.1960.2620.2770.3320.351
Scenario 2: Model 10.0040.0180.0400.0550.0760.0990.0980.1160.123
Scenario 2: Model 20.0140.0420.0880.1450.1780.2460.2490.2840.332
Scenario 3: Model 10.0180.0280.0380.0490.0870.0940.1120.1280.139
Scenario 3: Model 20.0230.0570.0990.1370.1980.2290.2830.3150.368
Scenario 4: Model 10.0060.0180.0400.0410.0760.0860.0980.1160.123
Scenario 4: Model 20.0110.0420.0880.1260.1780.2090.2490.2840.332

This table displays the power when one SNP has a direct effect on the phenotype Y and 99 SNPs are not associated with the phenotype Y.

Model 1 is used to estimate γ1 with a Sidak correction and Model 2 is used estimate γ1 with a screening step where the five SNPs with the largest test statistic given by Equation (8) are tested.

Table 7

Allele frequencyType-1 error rate when the following population stratification is present
5 and 10%5 and 45%20 and 25%40 and 45%
Scenario 1: Model 10.00120.00140.00110.0013
Scenario 1: Model 20.00060.00060.00040.0005
Scenario 2: Model 10.00100.00060.00040.0006
Scenario 2: Model 20.00120.00130.00180.0020
Scenario 3: Model 10.00090.00020.00040.0011
Scenario 3: Model 20.00080.00120.00160.0008
Scenario 4: Model 10.00060.00140.00080.0009
Scenario 4: Model 20.00090.00060.00060.0012

This table displays the significance level when one SNP has an indirect effect on the phenotype Y as seen in Figure 2 without the arrow from X to Y and 49 SNPs are not associated with the phenotype Y.

Model 1 is used to estimate γ1 with a Sidak correction and Model 2 to is used estimate γ1 in a screening step where the three SNPs with the largest test statistic given by Equation (4) are tested. Population stratification is present such that half of the subjects have one of the allele frequencies listed and the other half of the subjects have the other allele frequency listed.

Table 8

Allele frequencyPower when the following population stratification is present
5 and 10%5 and 45%20 and 25%40 and 45%
Scenario 1: Model 10.0250.0700.1110.171
Scenario 1: Model 20.0640.1990.2480.394
Scenario 2: Model 10.0160.0700.1030.163
Scenario 2: Model 20.0400.2050.2270.366
Scenario 3: Model 10.0250.0700.1130.172
Scenario 3: Model 20.0640.2020.2490.396
Scenario 4: Model 10.0160.0640.1030.163
Scenario 4: Model 20.0400.1860.2270.366

This table displays the power when one SNP has a direct effect on the phenotype Y and 49 SNPs are not associated with the phenotype Y.

Model 1 is used to estimate γ1 with a Sidak correction and Model 2 to is used estimate γ1 with a screening step where the three SNPs with the largest test statistic given by Equation (4) are tested. Population stratification is present such that half of the subjects have one of the allele frequencies listed and the other half of the subjects have the other allele frequency listed.

Table 9

Allele frequencyType-1 error rate when the following population stratification is present
5 and 10%5 and 45%20 and 25%40 and 45%
Scenario 1: Model 10.00110.00050.00070.0003
Scenario 1: Model 20.00090.00060.00080.0003
Scenario 2: Model 10.00040.00150.00090.0005
Scenario 2: Model 20.00030.00110.00120.0005
Scenario 3: Model 10.00040.00100.00080.0004
Scenario 3: Model 20.00060.00090.00100.0006
Scenario 4: Model 10.00080.00130.00070.0004
Scenario 4: Model 20.00100.00080.00110.0006

This table displays the significance level when one SNP has an indirect effect on the phenotype Y as seen in Figure 2 without the arrow from X to Y and 99 SNPs are not associated with the phenotype Y.

Model 1 is used to estimate γ1 with a Sidak correction and Model 2 to is used estimate γ1 with a screening step where the five SNPs with the largest test statistic given by Equation (4) are tested. Population stratification is present such that half of the subjects have one of the allele frequencies listed and the other half of the subjects have the other allele frequency listed.

Table 10

Allele frequencyPower when the following population stratification is present
5 and 10%5 and 45%20 and 25%40 and 45%
Scenario 1: Model 10.0220.0500.0730.157
Scenario 1: Model 20.0440.1410.1700.324
Scenario 2: Model 10.0140.0460.0710.148
Scenario 2: Model 20.0360.1370.1610.298
Scenario 3: Model 10.0220.0500.0760.159
Scenario 3: Model 20.0450.1430.1740.326
Scenario 4: Model 10.0140.0460.0710.148
Scenario 4: Model 20.0360.1370.1610.298

This table displays the power when one SNP has a direct effect on the phenotype Y and 99 SNPs are not associated with the phenotype Y.

Model 1 is used to estimate γ1 with a Sidak correction and Model 2 to is used estimate γ1 with a screening step where the five SNPs with the largest test statistic given by Equation (4) are tested. Population stratification is present such that half of the subjects have one of the allele frequencies listed and the other half of the subjects have the other allele frequency listed.

Data analysis: an application to the framingham study

We evaluated the practical relevance of the proposed adjustment principle by an application to the Framingham Heart Study with 1400 probands (Herbert et al., 2006). For the target phenotype, we selected the lung-function measurement FEV1. For the secondary phenotype K, we selected height. Gender, and age represent L, the collection of common risk factors between FEV1 and height. For rs2415815 a SNP associated with both height and FEV1, the test statistic equals 0.044 with corresponding p-value equal 0.83. As a result, we fail to reject the null hypothesis and conclude that there is no evidence that the SNP acts directly on FEV1 other than via body height.

Discussion

Our proposed FBAT assesses the direct genetic effect of a marker locus on the phenotype of interest, other than through another correlated phenotype. The adjustment is based on the conditional mean model approach and can be incorporated into the FBAT-approach in a straightforward fashion. The power of the approach is assessed by simulation studies and shown to be similar to the Vansteelandt et al. method when only one SNP is being tested and superior when multiple SNPs are being tested (Vansteelandt et al., 2009). Unlike the Vansteelandt et al. method, this method uses a screening step and has the unique advantage in situations in which a large number of SNPs are tested for a direct effect on the phenotype of interest. Since the number of tests will be much smaller than the total number of SNPs, this will lead to substantial reduction in the adjustment for multiple-comparisons and will result in improved overall statistical power. In this process, the screening step works under the assumption of no population admixture, but the final analysis of the selected SNPs is robust against it.

While we considered several causal scenarios, if the causal relationships assumed in the DAGs are not true this could cause problems for the proposed method. For example, a causal arrow KY or LY could introduce spurious association for this method. Therefore, one needs to makes sure that the assumptions of the DAG are met before using the proposed approach. While the simulations considered 50 and 100 SNPs, a realistic application could involve thousands of GWAS SNPs. This leads to extreme multiple test corrections and may lead to very different behavior than the behavior observed in the simulation studies (Morris and Elston, 2011). Furthermore, if phenotypes of the founders are known, the proposed method could perform poorly compared to population-based approaches.

For the screening step in the Simulations section, we chose 3 out of 50 and 5 out of 100 SNPs since this is roughly 5% of the tested SNPs. Another number of SNPs could be chosen for the screening step. Although, if the majority of SNPs are chosen in the screening step (i.e., 40 out of 50 SNPs), this increases the number of multiple comparisons and can decrease power. If too few SNPs are chosen in the screening step (i.e., 1 out of 50 SNPs), this decreases the number of multiple comparisons, but one may fail to detect the causal SNP since too few SNPs were chosen. Care needs to be given to the number of SNPs chosen in the screening step (Van Steen et al., 2005). One cannot simply choose different numbers of SNPs for the screening step until significant results are found since this will inflate the type-1 error rate (Van Steen et al., 2005).

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Statements

Acknowledgments

This work was funded by NIH/NHLBI U01 HL089856 Edwin K. Silverman, PI. COPDGene is supported by NHLBI Grant Nos U01HL089897 and U01Hl089856. This work was also funded by FWO grant G.0111.12.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  • 1

    AmosC. I.WuX.BroderickP.GorlovI. P.GuJ.EisenT.et al. (2008). Genome-wide association scan of tag SNPs identifies a susceptibility locus for lung cancer at 15q25. Nat. Genet. 40, 616622. 10.1038/ng.109

  • 2

    BerzuiniC.VansteelandtS.FocoL.PastorinoR.BernardinelliL. (2012). Direct genetic effects and their estimation from matched case-control data. Genet. Epidemiol. 36, 652662. 10.1002/gepi.21660

  • 3

    ColeS. R.HernanM. A. (2002). Fallibility in estimating direct effects. Int. J. Epidemiol. 31, 163165. 10.1093/ije/31.1.163

  • 4

    HerbertA.GerryN. P.McQueenM. B.HeidI. M.PfeuferA.IlligT.et al. (2006). A common genetic variant is associated with adult and childhood obesity. Science312, 279283. 10.1126/science.1124779

  • 5

    LangeC.DeMeoD.SilvermanE. K.WeissS. T.LairdN. M. (2003a). Using the noninformative families in family-based association tests: a powerful new testing strategy. Am. J. Hum. Genet. 73, 801811. 10.1086/378591

  • 6

    LangeC.LyonH.DeMeoD.RabyB.SilvermanE. K.WeissS. T. (2003b). A new powerful non-parametric two-stage approach for testing multiple phenotypes in family-based association studies. Hum. Hered. 56, 1017. 10.1159/000073728

  • 7

    MorrisN.ElstonR. (2011). A note on comparing the power of test statistics at low significance levels. Am. Stat. 65, 164. 10.1198/tast.2011.10117

  • 8

    MurphyA.WeissS. T.LangeC. (2008). Screening and replication using the same data set: testing strategies for family-based studies in which all probands are affected. PLoS Genet. 4:e1000197. 10.1371/journal.pgen.1000197

  • 9

    RabinowitzD.LairdN. (2000). A unified approach to adjusting association tests for population admixture with arbitrary pedigree structure and arbitrary missing marker information. Hum. Hered. 50, 211223. 10.1159/000022918

  • 10

    ThorgeirssonT. E.GellerF.SulemP.RafnarT.WisteA.MagnussonK. P.et al. (2008). A variant associated with nicotine dependence, lung cancer and peripheral arterial disease. Nature452, 638642. 10.1038/nature06846

  • 11

    VanderWeeleT. J.AsomaningK.TchetgenE. J.HanY.SpitzM. R.SheteS.et al. (2012). Genetic variants on 15q25.1, smoking and lung cancer: an assessment of mediation and interaction. Am. J. Epidemiol. 175, 10131020. 10.1093/aje/kwr467

  • 12

    VansteelandtS. (2010). Estimating direct effects in cohort and case-control studies. Epidemiology21, 278. 10.1097/EDE.0b013e3181cd72aa

  • 13

    VansteelandtS.GoetgelukS.LutzS.WaldmanI.LyonH.SchadtE. E.et al. (2009). On the adjustment for covariates in genetic association analysis: a novel, simple principle to infer direct causal effects. Genet. Epidemiol. 33, 394405. 10.1002/gepi.20393

  • 14

    VansteelandtS.LangeC. (2012). Causation and causal inference for genetic effects. Hum. Genet. 131,16651676. 10.1007/s00439-012-1208-9

  • 15

    Van SteenK.McQueenM. B.HerbertA.RabyB.LyonH.DemeoD. L.et al. (2005). Genomic screening and replication using the same data set in family-based association testing. Nat. Genet. 37, 683691. 10.1038/ng1582

  • 16

    WonS.BertramL.BeckerD.TanziR. E.LangeC. (2009). Maximizing the power of genome-wide association studies: a novel class ofpowerful family-based association tests. Stat. Biosci. 1, 125143. 10.1007/s12561-009-9016-z

Appendix

The following proof shows that the test statistics in the first and second screening steps are uncorrelated under the null hypothesis. As discussed in the introduction and methods sections, is the adjusted phenotype for the effect that the phenotype K has on the target phenotype Y. For ease of notation, we will use for this proof. Suppose that the null hypothesis is true that X has no effect on Y other than through K. Let where Φ equals the identity link or exponential link and w(U) is an arbitrary function. Without loss of generality, for the following proof, let Φ equal the identity link. This model does not involve X because we are working under the null hypothesis of no direct effect. Furthermore, the parameter γ1 in this model is the same as in model for some function w*(X, L, S) of (X, L, S), which can be seen by inferring this model from model (9) upon noting that Y ⫫ (L, S)|X, K, U and UK|L, X, S. Using model (9) and model (10) and noting that YS|X, K, U and XU|S, then

As a result of Equation (11) and model (9), the can be written as follows where

We will show that the by showing that each part of the above equation equals zero.

because XU|S.

because YS|X, K, U.

because YS|X, K, U and YX|K, U.

Assuming that Var(Y|K, U) is constant, as we do throughout, it is immediate that the term Part3 is zero. As a result, this shows that .

Summary

Keywords

family-based association analysis, causal inference, genetic pathway, mediation, pleiotropy

Citation

Lutz SM, Vansteelandt S and Lange C (2013) Testing for direct genetic effects using a screening step in family-based association studies. Front. Genet. 4:243. doi: 10.3389/fgene.2013.00243

Received

28 June 2013

Accepted

25 October 2013

Published

21 November 2013

Volume

4 - 2013

Edited by

Xiangqing Sun, Case Western Reserve University, USA

Reviewed by

Nathan Morris, Case Western Reserve University, USA; Sungho Won, Chung-Ang University, South Korea

Copyright

*Correspondence: Sharon M. Lutz, Department of Biostatistics, University of Colorado, 13001 E. 17th Place, B119 Building 500, Room W3128, Anschutz Medical Campus, Aurora, CO 80045, USA e-mail:

This article was submitted to Applied Genetic Epidemiology, a section of the journal Frontiers in Genetics.

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Figures

Cite article

Copy to clipboard


Export citation file


Share article

Article metrics