Localization of Association Signal from Risk and Protective Variants in Sequencing Studies

Aggregating information across multiple variants in a gene or region can improve power for rare variant association testing. Power is maximized when the aggregation region contains many causal variants and few neutral variants. In this paper, we present a method for the localization of the association signal in a region using a sliding-window based approach to rare variant association testing in a region. We first introduce a novel method for analysis of rare variants, the Difference in Minor Allele Frequency test (DMAF), which allows combined analysis of common and rare variants, and makes no assumptions about the direction of effects. In whole-region analyses of simulated data with risk and protective variants, DMAF and other methods which pool data across individuals were found to outperform methods which pool data across variants. We then implement a sliding-window version of DMAF, using a step-down permutation approach to control type I error with the testing of multiple windows. In simulations, the sliding-window DMAF improved power to detect a causal sub-region, compared to applying DMAF to the whole region. Sliding-window DMAF was also effective in localizing the causal sub-region. We also applied the DMAF sliding-window approach to test for an association between response to the drug gemcitabine and variants in the gene FKBP5 sequenced in 91 lymphoblastoid cell lines derived from white non-Hispanic individuals. The application of the sliding-window test procedure detected an association in a sub-region spanning an exon and two introns, when rare and common variants were analyzed together.


INTRODUCTION
Traditional genome-wide association analysis approaches, which analyze a single variant at a time, are underpowered to detect associations with rare variants (Bansal et al., 2010). Small to moderate effects at multiple rare variants could play an important role in explaining the missing heritability observed for many complex traits (Manolio et al., 2009). Several proposed methods have sought to improve power by aggregating information across a set of variants, for example, in a gene (Bansal et al., 2010). Many methods assume all rare variants have the same direction of effect; such methods are subject to loss of power in the presence of both risk and protective variants. There is a need for flexible methods which can detect associations with both risk and protective variants.
Power to detect association is reduced when the region includes many non-causal variants, which decrease the signal-to-noise ratio (Li and Leal, 2008). Therefore, when causal variants are clustered in a sub-region within a larger region of interest, power would be maximized by analyzing only that sub-region, because neutral variants outside the sub-region would be excluded. However, because the location of the causal sub-region (if any) is unknown for real data, it is necessary to test multiple sub-regions. There is a need for methods which analyze sub-regions -potentially increasing power to detect association through exclusion of some neutral variants -while minimizing power loss due to multiple testing.
A sliding-window based approach could meet this need. Sliding windows have been shown to improve power for detection of an association due to 1-3 causal variants, compared to single-marker analysis (Li et al., 2007;Tang et al., 2009) and haplotype block partitioning (Guo et al., 2009), even after multiple-test correction. However, there is a need for region-based sliding-window approaches to enable the analysis of longer windows containing more than three causal variants of small effect. In addition, one challenge of region-based association methods is that detecting an association in a large region is not informative about the specific functional elements within the region that may be causal. Slidingwindow based analysis is a means of localizing the association signal to a smaller sub-region, such as an exon within a candidate gene or a gene within a candidate pathway.
In this paper, we present a sliding-window, region-based approach for rare variant association testing which makes no assumptions about the direction of effects. Our approach uses a novel method for analysis of rare variants, the Difference in Minor Allele Frequency test (DMAF). DMAF allows combined analysis of common and rare variants, and can be extended to the analysis of pooled sequencing data, for which many collapsing methods are not applicable. Our method allows weighting of markers based on minor allele frequency (MAF; Madsen and Browning, 2009) or functional information (Price et al., 2010). We compared DMAF with eight other methods for whole-region analysis on a comprehensive set of simulations, and found that DMAF's use of a positive function of the difference in MAF between cases and controls is effective in retaining power across simulations involving risk and protective variants, as well as the scenarios with only risk variants. We then applied the sliding-window DMAF to simulated regions containing a cluster of approximately 65 causal variants. For a wide range of window sizes, the sliding-window approach improved power compared to the whole-region analysis and was effective in localizing the causal sub-region. Finally, we applied this method to a cell based model system to localize an association between the gene FKBP5 and response to the drug gemcitabine.

DMAF RARE VARIANT TESTING APPROACH
For each single nucleotide variant (SNV) j, j = 1, . . ., J, let D j represent the absolute value of the difference in MAF between cases and controls, D j = |X j − Y j |. An alternative function, D j = (X j − Y j ) 2 , was also compared; these approaches are distinguished as DMAF abs and DMAF sq . By using a positive function of the difference in MAF, our method places equal importance on risk variants (which are expected to have X j > Y j ) as on protective variants (which are expected to have X j < Y j ). The test statistic is then computed as the weighted sum over the variants of interest, V = j∈A w j D j , where w j is the weight for variant j and A is the set of variants of interest. A may include all variants in a window or only rare variants. We used a threshold of MAF ≤ 0.05 to classify variants as rare. When using DMAF sq with equal numbers of cases and controls, the test statistic V is equivalent to Q, the test statistic of SKAT (Wu et al., 2011). Unlike SKAT, DMAF places equal weight on cases and controls, regardless of their relative sample sizes, to emphasize information available from potentially limited numbers of cases. The significance of V is determined empirically by permuting case-control status n times and recalculating V for each permutation. We used n = 1000. For the sliding-window analysis, multiple-test correction for windows of a given size was performed using a step-down approach based on a second set of permutations (see below).
Various choices exist for w j , such as weights based on functional information (Price et al., 2010). However, there are many situations in which functionally based weights are unreliable or unavailable, such as intergenic regions. For this reason, we used weights based on the MAF: w j = 1/ n j q j (1 − q j ), where n j is the number of individuals genotyped (or imputed) for variant j and q j is the overall MAF for the variant. This model places greater emphasis on rare alleles, which are a priori believed to be more likely to have larger effect sizes (Manolio et al., 2009). It also prioritizes large relative differences in MAF, even for small absolute differences at rare variants. This model is similar to that used by Madsen and Browning (2009); however, we base q j on cases and controls, rather than controls only, to put equal emphasis on risk and protective alleles.

STEP-DOWN PERMUTATION-BASED CORRECTION FOR MULTIPLE TESTING
For sliding-windows of a given size (number of variants), multiple-test correction was performed using a step-down permutation-based approach (Westfall et al., 1999). For each window, an empirical distribution of the test statistic V was generated from 1000 permutations of the phenotype. This distribution was used to produce an empirical p-value for the test statistic V for each window. The phenotype was then permuted an additional 1000 times, and an empirical p-value for V was determined for the second set of permuted phenotypes. These p-values comprised a p-value matrix M, consisting of m rows by 1000 columns, where m is the number of windows of the given size. The p-values based on the observed phenotype were then ordered from smallest to largest in the vector p, and the rows of M were reordered in the same order. Then, the first element (smallest p-value) of p was compared to M min = min 1≤i≤m M i,j : 1 ≤ j ≤ 1000 , the set of column minimums of M. The multiple-test corrected p-value for the window corresponding to the element p 1 is the proportion of elements of M min that are smaller than p 1 . The first row of M was then removed and p 2 was compared to the column minimums of the smaller M to achieve the step-down correction, which is less conservative than a Bonferroni correction. To preserve monotonicity of p-values, the multiple-test corrected p-value for p j , j > 1 was calculated as max p j−1 , {M min < p j } /1000 . One goal of the current study was to assess the robustness of results to the window size. Therefore, no correction was made for the multiple window sizes tested.

Simulated data
The coalescent simulators ms (Hudson, 2002) and msHOT (Hellenthal and Stephens, 2007) were used to simulate sequence data under no natural selection for three regions. Each region was 50 kb in length and had a mutation rate of µ = 10 −8 mutations/bp/generation, an effective population size of 10,000, and a recombination rate of 1 cM/Mb. Regions 2 and 3 also had a hotspot of length 2 kb in which the recombination rate was 15 cM/Mb. We simulated 100,000 diploid individuals and generated phenotypes according to a null model and six models with causal SNVs ( Table 1). All of the models used a multiplicative model for genetic effect: Pr(y i = case|genotype) = Π j c j OR ij , where OR ij is the odds ratio of the variants carried by individual i for variant j and c j is a constant of proportionality. For each region and genetic model, c j was chosen to produce a population prevalence of 10%. To test the sensitivity of DMAF and other methods of rare variant analysis, we sampled 100 sets of 200 cases and 200 controls from each simulated data set to mimic a small but realistic sample size for sequencing studies (Wang et al., 2010;Jeoung et al., 2012;Silva et al., 2012), in which detection of rare variant associations is more challenging than in larger studies. We included causal effects at both rare and low-frequency variants to permit sufficient power for discrimination among analysis methods using realistic effect sizes for a sample size of 400 subjects.
In models A-F, half of the rare variants were risk alleles, and half were neutral or protective; different thresholds were used to classify variants as rare ( Table 1). The DMAF method is expected to have greater power for models D and F than A and E, respectively, because models D and F contain more causal variants. In contrast, methods that do not accommodate protective variants are expected to have reduced power for models D and F, as the signal from risk variants will be canceled out by protective variants in the same region. Like previous simulations for rare variant analysis (Liu and Leal, 2010;Neale et al., 2011), we applied smaller ORs for more common alleles: OR = 1.5 for risk variants with MAF > 0.01, OR = 1.7 for 0.001 ≤ MAF ≤ 0.01; OR = 2.0 for MAF < 0.001. The ORs for protective variants were the reciprocals of the effect sizes for the risk variants. In contrast to many previous simulations, we have simulated large regions with many causal variants ( Table 2). Due to the low MAF of the causal variants and the low population prevalence, the proportion of trait variance (Nagelkerke, 1991) explained by the set of causal variants ranged from 3.8 to 13.6%, with a mean of 7.9%. These values are consistent with the proportion of variation explained by individual linkage groups for growth-related phenotypes in brook charr (Sauvage et al., 2012) and the proportion explained by the set of known loci for type 2 diabetes in humans (Taneera et al., 2012). Therefore, our simulations realistically model complex traits influenced by a large number of variants of small effect, a situation that has been observed and hypothesized in humans and other species (Manolio et al., 2009;Ai et al., 2012;Marian, 2012).

Rare variant association methods assessed
We compared DMAF to eight other rare variant association testing methods ( Table 3). Both DMAF abs and DMAF sq were used to analyze all variants or variants with MAF ≤ 0.05, denoted as DMAF abs,all , DMAF sq,all , DMAF abs,rare , or DMAF sq,rare , respectively. The C-alpha test, like DMAF, was applied to all variants or to rare variants only (MAF ≤ 0.05), denoted as C-alpha all or Calpha rare . All other methods, except Variable Threshold (VT), were applied to variants with MAF ≤ 0.05. KBAC was applied using software obtained from the authors; VT was applied using software obtained 1 . All other approaches were implemented in R (R Development Core Team, 2011). The nine rare variant association testing approaches differed in the dimension across which information is pooled. For example, DMAF computes the frequency difference for each variant; thus information from all subjects is pooled into a single piece of information for the variant. In addition to DMAF, the methods C-alpha, 1 http://genetics.bwh.harvard.edu/rare_variants Hotelling's T 2 , and CMC also pool information across subjects. In contrast, the other methods examined pool information across variants.

SIMULATION STUDY II: SLIDING-WINDOW DMAF ANALYSIS
Using the sequence data from simulation study I, we simulated two sets of phenotypes with causal variants clustered in a subregion of regions 1, 2, and 3. In both models, the set of rare variants (MAF ≤ 0.05) was subdivided into thirds based on position. The first and last third of rare variants and all common variants were neutral. The middle third of rare variants were all risk variants (model G) or half risk, half protective (model H). The effect sizes for risk variants were OR = 1.7 for variants with MAF > 0.01, OR = 2.0 for 0.001 ≤ MAF ≤ 0.01, and OR = 2.2 for MAF < 0.001. We analyzed 1000 simulated data sets based on models G and H for regions 1, 2, and 3 using window sizes ranging from 10 SNVs to the entire region, in increments of 10 SNVs. For the longest window sizes, fewer than 1000 simulations were analyzed since not all simulations included more than 110 polymorphic variants. Window sizes that were analyzed for 700 or more simulations were included in the power calculations. While models G and H included effects at rare variants only, both rare and common variants were analyzed together to reflect realistic circumstances under which the MAF threshold for causal "rare" variants is unknown. DMAF abs was used for the analysis since it outperformed DMAF sq for a majority of the scenarios in simulation study I when rare and common variants were analyzed together. The window position was shifted in increments of 5 SNVs or 10% of the window length, whichever was greater. In addition to the analysis of models G and H, we analyzed 1000 simulations based on a null model for each region and window size to check that type I error was controlled.
An analysis for a given simulation and window size was considered to have detected an association if the multiple-test corrected p-value for any window was less than 0.05. In addition to assessing power, we were interested in the effectiveness of sliding windows for localizing the causal sub-region. For each simulation and window size, we determined whether the window (or set of windows) with the most significant p-value overlapped the causal sub-region by at least half the length of the window or set (measured by the number of markers).

APPLICATION OF DMAF TO A PHARMACOGENOMIC STUDY
Gemcitabine is a chemotherapy drug used to treat pancreatic, breast, and other solid tumors. A previous expression study identified FKBP5 as a candidate gene for association with gemcitabine resistance . Using Illumina's Genome Analyzer, we resequenced FKBP5 in 91 lymphoblastoid cell lines derived from Caucasians in the Human Variation Panel , and identified 641 variants. The quantitative drug response phenotype of gemcitabine IC50 (effective dose that kills 50% of the cells) was estimated using a four parameter logistic model per cell line (Gallant, 1987), followed by the adjustment of log(IC50) for sex, age, and batch of cell lines (Tan et al., 2011). We defined a binary endpoint, with the top 50% of adjusted log IC50 values considered "resistant," and the bottom 50% considered "sensitive." All SNVs in FKBP5 were analyzed using DMAF and the other methods listed in Table 3. Subsequently, a sliding-window analysis was performed using DMAF in windows of length 10-50 SNVs, with window size and position adjusted in increments of 5 SNVs.

SIMULATION STUDY I: ASSESSMENT OF DMAF TESTING FRAMEWORK
The difference in Minor Allele Frequency test, along with C-alpha, CMC, and aSum, had slightly elevated type I error rates (Table 4). For a fair comparison across methods, we computed the power of each method at an empirical type I error rate of 0.05. When applied to simulations without protective variants (models A, B, C, and E), most methods performed well (Figure 1), with WSS and methods analyzing all variants having lower power. In the presence of both risk and protective variants (models D and F), the methods DMAF, C-alpha, Hotelling's T 2 , and CMC (warm colors, Figures 1D,F) had greater power than other methods. This demonstrates that DMAF is as powerful as or more powerful than a wide range of frequently used methods, establishing its feasibility as the base method for a sliding-window analysis approach. As expected, the methods KBAC, RVT1, VT, and WSS, which assume that all causal variants have the same direction of effect, suffered reduced power on models D and F compared to models A and E, respectively, while DMAF, C-alpha, and Hotelling's T 2 experienced increased power or no significant change in power ( Table 5). CMC did not suffer reduced power, although it is not specifically designed to accommodate protective variants, while aSum suffered reduced power despite its intended accommodation of protective variants. Models A, B, and D had no causal variants with MAF > 0.05. For these simulations, DMAF rare consistently outperformed DMAF all . This is consistent with previous findings that including neutral variants in the analysis decreases power (Li and Leal, 2008). Models C, E, and F included causal variants with MAF > 0.05. For these models, DMAF all was not consistently superior to DMAF rare (Figure 1). This contrasts with the results of Li and Leal (2008), who found that excluding causal variants from analysis was more detrimental to power than including excess neutral variants. When rare and common variants were analyzed together, power for DMAF abs was better than or equal to DMAF sq in 15 out of 18 scenarios. For this reason, DMAF abs was used for the analysis of Simulation study II.

SIMULATION STUDY II: SLIDING-WINDOW DMAF ANALYSIS
At α = 0.05, the type I error rates for the sliding-window analyses were between 0.037 and 0.058 for all window sizes that allowed analysis of at least 700 simulations. Therefore, the step-down permutation procedure adequately controlled type I error. For region 1, most sizes of sliding window gave power similar to the wholeregion analysis. For regions 2 and 3, the sliding-window analysis outperformed the whole-region analysis for all window sizes from 30 to 110 SNVs (Figure 2). This demonstrates that sliding-window DMAF can improve power compared to a whole-region analysis, and the improvement is robust to choice of window size. The sliding-window approach was also effective in localizing the causal sub-region. The window with the most significant p-value overlapped the causal sub-region by at least half the window's length in a majority of simulations (Table 6), particularly for regions 2 and 3.

APPLICATION OF ASSOCIATION METHODS TO A PHARMACOGENOMIC STUDY
When rare variants alone were analyzed, no method found a significant association (p < 0.05) between FKBP5 and gemcitabine sensitivity ( Table 7). When rare and common variants were analyzed together, we found a significant association using DMAF, along with CMC, Hotelling's T 2 , and C-alpha * , an alternative version of C-alpha based on heterogeneity of odds ratios (Zelterman and Chen, 1988). This suggests that variants with MAF > 0.05 play an important role in the detection of gemcitabine response associations, at least when sample sizes are small. This possibility is supported by the fact that the VT method, which utilizes multiple MAF thresholds, had the next lowest p-value (0.129). The methods aSum, KBAC, RVT1, and WSS were unable to detect a significant association even when common variants were included in the analysis. Despite the implication of an association with common variants, a single-marker analysis of the 152 SNVs with MAF > 0.05 found no variants significant at the α = 0.05 level after correction for multiple testing. This demonstrates the value of methods which aggregate information from multiple variants even when common variants may be associated with the trait.
The sliding-window analysis of FKBP5 successfully localized the association peak to the sub-region spanning introns 1 and 2b and the exon between them (Figure 3). All window lengths identified a peak in this region, further demonstrating our method's robustness to window size. Analysis using only rare variants localized the peak to approximately the same location, intron 2b; however, this association was not significant after multiple-test correction.

DISCUSSION
In this paper, we introduced DMAF, a novel approach for aggregating information across a genetic region to increase power for association testing. DMAF's use of a positive function of the difference in minor allele frequencies between cases and controls places equal importance on risk and protective variants, and its accommodation of both rare and common variants, with weights based on MAF or biological information (e.g., functional variants given more weight), gives our method flexibility to adapt to a wide range of traits. DMAF demonstrates improved power compared to many widely used methods in the presence of both risk and protective variants. Its sliding-window implementation can increase power to detect an association due to a causal sub-region relative to whole-region analysis, and can localize associations within a gene or region. Both implementations of DMAF are available as an R package 2 .
On simulations with both risk and protective variants, rare variant analysis methods which pooled data from individuals (DMAF, CMC, Hotelling's T 2 , C-alpha) had the greatest power. We also found that aSum, a method designed to accommodate protective variants but which pools information from variants, had weaker power on simulations with protective variants. Taken together, these results suggest that pooling data from individuals improves power for analysis of risk and protective rare variants. This indicates that sequencing pooled DNA could be used as a cost-saving measure in association studies without loss of power due to a combination of risk and protective variants.
Our simulations of causal variants clustered in a sub-region demonstrate that by using sliding windows, the power gained from reducing the proportion of neutral variants in windows which overlap the causal sub-region can outweigh the power loss due to multiple testing of the various window locations. The superior performance of the sliding-window approach on regions 2 and 3, compared to region 1, could be due to the hotspots in these regions, which interrupt linkage disequilibrium (LD) between the causal sub-region and the rest of the region. This reduced correlation results in a lower signal-to-noise ratio (compared to region 1) when non-causal variants are included in the analysis. The fact that a wide range of window sizes in the sliding-window analysis performed better than the whole-region analysis demonstrates that the sliding-window approach is a robust, powerful alternative for region-based rare variant analysis.
In the sliding-window analysis of FKBP5 for gemcitabine response, the strongest association was p = 0.019, which was more significant than the whole-gene association (p = 0.049); however, multiple tests (window sizes) were required to achieve this. Due to the robustness of results to the window size that was demonstrated in our simulations and real data analysis, in the future it would be possible to perform a sliding-window analysis using a single window size. This could result in increased power, without the need for additional multiple-test correction. Sliding-window analyses also allow localization of region-based association signals, which will be valuable for understanding the functional regions responsible for identified associations. To highlight the usefulness of region-based association methods such as DMAF, our simulation studies involved large numbers of causal variants which explain a modest proportion of the trait variance. This is a realistic model for sequencing studies of candidate genes or pathways. In a candidate gene study, many non-synonymous changes to the gene could affect the shape of the protein encoded by the gene, and many synonymous variants could each have a small impact on the protein's translation rate. However, if these variants are rare SNVs, or even de novo mutations, then a region-based analysis method such as DMAF would be necessary to detect the variants' combined effect on a trait influenced by the protein. In candidate pathway studies, multiple genes are grouped to detect their combined effect on a trait. If each gene in the pathway contributes a few rare variants of moderate effect, then the combined whole-region analysis would include dozens or hundreds of causal variants, as simulated here.
The results of our simulation studies ( Table 6) indicate that the sliding-window analysis is effective in localizing associations within a region using a relatively small sample size of 200 cases and 200 controls. This localization may implicate a fairly large sub-region, as in the gemcitabine analysis of 91 cell lines, in which the association peak spanned two introns comprising approximately 77 kb. In the future, it would be valuable to analyze a larger sample, as this might enable more precise localization.
The dichotomization of gemcitabine response based on upper and lower 50% of IC50 values is somewhat artificial, as it results in assigning distinct phenotypes to individuals with similar intermediate responses. DMAF, like many methods for rare variant analysis, is currently applicable only to binary traits. In the future, it would be beneficial to extend DMAF to quantitative traits. It would also be worthwhile to explore the possibility of determining window size based on number of base pairs or extent of LD, rather than number of SNVs. Finally, it would be valuable to explore other variant-weighting schemes, such as a positional weighting scheme within each sliding window.