A set-based association test identifies sex-specific gene sets associated with type 2 diabetes

Single variant analysis in genome-wide association studies (GWAS) has been proven to be successful in identifying thousands of genetic variants associated with hundreds of complex diseases. However, these identified variants only explain a small fraction of inheritable variability in many diseases, suggesting that other resources, such as multilevel genetic variations, may contribute to disease susceptibility. In this work, we proposed to combine genetic variants that belong to a gene set, such as at gene- and pathway-level to form an integrated signal aimed to identify major players that function in a coordinated manner conferring disease risk. The integrated analysis provides novel insight into disease etiology while individual signals could be easily missed by single variant analysis. We applied our approach to a genome-wide association study of type 2 diabetes (T2D) with male and female data analyzed separately. Novel sex-specific genes and pathways were identified to increase the risk of T2D. We also demonstrated the performance of signal integration through simulation studies.


INTRODUCTION
Advancements in microarray and next generation sequencing technologies enable people to find more and more genetic variants, including small variations in single nucleotide polymorphisms (SNPs) and large variations, such as indel and copy number variation. Genome-wide association studies (GWAS), which focus on the association between SNPs and complex diseases, have been proven to be a powerful tool to unveil the genetic secret of complex diseases. However, the identified SNPs can only account for a small fraction of heritability in many diseases, motivating people to develop more advanced statistical methods and novel models with the hope to find the missing heritability. Among the large efforts being pursued, it is natural to consider gene-set or pathway-based analysis given that genes in a pathway or network tend to work coordinately to fulfill their tasks. The subtle effects in multiple SNPs can be combined so that the joint signal in a gene set could be potentially boosted. A variety of public resources are available to create the gene-set, such as Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa, 2000), Reactome (Joshi-Tope et al., 2005) and Gene Ontology (Ashburner et al., 2000). Signal-based combination methods have been broadly applied in this context and can be classified into two categories: the one-step approach and the two-step approach. The former one treats all SNPs equally and combines effects of all SNPs within a certain pathway to come up with a total signal. The latter one firstly obtains a gene-score by combining SNP effects in a gene, then constructs a higher hierarchical test statistic to score a gene-set (or pathway) using the previously achieved gene scores. In either case, the significance of a bigger set (e.g., a pathway) is evaluated using either asymptotic approach, or more practically, permutation techniques. However, the one-step approach might lose power since the association signals are equally considered in different genes without considering the hierarchical structure of a gene set.
Toward the signal integration, p-value combination is a popular one since p-value possesses several nice characteristics such as model-free, clear interpretation and standard scale. Different p-value combination methods were widely discussed in the literature, such as Fisher's method that summarizes signals using a transformation of the p-value product as the test statistic (Fisher, 1932), the truncated product (TP) method which considers the product of p-values less than a pre-defined threshold (Zaykin et al., 2002), the rank truncated product (RTP) method which employs top R most significant p-values with R being the preselected rank truncation threshold (Zaykin et al., 2006), and the adaptive rank truncation product (ARTP) method where the rank truncation point is adaptively selected from T truncation candidates R 1 , · · · , R T to optimize the permutation p-value of the test statistic (Hoh et al., 2001;Dudbridge and Koeleman, 2004). Furthermore, p-value combination can be applied either in one-step combination or two-step combination. Although the one-step ARTP method (e.g., Hoh et al., 2001;Dudbridge and Koeleman, 2004) is more flexible to the choices of the truncation point compared to other one-step truncation methods, it uses one single threshold for all the genes in a pathway, disregarding of their individual sizes and linkage disequilibrium (LD) structures, hence may not be reasonable in practice. In fact, an overestimated truncation might dilute the signal, while underestimation will lead to major signal loss. To overcome the limitation of the one-step analysis, a two-step ARTP model was developed Yu et al. (2009). However, there is no unified criterion for the choice of T. In addition, as a permutation-based method, ARTP is computationally expensive to achieve a small p-value, and this burden could be much heavier when T is chosen to be large.
These limitations motivate us to consider a two-step analysis while maintaining the hierarchical genetic structure in a gene set. Li et al. (2011) proposed a one-step p-value combination approach to infer pathway regulations in an eQTL mapping framework. By introducing the hierarchical pathway structure, we extend this method to a two-step one under the GWAS framework and name it as scaled chi-square combination (SCC) method. We propose to combine individual SNP p-values in a lower genetic system (e.g., a gene) in the first step, then summarize the gene signals in a higher system (e.g., a pathway) in the second step, while considering the correlations among genetic variants in the SNP-and gene-level. This joint analysis could spark additional insights into pathway function on a disease trait that otherwise could be missed by looking at individual signals alone. Another advantage using SCC method is that there is no need for truncation, which also implies no information dropping. Besides, since p-value is derived based on a chi-square approximation, small p-values can be obtained which cannot be easily achieved by permutation-based methods.
Examples of other two-step methods also include gene set scan (GSS) (Schaid et al., 2012) and modified gene set enrichment analysis (MGSEA) (Wang et al., 2007). In GSS, each gene is firstly scored using average SNP signals, which are individually evaluated via a score statistics from a regression model. Then a pathway signal is scored by the weighted "average" signals from genes, and a normal distribution is assumed to compute the p-value of a pathway. While in MGSEA, the smallest p-value among all SNPs in a gene is chosen to represent the gene signal. Pathways are then scored using weighted Kolmogorov-Smirnov like running sum statistic, and the significance of a pathway is evaluated based on permutation approach. Obviously the GSS method is limited by taking average when large number of noisy SNPs are present.
The article begins with a detailed description of using Fisher's combination statistic to score genes and pathways, and the Satterthwaite approximation of the null distribution of the combined p-values. Section 3 provides real data analysis on two type 2 diabetes (T2D) cohorts data sets. In section 4 we demonstrate the utility of SCC via extensive simulations and compare it to other two-step methods (e.g., ARTP and MGESA) under a variety of scenarios, followed by the discussion in section 5.

COMBINING AND APPROXIMATING
Assume L tests are conducted for the association between a disease trait and L SNPs in a given genetic system, and let p 1 , p 2 , · · · , p L be the p-values for the L individual tests. For example, if we want to test whether a gene (consisting L SNPs) is associated with a disease trait, the gene would be the system and individual SNP is the unit; pathway and gene would be the system and unit respectively when testing the association between a disease trait and a pathway containing L genes. Define W i = −2 log p i .
Under the null hypothesis of no genetic effect, each of the L pvalues is uniformly distributed and W i ∼ χ 2 2 for i = 1, · · · , L. If we assume the L tests are independent, the Fisher's combined statistic Z = L i = 1 W i ∼ χ 2 2L under the global null hypothesis of no genetic effect in the genetic system.
Since units in a genetic system tend to work coordinately, they are more or less correlated. Hence the L p-values are not independent and the chi-square distribution with 2L degrees of freedom (df ) in Fisher's method may not be true. To adaptively account for the combined effect of L correlated χ 2 2 distributions, Satterthwaite's approximation can be introduced, where a scaled chi-square distribution is used to approximate the null distribution of Z, i.e., and the scale parameter a and the df parameter d are chosen by equating the first and second moments of the scaled chi-square distribution with the ones of Z under the null hypothesis, respectively. Under the independence case, the means and variances of the statistic Z under the null are is the correlation coefficient between the log-transformed p-values W i and W j . Solving the moment equations E(Z)=E(aχ 2 d ) = ad and Var(Z)=Var(aχ 2 d ) = 2a 2 d, we haveâ When the L units in a genetic system are completely independent, i.e., ρ ij = 0 for all i = j, the Satterthwaite's approximation (â = 1,d = 2L) is exactly the same as the distribution of the Fisher's combined statistic which assumes independence. When the L units are perfectly dependent in the meaning of ρ ij = 1 for all i = j, the approximation degenerates toâ = L,d = 2. Beyond these extreme cases, the scaled chi-square approximation has much flexibility to account for the correlation structure among the units.

GENE SET SCORING
In the following we describe how to score genetic unit in each level. Assume the pathway of interest consists of K genes, where the kth gene consisting n k SNPs (1 ≤ k ≤ K) and (p k 1 , · · · , p k n k ) are p-values of those SNPs. Let k be the correlation matrix of the transformed p-value vector w k = (w k 1 , · · · , w k n k ), which represents the correlation structure among the n k SNPs within the kth gene. could be obtained. Together with the shape parameter (â k ,d k ) estimated in (2), a gene-based p-value can be obtained. This finishes the signal combination of the first step. Let w gene = (w gene 1 , · · · , w gene K ) be the -2log transformed pvalue vector for the K genes obtained in the first step, and is the corresponding correlation matrix. In the second step, the pathway can be scored as z path = K m = 1 w gene m and the shape parameters (â 2 ,d 2 ) can be estimated in a similar way. Then a p-value of the pathway, which reflects its association strength with a disease trait, can be obtained by p path = P z path >â 2 χ 2 d 2 . The problem remaining is to estimate the correlation matrices among L SNPs as well as among K genes, which is discussed in the next section.

ESTIMATING CORRELATION MATRIX
A permutation (or resampling) procedure is applied here to estimate the correlation matrix of a transformed p-value vector on a gene-or pathway-level. We first obtain the L dimensional transformed p-value vector w (0) = {w (0) 1 , · · · , w (0) L } through single-SNP testing based on the original data. Then we generate B datasets under the null hypothesis of no association between the SNPs and trait, by simply permuting the trait label while keeping the genotype fixed such that the correlation structure among SNPs/genes is not disrupted. Through each of the permuting process, we could obtain a new transformed p-value vector w (b) (b = 1, · · · , B). Then the correlation matrices k , k = 1 · · · , K and can be estimated using the sample correlation matrices from the permuted data. Note that only one single layer of permutations is needed, and moderate permutations are needed to estimate the correlations in k , k = 1 · · · , K and . Typically B = 200 is enough to obtain reasonable estimation of the correlation matices, compared to large number of permutations to obtain small p-values as implemented in the ARTP procedure.

REAL DATA ANALYSIS
A growing number of experimental evidence shows that there is gender effect related to type 2 diabetes (T2D). For instance, individual SNP test results indicate moderately differential signals between male and female population (Wu and Cui, 2013), and there also exists sex difference in the impact of T2D on coronary heart disease risk (Juutilainen et al., 2004). Therefore, we are specifically interested in identifying sex-specific pathways associated with T2D with the hope to gain novel insight into the disease etiology of T2D in different sex groups. We applied four different two-step methods including SCC, ARTP, MGSEA and GSS to two nested case-control cohort T2D datasets, the Nurses' Healthy Study (NHS) and the Health Professional Fellowup Study (HPFS), which are part of the Gene, Environment Association Studies (GENVEA) (Cornelis et al., 2010). Please refer to Hu et al. (2001) and van Dam (2002) for more detailed information about the datasets. The raw datasets originally include 3391 female and 2599 male participants with European ancestry, respectively. Individuals with large proportion of missing SNPs (>10%) or large kinship relationship with others were removed. We also removed SNPs with minor allele frequency (MAF) <0.05 or deviation from Hardy-Weinberg equilibrium (p < 0.001) in controls. The final data set contains 672,105 SNPs in 3371 females (1572 cases and 1799 controls) and 671,669 SNPs in 2494 males (1161 cases and 1333 controls). SNPs that are within 50 kb up-and down-stream of a gene were assigned to the corresponding gene based on Human Genome Build 37.3 . Totally 186 KEGG pathways was retrieved from the Molecular Signatures Database (MSigDB) covering 5015 genes using the Gene Set Enrichment Analysis package (Subramanian et al., 2005). The final data set contains 143,137 SNPs in female and 143,272 SNPs in male.
We conducted separated analysis in the two gender groups, in each of which SNP signals were combined together in two steps to represent the overall contribution of the pathway. Seven covariates (p-value < 0.05) was included into the model, including family history of diabetes among first degree relatives, reported high blood pressure at/before blood draw, reported high blood cholesterol at/before blood draw, total physical activity, body mass index, heme iron intake and glycemic load, to adjust for the covariates' effect when fitting a logistic regression model to test individual SNP effect. An additive gene action model was assumed when coding the SNP effect. The obtained p-values were then applied to the four methods, SCC, ARTP, MGSEA, and GSS following the recommended settings of each method. Figure 1 shows the Manhattan plot of the single SNP signals across the 22 autosomal chromosomes for the female and male groups with the vertical axes represent the −log10 p-values. The dashed line corresponds to the genome-wide Bonferroni threshold. The overall pattern is quite similar in the male and female population with a few clear differences, especially on chromosome 2, 3, 4, 5, 10, and 13. The signals passing the genome-wide Bonferroni threshold located on chromosome 10 in the male group are from gene TCF7L2, one of the genes associated with T2D and being replicated in several GWAS studies (Grant et al., 2006;Sladek et al., 2007). Figure 2 shows the KEGG gene signals when combining SNP signals in a gene. Again the dashed line corresponds to the genome-wide Bonferroni threshold. Genes passing the Bonferroni threshold are IL18RAP (on chr2), BST1 (on chr4), TCF7L2 (on chr10) and PIGQ (on chr16) in the female population, and SNRNP200 and DUSP2 (on chr2), MCM3 (on chr6), ABO (on chr9), TCF7L2 (on chr10) and GYS2 (on chr12) in the male population. Again we observed gender difference in association signals at the gene level. It is worthy to note that although all SNPs in gene TCF7L2 in the female population do not show significance, the combined gene signal reached the genome-wide gene-level significance. This shows the power of signal combination to boost the association signal.
In a short summary, we observed significant gender difference in association signals at both the SNP and gene level. In addition, we found that the gene-level signal was more significant through signal combination.  where multiple causal SNPs with moderate or small effects are not significant individually, the gene-level signal can be potentially boosted through combination. This is one of the major advantages of using combined method. The top gene-level signals using the ARTP method with the recommended parameter settings, were partially similar to the results obtained using the SCC method (see Supplementary Table 1 for details). However, more permutations are needed to achieve small p-values by using the ARTP method which imposes more computation burden. Note that not all SNPs with strong signals in the SNP-level analysis can be mapped to a gene. Thus, it is not surprise why some locations have strong SNP signals but weak gene signals. Table 1 summarizes the significant KEGG pathways in female and male groups after the FDR control with q-value < 0.05 (Storey and Tibshirani, 2003). Total 13 pathways were enriched in female group and 5 were enriched in male group, with 3 in common. There is clear heterogeneity in pathway association between the two sex groups. Since TCF7L2 belongs to several enriched pathways (3 in female group, 5 in male group) and is widely recognized as a gene conferring risk of T2D, we conducted the same analysis but deleting this gene in all pathways. Figure 3 shows the pathway-level signals across 186 KEGG pathways with and without TCF7L2, in female and male groups, respectively, where the solid line represents the FDR threshold. It can be seen from Figure 3 that there is no significant change between pathway signals with and without gene TCF7L2 in female group, while the strong signals in male group are almost vanished after deleting the gene. This observation suggests potential difference in T2D

Frontiers in Genetics | Statistical Genetics and Methodology
November 2014 | Volume 5 | Article 395 | 4  etiology in the pathway level in each gender group. The significance of the pathways in the male group is largely dominated by gene TCF7L2. The other three methods, ARTP, MGSEA, and GSS, all failed to detect any significant pathways after the FDR control (see Supplementary Tables 2-4). GSS gives p-values all greater than 0.08 for the 186 pathways. This is not surprising since GSS tends to lose power due to averaging when large number of noise variants are included (Schaid et al., 2012). The method could be benefitted by a SNP pre-selection (Wu and Cui, 2014). It should be noted that KEGG type II diabetes mellitus pathway, which is a manually annotated gene set, is only identified using the SCC method. This indicates the robustness of the SCC method.

SIMULATION STUDY
To evaluate the statistical performance of different combination approaches, extensive Monte Carlo simulations were conducted. Following the simulation design in Biernacka et al. (2011), we selected SNPs in KEGG pathway matu-rity_onset_diabetes_of_the_young which contains 25 genes, to simulate genotypes, based on a total of 5961 observed individuals. The 25 genes covering 599 SNPs were first mapped to 14 chromosomes and then phased using software fastPHASE (Scheet and Stephens, 2006). In each of the simulation replicate, 6000 haplotypes were simulated using the hapsim library in R (Montana, 2005) based on the haplotype frequencies obtained in the first step. Then pairs of hyplotypes were randomly assigned to 3000 individuals, where the disease status of the ith subject Y i was generated through a Bernoulli distribution, i.e., Y i ∼ Ber(p i ) with logit(p i |G i ) = α(G i ), and G i is the genotype vector whose lth component G il was coded as 0, 1, or 2 according to the counts of minor alleles at the lth SNP (1 ≤ l ≤ 599). Based on the 3000 subjects, equal number (m) of cases and controls were randomly picked to form the case-control data (m = 500, 1000) for each simulation replicate. The type I error (false positive rate) and power were evaluated based on the 0.05 significance level and 1000 simulation replicates.
Under the null hypothesis of no genetic association between phenotype and pathway, we let α(G i ) = 0, i.e., logit(p i ) = 0.5, which leads to case:control=1:1. Under the alternative hypothesis, three scenarios were considered based on the following model, where β is a vector with β l denoting the marginal effect of the lth SNP in the pathway, and is a matrix whose (s, t) entry st represents the interaction between the sth and tth SNP. The three scenarios, which were described in Figure 4 with detailed information listed in Table 2, correspond to different gene actions. Scenario A considers the case in which there is only one moderate marginal effect (odds ratio = 1.4) in each of the five genes. Scenario B assumes there are several small effects (odds ratio = 1.1) in each of the five genes. Scenario C assumes small effects as well as weak interactions within a gene and between genes. In the simulation, only five genes out of 25 were assumed to be associated with the trait. The causal SNPs were fixed in each simulation replicate. The MAF information of the causal SNPs calculated from the real data were given in Supplementary Table 5. Table 3 summarizes the power and the type I error at the gene level combination using the ARTP and SCC methods. The type I error for MGSEA was rendered to the pathway level analysis. In the table, the results obtained with the SCC method are listed in the parenthesis. The ones with higher power are highlighted with FIGURE 4 | Three simulation scenarios. OR, odds ratio; IWG, interactions within a gene; IBG, interactions between genes. β l = log OR for isolated nodes (SNPs), and st = log OR for any two nodes connected with edge.  Table 4 shows the power and type I error at the pathway level analysis. The type I error rates are reasonably controlled in all cases. Compared to SCC, ARTP and MGSEA achieve better performance under scenario A. However, under scenario B and C where the joint effect of several SNPs with small marginal effect and/or interactions, SCC is more powerful than ARTP and MGSEA in testing pathway association. Among the three, MGSEA performs the worst. Although SCC is less powerful than the other two under small sample size (2m = 1000), it catches up quickly when sample size increases (2m = 2000). The simulation results indicate the power gain of the SCC method, in particular under scenarios C when interactions between SNPs are present. To further distinguish the pathway-level power difference under Scenarios A and B with case:control = 1000:1000, as per one reviewer suggestion, we did more simulation by reducing the effect size (i.e., OR). Under Scenario A, ARTP has better power while SCC is quite close to it. Under Scenario B, SCC clearly dominates the other two (see Supplementary Table 6).

DISCUSSION
In this work, we proposed a pathway-based association method by combining single SNP p-values in two steps to identify pathways (or gene-sets) associated with a disease trait. Although our model was applied and simulated based on the binary disease trait, it can be utilized for other data types such as quantitative or count trait. The strategy also allows to adjust for the effects of covariates as well as gene-environment interactions. Comparing to other signal combination methods, our methods has the following advantages: (1) Unlike other methods using hard truncation, our method avoids the risk of throwing any important but marginally weak signals while accounting for the correlation structure among the combined signals. For the MGSEA method, a gene signal is represented by the strongest SNP signal within the gene, thus it suffers from power loss by dropping important SNP signals. For GSS, the averaging strategy could lose power when large number of noise SNPs are presented in a set. Although the performance of ARTP might be improved by doing large permutations (e.g., >10,000), it is computational intensive and furthermore, there is no unified criterion on the choice of truncation rank upper bound T. The computational load of ARTP would be further enhanced if T is large (e.g., causal SNPs/genes are dense in gene/pathway); (2) The scaled chi-square approximation can achieve small p-values which is computationally burdensome for other permutation based methods such as ARTP. Hence, SCC is computationally efficient and small p-values can be achieved without large scale permutations; (3) As revealed by simulation studies (Table 4), SCC exhibits great power than other two methods when there exist interactions between SNPs in a gene set.
We applied our method to two cohort studies of type 2 diabetes with pathways defined in KEGG databases. Several sex-specific patterns were observed. Firstly, both the gene-and pathwaylevel scan results showed that there was not much signal overlap between the female and male population, as also revealed by applying other tools (e.g., ARTP, MGSEA, and GSS). Secondly, more pathways were enriched in female population comparing to male population. SCC exhibited greater performance in  detecting associated pathways, especially for the curated KEGG T2D pathway type II diabetes mellitus which cannot be detected by other methods such as ARTP, MGSEA, and GSS. In addition, gene TCF7L2 in female population also shows significance after p-value combination, while no single SNP in this gene shows significance. This clearly demonstrates the power of signal combination to identify SNPs that function jointly but could be missed by marginal analysis. We also did the QQ plot of the p-values (results not shown). The gene-based p-value QQ plots showed a reasonable pattern at different significance levels. But for the pathway-based p-value QQ plots, there is a moderate departure from the expected. This is due to the fact of dependent p-values since many pathways share common genes and the pathway-based p-values are not completely independent.
Our simulation studies showed reasonable control of false positive rate of the method, on either gene-or pathway-level combination. Simulation studies also revealed that the optimal choice of combination methods depends on the underlying true disease model. As what we expected, under scenario A where only one SNP with moderate effect functions in each of the five genes, ARTP and MGSEA were more powerful. This is because the default T is 1 for SNP-level and 10 for genelevel in ARTP, and MGSEA only picks the most significant signal at each level. Under scenario B and C where there exist several SNPs with small marginal effects and interactions contributing to a disease risk, SCC obtained better performance, benefitted from no truncation (i.e., no information dropping). In reality, the true model is generally unknown although scenarios B and C are more likely. We suggest users to apply different methods and pool top signals for further functional validation.

ACKNOWLEDGMENTS
This work was partially supported by grants from NSF (MCP-1121650 and DMS-1209112) and from National Natural Science Foundation of China (31371336). Funding support for the GWAS of Gene and Environment Initiatives in Type 2 Diabetes was provided through the NIH Genes, Environment and Health Initiative [GEI] (U01HG004399). The datasets used for the analyses described in this manuscript were obtained from dbGaP at http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_ id=phs000091.v2.p1, through dbGaP accession number phs000091.v2.p1.