A modified generalized Fisher method for combining probabilities from dependent tests

Rapid developments in molecular technology have yielded a large amount of high throughput genetic data to understand the mechanism for complex traits. The increase of genetic variants requires hundreds and thousands of statistical tests to be performed simultaneously in analysis, which poses a challenge to control the overall Type I error rate. Combining p-values from multiple hypothesis testing has shown promise for aggregating effects in high-dimensional genetic data analysis. Several p-value combining methods have been developed and applied to genetic data; see Dai et al. (2012b) for a comprehensive review. However, there is a lack of investigations conducted for dependent genetic data, especially for weighted p-value combining methods. Single nucleotide polymorphisms (SNPs) are often correlated due to linkage disequilibrium (LD). Other genetic data, including variants from next generation sequencing, gene expression levels measured by microarray, protein and DNA methylation data, etc. also contain complex correlation structures. Ignoring correlation structures among genetic variants may lead to severe inflation of Type I error rates for omnibus testing of p-values. In this work, we propose modifications to the Lancaster procedure by taking the correlation structure among p-values into account. The weight function in the Lancaster procedure allows meaningful biological information to be incorporated into the statistical analysis, which can increase the power of the statistical testing and/or remove the bias in the process. Extensive empirical assessments demonstrate that the modified Lancaster procedure largely reduces the Type I error rates due to correlation among p-values, and retains considerable power to detect signals among p-values. We applied our method to reassess published renal transplant data, and identified a novel association between B cell pathways and allograft tolerance.


INTRODUCTION
Rapid developments in molecular technology have created high throughput data in search of genetic variants associated with complex traits. As the cost of experiments goes down, the amount of data that can be generated, and the resulting complexity of statistical analysis required to interpret the data goes up. The increase of genetic variants requires more statistical testing to be performed simultaneously, which poses a challenge to control the genome wide Type I error rate. False discovery rate (FDR) and its extended methods have been proposed to adjust p-values in multiple tests in order to control the genome wide Type I error (Benjamini and Hochberg, 1995;Cheng and Pounds, 2007). However, in large-scale hypothesis testing, these methods often require very a large sample size to maintain power of detecting risk factors.
The global test (also named omnibus test) of p-values can combine evidence and turn dimensionality from a curse into rich information. From a systems biology perspective, genes, cells, tissues, and organs function as a system through metabolic networks and cell signaling networks. In non-Mendelian inheritance patterns, such as complex disorders, a subset of genetic variants may jointly confer moderate effects in mediating molecular activities. As a result, signals may not be significant in single marker-single trait analysis, but many such values from related genes might provide valuable information on gene function and regulation. For instance, in pathway analysis (Khatri et al., 2012) and gene set enrichment analysis (Subramanian et al., 2005), multiple genes that work together to serve a particular biological function are often analyzed jointly as a gene set. Several pathway repositories, such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa et al., 2004), PANTHER classification system for protein sequence data (Nikolsky and Bryant, 2009), and Reactome pathways in humans (Matthews et al., 2009) have been established, and are continually being updated. For non-Mendelian diseases and complex traits, identification of isolated genetic variants is insufficient to summarize the complex association with disease. The "most-significant SNPs/genes" approach often detects variants with small effect sizes and odds ratios ranging between 1.3 and 2 (Wacholder et al., 2004). Therefore, integrating information from pathways, gene sets, and networks will provide useful information in understanding the gene regulation mechanism. Furthermore, filtration techniques can be integrated with global testing of p-values to remove sets of genetic variants that are not related to traits, and thereby reduce the dimensionality of the data (Dai and Charnigo, 2008;Dai et al., 2012a).
The global test of p-values evaluates the pattern (distribution) of p-values instead of selecting p-values less than an arbitrary threshold. Therefore, this method has the potential to identify multiple genes with small effects. If we assume that all individual tests are independent and arise from genetic variants with no effects, then p-values are identically and independently distributed as Uniform (0, 1). Taking this as a null hypothesis for the pattern of p-values in the global test, one can assess whether p-values, especially small p-values, are generated by chance. The global test of p-values is robust and can be applied to p-values from varying statistical models including t-tests, analysis of variance (ANOVA), linear mixed models, and so forth. Multiple simulation studies and case studies have demonstrated that this approach usually has sufficient power to detect signals of genetic association from a group of genes. For instance, Peng et al. (2010) has assessed Fisher's combination test and Sidak's combination test, Sime's combination test and the FDR method using 13 published genome wide association studies (GWAS), and the results indicate that combined p-value approaches can identify biologically meaningful pathways associated with the disease susceptibility. A review of methods of global test of p-values, developmental trends and their application to genetic data analysis has been presented by (Dai et al., 2012b).
One category of global tests of p-values involves combining p-values in the form of i H(p i ), where p-values might first be transformed by a function H. So far, several statistical methods have been developed to combine p-values. Let p i (i = 1, 2, . . . , n) be independent p-values obtained from n hypothesis tests. Under the null hypothesis (H 0 ) that p-values follow a Uniform (0, 1) distribution, Fisher (1932) shows that −2 n i = 1 ln(p i ) follows a chi-square distribution with 2n degrees of freedom. For a one sided test with a nominal error rate of α, one can reject the null hypothesis when the test statistics exceeds the (1 − α) * 100% percentile of χ 2 2n . Stouffer (Stouffer et al., 1949) proposed a z-test by transforming p-values to standard normal variables, i.e., n , where −1 is the inverse Cumulative Distribution Function (CDF) for N(0, 1). Under the null hypothesis, the z-test statistic follows N(0, 1).
Although there is no consensus regarding the most powerful method of combining p-values, Folks (1971, 1973) demonstrated that the Fisher's method of combining independent tests is asymptotically Bahadur efficient (Bahadur, 1967). Subsequently, weighting schemes have been incorporated into the Fisher's method and the z-test. Lancaster (1961) generalized the Fisher method by converting independent p-values to chi-square variables with w i degrees of freedom and he is the inverse CDF of Gamma distribution. Mosteller and Bush (1954) proposed a weighted z-test, i w i −1 (1 − p i )/ i w 2 i , which follows N(0, 1) under H 0 . In a separate paper, we have proved that the Lancaster procedure achieves the optimal Bahadur efficiency. We further demonstrated that the Lancaster procedure yields higher Bahadur efficiency than the weighted z-test. The Bahadur efficiency ratio gives the limiting ratio of sample sizes required by two statistics to attain an equally small significance level. Thus, Bahadur efficiency is an important method to compare test statistics. From the perspective of Bahadur efficiency, the Lancaster procedure asymptotically requires a relatively smaller sample size than other weighted p-value combining methods. This prompted us to focus on modification of the Lancaster procedure for correlated genetic data in this work.
Although the Fisher's method and Lancaster procedure both achieve the optimal Bahadur efficiency, the Lancaster procedure is more general and can be viewed as a generalized Fisher's method with weighting functions. There are three advantages to carefully select appropriate weight functions in genetic data analysis. Firstly, weight functions allow incorporation of prior biological information. Genetic data are complex and can be measured from different sources. Thus, weight functions can be used as a tool to incorporate meaningful information from different sources in order to interpret and derive biological insight from gene expression profiles. (Wu and Lin, 2009) provides a review of statistical methods for analysis of microarray data by incorporating prior biological knowledge using gene sets and biological pathways, which consist of groups of biologically similar genes. They show that the use of prior knowledge has led to a better understanding of the biological mechanisms underlying phenotypic responses. Secondly, weight functions can be used to remove bias. For instance, larger genes may contain more probes and/or SNPs. Therefore, larger genes will exert a stronger influence on the p-value combining methods as compared to smaller genes (Wang et al., 2007). To avoid this bias, one can consider a weight function to adjust for gene size when combining p-values. We will illustrate this approach in sections Empirical Assessments and Case Study: Renal Transplant Tolerance Data. Thirdly, as suggested by Benjamini and Hochberg (1997), Genovese et al. (2006), procedures that assign weights positively associated with the underlying alternative hypotheses will usually improve power. Therefore, one needs to carefully choose an appropriate weight function, either based on the biological knowledge, or by statistical hypotheses. An arbitrary weight is inappropriate for the Lancaster procedure.
In this work, we will provide modifications to the Lancaster procedure to accommodate correlation structures among p-values. The proposed method provides a generalization to the Fisher's method with a weight function and can be used in pathway analysis and gene sets enrichment analysis for a variety of genetic data including microarray gene expression data, GWAS data, and next generation sequencing data. In essence, investigators first dissect genetic variants by biological functions or prior knowledge, then combine the p-values from these gene sets to identify whether a proportion of genetic variants are associated with traits.

CORRELATED LANCASTER PROCEDURES
In this section, we allow p-values to be correlated. Consider a Lancaster test statistic is the inverse CDF of Gamma distribution with a shape parameter w i /2 and a scale parameter 2. This transformation converts p i ∼ Uniform(0, 1) to a chi-square distribution, i.e., γ −1 The parameter w i serves as a weight function to adjust the individual p-values. When p-values are independent, T has an exact chi-square distribution with n i = 1 w i degrees of freedom.
For correlated p-values, The distribution of T does not have an explicit analytical form. To address this issue, we consider a Satterthwaite approximation by approximating a scaled T statistic with a new chi-square distribution (Li et al., 2011). Let cT ≈ χ 2 v where c > 0 is a scalar and v > 0 is the degree of freedom for the approximated chi-square distribution. Note that We propose the following five approaches to approximate the distribution of T. In approximation (A), we use the Satterthwaite method to match the mean and variance of cT and χ 2 v , and then solve the equations to derive c and v. Koziol (1996) have proposed multiple methods to approximate the Lancaster procedure, but these approximations require the assumption of independence. In approximation (B)-(E), we extend the work of Koziol (1996) to correlated data by first approximating cT with χ 2 v then approximating χ 2 v using varying methods.
• T A approximation. Correlation among p-values is taken into consideration, and then Satterthwaite's approximation is used (Patnaik, 1949) to derive new degrees of freedom: • T B approximation. cT is first approximated by χ 2 v , followed by Fisher's approximation (Fisher, 1922) to χ 2 v : • T c approximation. After approximating cT by χ 2 v , the Wilson-Hilferty approximation is performed (Wilson and Hilferty, 1931) • T D approximation.
Approximate cT by χ 2 v , followed by the Cornish-Fisher expansion (Fisher and Cornish, 1960) • T E approximation.
When the covariance ρ ij is unknown, one can use the permutation approach to estimate ρ ij by shuffling the phenotype variable among subjects. For the kth permutation (k = 1, 2, . . . , m), we keep the genetic variants within the subject to preserve the correlation structure, then randomly assign the phenotype variable to subjects. Individual hypothesis testing can be done on all n genetic variants separately to generate the p-value vector The permutation is repeated m = 1000 times, and ρ ij is estimated from (p 1 , p 2 , . . . p m ).
The accuracy of the five approximate distributions to the correlated Lancaster procedure is then assessed using p-values with varying correlation structures. We consider six different types of correlation structures, including fixed and random compound symmetric as well as random positive definite variancecovariance structures for . Let I be an identity matrix, 1 be a vector of 1 s, ⊗ be the Kronecker product, and superscript t www.frontiersin.org February 2014 | Volume 5 | Article 32 | 3 be the transposition. In Cases I-V, let = Block ⊗ I 20 be compound symmetric variance matrices with 20 blocks of size 5 where Block = 1 5 1 t 5 ρ + (1 − ρ)I 5 . We vary ρ over two fixed values with ρ = 0.3 for moderate dependence and ρ = 0.6 for strong dependence. In addition, we simulate random correlation coefficients from beta and uniform distributions, i.e., ρ ∼ β(0.3, 1.5) and ρ ∼ uniform(−0.2, 0.2), which ensures that 20 variance blocks have distinct correlation coefficients ρ within . More generally, we consider random positive definite correlation matrices that vary across samples and simulation runs.
The quantile-quantile (Q-Q) plot assessing the accuracy of the proposed methods when the correlation coefficient ρ = 0.3 is shown in Figure 1. For clarity, the Lancaster statistic T that combines n p-values is renamed as T Lancaster

EMPIRICAL ASSESSMENTS
We assess the Type I error rates and power for the proposed correlated Lancaster procedures and compare them to the independent Lancaster procedure (Lancaster, 1961). SNPs from a pathway of haploid GWAS are simulated using linkage disequilibrium (LD) (Li et al., 2011). Let q 1 and q 2 be the minor allele frequencies (MAFs) at loci 1 and 2. Assuming Hardy-Weinberg equilibrium, the genotype at locus 1 can be randomly generated using a binomial distribution. Given the distribution of SNP at locus 1, one can simulate the genotype at locus 2. To do so, let D be a measure of LD. Then the conditional probability for the genotype at locus 2 given the genotype at locus 1 can be expressed as A and B represent the minor alleles at the two loci. For a diploid genome, similar idea can be applied and the simulation details can be found at Cui et al. (2008). We simulate a pathway with 5 genes with varying numbers of SNPs in each gene listed in parenthesis i.e., G1(12), G2(8), G3(5), G4(3), G5(2). The MAF of each SNP was set to be 0.3. We simulate different levels of LD for SNPs from
• Case V: ln(p/(1 − p)) = β 1 G 3, 1 + β 2 G 3, 3 + β 3 G 4, 2 + β 4 G 3, 2 G 3,4 + β 5 G 4, 3 G 5, 1 . • Case VI: ln(p/(1 − p)) = β 1 G 1, 2 + β 2 G 2, 2 + β 3 G 3, 3 + β 4 G 5, 2 + β 5 G 1, 5 G 1,7 + β 6 G 3, 3 G 5, 1 . Weight functions can be used to remove potential bias when combining p-values. Wang et al. (2007) and others have noted that larger genes contain more probes and/or SNPs. Therefore, larger genes may exert a stronger influence on the p-value combining methods compared to smaller genes. To avoid this bias, we set the weight function w i = 2/ √ n i where n i is the number of SNPs in the ith gene. When n i = 1, We simulate data with sample sizes n = 200 (Tables 1, 4) and n = 400 (Tables 2, 3), respectively. For simplicity, we assume the same effect size for all of the regression coefficients. For each set of data, we perform the original and modified Lancaster procedures to assess the pathway data by combining p-values from individual tests. We set nominal error rate to be 0.05. The simulation is repeated 1000 times.
Due to LD, SNPs from the same gene are correlated. We first assess the Type I error rate of the test statistics by testing H 0 : The power of all test statistics was compared for regression coefficient values set at β = 0.4 and β = 0.6, respectively. The results in Tables 1, 2 suggest strong and comparable power among the modified Lancaster procedures. In most simulated cases, the proposed methods have more than 80% power to detect β = 0.4. When the effect size increases to β = 0.6, the power of proposed methods increases to 90% or above. Also the power of these tests improves as sample size increases from n = 200 to n = 400.
We simulate different levels of LD for SNPs with D = 0, 1.5, 2, and uniform(0, maximum of LD). To save the space, we only show the results for D = 1.5 ( Table 3) and D = 2 (Tables 1, 2). Our findings show that the inflation of Type I error rate for the original Lancaster procedure gets severe when LD is strong (Tables 1, 2). The modified Lancaster procedures (T A − T E ) have well-controlled Type I error rates and power for both moderate and strong LD (Tables 1-3).
In Table 4, we assess the performance of all tests without a weighting function. We then compare the results in Table 4 (without a weight function) vs. Table 1 (with a weight function). All other simulation parameters are held the same in Tables 1, 4. We note that the original Lancaster procedure without a weighting function (Table 4) tends to have higher Type I error rates than the original Lancaster procedure with a weighting function ( Table 1). For modified tests (T A − T E ), the power is increased when a weighting function is used. This confirms that an appropriate weight function is beneficial to the Lancaster procedure.

CASE STUDY: RENAL TRANSPLANT TOLERANCE DATA
We revisited a kidney transplant data first collected and analyzed by Newell et al. (2010). Data were downloaded from the GEO website with ID = GDS4266 (http://www.ncbi.nlm.nih. gov/sites/GDSbrowser?acc=GDS4266). A group of tolerant renal transplant recipients (Tolerant,n = 19), as defined by stable graft function in the absence of immunosuppression for more than 1 year, were compared to subjects with stable graft function who were receiving standard immunotherapy (SI, n = 27) as well as to a group of healthy controls (Control, n = 12). Gene expression profiles of whole-blood total RNA from all subjects were measured by microarray. The goal of the study was to identify genetic variants associated with long-term allograft survival without the requirement for continuous immunosuppression, a condition known as allograft tolerance. Newell et al. (2010) performed statistical analysis to identify differentially expressed genes between the SI group and the Tolerant group. The results revealed a critical role for B cells in regulating alloimmunity, and provided a candidate set of genes for wider-scale screening of renal transplant recipients. However, no comprehensive pathway analysis was conducted by this group (Newell et al., 2010).
To further understand molecular mechanisms underlying renal allograft tolerance, we have applied the modified Lancaster procedure to this dataset to identify candidate cellular pathways. Gene expression levels were normalized using Robust Multichip Average (rma) preprocessing methodology, which included background subtraction, quantile normalization, and summarization via median-polish.
Gene expression levels were summarized for a total of 54,675 probes from 21,049 genes. Expression levels were compared among three groups using the Bioconductor "Limma" package. Three pair wise comparisons were conducted, including: SI vs. Control, SI vs. Tolerant, and Tolerant vs. Control. Then three comparisons were combined into one F-test. This is equivalent to a One-Way ANOVA for each gene except that the residual mean squares have been moderated across genes. P-values from multiple hypothesis testing were adjusted by FDR (Benjamini and Hochberg, 1995). Our results of differentially expressed genes are consistent with the previous published work. See Newell et al. (2010) for the gene analysis findings.
Although (Newell et al., 2010) identified a set of differentially expressed genes, our analysis demonstrates that these significant genes have small effect sizes with fold changes <1.5. Therefore, a limited number of individual genes in the absence of a biological context is inadequate to explain the total variation of allograft tolerance among renal transplant patients.
To address this issue, we performed the modified Lancaster procedure (T A ) as described in Section Correlated Lancaster Procedures to combine p-values from pathways. Combining p-values allows us to integrate small effects in pathway and gain the power of statistical testing. A total of 1454 Gene Ontology human pathway gene sets were analyzed. The size of pathways ranged from 9 genes to 2131 genes, with a median of 27 genes per pathway. Also, the number of probes per gene was highly variable. In order to map genes to pathways, we removed genes without gene symbols from the analysis. Among 21,049 genes with gene symbols, approximately 48% (n = 10161) of genes were interrogated with a single probe, 26% (n = 5389) of genes were queried using 2 probes, 14% (n = 2842) of genes were assessed by 3 probes. There were 3 or more probes for each on the remaining genes (range: 4-17). This finding indicates that larger genes would have more p-values and a stronger impact to pathway analysis. To prevent this bias, we set the weight function as w i = 2/ √ n i where n i is the number of probes for the ith gene.
We performed pathway analysis for the One-Way ANOVA test and three pair wise comparisons. The top 10 significant pathways based on the One-Way ANOVA test are listed in Table 5. The top two pathways, B cell differentiation (GO:0030183) and B cell activation (GO:0042113), confirm the signature of B cell involvement described by Newell et al. (2010). Furthermore, we identified other pathways related to B cell activation and function. These include antigen binding (GO:0003823), map kinase kinase kinase activity (GO:0004709) and lymphocyte differentiation (GO:0030098). These pathways are biologically consistent with the proposed role of B-lymphocytes in renal transplant tolerance reported by Newell et al. In contrast, when we performed the traditional Fisher's method without considering correlation structures (LD) within pathways or applying a weighting function to compensate for variability in the number of probes per gene, the result was a list of larger pathways, some containing >1000 genes, describing more general cellular processes and not specifically related to immune functions (See Table 6, #gene and

GO accession Pathway name
Gene ontology URL # Gene # Probes Adjusted #probe). Furthermore, when comparing the SI group and the Control group, the traditional method identified 1078 significant pathways while our proposed method narrowed the list down to 64 significant pathways (adjusted p-value <0.05). The increase in number of significant pathways identified by the traditional approach is primarily due to false positive discovery, and is consistent with the inflation of Type I error rate as presented in Section Empirical Assessments. Thus, by accounting for correlation structures (LD) within pathways and the number of probes per gene, our proposed method minimized identification of larger, nonspecific cellular processes pathways, and instead revealed more focused and functionally relevant biological pathways implicating a role for a humoral immune response in immunotolerance to renal transplants (See Table 5, #gene and #probe).

DISCUSSION AND CONCLUSIONS
Modifications to the Lancaster procedure are proposed to take correlations among p-values into account. Extensive simulation studies show that the original Lancaster procedure has inflated Type I error rates due to correlation among p-values. By using permutation approach to estimate the correlation among p-values, the proposed methods have well-controlled Type I error rates and maintain strong power to detect signals related to SNPs in pathways. Among five proposed approximation methods (T A , . . . , T E ), the Satterthwaite approximation (T A ) is the most computationally efficient. Other approximation methods (T B , . . . , T E ) are based on the Satterthwaite approximation. Therefore, we recommend using the Satterthwaite approximation (T A ) as the standard procedure to modify the Lancaster procedure. Among other approximation methods, simulation results in Section Correlated Lancaster Procedures show that, for data with stronger internal correlation, T D and T E have better approximation than T B and T C . Our simulation study and the case study further provide evidence that T D tends to have slightly higher power than the Satterthwaite approximation T A . The R code for five approximation is posted at http://d.web.umkc.edu/daih/.