Pathway-Wide Association Study Implicates Multiple Sterol Transport and Metabolism Genes in HDL Cholesterol Regulation

Pathway-based association methods have been proposed to be an effective approach in identifying disease genes, when single-marker association tests do not have sufficient power. The analysis of quantitative traits may be benefited from these approaches, by sampling from two extreme tails of the distribution. Here we tested a pathway association approach on a small genome-wide association study (GWAS) on 653 subjects with extremely high high-density lipoprotein cholesterol (HDL-C) levels and 784 subjects with low HDL-C levels. We identified 102 genes in the sterol transport and metabolism pathways that collectively associate with HDL-C levels, and replicated these association signals in an independent GWAS. Interestingly, the pathways include 18 genes implicated in previous GWAS on lipid traits, suggesting that genuine HDL-C genes are highly enriched in these pathways. Additionally, multiple biologically relevant loci in the pathways were not detected by previous GWAS, including genes implicated in previous candidate gene association studies (such as LEPR, APOA2, HDLBP, SOAT2), genes that cause Mendelian forms of lipid disorders (such as DHCR24), and genes expressing dyslipidemia phenotypes in knockout mice (such as SOAT1, PON1). Our study suggests that sampling from two extreme tails of a quantitative trait and examining genetic pathways may yield biological insights from smaller samples than are generally required using single-marker analysis in large-scale GWAS. Our results also implicate that functionally related genes work together to regulate complex quantitative traits, and that future large-scale studies may benefit from pathway-association approaches to identify novel pathways regulating HDL-C levels.


INTRODUCTION
Genome-wide association studies (GWAS) have been very successful in identifying genes or loci that are associated with complex diseases or quantitative traits. However, for most GWAS, the collection of known loci explains only a minor fraction of disease heritability (Maher, 2008;Manolio et al., 2009). Since most GWAS are not powered to detect the majority of genuinely associated loci, researchers could expand sample size indefinitely in an attempt to find more associated loci; however, this approach has its limitations as the GWAS are likely to have diminishing returns. That is, after the first few loci (presumably more readily detectable with larger effect sizes) have been identified and replicated, the remaining loci will be more and more difficult to find as they will have much smaller effect sizes (Goldstein, 2009). An alternative approach to detect trait-associated loci in GWAS, which can provide complementary information to single-marker analysis, is a pathway-based approach, which examines whether test statistics for related genes in the same pathway are consistently associated with a disease trait under study (Wang et al., 2010).
There have been a few published examples demonstrating the power and effectiveness of pathway-wide association approaches for several diseases. We have previously developed a pathway association approach (Wang et al., 2007), by adopting the Gene Set Enrichment Analysis (Subramanian et al., 2005) strategies, but with appropriate modifications to account for the linkage disequilibrium between SNP markers, the different sizes of genes and www.frontiersin.org pathways. This approach has been successfully applied to uncover association between the IL12/IL23 pathway and Crohn's disease (Wang et al., 2009a), the involvement of cell adhesion molecules in autism spectrum disorders (Wang et al., 2009b), as well as the WNT-signaling pathway in Type 2 Diabetes (T2D) (Perry et al., 2009). Additionally, Lesnick et al. (2007) have demonstrated that SNPs in the candidate pathway (axon guidance) collectively predisposed to a Parkinson's disease, even though no individual SNP reaches genome-wide significance level. Dinu et al. (2007) applied a pathway approach and discovered that multiple complement factor pathway genes, in additional to complement factor H (a well known susceptibility locus), are associated with risk for developing age-related macular degeneration (AMD). Torkamani et al. (2008) applied a pathway-based approach to analyze the Wellcome Trust Case Control Consortium (WTCCC) data sets, and discovered numerous pathways implicated in disease predisposition, many of which have long been assumed to contain polymorphic genes that lead to disease predisposition. Askland et al. (2009) analyzed GWAS data sets on bipolar disorder using a pathway association approach, and implicated genes mediating ion channel activity and synaptic neurotransmission. Holmans et al. (2009) also described a pathway approach and applied it on meta-analysis of bipolar disorder, and implicated a collection of genes involved in the modulation of transcription and cellular activity. Furthermore, several recent studies have focused on incorporating biological network structure into the analysis of GWAS (Pan, 2008;Baranzini et al., 2009). Collectively, all these studies show that pathway-based approaches can be applied to GWAS data sets to gain insights into disease susceptibility mechanisms, and to provide enhanced or complementary information to the conventional single-marker analysis.
Unlike previous disease studies, our current study focused on a quantitative trait, plasma levels of high-density lipoprotein cholesterol (HDL-C), to examine whether pathway-based approaches can be effective in enriching a group of truly associated genes. HDL-C, together with low-density lipoprotein cholesterol (LDL-C) and triglyceride (TG), are three major lipid phenotypes in the human serum. Lipid levels are heritable traits that are strongly associated with cardiovascular risk, therefore they are widely measured in clinical practice. Lipid measures represent excellent secondary phenotypes for genetic investigation by many GWAS, and relatively large sample sizes can be obtained to study these traits. Three large-scale GWAS have been reported to identify genetic loci associated with lipid traits: Kathiresan et al. (2009) conducted GWAS screens in 19,840 individuals and replication in up to 20,623 individuals, and identified 30 loci (36 trait-loci associations) controlling HDL-C, LDL-C, or TG levels. Sabatti et al. (2009) have assayed a birth cohort from a founder population with ∼4,500 subjects, and identified 17 trait-loci associations for lipid levels. Aulchenko et al. (2009) analyzed 16 European population cohorts with ∼20,000 subjects and identified 22 trait-loci associations on lipid traits. More recently, Teslovich et al. (2010) have published an even larger meta-analysis on lipid traits involving over 100,000 subjects. Despite these early successes, a more complete ensemble of lipids genes is needed for better understanding of the gene networks involved, and the current successes probably merely represent the start of a long road (Edmondson and Rader, 2008;Holleboom et al., 2008) for the following reasons: First, among genes that cause Mendelian disorders of lipid traits, only a subset of them have been identified in GWAS (Edmondson and Rader, 2008). Second, multiple candidate genes believed to affect HDL-C levels based on preclinical models have not yet been identified in GWAS for HDL-C. Finally, the collection of HDL-C genes identified in GWAS only explains a minor fraction of inter-individual variability in lipoprotein levels . Therefore, it is likely that many more genes regulating HDL-C levels remain to be discovered.
In this study, we applied a pathway-based association approach on a GWAS for the purpose of identifying genes regulating HDL-C levels. The study is distinct from previous GWAS on lipids in that it was originally designed and ascertained as a case/control study, so the study participants have either extremely high or low HDL-C levels but without syndromic dyslipidemia features. Although the initial single-marker association analysis identified only one well known locus that significantly associates with HDL-C, the pathway-based association approach resulted in ∼100 loci that potentially associate with HDL-C, including multiple previously known lipid loci. We further validated some of the novel findings through literature survey of previously known Mendelian forms of lipid disorders, previously published candidate gene association studies, as well as the examination of metabolic phenotypes of gene knockout mice maintained by the Jackson laboratory. Our study has significant implications for the study of quantitative traits and the analysis of GWAS data: First, it demonstrated that a modest case/control data set sampling from two extreme tails of a quantitative trait can be very useful in identifying genuinely associated genes. Second, it illustrated that pathway-wide association approaches could provide biological insights that are not readily available from single-marker analysis on orders of magnitude larger sample sizes.

Discovery cohort
Cases and control subjects were ascertained for the purpose of identifying genetic variation regulating HDL-C levels. All study participants were self-declared to be of European ancestry, and were genotyped on the Affymetrix Genome-wide Human SNP 6.0 arrays. The "cases" were defined as persons with HDL-C levels >94th percentile for age and gender. The control subjects were defined as persons with HDL-C <25th percentile for age and gender. Based on the whole-genome genotyping data, we performed multi-dimensional scaling analysis on a set of independent SNP markers ( Figure S1 in Supplementary Material), to ensure that only subjects of genetically inferred European ancestry were used in analysis. We also removed subjects with call rates of SNPs less than 95%. Additionally, we have removed pairs of subjects with cryptic relatedness, as defined by identity-by-descent score higher than 0.2. The final data set for association analysis included a total of 653 cases and 784 control subjects with 681,050 SNP markers. The genomic control inflation factor was 1.057, indicating that cases and control subjects were well matched and that no systematic batch effects were present.

Replication cohort
The replication cohort was compiled from PennCATH and Med-Star, two ongoing GWAS on coronary artery disease, which were previously described in detail (Myocardial Infarction Genetics . The PennCATH study is a University of Pennsylvania Medical Center based angiographic study of over 3,800 subjects that has been used previously for replication of novel genes and risk factors for atherosclerotic cardiovascular disease and T2D (Helgadottir et al., 2006;Steinthorsdottir et al., 2007). The MedStar study is a Washington Hospital Center based angiographic study of 1,500 subjects specifically designed for biomarker and genetic association studies of acute and chronic coronary atherosclerosis. Since the PennCATH and MedStar samples were genotyped together at the same site and were analyzed by identical procedures, to improve power to detect association, we combined the two data sets together. To define HDL-C case-control status, the subjects were dichotomized by sex-specific cutoff points: those subjects with HDL-C higher than 80% percentile were regarded as "cases" and those with HDL-C levels lower than 20% percentile were regarded as control subjects. The thresholds differ from the discovery cohort, because the discovery cohort is specifically designed and ascertained for studying high HDL levels, whereas the PennCATH and MedStar were not ascertained in this manner. Subjects who participated in the discovery cohort were not included in the replication cohort. Through these criteria, a total of 496 cases and 373 control subjects were included in the analysis.

Ethics statement
All subjects have been enrolled in a University of Pennsylvania or Washington Hospital Institutional Review Board approved protocol, and all subjects gave written informed consent.

PATHWAY ASSOCIATION ANALYSIS
The method for pathway-based association test was previously described in detail elsewhere (Wang et al., 2007(Wang et al., , 2009a. Briefly, we assign each SNP to the overlapping gene/genes, or its closest gene (subject to a distance constraint) in the genome if it does not overlap with a gene. For each gene, we assigned the highest statistic value among all SNPs mapped to the gene as the statistic value of the gene. For all the N genes that are represented by SNPs in the GWAS, we sorted their statistic values from the largest to the smallest, denoted by r (1) , . . ., r (N ) . For any given gene set S composed of N H genes, we then calculated a weighted Kolmogorov-Smirnovlike running sum statistic (Hollander and Wolfe, 1999) that reflects the overrepresentation of genes within the set S at the top of the entire ranked list of genes in the genome, p and p (default value = 1) is a parameter that gives higher weight to genes with extreme statistic values. To adjust for differences in gene size (hence the different number of SNPs located within/nearby each gene), as well as the linkage disequilibrium between SNPs within the same gene, we conducted a two-step correction procedure. In the first step, we permuted the disease labels of all samples ensuring the same number of individuals in each phenotype group for case-control studies. During each permutation (denoted by π), we repeated the calculation of enrichment score as described above as ES(S, π). In the second-step, we calculated normalized enrichment score (NES) as a Z -score, defined as [ES(S) − mean(ES(S, π))]/SD[ES(S, π)], so that different gene sets are directly comparable to each other. The P-value can be calculated from the π permutations, and a false discovery rate (FDR) procedure can be used to control the fraction of expected false positive findings below a certain threshold (Reiner et al., 2003). For a gene set, let NES * denote the NES in the observed data. The FDR is calculated as Alternatively, we also calculated the FWER P-value as the fraction of all permutations whose highest NES score in all gene sets are higher than the NES* for a given gene set. For the discovery cohort, a total of 10,000 phenotype permutation cycles were used so that a more accurate P-value can be calculated for each pathway; for replication cohort, 1,000 phenotype permutation cycles were used. The software implementing this method was made freely available at http://www.openbioinformatics.org/gengen. The current software uses chi-square test statistics for association, so it cannot incorporate covariates in the analysis. The software package included pathway definitions for the three pathway collections (KEGG, BioCarta, GO) used in the study, as well as library files mapping SNP identifiers to gene names for the Affymetrix Genome-wide Human SNP 6.0 array. The pathway collections were accessed and assembled on March 03, 2008. These default library files, as well as all default parameters in the software, were used in our pathway association analysis.

SINGLE-MARKER AND PATHWAY ASSOCIATION ANALYSIS ON HDL-C LEVELS
We examined a previously unpublished cohort on HDL-C levels (discovery cohort, Table 1), genotyped on the Affymetrix Genome-wide Human SNP 6.0 arrays. Unlike other disease studies that investigated HDL-C levels as an additional/secondary phenotype for investigation, this cohort consists of "cases"with extremely high HDL-C levels and "control" subjects with low HDL-C levels (see Material and Methods). One locus on 16q13 reached genomewide significance levels of P < 5 × 10 −8 (Figure 1), with the most significant SNP being rs1532625 (P = 6.8 × 10 −12 , OR = 1.68) within CETP (cholesteryl ester transfer protein). This locus was a well-established HDL-C locus before the GWAS era (Koizumi et al., 1985;Kurasawa et al., 1985;Inazu et al., 1990). Therefore, through single-marker association analysis, no novel lipid genes were identified that met genome-wide significance criterion, though we were able to replicate several loci from other published studies on lipid levels (Table S1 in Supplementary Material). Altogether, these single-marker association results testified the high quality of the phenotype data, and demonstrated the unique power www.frontiersin.org  of case/control data set that samples from extreme tails of normal distributions on a quantitative trait.
To screen for novel biological pathways or collections of functionally related genes that associate with HDL-C levels, we next applied a pathway-based association approach (Wang et al., 2007) on this GWAS data set, with the same set of subjects and SNPs used in single-marker association analysis. This pathway association method permutes the case/control labels a large number of times, and for each permuted data set, it re-calculates the association test statistics for all SNP markers, and then assesses the significance of the pre-defined pathways or sets of genes by comparing the observed test statistics with the null distribution generated by the permutations. The collection of pathways or gene sets analyzed in the study was retrieved from the BioCarta (BioCarta, BioCarta-Charting pathways of life), KEGG (Kanehisa et al., 2010) and level 4 of the Gene Ontology (Ashburner et al., 2000) databases. We performed 10,000 permutations and assessed a total of 568 pathways passing size threshold (20-200 genes). We found that seven pathways were associated with HDL-C levels at a nominal P-value less than 0.01 (Table 1), slightly more than expected by chance (5.68). Among these pathways, four have false discovery rate (FDR) less than 25%. However, we note that all genes in the "sterol metabolic process" pathway belong to the "steroid metabolic process" pathway as both were from the Gene Ontology collection (Figure S2 in Supplementary Material), therefore we excluded the "steroid metabolic process" pathway from further discussion as it has lower Z -score and P-value than its subset. The full list of genes and representative SNPs for the sterol metabolism and transport pathways were given in Figure S2 in Supplementary Material. To further evaluate the sensitivity of the pathway analysis to the inclusion of CETP -a well known HDL-C gene that showed genome-wide significance in our single-marker association analysis -we removed CETP from the pathway analysis. This practice is similar to removing APOE from pathway analysis on Alzheimer's disease (Jones et al., 2010) or removing NOD2 from pathway analysis on Crohn's disease (Wang et al., 2009a). We still observed association signals for the "sterol metabolic process" and "sterol transport" pathways but with reduced significance levels (P = 0.0011 and P = 0.007, respectively), suggesting that the observed pathway associations above were not due to the effect of CETP alone.
Similar to single-marker association tests, pathway-based association tests may be susceptible to spurious associations, so we next sought to replicate the pathway findings using independent data sets. Since the pathway association method requires phenotype permutation of case/control status, we compiled a case-control dataset (496 cases, 373 controls) from ongoing GWAS at University of Pennsylvania which have fasting lipid measures on some study participants (see Materials and Methods). The replication P-values for sterol metabolic process pathway and sterol transport pathway are 0.029 and 0.083, respectively. These values do not reach stringent levels of statistical significance considering the need to test two hypotheses, and this may be due to the lack of power in the replication study.

ADDITIONAL EVIDENCE SUPPORTING PATHWAY ASSOCIATION SIGNALS
Although our study demonstrated that a group of genes were collectively associated with HDL-C levels, only one locus (CETP) truly showed a genome-wide significance signal in single-marker association analysis. The Gene Ontology pathways were curated Frontiers in Genetics | Applied Genetic Epidemiology from heterogeneous data sources such as electronic annotation from protein homology searches, so they may contain genes that are not genuinely associated with HDL-C levels. Therefore, we applied three complementary approaches, to search for additional evidence supporting the role of a subset of genes in the pathways in regulating HDL-C levels, and summarized the results in Table 2.
First, we examined the sterol transport and metabolism genes in three independent GWAS [the Kathiresan et al. (2009) metaanalysis, the Aulchenko et al. (2009) meta-analysis, the Sabatti et al. (2009) study on founder populations), as well as the largescale Teslovich et al. (2010) meta-analysis. Among the 102 unique genes (96 unique loci), genes in the sterol transport and metabolism pathways, 9 and 15 loci were reported as genuine HDL-C loci and/or LDL-C/TG loci in at least one of these publications, respectively ( Table 2). The abundance of previously known lipid genes in the identified pathway therefore served as de novo confirmation supporting the biological relevance of the pathways. Additionally, when examining previously published candidate gene association studies, we found that multiple genes in the sterol transport/metabolism pathway, such as LEPR, APOA2, SOAT2, HDLBP, were previously associated with lipid levels ( Table 2), albeit with relatively small effect sizes and in some cases not replicated.
Second, we examined published literature on monogenic form of lipid disorders, and determined whether any of the sterol transport/metabolism genes have been previously associated with such diseases. Interestingly, multiple additional genes, such as DHCR24 and APOA2, have never been implicated by GWAS on lipid traits, but were identified as genuine lipid genes through the study of rare forms of lipid disorders ( Table 2). It is possible that no common causal variants in these genes exist in European populations, or that the effect sizes of causal variants are too small to be detected by the sample sizes used in previous studies. Nevertheless, this fact implicates the complementary nature of pathway-based association approach to single-marker association approach, and supports the utility of the pathway-based approach in identifying biologically relevant genes.
Third, we examined previous functional studies conducted on model organisms, including mouse knockout studies. Unlike many complex human diseases where genetic models are difficult to replicate or study in animal models, the lipid traits are easily measurable and quantifiable, so they were almost always included in knockout experiments, facilitating the functional study of lipid genes. This analysis revealed multiple additional previously unreported genes, including SOAT1, LEPR, and PON1, that had a metabolic phenotype (increased or decreased cholesterol levels) in knockout experiments ( Table 2).
Altogether, the three complementary approaches presented above resulted in the confirmation of a prioritized set of genes that have strong a prior evidence of playing a role in HDL-C regulation. Although the sterol transport/metabolism pathways contains 102 genes, this particular subset of genes may represent a prioritized subset that are promising candidates to be investigated in functional studies of lipid traits.

DISCUSSION
In this study, we performed pathway-based genome-wide association analysis on two GWAS data sets, and demonstrated that a group of sterol transport and metabolism genes are collectively associated with HDL-C levels. We further validated a subset of these genes through examination of previously published candidate gene association studies, known Mendelian forms of lipid disorders, and known dyslipidemia phenotypes in knockout mice. Compared to previously published large-scale GWAS, our results supplied a richer list of additional candidate genes regulating HDL-C levels, and demonstrated the enhanced power of pathway-based association analysis on finding biologically relevant genes. Given the relatively small sample size, we were not able to identify previously unknown pathways as being involved in regulation of HDL-C levels; however, the fact that known pathways and genes are identified suggest that the approach may be applicable to larger sample sets to identify novel pathways.
The physiology of HDL metabolism results from the complex interplay of multiple gene products including apolipoproteins, enzymes, lipid transfer proteins, cellular transporters, and receptors. Although the pathways identified in the study are intuitive in a biological sense, caution should be exercised as some of the genes within the pathways probably play more important and direct roles than other genes in regulating HDL-C homeostasis. This explains the rationale of our three-tiered prioritization approach to identify subsets of HDL-C genes for future functional studies.
We also wish to discuss a few points related to sample ascertainment scheme. One particular point is that all individuals in the discovery cohort are ascertained from the general population with extreme lipids measures, yet the individuals in the replication cohort have CAD. HDL-C levels might be more disturbed by lipid medications or other clinical factors in CAD patients. Therefore, although it would be valid to use individuals with CAD for replication purposes, it would be most appropriate to make discoveries from general population, to ensure that the data ascertainment scheme does not bias the results (for example, pathways that associate with CAD status rather than lipids levels). Another point is that our study did not utilize a larger cohort and took a QTL approach for the association analysis. It is expected that under fixed genotyping budget, extreme phenotypes have greatly improved power to detect associations due to increased odds ratio measures. However, lipids measures are easily obtainable, so they represent excellent secondary phenotypes available from many GWAS on diverse phenotypes with large sample sizes. Therefore, we expect that a QTL approach for a very large cohort may also have sufficient power to identify the implicated pathways.
Although the name "pathway" was used in our analysis, we caution that the term merely refers to a collection of genes, rather than an inter-connected biological process. The pathway-based association analysis is borrowed from microarray data analysis, where sets of related genes are combined into a single group to examine the up or down-regulation of gene expression levels for a group of genes that are consistent. Therefore, "pathway" does not necessarily correspond to biologists' understanding of biological process, but rather a functional classification of genes based on a set of pre-defined criteria. Our study used pathways compiled from three resources, but we note that other pathway collections could also be explored. Furthermore, we note that pathways analyses have the potential to move beyond the so-called www.frontiersin.org  (Huang et al., 1995), hyopbetalipoproteinemia (Homanics et al., 1993) Hypobetalipoproteinemia (Young et al., 1987   "common-disease/common-variant" paradigm, and consider SNP and genes together rather than in isolation. In conclusion, using two distinct GWAS data sets based on modestly-sized samples ascertained by extremes of a quantitative trait (HDL-C), we have applied pathway analysis to suggest that an ensemble of ∼100 genes is collectively associated with HDL-C levels. A fraction of these genes has already been suggested by previously published literature and by animal models, lending strong support to the validity of the pathway-based approach. Our results therefore confirm the validity and potential utility of the pathway association approach as applied to the extreme tails of a quantitative trait.

ACKNOWLEDGMENTS
We would like to thank all participating subjects and families involved in the study. Recruitment of the high/low HDL cohort was supported by a Doris Duke Charitable Foundation Distinguished Clinical Scientist Award to Daniel J. Rader. Recruitment of the PennCATH cohort was supported by the Cardiovascular Institute of the University of Pennsylvania. Recruitment of the MedStar cohort was supported by a research grant from GlaxoSmithKline and from the MedStar Research Institute. Genotyping was done at the Center for Applied Genomics at the Children's Hospital of Philadelphia. Genotyping of the high/low HDL cohort was supported by R01 HL089309 from the NHLBI and of the PennCATH and MedStar cohorts by GlaxoSmithKline through an Alternate Drug Discovery Initiative research alliance award (to Muredach P. Reilly and Daniel J. Rader) with the University of Pennsylvania School of Medicine. Andrew C. Edmondson was supported by a National Heart, Lung and Blood Institute Ruth L. Kirschstein National Research Service Award for Individual Predoctoral MD/PhD Fellows (F30). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.