Implications of pleiotropy: challenges and opportunities for mining Big Data in biomedicine

Pleiotropy arises when a locus influences multiple traits. Rich GWAS findings of various traits in the past decade reveal many examples of this phenomenon, suggesting the wide existence of pleiotropic effects. What underlies this phenomenon is the biological connection among seemingly unrelated traits/diseases. Characterizing the molecular mechanisms of pleiotropy not only helps to explain the relationship between diseases, but may also contribute to novel insights concerning the pathological mechanism of each specific disease, leading to better disease prevention, diagnosis and treatment. However, most pleiotropic effects remain elusive because their functional roles have not been systematically examined. A systematic investigation requires availability of qualified measurements at multilayered biological processes (e.g., transcription and translation). The rise of Big Data in biomedicine, such as high-quality multi-omics data, biomedical imaging data and electronic medical records of patients, offers us an unprecedented opportunity to investigate pleiotropy. There will be a great need of computationally efficient and statistically rigorous methods for integrative analysis of these Big Data in biomedicine. In this review, we outline many opportunities and challenges in methodology developments for systematic analysis of pleiotropy, and highlight its implications on disease prevention, diagnosis and treatment.


Introduction
In the past decade, genome-wide association studies (GWAS) have been conducted to study the genetic basis for thousands of phenotypes Eicher et al., 2015), including diseases (e.g., the seven diseases from WTCCC, The Wellcome Trust Case Control Consortium, 2007), clinical traits (e.g., cholesterol levels), anthropometric traits (e.g., height, Wood et al., 2014), brain structures (Hibar et al., 2015) and social behaviors (e.g., educational attainment, Rietveld et al., 2013;marriage, Domingue et al., 2014). As of April, 2015, more than 15,000 single-nucleotide polymorphisms (SNPs) have been reported to be significantly associated (p < 5 × 10 −8 ) with at least one phenotype (see GWAS catalog, Welter et al., 2014). By exploring these fruitful findings from GWAS, recent progress has suggested that a single locus may influence multiple seemingly different phenotypes (Solovieff et al., 2013). This phenomenon, termed "Pleiotropy, " was formally introduced into the scientific literature by the German geneticist Ludwig Plate in 1910 (Stearns, 2010). Accumulating evidence suggests that pleiotropy widely exists among complex traits, such as psychiatric disorders (Cross-Disorder Group of the Psychiatric Genomics Consortium, 2013a,b), metabolic syndrome traits (Vattikuti et al., 2012) and cancers (Sakoda et al., 2013). Examples of pleiotropy in human diseases include the PTPN22 locus associated with multiple auto-immune disorders (Cotsapas et al., 2011), such as rheumatoid arthritis, Crohn's disease, and type I diabetes; the TERT-CLPTM1L locus associated with bladder, glioma, and lung cancers (Fletcher and Houlston, 2010). Although the first survey (Sivakumaran et al., 2011) of pleiotropic effects (PE) was published in 2011, it may underestimate PE as the database of GWAS results was far less than what is available today . Fortunately, the Genome-Wide Repository of Associations between SNPs and Phenotypes (GRASP) database has been built up as a repository of many results from published GWAS (Leslie et al., 2014). A recent update (Eicher et al., 2015) on GRASP has provided even more comprehensive GWAS results-about 8.87 million SNP-phenotype associations in 2082 studies with p ≤ 0.05. Such a rich data resource allows characterizing the molecular mechanisms of PE on diverse phenotypes. Undoubtedly, it will not only greatly deepen our understanding of the genetic architecture that underlies complex human phenotypes, but also have clinically important implications.

Benefits from Characterization of Pleiotropic Effects
To demonstrate the potential benefits from characterization of PE, we consider a recent study from Butte's team at Stanford (Li et al., 2014b). Based on available information from the VARiants Informing MEDicine (VARIMED) database (Ashley et al., 2010), Butte's team hypothesized that diseases (e.g., type 2 diabetes) and non-disease traits (e.g., blood pressure and cholesterol levels; for convenience, we shall refer to "non-disease traits" as "traits" hereafter) could be related to each other through shared genetic variants. They first identified significant associations between 801 unique genes and 69 diseases, and between 796 unique genes and 85 traits. Next, they identified 120 disease-trait pairs that could be reliably linked via shared genetic variants, and 26 of them are novel to the community. Among these novel findings, five pairs can be directly validated by electronic medical records of patients from three independent clinical centers: Stanford Hospital and Clinics, Mount Sinai Medical Center and Columbia University Medical Center. For example, gastric cancer and the serum magnesium level is one of these five pairs. This pair is linked via three genes-MUC1, THBS3, and TRIM46, as implicated by some previous studies of gastric cancer (Wadhwa et al., 2013) and the serum magnesium level (Meyer et al., 2010). To validate this disease-trait pair, 804 patients were selected as cases because they had a magnesium measurement 1 year before their diagnosis of gastric cancer, and 324,160 individuals who had at least one magnesium measurement without diagnosis of gastric cancer were selected as controls. The comparison showed that the cases had a significantly higher magnesium levels than the controls. If this finding could be further replicated independently, it would have a very important clinical implication-the serum magnesium level could be used as a bio-marker which predicts the risk of gastric cancer 1 year beforehand.
There would be more benefits in clinical practice if the biological mechanisms of PE on some disease-trait pairs could be timely validated. One immediate benefit is the development of more affordable clinical tests, such as blood tests and imaging tests, for disease prevention and diagnosis, as implied by the real example above. Another potential benefit is the discovery of new drugs for disease treatment. Consider two diseases that are connected through a common biological mechanism. If a drug works well in treating one disease, it will also likely to be effective for the other disease. For example, calcium antagonist drugs have been used for the treatment of hypertension since 1960s (Wood et al., 1999). Recently, the Psychiatric Genomics Consortium (PGC) identified the L-type calcium channel subunit gene CACNA1C as a risk gene for several psychiatric disorders (Cross-Disorder Group of the Psychiatric Genomics Consortium, 2013b). As reported in Solovieff et al. (2013), this new finding has drawn the attention in trials of calcium antagonist drugs because these drugs could be potentially useful for psychiatric disorder treatment.

Statistical Perspectives on Characterization of Pleiotropic Effects
Although results from thousands of GWAS are readily available, systematical analysis of existing GWAS results toward fully characterizing PE is not trivial. Historically, the genome-wide significant level was set to p ≤ 5 × 10 −8 (McCarthy et al., 2008) as empirical data in European-decent GWAS suggested adjustment for 1-2 million independent tests (The International HapMap Consortium, 2005;McCarthy et al., 2008). However, it may not be wise to narrow down the search range of PE within the significant GWAS hits because they can only explain a very small proportion of phenotypic variance, which is known as the "missing heritability" phenomenon (Manolio et al., 2009). In 2010, Yang et al. showed that 45% of the variance for human height of 3925 unrelated individuals could be explained by 294,831 common SNPs (Yang et al., 2010). So far, researchers have found similar results for many other complex phenotypes , such as metabolic syndrome traits (Vattikuti et al., 2012), and psychiatric disorders (Cross-Disorder Group of the Psychiatric Genomics Consortium, 2013a; Yang et al., 2014). An important lesson learned from GWAS is the polygenic architecture of complex human phenotypesbesides the significant GWAS hits, a complex human phenotype is often affected by many genetic variants with small or moderate effects . Due to the limited sample size, these genetic risk variants may not be identified at the stringent genome-wide significant level, i.e., p ≤ 5 × 10 −8 . The survey of PE in 2011 showed that 4.6% of identified SNPs and 16.9% of identified genes were associated with multiple phenotypes (Sivakumaran et al., 2011). This result could be an underestimate of PE as many variants with weak or moderate effects have not been identified. For better characterization of PE among multiple phenotypes, the uncertainties arise in single-GWAS analysis should be taken into account. For example, it may be not easy to determine whether a variant with p-value about 10 −5 is disease-associated or not if we only focus on the variant itself. Statistics plays a critical role in incorporating indirect but relevant information (Efron, 2010) to account for uncertainties as discussed below.
Before discussion of statistical methods to characterize PE, we first introduce the local and global measures of PE. Originally, the term "pleiotropy" referred to the phenomenon that a single locus affects two or more phenotypes (Stearns, 2010). Let u (1) and u (2) be the effect sizes of this locus on two phenotypes, respectively. According to this definition, this locus is said to have pleiotropic effect if both u (1) and u (2) are nonzero. Clearly, this is a local measure of pleiotropy as it only refers to a specific locus. In the genomics era, the local measure has been extended to a global one which is defined as the correlation between the effect sizes of all genetic variants on the two phenotypes (Cross-Disorder Group of the Psychiatric Genomics Consortium, 2013a). Here we first introduce the method to estimate the global PE, and then discuss the issue about localization of PE in the next section.
At first glance, it seems to be straightforward to obtain the global PE by a two-step strategy: estimate the effect sizes individually for each variant and then calculate their correlation. However, accurate estimation of all the effect sizes in the first step can be very challenging due to limited sample sizes. Uncertainty arises in characterizing global PE due to the estimation error accumulated in the first step, resulting in the reduction of efficiency in correlation estimation in the second step. Fortunately, linear mixed models (LMM) arise as a powerful tool to address this challenge (Lee et al., 2012).
For simplicity, let us consider GWAS of two distinct phenotypes and without overlaped samples. Denote the phenotypes and genotypes as y (k) ∈ R n k ×1 and G (k) ∈ R n k ×M , respectively, where M is the number of SNPs in the genotyped matrix and n k is the sample size of the k-th GWAS, k = 1, 2. Bivariate LMM can be used for estimating global PE as follows: where X (k) is a design matrix of fixed effects collecting all the covariates (e.g., sex and age) for the k-th phenotype, β (k) is the coefficient vector of the fixed effects, u (k) is the coefficient vector of the genetic effects viewed as random, and e (k) is the residual, k = 1, 2. Let u (1) m and u (2) m be the m-th element of u (1) and u (2) , respectively. Then they are assumed to jointly follow the bivariate Gaussian distribution as where σ u k is the standard deviation of the effect sizes on the k-th phenotype, and ρ u ∈ [−1, 1] is the genetic correlation to capture global PE. In such a way, these random effects u (k) can be integrated out analytically, which helps us bypass the great challenge in accurate estimation of weak individual effects. After that, the model parameters (σ u 1 , σ u 2 , ρ u ) can be estimated using maximum likelihood (ML) or restricted maximum likelihood (REML) (Lee et al., 2012). In other words, all available information can be simultaneously incorporated under a statistically rigorous framework unlike the naive twostep strategy discussed above. Through this approach, the genetic relationship between five psychiatric disorders has been explored based on genome-wide SNPs (Cross-Disorder Group of the Psychiatric Genomics Consortium, 2013a). The estimated genetic correlation was very strong between schizophrenia and bipolar disorder (ρ u = 0.68 with s.e. 0.04), low between schizophrenia and autism spectrum disorders (ρ u = 0.16 with s.e. 0.06), and non-significant for some other pairs of disorders. The results may have clinically important implications for the classification of psychiatric disorders which has been done conventionally on the basis of symptoms. Due to the limited space here, readers interested in technical details of LMM shall refer to Jiang (2007) and McCulloch et al. (2008). As discussed above, the global PE can be estimated accurately after accounting for the uncertainties in estimating individual genetic effects. Similarly, leveraging pleiotropy and accounting for this estimation uncertainty can lead to significant improvement of disease risk prediction, as recently demonstrated in Li et al. (2014a); Maier et al. (2015). Interested readers can easily try the LMM approach in GWAS data analysis using efficient softwares, e.g., GCTA (Yang et al., 2011) and GEMMA (Zhou and Stephens, 2014).

Challenges and Opportunities in Characterization of Pleiotropic Effects
Although LMM is a very powerful tool to capture the global PE, systematical localization of PE is still at the beginning stage. In this section, we outline the challenges and opportunities from the statistical point of view.
There are two major challenges in localization of PE. First, a substantial proportion of phenotypic variance is attributed to many variants with small effects due to the polygenicity of complex human phenotypes. Identification of those risk variants with a very high certainty may not be supported by the sample size of current GWAS. A loose threshold (e.g., p ≤ 0.05) may lead to many false positives (Type I error), while a stringent threshold (e.g., p ≤ 5 × 10 −8 ) may produce too many false negatives (type II errors). Therefore, it may be problematic to examine pleiotropy based on the intersection of identified GWAS loci after a simple thresholding. Second, spurious pleiotropy may appear due to the artifact in experimental design and the linkage disequilibrium (LD) among genetic variants. On one hand, ascertainment bias can introduce spurious PE when sample recruitment based on the first phenotype changes the prevalence of the second phenotypes (Smoller et al., 2000). For example, it is possible that patients who suffer from two or more illnesses are more likely to be recruited than those with only one. As accumulation of experience in GWAS design, the sampling strategy and techniques to control confounding factors have been improved (Feero et al., 2010). Thus, this type of artifact may be reduced over time (Solovieff et al., 2013). On the other hand, the strong LD with two risk variants may cause spurious PE. Suppose SNPs A and B are two nearby risk variants for two distinct phenotypes. It is possible that they are strongly linked due to LD but locate in two genes with different functions. Statistically, it would be very challenging to distinguish this spurious PE from truly biological PE, given the limited information. Incorporation of functional information of the two SNPs can be very helpful to exclude this spurious PE as their distinct functions tell their differences.
To systematically characterize PE, additional information should be incorporated into statistically rigorous analysis. Instead of solely relying on genetic information, the Encyclopedia of DNA Elements (ENCODE) project (The ENCODE Project Consortium, 2012) has provided a comprehensive map of the regulatory elements in the human genome. Although most of its results are from cell lines, it has highlighted the importance of regulatory regions in the human genome. Another project, named Roadmap Epigenomics project (Kundaje et al., 2015), used primary cells instead of cell lines, aiming at providing reference epigenomes of more than one hundred tissues and cell types to tackle human diseases. Apart from epigenetic markers, the Genotype-Tissue Expression project (GTEx) (Lonsdale et al., 2013) aims at collection of 20,000 tissues from 900 donors, serving as a comprehensive atlas of gene expression and regulation across human tissues. Clearly, integration of pleiotropy and functional annotation to drive advanced scientific hypotheses is calling for rigorous analysis (Ritchie et al., 2015). Besides the statistical methods mentioned in the timely review paper by Solovieff et al. (2013), several statistical approaches to characterizing PE have been developed more recently, including GPA (Chung et al., 2014), CPASSOC (Zhu et al., 2015b), MGAS (Van der Sluis et al., 2015), Bayesian Test for Colocalization (Giambartolomei et al., 2014) and others.
As an example, here we briefly introduce GPA (Chung et al., 2014), a statistical approach recently developed by us. As a first attempt, GPA is designed to integrate pleiotropy and functional annotation information for risk SNP prioritization, and significance assessment of pleiotropy and annotation enrichment. A notable feature of GPA is that it only requires the summary statistics from GWAS, rather than the genotype data at the individual level, as its input, which greatly facilitates GWAS data integration. GPA has been applied to analyze psychiatric disorders and bladder cancer, with various types of functional annotation. Our analysis results suggest that integration of genomics data may potentially lead to abundant novel findings (Chung et al., 2014).
It is noteworthy that characterization of PE should not be restricted to phenotypes at the organismal level. Investigation the PE between the organismal phenotype (e.g., disease status) and the cellular phenotype (e.g., DNA methylation, gene expression, protein expression and metabolite abundance) may lead to even more fruitful discoveries as the genetic variants often have much larger effect sizes on cellular phenotypes (Battle et al., 2015). Mining the biomedical data representing different biological processes at different layers is becoming feasible as these data have been well organized recently, including the GTEx database (Lonsdale et al., 2013), a cross-platform collection of human gene expression data (Zhu et al., 2015a), a tissue-based human protein database (Uhlén et al., 2015) and an atlas of genetic influences on human blood metabolites (Shin et al., 2014). The potential impact of PE at different layers can be amplified as the costs of cellular trait measurements continue to drop with the advent of new technologies.

Conclusion
The advent of big data has revolutionized biomedical research. We are able to comprehensively characterize the health condition of a human subject with quantitative measurements generated at both cellular and organismal levels. From these data, we may find direct or indirect evidence to resolve long-standing problems and motivate advanced scientific hypotheses. Characterization of PE based on integration of various data types is such an exciting process, while the risk of identification of spurious PE may be largely increased due to the enlarged search space. We believe that statistically rigorous methods which effectively account for uncertainties in data integration will continue to play a critical role in improving statistical power and decreasing false positive findings.