MetaPhat: Detecting and Decomposing Multivariate Associations From Univariate Genome-Wide Association Statistics

Background Multivariate testing tools that integrate multiple genome-wide association studies (GWAS) have become important as the number of phenotypes gathered from study cohorts and biobanks has increased. While these tools have been shown to boost statistical power considerably over univariate tests, an important remaining challenge is to interpret which traits are driving the multivariate association and which traits are just passengers with minor contributions to the genotype-phenotypes association statistic. Results We introduce MetaPhat, a novel bioinformatics tool to conduct GWAS of multiple correlated traits using univariate GWAS results and to decompose multivariate associations into sets of central traits based on intuitive trace plots that visualize Bayesian Information Criterion (BIC) and P-value statistics of multivariate association models. We validate MetaPhat with Global Lipids Genetics Consortium GWAS results, and we apply MetaPhat to univariate GWAS results for 21 heritable and correlated polyunsaturated lipid species from 2,045 Finnish samples, detecting seven independent loci associated with a cluster of lipid species. In most cases, we are able to decompose these multivariate associations to only three to five central traits out of all 21 traits included in the analyses. We release MetaPhat as an open source tool written in Python with built-in support for multi-processing, quality control, clumping and intuitive visualizations using the R software. Conclusion MetaPhat efficiently decomposes associations between multivariate phenotypes and genetic variants into smaller sets of central traits and improves the interpretation and specificity of genome-phenome associations. MetaPhat is freely available under the MIT license at: https://sourceforge.net/projects/meta-pheno-association-tracer.


INTRODUCTION
Genome-wide association studies (GWAS) of common diseases and complex traits in large population cohorts have linked thousands of genetic variants to individual phenotypes. In emerging biobank studies as well as in some disease specific collections have focused on, for example, Type 2 diabetes (T2D) (Mahajan et al., 2018) or coronary artery disease (CAD) (Ripatti et al., 2016), multiple related quantitative traits are simultaneously available for genetic association studies. The statistical power in these discovery efforts can be boosted considerably by multivariate tests, which have become more practical through recent implementations that require only univariate summary statistics, such as MultiPhen (O'Reilly et al., 2012), TATES (van der Sluis et al., 2013), CONFIT (Gai and Eskin, 2018), MTAG (Turley et al., 2018), MTAR (Guo and Wu, 2019), and metaCCA (Cichonska et al., 2016). The merits of many of these methods are further discussed by Chung et al. (2019). Concretely, canonical correlation analysis (CCA) (Hotelling, 1936) is the direct extension of the correlation coefficient to identify linear associations between two sets of variables, and it has been successfully applied also to GWAS (Inouye et al., 2012). Moreover, metaCCA extended CCA to work directly from GWAS summary statistics (effect size estimates and standard errors) of related traits and studies. However, a remaining challenge is to interpret which traits are driving the multivariate association and which traits are just passengers contributing little to the association statistic. A successful identification of a subset of central traits for each associated variant can lead to new biological insights in studies of disease progression and heterogeneity. To address this important task, we have introduced MetaPhat (Meta-Phenotype Association Tracer), a novel method to efficiently and systematically: 1. identify and annotate significant variants via multivariate GWAS from univariate summary statistics using metaCCA; 2. perform decomposition by systematically tracing the traits of highest and lowest statistical importance to identify subsets of central traits at each associated variant; 3. plot the traces of trait decompositions and cluster the variants based on the ranking of the importance of traits.

Workflow
MetaPhat requires as input a set of related GWAS summary statistics from correlated traits. The program implements efficient multi-trait genome-wide association testing, identification of significant associations, and systematic tracing of trait subsets to identify the central traits that consist of a statistically optimal set of traits together with a set of driver traits. A workflow is shown in Figure 1. In steps one to three, genome-wide significant variants [P < 5e-8, the established genome-wide threshold in the field (Sherry et al., 2001;Pe'er et al., 2008)] were identified and were clumped into independent groups that are subsequently represented by the lead variant of each group (i.e., the variant with the smallest P-value). By default, two lead variants were defined as independent if their distance is higher than 1 million base pairs. At step four, we carried out the decompositions of multivariate association by starting from model with all K traits and removing one trait at a time until only one trait remains. We proceeded via two different strategies that we named the highest trace and the lowest trace. More specifically, starting from the model with all K traits, we tested all unique combinations of (K-1) traits to find the subset with the highest CCA statistic (lowest P-value) that we assigned to the highest trace and the subset with the lowest CCA statistic (highest P-value) that we assigned to the lowest trace. We continued both traces iteratively until only a single trait remained by always choosing the subset with the highest CCA statistic on the highest trace and the subset with the lowest CCA statistic on the lowest trace. Intuitively, at each step, the trait dropped on the highest trace was the trait that was best replaceable by the other traits in the model with respect to the genetic association considered. Analogously, at each step, the trait dropped on the lowest trace was the trait that was most irreplaceable by the other traits in the model with respect to the genetic association considered. Altogether, we evaluated K 2 subsets out of all possible 2 K subsets while building these two traces. Base pair distances, GWAS P-value thresholds, and other program parameters could be updated using command-line arguments. We used the two traces to identify central traits that are primarily responsible for the association with the variant as explained next.

Evaluating Models
We used two quantities to evaluate models: CCA P-values and Bayesian Information Criterion (BIC; Schwarz, 1978). P-values allowed us to compare each association to the established "genome-wide significance threshold" of 5e-8 (Pe'er et al., 2008). By using the lowest trace, we could identify those traits without which the multivariate P-value is no longer genomewide significant by simply collecting the traits that have been removed from the full model when the P-value on the lowest trace is first time above 5e-8. We call these traits the driver traits since they drive the association in the sense that without them the association does not anymore reach genome-wide significance and hence would not have been reported as a discovery in a GWAS. This definition of driver traits is based on a fixed P-value threshold, which is an established practice in the field, but does not claim any statistical optimality properties in terms of model comparison. Hence, to more rigorously compare models with different dimensionalities, we used BIC, which approximates the negative marginal likelihood of the model and thus penalizes for the model dimension (Schwarz, 1978). A lower BIC value suggests a statistically better description of the data. A subset of traits with minimum BIC would thus be the model of choice. We defined the optimal subset as the subset with the lowest BIC among all subsets on the highest trace and all subsets on the inverted lowest trace. The inverted lowest trace aggregates the traits that have been dropped on the lowest trace, and, in particular, includes the set of the driver traits as one of its subsets. FIGURE 1 | MetaPhat workflow 1. GWAS results for K traits are accepted as input. 2. After quality control and filtering, a multivariate GWAS is performed on the full model with all K traits using metaCCA via efficient multi-processing and chunking to reduce computation time. 3. Lead SNPs are detected and sorted based on the leading canonical correlation/P-value and then clumped based on a user-specified window size. Custom variants can be added. 4. Decomposition of chosen variants is performed through highest and lowest traces to find an optimal subset with a minimum BIC and driver traits based on the established P-value threshold. 5. MetaPhat results include trace plots for P-values and BIC, univariate association statistics plots for all lead SNPs, cluster maps (shown in Figure 2), and a summary table listing central traits (union of drivers and optimal subset).
Subsequently, we defined the central traits as the union of traits from the drivers and optimal BIC subset. MetaPhat traces and terms are summarized in Table 1. Computing P-Values and BIC From GWAS Summary Statistics metaCCA outputs the first canonical correlation r 1 between the genetic variant x and the set of k traits y 1 ,. . .,y k and computes the corresponding P-value (Clarke et al., 2011;Cichonska et al., 2016). In this case, the first canonical correlation r 1 equals to the maximum correlation between the variant and any linear combination of the traits and hence is equal to the square root of the variance explained R 2 from the linear regression of x on y 1 ,. . .,y k . In general, the expression for BIC is where n is the sample size, k is the number of parameters (here traits), and log lk is the maximized log-likelihood. Next, we have shown how to use metaCCA output r 1 to derive BIC from the maximized likelihood of the linear model written as a function of R 2 = r 1 2 . Consider a linear model between a (mean-centered) variant x and (mean-centered) traits y = (y 1 ,. . .,y k ) T .
where we do not include the intercept parameter as its maximum likelihood estimate (MLE) is zero after mean-centering. The loglikelihood function is and MLEs are Thus, the log-likelihood at maximum is

Highest trace
Starting from the full model of K traits, we tested all unique combinations of (K-1) traits to find the subset with the highest CCA statistic (lowest P-value), and we iterated until K = 2. The goal was to drop most replaceable traits first.

Lowest trace
Starting from the full model of K traits, we tested all unique combinations of (K-1) traits to find the subset with the lowest CCA statistic (highest P-value), and we iterated until K = 2. The goal was to drop most irreplaceable traits first.

Inverted trace
Aggregates the traits that have been dropped on the lowest trace. The goal was to include the driver sets into the search space for the optimal set.
Drivers/driver traits The traits that have been dropped on the lowest trace at the step where the multivariate P-value was for the first time no longer genome-wide significant. Interpretation: traits that make the multivariate association statistically significant.

Optimal set
The subset of traits that has the lowest BIC among subsets across all three traces. Interpretation: the set that is a statistically optimal description of the multivariate association.
Central traits Union of drivers and optimal set. Interpretation: includes the important traits of the multivariate association.
Hence, the logarithm of the likelihood ratio between the MLE and the null model can be written as Hence, we have that, for an additive constant which is possible to compute directly from the metaCCA output for models with at least two traits up to an additive constant c.
Since c does not depend on the model dimension, we can ignore it in the BIC calculation, when we are only interested in the differences in BIC between models. Finally, for a single-trait model, R 2 can be computed directly from the univariate GWAS summary statistics as which can be plugged in the BIC formula above to yield BIC for the single-trait model.

Implementation and Output
MetaPhat is written in Python (compatible for 2.7 and 3+) and requires R (3.4+) for plotting. The command-line based program has been tested on multiple operating systems and cloud images. Library requirements and command options are further described in Supplementary Table S1, and test data are accessible from the project page: https://sourceforge.net/projects/ meta-pheno-association-tracer. MetaPhat outputs tabular text files and several plots. A summary result file contains, for each chosen variant, the driver traits and the optimal subset with their P-value and BIC statistics. For each variant, trace plots using P-values and BIC are generated, showing the highest trace, the lowest trace and the inverted lowest trace. In addition, the univariate P-values and directions of effects for each trait are also plotted. The estimated phenotype correlation matrix, clustered heatmaps of trait importance for the chosen variants and a similarity between variants using trait rankings on the lowest trace are produced. Optionally, intermediate statistics during the decomposition can be plotted to get a more detailed view of the decomposition process.

Materials
Our lipidomics data set consisted of the univariate GWAS results of 21 correlated lipid species with polyunsaturated fatty acids that were reported to exhibit high heritability (Tabassum et al., 2019) and showed high correlation (Supplementary Figure S2). These results originated from 2,045 Finnish subjects with imputed genotypes available at ∼8.5 million SNPs. The arbitrarily assigned lipid species identifiers along with their class names and fatty acid chemical properties are listed in Table 2A. To further validate MetaPhat, we processed summary statistics from four basic lipids [high-density lipoprotein (HDL) cholesterol, lowdensity lipoprotein (LDL) cholesterol, triglycerides (TG), and total cholesterol (TC)] conducted by the Global Lipids Genetics Consortium (GLGC) (Willer et al., 2013;Zhu et al., 2018), and these are listed in Table 2B. With the GLGC data set our aim was to compare MetaPhat results with univariate results reported by GLGC for all variants reported to be significantly associated with two or more traits by GLGC.

RESULTS
Using the lipidomics data sets with GWAS summary statistics from the 21 polyunsaturated lipids, MetaPhat found seven independent lead variants after clumping the 415 variants exceeding the standard GWAS P-value threshold of 5e-8 within a window of 1 Mb. Table 3 lists these variants along with their gene annotation, multivariate P-value, and central traits. MetaPhat has strongly reduced the multivariate association for all seven variants into smaller and more specific groups of central traits.
We considered in more detail rs7412, which is a missense variant in the APOE gene and is known for its effect on LDL, as reported, for example, in the GLGC analysis (Willer et al., 2013). With the lipidomics data, this variant would not have been identified from any of the 21 univariate GWAS as the smallest univariate P-value was 1.1e-4 (trait PCO23, Supplementary  Figure S3.6). On contrary, the multivariate GWAS by MetaPhat clearly highlighted this variant associated with the multivariate lipidomics (P = 4.2e-13) and further determined that the association was driven by CE14 and PCO23 (P-value after excluding these driver traits is 1.8e-06). The BIC-optimal subset for this variant extended the drivers by one additional trait and included CE14, PC36, and PCO23, which form the central traits. The trace plots for rs7412 are shown in Figure 2A (Pvalues for defining driver traits) and Figure 2B (BIC for defining optimal subset). Variants rs66505542 near BUD13 and rs261290 near ALDH1A2 both have only one driver trait (PI9 for BUD13 and PE7 for ALDH1A2) and three or five central traits (Table 3). Earlier, the APOA1 variant rs964184 within 100 kb of rs66505542 has been reported to be associated with TG (lead trait, P = 7.0e-224), TC, HDL, and LDL in GLGC data and rs66505542 itself with several cell phenotypes (platelet count, red cell distribution width, sum of eosinophil and basophil counts) in the GWAS catalog, while rs261290 has been reported to be associated with HDL (lead trait, P = 1.0e-188), TC, and TG in GLGC data (mapped to LIPC gene) and with HDL in the GWAS catalog.
A very different picture emerges for rs174567 near FADS1/2 since its 18 central traits show its wide effects across the lipidomics traits studied here. Previously reported FADS1/2 associations are with all lipid traits (TG lead trait, P = 7.0e-38) in GLGC data and with metabolite measurements and gallstones in the GWAS catalog.
Trait importance map that clusters each variant based on the lowest trace is shown in Figure 2C and the similarity of the variants as measured by rank correlation of the traits on the lowest trace is shown in Figure 2D. The trace plots for the other six variants than rs7412 are shown in Supplementary Figure S1.

Validation and Global Lipids Genetics Consortium
We processed the Global Lipids Genetics Consortium (GLGC) GWAS study for four plasma lipids (HDL, LDL, TC, and TG, as listed in Table 2B). These correlated traits along with large sample sizes and available summary files are suitable for MetaPhat GWAS and decomposition. We focused on the 13 variants reported by GLGC to have associations with three or more lipid traits (Supplementary Tables S2 and S3 from Willer et al., 2013). In Table 4, we validated that at 12 out of the 13 variants the same associations are confirmed by MetaPhat's central traits. The only discordance was at rs6831256 (DOK7) where we found TC and TG as central traits compared to previously reported univariate associations with TC, TG, and LDL. As TC and LDL are highly correlated, it is understandable The lipid trait class names and acyl chain properties are listed in Table 2A. *Variant region reported as significant for basic lipids by GLGC (Willer et al., 2013).
that the smaller dimension of the set TC, TG, may in some analyses be preferred over the set that also includes LDL. In Supplementary Table S2, we further report high concordance between our central traits and GLGC variants found associated with two or more standard lipids.

Performance
For computing the test statistic, MetaPhat uses metaCCA that, for a single SNP, has previously been shown to reliably estimate the results of standard CCA applied to individual level data (canoncorr function in Matlab) (Cichonska et al., 2016). Additionally, we also empirically validated MetaPhat multivariate findings with GLGC results. MetaPhat considerably cuts down the computational demands of comprehensive subset testing. With K traits, there are 2 K -1 non-empty subsets that have quickly become infeasible to systematically assess, while MetaPhat only considers about K 2 models. For example, in our example with K = 21 traits, the gain in performance is about 4,700-fold compared to the complete subset testing. To further increase performance and usability, we have implemented flexibility for multi-thread processing to enable high performance and memory efficiency. On a moderate Google cloud image (16 vCPUs, 8 GB), the complete MetaPhat workflow for our lipidomics analysis, containing 21 lipids and 8.5 million SNPs, was completed in less than 2.5 h (143 min). Using 10 processors and 9 gigabytes of memory, the GLGC job with the four basic lipids and 2.4 million imputed SNPs completed in 24 min. MetaPhat also allows decomposition and plotting of custom SNPs. For example, the custom analysis of the 13 GLGC variants associated with three or more traits, shown in Table 4, was run again on existing GLGC MetaPhat results, and decomposition and plotting took only 2 min. We note that the run time could be longer on shared servers but also substantially shorter using more powerful dedicated cloud images.

DISCUSSION
It is expected that a particular genetic variant may affect only a subset of related biomarkers that are risk factors of complex disorders, such as T2D or coronary heart disease. We implemented MetaPhat to systematically decompose and visualize statistically significant multivariate genome-phenome associations into a smaller group of central traits, based only on univariate GWAS summary statistics. We are not aware of comparable software to MetaPhat that would automatically carry out multivariate GWAS and identify central traits for the associations from summary statistics. ASSET (Bhattacharjee et al., 2012) aims to find the best trait subsets within a pool of multiple studies and has been applied particularly for casecontrol studies. MTAG (Turley et al., 2018) can be applied to GWAS results of multiple related traits and overlapping samples, but its aim is to improve the accuracy of the univariate effect sizes by using the information from correlated traits rather than decomposing the multivariate association to individual traits.
In our results from an analysis of 21 lipidomics traits, we demonstrated that the APOE association (rs7412) benefited from multivariate testing (driven by CE14 and PCO23 traits), as the univariate P-value was insignificant (P > 1e-4) across all 21 GWAS traits (shown in Supplementary Figure S3.6), but multivariate P-value was low (P < 5e-13). This variant is known to have a strong effect on LDL, and Table 2 shows that CE14 has the highest correlation with LDL (0.464). The other two central traits of this variant, PCO23 and PC36, did not have any correlation to basic lipids larger than 0.20 in absolute magnitude. Table 3 lists the multivariate results including which four of these seven variants were previously reported by GLGC as associated with at least one of the four basic lipids. The other three variants also have some nearby variants that have been reported in the GWAS catalog (Buniello et al., 2019). First, rs8736 in MBOAT7 has been previously reported to be associated with human blood metabolites (Shin et al., 2014) as well as alcohol related cirrhosis of the liver (Buch et al., 2015). Second, variants in the region of rs146327691, near the SLCO1A2 gene, have been previously reported for response to serum metabolites (Krumsiek et al., 2012) and, interestingly, also for response to statins (Ho et al., 2006;Carr et al., 2019). Lastly, variants in the region of rs188167837 have been previously identified to be associated with nasopharyngeal carcinoma (Su et al., 2013). Additionally, MetaPhat decomposed most variants to substantially smaller sets of central traits than the full set of 21 traits, which can provide new biological insight regarding the variants identified.
On the other hand, the essential role of FADS2 gene region in regulating unsaturation in fatty acids was clearly reflected in MetaPhat results, as we observed as many as 18 central traits at the lead variant. Provided that the exact mechanistic roles of polyunsaturated lipids toward heart disease (Teslovich et al., 2010;Malovini et al., 2016;Pizzini et al., 2017) are under active investigation, our findings warrant further evaluation. We further confirmed good concordance (60/67, Supplementary  Table S2) with MetaPhat central traits with respect to the earlier reported GLGC associations with two or more standard lipids, and excellent concordance (12/13) with the associations with three or more standard lipids. MetaPhat optimal subsets are derived from the minimum BIC score representing the model that best describes the data when we account for both the model fit and the model dimension. Qualitatively BIC statistic is similar to the widely-used AIC (Akaike, 1973) statistic, but BIC quantitatively differs from AIC by favoring smaller dimensions, which also improves the interpretation of the optimal models. As intuitively expected, and as seen in Table 3, the driver traits tend to be members of the optimal set although they do not always agree, since the driver traits are defined by a GWAS-specific criterion of P-value threshold 5e-8, which does not need to coincide with the optimal subset chosen by a more statistically justified BIC criterion.
Our software implements flexible parameters for custom multi-thread chunking to enable high performance, genomewide, multi-trait meta-analysis while integrating metaCCA for multivariate testing followed by systematic decomposition of traits. Thus, a limitation of MetaPhat is that it relies on metaCCA, but other multivariate GWAS algorithms could also be used provided that these methods can work with univariate GWAS results as inputs and produce suitable metrics that can be used to derive the model comparison statistics. With regard to false positives, we used the standard GWAS cutoff (P = 5e-8), as carried out only a single multivariate GWAS to pick the lead variants. This cutoff can be adjusted according to the preferences of the users. MetaPhat also optionally allows the running of metaCCA+ (Cichonska et al., 2016) shown to protect against false positives via shrinkage that adds robustness to the analysis.
Finally, we remind the reader that MetaPhat decompositions are sequential, dropping one trait at a time, and hence are not guaranteed to produce the globally optimal subset. Additionally, for highly correlated traits, such as LDL and total cholesterol, the choice of which one is dropped first may not be completely robust to small changes in data.
The ability of MetaPhat to identify and visualize central traits will also be valuable in supporting efforts and pipelines (Fatumo et al., 2019) comparing results between univariate and multivariate associations as well as in studies that aim to increase specificity of multi-trait associations. We also expect that the multi-phenotype clustering results of MetaPhat can assist researchers investigating disease subtypes.

DATA AVAILABILITY STATEMENT
All datasets generated for this study are included in the article/Supplementary Material.

AUTHOR CONTRIBUTIONS
MP, SR, and JL conceived the project. MP developed the theory. JL implemented and tested the method. RT assisted with the testing. JL and MP draft the manuscript. All authors provided critical feedbacks and important contributions to the final manuscript.

FUNDING
This work was supported by the Academy of Finland (Grant Nos. 288509, 312076, 319181, and 325999).