Determining Genetic Causal Variants Through Multivariate Regression Using Mixture Model Penalty

With the availability of high-throughput sequencing data, identification of genetic causal variants accurately requires the efficient incorporation of function annotation data into the optimization routine. This motivates the need for development of novel methods for genome wide association studies with special focus on fine-mapping capabilities. A penalty function method that is simple to implement and capable of integrating functional annotation information into the estimation procedure, is proposed in this work. The idea is to use the prior distribution of the effect sizes explicitly as a penalty function. The estimates obtained are shown to be better correlated with the true effect sizes (in comparison with a few existing techniques). An increase in the positive and negative predictive value is demonstrated using Hapgen2 simulated data.


INTRODUCTION
Detection and estimation of the genetic causal variants associated with a particular phenotypic trait is one of the most challenging problems in modern day statistical genetics. Mathematical techniques are formulated with primary focus on fine-mapping studies, phenotype prediction, and heritability estimation (Servin and Stephens, 2007;Lee et al., 2009;Gaffney et al., 2012;Maller et al., 2012;Valdar et al., 2012;Zuber et al., 2012;de los Campos et al., 2013;International Multiple Sclerosis Genetics Consortium et al., 2013;Zhou et al., 2013;Mahajan et al., 2014;Pickrell, 2014;Spain and Barrett, 2015;Schweiger et al., 2016). Algorithms that integrate functional annotation data into the estimation procedure (Schork et al., 2013;Zhou et al., 2013;Kichaev et al., 2014;Zablocki et al., 2014;Vilhjálmsson et al., 2015) are being continually developed with the understanding that Linkage Disequilibrium (LD) and polygenicity reduces the likelihood of the identified genetic variant being biologically causal (Visscher et al., 2012). The resulting procedures have better fine-mapping and effect size estimation capabilities (Kichaev et al., 2014;Kichaev and Pasaniuc, 2015).
While fine-mapping studies focuses on detecting causal variants, regression, or Bayesian optimization methods integrate these fine-mapping results into the estimation procedure to accurately determine the effect sizes. Currently, fine-mapping studies either use summary statistics or raw genotype data to arrive at quantitative assessment of causal nature of the SNPs. For example, CAVIAR (Hormozdiari et al., , 2015, PAINTOR (Kichaev et al., 2014), and RiVIERA (Li and Kellis, 2016) use summary statistics, and DAP (Wen et al., 2016), CavMeN (Brown et al., 2017) use the raw genotype data. End results of these analyses, typically probability of the SNP being causal, are used in target gene identification studies.
With the understanding that GWAS significant SNPs harbor more than one causal variant, few researchers have attempted to utilize multivariate methods to detect additional association signals (Newcombe et al., 2016;Ning et al., 2017) from summary statistics. Newcombe et al. (2016) utilized the correlation structure of the variants from the reference panel to develop a Bayesian regression framework that accounts for various models with respect to the number of causal SNPs per region. Ning et al. (2017) used the covariance structure between the variants, and between the variant and phenotype vector to obtain LASSO results for a series of λ's (regularization parameter). These studies demonstrate the improvement achieved through additional analysis on already identified potential causal SNPs. The current line of work follows a similar strategy wherein prior information regarding the causal nature of the SNPs (in terms of p-values or posterior probabilities) are used to localize additional causal variants (using raw genotype data) that might have been gone undetected due to their small effect size, lower posterior probability, or possibly due to small sample size.
The currently available methods for variable selection and estimating the effect sizes can be broadly categorized into Bayes' theorem based or penalty function based. Bayesian methods proceed by assigning a prior probability density function (pdf) to effect sizes and use either maximum likelihood estimation method or Markov Chain Monte Carlo (MCMC) simulations to determine the posterior effect sizes (effect sizes conditioned on the measured phenotypic data). The various methods that fall under this category differ in the specification of the prior pdf (Meuwissen et al., 2001;Habier et al., 2011;Zhou et al., 2013). Some of them are the Bayesian alphabet models (BayesA, BayesB, BayesC, BayesCπ, BayesR, etc.), and Bayesian Sparse Linear Mixture Model (BSLMM). Regression methods, on the other hand, aim to minimize an objective function with a penalty term, which is chosen to impart sparse characteristics to the effect size estimates. For example, the least angle absolute shrinkage operator (LASSO) uses a L 1 penalty (Tibshirani, 1996), and the Ridge Regression (RR) uses the L 2 penalty (Hoerl and Kennard, 1970). Methods that are a combination of either Bayesian and regression methods (Bayesian LASSO; Park and Casella, 2008;Li et al., 2010) or two regression based methods (Elastic Net; Zou and Hastie, 2005) have also been developed.
While in the Bayesian methods, the prior probabilities aid in variable selection, the shrinkage constraints does the equivalent job in regression based methods. Both the Bayesian and regression methods are geared toward accurate identification of the causal variants and phenotypic prediction. A review of the currently available methods can be found in Zhou et al. (2013) and de los Campos et al. (2013). The Bayesian methods though are mostly independent of tuning parameters, suffer from practical applicability to large datasets (in terms of efficient effect size estimation).
In this work, we formulate a simple and efficient optimization routine which combines the flexibility of Bayesian methods and simplicity of penalty function methods into a single framework. The idea is to use the prior pdf of the effect sizes explicitly as a penalty function. The motivation of the paper is to introduce the method, provide details regarding the implementation of the procedure, and demonstrate its various capabilities. At this stage, theoretically, we do not claim superiority over existing methods developed for effect size estimation and phenotype prediction.

PROBLEM STATEMENT
Considering N individuals, n genetic markers, and a linear model; the N × 1 phenotype vector y is related to N × n genotype matrix X through the n × 1 vector of effect sizes β as (Meuwissen et al., 2001): Here ε is the vector of noise terms modeled as N(0, ε ). Elements of X are typically coded as 0, 1, or 2 (prior to normalization). We aspire to select the causal variants and determine their effect sizes, β such that ε = (Xβ − y) T (Xβ − y) is minimun. Due to the correlated and sparse nature of the SNPs, the univariate results often end up being erroneous estimates (Kim et al., 2009;de los Campos et al., 2013;Zhou et al., 2013). This resulted in the evolution of multivariate methods for determining the causal variants (see, for example, Hoerl and Kennard, 1970;Tibshirani, 1996;Zou and Hastie, 2005;Kim et al., 2009;de los Campos et al., 2013;Zhou et al., 2013).

Regression Methods
The Elastic Net (EN) (Zou and Hastie, 2005) provides the most generalized representation of the commonly used objective functions in penalty methods, and is given as: In the above expression, α = 1 corresponds to the LASSO (Tibshirani, 1996), and α = 0 results in the Ridge regression (RR) (Hoerl and Kennard, 1970). There exists several algorithms that minimizes the above objective function while efficiently determining the regularization parameters (Tibshirani, 1996;Fu, 1998;Efron et al., 2004;Zou and Hastie, 2005;Park and Casella, 2008;Wu and Lange, 2008). A variant of the LASSO, termed group LASSO (Meier et al., 2008) employs multiple regularization parameters to different groups of SNPs.

Bayesian Methods
Modeling β as a mixture pdf, β ∼ π 1 p 1 (β) + (1 − π 1 )p 0 (β) (Habier et al., 2011;de los Campos et al., 2013;Zhou et al., 2013), with p 1 (β) denoting pdf of the causal SNPs and p 0 (β) denoting the pdf of the null SNPs, the posterior pdf of β is given as The estimate of the effect sizes is determined as β|y , where • denotes the mathematical expectation operator. Irrespective of the distribution of β, the likelihood function, p(y|β) can be shown to be Normal with mean Xβ, and variance ε (Robert, 2004). Existing variants of the Bayesian methods could be obtained by changing the pdfs p 1 (β) and p 0 (β) (de los Campos et al., 2013;Zhou et al., 2013). Supplementary Material provides information on the equivalence between Bayesian and regression based methods for a few priors.

Mixture Model Penalty Method
We design a penalty function that intuitively accomplishes shrinking the regression coefficients while incorporating any prior information about the causal nature of the SNPs. The motivation for this penalty function stems from the understanding that the effect sizes can be realistically represented using a multimodal pdf (Meuwissen et al., 2001;Bukszár et al., 2009;Logsdon et al., 2010;Guan and Stephens, 2011;Habier et al., 2011;Yang et al., 2011;Zhou et al., 2013;Holland et al., 2016) and functional annotations help in classifying the SNPs as either being causal or not (Schork et al., 2013). In our formulation, the main error minimizing term is the negative log-likelihood function, and a mixture prior cost function imparts the necessary sparsity to the effect size estimates. Several researchers in the genetics community have used the Spike and Slab pdf (Ishwaran and Rao, 2005) as prior pdf of effect sizes (de los Campos et al., 2013;Zhou et al., 2013). However, using this pdf explicitly as a penalty function has not been attempted in genetic association studies. This also sidesteps the computationally expensive Markov Chain Monte Carlo (MCMC) method used for obtaining posterior effect size estimates.
The likelihood function, p(y|β) ∼ N(Xβ, ε ) is expressed as We construct a cost function, C that has the ability to capture the causal nature of SNPs: where the cost associated with the jth SNP is given as Hereπ 1 = [π 11 ,π 12 , · · · ,π 1n ] T is the n × 1 vector of non-null prior probabilities associated with the functional annotation of the SNPs. That is, if the jth SNP is highly likely to be causal, then a higher value (say 0.5) is specified to that SNP. p 1j (•) and p 0j (•) denote the pdf of causal and null SNPs, respectively. The cost function, thus, acts as a medium to incorporate the enrichment details of individual SNPs. Typically, we use a normal pdf, φ 1j (•; 0;σ 1j ), to model the causal effects. Note thatπ 1 denotes the assumed prior probability and π 1 denotes the true unknown probability. Denoting by L the negative log-likelihood function, the function to be minimized is written as The nonlinear conjugate gradient method (NCG) (Hestenes and Stiefel, 1952;Fletcher and Reeves, 1964;Polak and Ribiere, 1969;Shewchuk, 1994;Dai and Yuan, 1999;Hager and Zhang, 2006) with Newton-Raphson line search algorithm is used to minimize F. A step-wise implementation of the optimization procedure is given in Supplementary Material. We show that when the NCG method is used for optimization, the evaluation of whole n × n Hessian matrix can be avoided. This significantly reduces the computational cost whilst not compromising the accuracy of the solution (Equation S12). (6) is conceptually similar to the penalty function proposed by Ročková and George (2016). The authors, using a mixture Laplace pdf, estimate the prior probability of the effects using a coordinatewise optimization routine. We, however, specify different prior probabilities for each variant, so as to incorporate any information on LD or functional annotations. Furthermore, our motivation to use a mixture model stems from our understanding of the genetic architecture of the human genome. 2. The likelihood ratio and the cost function are weighed equally, so that the minimum error solution is sparse. Unequal weights can be specified, say higher for L if it is known that the genetic architecture is highly polygenic, and low if only a few genetic causal variants influence the phenotype under consideration. 3. Variants of the proposed method could be obtained by changing the pdfs used in constructing the mixture modelfor example, Laplace or non-local pdfs (Johnson and Rossell, 2010) could be used instead of two normal pdfs. These however, are minor modifications, and our main contribution lies in proposing an explicit mixture model pdf as a penalty function.

The cost function shown in Equation
4. Instead of considering cost associate with individual SNPs, the SNPs can be clustered through specification of suitable correlation. This could possibly capture the underlying LD information. However, this requires incorporation of LD metrics such as r 2 into covariance structure of the clustered SNPs. Such studies have not been pursued in this present work, and will likely be a part of future efforts. 5. Higher prior probabilities can be specified to cluster of SNPs in a given LD block that is envisioned to contain the causal signal. SNPs not in this LD block may be provided lower or zero prior probability. 6. SNPs that belong to certain functional annotation category have higher likelihood of being causal. Hence, SNPs in these regions are deemed to be enriched, i.e., have higher probability of influencing a particular phenotype. Typically SNPs tagging regulatory and coding regions are considered to be enriched in comparison with introns and intergenic SNPs (Schork et al., 2013). SNPs in the MHC region can be considered to be enriched when studying immune related diseases (Ellinghaus et al., 2016). 7. Using existing packages such as CAVIAR, DAP, PAINTOR, RiVIERA, and S-LDSC , one could obtain a quantitative assessment of the causal nature of individual SNPs. These results can be directly used as prior probabilities (π 1 ) in the proposed optimization routine. Probabilities could also be based on GWAS p-values. However, these values tend to alter with increase in power. 8. As mentioned earlier, for each regression based method, there exists a Bayesian equivalent. In the Bayesian methods, assuming a prior pdf, samples are drawn from the posterior distribution using MCMC. The proposed method avoids sampling from the prior and posterior pdf of the effect sizes by specifying the prior information explicitly as a penalty function. This distinguishes the method from the Bayesian LASSO and BSLMM. 9. Fine-mapping methods typically require data from dense genotyping arrays, which are further imputed using reference panels, such as 1,000 Genomes (1000Genomes Project Consortium et al., 2012. The mixture-model method, on the other hand, uses whole genome wide data to locate the causal signal. In this aspect, genotype data preferred for finemapping studies, may be unsuitable for the proposed method.

Whole Genome Analysis
For whole genome analysis, due to computational issues, the analysis can be carried out using SNP windows such that no two potential causal SNPs in LD are separated (Berisa and Pickrell, 2016). In this work, we consider SNPs associated with individual chromosomes in each sliding window. For each chromosome, the first 20,000 SNPs with minor allele frequency >0.01 are considered in the analysis, resulting in a total of 440,000 SNPs. The number of causal variants are taken to be 50% of the SNPs in the functional annotation category-Exon, 3 ′ UTR, 5 ′ UTR. This gives rise to 4,233 causal SNPs-approximately 1% of the total SNPs considered. Thus, SNPs belonging to these categories are considered to be enriched, i.e., have higher likelihood of being causal. Three different genotype matrices and three different true effect sizes are considered for the analysis, resulting in a total of 18 cases for estimating the normalized mean squared error (NMSE). The phenotype is simulated using Equation (1) with a heritability of 0.5. Specifying prior probability to individual SNPs requires functional annotation information for the genotyped SNPs. Assuming such significant information is unavailable, we initially specify equalπ 1 andσ 1 values for all the SNPs, i.e.,π 1j = π 1 ,σ 1j = 1, ∀j-Mixture Model with Constant Priors (MM-CP). An estimate of ε for determining the likelihood function is obtained by assuming a heritability of 0.0227 per SNP window. For comparison, results are obtained using the Regularized Pseudo Inverse (RegPI)-analytical solution withπ 1 = π 1 and no mixture model (Supplementary Material), LASSO, and univariate regression method. The enrichment factors used in simulating the data are specified as prior probabilities in the optimization routine, and the resulting estimates are denoted as Mixture-Model with Enriched Priors (MM-EP), that is,π 1 = π 1 . The RegPI method and MM-EP methods differ only in the procedure followed to obtain the effect size estimates, and in principle, are equivalent. While the RegPI method provides a closed form solution for the effect sizes, the MM-EP method utilizes an optimization algorithm to achieve the same goal. As mentioned in section 3.3.1, results from a prior analysis (typically fine-mapping studies) could be used to improve the detection and estimation capabilities of the method. We use end results of DAP as prior probabilities and the resulting estimates are denoted as MM-DAP. It is to be noted that the MM-DAP method utilizes the genotype information twice, once for estimating the prior probability of causality (DAP), and once in our optimization routine (for effect size estimation and detection of additional causal variants). However, the context in which the information is used slightly differs. An adaptive/iterative method for estimating the causal probabilities could avoid this. We are working toward achieving this. Matlab implementation of the proposed method is included along with this paper. The implementation has provision for gradual increment or decrement ofπ 1 and σ 0 values.

RESULTS
The correlation between the true and estimated effect sizes, the percentage of positive and negative predictive values (PPV, NPV) are used to measure the accuracy of different methods (Figure 1). PPV is defined as the ratio of number of true variants identified to the total number of variants identified. Similarly, NPV is defined as the ratio of number of true null SNPs identified to the total number of null SNPs identified. For the RegPI and MM-EP methods,π 1 = π 1 , hence these methods provide an upper bound for both the effect size correlation, PPV and NPV. This can be considered as a case where complete genetic architecture of the effect size distribution is known. For the MM-CP method,π 1j = 0.01∀j, and the null pdf is taken to be Laplacian. Specifyingπ 1j = 1 is the special case of the infinitesimal model, where all the SNPs are assumed to be causal with Normal effect size distribution. The MM-DAP method used DAP results as prior probabilities for the genetic variants. This constitutes partial knowledge about the distribution of causal SNPs in the genome. The MM-CP, Infinitesimal, LASSO, and Univariate methods do not use functional annotation information. Thus any improvement in the effect size estimates obtained using MM-CP method, in comparison with the other three, is deemed significant. Though a sparse structure is imposed on the penalty function in the MM-CP method, the method essentially does not incorporate any enrichment factors. A slight improvement in the effect size correlation can be observed in Figure 1. The figure illustrates the advantage of the proposed formulation in terms of locating the causal variants accurately. The positive and negative predictive value for the Infinitesimal and Univariate methods are not shown in the figure. In obtaining the LASSO estimates, initially, a two-fold cross validation has been carried out for SNP window 1 (i.e., chromosome 1), resulting in λ = 0.508. NMSE estimates are obtained for a grid of values between 0.45 and 0.70 for all the other chromosomes, and the estimates corresponding to the minimum NMSE value is reported in Figure 1. The MM-DAP estimates lie between the MM-CP and MM-EP estimates, as the prior probabilities are based on a previous analysis which identifies few significant SNPs.
The correlation between estimated and true effect sizes, PPV and NPV values have been obtained for a grid ofπ 1 values and plotted in Figures 2, 3, respectively. Similar study is carried out for LASSO (with respect to the regularization parameter λ). For the MM-CP method, the x-axis is plotted in the reverse direction so that moving along the x-axis toward right implies increase in sparseness (consistent with the x-axis of LASSO). From Figures 2, 3, it can be observed that the MM-CP and LASSO estimates follow a similar trend with increase in sparseness. Enforcing a sparse structure with arbitrary prior probabilities for the effect sizes result in estimates that are better at localizing the genetic causal variants. A good understanding of the distribution of the SNPs across the genome helps in incorporating the functional annotations, thereby further improving the effect size estimates-RegPI and MM-EP methods. A sensible choice of the enrichment factors require prior knowledge about the phenotype under study. For example, SNPs in the MHC region are shown to have a larger impact on Ankylosing Spondylitis than the non MHC SNPs (Ellinghaus et al., 2016). In this case, one could provide a higher π 1 value for the SNPs in the MHC region (differential enrichment). The MM-DAP results are obtained using this strategy, i.e. use prior information to improve the performance of the estimation procedure.

DISCUSSION
The correlation plotted in Figure 2 reflects the accuracy of the estimation procedures, that is, how close the estimated effect size is to the true effect size (which is unknown). Hereπ 1 = 1 implies that all the SNPs (regression coefficients) contribute to the phenotype y. This leads to the underestimation of the effect sizes due to the distribution of the true signal among several SNPs. The sparse representation, sayπ 1 = 10 −2 on the other hand has the advantage of distributing the total signal among few selected non-zero SNPs. Depending on the selected SNPs, the correlation between true and estimated SNPs may vary. For the infinitesimal case, though all the causal SNPs have been identified, will result in low correlation value, because the effect sizes of these causal variants have been underestimated. The same phenomenon can be observed when the infinitesimal model (π 1 = 1) is compared with LASSO in Figure 1.
It is straightforward to note that the amount of shrinkage achieved depends on the characteristics of the penalty function used. Using the mixture model pdf explicitly as a penalty function places a probabilistic sparse constraint on the effect sizes, as opposed to the distance based constraints used typical in penalty function based method. Specifying a sparse structure without any knowledge about the underlying genetic architecture (i.e., specifying an arbitrary π 1 ) is shown to equip the optimization routine with better fine-mapping capabilities, and result in estimates that are at least as good as the LASSO and Univariate methods.
The non-linear conjugate gradient method is used to solve the optimization problem efficiently by harnessing the structure of the objective function's Hessian matrix (Supplementary Material). Thus, the method in its current form can be applied to whole genome analysis without any difficulty. Application of the method to specialized chip sequenced data, say the Immunochip or Oncochip, requires a careful approach in specifying theπ 1 values to various SNPs. We are working toward developing methodologies to automatically determine the probability of SNP association, and SNP correlation with in the optimization framework. Alternately, end results from existing fine-mapping studies such as DAP or PAINTOR can be used as prior probabilities. Therefore, the method needs to be viewed as an efficient optimization algorithm capable of integrating functional annotation data (if available). Interpretation of the framework as a means to incorporate functional annotation and LD information, while at the same time achieving good variable selection and effect size estimation capabilities are some of the features we believe are important to the genetics community.

AUTHOR CONTRIBUTIONS
VS: Wrote manuscript, performed analyses, contributed to study design, interpretation of results, and critical revision of manuscript; C-CF and DH: Contributed to data preparation, interpretation of results, and critical revision of manuscript; AD: Performed analyses, contributed to study design, interpretation of results, and critical revision of manuscript.

SUPPLEMENTARY MATERIAL
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.