Abstract
Introduction: Compared to Genome-Wide Association Studies (GWAS) for common variants, single-marker association analysis for rare variants is underpowered. Set-based association analyses for rare variants are powerful tools that capture some of the missing heritability in trait association studies.
Methods: We extend the convex-optimized SKAT (cSKAT) test set procedure which learns from data the optimal convex combination of kernels, to the full Generalised Linear Model (GLM) setting with arbitrary non-genetic covariates. We call this extended cSKAT (ecSKAT) and show that the resulting optimization problem is a quadratic programming problem that can be solved with no additional cost compared to cSKAT.
Results: We show that a modified objective is related to an upper bound for the p-value through a decreasing exponential term in the objective function, indicating that optimizing this objective function is a principled way of learning the combination of kernels. We evaluate the performance of the proposed method on continuous and binary traits using simulation studies and illustrate its application using UK Biobank Whole Exome Sequencing data on hand grip strength and systemic lupus erythematosus rare variant association analysis.
Discussion: Our proposed ecSKAT method enables correcting for important confounders in association studies such as age, sex or population structure for both quantitative and binary traits. Simulation studies showed that ecSKAT can recover sensible weights and achieve higher power across different sample sizes and misspecification settings. Compared to the burden test and SKAT method, ecSKAT gives a lower p-value for the genes tested in both quantitative and binary traits in the UKBiobank cohort.
1 Introduction
Genome-wide association studies (GWAS) (Visscher et al., 2017) have been shown to be a powerful way to identify common genetic variants (Cordell and Clayton, 2005); (Hindorff et al., 2009). However, for most diseases, the common susceptibility variants identified to date explain only a small proportion of the heritable component of disease risk. It is known that for low-frequency variants and rare variants, the power to detect the effect is limited (Lee et al., 2014) in GWAS. Rare variants are known to play an important role in human diseases. It is a well-established hypothesis that rare variants may be able to explain the missing heritability (Zuk et al., 2014).
The common approach to tackle this problem is to aggregate the variants in a gene set. Most of these testing procedures can be classified into two groups: the burden test, which collapses the SNPs in the gene set into one scalar value to then be regressed onto the trait (Morgenthaler and Thilly, 2007); (Lee et al., 2014); (Guo et al., 2016), and variance component tests, of which the sequence kernel association test (SKAT) is the prototypical procedure in genetic testing (Wu et al., 2011; Lee et al., 2012; Liu et al., 2019). SKAT is more flexible than the burden test because it makes fewer assumptions about the data, but the burden test has greater power when a large proportion of variants are causal and effects are in the same direction. In reality, the burden test can be shown to be a special case of SKAT (Wu et al., 2011), and the unifying framework is that of kernels (Aronszajn, 1950; Hofmann et al., 2008). The kernel framework allows for building a kernel, encoding affinity between two objects such as gene sets (Borgwardt, 2011). However, with the current availability of data, it is possible to learn a superior kernel from the data itself, commonly known as learning the kernel or multiple kernel learning (Cortes et al., 2012; Gönen and Alpaydın, 2011; Sonnenburg et al., 2006). Posner et al. (2020) proposed a way to learn a kernel as a convex combination of linear kernels from the data itself, calling this procedure convex SKAT (cSKAT). Although Posner et al. (2020) claimed to handle non-genetic covariates beyond the regression setting, their derivation is flawed due to a mistake, leading to the wrong denominator in the objective (Supplementary Appendix SA1). Due to this flaw, their method (cSKAT) only optimizes the correct objective for the regression setting with no non-genetic covariates and does not extend to the case when we have non-genetic covariates, which is important for accounting for population stratification (Cardon and Palmer, 2003), for potentially binary traits.
In this paper, we extend cSKAT (Posner et al., 2020) to any statistical model that comes under the mixed linear model (Zhang et al., 2010) (an extension of the generalized linear model (GLM) framework Nelder and Wedderburn (1972), which generalizes regression to a large family of models, including logistic regression) while allowing for non-genetic covariates. We call the resulting model extended cSKAT (ecSKAT). We note that although Posner et al. (2020) focused on annotated genetic data, their method applies beyond this setting (Section 2.4), which shows the process in detail. Although ecSKAT can be applied to annotated genetic data, we do not pursue this direction here and instead focus on the standard genetic data included in our experiments. The model has several advantages over standard SKAT because it allows for learning the kernel in a data-driven way by solving an optimization problem shown to be equivalent to a quadratic program (QP), leading to increased power over using a hand-picked kernel. Hence, it has the same computational and memory complexity as that of Posner et al. (2020), so the additional generality comes with no extra cost. Theoretically, we use concentration inequalities for a convex combination of independent χ2-random variables to show that a modified ecSKAT objective relates to the null p-value through an upper bound that reaches zero exponentially in the objective value.
2 Materials and methods
We are interested in hypothesis testing using the score test in the context of genomic studies; in particular, given a dataset of patients with a response, non-genetic covariates, and a number of gene sets, for each gene set, we are interested in testing for the association between the gene set and response, taking into account the non-genetic covariates. Our work extends cSKAT (Posner et al., 2020), which itself comes from the line of work initiated by Wu et al. (2011). Interestingly, both Lee et al. (2012) and Ionita-Laza et al. (2013) proposed a convex combination of two very specific kernels, but the papers did not make this connection explicitly, which was pointed out in Larson et al. (2019). Zhao et al. (2015) created an algorithm for learning a kernel from multiple base kernels, which is probably the work closest to our proposed method, except for cSKAT. However, their algorithm is expensive as it requires us to calculate p-values for each iteration, while ours is efficient due to the optimization step being QP.
2.1 Model
We consider a genetic association study of sample size n where we try to find a significant association between genetic variants and some outcomes (a trait or phenotype generally associated with disease) while controlling for stratification by taking into account non-genetic covariates. Given non-genetic and genetic covariates denoted by and g ∈ {0,1,2}p1, respectively, and output from n subjects, where m and p are the non-genetic and genetic dimensions, we collect the data into a dataset . Additionally, we define the non-genetic and genetic design matrices and , respectively, and the output vector . For logistic regression, we encode an outcome as 1 and a lack of outcomes as 0. We limit ourselves to the setting of linear and logistic regression but note that other models, such as multinomial and Poisson regression, are easily handled due to the flexibility of the GLM framework (Nelder and Wedderburn, 1972), and the model we use for the testing procedure is that of the generalized linear mixed-effect model (Gelman and Hill, 2006), relating the phenotype to the genetic and non-genetic covariates.
For a gene set giving rise to the genetic design matrix G, we are interested in testing for an association between the trait and genetic information. We use the standard frequentist hypothesis testing framework (Casella and Berger, 2021), and we formulate the null and alternative hypotheses as follows:
Letting h to be linear in some set of parameters , then the function space containing h consists of linear functions of the form with some inner products defined between functions , and for some positive semi-definite Λ. If Λ is full-rank, Eq. 1 becomes
2.2 Score tests
SKAT was first introduced in Wu et al. (2011), who highlighted the need for new association tests that may take into account the importance of rare variants while being able to incorporate genetic effects that are sparse and have different directions of impact on the response, such as some genetic variants being deleterious while others being beneficial. The burden test (Madsen and Browning, 2009), another commonly used testing procedure, typically struggles with these kinds of settings. SKAT uses the variance component score test (Lin, 1997) to devise a testing procedure that takes into account effect sizes of differing signs and allows injecting prior knowledge about how rarity is related to the magnitude of effect size.
The family of SKAT-like testing procedures, including their many descendants (Lee et al., 2012; 2014) (Larson et al., 2019), assume that β follows Gaussian distribution, which makes it explicitly linked to Gaussian processes (Rasmussen, 2003), and something that was also pointed out in Wu et al. (2011). In particular, assuming that h follows a Gaussian process with mean function 0 and covariance function τK(⋅, ⋅) with τ ≥ 0 and K a valid kernel function, the vectorized output follows the Gaussian distribution N(0, τK)2, where Kij = K(gi, gj) is the so-called kernel matrix encoding affinity between the individuals through the gene set. When , this is equivalent to β ∼ N(0, τ ⋅ diag(w))3 for some w, where we assume w ∈ △p, with △p being the set of probability vectors of dimension p4, for simplicity. Choosing w allows us to design a test that fits the domain knowledge by weighing the contribution of individual variants according to the weight vector w. From here on, we will only consider weighted linear kernels where w ∈ △p.
In this case, we can reformulate Eq. 1 as an equivalent to testing ifand the variance component test (Lin, 1997), in this case, reduces to the SKAT statistic (Wu et al., 2011). Let and be the maximum likelihood conditional mean and be the raw residuals of the null model. Then, the SKAT statistic takes the following form:where is the maximum likelihood estimate of the dispersion parameter. For the case of a binary outcome, and for the continuous outcome which is the residual sample variance under the null model. Under the null hypothesis, the asymptotic distribution of follows a mixture of independent variables with the mixture coefficients being the eigenvalues of the matrix , where , P0 is a positive semi-definite, and V = diag(v), where vi is the maximum likelihood conditional variance of yi under the null model. For binary outcomes, , while for continuous outcomes, it takes the form (Wu et al., 2011; Posner et al., 2020). We let λ0 be the eigenvalues of A, denoted by λ0 = eig(A).
Based on the aforementioned findings, we obtain the p-value function as follows:where and the probability is with respect to the null distribution. We cannot evaluate p0 analytically, but it can be evaluated numerically up to arbitrary precision using Davie’s method (Davies, 1980) or approximately (Liu et al., 2009).
2.3 Extended convex-optimized SKAT
In this section, we extend the analysis of Posner et al. (2020) as follows: first, we show that their annotated kernel formulation can be reformulated as a specific kernel through an explicit feature map of g and we generalize this formulation. Second, we derive the objective in case of the existence of non-genetic covariates and models other than linear regression and show that this leads to an objective that results in a similar but qualitatively different solution compared to centered kernel target alignment, which can nevertheless be solved efficiently through a QP. Finally, using a large deviation theory, we show that the objective proposed in Posner et al. (2020) is related to an upper bound on the p-value under the null hypothesis as maximizing the objective minimizes the upper bound, putting the proposed solution on a principled footing and clarifying the nature of the weights that ecSKAT learns.
2.4 Reducing cSKAT to ecSKAT
Although Posner et al. (2020) introduced their method in the setting of learning with genetic annotations, we show, in this section, that their method is a multiple kernel learning method in disguise and so can be applied to any setting where we have a dataset in the form of an input matrix G, confounding input matrix X and traits y, where G can now be any matrix of stacked feature vectors per patient which we want to relate to traits y. In particular, G can be the original genetic design matrix or the aggregated genetic annotation matrix defined as follows.
Posner et al. (2020) assumed that for each SNP indexed by j, we have a sequence of annotation vectors , where dl is the dimensionality of al,j for any j. The form of the kernel they proposed is , where w ∈ △L and . It should be noticed that is a scalar function of al,j, so replacing this by any scalar function , where we enforce positivity to make sure that Kl is a valid kernel, does not change the form of Kl. This would allow for further flexibility in choosing how to aggregate annotation data, if available, with some suggestions being for p ≥ 1 or ϕl(x) = exp(−x), which leads to the so-called softmax weighing function. In practice, one could choose ϕl from a pool of candidates (for example, from those outlined previously) using cross-validation on the train set ((Hastie et al., 2009; Section 7.10) for an introduction to cross-validation). For simplicity, we can assume from here that ϕl(x) = x2, which reduces to using the aggregation method in Posner et al. (2020). Let be the vector such that Φl,j = ϕl(al,j), Dl = diag(Φl), and . Then, let the transformed genetic vector be and define to be the design matrix of this new genetic dataset of the cohort. The kernel matrix can then be expressed in the form , which is of the form we considered previously. This shows that the setting of Posner et al. (2020) can be handled in the general linear kernel case where the genetic feature vectors are first preprocessed using F, and w is then learned using multiple kernel learning techniques (Cortes et al., 2012). In this work, we do not use annotations for simplicity and only consider .
2.5 General cSKAT
Posner et al. (2020) laid out a strategy for how to select w in a data-driven way but only derived an explicit form for the case of no non-genetic covariates (only fitting α0, which can be shown to be equivalent to centering y in this case) and linear regression. Here, we solve the full case when X is non-zero and for any valid conditional response model that comes under the GLM framework with a canonical link function. They proposed to split the data into a train and a validation set, D = Dtr ∪ Dts, where Dtr is used to finding w and Dts to perform the hypothesis test using w. The learning procedure of w is defined through the following objective:where we view and λ0 as functions of Kw through w. The induced optimization problem becomes
As shown in theorem 1.1, we may rewrite , where and is the component score vector where the square is applied component-wise, and the solution w* in Eq. 7 can be shown to be proportional to the solution of QP:which can be solved effectively using modern convex solvers (Diamond and Boyd, 2016).
2.6 Relating optimization objective to p-value
As pointed out in Posner et al. (2020, A1), there is no a-priori reason that optimizing the objective (Eq. 6) will lead to a test with good power, what we would like to do theoretically is to maximize the power directly. As a proxy to power maximization, we would instead prefer minimizing the p-value (Eq. 5) on the training set in terms of w. However, it is not clear how to optimize the p-value since it is highly non-convex and complicated. A commonly used approach in optimization is to instead optimize an upper bound, p0(Q(w); w) ≤ u(w), where u(w) is tight and convex. Here, we have explained explicitly the dependency of p0(q) on w.
In theorem 1.2, we show using large deviation theory (Wainwright, 2019; Vershynin, 2018) in the form of sub-exponential concentration inequalities applied to the linear combination of independent random variables that the p-value is upper-bounded throughwhere J(w) differs from the cSKAT objective (Eq. 7) as the numerator now takes the form , where b = diag(B), the diagonal of B as a vector, instead of . Assuming that J(w) ≥ 1, then the upper bound is exp(−J(w)/8). Since the function f(x) = exp(−x/8) is decreasing, we observe that minimizing f is equivalent to maximizing J(w), which again reduces to a QP problem similar to Eq. 22. The result shows that the modified cSKAT objective is a principled objective and that maximizing it is equivalent to minimizing an upper bound on the p-value, and furthermore, as the objective grows in value, the upper bound decreases and reaches zero as J(w) → ∞ exponentially fast. In particular, for any a ≥ 0, fixing a minimum level p0(Q(w); w) ≤ 10−a can be certified as long as , which is easy to check after optimization.
3 Results
3.1 Synthetic and semi-empirical models
In order to benchmark the models (including ecSKAT), we need to know how the genetic and non-genetic covariates are related to the output. We simulate the data using a model of the relationship of the GLM form as follows:where we generate α0, α and β together with a potential genetic interaction term , where Γ has zero diagonal and is only non-zero for the causal terms corresponding to β. The interaction term is only used when evaluating the performance of the model under model misspecification and Γ = 0 when there is no model misspecification. Model misspecification aims to answer the question of what happens when using a model that specifies some assumptions of the world when these assumptions are violated in some pre-specified way. In this case, we aim to capture what happens when the model is violated in the sense that the linear term is replaced by the linear and interaction term . As in practice, our model is always misspecified, seeing how the methods performing under misspecification is integral, and we would prefer the procedure to degrade gracefully when there is non-severe model misspecification. The following settings depend on the marginal distribution and functional relationship, among others (Supplementary Appendix SA1), for a detailed specification for each setting.
We benchmark ecSKAT against the burden test using the sum as an aggregation, uniform SKAT, where the weights are equal to 1/p, and SKAT, where we set the weights using β(1, 25)-pdf, as outlined in Wu et al. (2011); we denote these algorithms by ecSKAT, Burden (Sum), SKAT (Uniform), and SKAT (β(1, 25)) in the figures and experiments, in terms of estimated Type I error and power. For the marginal distributions of the non-genetic and genetic covariates (ρX and ρG), we either generate them synthetically or use the empirical distributions of the UKBiobank dataset through using available datapoints (patients) relating to the gene PARK7 with non-genetic covariates of age (in years), sex (one-hot encoded), and 10 principal components from the full genetic whole-exome sequencing (WES) dataset and sample without replacement (due to the size of the database the id violation is negligible) from these patients 1,000 times in order to get 95% confidence intervals for Figures 3, 4. For ecSKAT, we use a train ratio of 0.3. Another approach would be to consider the marginal distribution of evolutionary simulation models such as in Wu et al. (2011), Hamilton (2021), and Yuan et al. (2012). We instead use the empirical data directly as they explain the ground truth and large sample size in UKBiobank to make this feasible without introducing artifacts due to the finite size of the original dataset.
Figures 1, 2 show that under idealized settings, ecSKAT manages to recover sensible weights, in particular, weights that add mass to semi-rare causal variants. Although this does not prove that the resulting test will perform well, it provides evidence that ecSKAT discards non-causal SNPs.
FIGURE 1
FIGURE 2
In the Type I error experiment (Figure 3), we show how the Type I error behaves as a function of the size of the dataset under the correctly specified setting. We test this for two sample sizes, 1,000 and 1,0000, and for α = 10−1, 10−2, 10−3, and perform this experiment 1,000 times to get 95% confidence intervals. We see that all algorithms control the Type I error for big α but struggle for α = 10−3. However, it should be noted that for all algorithms in all plots, the current significance level falls within the confidence interval.
FIGURE 3
Finally, for the power experiment (Figure 4), we look at the power of the correct (row 1) and misspecified cases (row 2) for different significance levels. As in the Type I error experiment, we repeat this experiment 1,000 times and calculate 95% confidence intervals. It should be noted that the confidence intervals are too small to be seen. For the correctly specified case (row 1), ecSKAT rejects the null hypothesis correctly for all plots, which can be seen by the straight line at 1 (maximum power). For a smaller α and large sample size, the other SKAT methods (β(1, 25), Uniform) also have maximum power, while the burden (Sum) test fails to perform well, probably because it assumes all weights to have the same sign, which is not true here. For the misspecified case (row 2), we see that only SKAT (β(1, 25)) and ecSKAT manage to perform well with their performance improving in the sample size and decreasing with smaller α. From this, we can see that ecSKAT performs best and SKAT variants perform well in terms of power for the correctly specified case, while only ecSKAT and SKAT (β(1, 25)) perform well in the misspecified case, probably due to the data-dependent nature of how they reweigh each genetic covariate.
FIGURE 4
3.2 Application to the UKBiobank data
We applied our proposed method to analyze the UK Biobank exome sequencing data. We tested associations of hand grip strength (quantitative trait) in 73,424 individuals and systemic lupus erythematosus (SLE) (binary trait) with 966 cases and 4,296 controls, adjusting for sex, age, and 10 principal components. We restricted our analysis to the predicted loss-of-function (LoF; i.e., essential splice site changes, stop codon gain, or frameshifts) (MacArthur et al., 2012)) variants with MAF <0.01. In addition to testing for an association via ecSKAT, we also applied the weighted sum burden test, weighted max burden test, and SKAT with a weighting of β(1, 25)-pdf of the MAF. For handgrip strength, we analyzed variants in TPTEP2-CSNK1E and ZDHHC8 (Karczewski et al., 2022), which were reported to be associated with the phenotype in Genebass. For SLE, we uncovered the association of aggregation of LoF in gene PCSK9, which was not reported in Genebass but was shown to be associated with disease activity in SLE (Frostegard et al., 2020). The underlying cause could be that oxidized LDL promotes DC activation, which depends on PCSK9, with a higher effect among SLE patients. PCSK9 could play an unexpected immunological role in SLE. Our proposed ecSKAT was the most powerful test for both quantitative and binary traits and has much smaller p-values than the burden test and SKAT for the genes that we tested (Tables 1, 2).
TABLE 1
| Method | |||||
|---|---|---|---|---|---|
| Gene | Category | Burden (sum) | Burden (max) | SKAT | ecSKAT |
| TPTEP2-CSNK1E | LoF | 5.14 × 10−2 | 5.14 × 10−2 | 4.73 × 10−2 | 3.08 × 10−2 |
| ZDHHC8 | LoF | 4.91 × 10−2 | 4.91 × 10−2 | 4.85 × 10−2 | 3.76 × 10−2 |
Association test for hand grip strength (p-value).
TABLE 2
| Method | |||||
|---|---|---|---|---|---|
| Gene | Category | Burden (sum) | Burden (max) | SKAT | ecSKAT |
| PCSK9 | LoF | 1.4 × 10−5 | 2.5 × 10−5 | 1.0 × 10−5 | 0.8 × 10−5 |
| CSNK2A1 | LoF | 2.47 × 10−2 | 2.47 × 10−2 | 2.45 × 10−2 | 2.31 × 10−2 |
Association test for systemic lupus erythematosus (p-value).
4 Discussion
This study generalised the cSKAT formulation to general GLM models with non-genetic covariates and showed that this formulation, while being considerably more general and applicable in practice as compared to the linear model, the no covariate setting of
Posner et al. (2020)still allows for finding the optimal weights through a QP, thus being equally computationally complex to the simpler setting. Our theoretical and methodological contributions are threefold.
1. We showed that the weighted annotation method of Posner et al. (2020) can be formulated as an instance of the SKAT setting where we first apply a linear feature map to the genetic covariates.
2. We completed the analysis of cSKAT for the case of an arbitrary GLM model when the covariates are non-zero, showing that the objective can be solved using a similar QP procedure as the original cSKAT algorithm, retaining the same computational complexity.
3. Finally, we showed that a slight modification of the cSKAT objective is related to an upper bound on the p-value as a function of w and that this bound is tight as the objective goes to infinity, indicating that cSKAT is a principled objective since it relates well to the objective of the study (the actual p-value).
Simulation studies showed that ecSKAT can recover sensible weights and achieve higher power across different sample sizes and misspecification settings. In real data analysis, we applied the method to both the binary (SLE) and quantitative (hand grip strength) traits in the UKBiobank cohort. Compared to the burden test and SKAT method, ecSKAT gives a slightly lower p-value for the genes tested in both quantitative and binary traits.
In the future, we would like to theoretically analyze the power in terms of the true value w* and the size of the training and validation sets. Furthermore, given a fixed dataset size n, we would like to analyze the optimal training set size, which would be of interest in practice. Finally, we would like to perform more large-scale experiments, in particular, on the newly released 500-k WES cohort of UKBiobank (Backman et al., 2021; Szustakowski et al., 2021).
Statements
Data availability statement
The data analyzed in this study are subjected to the following licenses/restrictions: The UK Biobank resource is available to bona fide researchers for health-related research in the public interest and can be accessed via the Access Management System. Requests to access these datasets should be directed to UKBiobank.
Ethics statement
The studies involving humans were approved by The UK Biobank Ethics Advisory Committee. The studies were conducted in accordance with the local legislation and institutional requirements. The participants provided their written informed consent to participate in this study.
Author contributions
QG conceived the project and provided guidance. IF and QG drafted and revised the manuscript. IF developed the theoretical formalism and performed simulations. QG, MZ, and JN conducted experiments. QG and IF contributed to the analysis and interpretation of the data. All authors contributed to the article and approved the submitted version.
Acknowledgments
This research was conducted using the UK Biobank Resource under Application Number 43138. Using real patient data is crucial for clinical research and finding the right treatment for the right patient. The authors would like to thank all participants who are a part of the UK Biobank, and who volunteered to give their primary and secondary care and genotyping data for the purpose of research. UKBiobank is generously supported by its founding funders the Wellcome Trust and UK Medical Research Council, as well as the British Heart Foundation, Cancer Research UK, Department of Health, Northwest Regional Development Agency, and Scottish Government.
Conflict of interest
Authors MZ, JN, and QG were employed by BenevolentAI.
The remaining author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors, and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
Supplementary material
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2023.1245238/full#supplementary-material
Footnotes
1.^The value is given by the number of minor variants of the SNP at the marker, and other schemes can trivially be handled.
2.^N(μ, V) is the Gaussian distribution with mean μ and covariance matrix V.
3.^For a vector , diag(x) is the n × n diagonal matrix with x on the diagonal.
4.^Explicitly, w ∈ △p is equivalent for any i ∈ {1, …, p}, wi ≥ 0, and .
References
1
AronszajnN. (1950). Theory of reproducing kernels. Trans. Am. Math. Soc.68, 337–404. 10.1090/s0002-9947-1950-0051437-7
2
BackmanJ. D.LiA. H.MarckettaA.SunD.MbatchouJ.KesslerM. D.et al (2021). Exome sequencing and analysis of 454,787 UK biobank participants. Nature599, 628–634. 10.1038/s41586-021-04103-z
3
BorgwardtK. M. (2011). “Kernel methods in bioinformatics,” in Handbook of statistical bioinformatics (Springer), 317–334.
4
CardonL. R.PalmerL. J. (2003). Population stratification and spurious allelic association. Lancet361, 598–604. 10.1016/S0140-6736(03)12520-2
5
CasellaG.BergerR. L. (2021). Statistical inference. Cengage Learning.
6
CordellH. J.ClaytonD. G. (2005). Genetic association studies. Lancet366, 1121–1131. 10.1016/S0140-6736(05)67424-7
7
CortesC.MohriM.RostamizadehA. (2012). Algorithms for learning kernels based on centered alignment. J. Mach. Learn. Res.13, 795–828. 10.5555/2503308.2188413
8
DaviesR. B. (1980). The distribution of a linear combination of χ2 random variables. J. R. Stat. Soc. Ser. C Appl. Statistics)29, 323–333. 10.2307/2346911
9
DiamondS.BoydS. (2016). Cvxpy: A python-embedded modeling language for convex optimization. J. Mach. Learn. Res.17, 83–2913. 10.5555/2946645.3007036
10
FrostegardJ.FrostegardJ.RahmanM.HafstromI.AjeganovaS.LiuA. (2020). Pcsk9 is associated with disease activity and implicated in immune activation and cardiovascular disease in systemic lupus erythematosus. Eur. Heart J.41, ehaa946–3623. 10.1093/ehjci/ehaa946.3623
11
GelmanA.HillJ. (2006). Data analysis using regression and multilevel/hierarchical models. Cambridge University Press.
12
GönenM.AlpaydınE. (2011). Multiple kernel learning algorithms. J. Mach. Learn. Res.12, 2211–2268. 10.5555/1953048.2021071
13
GuoM. H.DauberA.LippincottM. F.ChanY.-M.SalemR. M.HirschhornJ. N. (2016). Determinants of power in gene-based burden testing for monogenic disorders. Am. J. Hum. Genet.99, 527–539. 10.1016/j.ajhg.2016.06.031
14
HamiltonM. B. (2021). Population genetics. John Wiley & Sons.
15
HastieT.TibshiraniR.FriedmanJ. H.FriedmanJ. H. (2009). The elements of statistical learning: Data mining, inference, and prediction, 2. Springer.
16
HindorffL. A.SethupathyP.JunkinsH. A.RamosE. M.MehtaJ. P.CollinsF. S.et al (2009). Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl. Acad. Sci.106, 9362–9367. 10.1073/pnas.0903103106
17
HofmannT.SchölkopfB.SmolaA. J. (2008). Kernel methods in machine learning. Ann. statistics36, 1171–1220. 10.1214/009053607000000677
18
HornR. A.JohnsonC. R. (2012). Matrix analysis. Cambridge University Press.
19
Ionita-LazaI.LeeS.MakarovV.BuxbaumJ. D.LinX. (2013). Sequence kernel association tests for the combined effect of rare and common variants. Am. J. Hum. Genet.92, 841–853. 10.1016/j.ajhg.2013.04.015
20
KarczewskiK. J.SolomonsonM.ChaoK. R.GoodrichJ. K.TiaoG.LuW.et al (2022). Systematic single-variant and gene-based association testing of thousands of phenotypes in 394,841 UK biobank exomes. Cell. Genomics2, 100168. 10.1016/j.xgen.2022.100168
21
LarsonN. B.ChenJ.SchaidD. J. (2019). A review of kernel methods for genetic association studies. Genet. Epidemiol.43, 122–136. 10.1002/gepi.22180
22
LeeS.AbecasisG. R.BoehnkeM.LinX. (2014). Rare-variant association analysis: study designs and statistical tests. Am. J. Hum. Genet.95, 5–23. 10.1016/j.ajhg.2014.06.009
23
LeeS.EmondM. J.BamshadM. J.BarnesK. C.RiederM. J.NickersonD. A.et al (2012). Optimal unified approach for rare-variant association testing with application to small-sample case-control whole-exome sequencing studies. Am. J. Hum. Genet.91, 224–237. 10.1016/j.ajhg.2012.06.007
24
LinX. (1997). Variance component testing in generalised linear models with random effects. Biometrika84, 309–326. 10.1093/biomet/84.2.309
25
LiuH.TangY.ZhangH. H. (2009). A new chi-square approximation to the distribution of non-negative definite quadratic forms in non-central normal variables. Comput. Statistics Data Analysis53, 853–856. 10.1016/j.csda.2008.11.025
26
LiuY.ChenS.LiZ.MorrisonA. C.BoerwinkleE.LinX. (2019). Acat: A fast and powerful p value combination method for rare-variant analysis in sequencing studies. Am. J. Hum. Genet.104, 410–421. 10.1016/j.ajhg.2019.01.002
27
MacArthurD. G.BalasubramanianS.FrankishA.HuangN.MorrisJ.WalterK.et al (2012). A systematic survey of loss-of-function variants in human protein-coding genes. Science335, 823–828. 10.1126/science.1215040
28
MadsenB. E.BrowningS. R. (2009). A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet.5, e1000384. 10.1371/journal.pgen.1000384
29
MorgenthalerS.ThillyW. G. (2007). A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: A cohort allelic sums test (cast). Mutat. Research/Fundamental Mol. Mech. Mutagen.615, 28–56. 10.1016/j.mrfmmm.2006.09.003
30
NelderJ. A.WedderburnR. W. (1972). Generalized linear models. J. R. Stat. Soc. Ser. A General.135, 370–384. 10.2307/2344614
31
PosnerD. C.LinH.MeigsJ. B.KolaczykE. D.DupuisJ. (2020). Convex combination sequence kernel association test for rare-variant studies. Genet. Epidemiol.44, 352–367. 10.1002/gepi.22287
32
RasmussenC. E. (2003). “Gaussian processes in machine learning,” in Summer school on machine learning (Springer), 63–71.
33
SonnenburgS.RätschG.SchäferC.SchölkopfB. (2006). Large scale multiple kernel learning. J. Mach. Learn. Res.7, 1531–1565. 10.5555/1248547.1248604
34
StyanG. P. (1973). Hadamard products and multivariate statistical analysis. Linear algebra its Appl.6, 217–240. 10.1016/0024-3795(73)90023-2
35
SzustakowskiJ. D.BalasubramanianS.KvikstadE.KhalidS.BronsonP. G.SassonA.et al (2021). Advancing human genetics research and drug discovery through exome sequencing of the UK biobank. Nat. Genet.53, 942–948. 10.1038/s41588-021-00885-0
36
VershyninR. (2018). High-dimensional probability: An introduction with applications in data science, 47. Cambridge University Press.
37
VisscherP. M.WrayN. R.ZhangQ.SklarP.McCarthyM. I.BrownM. A.et al (2017). 10 years of gwas discovery: biology, function, and translation. Am. J. Hum. Genet.101, 5–22. 10.1016/j.ajhg.2017.06.005
38
WainwrightM. J. (2019). High-dimensional statistics: A non-asymptotic viewpoint, 48. Cambridge University Press.
39
WuM. C.LeeS.CaiT.LiY.BoehnkeM.LinX. (2011). Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet.89, 82–93. 10.1016/j.ajhg.2011.05.029
40
YuanX.MillerD. J.ZhangJ.HerringtonD.WangY. (2012). An overview of population genetic data simulation. J. Comput. Biol.19, 42–54. 10.1089/cmb.2010.0188
41
ZhangZ.ErsozE.LaiC.-Q.TodhunterR. J.TiwariH. K.GoreM. A.et al (2010). Mixed linear model approach adapted for genome-wide association studies. Nat. Genet.42, 355–360. 10.1038/ng.546
42
ZhaoN.ChenJ.CarrollI. M.Ringel-KulkaT.EpsteinM. P.ZhouH.et al (2015). Testing in microbiome-profiling studies with mirkat, the microbiome regression-based kernel association test. Am. J. Hum. Genet.96, 797–807. 10.1016/j.ajhg.2015.04.003
43
ZukO.SchaffnerS. F.SamochaK.DoR.HechterE.KathiresanS.et al (2014). Searching for missing heritability: designing rare variant association studies. Proc. Natl. Acad. Sci.111, E455–E464. 10.1073/pnas.1322563111
Summary
Keywords
GWAS, kernel learning, reproducing kernel Hilbert space, score testing, SKAT, target alignment, WES
Citation
Falk I, Zhao M, Nait Saada J and Guo Q (2023) Learning the kernel for rare variant genetic association test. Front. Genet. 14:1245238. doi: 10.3389/fgene.2023.1245238
Received
23 June 2023
Accepted
14 September 2023
Published
09 October 2023
Volume
14 - 2023
Edited by
Segun Fatumo, University of London, United Kingdom
Reviewed by
Asuman Turkmen, The Ohio State University, United States
Rounak Dey, Insitro, Inc., United States
Updates
Copyright
© 2023 Falk, Zhao, Nait Saada and Guo.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Qi Guo, guoqi_cindy@hotmail.com
Disclaimer
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.