Learning the kernel for rare variant genetic association test

Introduction: Compared to Genome-Wide Association Studies (GWAS) for common variants, single-marker association analysis for rare variants is underpowered. Set-based association analyses for rare variants are powerful tools that capture some of the missing heritability in trait association studies. Methods: We extend the convex-optimized SKAT (cSKAT) test set procedure which learns from data the optimal convex combination of kernels, to the full Generalised Linear Model (GLM) setting with arbitrary non-genetic covariates. We call this extended cSKAT (ecSKAT) and show that the resulting optimization problem is a quadratic programming problem that can be solved with no additional cost compared to cSKAT. Results: We show that a modified objective is related to an upper bound for the p-value through a decreasing exponential term in the objective function, indicating that optimizing this objective function is a principled way of learning the combination of kernels. We evaluate the performance of the proposed method on continuous and binary traits using simulation studies and illustrate its application using UK Biobank Whole Exome Sequencing data on hand grip strength and systemic lupus erythematosus rare variant association analysis. Discussion: Our proposed ecSKAT method enables correcting for important confounders in association studies such as age, sex or population structure for both quantitative and binary traits. Simulation studies showed that ecSKAT can recover sensible weights and achieve higher power across different sample sizes and misspecification settings. Compared to the burden test and SKAT method, ecSKAT gives a lower p-value for the genes tested in both quantitative and binary traits in the UKBiobank cohort.


APPENDIX 1.Notation
For an integer n let [n] = {1, . . ., n}.Let R n be the space of n-dimensional real vectors identified with column vectors, R n×m the space of n × m matrices.Furthermore let S n be the space of n × n dimensional real symmetric matrices and S n + , S n ++ positive-semidefinite and positive-definite matrices n×n dimensional respectively.Let R n + , R n ++ be the space of n-dimensional real vectors with all components being nonnegative and positive respectively.Let △ n be the set of vectors in R n + summing to one (the set of probability vectors).We let boldface denote matrices / vectors, such as X and let X T be the transpose of X.Let 1 n , 0 n be the n-dimensional vectors of all ones and zeros and let 1 n,m , 0 n,m be the n × m-dimensional matrices of ones and zeros respectively.We will drop the subscript when the dimensions are clear from context.For two vectors x, y ∈ R n let ⟨x, y⟩ = x T y = n i=1 x i y i and let ∥x∥ p = ( n i=1 x p i ) 1/p which is a true norm when p ≥ 1, with ∥x∥ ∞ = max (( i,j X ij Y ij be the Frobenius inner product and ∥X∥ F B be the induced norm, the Frobenius norm and let X ⊙ Y be the hadamard product, (X ⊙ Y ) i,j = X i,j Y i,j .Let X ∈ R n×n then the trace is define to be Tr(X) = n i=1 X ii and if X can be eigendecomposed with eigenvalues (λ i ) n i=1 then Tr(X) = n i=1 λ i and det(X) = n i=1 λ i .Let X ∈ R n×m and I = {I j : j = 1, . . ., |I|} be a partition of [n] and J = {J j : j = 1, . . ., |J|} be a partition of [m].We let X I j be the rows corresponding to I j stacked as an R |I j |×m matrix and X J j be the columns corresponding to J j stacked as an R n×|J j | matrix.Let H denote an RKHS and K the corresponding kernel.
We will use capital letter to denote a random variable and the lower-case for the observation, X, x.Let I(E) denote the indicator variable of the event E. For a family of distributions ρ θ parameterized by θ ∈ Θ with a pdf we write p ρ (y; θ) for the value of the pdf of ρ θ at y. Below is a table of distributions we will use together with some information about these distributions.Let B(α) = n i=1 Γ(α i )/Γ( n i=1 α i ) be the beta-function.

On a flaw in the cSKAT derivation for the extended setting
In this section we detail a flaw in the derivation of cSKAT extended to the general case in (Posner et al., 2020, Appendix A.1).We state it in terms of our notation.
The flaw can be found in (Posner et al., 2020, Eq. 11 and Eq. 12) and the derivation is found in (Posner et al., 2020, A.1), which is due to not using the right denominator ∥λ 0 ∥ 2 in the objective when going beyond the continuous case and / or having non-genetic covariates (note that we cancel out the φ0 terms since they occur in both the numerator and denominator).As stated on (Posner et al., 2020, p. 4) the φ0 • λ 0 = eig(P 1/2 0 KP 1/2 0 ).For the continuous case with no non-genetic covariates, P 0 can be shown to be the centering matrix, and assuming that kernel matrix K is centered, the objective reduces to y T Ky/∥K∥ F B since ∥λ 0 ∥ 2 = ∥K∥ F B in this case as 1) K is centered and so P 1/2 0 KP 1/2 0 = K and 2) P 0 is the centering matrix (which is only true in this particular case).However, when including the V or X terms in P 0 , P 0 is no longer the centering matrix and so the numerator and denominator need to change as follows; 1) y T Ky turns into r T Kr where r is the residual vector under the null-hypothesis and 2) which leads to the wrong objective.

Theorems
THEOREM 1. Assume a weighted linear kernel K w .Given a dataset D = (X, G, y) and a GLM model η(µ(x, g)) = α 0 + α T x + β T g giving rise to residuals r = y − μ0 , variance matrix V = diag(v), null projection matrix P 0 and matrix (1) PROOF.Note that J(w) = r T Kr/∥eig(P We first simplify the numerator and denominator, starting with the numerator.We will use the identity Horn and Johnson ( 2012) (3) at several points in the proof.
First note that which by ( 3) is equal to w T (G T r ⊙ G T r)1 = w T s.For the denominator we have ∥eig(P which, since B ∈ S p + , is of the form required by (3) meaning that ∥eig(P where since B is positive semi-definite (or positive definite) so is B ⊙ B Styan (1973).
Combining the above, we have that Finally, it can be seen that the results of Cortes et al. (2012) still applies and finding w * is equivalent to solving the Quadratic Programme and letting w * = z * /∥z * ∥ 1 .

Experimental Setup
For each of the hypothesis testing settings we use the UKBB WES dataset of the gene PARK7 resulting in 200'643 patients or datapoints.The PARK7 gene has 462 variants.For each experiment we sample without replacement n number of patients, where n is dependent on the situation.This means that for each experiment we run, we get a new genetic and non-genetic covariance matrix of n rows and 462 columns for the genetic matrix and 12 columns for the non-genetic matrix since we keep the columns sex, age and the first 10 principal components from the full genetic matrix on the whole UKBB WES dataset.
Each setting is specified and generated by specifying the causal_ratio (which we set to 0.1) which is ratio of causal variants in the gene, which is sampled with replacement according to the probability vector generated from taking all of the MAF's and mapping them through f (x) = x −0.5 and finally normalize so that it sums to one.In this way we satisfy the common assumption that rarer variants are more often causal.For the interaction term (if it is used) we have a parameter interaction_ratio (which we set to 1.0 so all causal variants interact) which specify how many of the causal variants interact with other variants and we also have a misspecification_factor which scales the final interaction matrix Γ so that ∥β∥/∥Γ∥ 2 = misspecification_factor which we set to 1.0, which means we are severely misspecified.For each causal variant, the beta coefficient of each causal variant SNP at index j is given by − log 10 (MAF j ) where MAF j is calculated from the full dataset, and the sign is flipped according to a probability beta_random_sign_flip_prob.Finally, we use a link function to map from the mean to the sampling of the output.For continuous output, this is just the identity link and for binary this is the Bernoulli link function.We generate α, α 0 from a standard multivariate Gaussian and are then normalized to have norm 1.
For the interaction term, use pick a subset of the causal variants at random until interaction_ratio has been chosen at random and each such variant interacts at random with one of the other variants in the gene except for itself.The interaction term is added to the previously used α 0 , α, β and so that the misspecified case does not change the linear terms used previously, but only add the interaction term.
For the continuous setting which corresponds to a Gaussian model, we set the noise variance to be σ 2 = 4 which corresponds to an h 2 (heritability) coefficient of 0.2 in this case.
As done in previous works (e.g.Posner et al. (2020)) we center all of the kernel matrices.