Subsampling Technique to Estimate Variance Component for UK-Biobank Traits

The estimation of heritability has been an important question in statistical genetics. Due to the clear mathematical properties, the modified Haseman–Elston regression has been found a bridge that connects and develops various parallel heritability estimation methods. With the increasing sample size, estimating heritability for biobank-scale data poses a challenge for statistical computation, in particular that the calculation of the genetic relationship matrix is a huge challenge in statistical computation. Using the Haseman–Elston framework, in this study we explicitly analyzed the mathematical structure of the key term tr(KTK), the trace of high-order term of the genetic relationship matrix, a component involved in the estimation procedure. In this study, we proposed two estimators, which can estimate tr(KTK) with greatly reduced sampling variance compared to the existing method under the same computational complexity. We applied this method to 81 traits in UK Biobank data and compared the chromosome-wise partition heritability with the whole-genome heritability, also as an approach for testing polygenicity.


INTRODUCTION
Given the increasing sample size and sequencing capability, high-throughput genetic data is presented as the standard input that challenges statistical computation. For example, in the estimation of heritability for complex traits using all markers concurrently, both (i) constructing the genetic relationship matrix [GRM, denoted as K and the mathematical expression can be seen in section Materials and Methods, with its computational cost O(MN 2 )] and (ii) the estimation of heritability using linear mixed model [O(N 3 )] are computationally expensive (Yang et al., 2010). In order to alleviate computational burden, various solutions have been proposed. Modified Haseman-Elston regression (HE) can be used to estimate heritability with reduced computational cost in the estimation step [O(N 2 )], but the construction of GRM is still needed (Chen, 2014). Using summary statistics, such as those estimated from the genome-wide association study (GWAS), rather than individual-level data, can provide a theoretical equivalence estimate of the heritability under the assumption that the source of summary statistics and the linkage disequilibrium (LD) reference are homogeneous , if not always the case.
Even under the HE framework, given the availability of biobank-scale data, such as UK Biobank (UKB) data (Bycroft et al., 2018), the computational cost for GRM poses a challenge for heritability estimation mentioned procedure above. In order to reduce the computational cost of GRM, recently a randomized estimation of heritability has been introduced by Wu and Sankararaman (2018), called randomized Haseman-Elston regression (RHE), a promising method that can be used for both single-trait and bitrait analyses (Sankararaman, 2019). This method is built on a hybrid framework which can be applied to biobank-scale data, and a key innovation involved is a quick evaluation for tr(K T K), the trace of the multiplication of GRM with its transpose. Direct computation of tr(K T K) can be time-consuming, at the time cost of O(N 2 M), but in RHE the numerical evaluation of tr(K T K) can be realized via a randomization method expressed in quadric form. However, we found that the sampling variance of RHE in the original report is incorrect because of their wrong derivation (refer to Appendix A3 in Wu and Sankararaman's original report). In this study, we further investigate the statistical property of RHE, in particular the term about tr(K T K), and its relevant extensions.
This report was written for three purposes. First, we found that the provided randomization estimate for tr(K T K) is correct but with its sampling variance, which is proportional to tr(K T KK T K), not properly treated in Wu and Sankararaman's original report. We derived and numerically validated the sampling variance of tr(K T K). Second, recently a hybrid routine that can use either individual-level data and summary statistics has also been found (Zhou, 2017;Wu and Sankararaman, 2018), in which subsampling technique is used to evaluate tr(K T K); however, its sampling variance was not available. We provided a similar method as subsampling but with availability of its analytical sampling variance. Third, we partitioned the heritability based on the effective number of markers and applied them in the partitioning of heritability for some complex traits in UKB.

Genetic Relationship Matrix
For a homogenous unrelated sample, its genotypic matrix can be written as X, a matrix of N rows-individuals, and M columnscoding the count of the reference allele for a biallelic locus. After standardization for each genotypex kl = x kl −2p l √ 2p l q l , in which 2p l is the allele frequency and √ 2p l q l the square root of the variance, we can define GRM as K = 1 MXX T . Given K, we can easily derive some characters of K. Denote K o as the off-diagonal elements, and it is easy to see that E (K o ) = −1 N−1 , because the summation of the diagonal is N − 1. var (K o ) is the sampling variance of the all off-diagonal elements.
Of note, var (K o ) relates to the concept, so-called effective number of markers, denoted as M e thereafter. As noticed, M e is defined as the reciprocal of var (K o ). M e = E ρ l 1 l 2 2 , in which E(ρ l 1 l 2 ) is the expected Pearson's correlation between the l th 1 and l th 2 loci. Alternatively, M e = 1 E ρ l 1 l 2 2 . It is known that for a population, the averaged linkage disequilibrium across the genome is nearly a constant given the markers; in other words, M e is a constant genetic parameter. The definition of M e in this report allows researchers to calculate M e based on a reference population of the same origin to the population in question. Similarly, M e.c represents the averaged LD for any pair of markers on the c th chromosomes. As the causal variants are hardly observed directly, their relationship with markers are surrogated by relationship between markers, as reflected in M e . As M e is a critical parameter in many genetic applications, a conceptional parameter is its involvement in genetic prediction (Dudbridge and Wray, 2013), or power calculation for the estimation of heritability (Visscher et al., 2014). In the estimation for variance components, as shown below, M e is a key parameter.

Haseman-Elston Regression Framework for the Estimation of Heritability
Haseman-Elston regression (HE) has been initially proposed for the linkage analysis (Haseman and Elston, 1972). With its original kernel relatedness between sib pairs via linkage replaced by linkage disequilibrium for unrelated samples, the modified HE can be used for the estimation of heritability (Chen, 2014). Due to its clear mathematical property, HE has been found a bridge to connect and develop various parallel methods for the estimation of heritability, such as LD score regression that estimates heritability and uses summary statistics from GWAS Zhou, 2017).
However, LD score regression is based on various assumptions that may or may not be met in practice. LD score regression uses SNPs in a sliding window instead of all genome-wide SNPs to calculate LD scores, which will lose efficiency if heterogeneity exists between the reference population and the population that generates the GWAS summary statistics. If we directly use individual-level data, the time cost will be unaffordable, such as for the restricted maximum likelihood estimation method (REML); in contrast, a method of moment (MoM) can provide equivalent estimation for the heritability for complex traits.
We assume that in which y is the standardized phenotype for a trait of interest, X is the standardized genotypic matrix of N individuals and M the biallelic markers, β is the additive effect associated with each marker, e is the residual, h 2 is the SNP heritability, and σ 2 e is the residual variance. It is easy to know that var y = E yy T −E y E y T = h 2 MXX T +σ 2 e I=h 2 K+σ 2 e I.

Estimation for Heritability via Modified Randomized Haseman-Elston Regression
Consequently, we extend the work by Wu and Sankararaman (2018); the moment estimator is to minimize Frontiers in Genetics | www.frontiersin.org By taking the differentiation in terms of h 2 and σ 2 e , we have ∂Q ∂h 2 = tr h 2 K T K + σ 2 e K − yy T K = 0 ∂Q ∂σ 2 e = tr h 2 K+σ 2 e I − yy T I = 0 After algebra, we have the normal equations below: The estimator forĥ 2 can be written aŝ However, different computational strategies deal with the computational expensive part for both the numerator and the denominator. In particular, for the numerator, y T Ky can be decomposed as y TXXT y/M, and y TX , equal to X T y T . Each elementX T j y ofX T y is just the regression coefficient between the jth marker and y that can be computed via simple linear regression, or multivariate linear regression if covariates are included. It is easy to recognize that y TXXT y/M follows χ 2 1 after scaling by the sample size N, and a possible non-central parameter related to the heritability of the trait. Alternatively, we derive the mathematical expectation E y T Ky = NE χ 2 1|h 2 = N(1 + Nh 2 r 2 ), in which r 2 is the averaged LD score between a marker to a causal variants in LD.
The denominator involves the trace of K T K, a high-order function for GRM. Alternatively, according to the property of the trace of a matrix, it can be calculated that tr(K T K) = N i,j K 2 i,j , a summation of the square of each element in K. From the first glance, it seems inevitable to compute K, the computational cost of which is O(N 2 M), a substantial cost given a large sample size, such as for UKB of about 500,000 samples (Bycroft et al., 2018). In order to have a proper estimate for tr(K T K) but reduce computation cost, three methods are proposed for estimating tr(K T K).

Estimating tr(K T K)
We present three methods in estimating tr(K T K). Sampling method I has been proposed by Wu and Sankararaman, but we provide its correct sampling variance, which was incorrectly given in their original report (Wu and Sankararaman, 2018). Sampling method II derives the expectation of tr(K T K) and estimates it in a reference population with the similar genetic origin of the population of question. Sampling method III slightly modifies method II if the reference population is big and yields smaller sampling variance of tr(K T K) than that of method II.

Sampling Method I: The Randomized Estimator With Corrected Analytical Sampling Variance
Using randomized estimation, an unbiased estimator L B is employed to estimate tr(K T K) in RHE (Wu and Sankararaman, 2018). The rational for a randomized estimate is as below: In each iteration, a vector z, of length N, is generated from the standard normal distribution. As long as z has been generated B time and B is large enough, it is guaranteed to approach tr(K T K). As z T b XX T can be calculated easily, the computational cost is O(NMB). Then, L B can be plugged into a normal equation (Equation 2).
In Wu and Sankararaman's original report, the sampling variance of L B was given as var (L B ) = 2tr K T K /B, which was incorrect, and the correct one should have been The derivation of the penultimate step uses the quadratic variance calculation formula. The sampling variance of L B is proportional to tr(K T KK T K), the computational cost of which is likely to be infeasible for biobank-scale data. However, its practical sampling variance can be estimated from B iterations above var

Sampling Method II: Estimating tr(K T K) by Subsampling
In an alternative route, we bypass the direct computation of tr(K T K). It is shown that tr K T K = N 2 /M e + N for unrelated samples (see Supplementary Notes). N is the sample size, a known parameter; we only need to estimate M e . As noted above, M e can be estimated by subsampling a proportion of the study population (Figure 1) or by a reference population of the same origin with the population of study (Zhou, 2017). Thus, we can estimate M e using a small proportion of the sample, as long as we can estimateM e ; we can easily get the estimator of tr K T K . We define a new L S estimator: is not directly known but can be directly estimated in the third method proposed below.

Sampling Method III: Estimating tr(K T K) via Shotgun Randomization
However, σ 2 M e in method II is not analytical probably because each individual will be involved s times in the estimation of variance. With slightly modification, we developed a new estimator L T (lower triangle shotgun sampling estimator) to estimate tr(K T K) in RHE. The difference in sampling schemes between methods II and III can be visualized as Figure 1. Given the whole GRM, method II samples a square matrix of size s × s after rearrangement and calculates half elements, whereas method III randomly samples n = s 2 /2 elements in the whole GRM without replacement so as to reduce overlapping of samples (Figure 1). This sampling idea is similar to the shotgun method in the first-generation DNA sequencing technology, so we call the method III shotgun sampling estimator.
Given a random subset of n elements A ⊆ {1, 2, · · · , N(N − 1)/2}, we define It can be proved that L T is an unbiased estimator of tr(K T K) with its sampling variance N 4 var(K o 2 )/n, which can be estimated by Supplementary Notes). Therefore, we can get the unbiased estimate of tr(K T K) and its sampling variance at the same time in one step. It does not need to calculate all the elements in K o but the corresponding pairs of the individuals, and to calculate the mean of the product of all their genetic values. Therefore, each item in the summation can be computed in O(M), and the total running time is O(nM).

The Estimation of Variance Components and Its Sampling Variance
If we replace tr K T K with its subsampling estimators, we can get the synthesized estimator for heritabilitŷ is the mean of χ 2 1 for each SNP with or without the adjustment of covariates. Using the Delta method, we show in Supplementary Notes that the variance ofĥ 2 can be formulated as and each item in the above formula is estimable, then we can get the variance estimator of the variance component Except forĥ 2 , all other parts involved are independent to the phenotype, so given a specific sample of question, the estimator has a linear relationship with the square of the estimated heritability.
Genetic Partitioning of Heritability Yang et al. (2010) estimated the chromosome-wise partitioned heritability and found that the heritability of complex trait, such as human height is proportional to the length of the chromosome, that is, proportional to the number of causal variants. Some researchers gave more weight to large effects to explain heritability and to study polygenicity (O'Connor et al., 2019;Yang and Zhou, 2020). In this report, we instead calculated heritability based onM e and compared the chromosome-wise partition heritability with the whole-genome heritabilitŷ in whichM e.c is the effective of markers for the c th chromosome and r 2 c is the averaged squared correlation between a casual variant and a marker on the c th chromosome. Under the assumption of polygenicity,ĥ 2 c=1 r 2 c , and the ratio between h 2 h 2 C = r 2 22 c=1 r 2 C . As both r 2 and 22 c=1 r 2 c are unknown, we use 1 M e and 1 C c=1 M e.c as the surrogates for r 2 and 22 c= 1 r 2 c . By breaking the GRM of the whole genome K in Equation (1) into the GRMs for 22 autosomes, we can also estimate the chromosome heritability jointly in one model. This method has to inverse a 23 × 23 matrix. Under the assumption that the genotype of each chromosome contains the same N individuals, the inversed matrix is completely upon N and M e.c , so a computation cost linear to 23, without bothering the conventional matrix inversion procedure, a computation cost of 23 3 , can be written down analytically. In particular, the c th diagonal element of the inverse matrix is M e.c /N 2 , and the last column/row is −M e.c /N 2 . For more details, please see Supplementary Notes.

Heritability for the Weighted Genetic Relationship Matrix
Given the definition of the weighted GRM we can get an estimator of the weighted heritability as well as its variance estimator based on weighted GRM through a similar derivation whereM ew is the estimation of M e for K w , and χ 2 1|h 2 ,l is the square of the z-score for the l th SNP with or without the adjustment of covariates. The weighted chromosome-wise partition heritability can be expressed aŝ

Connection to Other Estimators
The BOLT-LMM method  might be the most widely used method in the field of heritability estimation for large-scale data. Theoretically, the computational complexity of BOLT-LMM is O(PMN 1.5 ), where P is the number of iterations for convergence. In the L T estimator, the subsample size n ≪ N 1.5 , so our calculation time is less than BOLT-LMM in theory. In terms of actual calculation, the L B estimator used less calculation time to get an accuracy similar to BOLT-LMM (Wu and Sankararaman, 2018); the variance of our estimators is about an order of magnitude smaller than L B under the same calculation time. Thus, our method is better than BOLT-LMM in calculation accuracy and time. In terms of memory, the memory complexity of BOLT-LMM is O(NM/4), while the memory of our subsampling estimators is proportional to M and the subsample size s in the L S estimator, which generally does not exceed 10% of the total sample size.
Given the availability of the estimators and their sampling variances, it is able to evaluate the statistical power of the estimators and estimate the sample size for the given type I and type II error rates. Under the null hypothesis h 2 = 0, the sampling variance for the additive variance component can be reduced toσ h 2 ≈ 2M e N 2 , which are equivalent to that of REML (Visscher et al., 2014). It is consequently known that the statistical power of the presented method will be equivalent to REML. In contrast, the original Haseman-Elston regression has doubled sampling variances whereσ h 2 ≈ 4M e N 2 (Chen, 2014), because the original HE regression only uses the off/upper-diagonal of the matrix, as presented in the numerator above. The connection to LD score regression is obviously too; here, the whole M e can be seen as a genome LD score, rather than being partitioned into genomic bins.

RESULTS
Simulation Results for the Evaluation of tr(K T K) In the simulation and in the real data, we compared the mean and variance of the three estimators L B , L S , and L T , and the results are as presented in Figure 2. We took n = BN and s = √ 2BN to make sure the three estimators are under the equal computational cost of O(NMB) (see Figure 1 for an example). In the simulation, we set the genotype in two ways: (1) The minor allele frequency (MAF) of each SNP was randomly generated from a uniform distribution between 0.03 and 0.5, and two levels of LD (linkage disequilibrium, in terms of Lewontin's D ′ , the normalized LD parameter) strength were set as 0-0.2 (weak LD) and 0.6-0.8 (strong LD) with the SNP number M = 2,000, 5,000 and sample size N = 500, 1,000, and 2,000, respectively. (2) The real genotype data consisted of 12,980 adjacent markers on chromosome 22 of 2,000 randomly sampled unrelated white British individuals in UKB. B was set from 5 to 50 and repeated 100 times for each to assess the mean and variance of the three estimators.
Across different parameter settings (sample sizes, number of loci, MAF, and LD), it yielded a similar pattern for the evaluated results of tr(K T K). We chose M = N = 2,000 and strong LD for detailed presentation, and the rest were shown in Supplementary Figures 1, 2). Figure 2 shows that all the three estimators were unbiased. The variance of each of these estimators, as expected, was inversely proportional to B. The real sampling variance of L B was several times larger than the analytical incorrect result given in Wu and Sankararaman's study (refer to Appendix A3 in their original report) but was consistent with 2tr K T KK T K /B, just the corrected one as derived in this study (Equation 3). The sampling variance of L T was about an order of magnitude smaller than that of L B . The simulation results in the real data shown in Supplementary Figure 3 were consistent with Figure 2.

Real Data Analysis for tr(K T K)
We compared the performance of the three estimators L B , L S , and L T in UKB. After quality control, 525,460 autosome SNPs with MAF > 0.01 for 278,788 unrelated British white individuals, whose pairwise genetic relationship coefficient <0.0125, were included for analysis. We set B = 5, 10, 20, 40, 60, 80, and 100, and calculated each of the three estimators 100 times to get the mean and the variance for each B. We compared the means of the three estimators with the expected value of tr K T K = N 2 /M e +N, whereM e was estimated from subsamples; given witĥ M e ≈ 87,351, tr K T K was expected to be 1,168,573 for each of the three estimators.
The calculation was performed on an Intel(R) Xeon(R) Bronze 3104 CPU @ 1.70-GHz server cluster, and about 30 threads were allocated for each calculation. The actual calculation time of the three estimators were basically the same (see Supplementary Table 1) and conformed to the theoretical calculation complexity O(NMB). The variances of the three estimators L B , L S , and L T for tr(K T K) are listed in Table 1.
In particular, between the randomized estimator and the subsampling estimators, there was a huge difference between their variances. Under the real data, the sampling variance of L B was large, while the sampling variances of the other two estimators were smaller and the variance of L T was about half that of L S . The variances of each of the three estimators decreased with the increasing B, consistent with the simulation.

Chromosome-Wise Partitioning for Heritability
Equation (2) presents how heritability is estimated using all 22 autosomes, and Equation (5) offers an alternative method by summation of chromosome-wise estimation for heritability. For ease of comparison, we only estimated heritability for 81 traits as demonstrated by Ge et al. (2017). We used the first two principal components as the covariates to control the possible population stratification; other covariates were adjusted upon the traits. The chromosome-wise partition heritability was calculated by the summation of the heritability estimated for M e.c for each chromosome ( Table 2), and the whole-genome heritability was calculated from the GRM of the whole genome. The estimated heritability of some selected UKB traits is listed in Table 3 (see Supplementary Table 2 for all the 81 traits). The heritability of all traits was basically very similar to Ge et al.'s result and within the error range. Several physiological traits, such as height and weight had high heritability, while social traits that were more affected by social factors, such as the duration of certain activities, showed lower heritability. This result was consistent with the mainstream conclusion. The left part of Equation (4) for variance estimators of the whole-genome heritability ( 2M e N 2 ) contributed a large part of the total variance (about 0.0017 in 0.002 for N = 270,000). Although the variances of theL B ,L S , andL T were several times different, they influenced little on the variance of estimated heritability.
In the comparison of the two kinds of heritability for each trait, all the chromosome-wise partition heritability was higher than the whole-genome heritability except for the trait of the age diabetes diagnosed (the explanation of this exception is given below). For a certain polygenic trait, the heritability attributed to each chromosome was proportional toM e.c according to the heritability estimation formula (Equation 5). Since the LD score between chromosomes could be considered as 0, this causes theM e of the whole genome to be diluted by a large number  of blank LD, so the overallM e was smaller than the averagê M e.c of each chromosome. In order to eliminate the influence of blank LD to see the contribution of effects of causal variants to heritability, Figure 3 shows the relationship of these two estimations of the heritability for the 81 traits. The slope of the solid gray line in the figure represents the ratio of the whole-genomeM e to c i=1M e.c , a ratio of 0.729. This figure was to capture traits that do not meet the assumptions of polygenic assumption-or fitness of the model. If a trait were purely polygenic, the point representing this trait would be expected just along the solid gray line. However, the points were mostly distributed above the line, indicating that the effect size of causal variants was not evenly distributed on the chromosomes. In particular, the trait of the age diabetes was diagnosed, the Manhattan plot of which showed many statistically significant SNPs concentrated on the major histocompatibility complex (MHC) region on chromosome 6. They all belong to MHC, which is related to many human traits. Obviously, these loci breached the polygenic assumption underlying. After deleting these loci, we reestimated the two kinds of heritability, and all the traits were basically close to the solid gray line and were closer compared with Figure 3 (see Supplementary Figure 4). This shows that the model assumptions were basically valid, and the estimated value of heritability had a certain degree of reliability.
Alternatively, the chromosome-wise partitioning heritability could be estimated jointly by fitting 22 autosomes altogether. It was basically the same as those calculated singly but slightly lower than the latter. It was because when calculating the heritability of chromosomes jointly, we set N for the whole genome in Equation (5), but smaller N were taken in the equation for estimating the heritability of each chromosome singly, as fewer individuals met the quality control standards for a single chromosome. We mentioned in Method that the fast estimation of joint heritability should meet the precondition that N of each chromosome are equal. The heritability estimated by the two methods will be strictly equal if this precondition holds (see Supplementary Notes). For traits with large sample sizes, this precondition could be met well, and the heritability estimated by the two methods was almost the same.
We also estimated the weighted chromosome-wise partition heritability and the weighted whole-genome heritability for these traits (see Supplementary Figure 5). In general, the weighted estimation of heritability was similar to that without weight.

DISCUSSION
In this study, we corrected the erroneous variance of the L B estimator and proposed another two unbiased estimators of tr(K T K), which was the most time-consuming term in RHE (Wu and Sankararaman, 2018). Instead of plotting the running time and accuracy of different methods like most articles, we used a different experimental design to make a special comparison with the L B estimator. We borrowed the sampling size parameter B in L B and adjusted the sample size of our estimators so that the theoretical calculation time of the three estimators was the same under different sample size parameter B. Under the same time complexity, our results showed better stability with smaller variances. In other words, under the same accuracy requirements, our method could greatly reduce the computation cost. We noted that Wu and Sankararaman further reduced the calculation time in matrix multiplication by introducing the mailman algorithm (Liberty and Zucker, 2009), which could also be used in our calculation by writing our estimators in the form of multiplication of genetic matrix and a random vector with multinoulli distribution. From these perspectives, our estimators were superior substitutions of the L B estimator in Haseman-Elston regression.
We also gave the sampling variance of the subsampling estimator, which could be calculated by one sampling without additional calculation. As a result, the variance estimator of the heritability could be easily derived. Although the variance of the L B estimator 2tr K T KK T K /B could also be derived by the subsampling method (beyond the scope of this study), its time complexity greatly exceeded the calculation of tr K T K as far as we know.
The variance of L S was always slightly larger than that of L T . This was because L T randomly extracted nearly uncorrelated elements in the lower triangular matrix K o , while L S extracted all elements in a triangle of K o (after reordering the individuals). Although their sampling variance was approximately equal to the population variance var(K o ), the sampling variance of L T was relatively smaller because it uses less related individuals.
One possible drawback of the L T estimator relies on a much larger reference population than that of L S . When the reference sample size is small, it is obvious that L T becomes L S . Therefore, the L T estimator can make full use of large sample size, such as that of UKB. Although the difference between the variances of these two estimators is small, and the difference in the final heritability estimation is even slight, we still provide a novel and simple subsampling idea, which can be used in many situations involving large samples.
In the early analysis of heritability, both GRM and Haseman-Elston regression were applied to related individuals under the context of linkage analysis using sibling data. Under linkage, relatedness is actually related to the concept of identity by descent (IBD). However, with the increasing amount of data, the significance and application range of GRM and HE have been expanded. The unrelated individuals we emphasize here are mainly to distinguish from the linkage analysis of pedigree data. There is no problem in the estimation of heritability with related individuals, as demonstrated below. The  Supplementary Table 2. The horizontal axis represents their chromosome-wise partition heritability, and the vertical line at each point is their error bar; the vertical axis represents their whole-genome heritability. The red color represents the chromosome-wise partition heritability calculated jointly, and the green color represents the chromosome-wise partition heritability calculated singly. The long-dash line crosses the origin has a slope of 1. The slop of the solid line is 0.729, the ratio of  Supplementary Table 4. In general, the heritability increased compared to the previous results of the unrelated set, but negligible. It shows that our estimators are basically applicable among a more realistic population even containing partially related individuals but leave some concerns in theoretical soundness.
Using modified Haseman-Elston regression to estimate heritability is becoming more and more popular in summary statistics. We further explored an important connection between Haseman-Elston regression and M e , the effective number of independent SNPs, which is also a critical concept in quantitative genetics. We found that M e plays a pivotal role in the estimation of variance components and heritability. As long as we get the estimation of M e , we can easily get the estimation of its corresponding variance components.
Although we used only individual-level data to estimate heritability in this report, the nature of M e allows researchers to estimate heritability based on a reference population of the same origin to the population in meta-analysis. However, the existence of family structure will make M e shrink (see Supplementary Table 3; the expansion of trace means the shrinkage of M e ), and different family structures make it shrink differently, leading to inaccurate meta-analysis. Therefore, we do not recommend using our method in samples with various related individuals, but it raises a very interesting question for the estimation theory using mega-scale family trees (Kaplanis et al., 2018;Shor et al., 2019).
Due to the statistical property of M e , we can easily extend M e to the dominant model and use the same method to obtain both additive and dominant heritability, as long as their codes for the count of the reference allele are orthogonal, as discussed (Vitezica et al., 2017;Álvarez-Castro and Crujeiras, 2019). We can also extend M e to estimate a genetic correlation for a pair of traits, in which tr K T K = N 1 N 2 /M e + N o , where N o is the overlap sample size between a pair of cohorts, which have N 1 and N 2 individuals, respectively.

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. This data can be found at: https://biobank.ctsu.ox.ac.uk/.

ETHICS STATEMENT
Ethical review and approval was not required for the study on human participants in accordance with the local legislation and institutional requirements. The patients/participants provided their written informed consent to participate in this study.

AUTHOR CONTRIBUTIONS
TX developed the method and wrote the manuscript. G-AQ analyzed the data. JZ and H-MX revised the manuscript. G-BC designed the study and improved the manuscript. All authors read and approved the final manuscript.