Original Research ARTICLE
A Log-Ratio Biplot Approach for Exploring Genetic Relatedness Based on Identity by State
- 1Department of Statistics and Operations Research, Technical University of Catalonia, Barcelona, Spain
- 2Department of Biostatistics, University of Washington, Seattle, WA, United States
- 3Department of Computer Science, Applied Mathematics and Statistics, University of Girona, Girona, Spain
- 4Genomes For Life - GCAT Lab, Institute for Health Science Research Germans Trias i Pujol (IGTP), Badalona, Spain
The detection of cryptic relatedness in large population-based cohorts is of great importance in genome research. The usual approach for detecting closely related individuals is to plot allele sharing statistics, based on identity-by-state or identity-by-descent, in a two-dimensional scatterplot. This approach ignores that allele sharing data across individuals has in reality a higher dimensionality, and neither regards the compositional nature of the underlying counts of shared genotypes. In this paper we develop biplot methodology based on log-ratio principal component analysis that overcomes these restrictions. This leads to entirely new graphics that are essentially useful for exploring relatedness in genetic databases from homogeneous populations. The proposed method can be applied in an iterative manner, acting as a looking glass for more remote relationships that are harder to classify. Datasets from the 1,000 Genomes Project and the Genomes For Life-GCAT Project are used to illustrate the proposed method. The discriminatory power of the log-ratio biplot approach is compared with the classical plots in a simulation study. In a non-inbred homogeneous population the classification rate of the log-ratio principal component approach outperforms the classical graphics across the whole allele frequency spectrum, using only identity by state. In these circumstances, simulations show that with 35,000 independent bi-allelic variants, log-ratio principal component analysis, combined with discriminant analysis, can correctly classify relationships up to and including the fourth degree.
The detection of pairs of related individuals in genomic databases is important in many areas of genetic research. In population-based gene-disease association studies, the assumption of independent observations which is usually made in the statistical modeling of the data, may be violated due to related individuals. Cryptic relatedness can lead to an increased false positive rate in association studies, in particular if related individuals are oversampled (Voight and Pritchard, 2005). In conservation genetics, unrelated individuals are carefully selected in breeding programs in order to maximize genetic diversity (Oliehoek et al., 2006). In quality control of genetic variants produced by high-throughput techniques, accidental duplication of samples in genetic studies can be detected by a relatedness analysis (Abecasis et al., 2001). In ecology, samples of species often contain an excess of close relatives. This can lead to biased estimates of population-genetic parameters, lower the precision of their estimates, and inflated type 1 error rates of tests for genetic equilibria (Wang, 2018). In practice, most relatedness investigations are based on allele-sharing statistics such as the average number of identical-by-state (IBS) alleles shared by a pair of individuals over a set of loci, or by estimating the probabilities of sharing 0, 1, or 2 alleles identical-by-descent (IBD; Thompson, 1975, 1991), known as Cotterman's coefficients (Cotterman, 1941). Plots of these sharing statistics typically show clusters that correspond to unrelated pairs (UN), parent-offspring pairs (PO), full sibs (FS), half sibs (HS), monozygotic twins (MZ), avuncular pairs (AV), first cousins (FC), grandparent-grandchild (GG), or more remote relationships (see Figures 1A–C).
Figure 1. Classical graphics for relatedness research and log-ratio PCA biplot. Plots show the CEU sample of the 1000G project. IBS/IBD statistics were calculated over a set of 26,081 complete, LD-pruned autosomal SNPs with MAF above 0.4, and HWE exact test p-value above 0.05. (A) Scatterplot of the mean and standard deviation of the number of IBS alleles. (B) Scatterplot of the fraction of variants sharing two (p2) against the fraction sharing zero (p0) IBS alleles. (C) Scatterplot of the estimated probability of sharing one () against the estimated probability of sharing zero () IBD alleles. (D) log-ratio PCA biplot.
All these methods collapse the data to two statistics, that can summarize relatedness in two dimensions. Classical plots are the mean vs. the standard deviation of the shared number of alleles over loci [the (m, s) plot, see Figure 1A], the fractions of loci for which a pair of individuals shares 0 or 2 IBS alleles [the (p0, p2) plot, see Figure 1B], or the estimated probabilities of sharing 0 or 1 allele IBD [the (, ) plot, see Figure 1C]. However, all allele sharing statistics are estimated from the genotype data. For a pair of individuals with bi-allelic variants, there exist six possible pairs of genotypes, and their counts over the k variants determine the IBS allele sharing statistics. From this perspective, the observed genotype sharing data consists of vectors of six elements, that, when expressed in percentage form, occupy a five dimensional space. This suggests that the classical approaches of collapsing the data into two dimensions by plotting the summary statistics may not extract all information about relatedness that is present in the data. In this paper we propose to explore the data in five dimensions by using log-ratio principal component analysis (PCA), which is specially designed for analyzing compositional data (Aitchison, 1983). A log-ratio PCA allows us to construct comprehensive biplots that uncover the main relatedness features of the data.
Biplots are widely used in genetic research, in particular for the graphical representation of quantitative traits of genotypes in plant genetics (Anandan et al., 2016; Pandit et al., 2017; Sharma et al., 2018). In relatedness research, a PCA of bi-allelic genetic variants, coded in 0, 1, 2 format (for AA, AB, and BB respectively) is often used to investigate the existence of population substructure, that is, remote genetic relatedness. The plots obtained by this kind of PCA are, in principle, biplots, though often the genetic variants are omitted in such plots because there are too many of them. Substructure is also often investigated by multidimensional scaling (MDS) of allele sharing distances between individuals. The resulting MDS maps only represent individuals, and some authors prefer the term monoplots for such graphics (Gower et al., 2011). If MDS is based on the Euclidean distances, then a covariance-based PCA and MDS are in fact equivalent (Mardia et al., 1979). The MDS plots, PCA biplots without variable vectors for the genetic variants, are particularly popular in substructure investigations in human genetics (Jakobsson et al., 2008; Sabatti et al., 2009; Pemberton et al., 2010, 2013; Wang et al., 2010).
The biplot approach proposed in this paper differs from the classical applications described above in several ways. We propose a biplot of the genetic data of pairs of individuals, that represents artificially related pairs of a reference set of given familial relationships, generated by a respampling of the genetic data. The empirically observed pairs are used in a supplementary way, and are projected onto the reference biplot. The data matrix used in this biplot is not a (0, 1, 2) genetic data matrix, neither a distance matrix of allele sharing distances, but consists of vectors of counts of genotype patterns [(AA,AA), (AA,AB), etc.] which we treat as compositions, and we therefore use a log-ratio approach. More details are given in the section 2 below.
An important additional advantage of using log-ratio PCA in this context is that it allows us to explore the data iteratively with a peel and zoom procedure. A first log-ratio PCA may clearly reveal a cluster of FS pairs. Once identified, the corresponding pairs can be removed from the database, and log-ratio PCA can be repeated on the remaining pairs. The second analysis will focus more closely on more remote relationships that may be present in the database, and thereby act as a magnifying glass for the latter. The aforementioned classical graphics do not have this property, as they are invariant under removal of a relationship category.
The remainder of this paper is organized as follows. In the section 2 we provide background on relatedness research and log-ratio PCA, and show how to construct biplots that are useful for relatedness research. In the section 3 we study the discriminative power of log-ratio PCA and compare this with the classical plots in a simulation study. We also describe two empirical examples of our method with data from two different population-based datasets; a next generation sequencing dataset from the 1,000 Genomes Project (The 1000 Genomes Project Consortium, 2015) and a genome-wide SNP array technology dataset from the GCAT Genomes For Life Cohort Study of the Genomes of Catalonia (Galván-Femenía et al., 2018; Obón-Santacana et al., 2018). A discussion finishes the paper.
We first summarize some basic methods for relatedness research (section 2.1), then give a brief account of log-ratio PCA (section 2.2), and finally show how log-ratio PCA can be used in relatedness research (section 2.2).
2.1. Relatedness Research
We briefly review some fairly standard procedures that are currently used in relatedness research. Relatedness investigations are focused on the extent to which alleles are shared between individuals. Two individuals can share 0, 1, or 2 alleles for any autosomal variant. Alleles can be identical by state (IBS) or identical by descent (IBD). A pair of individuals share IBS alleles if they match irrespective of their provenance; whereas they share IBD alleles only if they come from a common ancestor. Table 1 shows all the possible combinations of the IBS alleles shared for a pair of individuals at a biallelic variant. Considering k biallelic variants, each pair of individuals has a vector of 0, 1, and 2 IBS counts of length k. In IBS studies, the means (m) and standard deviations (s) of the vector of the IBS allele counts (Abecasis et al., 2001), or the proportions of variants sharing 0, 1, and 2 IBS alleles (denoted p0, p1, and p2 respectively, Rosenberg, 2006) can be plotted (see Figures 1A,B). These plots reveal characteristic clusters corresponding to MZ, PO, FS, UN, and other pairs. Alternatively, in an IBD based approach, the probability of sharing 0, 1, or 2 IBD alleles for a pair of individuals (usually denoted by k0, k1, and k2 and referred to as Cotterman's coefficients) can be represented in a scatterplot (see Figure 1C, Nembot-Simo et al., 2013). The Cotterman coefficients can be estimated by the method of moments (Purcell et al., 2007), maximum likelihood (Thompson, 1991; Milligan, 2003; Weir et al., 2006), or robust estimation methods (the KING program, Manichaikul et al., 2010). In IBD studies, reference values for the standard relationships are available (see Table 2). Related pairs can also be distinguished, albeit at lower resolution, by using the co-ancestry coefficient defined as θ = k1/2+k2 or the kinship coefficient defined as ϕ = θ/2. Galván-Femenía et al. (2017) give an overview of graphics used in relatedness research. Figure 1 shows a panel plot of some standard graphics used in IBS and IBD studies for all the pairs of individuals from the CEU population of the 1.000 Genomes project. These plots distinguish UN, PO, FS, and second degree pairs. Alternatively, a Markov-chain approach with the calculation of likelihood ratios for putative and alternative relationship has been developed by Epstein et al. (2000; the Relpair program) and by 2000; the Prest-plus program). Throughout this paper we employ the classical notion of degree of relationship, shown in the second column of Table 2, with PO and FS being considered first degree, HS, GG and AV, second degree, FC third degree, first cousins once removed fourth degree, second cousins fifth degree and second cousins once removed sixth degree, and so on.
2.2. Log-Ratio Principal Component Analysis
Aitchison (McPeek and Sun (1983) proposed log-ratio principal component analysis (PCA) for the exploration of compositional data. Many successful applications of log-ratio PCA have been described in the literature, notably in geology. We briefly summarize log-ratio PCA and biplot construction (see Pawlowsky-Glahn et al., 2015 for a comprehensive account). Log-ratio PCA is usually performed by applying the centered log-ratio transformation to the compositional data, and we will follow that approach here. Let X be a matrix with n compositions in its rows, and having D parts (columns). Compositional data can be defined as strictly positive vectors for which the information of interest is in the ratios between the components (Aitchison, 1986). We consider the centered log-ratio transformation (clr) of a composition x (a row of X) given by
where gm(x) is the geometric mean of the components of the composition x. Let Xℓ be the log transformed compositions, that is Xℓ = ln (X) with the natural logarithmic transformation applied element-wise. The clr transformed data can be obtained by just centering the rows of this matrix, using the centering matrix . Then
The rows of Xclr are subject to a zero sum constraint because Hr1 = 0. If there are no additional linear constraints, then Xclr will have rank D−1. We now column-center the clr transformed data, producing a double-centered data matrix that has zero column and row means:
where Hc is the centering matrix . Matrix Xcclr is used as the input for a classical principal component analysis. We perform PCA by the singular value decomposition:
with Fp = UD and Gs = V. Matrix Fp contains the principal components, and its first two columns contain the biplot coordinates of the compositions. The columns of Gs are the eigenvectors of the covariance matrix of Xcclr, its first two columns contain the biplot coordinates of the parts of the compositions. We use sub-indexes p and s to distinguish principal and standard biplot coordinates. We will need to project supplementary compositions onto a given biplot (see section 3). This can be accomplished by regression (Graffelman and Aluja-Banet, 2003). The biplot coordinates, , of a matrix of supplementary compositions, Y, can be found as
where Ycclr contains the clr-transformed supplementary compositions, but centered with respect to the compositions in X, that is
We will construct a biplot of genotypic reference compositions by using Equation (4), and project empirical genotype compositions onto the biplot by using Equations (5) and (6).
2.3. Log-Ratio PCA of Genotype Sharing Data
For bi-allelic variants with alleles A and B, there exist six possible pairs of genotypes whose counts over k variants can be laid out in a triangular array shown in Table 3, where kij refers to the number of variants that have i B alleles for one individual, and j B alleles for the other individual. Consequently, each pair can be represented by a vector of six counts which can be expressed as a composition by division by its total (closure):
The total number of variants is given by . For PO pairs this vector has, in theory, a structural zero, k20 = 0, because PO pairs share at least one IBS allele. However, for empirical data k20 = 0 is, with large k, almost never observed due to the existence of some mutations and genotyping error. Given n individuals, we construct matrix X with pairs in its rows, and propose to study relatedness by a log-ratio PCA of this q×6 matrix of compositions. This will allow the construction of a biplot, where each pair of individuals is represented by a point, and each part of the clr transformed composition by a vector. A drawback of the representation of pairs of individuals in a log-ratio PCA biplot is that the type of relationship cannot be inferred if it is undocumented. Without additional analysis one does not know for sure whether observed clusters correspond to FS, HS, or other pairs. We resolve this by first identifying a subset of approximately unrelated individuals in the database, having a co-ancestry coefficient with other individuals that is below 0.05. We next simulate pairs of related individuals of known relationships by constructing pedigrees from this subset, applying the Mendelian inheritance rules. For example, PO pairs are simulated by first drawing two parents at random from the unrelated subset. A child is then simulated by drawing one allele at random from both these parents. The process is repeated in order to generate many random PO pairs. FS, HS, and pairs of other relationships are simulated in an analogous manner. This process is based on a re-sampling the alleles of the observed individuals. The artificially generated data set forms a reference set or training set against which the empirically observed data can be compared. This reference set is generated conditionally on the allele frequencies of the observed sample. We now first apply log-ratio PCA to the pairs of the reference set (X), and construct a biplot of the reference set. The empirically observed pairs (Y) are projected onto this PCA biplot and their relationship is inferred, according to which simulated type of relationship is most close to the empirical pair. This can be done in a quantitative way by classifying all empirical pairs with linear discriminant analysis (LDA) (Johnson and Wichern, 2002), using the simulated pairs as a training set.
In this section we first validate the proposed methodology with some simulations, comparing the log-ratio PCA approach with the well-known aforementioned (m, s), (p0, p2), and (, ) plots, and then show two examples with empirical genetic data.
We simulated 35,000 independent genetic bi-allelic variants by sampling from a multinomial distribution under the Hardy-Weinberg assumption, using a minor allele frequency (MAF) of 0.5 for all variants. Using Mendelian inheritance rules, 100 independent pairs of each type of relationship were simulated. We assume a homogeneous population without mutation and genotyping error, generating simulated data sets that are free of Mendelian inconsistencies. The classical plots and the log-ratio PCA biplot of a simulation are shown in Figure 2. This figure shows that first and second degree pairs are easily identified by all methods. We will therefore focus on third and higher degree relationships which are harder to distinguish as they tend to blur in the plots. We investigated the effect of MAF and number of SNPs on the classification rate of our procedure, using different numbers of principal components for classification of third through sixth degree pairs (100 of each). Figure 3 shows the classification rates obtained as a function of the minor allele frequency (MAF), the number of SNPs and the number of principal components used. These figures show, as expected, that the classification rate increases with the MAF and the number of SNPs. The simulations show that all five components are needed at low MAF, where more components increase the classification rate. At high MAF (0.40–0.50) there is little or no benefit in using more than two components. With 35,000 SNPs at 0.50 MAF the classification rate is around 95% irrespective of the number of components. With 35,000 SNPs at 0.10 MAF the classification rate varies from below 50% with one component through 93% using all five components. We report the false positive rates in Table S1; No UN or 6th degree individuals were misclassified as 4th degree or lower, and only 1.8% of the 5th degree pairs are misclassified as 4th degree. The simulations show that IBS based log-ratio PCA can discriminate higher degree relationships if a sufficient number of independent highly polymorphic variants is available. In the light of the simulations, we decided to use three principal components for classification with high MAF variants for the empirical data sets described in section 3.3.
Figure 2. Classical graphics and log-ratio PCA biplot for simulated samples. 100 pairs of each type of relationship [UN, sixth, fifth, fourth, third (FC), second (HS), FS, and PO] were generated using 35,000 independent bi-allelic variants with minor allele frequencies of 0.5, assuming Hardy-Weinberg equilibrium. (A) Scatterplot of the mean and standard deviation of the number of IBS alleles. (B) Scatterplot of the fraction of variants sharing two (p2) against the fraction sharing zero (p0) IBS alleles. (C) Scatterplot of the estimated probability of sharing one () against the estimated probability of sharing zero () IBD alleles. (D) log-ratio PCA biplot.
Figure 3. Classification rate of log-ratio PCA combined with LDA for simulated samples. Classification rate for a varying number of principal components (PCs). Classification rates were obtained using 100 pairs of each type of relationships (UN, sixth, fifth, fourth, and third) using independent variants simulated assuming Hardy-Weinberg equilibrium. (A,B) Classification rates are shown as a function of the MAF for 5,000 and 35,000 SNPs. (C,D) Classification rates are shown as a function of the number of SNPs of a given MAF (0.10 and 0.50).
3.2. Method Comparison
We compare our method with aforementioned classical procedures for identification of related pairs. Figure 4 shows the classification rate as a function of the number variants with MAF 0.50 for four methods: the two IBS-based methods, the (m, s) plot and the (p0, p2) plot; one IBD-based method, the (, ) plot, using the KING estimator (Manichaikul et al., 2010); and the log-ratio PCA approach proposed in this paper. These classification rates were obtained by averaging over 25 replicates of the simulations, for each value of the MAF and the number of variants. It is clear that the log-ratio PCA approach (using three principal components) gives the best classification rates for all relationships. There is little difference in classification rate for third degree relationships, which are relatively more easy to classify. Interestingly, in terms of classification rate the (m, s) and (p0, p2) plots are seen to be fully equivalent, as they have exactly the same classification rate profile. Posteriorly, we found these statistics to be related by the equations m = 1 − p0 + p2 and . As expected, classification rate increases with the number of variants. The results suggest that for all four methods 25,000 variants with MAF 0.50 are sufficient to almost perfectly classify PO, FS, second, third, and fourth degree relationships. The difference in classification rate between the log-ratio PCA approach and the conventional methods is larger for the more remote relationships. This simulation concerns a relatively ideal dataset with independent variants and maximally polymorphic variants. For empirical data sets, the independence of the variants can be approximately achieved by LD pruning variants. In practice, many variants have a low MAF. We therefore also investigated the effect of the MAF on the discriminatory power of the different methods, by simulating variants with different MAFs. Figure 5 shows how the classification rate varies as a function of the MAF, using a fixed number of 5.000 bi-allelic polymorphisms. The log-ratio PCA approach, using five principal components, is seen to outperform the classical plots over the full MAF range.
Figure 4. Classification rates for different methods vs. number of SNPs. Classification rates for the different degrees of relationship (third, fourth, fifth, sixth, UN, and All) are shown for four methods, using five principal components. Classification rate profiles for the (m, s) plot and the (p0, p2) plot virtually coincide. The last panel All refers to the classification rate for third through UN relationships jointly. Rates are shown as a function of the number of SNPs with MAF 0.50, and were obtained by linear discriminant analysis. 100 pairs of each type of relationship (UN, fifth, fourth, third, second, FS, and PO) were generated assuming Hardy-Weinberg equilibrium.
Figure 5. Classification rates for different methods vs. MAF. Classification rates for the different degrees of relationship (third, fourth, fifth, sixth, UN, and All) are shown for four methods, using three principal components. Classification rate profiles for the (m, s) plot and the (p0, p2) plot virtually coincide. The last panel All refers to the classification rate for third through UN relationships jointly. Rates are shown, using 5,000 SNPs, as a function of the MAF, and were obtained by linear discriminant analysis. 100 pairs of each type of relationship (UN, sixth, fifth, fourth, and third) were generated assuming Hardy-Weinberg equilibrium.
3.3. Empirical Data Sets
In this section we use log-ratio PCA for a relatedness study of two genomic data sets. We use the CEU population of the 1,000 genomes project (www.internationalgenome.org, The 1000 Genomes Project Consortium, 2015), whose family relationships have been analyzed in detail by Pemberton et al. (2010), Kyriazopoulou-Panagiotopoulou et al. (2011), Huff et al. (2011), and Stevens et al. (2011; 2012). We also present a relatedness study of the population-based GCAT Genomes for Life project (a cohort study of the genomes of Catalonia, www.genomesforlife.com, Obón-Santacana et al., 2018). For both projects, we used Plink 1.90 (Purcell et al., 2007) for data manipulation, filtering and IBD estimation, and R (R Core Team, 2014) for log-ratio PCA and discriminant analysis.
3.3.1. The CEU Sample
First and second degree relationships for the CEU population were documented by Pemberton et al. (2010) using IBS methods, and confirmed by Kyriazopoulou-Panagiotopoulou et al. (2011), who used hidden Markov models and suggested additional third and fourth degree relationships. Stevens et al. (2012) used IBD methods confirming the results of Pemberton et al. (2010). We detail the analysis of the CEU panel using log-ratio PCA. Variants were filtered according to missingness (only variants genotyped for all individuals were used), MAF (>0.40) and Hardy-Weinberg equilibrium test result (exact test mid p-value >0.05, Graffelman and Moreno, 2013). Variants were LD-pruned with Plink using a sliding window of 50 SNPs with an overlap of 5 SNPs between successive windows, and SNPs are removed from the window until no variants remain that have a squared correlation above 0.20 (Plink option indep-pairwise 50 5 0.2). The final data set contained 31,370 autosomal variants. The CEU panel consists of 165 individuals, mainly PO trios, giving 13,530 possible pairs of individuals. The classical plots of the allele sharing statistics were shown previously in Figure 1, including a log-ratio PCA biplot of all pairs (Figure 1D). We now illustrate the log-ratio PCA approach, using an iterative peel and zoom procedure. Figure 1D showed PO pairs to be outlying in the first dimension, for having a low k02/k00 ratio. Theoretically, this ratio is zero for PO pairs, though with large numbers of variants it is non-zero due to mutations and genotyping errors. In fact, the 96 reported PO pairs are easily identified and excluded from the data by filtering with k02 < 0.005. Log-ratio PCA biplots, obtained by simulation with unrelated individuals of the CEU sample, are shown in Figure 6. The simulated pairs of given relationships are represented by convex hulls, and the projected empirical pairs by open dots that are colored according to their predicted relationship, where the latter are inferred from the posterior probabilities obtained in LDA. The convex hulls delimit the cloud of the positions of the simulated UN, sixth, fifth, fourth, and third degree pairs (using 100 pairs of each). The overall classification rate of the simulated data was 91.4%, using three principal components. Classification rates for third, fourth, fifth, sixth, and UN were, respectively 100, 100, 90, 77, and 90%. Results in Figures 1, 6 suggest the CEU sample has 96 PO pairs, one FS pair, two second degree pairs, one third degree pair, five fourth degree pairs, and many fifth and sixth degree pairs that merge with UN pairs. The analysis without PO pairs in Figure 6A shows the documented FS and AV pairs as outliers in the first dimension, for having high k00/k02 and k22/k02 ratios. Re-analysis after removal of the FS pair gives Figure 6B, showing the two AV pairs now as strong outliers in the first dimension. Peeling these two pairs, we obtain Figure 6C, with the single documented third degree pair being now the most prominent outlier. Five additional pairs are seen to separate from the UN cloud, and are classified as fourth degree pairs. Re-analysis after peeling off the third degree pair gives a plot with a more clear separation of the fourth degree pairs (Figure 6D). Another set of pairs, presumably of fifth degree, is seen to bud off from the UN cloud more clearly, once the fourth degree pairs are removed from the analysis (Figure 6E), and additional pairs, classified as sixth degree, separate out partly in the third dimension of this analysis. An exploration of the data up to the fifth dimension of the analysis, after peeling the most obvious PO, FS, AV, third, and fourth degree outliers, is shown in Figure S1. These graphs suggest there is information on relatedness up to and including at least the third dimension of the analysis.
Figure 6. Log-ratio PCA biplots for the CEU sample obtained by peeling and zooming. (A) log-ratio PCA biplot, PO pairs excluded. (B) PO and FS pairs excluded; (C) PO, FS, and AV pairs excluded; (D) PO, FS, AV, and third degree pairs excluded; (E) PO, FS, AV, third and fourth degree pairs excluded (PC1 vs. PC2); (F) PO, FS, AV, third and fourth degree pairs excluded (PC1 vs. PC3). Convex hulls delimit the region of the pairs obtained by simulation.
The classification of the empirical pairs by k02 filtering followed by linear discriminant analysis confirmed the 96 PO and the single FS pair relationships described by Pemberton et al. (2010) (results not shown), as well as the additional FC pair reported by Kyriazopoulou-Panagiotopoulou et al. (2011). First and second degree relationships in the CEU sample are easily and almost certainly identified. Much more uncertainty resides in relationships of the third and higher degrees, and for these relationships conflicting inferences are reported in the literature. We therefore carried out a linear discriminant analysis with a simulated training sample containing pairs with a third through sixth degree relationship, as well as UN pairs, and classified all empirical pairs which clearly had no first or second degree relationship. Third and fourth degree relationships uncovered by Kyriazopoulou-Panagiotopoulou et al. (2011) are reported in Table 4, together with the posterior probabilities obtained in our log-ratio PCA approach. We extended Table 4 with additional fifth degree pairs uncovered by log-ratio PCA, for which LDA gave the highest posterior probability. In total, 18 pairs were classified as fifth degree relationship pairs, of which 10 had a posterior probability above 0.95 (marked in bold in Table 4). We tentatively suggest the CEU panel to contain at least ten fifth degree pairs. We found 1,285 sixth degree pairs, but do not report all these pairs in the light of the overlap with the UN cluster and the somewhat poorer classification rate of the sixth degree observed in the simulations.
Table 4. Predicted relationships of third (3rd), fourth (4th), and fifth (5th) degree pairs of the CEU sample.
Our results confirm a third degree pair (pair 1 in Table 4) reported by Kyriazopoulou-Panagiotopoulou et al. (2011). We also confirm four of the fourth degree pairs reported by the latter authors (pairs 2–5 in Table 4). However, we also observed considerably incongruence of our results with those of the latter authors. We found an FC pair to be classified as fourth degree (pair 6) by our method and 11 reported fourth degree pairs were classified as fifth or sixth degree. We also compared results with those published by Huff et al. (2011), who estimate recent shared ancestry (ERSA) by using IBD segments. Our work confirms three fourth degree pairs and one fifth degree pair reported by the latter authors, though we found two additional fourth degree pairs, and several fifth degree pairs, which are not confirmed by Huff et al. (2011).
3.3.2. The GCAT Sample
We use samples from the GCAT Genomes for life project, a cohort study of the genomes of Catalonia (www.genomesforlife.com). GCAT is a prospective cohort study that includes 17,924 participants (40–65 years, release August 2017) recruited from the general population of Catalonia, a Mediterranean region in the northeast of Spain. Participants are mainly part of the Blood and Tissue Bank (BST), a public agency of the Catalan Department of Health. Detailed information regarding the GCAT project is described in Obón-Santacana et al. (2018). We study relatedness of 5,075 GCAT Spanish participants from Caucasian origin using 736,223 SNPs that passed quality control (Galván-Femenía et al., 2018). Inferred relatives of first and second degree were confirmed by the BST public agency, for pairs sharing one surname (PO, second degree pairs) or two surnames (FS pairs), respecting the privacy of the participants. According to the same filtering procedures used in the CEU samples, 26,006 SNPs (MAF > 0.40, LD-pruned, HWE exact mid p-value >0.05, and missing call rate 0) were considered for relatedness analysis. PO and MZ pairs potentially having structural zeros were filtered with k02 < 0.005. Log-ratio PCA biplots representing over twelve million pairs, combined with the classification of the individuals by LDA, and using the peel and zoom procedure, are shown in Figure 7. This analysis shows the different relationships have in general, a larger variability than expected according to the simulated pairs. The FS cluster has a particular high variability, with pairs apparently less related than FS, and pairs stronger related than FS, in comparison with the FS hull. One apparent FS pairs is actually classified as second degree (Figure 7A). This fusion of FS and second degree pairs suggested us that three-quarter siblings might exist in the database and we therefore re-analyzed the data using a training set that included three-quarter siblings. Three-quarter siblings (3/4S) share more IBD alleles than second degree pairs but fewer than FS. 3/4S have one common parent, while their unshared parents can be FS or PO (see Figure S2). Three-quarter siblings have IBD probabilities k0 = 3/8, k1 = 1/2, and k2 = 1/8, such that their kinship coefficient is ϕ = 3/16, below the value ϕ = 1/4 of full siblings. In the re-analysis in Figure 7B, we found 63 FS pairs, 12 2nd pairs, and eight pairs were indeed classified as three-quarter siblings with large posterior probability (see Table 5). Two of these pairs (67, 71) had their kinship coefficient very close to the expected value of ϕ = 3/16. Because Spanish people have both paternal and maternal surnames, three-quarter siblings share both surnames just as siblings do. The pairs classified as 3/4 siblings shared indeed both surnames, confirming these pairs are actually not second degree. Peeling siblings and three-quarter siblings reveals apparent second degree pairs more clearly (Figure 7C). Tentatively peeling second degree pairs brings the third degree pairs in focus (Figure 7D), and in this analysis we find 174 third, 66 fourth, 31 fifth, and 3,517 sixth degree pairs. Further peeling is difficult as the different clusters increasingly merge. In log-ratio PCA the clusters representing the different relationships have more elliptical shapes that separate better. Note that the number of pairs classified as sixth degree decreases as the lower degree relationships are peeled in the analysis.
Figure 7. Log-ratio PCA biplot of GCAT sample obtained by peeling and zooming. (A) log-ratio PCA biplot, PO and 3/4S pairs excluded. (B) 3/4S pairs included; (C) FS and 3/4S pairs excluded; (D) FS, 3/4S, and second degree pairs excluded. Convex hulls delimit the region of the pairs obtained by simulation.
For all simulated and empirical data sets studied above, the first principal component in the log-ratio PCA's is seen to strongly correlate with the kinship coefficient. The corresponding scatterplots and correlation coefficients are shown in Figure S3. The first principal component is clearly interpretable as a relatedness index. In Figures 6A, 7A (without PO), the biplot vectors show that the first component separates homogeneous homozygote pairs (AA & AA; BB & BB) from heterogeneous homozygote pairs (AA & BB). The second principal component separates double heterzygote pairs from single heterozygote pairs. When FS pairs are removed, the second principal component changes, and reflects a contrast between pairs with heterozygotes and without heterozygotes.
We have developed a log-ratio PCA based procedure that can be used for uncovering cryptic relatedness in homogeneous populations. Simulations show the procedure has a better classification rate than the classical IBS and IBD based approaches. The log-ratio PCA approach exploits the compositional nature of genotype sharing counts over variants, and can potentially use five dimensions for analysis, whereas the classical approaches collapse the data in two dimensions. The analysis of the CEU sample has led to the identification of a set of hitherto unreported pairs for which a fifth degree relationship is highly plausible (Table 4). Our conclusion is that log-ratio PCA, combined with LDA, increases the resolution of relationship discrimination. The classification rate for 6th degree pairs can still be improved if more than 35,000 independent MAF 0.50 variants would be used (see Figure 4). The (p0, p2), (m, s), and (, ) scatterplots display estimates in a constrained space (Galván-Femenía et al., 2017), where Euclidean distances between points cannot be safely interpreted. This is particularly true for the higher degree relationships that merge toward the vertex of the triangular region inside the scatterplot. Log-ratio PCA, besides using more dimensions, frees the data of the unit sum constraint, and clearly enhances the discrimination of the higher degree relationships. We have compared our log-ratio based procedure with some basic procedures used in relatedness research. Its performance could be further explored in a more extensive comparison that includes IBD-segment based methods, such as the comprehensive study reported by Ramstetter et al. (2017).
The analysis of the GCAT samples shows, for almost all relationship categories, larger variability in the relationship clusters than would be expected under strict Mendelian sampling of alleles from unrelated individuals. This excess variability can, at least in part, be explained by the presence of additional relatedness between (unobserved) close relatives of the individuals in the database. This leads to increased autozygosity, which is a characteristic of more endogamous populations. The occurrence of three-quarter siblings is just a particular instance of this phenomenon. Consequently, the degree of relatedness of two individuals tends to become a continuous variable, which is increasingly hard to discretize into the standard relationship categories.
The simulated reference data sets were obtained by resampling genetic variants independently, and this does not take linkage disequilibrium (LD) and recombination into account (Hill and Weir, 2011). If the genotype data is phased, a biologically more realistic simulated data set can be obtained by sampling haplotypes. We have avoided this issue by LD pruning the data base prior to resampling, so removing tightly correlated markers. The reference data set is therefore constructed on the basis of a subset of variants that can expected to be approximately independent. This subset is then used as the basis for relationship estimation. This procedure has the advantage that it avoids potential additional uncertainty generated by using a phasing algorithm. However, the proposed procedure may be improved in the future by accounting for haplotype structure and recombination. The pruning threshold used in our method (0.20) is a compromise between precision and satisfying the independence assumption. A larger value will admit more variants and can increase the resolution, but due to correlation between variants it will invalidate the independence assumption used to generate the reference set of related pairs.
The proposed method for classifying pairs combining log-ratio PCA and discriminant analysis is seen to perform well with both simulated and empirical data. The sampling of artificially related pairs from the observed data requires a considerable number of approximately unrelated individuals to be present in the database. We therefore suggest the method to be used for large samples with thousands of individuals, where such a substantial subset of unrelated individuals can be identified. This is probably not an obstacle for the use of our method, as increasingly large samples are being used in epidemiological genomics. The sampling of artificial pairs from the observed data respects the allele frequency distribution of the original data, and provide reference areas for the different relationships given the allele frequencies of the observed data. Note that with only one hundred simulated pairs of each relationship, we build a classifier that can be used to classify millions of pairs. Our method is computationally feasible for over 5,000 individuals and 26,000 variants like in the GCAT sample. Most of the computation time is spent on the projection of the empirical pairs onto the reference structure, and these computations could easily be parallelized. Many public repositories of genomic data are currently available, but without recruitment and relatedness information, and for which the relatedness techniques discussed in this paper could be usefully applied.
The log-ratio transformation in Equation (1) does not admit zeros for the genotype sharing counts. In theory MZ pairs have k10 = k20 = k21 = 0, and PO pairs have k20 = 0. In practice, due to the summing over large numbers of variants, zeros are almost never observed as a consequence of some genotyping error and incidental mutations. If a few zero counts are observed, a replacement by 1 or 0.5 can eventually be used in order to proceed with the analysis. If there is a substantial amount of zeros, a ratio-preserving multiplicative replacement (Fry et al., 2000; Martín-Fernández et al., 2003) or a Bayesian procedure (Martin-Fernandez et al., 2015) are recommended. The zero problem is well-known in compositional data analysis, and a distinction is usually drawn between structural and rounding zeros (Martín-Fernández et al., 2003, 2011). In principle, MZ and PO pairs have structural zeros. However, MZ and PO pairs are the most easily detected relationships, and are easily dealt with separately, prior to applying the log-ratio transformation to the data. Exclusion of the relationships up to the second or third degree is in fact desirable if possible, as it will allow the study of the more remote relationships at higher resolution.
We recommend the use of discriminant analysis in allele-sharing studies as employed in this paper. The posterior probabilities of the different relationships give a quantitative criterion for deciding upon which relationship is most likely for a given pair of individuals. In allele sharing studies this decision is mostly made graphically by inspecting a (p0, p2) plot in IBS studies, or a (, ) plot in IBD studies. We note that these posterior probabilities differ from those used in a standard discriminant analysis, in the sense that they are affected by additional uncertainty generated by using a training set obtained by a resampling of the observed data.
Applications of IBD based methods typically employ three Cotterman coefficients that are constrained to sum one, and therefore represent relatedness in only two dimensions. However, IBD based methods can estimate additional Jacquard coefficients (Milligan, 2003) and thus potentially exploit more dimensions than is usually done in practice.
The current paper is focused on homogeneous populations. If population substructure exists, then log-ratio PCA can be expected to separate the different populations in its biplot. Methods that address substructure (distant relatedness) and family relationships (recent relatedness) jointly have been developed (Manichaikul et al., 2010; Conomos et al., 2015). Population substructure can be accounted for by using only variants with low weights on the first components for a relatedness analysis, as is done in the UK Biobank project (Bycroft et al., 2018), as the first components mostly capture substructure. In future work, the usefulness of the log-ratio PCA approach for the joint study of remote and recent relatedness could be further explored.
Our study does use data from human subjects, but concerns data that is available in public repositories.
JG and IG contributed equally to this paper, where JG conceived the methodology and wrote the paper. IG developed computer programs, ran simulations, and performed data analysis. RdC supervised GCAT data analysis. RdC and CBV proof-read the manuscript. All authors contributed to the improvement of the paper.
This work was partially supported by grants MTM2015-65016-C2-2-R (JG), MTM2015-65016-C2-1-R (IG and CBV) and ADE 10/00026 (RdC) (MINECO/FEDER) of the Spanish Ministry of Economy and Competitiveness and European Regional Development Fund, by grants SGR1269 and 2017 SGR529 (RdC) of the Generalitat de Catalunya, by grant R01 GM075091 (JG) from the United States National Institutes of Health, and by the Ramon y Cajal action RYC-2011-07822 (RdC).
Conflict of Interest Statement
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
We are grateful for the publicly available data sets of the 1,000 Genomes project, available at www.internationalgenome.org. This study makes use of data generated by the GCAT Genomes for Life Cohort study of the Genomes of Catalonia, IGTP. A full list of the investigators who contributed to the generation of the data is available from www.genomesforlife.com. IGTP is part of the CERCA Program of the Generalitat de Catalunya.
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.00341/full#supplementary-material
Anandan, A., Anumalla, M., Pradhan, S., and Ali, J. (2016). Population structure, diversity and trait association analysis in rice (Oryza sativa L.) germplasm for early seedling vigor (esv) using trait linked ssr markers. PLoS ONE 11:e0152406. doi: 10.1371/journal.pone.0152406
Bycroft, C., Freeman, C., Petkova, D., Band, G., Elliott, L., Sharp, K., et al. (2018). The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209. doi: 10.1038/s41586-018-0579-z
Conomos, M., Miller, M., and Thornton, T. (2015). Robust inference of population structure for ancestry prediction and correction of stratification in the presence of relatedness. Genet. Epidemiol. 39, 276–293. doi: 10.1002/gepi.21896
Galván-Femenía, I., Obón-Santacana, M., Piñeyro, D., Guindo-Martinez, M., Duran, X., Carreras, A., et al. (2018). Multitrait genome association analysis identifies new susceptibility genes for human anthropometric variation in the GCAT cohort. J. Med. Genet. 55, 765–778. doi: 10.1136/jmedgenet-2018-105437
Graffelman, J., and Aluja-Banet, T. (2003). Optimal representation of supplementary variables in biplots from principal component analysis and correspondence analysis. Biometr. J. 45, 491–509. doi: 10.1002/bimj.200390027
Huff, C., Witherspoon, D., Simonson, T., Xing, J., Watkins, W., Zhang, Y., et al. (2011). Maximum-likelihood estimation of recent shared ancestry (ERSA). Genome Res. 21, 768–774. doi: 10.1101/gr.115972.110
Jakobsson, M., Scholz, S., Scheet, P., Gibbs, J., VanLiere, J., Fung, H., et al. (2008). Genotype, haplotype and copy-number variation in worldwide human populations. Nature 451, 998–1003. doi: 10.1038/nature06742
Kyriazopoulou-Panagiotopoulou, S., Kashef-Haghighi, D., Aerni, S., Sundquist, A., Bercovici, S., and Batzoglou, S. (2011). Reconstruction of genealogical relationships with applications to Phase III of HapMap. Bioinformatics 27, i333–i341. doi: 10.1093/bioinformatics/btr243
Manichaikul, A., Mychaleckyj, J., Rich, S., Daly, K., Sale, M., and Chen, W. (2010). Robust relationship inference in genome-wide association studies. Bioinformatics 26, 2867–2873. doi: 10.1093/bioinformatics/btq559
Martín-Fernández, J., Barceló-Vidal, C., and Pawlowsky-Glahn, V. (2003). Dealing with zeros and missing values in compositional data sets using nonparametric imputation. Math. Geol. 35, 253–278. doi: 10.1023/A:1023866030544
Martin-Fernandez, J., Hron, K., Templ, M., Filzmoser, P., and Palarea-Albaladejo, J. (2015). Bayesian-multiplicative treatment of count zeros in compositional data sets. Stat. Model. 15, 134–158. doi: 10.1177/1471082X14535524
Martín-Fernández, J., Palarea-Albaladejo, J., and Olea, R. (2011). “Dealing with zeros,” in Compositional Data Analysis: Theory and Applications, eds V. Pawlowsky-Glahn and A. Buccianti (Chichester: John Wiley & Sons), 43–58.
Nembot-Simo, A., Graham, J., and McNeney, B. (2013). CrypticIBD check: an R package for checking cryptic relatedness in nominally unrelated individuals. Source Code Biol. Med. 8:5. doi: 10.1186/1751-0473-8-5
Obón-Santacana, M., Vilardell, M., Carreras, A., Duran, X., Velasco, J., Galván-Femenía, I., et al. (2018). GCAT|Genomes for Life: a prospective cohort study of the genomes of catalonia. BMJ Open 8:e018324. doi: 10.1136/bmjopen-2017-018324
Oliehoek, P., Windig, J., van Arendonk, J., and Bijma, P. (2006). Estimating relatedness between individuals in general populations with a focus on their use in conservation programs. Genetics 173, 483–496. doi: 10.1534/genetics.105.049940
Pandit, E., Tasleem, S., Barik, S., Mohanty, D., Nayak, D., Mohanty, S., et al. (2017). Genome-wide association mapping reveals multiple qtls governing tolerance response for seedling stage chilling stress in indica rice. Front. Plant Sci. 8:552. doi: 10.3389/fpls.2017.00552
Pemberton, T., DeGiorgio, M., and Rosenberg, N. (2013). Population structure in a comprehensive genomic data set on human microsatellite variation. Genes Genomes Genet. 3, 891–907. doi: 10.1534/g3.113.005728
Pemberton, T. J., Wang, C., Li, J. Z., and Rosenberg, N. A. (2010). Inference of unexpected genetic relatedness among individuals in hapmap phase iii. Am. J. Hum. Genet. 87, 457–464. doi: 10.1016/j.ajhg.2010.08.014
Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M., Bender, D., et al. (2007). Plink: a toolset for whole-genome association and population-based linkage analysis. Am. J. Hum. Genet. 81, 559–575. doi: 10.1086/519795
Ramstetter, M., Dyer, T., Lehman, D., Curran, J., Duggirala, R., Blangero, J., et al. (2017). Benchmarking relatedness inference methods with genome-wide data from thousands of relatives. Genetics 207, 75–82. doi: 10.1534/genetics.117.1122
Rosenberg, N. A. (2006). Standardized subsets of the HGDP-CEPH Human Genome Diversity cell line Panel, accounting for atypical and duplicated samples and pairs of close relatives. Ann. Hum. Genet. 70, 841–847. doi: 10.1111/j.1469-1809.2006.00285.x
Sabatti, C., Service, S., Hartikainen, A., Pouta, A., Ripatti, S., Brodsky, J., et al. (2009). Genome-wide association analysis of metabolic traits in a birth cohort from a founder population. Nat. Genet. 41, 35–46. doi: 10.1038/ng.271
Sharma, S., MacKenzie, K., McLean, K., Dale, F., Daniels, S., and Bryan, G. (2018). Linkage disequilibrium and evaluation of genome-wide association mapping models in tetraploid potato. G3 (Bethesda) 8, 3185–3202. doi: 10.1534/g3.118.200377
Stevens, E., Baugher, J., Shirley, M., Frelin, L., and Pevsner, J. (2012). Unexpected relationships and inbreeding in HapMap Phase III populations. PLoS ONE 7:e49575. doi: 10.1371/journal.pone.0049575
Stevens, E., Heckenberg, G., Roberson, E., Baugher, J., Downey, T., and Pevsner, J. (2011). Inference of relationships in population data using indentity-by-descent and identity-by-state. PLoS Genet. 7:e1002287. doi: 10.1371/journal.pgen.1002287
Wang, C., Szpiech, Z., Degnan, J., Jakobsson, M., Pemberton, T., Hardy, J., et al. (2010). Comparing spatial maps of human population-genetic variation using procrustes analysis. Stat. Appl. Genet. Mol. Biol. 9:13. doi: 10.2202/1544-6115.1493
Keywords: allele sharing, composition, identity by state, identity by descent, log-ratio transformation
Citation: Graffelman J, Galván Femenía I, de Cid R and Barceló Vidal C (2019) A Log-Ratio Biplot Approach for Exploring Genetic Relatedness Based on Identity by State. Front. Genet. 10:341. doi: 10.3389/fgene.2019.00341
Received: 05 December 2018; Accepted: 29 March 2019;
Published: 24 April 2019.
Edited by:Steven J. Schrodi, Marshfield Clinic, United States
Reviewed by:Himel Mallick, Merck, United States
Wei-Min Chen, University of Virginia, United States
William C. L. Stewart, The Research Institute at Nationwide Children's Hospital, United States
Fabrice Larribe, Université du Québec à Montréal, Canada
Copyright © 2019 Graffelman, Galván Femenía, de Cid and Barceló Vidal. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Jan Graffelman, firstname.lastname@example.org
†These authors have contributed equally to this work
‡On behalf of GCAT Project Team