Current Developments in Detection of Identity-by-Descent Methods and Applications

Identity-by-descent (IBD), the detection of shared segments inherited from a common ancestor, is a fundamental concept in genomics with broad applications in the characterization and analysis of genomes. While historically the concept of IBD was extensively utilized through linkage analyses and in studies of founder populations, applications of IBD-based methods subsided during the genome-wide association study era. This was primarily due to the computational expense of IBD detection, which becomes increasingly relevant as the field moves toward the analysis of biobank-scale datasets that encompass individuals from highly diverse backgrounds. To address these computational barriers, the past several years have seen new methodological advances enabling IBD detection for datasets in the hundreds of thousands to millions of individuals, enabling novel analyses at an unprecedented scale. Here, we describe the latest innovations in IBD detection and describe opportunities for the application of IBD-based methods across a broad range of questions in the field of genomics.


INTRODUCTION
The rapid growth and increasing availability of biobank-scale datasets has led to their increased utilization in human genetics studies, however, the demographic and evolutionary forces that underly genomic patterns within these data are often overlooked. Biases in sample recruitment has led to underrepresentation of non-European ancestry participants, limiting the scope and broad applicability of medical genomics and precision medicine. Additionally, standard genetic analytical frameworks often overlook the fine-scale population structure relevant to the segregation of rare variants, despite their role in common, complex diseases becoming increasingly apparent (Hernandez et al., 2019;Taliun et al., 2021). For these reasons, there is an increasing need for novel methods that can account for demographic substructure driving patterns of variation across the site frequency spectrum in large, diverse cohorts (Gravel et al., 2011). The principle of identity-bydescent (IBD) offers a framework through which we can interpret and leverage the demographic histories of large-scale human genomic data, and improve statistical power to detect causal variants.
Identity-by-descent is the shared inheritance of an identical portion of the genome between two individuals (Browning, 2008;Gusev et al., 2008;Browning andBrowning, 2010, Browning andHenn et al., 2012;Thompson, 2013). This is distinct from identity-by-state (IBS), in which a portion of two individual's genomes may appear identical, but not necessarily due to recent shared co-inheritance. Leveraging properties of IBD allows researchers to infer a vast amount of information about a population's demographic history (Carmi et al., 2013;Palamara and Pe'er, 2013;Nait Saada et al., 2020), allowing for evolutionary and pedigree-derived insights that can aid in the interpretation of genetic variation. Further, identifying these shared segments from a recent common ancestor can enrich for shared patterns of rare variation, due to the relationship between allele age and frequency (Slatkin and Rannala, 2000). In essence, inference of IBD sharing at the population level can allow for the same genetic frameworks behind pedigree studies and linkage analyses to be applied to large population-level genotyped or sequenced data sets. In this review, we explore the population genomic principles governing patterns of IBD sharing, past and recent methods for detecting IBD in population scale data, and downstream applications in contemporary human genomics.

GOVERNING EVOLUTIONARY POPULATION GENETICS PRINCIPLES
Methods of IBD detection, or the identification of haplotypes likely to arise from a recent common ancestor are well established in theory but are rarely applied to modern, biobank-scale datasets. These modern algorithms have been shown to have high accuracy and quick computational run times (Ramstetter et al., 2017). The underlying principle is that long haplotypes shared between individuals are statistically more likely to arise from relatedness due to deep, shared population history as opposed to random recombination or mutation (Browning, 2008;Browning and Browning, 2015). The more closely related individuals are, the higher the percentage of their genome will be shared IBD, since they share a common ancestor more recently in their genealogical history than two randomly sampled individuals. As populations both diverge and intermix over time, lengths of IBD segments will degrade due to recombination (Carmi et al., 2013;Palamara and Pe'er, 2013), therefore longer haplotypic segments tend to represent more recent relatedness due to there being a lower probability of recombination inducing a decay in their length over shorter spans of genealogical time (Henn et al., 2012). For a given set of observed genetic data and associated recombination rate estimates, the unknown population history can be modeled by the population genetics principle of the coalescent. This results in an abundance of information that can be inferred from the properties of the shared IBD segments. The length of a shared IBD segment serves as a proxy for age of the most recent common ancestor at that genomic region, i.e., a longer IBD segment reflects a more recent common ancestor. Therefore, by using IBD to measure local relatedness between individuals along the genome, it is possible to infer aspects of a population's demographic history. For instance, factors such as the effective population size over antecedent generations, bottlenecks and subsequent founder effects may be estimated given the distribution of observed IBD in a contemporary population (Browning and Browning, 2015). This has implications at the population level, as represented by patterns of IBD-sharing genome-wide, but can also be informative at specific loci along the genome, and can provide demographic and historical context to loci associated with complex traits. IBD can account for demography of a population for a given risk allele, that is, a variant arising through mutation or recombination, spreading and surviving in a population due to demographic events and genetic drift, has information that is encoded in the spanning inherited segment that is informative of evolutionary and complex disease processes (Nelson et al., 2018;Tian et al., 2019). With the concept of IBD explained, we will now offer some of the applications in contemporary human genomics.
A crucial goal in population genetics is the estimation of the mutation rate across the genome. IBD-based methods can augment mutation rate estimation approaches by leveraging IBD segments to condition on recent ancestry as part of the estimation process. Prior techniques involved using trios of parents and offspring to estimate mutation rate. However, this approach is difficult to implement due to the logistical challenges of recruiting trios, and is sensitive to genotyping errors or somatic mutations being incorrectly classified as de novo mutations (Shah et al., 2018;Tian et al., 2019). In identifying IBD segments, researchers can quantify the de novo mutation rate on each segment related to the degree of kinship between the samples to reduce the false positive rate, particularly when compared to small pedigreebased studies. Furthermore, IBD methods allow for the expansion beyond pedigree studies to large-scale population-based datasets by leveraging the inherent background IBD present in human populations, with recent investigations further narrowing the confidence in our estimation of mutation rates to between 1.02 × 10 −8 and 1.56 × 10 −8 (Campbell et al., 2012;Palamara et al., 2015). Other studies have shown that inferring short IBD segments into longer IBD segments can help to adjust estimations of the de novo mutation rate (Chiang et al., 2016). By leveraging IBD, the fundamental question of what mutation rates are across the genome can be more confidently assessed by creating more complete models of mutation, recombination and kinship.
Alongside interrogating the mutation rate of the genome, there has been significant interest in determining the variation in the recombination landscape among global human populations. In addition to having different population level prevalences, the same complex disease loci may exhibit local differences in linkage disequilibrium that directly impact fine-mapping and other common genetic analyses (Wojcik et al., 2019). This means that population-specific recombination maps will be important for fine-mapping both common and rare variants in complex diseases in diverse populations. One recent study showed that building a recombination map from IBD segments yields better estimation of recombinational endpoints and time-to-mostrecent-common-ancestor when compared to LD-or admixturebased approaches (Zhou et al., 2020a). Here, IBD methods, particularly those that can work accurately and at scale, can help to create population specific recombination maps that will in turn allow for more accurate simulations of each specific population's demographic history, leading to other downstream applications such as improved imputation.
Identity-by-descent detection also plays into the recent advances in population structure estimation, particularly at fine scale. Inherent to the idea of a population is the idea of shared ancestry and with this shared ancestry comes a higher probability of relatedness, and a larger portion of the genome shared IBD between any sampled individuals within the same population, when compared to two individuals sampled from between populations. We consider, as an example, the question of improving admixture inference accuracy. By identifying IBD segments among individuals in a population, admixture measurements can be considered with higher accuracy than just comparing genotypes, which may be additionally influenced by errors or somatic mutations. In addition, as studies grow larger, the search space for identifying shared cryptic ancestry as captured by IBD tends to scale quadratically (i.e., with the total pairs of individuals). Thus, a high degree of cryptic relatedness can be present in large-scale genetics studies when a prior, smaller study in the same population may have shown little to no cryptic relatedness. To account for this component of population structure, IBD methods allow researchers to reduce confounding in their study design and better reflect the populations' allele frequencies by matching cases and controls on the basis of genetic ancestry (Palin et al., 2011;Nelson et al., 2018;Sohail et al., 2019).
Concurrent with GWAS, mapping of genetic variants to IBD segments and/or clusters is an alternative method that can help to detect significant associations with a trait of interest. This is similar to how the technique of linkage mapping narrows the genetic signal to a linkage peak (Gusev et al., 2011;Browning and Thompson, 2012). Rare, causal variants preserved in the population while being affected by population demography, drift, selection and substructure have been shown to fall within segments of the genome that are IBD between pairs of individuals in study populations. Analysis of founder populations offer examples of how rare variants can be identified using IBD methods: one example showed how broadly rare European variants contribute disproportionately to disease risk in Quebec (Nelson et al., 2018). Similarly, the elevated IBD patterns present in island populations have empowered novel discoveries, such as the link between height-associated loci and a collagen disorder found in Puerto Ricans (Belbin et al., 2017). With increasing recognition of the role of rare variants in complex disease, and the highly structured manner in which they segregate, methods that leverage IBD for rare variant detection have the potential to be increasingly useful for rare variant discovery.
Finally, imputation can be dramatically improved when leveraging the population specific information inherent to IBD. With growing reference panels from global populations, imputation is resulting in more accurate haplotype matching (Kowalski et al., 2019). IBD can further improve this by noting how to match sample haplotypes to appropriate ancestral references for imputation in a concept called a Study-Specific Reference Panel (SSRP; Gusev et al., 2012;Uricchio et al., 2012;Abney and ElSherbiny, 2019). In practice, modern imputation methods hosted in current servers attempt to approximate this process, but do not recapitulate the augmentation of standard reference panels with appropriate SSRPs (Das et al., 2016). Even without a well annotated pedigree, modern IBD techniques show that imputation quality can be drastically improved when leveraging SSRPs above typical LD based imputation methods (Abney and ElSherbiny, 2019). Not only is IBD useful alone, but it also augments more standard imputation methods by improving imputation probabilities at difficult-to-impute SNPs. By creating custom SSRPs, recruitment efforts to improve representation of understudied populations in human genetics (Bustamante et al., 2011;Popejoy and Fullerton, 2016) can be efficiently leveraged for imputing rare variants, particularly those with greater population-specificity (Gravel et al., 2011).
With the utility of IBD detection outlined, we will next describe the theoretical, statistical and computational means through which IBD detection algorithms are implemented.

OVERVIEW OF METHODS
Both novel computational paradigms and improvements in computational architecture have led to scalable and accurate methods for IBD detection (Table 1). Originally, whether through strict string pattern matching or fuzzier matching, methods were not equipped to deal with the inherent quadratic scaling of IBD, limiting the size of initial investigations. The era of high-throughput IBD detection began with GERMLINE (Gusev et al., 2008) to detect variation in IBD patterns efficiently and explore how they are influenced by population processes. GERMLINE creates a hash table between short, exact matches of haplotypes and extending into longer, fuzzy (i.e., allowing for small SNP mismatches or genotype errors) IBD segments. This "seed and extend" paradigm, leveraging the inherent efficiency of short hashing functions for speedup beyond standard pairwise comparisons has been adopted by subsequent detection algorithms (Shemirani et al., 2019;Nait Saada et al., 2020), and improved efficiency over hidden Markov model (HMM)-based algorithms or simpler string matching approaches. The computational efficiency garnered by GERMLINE allows computational time to scale approximately linearly with the number of samples and genotyped variants. While GERMLINE demonstrated accuracy and efficiency in identifying known IBD from simulated datasets and early GWAS studies, it does not easily scale to sample sizes in the hundreds of thousands of individuals, as seen in many contemporary genetic cohorts [although it can provide meaningful insights into biobank-scale data with extensive parallelization (Sapin and Keller, 2021)]. Thus, the primary value in detailing GERMLINE is to describe how it influenced the current IBD calling algorithms outlined below. While GERMLINE works in both diploid and haploid modes, much recent work has been focused on recent haploid methods given the ubiquity of phasing in modern genomic analyses, although we discuss recent efforts in diploid IBD detection as well.
One of recent innovations in the rapid detection of IBD segments is the ILASH algorithm (Shemirani et al., 2019). ILASH works on the principle of locality sensitive hashing (Leskovec et al., 2020) to efficiently search the genome. It begins with a similar "seed and extend" hash table of two individuals in a data set via small stretches of DNA and extending data if the two stretches meet criteria matching IBD similarity. The locality sensitive hashing implemented in ILASH is scalable to IBD detection in tens to hundreds of thousands of individuals, such as in the PAGE Study and UK BioBank. Furthermore, it utilizes multiple parallelized computing across multiple stages of the algorithm to ensure optimization. While ILASH is optimized for the biobank era of genetics and proves easy to use in standard analysis pipelines, there are other algorithms with alternative mathematical and computational approaches. Another solution to efficient IBD detection is RaPID (Naseri et al., 2019). Instead of locality sensitive hashing, RaPID works through random projections of the low-resolution genetic data and applying the Positional Burroughs-Wheeler Transformation (PBWT; Durbin, 2014)between phased individual haplotypes until a perfect match is obtained. These matches are also stored in a hash table and extended with further matches as previously detailed, combining those results into an IBD segment. While PBWT is an efficient data transformation for genetic data, a key additional step in RaPID incorporates the approximate matching needed to be added to tolerate small mismatches, while only adding trivially to the computational time. Furthermore, the accuracy of results can be improved by subsequent iterations of PWBT, albeit at the cost of longer analysis time. Developers also benchmarked RaPID on simulated and UK BioBank data, showing performance and accuracy results similar to those of ILASH.
Another method that has been developed on top of existing theory is hap-IBD (Zhou et al., 2020b). Building on extensive previous work in IBD estimation through the Beagle software program, researchers have made significant advances in haploid IBD speed. In their most recent efforts, they developed hap-IBD as an algorithm for implementing PBWT similar to RaPID. It differs from RaPID in that it controls for false positives of genotype error or mutation by allowing for small gaps of non-IBS between IBD segments. This allows the algorithm to account for gene conversion, a common phenomenon that can disrupt otherwise IBD segments. In addition, hap-IBD may run the PBWT in parallel, thus showing the best performance among algorithms benchmarked in UK BioBank data. Similarly, investigators at 23andMe leveraged the same PBWT to develop their new Templated PBWT framework (Freyman et al., 2021) with similar properties and efficient, scalable runtime. TPBWT is notable for attempting to identify and correct phase switch errors, thereby improving IBD tract length estimation and longrange phasing.
Another novel algorithmic extension that builds on IBD detection and that shows high performance in accuracy as well as speed is FastSMC (Nait Saada et al., 2020). FastSMC builds upon the hash table GERMLINE method as a first identification step by also including a validation step that uses a approximate coalescent HMM (Palamara et al., 2018). This second step distinguishes between segments of IBS and IBD by estimating the probability a shared IBS segment is due to recent common ancestry, thus allowing for IBD calls within shorter windows. This coalescence probability is reported as an IBD quality score, providing a further layer of information in addition to the IBD haplotypes themselves. By implementing this validation step, FastSMC shows higher accuracy in IBD identification at limited additional computational performance when compared to other algorithms. FastSMC is just one of many IBD identification tools that extend upon the frameworks originated in GERMLINE to improve performance and accuracy, and because of its twostep design, it could easily be adapted to utilize one of the newer IBD detection methods to further improve efficiency of the initial step.
While many IBD detection methods rely upon accurate phasing of alleles, one approach, IBIS, does not have this caveat. IBIS works through long range allelic sharing, detecting shared homozygous alleles between individuals and uses Boolean logic operators to determine IBD from a given rule set (Seidman et al., 2020). The main benefit of IBIS compared to other methods is the time and computational resources saved from not having to pre-phase the genetic data before IBD detection. The major caveat behind this is that without phase information providing haplotype resolution, excess homozygosity within putative IBD segments can increase the false positive rate, and the shortest segments detectable in diploid IBD are larger than in haploid methods. However, this limitation on segment length (say ∼7 cM for diploid, versus 2-3 cM for haploid) can be acceptable for certain analyses. As previously stated, more recently related individuals share longer IBD segments which may empower risk allele identification or where measuring the length of long IBD segments is of particular importance. Researchers may be especially interested in IBIS as an intermediate analysis strategy, balancing accuracy and speed, for preliminary exploration of a dataset, or for applications that do not require phasing.
A final value to IBD is that in association studies looking for rare, causal variants in complex disease with large biobank sample sized data sets, IBD offers improved statistical power over traditional GWAS methods. This is because, rare variants are much more likely to be found within an IBD cluster (Nait Saada et al., 2020). Coalescence simulation-based work has shown the concordance between IBD and rare exomic variants (Nait Saada et al., 2020). Similarly, in the UK BioBank, researchers found significant associations to blood related traits otherwise not detected in exome-based tests by using IBD methods to predict sharing of ultra-rare, causal variants (MAF < 0.0001; Nait Saada et al., 2020). By identifying regions of IBD where rare, causal variants are likely to occur, the threshold for significance can be appropriately lowered, analogous to how a linkage peak narrows the search for a genetic signal. As a result of looking for associations between IBD segments and complex disease status, we propose the coining of the term "IBDWAS" to make the value of IBD-driven insights more pronounced.

CONCLUSION
To summarize, IBD has significant but often-overlooked meaning in human genetics studies in the context of biobank scale data. All genetic variants affecting traits are influenced by the combination of the evolutionary forces of selection and genetic drift. While in the past inferring the demographic history of a study's population was difficult, the field of genomics has reached datasets so large that ignoring underlying population history can lead to inappropriate conclusions in disease associations and pathogenicity adjudication. As biobankscale datasets continue to grow, IBD-based analyses offer a paradigm to address unanswered questions within the field of genomics, and with recent advances in IBD-detection methods there are new opportunities to study these patterns of relatedness at scale. It is therefore relevant to incorporate methods of IBD detection into genetic studies to gain insights into the demographic history of variants of interest, to improve statistical power in detecting rare, causal variants, and to improve the accuracy of imputation, among other relevant analyses.

AUTHOR CONTRIBUTIONS
ES initially drafted the manuscript with edits and contributions from GB and CG. All authors contributed to the article and approved the submitted version.

FUNDING
This work was partially funded by the National Institutes of Health under R01HG011345 and U01HG009080.