Evaluating the Impact of Dropout and Genotyping Error on SNP-Based Kinship Analysis With Forensic Samples

Technological advances in sequencing and single nucleotide polymorphism (SNP) genotyping microarray technology have facilitated advances in forensic analysis beyond short tandem repeat (STR) profiling, enabling the identification of unknown DNA samples and distant relationships. Forensic genetic genealogy (FGG) has facilitated the identification of distant relatives of both unidentified remains and unknown donors of crime scene DNA, invigorating the use of biological samples to resolve open cases. Forensic samples are often degraded or contain only trace amounts of DNA. In this study, the accuracy of genome-wide relatedness methods and identity by descent (IBD) segment approaches was evaluated in the presence of challenges commonly encountered with forensic data: missing data and genotyping error. Pedigree whole-genome simulations were used to estimate the genotypes of thousands of individuals with known relationships using multiple populations with different biogeographic ancestral origins. Simulations were also performed with varying error rates and types. Using these data, the performance of different methods for quantifying relatedness was benchmarked across these scenarios. When the genotyping error was low (<1%), IBD segment methods outperformed genome-wide relatedness methods for close relationships and are more accurate at distant relationship inference. However, with an increasing genotyping error (1–5%), methods that do not rely on IBD segment detection are more robust and outperform IBD segment methods. The reduced call rate had little impact on either class of methods. These results have implications for the use of dense SNP data in forensic genomics for distant kinship analysis and FGG, especially when the sample quality is low.

There are two methods examined in detail: genome-wide relatedness, and IBD segment methods. Genome-wide relatedness methods, most frequently implemented using the KING-robust method, estimated genome wide relatedness of samples (Manichaikul et al., 2010). These methods rely on calculations of relatedness and IBD segment sharing across entire genomes simultaneously. Methods of this type are recommended to use population specific allele frequencies, but it is not required to do so. As a result of this methodology, the KING-robust methodology is suited for a wide range of SNP densities. Genome wide relatedness measures have been shown to be effective at prediction of close relatives (3 rd degree or closer), with decreasing accuracy as relationships become more distant. More recent IBD segment methods are more sensitive, and able to accurately distinguish between relatives as distant as 6 th , 7 th or 8 th degree. They are also more computationally expensive, often (but not always) requiring phasing and possibly imputation. These methods identify segments of homologous chromosomal SNP data from a seed-extend approach, marking all potential IBD segments with identity to another individual, and calculating kinship coefficient as a function of the genetic distance (in centiMorgans) shared between individuals. To accomplish this, most IBD segment methods rely on accurate phasing of genomic SNP data. This method is more sensitive but requires high-quality data to accurately identify and extend segments.

Genome-wide relatedness: KING
The KING algorithm (Manichaikul et al., 2010) allows for the presence of unknown population substructure, enabling robust estimation of the kinship coefficient in the presence of population structure. Simulation and analysis with real data as benchmarked showed accurate relationship estimation of third degree relatives (Ramstetter et al., 2017). The KING publication showed that the assumption of population homogeneity in PLINK (Purcell et al., 2007;Chang et al., 2015) resulted in biased results, systematically inflating the apparent relatedness degree (kinship coefficient). The publication presented two estimators of the kinship coefficient -one that assumes homogeneity, and another that does not (KING-robust). The authors also found that results of relationship inference on data with population structure were similar using a full set of SNPs as compared to a subset of rare SNPs. The KING software implementation is not permissively licensed and thus was not evaluated any further. Additionally, the KING software only takes PLINK-formatted .bed/.bim/.fam files as input, rather than the more widely used industry standard VCF, and thus does not take advantage of HTSlib for quickly reading and parsing genetic variation data. AKT is an efficient toolkit that re-implements the KING-robust algorithm in C++ using HTSlib for reading industry standard VCF/BCF (Arthur et al., 2017). AKT takes standard VCF as input, provides a consistent API, and provides a tidy output structure for pairwise kinship coefficients and inferred IBD proportions. The KINGrobust algorithm as implemented in AKT has very low resource requirements: its input is a set of SNPs (optionally LD-pruned) and does not require any preprocessing such as phasing or imputation.

IBD segments: IBIS
IBIS (Saada et al., 2020) allows for IBD segment inference from dense genotyping data without the requirement for phased data. The publication benchmarked against RefinedIBD (Browning and Browning, 2013), GERMLINE (Gusev et al., 2009), and KING (Manichaikul et al., 2010). Benchmarking included assessing IBD segment identification accuracy, but more importantly, classification accuracy of inferred relationship degrees using both simulated and real data. IBIS accepts PLINK bed/bim/fam as inputs, and outputs a list of IBD segments. Notably, the phase-free inference makes IBIS much faster than competing methods and robust to phasing errors that can be introduced with genotyping data and small sample sizes.

IBD segments: hap-IBD
The software hap-IBD (Zhou et al., 2020) enables extremely fast and scalable IBD segment detection. It uses a positional Burrows-Wheeler transform and parallelization to achieve speed and a novel compression method to reduce memory. The publication describes benchmarking against TRUFFLE (Dimitromanolakis et al., 2019), iLASH (Shemirani et al., 2019), and RaPID (Naseri et al., 2019). The benchmarks indicated that hap-IBD has lower false positives than RaPID and higher true positives compared to iLASH, while TRUFFLE has high error rates for short IBD segments. Computationally, hap-IBD was faster than all other methods at large sample sizes, but for smaller samples sizes (e.g. 5,000) there was no appreciable absolute difference.

Other methods not evaluated
Other IBD segment methods that were not evaluated included TRUFFLE (Dimitromanolakis et al., 2019), iLASH (Shemirani et al., 2019), and RaPID (Naseri et al., 2019). These tools are not permissively licensed and as such were not further evaluated. Additionally, other tools such as ERSA (Huff et al., 2011), PRIMUS (Staples et al., 2014), and PADRE (Staples et al., 2016) were not evaluated also because of restrictive licensing. Methods that rely on genotype likelihoods (Korneliussen et al., 2014;Korneliussen and Moltke, 2015;Waples et al., 2019) fundamentally did not rely on called genotypes and were therefore beyond the scope of this evaluation. We furthermore did not evaluate GERMLINE, as other tools assessed here benchmarked with higher accuracy and speed than GERMLINE in other benchmarking studies (Gusev et al., 2009;Browning and Browning, 2011;Zhou et al., 2020).

Supplementary methods 2: subsetting variants to common array (GSA) sites
We aimed to keep only SNPs represented on a commonly used microarray platform. The Illumina Global Screening Array (GSA) is the most widely used genotyping platform in the direct-toconsumer genomics and ancestry testing space, used by 23andMe, AncestryDNA, Family Tree DNA, and MyHeritage DNA (International Society of Genetic Genealogy, 2021). Accordingly, the overwhelming majority of data in the most comprehensively used genealogy platform, Gedmatch, is derived from the Illumina GSA, and this overrepresentation will become even more pronounced over the coming years. While this effort does not rely on Gedmatch and is not evaluating Gedmatch in any way, if we are to reduce representation of founders' SNP genotype data to a manageable subset that mimics data generated on a dense SNP genotyping platform, using the GSA as this backbone is the most sensible place to start. There are 583,089 biallelic SNPs genotyped on the GSA that have genotype data in the 1000 Genomes Project data.
In addition to the sites from the GSA, small panels of kinship-informative SNPs were added. These SNPs include 1,262 "reliable" SNPs for kinship analysis (Arthur et al., 2017) -these SNPs are biallelic SNPs in 1000 Genomes Phase 3 with global MAF>=5%, present in high-coverage samples sequenced at Illumina with similar allele frequency to 1000 Genomes, and reliably genotyped on several microarrays commonly in use at the time of publication, including the Illumina Human Core Exome microarray, Illumina Infinium Core Exome microarray, Illumina Omni2.5 microarray, Affymetrix Genome-Wide Human SNP Array 6.0 (now Thermo-Fisher), and the pre-GSA 23andMe v4 microarray. Next, Verogen recently announced their new "Kintelligence" product, which uses the PC-relate method (Conomos et al., 2016) to discover first-third (perhaps fourth) degree relatives in Gedmatch by genotyping 10,055 autosomal markers on a MiSeq FGx. Of these 10,055 SNPs, 9,977 (99.2%) are already present on the GSA. After adding the "reliable" kinship SNPs and the Kintelligence SNPs that were not already on the GSA, the total number of biallelic autosomal SNPs retained from the 1000 Genomes data came to 590,588.
Whole-genome sequencing typically calls variant genotypes with respect to the reference, resulting in VCF files that generally do not contain genotypes at invariant sites. This down selection strategy enabled more rapid initial benchmarking of methods in Task 3 across the hundreds of simulations described below (thousands of simulation-tool-parameter combinations). Following initial characterization, deeper analysis can be performed on specific tools and specific scenarios using a larger initial starting pool of SNPs from which to simulate.

Classification accuracy
Classification accuracy of all tools is well above the accuracy of guessing for all tools, using either Full or the GSA+reliable set of SNPs. Accuracy of guessing indicated by the dashed red line denotes the floor of classification accuracy which is realized if all samples are assigned to the most common class (Unrelated). However, KING shows less than 1% reduced accuracy with the Full data set at 0 error, while IBD methods (IBIS and hap-IBD) show less than 1% improvement when all SNPs are included. The figure below shows the accuracy rates of all tools split by error rates and SNP selection. At 0% error, all tools performed with > 97% accuracy. At 20% error, IBD tools perform at the same rate as accuracy of guessing (a result of classifying all individuals of "Unrelated"). KING performs similarly with the Full selection, but accuracy decreases below that of guessing when GSA+reliable SNPs are used. This finding is likely a result of KING misclassifying unrelated individuals as related due to random error generating false positive matches. For a reduced data set, this will result in a larger number of individuals being misclassified as related. These errors are classified as less than the variation between simulated data sets and are acceptable for this work.

Kinship Coefficients
The effect of errors on KING described above (KING shows worse-than-guessing accuracy when high error is present) can be more simply demonstrated using a density plot of the kinship coefficients calculated by KING. The figure below is a density plot of all the kinship coefficients calculated by all 4 simulations, with relatedness only classified up to fourth degree relatives, with all others classified as "Unrelated". With higher error, "Unrelated" individuals have a higher average kinship coefficient than at low error, causing their kinship coefficient to rise above the threshold and be classified as fourth degree relatives. For Hap-IBD and IBIS, the calculated kinship coefficient for all individuals at 20% genotyping error is 0.

Supplementary Methods 3: data simulation with ped-sim
Initial simulations of "missing data" produced simulations with individual genotypes on all individuals in the pedigree missing at random at 1%, 5%, 10%, etc. When comparing two individuals in which each has 10% missing data, ~19% of sites will be missing in either. To capture all n, choose 2 pairwise relationships among n simulated samples while mimicking a 10% missing data rate in a single sample compared to all others, we modified the "missing data" paradigm to thinning instead of random dropout. That is, rather than simulating 10% missing at random in all n samples, we randomly thinned the entire sample by 10%. As such, each pairwise comparison of n choose 2 pairs would mimic the situation of a 10% data loss due to 10% missing in a single query sample. This effectively becomes a limit of detection kind of experiment: how many markers are required for robust kinship estimation with different methodologies (at zero error and as genotyping error increases)?
Ped-sim (Caballero et al., 2019) version 1.3.1 was run using a pedigree definition that resulted in the large multi-generation pedigree shown in Supplementary Figure 1. Data was simulated using the high-resolution sex-specific map from (Bhérer et al., 2017) with a crossover interference model from (Campbell et al., 2015). Unrelated individuals from each of the three populations (GBR, AFR, MXL) were used as founders to ped-sim so resulting simulated offspring genotypes would be present in the resulting VCF.

Supplementary Figures
Supplementary Figure 1. Pedigree structure for each of the simulations conducted. Simulated families were defined using a pedigree definition as described in the ped-sim documentation. These relationships include parent-child (1 st degree), full sibling (1 st degree), avuncular (2 nd degree), grandparent-grandchild (2 nd degree), first cousin (3 nd degree), great-great-grandparent/child (4 th degree), grand-avuncular (4 th degree), second cousin (5 th degree), third cousin, first cousin once removed, first cousin twice removed, second cousin once removed, etc.

Degree n
This is a provisional file, not the final typeset article Supplementary Figure 3: Kinship coefficient inferred using KING (default parameters) against the simulated kinship coefficient for pairs of individuals simulated from GBR founders. Error increases in panels going left-to-right. Missing data increase in panels top-to-bottom. Pairs of relationships are color-coded by the true relationship degree. This graphic shows that error degrades the detected kinship coefficient, with degradation most notable in high-error simulations (5-10%).

Supplementary Figure 4:
Kinship coefficient inferred using hap-IBD (default parameters) against the simulated kinship coefficient for pairs of individuals simulated from GBR founders. Error increases in panels going left-to-right. Missing data increase in panels top-to-bottom. Pairs of relationships are color-coded by the true relationship degree. This graphic shows that even small amounts of genotyping error (1%) severely degrades the performance of kinship estimation and relationship inference, while larger errors (5-10%) completely eliminate hap-IBD's ability to detect IBD segments using default parameters.

Supplementary Figure 5:
Kinship coefficient inferred using IBIS (default parameters) against the simulated kinship coefficient for pairs of individuals simulated from GBR founders. Error increases in panels going left-to-right. Missing data increase in panels top-to-bottom. Pairs of relationships are color-coded by the true relationship degree. This graphic shows that relatively higher error (5-10%) severely degrades the performance of kinship estimation and relationship inference, while small errors (1%) negatively impact IBIS's ability to accurately assess full siblings, which will share on average about 25% of the genome IBD2.

Supplementary Tables
Supplementary Table 1 For tools that calculate IBD segments, the most impactful measures are related to the allowed error rate to continue a segment, the minimum length of segment reported, and the maximum distance between SNPs for a segment to be continued. Selected parameters for these tools are shown below. In general, The length requirement (mL for IBIS and min-output for hap-ibd) refers to the minimal size of an identified IBD segment to keep, lowering this would allow keeping shorter segments, assuming error in the simulations break the segments. The markers required to generate a segment looks at the ability to identify a segment, again shortened to allow errors to not skip IBD segment generation. Ibis also allows for an allowable error rate, which was tested between 0.01, 0.

min-markers
Minimum markers for seed of hap-IBD segments 100, 64

Max-gap
Maximum distance between two markers before segment is broken