Imputation of microsatellite alleles from dense SNP genotypes for parentage verification across multiple Bos taurus and Bos indicus breeds

To assist cattle producers transition from microsatellite (MS) to single nucleotide polymorphism (SNP) genotyping for parental verification we previously devised an effective and inexpensive method to impute MS alleles from SNP haplotypes. While the reported method was verified with only a limited data set (N = 479) from Brown Swiss, Guernsey, Holstein, and Jersey cattle, some of the MS-SNP haplotype associations were concordant across these phylogenetically diverse breeds. This implied that some haplotypes predate modern breed formation and remain in strong linkage disequilibrium. To expand the utility of MS allele imputation across breeds, MS and SNP data from more than 8000 animals representing 39 breeds (Bos taurus and B. indicus) were used to predict 9410 SNP haplotypes, incorporating an average of 73 SNPs per haplotype, for which alleles from 12 MS markers could be accurately be imputed. Approximately 25% of the MS-SNP haplotypes were present in multiple breeds (N = 2 to 36 breeds). These shared haplotypes allowed for MS imputation in breeds that were not represented in the reference population with only a small increase in Mendelian inheritance inconsistancies. Our reported reference haplotypes can be used for any cattle breed and the reported methods can be applied to any species to aid the transition from MS to SNP genetic markers. While ~91% of the animals with imputed alleles for 12 MS markers had ≤1 Mendelian inheritance conflicts with their parents' reported MS genotypes, this figure was 96% for our reference animals, indicating potential errors in the reported MS genotypes. The workflow we suggest autocorrects for genotyping errors and rare haplotypes, by MS genotyping animals whose imputed MS alleles fail parentage verification, and then incorporating those animals into the reference dataset.


INTRODUCTION
Single nucleotide polymorphism (SNP) are preferred to microsatellite (MS) markers for parentage verification and genomic selection due to their higher genotyping accuracies, speed of genotyping, lower overall cost per genotype, and ease of automation. While SNP genotypes per animal (N = 3000 to > 7,70,000) assayed on Illumina platforms are routinely > 99% for call rate and concordance (McClure et al., 2009;Rincon et al., 2011), individual MS are known to have a 1-5% genotyping error rate (Baruch and Weller, 2008). When individual genetic markers each have an error rate of 1%, the probability of having at least 1 genotype error in an individual genotyped for 11 MS markers is >10% (Weller et al., 2006). Also, we have observed that single nucleotide insertions or deletions within the amplified MS region can result in the rounding up or down of the called MS allele fragment size resulting in a 2 bp difference in the reported allele size. Therefore, the high inherent chance of genotyping errors has led several studies to suggest that 2 MS marker conflicts must exist for an animal to be excluded in parentage verification (Bonin et al., 2004;Weller et al., 2004;Baruch and Weller, 2008). In a comparison of a bovine parentage MS panel vs. a 32 SNP parentage panel (Heaton et al., 2002) employed for sire discovery for 287 calves from US beef and dairy farms, the SNP panel routinely outperformed the MS panel with the SNP panel assigning a sire at 100% probability 81.9% of the time vs. 38.3% of the time for the MS panel (Stewart Bauck, GeneSeek a Neogen Company, Pers. Commun. 3/10/2013). Recent work by Fernández et al. (2013) showed that even in a Brazilian inbred Angus herd that only 24 SNP were needed to obtain the equivalent matching probability (MP) for parental verification as 18 microsatellites. Similarly, 43 SNP provided 2-4 orders of magnitude grater MP than 11 MS in 6 Northern Ireland cattle breeds (Aberdeen Angus, Belgian Blue, Charolais, Holstein, Limousin, and Simmental) (Allen et al., 2010).
SNP technology is not only used in numerically large breeds, such as Holstein and Angus, but also by numerically mid-size and small breeds for the identification of genetic disease carriers and for genomic selection. Recently, it has also become more practical and cost effective to use SNP-based tools for parentage verification. Some cattle breed associations, such as the US Jersey Association have begun to solely use SNPs for parentage verification. However, most breeds are just beginning the transition from MS to SNP markers. Traditionally, when a livestock industry transitions to a new technology for parentage verification, the additional cost of re-genotyping the transition generation(s) with the newer technology is absorbed by the producer or breed association. In an effort to reduce the cost of SNP technology adoption across cattle breeds, we initially developed a method to impute MS alleles from dense SNP genotypes (McClure et al., 2012). Our initial report in 4 dairy breeds (Holstein, Brown Swiss, Jersey, and Guernsey) found that 17% of the SNP-MS haplotypes were preserved across 2-4 of the studied breeds, suggesting that while many haplotypes are breed specific, some are present in phylogenetically distant breeds, possibly because they are identical by descent (IBD) from the common breed ancestor.
The objective of this study was to develop a SNP-MS haplotype reference panel set that could be used globally across the majority of commercial Bos taurus breeds and the major B. indicus breeds.
An additional objective was to provide a data set and workflow so that any lab or service provider could implement our results for the benefit of the world-wide cattle community.

GENOTYPES
Twenty-five groups, representing government, academic, and DNA service providers from the North American, South American, European, and Australian continents, including the International Bovine HapMap Project (International Bovine Hapmap Consortium, 2006) provided MS and partial Illumina BovineHD (Illumina Inc., 2010) (Illumina Inc., San Diego, CA, USA) genotypes on 16,564 animals representing 51 breeds plus 135 B. taurus crossbred animals ( Table 1). All animals that were registered with their respective breed associations have accurate pedigree information which was available to this project. The provided genotypes were for SNP located within 500 kb (N = 3732) of 12 MS markers (BM1818, BM1824, BM2113, ETH3, ETH10, ETH225, INRA023, SPS115, TGLA53, TGLA122, TGLA126, TGLA227). These 12 MS loci comprise the International Society of Animal Genetics' (ISAG) recommended bovine parentage markers (http://www.isag.us/Docs/CattleMMPTest_CT.pdf) for inclusion in test panels used by service laboratories. All SNP data were captured and output in Illumina AB format. Genotypes for the ISAG-sanctioned MS bovine panel on the individuals and/or their parents were obtained from > 30 breed associations or their corresponding authorized data repositories. These MS genotypes were generated by multiple labs including GeneSeek (Lincoln, NE), MetaMorphix Inc. (Davis, CA), Maxxam (Mississauga, ON, Canada), UC Davis Veterinary Genetics Lab (Davis, CA), Zoetis (Kalamazoo, MI), Weatherbys DNA Laboratory (Kildare, Ireland), and Deoxi Biotecnologia (Araçatuba, São Paulo, Brazil), and LABOGENA (Jouy-en-Josas, France). Selected HapMap project individuals from less conventional or popular U.S. breeds were MS genotyped at UC-Davis Veterinary Genetics Lab, and Brahman individuals were MS genotyped by Zoetis according to ISAG genotyping standards.
From these MS and SNP genotypes, two populations were generated ( Table 1). The reference population contained 8077 individuals from 39 breeds as well as 29 B. taurus crossbred animals with both MS and SNP genotypes. Seven to 12 (average of 9) MS genotype records were provided for each animal in the reference population, resulting in each MS having 2403-8031 genotyped individuals in this group ( Table 2). The validation population was based on animals with only SNP data and contained 8622 animals representing 45 breeds and 106 B. taurus crossbred animals. MS genotypes on 1301 of the validation animals' parents, mainly sires, were also available for the evaluation of imputation accuracy. Only 89 validation animals had a parent present in the reference population. Both populations contained B. taurus and B. indicus purebreds and composite animals. BEAGLE (Browning and Browning, 2007) was used to impute the <2% of missing SNP genotypes in the reference and validation population. This step was considered robust based on previous reports where SNP genotypes were imputed with >95% accuracy with only a few hundred reference animals (Pausch et al., 2013) and with 98-99% accuracy in multi-breed reference populations (Larmer et al., 2010).  A separate validation population (GGP-val) comprising of 122 animals from 9 breeds (Angus, Ankole-Watusi, Belgian Blue, Charolais, Devon, Dexter, Holstein, Maine-Anjou, and Texas Longhorn) was assembled to test MS imputation from the GGP-LD (GeneSeek Genomic Profiler Low Density) Beadchip (Neogen Corporation, 2012). While the GGP-LD contains ∼80% of the original MS imputation SNP reported in McClure et al. (2012) these SNP genotypes were not imputed to the higher SNP density available in the reference population. These animals were also genotyped for the 12 MS at UC-Davis Veterinary Genetics lab.

HAPLOTYPE ESTIMATION
BEAGLE input files for the reference population were created for each MS marker and flanking SNP within 500 kb. Animals were filtered on their MS genotypes so that for each MS the BEAGLE file contained only individuals with a MS genotype, thus 12 files were generated ranging from 2403 to 8031 animals ( Table 2). All reference individuals were phased together using BEAGLE with 100 iterations. Williams et al., 2012 observed that phasing human ethnic groups together instead of separately resulted in increased phasing accuracy, as long as a single cohort did not dominate the dataset (>80% of the total population). Our reference population was fairly evenly distributed ( Table 1) and each breed represented an average of 2.5% of the total population with only 2 breeds representing over 10% (Charolais at 13.5% and Limousin at 19.8%).
SNP haplotypes for MS imputation were identified using a similar process as reported in McClure et al. (2012). Optimal haplotype size for MS imputation was determined by analysing phased haplotypes, centered on the MS, using sliding windows that increased in size (10-20 flanking SNP increments). The number of unique reference population haplotypes that were linked to 1 MS allele 100% of the time and the number of haplotypes that were linked to >1 MS alleles but matched 1 MS allele ≥90% of the time were tallied. The optimal haplotype size was determined when either of the following criteria was met: 2. Increasing the haplotype size by 10 SNP resulted in ≤ 1% increase in the total number of tallied haplotypes.

IMPUTATION REFERENCE POPULATION CREATION
Two MS-SNP haplotype imputation reference populations were created from the full reference population using the optimal SNP haplotype size for each MS (

MICROSATELLITE IMPUTATION
Two validation subpopulations, BT-val and BT + BI-val, were created from the validation population in the same manner as the imputation reference populations. Imputation was performed using either the 880 minimum SNP (min) panel (Table S1) from the optimal haplotype sizes identified above or all 3732 SNP within 500 kb of a MS marker (1 Mb where the first, second, third and fourth term represent: validation population, reference population, SNP panel used, number of BEAGLE iterations.

MENDELIAN INHERITANCE CONFLICTS OF MICROSATELLITE ALLELES
For the 1301 validation population animals with submitted parental MS genotypes submitted, the animal's BEAGLEimputed MS alleles were checked for Mendelian inheritance consistency against the MS genotype of its parents. Mendelian inheritance verification was also evaluated for 3457 reference population animals that had individual and parental MS genotypes submitted by the breed associations. An ANOVA was performed to determine statistical differences between the Mendelian consistencies of BT-val imputed MS and BT-ref reported MS genotypes, and between the different MS imputation parameter combinations. For the 122 GGP-val genotyped animals the concordance between their imputed and reported MS genotypes was determined. Both imputed MS alleles had to match the reported MS alleles to be considered concordant.

MS HAPLOTYPE IMPUTATION
The number of SNP used for haplotype imputation for each MS ranged from 40 to 110 (average 73), with 83.16% of the reference population haplotypes being linked to only 1 MS allele 100% of the time or 1 MS allele ≥ 90% of the time across all breeds (Table S2). Less than 6% of the SNP haplotypes were associated with >1 MS allele and when this occurred, the other MS alleles were often within 2 bp of the most commonly associated allele (Table S3). These associations are potentially caused by a combination of rare haplotypes and MS genotyping errors, insertions and deletions within the amplified MS region that caused a rounding up or down of the called MS allele fragment size, or SNP haplotypes present in multiple breeds that are associated with multiple MS alleles in each breed due to recombination. On average, a haplotype that was associated with only 1 MS allele 100% of the time was present in 2.3 breeds with some such haplotypes being common across up to 23 breeds. For haplotypes that were associated with >1 MS allele, the most common MS allele was present in an average of ∼7 breeds with a maximum of 36 breeds ( Table 2). The distribution of MS-SNP haplotypes present in ≥1 breed across the whole reference population is shown in Figure 1. The large number of MS-SNP haplotypes observed only once or twice within the reference population are considered rare MS-SNP haplotypes (Table S3). While the majority of the MS-SNP haplotypes, 74.5%, were bred specific, the occurrence of 25.5% of the MS-SNP haplotypes being observed 2-36 breeds indicates that MS haplotype data from one breed can be informative for the imputation of MS alleles in other breeds.

IMPUTATION ACCURACIES
The concordance between imputed and reported MS for the  Table 4). On average, 68.09% of the 1291 BT-val animals with imputed MS had no Mendelian inheritance conflicts with their parents' MS genotype, 22.83% had only 1 conflict, 4.95% had only 2 conflicts and 4.13% had >2 conflicts. In comparison, the 3457 reference animals with parental MS data had 85.25% with no conflicts, 10.65% with 1 conflict, 2.34% with 2 conflicts, and 1.76% with >2 conflicts ( Table 6). There was variability in the average Mendelian inheritance accuracy of imputed MS among breed and MS in the validation population with an average breed accuracy of 94% across all imputation strategies ( Table 6).
For the 25 BT-val animals with a parent in the reference population and a MS conflict, if the matching SNP haplotypes are taken into consideration, 17 have 100% parent verification. Only 7 animals had 1 haplotype conflict (i.e., 1 MS conflict) and one animal had 2 haplotype conflicts. Taking the matching SNP haplotypes into consideration means that for the 89 validation animals with a parent in the reference population, 91% have no MS or SNP haplotype conflicts, 98.88% have ≤1 conflict and 100% have ≤2 conflicts. These conflict statistics are higher than the MS parent verification statistics for the BT-ref animals in Table 7.  500 kb each side of the MS were included in the imputation process compared to when the most parsimonious number of flanking SNPs were used (Tables 4, 5, 7). While the imputed MS alleles showed greater Mendelian inheritance conflicts than the reported MS alleles did, this was expected as previous research has documented that MS marker genotypes themselves have a 1-5% error rate and only 85% of the reference animals had no parentage MS conflicts. An analysis of the SNP haplotypes for the 25 BT-val animals with Mendelian inheritance conflicts and with sires in the BT-ref population indicated that many of their SNP haplotypes were not in conflict (Table S4). In these cases, the sire haplotype may have harbored a mis-scored MS allele. For instance, Table S4 (Tab  TGLA126) shows the TGLA126 SNP haplotypes for Simmental-679 and its sire (Simmental-334), the imputed MS genotypes for Simmental-679 (123/115) were in conflict with its sire's reported genotype (117/117), even though both animals share a common haplotype. When the shared SNP haplotype was examined in Table S3 (Tab chr20-TGLA126, column UP) the most common MS allele observed for this haplotype is 123. The haplotype was associated with the 123 allele 937 times (99.68%) across 17 breeds and the 117 allele only once (0.11%). While it is possible that the sire's reported MS genotype is correct, it appears to be more likely that the sire's genotype was incorrectly scored. This 0.11% error rate is within reported MS error rates found in literature (Baruch and Weller, 2008). Of note, the other TGLA126 SNP haplotype for this sire was associated with the 117 allele 301 times (88.79%) across 11 breeds (Table S3, tab chr20-TGLA126, column VI). It is possible that when this animal was genotyped the 123 allele failed to PCR amplify, amplified too weakly to be called, or simply failed to be called, such that the animal was genotyped as 117 homozygote, instead of 117/123.

RECOMMENDATIONS
The optimized SNP haplotypes reported here and the reference population data represent a robust standard data set that can be used to impute MS at high accuracy ( Table 4, average 95%) for the loci within the ISAG recommended bovine parentage MS panel. This standard can be used in breeds that are not represented in the reference panel with only a small reduction in accuracy ( Table 7).
For the research reported here to be implemented by the industry we suggest the following work flow: 1. Genotype animals with a SNP assay that contains our reported min SNP set (Table S1) and parentage SNP (Heaton et al.,    (Scheet and Stephens, 2006), findhap (Vanraden, 2011), HAPI_UR (Williams et al., 2012), or other appropriate program. Then match the haplotype with the appropriate MS tab in Table S3 and return the most common MS allele to impute the animal's MS genotype.

R e f e r e n c e B T B T B T B T+ BI BT + BI BT + BI BT
4. Use the imputed MS genotypes for parentage verification. 5. If parentage verification fails, then genotype the animal with MS panel.
a. If the actual and imputed MS genotypes match, then consider retesting the parent with MS to correct the genotype error. b. If the actual and imputed MS genotypes do not match, then phase the animal's SNPs and MS genotypes and add this animal to the reference population.
6. Generate an updated reference haplotype population by adding any new animal with actual MS and SNP genotype data to the reference population dataset and rephrase all of the SNP and MS genotypes. 7. Use the updated reference population at Step 3.
By MS genotyping the animal if a discrepancy occurs the process described above will self-correct for MS genotyping errors and capture rare MS-SNP haplotypes Generation of new reference panels (Step 6 above) will help: A) increase the imputation accuracy, and B) to identify rare or breed specific MS-SNP haplotypes. This process will also speed up the adoption of the accurate 101 SNP panel (Heaton et al., 2002) or derivative for parentage verification over the current MS panel.
For individuals that solely wish to parentally verify an individual and transition between MS and SNP genetic markers it currently would be most cost effective for one to genotype the animal with the ISAG MS panel ($15-C20) and a 116 SNP panel ($15) than to use a Super-GGP, GGP-HD, BovineHD, or IDB beadchip ( C30-$185) (Jeremy Walker, GeneSeek, and John Flynn, Weatherbys, Pers. Commun., 22/07/2013). For those wishing to obtain genomic breeding values, select genetic disease status, and parentage SNP and MS genotypes on an animal than the listed beadchips and MS imputation do represent an economically viable option as one will not have to incur an additional cost to obtain MS genotypes.
As part of this international collaborative effort, the phased reference population data (BT-ref and BT + BI-ref) and marker  (1 Mb and Min) BEAGLE files are available (Supplementary Data Sheets 1-3) to facilitate MS imputation in DNA service laboratories world-wide. Our results demonstrate the power of continued data sharing of MS and SNP genotypes from the BovineSNP, GGP-HD, Super-GGP, or IDB panels for the SNP genotypes within 500 kb of each MS to increase imputation accuracy. The haplotypes reported for these reference populations can be applied to accurately impute MS alleles with high accuracy on animals that have been genotyped for the flanking SNP, regardless of breed.
products in this article is solely for the purpose of providing specific information and does not imply recommendation or endorsement by the US Department of Agriculture. The USDA is an equal opportunity provider and employer.