ORIGINAL RESEARCH article

Front. Genet., 11 March 2019

Sec. Livestock Genomics

Volume 10 - 2019 | https://doi.org/10.3389/fgene.2019.00183

Selection of Optimal Ancestry Informative Markers for Classification and Ancestry Proportion Estimation in Pigs

  • 1. Beijing Advanced Innovation Center for Food Nutrition and Human Health, China Agricultural University, Beijing, China

  • 2. State Key Laboratory of Agrobiotechnology, College of Biological Sciences, China Agricultural University, Beijing, China

Abstract

Using small sets of ancestry informative markers (AIMs) constitutes a cost-effective method to accurately estimate the ancestry proportions of individuals. This study aimed to generate a small and effective number of AIMs from ∼60 K single nucleotide polymorphism (SNP) data of porcine and estimate three ancestry proportions [East China pig (ECHP), South China pig (SCHP), and European commercial pig (EUCP)] from Asian breeds and European domestic breeds. A total of 186 samples of 10 pure breeds were divided into three groups: ECHP, SCHP, and EUCP. Using these samples and a one-vs.-rest SVM classifier, we found that using only seven AIMs could completely separate the three groups. Subsequently, we utilized supervised ADMIXTURE to calculate ancestry proportions and found that the 129 AIMs performed well on ancestry estimates when pseudo admixed individuals were used. Furthermore, another 969 samples of 61 populations were applied to evaluate the performance of the 129 AIMs. We also observed that the 129 AIMs were highly correlated with estimates using ∼60 K SNP data for three ancestry components: ECHP (Pearson correlation coefficient (r) = 0.94), SCHP (r = 0.94), and EUCP (r = 0.99). Our results provided an example of using a small number of pig AIMs for classifications and estimating ancestry proportions with high accuracy and in a cost-effective manner.

Introduction

Autosomal single-nucleotide polymorphism (SNP) and insertion-deletion (InDel) are widely utilized for human ancestry inference and population assignment (Bauchet et al., 2007; Tian et al., 2009; Sun et al., 2016). Ancestry informative markers (AIMs) are genetic markers of frequency differences between populations (Shriver et al., 2003). Multiple statistics have been used to obtain AIMs, including F statistics (FST), absolute allele frequency differences (δ), informativeness for assignment measure (In), and principal component loading scores (Rosenberg et al., 2003; Zhang et al., 2009; Ding et al., 2011; vonHoldt et al., 2016; Barbosa et al., 2017; Peterson et al., 2017). Instead of using whole genome markers, AIMs were considered to be sufficiently accurate for ancestry inference for limited population size. Consequently, this constitutes an economical way to screen and analyze thousands of samples. Santos et al. (2016) reported that 192 AIMs selected from ∼370 K SNP data can be used to accurately estimate the ancestry proportions of three major populations in Brazil. Li et al. (2016) developed a panel of 74 AIMs to infer the ancestry proportions of 500 test individuals from 11 populations. Due to the high resolution of AIMs, a 23-AIMs panel generated by Zeng et al. (2016). distinguished four major American populations, and correctly assigned ancestry for nine additional populations (Zeng et al., 2016).

For animal population genetics, AIMs have been successfully applied to identify breeds of different varieties and to evaluate genetic compositions in hybrid populations (Dimauro et al., 2015; Bouchemousse et al., 2016). Bertolini et al. (2017) found that 96 AIMs performed well in discriminating six dairy cattle breeds. In another study, 63 AIMs selected from 427 canids were utilized to assess genetic admixture in coyotes (Monzon et al., 2014). Recently, 74 AIMs were used to calculate ancestry proportions in crossbred sheep (Awassi with two native breeds in Ethiopia), and it was found that different admixture levels of Awassi significantly affected the traits of lamb growth and ewe reproduction (Getachew et al., 2017).

The pigs (Sus scrofa) diverged into European and Asian wild boars during mid-Pleistocene (1.2–0.8 million years ago) (Larson et al., 2005; Frantz et al., 2013). Pig domestication in China occurred ∼9,000 years ago (Larson et al., 2005). It has been documented that Chinese domestic pigs were divided into six types according to the region of dwelling and phenotype characteristics (I–North China, II–Lower Changjiang Basin, III–Central China, IV–South China, V–Southwest, and VI–Plateau) (Li et al., 2004; Fang et al., 2005). In a recent study, Yang et al. (2017). tracked the ancestries of various Chinese breeds and identified two major distinct ancestries, which are East China (e.g., Meishan and JinHua) and South China (e.g., Luchuan and Bamaxiang) origin. In addition, genomic introgression from European commercial breeds to Chinese indigenous pigs has also been reported (Ai et al., 2013; Bosse et al., 2014; Zhu et al., 2017), making the genetic compositions of modern Chinese pigs even more complicated.

Although it has been widely applied in other animals, and it is of great importance in specific application scenarios, including market surveillance and genetic resource protection, no study currently exists that specifically addresses the problem of efficiently using AIMs for distinguishing pig breeds or for estimating ancestry proportions. Here, using ∼60 K pig SNP chip data, we searched for the optimal number of AIMs for distinguishing pigs of East China, South China, or European origin. Based on 129 selected AIMs, we estimated ancestry proportions of the above origins for other Chinese pigs. We suggested that AIMs selected from unadmixed reference populations could be used to accurately estimate ancestry proportions in hybrid populations. Our results provide a useful example of utilizing AIMs for breed classification and ancestry estimation in pigs.

Materials and Methods

Data Collection and Quality Control

Genotyping data of 2,113 samples were retrieved from the Dryad Digital Repository1. Only samples from Asian breeds, and European breeds were used in this study (a total of 1,157 samples from 71 populations, details in Supplementary Table S1). Samples and SNPs were excluded if the following criteria were met: (1) an individual contained more than 10% missing genotypes; (2) SNPs with a call rate lower than 95%; (3) SNPs with a minor allele frequency less than 0.05; (4) SNPs that were located on sex chromosomes; and (5) SNPs were not biallelic. The missing genotypes were subsequently imputed by using BEAGLE (version 3.3.2) (Browning and Browning, 2007). Finally, 45,562 SNPs and 1,155 samples remained. The 1,155 samples were then split into two datasets. For the reference set, 186 samples were chosen from 10 representative populations of the three major ancestry groups: East China pig (ECHP), South China pig (SCHP), and European commercial pig (EUCP). The 10 populations were selected based on the fact that there was no obvious admixture between populations belonging to the ECHP or SCHP group, according to a report from Yang et al. (2017). This data set is summarized in Table 1. The test dataset contained the remaining 969 samples from 61 populations (details in Supplementary Table S2). Considering the convenience of practical application, the genotype data of the test dataset were directly extracted from the raw data without phasing or imputation.

Table 1

GroupSubpopulationAbbreviationNumber
South China pig (SCHP)China_BamaxiangCNBX16
China_CongjiangxiangCNCJ16
China_GuangdongdahuabaiCNDH16
China_LuchuanCNLU18
East China pig (ECHP)China_ErhualianCNEH20
China_JinhuaCNJH20
China_MeishanCNMS20
European Commercial pig (EUCP)DurocDUR220
PietrainPIT120
LandraceLDR120

Pig breeds information in the reference set.

Population Structure

Principal component analysis (PCA) was performed on ∼60 K chip data using SMARTPCA (version 6.1.4) in the reference set (Patterson et al., 2006). To confirm the unadmixed status, the unsupervised ADMIXTURE (version 1.23) (Alexander et al., 2009) was utilized to compute the ancestry proportions of samples from the reference set with the number of ancestry (K) set from K = 3 through K = 15. The ChromoPainter v2 (Lawson et al., 2012) linked model was also chosen to explore similarity/dissimilarity for individuals in the reference set. In detail, the recombination map file was generated using the script makeuniformrecfile.pl provided by fineSTRUCTURE (version 2.1.1) (Lawson et al., 2012). By utilizing a hidden Markov model profile, ChromoPainter v2 infers haplotypes of “donor” and “recipient” to create a co-ancestry matrix. Initially, 20 expectation-maximization steps were used to estimate the mutation and switch rate on 1/5 random sampling members from all individuals with all autosomes considered. The inferred mutation and switch rates for each chromosome were then averaged. Subsequently, with estimated mutation, switch rate and other default values, ChromoPainter v2 was again used to generate the co-ancestry matrix for all individuals. Finally, the MCMC algorithm implemented in fineSTRUCTURE was employed to hierarchically cluster individuals with a burn-in and runtime of 1,000,000 and 6,000,000 iterations, respectively.

Selection of AIMs

All 186 samples in the reference dataset were used to compute FST and In. Candidate SNPs were selected from the AIMs algorithm selector that was implemented in AIMs_generator.py from ANTseq pipeline2. Specifically, we firstly excluded SNPs in high-linkage disequilibrium (LD) by selecting only one SNP in a strong LD (r2 > 0.3) region and within 500 kb distance. Within each group, SNPs that exhibited heterogeneous frequencies among populations were further excluded based on a Chi-squared test (Galanter et al., 2012). Secondly, FST and In were computed for each of the three paired groups : ECHP vs. EUCP, SCHP vs. EUCP, and ECHP vs. SCHP (Rosenberg et al., 2003).

Group Classification With Minimum AIMs

Using the reference dataset, we first compared the discriminatory power of the AIMs selected by FST or In. Binary classification for the three paired groups were performed separately. For each paired group, we started by selecting the top two through top 30 AIMs, with an increment of one AIM. Samples in the corresponding paired group were randomly split into two proportions: 75% for training, 25% for testing, and this operation was repeated 50 times. GridSearchCV implemented in the Scikit-learn (version 0.18) package was then used to determine the optimal parameters for a support vector machine (SVM) classifier (Da Mota et al., 2014). The parameters for SVM are summarized in Supplementary Table S3. For the model with optimal parameters, the accuracy of classification was evaluated by the mean of the Matthews correlation coefficient (MMCC) for 50 repeats as follows:

where TNi and FNi are the number of true negatives and false negatives, and TPi and FPi are the number of true positives and false positives, for each run.

To determine the minimum number of AIMs for distinguishing ECHP, SCHP, and EUCP simultaneously, a multiclass approach of one-vs.-rest SVM was employed on reference dataset (Hong and Cho, 2008). Similarly, we began by selecting the top two through top 200 AIMs from each of the paired groups, with an increment of one AIM, resulting in 199 AIM sets in total. In each set, AIMs selected from the three paired groups were merged and duplicated AIMs were removed (Supplementary Table S4). Since MMCC was not designed for evaluating the accuracy of multiclass classification, confusion matrix, Cohen’s kappa statistic and balanced error rate were used instead to evaluate the classification accuracy. Higher Cohen’s kappa but lower balanced error rate indicated higher accurate classification. We again utilized GridSearchCV to estimate the best parameters for one-vs.-rest SVM, the parameters of which are summarized in Supplementary Table S3. We also generated random SNP sets of equal number from the whole genome for comparison of discriminatory power to the selected AIMs.

Ancestry Inference With Optimal AIMs

AIMs have been widely used to estimate ancestry proportions in hybrid populations, even in cases in which they were selected from unadmixed populations. Based on selected AIMs, to estimate ancestry proportions of possible admixed pig populations, we employed a strategy that was similar to that used in a previous study by Pardo-Seco et al. (2014). We first generated pseudo admixed individuals by randomly selecting genotypes of selected AIMs from samples in the reference data set with equal proportions. Therefore, the expected ancestry proportions of these pseudo admixed individuals were 1/3 (∼0.3333) from each group (ECHP, SCHP, and EUCP). For each of the 199 AIM sets generated from the above, 1,000 simulations were performed. Supervised ADMIXTURE (K = 3) was used to estimate the ancestry proportions. The performances were evaluated by the mean and the coefficient of variation (CV) of the estimated ancestry proportions. The CV of estimated ancestry proportions against the number of AIMs was fitted by the Curve Expert 1.4 program3. The optimal number of AIMs was determined by selecting the slope of the tangent threshold of the curve of which stable performance was observed beyond that point. To add an additional validation, we simulated pseudo admixed individuals with random ancestry proportions using the determined optimal number of AIMs. The ancestry proportions of ECHP, SCHP, and EUCP were randomly assigned with a minimum proportion set to 10%.

On the basis of the AIMs selected in the last step, we performed ancestry inference for the 969 individuals in the test dataset by supervised ADMIXTURE. The performance was evaluated by Pearson correlation coefficient between the genome-wide SNPs and the optimal number of AIMs.

Results

Population Structure of Reference Populations

Populations in the reference set were supposed to be least admixed. We did observe that ECHP, SCHP, and EUCP were well separated in a principal component plot (Figure 1A). The genome-wide FST distribution (Figure 1B) showed higher differentiation both between ECHP vs. EUCP (mean = 0.2197, 95% CI 0.0006–0.7267) and SCHP vs. EUCP (mean = 0.2153, 95% CI 0.0005–0.7570), while the differentiation between ECHP vs. SCHP (mean = 0.0588, 95% CI 0–0.3342) was noticeably less pronounced. By using ADMIXTURE, all breeds were well divided into anticipated groups (Figure 1C) when K = 3, in accordance with the previous study by Yang et al. (2017). When K = 10, 10 populations could be separated clearly, consistent with our expectation that the 10 populations were least admixed (Supplementary Figure S1).

FIGURE 1

For further quantification, the ChromoPainter v2 and fineSTRUCTURE programs were employed to check the relationship among these breeds considering LD. As shown in the coancestry heatmap (Figure 2), individuals within each group exhibited a homogeneous pattern, and those from the same group shared more genetic chunks than from other groups. In particular, the EUCP had a negligible degree of coancestry with individuals from Chinese indigenous breeds. The sample from ECHP and SCHP showed a higher degree of coancestry, but individuals from the same group still tended to cluster together more than between groups. In summary, the results suggested that the samples in the reference dataset exhibited a negligible level of admixture.

FIGURE 2

Group Classification Using AIMs

In order to build an effective set of AIMs, we firstly compared the performance of FST statistics and In statistics. For a paired group of ECHP vs. EUCP and SCHP vs. EUCP, a minimum of two AIMs were found to be sufficient to result in a perfect separation (MMCC = 1), either by selecting the top FST or by top In statistics (Supplementary Figure S2). However, to separate ECHP vs. SCHP, at least four AIMs were required by using FST, or at least five were required by using In. For AIMs selected by FST or In, we found that informative AIMs selected by In were largely overlapped with AIMs selected by FST, indicating that FST is at least as informative as In. Therefore, the following analyses were based only on AIMs selected by FST.

Next, we attempted to identify the number of AIMs which could be used to separate ECHP, SCHP and EUCP simultaneously using a multiclass approach. As described in Materials and Methods, top ranked two to 200 AIMs were sequentially selected from ECHP vs. EUCP, SCHP vs. EUCP and ECHP vs. SCHP, respectively, resulting in 199 AIM sets of increasing number (Supplementary Table S4). AIMs in each set were merged and deduplicated. For example, for the largest set, 171 out of 200 AIMs were shared between ECHP vs. EUCP and SCHP vs. EUCP (Supplementary Figure S3), 12 out of 200 AIMs were the shared between SCHP vs. EUCP and ECHP vs. SCHP, and 14 out of 200 AIMs were shared between ECHP vs. EUCP and ECHP vs. SCHP. All 199 AIM sets were fed to a one-vs.-rest SVM classifier. As show in Figure 3 and Supplementary Table S5, seven AIMs were sufficient to completely separate ECHP, SCHP and EUCP with the Cohen’s kappa = 1 and balanced error rate = 0. The detailed information of seven AIMs were summarized in Table 2 and Supplementary Table S7.

FIGURE 3

Table 2

SNPChrPositionECHP vs. EUCPSCHP vs. EUCPECHP vs. SCHP
ALGA00036901640943440.47240.03000.3069
MARC002337811483095480.967210.0084
INRA000428211498248000.967210.0084
DRGA000154211508017170.96720.95560.0005
INRA000431211521753240.967210.0084
H3GA000281111531442810.95080.98350.0084
INRA000446011584292540.90280.93550.0084
ASGA000473811594502840.90280.93550.0084
MARC003632311620685960.90280.93550.0084
H3GA000294711626105570.95080.98350.0084
M1GA000115811633039720.967210.0084
ASGA000501411790908140.05340.90690.6833
DRGA000167011927681980.72300.98350.0811
DRGA000176612010055160.03900.54380.4165
INRA000559312117780610.935510.0169
INRA000565212147407420.93470.95120.0042
ALGA000746712159250310.93470.95120.0042
ALGA000753912205177670.14290.08260.3608
M1GA000206613074747840.36360.01540.3125
H3GA000544313116857930.91850.95120.0084
ASGA0102470222619770.93550.93550
ASGA0008848278234190.00310.53190.5905
ASGA00913592961580220.00840.37500.3460
ASGA001221221409961420.00180.41490.3737
ALGA001654321452579710.20610.02340.3266
MARC006597835571740210.94120.0154
ALGA001977137763601500.42930.4293
DIAS00037663819862560.95120.95120
ALGA01073903860856440.93550.93550
ASGA010171131236381490.93440.93730
ASGA001659731322750490.05730.60110.3479
ALGA00242454308745230.00030.32750.3426
ASGA001940244130009010.97020.0076
DRGA00047574436153160.95120.95120
ALGA00252014614016150.00610.47760.3969
INRA00143514667930360.951210.0127
MARC00900924684090380.951210.0127
INRA0014612473495852110
ASGA002107341038044880.03900.40810.5349
ALGA00310435195266920.17470.04800.3044
ALGA00317425393278790.95120.95120
DRGA00057275409158940.95120.45500.2000
INRA00192765427245750.96710.96710
ASGA00254835446226290.96710.96710
DRGA00057625452522550.96710.96710
DRGA0005767546381993110
ALGA0031838547243385110
MARC0046863548236817110
DRGA0005792549040715110
ALGA00318945509913680.96710.96710
INRA0019346552798149110
ALGA00320945621570420.02600.46820.3097
ALGA01080315677103150.14290.04910.3005
ALGA00325005683527300.44230.02050.3048
ASGA00260835695840320.341300.3461
INRA002036559635216110.97020.0076
ALGA003707961337265630.93550.93550
ALGA011769361487427190.91850.95120.0084
ASGA009402261512173230.07840.35210.6331
MARC004194861528946490.379300.3793
MARC0115216731952170.93480.98350.0127
DBKK00002857602525140.490600.4906
H3GA00219837676461650.38860.00100.3767
ALGA00425377770092370.01140.44080.3630
DRGA00078207779730370.04020.59300.4218
DIAS000014671097839260.95120.90680.0115
ALGA004552271273328400.67410.09770.3392
H3GA00245308236015510.19840.08200.4514
BGIS00049528393007330.95120.95120
ASGA00387428412427590.98350.98350
ALGA00478768522135680.00840.45050.4206
H3GA002489886186392710.57150.1579
ALGA00479928654890640.01690.59030.5294
INRA00298738699987300.00140.31150.3256
ALGA00481798770212750.98350.92470.0154
ALGA004825387867169510.98500.0038
ASGA003968381218293270.96710.96710
ASGA003983281309092400.96700.98350.0042
INRA003053181315177600.983510.0042
H3GA002549481350319190.00570.24470.3073
ASGA009536881457097480.04550.25610.4043
H3GA00557699122921380.19180.03130.3512
ALGA01190459150556040.02920.47740.3070
BGIS00075669535790540.96720.98500.0017
ASGA00435299668081150.00570.24350.3044
ASGA00968199734866630.00420.50170.5172
ASGA00438509850540370.18530.08110.4384
ALGA005489991299248030.91900.96850.0083
H3GA002816091308302990.93550.98050.0083
H3GA005379210139268300.00140.39240.3550
ALGA005777310258424670.21070.02960.3532
M1GA0014504115529510.95560.0115
INRA003651511557286150.78570.18990.3166
M1GA001642312231164120.96710.96710
MARC007248312464369620.21690.11990.5465
ASGA010477012603723150.08830.14690.3674
MARC001073913406713080.95120.55370.1429
ALGA006970913428250420.95120.90680.0115
MARC0094198134380823810.95560.0115
ALGA011481013496769140.41350.01340.3055
ASGA005795313719980140.05260.19050.3333
ALGA007072613731680550.53760.00170.5543
INRA0040831131152374620.93480.98350.0127
INRA0040844131178604120.93480.98350.0127
INRA0040883131241392770.91850.95120.0084
ALGA007664814308498710.05470.57460.3880
DBMA0000255141311143630.53850.05120.3135
M1GA0019170141351484460.95120.95120
ALGA01138041527247150.16940.05630.3810
INRA004883415140375490.36360.00310.3135
ALGA008494515405457020.00690.42340.3399
ALGA0088237151488499490.83210.05500.5715
MARC00404301621424000.28130.02300.4293
ASGA00718861627450090.19030.05680.3759
ASGA007234216144652030.15940.07160.3679
H3GA004630316249401060.08990.80210.4683
ALGA009017216352450080.03440.15540.3116
MARC008021716584764210.98350.97020.0010
ALGA009467417391997310.02430.28650.4208
ALGA009530817500258110.93550.93550
MARC0041179186174380.25920.87490.3044
ALGA00971961816559678110
ASGA007906118178170030.60830.04720.3757
ASGA007973718449273690.96710.96710
DBWU000018718465891400.00060.29770.3202
M1GA002325718489413510.95120.90680.0115
ALGA009872318543674330.93480.98350.0127
ALGA009874218552951820.967210.0084
ASGA008042018588229460.93550.95560.0010

The pairwise FST values for the 129 AIMs.

The information of the seven AIMs which could completely separate ECHP, SCHP and EUCP are indicated in bold font. Chr, chromosome.

Accurate Ancestry Proportion Estimation Using AIMs

AIMs selected from unadmixed populations were reported to be successfully applied to estimate ancestry proportions in admixed populations (Lee et al., 2012; Maples et al., 2013). To validate practicability in our study, we performed data simulation. If the study is practical, we should observe high consistency between simulated and estimated ancestry proportions. For each AIM set, the supervised ADMIXTURE was used to calculate ancestry proportions in 1,000 simulations. For each simulation, genotype of 60 samples selected from ECHP, SCHP and EUCP were randomly mixed for each AIM.

As shown in Figure 4A,B, when 80 or fewer AIMs were included, large differences between the mean of estimated and expected value (∼0.3333) were observed. For example, the seven AIMs worked perfectly for classification were not sufficient to infer the ancestry proportions accurately: ECHP (mean = 0.2994, coefficient of variation (CV) = 0.8450), SCHP (mean = 0.3909, CV = 0.7783) and EUCP (mean = 0.3097, CV = 0.9895). However, by including top 82 AIMs or more, the estimated proportions gradually converged to the expected values (Figure 4A). Same tendency for the CV plot in which the CV decreased as the number of AIMs increased (Figure 4B).

FIGURE 4

In order to determine the optimal AIM set, we fitted the CV curves in Figure 4B with a reciprocal logarithmic function (Supplementary Figure S4) for AIMs between 82 and 403. Since the tangent to the curve gets infinitely close to zero, we determined an arbitrary threshold of –0.0004, which corresponds to the set of 129 AIMs, by considering both the stability of the CV value and the genotyping cost for SNPs (Supplementary Table S6). The AIM set of 129 performed well in ancestry inference for simulated samples (Figure 5), which resulted in ECHP: mean = 0.3310, standard deviation (std) = 0.0772; SCHP: mean = 0.3356, std = 0.0751; and EUCP: mean = 0.3334, std = 0.0394. We also observed that the performance of 129 AIMs set showed very limited difference to that of 403 AIMs set, suggesting the 129 AIMs set was optimal (Supplementary Table S6).

FIGURE 5

Considering the practicability of the 129 AIMs set, we next simulated pseudo admixed individuals with unequal random ancestry proportions using the same AIMs. we first produced 10 random ancestry proportions for each three groups, and then ran 1,000 simulations on each three ancestry proportions. For each simulation, 60 pseudo admixed individuals were generated. As shown in Table 3, the 129 AIMs worked very well, even for samples of random ancestry proportions.

Table 3

ECHP
SCHP
EUCP
expectationmean95% CIexpectationmean95% CIexpectationmean95% CI
10.20830.20150.0495–0.35280.43330.43930.2914–0.58640.35830.35850.2818–0.4366
20.15830.15040.0000–0.29540.30000.30760.1652–0.44860.54170.54200.4617–0.6210
30.25000.24400.0968–0.39250.60000.60560.4604–0.74860.15000.15010.0924–0.2119
40.73330.73440.6049–0.85120.11670.11510.0000–0.23650.15000.15010.0920–0.2116
50.50830.51210.3612–0.65530.37500.37190.2283–0.51730.11670.11670.0643–0.1735
60.35000.35210.2009–0.50140.46670.46400.3153–0.61230.18330.18340.1210–0.2489
70.40000.40270.2509–0.55200.45000.44660.2987–0.59420.15000.15000.0912–0.2135
80.22500.21840.0692–0.36810.53330.53940.3932–0.68390.24170.24220.1729–0.3141
90.70830.71000.5830–0.82840.11670.11560.0000–0.23910.17500.17490.1132–0.2398
100.23330.22720.0818–0.37650.60830.61450.4699–0.75660.15830.15830.0998–0.2228

Simulation of random ancestry proportions using the 129 AIMs.

As anticipated, using the 129 AIMs (Table 2 and Supplementary Table S7), PCA demonstrated that 10 populations were clearly divided into three corresponding groups (Supplementary Figure S5). Interestingly, in comparison to Figure 1A, substructure within populations at each group was less obvious.

Ancestry Proportion Estimation for the Test Dataset

It has been reported that some Asian pig breeds were admixed with European domestic breeds, and especially with commercial breeds. For instance, eight Asian breeds (Korean local breed (KPKO), Thailand local breed (THCD), China Lichahei (CNLC), China Sutai (CNST), China Kele (CNKL), China Guanling (CNGU), China Leanhua (CNLA), and China Minzhu (CNMZ)) have been reported to be introgressed by at least 20% from European ancestry (Yang et al., 2017). In order to symmetrically identify and quantify the introgression, we utilized the 129 selected AIMs to estimate the ancestry compositions of another 969 samples from 61 populations that are possibly admixed at least to a certain extent.

Overall, by using the supervised ADMIXTURE, we found a strong correlation (Figure 6) between ancestry proportions calculated by 129 AIMs and those calculated by all ∼60 K chip data at the individual level. Bland–Altman plot also showed agreements on ancestry proportion estimated between genome-wide and 129 AIMs data (Figure 7). For breeds that were known to be introgressed from EUCP, we obtained reasonable results. As shown in Figure 8 and Supplementary Table S8, the estimation of the mean of three ancestry proportions in the CNMZ population by using 129 AIMs (ECHP:0.5325, SCHP:0.2456, EUCP:0.2219) was similar to the estimation of the mean of three ancestry proportions in the CNMZ population by using ∼60 K SNP data (ECHP:0.6457, SCHP:0.1291, EUCP:0.2252). The LargeWhite-Meishan crossbreed (CSLM), which has been documented as an F1 generation from LargeWhite × MeiShan, our ancestry proportion estimation from the 129 AIMs (ECHP:0.4992, SCHP:0.0455, EUCP:0.4553) was consistent with the expectation, and similar to the result from ∼60 K SNP data (ECHP:0.5128, SCHP:0.0020, EUCP:0.4852). In another case, Russia Minisibs (RUMS), which has been reported to possess approximately half European ancestry, we also obtained a high level of EUCP ancestry using either 129 AIMs (ECHP:0.1428, SCHP:0.4780, EUCP:0.3791) or ∼60 K SNP data (ECHP:0, SCHP:0.5349, EUCP:0.4651).

FIGURE 6

FIGURE 7

FIGURE 8

Discussion

Since the 19th century, pig breeders in the West have used Chinese pigs to hybridize with European pigs to improve their breeding stock (Groenen, 2016). Bianco et al. (2015) found that European domestic pigs have 20% genomic introgression from Asian pigs. On the other hand, Yang et al. (2017) reported that European pigs contributed at least 20% to eight Asian breeds. In recent years, evidence has been presented that local Chinese farmers cross local pigs with imported commercial pigs (Berthouly-Salazar et al., 2012). Introgression introduces new genetic materials, which might help to improve certain characteristics, especially production performance. Unfortunately, introgression, in either a narrow sense, as an admixture with foreign breeds, or in a broad sense, as an admixture with breeds from different areas within a nation, also introduces “genetic pollution” which is hardly avoidable. For example, in recent study, Zhang et al. found that almost all Chinese indigenous chickens have gene introgression from commercial broiler (Zhang et al., 2019).

Since the indigenous pork are sold at higher price than that of European commercial pigs in China, false propaganda, shoddy phenomenon on the market began to rise. Significant attention has been paid to the issue of pork adulteration, however, at this stage, the work of identification was mostly based on intuitions and experiences from the customer side (Dai et al., 2009; Kwon et al., 2017). Fortunately, pig products from the 10 breeds in our reference set are dominant in China (Bosse et al., 2015; Gong et al., 2018; Zhao et al., 2018), our method thus constitutes a promisingly effective way in detection of pork adulteration at DNA level in market surveillance. From the view of a researcher, in genome-wide association studies, different genetic ancestries between case and control will lead to population stratification. Therefore, if selecting the samples of similar ancestry proportions or considering ancestry as covariates in the regression model to adjust stratification, it would help to reduce false positives(Qin et al., 2014).

Overall, it is highly important to trace the origin or estimate genetic ancestry in either the respect of genetic resource protection, market surveillance or population stratification. AIMs provides a cost-effective approach compared to using whole-genome SNPs, and thus is very suitable for large-volume testing.

In the present study, we found that as few as two AIMs are sufficient to distinguish Chinese pigs from European commercial pigs, and 10 pure breeds could be accurately assigned to three corresponding groups (ECHP, SCHP and EUCP) by using as few as seven AIMs. Through data simulations, we demonstrated that the AIMs selected from unadmixed individuals can also be successfully applied to estimate ancestry proportions for admixed individuals. We further developed a panel of 129 AIMs to infer ancestry proportions in possibly admixed individuals effectively. Considering the flexibility, reliability and serviceability, Agena MassARRAY platform would be currently the best choice for genotyping for the 129 AIMs set. However, for very large-volume testing, customized low-density SNP chip or multiplex PCR-based next-generation sequencing would bemore cost-effective.

Our work provided a useful example of using a small number of AIMs for classifications and estimating ancestry proportions. Efforts can still be made to optimize the AIMs to a minimum number if necessary. For example, among the 129 AIMs, those representing the differences between EUCP and ECHP or SCHP could possibly be reduced. Or, to include more AIMs to increase the power of discrimination between ECHP and SCHP.

It is worth noting that one of the important prerequisites to obtain effective AIMs for either classification or ancestry estimation is to find good reference populations. For example, Daya et al. (2013) reported a panel of 96 AIMs could be used to infer the ancestry proportions for South African Colored (SAC) population, by using representative populations. However, these markers did not perform well in the South Asian and East Asian ancestries inference. In our study, 10 pure pig breeds from three groups (ECHP, SCHP and EUCP) are chosen as reference populations. There are several reasons why we chose these breeds. Firstly, many European commercial pigs or crossbreeding of indigenous breeds with European commercial breeds become increasingly common in China, so here major imported European commercial breeds including Duroc, Pietrain and Landrace were choosing as representative populations of EUCP. Secondly, the Chinese breeds included in this study covered two designated ancestry backgrounds. In Yang et al. study (Yang et al., 2017), China_Erhualian (CNEH), China_Jinhua (CNJH), China_Meishan (CNMS) pigs are clearly derived from one ancestry, and China_Bamaxiang (CNBX), China_Congjiangxiang (CNCJ), China_Guangdongdahuabai (CNDH) and China_Luchuan (CNLU) are clearly derived from the other. Admixture analysis showed that they are least introgressed by EUCP and can be separated from each other clearly. They together thus constitute the best reference population available so far, considering both genetic pureness and ability to reveal potential admixture in other Chinese breeds. If more pure breeds are included in the reference set in future, one could expect more accurate estimation as well as a wider range of populations where our methodcould be applicable.

Statements

Author contributions

YZ conceived and supervised the study. ZL analyzed the main content of the data with the assistance of LB, YQ, YP, and RY. ZL and YZ wrote the manuscript. All authors read and approved the final manuscript.

Funding

The project was supported by the National Key Technology Research and Development Program (2015BAD03B01-01) and the National Natural Science Foundation of China (U1704233).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.00183/full#supplementary-material

References

  • 1

    AiH.HuangL.RenJ. (2013). Genetic diversity, linkage disequilibrium and selection signatures in chinese and western pigs revealed by genome-wide SNP markers.PLoS One8:e56001. 10.1371/journal.pone.0056001

  • 2

    AlexanderD. H.NovembreJ.LangeK. (2009). Fast model-based estimation of ancestry in unrelated individuals.Genome Res.1916551664. 10.1101/gr.094052.109

  • 3

    BarbosaF. B.CagninN. F.SimioniM.FariasA. A.TorresF. R.MolckM. C.et al (2017). Ancestry informative marker panel to estimate population stratification using genome-wide human array.Ann. Hum. Genet.81225233. 10.1111/ahg.12208

  • 4

    BauchetM.McEvoyB.PearsonL. N.QuillenE. E.SarkisianT.HovhannesyanK.et al (2007). Measuring european population stratification with microarray genotype data.Am. J. Hum. Genet.80948956. 10.1086/513477

  • 5

    Berthouly-SalazarC.ThevenonS.VanT. N.NguyenB. T.PhamL. D.ChiC. V.et al (2012). Uncontrolled admixture and loss of genetic diversity in a local vietnamese pig breed.Ecol. Evol.2962975. 10.1002/ece3.229

  • 6

    BertoliniF.GalimbertiG.SchiavoG.MastrangeloS.Di GerlandoR.StrillacciM. G.et al (2017). Preselection statistics and random forest classification identify population informative single nucleotide polymorphisms in cosmopolitan and autochthonous cattle breeds.Animal121219. 10.1017/S1751731117001355

  • 7

    BiancoE.SotoH. W.VargasL.Perez-EncisoM. (2015). The chimerical genome of Isla del Coco feral pigs (Costa Rica), an isolated population since 1793 but with remarkable levels of diversity.Mol. Ecol.2423642378. 10.1111/mec.13182

  • 8

    BosseM.MadsenO.MegensH. J.FrantzL. A. F.PaudelY.CrooijmansR. P.et al (2015). Hybrid origin of european commercial pigs examined by an in-depth haplotype analysis on chromosome 1.Front. Genet.5:442. 10.3389/Fgene.2014.00442

  • 9

    BosseM.MegensH. J.FrantzL. A. F.MadsenO.LarsonG.PaudelY.et al (2014). Genomic analysis reveals selection for asian genes in european pigs following human-mediated introgression.Nat. Commun.5:4392. 10.1038/Ncomms5392

  • 10

    BouchemousseS.Liautard-HaagC.BierneN.ViardF. (2016). Distinguishing contemporary hybridization from past introgression with postgenomic ancestry-informative SNPs in strongly differentiated Ciona species.Mol. Ecol.2555275542. 10.1111/mec.13854

  • 11

    BrowningS. R.BrowningB. L. (2007). Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering.Am. J. Hum. Genet.8110841097. 10.1086/521987

  • 12

    Da MotaB.TudoranR.CostanA.VaroquauxG.BrascheG.ConrodP.et al (2014). Machine learning patterns for neuroimaging-genetic studies in the cloud.Front. Neuroinform.8:31. 10.3389/Fninf.2014.00031

  • 13

    DaiF. W.FengD. Y.CaoQ. Y.YeH.ZhangC. M.XiaW. G.et al (2009). Developmental differences in carcass, meat quality and muscle fibre characteristics between the landrace and a Chinese native pig.S. Afr. J. Anim. Sci.39267273.

  • 14

    DayaM.van der MerweL.GalalU.MollerM.SalieM.ChimusaE. R.et al (2013). A panel of ancestry informative markers for the complex five-way admixed South African coloured population.PLoS One8:e82224. 10.1371/journal.pone.0082224

  • 15

    DimauroC.NicolosoL.CellesiM.MacciottaN. P. P.CianiE.MoioliB.et al (2015). Selection of discriminant SNP markers for breed and geographic assignment of Italian sheep.Small Rumin. Res.1282733. 10.1016/j.smallrumres.2015.05.001

  • 16

    DingL. L.WienerH.AbebeT.AltayeM.GoR. C. P.KercsmarC.et al (2011). Comparison of measures of marker informativeness for ancestry and admixture mapping.BMC Genomics12:622. 10.1186/1471-2164-12-622

  • 17

    FangM.HuX.JiangT.BraunschweigM.HuL.DuZ.et al (2005). The phylogeny of Chinese indigenous pig breeds inferred from microsatellite markers.Anim. Genet.36713. 10.1111/j.1365-2052.2004.01234.x

  • 18

    FrantzL. A. F.SchraiberJ. G.MadsenO.MegensH. J.BosseM.PaudelY.et al (2013). Genome sequencing reveals fine scale diversification and reticulation history during speciation in Sus.Genome Biol.14:R107. 10.1186/Gb-2013-14-9-R107

  • 19

    GalanterJ. M.Fernandez-LopezJ. C.GignouxC. R.Barnholtz-SloanJ.Fernandez-RozadillaC.ViaM.et al (2012). Development of a panel of genome-wide ancestry informative markers to study admixture throughout the americas.PLoS Genet.8:e1002554. 10.1371/journal.pgen.1002554

  • 20

    GetachewT.HusonH. J.WurzingerM.BurgstallerJ.GizawS.HaileA.et al (2017). Identifying highly informative genetic markers for quantification of ancestry proportions in crossbred sheep populations: implications for choosing optimum levels of admixture.BMC Genet.18:80. 10.1186/s12863-017-0526-2

  • 21

    GongH.XiaoS.LiW.HuangT.HuangX.YanG.et al (2018). Unravelling the genetic loci for growth and carcass traits in Chinese Bamaxiang pigs based on a 1.4 million SNP array.J. Anim. Breed. Genet.136314. 10.1111/jbg.12365

  • 22

    GroenenM. A. M. (2016). A decade of pig genome sequencing: a window on pig domestication and evolution.Genet. Sel. Evol.48:23. 10.1186/s12711-016-0204-2

  • 23

    HongJ. H.ChoS. B. (2008). A probabilistic multi-class strategy of one-vs.-rest support vector machines for cancer classification.Neurocomputing7132753281. 10.1016/j.neucom.2008.04.033

  • 24

    KwonT.YoonJ.HeoJ.LeeW.KimH. (2017). Tracing the breeding farm of domesticated pig using feature selection (Sus scrofa).Asian Aust. J. Anim. Sci.3015401549. 10.5713/ajas.17.0561

  • 25

    LarsonG.DobneyK.AlbarellaU.FangM. Y.Matisoo-SmithE.RobinsJ.et al (2005). Worldwide phylogeography of wild boar reveals multiple centers of pig domestication.Science30716181621. 10.1126/science.1106927

  • 26

    LawsonD. J.HellenthalG.MyersS.FalushD. (2012). Inference of population structure using dense haplotype data.PLoS Genet.8:e1002453. 10.1371/journal.pgen.1002453

  • 27

    LeeS.EpsteinM. P.DuncanR.LinX. H. (2012). Sparse principal component analysis for identifying ancestry-informative markers in genome-wide association studies.Genet. Epidemiol.36293302. 10.1002/gepi.21621

  • 28

    LiC. X.PakstisA. J.JiangL.WeiY. L.SunQ. F.WuH.et al (2016). A panel of 74 AISNPs: improved ancestry inference within Eastern Asia.Forensic Sci. Int. Genet.23101110. 10.1016/j.fsigen.2016.04.002

  • 29

    LiS.-J.YangS.-H.ZhaoS.-H.FanB.YuM.WangH.-S.et al (2004). Genetic diversity analyses of 10 indigenous Chinese pig populations based on 20 microsatellites.J. Anim. Sci.82368374. 10.2527/2004.822368x

  • 30

    MaplesB. K.GravelS.KennyE. E.BustamanteC. D. (2013). RFMix: a discriminative modeling approach for rapid and robust local-ancestry inference.Am. J. Hum. Genet.93278288. 10.1016/j.ajhg.2013.06.020

  • 31

    MonzonJ.KaysR.DykhuizenD. E. (2014). Assessment of coyote-wolf-dog admixture using ancestry-informative diagnostic SNPs.Mol. Ecol.23182197. 10.1111/mec.12570

  • 32

    Pardo-SecoJ.Martinon-TorresF.SalasA. (2014). Evaluating the accuracy of AIM panels at quantifying genome ancestry.BMC Genomics15:543. 10.1186/1471-2164-15-543

  • 33

    PattersonN.PriceA. L.ReichD. (2006). Population structure and eigenanalysis.PLoS Genet.2:e190. 10.1371/journal.pgen.0020190

  • 34

    PetersonR. E.EdwardsA. C.BacanuS. A.DickD. M.KendlerK. S.WebbB. T. (2017). The utility of empirically assigning ancestry groups in cross-population genetic studies of addiction.Am. J. Addict.26494501. 10.1111/ajad.12586

  • 35

    QinP.LiZ.JinW.LuD.LouH.ShenJ.et al (2014). A panel of ancestry informative markers to estimate and correct potential effects of population stratification in Han Chinese.Eur. J. Hum. Genet.22248253. 10.1038/ejhg.2013.111

  • 36

    RosenbergN. A.LiL. M.WardR.PritchardJ. K. (2003). Informativeness of genetic markers for inference of ancestry.Am. J. Hum. Genet.7314021422. 10.1086/380416

  • 37

    SantosH. C.HorimotoA. V. R.Tarazona-SantosE.Rodrigues-SoaresF.BarretoM. L.HortaB. L.et al (2016). A minimum set of ancestry informative markers for determining admixture proportions in a mixed American population: the Brazilian set.Eur. J. Hum. Genet.24725731. 10.1038/ejhg.2015.187

  • 38

    ShriverM. D.ParraE. J.DiosS.BonillaC.NortonH.JovelC.et al (2003). Skin pigmentation, biogeographical ancestry and admixture mapping.Hum. Genet.112387399. 10.1007/s00439-002-0896-y

  • 39

    SunK.YeY.LuoT.HouY. (2016). Multi-InDel analysis for ancestry inference of sub-populations in china.Sci. Rep.6:39797. 10.1038/srep39797

  • 40

    TianC.KosoyR.NassirR.LeeA.VillosladaP.KlareskogL.et al (2009). European population genetic substructure: further definition of ancestry informative markers for distinguishing among diverse european ethnic groups.Mol. Med.15371383. 10.2119/molmed.2009.00094

  • 41

    vonHoldtB. M.KaysR.PollingerJ. P.WayneR. K. (2016). Admixture mapping identifies introgressed genomic regions in North American canids.Mol. Ecol.2524432453. 10.1111/mec.13667

  • 42

    YangB.CuiL. L.Perez-EncisoM.TraspovA.CrooijmansR. P. M. A.ZinovievaN.et al (2017). Genome-wide SNP data unveils the globalization of domesticated pigs.Genet. Sel. Evol.49:71. 10.1186/s12711-017-0345-y

  • 43

    ZengX. P.ChakrabortyR.KingJ. L.LarueB.Moura-NetoR. S.BudowleB. (2016). Selection of highly informative SNP markers for population affiliation of major US populations.Int. J. Legal Med.130341352. 10.1007/s00414-015-1297-9

  • 44

    ZhangC.LinD.WangY.PengD.LiH.FeiJ.et al (2019). Widespread introgression in Chinese indigenous chicken breeds from commercial broiler.Evol. Appl.12610621. 10.1111/eva.12742

  • 45

    ZhangF.ZhangL.DengH. W. (2009). A PCA-based method for ancestral informative markers selection in structured populations.Sci. Chin. Series C Life Sci.52972976. 10.1007/s11427-009-0128-y

  • 46

    ZhaoP.YuY.FengW.DuH.YuJ.KangH.et al (2018). Evidence of evolutionary history and selective sweeps in the genome of Meishan pig reveals its genetic and phenotypic characterization.Gigascience7. 10.1093/gigascience/giy058

  • 47

    ZhuY.LiW.YangB.ZhangZ.AiH.RenJ.et al (2017). Signatures of selection and interspecies introgression in the genome of chinese domestic pigs.Genome Biol. Evol.925922603. 10.1093/gbe/evx186

Summary

Keywords

ancestry informative markers, FST, classification, pig, ancestry proportion

Citation

Liang Z, Bu L, Qin Y, Peng Y, Yang R and Zhao Y (2019) Selection of Optimal Ancestry Informative Markers for Classification and Ancestry Proportion Estimation in Pigs. Front. Genet. 10:183. doi: 10.3389/fgene.2019.00183

Received

28 September 2018

Accepted

19 February 2019

Published

11 March 2019

Volume

10 - 2019

Edited by

Denis Milan, Institut National de la Recherche Agronomique (INRA), France

Reviewed by

Jesús Fernández, Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA), Spain; Simon Boitard, Institut National de la Recherche Agronomique de Toulouse, France

Updates

Copyright

*Correspondence: Yiqiang Zhao,

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Figures

Cite article

Copy to clipboard


Export citation file


Share article

Article metrics