Edited by: Tad Stewart Sonstegard, Acceligen, United States
Reviewed by: Yuri Tani Utsunomiya, São Paulo State University, Brazil; Ino Curik, University of Zagreb, Croatia
This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
A variety of statistical methods, such as admixture models, have been used to estimate genomic breed composition (GBC). These methods, however, tend to produce non-zero components to reference breeds that shared some genomic similarity with a test animal. These non-essential GBC components, in turn, offset the estimated GBC for the breed to which it belongs. As a result, not all purebred animals have 100% GBC of their respective breeds, which statistically indicates an elevated false-negative rate in the identification of purebred animals with 100% GBC as the cutoff. Otherwise, a lower cutoff of estimated GBC will have to be used, which is arbitrary, and the results are less interpretable. In the present study, three admixture models with regularization were proposed, which produced sparse solutions through suppressing the noise in the estimated GBC due to genomic similarities. The regularization or penalty forms included the L1 norm penalty, minimax concave penalty (MCP), and smooth clipped absolute deviation (SCAD). The performances of these regularized admixture models on the estimation of GBC were examined in purebred and composite animals, respectively, and compared to that of the non-regularized admixture model as the baseline model. The results showed that, given optimal values for λ, the three sparsely regularized admixture models had higher power and thus reduced the false-negative rate for the breed identification of purebred animals than the non-regularized admixture model. Of the three regularized admixture models, the two with a non-convex penalty outperformed the one with L1 norm penalty. In the Brangus, a composite cattle breed, estimated GBC were roughly comparable among the four admixture models, but all the four models underestimated the GBC for these composite animals when non-ancestral breeds were included as the reference. In conclusion, the admixture models with sparse regularization gave more parsimonious, consistent and interpretable results of estimated GBC for purebred animals than the non-regularized admixture model. Nevertheless, the utility of regularized admixture models for estimating GBC in crossbred or composite animals needs to be taken with caution.
The estimation of genomic breed composition (GBC) of individual animals is useful in many aspects, such as predicting heterosis (Akanno et al.,
A variety of statistical methods and software packages have been developed to estimate GBC (Alexander et al.,
In the present study, regularized admixture methods were utilized to produce sparse solutions of admixture coefficients, thus imposing penalties on small, non-essential components due to genomic similarity. Three forms of sparse regularization were incorporated into the admixture models, which included the L1 norm penalty, minimax concave (MCP) penalty, and smooth clipped absolute deviation (SCAD). The L1 norm is the most commonly used convex surrogate (Tibshirani,
The dataset used in the present study included 107,593 animals from ten breeds, nine pure breeds, and one composite breed. All these animals were genotyped on the GeneSeek Genomic Profiler (GGP) bovine 50 K version 1 SNP chip (49,463 SNPs), except that 349 Brahman animals were genotyped on the Illumina 777K bovine SNP chip (777,962 SNPs). The reference populations consisted of eight
Descriptive statistics of genotype data for the ten cattle breeds used in the present study.
Angus | 20,359 (20,322) | 49,463 | 0.492 (0.247) |
Brahman | 349 (349) | 777,962 | 0.439 (0.343) |
68 (43) | 49,463 | 0.431 (0.363) | |
Brangus | 3,605 | 49,463 | 0.477 (0.231) |
Hereford | 2,423 (2,421) | 49,463 | 0.496 (0.271) |
Holstein | 20,350 (20,246) | 49,463 | 0.489 (0.254) |
Jersey | 15,689 (15,607) | 49,463 | 0.489 (0.288) |
Limousine | 5,043 (5,041) | 49,463 | 0.490 (0.228) |
Shorthorn | 1,232 (1,218) | 49,463 | 0.491 (0.258) |
Simmental | 14,754 (14,727) | 49,463 | 0.490 (0.226) |
Wagyu | 23,721 (21,844) | 49,463 | 0.483 (0.302) |
Genomic breed composition was estimated based on SNP panels. The largest panel had 15,708 SNPs (referred to as the 16K SNP panel) which were common SNPs across five commercial bovine SNP chips, namely, Illumina Bovine high-density (HD or 777K) chip, GGP ultra-high-density (UHD or 150K) SNP chip, GGP HD (80K) SNP chip, GGP 50K version 1 SNP chip, and GGP low-density (LD or 40K) version 4 SNP chip. The main reason for us to use the shared content of these commercial SNP chips was to facilitate the estimation of GBC using currently available SNP chips in the market. Then, three panels of uniformly-distributed SNPs (1K, 5K, and 10K) were selected from the list of 16K common SNPs using the selectSNP package (Wu et al.,
The reference animals for each of the nine pure breeds (not including Brangus) were selected using the 5K SNP panel based on the likelihood approach previously described by He et al. (
Consider
The log-likelihood of all the observed genotypes on this individual was given by:
The above likelihood (2) can be written as:
where
In the ADMIXTURE-L1 model, estimates of sparse solution
where λ(λ > 0) is Lagrange multiplier (i.e., a regularization parameter) that determines the amount of sparsity in
The gradient of
where
In (4),
In ADMIXTURE-MCP and ADMIXTURE-SCAD, the estimate of sparse solution
where λ(λ > 0) and
Given γ > 1, SCAD has
In the above, γ is the concavity parameter of MCP or SCAD, which essentially characterizes the concavity of the MCP or SCAD regularizer: A larger γ implies that the regularizer is less concave. In this paper, we let γ = 3 as usual. Please refer to
The optimal values for the parameter λ of the three sparsely regularized admixture models were obtained using three-fold cross-validation, based on the 5K SNP panel, and illustrated in three cattle breeds (Angus, Holstein, and Limousine). The non-regularized admixture model served as the baseline model for comparison because it was equivalent to ADMIXTURE-L1 with λ = 0. Briefly, all the animals for each breed were randomly split into three subsets. Then, the animals in two subsets were combined and used as the reference population for estimating the allele frequencies of SNPs in the 5K panel. The third subset was used as the testing set, in which GBC was computed for each animal. The procedure rotated three times so that each subset was used for testing once and only once. The percentage of animals with GBC = 1 for their respective breeds was computed for each of the three sparsely regularized admixture models under varied settings for the regularization parameter λ. Then, the optimal values of regularization parameter λ were taken as such that each sparsely regularized admixture model gave a higher percentage of purebred animals with 100% GBC of their respective breeds than the non-regularized ADMIXTURE (λ = 0). By this criterion, the range of optimal values of λ for the three regularized admixture models appeared to be 0 < λ < 0.60 for Holstein, 0 < λ < 0.36 for Angus, and 0 < λ < 0.30 for Limusine (see
Percent of individuals with GBC=1 obtained by the three regularized ADMIXTURE methods, each with a varying value for the regulation parameter lambda (λ). Curves were extracted from the surfaces in this figure by fixing the GBC =1 for ADMIXTURE-L1, ADMIXTURE-MCP, and ADMIXTURE-SCAD in Angus, Holstein, and Limousin, respectively.
With the optimal λ values given to the regularized models and λ = 0 for the non-regularized model, GBC was estimated for animals in each of the nine pure breeds using the four statistical models. In
Percent (%) of animals by categories of estimated GBC obtained using four statistical models with the 16K SNP panel in Angus (A), Holstein (H), and Limousine (L).
1 | 69.6 | 70.7 | 47.4 | 94.1 | 97.7 | 65.1 | 98.6 | 99.2 | 72.5 | 96.5 | 99.6 | 70.9 |
[0.9, 1) | 18.9 | 19.5 | 9.4 | 3.3 | 1.2 | 6.7 | 0.4 | 0.3 | 4.4 | 2.3 | 0.1 | 4.3 |
[0.8, 0.9) | 8.5 | 7.0 | 9.4 | 1.5 | 1.0 | 5.8 | 0.5 | 0.4 | 3.6 | 0.4 | 0.1 | 4.5 |
[0.7, 0.8) | 1.8 | 2.4 | 9.2 | 0.5 | 0.1 | 8.7 | 0.1 | 0.0 | 5.0 | 0.2 | 0.0 | 6.1 |
[0.6, 0.7) | 0.4 | 0.2 | 13.5 | 0.2 | 0.0 | 6.9 | 0.2 | 0.0 | 5.5 | 0.2 | 0.0 | 7.2 |
[0.5, 0.6) | 0.3 | 0.0 | 6.2 | 0.1 | 0.0 | 2.8 | 0.1 | 0.0 | 4.4 | 0.1 | 0.0 | 3.5 |
[0.5, 0.4) | 0.2 | 0.0 | 2.6 | 0.1 | 0.0 | 2.0 | 0.0 | 0.0 | 1.8 | 0.1 | 0.0 | 1.1 |
[0.4, 0.3) | 0.1 | 0.0 | 1.2 | 0.1 | 0.0 | 0.9 | 0.0 | 0.0 | 1.2 | 0.0 | 0.0 | 0.8 |
[0.3, 0.2) | 0.1 | 0.0 | 0.5 | 0.1 | 0.0 | 0.8 | 0.0 | 0.0 | 0.8 | 0.0 | 0.0 | 0.4 |
[0.2, 0.1) | 0.0 | 0.0 | 0.4 | 0.0 | 0.0 | 0.2 | 0.0 | 0.0 | 0.4 | 0.0 | 0.0 | 0.3 |
[0.1, 0) | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
The power of identifying purebred animals varied with the size of SNP panels. The 1K SNP panel had the highest power for identifying purebred animals in most of the nine breeds, e.g., Angus and Limous, and the power of identifying purebred animals decreased as the SNP panel size increased (
Of the four admixture models, the regularized admixture models had higher power in the identification of purebred animals than the non-regularized admixture model. With the 16K panel, for example, the percentage of animals with Angus GBC =1 was 69.6% with the non-regularized admixture model, and it was substantially higher (94.1–97.3%) with the three regularized models (
The identification power of purebred animals varied drastically with the nine breeds. The percent of animals with GBC = 1 was the lowest (47.4–74.4%) in Limousine (
Histogram of the means of estimated GBC for 5,041 Limousin animals, obtained using four statistical models, respectively. Bar plot of the mean GBC across the 10 breeds, which were estimated by ADMIXUTUR ADMIXUTURE-L1 (λ = 0.1), ADMIXUTURE-MCP (λ = 0.25), and ADMIXUTURE-SCAD (λ = 0.25) using 5K SNP panel. Standard deviations (SD) is abled on the bar of Limousin.
The four admixture models were also used to estimate GBC for the 3,605 Brangus animals. This composite beef breed was developed to utilize the superior traits of Angus and Brahman cattle. For official registration, a Brangus animal is expected to be genetically stabilized at 3/8 Brahman and 5/8 Angus, solid black or red, and polled, and both sire and dam must be recorded with the International Brangus Breeders Association (IBBA). Unlike estimating GBC for a purebred animal, our interest for a composite animal was to know how much of its genome was inherited from each of its ancestral breeds.
With the nine reference populations and the 5K SNP panel, small admixture coefficients showed up for non-ancestral breeds, such as Hereford, Limousine, Shorthorn, and Simental, in addition to the two large admixture components for the two ancestral breeds (
Percent (%) of animals by categories of estimated GBC obtained using four statistical models in Brangus.
ADMIXTURE | 54.3 (68.3) | 11.9 | 25.1 (31.7) | 6.31 | 71.1 | 6.70 | 28.9 | 6.70 |
ADMIXTURE-L1 | 61.5 (68.2) | 15.6 | 28.6 (31.8) | 12.1 | 77.1 | 8.70 | 22.9 | 8.70 |
ADMIXTURE-MCP | 59.8 (68.1) | 12.9 | 27.9 (31.9) | 9.1 | 74.6 | 7.10 | 25.4 | 7.10 |
ADMIXTURE-SCAD | 59.5 (67.9) | 13.1 | 28.1 (32.1) | 10.4 | 75.3 | 7.50 | 24.7 | 7.50 |
Histogram of the means of estimated GBC for 3,605 Brangus(0.625 Angus, 0.375 Brahman) obtained the four statistical models, respectively. Bar plot of the mean GBC across the ten breeds, which were estimated by ADMIXUTUR, ADMIXUTURE-L1 (λ = 0.1), ADMIXUTURE-MCP (λ = 0.25), and ADMIXUTURE-SCAD (λ = 0.25) using 5K SNP panel. Standard deviations (SD) were abled on the Angus and Brahman bars.
The estimated Angus composition in these Brangus animals, as obtained using the four models, were presumably higher than the pedigree-expected Angus ratio of 62.5%. There were possibly two reasons for the elevated Angus GBC. Firstly, the Brangus have been selected for traits with which Angus has advantages. Hence, the selection, in turn, could shift allelic frequencies more toward the Angus origin. Secondly, there was a mixture of UltraBlack animals in this Brangus dataset. A King-robus principal component analysis (PCA) based on the genotypes of the 3,605 Brangus was conducted to infer the genetic relationships of these Brangus animals using the King-robus software (Manichaikul et al.,
Population distribution across the first (PC1) and second principal component (PC2) on the genotype data of the Brangus individuals. Animals are labels based on their Angus percent of GBC estimated by ADMIXTURE.
Finally, two assumptions under the present models are worth discussion. First, it was assumed that each reference population comprised samples of purebred animals only. This assumption, however, can be violated in reality because a low level of introgression in the reference samples can occur. For example, Brahman cattle carry an average composition of 91% Bos indicus and 9% Bos taurus (O'Brien et al.,
Secondly, the present admixture models assumed that the allele frequencies of the ancestral breeds are known and are estimated a prior, which differed from the unsupervised model-based clustering algorithms. The latter was originally conceived to not only estimate ancestry in admixed individuals but also to study the trajectory of divergence between ancestral populations that produced the empirical data. This is important because modern-day breeds of cattle—especially Bos taurus breeds—were formed quite recently (i.e., in an evolutionary scale) from mixtures of previously geographically isolated lineages that were only moderately divergent (FST < 0.10), and are not necessarily pure distinct lineages from a population genetics stand point. Assuming fixed allele frequencies for ancestral ignore the trajectory of genetic characteristics of ancestral populations over time, but it simplifies the computing in practice. This is particularly advantageous with the proposed sparsely-regularized admixture models, which are often more computationally intensive than the non-regularized admixture models. Finally, some methods can even accommodate complex admixtures, such as support vector machines (Haasl et al.,
Estimated GBC for purebred animals is complicated by the presence of small admixture components assigned to non-ancestral breeds due to the genomic similarities. Thus, not all purebred animals have 100% GBC for their respective breed categories, leading to an increased false-negative rate for pure-breed identification. Otherwise, a lower cutoff of estimated GBC for purebred animals needs to be used instead, which, however, is arbitrary. Our results showed that the use of sparse regularization in the admixture models with appropriately-chose values of λ effectively shrank non-ancestral GBC estimates toward zero, therefore reducing the false-negative rate and at the same time increasing the identification power of purebred animals. Of the three sparse regularized admixture models, the two models with nonconvex penalties (ADMIXTURE-MCP and ADMIXTURE-SCAD) outperformed the admixture model with L1 norm penalty (ADMIXTURE-L1).
The power of breed identification of purebred animals varied with reference SNP panels used in the non-regularized admixture model. The 1K panel giving the greatest power in most breeds because it had the smallest average LD between SNPs, which approximately satisfied the model assumption about the independence of SNPs. Therefore, the computed likelihood values using the 1K panel are more accurate than larger panels (5K, 10, and 16K). Nevertheless, the three regularized admixture models were more robust to the violation of model assumption for SNP independence than the non-regularized admixture model when estimating GBC using various SNP panels, because the power of purebred identification with the regularized admixture model decreased at a considerably slower rate than the non-regularized admixture model as the SNP panel sizes increased. As a rule of thumb, a cutoff of GBC for pure-breed identification is recommended to be 95% for the non-regularized admixture model and between 0.98 and 0.99 for regularized admixture models, assuming no significant population stratification and no significant genomic correlations between the reference breeds.
For composite animals, the three admixture models with sparse regularization tended to produce larger GBC for these Brangus animals than the non-regularized admixture model, which possibly indicated the presence of estimation bias with the regularized models. While imposing sparse regularization on estimated GBC is favorable for reducing false-negative error rate when identifying purebred animals, it can lead to bias in estimated GBC for crossbred or composite animals, in particular when dynamic segregation was still going on. Hence, the utility of regularized admixture models for estimating GBC in composite animals needs to be taken with caution and the results need to be checked against those obtained using non-regularized admixture models.
Finally, a software package that implements the admixture models with regularization is made available for non-commercial use (The web link will be provided once the paper is accepted).
The supplementary results, four reference SNP panels, namely 1K, 5K, 10K, and 16K (actually 14K after data cleaning), and two example GGP 50K genotype files (each with 1000 animals) are available at the following link:
Ethical review and approval were not required for the study because the genotypes were extracted from the data repositories of Neogen genotyping laboratories. All the cattle samples (hair, blood and ear tags) used for genotyping are collected based on routine procedures for commercial selection purposes.
YW, XL-W, and GR conceived this study, in discussion with ZB, RT, and SB. YW and XL-W drafted and revised the manuscripts. YW and ZL conducted the data analysis. All the authors read and approved this manuscript.
X-LW, RT, and SB were employed by the company Neogen. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
The Supplementary Material for this article can be found online at: