Abstract
Current Genome-Wide Association Studies (GWAS) rely on genotype imputation to increase statistical power, improve fine-mapping of association signals, and facilitate meta-analyses. Due to the complex demographic history of Latin America and the lack of balanced representation of Native American genomes in current imputation panels, the discovery of locally relevant disease variants is likely to be missed, limiting the scope and impact of biomedical research in these populations. Therefore, the necessity of better diversity representation in genomic databases is a scientific imperative. Here, we expand the 1,000 Genomes reference panel (1KGP) with 134 Native American genomes (1KGP + NAT) to assess imputation performance in Latin American individuals of mixed ancestry. Our panel increased the number of SNPs above the GWAS quality threshold, thus improving statistical power for association studies in the region. It also increased imputation accuracy, particularly in low-frequency variants segregating in Native American ancestry tracts. The improvement is subtle but consistent across countries and proportional to the number of genomes added from local source populations. To project the potential improvement with a higher number of reference genomes, we performed simulations and found that at least 3,000 Native American genomes are needed to equal the imputation performance of variants in European ancestry tracts. This reflects the concerning imbalance of diversity in current references and highlights the contribution of our work to reducing it while complementing efforts to improve global equity in genomic research.
Introduction
Over the past years, GWAS have identified thousands of genetic associations to multiple phenotypes (; Visscher et al., 2017), targets for potential new drugs (; ; ), and facilitated disease stratification (). However, most GWAS have been performed in populations with European ancestry (). Unfortunately, the findings of large-scale GWAS performed in populations of European descent have limited portability to other ancestry groups (; ) due to population substructure. This represents a major limitation in the case of Latin American populations as they are the result of recent admixture primarily between Native American, European, and African populations, and only 1.3% of both discovery and replication studies have been performed in these populations (). Furthermore, the genetic composition of Latin American populations is heterogeneous between countries (; ) and within countries (; Harris et al., 2018; ). Different demographic histories often lead to different associated variants to a given phenotype (). For example, variants in the SLC16A11 gene have been associated with an increased risk of diabetes in Mexicans and appear to be segregating at low frequency in Latin American populations specifically (). Likewise, risk variants of renal disease in APOL1 associated with renal disease in west African populations are also found in the Americas as a result of the Transatlantic slave trade, differentially shaping the frequency spectrum of disease variants among Afro-descendent Latino populations (). If the current bias in catalogs of human variation persists, many population-specific variants will be overlooked, and precision medicine strategies will not benefit all populations equally ().
A critical step when performing a GWAS is genotype imputation, which leverages linkage disequilibrium (LD) structure and haplotype sharing to estimate untyped variation in a SNP array based on a reference panel (). Genotype imputation increases statistical power, improves fine-mapping of association signals, and facilitates meta-analysis (). Currently, available imputation panels do not have an explicit representation of Native American genomes. A previous study showed that in Latin American populations, SNPs in chromosomal segments with Native American ancestry have reduced imputation quality compared to those in chromosomal segments of European ancestry (). Therefore, association signals coming from chromosomal segments with Native American ancestry will be harder to detect. This limits the scope and impact of biomedical research in the region.
Several projects and initiatives around the world are contributing to revert this trend (; ; ; ). For example, the Ugandan Genome Resource () comprises genome-wide data for 6,400 individuals, including a subset of 1,978 whole genomes, which is enabling researchers to explore the genetic substructure of the region, improve imputation in African populations, and foster the discovery of novel association signals. In Latin America, recent sequencing efforts have generated whole-genome data from dozens of Native American genomes, including the Peruvian Genome Project (Harris et al., 2018) and the 12G and 100G-MX Projects (; ) from the National Institute of Genomic Medicine (INMEGEN) in Mexico. However, only a subset of the existing generated data is available to the scientific community given the data sharing mechanisms implemented in each country. An ongoing multi-institutional effort in Mexico, the MX Biobank Project, is generating genome-wide data for more than 6,000 individuals nationwide, including 50 whole genomes of Native American ancestry representing the genetic variation of indigenous diversity within Mexico (http://www.mxbiobankproject.org). At a global scale, the inclusion of diverse populations in disease association research has been well demonstrated by the PAGE study (Wojcik et al., 2019), which combines genome-wide data for 49,839 individuals with diverse ancestries, enabling the discovery of novel association signals to well-studied phenotypes. Here, we combine novel and publicly available data from multiple sources to build a population-specific reference panel of Native American variation aimed at improving imputation performance in Latin American populations by expanding the current and widely used reference of the 1,000 Genomes Project (1KGP) (The 1000 Genomes Project Consortium et al., 2015) with 134 Native American genomes. Using a demographic simulation framework, we also explore the number of additional reference genomes that should be sequenced to bridge the gap in imputation quality between different ancestries. Strengthening these efforts in diverse populations is not only a question of equality in genomics, but it also entails the scientific advantage of furthering our understanding of complex phenotypes in biomedical research.
Materials and Methods
Building a Native American Reference Panel
Our panel consists of 134 Native American individuals broadly distributed across the continent (Figure 1; Supplementary Tables S1, S2). We gathered publicly available whole-genome sequencing (WGS) data from HGDP () (61 individuals), SGDP () (11 individuals), and INMEGEN () (12 individuals). Additionally, we whole-genome sequenced the genome of 50 Mexican individuals with the highest Native American ancestry (99% on average) from the MX Biobank Project (http://www.mxbiobankproject.org). These were selected to maximize indigenous ancestry and geographical representation across Mexico. Individual genetic ancestry proportions were estimated using ADMIXTURE () at K = 3 using Utah residents with Northern and Western European ancestry (CEU), Yoruba in Ibadan, Nigeria (YRI), and the Latin Americans (AMR) of 1KGP as references.
FIGURE 1
To construct the panel, we restricted the datasets to biallelic SNPs with no missing data in any individual across each data source. This was done for all four data sources (Supplementary Table S3). The data processing was done using VCFtools v0.1.17 (
Finally, we phased the data using SHAPEIT2 v2. r837 (
Whole-Genome Sequencing and Variant Calling
Fifty individuals from the MX Biobank Project were sequenced at 40X on Illumina HiSeqX instruments using dual indexed barcodes. The raw reads were aligned to the human genome assembly GRCh37 using BWA v.0.7.17-r1198-dirty (
Creating a SNP Array Subset From WGS Data for Imputation Performance Evaluation
To evaluate the performance of our panel, we used WGS data from the 347 AMR individuals in 1KGP as target individuals for imputation. Namely, Puerto Ricans in Puerto Rico (PUR), Peruvians in Lima (PEL), Colombian in Medellin (CLM), and Mexican ancestry in Los Angeles (MXL). We generated an array dataset by subsetting the AMR individual genomes to the existing positions in the Multi-Ethnic Global Array (MEGA) using VCFtools v.0.1.17 and saved the removed positions from the WGS data to use for imputation validation. Illumina’s MEGA array includes nearly 1.8 M markers (1,779,819) genome-wide distributed and was designed to leverage SNP content from various global sequencing efforts, mostly Phase 3 of the 1,000 Genomes Project. To better approximate a real scenario, we unphased the array dataset with Plink v1.9 (
Local Ancestry Inference
To evaluate the performance by ancestry, we deconvoluted local ancestry for the Latin American individuals from 1,000 Genomes. We used 70 YRI individuals in 1KGP as the African reference, 70 CEU individuals from 1KGP as the European reference, and 70 Native American individuals from (
Imputation and Imputation Performance
We implemented a leave-one-out strategy for imputation. Namely, the target individual was removed from the 1KGP reference. We performed imputation with IMPUTE2 for chromosomes 2 and 9. These chromosomes, being the largest and of intermediate size, respectively, were selected to ensure a representative subset of variants across the genome while keeping the project within the available computational capacity. We used 1KGP and 1KGP + NATS as reference panels. When using 1KGP as a reference, we used the flag --k_haps 1,000, and when using 1KGP + NATS, we used the flags --merge-ref-panels and --k_haps 1,250.
We obtained the imputed dosages with the formula: P(Aa) + 2P(aa). We computed the Pearson squared correlation (r2) between the imputed dosages and the real dosages for each individual using R software. Overall imputation accuracy was stratified by minor allele frequency and local ancestry diplotype (AFR_AFR, AFR_EUR, AFR_NAT, EUR_EUR, EUR_NAT, NAT_NAT). We also compared the number of SNPs above the GWAS quality threshold (MAF >=0.01 and INFO >0.3) for both reference panels stratified by local ancestry diplotype in the target individuals.
Demographic Simulation
We simulated neutral genetic sequence data under a coalescent model. We used the msprime (
To simulate genotype array data for the target individuals, we downsampled the simulated neutral sequence to match the allele frequency spectrum in European populations of 1KGP and the average distance between SNPs of the MEGA array. We used the European populations in 1KGP to mirror the ascertainment bias towards European ancestry in current array designs. We estimated local ancestry using RFMix for the 300 admixed individuals used as imputation targets. We randomly selected 100 simulated individuals from each ancestral population (African, European, and Asian) as references for the local ancestry inference. Here, again, we used Asians as the closest proxy for Native Americans in the available simulation model.
We conducted imputation with the base reference panel plus a varying number of additional reference genomes (0, 100, 134, 200, 400, 600, 800, 1,000, 1,500, 2000, and 3,000). Finally, we compared imputation r2 of using different reference panels stratified by local ancestry and allele frequency in the target individual genomes.
Results
The Native American Reference Panel NATS
We built a Native American reference panel (NATS) representing indigenous populations across Latin America. The panel consists of publicly available data [HGDP (
Imputation Performance of the NATS Reference Panel
To assess the impact of our panel on imputation performance, we imputed the AMR individuals (from Colombia, Peru, Puerto Rico, and Mexico in 1KGP) at SNPs not found on the MEGA array using a leave-one-out strategy, with either 1KGP or 1KGP + NATS as reference panels (Materials and Methods). We chose the MEGA array because it was specifically designed to capture global variation better. We compared the mean number of SNPs above the standard quality threshold for human genetic studies (MAF >= 1% and INFO >= 0.3) using the two reference panels. We were able to increase the number of SNPs above the quality threshold across the four populations using our NATS panel (Table 1). The magnitude of the increase is correlated with the individual’s proportion of native ancestry (Supplementary Figure S1). Furthermore, the majority of these SNPs fall into diploid European tracts of the genome (Supplementary Figure S2) regardless of the ancestry composition of each population, and which reference panel was used for imputation. This is because even though the 1KGP has as many African individuals as Europeans, European ancestry is more predominant in AMR individuals.
TABLE 1
| Population | SNPs above quality threshold (1KGP) | SNPs above quality threshold (1KGP + NATS) | Increase of SNPs using 1KGP + NATS | Average proportion of Nat. American ancestry |
|---|---|---|---|---|
| Peru (PEL) | 244,818 | 248,087 | 3,269 (p-value = 2.03e-49) | 0.70 |
| Mexico (MXL) | 265,619 | 268,254 | 2,635 (p-value = 6.5e-31) | 0.42 |
| Colombia (CLM) | 279,828 | 281,911 | 2,163 (p-value = 8.3e-47) | 0.18 |
| Puerto Rico (PUR) | 291,035 | 292,734 | 1,699 (p-value = 2.9e-67) | 0.06 |
SNPs above the standard quality threshold using both panels after imputing missing variants. We show the average number of SNPs with MAF >= 0.01 and INFO >= 0.3 using both reference panels and the overall proportion of Native American ancestry of the population. p-value was calculated with a two-tailed paired t-test. The average number of SNPs with MAF <0.01 and INFO >0.3 for both panels is shown in Supplementary Table S4.
To determine imputation accuracy, we computed the correlation between the real allele dosages and the imputed dosages (Materials and Methods). We checked imputation accuracy in 1KGP admixed individuals trimmed down to SNP array positions stratified by diploid ancestry (Figure 2A). Overall, imputation accuracy is worse in AMR populations with the highest proportion of Native American ancestry (Supplementary Figure S3). As previously reported (
FIGURE 2

Imputation accuracy by local ancestry and population using both reference panels. (A) Imputation accuracy of the four AMR populations stratified by diploid local ancestry for the MEGA array using 1KGP as reference panel. (B) Imputation accuracy for the Native and European diploid ancestries using 1KGP and 1KGP + NATS as reference panel focusing on rare alleles. Imputation accuracy was calculated with the Pearson squared correlation between imputed and real allele dosages.
Predicting Imputation Improvement From Additional Native American Genomes Using Simulations
Our results show that after adding 134 Native American genomes to the most widely used reference panel of global variation, we observe a promising trend of improvement. Still, we do not come close enough to equal the imputation performance for other better represented ancestries. The question remains of how many additional genomes are still needed to close the gap. To explore this, we employed demographic simulations using stdpopsim (
We confirmed the ancestry proportions of our simulated data using ADMIXTURE (Supplementary Figure S5). To replicate the imputation pipeline, we created a genotype array dataset for the simulated target individuals by matching mean distance between markers and frequency in the European population of SNPs in the MEGA array to the simulated array, to mirror the bias in standard arrays (Materials and Methods and Supplementary Figure S6). Then, we imputed the 300 target individuals with the base reference plus either 0, 100, 134 (to mirror the sample size in NATS), 200, 400, 600, 800, 1,000, 1,500, 2000, or 3,000 Native Americans. We were able to recover roughly the same pattern of imputation accuracy (Supplementary Figure S7). Namely, accuracy decreased the less represented the ancestry was in the base reference with the Native American as the worst-performing ancestry. One caveat is that the best-performing ancestry is African contrary to what we see in the real data (Figure 2A). This is likely because the 661 African individuals are from the population that contributed to the admixed population in the simulation, which is not the case for real data. Different African ancestries contributed more or less to different Latin American populations (
When incorporating additional Native American genomes, imputation accuracy only increased in those tracts with any Native ancestry (Supplementary Figure S8). Furthermore, for imputation accuracy in Native American diploid ancestry tracts to equal that in European diploid ancestry tracts, 3,000 Native genomes were needed for variants with frequency>=2%, while 1,500 were enough for variants with frequency <2% (Figure 3A). To ask whether we reach a saturation point in the increase of imputation accuracy in the Native diploid ancestry, we compared the difference between accuracy in the base reference versus each additional reference. As expected, the behavior is different for common (frequency >0.05), low (frequency <0.05 and >0.01), and rare (frequency <0.01) variants (Figure 3B). Neither of them seems to show a saturation point at 3,000 newly added Native genomes. The steepest increase is achieved for the rare alleles, whereas for the common alleles, the increase is slower. This agrees with the previous result where more genomes were needed to match the Native imputation accuracy to the European one for common variants. It is also evident that the variants of common frequency are closest to saturation in accuracy as their values were already close to one (Figure 3A).
FIGURE 3

Predicted imputation accuracy according to demographic simulations. (A) Imputation accuracy in the diploid Native American (solid colored lines) and diploid European (thick dashed line) ancestries using different simulated reference panels of incremental sizes. Ref 0 stands for the base reference (as it has 0 additional reference genomes). Given the available demographic model (
Discussion
GWAS requires large sample sizes to detect genetic associations to complex phenotypes, and more so as the field moves toward studying rare variants (
One major caveat of our panel is that it does not comprehensively reflect the indigenous genetic variation across the Americas. Most of the data come from individuals from Mexico. Furthermore, the 134 genomes added are only a small increment (5%) with respect to 1KGP. The contribution of this panel is small in comparison to projects like the Uganda Genome Resource that sequenced 1,978 novel genomes (
Our panel increased the number of SNPs above the standard quality threshold for human genetic studies increasing statistical power in the four AMR populations of 1KGP. This mirrors what has been achieved by other studies in other populations (
We were able to increase imputation accuracy in rare variants of Native American diploid ancestry in the MXL population. This was not the case for the other three populations. We expected that, since PEL is the population with the highest Native American ancestry proportion, it would also be the population most benefited by the use of our extended panel. However, there can be high levels of genetic differentiation among Native American groups, even if they are geographically close (
These results are important with regard to not only GWAS but also their further applications. For instance, one of the applications of GWAS summary statistics is Polygenic Risk Scores (PRS). PRS calculates the genetic “risk” of an individual for a particular phenotype by summing the risk alleles present in that individual (Torkamani, Wineinger, and Topol 2018). PRS necessitates summary statistics calculated in a population as close as possible to the target individuals to be accurate. Previous studies have shown that this is not a trivial task (Tropf et al., 2017;
The question of how much data are needed remained. To answer it, we employed demographic simulations. We replicated the same pattern of imputation accuracy of our data and of previous studies (
Statements
Data availability statement
The newly generated data presented in the study are deposited in the European Genome-phenome Archive (EGA) repository, accession number EGAD00001008354 i.e. https://ega-archive.org/datasets/EGAD00001008354.
Ethics statement
The studies involving human participants were reviewed and approved by the Research Ethics Committee (Approval CI-1479) of the National Institute of Public Health, Mexico. The patients/participants provided their written informed consent to participate in this study.
Author contributions
AJ-K, AM-E, AYC, AC, SF-V, AJM, MS, AH, and DD-V designed the study. AM-E, LG-G, LF-R, LC-H, TT-L, HM-M, CA-S, and NM-R selected and provided DNA samples from the MX Biobank Project. AM-E, and AH sequenced the data. HK, NK, SS, MT, CQ-C, and MP-M performed the whole-genome variant calling and curated the data. AJ-K, AYC, and SM-M analyzed the data. AJ-K and AM-E drafted the manuscript, with input from LG-G, CQ-C, LF-R, LC-H, TT-L, HM-M, CA-S, AH-C, MS, SM-M, NM-R, and GD-S. All authors read and approved the manuscript.
Funding
This work was supported by “The Mexican Biobank Project: Building Capacity for Big Data Science in Medical Genomics in Admixed Populations”, a binational initiative between Mexico and the UK co-funded by CONACYT (Grant number FONCICYT/50/2016), and The Newton Fund through The Medical Research Council (Grant number MR/N028937/1) awarded to AME and AVSH. It was also supported by the International Center for Genetic Engineering and Biotechnology (ICGEB, Italy) grant number CRP/MEX20-01. MS was partially supported by the Chicago Fellows program of the University of Chicago. DODV is supported by the UC MEXUS CONACYT collaborative program (Grant number CN-19-29), and the UNAM PAPIIT funding program (Grant number IA200620).
Acknowledgments
We thank the participants of the Encuesta Nacional de Salud, 2000 (2000 National Health Survey, ENSA 2000), conducted in Mexico nationwide by the Secretaría de Salud (Health Secretariat) and the Instituto Nacional de Salud Pública (National Institute of Public Health, INSP). We are grateful to Mitzi Flores and Adriana Garmendia for project management support and to Carlos Conde, Victor Guerrero Lemus, Armando Mendez Herrera, Cruz Portugal García, Ma. Luisa Ordóñez-Sánchez, Rosario Rodriguez-Guillen, and Manuel Velazquez Mesa for biobank maintenance and sample preparation. We also thank Mary Ortega, Cecilia Gutiérrez, and Sara García for technical assistance, Jacob Cervantes for IT support, and Aaron Ragsdale for comments on earlier versions of the manuscript.
Conflict of interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors, and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
Supplementary material
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2021.719791/full#supplementary-material
References
1
Abul-HusnN. S.KennyE. E. (2019). Personalized Medicine and the Power of Electronic Health Records. Cell177 (1), 58–69. 10.1016/j.cell.2019.02.039
2
AdrionJ. R.ColeC. B.DuklerN.GallowayJ. G.GladsteinA. L.GowerG.et al (2020). A Community-Maintained Standard Library of Population Genetic Models. eLife9, e54967. 10.7554/eLife.54967
3
AgrawalN.BrownM. A. (2014). Genetic Associations and Functional Characterization of M1 Aminopeptidases and Immune-Mediated Diseases. Genes Immun.15 (8), 521–527. 10.1038/gene.2014.46
4
Aguilar-OrdoñezI.Pérez-VillatoroF.García-OrtizH.Barajas-OlmosF.Ballesteros-VillascánJ.González-BuenfilR.et al (2021). Whole Genome Variation in 27 Mexican Indigenous Populations, Demographic and Biomedical Insights. PLoS One16 (4), e0249773. 10.1371/journal.pone.0249773
5
AhmadM.SinhaA.GhoshS.KumarV.DavilaS.YajnikC. S.et al (2017). Inclusion of Population-Specific Reference Panel from India to the 1000 Genomes Phase 3 Panel Improves Imputation Accuracy. Sci. Rep.7 (1), 6733. 10.1038/s41598-017-06905-6
6
AlexanderD. H.NovembreJ.Lange.K. (2009). Fast Model-Based Estimation of Ancestry in Unrelated Individuals. Genome Res.19 (9), 1655–1664. 10.1101/gr.094052.109
7
AmendolaL. M.BergJ. S.HorowitzC. R.AngeloF.BensenJ. T.BieseckerB. B.et al (2018). The Clinical Sequencing Evidence-Generating Research Consortium: Integrating Genomic Sequencing in Diverse and Medically Underserved Populations. Am. J. Hum. Genet.103 (3), 319–327. 10.1016/j.ajhg.2018.08.007
8
BergJ. J.HarpakA.Sinnott-ArmstrongN.JoergensenA. M.MostafaviH.FieldY.et al (2019). Reduced Signal for Polygenic Adaptation of Height in UK Biobank. eLife8, e39725. 10.7554/eLife.39725
9
BergströmA.McCarthyS. A.HuiR.AlmarriM. A.AyubQ.DanecekP.et al (2020). Insights into Human Genetic Variation and Population History from 929 Diverse Genomes. Science367 (6484), eaay5012. 10.1126/science.aay5012
10
BiddandaA.RiceD. P.NovembreJ. (2020). A Variant-Centric Perspective on Geographic Patterns of Human Allele Frequency Variation. eLife9, e60107. 10.7554/eLife.60107
11
BrowningS. R.BrowningB. L.DaviglusM. L.Durazo-ArvizuR. A.SchneidermanN.KaplanR. C.et al (2018). Ancestry-Specific Recent Effective Population Size in the Americas. PLoS Genet.14 (5), e1007385. 10.1371/journal.pgen.1007385
12
Chacón-DuqueJ-C.AdhikariK.Fuentes-GuajardoM.Mendoza-RevillaJ.Acuña-AlonzoV.RodrigoB.et al (2018). Latin Americans Show Wide-Spread Converso Ancestry and Imprint of Local Native Ancestry on Physical Appearance. Nat. Commun.9 (1), 5388. 10.1038/s41467-018-07748-z
13
ChangC. C.ChowC. C.TellierL. C.VattikutiS.PurcellS. M.LeeJ. J. (2015). Second-Generation PLINK: Rising to the Challenge of Larger and Richer Datasets. GigaScience4, 7. 10.1186/s13742-015-0047-8
14
ChatterjeeN.ShiJ.García-ClosasM. (2016). Developing and Evaluating Polygenic Risk Prediction Models for Stratified Disease Prevention. Nat. Rev. Genet.17 (7), 392–406. 10.1038/nrg.2016.27
15
CirulliE. T.WhiteS.ReadR. W.ElhananG.MetcalfW. J.TanudjajaF.et al (2020). Genome-Wide Rare Variant Analysis for Thousands of Phenotypes in over 70,000 Exomes from Two Cohorts. Nat. Commun.11 (1), 542. 10.1038/s41467-020-14288-y
16
CollinsR. (2012). What Makes UK Biobank Special?Lancet379 (9822), 1173–1174. 10.1016/s0140-6736(12)60404-8
17
DanecekP.AdamA.AbecasisG.AlbersC. A.BanksE.DePristoM. A.et al (2011). The Variant Call Format and VCFtools. Bioinformatics27 (15), 2156–2158. 10.1093/bioinformatics/btr330
18
DanecekP.BonfieldJ. K.LiddleJ.MarshallJ.OhanV.PollardM. O.et al (2021). Twelve Years of SAMtools and BCFtools. GigaScience10 (2). giab008. 10.1093/gigascience/giab008
19
DelaneauO.MarchiniJ.1000 Genomes Project Consortium1000 Genomes Project Consortium (2014). Integrating Sequence and Array Data to Create an Improved 1000 Genomes Project Haplotype Reference Panel. Nat. Commun.5, 3934. 10.1038/ncomms4934
20
DuncanL.ShenH.GelayeB.MeijsenJ.ResslerK.FeldmanM.et al (2019). Analysis of Polygenic Risk Score Usage and Performance in Diverse Human Populations. Nat. Commun.10 (1), 3328. 10.1038/s41467-019-11112-0
21
FaustG. G.Hall.I. M. (2014). SAMBLASTER: Fast Duplicate Marking and Structural Variant Read Extraction. Bioinformatics30 (17), 2503–2505. 10.1093/bioinformatics/btu314
22
FlannickJ.ThorleifssonG.BeerN. L.JacobsS. B.GrarupN.BurttN. P.et al (2014). Loss-of-Function Mutations in SLC30A8 Protect against Type 2 Diabetes. Nat. Genet.46 (4), 357–363. 10.1038/ng.2915
23
GenomeAsia 100K Consortium (2019). The GenomeAsia 100K Project Enables Genetic Discoveries across Asia. Nature576 (7785), 106–111. 10.1038/s41586-019-1793-z
24
GurdasaniD.CarstensenT.Tekola-AyeleF.PaganiL.TachmazidouI.HatzikotoulasK.et al (2015). The African Genome Variation Project Shapes Medical Genetics in Africa. Nature517 (7534), 327–332. 10.1038/nature13997
25
GurdasaniD.CarstensenT.FatumoS.ChenG.FranklinC. S.Prado-MartinezJ.et al (2019). Uganda Genome Resource Enables Insights into Population History and Genomic Discovery in Africa. Cell179 (4), 984e36–1002. 10.1016/j.cell.2019.10.004
26
HarrisDanielN.WeiS.AmolC.ShettyKellyS.et al (2018). “Evolutionary Genomic Dynamics of Peruvians Before, During, and after the Inca Empire.,” in Proceedings of the National Academy of Sciences of the United States of America115 (28), E6526–E6535.
27
HowieB.MarchiniJ.StephensM. (2011). Genotype Imputation with Thousands of Genomes. G3 Genes|Genomes|Genetics1 (6), 457–470. 10.1534/g3.111.001198
28
HowieB.FuchsbergerC.StephensM.MarchiniJ.AbecasisG. R. (2012). Fast and Accurate Genotype Imputation in Genome-Wide Association Studies through Pre-phasing. Nat. Genet.44 (8), 955–959. 10.1038/ng.2354
29
KehdyF. S. G.GouveiaM. H.MachadoM.MagalhãesW. C. S.HorimotoA. R.HortaB. L.et al (2015). Origin and Dynamics of Admixture in Brazilians and its Effect on the Pattern of Deleterious Mutations. Proc. Natl. Acad. Sci. United States Am.112 (28), 8696–8701. 10.1073/pnas.1504447112
30
KelleherJ.EtheridgeA. M.McVeanG. (2016). Efficient Coalescent Simulation and Genealogical Analysis for Large Sample Sizes. PLoS Comput. Biol.12 (5), e1004842. 10.1371/journal.pcbi.1004842
31
LiH.DurbinR. (2009). Fast and Accurate Short Read Alignment with Burrows-Wheeler Transform. Bioinformatics25 (14), 1754–1760. 10.1093/bioinformatics/btp324
32
LiH. (2011). A Statistical Framework for SNP Calling, Mutation Discovery, Association Mapping and Population Genetical Parameter Estimation from Sequencing Data. Bioinformatics27 (21), 2987–2993. 10.1093/bioinformatics/btr509
33
MacArthurJ.BowlerE.CerezoM.GilL.HallP.HastingsE.et al (2017). The New NHGRI-EBI Catalog of Published Genome-Wide Association Studies (GWAS Catalog). Nucleic Acids Res.45 (D1), D896–D901. 10.1093/nar/gkw1133
34
MagalhãesW. C. S.AraujoN. M.LealT. P.AraujoG. S.ViriatoP. J. S.KehdyF. S.et al (2018). EPIGEN-Brazil Initiative Resources: A Latin American Imputation Panel and the Scientific Workflow. Genome Res.28 (7), 1090–1095. 10.1101/gr.225458.117
35
MallickS.LiH.LipsonM.MathiesonI.GymrekM.RacimoF.et al (2016). The Simons Genome Diversity Project: 300 Genomes from 142 Diverse Populations. Nature538 (7624), 201–206. 10.1038/nature18964
36
MaplesB. K.GravelS.KennyE. E.BustamanteC. D. (2013). RFMix: A Discriminative Modeling Approach for Rapid and Robust Local-Ancestry Inference. Am. J. Hum. Genet.93 (2), 278–288. 10.1016/j.ajhg.2013.06.020
37
MarchiniJ.HowieB. (2010). Genotype Imputation for Genome-Wide Association Studies. Nat. Rev. Genet.11 (7), 499–511. 10.1038/nrg2796
38
MarchiniJ.HowieB.MyersS.McVeanG.DonnellyP. (2007). A New Multipoint Method for Genome-Wide Association Studies by Imputation of Genotypes. Nat. Genet.39 (7), 906–913. 10.1038/ng2088
39
MartinA. R.GignouxC. R.WaltersR. K.WojcikG. L.NealeB. M.GravelS.et al (2017). Human Demographic History Impacts Genetic Risk Prediction across Diverse Populations. Am. J. Hum. Genet.107 (4), 788–789. 10.1016/j.ajhg.2017.03.004
40
MartinA. R.KanaiM.KamataniY.OkadaY.NealeB. M.DalyM. J. (2019). Clinical Use of Current Polygenic Risk Scores May Exacerbate Health Disparities. Nat. Genet.51 (4), 584–591. 10.1038/s41588-019-0379-x
41
McKennaA.HannaM.BanksE.SivachenkoA.CibulskisK.KernytskyA.et al (2010). The Genome Analysis Toolkit: A MapReduce Framework for Analyzing Next-Generation DNA Sequencing Data. Genome Res.20 (9), 1297–1303. 10.1101/gr.107524.110
42
MichelettiS. J.BrycK.Ancona EsselmannS. G.FreymanW. A.MorenoM. E.PoznikG. D.et al (2020). Genetic Consequences of the Transatlantic Slave Trade in the Americas. Am. J. Hum. Genet.107 (2), 265–277. 10.1016/j.ajhg.2020.06.012
43
MillsM. C.Rahal.C. (2019). A Scientometric Review of Genome-Wide Association Studies. Commun. Biol.2, 9. 10.1038/s42003-018-0261-x
44
MinikelE. V.KarczewskiK. J.MartinH. C.CummingsB. B.WhiffinN.RhodesD.et al (2020). Evaluating Drug Targets through Human Loss-Of-Function Genetic Variation. Nature581 (7809), 459–464. 10.1038/s41586-020-2267-z
45
Moreno-EstradaA.GignouxC. R.Fernández-LópezJ. C.ZakhariaF.MartinS.ContrerasA. V.et al (2014). The Genetics of Mexico Recapitulates Native American Substructure and Affects Biomedical Traits. Science344 (6189), 1280–1285. 10.1126/science.1251688
46
MostafaviH.HarpakA.AgarwalI.ConleyD.PritchardJ. K.PrzeworskiM. (2020). Variable Prediction Accuracy of Polygenic Scores within an Ancestry Group. eLife9, e48376. 10.7554/eLife.48376
47
MulderN.AdebamowoS. N.de VriesJ.MatimbaA.OlowoyoP.RamsayM.et al (2018). H3Africa: Current Perspectives. Pharmacogenomics Pers. Med.11, 59–66. 10.2147/pgpm.s141546
48
NadkarniG. N.GignouxC. R.SorokinE. P.RahmanR.BarnesK. C.WasselC. L. (2018). Worldwide Frequencies of APOL1 Renal Risk Variants. New Engl. J. Med.379 (26), 2571–2572. 10.1056/nejmc1800748
49
NelsonM. R.TipneyH.PainterJ. L.ShenJ.NicolettiP.ShenY.et al (2015). The Support of Human Genetic Evidence for Approved Drug Indications. Nat. Genet.47 (8), 856–860. 10.1038/ng.3314
50
PopejoyA. B.FullertonS. M. (2016). Genomics Is Failing on Diversity. Nature538 (7624), 161–164. 10.1038/538161a
51
Romero-HidalgoS.Ochoa-LeyvaA.GarcíarrubioA.Acuña-AlonzoV.Antúnez-ArgüellesE.Balcazar-QuinteroM.et al (2017). Demographic History and Biologically Relevant Genetic Variation of Native Mexicans Inferred from Whole-Genome Sequencing. Nat. Commun.8 (1), 1005. 10.1038/s41467-017-01194-z
52
SIGMA Type 2 Diabetes ConsortiumWilliamsA. L.JacobsS. B. R.Moreno-MacíasH.Huerta-ChagoyaA.ChurchhouseC.Márquez-LunaC.et al (2014). Sequence Variants in SLC16A11 Are a Common Risk Factor for Type 2 Diabetes in Mexico. Nature506 (7486), 97–101. 10.1038/nature12828
53
SirugoG.WilliamsS. M.TishkoffS. A. (2019). The Missing Diversity in Human Genetic Studies. Cell177 (4), 1080. 10.1016/j.cell.2019.04.032
54
Soares-SouzaG.BordaV.KehdyF.Tarazona-SantosE. (2018). Admixture, Genetics and Complex Diseases in Latin Americans and US Hispanics. Curr. Genet. Med. Rep.6 (4), 208–223. 10.1007/s40142-018-0151-z
55
SohailM.MaierR. M.GannaA.BloemendalA.MartinA. R.TurchinM. C.et al (2019). Polygenic Adaptation on Height Is Overestimated Due to Uncorrected Stratification in Genome-Wide Association Studies. eLife8, e39702. 10.7554/eLife.39702
56
TarasovA.VilellaA. J.CuppenE.NijmanI. J.PrinsP. (2015). Sambamba: Fast Processing of NGS Alignment Formats. Bioinformatics31 (12), 2032–2034. 10.1093/bioinformatics/btv098
57
The 1000 Genomes Project ConsortiumAutonA.BrooksL. D.DurbinR. M.GarrisonE. P.KangH. M.et al (2015). A Global Reference for Human Genetic Variation. Nature526 (7571), 68–74. 10.1038/nature15393
58
TorkamaniA.WineingerN. E.TopolE. J. (2018). The Personal and Clinical Utility of Polygenic Risk Scores. Nat. Rev. Genet.19 (9), 581–590. 10.1038/s41576-018-0018-x
59
TropfF. C.LeeS. H.VerweijR. M.StulpG.van der MostP. J.de VlamingR.et al (2017). Hidden Heritability Due to Heterogeneity across Seven Populations. Nat. Hum. Behav.1 (10), 757–765. 10.1038/s41562-017-0195-1
60
VisscherP. M.WrayN. R.ZhangQ.SklarP.McCarthyM. I.BrownM. A.et al (2017). 10 Years of GWAS Discovery: Biology, Function, and Translation. Am. J. Hum. Genet.101 (1), 5–22. 10.1016/j.ajhg.2017.06.005
61
WojcikG. L.GraffM.NishimuraK. K.TaoR.HaesslerJ.GignouxC. R.et al (2019). Genetic Analyses of Diverse Populations Improves Discovery for Complex Traits. Nature570 (7762), 514–518. 10.1038/s41586-019-1310-4
Summary
Keywords
Imputation, reference panels, GWAS, Native American ancestry, Latin Americans, underrepresented populations
Citation
Jiménez-Kaufmann A, Chong AY, Cortés A, Quinto-Cortés CD, Fernandez-Valverde SL, Ferreyra-Reyes L, Cruz-Hervert LP, Medina-Muñoz SG, Sohail M, Palma-Martinez MJ, Delgado-Sánchez G, Mongua-Rodríguez N, Mentzer AJ, Hill AVS, Moreno-Macías H, Huerta-Chagoya A, Aguilar-Salinas CA, Torres M, Kim HL, Kalsi N, Schuster SC, Tusié-Luna T, Del-Vecchyo DO, García-García L and Moreno-Estrada A (2022) Imputation Performance in Latin American Populations: Improving Rare Variants Representation With the Inclusion of Native American Genomes. Front. Genet. 12:719791. doi: 10.3389/fgene.2021.719791
Received
03 June 2021
Accepted
01 November 2021
Published
03 January 2022
Volume
12 - 2021
Edited by
Tony Merriman, University of Otago, New Zealand
Reviewed by
Inaho Dnjoh, Tohoku University, Japan
Mohamad Saad, Qatar Computing Research Institute, Qatar
Updates

Check for updates
Copyright
© 2022 Jiménez-Kaufmann, Chong, Cortés, Quinto-Cortés, Fernandez-Valverde, Ferreyra-Reyes, Cruz-Hervert, Medina-Muñoz, Sohail, Palma-Martinez, Delgado-Sánchez, Mongua-Rodríguez, Mentzer, Hill, Moreno-Macías, Huerta-Chagoya, Aguilar-Salinas, Torres, Kim, Kalsi, Schuster, Tusié-Luna, Del-Vecchyo, García-García and Moreno-Estrada.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Andrés Moreno-Estrada, andres.moreno@cinvestav.mx; Lourdes García-García, garcigarml@gmail.com
This article was submitted to Human and Medical Genomics, a section of the journal Frontiers in Genetics
Disclaimer
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.