Re-analysis of Whole Genome Sequence Data From 279 Ancient Eurasians Reveals Substantial Ancestral Heterogeneity

Supervised clustering or projection analysis is a staple technique in population genetic analysis. The utility of this technique depends critically on the reference panel. The most commonly used reference panel in the analysis of ancient DNA to date is based on the Human Origins array. We previously described a larger reference panel that captures more ancestries on the global level. Here, I reanalyzed DNA data from 279 ancient Eurasians using our reference panel. I found substantially more ancestral heterogeneity than has been reported. Reanalysis provides evidence against a resurgence of Western hunter-gatherer ancestry in the Middle to Late Neolithic and evidence for a common ancestor of farmers characterized by Western Asian ancestry, a transition of the spread of agriculture from demic to cultural diffusion, at least two migrations between the Pontic-Caspian steppes and Bronze Age Europe, and a sub-Saharan African component in Natufians that localizes to present-day southern Ethiopia.


INTRODUCTION
Before the technological advances that permitted ancient DNA studies, historical inferences were made using present-day samples in conjunction with well-established theory and techniques in population genetics and phylogenetic reconstruction. Inferences regarding population structure have been based on the popular software STRUCTURE (Pritchard et al., 2000), ADMIXTURE (Alexander et al., 2009), and variants thereof. The basic idea is to perform model-based estimation of ancestries from multi-locus genotypes. Having learned ancestry-specific allele frequencies in unsupervised clustering analysis from a data set, it is computationally efficient to project new samples onto the ancestries in order to learn about population structure in the new samples. The utility and quality of projection analysis, or supervised clustering analysis, strongly depends on the reference panel of learned ancestries.
Commercially available and widely used genotyping arrays designed for studies of human medical genetics have unknown patterns of marker ascertainment bias. This ascertainment bias reflects differences in the allele frequency spectra in non-European populations compared to European populations as well as enrichment of array content for ancestry informative markers and other content specifically of medical value. The Human Origins genotyping array is a collection of 13 panels, each designed with known ascertainment, for studies of human population genetics (Patterson et al., 2012). In the analysis of ancient DNA, the reference panel most widely used to date comprises <3,000 individuals genotyped using the Human Origins array (Lazaridis et al., 2014;Günther et al., 2015;Mathieson et al., 2015;Cassidy et al., 2016), although some data are not freely available.
One consequence of use of a single reference panel is consistency within the ancient DNA field. Unfortunately, the labels for ancestries used in these papers lacks overlap with labels used by other researchers for ancestries in present-day peoples. Furthermore, none of the results in the ancient DNA papers has been replicated using a second reference panel. We previously combined completely public domain data to generate a reference panel comprising 5,966 individuals from 282 samples, from which we estimated 21 ancestries (Baker et al., 2017). Our reference panel covers more ancestral diversity than the Human Origins-based panel, with no evidence of marker ascertainment bias (Baker et al., 2017). Using this reference panel, I investigated (1) the reproducibility of findings previously reported in ancient DNA studies and (2) the impact of broader coverage of ancestries. After projecting 279 ancient Eurasians onto our reference panel, I reached a distinct series of conclusions regarding the genetic history of Europe and Western Asia.

Ancient DNA Data
I retrieved and integrated data from 279 ancient Eurasians from 49 samples (Keller et al., 2012;Lazaridis et al., 2014Lazaridis et al., , 2016Skoglund et al., 2014;Allentoft et al., 2015;Günther et al., 2015;Jones et al., 2015;Mathieson et al., 2015;Broushaki et al., 2016;Hofmanová et al., 2016). Data were provided either as called genotypes in VCF files or aligned sequences in BAM files. To generate the most probable genotypes from aligned sequences, BAM files were processed using the program bam2mpg using a quality filter of 20 and the reference human genome sequence hs37d5 in fasta format (available at https://github.com/nhansen/ bam2mpg), with results saved in VCF files. For each of the 279 individuals, sampling locations are shown in Figure S1 and meta-data is provided in Table S1.

Supervised Clustering
The global reference panel was previously described (Baker et al., 2017). Briefly, this panel consists of ancestry-specific allele frequencies for 21 ancestries and 19,075 SNPs, generated from unsupervised clustering of 5,966 individuals from 282 global samples. The present-day geographic distributions of the 21 ancestries are shown in Figure S2.
Genotype data were extracted from VCF files using VCFtools version 0.1.14 (Danecek et al., 2011). Supervised clustering was performed by projecting the ancient Eurasians onto the reference panel using ADMIXTURE version 1.3 (Alexander et al., 2009). Standard errors were obtained from 200 bootstrap replicates. Inverse variance weighting was used to combine ancestry proportions across individuals within samples, accounting for both within-and between-individual variance. Assuming approximate normality, sparsity was induced by zeroing out any ancestry for which the 95% confidence interval included 0. Finally, the significant ancestry proportions were renormalized to sum to 1.

Ethics
This project was determined to be excluded from IRB Review by the National Institutes of Health Office of Human Subjects Research Protections, Protocol 17-NHGRI-00282.

Eneolithic to Middle Bronze Age Steppe Peoples
Third, I considered the Eneolithic to Middle Bronze Age steppe peoples ( Figure 1C) I included the Potapovka sample here because the sum of absolute differences in ancestry was greater post-Potapovka rather than post-Poltavka. In these steppe peoples, 76.2% of Y DNA haplogroups were R1b and 86.7% of mitochondrial haplogroups were H, J, T, or U ( Table 2).

Late Bronze to Iron Age Steppe Peoples
The 21 Late Bronze to Iron Age steppe individuals (Sintashta, Andronovo, Srubnaya, and Scythia) had 50.9% Northern European, 23.9% Southern European, 17.7% Southern Asian, 3.4% Kalash, 2.5% Western Asian, and 1.6% Amerindian ancestries ( Figure 1F and Table 1). Thus, post-Potapovka, population change in the steppes involved an increase of 16.0% Southern European and 1.7% Western Asian ancestries (a ratio of 9.4) and a decrease of 10.1% Southern Asian, 3.8% Northern European, and 2.6% Amerindian ancestries. This change does not fit with gene flow of people like the Early Neolithic peoples (Mathieson et al., 2015), who had a ratio of Southern European to Western Asian ancestry of 1.5; the source is more consistent with European Copper or Bronze Age peoples, in whom the ratios were 6.6 and 6.8, respectively. All nine Y DNA haplogroups were R1a, while 85.7% of mitochondrial haplogroups were H, V, U, K, J, or T ( Table 2).

Western Asian Peoples
The two Georgian hunter-gatherers did not group with the European hunter-gatherers ( Figure 1G). The Georgian huntergatherers averaged 45.8% Western Asian and 37.7% Southern Asian ancestries, with only 4.9% Northern European and no Southern European ancestries ( Table 1). Both Y DNA haplogroups were J ( Table 2). The mitochondrial haplogroups were H and K; neither matriline was observed in the European hunter-gatherers ( Table 2). The ancient Iranians were characterized through the Late Neolithic period by predominantly Southern Asian ancestry ( Figure 1G and Table 1). The proportion of Western Asian ancestry doubled through the Iron Age. The ancient Armenians resembled the Georgian hunter-gatherers in having a mixture of Western Asian and Southern Asian ancestries ( Figure 1G and Table 1). In the ancient Armenians relative to the Georgian hunter-gatherers, Northern European, Southern European, and Arabian ancestries increased in the Copper Age.
The Natufian sample consisted of 61.2% Arabian, 21.2% Northern African, 10.9% Western Asian, and 6.8% Omotic ancestry ( Figure 1G and Table 1). The transition in the Levant from the Epipaleolithic to the Neolithic period involved an increase of Arabian ancestry at the expense of Northern African and Omotic ancestries. The transition from the Neolithic period to the Bronze Age involved the acquisition of principally Western Asian ancestry, with smaller contributions of Southern European and Southern Asian ancestries.

DISCUSSION
Using a large, global reference panel, I found more population structure than previously reported among 279 ancient Eurasians. All samples showed multiple autosomal ancestries, Y DNA haplogroups, and mitochondrial haplogroups. Given such a large amount of ancestral heterogeneity, previous estimates of allele frequencies, including claims of natural selection, may have been confounded by this unrecognized population structure.
In contrast to previous reports, Western and Eastern hunter-gatherers were not homogeneous for different ancestries (Lazaridis et al., 2016) nor were they separated (Gallego-Llorente et al., 2016). Amerindian, Circumpolar, and Southern Asian ancestries existed in Eastern and, to a lesser extent, Scandinavian hunter-gatherers thousands of years before the European Bronze Age and in higher proportions than in Bronze Age steppe populations. Amerindian and Circumpolar ancestries were absent from Europeans from the Early Neolithic through the Bronze Age. These results are consistent with a shared relationship predating the Neolithic.
The transition from Eneolithic to Early and Middle Bronze Age steppe peoples involved increases in Southern Asian and Southern European ancestries that do not fit with a European hunter-gatherer source (Mathieson et al., 2015) and more broadly do not fit with any of the samples, suggesting an unknown source population. Currently, Southern Asian ancestry co-localizes with Y DNA haplogroup L and correlates with Indo-Iranian languages (Baker et al., 2017). Although there are no L haplogroups in any of these Early to Middle Bronze Age steppe individuals, the correlation with Indo-Iranian languages strengthens the connection between Early to Middle Bronze Age steppe peoples and the introduction of Indo-European languages into Europe.
Northern and Southern European ancestries were primarily associated with Y haplogroup I from before the Neolithic until the Copper Age. A low level of Y DNA haplogroup R was present in Europe prior to and during the Neolithic. In the Bronze Age, an increase in the proportion of Y haplogroup R while the distribution of mitochondrial haplogroups remained essentially unchanged is consistent with male-biased gene flow (Goldberg et al., 2017). Collectively, the findings suggest two male-biased migrations from the steppes to Europe, rather than one prolonged event (Goldberg et al., 2017). The first event was associated with Eneolithic to Middle Bronze Age steppe peoples located north and east of the Black Sea and characterized by R1b; this incoming ancestry is associated with present-day Southern European ancestry. The second event was associated with Late Bronze Age steppe peoples located north and east of the Caspian Sea and characterized by R1a; this incoming ancestry is associated with present-day Northern European ancestry. There was also gene flow from Europe to the steppes associated with the transition from the Middle to Late Bronze Age.
In the Copper Age, Northern African ancestry increased while Arabian ancestry decreased, possibly indicating entry into Europe from northwest Africa rather than northeast Africa. The ratio of 5.4-fold more Southern European than Northern European ancestries and the presence of Northern African ancestry acquired from the Early Neolithic to the Copper Age are inconsistent with a resurgence of peoples related to Western hunter-gatherers, given that Western hunter-gatherers had 1.6fold more Northern European than Southern European ancestry and no Northern African ancestry. Instead, this ancestral profile is suggestive of an expansion of peoples from Southern Europe resembling those from the Remedello culture.
In the ancient Iranians, the proportion of Western Asian ancestry doubled through the Iron Age, suggesting gene flow from the Caucasus rather than the Levant (Lazaridis et al., 2016), while smaller amounts of Arabian and South Indian ancestries suggest gene flow from the west and the east, respectively. In the ancient Armenians, Northern European, Southern European, and Arabian ancestries increased in the Copper Age, again suggesting gene flow from multiple directions. In the Levant, Lazaridis et al. (2016) suggested that the transition from the Neolithic period to the Bronze Age resulted from admixture from people resembling Chalcolithic Iranians. This putative source is unlikely because none of the ancient Iranian samples had Southern European ancestry; a Caucasian source, such as the Chalcolithic or Early Bronze Age Armenians, provides a better fit.
The Early Neolithic samples, i.e., early farmers, qualitatively differed from hunter-gatherers by harboring more diverse sets of Y DNA haplogroups and mitochondrial lineages. This result suggests that the initial spread of agriculture occurred by demic diffusion involving both males and females. The early European farmers had no Southern Asian ancestry, which does not support an origin in the eastern part of Western Asia, i.e., presentday Iran. However, ancient Western Asian peoples and early European farmers shared Western Asian ancestry, and thus were not genetically dissimilar (Gallego-Llorente et al., 2016). The increase of Western Asian ancestry in the Bronze Age Levant and throughout Neolithic Western Asia is consistent with demic diffusion of agriculture via a single origin, with the original people characterized by Western Asian ancestry. Even if farming was introduced into Europe by such individuals, then subsequent migrations of semi-nomadic pastoralists from the steppes suggests that the ultimate spread of agriculture occurred by cultural, not demic, diffusion.
Previously, no significant sharing of ancestral components with sub-Saharan African populations was found to accompany the presence of Y haplogroup E1b1b1b2 (Lazaridis et al., 2016). E1b1b1b1a-M81, not E1b1b1b2-Z830, is presently common among Berbers in North Africa (Arredi et al., 2004;Trombetta et al., 2015). E1b1b1b1a-M81 has a time to most recent common ancestor of only 2,300 (95% confidence interval [1900,2700]) years before present (Urasin, 2017) and therefore was not prevalent in Northern African ancestry during the Epipaleolithic. Ancestry shared by Omotic-speaking peoples is found predominantly in present-day southern Ethiopia and is associated with haplogroup E, thus revealing a plausible source.
Using TreeMix (Pickrell and Pritchard, 2012) to reconstruct migration graphs from ancestries inferred by ADMIXTURE, we previously observed that Southern European and Northern European ancestries clustered with 77% probability and that Southern European and Arabian ancestries clustered with 23% probability (Shriner et al., 2014). We hypothesized that the primary mode reflected the relationship between R1a, characteristic of present-day Northern European ancestry, and R1b, characteristic of present-day Southern European ancestry. We further hypothesized that the secondary mode reflected the relationship between I2, present in lower frequencies in presentday Southern European ancestry, and J (more precisely, J1), characteristic of Arabian ancestry. The current findings support both hypotheses. The fact that Southern European ancestry experienced a replacement of haplogroup I by haplogroup R and yet was inferred by ADMIXTURE to be one ancestry, rather than two distinct ancestries, serves as a strong caveat in the interpretation of ancestries, while TreeMix could detect both stages of Southern European ancestry.
All ancestries in our reference panel were estimated from present-day individuals and therefore reflect present-day ancestry-specific allele frequencies. As these allele frequencies change through evolutionary time, it is possible to relate ancestries phylogenetically and make inferences about the common ancestors of ancestries. Projecting ancient individuals onto present-day ancestries will lead to increasingly incorrect inference as the age of the ancient individual increases. Thus, this issue is a bigger problem for Ice Age Europeans than for Bronze Age Europeans. This problem can be solved if allele frequencies for each of the ancestors of the present-day ancestries were known.
In summary, rather than three (Lazaridis et al., 2014) or four (Jones et al., 2015;Lazaridis et al., 2016) ancestral populations, I found considerably more population structure across 279 ancient Eurasians, involving a total of 18 autosomal ancestries, 13 Y DNA haplogroups, and 14 mitochondrial haplogroups, such that no sample was ancestrally homogeneous. Even if ancestries are inferred from extant individuals, ancestry analysis can provide historical insight in the absence of ancient DNA samples. Perhaps most importantly, using a consistent, unified nomenclature will enhance research of both ancient and present-day peoples.

AUTHOR CONTRIBUTIONS
DS designed the study, performed the research, interpreted the results, and wrote the manuscript.