The Geographic Origins of Ethnic Groups in the Indian Subcontinent: Exploring Ancient Footprints with Y-DNA Haplogroups

Several studies have evaluated the movements of large populations to the Indian subcontinent; however, the ancient geographic origins of smaller ethnic communities are not clear. Although historians have attempted to identify the origins of some ethnic groups, the evidence is typically anecdotal and based upon what others have written before. In this study, recent developments in DNA science were assessed to provide a contemporary perspective by analyzing the Y chromosome haplogroups of some key ethnic groups and tracing their ancient geographical origins from genetic markers on the Y-DNA haplogroup tree. A total of 2,504 Y-DNA haplotypes, representing 50 different ethnic groups in the Indian subcontinent, were analyzed. The results identified 14 different haplogroups with 14 geographic origins for these people. Moreover, every ethnic group had representation in more than one haplogroup, indicating multiple geographic origins for these communities. The results also showed that despite their varied languages and cultural differences, most ethnic groups shared some common ancestors because of admixture in the past. These findings provide new insights into the ancient geographic origins of ethnic groups in the Indian subcontinent. With about 2,000 other ethnic groups and tribes in the region, it is expected that more scientific discoveries will follow, providing insights into how, from where, and when the ancestors of these people arrived in the subcontinent to create so many different communities.


INTRODUCTION First Arrivals
Homo sapiens or modern humans spread from Africa to Asia and Europe in several migratory movements (Stringer, 2000;Walter et al., 2000). Based on the geographical distances between populations and measures of population differentiation derived from quantitative cranial datasets, multiple dispersals took place between ∼37 and 135 kya (1000 years ago) (Reyes-Centeno et al., 2015). The initial migrants traveled north and crossed into the Arabian Peninsula. Early archeological evidence of H. sapiens fossils outside Africa was discovered in the prehistoric caves of Qafzeh and Skhul, in present-day Israel. New mass-spectrometric techniques have dated these fossils to ∼80-106 kya (McDermott et al., 1993). Some traveled further north into central Asia, which became the staging ground for migrations into Serbia and Europe.
The Indian subcontinent-comprising India, Pakistan, Bangladesh, Sri Lanka, Nepal, Bhutan, and Myanmar-became one of the first geographical regions of the world to be populated by H. sapiens (Dennell and Petraglia, 2012;Blinkhorn et al., 2013). One group from the Arabian Peninsula took the coastal route through India, Myanmar, and Malaysia to Australia. A study (Elhaik et al., 2013) conducted by the National Geographic Society's Genographic Project (Behar et al., 2007) found that people living in a village near Madurai in South India carried the same rare genetic markers as some Australian aborigines and people living in Africa (Wells, 2007). The findings showed a link between the three continents and confirmed that the people in Australia and India with this genetic marker were likely descendants of the original coastal migrants from Africa. More migrations out of Africa followed.
The migration through India was interrupted about 75 kya by the eruption of Mount Toba in Sumatra, Indonesia, which is recorded as one of the largest volcanic eruptions in this planet's history (Blinkhorn et al., 2014), resulting in an extended nuclear winter and ice age (Rampino and Self, 1993;Huang et al., 2001;Robock et al., 2009). Michael Petraglia and his team of archeologists discovered stone tools at Jwalapuram in Andhra Pradesh, South India, above and below a thick layer of ash from the Toba eruption (Petraglia et al., 2007). These tools match those used in Africa from the same period and suggested the presence of modern humans in India at the time of the Toba event. A recent theory refutes this research and contends that these were prehuman species (Mellars et al., 2013). After warming of the climate, new migrations out of Africa from ∼50 kya populated India with large numbers of humans who later became known as Dravidians.

Archaeological and Historical Perspectives
A recent archeological discovery was the Homo sapiens balangodensis (Balangoda man) in Sri Lanka, dated to ∼37 kya (Tan, 2012). Other findings include the prehistoric rock shelters of Bhimbetka near Bhopal in Madhya Pradesh, India, which date back to 30,000 BCE. These rock shelters served as habitation sites during the lower Paleolithic period (Anon, 2017). There are more than 700 caves with more than 400 paintings carved in stone, which makes them one of the oldest known rock art sites in the world. Other archeological discoveries include the sites of the Indus Valley civilization in northwest India and Pakistan, which are dated ∼8-9 kya (Khandekar, 2012). A recent discovery of one of the largest Indus Valley sites was made in Rakhighari, which is located about 160 km from New Delhi, and is dated ∼7 kya (Subramanian and Khan, 2016). The people of the Indus Valley were known as the earliest agriculturists in South Asia (Harris, 1996).
Because language is a recent development, there is no written record of ancient Indian history. There is no reliable written history of the Indian subcontinent before Alexander the Great's campaign of India in 327 BCE (Smith, 1921). As a result, the deep ancient origins of the founder populations of the Indian subcontinent have remained ambiguous for a long time. In recent times, a series of migrations and invasions-both peaceful and violent-from adjoining areas became a recurring theme in the history of the Indian subcontinent. The different ethnicities of people that arrived and settled included Afghans, Arabs, Armenians, Aryans, Chinese, Greeks, Huns, Iranians, Mongols, Persians, Scythians, Syrians, Tajiks, Turks, Uzbeks, and others (Ahloowalia, 2009). The subcontinent has been aptly described as an "ethnological museum."

Tracing Human Origins
Y chromosome (Y-DNA) and mitochondrial DNA (MT-DNA) studies have been used to support ideas about modern human origins. These DNA technologies exploit two types of genetic markers: the short tandem repeats (STRs), and single nucleotide polymorphisms (SNPs). The STRs are found on the Y chromosome (Y-STRs) and used exclusively for tracing male lines of heredity. The SNPs are found on the Y chromosome and in MT-DNA. They are used to trace male and female lines of heredity. The result of the test is a set of numbers, referred to as the haplotype, that represents the allele values of DYS markers (D for DNA, Y for chromosome, and S for segment) on a portion of the DNA. The haplotype is used to identify the haplogroup of an individual. Thus, the haplogroup represents a group of people who have inherited common genetic characteristics from the same most recent common ancestor (MRCA) going back several thousand years. All humans belong to haplogroups which are designated according to their Y-DNA and MT-DNA.
The nonrecombining portion of the human Y chromosome is paternally inherited. This chromosome passes from father to son and is essentially unchanged; however, occasionally random small changes, known as polymorphisms, occur. These polymorphisms serve as beacons or markers and can be mapped. Correct interpretation of these changes in the Y chromosome can improve our understanding of temporal and spatial aspects of human history. Thus, the Y chromosome haplogroup, which is a population group descended from the MRCA, can be used as a valuable tool to trace the paternal line of the individual (Jobling and Tyler-Smith, 2003).
Y-DNA tests are available only for men. Short tandem repeats (STRs) or single nucleotide polymorphisms (SNPs) on the Y chromosome are assessed. Because Y-DNA haplogroups are closely linked to geography and populations, they are important genetic indicators to trace paternal lineages and their ancient origins. This study has relied on the Y-DNA haplogroup, as the primary gauge for exploring deep ancestry and geographical origins of the MRCAs.
Recent developments in DNA science were assessed to provide a contemporary perspective of the ancient geographic origins of 50 key ethnic groups of the Indian subcontinent. After identifying the Y haplogroups of these ethnic groups, the ancient geographical origins were ascertained from genetic markers in the Y-DNA Haplogroup Tree and published sources. The ancient origins of the ethnic groups were traced to 14 different geographical areas of this world. A startling new assessment of the genetic ancient origins of the ethnic groups was revealed with DNA science.

Sample Dataset
A dataset of 2,504 Y chromosome profiles of 50 ethnic groups in the Indian subcontinent was compiled from eight different sources (Table 1). These included the Genographic Project database (Behar et al., 2007;Genographic, 2016), with permission of the National Geographic Society, and seven published sources (Sengupta et al., 2006;Nagy et al., 2007;Zhao et al., 2008;Giroti and Talwar, 2010;Nair et al., 2011;Chennakrishnaiah et al., 2013;Lee et al., 2014). The dataset represented 50 geographically diversified ethnic groups of the subcontinent, of which 39 groups were in India, nine were in Pakistan, and two were in Bangladesh.
The haplogroups of 2,191 or 88% of the profiles in the dataset were predetermined at source based on examination of SNPs on the Y chromosome with actual DNA samples of men. These were collected from the Genographic Project database and published sources. For the remaining 313, or about 12% of the profiles in the dataset, the allele frequencies were collected from published material, and the haplogroups were identified with Whit Athey's Haplogroup Predictor software (Athey, 2006).
All haplogroups-those predetermined at source and identified with the software-were merged and sorted in a database according to their ethnic groups. Only the predominant top-level haplogroups were identified (the subclades or subhaplogroups were not used).

Haplogroup Prediction Software
Several software programs are available that use mathematical calculations to predict haplogroups from Y-STR profiles. A study of a software tool, Haplogroup Classifier, developed at the University of Arizona showed that by using machine learning algorithms and data derived from a set of Y-linked STRs, it was possible to assign Y chromosome haplogroups to individual samples with a high degree of accuracy (Schlecht et al., 2008). Another software tool, yHaplo, was developed at 23andMe, a DNA testing company, to enable researchers to identify the Y chromosome haplogroups of men in genetic samples. The software has been tested on more than 600,000 samples of men in the 23andMe database (Poznik, 2016). For this study, Whit Athey's online Haplogroup Predictor software (http://www.hprg. com/hapest5/) that utilizes Y-STR values with a Bayesian-allele frequency approach (Athey, 2005(Athey, , 2006 was exploited. Whit Athey's software offers fast and easy prediction of a Y chromosome haplogroup from Y-STR values. The latest version of the program (10 Dec 2012) has adopted the 111-marker set of Family Tree DNA as the standard. Of the 86 markers used in the previous version of the program, only DYS508 is not included in the 111-marker set. A "batch" version of the software is now available for application to large numbers of haplotypes.
The Bayesian approach used in the software considers the frequency of each haplogroup in the geographic region where the haplotype originated. These frequencies are called the "prior probabilities" or "priors, " and they are different from one geographical area to another (for example, Northwest Europe and South Asia). The software provides an option to select the desired geographical area for the analyses. The current options available are Northwest Europe, East Europe, Mediterranean, and Equal Priors. After the area selection is made, the markers are entered in an online form. The results provide "goodness-offit" scores for haplogroups, and the probabilities for each score. If a haplogroup gets a probability of 100%, it means that the haplotype most likely only exists in that haplogroup. Typically, the results produce more than one haplogroup for a haplotype. For this study, 9-17 Y-STR markers were used for each profile, and the haplogroup with the highest probability was selected. The software was deployed for only 313 profiles in the total dataset.

The Y-DNA Haplogroup Tree
The geographic origins of a Y chromosome haplogroup for males can be deciphered from the phylogenetic tree of mankind, or the Y-DNA Haplogroup Tree, maintained by the International Society of Genetic Genealogy (ISOGG, 2016). The haplogroups contain many branches called subhaplogroups or subclades. The top-level haplogroups are expressed as letters (A, B, C, etc.). Their subhaplogroups or subclades are expressed as letters and numbers (G2, R1b1, E3b1b, etc.). The markers on the phylogenetic tree provide pieces of evidence regarding the date and geographical origin of the MRCA in the distant past. The geographic origins of the 14 haplogroups identified from the dataset were deciphered from the phylogenetic tree and other published sources.

RESULTS
The data revealed that 14 different haplogroups representing 14 different geographic origins were present in the 50 ethnic groups used in this study (Table 3), confirming multiple lines of ancestry and geographic origins. Every ethnic group had members that belonged to more than one haplogroup, indicating that they had different lines of ancestors. There was no ethnic group in these analyses that could trace the genetic ancestry of all its members to a single MRCA. For example, members of the large Brahmin ethnic group belonged to 11 different haplogroups, indicating 11 different lines of ancestors. Similarly, the Malayali and Nair groups had members in 10 different haplogroups, indicating at least 10 different ancestral lines.
Some groups had few ancestral lines. The tribal Ho and Mizo groups had members in only two haplogroups, with O being the predominant one, indicating that there may be one major line of ancestry. Similarly, the Malanis, who live in the small hermit village of Malana in the Himalayas with a population of only about 1,100 people, primarily belonged to only two predominant haplogroups, J and R (Giroti and Talwar, 2010).
Although there were 14 haplogroups in the total dataset, about 90% of the people belonged to seven haplogroups: F, G, H, J, L,

Dataset Details References
The National Geographic Society's Genographic Project The Genographic Project is studying the genetic signatures of ancient human migrations and creating a database of yDNA and mtDNA profiles. Currently, there are over 800,000 participants from over 140 countries.

Genographic, 2016
The Ethnic Groups of South Asia The study covered a high-resolution assessment (69 informative Y-chromosome binary markers and 10 microsatellite markers) of a large set of representative ethnic groups of South Asia. This included 728 samples from India representing 36 populations, with 17 tribal populations, from six geographic regions and different social and linguistic categories, and 176 samples from Pakistan representing eight populations. Sengupta et al., 2006 The Origin of Romanies The haplotype frequencies for 11 Y-STR markers in a Romani population (n = 63) from Slovakia, Jats of Haryana (n = 84), and Jat Sikhs (n = 80) from India were assessed. Nagy et al., 2007 Paternal Lineages among North Indians A total of 32 Y-chromosomal markers in 560 North Indian males collected from three higher caste groups (Brahmins, Chaturvedis, and Bhargavas) and two Muslims groups (Shia and Sunni) were genotyped. Zhao et al., 2008 The O, and R. About 77% of the people belonged to the four largest haplogroups R, H, L, and J. These haplogroups are described below.
Haplogroup R (38.5%) This is one of the largest haplogroups in India and Pakistan. This is also the largest haplogroup in the dataset used in this study. It originated in north Asia about 27,000 years ago . It is one of the most common haplogroups in Europe, with its branches reaching 80 percent of the population in some regions (Eupedia, 2017). One branch is believed to have originated in the Kurgan culture, known to be the first speakers of the Indo-European languages and responsible for the domestication of the horse (Smolenyak and Turner, 2004). From somewhere in central Asia, some descendants of the man carrying the M207 mutation on the Y chromosome headed south to arrive in India about 10,000 years ago (Wells, 2007).

Haplogroup H (16.1%)
This is an old haplogroup with a large representation in the Indian subcontinent. It can be referred to as the Indian haplogroup. Originally from the Middle East or south central Asia, marker M69 originated in western India about 30000 years ago (Wells, 2007). This group is considered part of a second wave of migrations to the Indian subcontinent. The Romany people, also known as gypsies and believed to originate from India, belong to a subclade of this haplogroup (ISOGG, 2008).

Haplogroup L (11.2%)
This haplogroup is present in the Indian population at an overall frequency of about 7-15% (Basu et al., 2003;Cordaux et al., 2004). Genetic studies indicate that this may be one of the original haplogroups of the creators of Indus Valley Civilization (McElreavey and Quintana-Murci, 2005;Sengupta et al., 2006). It has a frequency of about 28% in western Pakistan and Baluchistan, from where the agricultural creators of this civilization emerged (Qamar et al., 2002). The origins of this haplogroup can be traced to marker M11, and the rugged and mountainous Pamir Knot region in Tajikistan (Wells, 2007), which is also the home of the Bactria-Margiana Archaeological Complex that represents the Oxus civilization of around 4000 BCE (Wood, 2007).
Haplogroup J (11.1%) The ancestor carrying the M304 mutation was born around 15000 years ago in the Middle East area known as the Fertile Crescent, comprising Israel, the West Bank, Jordon, Lebanon, Syria, and Iraq. There is a dominant Arabic lineage. This group and its subclades are found predominantly around the coast of the Mediterranean, the Middle East, North Africa, and Ethiopia. Middle Eastern traders brought this genetic marker to the Indian subcontinent (Kerchner, 2013).

Geographic Origins
A phylogenetic tree (without branches) showing the top-level Y-DNA haplogroups and markers of the 14 ethnic groups used in this study appears in Figure 1.
The geographic origins of the 14 different haplogroups were ascertained from the phylogenetic tree of mankind maintained by the International Society of Genetic Genealogy (ISOGG, 2016), and published sources. They are summarized in Table 4. The order of DYS385a may be reversed. Its sequence is referred to as the Kittler order.

DYS385b
The order of DYS385b may be reversed. Its sequence is referred to as the Kittler order.
[GAAA] n 7-28 0.21 AC022486; n = 11 repeats Z93950; has 10 repeats DYS385 II DYS389I DYS389 is a multi-copy marker, and includes DYS389i and DYS389ii. DYS389ii refers to the total length of DYS389. Therefore, when there is a one-step mutation at DYS389i, it will also appear in DYS389ii.

Sample Size
In statistical analyses, as the population increases in size, the sample size increases at a diminishing rate, and remains relatively constant when it reaches a size of 380 or more. At about 384, the sample is generally representative for a population of one million, or more (Krejcie and Morgan, 1970). Therefore, to ascertain a representative distribution of haplogroups in any large ethnic community, ideally the sample size should be 380, and preferably larger. The samples available for the 50 ethnic groups used in this study were less than the ideal size, and ranged from 6 for the Assamese in India to 288 for the Pathans in Pakistan. Although the sample for each ethnic group revealed key haplogroups, it did not represent a statistically significant distribution for the total population of the ethnic group. Larger samples for these ethnic groups are likely to reveal a few additional haplogroups and provide a more complete picture for each ethnic group.

Potential Errors in Identifying Haplogroups
Because of the need for precision in matters relating to criminal and civil laws, the forensic genetics community is generally not in favor of determining haplogroups with STR profiles. It is held that STR haplotypes are not always identical by descent, but also identical by state, and can be rooted in different haplogroups.
A study that used STR profiles of 119 males in Argentina to determine haplogroups with two software programs-Whit Athey's Haplogroup Predictor (used in this study), and a Haplogroup Classifier developed at the University of Arizonashowed that the results were not totally accurate (Muzzio et al., 2011). Another study of 165 males in Nicaragua showed that Athey's Haplogroup Predictor produced accurate results for 95.2% of the sample, but 4.8% of the results were inaccurate (Núñez et al., 2012). For greater reliability in identifying Y chromosomal haplogroups, the forensic community's preferred method is to analyze single nucleotide polymorphisms (SNPs) on the Y chromosome in the lab with actual DNA samples.
Athey has explained that the main drawback of the haplogroup prediction method in his software is the size  Sources for markers and date scale: (Smolenyak and Turner, 2004;Wells, 2007;Y-DNA Haplogroup Tree, markers, and (Smolenyak and Turner, 2004;Wells, 2007); Y-DNA Haplogroup Tree, markers, and descriptions at ISOGG, http://isogg.org/tree/index.html), kya, thousand years ago.
of the database of some Y-STR haplotypes from which the allele frequencies are calculated. For most haplogroups there is sufficient Y-STR haplotype data. However, for some haplogroups, such as C, H, L, N, and Q, the database of Y-STR haplotypes is smaller, and the results may be prone to error (Athey, 2006).
From the total dataset used in this study, 313 or about 12% of the records were processed through Athey's software to determine their haplogroups. Assuming an error rate of 5%, as reported in the Nicaraguan study (Núñez et al., 2012), 16 haplotypes (5% of 313) may have identified incorrect haplogroups. That represents a potential error of only 0.6% (16/2504) in the total dataset used in the study.

Population Mixture
The population of the Indian subcontinent at ∼12 kya was statistically estimated at about 100,000 people (McEvedy and Jones, 1979). Currently, there are about 1.7 billion people on this subcontinent. The increase in population reflects a complex history of migrations and invasions of people from outside the subcontinent, resulting in an influx of foreign genes to the subcontinent.
When the population of the subcontinent was small, people did not travel far. However, over time, humans have moved to new and distant habitats. They have also shown variations in their post-marital residence practices. About 70% of humans practice some form of patrilocality, with men remaining in and women migrating from their household, clan, lineage, tribe, or village (Fox et al., 1967;Murdoch, 1981). Some societies display matrilocal or bilocal migration patterns, with men and members of both sexes leaving their birthplace to live with their mate elsewhere (Fox et al., 1967). It is believed these practices prevailed in the Indian subcontinent in ancient times. The people admixed freely and gradually scattered in different directions, merging with communities they found in their paths, or creating entirely new communities.

Emergence of Endogamy
Genetic studies have shown that most ethnic groups of the Indian subcontinent descended from a mixture of two divergent populations. These were Ancestral North Indians related to Central Asians, Middle Easterners, Caucasians, and Europeans and Ancestral South Indians who were not closely related to any groups outside the subcontinent (Reich et al., 2009;Moorjani et al., 2013). After the arrival of Indo-European speakers in North India about 4 kya, the caste system was introduced, and a stratified social hierarchy evolved.
The upper-caste populations were thought to have started practicing and encouraging endogamy about 70 generations (more than 2,000 years) ago (Basu et al., 2016). Another study suggested that endogamy originated much later, around the time of foreign invasions in north India (Vadivelu, 2016). As ethnic groups developed with their own identities, endogamy in the Indian subcontinent became the general norm. The preference for close kin unions, i.e., consanguineous marriages between people, such as cousins, who have at least one recent common ancestor, is another type of endogamy. Currently, couples related as second cousins or closer account for an estimated 10.4% of the global population, with the highest rates in certain regions including West, Central, and South Asia (Bittles and Black, 2010). According to the International Institute for Population Sciences in Mumbai, about 16% of marriages in India are consanguineous (Kuntla et al., 2013). In Pakistan, where first cousin marriages have occurred for generations, the rate is 67% (Yaqoob et al., 1993).
At this time, there are several thousand different ethnic and tribal groups in the Indian subcontinent (Papiha, 1996;Xing et al., 2010). Members of these communities share common selfidentities that are based on languages, customs, cuisines, and at least six major religions. There are 22 official languages and hundreds of dialects in the country (Annamalai, 2006), which reflect the genetic diversity of the population. About 125 million people-roughly 10% of the population-now speak the English language (Mehtabul et al., 2013), and members of many ethnic groups have migrated to other countries throughout the world.
Although endogamy has become the general norm in the Indian subcontinent, and consanguinity is practiced in some communities, there has been considerable admixture in the past, resulting in a very mixed set of genes and geographic origins throughout the subcontinent. This is evidenced by the distribution of 14 different haplogroups in the 50 ethnic groups used in this study.

Genetic Distance
In another study, the results of AMOVA (analysis of molecular variance) and MDS (multidimensional scaling plots) tests confirmed that the ethnic group of Jats had genetic affinities with several foreign populations (Mahal and Matsoukas, 2017). Similar studies of other ethnic groups of the Indian subcontinent will provide additional insights into their genetic makeup and geographic origins.

CONCLUSION
The human Y chromosome provides a powerful molecular tool for analyzing Y-STR haplotypes and determining their haplogroups, which in turn lead to the ancient geographic origins of individuals. For this study, 50 ethnic groups in the Indian subcontinent were analyzed, and their haplogroups were identified. Using markers from the Y-DNA haplogroup tree and available descriptions of haplogroups, the geographic origins and migratory paths of the ancestors were ascertained and documented.
The results showed that every ethnic group in the dataset had members that belonged to more than one haplogroup, indicating multiple lines of ancestry and geographic origins. Additionally, even with their potentially different languages, religions, nationalities, customs, cuisines, and physical differences, members of different ethnic groups who belonged to the same haplogroup were genetically related and had the same ancient MRCAs and geographic origins in the distant past. Although historians have attempted scholarship on the deep origins of people, their assessments do not go back far enough in time because of lack of documentation. Based on recent developments in DNA science, this study has provided new insights into the ancient multi-source geographic origins of some distinct ethnic groups of the Indian subcontinent. It is expected that more scientific studies will follow, providing insights about where and when the founding populations of the Indian subcontinent arrived, and how they spread in different directions to create so many diverse ethnic communities.

ETHICS STATEMENT
This study presented in the manuscript did not involve human or animal subjects. All data used in the study are from existing databases and published sources, which are cited.

AUTHOR CONTRIBUTIONS
DM: analyzed data and wrote the paper; IM: wrote the paper.