Population Genetic Diversity and Clustering Analysis for Chinese Dongxiang Group With 30 Autosomal InDel Loci Simultaneously Analyzed

In comparison with the most preferred genetic marker utilized in forensic science (STR), insertion/deletion analysis possesses further benefits, like absence of stutter peak, low mutation rate, and enabling mixed stain analysis. At present, a total of 169 unrelated healthy Dongxiang individuals dwelling in Dongxiang Autonomous county of Gansu province were recruited in our study to appraise the forensic usefulness of the panel including 30 autosomal diallelic genetic markers. The insertion allele frequencies were in the range of 0.1598 at HLD 111 to 0.8550 at HLD 118. The cumulative match of probability and the combined probability of exclusion were estimated based on independence of pairwise loci, with the values of 3.96 × 10-11 and 0.9886, respectively, which showed tremendous potential of this panel to be qualified for forensic personal identification in Chinese Dongxiang group. And it could also be used as a complementary tool for forensic parentage testing when combined with standard STR genetic markers. Furthermore, calculation of the DA distance and Fst values of pairwise populations, phylogenetic reconstruction, multidimensional scaling analysis, structure clustering analysis were also conducted to probe the genetic relationships between Dongxiang group and the other 30 reference populations. Results demonstrated that Dongxiang ethnic group might be genetically closer related with most Chinese populations involved in our study, especially Tibet groups, Xibe group, and several Han populations.


INTRODUCTION
Insertion/deletion (InDel) polymorphic genetic marker characterized by abundance in the genome, relative low mutation rate, small amplicon size, and compatibility with current genotyping platform (Wei et al., 2014) is gradually becoming a possible alternative approach for forensic amplifications to overcome some inevitable limitations of traditional STRs, such as stutter products, high mutation rate, and so on. In recent years, it has been proved that the InDels could be useful in human identification (Pereira et al., 2009), mixed stain identification and deconvolution (Oldoni et al., 2017), as well as population genetic analysis including biogeographic ancestry inference and population substructure determination (Santos et al., 2010). Sun et al. (2016) verified the considerable potency of a multi-InDel panel in ancestry inference of subpopulations in China and Caputo et al. (2017) reported the potential use of a 33 X-InDel panel in Argentina populations. A year later, Hwa et al. (2018) also reported a panel's efficiency in degraded and non-degraded DNA mixtures with SNPs and InDels simultaneous analyzed via MPS technology. The commercially available Investigator DIPplex kit which contains 30 autosomal InDel loci and an amelogenin gene has been testified in a large majority of populations to evaluate its efficiency for forensic applications. Chinese population data including Tujia , Uyghur (Mei et al., 2016), Xibe (Meng et al., 2015), Hui (Xie et al., 2018), Dong (Li et al., 2015), Tibetan , Kazak (Kong et al., 2017), Zhuang (Li et al., 2015), and Yi (Li et al., 2015) were reported previously. But data in existence did not incorporate Dongxiang ethnic minority in Gansu province of China. And that's the reason why we chose Dongxiang group as our research subject.
As a country with a long-civilized history, China is also universally accepted to be one big oriental country composed of 56 ethnics and full of modern vitality. Diversities of ethnics and cultures make China a tremendous treasure to conduct genetic-related analysis. Dongxiang group is one of the Muslin ethnic groups mainly distributed in Gansu province and Xinjiang Uyghur Autonomous Region, China. According to the 2010 census, the population of Dongxiang reach 515,000 (Wen et al., 2013). Making a comparison between the population size of Dongxiang and the other ethnic minorities of China explains the constriction of our sample size. On account of lacking convincing historical records, the origin of Dongxiang group is not explicit and continuously debated by historians until now. But researchers have exerted a certain amount of studies concentrating on this issue based on diverse genetic markers, like Y-STRs (Wang et al., 2003), mitochondria DNA  , and so on. In present study, a panel of 30-InDel loci was firstly applied to Dongxiang group with the cumulative match of probability (CMP) and cumulative probability of exclusion (CPE) calculated to assess its forensic efficiency in this region. Besides, phylogenetic reconstructions, multidimensional scaling analysis (MDS), structure clustering analysis, heatmaps of fixation index (F st ), and D A values of pairwise populations were constructed based on these 30 InDels to explore the interpopulation genetic relationships between Dongxiang and 30 reference populations.

Sample Collection and Ethics Statement
A total of 169 healthy unrelated Dongxiang individuals were recruited in our study from Dongxiang Autonomous county of Gansu province. All the individuals declared on kinships among them within at least three generations and no immigration events happened in their family history. Not until the written informed consents were acquired from each of them did we further continue our research. Five milliliters peripheral blood was collected and the genomic DNA was extracted by paramagnetic particle method according to the manufacturer's recommendation. Procedures involved in our experiment were in good agreement with the human and ethical research principles of Southern Medical University and Xi'an Jiaotong University, China. Genotypic data of the 30-InDel loci for 169 Dongxiang individuals could be found in the public database named "figshare" (10.6084/m9.figshare.6743057).

PCR Amplification and Subsequent InDel Genotyping
Thirty InDel loci were co-amplified in a single PCR system with the necessary reagents and reaction conditions strictly set following the manufacturer's protocol of Investigator DIPplex commercial kit (Qiagen, Hilden, Germany) in GeneAmp PCR 9700 Thermal Cycler (Applied Biosystem, Foster City, CA, United States). Subsequent genotyping of PCR products was performed in an ABI 3500XL Genetic Analyzer (Applied Biosystem, Foster City, CA, United States) according to the manufacturer's recommendation and alleles allocation were operated by GeneMapper ID-X version 1.5 software (Applied Biosystem, Foster City, CA, United States). Positive control as well as negative control was also included to ensure precise results of InDel genotyping.

Statistical Analysis
Calculations of the insert/deletion allele frequencies and forensic statistical parameters incorporating match probability (MP), discrimination power (DP), probability of exclusion (PE), polymorphism information content (PIC), and observed heterozygosity (Ho) of the 30 InDels were implemented by modified PowerState version 1.2 spreadsheet. Linkage disequilibrium (LD) analysis was carried out by SNPAnalyzer version 2.0 (ISTECH, Goyang, South Korea) software (Yoo et al., 2008). Locus-by-locus P-values for interpopulation differentiation comparisons were conducted in Arlequin version 3.5.1.2 software (Excoffier et al., 2007) encircled by thick black lines with the r 2 threshold established at 0.8 level, revealing no LD existed between any of two different InDel loci. More detailed information about several indices for LD was presented in Supplementary Table S1. In combination with the results of HWE tests and LD analysis, we concluded that our population data were representative and the 30-InDel loci were independent of each other. Thus, the product law could be unquestionably utilized to calculate the cumulative match of probability (CMP) and cumulative probability of exclusion (CPE).

Allele Frequency Diversities and Forensic Efficiency Parameters
To further evaluate the forensic potency of the 30-InDel panel applied in Dongxiang ethnic group, InDel allele frequencies as well as forensic efficiency parameters of the 30 InDels were also calculated and the results were presented in  (Martín et al., 2013), Basque (Martín et al., 2013), Dane (Friis et al., 2012), and Hungarian (Kis et al., 2012)] at 10, 11, 12, and 14 loci. And two African indigenous populations [Zulu (Hefke et al., 2016) and Xhosa (Hefke et al., 2016)] were detected to be most significantly different from Dongxiang group at 21 and 22 loci, respectively. Clearly, compared with non-Chinese populations, closer genetic relationships might be existed between Dongxiang and the other Chinese populations. As for single locus diversities, the first four loci shown greatest remarkable diversities between Dongxiang and the reference populations were HLD 118, HLD 39, HLD 111, and HLD 99, of which HLD 118 and HLD 99 displayed differentiations among all the non-East Asian populations.

A Heatmap of Deletion Allele Frequency Distributions of the 30-InDel Loci for 31 Populations
Additionally, a heatmap of deletion allele frequencies for the 30 loci was also performed. As shown in Figure 1, the color of each block deepened with the corresponding deletion frequencies increasing. The color scale ranged from blue for the lowest deletion allele frequency to red for the highest deletion allele frequency. And clustering analysis for the 30-InDel loci was also generated on the top of the figure and three primary clusters were easily distinguished. It was clear that cluster 1 (HLD 118, HLD 99, HLD 64, HLD 81, HLD 67, and HLD 84) exhibited relative small deletion allele frequencies in most Chinese populations in exception of Kazak and Uyghur groups while a small branch of cluster 3 (HLD 39, HLD 111, and HLD 122) showed larger deletion allele frequencies in these populations. Hence, we speculated these above-mentioned loci might be potential for biogeographic ancestry inferences for Chinese populations involved in our study. On the part of distributions of deletion allele frequencies for the 30 loci, Dongxiang group was discovered to share analogical deletion allele frequency distributions with several Chinese populations (Qinghai Tibetan, Tibet Tibetan, Chengdu Han, Beijing Han, and Henan Han) while distinct deletion allele frequencies distributions with most non-Chinese populations, which meant a similar genetic structure among Dongxiang and these Chinese populations. Furthermore, observations showed the deletion allele frequencies of the 30 loci were approximately identical among the only four European populations included in our study, which indicated the 30-InDel panel could be suitable for personal identification cases in these populations.

D A and F st Values of Pairwise Populations
Nei' D A distance is one of the most commonly used genetic distances to measure the genetic divergence between species or between populations within the same species and it was developed under the assumption that genetic differences originated from genetic drift and mutation events (Nei and Roychoudhury, 1974). Presently, the Nei's D A distance was calculated and a heatmap of D A distance values was subsequently conducted to intuitively reflect the genetic relationships between Dongxiang and the 30 reference populations. As shown in The F st is generally considered as a measure of population differentiation on account of genetic structure (Jakobsson et al., 2013). A heatmap of F st values was also constructed in our study to mirror the differentiation degrees of pairwise populations. As demonstrated in Figure 3, the darker the block color was, the more the significant genetic differentiations existed between populations, and vice versa. And the color scale ranged from white to dark blue. It was visible that a set of blocks with lighter color exhibited between Dongxiang and most Chinese reference populations, which meant small genetic discrepancies existed between Dongxiang and these populations.
To further illustrate the genetic relationships between Dongxiang and the other 30 populations, a multiplex line chart showing the variation tendency of D A distance and F st values was conducted by EXCEL spreadsheet 2016. As shown in Figure 4, the green line representing the change of D A distance values and the light blue line exhibiting the variation of F st values showed coincident change tendencies, indicating that the results were credible from another aspect. Detailed information about D A distance values and F st values was attached in Supplementary Tables S3, S4, respectively.

Multidimensional Scaling Analyses Among the 31 Populations
Multidimensional scaling analysis is a generally employed method with the capability to visualize the similarity level of individual cases of a dataset. An MDS algorithm aims at placing each object in a N-dimensional space and the distances between two different objects can be preserved as well as possible (Borg and Groenen, 1997). Presently in our study, the MDS plot was constructed based on pairwise F st values to reflect the genetic relationships among 31 populations. As shown in Figure 5, all the 31 populations were exhibited with small icons and the colors were labeled according to their language families. It was noticeable that the population distributions in the plot were in general concordance with their geographic regions: all the East Asian populations involved in our study located at the right part of the plot, the only two Central Asian groups (Kazak and Uyghur) positioned in the middle of the plot and the left part of the plot was occupied by six Mexican groups, four European groups, and three African groups. Apparently, the studied Dongxiang group closely assembled with most Chinese populations (Tibetan in Tibet and Qinghai, Han populations in several different regions, Xibe, Hui, Tujia, Miao, Dong, and Zhuang), relatively far distant from two Central Asian populations (Uyghur, Kazak), six Mexican populations, and four European populations, most far distant from three African populations. So, MDS plot also verified that Dongxiang group was in close genetic relation to most Chinese populations, especially Tibet groups, Xibe group, and several Han populations.

Population Substructure Analysis for Dongxiang and 30 Reference Populations
STRUCTURE analysis is commonly recognized to be capable of inferring population structure and assigning individuals to populations using multi-locus genotypic data (Pritchard et al., 2000). In present study, STRUCTURE clustering analysis was performed to reflect the memberships of biogeographic ancestry components for Dongxiang group and the reference populations with the number of hypothetic populations (K) defined at 2-7. And a burn-in period of 10,000 was also taken into account to acquire representative estimations of the parameters. As shown in Figure 6, population names as well as their corresponding language families were labeled on the bottom and the top of the figure. The width of each bar was proportional with the population sample size. When K at 2 and 3, East Asian groups and non-East Asian groups could be differentiated by distinct discrepancy of color compositions. And when K at 4, two African indigenous populations (Xhosa and Zulu), six Mexican groups (Chihuahua Mexican, Jalisco Mexican, Mexico Mexican, Veracruz Mexican, Yucatan Mexican, and Amerindian Mexican) and Cape Colored population, a subset of European groups (Dane, Hungarian, Basque, Central Spanish), and two Central Asian groups (Uyghur and Kazak) could be further distinguished. Similar clustering results could be generated at K = 5, 6. And when K = 7, Cape Colored population differed from Mexican groups with less brown components and more pink components. We surprisingly discovered the population substructure traits of Dongxiang group exerted similar hypothetical ancestry components with additional East Asian populations involved in our study at K = 3, 4, 5, 6, 7, which meant Dongxiang group was genetically closer with the most of Chinese populations involved in our study rather than other non-Chinese populations.

Phylogenetic Reconstruction Generated Based on D A Distance Values and Allele Frequencies
With neighbor-joining method applied, a phylogenetic tree was conducted based on D A values among Dongxiang and the 30 reference populations and displayed in Figure 7. The color of each population was labeled according to their corresponding language families. And four distinct branches were easily distinguished, with the first, second, third, and fourth branch composed of eighteen Asian populations, four European populations, six American populations, and three African populations, respectively. And we found clustering of the 31 populations roughly complied with their geographic locations and language families. The studied Dongxiang group was found to cluster with Tibet groups in Tibet and Qinghai, Xibe group, Hui group, and Han populations of diverse regions (Chengdu, Beijing, and Henan), indicating that relative close genetic relationships could be detectable among these populations. In exception of Cape Colored group, the only two African indigenous groups were discovered to be far related with most of the populations, which was in good accordance with previous studies. Furthermore, an unrooted tree (Supplementary Figure S2) was also constructed based on allele frequencies of the 30 loci by Phylip version 3.69 software, and the population distribution was quite similar with the above-mentioned mega tree, so we further validated our results.
Recently, the development of DNA genotyping technology provides a promising approach to explore the genetic backgrounds for Dongxiang group and fascinated the progress of origin exploration for Dongxiang group to a certain extent. Xie et al. (2002) conducted a phylogenetic tree for Dongxiang and its reference groups and reported the genetic similarities among Dongxiang, Hui, Tibetan, and Beijing Han populations. Moreover, Yao et al. (2016) reported that Dongxiang ethnic group displayed remarkable genetic homogeneity with Hans in Linxia and several additional East Asian populations. Obviously, researches cited above indicated the Dongxiang group might be closely related with Tibet group and Han populations, which was in agreement with our finding to a large extent. As we know, except for genetic markers, explorations of population-specific origins could be implemented from multiple aspects, such as languages, cultures, and so on. Dongxiang group is one of the Muslin groups of China. The language of Dongxiang ethnic group is a member of Mongolic family. Today, villagers residing in northeastern Dongxiang county also speak the "Tang Wang" language, which is a kind of creolized language recognized to be mixed by Mandarin and their original language. And surnames of Dongxiang people are also largely influenced by miscegenation phenomenon with the prevalence of Mongol, Han Chinese, and Tibetan surnames, like Wang, Kang, Zhang et al. (Howard, 1998). Similarly, humanity evidences identically indicated that relative frequent gene flow could be existed between Dongxiang and the adjacent Tibet group, Han population, which supported the close genetic relationships among these groups.

CONCLUSION
At present, the forensic efficiency of the 30-InDel panel was assessed in Chinese Dongxiang ethnic group with the enrollment of 169 unrelated healthy individuals. And the results of CMP (3.96 × 10 −11 ) and CPE (0.9886) certified the usefulness of these 30-InDel loci for forensic personal identification. Besides, to further clarify the genetic origin of Dongxiang ethnic group, we firstly applied the 30 insertiondeletion polymorphic genetic markers to explore the genetic relationships between the studied Dongxiang group and additional 30 reference populations. And observations indicated that Dongxiang was close related with Xibe group, Tibet groups in Tibet and Qinghai, and Han populations of several different regions (Chengdu, Beijing, and Henan). We believe our data presented here can be meaningful for further enriching the genetic background researches for Dongxiang group.

ETHICS STATEMENT
This study was carried out according to the recommendations of "Human and Ethical Committee of Southern Medical University and Xi'an Jiaotong University, China" with written informed consent from all subjects. All subjects gave written informed consent in accordance with the Declaration of Helsinki. The protocol was approved by the "Human and Ethical Committee of Southern Medical University".

SUPPLEMENTARY MATERIAL
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2018.00279/full#supplementary-material FIGURE S1 | Pairwise LD analysis of for the 30 InDel loci in Chinese Dongxiang group by SNPAnalyzer version 2.0 software. FIGURE S2 | An unrooted phylogenetic tree constructed on the basis of allele frequencies of the 30 InDel loci by Phylip version 3.69 software.
TABLE S1 | Several indices for LD including |D'|, r 2 and LOD of pairwise InDel loci.