Multi-Indel: A Microhaplotype Marker Can Be Typed Using Capillary Electrophoresis Platforms

Since the concept of microhaplotypes was proposed by Kidd in 2013, various microhaplotype markers have been investigated for various forensic purposes, such as individual identification, deconvolution of DNA mixtures, or forensic ancestry inference. In our opinion, various compound markers are also regarded as generalized microhaplotypes, encompassing two or more variants in a short segment of DNA (e.g., 200 bp). That is, a set of variants (referred to herein as multi-variants) within a certain length includes single nucleotide polymorphisms (SNP), insertion/deletion polymorphisms (Indels), or short tandem repeat polymorphisms (STRs). At present, multi-variant is mainly aimed at multi-SNPs. However, the haplotype genotyping of multi-variants relies on single-strand analysis, mainly using massively parallel sequencing (MPS). Here, we describe a method based on a capillary electrophoresis (CE) platform that can directly obtain haplotypes of individuals. Several microhaplotypes consisting of three or more Indels with different insertion or deletion lengths in the range of less than 200 bp were screened out, each of which had at least three haplotypes. As a result, the haplotype of an individual was reflected by the length of its polymorphism. Finally, we established a multiplex amplification system containing 18 multi-Indel markers that could identify haplotypes on each chromosome of an individual. The combined power of discrimination (CPD) and the cumulative probability of exclusion (CPE) were 0.999999999997234 and 0.9984, respectively.


INTRODUCTION
Owing to various forensic cases encountered in practice, compound markers have attracted the interest of forensic DNA scientists. Compound biomarkers consisting of two or more variants that occur in short DNA segments of ∼200 bp for example, can be regarded as generalized microhaplotypes, including insertion and deletion polymorphisms (Indels) closely linked to short tandem repeat polymorphisms (STRs) (DIP-STR), single nucleotide polymorphisms (SNP) closely linked to STR (SNP-STR), Indel polymorphisms closely linked to SNP (DIP-SNP), and several Indel polymorphisms linked very tightly in physical positions (multi-Indels) (Castella et al., 2013;Wang et al., 2015;Wendt et al., 2016;Tan et al., 2017;Tan et al., 2018;Oldoni and Podini, 2019).
Haplotypes are presently interpreted in three ways. A statistical inference method was used after separately genotyping each locus, but it could not reflect the true haplotype of individuals (such as PHASE) (Kong et al., 2008;Kidd et al., 2013Kidd et al., , 2014. Other ways to interpret include the use of DIP-STR, SNP-STR, DIP-SNP, SNP-SNP, or other compound markers for detection. By designing allele-specific PCR primers, the 3 end of a PCR primer is paired with upstream DIP or SNP alleles. A shared reverse primer is then designed downstream of other STR or SNP markers. Thereafter, two allele-specific sequences are obtained using PCR. The genotype of haplotype markers from an individual can be determined using a capillary electrophoresis (CE) platform and a two-step detection method, but the phase of a haplotype can be unambiguously determined only when the microhaplotypes include two variants (Castella et al., 2013;Cereda et al., 2014;Oldoni et al., 2015;Tan et al., 2017;Liu et al., 2018Liu et al., , 2019Moriot and Hall, 2019;Oldoni and Podini, 2019;Zhang et al., 2020). Additionally, the main limitation of microhaplotype markers comprising only two variants is the difficulty with increasing polymorphism. A third method relies on single-stranded haplotypes that are resolved by experimental analyzes such as massively parallel sequencing (MPS), which can directly detect the phases of haplotypes on sequenced strands (Borsting and Morling, 2015;Snyder et al., 2015;Wang et al., 2015;Wendt et al., 2016;Chen et al., 2019;Turchi et al., 2019;Zhu et al., 2019;de la Puente et al., 2020;Pang et al., 2020;Sun et al., 2020). However, forensic scientists face many practical challenges due to the complexity of MPS, extensive data processing requirements, and higher costs.
Since the discovery and identification of 2,000 human diallelic Indels in 2002, many studies have found that Indels can serve as important complements to forensic genetic markers in addition to STR and SNP (Weber et al., 2002). Compared to STR, Indel amplicon fragments are shorter, and mutation rates are lower. Compared to SNP, Indels have length polymorphism, which can be directly detected by CE of PCR products. This can be easily achieved in most forensic DNA laboratories without complex detection methods. However, most Indels have only two alleles, the polymorphisms are relatively poor and the discriminatory power is relatively lower than that of STR. The present study considered a marker containing at least two Indel loci in a short segment of DNA (namely multi-Indel), as a new microhaplotype. This marker not only increased Indel polymorphism, but also retained the advantages of SNP and STR. Since Indels are markers with length polymorphism, we selected Indel loci with different allele lengths to form a microhaplotype that was directly detectable by CE. According to length polymorphism, it unambiguously reflected the phases of haplotypes from individuals.
Previous studies of multi-Indels have been limited to increasing polymorphism. However, as the length of an insertion or deletion in alleles of an Indel is not specific, some polymorphism information is lost (Huang et al., 2014;Sun et al., 2016). Additionally, individual haplotypes have been statistically inferred after genotyping each Indel locus, which does not reflect the true haplotype of an individual (Zhao et al., 2018). In the present study, we proposed a strategy based on a CE platform to obtain accurate haplotypes of individuals, and constructed a multiplex amplification system containing 18 multi-Indel markers to improve the discrimination power of Indels.

Ethics
The participants provided their written informed consent to participate in this study and for participants under the age of 16, the legal guardian provided written informed consent to participate. All samples were obtained under the supervision of the Ethical Committee of the Sichuan University (KS2019042).

Samples and DNA Extraction
This study included 335 samples of EDTA blood collected from the Sichuan Province, China. The samples were collected under written informed consent from 170 unrelated Sichuan Han individuals, 30 unrelated Sichuan Yi individuals, and 83 parent-child pairs. Notably, 134 samples were from 17 unrelated extended families that descended from 83 parentchild pairs; thus, some parent-child pairs had the same alleged parent or alleged child. We extracted DNA using the BioTeke DNA kits (BioTeke Corp., Beijing, China) as described by the manufacturer. The collected DNA was quantified using the NanoDrop TM 1000 spectrophotometer (Thermo Fisher Scientific Inc., Waltham, MA, United States).

Selection of Multi-Indel Markers
Candidate Indels were selected from 208 samples including 103 Han Chinese in Beijing, China (CHB) and 105 Southern Han Chinese in China (CHS) in the 1000 Genomes Project phase 3 using VCFtools 1 (Sudmant et al., 2015) that met the following criteria: being biallelic, minor allele frequency (MAF) >0.1, located in a non-coding region or intron, allele length of each Indel ranged from 1 to 30 bp; one multi-Indel comprised at least three Indels, physical distance between selected Indels in one multi-Indel marker was <200 bp, alleles had different lengths, and the length of any allele was not equal to the sum of the lengths of the other two or more alleles (each theoretical haplotype has a unique amplicon length), different multi-Indel markers were >10 Mb apart if on the same chromosomal arm, no other Indel variation had MAF >0.005 within this range, the haplotype frequency calculated by Haploview was ≥3, and at least three haplotypes had a frequency of ≥0.05 (Barrett et al., 2005).

Primer Design and Optimization
We designed PCR primers using the online tool Primer3web 2 according to the following criteria: PCR product size, 70-250 bp; Tm values, 55-62 • C, and GC content, 30-60%. Potential secondary structures between obtained primer pairs (including formation of primer dimers and hairpin structures, were examined using AutoDimer 3 , and specific primers were identified using Primer-BLAST 4 . All primer pairs were then assigned according to the predicted amplicon length, and one of the primer pairs was labeled with the fluorochromes, FAM, HEX, TAMRA, and ROX. All primers were synthesized (Thermo Fisher Scientific Inc.) then purified using high performance liquid chromatography (HPLC). Subsequently, we used 1-5 samples to perform a singleplex PCR reaction for each microhaplotype locus. CE was used to detect the PCR products of each microhaplotype locus. And the homozygous samples were amplified using the corresponding primers that are not labeled with fluorescent dyes for Sanger sequencing verification. The size of each locus was examined and compared with the size of CE to determine the electrophoretic mobility of each allele.

Multiplex PCR Amplification
In multiplex RCR amplification, the initial each primer concentration was 0.2 µM. Then this multiplex amplification system was then optimized based on primer concentrations and peak heights. We programmed the thermal cycler according to the manufacturer's instructions. In order to minimize the influence of the annealing temperature of the multiplex system, 18 multi-Indel markers were multiplex amplified under different annealing temperature gradients (56.9, 57.6, 58.4, and 59.1 • C) and different PCR cycles (25, 27, 29 and 32) with 1 ng of control DNA F312. According to the optimized and relatively balanced genotyping profiles, the optimal annealing temperature and optimal cycle number of our system were finally determined. The final reaction volume of 10 µL included 5 µL of 2× Multiplex PCR Master Mix (Qiagen GmbH, Hilden, Germany), 2 µL of primer mixture, 1 µL of target DNA (1 ng/µL), and 2 µL of RNase-free water. The samples were amplified by PCR using the GeneAmp 9700 PCR System (Applied Biosystems, Foster City, CA, United States) under the following cycle conditions: 95 • C for 15 min, then 27 cycles of 30 s at 94 • C, 90 s at 58.4 • C, 60 s at 72 • C, and hold at 60 • C for 60 min. All 335 samples were genotyped using the 18 multi-Indel markers in one multiplex PCR reaction.

Detection and Analysis
The PCR products were detected using the ABI 3500 Genetic Analyzer (Applied Biosystems) and a preloaded AGCU E5 dye fragment analysis run module. Samples were prepared for CE by mixing 1 µL of the PCR products with 8.9 µL of Hi-Di formamide (Applied Biosystems) and 0.1 µL of SIZ500 size standard (AGCU ScienTech, Jiangsu, China). Samples were injected at 1.2 kV for 5 s and resolved by electrophoresis at 15 kV for 1,310 s in Performance Optimized Polymer-4 (POP-4 polymer) (Applied Biosystems). Genotyping data were then analyzed using the GeneMapper TM ID Software v3.2.1 (Applied Biosystems), with an allele peak threshold of 100 relative fluorescence units (RFU).

Allele Nomenclature
Since a nomenclature system for multi-Indel markers has not been standardized and they are essentially a type of microhaplotype, we named the multi-Indel markers in this study according to those suggested by Kidd (2016). We labeled the smallest of their alleles as 0 according to the size of the amplicon in each multi-Indel marker, and if other alleles were N bp larger than the smallest allele, these were called N. New alleles identified in this study were also named according to their length (Huang et al., 2014).

Sensitivity Study
We evaluated the sensitivity of our multiplex system. Serially diluted control DNA F312 (2 ng µL −1 stock) (Beijing Microread Genetics, Beijing, China) was amplified in triplicate with quantities of 1, 0.5, 0.25, 0.125, and 0.0625 ng. These samples were processed under the same reaction conditions described above.

Degradation Study
We simulated several degraded samples that were amplified and resolved by electrophoresis as described above to evaluate the ability of our multiple system to detect DNA in degraded samples. The control DNA M308 was ultrasonically degraded by 0, 100, 200, 300, or 400 cycles of 200 W for 10 s per cycle with 4s intervals between cycles. The extracted DNA from the EDTA blood was ultrasonically degraded by 0, 200, 400, and 600 cycles of 400 W for 10 s per cycle, with 4-s intervals between cycles.

Statistical Analysis
Each allele was considered as one haplotype. The allele frequency was the available haplotype frequency. The forensic parameters allele frequencies, power of discrimination (PD), power of exclusion (PE), typical paternity index (TPI), and observed heterozygosity (Ho), and the exact tests of the Hardy-Weinberg equilibrium (HWE) were calculated using a modified spreadsheet within PowerStat v1.2 (Promega Corp., Madison, WI, United States) (Zhao et al., 2003). Linkage disequilibrium (LD) in pairwise loci were analyzed using GENEPOP (Rousset, 2008). The effective number of alleles (A e ) was calculated based on the formula proposed by Kidd and Speed (2015).
The paternity index (PI) is the likelihood ratio of the probability that an alleged father with the DNA result is the biological father of the child and the probability that the random man is the biological father of the child. The PI was calculated based on LR principles according to the International Society for Forensic Genetics (ISFG) (Gjertson et al., 2007). The combined paternity index (CPI) was equivalent to the product of PI for all multi-Indel markers tested in each parent-child pair.

Marker Selection and General Information
We screened candidate Indels that met the inclusion criteria from the 1000 Genomes Project database. The filter of biallelic Indels with MAF > 0.1 and the allele length variation of each Indel from 1 to 30 bp resulted in 629,402 candidates, which were then filtered according to differences between allele lengths of all loci within a physical distance of <200 bp, and 26,092 potential haplotype markers remained. These were filtered according to each haplotype containing at least 3 Indels, which left 1,642 candidates. Loci in gene coding regions and those positioned <10 Mb apart on the same chromosomal arm were excluded. According to the number and the frequency of haplotypes calculated by Haploview and filtering according to our primer design criteria, only 52 candidates remained. Finally, 18 candidate multi-Indel markers containing 54 Indel loci were genotyped in one multiplex panel after removing loci for which correct genotype results could not be obtained due to long homopolymer structures or 2-15 nucleotide tandem repeats. Table 1 shows the general information of the 18 multi-Indel markers, and Supplementary Table S1 shows the haplotype frequency calculated by Haploview.

Multiplex Assays
Before performing the multiplex amplification, we verified the amplification of the primer pairs at each marker by performing singleplex PCR reaction and detection by CE. The size of the allele was determined based on the results of Sanger sequencing of the corresponding homozygous samples. The CE detection results of the singleplex PCR reaction and the Sanger sequencing of the corresponding markers were shown in the Supplementary Data. After the development and optimization of this multiplex panel, 18 microhaplotype markers were successfully amplified in a single PCR reaction, and the optimal temperature was determined as 58.4 • C, the optimal cycle number was determined as 27, following the optimized PCR conditions presented in section "Multiplex PCR Amplification." After one PCR reaction and the next CE run, 18 multi-Indel markers containing 54 Indel loci were genotyped per DNA sample. The results showed that 18 complete profiles were detected in each test sample. Figure 1 shows an example of capillary electropherogram obtained by genotyping the control DNA F312. Supplementary Figure S1 shows a capillary electropherogram of the control DNA M308, and Table 1 includes information about the sequences and concentrations of all primers in the system.

Sensitivity Study
The sensitivity of our multiplex assay was tested with control DNA F312 serially diluted to template amounts of 1, 0.5, 0.25, 0.125, and 0.0625 ng. Each template amount was amplified three times. Sample inputs >0.125 ng consistently generated full profiles (Figure 2) when amplified for 27 PCR cycles and when the threshold for allele calls was 100 RFU. As the template DNA concentration was gradually reduced from 1 to 0.125 ng, the average detected peak height shifted from 4,144 to 351 RFU. When the template DNA F312 decreased to 0.0625 ng, profiles were partial and an average 91.36% of the allele was detected with an average peak height of 212 RFU. Therefore, our multiplex system obtained reliable profiles at a threshold of 100 RFU above a DNA concentration of 0.125 ng.

Degradation Study
We simulated the degradation of the control DNA M308, and DNA extracted from fresh EDTA blood to determine the effects of sample degradation. After the control DNA M308 was disrupted using 0-400 ultrasound cycles of 200 W, full profiles were obtained using a peak height analysis threshold of 100 RFU. However, the average peak height gradually decreased as the number of cycles increased (Figure 4). Only 83% of the alleles were called from the DNA sample extracted from fresh EDTA blood (a conventional case sample), after 200 ultrasound cycles at 400 W, and after 400 and 600 cycles, 33.33 and 23.33% of alleles were called, respectively ( Figure 5).

Statistical Analysis
We genotyped 200 unrelated individuals from Sichuan using our panel of 18 multi-Indel markers containing 54 Indel loci multiplex systems. Supplementary Table S2 shows their genotype profiles. The mean distance between the outermost Indels of each multi-Indel was 58 (5-142) bp. The average amplicon length was 182 (107-326) bp. The actual and theoretical amplicon sizes differed in seven multi-Indel markers. Our multiplex detected 77 specific amplicons (that is, 77 haplotypes) in 200 Sichuan individuals. One of these, mh01zl001, was monomorphic in the surveyed population, so we excluded this locus from further statistical analysis. We found 2, 3, 4, 5, 7, 9, and 10 haplotypes in 3, 4, 3, 4, 1, 1, and 1 multi-Indel markers respectively. Supplementary Table S3 lists the alleles of 17 multi-Indel markers and their frequencies. The mean and median values of A e for these 17 loci were 2.83 and 2.92, respectively (Figure 6).  196 183, 184, 194, 195, 199, 210, 211, 200   We also tested each locus for conformity to the HWE model and for potential LD. The threshold p value for the HWE test was set at 0.00037 after the Bonferroni correction, and no deviations from linkage equilibrium were significant between pairwise loci after the Bonferroni correction (p > 3.68 × 10 −4 ; Supplementary Table S4). Table 2 lists the PD, PE, Ho, PM, PIC, TPI, and p values for HWE of the 17 multi-Indel markers. The average PD value was 0.7585 (range, 0.5146-0.9469). The average PE value for the 17 loci was 0.591 (range, 0.0888-0.5535). The Ho was 0.355 to 0.775, and combined PD and combined PE were 0.999999999997234 and 0.998414249965817, respectively.

Application in Paternity Testing
We analyzed 83 parent-child pairs and calculated PI using genotype data using the multi-Indel multiplex panel. Supplementary Table S5 shows the genotypes of 83 parent-child pairs and the specific PI per locus and CPI per parent-child pair. The allele frequency of 17 multi-Indel markers was obtained separately from the 200 unrelated individuals. All the parent-child pairs conformed to the Mendelian laws of inheritance. No mutation or recombination was found in any of the multi-Indel markers from 83 parent-child pairs. Overall, the CPI in 83 parent-child pairs determined by the panel of 17 multi-Indel markers averaged 2.82066955485148 × 10 6 (range, 0.58394420522483 × 10 3 to 5.06111014257473 × 10 7 . Fourteen parent-child pairs had a CPI below 10,000, which did not support a biological parent-child relationship between them. However, their CPI were >0.0001, so a biological parentchild relationship cannot be excluded. The number of loci would need to be increased, or combined with STR kits to clarify this situation.

DISCUSSION
Multi-variant is slightly different from the traditional microhaplotype. We believe that a set of all variants including SNP, Indel and STR within a specifically short length can be considered as generalized microhaplotypes. Only microhaplotypes containing two SNP can presently be genotyped on the CE platform due to limitations of the system . Therefore, we selected Indels from the 1000 Genomes Project as the basis for constructing microhaplotypes that could be analyzed using this platform. The human Indel mutation rate ranges from 0.53 to 1.5 × 10 −9 per base per generation (Kondrashov, 2003;Lynch, 2010;Campbell and Eichler, 2013;Ramu et al., 2013;Besenbacher et al., 2015;Zhao et al., 2018). This mutation rate is one order of magnitude lower than that for SNP and five orders of magnitude lower than that for STR. Therefore, Indels combine the advantages of SNP and STR. Multi-Indels increase their polymorphism while retaining the advantages of Indels. We used Haploview to initially screen haplotype frequency. Since Haploview can only recognize biallelic alleles and biallelic loci are the most prevalent in Indels, this study investigated only biallelic Indels. We extracted 2,052,970 biallelic Indels from 22 autosomes in the 1000 Genomes Project using VCFtools. We further restricted the alleles according to their length. In theory, different amplicon lengths represent different haplotypes, so haplotype polymorphism can be determined according to allele frequency. In addition, the allele frequencies of SNP/InDel vary significantly among different populations. When applied to individual identification in forensic cases, population-specific allele frequencies are necessary (Oldoni et al., 2018). In our study, the application in the Chinese population is temporarily considered, so only the CHB and CHS population in the 1000 Genomes Project phase 3 are used as the source of screening candidate markers.
As a result, the frequency of some multi-Indel markers differed from the theoretical data obtained by the 1000 genome project database using Haploview (Supplementary Tables S1, S3). According to the law of free combination, three single markers with linkage equilibrium should display eight different haplotypes. A haplotype with a minimum frequency of 0.001 can be obtained using Haploview calculations. However,  we found three multi-Indel markers (mh04zl001, mh10zl002, and mh18zl001) with three Indels having only two different haplotypes as two alleles, which might be related to the complete LD between closely adjacent markers (distances were 17, 38, and 16 bp, respectively). Additionally, seven multi-Indel markers were inconsistent with the theoretical amplicon length, and the haplotype frequency was also different. We verified the homozygous samples of each marker by Sanger sequencing, especially each amplicon that was inconsistent with the theoretical length. Although our screen limited the existence of other Indels with MAF >0.005 in this range, mh02zl001 and mh02zl003 had 10 and 9 haplotypes, respectively, because additional Indels were detected in this range. In addition, according to the Sanger sequencing results, novel mutations were also found in the mh10zl001 and mh21zl001 loci, which caused the actual allele size and frequency to be inconsistent with the theoretical value. For the other two loci, mh03zl003 and mh21zl002, we did not find redundant mutations in homozygous samples that have been sequenced by Sanger, but there are also inconsistencies with alleles. These Indels were not included in the database because the goal of the 1000 Genomes Project is to capture the most common human genetic variations (Bergstrom et al., 2020). The development and progress of sequencing technology allows the collection of more varied information.
Our multi-Indel multiplex panel has many advantages. We designed one pair of primers for each multi-Indel marker and one PCR amplicon and one CE run for genotyping. The elimination of sequences with 2-15 nucleotide tandem repeats improved genotyping accuracy and avoided stutter, which is a benefit when analyzing mixtures. Low mutation rates are highly significant in paternity testing, but our results showed that our panel could only serve as an effective supplement to STR, because the PE was not high enough (Huang et al., 2014;Gao et al., 2015;Zhao et al., 2018).
A generalized microhaplotype is essentially a set of all variants in a short fragment, namely multi-variants, which have higher polymorphism. The MPS technology can directly obtain sequences within the read length range, and thus directly determine the phase of a haplotype. Currently, the CE platform is more prevalent in forensic laboratories, so multi-Indels have other potential applications. With the future popularization of MPS, the application of generalized microhaplotypes will become more widespread.

CONCLUSION
In our research, we proposed that the generalized microhaplotype is essentially a collection of all variants in a very short fragment (200 bp), that is, multi-variants with high polymorphism. At present, as the CE platform was widely used in all forensic genetic laboratories, a method based on the CE platform is described in this study. This method can simultaneously detect 18 microhaplotype markers consisting of three or more Indels with different insertion or deletion lengths in the range of less than 200 bp. Our multi-InDel microhaplotypes panel have shorter fragments than conventional STR markers, and have more potential in forensics considering the degraded DNA. In addition, multi-InDel microhaplotypes do not generate stutter involved with PCR amplification, which have more potential in forensics considering the mixture of DNA from two or more individuals. Finally, multi-InDel microhaplotypes offer a much lower mutation rate than STR markers, and it can be used as supplementary in paternity cases with STR mutation. And the results of combined power of discrimination (CPD) (0.999999999997234) certified the usefulness of our panel for forensic personal identification. But our results also showed that our panel can only be used as an effective supplement to STR, because the CPE (0.9984) is not high enough. Therefore, microhaplotypes consisting of three or more Indels which can be resolved by CE platform have great application potential in forensic genetics.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by Ethical Committee of the Sichuan University. The patients/participants provided their written informed consent to participate in this study.