Identification and Characterization of Nine Novel X-Chromosomal Short Tandem Repeats on Xp21.1, Xq21.31, and Xq23 Regions

The application of X-chromosomal short tandem repeats (X-STRs) has been recognized as a powerful tool in complex kinship testing. To support further development of X-STR analysis in forensic use, we identified nine novel X-STRs, which could be clustered into three linkage groups on Xp21.1, Xq21.31, and Xq23. A multiplex PCR system was built based on the electrophoresis. A total of 198 unrelated Shanghai Han samples along with 168 samples from 43 families was collected to investigate the genetic polymorphism and forensic parameters of the nine loci. Allele numbers ranged from 5 to 12, and amplicon sizes ranged from 146 to 477 bp. The multiplex showed high values for the combined power of discrimination (0.99997977 in males and 0.99999999 in females) and combined mean exclusion chances (0.99997918 and 0.99997821 in trios, 0.99984939 in duos, and 0.99984200 in deficiency cases). The linkage between all pairs of loci was estimated via Kosambi mapping function and linkage disequilibrium test, and further investigated through the family study. The data from 43 families strongly demonstrated an independent transmission between LGs and a tight linkage among loci within the same LG. All these results support that the newly described X-STRs and the multiplex system are highly promising for further forensic use.


INTRODUCTION
On account of the distinct genetic characteristics and inheritance pattern, the weight of evidence provided by X-linked short tandem (STR) markers is enhanced in routine forensic practice. The application of X-chromosomal STRs (X-STRs) has been recognized a powerful tool in complex kinship testing, such as deficiency paternity cases and inbred cases, where the analysis of autosomal STRs may fail to give a clear conclusion (Szibor et al., 2003;Krawczak, 2007;Szibor, 2007;Pinto et al., 2011;Tillmar et al., 2017;Gomes et al., 2020). In some cases, the analysis of low-size X-STRs may provide greater statistical power than autosomal STRs in analyzing difficult samples such as skeletal human remains (Szibor, 2007). In addition, X-STR profiling could generate more informative evidence in individual identification using mixed stains containing female components compared with the detection results of autosomal STR loci.
To date, more than 50 X-STR loci have been identified and investigated for forensic practice (http://www.chrx-str.org). Several commercial kits have been developed for X-STR analysis (Tao et al., 2020;Xiao et al., 2021), and the most widely used are the Investigator Argus X-12 QS Kit (Hakim et al., 2021) and the AGCU X19 STR Kit (Zhang et al., 2016). Most commercial kits adopt X-STRs from four basic linkage groups (LGs) located on Xp22 (Hundertmark et al., 2008), Xq12 (Hering et al., 2006), Xq26, and Xq28 regions  according to clear linkage relationship. However, information provided by these X-STRs may not be efficient enough for specific cases, especially where mutational events or recombination events were observed within a linkage group (Hering et al., 2015).
Studies have been conducted to identify novel X-STR markers clustering other LGs on centromere (Edelmann et al., 2010), Xq21 (Szibor et al., 2005), and Xq22.1 (Edelmann et al., 2002), while attempts were made to add new LGs to the former four LGs to gain higher efficiency in detection. Nonetheless, it is noticeable that a relatively shorter physical distance could be observed between the linkage group on centromere and Xq12. A dispute over the linkage relationship between these two X-STR clusters-linked as one LG (Yang et al., 2019) versus independent as two LGs has been reported in previous studies. This would confuse the interpretation of X-STR profiling results in actual forensic practice.
In this study, in order to provide potential enhancement to the basic four LGs, we screened out nine novel X-STRs, which could be clustered into three novel LGs located >15 Mb from each other and from the basic four LGs. A multiplex PCR assay for STR typing was built based on the electrophoresis. A total of 198 unrelated Han Chinese individuals were collected to investigate the polymorphism and forensic parameters of the nine novel loci. The linkage between and within the three presumable LGs was estimated via Kosambi mapping function and linkage disequilibrium (LD) test, and further investigated through the family study.

Samples and DNA Extraction
Blood samples were collected from 198 unrelated Shanghai Han individuals (71 females and 127 males) and 43 families (N 168). Informed consents were obtained from all participants prior to the sample collection. There are two types of families. Type One families were three-generation pedigrees comprising a grandfather, a mother, and a son. Type Two families were two-generation pedigrees comprising a father, a mother, and at least two offspring. When offspring are all males in a family, the sample of the father was not required for our study. The biological relationships of each family had been confirmed by paternity tests with autosomal STRs. Genomic DNA was extracted with the ReadyAmp Genomic DNA Purification System (Promega, United States). Typical control DNA of 2800M (Promega, United States) was used as reference in each single detection. This study was approved by the Ethics Committee of Fudan University.

Short Tandem Repeat Search and Primer Design
The lobSTR (Gymrek et al., 2012) and data sets of Han Chinese in Beijing and Southern Han Chinese from the 1,000 Genomes Project were used to search for novel loci on the X chromosome (except for Xp22, Xq12, Xq26, and Xq28 regions). The criteria were as follows: 1) STR with 4-to 5-bp repeat motif, 2) more than five repeats, 3) more than three novel loci in the same region, 4) the power of discrimination in females higher than 0.5, and 5) easy-to-design primers. Genome build GRCh38 (hg38) was used to locate positions and flanking sequences. The primers for constructing a CE multiplex were designed using the PRIMER5 software.

Genotyping and Sequence Analysis
DNA amplification of all samples was carried out with the Ex Taq DNA Polymerase (TaKaRa, China) according to the protocol of the manufacturer. All primers were pooled together with a final concentration of 1 µM. Thermal cycling followed the recommendation from the manufacturer: 30 cycles of denaturation at 98°C for 10 s, annealing at 55°C for 30 s, extension at 72°C for 60 s, and finally hold at 4°C. The PCR products were separated on an ABI 3130xl Genetic Analyzer (Applied Biosystems, United States) with the GeneScan 500 LIZ dye Size Standard (Applied Biosystems, United States). All alleles were determined with the GeneMapper ID v3.2 (Life Technologies, United States). To confirm the whole sequences of amplification products and locate polymorphic repeat regions, Sanger sequencing was performed using both forward and reverse reads.
The genetic distances between STR pairs were estimated through a sequence-level genetic map constructed by Bjarni V. Halldorsson (Halldorsson et al., 2019). The Kosambi mapping function was applied to convert the cM genetic distance into a recombination fraction (Kosambi, 1943;Phillips et al., 2012). The maximum log odds (LOD) score analysis was carried out to investigate linkage between all pairs of loci (Yoo and Mendell, 2008). The recombination fraction was also calculated based on family data.

Novel Short Tandem Repeat Information
A total of nine novel X-STRs were screened out according to the criteria. An X-chromosome idiogram with details of the loci analyzed in this study is shown in Figure 1. The first marker group named LG1 in this study locates on Xp21.1 and spans approximately 0.763 cM. The second group named LG2 locates on Xq21.31 and spans approximately 0.417 cM. The third group named LG3 locates on Xq23 and spans approximately 1.906 cM. To assess the level of physical linkage statistically, we first computed the Kosambi recombination fraction within and between presumable LGs. The intragroup recombination fraction of LG1, LG2, and LG3 was 0.008, 0.004, and 0.019, respectively. The intergroup genetic distances were approximately 37.60 cM between LG1 and LG2, 25.67 cM between LG2 and LG3, and 62.85 cM between LG1 and LG3. The intergroup recombination fractions were 0.318 between LG1 and LG2, 0.236 between LG2 and LG3, and 0.425 between LG1 and LG3.

Multiplex PCR
Primers for constructing a CE multiplex PCR are shown in Table 1. All the forward primers were labeled with fluorescent dyes: locus X003 and X016 were labeled with ROX; locus X006, X008, and X033 were labeled with FAM; locus X018 and X029 were labeled with HEX; locus X019 and X028 were labeled with TAMRA. Locus Amel was also included in the multiplex assay, and the forward primer was labeled with TAMRA. The amplification size ranged from 146 bp (X033) to 477 bp (X028) ( Table 1). An example of the electropherogram of the positive control 2800M is shown in Supplementary Figure S1.

Sequence Analysis
Using lobSTR, an estimate of the STR sequence structure was made by retrieving variations from datasets of Han Chinese in Beijing and Southern Han Chinese from the 1,000 Genomes Project. Loci X016, X019, and X029 were assumed to have simple repeat structures, while others were assumed to be complex STRs or compound STRs ( Table 2). To confirm the whole sequences of amplification products and identify variable repeat regions for each locus, Sanger sequencing was performed. Although X008 and X018 were assumed as complex STRs by lobSTR, sequencing results showed that the variation was only observed in just one repeat region of the assumed repeat structure: the fifth repeat region with a motif of ATAG at X008 and the first repeat region with a motif of ATAG at X018. To simplify the nomenclature, they were defined as simple STRs. The allele nomenclature for X033 was also simplified. However, with sequence results, X033 was confirmed to have two different repeat structures: [TCTA]n and [TCTA]n [TCTG]1 [TCTA]n. According to the index of 1,000 genomes phase 3 (dbSNP 2.0 Build 154 v2), the second repeat structure was formed by a variation (rs782031781, A to G) in the last position of a motif. Modified repeat structures for the nine loci are shown in Table 2, as well as the STR sequences of the positive control 2800M.
Potential variations within flanking regions were also investigated, especially insertions and deletions (InDels), which would obstruct the allele nomenclature. Within the upstream flanking region of X018, an AC deletion (hg38: 87326311-87326312) was found in Shanghai Han samples. Alleles with the deletion were determined as N.2 by CE but modified to N + 1 based on the sequence results. In addition, the SNP rs621508 was found in the upper flanking region of X018. Allele A was observed in the reference sequence from GenBank while allele G was observed in all the sequenced Shanghai Han samples.
As the repeat structures were determined, the length of STR and the allele size were obtained ( Table 1). The length of STR ranged from 24 bp (X029) to 105 bp (X006), and the allele ranged from 5 (X019) to 24 (X028).

Hardy-Weinberg Equilibrium Test and Linkage Disequilibrium Test
Analysis of the HWE based on the female data is presented in Supplementary Table S1. The observed heterozygosity varied from 0.23944 (X016) to 0.91549 (X006) with an average of 0.68388, and expected heterozygosity varied from 0.51573 (X029) to 0.86914 (X006) with an average of 0.71293. Eight of the nine loci were found to be in HWE except X016 (p 0.0000). It may be due to the small sample size. The pairwise exact test of LD was also performed. Significant associations were found in two pairs of loci after Bonferroni correction (p 0.00138889): X016-X018 (0.00019802) and X018-X028 (0.00099009) (Supplementary Table S2, S3).

Forensic Statistical Analysis
Allele frequencies and forensic parameters of the nine X-STRs are shown in Table 3. A total of 71 alleles were observed in the
The combined power of discrimination in males and females in the Shanghai Han population were 0.99997977 and 0.99999999, respectively. The combined MECs were 0.99997918 and 0.99997820 in trios, 0.99984939 in duos, and 0.99984200 in deficiency cases ( Table 3).

Heredity of X-Chromosomal Short Tandem Repeats in Families
A total of 168 samples from 43 families (86 meiosis involved) were collected to investigate allele segregation of the nine X-STRs. Maternal haplotypes were identified successfully for each female sample. In this study, two inconsistencies were found to be one-step mutations ( Table 4). In Family One, a paternal Note. One-step mutations are in bold and ambiguous transmissions are in italics.
Frontiers in Genetics | www.frontiersin.org November 2021 | Volume 12 | Article 784605 6 mutation from allele 16 to allele 17 was observed at the locus X006. The other one was a maternal mutation from allele 11 to allele 12 at the locus X018 in Family Two. Apparent intragroup recombination was observed in two families ( Table 4). One occurred between the locus X003 and X006 (Family Three), where the mother was heterozygous at both loci, but two sons only inherited different alleles at the locus X003. In Family Four, although the intragroup recombination was evident within LG3, the location of the recombination breakpoint is unclear as the mother was homozygous at the locus X029. In addition, four more ambiguous X-STR transmissions (two at the locus X003, one at the locus X008, and one at the X018) were observed in this study (Table 4). They were explicable by either an intragroup recombination or a single-step mutation. The LOD score for each locus pair was also calculated and is shown in Table 5. LOD scores for all the intragroup locus pairs ranged from 5.14 to 9.79, and LOD scores between loci from different LGs were all less than 1. The recombination fraction for all pairs of loci was also calculated and is shown in Table 6. The recombination fraction for all pairs of intragroup loci varied from 0 to 0.065 and varied from 0.300 to 0.500 for all pairs of loci from different LGs.
All the maternal haplotypes for the three LGs were informative as no haplotype was completely homozygous. Hence, recombination events between adjacent LGs could be identified directly. Intergroup recombination was observed in 29 families. The recombination fraction between LG1 and LG2, and between LG2 and LG3 were 0.366 and 0.415, respectively (Supplementary Table S4).

DISCUSSION
In this study, we identified nine novel X-STRs, which could be clustered into three LGs located on Xp21.1, Xq21.31, and Xq23. A multiplex PCR assay was built based on the electrophoresis. A total of 198 unrelated Shanghai Han samples along with 168 samples from 43 families were collected to investigate the polymorphism and forensic parameters. Allele numbers ranged from 5 to 12 and amplicon sizes ranged from 146 to 477 bp. The multiplex showed a great forensic efficiency as high values were achieved for the combined power of discrimination and combined MECs.
In kinship testing with X-STRs, the calculation of exact likelihood requires the consideration of linkage and LD (Krawczak, 2007). Thus, we utilized the Kosambi mapping function, the LD test, and a comprehensive family study to assess the tightness and strength of linkage between loci and between LGs. The intragroup recombination fraction of the Kosambi implied that the recombination may have the least chance to disrupt the haplotype of LG2, but a relatively higher chance in LG3. This was testified by the family data as the intragroup recombination was only observed in LG1 and LG3. However, unexpectedly, the intergroup recombination fractions of the Kosambi were quite different with the result generated from the actual family data. Since the recombination fractions obtained from the direct observation of STR transmissions in families is more trustworthy than that estimated from HapMap (Nothnagel et al., 2012), the paradox in our study may be reasonable and point to the existence of a potential hotspot between LG2 and LG3.
By LD test, the very limited LD and the suspicious association between loci X018 and X028 were observed. This may be due to the small sample size and the relatively low power of the LD tests (Tillmar et al., 2017).
In the family study, five ambiguous X-STR transmissions were uninformative for the investigation of the allele segregation. In Family Four, there were two plausible scenarios to illustrate how the recombination disrupted the linkage. In Scenario A, two daughters inherited different maternal alleles, and the intragroup recombination occurred between loci X028 and X029. Alternatively, in Scenario B, allele 8 of two daughters originated from the same maternal fragment, and the intragroup recombination occurred between loci X029 and X033. The other four ambiguous X-STR transmissions were explicable by either an intragroup recombination or a single-step mutation. For these cases, additional typing of flanking markers may provide further information (Nothnagel et al., 2012). LOD scores were calculated based on the family data. Significant linkage within each group was strongly supported by the fact that LOD scores for all the intragroup locus pairs were larger than 3. The pattern of free recombination between loci from different groups was also revealed as the LOD scores were less than 1. All these demonstrate that the newly described X-STRs and the multiplex system are highly promising for forensic use. Additionally, higher capability and efficiency of the system can be reached in the future as the remaining capacity in the multiplex system allows more loci to be included.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by Ethics Committee of Fudan University. The patients/participants provided their written informed consent to participate in this study.