Comparative Analyses of Chloroplast Genomes From 14 Zanthoxylum Species: Identification of Variable DNA Markers and Phylogenetic Relationships Within the Genus

Zanthoxylum L. is an economic crop with a long history of cultivation and domestication and has important economic, ecological, and medicinal value. To solve the classification problems caused by the similar morphological characteristics of Zanthoxylum and establish a credible phylogenetic relationship, we sequenced and annotated six Zanthoxylum chloroplast (cp) genomes (Z. piasezkii, Z. armatum, Z. motuoense, Z. oxyphyllum, Z. multijugum, and Z. calcicola) and combined them with previously published genomes for the Zanthoxylum species. We used bioinformatics methods to analyze the genomic characteristics, contraction, and expansion of inverted repeat (IR) regions; differences in simple sequence repeats (SSRs) and long repeat sequences; species pairwise Ka/Ks ratios; divergence hotspots; and phylogenetic relationships of the 14 Zanthoxylum species. The results revealed that cp genomes of Zanthoxylum range in size from 158,071 to 158,963 bp and contain 87 protein-coding, 37 tRNA, and 8 rRNA genes. Seven mutational hotspots were identified as candidate DNA barcode sequences to distinguish Zanthoxylum species. The phylogenetic analysis strongly supported the genus Fagara as a subgenus of Zanthoxylum and proposed the possibility of a new subgenus in Zanthoxylum. The availability of these genomes will provide valuable information for identifying species, molecular breeding, and evolutionary analysis of Zanthoxylum.


INTRODUCTION
Zanthoxylum L. belongs to the Rutoideae subfamily of the Rutaceae family. Zanthoxylum is widely distributed in tropical and subtropical regions and consists of approximately 250 species (Huang, 1997). Owing to its important economic and medicinal value, Zanthoxylum has a long history of cultivation and domestication in Asia. The Chinese pharmacopeia (2015 Edition) records that the dried and mature peel of Zanthoxylum schinifolium Sieb. et Zucc. and Zanthoxylum bungeanum Maxim. have the effects of warming the middle-jiao to alleviate pain, destroying parasites, and relieving itching, and can be used to treat cold pain of the gastric cavity and abdomen, vomiting and diarrhea, and abdominal pain due to parasitic infestation. The pericarp of Z. armatum, Z. bungeanum, and other Zanthoxylum species has a strong fragrance and is one of the traditional Chinese "eight major condiments." In addition, owing to its economic potential, Zanthoxylum has been planted to restore farmland to forests and has high ecological value for soil and fertilizer preservation (Hu et al., 2012).
Due to the high medicinal and culinary value of the Zanthoxylum genus plants, there are some phenomena of homonym, heteronym, or defective products that are substituted for the qualified ones in the market (Shen et al., 2004). The traditional classification and identification of Zanthoxylum relied on its morphological characteristics, such as leaf morphology, fruit color, fruit maturity, leaf gland points, and leaf thorns (Tu et al., 2001), however, the morphological characteristics are easily affected by the growth environment. Also, many Zanthoxylum plants are similar in morphology and difficult to identify, which restricts the progress and development of the Zanthoxylum industry. Furthermore, inappropriate selection of cultivars in agricultural production of Zanthoxylum may cause economic losses to farmers (Feng et al., 2017). Chaotic use of Zanthoxylum medicinal materials also directly affects the safety and efficacy of clinical medications. Therefore, reliable and distinguishable genetic markers are urgently needed for the healthy development of the Zanthoxylum industry. Although Zanthoxylum is widely distributed in China, the systematic breeding of Zanthoxylum varieties has not been carried out on a large scale. The classification and naming of the existing Zanthoxylum varieties are based on historical, customary names inherited by the local people and lack an authoritative and unified naming.
To solve the problem of Zanthoxylum species identification, Li et al., developed simple sequence repeats (SSRs) markers for the chloroplast (cp) genome of Z. bungeanum. Li et al. (2019) showed that SRs can be used for genetic diversity analysis among different Zanthoxylum species and used cpSSRs to distinguish between Z. bungeanum, Z. armatum, and Z. piperitum. Feng et al. (2017) developed SSR markers to identify Z. bungeanum based on transcriptome data. Wang et al. (2016) used the internal transcribed spacer 2 (ITS2) to identify Zanthoxylum plants and found that ITS2 could not completely distinguish all of the Zanthoxylum species they selected. However, using the combined barcode of ITS2 and psbA-trnH derived from the cp genome, five species of Zanthoxylum were distinguished (Wang et al., 2016). This result is consistent with the results of Xin-Ye et al. (2014). The research on molecular markers of Zanthoxylum is confined to a few species in some regions only. We are committed to annotating further Zanthoxylum complete cp genomes to compare variation and discover additional molecular markers that are applicable to a wider range of Zanthoxylum for identification and breeding. The cp genome is characterized by a typical quadripartite structure that contains a pair of inverted repeat (IR) regions separated by a large single-copy (LSC) and a small single-copy (SSC) region (Bendich, 2004). A full cp genome is a valuable resource of information for studying plant taxonomy, phylogenetic reconstruction, and historical biogeographic inference (Liu et al., 2018). The plant cp genome, a research hotspot for screening DNA barcoding sequences, can also be used as a super-barcode for phylogenetic and species identification studies (Jansen et al., 2007). The use of cp genomes to solve the problem of classification of related species is of enormous significance for species identification of herbal medicine and the entire plant community.
In this study, we sequenced and assembled six Zanthoxylum whole cp genomes and combined this data with eight previously published Zanthoxylum cp genomes to perform a comprehensive analysis, including genome features, repeats, selective pressures, divergence hotspots, and phylogenetic relationships. Our goals in this study were to: (1) present the complete cp genome sequence of six new assembled Zanthoxylum species and compare the global structures with another eight previously published Zanthoxylum species; (2) examine variations of long repeat sequence and SSRs in 14 Zanthoxylum cp genomes; (3) identify divergence hotspots as potential genetic markers for DNA barcoding; and (4) reconstruct the phylogeny of Zanthoxylum species using protein-coding sequences of cp genomes and infer their phylogenetic location within Rutaceae.

DNA Extraction and Sequencing
The plant materials of Z. piasezkii, Z. armatum, Z. motuoense, and Z. oxyphyllum were collected from Nyingchi (Tibet, China); Z. multijugum and Z. calcicola were obtained from Kunming (Yunnan, China). Fresh, healthy leaves were directly dried with silica gel after collection. Total genomic DNA was isolated using a modified CTAB method . The DNA integrity and concentration were measured using agarose gel electrophoresis and a NanoDrop 2000 Spectrophotometer (Thermo Scientific, Carlsbad, CA, United States). Purified DNA was randomly sheared into fragments using physical methods. Paired-end reads (150 bp) were generated on an Illumina HiSeq X 10 System (San Diego, CA, United States). Total genomic DNAs were also sent to BGI (Shenzhen, China) for library (400 bp) preparation for genome skimming sequencing. Paired-end (150 bp) sequencing was conducted on the Illumina HiSeq X-10 platform, generating ∼2 Gb data per sample. Low-quality sequences were filtered by NGS QC Toolkit v2.3.333 (Patel and Jain, 2012) with Q30 (base Phred quality score of ≥ 30) was used to obtain highquality reads.

cp Genome Assembly and Annotation
We assembled the chloroplast genomes with NOVOPlasty (Dierckxsens et al., 2017) using clean data, with parameters of K-mer (33), and annotated them with GeSeq 1 (Tillich et al., 2017), coupled with manually edited start and stop codons in Geneious 11.1.4 software (Kearse et al., 2012). The Z. bungeanum cp genome sequence (NCBI Accession number: NC031386) was used as a reference. The annotation results were checked using DOGMA 2 (Wyman et al., 2004) and CpGAVAS . In addition, all tRNA genes were further verified using tRNAscan-SE v1.21 (Brooks and Lowe, 2005). The boundaries of LSC, SSC, IRa, and IRb were determined through local BLAST software. Finally, the Organellar Genome DRAW tool 3 (Lohse et al., 2013) was used to draw the circular gene maps of the Zanthoxylum cp genome.

Comparative Genomic Analysis and Molecular Marker Identification
IR expansion and contraction in the cp genomes of the 14 Zanthoxylum species were detected using IRscope (Amiryousefi et al., 2018). The nucleotide diversity (Pi) of coding and non-coding regions was extracted  and then computed with DnaSP (Rozas et al., 2003). The variable, parsimony informative, conserved sites of DNA barcode sequences were identified using DnaSP software (Rozas et al., 2003).

Repeat Sequencing Analysis
MISA 4 was used to identify SSRs in each species with the minimum numbers of repeats set to 8, 5, 3, 3, 3, and 3 for mono-, di-, tri-, tetra-, penta-, and hexanucleotides, respectively (Thiel et al., 2003). The long repetitive sequences containing forward, palindromic, reverse, and complementary repeats were analyzed using the software REPuter 5 with a 30 bp minimum repeat size and a Hamming distance of 3 (Kurtz et al., 2001).

Evolutionary and Phylogenetic Analysis
We used the KaKs calculator program (Zhang et al., 2006) with the NG model to calculate the rates of non-synonymous substitutions (Ka), synonymous substitutions (Ks), and their ratio (Ka/Ks). When Ks = 0, the value cannot be computed and was represented by * . When Ka = 0 and Ks = 0, the value was represented by NaN. The Sapindales species Z. bungeanum (NC031386.1) was used as a reference.
The genome sequences of 20 plastomes of the Rutaceae species were downloaded from the National Center for Biotechnology Information Search database (Supplementary Table 5), and six newly assembled Zanthoxylum cp genomes were used to reconstruct phylogenetic relationships; Xylocarpus rumphii (NC038199.1) was used as the outgroup in the phylogenetic analysis. A total of 76 protein-coding genes (Supplementary Table 6), shared by the cp genomes of 30 Rutaceae species, were selected to construct maximum likelihood (ML) and Bayesian trees. A total of 76 gene sequence alignments were deposited into MAFFT 7.0 (Osaka University, Suita, Japan; Katoh and Standley, 2013) and were adjusted manually where necessary. Phylogenetic trees were constructed using IQ-TREE (Nguyen et al., 2015) and MrBayes 3.2.6 software (Ronquist et al., 2012;Zhang et al., 2020). The Bayesian Inference tree was constructed under the GTRGAMMA model (two parallel runs, 2,000,000 generations), with the initial 25% of sampled data discarded as burn-in. We selected the GTRGAMMA model of nucleotide substitution for ML analysis (Nguyen et al., 2015).
The cp genomes of the 14 Zanthoxylum species, their sizes, GC content, number of genes, and other information are shown in Supplementary Table 1. The 14 Zanthoxylum cp genomes ranged in size from 158,071 bp for Z. madagascariense to 158,963 bp for Z. schinifolium (Figure 1 and Supplementary Table 1). The differences between the lengths of the Zanthoxylum cp genomes were no greater than 892 bp (Supplementary Table 1). The 14 Zanthoxylum cp genomes displayed a typical quadripartite structure, consisting of a pair of IRs (26,955-27,679 bp) separated by one LSC (85,340-86,528 bp) and one SSC (17,526-18,301 bp) region.

Contraction and Expansion of IRs
The contraction and expansion of IR regions is the main contributor to the size variation in cp genomes and alters the evolutionary rate of the cp genome (Zhang et al., 2013). We compared the IR boundaries in 14 Zanthoxylum species and found that the IR boundary regions varied slightly (Figure 2). The IRa/LSC boundaries were located downstream of the rps19 gene. The ycf1 gene crossed over the IRa/SSC border and extended into the IRa region; the length of the ycf1 gene extending into the IRa region ranged from 1,085 to 1,704 bp. At the IRb/SSC border, 23 bp of the ndhF gene was located within the IRb while the remainder was situated in the SSC regions, except in Z. oxyphyllum, Z. multijugum, Z. calcicoca, Z. pinnatum, Z. madagascariense, Z. schinifolium, Z. paniculatum, and Z. tragodes, where the ndhF gene was fully present within the SSC region, indicating that these species may have a closer genetic relationship.

Long Repeat Analysis
Repeat sequences in Zanthoxylum cp genomes were detected by REPuter, with the criterion of a copy size of 30 bp or longer. A total of 379 long repeats consisting of 159 forward repeats, 192 palindromic repeats, 20 reverse repeats, and 8 complement repeats were detected (Figure 3 and Supplementary Table 2
Z. schinifolium had the largest number of long repeats with 47, and Z. bungeanum had the smallest number of long repeats with 17 ( Figure 3B and Supplementary Table 2). The number of forward repeats varied between 8 (Z. bungeanum) and 24 (Z. schinifolium), and the number of palindromic repeats varied from 9 (Z. bungeanum) to 21 (Z. schinifolium) (Figure 3B and Supplementary

SSRs Analysis
SSRs are important genetic markers to identify closely related species (Zhao et al., 2020;Zheng et al., 2020). Here, SSRs in the cp genomes of 14 Zanthoxylum species were detected using MISA software. The number of SSRs in the 14 Zanthoxylum species ranged from 244 (Z. tragodes) to 268 (Z. simulans); no significant differences were found in SSR numbers in the 14 Zanthoxylum species (Figure 4A and Supplementary Table 3). In our study, mononucleotide to tetranucleotide SSRs were found in all species. Pentanucleotide repeats were found in Z. madagascariense, Z. schinifolium, and Z. oxyphyllum cp genomes, and hexanucleotide repeats were also found in Z. paniculatum and Z. madagascariense ( Figure 4B and Supplementary Table 3). Among these SSRs, mononucleotide repeats were the most common (Figures 4A,B).
Only a small proportion consisted of dinucleotide, trinucleotide, tetranucleotide, pentanucleotide, and hexanucleotide repeat motifs (Figures 4A,B and Supplementary Table 3). These newly detected SSRs will contribute to the development of genetic markers for the Zanthoxylum species in future studies.

Divergence Hotspots
We analyzed the nucleotide diversity (Pi) values to measure the divergence level in protein-coding genes and intergenic regions of the 14 Zanthoxylum species. The level of sequence divergence among protein-coding genes was more conserved than in intergenic regions. In 76 protein-coding genes, the average Pi value was 0.00456. Based on a considerably higher Pi value of > 0.01, we found four highly variable regions (psbT, ndhF, matK, and atpH), and the matK gene had the highest divergence value of 0.01253 ( Figure 5A). Within the intergenic regions, trnR-UCU-atpA, psbZ-trnG-GCC, trnH-GUG-psbA, ccsA-ndhD, ycf4-cemA, rpl32-trnL-UAG, and psbK-psbI showed a significantly higher Pi value of > 0.02. The trnR-UCU-atpA region had the highest divergence value of 0.06304 ( Figure 5B). We analyzed sequence characteristics such as sequence length range, GC content, and the average number of mutation sites of the seven candidate barcode sequences ( Table 2). ndhF had the longest sequence length (2,211∼2,229 bp), ccsA-ndhD had the shortest sequence length (224∼304 bp). The region with the most average number of mutation sites is rpl32-trnL-UAG (13.6%), and the region with the least is ndhF (4.4%). From the sequence length and the percentage of parsimony informative sites, trnH-GUG-psbA, rpl32-trnL-UAG, and ccsA-ndhD contained approximately equal numbers of parsimony informative sites ( Table 2).
Among these high nucleotide diversity regions, we selected seven regions (matK, ndhF, ccsA-ndhD, psbK-psbI, ycf4-cemA, rpl32-trnL-UAG, and trnH-GUG-psbA) with suitable lengths and low sequence identities as candidate barcode sequences. To evaluate the ability of the seven selected sequences to identify Zanthoxylum plants, we constructed the ML tree separately based on each sequence (Supplementary Figure 1). The number of species successfully identified in the ML tree is used to evaluate the resolution power of the sequence (Zheng et al., 2020). When the value of the node is lower than 50, the species on the branch cannot be distinguished (Zheng et al., 2020). matK and rpl32-trnL-UAG had the same highest resolution power (identification success rate) of 93%, followed by trnH-GUG-psbA with 86%, and ycf4-cemA with 79%. CcsA-ndhD had the lowest resolution power of 57% (Table 3). Additionally, we examined the tree topology of the constructed ML trees based on each region (Figure 6). The tree topology of the constructed ML tree based on the matK gene is closest to the evolutionary tree, which was

Selective Pressures in the Evolution of Sapindales
We analyzed the synonymous and non-synonymous change rates of 68 protein-coding genes in Sapindales (Supplementary Table 4). Eight genes (ccsA, cemA, matK, psaI, psbK, psbM, rps12, and rps16) were identified under positive selection (Ka/Ks ratio > 1; Supplementary Table 4). This shows that although Sapindales face relatively weak selection pressure, some are undergoing essential adaptations to their environment. Among the eight genes, rps16, psbK, matK, ccsA, cemA, and rps12 showed high rates for one species. The genes psbM and psaI presented high rates for two and five species, respectively.

Phylogenetic Analysis Within Rutaceae
A total of 30 Rutaceae cp genomes were selected to perform phylogenetic analysis. Xylocarpus rumphii (NC038199.1) was used as the outgroup. The phylogenetic tree was constructed using the ML and Bayesian inference (BI) methods and resulted in similar phylogenetic trees based on 76 protein-coding sequences. Seven Subgen. Fagara and five Subgen. Zanthoxylum species were clustered together to form a single clade, which is consistent with the record in Flora of China (Huang, 1997). However, it is noteworthy that Z. madagascariense and Z. paniculatum clustered together to form a single clade and then gather with other Zanthoxylum species. The traditional classification system of Zanthoxylum mainly relies on the differentiation of calyx and petals. The existing classification of Zanthoxylum may be imperfect. Therefore, we speculate that Z. madagascariense and Z. paniculatum may constitute a new subgenus.

DISCUSSION
In the present study, we sequenced and annotated the cp genomes of six Zanthoxylum species, compared genomic features among the species of Zanthoxylum, identified SSR, tandem repeats and suitable polymorphic loci for designing of suitable molecular markers. Our results have laid the foundation for future studies on the molecular identification of Zanthoxylum species.
The cp genome of most angiosperm species contains 74 protein-coding genes, while a few species contain another five protein-coding genes (Millen et al., 2001;Bock and Knoop, 2012). In this study, the 14 Zanthoxylum cp genomes contain 87 proteincoding genes (79 unigenes were protein-coding), 37 tRNA genes, and 8 rRNA genes, which is similar to Citrus (Carbonellcaballero et al., 2015). Although there is one less protein-coding gene in Z. piperitum compared with other Zanthoxylum, after careful proofreading, we found that the original author missed an annotation for the rps12 gene (Lee et al., 2015).
Cp genomes are typically 120-160 kb in size since IR regions expand and contract (Wicke et al., 2011). The cp genomes of the 14 Zanthoxylum are ∼158 kb, and the length does not change significantly. Although IR boundary regions have no significant changes in Zanthoxylum, we found most of the ndhF genes of Subgen. Zanthoxylum have 23 bp located in the SSC regions, which indicates that the location information of the genes in the IR region can indicate the distance between species to a certain extent.
In addition to identifying closely related species, the variation in SSR copy numbers in the cp genome is an efficient marker for the study of plant population genetics, polymorphism investigations, and evolutionary history (Xue et al., 2012(Xue et al., , 2019He et al., 2012;Wheeler et al., 2014). Li et al. (2019) developed SSR markers derived from cp genomes that can effectively distinguish Z. bungeanum, Z. armatum, and Z. piperitum and used the SSR markers to analyze the genetic diversity among different species of Zanthoxylum. The number of Poly (A)/(T) SSRs in the Zanthoxylum cp genome is much greater than that of poly(G)/(C), which is consistent with the results of previous studies (Xue et al., 2019;Zhao et al., 2020). The abundant SSRs we found in the cp genomes of the 14 Zanthoxylum species laid the foundation for the identification of assays detecting polymorphisms at the population level of Zanthoxylum.
Since the whole cp genome contains abundant mutation sites, it can be used directly as a super barcode for species identification., on the other hand, hypervariable regions can be screened out as potential molecular markers for species identification (Nock et al., 2011;Li et al., 2015). Recently, researchers have successfully identified plants such as Amomum (Cui et al., 2019) and Ligularia (Chen et al., 2018) based on the whole cp genome. Hollingsworth et al., proposed using the rbcL + matK gene combination derived from the cp genome as the core barcode for land plant identification (Hollingsworth et al., 2009). In this research, we selected seven regions (matK, ndhF, ccsA-ndhD, psbK-psbI, ycf4-cemA, rpl32-trnL-UAG, and trnH-GUG-psbA) as candidate barcode sequences to identify Zanthoxylum. Although the protein-coding genes of cp genomes are relatively conservative and are mainly used for the study of higher classification levels, they also have applications in lower classification levels. The matK gene is a good identifier for plants of Apocynaceae (Cabelin and Alejandro, 2016), Dipterocarpaceae , and Juniperus (Hong et al., 2014). Given the  excellent performance of the matK gene in the construction of the evolutionary tree of Zanthoxylum, we recommend that the matK gene be used to reconstruct phylogenetic relationships of Zanthoxylum where there is a lack of genomic information.
Although the psbT gene in Zanthoxylum cp genomes has a high Pi value, the length of the psbT gene is too short (102 bp) to provide sufficient mutation sites, so we believe that psbT is not suitable as a barcode gene. Similarly, the rpl22 gene has a high Pi value in Zanthoxylum cp genomes mainly due to the variation in the rpl22 gene length, making it difficult to design universal primers and so rpl22 is not suitable as a barcode gene. Due to less selective pressure, non-coding sequences have higher evolutionary rates than coding regions and so provide more systematically significant information sites (Borsch and Quandt, 2009). trnH-psbA and rpl32-trnL are often used in genus level and subgenus level relationships, phylogenetic location, and species identification research. In our study, trnH-psbA and rpl32-trnL had excellent species identification success rates in Zanthoxylum. We recommend matK, trnH-psbA, and rpl32-trnL sequences as potential molecular markers for the identification and marker-assisted breeding of Zanthoxylum. In previous studies, the construction of Zanthoxylum phylogeny was mostly based on SSR markers (Feng et al., 2016), random amplified polymorphic DNA (Hong et al., 2008), sequencerelated amplified polymorphism markers (Feng et al., 2015), and single-copy nuclear genes (Feng et al., 2017). The lack of genomic information hinders the accurate evolutionary analysis of Zanthoxylum and its related species (Feng et al., 2017). Since the cp genome sequence is relatively conservative and is less affected by paralogous genes such as nuclear genes when constructing phylogenetic trees, it has often been used in angiosperm phylogeny construction and speculation of species evolution history in recent years (Dong et al., 2017;Zhang et al., 2017;Zhao et al., 2020). In our study, 14 Zanthoxylum species were represented with strongly supported phylogenetic trees using ML and BI analysis. The results of our phylogenetic analysis strongly support the genus Fagara as a subgenus of Zanthoxylum, and proposes the possibility of a new subgenus in Zanthoxylum. Z. bungeanum and Z. schinifolium recorded in the Chinese Pharmacopeia belong to a different subgenus and are relatively distantly related, consistent with the research of Wang et al. (2016). Identification success rate (%) = [(the total number of species -the number of species with bootstrap values below 50%)/the total number of species] × 100% (Zheng et al., 2020). Overall, the phylogenetic position of Zanthoxylum revealed by our phylogenetic tree is more credible than in previous studies given the higher number of cp genomes analyzed in our research.

DATA AVAILABILITY STATEMENT
The cp genome sequences of Zanthoxylum species were submitted on the National Center for Biotechnology Information