Edinburgh Research Explorer Characteristics and mutational hotspots of plastomes in Debregeasia (Urticaceae)

plastomes from other genera. Together with the one published plastome for Debregeasia , we analyzed plastome structure and character, identiﬁed mutation hotspots and loci under selection, and constructed phylogenies. The plastomes of Debregeasia were found to be very conservative, with a size from 155,743 bp to 156,065 bp, and no structural variation. Eleven mutation hotspots were identiﬁed, including three ( rpoB - trnC - GCA , trnT - GGU - psbD and ycf1 ) that are highly variable both within Debregeasia and among genera; these show high potential value for future DNA barcoding, population genetics and phylogenetic reconstruction. Selection pressure analysis revealed nine genes ( clpP , ndhF , petB , psbA , psbK , rbcL , rpl23 , ycf2, and ycf1 ) that may experience positive selection. Phylogenomic analyses results suggest that Debregeasia was monophyletic, and closest to Boehmeria among genera examined. Within Debregeasia , D. longifolia was sister to D. saeneb , whereas D. elliptica , D. orientalis with D. squamata formed the other subclade. This study enriches organelle genome resources for Urticaceae, and highlights the utility of plastome data for detecting mutation hotspots for evolutionary and systematic analysis.


INTRODUCTION
Chloroplasts are vital organelles within plants (Raubeson and Jansen, 2005), and their genomes comprise 120 kb to 160 kb of often highly conserved DNA and gene sequences (Wicke et al., 2011), providing rich resources for the study of evolution, DNA barcoding, taxonomy and phylogeny (Borsch and Quandt, 2009;Dong et al., 2012;Ruhfel et al., 2014). Over the past decade, analysis of whole plastomes and/or protein-coding genes has been used successfully to address phylogenetic relationships at multiple taxonomic levels (e.g., Ma et al., 2014;Du et al., 2017;Li H. T. et al., 2019). Repeating sequences can cause structural changes in genomes, and because of their variability between and within lineages, they can be used to study the population genetics of taxa (Timme et al., 2007), such as in Aristolochia ; they can also serve as information regions for developing genomic markers for phylogenetic analysis, including taxonomically challenging species complexes. Such repeating markers include simple sequence repeats (SSRs), known as microsatellites, which comprise 1-6 nucleotide repeat units and are ubiquitous in the genome (Powell et al., 1996). Certain genes exhibit high variability, especially ycf1, which can therefore potentially be used as a barcode for terrestrial plants (Dong et al., 2015), and rpl20, which has an important role in protein synthesis and is involved in protein translation (Weglöhner and Subramanian, 1992). Furthermore, understanding plastome genetic variation within and between populations provides important information that can be used for conserving species and populations, helping them adapt to climate and habitat changes, and for more successful plant breeding (Daniell et al., 2016). Combining genome-wide information with that from hyper-variable regions provides the best approach to elucidate relationships and identify species among taxonomically critical groups (e.g., Bi et al., 2018;Fu et al., 2019).
Debregeasia Gaud. (Urticaceae) occurs mostly in East Asia, and comprises about eight species (Chen et al., 2003;Wilmot-Dear and Friis, 2012). Debregeasia is economically important because of its stem fibers, which are usually used to make ropes and fishing nets, and its edible fruits can be used to make wines (Chen et al., 2003). Additionally, Debregeasia has been used to treat diarrhea, bone fractures, tumors, skin diseases and urinary complaints, and contains compounds with anti-bacterial, immune suppressant, anti-fungal and antiinflammatory properties (Akbar and Malik, 2002;Almubayedh and Ahmad, 2019). Thus far, morphology-based taxonomic treatments for Debregeasia have been controversial (Chen et al., 2003;Wilmot-Dear and Friis, 2012), whereas phylogenetic analyses have so far used too few loci to achieve full resolution within Debregeasia (Wu et al., 2013(Wu et al., , 2018. Therefore, new methods based on plastome genomic data need to be explored to study the systematics of Debregeasia. However, only one plastome (D. orientalis) has been reported in Debregeasia , and neither plastome characteristics nor mutation hotspots have so far been investigated in the genus.
In the present study, a total of 25 complete plastomes of Urticaceae were newly assembled and annotated (including 12 individuals from 5 Debregeasia species). Together with the one published plastome, these were used to: (1) analyze variation in genome size, content and structure, as well as IR contraction and expansion; (2) identify microsatellite types, hotspot regions for sequence divergence and variation and adaptive selection; (3) reconstruct phylogenetic relationships of Debregeasia. The present study therefore enriches organelle genome resources for Urticaceae.

Plant Material
Leaf materials were collected from healthy living plants in the field, and subsequently dried and stored in silica gel. In addition, a few individuals were sampled from herbarium specimens. In total, thirteen individuals of five Debregeasia species were included (Supplementary Table S1), all newly sequenced except for Debregeasia orientalis_LAD10 (MH196364)  which was downloaded from GenBank. An additional 13 species within Urticaceae, which represented all four main clades of the family (Wu et al., 2013(Wu et al., , 2018 were adopted as outgroups ( Table 1). All voucher specimens were deposited in the herbarium of Kunming Institute of Botany, Chinese Academy of Sciences (KUN); Royal Botanic Garden, Edinburgh (E); and Royal Botanic Gardens, Kew (K) (Supplementary Table S1).

DNA Extraction, Sequencing, Plastomes Assembly and Annotation
For silica gel dried materials, DNA was extracted using a modified hexadecyltrimethylammonium bromide (CTAB) method (Doyle and Doyle, 1987), whereas for herbarium specimens, DNA was extracted using Tiangen DNA secure Plant Kits (DP320) (Tiangen Biotech, Beijing, China). The quality and quantity of DNA were measured on 1% Tris-acetate-ethylenediamine tetraacetic acid (TAE) agarose gels and using fluorometric quantification on the Qubit (Invitrogen, Carlsbad, California, United States). Paired-end libraries with 500 bp insert-size were prepared and then sequenced using the Illumina HiSeq X Ten platform, the length of reads was 150 bp. A total of 2 to 4 Gb clean data were generated for each individual. De novo assemblies were constructed with Spades (Bankevich et al., 2012). GetOrganelle v1.7.0 (Jin et al., 2018) was used to improve accuracy and efficiency in de novo assembly. Reference-guided connecting was subsequently conducted using Bandage (Wick et al., 2015) and Geneious v8.1 (Kearse et al., 2012), to generate circular plastomes. The newly generated genomes were automatically annotated by PGA (Qu et al., 2019), then adjusted and confirmed using Geneious. The patterns of genomic variation among the plastomes were calculated and visualized using OGDRAW v1.3.1 (Greiner et al., 2019) and Circos v0.69-9 (Krzywinski et al., 2009).

Repeat Sequence Analysis
REPuter (Kurtz et al., 2001) was used to identify dispersed (including forward, reverse and complement repeat sequences) and palindrome repeat sequences according to the following settings: sequence identity was 90%, Hamming distance equal to 3, the minimum repeat size was 30 bp and the maximum computed repeats was 100. The tandem repeats were identified using the online Tandem Repeats Finder (Benson, 1999). The alignment parameters match, mismatch, and indels were 2, 7, and 7, respectively. The minimum alignment score to report repeats was 80. The maximum period size and TR array size were limited to 500 bp and two million bp, respectively. ESTs (Thiel et al., 2003) was used to identify simple sequence repeats (SSRs) with the minimum repeat number set to 10, 5, 4, 3, 3, and 3 for mono-, di-, tri-, tetra-, penta-and hexa-nucleotides, respectively.

Estimation of Sequence Divergence and Mutational Hotspots
In order to determine the structure and sequence divergence of the plastomes of Debregeasia, we used the Mauve alignment The numbers in parenthesis indicate the genes duplicated in the IR regions.
Frontiers in Genetics | www.frontiersin.org tool embedded in Geneious, and the VISTA framework (Frazer et al., 2004) to compare the 13 plastomes. The boundaries between the IR and SC regions of these were compared and analyzed. Individual coding and non-coding regions were extracted by Geneious, and homologous loci were aligned using MAFFT v1.3.3 (Katoh et al., 2002). Then we determined the percentage of variable sites, calculated thus: (number of nucleotide substitutions + number of indels) / (length of aligned sites minus length of indels + number of indels) * 100%. Following this, the seven regions with the highest mutation rate were identified as mutation hotspots for Debregeasia. Due to the over-conserved genomic structure of Debregeasia plastomes, we compared in a similar way the 13 outgroup species, with each other and with Debregeasia, to investigate plastome structures and sequence divergence across Urticaceae, and hence identified the seven most variable regions at family level.

Positive Selection Tests
Non-synonymous (dN) and synonymous (dS) nucleotide substitution rates, as well as their ratios (w = dN/dS) were analyzed using Codeml (PAML v4.7) (Yang and Nielsen, 2002;Yang, 2007). The protein-coding genes were extracted and aligned using MAFFT. Six site-specific models (M0, M1, M2, M3, M7, and M8) were applied, to identify the selection pressure across plastomes. These models allowed the ω ratio to vary among sites, with a fixed ω ratio in all the branches. The dN, dS, and ω values were calculated with Codeml (seqtype = 1, model = 0, NSsites = 0, 1, 2, 3, 7, 8). Then we compared pairs of site-specific models as follows: M0 (one-ratio) vs. M3 (discrete), M1 (nearly neutral) vs. M2 (positive selection) and M7 (β) vs. M8 (β and ω) to analyze the existence of positive selection, with p values for each comparison determined via a Likelihood ratio test (LRT). Bayes Empirical Bayes inferences were calculated in site models M2 and M8 to estimate the posterior probabilities and positive selection pressures of the selected genes.
Within Debregeasia, no IR contraction was observed in any plastomes, whereas IR expansion generally seemed very conservative. In outgroups, the LSC/IR and IR/SSC boundaries showed some differences from Debregeasia (Figure 2). In Gonostegia hirta_Go1, the gene rps11 crossed from LSC to IRb, and the rpl36 gene was near the IRa/LSC boundary. In Droguetia iners_Dr4, the gene rps19 was only in the large single-copy. In Parietaria debilis_Pa1, the genes rps19 and trnH-GUG crossed from the LSC to the IRb and IRa regions, respectively. In Hesperocnide tenella_W277, trnH-GUG was copied in both IR regions.

Repeat Structure and Simple Sequence Repeats
A total of 932 repeats were identified in Debregeasia, falling into three categories (Table 3). Of these, the most frequent   were palindromic repeats, which occurred 363 times (38.95%), followed by tandem repeats (337 instances, 36.16%), and dispersed repeats (forward, reverse, or complement), of which there were 232 (24.89%). The individual accession with the greatest number of repeats was D. squamata_Q05 with 87, comprising 22 dispersed repeats, 31 palindromic repeats, and 34 tandem repeats. The greatest numbers of dispersed, palindromic and tandem repeats were found in D. elliptica_De19 (22), D. elliptica_De07 (31) and D. squamata_Q05 (34), respectively (Figure 3). Six kinds of SSRs (mono-, di-, tri-, tetra-, penta-and hexanucleotide) were identified in the plastomes of Debregeasia, with 1,091 SSRs detected in total (Supplementary Table S2 and Figure 4). The most frequent SSRs were mononucleotides, making up 72.41% of the total, of which T, A, C and G mononucleotides comprised 41.61%, 29.51%, 1.28%, and none, respectively (Supplementary Table S3 and Figure 4). The frequency of SSRs was inversely proportional to their length, except that tetranucleotide SSRs were more common than trinucleotide SSRs. Within D. longifolia, the total number of SSRs varied from 79 (D.  between accessions examined, so the variation in SSR number in D. longifolia is unusual in the genus (Figure 4).

Sequence Divergence and Mutational Hotspots
In general, our results showed that the plastome of Debregeasia is comparatively conserved, and that all genes were always present in the same order (Supplementary Figures S2, S3); this also applies across all 13 outgroup taxa (Supplementary Figure S4). Moreover, the non-coding regions had more variation, and higher levels of divergence, than the coding regions. The seven regions with the highest levels of variation were psbK-psbI, rpoB-trnC-GCA, trnT-GGU-psbD, trnT-UGU-trnL-UAA, ycf4-cemA, trnP-UGG-psaJ, and ycf1. Of these regions, ycf1 straddled the SSC/IR boundary, whereas all of the others were located in the LSC region ( Figure 5A). All had >0.5% variation across Debregeasia species examined. These seven regions could be considered as mutational hotspots and utilized as potential DNA barcodes for future population genetic analysis, phylogeny reconstruction and species identification studies in Debregeasia.

Positive Selection Sites
We investigated the rate of non-synonymous (dN) and synonymous (dS) substitutions to evaluate the selective pressure for 72 common protein-coding genes among the 13 Debregeasia individuals examined (Supplementary Tables S4, S5), using codon substitution models to identify possible sites under positive selection. Eighteen genes with positive selection sites were identified, and these were as follows: one subunit of the Acetyl-Co A-carboxylase gene (accD), one C-type cytochrome synthesis gene (ccsA), one gene for envelope membrane protein (cemA), one subunit of the rubisco gene (rbcL), one gene for a component of the trans locus of an envelope protein (ycf1), one gene for photosystem I subunit (psaB), two subunits of ATP synthase genes (atpA and atpB), two genes for subunits of NADH-dehydrogenase (ndhD and ndhF), four genes for the synthesis of small and large ribosomal subunit proteins (rps3, rps4, rps15, and rpl20), and four DNA-dependent RNA polymerase genes (rpoA, rpoB, rpoC1, and rpoC2).

Phylogenetic Relationships
Phylogenetic analysis based on five Debregeasia species plus 13 outgroup species, using Maximum likelihood, Maximum parsimony, and Bayesian Inference, showed that all Debregeasia  species examined formed a single clade with high bootstrap and posterior probability support (Figure 6 and Supplementary Figure S6). The genus comprised two well-supported subclades, including D. longifolia plus D. saeneb, and D. elliptica plus D. orientalis plus D. squamata. The four species with multiple accessions examined were each monophyletic. Additionally, species from Boehmeria were resolved as the sister group to Debregeasia.

Plastome Character and Potential Microsatellite Markers
Among the five Debregeasia species examined here, the plastomes appeared highly conserved, with no changes to gene order or overall structure (e.g. gene duplication, deletion and reverse transcription) observed in the genomes of Debregeasia. This might be because the species diverged fairly recently (Wu et al., 2015), or possibly due to the relatively conservative ecological niches of the genus.
The GC content of the LSC and SSC regions in all the Debregeasia species were much lower than those of the IR regions. A possible explanation for this is that the IR contains four rRNA genes, and the 16S rRNA has a very high GC content in Archaea (65-66.5%) (Yamane et al., 2011), with similar results in other terrestrial plants (Zeb et al., 2020).
Repeating sequences in plastomes can cause structural changes, and their variability across lineages makes these an appropriate source of for developing genomic markers for population genetics (Powell et al., 1996), especially when they are abundant and polymorphic. This clearly applies in Debregeasia and Urticaceae, wherein varying abundances of dispersed, palindromic and tandem repeats among the plastomes, both within and between species (Supplementary Table S2) may provide additional phylogenetic signals and evolutionary information. Additionally, large numbers of SSRs (Microsatellites) were detected in all plastomes of Debregeasia, with mononucleotide SSRs the most frequent, providing ample markers for further population and phylogenetic analysis. The number of SSRs was considerably more variable within D. longifolia than in D. orientalis, although four individuals of each were examined (Figures 3, 4 and Supplementary  Table S2). Our data does not show an obvious reason for this, as D. orientalis shows more variation in both latitude and altitude than D. longifolia (Supplementary Table S2), but D. longifolia might exhibit greater variation in habitats occupied.

Utility of Plastomes in Phylogenomics and DNA Barcoding
Complete plastome sequences are increasingly being used to solve taxonomic problems among closely related groups, providing valuable information for phylogenetic reconstruction (e.g., Ma et al., 2014;Dong et al., 2018;Li H. T. et al., 2019). In Debregeasia, phylogenetic relationships within have so far remained insufficiently resolved, probably because previous studies (Wu et al., 2015(Wu et al., , 2018 have employed a limited number of DNA loci, providing insufficient information for full resolution. Here, the monophyly of Debregeasia received maximum bootstrap and Bayesian support, improving on previous studies using less data (Wu et al., 2013(Wu et al., , 2018. Support for groupings within the genus also increased, and tree topology generally did not vary across methods or datasets, except for a few less well-supported groups at the tree tips, for example: D. elliptica appears nested within D. orientalis for some analyses and data sets, but not others (Figure 6), however, these relationships are not strongly supported. This may reflect recent divergence of these species, and hence it is possible that more intensive sampling of populations within both species, together with nuclear genomic data will provide a clearer picture in the future.
DNA super barcodes (whole genome) and mini barcodes (a proportion of a barcode) are extensions to the practice of routine DNA barcoding (Little, 2014;Hollingsworth et al., 2016). Theoretically, whole plastomes or nuclear genomes will provide the final solution for species identification. However, from both an economic and a practical perspective, a barcode or mini barcode is often sufficient, e.g., for Taxus  and macrophyte (Ortega et al., 2020) identification. In our study, the whole plastome can clearly distinguish all five Debregeasia species examined ( Figure 6A). Meanwhile, three regions (rpoB-trnC-GCA, trnT-GGU-psbD and ycf1) showed high levels of variation at both within Debregeasia and between genus (Urticaceae) levels (percentage of variability >0.5% and >6.0%, respectively), and can distinguish all five Debregeasia species (Figure 6C). Indeed ycf1, recently proposed as the most promising plastid DNA barcode across all land plants (Dong et al., 2015), could separate all five Debregeasia species on its own (data not shown). These mutational hotspots have the potential to resolve taxonomic issues in the family, and for future use as barcodes and for species identification. Therefore, plastome data shows great potential for the study of evolution, taxonomy and phylogenetic relationships in the genus Debregeasia and elsewhere in the Urticaceae.

Positive Selection Regions
Variation in both synonymous and non-synonymous nucleotide sites is also very useful in evolutionary studies (Ogawa et al., 1999). In this study, eighteen genes with sites under positive selection were identified (Supplementary Tables S4, S5), which is comparable to the sixteen detected in Orchidaceae , rather fewer than the 51 detected across 97 Pinus species (Zeb et al., 2020), but more than the seven detected among 22 Lythraceae species (Gu et al., 2019). Notably, the gene ycf1 was both under positive selection, and a mutational hotspot, in Debregeasia. This gene is one of the largest genes in the plastome, encoding a component of the trans locus of the envelope protein in vivo (Drescher et al., 2000). The ycf1 gene has been useful for phylogenetic analysis in other groups, and contains a site that is under positive selection in other plant lineages (e.g., Greiner et al., 2008;Hu et al., 2015). Our results could indicate a role for ycf1 in speciation and habitat adaptation within Debregeasia. The roles of all genes under selection in the genus merit further investigation, with regard to the range of habitats occupied by Debregeasia, which include moist places by streams, thickets, forests in mountain valleys, and slopes of limestone mountains (Chen et al., 2003).

DATA AVAILABILITY STATEMENT
All datasets generated for this study are included in the Table 1.

AUTHOR CONTRIBUTIONS
JL and Z-YW conceived the work, and carried out the field work. R-NW, Z-YW, X-YD, and JL analyzed the data. R-NW drafted the manuscript. RM, JL, and Z-YW revised the manuscript. All authors approved the final manuscript.

ACKNOWLEDGMENTS
We are deeply indebted to Profs. De-Zhu Li and Lian-Ming Gao, for their invaluable advice on the study. We also want to thank Mr. Xue-Wen Liu and Tao Liu for their help for the field sampling. Special thanks are due to Miss Wan-Lin Dong for assistance with data analysis. We would like to thank the Laboratory of Molecular Biology at the Germplasm Bank of Wild Species, Kunming Institute of Botany, Chinese Academy of Sciences, to provide platform for molecular lab work.

SUPPLEMENTARY MATERIAL
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2020.00729/full#supplementary-material FIGURE S1 | Comparison of the borders of LSC, SSC, and IR regions in Debregeasia, based on 13 individuals of five species.
FIGURE S2 | Sequence identity plot comparing the plastomes based on 13 individuals of five Debregeasia species using mVISTA with D. elliptica_De07 as a reference. Genome regions are color coded as protein coding, rRNA coding, tRNA coding or conserved non-coding sequences.           Figure 1A.  Figure 1B.