Species Identification of Dracaena Using the Complete Chloroplast Genome as a Super-Barcode

The taxonomy and nomenclature of Dracaena plants are much disputed, particularly for several Dracaena species in Asia. However, neither morphological features nor common DNA regions are ideal for identification of Dracaena spp. Meanwhile, although multiple Dracaena spp. are sources of the rare traditional medicine dragon’s blood, the Pharmacopoeia of the People’s Republic of China has defined Dracaena cochinchinensis as the only source plant. The inaccurate identification of Dracaena spp. will inevitably affect the clinical efficacy of dragon’s blood. It is therefore important to find a better method to distinguish these species. Here, we report the complete chloroplast (CP) genomes of six Dracaena spp., D. cochinchinensis, D. cambodiana, D. angustifolia, D. terniflora, D. hokouensis, and D. elliptica, obtained through high-throughput Illumina sequencing. These CP genomes exhibited typical circular tetramerous structure, and their sizes ranged from 155,055 (D. elliptica) to 155,449 bp (D. cochinchinensis). The GC content of each CP genome was 37.5%. Furthermore, each CP genome contained 130 genes, including 84 protein-coding genes, 38 tRNA genes, and 8 rRNA genes. There were no potential coding or non-coding regions to distinguish these six species, but the maximum likelihood tree of the six Dracaena spp. and other related species revealed that the whole CP genome can be used as a super-barcode to identify these Dracaena spp. This study provides not only invaluable data for species identification and safe medical application of Dracaena but also an important reference and foundation for species identification and phylogeny of Liliaceae plants.


INTRODUCTION
Dracaena Vand. ex L., belonging to the Dracaeneae tribe of Liliaceae, comprises approximately 50 species that are mainly distributed in tropical Africa, Australia, and Asia (Chen and Turland, 2000). In China, there are six species of Dracaena, mainly distributed in Yunnan, Guangxi, and Hainan Provinces (Chen and Turland, 2000). At present, there are many disputes over the taxonomy of the Dracaena plant, especially with regard to several Dracaena spp. in Asia that has been studied very little for several decades (Chen and Turland, 2000;Paul et al, 2012), consequently there are many taxononomic and nomenclatural problems and large numbers of unidentified specimens in herbaria (Chen and Turland, 2000;Yan, 2005). In addition, the traditional classification of the genus Dracaena is mainly based on leaf sheath, leaf blade central costa, tepals, and flaments (Wilkin et al.,2013). However, the diversity of natural environment in Southeast Asia area results in significant morphological variations of the Dracaena species, making it difficult to accurately identify the Dracaena species according to the above identification characteristics. As for the molecular identification, previous studies have demonstrated that barcoding sequences (e.g., ITS, trnL, matK, psbA-trnH, and rbcL) are also not ideal for accurate identification of Dracaena spp. (Li, 2012;Zheng, 2013;Wang et al., 2015).
In application, Dracaena species are usually used as ornamental horticultural plants or as source plants of rare medicinal material dragon's blood (Adolt and Pavlis, 2004;Aslam et al., 2013). Dragon's blood is a deep red resin that is widely used throughout the world; it is known to enhance immune function, promote skin repair, stop bleeding, and enhance blood circulation (Fan et al., 2014), and it has been utilized as a traditional medicine for wounds, fractures, piles, leucorrhea, diarrhea, stomach and intestinal ulcers, and even some types of cancer in the histories of many cultures (Gupta et al., 2007;Tapondjou et al., 2008;Kougan et al., 2010;Xu et al., 2010). Modern chemical and pharmacological studies have indicated that the flavonoids, stilbene, saponins, and terpenes in dragon's blood are the main effective compounds (Hu et al., 2001;González et al., 2004;Chen et al., 2012;Mei et al., 2013;Fan et al., 2014). Among the Dracaena plant, multiple Dracaena species are the source plants of dragon's blood (Gupta et al., 2007;Lyons, 1974) such as D.
cinnabari Balf.f, D. draco L, Dracaena cochinchinensis (Lour.) S.C. Chen and D. cambodiana Pierre ex Gagnp (Gupta et al., 2007;Pearson and Prendergast, 2001), but the Pharmacopoeia of the People's Republic of China has defined D. cochinchinensis as the only source plant. At present, dragon's blood resources in China mainly depend on imported supplies from Thailand, Laos, Myanmar, and other countries. However, the Dracaena species in these areas are more abundant, the inaccurate identification of the Dracaena species will inevitably affect the clinical application effect of dragon's blood. Therefore, the identification of molecular markers that can unambiguously distinguish the Dracaena spp. is critical for ensuring the beneficial effects of medicinal products.
The chloroplast (CP), an organelle that converts solar energy to carbohydrates through photosynthesis, has important roles in the biosynthesis of starch, fatty acids, amino acids, and pigments (Jansen and Ruhlman, 2012;Zhao et al., 2015;Daniell et al., 2016). The CP genome is a circular DNA molecule that includes a small single-copy (SSC) region, a large single-copy (LSC) region, and two inverted repeats (IRa and IRb) (Sato et al., 1999). It is highly conserved (Tonti-Filippini et al., 2017) in plants and therefore ideal for ecological, evolutionary, and diversity studies (Wicke et al., 2011). Recently, researchers have used the CP genome as a super-barcode to distinguish species (Xia et al., 2016;Chen et al., 2018b), or have screened sequences from the whole CP genome for species identification Thode and Lohmann, 2019;Zhang et al., 2017a). At present, several CP genomes from Liliaceae have previously been reported and deposited in the National Center for Biotechnology Information (NCBI) database. However, from the largest genus of tribe Dracaeneae, Dracaena, only one CP genome (D. cambodiana) has been published previously (Zhu et al., 2018). In this study, we report the CP genomes of six Dracaena spp., including D. cochinchinensis (Lour.) S.C.Chen, D. cambodiana Pierre ex Gagnep, D. angustifolia (Medik.) Roxb, D. terniflora Roxb., D. hokouensis G.Z.Ye, and D. elliptica Thunb. & Dalm., obtained via high-throughput Illumina sequencing technology. Our aim is to utilize the CP genome as a super-barcode for the identification of Dracaena spp. and to provide invaluable genetic information for understanding their phylogenetic relationships and other future studies.

Plant Materials, DNa extraction, and Illumina Sequencing
Plant materials from six Dracaena spp. were collected from the Yunnan Branch of the Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences; the voucher specimens were deposited in the Institute of Medicinal Plant Development herbarium. Total genomic DNA was extracted from clean leaves from samples frozen at −80°C using the TaKaRa MiniBEST Universal Genomic DNA Extraction Kit (TaKaRa, Kusatsu, Japan) with a standard protocol, DNA quality was assessed using a Nanodrop 2000 (Thermo Scientific, Waltham, MA, USA) and by electrophoresis in a 1% (w/v) agarose gel. The OD 260/280 values ranged from 1.8 to 2.0, and >2 μg of DNA was equally pooled from individuals of the six species to generate shotgun libraries. cpDNA samples were randomly sheared, incubated with fragmentation buffer, and broken into 300-500-bp fragments in an M220 focused ultrasonicator (Covaris, Woburn, MA, USA). Connect the A & B connector at both ends of the DNA fragment, Screen the segments, then remove the self-connecting segments of the connector. Fragment screening by electrophoresis in agarose gel, keep the fragment with A connector at one end and B connector at the other end, which subsequent addition of NaOH to denatures, produce single-stranded DNA fragments. The library was sequenced using Illumina HiSeq 4000 sequencing platform at the Major Company (Shanghai, China).

Genome assembly and annotation
Raw data for the six Dracaena spp. were generated with a pairedend read length of 150 bp. The low-quality reads (Q < 20) were filtered out and the clean reads were used for CP genome assembly. A reference database was created by downloading all plant CP sequences from NCBI. The high-quality reads were mapped to the database, and the mapped reads were extracted based on sequence coverage and similarity. The extracted reads were assembled into contigs using SOAPdenovo2 , and the resulting contigs were combined and extended to obtain complete CP genome sequence. Finally, the positions of the LSC, SSC, and two IR regions of the CP genomes were determined by localization, and complete CP genomes were obtained. Annotation of the six complete Dracaena CP genomes were executed using the online program Dual Organellar GenoMe Annotator (DOGMA, http://dogma.ccbb.utexas.edu/) (Wyman et al., 2004) coupled with manual correction. The tRNAscan-SE software (v2.0, University of California, Santa Cruz, CA, USA) (Schattner et al., 2005) and DOGMA (Wyman et al., 2004) were used to identify the tRNA genes. The BLAST versus reference sequences was used for verifying boundaries genes, intron/exon and coding regions. Moreover, MEGA 6.0 was used to analyze the GC content (Tamura et al., 2013). The Organellar Genome DRAW (OGDRAW) (v1.2, Max Planck Institute of Molecular Plant Physiology, Potsdam, Germany) (Lohse et al., 2007) was used with default settings to draw the gene map, which was then checked manually. Finally, a.sqn file was generated to submit the sequences to NCBI. The complete and correct CP genome sequences of the six Dracaena species were deposited in GenBank, accession numbers are MN200193 hokouensis), and MN200198 (D. terniflora) respectively.

Genome Structure and Comparative Genome analysis
The CodonW software (University of Texas, Houston, TX, USA) was used to investigate the distribution of codons based on the relative synonymous codon usage (RSCU) ratio (Sharp and Li, 1987). Conserved sequences were identified between the CP genomes of D. cochinchinensis and those of D. cambodiana, D. angustifolia, D. terniflora, D. hokouensis, and D. elliptica by BLASTN analysis with an E-value cutoff of 1e−10. The mVISTA (Frazer et al., 2004) program was used in Shuffle-LAGAN mode to compare the six Dracaena CP genomes using the D. cochinchinensis CP genome as a reference.

Repeat Sequence analysis
RSCU value, which is the ratio between the frequency of use and the expected frequency of a particular codon, was used to detected non-uniform synonymous codon usage within a coding sequence (Sharp and Li, 1987). Moreover, simple sequence repeats (SSRs) were detected using the MISA software (http://pgrc.ipk-gatersleben.de/misa/), with parameters set to encompass the number of repeat units of a mononucleotide SSR ≥10 nucleotides in length, followed by ≥5 and ≥4 repeat units for di-and tri-nucleotide SSRs, respectively, and ≥3 repeat units for tetra-, penta-, and hexa-nucleotide SSRs. The sizes and locations of repeat sequences in the CP genomes of the six Dracaena spp. were identified using REPuter (University of Bielefeld, Bielefeld, Germany) (Kurtz et al., 2001) with the parameters set to a similarity percentage of scattered repeat copies ≥90% and a minimal repeat size of 30 bp.

Phylogenetic analysis
To determine phylogenetic positions of the six Dracaena species within Liliaceae, we analyzed the CP genomes of 37 species, encompassing 31 additional taxa within this lineage. The CP genome sequences of 31 species were downloaded from NCBI. The sequences were initially compared using MAFFT (Katoh et al., 2005), followed by multiple sequence visual analysis and manual adjustment using BioEdit (Hall, 1999). We used the CP genomes of Stemona japonica (Blume) Miq. (NC_039675.1) as outgroups and constructed phylogenetic trees employing 37 CP genomes sequences using the maximum parsimony (MP) and neighbor-joining (NJ) methods in MEGA 6.0 (Tamura et al., 2013) with 1000 bootstrap replicates.

CP Genomes of Six Dracaena spp.
At present, complete CP genome sequencing is feasible with the genome skimming approach as a result of the wide application of next-generation DNA sequencing technology and bioinformatics software tools (Twyford and Ness, 2016). Illumina technology can be used for assembling the complete CP genome without the tedious of separating CP DNA from nuclear DNAs (Wang et al., 2016a). The raw data from the six Dracaena species is 7.04 Gb for D. cochinchinensis, 7.02 Gb for D. cambodiana, 6.59 Gb for D. angustifolia, 6.24 Gb for D. terniflora, 7.74 Gb for D. hokouensis, and 6.88 Gb for D. elliptica. The genome sequences assembled using the reads obtained from the Illumina sequencing platform ranged from 155,055 bp for D. elliptica to 155,449 bp for D. cochinchinensis (Table 1), which are similar to other Liliaceae CP genomes (Kim et al., 2016;Park et al., 2017;Liu et al., 2018). The genome exhibited typical cyclic tetramer structure, including the SSC and LSC regions separated by two IR regions (Figure 1), and the four regions from the six species had similar lengths ( Table S1). The SSC lengths ranged from 18,456 to 18,494 bp, which is larger than those from the six Ipomoea species (Park et al., 2018) and Strobilanthes cusia (Nees) Kuntze (Chen et al., 2018a). Meanwhile, the LSC lengths ranged from 83,621 to 83,907 bp, which is shorter than those from the six Ipomoea species and S. cusia. In addition, the IR lengths each ranged from 26,489 bp to 26,530 bp, which is larger than those obtained from S. cusia (Chen et al., 2018a), Taxillus chinensis (DC.) Danser , and T. sutchuenensis (Lecomte) Danser . Moreover, the GC contents of the six Dracaena CP genomes were similar, and the IR regions (42.9%) had higher GC contents than the single-copy regions (LSC: 35.4%-35.6% and SSC: 31.1%-31.2%). This result is consistent with previous reports (Park et al., 2018;Li and Zheng, 2018). The high GC contents in the IR regions may be attributable to the presence of four rRNA genes (rrn16, rrn23, rrn4.5, and rrn5) (Chen et al., 2018b). Furthermore, the six Dracaena CP genomes have similar gene contents, sequences, and orientations, which are the typical characteristics of higher plant CP genomes (Wicke et al., 2011;Qian et al., 2013). The genes in the six CP genomes consisted of 84 protein-coding genes, 38 tRNA genes, and 8 rRNA genes. Among them, seven protein-coding genes (rps19, rpl2, rpl23, ycf2, ndhB, rps7, and rps12), four rRNA genes (rrn4.5×2, rrn5×2, rrn16×2, and rrn23×2), and eight tRNA genes (trnR-ACG, trnA-UGC, trnI-GAU, trnV-GAC, trnL-CAA, trnI-CAU, trnH-GUG, and trnN-GUU), as well as the IR regions, contained two repeat units, similar to Ligularia intermedia Nakai (Chen et al., 2018b) and Artemisia annua L. (Shen et al., 2017). Moreover, 18 of the genes contained introns, and among them, 10 protein-coding genes (petB, petD, atpF, ndhA, ndhB, rpoC1, rps12, rps16, rpl16, and rpl2) and 6 tRNA genes contained a single intron, and 2 protein-coding genes (clpP and ycf3) contained two introns ( Table 2). Introns play an important role in the regulation of gene expression by enhancing the expression of exogenous genes at specific sites and at specific times in plants (Xu et al., 2003). In total, the coding regions including protein-coding genes, tRNAs, and rRNAs occupied 57.35%-58.37% of the six Dracaena CP genomes, while the non-coding regions constituted 41.63%-42.68% of the genomes.  Table S2. The CP protein-coding genes of these six species contained 61 codons encoding 20 amino acids. The coding regions (CDS) comprised 25,710 codons in D. elliptica to 26,248 codons in D. terniflora and D. hokouensis. Codons for isoleucine (5.63%~5.89%) and lysine (5.41%~5.66%) were the most abundant in the six Dracaena CP genomes, whereas those for cysteine (0.37~~ 0.39%) and arginine (0.44~~0.48%) were observed least often, at lower frequencies than in artichoke (Curci et al., 2015) and the six Ligularia species (Chen et al., 2018b). An RSCU value <1.00 indicates that the codon usage frequency is lower than expected, whereas an RSCU value >1.00 indicates that the codon usage frequency is higher than expected (Sharp and Li, 1987). In this study, other than leucine and isoleucine, amino acid codons in the CP genomes of the six Dracaena spp. preferentially ended with A or U (RSCU >1). Codons ending in A and/or U accounted for 69.76% (D. cochinchinensis) to 72.87% (D. terniflora) of all CDS codons in the six CP genomes. These results are similar to those observed for Papaver rhoeas L.and Papaver orientale L. . The codon usage pattern may be determined by the high proportion of A/T composition bias. In other terrestrial higher plant CP genomes, high codon preference in codon usage, especially A/T bias, is very common (Kim and Lee, 2004;Qian et al., 2013). This phenomenon indicates that stable CP evolution is beneficial for protecting important CP genes from harmful mutations and adaptation to selection stress (Wang et al., 2016b;Ivanova et al., 2017;Zuo et al., 2017).
FIGURe 2 | Repeat analysis in six Dracaena complete chloroplast (CP) genomes. REPuter was used to identify repeat sequences with length ≥30 bp and sequence identified ≥90% in the CP genomes. F, P, R, and C indicate the repeat types F (forward), P (palindrome), R (reverse), and C (complement), respectively. Repeats with different lengths are indicated in different colors.
Frontiers in Pharmacology | www.frontiersin.org November 2019 | Volume 10 | Article 1441 FIGURe 3 | Structure comparison of the six Dracaena CP genomes by using the mVISTA program. Gray arrows and thick black lines above the alignment indicate genes with their orientation and the position of the IRs, respectively. A cut-off value of 70% identity was used for the plots, and the Y-scale represents the percent identity between 50% and 100%.

SSRs in the CP Genomes of the Six Dracaena spp.
SSRs are widely present in CP genomes, consisting of 1-6-nucleotide repeat units, and are valuable molecular markers of high variation within the same species (Powell et al., 1995).
It is well known that the repeat sequences are associated with plastome tissues, mostly in the intergenic and intron regions, and only a small portion is distributed in the genetic region (Salih et al., 2017;Zhou et al., 2017). In the present study, SSRs were less abundant in the CDS (15~16 SSRs) of the six Dracaena CP genomes than in the non-coding region. In addition, the types and distributions of the potential SSRs were investigated. The most abundant type was repeated mononucleotide (54.69%~61.97%), which were found 14~15 times in the six Dracaena species. These were followed by dinucleotide (19.72%~23.44%), trinucleotide (4.23%~4.69%), tetranucleotide (12.68%~14.06%), pentanucleotide (0%~4.35%), and hexanucleotide repeats (1.41%~2.99%). Among these repeats, the A/T repeat was the most abundant motif, followed by AT/TA dinucleotide repeats, then AAAT/ATTT tetranucleotide repeats. These results were consistent with previous reports that CP SSRs usually consist of short poly-A or poly-T repeats and rarely contain tandem G or C repeats in many plants (Kuang et al., 2011). In addition, with the exceptions of D. cochinchinensis and D. cambodiana, the other four species had no pentanucleotide SSRs (Table 3). At present, SSR markers are widely used in genetic diversity and population structure assessments, comparative genomics, development of genetic maps, and marker-assisted selective breeding (Chen et al., 2015;Zhou et al., 2018;Zhang et al., 2017b). The repeats identified in this study will provide valuable resources for species identification and population studies of Dracaena spp.

Comparative Genome analysis
Comparative analysis of CP genome is advantageous for understanding the genetic diversity and evolutionary relationships of plants in different environments (Daniell et al., 2016;Zhao et al., 2018). The sequence homology of the six Dracaena CP genomes was compared and analyzed using the mVISTA software, and D. cochinchinensis was used as a reference sequence (Figure 3). The comparison revealed that some differences between D. cochinchinensis and the other five Dracaena spp. existed within intergenic spacers (IGS), such as rps12-trnV and psaI-ycf4; and the CP genomes of the other five Dracaena spp. were not significantly different., only some differences existed in the ycf1 gene and intergenic regions, such as trnS -trnG, rpoB-trnC, rpl16-rps3, psbE-petL, and ndhF-rpl32. As shown in Table 4, the gene with the most variable sites in coding regions is ycf1, which has 73 variable sites and 35 indels within the 5,466 bp alignment. the intergenic regions with the most variable sites is rps7 → trnV-GAC, which has 59 variable sites and 16 indels within the 2,716 bp length. On the whole, the variable sites of the six Dracaena CP genomes are more than some other species (Zhang et al., 2017c). Furthermore, we found that the majority of sequence variations were within the LSC and SSC regions, whereas the IR regions had relatively fewer sequence variations. This result further supports the findings of previous studies: the two IR regions are more conserved than the LSC and SSC regions (Nazareno et al., 2015;Li et al., 2017;Lu et al., 2017). This may be because mutations in the IR sequences are corrected by gene conversion (Khakhlova and Bock, 2006). In addition, the sequence variation in the non-coding regions was higher than in the gene coding regions in the CP genomes of the six Dracaena spp., which is consistent with previous findings for most gymnosperms Lu et al., 2017;Zhao et al., 2018).

Identification and Phylogenetic analysis of Dracaena Species
The highly variable regions of CP genomes can be used as potential DNA barcodes for species identification and phylogenetic analyses, such as in Fritillaria species . Our results showed high similarity among the six sequences. The largest The CP sequence is crucial for studying phylogenetic relationships and determining taxonomic status among angiosperms (Jansen et al., 2007). To identify the phylogenetic positions of the six Dracaena spp. within the Liliaceae, we obtained 30 complete CP genome sequences belonging to 20 genera of the family Liliaceae (including two species each from Aletris, Allium, Aloe, Asparagus, Hosta, Yucca, Polygonatum, Fritillaria, Lilium, and Paris, and one species each from Anemarrhena, Chlorophytum, Maianthemum, Rohdea, Cordyline, Tricyrtis, Chionographis, Trillium, Veratrum, and Gloriosa) and one species belonging to Stemonaceae from the RefSeq database (Table S5). In the MP (Figure 4) and NJ (Figure S1) tree, the six species of the genus Dracaena are clustered together and separated from the other 20 genera of the family Liliaceae, indicating the close relationship of the six Dracaena species. Meanwhile, the six Dracaena spp. were distinct from one another, with support values ≥85%; D. angustifolia and D. terniflora were clustered together, D. hokouensis exhibited a sister relationship, and D. cambodiana and D. cochinchinensis were clustered together. The results showed that the CP genomes can be used to identify the six Dracaena species. In addition, we explored the construction of a phylogenetic tree based on 36 complete Liliaceae CP genomes sequence multiple alignments in present study. Our results will provide an important reference and foundation for species identification and phylogeny of Liliaceae plants.

CONCLUSION
This study reported the CP genomes from six Dracana species, range from 155,055 bp (D. elliptica) to 155,449 bp (D. cochinchinensis), and the structure and composition of the CP genomes are highly similar. The CP genomes genes consisted of 84 protein-coding genes and 38 tRNAs, with 8 rRNA genes in the six genomes. Among the six Dracana species, D. angustifolia had the closest relationship with D. terniflora, D. cochinchinensis had the closest relationship with D. cambodiana respectively. The MP and NJ tree showed that the CP genome can be used to identify the six Dracana species and is expected to become a super-barcode for the identification of Dracana species.

aUThOR CONTRIBUTIONS
ZZ and XM conceived and design the study. XM acquired the funding. ZZ and YG collected samples and determined the species. YZ performed the genome assembly and analysis on the data. ZZ wrote the manuscript. XM supervised the manuscript. All authors have read and approved the final manuscript.