Identification of Ligularia Herbs Using the Complete Chloroplast Genome as a Super-Barcode

More than 30 Ligularia Cass. (Asteraceae) species have long been used in folk medicine in China. Morphological features and common DNA regions are both not ideal to identify Ligularia species. As some Ligularia species contain pyrrolizidine alkaloids, which are hazardous to human and animal health and are involved in metabolic toxification in the liver, it is important to find a better way to distinguish these species. Here, we report complete chloroplast (CP) genomes of six Ligularia species, L. intermedia, L. jaluensis, L. mongolica, L. hodgsonii, L. veitchiana, and L. fischeri, obtained through high-throughput Illumina sequencing technology. These CP genomes showed typical circular tetramerous structure and their sizes range from 151,118 to 151,253 bp. The GC content of each CP genome is 37.5%. Every CP genome contains 134 genes, including 87 protein-coding genes, 37 tRNA genes, eight rRNA genes, and two pseudogenes (ycf1 and rps19). From the mVISTA, there were no potential coding or non-coding regions to distinguish these six Ligularia species, but the maximum likelihood tree of the six Ligularia species and other related species showed that the whole CP genome can be used as a super-barcode to identify these six Ligularia species. This study provides invaluable data for species identification, allowing for future studies on phylogenetic evolution and safe medical applications of Ligularia.


INTRODUCTION
Ligularia Cass., belonging to the Senecioneae tribe of Asteraceae, comprises about 140 species of perennial herbs. These species are distributed in Asia and Europe, with a total of 123 species distributed in China, 89 of which are endemic (Liu and Illarionova, 1989). In China, Ligularia species are mainly distributed in mountainous areas in the southwest (Liu and Illarionova, 1989) and more than 30 Ligularia species have long been used in folk medicine (Wang, 2007). The roots, stems, leaves, and flowers of them contain various chemical compounds, such as

Illumina Sequencing and Genome Assembly
Approximately 5-10 µg of high-quality DNA were used to build shotgun libraries with insert sizes of 500 bp and were sequenced in accordance with the protocol for Illumina Hiseq X technology. The total raw data of the six species were produced with 150 bp paired-end read lengths. The software Trimmomatic (Bolger et al., 2014) was employed to filter low-quality reads from the raw data. After filtering for quality sequences, the remaining clean reads were used to assemble the CP genome sequences. The CP sequences of all plants downloaded from the National Center for Biotechnology Information (NCBI) were used to create a reference database. Then, the clean sequences were mapped to the database and the mapped reads were extracted on the basis of coverage and similarity. The extracted reads were assembled into contigs using SOAPdenovo2 (Luo et al., 2012). The scaffold of the CP genome was constructed using SSPACE (Boetzer et al., 2011), and the gaps were filled using GapFiller (Boetzer and Pirovano, 2012).

Validation, Annotation, and Sequence Submission
The accuracy of the assembly of the four boundaries (SSC, LSC, IRa, and IRb regions) of the CP sequences was confirmed through PCR and Sanger sequencing using validated primers (Supplementary Table S2). The thermocycler conditions for the PCR were as follows: 94 • C for 5 min; 94 • C for 30 s, 56 • C for 30 s, 72 • C for 1.5 min, and 32 cycles; 72 • C for 10 min. The online programs Dual Organellar GenoMe Annotator (DOGMA) (Wyman et al., 2004) and CPGAVAS  were used for the initial annotation of the CP genomes of the six species, followed by manual correction. The complete data from the study were submitted to NCBI under the BioProject ID PRJNA400300 and BioSample ID SAMN07562669. The assembled complete CP genome sequences of the six Ligularia species were submitted to NCBI GenBank with the accession numbers MF539929-MF539933, and MG729822.
were removed. The distribution of codon usage was studied using CodonW with the relative synonymous codon usage (RSCU) ratio (Sharp and Li, 1987). The online program Predictive RNA Editor for Plants suite (Mower, 2009) with a cutoff value of 0.8 were used to predict RNA editing sites in the six CP genomes of Ligularia species.

Phylogenetic Analysis
For identification purposes and to further phylogenetic research on this genus, we used mVISTA (Thompson et al., 1994) to compare six Ligularia species with L. hodgsonii as the reference genome. MEGA 6.0 was used to construct the phylogenetic tree with Platycodon grandiflorus and Adenophora remotiflora included as outgroups based on ML analysis. The details of the selected species (excluding the six Ligularia species) are presented in Supplementary Table S3.

CP Genome Structure of Six Ligularia Species
The raw data from the six Ligularia species is 9.1 Gb for L. intermedia, 7.2 Gb for L. hodgsonii, 7.4 Gb for L. jaluensis, 6.4 Gb for L. mongolica, 6.3 Gb for L. veitchiana, and 6.2 Gb for L. fischeri. The sizes of the six CP genomes range from 151,118 bp for L. mongolica to 151,253 bp for L. veitchiana, which are similar to other Asteraceae CP genomes (Liu et al., 2013;Salih et al., 2017;Wang et al., 2017;Zhang et al., 2017). The investigated genomes showed typical circular tetramerous structure, including an SSC region and an LSC region, separated by two IR regions (Figure 1).  (Lee et al., 2016). Our results showed that all six of the newly sequenced CP genomes have a GC content of 37.5%, which is lower than some Asteraceae species (Liu et al., 2013;Salih et al., 2017;Wang et al., 2017;Zhang et al., 2017). The GC content of four homologous regions of the six CP genomes is the same. However, the distribution of the GC content in each region is uneven. The GC content in the IR region is the largest (43.0%), followed by the LSC region (35.6%), and the region with the lowest GC content is the SSC region (30.7%). Our analysis showed that the high GC content in the IR region is attributed to four rRNA genes (rrn16, rrn23, rrn4.5, and rrn5).
The AT content of the first, second, and third position of proteincoding genes in the six CP genomes are 54.5-54.6%, 61.9-62.0%, and 70.1%, respectively. The higher AT content in the third site has also been observed in other plants (Yi and Kim, 2012;He et al., 2017;Zhou et al., 2017) and is usually used to distinguish DNA of CP, nucleus, and mitochondria origin (Clegg et al., 1994). Each of the six CP genomes contains 134 genes, including 87 protein-coding genes, 37 tRNA genes, eight rRNA genes, and two pseudogenes (ycf1 and rps19; Table 2). Seven protein-coding genes (ndhB, rp12, rp123, rps12, rps7, ycf15, and ycf2), seven tRNAs (trnA-UGC, trnI-CAU, trnI-GAU, trnL-CAA, trnN-GUU, trnR-ACG, and trnV-GAC), and all of the rRNAs (rrn16, rrn23, rrn4.5, and rrn5) are duplicated in the IR regions, which is similar to Artemisia annua (Shen et al., 2017) and Artemisia frigida (Liu et al., 2013). The CP genomes of the six Ligularia species contain a small 3.4 kb inversion within a large 23 kb inversion in the LSC region, which is a unique feature in Asteraceae (Kim et al., 2005;Liu et al., 2013). The LSC region included 62 protein-coding genes and 22 tRNA genes. The SSC region included 11 proteincoding genes and one tRNA gene (trnL-UAG). The CP genomes of each of these six Ligularia species did not have an inverted SSC region, which has also been found in the CP genomes of A. frigida (Liu et al., 2013), Scutellaria baicalensis (Jiang et al., 2017), Carthamus tinctorius (Lu et al., 2015), and Juglans L. . In contrast, the SSC regions of Helianthus annuus, Lactuca sativa (Timme et al., 2007), and Aster spathulifolius (Choi and Park, 2015) are inverted. The functional ycf1 copy is located in the IRb-SSC boundary and the pseudogene ycf1 copy is located in the IRa region. The functional rps19 copy is on the boundary of LSC and IRa and the pseudogene rps19 copy is on the IRb region. The coding region occupied 59.67-59.72% of the CP genomes of six Ligularia species, including protein-coding genes, tRNA genes, and rRNA genes. Meanwhile, non-coding regions, including introns, pseudogenes, and intergenic spacers occupied 40.28-40.33% of the CP genomes of the six Ligularia species.

Codon Usage and RNA Editing Sites
All protein-coding genes in the six Ligularia CP genomes are composed of 26,136-26,138 codons. The most and least universal amino acids of the CP genomes of the six Ligularia species are leucine (10.8%) and cysteine (1.1%), respectively (Figure 2). This is also similar to the CP genome from artichoke (Curci et al., 2015). However, the most universal amino acid from A. frigida is isoleucine (Liu et al., 2013). The most and the least abundant amino acids in the Taraxacum obtusifrons and Taraxacum amplum CP genomes are serine and methionine (Salih et al., 2017), respectively. Figure 2 shows that with the increase of specific amino acid codes the RSCU increases accordingly. Most of the amino acid codons have preferences, except for methionine and tryptophan. Potential RNA editing sites were predicted for 35 genes from the CP genomes of the six Ligularia species. Forty-eight RNA editing sites were identified. S to L of amino acid change appeared most frequently, while R to W and T to I occurred least. Each corresponding gene from FIGURE 3 | Repeat sequences in six CP genomes. REPuter was used to identify repeat sequences with length ≥30 bp and sequences identified ≥90% in the CP genomes. F, P, R, and C indicate the repeat types F (forward), P (palindrome), R (reverse), and C (complement). Repeats with different lengths are indicated in different colors. the RNA editing sites of the six Ligularia species is at the same nucleotide position (Supplementary Table S4). A total of 18 genes containing introns, including 12 proteincoding genes (atpF, clpP, ndhA, ndhB, petB, petD, rpl16, rpl2, rpoC1, rps12, rps16, and ycf3), and six tRNA genes (trnA-UGC, trnG-UCC, trnI-GAU, trnK-UUU, trnL-UAA, and trnV-UAC; Supplementary Table S5), were identified in this study. Nine protein-coding genes contain only one intron and three proteincoding genes (clpP, rps12, and ycf3) contain two introns. All six tRNAs contain only one intron. TrnK-UUU has the longest intron (2,556 bp), which contains matK. The clpP gene and ycf3 gene are both located in the LSC region. The rps12 gene is a special trans-splicing gene with the 5 exon located in the LSC region, but the 3 exon located in the IR region. This condition exists in many species, such as A. frigida (Liu et al., 2013), artichoke (Curci et al., 2015), and Aster spathulifolius (Choi and Park, 2015).

Long Repeats and SSRs in the CP Genomes From the Six Ligularia Species
Repeat sequences, which are related to plastome organization (Salih et al., 2017), are mostly distributed in intergenic regions and intron regions, and only a small fraction is present in the FIGURE 4 | Sequence identity plot comparing six CP genomes with L. hodgsonii as a reference using mVISTA. Gray arrows and thick black lines above the alignment indicate genes with their orientation and the position of the IRs, respectively. A cutoff of 70% identity was used for the plots, and the Y -scale represents the percent identity ranging from 50 to 100%. genetic region. Four types of long repeats were observed in the CP genomes of the six Ligularia species, including forward, palindromic, reverse, and complement repeats (Figure 3). The length of the repeat unit ranged from 30 to 48 bp. Ligularia intermedia and L. jaluensis both had 19 forward and 20 palindromic repeats. Ligularia hodgsonii had the following repeats: 18 forward, 20 palindromic, and one reverse. Ligularia mongolica had 18 forward and 20 palindromic repeats. Ligularia veitchiana had 20 forward and 21 palindromic repeats. Ligularia fischeri had the following repeats: 19 forward, 19 palindromic, and one complement. The long repeat sequences observed in the CP genomes of the six Ligularia species, with L. hodgsonii as the reference, are presented in Supplementary Table S6.
Simple sequence repeats, also called microsatellites, exist widely in the genome, and the sequences consist of one to six nucleotide repeat units (Powell et al., 1995). SSRs are widely used in studies on species identification, population genetics, and phylogenetic studies based on polymorphisms (Yang et al., 2011;Jiao et al., 2012;Xue et al., 2012). Four types of SSRs were found in the CP genomes from the six Ligularia species: mononucleotide (56.6-60.7%), dinucleotide (11.5-13.2%), trinucleotide (9.3-9.8%), and tetranucleotide (18.0-21.6%); the SSRs were mainly distributed in the noncoding region of the LSC and SSC. Of all these SSRs, the number of mononucleotide SSRs (A/T) is the largest, ranging from 29 in L. hodgsonii to 37 in L. veitchiana, enriching A and T in the CP genomes. The next most common SSR is dinucleotide (AT/AT), six dinucleotide SSRs in CP genomes of L. hodgsonii and L. mongolica and seven dinucleotide SSRs in other four CP genomes. All of the CP genomes from the six species have two trinucleotide AAG/CTT SSRs, one ATC/ATG trinucleotide SSR, and 11 tetranucleotide SSRs ( Table 3). The CP genome of L. veitchiana has three AAT/ATT trinucleotide SSRs, while the other five species only have two trinucleotide SSRs.

Identification and Phylogenetic Analysis of Ligularia Species
The CP genomes from the six Ligularia species are highly similar. Among the few variations, non-coding regions exhibited higher The IR regions of the six CP genomes are conservative regardless of the number and order of the genes. Previous research screened highly variable region from CP genomes as the potential DNA barcodes for authenticating species, such as Dioscorea (Ma et al., 2018) and Fritillaria species .
Sequence homology was investigated compared with the reference CP genome from L. hodgsonii using the mVISTA software (Figure 4). Our results showed high similarity among all sequences. Differences were observed in the intergenic regions of matK-trnK and ndhG-ndhI (Figure 4). There was only one variable site in the matK-trnK region and five variable sites in ndhG-ndhI region, but this is not enough to distinguish among the six Ligularia species. Because of the highly conservative sequences, the structure, and size of the CP genomes of Ligularia species, no obvious hypervariable region was screened. Thus, the complete CP genomes were considered to distinguish Ligularia species.
In addition to the six CP genomes sequenced in this study, 25 other CP genomes from Asteraceae were chosen to construct the phylogenetic tree, and P. grandiflorus and A. remotiflora (Campanulaceae) were included as outgroups (Figure 5). In the ML tree, we identified two main clades (clade A and B) excluding outgroup species. Six species of Ligularia were a monophyly with well-supported (100%). The support values in clade A were not less than 60%, and L. fischeri and L. jaluensis have a close relationship. Ligularia is most closely related to L. sativa, Saussurea involucrata, Centaurea diffusa, and Carthamus tinctorius. The results showed that the CP genomes can be used to identify the six Ligularia species.

CONCLUSION
This study reported the CP genomes from six Ligularia species, and the structure and composition of the CP genomes are highly similar. Like most Asteraceae species, the CP genomes of the six Ligularia species had a small 3.4 kb inversion within a large 23 kb inversion in the LSC region. The ML tree showed that the CP genome can be used to identify the six Ligularia species and is expected to become a super-barcode for the identification of Ligularia species.

AUTHOR CONTRIBUTIONS
HY conceived the study and acquired the funding. XC, YW, and BD collected samples and conducted the experiment. JZ and YC performed the genome assembly and analysis on the data. XC and JZ wrote the manuscript. All authors have read and approved the final manuscript.