Candidate Genes Involved in the Biosynthesis of Triterpenoid Saponins in Platycodon grandiflorum Identified by Transcriptome Analysis

Background: Platycodon grandiflorum is the only species in the genus Platycodon of the family Campanulaceae, which has been traditionally used as a medicinal plant for its lung-heat-clearing, antitussive, and expectorant properties in China, Japanese, and Korean. Oleanane-type triterpenoid saponins were the main chemical components of P. grandiflorum and platycodin D was the abundant and main bioactive component, but little is known about their biosynthesis in plants. Hence, P. grandiflorum is an ideal medicinal plant for studying the biosynthesis of Oleanane-type saponins. In addition, the genomic information of this important herbal plant is unavailable. Principal findings: A total of 58,580,566 clean reads were obtained, which were assembled into 34,053 unigenes, with an average length of 936 bp and N50 of 1,661 bp by analyzing the transcriptome data of P. grandiflorum. Among these 34,053 unigenes, 22,409 unigenes (65.80%) were annotated based on the information available from public databases, including Nr, NCBI, Swiss-Prot, KOG, and KEGG. Furthermore, 21 candidate cytochrome P450 genes and 17 candidate UDP-glycosyltransferase genes most likely involved in triterpenoid saponins biosynthesis pathway were discovered from the transcriptome sequencing of P. grandiflorum. In addition, 10,626 SSRs were identified based on the transcriptome data, which would provide abundant candidates of molecular markers for genetic diversity and genetic map for this medicinal plant. Conclusion: The genomic data obtained from P. grandiflorum, especially the identification of putative genes involved in triterpenoid saponins biosynthesis pathway, will facilitate our understanding of the biosynthesis of triterpenoid saponins at molecular level.


INTRODUCTION
Platycodon grandiflorum (Jacq.) A. DC. is a perennial flowering plant of the Campanulaceae family and the only species of the genus Platycodon. It is a well-known medicinal plant in China and other East Asian countries and has been traditionally used as a medicine and food additive for various respiratory diseases, including bronchitis, asthma, tonsillitis, pulmonary tuberculosis and other inflammatory diseases (Takagi and Lee, 1972;Kim et al., 1995;Shin and Lee, 2002). Oleanane-type triterpenoid saponins are the main chemical components of P. grandiflorum, mainly including platycodin D, D2, D3, deapioplatycodin D, D2, polygalacin D and platyconic acid A (Kim J.W. et al., 2013). In addition to their natural effects, these triterpenoid saponins have various pharmacological activities, such as anti-inflammatory, anti-cancer, immune enhancing effects and preventing chemicals-induced hepatotoxicity (Lee et al., 2004(Lee et al., , 2008Kim et al., 2008Kim et al., , 2012aKhanal et al., 2009). Especially, chemical investigation of P. grandiflorum has revealed that platycodin D is the most abundant and the main bioactive component (Shin et al., 2009;Xie et al., 2009;Kim et al., 2012b).
Triterpenoid saponins are a group of mostly studied compounds in plants, and their biosynthesis has been extensively studied and described (Haralampidis et al., 2002;Yendo et al., 2010;Augustin et al., 2011;Moses et al., 2014a). The direct precursor of triterpenoid saponins is 2, 3-oxidosqualene which is synthesized via the mevalonic acid (MVA) pathway (Haralampidis et al., 2002). Three key enzymes are involved in the biosynthesis of these saponins: oxidosqualene cyclases (OSCs), cytochrome P450 monooxygenases (P450s) and uridine diphosphate-dependent glycosyltransferases (UGTs; Figure 1, Supplementary Table S4). The most important progress in the biosynthesis of triterpenoid saponins is achieved in Panax species (Araliaceae family), which contains a special group of triterpenoid saponins, i.e., ginsenosides. Three P450s in Panax ginseng have been functionally characterized, they are protopanaxadiol synthase (PPDS, CYP716A47), which catalyzes the conversion of dammarenediol-II to protopanaxadiol (Han et al., 2011), protopanaxatriol synthase (PPTS, CYP716A53v2) catalyzing the conversion of protopanaxadiol to protopanaxatriol (Han et al., 2012), and β-A28O (CYP716A52v2) catalyzing the conversion of β-amyrin to oleanolic acid (Han et al., 2013). Recently, two UGTs (PgUGT74AE2 and PgUGT94Q2) have also been characterized in P. ginseng which are involved in the biosynthesis of ginsenoside Rg3 and Rd (Jung et al., 2014). Even though the biosynthesis of some ginsenosides or their aglycones have been well-documented and can be conducted in a yeast fermentation system (Dai et al., 2014;Jung et al., 2014), the biosynthesis of triterpenoid saponins in different plant species is far from conclusive.
Despite many genes encoding enzymes involved in the biosynthesis of the triterpenoid saponins have been identified from Panax species (Sun et al., 2010;Chen et al., 2011;Luo et al., 2011;Li et al., 2013), information about those genes in P. grandiflorum is still lacking (Kim Y.K. et al., 2013). Although the pharmacological activity of platycodin D has been investigated (Kim et al., 2012a,b;Hwang et al., 2013;Li et al., 2014), a complete biosynthesis pathway of platycodin D has not been elucidated, especially the last two steps. At present, the genomes or transcripts of about 46 species of medicinal plants have been sequenced, which will lead to an efficient way of deciphering novel gene functions involved in specific metabolic pathways in medicinal plants (Misra, 2014). Characterization of these novel genes will be useful for investigating the synthesis of platycodins in P. grandiflorum. The objective of the present study was to characterize the transcriptome of P. grandiflorum using Illumina HiSeq TM 2000 sequencing platform in order to uncover the candidate genes encoding enzymes involved in the triterpene saponin biosynthetic pathway, especially in oleananetype saponins biosynthesis, and to screen molecular markers of SSRs for facilitation the marker-assisted breeding of this species.

Illumina Sequencing and De Novo Assembly
The root tissue of P. grandiflorum was used for transcriptome sequencing and analysis because roots have traditionally been used for medicinal purpose. A cDNA library was constructed from total RNA of P. grandiflorum roots, and sequenced using Illumina paired-end sequencing technology. After removal of adaptor sequences, ambiguous reads and low-quality reads (Q20 < 20), a total of 58,580,566 clean reads were obtained. The Q20 percentage (sequencing error rate < 1%) and GC percentage were 97.04 and 45.51%, respectively. An overview of the sequencing and assembly statistics is shown in Table 1. The high quality reads obtained in this study have been deposited in the NCBI SRA database (accession number: SRA226668).
All the clean reads (58,580,566) were de novo assembled using the Trinity program into 50,408 transcripts consisting of 55,568,306 bp. The size of the transcripts ranged from 201 to 15,684 bp, with an average length of 1,102 bp and N50 length of 1,796 bp. Among these transcripts, 20,939 (41.54%) were longer than 1000 bp, and 19,808 (39.30%) were shorter than 500 bp (Figure 2). Using paired-end joining and gap-filling methods, these contigs were further assembled into 34,053 unigenes with an average length of 936 bp and an N50 length of 1,661 bp. There were 11,291 unigenes (33.16%) longer than 1,000 bp, and 4,202 unigenes (12.34%) longer than 2,000 bp (Figure 2).

Gene Ontology Classification
A total of 16,677 unigenes were characterized using GO analysis based on Nr annotation, including biological process, cellular component, and molecular function. There were 31,810 unigenes were grouped under cellular component, 21,705 unigenes under molecular function, 44,810 unigenes under biological process. Under the cellular component category, the majority of unigenes were involved in cell (6,586 unigenes, 20.29%) and cell part (6,579 unigenes, 20.27%). For the biological process class, the cellular process (10,127 unigenes, 22.50%) and metabolic process (9,737 unigenes, 21.63%) were the most abundant classes. In the molecular function category, binding (9,999, 46.07%) and catalytic activities (8,438, 38.88%) were predominant (Figure 3).

Candidate Genes Encoding Enzymes Involved in Triterpenoid Saponin Biosynthesis
The transcripts encoding all the known enzymes involved in triterpenoid saponin biosynthesis were discovered in this Illumina dataset, including AACT, HMGS, HMGR, MVK, PMK, MVD, GGPPS, FPPS, IPPI, SS, SE, β-AS, and β-A28O (Table 3). These findings were in accordance with the fact that P. grandiflorum contains high contents of oleanane-type saponins. Platycodin D is the main triterpenoid saponin in P. grandiflorum, the β-AS (seven unigenes) and β-A28O (one unigenes) were the key enzymes in the biosynthesis of platycodin D. Functional characterization of these unigenes will help us to understand the molecular mechanism of the biosynthesis of oleanane-type saponins in P. grandiflorum.
Uridine diphosphate-dependent glycosyltransferases catalyze the glucosylation of C3-and C28-carboxyl for the biosynthesis of triterpenoid saponins in P. grandiflorum (Figure 1). In the present study, 106 unigenes encoding UGTs were obtained (Supplementary Table S2), the phylogenetic relationship between these UGTs and characterized UGTs from other plants is depicted in Figure 7. Two unigenes (comp18634 c0 and comp20876 c0) were highly homologous to Barbarea vulgaris UGT73C11 and UGT73C10, which catalyze sapogenin 3-O-glucosylation (Augustin et al., 2012),  suggesting that both of them have the same function in P. grandiflorum. Two unigenes (comp18634 c0 and comp20876 c0) were closely related to Saponaria vaccaria UGT74M1, which is a triterpene carboxylic acid glucosyltransferase (Meesapyodsuk et al., 2007), suggesting that these two unigenes might catalyze the glucosylation of C28-carboxyl for the biosynthesis of triterpenoid saponins. Further studies are required to characterize functionally the aforementioned four unigenes in the biosynthesis of triterpenoid saponins in P. grandiflorum.

Tissue-Specific Expression of Genes Involved in the Biosynthesis of Triterpenoid Saponins
The qPCR analysis was used to investigate the tissue-specific expression patterns of 19 unigenes related to the triterpenoid saponin biosynthesis in this species. The expression pattern of these genes is shown in Figure 8. The unigenes encoding AACT, HMGS, MVK, PMK, MVD, FPPS, and SS were expressed at much higher level in leaves than in roots, young stems, and flowers (P < 0.05). The HMGR, IPPI, and SE genes showed very high expression in the flower tissue (P < 0.05). All genes mentioned above play a role in upstream biochemical reactions of the triterpenoid saponin pathway, and showed high expression at mRNA level in leaves and flowers, indicating that leaves are the factories for synthesizing the precursors of triterpenoid saponins. A high expression of β-A28O was observed in young stems (P < 0.05), but PD accumulated mainly in roots, indicating that young stems were the modification site of triterpenoid saponins before storage. UGT1 and UGT5 were expressed at much higher level in roots than in other tissues (P < 0.05), whereas the expression level of UGT1 and UGT2 was higher in P. grandiflorum as compared to that of

CONCLUSION
Transcriptome sequencing of P. grandiflorum was performed for the first time using Illumina next-generation sequencing technologies and a total of 34,053 unigenes were obtained. Particularly, 19 unigenes involved in the biosynthesis of triterpenoid saponins were identified, the expression of which was in a tissue-specific manner. These findings will not only provide valuable information for our complete understanding of the biosynthesis pathway of triterpenoid saponins in P. grandiflorum, but also provide opportunities for the de novo production of active ingredients by engineering microorganisms. Furthermore, this study will also contribute to the improvements on this species through marker-assisted breeding or genetic engineering.

Ethics Statement
No specific permits were required for the described field studies. No specific permissions were required for these locations and activities. The location was not privately owned or protected in any way and the field studies did not involve endangered or protected species.

Plant Materials
Two-years-old P. grandiflorum plants were collected from Jianchuan County, Yunnan province, southwest of China (Latitude: 26 • 16 13 N, Longitude: 99 • 32 4 E, Altitude: 2900 m). After morphological and molecular identification according to the reference (Kim et al., 2012a), the root tissues were collected, frozen immediately in liquid nitrogen, and stored at −80 • C until use.

RNA Library Preparation and Sequencing
Total RNA was extracted from roots by using Trizol reagent (Invitrogen), following by purification with RNeasy MiniElute Cleanup Kit (Qiagen) according to the manufacture's protocol. For mRNA library construction and deep sequencing, at least 20 µg of total RNA samples were prepared by using the NEBNext R Ultra TM RNA Library Prep Kit for Illumina sequencing on Hiseq 2000 platform at Novogene Bioinformatics Technology, Co. Ltd., (Beijing, China). The high quality reads obtained in this study have been deposited in the NCBI SRA database.

Transcriptome Data Processing and Assembly
The raw data processing was the same as described previously (Zhang et al., 2015). In brief, raw reads with adaptors and unknown nucleotides above 5% or those that were of low quality (containing more than 50% bases with Q-value ≤ 20) were firstly removed to obtain clean reads using a custom Perl script. Then the clean reads were de novo assembled using Trinity program (K-mer = 25, group pairsdistance = 300) with default parameters (Grabherr et al., 2011). Firstly, clean reads with a certain length of overlap were combined to form longer fragments without N, which were called contigs. These clean reads were then mapped back to the corresponding contigs with paired-end reads to detect contigs from the same transcript as well as the distances between contigs, and their paired-end information was also used to fill gaps or extend the sequences. Finally, these resultant sequences were clustered to remove redundant sequences using the TIGR gene Indices clustering tools (TGICL) to form longer sequences without N and cannot be extended on either end. Such sequences are defined as unigenes.

Functional Annotation and Predicted CDS
Functional annotations were performed as described previously (Zhang et al., 2015). Briefly, functional annotations were performed by sequence comparison with public databases, including the NCBI non-redundant nucleotide database 1 , nonredundant protein database, Swiss-Prot database 2 and the KOG  (Conesa et al., 2005) to perform GO annotation of unigenes. After achieving GO annotation for every unigene, WEGO (Ye et al., 2006) software was used to perform GO classification and draw GO tree. Moreover, the conserved domains/families of the assembled unigenes encoding proteins were searched against the Pfam database (version 26.0; Finn et al., 2014) using Pfam_Scan script.
The CDS for unigene was predicted by BlastX and ESTscan. The unigene sequences were searched against the Nr, KOG, KEGG, and Swiss-Prot protein databases using BLASTX (e-value < 10-5). Unigenes aligned to a higher priority database would not be aligned to lower priority database. The best alignment results were used to determine the sequence direction of unigenes. When a unigene could not be aligned to any database, ESTScan (Iseli et al., 1999) program was used to predict coding regions and determine sequence direction.

EST-SSR Detection and Primer Design
Potential SSR markers were detected among the 34,053 unigenes using the MISA tool 4 as described previously (Jiang et al., 2014). We searched for SSRs with motifs ranging from monoto hexa-nucleotides in size. The minimum of repeat units were set as follows: 10 repeat units for mono-nucleotide, six for di-nucleotides, and five for tri-, tetra-, penta-, and hexanucleotides. Primer pairs were designed using Primer3 5 with default parameters.

Phylogenetic Analysis
Phylogenetic analysis was performed based on the deduced amino acid sequences of CYP450 and UGT from P. grandiflorum and other plants. All of the deduced amino acid sequences were aligned with Clustal X with a gap opening penalty of 10, a gap extension penalty of 0.1, a delay divergent cutoff of 25%, and the other default parameters as described previously (Jiang et al., 2014). The evolutionary distances were computed using MEGA5.10 with the Poisson correction method. For the phylogenetic analysis, a neighbor-joining tree was constructed using MEGA5.0. Bootstrap values obtained after 1000 replications are indicated on the branches. The scale represents 0.1 amino acid substitutions per site.

Quantitative Real-Time PCR (qPCR) Analysis
Nineteen unigenes with potential roles in ginsenoside biosynthesis were chosen for validation using qPCR with gene specific primers designed with Primer3 software, as described previously (Zhang et al., 2015). All the primer sequences used for the qPCR analysis are shown in Supplementary Table S3. Total RNA from different organs (roots, stems, leaves, and flowers) of P. grandiflorum were extracted individually using Trizol Kit (Promega, USA) following the manufacturer's protocol. Subsequently, RNA was treated with 4 × g DNA wiperMix at 42 • C for 2 min to remove DNA. The purified RNA (1 µg) was reverse transcribed to cDNA using HiScript QRT SuperMix for qPCR (Vazyme, Nanjing, China). The qPCR reactions were performed in a 20 µl volume composed of 2 µl of cDNA, 0.4 µl of each primer, and 10 µl 2 × SYBR Green Master mix (TaKaRa) in Roche LightCycler 2.0 system (Roche Applied Science, Branford, CT, USA). 574 PCR amplifications were performed under the following conditions: 30 s at 94 • C, followed by 45 cycles of 94 • C for 20 s, 55 • C for 20 s, and 72 • C for 30 s. Three technical replications were performed for all qPCRs. The PMK gene, which was found in our transcriptome database, was chosen as 5 http://bioinfo.ut.ee/primer3-0.4.0/primer3/ reference control for normalization after the expression of three reference genes (actin, GAPDH, and PMK) was compared in different tissues. The relative changes in gene expression levels were calculated using the 2 − Ct method. For a given gene, the relative expression level was expressed as mean ± standard deviation (SD) of three determinations after normalization with the mRNA level of reference gene PMK. One way ANOVA with Tukey's test was used to compare the difference in the mean expression level of a given gene among different organs. P ≤ 0.05 was considered statistically significant.

AUTHOR CONTRIBUTIONS
This study was conceived by G-HZ and S-CY. The plant material preparation were carried out by M-RH and J-HS.
Z-JG, J-JZ, and WZ analyzed the RNA-Seq data. C-HM and G-HZ drafted the manuscript. J-WC and C-HM revised the manuscript. All authors read and approved the final manuscript.

ACKNOWLEDGMENT
This work was funded by the project of young and middle-aged talent of Yunnan province (Grant No. 2014HB011).