Genome-Wide Transcriptional Excavation of Dipsacus asperoides Unmasked both Cryptic Asperosaponin Biosynthetic Genes and SSR Markers

Background: Dipsacus asperoides is a traditional Chinese medicinal crop. The root is generally used as a medicine and is frequently prescribed by Chinese doctors for the treatment of back pain, limb paralysis, flutter trauma, tendon injuries, and fractures. With the rapid development of bioinformatics, research has been focused on this species at the gene or molecular level. For purpose of fleshing out genome information about D. asperoides, in this paper we conducted transcriptome analysis of this species. Principal Findings: To date, many genes encoding enzymes involved in the biosynthesis of triterpenoid saponins in D.asperoides have not been elucidated. Illumina paired-end sequencing was employed to probe D. asperoides's various enzymes associated with the relevant mesostate. A total of 30, 832,805 clean reads and de novo spliced 43,243 unigenes were obtained. Of all unigenes, only 8.27% (3578) were successfully annotated in total of seven public databases: Nr, Nt, Swiss-Prot, GO, KOG, KEGG, and Pfam, which might be attributed to the poor studies on D. asperoides. The candidate genes encoding enzymes involved in triterpenoid saponin biosynthesis were identified and experimentally verified by reverse transcription qPCR, encompassing nine cytochrome P450s and 17 UDP-glucosyltransferases. Specifically, unearthly putative genes involved in the glycosylation of hederagenin were acquired. Simultaneously, 4490 SSRs from 43,243 examined sequences were determined via bioinformatics analysis. Conclusion: This study represents the first report on the use of the Illumina sequence platform on this crop at the transcriptome level. Our findings of candidate genes encoding enzymes involved in Dipsacus saponin VI biosynthes is provide novel information in efforts to further understand the triterpenoid metabolic pathway on this species. The initial genetics resources in this study will contribute significantly to the genetic breeding program of D. asperoides, and are beneficial for clinical diagnosis and treatment.


INTRODUCTION
Dipsacus asperoides is a traditional Chinese medicinal crop of the Dipsacaceae family. Its root (Supplementary File S1) gets its name "Dipsaci radix" from the capacity to heal broken bone, and is considered "top grade" as early as Shen Nong's Herbal Classic. More detailed, related information can be found in the Pharmacopoeia of the People's Republic of China (Beijing, 2010, Vol I, p. 309). Later generations often use it as a prescription with other compatible medicines. It has been used clinically for the treatment of osteoporosis, lassitude in loin and legs, fractures, abortion and dysmenorrheal diseases, as well as Alzheimer's disease and cancer (Zhang et al., , 2003Corcelle et al., 2006;Wong et al., 2007;Zhu et al., 2009;Seifert-Klauss and Prior, 2010;Niu et al., 2015). Thus, it is a promising medicinal plant.
Over recent decades, the demand for D. asperoides has continued to rise. D. asperoides is a perennial herb (Supplementary File S2), and there is an increasing disparity between its long growth cycle and excessive harvesting of wild populations. To alleviate this conflict, on the one hand, it is imminent to focus on breeding improved varieties at molecular level in order to respond to a variety of coercive influences; on the other hand, it is also urgently needed to produce the bioactive ingredient via genetic engineering to meet the ever-growing demand for this herbal alternative.
Currently, although next generation sequencing (NGS; Suter et al., 2015) has been very broadly applied to RNA-Seq in a large number of plant species, de novo (Moreton et al., 2015) transcriptome sequencing still has not utilized the reference approach to excavate sufficient useful genomic information (Grabherr et al., 2011). This is due to the absence of reference genomes for non-model plants. In this study, we made some improvements by assembling together all the clean reads into a transcriptome and used as reference sequences in the followup analysis. As expected, RNA-Seq results gave many clues concerning genetic and molecular marker information for D. asperoides. As the first report on whole genome sequencing of D. asperoides, the sequencing results will significantly help the analysis of a genome-wide study in this species.
Although it has been over thousand years since D. asperoides was used as an important common traditional Chinese medicine, its various active ingredients have still not been elucidated. In Europe, North Africa and Asia, D. asperoides is one kind of widely distributed herb. In China, Yunnan, Sichuan, Hubei and Hunan Provinces are the main origins of D. asperoides. Existing studies show that D. asperoides has a variety of pharmacological activities, which are primarily attributed to saponin compounds, the active components in root of D. asperoides. Triterpenoid saponins (Supplementary File S3) (Liu et al., 2011) such as HN saponin F, Dipsacus saponin VI and Macranthoidin A are the principal active ingredients in this plant. Akebiasaponin D (i.e., Dipsacus saponin VI) is an important quality indicator of D. asperoides (Liu et al., 2010). However, the metabolic pathways of this compound are still unknown (even though extensive research on this compound has been reported).
The purpose of this study was to examine the transcriptome of D. asperoides using Illumina second-generation sequencing platforms (Fu et al., 2013), as well as to mine all genes encoding enzymes involved in biosynthetic pathways of Dipsacus saponin VI. The proposed synthetic routes were shown in Figure 1. NGS (Strickler et al., 2012) technologies allow us to dissect the entire transcriptome of specific species without model plants (Grabherr et al., 2011), and subsequently enable us to access information about biological pathways and disease mechanisms. This information also included gene function, single nucleotide polymorphisms (SNP; Somers et al., 2003) calling, Simple Sequence Repeat (SSR; Ramsay et al., 2000) markers of one species, and so on. The present study will contribute to the improvement of genetic diversity in germplasm resources of D. asperoides and the pharmacological biosynthesis of the active components of this plant in vitro.

Reads Acquirement and De novo Assembly
Based on cDNA library construction, Illumina Genome Analyzer IIx produced 30,832,805 clean reads with 97.28% of Q20 percentage (Cock et al., 2010). Thus, of all clean reads, the proportion with 99% correctly identification accounted for 97.28%-an ideal sequencing result. All sequencing reads were entered in the NCBI web site and could be accessed with the short read archive number of SRA269859. Using Trinity software (Grabherr et al., 2011), the clean reads were then assembled into 73,036 contigs (Seong et al., 2015) with total length of 59,560,527 bp. The lengths of all contigs covered a range of 201-7591 bp, with a mean length of 815 bp and a N50 size (Earl et al., 2011) of 1262 bp. All of the above contigs were assembled into 43,243 unigenes with a total length of 31,420,741 bp. The range of the lengths of all genes was similar to the contigs, with a mean length of 727 bp and a N50 size of 1212 bp. All relevant Illumina paired-end sequencing and assembly data are summarized in Table 1.
All unigenes were compared with the whole set of public protein databases using BLASTX (E-value from 1e-10 to 1e-3), i.e., Nr (NCBI non-redundant protein sequences), Nt (NCBI non-redundant nucleotide sequences), Swiss-Prot (A manually annotated and reviewed protein sequence database), KOG (Eukaryotic Ortholog Groups), and GO (Gene Ontology; Mudado Mde and Ortega, 2006). The open reading frame (ORF; Doerks et al., 2002) of each gene was logically generated after successfully compared. Otherwise, Expressed Sequence Tags (EST; Pashley et al., 2006) were used for coding sequence (CDS) prediction to acquire the amino acid sequence of the ORF. Approximately 64.32% of the 43,243 genes possessed their own CDS (27,813). The lengths of all CDS covered a range of 42-7146 bp. The size distributions among Contigs, Unigenes and CDS are given in Figure 2. These results demonstrated that, even though this non-model plant has not been energetically annotated, we still might quickly and economically perform the transcriptome analysis via a NGS approach.

Functional Annotation
Annotation percentages of D. asperoides unigenes after being compared with the public databases are summarized in Table 2. FIGURE 1 | Putative Dipsacus saponin VI biosynthesis pathway in D. asperoides. Enzymes found in earlier studies are surrounded by a black box; new putative enzymes are surrounded by a red box. Enzymes involved in the pathways are: AACT, acetyl-CoA acetyltransferase; HMGS, hydroxymethylglutaryl-CoA synthase; HMGR, hydroxyl methylglutaryl-CoA reductase; MVK, mevalonate kinase; PMK, phosphomevalonate kinase; MVD, mevalonate diphosphate decarboxylase; GPPS, geranylgeranyl pyrophosphate synthase; FPPS, farnesyl diphosphate synthase; IPPI, isopentenyl diphosphate isomerase; SS, squalene synthase; SE, squalene epoxidase; β-AS, β-amyrin synthase; and β-A28O, β-amyrin 28-monooxigenase. The overlapping parts and exclusive sections of 43,243 integrity unigenes among the four databases (Nr, Swiss-Prot, GO, and KOG) are shown in Figure 3. There were 6098, 1, 1394, and 11 unigenes annotated exclusively in these four databases, respectively. 3578 unigenes were annotated in every of the seven public databases. When comparing all unigenes to those from earlier studied species (Vitis vinifera, Ricinus communis, Populus trichocarpa and Glycine max), 12,304 (28.45%), 3203 (7.41%), 3550 (8.21%), and 1797 (4.16%) annotated unigenes with high similarity were obtained, respectively. Thus, the D. asperoides genome was closer to V. vinifera than the other species from earlier studies. Notably, we also found a positive correlation between the length of unigenes and the hit percentage to sequences from the Nr and Swiss-Prot databases-longer unigenes possessed a greater proportion of hits (Supplementary File S4). In the Nr database, unigenes >1000 bp had higher E-values (1e-100) than those <500 bp (1e-50), showing a higher hit percentage: 20,257 (73.54%) unigenes possessed identity of ≥60% with proteins. In the Swiss-Prot database, 12,930(66.43%) unigenes had identity ≥50%. E-values (Lolkema and Slotboom, 2005) represent the possibility that the similarity between the other sequences and the target sequence is larger than that of the sequence in random case. So the lower the score, the better the result.

Gene Ontology (GO) Classification
GO is a classification system to comprehensive descript genes and their products. So far, three independent ontology databases have been built up: biological process (BP), molecular function (MF) and cell component (CC) in GO database. The three ontologies below still possess independently different sub levels, down to form a ontology tree structure. Based on the results of Nr and the Pfam protein database annotations, 20,145 unigenes were successfully hit in one or more ontologies. The detailed descriptions were that 51,266 unigenes were packaged into BP, 36,716 into CC and 25,491 into MF (Figure 4). In a total of 55 ontologies in the three categories, the GO classification distributed all the unigenes to their own functional categories with a high degree of identity. Cellular process (11,987, 47.02%) was the largest group belonging to BP. Most unigenes in CC and MF were allocated to the cell (7517, 29.49%) and the binding (11,787, 46.24%) ontologies, respectively. Without being limited to the above mentioned three categories, all these gene functional annotations will contribute to further exploration of relevant biological genetic information of D. asperoides.

Functional Classification by KOG
Eukaryotic Ortholog Groups (KOG) annotation system is based on the lineal homologous relationship of the eukaryotic genes in NCBI. KOG which combine with the evolutionary relationships of homologous genes from different species identify and divide those genes into different orthologs (Horiike et al., 2016) clusters. KOG currently have 4852 orthologs. Different genes hold same function if they belong to the same ortholog cluster, naturally, the other members from the same KOG ortholog can directly inherit the functional annotation. Overall, 10,056 of 43,243 unigenes matched to KOG database were clustered into 26 functional orthologs ( Figure 5). The largest ortholog was assigned to the general function prediction only cluster (1629 unigenes, 16.10%), followed by posttranslational modification, protein turnover and chaperones (1396, 13.88%), signal transduction mechanisms (932, 9.27%), translation, ribosomal structure, and biogenesis (764, 7.60%).

KEGG Classification
Kyoto Encyclopedia of Genes and Genomes (KEGG) can systematic analyze for the metabolic pathways of gene products and compounds in cells as well as the functions of these gene products. After being subjected to KO annotations, genes could be sorted according to their participation in the KEGG metabolic pathways, and therefore, the function of genome or transcription of newly sequenced species can be definitely clarified. KEGG pathway database is a cross network diagram collecting various metabolic pathways. The transcripts acquired in this study were aligned with the KEGG database and classified into five branches: Cellular Processes (A), Environmental Information Processing (B), Genetic Information Processing (C), Metabolism (D), and Organismal Systems (E  To more thoroughly explore the genes involved in metabolic pathways, the unigenes were mapped to the KEGG metabolism pathway (D). As described above, 4376 genes participated in a total of 130 metabolism pathways of the KEGG database (Evalue ≤ 1e-5). Among these genes, carbohydrate metabolism (829, 18.94%) was ranked as the first sub-branch in terms of gene numbers (Figure 6A), followed by overview (574, 13.12%) and amino acid metabolism (548, 12.52%), energy metabolism (533, 12.18%), metabolism of other amino acids (241, 5.51%) (Supplementary File S7). Taking into account asperosaponin VI is a kind of triterpenoids from the secondary metabolites, we found, the other secondary metabolites mapped to 15 pathways and metabolism of terpenoids and polyketides discriminatively assigned to 12 pathways covered the equally abundant unigenes (202, 4.62%). Undoubtedly, in the two pathways, those unigenes encoding special proteins may be related to the synthesis of Dipsacus saponin VI.
To further explore the clues, we conducted in-depth analysis to two sub-branches, the biosynthesis of other secondary metabolites and metabolism of terpenoids and polyketides. In the former, Phenylpropanoid biosynthesis pathway (ID: ko00940) containing 128 genes was the dominant group, followed by flavonoid biosynthesis (24, ID: ko00941), tropane, piperidine, and pyridine alkaloid biosynthesis (23, ID: ko00960), stilbenoid, diaryl heptanoid, and gingerol biosynthesis (21, ID: ko00945). In these comparison results, the relevant information about those putative unigenes involved in asporasoponin VI synthesis were still not plainly found, we speculated that the reason may be associated with the complexity of asporasoponin VI precursor.
Most specifically in the metabolism of terpenoids and polyketides, we conducted deep search. The most extensive coverage was terpenoid backbone biosynthesis (73, ID: ko00900; Figure 6B). Carotenoid biosynthesis was the second-largest group (25, ID: ko00906). The third was sesquiterpenoid and triterpenoid biosynthesis (20, ID: ko00909), followed by diterpenoid biosynthesis (18, ID: ko00904), limonene and pinene degradation (18, ID: ko00903), monoterpenoid biosynthesis (11, ID: ko00902). Obviously, clusters 1st and 3rd played major roles in biosynthesis of the main bioactive triterpenoid compounds in D. asperoides. This result will be helpful to identify and clarify the function of related enzymes encoded by those genes in the synthesis pathway of asporasoponin VI.

Candidate Genes Involved in Triterpenoid Synthesis
Because triterpenoid saponins are the main active medicinal ingredients of D. asperoides, we expanded deep-mining to locate clearly and definitely related candidate genes along the chain of saponin synthesis. Combining the earlier reports on triterpenoid saponin profiles with our study on Illumina sequence analysis of D. asperoides, all unigenes encoding enzymes involved in this kind of triterpenoid biosynthesis were revealed.
In addition, even though a one-step reaction catalyzing the synthesis from β-amyrin to oleanolic acid by β-amyrin 28monooxigenase was initially reported by previous studies, it was revealed that the said reaction is not just one step (Pollier and Goossens, 2012) after all. As oleanolic acid is the precursor for synthesis of the triterpenoid saponins of D. asperoides, it is necessary to refine this reaction so that the oleanolic acid profile can be utilized more quickly and easily to produce beneficial drugs for human beings. Therefore, we carried out the related comprehensive survey (Huang et al., 2012;Pollier and Goossens, 2012) and successfully aligned the transcriptome dataset of D. asperoides with the four reported reference genes involved in oleanolic acid biosynthesis in NCBI using BLAST (Supplementary File S8). These reference genes (CrCYP716AL1, MtCYP716A12, VvCYP716A17, and VvCYP716A15) share the unique EC number of 1.14.13.cn and each of them can all form one of the same three enzymes of β-amyrin 28monooxygenase (β-A28O), erythrodiol dehydrogenase (ED), and oleanolic aldehyde 28-monooxygenase (OA28-O; Fukushima et al., 2011). Six candidate genes had high identities of ≥46% with the four reference genes, so they phylogenetically close to the CYP716 family. Scilicet, formation of oleanolic acid can be accomplished with a three-step reaction on the basis of βamyrin catalyzed by β-A28O, ED, OA28-O. In this process, β-A28O catalyzes the conversion of β-amyrin to erythrodiol, then oleanolic aldehyde and oleanolate can be yielded via catalysis of ED and OA28-O, respectively. The above statements exhibited the vertical pathway associated with the formation and conversion of the oleanolic acid. The primary metabolic genes encoding all the enzymes, both known and unknown, involved in the intermediate metabolites up to hederagenin were matched to the corresponding genes from D. asperoides. Those candidate CYP450 genes together with the above six genes were used to construct a phylogenetic relationship with reported CYP450 genes from typical plants (Figure 7). Because nearly all genes from the hederagenin pathway had more than one peer in the transcript dataset, we presumed that D. asperoides experienced genome replication incidents during anagenesis. Most of all candidate unigenes exhibited high identity to Medicago truncatula or other genuine herbaceous plant genes, with >80% homology at the protein level, thus these genes remained highly conserved during evolution.
Since Dipsacus saponin VI is the primary active ingredient in D. asperoides, there must exist glycosylases catalyzing glycosylations (Shi et al., 2011) substituted hydroxyl groups of hederagenin by one of four compounds: glucose from UDP-D-glucose, arabinose from UDP-L-arabinose, rhamonose from UDP-L-rhamonose, and xylose from UDP-D-xylose. Obviously, it is rather hederagenin-type (Sarikahya and Kirmizigül, 2012) substrates than oleanane-type  triterpenoid substrates which are bonded hydroxyl groups on the kind of carbon skeleton precursors (Liu et al., 2011). We screened the RNA-Seq dataset, and found new genes participating in glycosylation of hederagenin. These new unigenes encode UDP-glucosyltransferase (18), arabisyltransferase (1), xylosyltransferase (13), and rhamsyltransferase (4) (data not shown). As these genes encoding triterpenoid conversion enzymes have not been previously reported, their bioactive verification will be performed in the near future.
NCBI or Swiss-Prot databases (Figure 8). The alignment results demonstrated a close evolutionary relationship due to the high degree of similarity. The majority of these unigenes had nearly 100% orthology with reference genes, accordingly, we could infer that the candidate unigenes might have similar functions to the reference genes.
It is worth noting that triterpenoid biosynthesis also includes multiple, parallel sub-pathways, which is caused by the fact that C-3 and C-28 of hederagenin can accept different glycones as substrates. Consequently, triterpenoid biosynthesis seems to more resemble a complex grid than a linear route.

Verification and Analysis of Differentially Expressed Putative Genes Involved in Dipsacus Saponin VI Synthesis
In order to confirm the Dipsacus saponin VI related genes actually enrich in the roots, the expression level of 26 related genes in the root, stem and leave of D. asperoides were analyzed by reverse transcription quantitative real-time PCR (RT-qPCR; Derveaux et al., 2010 ; Figure 9). All the PCR reactions with the designed primers produced specific single bands corresponding to each calculated size in agarose gel electrophoresis (data not shown). In this study, we set the gene expression level in leaves of each plant as 1. These gene expression levels in roots or stems were employed to compare with those in leaves from the same plant. The results showed that these genes exhibited different expression levels in the root, stem, and leaves. Quite a number of these genes were highly expressed in the roots. While the rest of these unigenes exhibited a lower expression level in roots, with highly expressed in stems or leaves, we speculated that might be associated with synthesis of rare saponins. Although the detailed information regarding other special saponins biosynthesis were lacking, not accidentally, the genes were supposed to be abundant in stems and leaves in order to synthesize the special sponins.
In the first nine genes (i.e. deduced CYPs in Dipsacus saponin VI synthesis), five unigenes showed the significantly highest expression level in roots rather than stems and leaves from one plant. Remarkably, the multiples covered a range from several fold to one hundred fold. Likewise, in the last 17 genes (i.e. putative UGTs in Dipsacus saponin VI synthesis), eight genes manifested similar results. Therefore, our results were not unexpected, and indicated that roots may be the main organ for synthesizing Dipsacus saponin VI. Clearly, these enriched genes exhibit higher activities in roots. Despite of the lack of information in Dipsacus saponin VI synthesis currently, we still could deduce that, these enriched unigenes may encode key enzymes for the biosynthesis of Dipsacus saponin VI. The results of qRT-PCR expression analysis will help further study the mechanism of Dipsacus saponin VI biosynthesis.

Mining and Identification of SSR Marker
With the development of bioinformatics, SSRs (or microsatellites) have been increasingly utilized in molecular marker screening and genetic breeding studies (Kantartzi et al., 2009). To better understand whether both the genetic diversity and genetic information are closely associated with the biological properties of D. asperoides, we sought SSR motifs (Cardle et al., 2000) in all of the unigenes using the MISA search tool. SSR mining in the transcriptome resulted in a total of 4490 SSRs   Table 4. Mono-type motifs were the highest proportion of all motif types (1541, 34.32%), closely followed by tri-(1383, 30.80%), di-(1377, 30.67%), tetra-(151, 3.36%), penta-(24, 0.53%), and hexa-types motif (14, 0.31%). Statistically, 1380 dominant tandems occurred in tri-type motifs with 5-8 times of repeats, followed by 1261 (28.08%) in mono-type with 9-12 times, 1159 (25.81%) in di-type with 5-8 times, and the fewer tandem repeats were not shown. Specifically, 422 (9.40%) T-motif continuing 10 times was the most frequently distributed in the transcriptome. Additionally, the most frequent tri-nucleotide was five GGAs; the most common tetra-nucleotide was nine TTTC units. The fragments of cascaded penta-and hexa-nucleotide types were rarely found, even once (penta-nucleotide repeated 9-12 times). The overall distribution of identified SSRs is shown in Table 5.
All of these widely distributed markers will be effective resources for studies at the genetic level and vigorous contribution to genetic diversity in germplasm resources.

CONCLUSIONS
In the first attempt to reveal biological information on D. asperoides using the Illumina GAII transcriptome technology, we obtained 43,243 unigenes of high quality, as shown by 97.28% Q20 bases, although without a reference genome sequence. All these unigenes were successfully annotated in public databases. This result presages that mysterious ingredients of many medicinal plants will no longer unfathomable (Rowlands et al., 2015), with the sequencing results of Salvia miltiorrhiza, Catharanthus roseus, and other plants (Fulcher and Riha, 2015) being successively reported.
Our study found a large number of candidate genes potentially involved in Dipsacus saponin VI biosynthesis and a total of 3648 SSRs markers. The experimental validation of the putative unigenes exhibited their ample accumulation in roots of D. asperoides. These data constitute a new valuable resource for the studies on major metabolism, especially triterpenoid biosynthesis pathway, as well as marker development. These initial bioinformatics will likewise contribute to and promote future research in genetic breeding and synthetic biology (Escobar-Zepeda et al., 2015) on D. asperoides.

Ethics Statement
No specific approvals were required for the described studies. No special permissions were demanded for the locations and activities. The field studies did not involve endangered or protected species.

Plant Materials and Total RNA Extraction
D. asperoides was cultivated in farmland in Xiangtu township Madden town, Jiangtou village, Dali Bai minority Autonomous Prefecture, Jianchuan County (Zhilei Co. Ltd; East longitude 99 • 32 ′ 05 ′′ , north latitude 26 • 16 ′ 11 ′′ , altitude 2459 meters). Specimens from five different tissues of the D. asperoides, including young roots, tender shoots, young leaves, mature leaves, stems, flower buds, were obtained from ∼3-year-old entire plants, thereby ensuring the probability of covering all the genome. The specimens, acquired from local authorities, were promptly frozen in liquid nitrogen and stored at −80 • C until RNA extraction.
Total RNA was extracted with RN12-PLANTeasy Reagent Kit (Aidlab Biotechnologies Co. Ltd., Beijing, China) and purified utilizing a RN14-RNAclean Reagent Kit (Aidlab) according to the manufacturer's instructions. The precise concentration of RNA was measured using Qubit R 2.0 Fluorometers (Life Technologies, USA; Simbolo et al., 2013). RNA integrity and quality were assessed using the RNA Nano 6000 Assay Kit of an Agilent Bioanalyzer 2100 (Agilent Technologies, USA) and agarose gel electrophoresis, respectively. Equal amounts of total RNA from each tissue were pooled together for cDNA library preparation.

cDNA Library Preparation and Sequencing
The poly (A) + RNA was enriched from 20 µg of total RNA pool using magnetic beads cohering Oligo (dT) 25 . Fragmentation buffer was then added to mRNA samples to break mRNA into short segments. The fragmented mRNA samples were used as templates to synthesize the single-strand cDNA with six base random primers (random hexamers). Double-strand cDNA was synthesized by sequentially adding buffer, dNTP, DNA polymerase I, and RNase H, followed by purification using AMPure XP beads. Purified double-strand cDNA was endrepaired, A-tailed. Fragment size selection was performed in AMPure XP beads. Finally, PCR amplification was performed and PCR products were purified in AMPure XP beads to obtain the desired library.
The library was preliminary quantified immediately via Qubit R 2.0 Fluorometer (Callaway et al., 2014) and diluted to 1.5 ng/µl, and then the insert size (150-200 bp) of library was detected using the Agilent 2100. If the result was consistent with expectations then the effective concentration (>2 nM) of the library was accurately quantified using Real-time Quantitative PCR. Ultimately, the high quality cDNA libraries were entirely pooled for Illumina paired-end sequencing platform at Novogene Co.Ltd. (Beijing, China). The accession number of the originally high quality reads in the NCBI SRA database is SRA2 69859.

Reads Dissection and Assembly
The reads that either contained adapters, or had the Ns proportion (N represents a base of information cannot be determined) of ≥10% of the entire reads, or were low-quality reads (The bases number Qs ≤ 5 accounting for more than 50% of the entire reads.) were all removed using Trinity software (Haas et al., 2013) with the parameter K-mer = 25 (Kim et al., 2014). Raw reads were joined together into a clean read. Trinity software was employed to perform de novo assemblies of all clean reads. The clean reads containing overlapping areas between them could be assembled into a contig. Via the re-alignment between reads and contigs, different contigs from the same transcript as well as the distance between these contigs could be determined by paired-end reads. Ultimately, the longest unigenes were obtained by connecting and extending the contigs from the same transcript in both ends using paired-end reads. In due course, contigs and unigenes were combined for statistical analysis.

Function Annotation and Predicted CDSs
Unigenes were aligned against a full set of public protein databases using BLASTX (Altschul et al., 1997) with E-value threshold of 1e-10 to 1e-3: the Nr (NCBI non-redundant), Nt (NCBI nucleotide sequences), Swiss-Prot (Pruitt et al., 2005), Pfam (Protein family; Finn et al., 2008), Kyoto Encyclopedia of Genes and Genomes (KEGG; Kanehisa et al., 2004Kanehisa et al., , 2006, and gene ontology (GO) databases (Harris et al., 2004). The software and methods of functional annotation used for alignment of unigenes to each database were as follows: NCBI blast 2.2.28+ in Nr, Nt, Swiss-Prot, and KOG; HMMER 3.0 package in Pfam; KEGG Automatic Annotation Server in KEGG; and Blast2 GO v2.5 and the writing script in GO functional annotation according to molecular function, biological process and cellular component ontologies. By comparing with the KEGG (Kanehisa et al., 2004) database, we acquired the biological functions and the pathway annotations of the genes.
To further predict and rank possible functions, we attentively analyzed the result of the above comparison. If the comparison was successful, the ORF information box containing the CDS (Hsu and Chen, 2012) of this transcript could be extracted from the results, then this conserved domain could be translated into an amino acid sequence (from 5 ′ to 3 ′ ) in accordance with the standard codon table; if the unigene was not mapped to the above protein databases, the CDS could be predicted by Estscan (Iseli et al., 1999;3.0.3 version) software, thereby obtaining the nucleic acid or amino acid sequences of this conserved domain.

Phylogenetic Analysis
In connection with the deduced amino acid sequences of CYPs and UGTs from D. asperoides and other plants, we performed phylogenetic analysis. Clustal X (Selvaraj and Sarma, 2008; using default parameters) was applied to compare all of the above amino acid sequences. Afterwards, phylogenetic distances were analyzed utilizing MEGA5.10 (Tamura et al., 2011) with the parameters set as follows: neighbor-joining in the statistical method, bootstrap method for phylogeny test with 500 replications, uniform rate among sites, complete deletion for gaps/missing data treatment, amino acid for substitution type and the Poisson model. A phylogenetic tree was logically built up. The bootstrap values among the above mentioned amino acid sequences after 500 replications were displayed above each bifurcation. The default distance scale length used was 0.2 amino acid substitutions per site.

Differentially Expression of Putative Genes Involved in Dipsacus Saponin VI Synthesis
Twenty-six genes with assumed activity in Dipsacus saponin VI synthesis were selected for verification using reverse transcription qPCR. All specific primers of these unigenes used for the RT-qPCR assay were designed with primer premier software (Qin et al., 2009;version 6.0) and listed in Supplementary File S9. Total RNA was extracted individually from young roots, stems and young leafs of 1-year-old plant using a modified CTAB method (Zhang et al., 2012) and DNA was removed with RNA purification kit (Tiangen, China). No more than one microgram of total RNA was reverse transcribed to cDNA in total volume of 20 µL in the presence of oligo (dT) 18 primer according to the protocol of Thermo. The standard curve for each gene was portrayed by real-time PCR with gradient diluted cDNA. The quantitative reaction was performed in 20 µL, containing 10 µL 2 × SYBR Green Master Mix (Toyobo, Japan), 500 nM final concentration of each primer and 1 µL 10-fold diluted cDNA template.
The PCR reactions were run in Roche LightCycler R 96 System using the following program: 95 • C for 60 s and 40 cycles of 95 • C for 15 s, annealing and extension at 60 • C for 60 s. Immediately, the specificity of the individual PCR amplification was validated with Melt curve protocol after the PCR reaction and agarose gel electrophoresis. Three technical replications per three different organs respectively from different plants were performed. Expression levels of four reference genes (actin, GAPDH, 18S, and β-tubulin) in the three organs were analyzed to the degree of conservatism and actin gene was logically granted as an internal control for normalization. Quantifying the relative expression of the genes in three different organs was calculated using the delta-delta Ct method as described by Livak and Schmittgen (Livak and Schmittgen, 2001). All data were represented as the mean ± SD after normalization.

Mining and Identification of SSR Marker
Utilizing MIcroSatellite (MISA, http://pgrc.ipk-gatersleben.de/ misa/; Thiel et al., 2003) software, the cryptic SSRs containing motifs ranging from mono-to hexa-nucleotides in size were mined from the 43,243 examined sequences. The minimum of contiguous repeat units were arranged as follows: nine for mono-nucleotides and five for di-, tri-, tetra-, penta-, and hexa-nucleotides. After determining the SSR markers, Primer3 (Untergasser et al., 2012) software (2.3.5 version, the default parameters) was utilized to design primer-pairs flanking each SSR.

AUTHOR CONTRIBUTIONS
Conceived this research: SY, JW. Analyzed the data: GZ, JW, ZG. Contributed plant materials: YL, MH, JC, QH. Wrote the paper: JW, YL. All authors approved the manuscript.

ACKNOWLEDGMENTS
We acknowledge the financial supports for this work from the Science and Technology Plan Projects of Yunnan Province (Grant No. KX141140 and No. 2014RE022) and the project of young and middle-aged talent of Yunnan province (Grant No. 2014HB011). Funders were not involved in conceiving the study, data acquisition, determination to issue or manuscript preparation. Simultaneously, we acknowledge the Novogene Bioinformatics Technology Co. Ltd. serving sequencing services. We are also extremely grateful to the editors for seriously reviewing and providing constructive suggestions on our manuscript.