Sequencing, De novo Assembly, Functional Annotation and Analysis of Phyllanthus amarus Leaf Transcriptome Using the Illumina Platform

Phyllanthus amarus Schum. and Thonn., a widely distributed annual medicinal herb has a long history of use in the traditional system of medicine for over 2000 years. However, the lack of genomic data for P. amarus, a non-model organism hinders research at the molecular level. In the present study, high-throughput sequencing technology has been employed to enhance better understanding of this herb and provide comprehensive genomic information for future work. Here P. amarus leaf transcriptome was sequenced using the Illumina Miseq platform. We assembled 85,927 non-redundant (nr) “unitranscript” sequences with an average length of 1548 bp, from 18,060,997 raw reads. Sequence similarity analyses and annotation of these unitranscripts were performed against databases like green plants nr protein database, Gene Ontology (GO), Clusters of Orthologous Groups (COG), PlnTFDB, KEGG databases. As a result, 69,394 GO terms, 583 enzyme codes (EC), 134 KEGG maps, and 59 Transcription Factor (TF) families were generated. Functional and comparative analyses of assembled unitranscripts were also performed with the most closely related species like Populus trichocarpa and Ricinus communis using TRAPID. KEGG analysis showed that a number of assembled unitranscripts were involved in secondary metabolites, mainly phenylpropanoid, flavonoid, terpenoids, alkaloids, and lignan biosynthetic pathways that have significant medicinal attributes. Further, Fragments Per Kilobase of transcript per Million mapped reads (FPKM) values of the identified secondary metabolite pathway genes were determined and Reverse Transcription PCR (RT-PCR) of a few of these genes were performed to validate the de novo assembled leaf transcriptome dataset. In addition 65,273 simple sequence repeats (SSRs) were also identified. To the best of our knowledge, this is the first transcriptomic dataset of P. amarus till date. Our study provides the largest genetic resource that will lead to drug development and pave the way in deciphering various secondary metabolite biosynthetic pathways in P. amarus, especially those conferring the medicinal attributes of this potent herb.

Phyllanthus amarus Schum. and Thonn., a widely distributed annual medicinal herb has a long history of use in the traditional system of medicine for over 2000 years. However, the lack of genomic data for P. amarus, a non-model organism hinders research at the molecular level. In the present study, high-throughput sequencing technology has been employed to enhance better understanding of this herb and provide comprehensive genomic information for future work. Here P. amarus leaf transcriptome was sequenced using the Illumina Miseq platform. We assembled 85,927 non-redundant (nr) "unitranscript" sequences with an average length of 1548 bp, from 18,060,997 raw reads. Sequence similarity analyses and annotation of these unitranscripts were performed against databases like green plants nr protein database, Gene Ontology (GO), Clusters of Orthologous Groups (COG), PlnTFDB, KEGG databases. As a result, 69,394 GO terms, 583 enzyme codes (EC), 134 KEGG maps, and 59 Transcription Factor (TF) families were generated. Functional and comparative analyses of assembled unitranscripts were also performed with the most closely related species like Populus trichocarpa and Ricinus communis using TRAPID. KEGG analysis showed that a number of assembled unitranscripts were involved in secondary metabolites, mainly phenylpropanoid, flavonoid, terpenoids, alkaloids, and lignan biosynthetic pathways that have significant medicinal attributes. Further, Fragments Per Kilobase of transcript per Million mapped reads (FPKM) values of the identified secondary metabolite pathway genes were determined and Reverse Transcription PCR (RT-PCR) of a few of these genes were performed to validate the de novo assembled leaf transcriptome dataset. In addition 65,273 simple sequence repeats (SSRs) were also identified. To the best of our knowledge, this is the first transcriptomic dataset of P. amarus till date. Our study provides the largest genetic resource that will lead to drug development and pave the way in deciphering various secondary metabolite biosynthetic pathways in P. amarus, especially those conferring the medicinal attributes of this potent herb.

INTRODUCTION
P. amarus Schum. and Thonn., a member of family Euphorbiaceae is used in the traditional system of medicine like Ayurveda for over the centuries because of its rich medicinal values and ethnomedicinal importance. P. amarus Schum. and Thonn. is a small, erect annual herb whose stem has a green, smooth capsule, and grows up to 10-50 cm high. Over the last few decades, P. amarus has gained global recognition for its medicinal properties after several studies that were conducted to understand its therapeutic potential, yielding exciting results. However, studies about the DNA or protein sequences of this species are very limited. P. amarus, popularly known as "bhuiamlaki" is distributed worldwide. In Spain, this plant is best known by the common name "chanca piedra, " which means stone-breaker. The most significant hepatoprotective role of P. amarus has long been reported (Thyagarajan et al., 1988;Blumberg et al., 1990). The genus Phyllanthus has prospective beneficial therapeutic actions in the management of hepatitis B, nefrolitiase, and in painful disorders (Calixto et al., 1998). Recent studies have also reported the hepatoprotection property of P. amarus (Chirdchupunseree and Pramyothin, 2010;Krithika et al., 2014). In addition to this, P. amarus has also shown to exhibit antioxidant (Harish and Shivanandappa, 2006), antitumor and anticarcinogenic activities (Rajeshkumar et al., 2002). Besides, anti-allodynic and antioedematogenic properties, as well as antimicrobial potentiality (Mazumder et al., 2006) of this herb, have also been reported. Report of α-amylase inhibitory properties of P. amarus in treating diabetes has also been shown (Ali et al., 2006). The wide variety of secondary metabolites, that attribute to these medicinal properties are present mainly in the leaves and include lignans mainly phyllanthin and hypophyllanthin (Chopra et al., 1956;Rao and Bramley, 1971;Somanabandhu et al., 1993) besides nirtetralin, niranthin, diarylbutane, nyrphyllin, and a neolignan, phyllnirurin; geraniin and flavonoids like quercetin, astralgin, quercetrin, isoquercetin, and rutin (Umezawa, 2003;Kassuya et al., 2006;Leite et al., 2006). It also contains minor compounds like hydrolysable tannins like phyllanthusiin D (Foo and Wong, 1992), amariin, amarulone (Foo, 1993), amarinic acid (Foo, 1995) and alkaloids like entnorsecurinine, isobubbialine, and epibubbialine (Houghton et al., 1996).
A number of reports addressed the genetic diversity of P. amarus for application in the cultivar identification using PCR and sequencing based techniques viz. RFLP, RAPD, ISSR, SCAR, and AFLP (Jain et al., 2003;Senapati et al., 2011;Bandyopadhyay and Raychaudhuri, 2013). Despite its global medicinal importance genomic sequence resources available for P. amarus are extremely scarce. As of July 2015, only 119 ESTs, 105 genome survey sequences, and 188 nucleotide sequences are available at the National Center for Biotechnology Information (NCBI) database. Out of the 119 ESTs, 57 sequences were reported in our previous study on P. amarus leaves (Chattopadhyay and Bose Mazumdar Ghosh, 2014). In view of this extremely limited genome sequence, an in-depth study of transcriptome might facilitate the analysis of functional genes and thereby unravel the transcripts involved in several biological processes and mainly help understand the various metabolic pathways involved in the phytotherapeutic attributes of P. amarus.
The recent emergence of the next-generation sequencing (NGS) technology has made the rapid transcriptome sequencing more feasible. Previous studies have shown that development of RNA sequencing (RNA-seq) methodology has facilitated the analysis of transcriptomes of a number of models as well as nonmodel crop and medicinal plants (Nakasugi et al., 2013;Lehnert and Walbot, 2014;Rastogi et al., 2014). The main advantage of RNA-seq compared with the whole genome sequencing is that only transcribed regions of the genome are analyzed in the former. It is among the most popular techniques of NGS and this methodology still remains the golden standard for both coding and non-coding gene annotation. RNA-seq method offers a comprehensive and integrated view of the transcriptome revealing SNPs, novel transcribed regions, as well as the precise location of transcription boundaries (Wilhelm et al., 2010). Furthermore, RNA-seq data with NGS technologies help in assessing the process of different forms of alternative splicing from both plant and mammalian genomes as well (Rogers et al., 2012). The approach of eukaryotic transcriptome analysis is expected to get highly altered by the advanced RNA-seq technology.
In the present study, an attempt has been made to annotate and analyze the leaf transcriptome of P. amarus since the vast array of secondary metabolites are present substantially in the leaf tissues. We performed de novo transcriptome sequencing using the Illumina Miseq platform as prior genome information on P. amarus is unavailable. Here for RNA-seq analysis, we chose a Miseq platform because compared to other Illumina platforms the longer length of the sequencing reads generated from the Miseq platform considerably enhances the accuracy of the subsequent de novo assembly, besides being a rapid and cost-effective platform for transcriptome assembly and analyses. To the best of our knowledge, this is the first report of de novo sequencing and transcriptome analysis of P. amarus which will serve for the discovery of different genes involved in various metabolic pathways, especially the putative members of medicinally important secondary metabolites biosynthetic pathways and also help in the development of molecular markers like Simple Sequence Repeat (SSR) for enhancing the medicinal traits of this herb. Shibpur, Howrah, as PA202 were chosen for our study. Leaf samples from young, healthy plants were collected. RNA was extracted separately from leaf samples of the two samplings using "Roche High Pure RNA Isolation Kit, " (Product no.11828665001). The purity and concentration of each RNA sample were checked by using the Agilent 2100 bioanalyzer (Agilent Technologies, USA) before proceeding to further downstream analyses. Library preparation was performed from 1 µg of total RNA, using Illumina's "TruSeq R RNA Sample Preparation v2 Guide" (Part # 15026495 Rev. F March 2014).

Illumina Sequencing and Quality Control
Illumina MiSeq system was used for sequencing the P. amarus leaf transcriptome library using Sequencing by Synthesis (SBS) technology. The library has been sequenced following manufacturer's instructions using the MiSeq Reagent Kit v2 (Part # 15034097 Rev. B). The base-calling pipeline of Illumina's MiSeq version was MiSeq Control Software 2.2.0-RTA 1.17.28.0-CASAVA-1.8.2 which was used to generate paired-end and single-end data in FastQ format. Low-quality bases result in misassemblies by interfering in the assembly process. Hence, quality filter is the first and foremost requisite for all downstream computational analyses and results interpretation . So additional quality control of raw data using FastQC was performed. The reads were preprocessed using Trimmomatic and SeqPrep software to obtain clean paired-end and single-end MiSeq data in a FastQ format which was also subjected to quality control using FastQC. The high quality, filtered reads were used for downstream analyses.

De novo Assembly and Clustering
Transcriptome de novo assembly was carried out on three levels using both paired and unpaired high-quality reads as inputs. Velvet (version 1.2.09) and Oases (version 1.2.09) were used for first level assembly and the clean reads were split into different "k-mers" from kmer27-kmer63. The transcripts obtained from all the "kmers" were merged and assembled at level 2 using Velvet and Oases. This level 2 assembled transcript was further assembled and clustered using CD-HIT (version 4.5.4) to remove redundant transcripts (Li and Godzik, 2006). This level 3 assembly and clustering represented the final dataset of clustered non-redundant (nr) unique sequences ("unitranscripts") for P. amarus leaf transcriptome.

Gene, Pathway Annotation, Classification
Functional annotations were performed by sequence comparison with public databases. For sequence similarity search, the annotation of unitranscripts was performed by BLASTX (Altschul et al., 1997) at NCBI using green plants (taxid: 33090) of nr protein database. The BLASTX results were imported to Blast2GO suite (Conesa and Götz, 2008) for mapping and retrieving GO terms to the assembled sequences, and further annotated with unique enzyme codes (EC) and KEGG maps (http://www.genome.jp/kegg) (Kanehisa and Goto, 2000;Kanehisa et al., 2014). GO terms are dynamic-structured, precisely defined controlled vocabulary that can be employed to describe functions of genes and gene products. These retrieved GO terms were allocated to query sequences and the extensive groups of genes present in P. amarus leaf transcriptome were classified into three categories -biological process, molecular function, and cellular component. Then WEGO (Ye et al., 2006) tool (http://wego.genomics.org.cn) was used to functionally classify GO terms and graphically represent the unitranscript functions at the macro level. Further BLASTX against the Clusters of Orthologous Groups database (http://www.ncbi.nlm. nih.gov/COG/) was performed to predict and classify functions of the assembled unitranscripts, using Autofact (http://megasun. bch.umontreal.ca/Software/AutoFACT.htm) tool (Koski et al., 2005).

Comparison of P. amarus Assembled Data with Closely Related Species
Comparison of P. amarus assembled unitranscripts was carried out with the most closely related species that were obtained after the functional annotation of the leaf transcriptome assembled data at NCBI green plants nr protein database. Comparison with the closely related species were performed using TRAPID analysis (Van Bel et al., 2013) with similarity search E-value 10e-5. Populus trichocarpa was the most closely related species followed by Ricinus communis. A comparison between both the species were performed and the latter being a member of the same spurge family, Euphorbiaceae to which P. amarus belongs.

Identification of Transcription Factor Families and Its Domain Architecture
For identification of transcription factor (TF) families and domain mapping represented in P. amarus leaf transcriptome, the representative unitranscripts were enquired against the TF protein sequences at Plant TF database (PlnTFDB; http://plntfdb. bio.uni-potsdam.de/v3.0/downloads.php) by BLASTX with an E-value cutoff 1E-06.

Identification of Simple Sequence Repeats (SSRs)
SSRs were detected using MIcroSAtellite Identification Tool (MISA). For estimation of mRNA or unitranscript abundance of the major identified secondary metabolic pathway genes in the present study, FPKM values were determined. FPKM values for the unitranscripts were determined using the formula FPKM = (10 9 × C)/(N × L), where C = Number of reads mapped to a unitranscript; N = Total mapped reads in the experiment; L = unitranscript length in base pairs. Further, RT-PCR enables the detection and identification of target mRNA transcripts. Hence, to validate our dataset, some of the assembled P. amarus unitranscripts that share sequence similarity to various secondary metabolite biosynthetic pathway genes and related TFs as identified revealing putative information of P. amarus leaf transcriptome were selected for performing RT-PCR. All primers for RT-PCR of selected secondary metabolite pathway genes were designed from the final assembled and clustered nr unique sequences (Supplementary File 1). The housekeeping gene actin was used as a control. Actin primers were designed (Acc No.: X63603) and the primer sequences were 5 ′ -CGCGAAAAGATGACTCAAATC and 5 ′ -AGATCCTTTCTGATATCCACG-3 ′ . The RT-PCR products were electrophoresed on 1.2% agarose gel containing ethidium bromide.

De novo Assembly and Clustering of P. amarus leaf Transcriptome
The Illumina Miseq platform generated a total of 18,060,997 raw reads for P. amarus leaf transcriptome that accounted for approximately 9 Gbases of sequence data. The raw data was also deposited in the National Center for Biotechnology Information's (NCBI) Short Read Archive (SRA) database under the accession number PRJNA248079. Raw reads were further subjected to quality control (FastQC). After quality and adaptor trimming using Trimmomatic 0.30 tool, 14,608,389 high quality paired reads, 2,291,081 (unpaired Reads R1) and 371,454 (unpaired Reads R2) were retained. All these filtered paired and unpaired reads were used in the transcriptome assembly. The summary of the filtration of total raw reads generated after RNA-seq and used further for transcriptome assembly is illustrated in Supplementary Figure 1. Filtered paired and unpaired reads were assembled using Velvet (version 1.2.09) and Oases (version 1.2.09) and the clean reads obtained were split into different k-mers from kmer27-kmer63. As a function of k-mer various output parameters were analyzed in our level 1 assembly. These parameters included-total number of transcripts, transcripts with length 100 bp and above, N50 length, longest transcript length, and average transcript length. The number of clean transcripts obtained in each kmer along with its total length, the average size and N50 value are summarized in Supplementary  Table S1. For accuracy in P. amarus leaf transcriptome de novo assembly, we assembled and further merged the transcripts from kmer27-kmer63 using Velvet and Oases with long read option to obtain 360,405 transcripts in level 2 single merged assembly. To remove redundancy of the merged assembled transcripts, we used CD-HIT (version 4.5.4) to merge the level 2 assembled sequences further. Merging of the assembled transcripts resulted in 85,927 unitranscripts with maximum and minimum read lengths being 13,600 and 200 bp respectively, with an average size of the assembled unitranscripts being 1548 bp which indicated an increased coverage as well as the depth of our sequencing data. The outline of the level 2 and 3 assemblies has been summarized in Supplementary Table S2. The sequence length along with BLASTX hit and e-value distribution of P. amarus unitranscripts have been shown in Figure 1.

Similarity Search and Functional Annotation of P. amarus Leaf Transcriptome
P. amarus being a non-model plant of medicinal value without any prior genome information, sequence similarity search and comparison for the assembled unitranscripts of P. amarus leaf transcriptome was carried out by BLASTX against the green plants of nr protein database at NCBI, with an Evalue cut off 1E-06 (Supplementary File 2). The BLASTX hit results showed that about 60.58% of the annotated descriptions were uninformative (e.g., "unknown, " "unnamed, " "putative, " or "hypothetical" protein) as a result of inadequate P. amarus genome information. Our dataset showed that the percentage of uninformative BLASTX hit results were nearly similar to endangered medicinal plant species Chlorophytum borivilianum (Kalra et al., 2013). Also, 15,265 out of 85,927, i.e., 17.76% unitranscripts were without any hits in plant nr database. In our dataset unitranscripts showed significant similarity to Populus trichocarpa, followed by Ricinus communis, Theobroma cacao and so forth ( Figure 1B). BLAST2GO suite was then used for functional annotation using the BLASTX results. Graphical representation of different levels of annotations of P. amarus unitranscripts by BLAST2GO showing mapping against different databases (UniprotKB, TAIR, etc.), annotation score distribution of assembled unitranscripts, sequence similarity distribution, distribution of annotated sequences with length, GO level distribution of annotated unitranscript sequences have been shown in Figure 2. Blast2GO is a suitable tool for plant genomics research, especially for the large-scale functional annotation and data mining of novel sequence data of non-model species. BLASTX results were used for mapping to retrieve GO terms and further annotate to retrieve the EC (EC number). To reveal molecular interaction network and metabolic pathways, KEGG pathway annotation for the assembled unitranscripts of P. amarus leaf transcriptome was performed by mapping the sequences obtained from BLAST2GO to the contents of the KEGG metabolic pathway database. Annotation summary of the assembled P. amarus unitranscripts has been specified in Supplementary Table S3.

Sequence Similarity and Comparison of P. amarus Data with Related Species
Functional annotation of the assembled unitranscripts of P. amarus leaf transcriptome at NCBI green plants nr protein database showed high similarity to Populus trichocarpa and Ricinus communis. So comparison of the assembled data was performed with both these species. Both the species are known to possess medicinal values like anti-inflammatory, analgesic being common to both besides other properties and also R. communis belong to the same spurge family, Euphorbiaceae like P. amarus, as already mentioned. So we aimed to compare our sequenced assembled data with both the species using TRAPID analysis, which aids in the generation of detailed gene catalogs, especially for non-model species. Out of the total 85,927 unitranscripts, 71,896 (83.7%) unitranscripts showed similarity to P. trichocarpa while that of 71,358 (83%) unitranscripts

Functional Classifications by Gene Ontology (GO)
The GO database is a significant web resource in the bioinformatics field. GO provides a set of dynamic, controlled and structured vocabularies for describing the roles of genes and their products in any organism (Ashburner et al., 2000). The three categories of the GO database are-biological process, molecular function and cellular component. P. amarus unitranscripts with nr annotations were functionally annotated with "GO terms" by BLAST2GO suite. Further WEGO software was used for the GO functional classification of the assembled P. amarus unitranscripts at the macro level.
A total of 20,582 P. amarus unitranscripts were assigned to 69,394 GO terms and one unitranscript was assigned more than one GO term. The majority of GO terms was assigned to molecular function (29,125, 41.97%), followed by biological process (27,577, 39.74%) and cellular component (12,692, 18.29%) was the least.
Regarding the cellular component ontology, "cell" (GO: 0005623), "cell part" (GO: 0044464), and "organelle" (GO: 0043226) were the most representative category. Under molecular function ontology, results showed a high percentage of genes from "binding" (GO: 0005488) and "catalytic activity" (GO: 0003824). Some percentage of genes were also involved in "antioxidant activity" (GO: 0016209) in the molecular function ontology as well. Moreover, biological process ontology contained mainly genes involved in "metabolic process" (GO: 0008152), "cellular process" (GO: 0009987). Figure 3 shows the categorization of P. amarus unitranscripts into three main ontologies and 47 sub-groups.

Pathway Analysis by KEGG
Biological pathway studies play a key role in gaining insight into the advanced studies of genomics. KEGG is a highly integrated database providing information of the biological systems and their relationships at the molecular, cellular and organism levels, particularly via the KEGG pathway maps (Kanehisa et al., 2008). KEGG pathway annotations and EC were generated (Supplementary File 4) from the assembled unitranscripts of P. amarus leaf transcriptome that were mapped with GO terms. In total, 4,697 P. amarus unitranscripts were assigned to 134 KEGG maps and 583 EC and these EC were then used to retrieve and color the KEGG pathway maps to represent the putatively identified genes involved in several metabolic pathways. Interestingly, in our dataset it was seen that more than one unique sequence of P. amarus leaf transcriptome was annotated as the same enzyme. Enzymes encoded by annotated unitranscripts were grouped into the 5 major pathways in the KEGG pathway database (Figure 5A)-"metabolism" (9323 unitranscripts), "genetic information processing" (78), "environmental information processing" (191), "organismal systems" (303), "human diseases" (2). Apparently "metabolism" being one of the most significant and the most highly represented category in our study led to the in-depth analysis of this and has been represented in Figure 5B. In our dataset, it was seen that the maximum number of unitranscripts fell under the "carbohydrate metabolism" (2424 unitranscripts) followed by "amino acid  metabolism" (1740 unitranscripts), "lipid metabolism" (1143 unitranscripts), "nucleotide metabolism" (637 unitranscripts). Nucleotide metabolism plays a vital role in plants for metabolism and development like other organisms. Besides, although lipid metabolism closely relates to oil plants mostly, and since P. amarus is a plant of therapeutic importance, lipid metabolism is also associated with plants in general (Mazliak, 1973). A number of P. amarus unique sequences (1332 unitranscripts) were involved in secondary metabolism as well (Supplementary Table S5). Out of all secondary metabolite pathways, "flavonoid biosynthesis" pathway was shown to be encoded by the highest number of P. amarus assembled unitranscripts (134 unitranscripts) followed by "phenylpropanoid biosynthesis" (125 unitranscripts). The entire functional KEGG pathway categorization of P. amarus leaf transcriptome unitranscripts have been shown in Supplementary Table S6.

Lignan Biosynthetic Genes
Phenylpropanoids which comprise a large group of plant-based natural compounds is derived from phenylalanine (Michal, 1999). Phenylpropanoid biosynthesis pathway starts with the formation of cinnamic acid from phenylalanine, which leads the formation of cinnamoyl-CoA, p-coumaroyl-CoA, feruloyl-CoA, and sinapoyl-CoA. These CoA-activated compounds are starting metabolites for the synthesis of lignans, flavonoids, flavones, and flavonols, etc. In the present study KEGG analyses of P. amarus leaf transcriptome sequences revealed the presence of 11 genes involved in the biosynthesis of different compounds of phenylpropanoid pathway (Figure 6).
The major lignans reported in P. amarus-phyllanthin and hypophyllanthin are known to possess significant hepatoprotective and antioxidant properties. But the exact biosynthetic pathway leading to the formation of phyllanthin and hypophyllanthin is still under investigation. The structural similarity between the skeleton of secoisolariciresinol and phyllanthin is suggestive of the derivation of phyllanthin/hypophyllanthin from secoisolariciresinol. The presence of pinoresinol/lariciresinol reductase gene was indicated due to the presence of (+) secoisolariciresinol in species of Phyllanthus (Umezawa et al., 1997). In our leaf transcriptomic data, one unique sequence (unitranscript 77577) was assigned the pinoresinol reductase activity gene ontology term (GO: 0010283) after mapping the assembled P. amarus unitranscripts into BLAST2GO suite. Another six unique sequences (unitranscripts 63241; 63242; 63243; 63244; 63245; and 63246) encoding phenylcoumaran benzylic ether reductase (PCBER) -like protein were annotated the lignan biosynthetic process GO term (GO: 0009807). Besides, PCBER has also shown to have high sequence similarity with PLR, i.e., pinoresinol/lariciresinol reductase (Gang et al., 1999;Vander Mijnsbrugge et al., 2000), showing the active involvement of P. amarus leaves in lignans biosynthesis and thus complementing its phytotherapeutic properties.
Recent reports show naringenin to possess neuroprotective effect against Parkinson's disease-related pathology (Lou et al., 2014), iron-induced neurotoxicity and protection of ocular ischemic diseases (Kara et al., 2014). Garbanzol has been shown as an antimutagenic flavonoid (Park et al., 2004). The flavanonol dihydrofisetin (also known as fustin), a type of flavonoid showed protective effects on neuronal cell death . Galangin, another bioflavonoid suggested as a potential candidate for further development of new drugs against Alzheimer's disease (Guo et al., 2010) also has anticancer (Zhang et al., 2013) and hepatoprotective properties (Wang et al., 2013). Butin inhibit aromatase in the human body (Park et al., 2014). Butein suppresses breast and lung cancer (Cho et al., 2014;Seo and Jeong, 2014). Another important flavonoid reported in P. amarus is quercetin, widely accepted as a potent antioxidant also shows anticarcinogenic and hepatoprotective (Ji et al., 2014) activities. Both quercetin and myricetin possess antimicrobial properties (Rashed et al., 2014). Further the well-recognized flavonoid of P. amarus, viz. kaempferol has antioxidant, hepatoprotective and anticancer effects (Huang et al., 2014;Shakya et al., 2014;Dang et al., 2015). The flavone luteolin too has anticancer potential (Lim do et al., 2007). Recent studies have also reported that luteolin along with quercetin reduces high blood cholesterol levels in vivo (Nekohashi et al., 2014). Astragalin is another flavone known to possess antioxidant effects as well (Choi et al., 2013).The pathway genes of all these flavonoids were reported in our putative leaf transcriptome dataset of P. amarus.

Discovery of Transcripts Encoding Transcription Factors and Their Domain Architecture
TFs play key roles in controlling gene expression. In plants, TFs have been employed to manipulate various types of metabolic, developmental and stress response pathways. Further TFs are also known to regulate secondary metabolism in plants at the gene and protein levels as well (Vom Endt et al., 2002). TFs that are known to regulate plant secondary metabolism include R2R3-MYB, basic helix-loop-helix (bHLH) proteins like CrMYC2, AP2/ERF family proteins, WRKY, NAC, DOF, bZIP, HD-ZIP, and TFIIIA zinc finger TFs (Bhattacharyya et al., 2013). A total of 16,344 P. amarus unitranscripts could be annotated at the Plant TF database (PlnTFDB; http://plntfdb.bio.uni-potsdam.de/v3.0/ downloads.php) (Pérez-Rodríguez et al., 2009) and categorized into 59 TF categories (Figure 10). The unitranscripts with their detailed TF protein identities and their corresponding domain annotations have been shown in Supplementary File 5. Among the annotated unitranscripts, notable unitranscripts identified and related to secondary metabolism were AP2-EREBP, NAC, bHLH, MYB, or MYB related, bZIP, mTERF, WRKY, zf-HD, C2C2-CO-like, and C2C2-Dof. Similar results were also obtained in our previous study of de novo transcriptome assembly of endangered medicinal herb Podophyllum hexandrum Royle whose extract podophyllotoxin is used for production of anticancer drugs (Bhattacharyya et al., 2013). Our present study has shown that 303 and 737 unitranscripts have encoded for MYB and MYB related TFs respectively. MYB TFs that regulate the phenylpropanoid biosynthetic pathway and also identified in several plant species, mostly include the R2R3-MYB TFs (Hichri et al., 2011;Bhattacharyya et al., 2013). Besides, R2R3-MYB TF has also been shown as a flavonolspecific regulator of phenylpropanoid biosynthesis in Arabidopsis thaliana (Mehrtens et al., 2005). Previous reports also show that in the grapevine phenylpropanoid pathway (Deluc et al., 2006) and its major branch viz. flavonoid biosynthetic pathway in Prunus persica and Epimedium sagittatum are both regulated by R2R3 MYB TFs Ravaglia et al., 2013). Besides this, 750 unitranscripts coding for bHLH TFs have been identified in the present study. bHLH TFs like MYB also regulate the flavonoid biosynthetic pathway in plants. . Besides, we identified 1 Dof TF family in our present study. Dof TFs besides regulating diverse biological processes like carbon and nitrogen assimilation, dormancy, seed maturation and germination, phytochrome signaling, salicylic acid response, guard cell-specific gene expression, photoperiodic flowering, etc. have also been reported to influence phenylpropanoid metabolism in an environmental and tissuespecific manner (Skirycz et al., 2007;Gupta et al., 2015). Another TF family viz. WRKY has also been encoded by 541 unique sequences. WRKY proteins amongst its diverse functions are also involved in the biosynthesis of secondary metabolites (Eulgem et al., 2000). Taken together, TFs identified here can be evaluated further, especially those related to a wide array of secondary metabolites biosynthesis in this potential medicinal herb.

In silico SSR Mining and Discovery
Microsatellites, or SSRs, are tandemly repeated short DNA sequence motifs ranging from 1 to 6 base pairs extensively distributed in eukaryotes, including the plants, animals and microorganisms, as well as in some prokaryotes (Morgante et al., 2002). SSRs have become one of the most widely used informative molecular markers because it is easy to detect and further used in several applications, including genetic diversity, evolution, genome mapping, marker assisted selection, and breeding studies. Out of 85,927 sequences that were examined by MISA tool, a total of 65,273 SSRs was identified out of which 29,652 were present in compound formation. Of the examined sequences 28,304 contained SSRs with 42% harboring more than one SSR. Statistical analysis of SSRs identified in our study has been presented in Table 3.
Experimentally confirmed data of gene expression provide a preferable understanding of the function and regulation of the genes. Hence, we performed RT-PCR of the above mentioned unitranscripts (Figure 12) identified in our putative leaf transcriptome dataset to confirm the gene expression of these unitranscripts along with actin gene used as a control in the leaf tissue samples of P. amarus. Our results suggested that the threelevel de novo assembly and annotation data of our present study can be used in genomics study and practical experiments on P. amarus in future. CONCLUSION P. amarus transcriptome research lags that of other plants of medicinal importance. To facilitate molecular research in P. amarus we characterized the leaf transcriptome since the vast repertoire of secondary metabolites are mainly present in the leaf tissues of this herb. Our data on leaf transcriptome analysis using the Illumina MiSeq platform have instigated to the identification of a large number of transcripts, TFs, molecular/SSR markers involved in diverse processes, functions, metabolic pathways together with the transcripts involved in a number of secondary metabolic pathways, specially attributing to the phytomedicinal significance of this herb. To be more precise, the putative transcriptome information explored in our data revealing the various lignans, flavonoids, terpenoids, alkaloids and other secondary metabolites biosynthesis pathway genes adds a copious amount of information to P. amarus database and also paves the way for functional and comparative genomic studies of this highly promising medicinal plant in future. Further RT-PCR results showing expression of the few selected unitranscripts, involved in various classes of secondary metabolites synthesis, also confirmed the reliability and accuracy of our P. amarus leaf transcriptome assembly. This is the first report on a detailed study of P. amarus leaf transcriptome that was done to provide an important resource for future studies on "bhuiamalaki" thereby greatly facilitating research on non-model plants in plant biology.

AUTHOR CONTRIBUTIONS
AB and SC designed the experiment. AB carried out the experimental work, analyzed the data and drafted the manuscript. SC supervised the analysis of NGS data and prepared the final manuscript.