De novo Transcriptome Analysis Revealed Genes Involved in Flavonoid and Vitamin C Biosynthesis in Phyllanthus emblica (L.)

Phyllanthus emblica is an affluent source of various therapeutic components. A few of them like vitamin C and flavonoids are predominant bioactive compounds that are being used in immense pharmacological applications. In-spite of numerous applications, the genomic information of this plant was limited to a few expressed sequence tags (ESTs) in DNA databases. Herein, we developed in-depth transcriptome information of P. emblica using Illumina Hiseq 2000 platform and characterized. A total of 31,285,965 high-quality reads were assembled into 91,288 contigs with the N50 value 358. Out of them, 47,267 contigs were functionally annotated using BLASTX search against NCBI-non-redundant (NR) protein database. Further, 31,366 contigs showed similarity with various gene ontology (GO) terms, and 1299 were related to different enzymes and biosynthetic pathways. We identified the transcripts related to each gene involved in flavonoid and vitamin C biosynthesis. Several cytochrome P450s (CYPs) and glucosyltransferases (GTs) genes involved in flavonoid biosynthesis and various other metabolic pathways were also documented. Further, 6510 transcription factors and 4420 EST derived simple sequence repeat (SSR) markers were also predicted. The present study enlightened various characteristic features of P. emblica genome, and provided an important resource for future molecular and functional genomics studies.

The flavonoids are diverse class of secondary metabolites that have pivotal role in plant growth, development and defense mechanism (Dixon and Steele, 1999;Winkel-Shirley, 2001). They play critical role in the production of plant pigments and involved in various other activities including UV protection and pathogen defense, along with their nutraceutical value in the human diet (Winkel-Shirley, 2001). The structural and regulatory genes involved in flavonoid biosynthesis have been extensively characterized in various plants for their spatial and temporal regulation (Boss et al., 1996;Ban et al., 2007;Singh et al., 2008;Niu et al., 2010).
Plants are rich source of water soluble vitamin C, which plays diverse role in various biological functions in plants and humans as well. In plants, it is involved in biosynthesis of ethylene, gibberellins, and plant pigments, cell growth regulation, acts as an enzyme cofactor in photosynthesis and various other vital functions (Smirnoff and Wheeler, 2000). Further, it is essential in ameliorating the harmful effects of reactive oxygen species derived from chloroplast in photosynthetic eukaryotes (Wheeler et al., 1998). The human beings are largely dependent upon plants for their regular uptake of vitamin C due to the lack of an enzyme gulonolactone oxidase involved in final step of vitamin C biosynthesis. It acts as a cofactor for enzymes involved in the post-translational hydroxylation of collagen, carnitine biosynthesis, involved in conversion of the neurotransmitter dopamine to norepinephrine, and in tyrosine metabolism (Diliberto and Daniels, 1991). It also plays vital role in regulation of iron uptake, cardiovascular functions, maintenance of cartilage, and wound healing.
The high throughput transcriptome sequencing and analyses have become a versatile method for gene discovery and expression profiling in recent years (Kalra et al., 2013;Chen et al., 2015). The Illumina sequencing technology has proven to be an exceptionally successful in a wide variety of whole-transcriptome investigations, particularly for the characterization of non-model organisms where reference genome is not available (Wilhelm et al., 2008;Tang et al., 2011;Chen et al., 2015;Kumar S. et al., 2015). Several computational tools for de novo assembly of short read sequence data and identification of genes involved in various metabolic pathways have also been demonstrated (Pertea et al., 2003;Zerbino and Birney, 2008;Grabherr et al., 2011;Fu et al., 2012).
Molecular insights into the medicinal plants have gained attention in recent years. The availability of genomic and transcriptomic data of such plants has been comprehensively reviewed by Misra (2014). Despite of high medicinal value, the genomic information of P. emblica is still very limited. To the best of our knowledge, only 71 ESTs were available in the National Center for Biotechnology Information (NCBI) database before the start of this work. The inadequate genomic/transcriptomic data was a major bottleneck in understanding various molecular mechanisms and biosynthetic pathways including flavonoids and vitamin C biosynthesis in P. emblica. Earlier, we developed a method for RNA isolation  and cloned flavonoid 3 hydroxylase (F3H) gene of flavonoid biosynthetic pathway using primers based PCR approach from this plant . However, we could not get complete information about biosynthetic pathways and other molecular details using that approach. Therefore, to gain further insight into metabolic and molecular networks of this plant, the de novo transcriptome study was initiated with foremost emphasis to investigate the candidate genes involved in flavonoids and vitamin C biosynthesis.

Plant Material, RNA Isolation, and Transcriptome Sequencing
Young leaves from the top aerial part of tree at the edge of branchlets (Supplementary Figure S1) and full bloom flowers were harvested from approximately 10-year-old healthy plant of P. emblica growing under natural environmental conditions in the botanical garden of the Panjab University, Chandigarh, India. Samples were harvested in early morning of November month, snap frozen in liquid nitrogen, and stored at −80 • C till further use. Total RNA was isolated using the method described by Kumar and Singh (2012), followed by RNA purification and on column DNase I digestion using miRNA Easy kit (Qiagen, Germany). The cDNA library was prepared using TruSeq TM RNA Sample preparation kit (Illumina, USA) at Microarray core facility, Huntsman Cancer Institute, University of Utah, Salt Lake City, Utah, USA, followed by 50 cycled single end library sequencing on Illumina Hiseq 2000 sequencing platform.

De novo Assembly and Sequence Clustering
Computational analysis was carried out on HP workstation with eight cores, 2.27 GHz Intel Xeon processor with 16 GB RAM. Data was filtered to remove adapter sequences by using the fastx_clipper tool of the FASTX Toolkit (www.hannonlab.cshl.edu/fastxtoolkit) with exact matching of target sequence. Reads passing phred quality scores ≥20 (an error probability of 0.01) were filtered out, and unambiguous sequences ("N") were trimmed. The de novo assembly of filtered reads was performed using a short read assembler program, VELVET (Version 0.7.55) (Zerbino and Birney, 2008) followed by OASES program (Version 0.1.11) (Schulz et al., 2012) with different k-mer hash length.
After assembly, the clustering tool CD-HIT-EST was used to cluster nearly identical (>99%) transcripts. The longest sequence within each cluster was extracted. The clustering process was supplemented with TGICL-CAP3 clustering program and the clustered contigs and singletons were merged to get final transcript assembly. Statistical parameters such as total transcripts, average size of transcript, transcripts having length ≥1000 bp etc. were used to assess assembly quality. In order to assess the reliability of assembly, assembled sequences were further validated using previously characterized P. emblica gene sequences available at NCBI Genbank database. BLASTN analysis was performed for each reported sequence against the set of assembled sequences at e-value 10 −5 .

Sequence Annotation and Classification
The contigs and singletons were annotated using FastAnnotator (bioinfo.cgu.edu.tw/fastannotator_release). It utilizes Blast2GO, PRIAM, and RPS-BLAST to assign Gene ontology (GO) term, enzyme commission (EC) codes and protein domains. BLASTX was run against NCBI non-redundant (NR) protein database. The query sequences were assigned with a cut-off e-value of 10 −5 . Assembled transcripts were also searched against plant transcription factor database (PlnTFDB; http://plntfdb.bio.uni-potsdam.de/v3.0/downloads.php) for the identification of transcription factor families. Genes involved in flavonoid and vitamin C biosynthesis were sorted out and analyzed by BLASTX analysis.

Transcriptome Sequencing and de novo Assembly
The high-throughput transcriptome sequencing of P. emblica using Illumina Hiseq 2000 (Illumina, USA) produced 32,382,864 single end reads. Since the 3 ′ end of reads are more prone to error, two bases from 3 ′ end were excluded after quality assessment. The 48 bases high quality reads were used for further analysis. After adapter trimming and removal of unambiguous sequences, 31,285,965 reads were assembled into 134,205 unique sequences and 89,242 singletons ( Table 1). The de novo assembly was optimized at different k-mer lengths, in which k-mer 33 was found best with 358 bp N50 value. The longest and average contig length was 5418 and 278 bp, respectively (Table 1). Although, 100-199 bp contigs were abundant, yet 11,074 contigs had sequence length ≥500 bp. Contigs with ≤100 bp length were discarded from further analysis. The accuracy of assembled sequences was validated by aligning them with the ESTs of P. emblica available at NCBI database. Significant hits were observed with ESTs with an average 82.72% similarity.

Functional Annotation and Classification
It was crucial to gather in-depth functional information to understand various metabolic processes. Therefore, the transcripts sequences were used for BLAST search (e-value 1e −05 ) against gene ontology (GO) and enzyme classification (EC) databases (Supplementary Table S2). An overview of annotated transcripts against GO, EC, and protein domains databases is shown in Figure 2.

GO Annotation
GO analysis provides functional classification of gene, which defines the properties of genes and their products. GO has three ontologies; molecular function, cellular components, and biological processes. A total of 31,366 transcripts were annotated using GO database. In biological process category of GO ( Figure 3A), oxidation-reduction process (GO: 0055114), serine family amino acid metabolic process (GO: 0009069), protein phosphorylation (GO: 0006468), regulation of transcription (GO: 0006355), and proteolysis (GO:0006508) were highly represented FIGURE 1 | Similarity distribution with different plant species using the NR protein database (with an e ≤ 1e −5 ).

EC Classification
The EC annotation was obtained for 1299 transcript sequences.
Top 20 abundant enzymes predicted in P. emblica transcriptome are shown in Figure 4. Non-specific serine/threonine protein kinase enzyme family (368 transcripts) members were present in high numbers. Due to the lack of tyrosine kinase receptors, tyrosine phosphorylation is less common in plants as compared to serine-threonine kinases (Mano et al., 2005). Although a number of dual-specificity kinases have been found in plants systems, but none of the true protein-tyrosine kinases (PTK) have been reported, except two PTKs being predicted in A. thaliana (Rudrabhatla et al., 2006;Miranda-Saavedra and Barton, 2007). Around 17% of transcripts were characterized as 2alkenal reductase (AER, EC 1.3.1.7.4). AER plays central role in the detoxification of reactive carbonyls. The reduction of α, β unsaturated bonds present in reactive carbonyls is carried out by AER and are involved in anti-oxidative defense in plants (Luan, 2002). AER plays an important role in oxidation-reduction processes, amino acid transport, and response to various stress conditions, hence the putative role of AER in P. emblica could be response to oxidative stress. About 7% transcripts of P. emblica represented Ubiquitin-protein ligase (EC 6.3.2.19). They are involved in the regulation of various metabolic processes e.g., vegetative growth control mediated by hormones, plant reproduction, stress tolerance, and DNA repair (Mazzucotelli et al., 2006). E3 ubiquitin-ligases are also known to regulate signaling pathways initiated by ABA induced stress (Lyzenga et al., 2011).

Analysis of the Transcripts Encoding Transcription Factors (TFs)
Transcripts encoding TFs were identified by sequence comparison to known TF gene families. In total, 6510 (7.10%) putative transcripts showing similarity to TF genes were identified in P. emblica (Supplementary Table S3). These included TF families like C3H, PHD, FAR1, MADS, SET, SNF2, MYB, bHLH etc ( Figure 5). Out of them, C3H (601 transcripts), PHD (447 transcripts), and FAR1 (375 transcripts) were most abundant, and GNAT and ABI3VP1 were least represented. Several C3H proteins have been reported to participate in developmental responses and hormonal pathways. Thus, C3H is expected to play significant role in stress responses and various metabolic processes in P. emblica, also. The flavonoid pathway genes are mainly regulated at transcription level (Winkel-Shirley, 2001). Several TFs regulating flavonoid biosynthesis have been identified in plants (Davies and Schwinn, 2003;Allan et al., 2008;Palapol et al., 2009;Niu et al., 2010). The R2R3-MYB and bHLH TFs form the MBW complex with the WD40 proteins, which may regulate the transcription of various genes of flavonoid biosynthetic pathway in plants. This regulation is via specific binding to motifs in the promoter region (Hernandez et al., 2004;Hartmann et al., 2005;Dare et al., 2008). The basic helix loop-helix (bHLH) family of TFs is one of the most abundant and highly conserved in plant kingdoms. These bind the E-box (CANNTG), although most plant bHLHs specifically recognize the so-called G-box (CACGTTG). The bHLH proteins are known to regulate biological processes i.e., light signaling, hormone signaling, wound and drought stress response, organ, and tissue development. The R2R3-MYB and bHLH TFs responsible for the anthocyanin accumulation have been well characterized (Ban et al., 2007;Espley et al., 2007). In grapes R2R3-MYB TF, VvMYBA activates the UDP-glucose:flavonoid-3-O-glycosyltranferase (UFGT) gene, which plays a key role in color development (white and red) in grape skin (Boss et al., 1996;Kobayashi et al., 2002). PpMYB10 of peach bind to dihydroflavonol 4-reductase (DFR) promoter and activated the anthocyanin biosynthesis in tobacco and Arabidopsis (Lin-Wang et al., 2010).
In Arabidopsis, three closely related MYBs, AtMYB11, AtMYB12, and AtMYB111 regulate AtFLS1 and other steps for production of flavonol glucosides Stracke et al., 2007). These genes share significant similarity and form subgroup 7 of R2R3-MYB gene family. Due to the functional similarity amongst MYB11, MYB12, and MYB111, they show target specificity for flavonoid biosynthetic pathway genes such as CHS, CHI, F3H, and FLS1. The FLS is regulated by light and UV exposure via activation of MYB TFs in grapes and maize, respectively (Czemmel et al., 2009;Ferreyra et al., 2012). At least four MYB TFs (VvMYB5a, VvMYB5b, VvMYBPA1, and VvMYBPA2) are reported in grapes that regulate key steps of the flavonoid pathway. They affect accumulation of proanthocyanidins in leaves, flowers and in early berry development, before the véraison stage (Terrier et al., 2009).
These annotations were useful in predicting the molecular functions of unigenes and constructing the metabolic pathways in P. emblica. The knowledge of flavonoid biosynthetic pathways along with the regulatory TFs (such as MYB, bHLH, and WD40type) rendered metabolic engineering much simplified for the production of various essential metabolites.
Isoflavonoids are an important subclass of flavonoids being involved in plant defense and nodulation (Dixon and Steele, 1999). Certain transcripts associated with isoflavonoid metabolism such as isoflavone 7-O-methyltransferase, 2-hydroxy isoflavanone synthase, isoflavone 2 ′ -hydroxylase, isoflavone 4 ′ -Omethyltransferase, and isoflavone reductase were also present in our data.
The P450 monooxygenases are heme protein-dependent mixed-function oxidase systems. They utilize NADPH/NADH to reduce atmospheric dioxygen and yield an organic substrate along with a water molecule. They are involved in several processes such as hydroxylation, dealkylation, dimerizations, epoxidation, isomerization, deamination etc. Schuler and Werck-Reichhart (2003).
Activities of P450s are categorized into two classes-first exists in biosynthetic pathways and second in detoxification pathways (Schuler, 1996;Chapple, 1998). They play important role during synthesis of lignin, flavonoids, coumarins, sinapoyl esters, isoflavonoids, hydroxamic acids, glucosinolates, terpenes, gibberellins, brassinosteroids, auxin, and oxygenated fatty acids. In addition to these known biosynthetic activities, plant P450s are also responsible for catabolizing a range of endogenous and toxic exogenous compounds encountered in the environment such as herbicides, insecticides, and some pollutants (Werck-Reichhart et al., 2000;Harvey et al., 2002).
Several unigenes encoding different types of glycosyltransferases were also found in our dataset (Supplementary Table S6). Out of them, UFGTs supposed to be involved in flavonoid biosynthesis included flavonoid glucosyltransferase, UDP-glucose:isoflavone 7-O-glucosyltransferase, tetrahydroxychalcone glucosyltransferase, and anthocyanidin 3-O-glucosyltransferase.
The glycosylation of anthocyanidin resulted into increase in stability, decrease in reactivity, and change in its spectral features; otherwise anthocyanidins are highly reactive and inherently unstable. Major increase in anthocyanin accumulation is attributed to higher mRNA levels of UFGT, CHS, and F3H genes (Prior and Wu, 2006;Petrussa et al., 2013). UFGT is the final gene in anthocyanin pathway and play vital role in anthocyanin biosynthesis and accumulation. The UFGT mediated transfer of the glucosyl moiety from UDP-glucose to hydroxyl groups of anthocyanidins is shown crucial for their stability and solubility. The expression of UFGT is reported to be controlled by various transcription factors like MYB10, MYB123, and bHLH3 (Ravaglia et al., 2013).
Vitamin C (L-Ascorbic Acid) Biosynthesis P. emblica contains high levels of vitamin C, therefore transcriptome analysis is an indispensable tool to investigate the genes involved in its biosynthesis. Transcripts related to each gene involved in biosynthesis of L-Ascorbic Acid (AsA) were identified in the current study (Figure 7). The number of contigs matching with the pathway genes varies from 2 for Lgalactose dehydrogenase to more than 20 for Hexokinase and Table S7). AsA biosynthesis in plants occurs through various routes but activation of each pathway is dependent upon species and developmental stages of the plant. The first proposed pathway for AsA biosynthesis proceeds through GDP-D-mannose and Lgalactose (Wheeler et al., 1998). In 1963, it was demonstrated in strawberry fruits, where 6-carbon labeled molecule D-glucose converted into AsA without cleavage of the carbon chain. Later on, it was postulated that L-galactono-1, 4-lactone is the immediate precursor for AsA. D-Glucose is the main precursor for ascorbic acid biosynthesis, which gets converted into D-glucose 6-phosphate by the enzyme hexokinase (EC 2.7.1.1), a component of glycolysis. The glucose ring is phosphorylated by addition of a phosphate group derived from ATP. D-Glucose-6-phosphate isomerized into D-fructose-6-Phosphate with the action of phosphoglucose isomerase (EC 5.3.1.9, 364 unigenes) by rearrangements of carbon oxygen bond to transform 6-C ring into 5-C ring. It further converted mannose-6-phosphate by phosphomannose isomerase (EC 5.3.1.8,5 unigenes). Subsequently phosphomannomutase (EC 5.4.2.8,3 unigenes) transfers the phosphate group from 6-carbon position to 1carbon position to produce D-mannose-1-phosphate. Then, GDP-D-mannose pyrophosphorylase (EC 2.7.7.22,19 unigenes) adds one glycosyl unit to produce GDP-D-mannose, which epimerizes at 3 ′ and 5 ′ positions to yield GDP-L-galactose by enzyme GDP-mannose-3 ′ , 5 ′ -epimerase (EC 5.1.3.18,23 unigenes). GDP-mannose pyrophosphorylase is reported to have significant role in regulation of AsA biosynthesis. In fact, a correlation between mRNA levels of GDP-mannose pyrophosphorylase and AsA levels has been documented in many plant species (Badejo et al., 2007(Badejo et al., , 2009.
The enzyme GDP-mannose 3 ′ , 5 ′ -epimerase is responsible for the catalytic conversion of GDP-D-mannose into GDP-Lgalactose. A novel compound, GDP-L-gulose is also produced by the 5 ′ -epimerization of GDP-D-mannose. Therefore, it was postulated that the GDP-mannose-3 ′ , 5 ′ -epimerase enzyme catalyses distinct epimerization reactions depending upon the molecular form of enzyme leading to the formation of either GDP-L-galactose or GDP-L-gulose (Wolucka and Van Montagu, 2003). Here, we found a new branch point of this pathway in plants and a connecting link with the pathway operating in animals. Cats and dogs can synthesize their own vitamin C unlike humans, because human cells cannot perform the conversion of 1-gulono-1,4-lactone into ascorbic acid, which is catalyzed by the enzyme gulonolactone oxidase. It was observed that the gene gulonolactone oxidase is present in humans, but it is a non-functional pseudogene because of accumulation of several mutations over the time (De Tullio, 2010).
Additionally, di-nucleotide to tetra-nucleotide repeats were predicted in various other genes. Therefore, in future, research on various isoforms and activities of genes involved in flavonoid and vitamin C biosynthesis may be helpful in explaining, why P. emblica has such a high amount of flavonoids and vitamin C?

DATA ARCHIVING STATEMENT
Transcriptome data of this study can be accessed at NCBI SRA database under SRA ID SRP075209 (Bioproject PRJNA313483).

ACKNOWLEDGMENTS
We are thankful to Dr. Brian Dalley, Director, Microarray core facility, Huntsman Cancer Institute, University of Utah, Salt Lake City, USA for mRNA library preparation and high-throughput sequencing. AK and SK are thankful to CSIR, India for awarding junior and senior research fellowship. Authors are thankful to Dr. Shikha Kalra and Dr. Dharam Singh for critically reading the manuscript and giving their valuable suggestions.

SUPPLEMENTARY MATERIAL
The Supplementary Figure S2 | Size distribution of the contigs obtained from de novo assembly of high quality clean reads.
Supplementary Figure S3 | E-value distribution of the BLAST hits for unigenes of P. emblica (E-value cut off 1e −5 ).
Supplementary Figure S4 | Conserved domains distribution of the of the best BLAST hits for each unigene.
Supplementary Figure S5 | Similarity distribution of the best BLAST hits for unigenes.
Supplementary Table S1 | Sequences of assembled transcripts of P. emblica. Supplementary