Transcriptome analysis of Ginkgo biloba kernels

Ginkgo biloba is a dioecious species native to China with medicinally and phylogenetically important characteristics; however, genomic resources for this species are limited. In this study, we performed the first transcriptome sequencing for Ginkgo kernels at five time points using Illumina paired-end sequencing. Approximately 25.08-Gb clean reads were obtained, and 68,547 unigenes with an average length of 870 bp were generated by de novo assembly. Of these unigenes, 29,987 (43.74%) were annotated in publicly available plant protein database. A total of 3,869 genes were identified as significantly differentially expressed, and enrichment analysis was conducted at different time points. Furthermore, metabolic pathway analysis revealed that 66 unigenes were responsible for terpenoid backbone biosynthesis, with up to 12 up-regulated unigenes involved in the biosynthesis of ginkgolide and bilobalide. Differential gene expression analysis together with real-time PCR experiments indicated that the synthesis of bilobalide may have interfered with the ginkgolide synthesis process in the kernel. These data can remarkably expand the existing transcriptome resources of Ginkgo, and provide a valuable platform to reveal more on developmental and metabolic mechanisms of this species.


Introduction
Ginkgo biloba L. is a long-lived dioecious plant. The species is endemic in China and is the only known member of the Ginkgopsida. Modern classification based on phylogeny recognizes the species in a separate order Ginkgoales. All living gymnosperms were divided into four orders, Cycadales, Gingoales, Gnetales, and Pinales/Coniferales. As one of the most ancient seed plants, the earliest Ginkgo-like trees can be traced back to the early Permian period, approximately 280 million years ago (Zhou and Zheng, 2003;Gong, 2008). Ginkgo trees flourished in the Jurassic and Cretaceous periods of the Mesozoic era, playing important roles in the biosphere. Since its unique morphology and physiology characteristics, together with many debates on recent phylogenetic studies of seed plants (Wu et al., 2013), they played important roles in the research of evolutionary relationships.
Apart from its special evolutionary status, Ginkgo biloba also possesses considerable medical value. Ginkgo biloba extract (GBE) is among the best-selling herbal remedies in America and many European countries (Kressmann et al., 2002;Yilmaz et al., 2010). It has been used to treat cardiovascular and cerebrovascular diseases (Lin et al., 2002). A standardized preparation of GBE, EGB761, is used to treat thrombosis, inflammation, cardiovascular conditions, and Alzheimer's disease (Oken et al., 1998;Sasaki et al., 2002;Rodríguez et al., 2007). The components of GBE are very complex, and its major medical ingredients are flavonoids and terpenoids (Horáková et al., 2003;van Beek, 2005). These terpenoids primarily consist of ginkgolide and bilobalide, both of which are rare terpene trilactone compounds.
As the case for all other isoprenoid compounds, the syntheses of ginkgolide and bilobalide involve three similar steps: (1) the formation of isopentenyl diphosphate (IPP) and its isomer dimethylallyl diphosphate (DMAPP); (2) the repetitive condensation of IPP and DMAPP to their precursors farnesyl diphosphate (FPP) and geranylgeranyl diphosphate (GGPP); and (3) the late cyclization and oxidation of these compounds as catalyzed by terpenoid synthesis and cytochrome P450 -dependent monooxygenases. After these common steps, ginkgolide and bilobalide are synthesized through the methylerythritol 4-phosphate (MEP) and mevalonic acid (MVA) pathways, respectively (Rohmer, 1999;Kim et al., 2006b). It is suggested that bilobalide is a sesquiterpene which contains three lactones units (Sabater-Jara et al., 2013), and the specific biosynthetic mechanism of bilobalide in G. biloba remains unclear, although many genes coding for key enzymes involved in the MEP pathway have been cloned. The gene coding for 3-hydroxy-3-methylglutaryl coenzyme A (HMG-CoA), which is related to the first committed step of bilobalide synthesis, has been cloned and identified (Shen et al., 2006).
Though ginkgolides are present in both leaves and roots, the sites of terpene trilactone biosynthesis remain unclear. Recently, several functional genomics studies have been conducted on G. biloba (Wang et al., 2010;Lin et al., 2011Lin et al., , 2012, and these previous studies have mostly focused on leaves rather than kernels. However, the ginkgo leaves contain only small amounts of ginkgolides (Nakanishi, 1967) and people have been eating G. biloba kernels directly for 1000s of years, believing it beneficial for health. According to many researches, genes coding for enzymes of the MEP pathway involved in IPP and DMAPP are highly expressed predominantly in the roots (Gao et al., 2006;Kim et al., 2006a,b;Shen et al., 2006). Ginkgolides would be synthesized in the roots and then transported to the stems and leaves (Kim et al., 2012). As a sesquiterpene, detailed evidence of bilobalide formation is still lacking. In this study, we attempted to reveal the synthesis or regulative mechanism of ginkgolide and bilobalide in Ginkgo kernels by Illumina paired-end sequencing and bioinformatics analysis and we identified several useful differentially expressed genes (DEGs) involved in both pathways.

Plant Materials
Seeds from a mature Ginkgo tree growing at Nanjing Forestry University were collected at different developmental time points in 2013 (8 July: Gb_Seed1; 5 August: Gb_Seed2; 2 September: Gb_Seed3; 20 November: Gb_Seed4; 2 December: Gb_Seed5), after the removal of the testae and then quickly frozen in liquid nitrogen and stored at −80 • C until RNA extraction was conducted.

RNA Extraction and Library Preparation
Total RNA extraction was performed according to the manufacturer's instructions using the RNeasy Plant Mini Kit (Qiagen, Hilde, Germany). Before extraction, RNA degradation and contamination were monitored on 1% agarose gels. RNA purity was checked using a NanoPhotometer R spectrophotometer (Implen, Westlake Village, CA, USA). RNA concentration was measured using the Qubit R RNA Assay Kit in a Qubit R 2.0 Flurometer (Life Technologies, Carlsbad, CA, USA). RNA integrity was assessed using the RNA Nano 6000 Assay Kit in an Agilent Bioanalyzer 2100 system (Agilent Technologies, Santa Clara, CA, USA). A total of 3 μg RNA was used as an input per sample. Sequencing libraries were generated using a NEBNext R Ultra TM RNA Library Prep Kit for Illumina R (NEB, USA) following the manufacturer's instructions, and index codes were added to attribute sequences to each sample. Clustering of the index-coded samples was performed on a cBot Cluster Generation System using the TruSeq PE Cluster Kit v3-cBot-HS (Illumia) according to the manufacturer's instructions. After cluster generation, the library preparations were sequenced on an Illumina Hiseq 2000 platform, and paired-end reads were generated.

Data Processing, Assembly, and Annotation
Raw data (raw reads) in FASTQ format were first processed using in-house Perl scripts. In this step, clean data (clean reads) were obtained by removing adapter-containing, poly-N, and lowquality reads from the raw data. At the same time, the Q20, Q30, GC content, and sequence duplication level of the clean data were calculated. All downstream analyses were based on high-quality, clean data. The left files (read1 files) from all libraries/samples were pooled into a single left.fq file, and the right files (read2 files) were pooled into a single right.fq file. Transcriptome assembly was accomplished from the left.fq and right.fq using Trinity (Grabherr et al., 2011), with min_kmer_cov set to 2 by default and all other parameters set as their defaults. The transcripts of each gene with the longest length were selected as unigenes.
All assembled unigenes were searched against the Nr (NCBI non-redundant protein sequences) database using the BLAST algorithm to study the functions of mRNA and identified homologues of genes with known functions. The best gene ontology (GO) terms acquired were searched against the Nr database using Blast2GO and self-written programs (Götz et al., 2008). The assembled unigenes were also searched against the NT (NCBI nucleotide sequences), Pfam (Protein family), KOG (euKaryotic Ortholog Groups)/COG (Clusters of Orthologous Groups of proteins), Swiss-Prot (a manually annotated and reviewed protein sequence database), and KO [KEGG (Kyoto Encyclopedia of Genes and Genomes) ORTHOLOG] databases to find and predict functional classifications and molecular pathways.

Differential Expression Analysis
The transcriptome based on the Trinity assembly was set as the reference sequence (ref) for the analysis of gene expression levels. The clean reads of each sample were mapped to the ref, and the number of read counts mapped to each gene was calculated. For  this process, we used RSEM software (Li and Dewey, 2011), and the bowtie parameter was set at mismatch 2. The read counts were then transformed with the FPKM method (expected number of fragments per kilobase of transcript sequence per millions of base pairs sequenced), and the impacts of sequencing depth and gene length on fragment counts were also considered (Trapnell et al., 2010). The input data of the DEGs were based on the read counts. In this study, we selected expressed genes of GB_Seed1 as the reference and all the DEGs were generated based on the comparison with it. Statistical treatment was first performed on the read counts using the trimmed mean of M-values method, then differential expression in the digital gene expression data was analyzed using a model based on the negative binomial distribution with the DEGseq (2010) R package. The p-value was adjusted using the q-value (Storey and Tibshirani, 2003). A q-value <0.005 and a log2 fold change >1 were set as the thresholds for significantly differential expression (Anders and Huber, 2010). The identified DEGs were clustered using k-means method and then were used for further GO and KO enrichment analysis. GO enrichment analysis of the DEGs was implemented by the GOseq R packages, based on the Wallenius non-central hypergeometric distribution (Young et al., 2010), which can adjust for gene length bias. We used KOBAS (Mao et al., 2005) software to test the statistical enrichment of DEGs in the KEGG pathways.

Real-time Quantitative PCR Analyses
Three genes involving in the terpene lactones synthesis pathway (Gene ID: comp27939_c0, comp36057_c0, comp56812_c0) were analyzed using qRT-PCR. New plant materials from the individuals were used for the RNA extraction for the qRT-PCR, and three biological replicates were made. Gene-specific primers were designed according to the reference unigene sequences using online primer-design software (http://www.genscript. com.cn/technology-support/on-line-tools). A HiScriptTM Q RT SuperMix for qRT-PCR (Vazyme, Nanjing, China) was used to synthesize the cDNAs and real-time quantification was performed using a ABI Viia7 Real-Time PCR system and the EvaGreen 2X qPCR Master Mix (Vazyme, Nanjing, China). The PCR cycling was denatured using a program of 95 • C for 10 min, and 40 cycles of 95 • C for 15 s and 60 • C for 60 s. The18S gene in G. biloba was selected as the reference gene.

Sequencing and Assembly
Illumina sequencing data from G. biloba kernels at different time points were deposited in the NCBI SRA database under accession number SRP062414. The results of the highthroughput sequencing were designated as raw data or raw reads, which were stored in FASTQ format (fq) and contained sequence and quality information. In this study, a total of 25.08 Gb of sequencing data were generated, comprising 185,075,486 raw reads and 167,215,904 clean reads. The base average error rate was 0.08%, and the average Q20 and Q30 values were 95.24 and 87.12%, respectively. The average GC content was 45.50% (Table 1).
A total of 112,946 transcripts were generated; the shortest was 201 bp, and the longest was 18,087 bp. The average length was 1262 bp, and the N50 was 2269 bp. A total of 68,547 unigenes were assembled, for which the shortest and longest lengths were the same as those of the transcripts. Among the unigenes, 38,932 (56.80%) were 200-500 bp, 11,658 (17.01%) were 0.5-1 kbp, 9893 (14.43) were 1-2 kb, and the remaining 8064 (11.76%) were >2 kb ( Table 2).

Functional Annotation and Classification
All of the 68,547 assembled unigenes were searched against the Nr, NT, KO, KOG/COG, and Swiss-Prot databases using the BLAST algorithm (E-value < 1E-5) and annotated accordingly (Supplementary File S1). Of these unigenes, 26,334 had significant similarity to entries in the Nr database, accounting for 38.41% of the total. A total of 20,331 (29.65%) unigenes were annotated as unknown. The numbers of sequences that were highly similar to entries in the NT, KO, PFAM, GO, and KOG/COG databases were 9721 (14.18%), 5164 (7.53%), 20,197 (29.46%), 21,925 (31.98%), and 10,268 (14.97%), respectively, and 43.74% of the unigenes were annotated in at least one database ( Table 3). Of unigenes that were not annotated in any of the databases, more than 1,000 unigenes didn't have open reading frames.

Differential Expression Analysis of Kernels
The transcriptome assembled by Trinity was employed as the ref, and clean reads of each sample were mapped to this ref. Over 85% of the sequences per sample could be mapped ( Table 4). In total, 3869 DEGs were identified and the DEG clustering figure using k-means method is contained in the Supplementary File S2. Between GB_Seed2 and GB_Seed1, 707 DEGs were identified, including 575 up-regulated and 132 downregulated DEGs (Figure 1). A total of 1485 DEGs were identified between GB_Seed3 and GB_Seed1, including 1184 up-regulated and 301 down-regulated DEGs. A total of 2442 DEGs were identified between GB_Seed 4 and GB_Seed 1, including 1962 up-regulated and 480 down-regulated DEGs. There were 2698 DEGs between GB_Seed5 and GB_Seed1, 2213 of which were up-regulated and 485 of which were down-regulated. In all four comparison groups (GB_Seed2 vs. GB_Seed1, GB_Seed3 vs. GB_Seed1 and etc.), 202 DEGs in common were identified (Figure 2).
After screening for DEGs, studying their GO distribution can elucidate the functional differences between different samples. To obtain a better understanding of the DEGs, we not only performed enrichment analysis (GO enrichment and KEGG enrichment) on all DEGs (Figure 3), but also divided them into up-regulated and down-regulated groups (Supplemetary File S3). When the corrected p-value of one function was less than 0.05, we regarded that function as enriched. The results showed that between GB_Seed2 and GB_Seed1, the genes involved in "photosynthesis, light harvesting" and "generation of precursor metabolites and energy" had the highest degree of enrichment; both of these classifications FIGURE 1 | Differential expressed genesduring kernel developments. The horizontal axis represents fold changes of gene expression, and the vertical axis represents the statistically significance level. The red dots mean significantly up-regulated genes and the green dots represent significantly down-regulated genes.
pathways were "photosynthesis-antenna proteins" and "fatty acid biosynthesis, " both of which involved down-regulated DEGs. Between GB_Seed4 and GB_Seed1, although no pathway met the q-value requirement, the most DEGs were involved in "metabolic pathways" (q-value = 0.22) and the "biosynthesis of secondary metabolites" (q-value = 0.09). Both up-regulated and downregulated DEGs were related to secondary metabolites, and the majority of them were up-regulated in this comparison. Between GB_Seed5 and GB_Seed1, large amounts of genes were differentially expressed (Figure 4). The most significantly enriched pathways were "metabolic pathways" and "biosynthesis of secondary metabolites." Up-regulated DEGs were dominant in the "biosynthesis of secondary metabolites." Although most DEGs in the "metabolic pathways" were up-regulated, this enrichment was not significant. Other significantly downregulated DEGs were involved in "photosynthesis-antenna proteins" and "diterpenoid biosynthesis." Significantly upregulated DEGs were involved in the "plant hormone signal transduction, " "phenylpropanoid biosynthesis, " and "glutathione metabolism" pathways.

DEGs in the MEP/MVA Pathways
A total of 66 unigenes were involved in terpenoid backbone biosynthesis. Comparing to the first time point, a total of 12 unigenes were differentially expressed in the fifth period. Since the number of DEGs between GB_Seed5 and GB_Seed1 was the largest, we paid more attention to the KEGG pathway annotations between these two time points (Figure 5). Between GB_Seed4 and GB_Seed1, 11 unigenes were differentially expressed. Most of the DEGs were similar with that between GB_Seed5 and GB_Seed1 and the fold changes in this comparison were relatively lower. Between GB_Seed3 and GB_Seed1, two unigenes (comp38586_c0 and comp40873_c0) were identified up-regulated and between GB_Seed2 and GB_Seed1, one unigene (comp38586_c0) was identified upregulated.

Verification by Real-time Quantitative PCR
To confirm the reliability of the transcriptome data, the transcriptome level of 3 unigenes were examined by real-time quantitative PCR (Figure 6). The three unigenes showed expression patterns which were consistent with the transcriptome data.

Discussion
Over the past few years, next-generation sequencing (NGS) technology has brought profound changes to genomics and genetics, and sequencing rates have become continually faster while costs have continually decreased (Mardis, 2008). Among representative NGS platforms, the Illumina HiSeq 2000 platform is widely used for deep sequencing due to its high throughput and favorable cost performance. Expressed sequence tag (EST) sequencing is a useful method to understand molecular mechanisms underlying physiological and morphological traits. Sequencing technology was first developed in 1977 but was problematic due to its high labor and cost requirements. After years of improvement, NGS technology methods have improved the speeds, convenience, and cost of sequencing. Because the Illumina Hiseq2000 platform has the best cost performance among NGS methods, it has been widely used for the deep sequencing of model and non-model organisms (De Donato et al., 2013). In this study, over 185 million raw reads and 167 million clean reads, representing 25.08 Gb, were obtained using a HiSeq2000 machine. After filtering, the high-quality sequences were assembled using Trinity, with a minimum length cutoff of 200 bp. A total of 112,946 singletons and 68,547 unigenes were generated. The average length of these unigenes was 870 bp; the shortest was 201 bp, and the longest was 18,087 bp.
Based on this data, we conducted the first sequencing of G. biloba kernels at different time points, and we attempted to reveal the biosynthetic mechanisms of ginkgolide and bilobalide in these kernels. These data will contribute to the preservation of Ginkgo germplasm resources and further functional genomics analyses in this species. A total of 3869 unigenes were identified, most of which were annotated by the GO and KEGG databases. As we manually divided the seeds into five time points, the DEGs at different stages provide effective references for understanding the details of gene expression and regulation over the growth of G. biloba seeds.
A previous 454 GX FLX sequencing study of Gingko biloba leaves in Lin et al. (2011) generated a total of 64,057 ESTs and 22,304 unigenes, with an average length of 402 bp. Nine putative transcripts involved in MEP pathway and four putative transcripts involved in MVA pathway were identified in that study and in our research, these related unigenes were identified as well. Since most previous studies focused on G. biloba leaves together with the putative transcripts identification, the kernels transcriptome was sequenced for the first time and the results would benefit further study of functional genomics and developmental biology. By Illumina sequencing in our study, a total of 29,987 (43.74%) unigenes were annotated in at least one database, while 20,331 unigenes were unknown. Of these unigenes, 5164 were annotated in the KO database, covering 31 categories. In Lin and his colleagues' work, 64.5% of the unigenes were annotated and the percentage was higher than that in our research. These un-annotated unigenes may represent more unique functional genes in kernels. The discrepancy about unigene annotation would result from the biological tissue differences. Besides, much more transcripts were obtained in our study than that of Lin and his colleagues' , and more rare transcripts would be sequenced.
We mainly focused on the biosynthesis of the common precursor GGPP, considering the complex mechanisms of diterpenoid and sesquiterpenoid synthesis. Comparing to the first time point, no unigenes concerning in the MEP and MVA pathways was significantly down-regulated. It is possible that ginkgolide and bilobalide were both generated rapidly during the seed maturation process. Differential expression analysis revealed that many genes involved in the MEP pathway toward GGPP were up-regulated, such as 1-deoxy-D-xylulose-5-phosphate reductoisomerase, 4-diphosphocytidyl-2-C-methyl-D-erythritol kinase, and 1-hydroxy-2-methyl-2-butenyl 4-diphosphate and FPP synthases. Most of these genes were expressed at the fourth and the fifth time points, while fewer expressed genes involved in the MVA pathway were identified.
According to the qRT-PCR results, two unigenes (comp36057_c0 and comp_27939c0), encoding GPS and DXR enzymes respectively, were both enzymes in the MEP pathway. These two unigenes maintained relatively stable levels at first, and then they were up-regulated significantly at the last two periods and finally reached the peak at the fifth time point. One another unigene (comp56812_c0) encoded HMGR, the enzyme involved in the MVA pathway. The expression levels reached a peak at the fourth period and then suddenly dropped to a much lower level at the last time point, although still higher than that at the first time point. Is it possible that ginkgolide biosynthesis would occur earlier than bilobalide synthesis in Ginkgo kernels? Would it be supposed that two synthesis pathways compete for the utilization of their common precursors? The validation to these hypothesizes requires much following work, and some key genes involved in the two pathways are being functionally analyzed in our laboratory, providing a foundation for the further study of ginkgolide and bilobalide biosynthesis in Gingko kernels.