Transcriptome Analysis Identifies Candidate Genes Related to Triacylglycerol and Pigment Biosynthesis and Photoperiodic Flowering in the Ornamental and Oil-Producing Plant, Camellia reticulata (Theaceae)

Camellia reticulata, which is native to Southwest China, is famous for its ornamental flowers and high-quality seed oil. However, the lack of genomic information for this species has largely hampered our understanding of its key pathways related to oil production, photoperiodic flowering process and pigment biosynthesis. Here, we first sequenced and characterized the transcriptome of a diploid C. reticulata in an attempt to identify genes potentially involved in triacylglycerol biosynthesis (TAGBS), photoperiodic flowering, flavonoid biosynthesis (FlaBS), carotenoid biosynthesis (CrtBS) pathways. De novo assembly of the transcriptome provided a catalog of 141,460 unigenes with a total length of ~96.1 million nucleotides (Mnt) and an N50 of 1080 nt. Of them, 22,229 unigenes were defined as differentially expressed genes (DEGs) across five sequenced tissues. A large number of annotated genes in C. reticulata were found to have been duplicated, and differential expression patterns of these duplicated genes were commonly observed across tissues, such as the differential expression of SOC1_a, SOC1_b, and SOC1_c in the photoperiodic flowering pathway. Up-regulation of SAD_a and FATA genes and down-regulation of FAD2_a gene in the TAGBS pathway in seeds may be relevant to the ratio of monounsaturated fatty acid (MUFAs) to polyunsaturated fatty acid (PUFAs) in seed oil. MYBF1, a transcription regulator gene of the FlaBS pathway, was found with great sequence variation and alteration of expression patterns, probably resulting in functionally evolutionary differentiation in C. reticulata. MYBA1_a and some anthocyanin-specific biosynthetic genes in the FlaBS pathway were highly expressed in both flower buds and flowers, suggesting important roles of anthocyanin biosynthesis in flower development. Besides, a total of 40,823 expressed sequence tag simple sequence repeats (EST-SSRs) were identified in the C. reticulata transcriptome, providing valuable marker resources for further basic and applied researches on this economically important Camellia plant.


INTRODUCTION
Camellia reticulata Lindley, one of the most famous plants in the family Theaceae , is an evergreen flowering tree or shrub naturally distributed in Southwest China (Ming et al., 2000). It primarily occurs in mountainous regions with poor soil conditions and a long dry season due to the subtropical plateau monsoon climates. Natural populations with different polyploidy levels (i.e., 2n = 2x, 4x, 6x; x = 15) of C. reticulata resulted from natural hybridization and polyploidization were found across Southwest China (Liu and Gu, 2011), which form a greatly valuable gene pool to be explored for Camellia breeding programs in the future. Compared with the common camellia oil plant, C. oleifera, C. reticulata is very welcome in this region because of its higher economic returns especially in terms of seed size, seed yield and oil content (Liu and Ma, 2010). It is well recognized that the oil extracted from Camellia seeds is similar to that from well-known olive in FA composition, making it a valuable edible oil source for daily consumption. Indeed, the camellia oil contains a high content of unsaturated FAs (UFAs), comprising 60-80% oleic acid and 5-10% linoleic acid (Liu and Ma, 2010;Ma et al., 2011) and nearly undetectable content of very long-chain FAs (VLCFA, >C 20 ) (Ma et al., 2011). The ratio of monounsaturated FAs (MUFAs) to saturated FAs (SFAs) in the camellia oil is close to the optimal ratio following the Somopoulos' "Omega Diet" (Simopoulos and Robinson, 1999). These features are considered having balanced and healthy effects in reducing the risk of obesity, cancer, and heart disease. Researches in the rapeseed and Arabidopsis indicated that genes involved in triacylglycerol (TAG) biosynthesis (TAGBS) pathway are very much related to the oil composition and yield (Baud and Lepiniec, 2010). The TAGBS pathway mainly includes two conceptually simplified systems: the biosynthesis and modification of fatty acids (FAs), and TAG assembly (Baud and Lepiniec, 2010;Xu et al., 2011). FAs are de novo synthesized in the plastids with acetyl-CoA as a common precursor, and exported toward the cytosolic compartment as FA-CoA esters, which mainly contains FAs with double bonds of less than two and carbons of no longer than 18. FA modification generates a variety of FAs and involves enzymes contributing to FA elongation (>C 20 ) and polyunsaturated FA (PUFA) biosynthesis, such as BETA-KETOACYL -[ACYL-CARRIER PROTEIN] REDUCTASE (KAR) , and BETA-KETOACYL-COA SYNTHASE (KCS) (Joubes et al., 2008), FATTY-ACID DESATURASE (FAD) (Okuley et al., 1994;Ma and Browse, 2006), and PHOSPHATIDYLCHOLINE:DIACYLGLYCEROL CHOLINEPHOSPHOTRANSFERASE (PDCT) (Lu et al., 2009). The TAG assembly occurs in the endoplasmic reticulum and involves four consecutive enzymatic reactions catalyzed by GLYCEROL-3-PHOSPHATE ACYLTRANSFERASE (GPAT), LYSOPHOSPHOLIPID ACYLTRANSFERASE (LPAT), PHOSPHATIDIC ACID PHOSPHATASE (PAP), and DIACYLGLYCEROL ACYLTRANSFERASE (DGAT) (Baud and Lepiniec, 2010). Although the majority of candidate genes in TAGBS pathway were identified in the C. oleifera transcriptome using the 454 sequencing platform (Xia et al., 2014), their expression patterns are still largely unexplored in other Camellia species such as C. reticulata. The regulation of unique FA composition of camellia oil requires to be further studied, which will generate useful information to guide Camellia improvement programs.
C. reticulata is also a well-known camellia flower, which has been cultivated as a popular gardening plant in its indigenous region for at least 1300 years (Yu and Bruce, 1980). This ornamental woody plant is notable for large flowers, brilliant colors, various cultivars, and long florescence (Yu and Bruce, 1980;Liu and Gu, 2011). Extensive research efforts have advanced our knowledge regarding the genetic basis of photoperiodic flowering process in higher plants, particularly in A. thaliana for example. The photoperiodic flowering pathway overlaps with the circadian rhythm network, which acts as an endogenous timing system, enabling plants to promote flowering in response to photoperiod (Mas, 2005;Andres and Coupland, 2012). The photoperiod and irradiance are mainly perceived by mature leaves and mediated by many genes, mainly including FLAVIN-BINDING KELCH REPEAT F BOX PROTEIN (FKF1) (Imaizumi et al., 2005), GIGANTEA (GI) (Fowler et al., 1999), and CONSTANS (CO) (Putterill et al., 1995;Valverde et al., 2004). CO as the key regulator in the photoperiodic flowering pathway actively promotes the transcription of FLOWERING LOCUS T (FT) gene under long-day conditions (Putterill et al., 1995;Valverde et al., 2004). The FT protein, also known as florigen, is believed to move to the shoot apical meristem where the flowering process occurs (Amasino and Michaels, 2010). It has been proposed that FT interacts with FLOWERING LOCUS D (FD) to form a FT-FD complex in the meristem (Wigge et al., 2005). The FT-FD complex promotes the transcription activation of SUPPRESSOR OF OVEREXPRESSION OF CONSTANS (SOC1) and a floral meristem-identity gene AP1APETALA1 (AP1) (Putterill et al., 1995;Wigge et al., 2005). SOC1 encodes a MADS box transcription factor that activates another floral meristem identity gene LEAFY (LFY) (Putterill et al., 1995;Samach et al., 2000). LFY and APETALA1 (AP1) subsequently induce flower development at the anlagen of shoot apical meristem according to the ABC model (Jack, 2004;Andres and Coupland, 2012). A recent transcriptome sequencing and analysis of the summer-flowering C. azalea has provided insights into its floral bud development, which characterized some genes involved in the photoperiodic flowering pathway (Fan et al., 2015). More information about this pathway from other Camellia plants will facilitate horticulturists to breed new cultivars with a wide-range blooming periods.
Flavonoids and carotenoids, two major groups of colorful pigments, are supposed to be involved in flower coloration in Camellia species (Nishimoto et al., 2004;Li et al., 2009). The pathways of flavonoid biosynthesis (FlaBS) and carotenoids biosynthesis (CrtBS) have been clearly uncovered in A. thaliana and some other plant species (Cunningham and Gantt, 1998;Winkel-Shirley, 2001;Cazzonelli and Pogson, 2010). Flavonoids are derived from phenylalanine. For the first steps, PHENYLALANINE AMMONIA-LYASE (PAL) catalyzes the phenylalanine into cinnamate, and subsequently, CINNAMATE 4-HYDROXYLASE (C4L) and 4-COUMARATE-COA LIGASE (4CL) catalyze the conversion of cinnamate to p-coumaroyl-CoA. Then, CHALCONE SYNTHASE (CHS) as a rate-limiting enzyme converts the p-coumaroyl-CoA into chalcone (Dao et al., 2011). Flavones, flavonols, anthocyanins and proanthocyanidins (PAs) are synthesized from the common chalcone precursor along the FlaBS pathway. Evidences indicated that FlaBS pathway genes are largely regulated at the transcription level by a complex of transcription factors (TFs) including R2R3 MYB TFs (Czemmel et al., 2012) and basic helix-loop-helix (bHLH) TFs (Nesi et al., 2000). For example, the MYBA1 in Vitis vinifera is a R2R3 MYB TF specifically controlling the expression of UFGT, which encodes an enzyme responsible for the anthocyanin biosynthesis and able to determine the grape berry colors . Carotenoids are a group of isoprenoid molecules with characteristic color in the yellow to red range (Cazzonelli and Pogson, 2010). Carotenoids are synthesized from the five-carbon building blocks isopentenyl diphosphate (IPP) and its double-bond isomer dimethylallyl diphosphate (DMAPP), both of which are produced by the plastid-localized methylerythritol phosphate pathway (Phillips et al., 2008). Three IPP molecules are added to DMAPP by GERANYLGERANYL DIPHOSPHATE SYNTHASE (GGPS) to generate geranylgeranyl diphosphate (GGPP), a common precursor for the biosynthesis of carotenoids and several other groups of plastidic isoprenoids (Hirschberg, 2001). Although genes involved in these pathways have been extensively studied in other plants, more studies on these networks are needed in Camellia species. Transcriptome sequencing uncovered the candidate genes in the FlaBS pathway in Camellia, such as C. sinensis (Shi et al., 2011), C. taliensis (Zhang et al., 2015), and C. chekiangoleosa (Wang et al., 2014). However, there is still a lack of comprehensive expression profiling of the FlaBS genes. In particular, there is so far no information for the CrtBS pathway in Camellia plants.
The economic importance of Camellia species is largely due to the demand for young leaves (e.g., tea leaves from C. sinensis), ornamental flowers, and seed oil. Genes that are involved in TAGBS, FlaBS and CrtBS pathways and photoperiodic flowering are supposed to be highly relevant to the above-mentioned agricultural and horticultural traits. While comprehensive analysis on gene networks related to photoperiodic flowering and CrtBS is so far absent in Camellia plants, genes involved in TAGBS and FlaBS were reported in C. sinensis (Shi et al., 2011), C. chekiangoleosa (Wang et al., 2014), and C. oleifera (Xia et al., 2014). But, they are still far away from being well characterized in Camellia because of scarce information of gene expression profiles. Indeed, although RNA sequencing (RNA-Seq) datasets have been reported in several Camellia species, large-scale gene expression profiles were only available in C. sinensis to identify genes activated during cold acclimation , because other studies are unable to provide a broad expression analysis because of the limitation of sequencing strategies, such as 454 sequencing Wang et al., 2014;Xia et al., 2014) or Illumina sequencing with a single library (Shi et al., 2011). Compared with the other Camellia species, very few genomic resources are available for C. reticulata. Nevertheless, a full catalog of genes and their expression profiles in C. reticulata are needed to better understand the above mentioned biological pathways.
Here, Illumina RNA-Seq was employed to de novo sequence the transcriptome of the diploid C. reticulata. The transcriptomes from the five tissues under normal development conditions were separately sequenced and compared to present a global survey of expression profiles as well as differential expression analysis in C. reticulata. A large set of unigenes was obtained to identify the majority of genes related to TAGBS, FlaBS, CrtBS, and photoperiodic flowering pathways. The characterization of EST-SSRs (expressed sequence tag, simple sequence repeat) from the transcriptome of C. reticulata has expanded valuable marker resources in breeding programs of the camellia community.

Plant Material, RNA Isolation and Illumina Sequencing
The wild C. reticulata plant YB1-2, which was previously characterized as a diploid (Xia et al., 1994;Liu and Gu, 2011;Huang et al., 2013), was collected from Sichuan Province, China (Supplementary Table S1). A total of five tissues, including leaf buds, mature leaves, flower buds, flowers and immature fruits, were used for transcriptome sequencing and quantitative reverse transcription polymerase chain reaction (qRT-PCR) validation; blackening seeds were only used for qRT-PCR (Supplementary Figure S1). Samples were immediately frozen in liquid nitrogen and stored at −80 • C. Total RNA of each sample was isolated using a modified CTAB (cetyl trimethylammonium bromide) method (Li L. et al., 2008). RNA samples were treated with RNase-free DNase I (Takara) to avoid DNA contamination. RNA quality was assessed using an Agilent 2100 BioAnalyzer.
For each RNA sample, Poly-A mRNA was enriched and then used to prepare a 350-bp paired-end cDNA library (2 × 100 nt) according to the Illumina protocol. Paired-end sequencing was performed on the Illumina HiSeq ™ 2000 platform.

Sequence Data Processing and de novo Assembly
FastQC (http://www.bioinformatics.babraham.ac.uk/projects/ fastqc/) was used to evaluate and visualize the sequence quality before and after trimming process. The trimming process was performed with Trimmomatic (Lohse et al., 2012), sequentially removing adapters, the first seven bases at the 5 ′ end of the reads due to sequence bias (Hansen et al., 2010), low quality bases (phred score < 20), and short reads (<50 nt). The clean reads were de novo assembled using Trinity software (Grabherr et al., 2011), with parameter settings of "-group_pairs_distance 450 -min_contig_length 200 -CPU 4" and default options otherwise.

Post-Assembly Processing
Three rounds of filtering process were performed to construct a high-quality nuclear transcriptome assembly. First, putative protein-coding sequences most likely from plant epiphytes and pathogens were identified and discarded using a previously described taxonomy-based method (Krasileva et al., 2013). Next, a second round of filtering process was performed using DeconSeq (Schmieder and Edwards, 2011) against a local database constructed from genomic sequences of candidate contaminant species (i.e., human, bacteria, virus, fungi, and top 20 epiphyte or pathogen species inferred from the taxonomybased filtering step). Finally, mitochondrial and plastidial transcripts were identified and removed by searching against publicly available mitochondrion and plastid genomes of dicotyledonous plants.
Then the filtered sequences were subjected to cluster analysis using CD-HIT-EST (Huang et al., 2010) with the following parameters: -r 1 and -c 0.9. The longest sequence of each cluster was used as a representative sequence (i.e., unigene).

Sequence Annotation
Homology searches were performed against the UniProtKB/Swiss-Prot (Swiss-Prot), UniProtKB/TrEMBL (TrEMBL), NCBI (National Center for Biotechnology Information) non-redundant protein database (NR), A. thaliana proteins database (TAIR10), and V. vinifera proteins database (IGGP 12X) using a BLASTx procedure with an e-value threshold of 10 −3 . For each unigene, the best hit against these databases was used to predict sequence orientation, coding DNA sequence (CDS), and polypeptide sequence with GeneWise software (Birney and Durbin, 2000). The predicted peptide sequences were analyzed to assign corresponding protein domains and families by using the pfam_scan tool against the Pfam database (Punta et al., 2012). Gene ontology (GO) were obtained based on the NR database annotations using Blast2GO software (version 2.6.5) (Conesa et al., 2005), and then WEGO software (Ye et al., 2006) was used to obtain GO functional classifications. The unigenes were also submitted to the online KEGG (Kyoto Encyclopedia of Genes and Genomes) Automatic Annotation Server (KAAS) to obtain enzyme commission (EC) numbers and associated KEGG orthology (KO) identifiers that are directly linked to objects in the KEGG pathway map (Moriya et al., 2007). The KAAS annotation was performed with single-directional best hit method using angiosperm species data sets as reference. We focused on unigenes assigned to four pathways: photoperiodic flowering, TAGBS, FlaBS, and CrtBS. Finally, we performed a manual curation for these unigenes.

Transcript Abundance and Expression-Based Analysis
Paired end reads were aligned to the assembly by Bowtie (Langmead and Salzberg, 2012), and the resulting alignments were used to estimate expression abundances in FPKM (expected number of fragments per kilobase of transcript sequence per millions base pairs sequenced) by RSEM (Li and Dewey, 2011). The Bioconductor tool edgeR (Robinson et al., 2010) was employed to calculate the expression abundance fold change based on pairwise comparisons of normalized FPKM among five sequenced tissues using exact statistical method with default settings. Differentially expressed genes (DEGs) were defined with a threshold of fold change ≥ 4 and false discovery rate (FDR) ≤ 0.001. Enrichment analyses of DE gene sets in KEGG pathways or the GO database were performed using the online tool KOBAS (KEGG Orthology Based Annotation System, http://kobas.cbi. pku.edu.cn/) (Xie et al., 2011) with the expressed genes (FPKM ≥ 1) as background (corrected p ≤ 0.05).

qRT-PCR Analysis
Primers used for qRT-PCR assays were designed with the Primer Express software (version 3.0, Applied Biosystems), and they amplified PCR products varied from 80 to 159 bp (Supplementary Table S2). The housekeeping gene ELONGATION FACTOR 1 ALPHA (EF1A) was used as internal reference control for normalization (Nicot et al., 2005). qRT-PCR assays were performed as described previously (Xu et al., 2011). The relative expression of the genes was calculated using the 2 − Ct method.

EST-SSR Identification and Polymorphism Survey
To increase confidence, only unigenes with length ≥500 nt were used for EST-SSR identification. They were searched by using the MISA program (Thiel et al., 2003), mining EST-SSRs with 2-to 6-nt motifs and a minimum length of 12 nt.
Twenty EST-SSRs were chosen for polymorphism survey in 24 individual C. reticulata plants (Supplementary Tables S1, S3). PCR primer pairs were designed using Invitrogen Vector NTI (version 10) with standard criteria. DNA extraction was performed as described previously (Doyle and Doyle, 1987). Standard PCR amplifications and electrophoresis were performed as described previously . The number of alleles per locus (NA), effective number of alleles (NE), Shannon's diversity index (I), observed heterozygosity (Ho), and expected heterozygosity (He) were estimated with POPGENE software (version 1.32) (Yeh et al., 1999). The polymorphic information content (PIC) value was calculated according to a previously described formula (Botstein et al., 1980).

RESULTS AND DISCUSSION
Sequencing, de novo Assembly and Assessment of the C. reticulata Transcriptome Distinct cDNA libraries of leaf buds, mature leaves, flower buds, flowers, and immature fruits (Supplementary Figure S1) from a wild diploid C. reticulata plant were separately sequenced with Illumina HiSeq ™ 2000, generating a total of 394.9 million 100 nt paired-end raw reads. After a stringent trimming process, about 311.3 million clean reads (78.8% of total raw data) with 26.2 billion nucleotides in total were retained ( Table 1). The high-quality reads were de novo assembled into contigs with Trinity, which has been shown to be the best single k-mer assembler for de novo assembly from RNA-Seq short reads (Grabherr et al., 2011;Zhao et al., 2011). As direct output from the assembly procedure often includes contaminants and redundant sequences, assembled sequences were subjected to do a post-assembly processing to obtain a high-quality nuclear transcriptome (see Material and methods). As a result, an assembly of 232,428 contigs with a total size ∼186.4 million nucleotides (Mnt) was established for C. reticulata. After clustering analysis, 141,460 unigenes with a total size ∼96.1 Mnt were generated ( Table 1). The unigenes ranged from 200 to 9,880 nt, with a mean length of 679 nt and an N50 length of 1080 nt ( Table 1 and Figure 1A). Though a systematic evaluation standard is not yet available for transcriptome assembly of non-model organisms (Martin and Wang, 2011), we tried to overview the quality of our assembly with several broadly used parameters and by comparing it to independent databases from closely related species. Reads mapping showed that the proportion of reads assembled was 84.5% (Table 1), which is a comparable alignment rate to that of other de novo assemblies. 86.3% of the mapped paired-end reads aligned concordantly, showing good physical evidence of sequence contiguity. The unigenes was compared to available ESTs and protein sequences of Camellia plants from NCBI. Of 111,905 non-redundant Camellia ESTs, 96,092 (85.9%) ESTs were represented in our assembly (Megablast, E = 10 −9 ), among which 72,924 (75.9%) ESTs were matched with more than 80% identity and 80% coverage (Supplementary Figure S2). Of 256 non-redundant Camellia proteins (≥300 AA), 255 (99.6%) matched to at least one assembled unigene (BLASTx, E = 10 −9 ), among which 215 (84.3%) were represented with ≥80% coverage. These results demonstrated that the assembly successfully constructed a large number of homologous transcripts with desirable lengths. Furthermore, chimeric assembly level was inspected using the top longest assembled sequences (Van Belleghem et al., 2012). All of the 10 longest unigenes (≥7400 nt) in our assembly positively matched with long gene sequences in public databases without obvious chimeric evidence, and 9 of them were confirmed to have good alignment confidence (70% identity and 80% coverage) at both the cDNA and protein levels (Supplementary Table S4). These assessments indicate that the C. reticulata transcriptome constructed herein possessed desirable completeness, accuracy, and contiguity.
To facilitate the access and utilization of the C. reticulata transcriptome data, the sequences of raw reads and unigenes were deposited in the NCBI (BioProject ID PRJNA297756). The full list of transcript sequences is available upon request.

Functional Annotation and Gene Ontology Classification
To annotate the transcriptome with putative functions, similarity searches for each unigene were performed using BLASTx against five public databases: Swiss-Prot, TrEMBL, NR, TAIR10 and the Vitis vinifera protein database. With an E-value threshold of 10 −3 , 69,922 (49.4% of the total) unigenes were significantly matched with known proteins in at least one of the searched databases ( Table 2 and Supplementary Table S5), and 42,042 (29.7% of the total) unigenes had significant hits with proteins presented in all of the five databases ( Table 2). All of the 69,922 unigenes matched to known proteins were considered as putative protein-coding sequences. The length distribution of these putative proteincoding sequences is shown in Figure 1A, showing that the longer sequence would have a higher probability of being matched.
For the NR annotations, the top-hit species distribution is shown in Figure 1B. We found that 36.75% of the mapped unigenes had the highest matches to genes from V. vinifera, followed by Theobroma cacao (10.02%) and Solanum lycopersicum (6.85%) ( Figure 1B). Overall, 54.29% of the matched sequences were mapped with similarity >80% (Figure 1C), and 76.01% of the matched sequences showed strong homology with an E < 10 −100 ( Figure 1D). Notably,   Table 2), suggesting that the hypothesis of genome duplication events in C. sinensis probably might occur in C. reticulata (Shi et al., 2011). Sequence orientation, CDS, and polypeptide sequence were predicted using GeneWise software (Birney and Durbin, 2000). For each putative polypeptide sequence, its functional domain and associated family was inferred by pfam_scan analysis against the Pfam database. The top 10 abundant families and domains are listed in Supplementary Table S6. The "pentatricopeptide repeat" (PF01535), which is thought to be involved in organelle biogenesis (Lurin et al., 2004), was the most abundant protein family. The "protein kinase domain" (PF00069), which functions in a process called phosphorylation, was the most abundant protein domain. The GO classification system that describes gene function into three major categories (biological processes, molecular functions, and cellular components) and additional subcategories was also applied to putative gene functions. Overall, 39,301 unigenes were assigned to 131,466 GO terms (3736 unique GO terms; Supplementary Table S7) that were further classified into 47 subcategories (Supplementary Figure S3). The most abundant GO subcategories for biological processes, molecular functions, and cellular components were metabolic process (GO:0008152), binding (GO:0005488), and cell (GO:0005623), respectively.
Out of 20 EST-SSR primer pairs chosen for polymorphism survey, 18 were successfully amplified in 24 C. reticulata accessions (Supplementary Tables S1, S3 and Supplementary Figure S4A). Among the 18 working primer pairs, 17 successfully amplified PCR products with the expected sizes, and one pair (SSR11) generated a larger PCR product (∼270 bp) than expected size of 159 bp, suggesting that there may be a short intron within the amplicon. Our polymorphism survey showed that seven primers pairs could amplify polymorphism bands among the 24 C. reticulata representative samples (Supplementary Figures S4B-H). The PIC values varied from 0.33 to 0.69 with an average of 0.50, and the NA is 3.57 (Table 4). These results clearly demonstrated that EST sequences derived from RNA-Seq are effective and valuable resources to develop polymorphic SSR markers, which are essential in various applications such as population genetic structuring and linkage mapping.

Transcriptome Expression Profiling and Differential Expression Analysis
Expression levels of unigenes were determined by aligning the RNA-Seq reads from each library to the assembly. The default empirical abundance threshold of 1 FPKM was used to evaluate whether a gene was positively expressed (Vogel and Marcotte, 2012;Gonzalez-Porta et al., 2013). As a result, 94,450 unigenes were positively expressed in at least one of the libraries. Among them, 17,956 unigenes were expressed in all five tissues, and 8,837, 13,963, 3,265, 10,540, and 7,094 unigenes were specifically expressed in leaf buds, mature leaves, flower buds, flowers, and immature fruits, respectively (Supplementary Figure S5A). Based on expression levels, the sequence depths of expressed unigenes varied across a broad range. The top 1% of abundantly expressed unigenes accounted for 60% of the overall abundance value, while the top 10% abundantly expressed unigenes were 86% of the overall abundance value (Supplementary Figure S5B).
A total of 22,229 unigenes were defined as DEGs (fold change = 4 and FDR = 0.001); and 18,639 (83.85%) of these were putative protein-coding genes, indicating specific enrichment of functional genes in this dataset. All of the DEGs were assigned to five groups (DELB, DEML, DEFB, DEFL, and DEFR) according to the tissue identity where their highest expression occurred (Supplementary Table S9). As a result, Group DEML was assigned with the maximum number of DEGs (5,451), followed by Group DELB (5,073), Group DEFL (4,599), Group DEFR (3,756), and Group DEFB (3,350) (Supplementary Table S9). KEGG pathway or GO enrichment analysis of the DE unigenes in each group indicated that unigenes preferentially expressed in a tissue are highly related to the specific functions of that tissue (Supplementary Table S10). For example, Group DEML expressed most preferentially in mature leaves enriched genes involved in "photosynthesis, " "carbon fixation, " "fructose, mannose and nitrogen metabolism, " "porphyrin and chlorophyll metabolism, " as well as "carotenoid biosynthesis." Group DEFB and DEFL showed preferential expression of genes involved in different functional processes, which are related to flower development in different stages (flower buds and flowers), such  as "plant hormone signal transduction, " "pigment accumulation" (GO:0043476), "carotenoid biosynthesis" and "phenylpropanoid biosynthesis." Genes from group DEFR were highly expressed in immature fruits, and included genes functioned in "flavonoid biosynthesis, " "phenylpropanoid biosynthesis, " "flavone and flavonol biosynthesis, " and "biosynthesis pathways of secondary metabolites." Collectively, the DEGs are able to mirror physiological differences among the five tissues. These results will facilitate the identification of genes that function in specific physiological programs.

qRT-PCR Validation
To evaluate the reliability of RNA-Seq analysis, 22 gene homologs related to TAGBS and FlaBS pathways were selected for qRT-PCR test in six tissues (leaf buds, mature leaves, flower buds, flowers, immature fruits, and blackening seeds) (Supplementary Figure S1 and Supplementary Table S1). In general, our qRT-PCR results show a high degree of consistency with the RNA-Seq results (Figures 2, 3). Expression patterns of nine (40.9%) cases ( Figures 2J,K, 3A-E,H,J) fit well with the RNA-Seq results across all five tissues. Ten (45.5%) cases ( Figures 2B,C,E-G,I, 3F,G,I,K) had almost similar expression patterns but with very small partial inconsistences compared to the RNA-Seq results. It is rational and acceptable that there are certain differences in direct comparisons between the RNA-Seq and qRT-PCR results due to different normalization methods, bias in library preparation in RNAseq, and other technical biases (Bustin, 2002;Li et al., 2010;Zheng et al., 2011).

Triacylglycerol Biosynthesis Pathway in C. reticulata
Based on KEGG pathway annotation of the C. reticulata transcriptome, we totally identified 93 unigenes for the TAGBS pathway (Supplementary Table S11; Figure 4). Most of the genes involved in TAGBS pathway were present in the assembly, except for two KCS genes, FATTY ACID ELONGATION 1 (FAE1, namely KCS18) and KCS9 (Supplementary Table S11). We failed to identified FAE1 and KCS9 in the traditional camellia oil plant C. oleifera, based on previously published transcriptome dataset (Xia et al., 2014). FAE1 is involved in the elongation of C 18 to C 22 FAs (Kunst et al., 1992), while KCS9 is involved in the elongation of C 22 to C 24 FAs (Kim et al., 2013). Arabidopsis fae1 mutant had a deficiency in producing very long-chain FAs (VLCFA) in seed oil (Kunst et al., 1992). Thus, the absence of FAE1 might be an important factor contributing to the camellia oil with very low level of VLCFA. The genes of this pathway showed high similarities to Arabidopsis with an average identity of 70% at amino acid sequence level, suggesting their potentially functional conservation. Many of these genes had multiple copies in C. reticulata compared with Arabidopsis (63 duplicated unigenes matched to 24 Arabidopsis unique homolog genes), such as STEAROYL-ACP DESATURASE (SAD, 3 unigenes), FAD2 (2 unigenes), and DGAT2 (2 unigenes). This phenomenon was also found in other Camellia species including C. sinensis, C. taliensis, and C. oleifera (data not shown). The results seemingly support the hypothesis that Camellia species might share whole genome duplication events during the course of evolution before their speciation (Shi et al., 2011). Gene expression profiles of the TAGBS pathway were shown in Figure 5A. For all of the 93 TAGBS unigenes, 42 were defined as DEGs (fold change = 4, FDR = 0.001). There are 29 (69%) DEGs up-regulated in flowers or flower buds, and some of them are FA desaturase homologs, including SAD_a, FAD2_a, FAD2_b, FAD3, FAD8_b, and FAD8_c. These results indicate that a variety of fatty acids could be demanded for flower development. FA desaturase genes are often related to the modification of membrane fluidity in response of cold hardiness (Kodama et al., 1994;Matteucci et al., 2011;Wang et al., 2013). The flowers of C. reticulata often bloom in winter when the temperatures are quite low (Supplementary Figures S6E,F, S7). Thus, high expression of these desaturase genes may explain the freezing tolerance of C. reticulata flowers and flower buds in winter.
Camellia oil contains 60-80% oleic acid and 5-10% linoleic acid. It is well characterized in Arabidopsis that plastid-localized enzymes were responsible for oleic acid biosynthesis with 18:0-ACP as precursor: SAD catalyzes the conversion of stearoyl-ACP to oleoyol-ACP (Shanklin and Somerville, 1991), while FATTY ACYL-ACP THIOESTERASE A (FATA) plays a key role in the formation of free oleic acid (Hellyer et al., 1992;Moreno-Perez et al., 2012). FAD2 encodes an oleate desaturase that required for PUFA biosynthesis in endoplasmic reticulum (Okuley et al., 1994). In this study, FATA gene was observed with only one copy, while SAD and FAD2 were observed with 3 and 2 copies, respectively (Supplementary Table S11). Different expression patterns were observed among the copies of SAD and FAD2. FAD2_a was up-regulated in young fruit, while FAD2_b was down-regulated in young fruit ( Figure 5A). Clustering analysis of gene expression profiles shows that, SAD_a and FATA were coexpressed genes in the same subcluster, while SAD_b and SAD_c were in different subclusters ( Figure 5A). Using qRT-PCR, we also investigated the expression levels of several TAGBS-related genes in ripening seeds. While FAD2_a had a lower expression level in seeds than in mature leaves (Figure 2E), SAD_a and FATA had several fold higher expression level in seeds than in mature leaves (Figures 2D,H). These observations suggest that SAD_a and FATA may be key genes responsible for MUFA production in C. reticualta seeds. Up-regulation of SAD_a gene and downregulation of FAD2_a gene may be an efficient way to control the production ratio of MUFAs to PUFAs. Previously, Xia et al. (2014) found that parallel evolution of FAD2 genes may occur between Camellia species and olive, which might result in the similar FA composition in their seed oil. Moreover, FAD2 RNAi experiment in Camelina sativa showed evidence that FAD2 is an efficient target for genetic manipulation toward lowering the PUFA content and increasing the oleic acid content (Nguyen et al., 2013).Our data here show that the transcription levels of SAD, FAD2, and FATA were highly regulated in seeds of C. reticulata. Their expression patterns may well explain why the FA composition of camellia oil was dominated by oleic acid but with relative low PUFA levels.

Photoperiodic Flowering Pathway in C. reticulata
To better understand the flowering transition process in C. reticulata, we conducted a field study to record the timing series of major vegetative and reproductive events over an entire year (Supplementary Figures S6A-L). Our observation showed that the flowering transition of C. reticulata often occurs at the apex of branch around mid-April when the day length is more than 12 h in Sichuan, China (Supplementary Figures S6A, S7). This result suggests that the photoperiodic flowering pathway may determine flowering time control of C. reticulata.  Supplementary Table S13. LB, leaf buds; ML, mature leaves; FB, flower buds; FL, flowers; FR, immature fruits; SE, seeds.
In the model plant, A. thaliana, genes involved in photoperiodic flowering pathway were well characterized (Andres and Coupland, 2012). Homology searches successfully identified 50 unigenes potentially involved in this pathway (Supplementary Table S12). Of them, 36 (72%) candidate genes were represented with full-length CDSs, such as CRYPTOCHROMEs (CRYs), LATE ELONGATED HYPOCOTYL (LHY), TIMING OF CAB EXPRESSION 1 (TOC1), FKF1, GI, SOC1, and FT. However, CO and its transcription repressor CYCLING-DOF-FACTOR-1 (CDF1) were absent in the assembly. CO was expressed at an extremely low level in Arabidopsis and was only actively detected in leaves or shoots under long-day condition (Putterill et al., 1995). Thus, the possibility cannot be excluded that the unique expression pattern of CO led to the lack of itself in our results. Alternatively, we found several homologs of the CO gene family (e.g., COL2, COL4, COL5, and COL9; Supplementary Table S12). Similar to genes in TAGBS pathway, many genes involved in the photoperiodic flowering pathway appeared to have multiple copies in C. reticulata. For instances, FD was observed with 4 copies, and both SOC1 and AP1 were observed with three copies (Supplementary Table S12). Based on sequence comparisons, the candidate genes in this pathway showed moderate similarity to Arabidopsis with an average identity of 62% at amino acid sequence level. However, great sequence differences were found in ELF3_a, ELF3_b, and LHY with identities lower than 37%. The complex evolutionary process may bring about sequence variations and functional differentiation of these genes between C. reticulata and Arabidopsis.
The gene expression patterns of the photoperiodic flowering pathway genes across the tested tissues were shown in Figure 5B. A total of 20 putative genes were identified as DEGs, including FT, SOC1 (SOC1_a and SOC1_c), LFY (LFY_a and LFY_b), and AP1 (AP1_a, AP1_b, and AP1_c) (Figure 5B). The photoperiod and irradiance are mainly perceived by mature leaves. Nevertheless, many of these DEGs were not specially expressed in mature leaves. For examples, SOC1_a was upregulated in mature leaves and leaf buds, and FT was upregulated in mature leaves and young fruits ( Figure 5B). These results suggest that these genes may play important roles in multiple developmental processes. The duplicated genes in this pathway demonstrate differential expression patterns across these tissues. For example, while SOC1_a was upregulated in mature leaves and leaf buds, SOC1_b and SOC1_c were actively expressed in leaf buds and flowers ( Figure 5B). Many studies suggested that the alteration of spatiotemporal expression is an important indicator of functional divergence in duplicated genes (Ganko et al., 2007). Thus, these duplicated FIGURE 4 | Putative genes related to TAG biosynthesis in C. reticulata. The numbers in brackets following each gene name indicate the number of unigenes annotated to that gene. For each gene abbreviation, its full name and other information can be referenced in Supplementary Table S11. genes may play diverse roles in multiple developmental processes.

Flavonoid Biosynthesis Pathway in C. reticulata
Flavonoids include the four major subgroups of flavones, flavonols, anthocyanins and PAs that play vital roles in flower coloration and many other plant physiological processes. Homology searches identified 56 unigenes as candidate genes that are related to the FlaBS pathway (Supplementary Table S13, Figure 6A). Most of these genes had multiple copies in C. reticulata compared to Arabidopsis, except for six genes (ACC_Ho, FNSII, UGT75B2, AOMT, MYBF1, and F3 ′ 5 ′ H; Supplementary Table S13). Expression profiles for these genes were shown in Figure 5C. Overall, 89.3% (50/56) of these putative genes were defined as DEGs, indicating that most of these genes were extensively regulated at the level of transcription.
In grape, VviMYB5 (MYB5a and MYB5b) (Deluc et al., 2006(Deluc et al., , 2008, VviMYBF1 (Czemmel et al., 2009), VviMYBPA1 , and VviMYBA1 (Kobayashi et al., 2004) were characterized as R2R3 MYB TF genes that play major roles in the production control of total flavonoids, flavonol, PAs, and anthocyanin pigments, respectively. The homologs of these TF genes were also identified in our analysis. Neither MYB5a nor MYB5b in C. reticulata is differentially expressed across tissues (Figure 5C), which is consistent to their putative roles as general regulators of flavonoid precursors (Deluc et al., 2006(Deluc et al., , 2008. VviMYBF1 in grape has two functional MYB domains and serves as a specific regulator in expression control of FLAVONOL  SYNTHASE (FLS) gene to regulate the flavonol production (Czemmel et al., 2009). The C. reticulata MYBF1 was found with only a single MYB domain (Supplementary Figure S8), making great difference with its homolog VviMYBF1. Furthermore, none of five copies of FLS seems to be correlated with MYBF1 at expression level, which is inconsistent with previous observation that MYBF1 is a positive regulator of FLS gene (Czemmel et al., 2009). Thus, MYBF1 in C. reticulata might have evolved a novel function, which was preferentially expressed in seeds ( Figure 3J).
Two copies of MYBPA1 were observed in the C. reticulata transcriptome, MYBPA1_a was a bit more conserved than MYBPA1_b compared to VviMYBPA1 (Supplementary Table S13). While MYBPA1_b showed specific expression in young fruits, MYBPA1_a were highly expressed in leaf buds, mature leaves and young fruits. ANTHOCYANIDIN REDUCTASE (ANR) and LEUCOANTHOCYANIDIN REDUCTASE (LAR) are known as two key enzymes in the biosynthesis of PAs. While ANR catalyzes the formation of cis-flavan-3-ol (an oligomeric form of PAs) from anthocyanidin (Xie et al., 2003), LAR catalyzes the formation of tran-flavan-3-ol (another oligomeric form of PAs) from leucoanthocyanidin (Mauge et al., 2010). Previous report showed that VviMYBPA1 positively activated the expression of ANR and LAR to specifically control PAs production . In this study, the PA-specific biosynthetic gene ANR had two copies (ANR_a and ANR_b) in the assembly, and both of them had highly similar expression patterns with MYBPA1_a across tissues (Figures 3G,I, 4C). Among three copies of LAR, only LAR_a appeared high similar expression patterns with MYBPA1_a, which was also validated by qRT-PCR (Figures 3E,I). The expression patterns of LAR_a, ANR_a, and ANR_b correlated with that of MYBPA1_a suggest that the regulation of C. reticulata MYBPA1_a may be similar to its homologs in grape.
In this study, two copies of MYBA1 were identified in the C. reticulata transcriptome. MYBA1_a was more conserved than MYBA1_b compared to VviMYBA1 (Supplementary Table S13). MYBA1_a was identified as a DEG, which was highly expressed in flower buds, flowers and young fruits but lowly expressed in leaf buds and mature leaves (Figures 3K, 5C). Clustering analysis of gene expression profiles showed that MYBA1_a shared the same subcluster with several anthocyanin-specific biosynthetic genes, including ANTHOCYANIDIN 3-O-GLUCOSYLTRANSFERASE (UF3GT_a and UF3GT_c) and ANTHOCYANIDIN 5-O-GLUCOSYLTRANSFERASE (UF5GT_b). MYBA1_a and these anthocyanin-specific biosynthetic genes were highly expressed in both flower buds and flowers (Figures 3H,K, 5C), indicating their important roles of anthocyanin biosynthesis in flower developments.

Carotenoid Biosynthesis Pathway in C. reticulata
Similar to flavonoids, carotenoids are a diverse group of colorful plant pigments. They play vital roles in many essential physiological processes such as the photosynthesis, flower coloration, and production of phytohormones (Cazzonelli and Pogson, 2010). In the assembly, we successfully identified 33 unigenes potentially involved in the CrtBS pathway ( Figure 6B and Supplementary Table S14). Different from the FlaBS pathway, most (70%) of these genes in this pathway appeared a single copy. Seven genes had multiple copies, including GERANYLGERANYL REDUCTASE ( Table S14). The expression profiles of these CrtBS-related genes are shown in Figure 5D. Most (25/33) of these genes were defined as DEGs, and the majority of these DEGs were up-regulated in mature leaves or flowers ( Figure 5D). Leaf tissues typically accumulate lutein, beta-carotene, violaxanthin, and neoxanthin for the photosynthesis, antenna assembly, and photoprotection (Cazzonelli and Pogson, 2010). Thus, genes related to the synthesis of these compounds are reasonably up-regulated in mature leafs, such as GGR_a, LUTEIN1 (LUT1), LUT2, and LYCOPENE CYCLASE (LYC) (Figure 5D). PHYTOENE SYNTHASE (PSY), which encodes a rate-limiting enzyme that functions in the first committed step of the CrtBS pathway (Cazzonelli and Pogson, 2010), was markedly up-regulated in both flower buds and flowers ( Figure 5D). Clustering analysis of gene expression profiles showed that NEOXANTHIN SYNTHASE (NXS) and NCED3_c had similar expression patterns with PSY ( Figure 5D). In abscisic acid (ABA) biosynthesis, the NXS enzyme catalyzes the conversion of violaxanthin into neoxanthin (Al-Babili et al., 2000), and the NCED enzyme subsequently catalyzes the cleavage of neoxanthin, which represents the first committed step of ABA biosynthesis (Qin and Zeevaart, 1999). Previously, ZEP and ABA2 in Arabidopsis were also identified as ABA biosynthetic genes (Gonzalez-Guzman et al., 2002). Figure 5D show that ABA2_b, NCED4, NCED3a, ZEP_a were actively expressed in the camellia flowers. Collectively, ABA biosynthesis genes may play important roles in C. reticulata flowers for developmental processes.
Carotenoids provide flowers with distinct colors, ranging from yellow to orange or red. In yellow-flowered Camellia species, such as C. chrysantha, carotenoids were reported to contribute to yellow color of its petal (Hwang et al., 1992;Nishimoto et al., 2004). Red-flowered Camellia species, such as C. reticulata, are reported to have high amounts of anthocyanins but few carotenoids (Li J. B. et al., 2008). As illustrated in Figure 5D, several genes directly related to the formation of pigment compounds (e.g., lutein, zeaxanthin, carotene), such as LUT1, LUT2, LUT5, and LYC, were expressed at relatively low levels in flower buds and flowers of C. reticulata. Low expression level of these genes in flower may account for a low level of carotenoid-based color compounds in camellia flower. These data also supports that it is the flavonoid biosynthesis pathway rather than carotenoids biosynthesis pathway playing dominant roles in C. reticulata flower coloration. Activation of these genes that involved in the synthesis of carotenoid-based color compounds might be useful in genetic manipulations of flower color in Camellia species in future.

CONCLUSIONS
We first report the de novo assembly of the C. reticulata transcriptome. De novo assembly has provided a catalog of 141,460 unigenes with a total length of ∼96.1 million nucleotides and an N50 of 1080 nt. Systematic evaluation indicated a good quality of the transcriptome assembly, which is suitable for further studies such as EST-SSRs mining and pathway analysis. We identified the majority of candidate genes related to TAGBS, FlaBS, CrtBS, and photoperiodic flowering pathways. The results also showed that FA desaturase genes were highly regulated, which may be related to cold hardiness response and FA composition control of camellia seed oil. Both the flaBS and CrtBS pathways were extensively regulated in C. reticulata. The former mainly participated in the biosynthesis and metabolism of proanthocyanidins, flavonols and anthocyanin pigments for flower color, while the latter play essential role in ABA synthesis rather than color maintenance. These useful resources will further enhance basic and applied researches on this economically important Camellia plant.

AUTHOR CONTRIBUTIONS
LG designed the research. QY, HH, YT performed necessary experiments. QY, HH, YT, and EX performed bioinformatics analyses and interpreted the results. QY, HH, and LG wrote the paper. All authors read and approved the final manuscript.  Supplementary Figure S2 | Identity and coverage distribution of the best BLAST hits for each unigene against the Camellia EST database. We used megaBLAST (E ≤ 10 −9 ) to search against the local Camellia EST database and extracted the best hit of each sequence for analysis. The green line indicates identity distribution of the best match, while the red line shows the coverage distribution of the best match.

ACKNOWLEDGMENTS
Supplementary Figure S3 | GO Classification of the C. reticulata transcriptome. GO terms assigned to each unigenes by BLAST2GO are summarized into three main GO categories (biological process, cellular component, molecular function) and 47 subcategories using the web-based tool WEGO.
Supplementary Figure S4