Analysis of Arabidopsis floral transcriptome: detection of new florally expressed genes and expansion of Brassicaceae-specific gene families

The flower is essential for sexual reproduction of flowering plants and has been extensively studied. However, it is still not clear how many genes are expressed in the flower. Here, we performed RNA-seq analysis as a highly sensitive approach to investigate the Arabidopsis floral transcriptome at three developmental stages. We provide evidence that at least 23, 961 genes are active in the Arabidopsis flower, including 8512 genes that have not been reported as florally expressed previously. We compared gene expression at different stages and found that many genes encoding transcription factors are preferentially expressed in early flower development. Other genes with expression at distinct developmental stages included DUF577 in meiotic cells and DUF220, DUF1216, and Oleosin in stage 12 flowers. DUF1216 and DUF577 are Brassicaceae specific, and together with other families experienced expansion within the Brassicaceae lineage, suggesting novel/greater roles in Brassicaceae floral development than other plants. The large dataset from this study can serve as a resource for expression analysis of genes involved in flower development in Arabidopsis and for comparison with other species. Together, this work provides clues regarding molecular networks underlying flower development.


INTRODUCTION
Flower is one of the most complex structures of the angiosperms (flowering plants), and is thought to make great contribution to sexual reproduction in either developmental or evolutionary aspects (Alvarez-Buylla et al., 2010). The basic floral architecture is highly conserved among the core eudicots, including Arabidopsis thaliana, which is an important model plant for studying flower development. Over the past three decades, extensive molecular genetic analyses have identified a large number of key floral regulators controlling flower development (O'Maoileidigh et al., 2014), making it one of the best-understood aspects of plant development. However, the present knowledge in understanding gene regulatory network in flower development is incomplete, such as information on genes with low expression levels.
Genome-wide approaches have become valuable tools in characterizing gene expression and in elucidating the genetic networks of flower development at a global level. In the past, large-scale analyses of transcript enrichment among Arabidopsis floral organs largely depends on hybridization, such as cDNA and oligonucleotide arrays (Hennig et al., 2004;Wellmer et al., 2004Wellmer et al., , 2006Zhang et al., 2005;Alves-Ferreira et al., 2007;Benedito et al., 2008) and represents a major step in the spatial characterization of floral transcriptome, resulting in identifying many genes important for flower development (Alvarez-Buylla et al., 2010;Irish, 2010). However, array analyses and other hybridizationbased approaches have several limitations, including knowledge of genes for probe design, non-specific hybridization, and difficulty in detecting low level expression (Marioni et al., 2008). On the other hand, more recently developed RNA sequencing (RNA-seq) technologies can overcome such limitations of hybridization-based approaches and other conventional largescale gene expression analysis methods (Marioni et al., 2008;Xiong et al., 2010). It also has great sensitivity, allowing the detection of transcripts with lower expression levels, such as those of many transcription factors (Marioni et al., 2008;Chen et al., 2010). In the last few years, RNA-seq has been extensively applied in the characterization of transcriptome regarding developmental stage, organ, even specific cell types or single cell level, from yeast to human, including several plant species (Jiao et al., 2009;Zhang et al., 2010;Yang et al., 2011). To date, RNA-seq has been used for cell-specific analysis of actively translated mRNAs associated with polyribosomes in developing flowers, providing insights and resources to further study flower development (Jiao and Meyerowitz, 2010).
To further explore the Arabidopsis flower transcriptome, we employed RNA-seq for three developmental periods. We detected 8512 additional genes that are not present on previously used microarray experiments, and provide evidence that at least 23,961 genes are truly expressed in the Arabidopsis flower. We also identified differentially and specifically expressed genes and gene families during flower development.

SEQUENCING DATASETS
The inflorescent meristem (IM), stage 1-9 flowers (F1-9) and stage 12 flowers (F12) samples for RNA-seq were collected in our lab, and the three samples were subjected to 50 bp single-end sequencing on a SOLiD 3 platform; details for the methods were recently described in a study for alternative splicing (Wang et al., 2014a). All sequenced short reads were submitted to NCBI Short Read Archive under accession number SRP035230. The datasets for seeding and stage 4 flowers were from previously studies, which generated 36-bp and 42-bp long reads, respectively, using the Illumina genome analyzer (Filichkin et al., 2010;Jiao and Meyerowitz, 2010). The meiocyte datasets were from our previous study, which included two runs (36 and 50 bp) using Life Technologies' SOLiD sequencing platform (Yang et al., 2011).

ALIGNMENT OF SEQUENCING READS
Sequence reads from the three sample plus the three floral samples were mapped using PerM (Chen et al., 2009) to the Arabidopsis genome (release 9) from the Arabidopsis Information Resource (TAIR) database (TAIR9; www.arabidopsis.org) allowing 5, 4, and 3 mismatches per 50, 42, and 36-bp read, respectively.

DIGITAL GENE EXPRESSION AND EXPRESSION ARRAYS
For the RNA-seq experiments, we used at least 10 reads mapped to a gene as the threshold for being expressed. The raw digital gene expression counts were normalized using the reads per kilo-base of mRNA length per million of mapped reads (RPKM) method. The equation was used: Where C is the uniquely mapped counts determined from mapping results, L is the length of the cDNA for the longest splice variant for a particular gene model and N is the total reads that were mapped to the genome. Log2-transformation of this normalized value was performed as in other analyses.
To test differential expression with mapping data DEGseq was used . Fisher's Exact Test (P < 0.01) method was selected. Microarray results were obtained from a previous study (Zhang et al., 2005). The Microarray experiments have a background value, which was 5 (log value of base 2) as previously described (Zhang et al., 2005) for the evaluation of "expressed" or "unexpressed" genes. Identification of differentially expressed genes according to the microarray data also used the Fisher's Exact Test method.

Z-SCORE
Calculation of the Z-score was based on the log2-transformed RPKM-normalized transcript levels as follows: X is the RPKM of a gene for a specific tissue/developmental stage. μ is the mean RPKM of a gene across all tissues/developmental stages and σ is the RPKM standard deviation of a gene across all tissues/developmental stages. All calculations and plotting were performed by Perl and excel, respectively.

GENE FAMILY AND FUNCTIONAL ANNOTATION
The protein domain annotations were obtained from the Pfam database (http://pfam.sanger.ac.uk) (Punta et al., 2011). Arabidopsis protein sequences were then searched against protein family models in the Pfam-A database, resulting in 21102 Arabidopsis proteins identified as having at least one Pfam domain. Transcription factor family annotations were from The Database of Arabidopsis Transcription Factors (http://datf.cbi. pku.edu.cn/) (Guo et al., 2005), which contains 1922 transcription factors in Arabidopsis. Gene ontology (GO) enrichment analysis was performed with the agriGO browser (http://bioinfo. cau.edu.cn/agriGO/) (Du et al., 2010) using Singular Enrichment Analysis.
Multiple sequence alignment was performed in MUSCLE (http://www.drive5.com/muscle/) using the default parameters. Maximum likelihood (ML) trees were constructed by FastTree (www.microbesonline.org/fasttree) with the approximate likelihood ratio test method.

GLOBAL GENE EXPRESSION OF FLOWER TRANSCRIPTOMES IN ARABIDOPSIS
To obtain more insights about the overall transcriptome landscape during flower development, we analyzed RNA-seq datasets of the Arabidopsis flower at three developmental stages recently generated in our laboratory; these datasets were analyzed for alternative splicing in a separate study (Wang et al., 2014a): inflorescent meristem (IM), stage 1-9 flowers (F1-9) and stage 12 flowers (F12), detecting 21,181 (IM), 22,137 (F1-9), and 22,827 (F12) reliably expressed genes (Table S1, Figure S1). A recent report summarized that a total of 126 Arabidopsis genes have been demonstrated genetically to have a role during flower development (Alvarez-Buylla et al., 2010), 122 of which were also detected as expressed in our dataset (Table S2), indicating that our data were very reliable, and can be used for further analysis. To compare gene expression during flower development, besides the three datasets described above, we also included data for Arabidopsis male meiocytes that we had generated previously (Yang et al., 2011), and two other public datasets of Arabidopsis seedlings and stage 4 flowers (Filichkin et al., 2010;Jiao and Meyerowitz, 2010), the latter of which were from isolated polysomic.
To further explore how many genes are truly expressed in Arabidopsis flower, we searched The Arabidopsis Information Resource (TAIR) database and obtained a total of 24,570 genes, which are supported by at least one EST. Then, we searched the present tiling array database to find available probes for 30,228 genes. 4734 genes were found to be tiling array-specific compared with ESTs and RNA-seq data. Among them, 2634 and 276 are transposons and pseudogenes, respectively. The other 1824 genes seem to be expressed at very low levels. The average value of the 4734 genes is 5.4, which is regarded as a threshold in this study for the evaluation of "expressed" or "unexpressed" genes. Based on this criterion, we believe that tiling array can detect at least 27, 617 genes. As described previously, RNA-seq detected 24,769 genes in flowers. Comparison of the detected genes among EST, tiling array and RNA-seq found that 22,440 genes were detected by three data sets and 1521 genes were detected by RNA-seq and either ESTs/tiling array ( Figure 1A). The results suggest that at least 23,961 genes are reliably detected as expressed in the Arabidopsis flower. In addition, 621 genes were only detected by RNA-seq; most of these are low abundance genes that are nearly undetectable by arrays and the others are likely to be stage-specific genes.
Characterization of stage or cell-specific genes provides a foundation for unraveling their molecular mechanisms. Previous studies in multiple plants demonstrated that each stage or tissue has specific transcripts (Jiao et al., 2009;Jiao and Meyerowitz, 2010;Yang et al., 2011;Liu et al., 2014). To better establish the genome-wide gene expression pattern of flower development, we conducted a Z-score analysis to assess the extent of differential gene expression for florally expressed genes. Results showed that the Z-score distribution of gene expression in F12 was dramatically different from that for early flower development (F1-9, F4, and IM) ( Figure 1C), suggesting that nearly mature flowers requires many more specifically or differentially expressed genes than early flowers. In contrast, Z-score distributions were very similar between F1-9, F4, and IM ( Figure 1D), further supporting the idea that the developmental programs of these stage/organ are similar.

DETECTION OF EXPRESSION OF 8512 GENES IN THE ARABIDOPSIS FLOWER NOT REPORTED FROM MICROARRAY ANALYSIS
We first compared the F1-9 and F12 RNA-seq data with the Affymetrix ATH1 array data at similar stages (Zhang et al., 2005). We compared the number of sequencing reads mapped to each gene with the corresponding (normalized) absolute intensities from the array (Figures 2A,B), and found that the correlations between the two platforms were high, with Spearman correlation coefficients of 0.82 (F1-9; Figure 2A) and 0.80 (F12; Figure 2B). Thus, comparison of RNA-seq and microarray identified 15,180 overlapping genes with relatively high expression levels ( Figure 2C), covering 96% of genes detected using microarray. In addition, our RNA-seq identified additional 8512 genes that were undetected by microarray (Zhang et al., 2005), whose expression levels were obviously lower; the average expression level is 56.12 (F1-9) and 57.82 (F12) RPKM (Figure S2b), compared to the average expression level about 250 for the 15,180 genes. Also, the curvature of the comparison toward the microarray axis suggested that microarray possibly underestimated the expression level of genes relative to RNA-seq (Figures 2A,B). Together, these results suggest that RNA-seq has a great advantage over microarray in detecting low-abundance transcripts, consistent with previous reports (Marioni et al., 2008;Yang et al., 2011).
To investigate further regarding the 8512 genes, we analyzed the enrichment of protein family (PFAM) domains (gene families) among these genes, and identified several enriched gene families that were not reported previously as enriched, including F-box, NB-ARC, C1_3, PPR, LRR_1, Myb, bHLH, and AP2 gene families (Table 1). Previously, many F-box genes were reported as unexpressed or undetectable by microarray analysis (Schmid   , 2005), further suggesting that microarray is not as sensitive as RNA-seq for detecting low-abundance transcripts. In addition, we also detected some enriched gene families that belongs to the highly expressed genes; for instance, Plant self-incompatibility response (SCRL) and S locus-related glycoprotein 1 binding pollen coat protein (SLR1-BP) are specifically enriched in F12 ( Table 1 and Table S3), suggesting a potential role at this stage. In contrast, 559 genes detected by microarray were not found by RNA-seq, possibly due to difference in growth conditions. We further employed a widely used Fisher's Exact Test method to identify differentially expressed genes (DEGs) between F1-9 and F12 in RNA-seq and microarray data. Altogether, 7605 and 5327 DEGs were identified in each dataset. Among them, 3422 genes were detected by both platforms (Figure 2D), 1272 and 84 DEGs were only detected by RNA-seq and microarray, respectively (Figure 2D), and consistent with the fact that RNA-seq is more sensitive for detection and comparison of gene expression. Taken together, these results indicate that deep sequencing can greatly increase the sensitivity of transcriptome analysis.

IDENTIFICATION OF STAGE-DIFFERENTIALLY EXPRESSED GENES DURING FLOWER DEVELOPMENT
Floral organ identity and cell fate determination are highly regulated by the temporal and spatial gene expression, with each organ or cell type having distinct transcriptomes (Jiao et al., 2009;Yang et al., 2011;Wang et al., 2014b). To investigate DEGs between one of the floral stages with seedlings, we compared the flower transcriptomes of IM, F1-9, or F12 with that of seedlings, and identified IM with 9793 DEGs, F1-9 with 9583 DEGs, and F12 with 9340 DEGs (Table 2). Furthermore, the intersection between these three sets contained 6193 genes (Figure 1B), indicating these three samples are quite similar regarding differentially gene expression compared with seedlings. GO annotation showed that these genes were enriched for categories such as "histone modification" and "methylation" (p = 3.2E-11 and 4.7E-9), suggesting To further examine the combined set (13,628 genes) of the above floral DEGs, we compared these genes with 4505 genes identified as potential targets of the SEP3 and/or AP1 proteins by ChIP-seq (Immink et al., 2009;Kaufmann et al., 2010). The results showed that 2506 genes overlapped between the floral DEGs and the SEP3/AP1 targets with significance (Fisher's test, p = 7.03e-08). It is likely that some of these genes are involved in the regulation of flower development, but the role of these genes in flower development needs to be determined using molecular genetic analyses.
We then analyzed the enrichment of protein domains as defined in the PFAM database, and found several enriched domains (P ≤ 0.01; Table 3), including ATPase, Helicase_C, DEAD box, WD40, SET, and PHD domains, suggesting that chromatin associated transcriptional regulation might be one of the major features underlying flower development. In addition, proteins with "UCH," "hydrolase," and "IQ" domains were also significantly over-represented, although their functions in flower development are largely unknown. Interestingly, we also identified "PPR," "Mito_carr" and "Miro" domains as significantly enriched; members of these genes are involved in gene expression and other functions in mitochondria and plastid, suggesting that such organellar functions might be important for flower development.

DISTINCT ENRICHMENT OF TRANSCRIPTION FACTORS IN EARLY FLOWER DEVELOPMENT
Identification of transcription factors (TFs) expressed in a specificstage provides a foundation for understanding the transcriptional regulatory networks underlying the development, structure and function of thestage. To investigatethe expressed TFs during flower development, we examined the TFs among IM, F1-9, and F12 and identified a total of 1667 transcription factors, 927 of which showed differential expression compared with the seedling (Figure 3A). Among the 927 TFs, 70% showed highest expression in IM (designated as D1), whereas 14 and 16% showed highest expression in F1-9 (designated as D2) and F12 (designated as D3), respectively ( Figure 3B).  Figure 3B). For example, homeobox genes encode transcription factors that contain a classic DNA binding domain with about 60 amino acids and regulate gene expression via Polycombdependent modulation of chromatin structure, thereby controlling development in animals, fungi and plants (Zhong and Holland, 2011). Several known members of HB ( Figure 3C) identified in IM support that early floral development requires active HB genes, consistent with the finding that epigenetic reprogramming of gene expression is important for the establishment of initial floral identity (Mukherjee et al., 2009). The co-expressed pattern between HB genes and chromatin factors in IM is in agreement with previous studies that a number of floral genes with similar expression patterns and/or associated with each other regulate the expression of downstream genes to ensure proper flower development (Kaufmann et al., 2009(Kaufmann et al., , 2010Deng et al., 2011).
D3 was enriched in MADS, MYB, AP2, C2H2, C2C2-CO-like, NAC, AUX-IAA-ARF, and bHLH families ( Figure 3C). Previous studies showed that auxin-dependent transcriptional regulation requires the auxin/indole-3-acetic acid (Aux/IAA) and auxin response factor (ARF) families of TFs and formation of Aux/IAA-ARFs heterodimers represses auxin signaling (Reed, 2001), which has been demonstrated to participate in pollen development, pollination and fertilization (Sundberg and Ostergaard, 2009), as well as female gametophyte specification (Pagnussat et al., 2009). Indeed, our data identified several known and unknown ARFs and IAAs factors in G3, suggesting that the Aux/IAA-ARF regulatory pathway is vital for late reproductive development. However, the function of other enriched TFs in flowers is still largely unknown. Together, these results demonstrate that flower development at different stages requires common and distinct transcription factor families.

IDENTIFICATION OF SPECIFIC GENE FAMILIES AT DISTINCT STAGES OF FLOWER DEVELOPMENT
We sought to identify stage-specific genes, which were defined as those genes that were differentially expressed (>4-fold change) at one stage over all other stages studied here using DEG seq. The largest numbers of stage specific genes were identified in the seedling, F12 and meiocytes (1083, 552, and 652 genes, respectively; Table 4). Given the lack of correlation in overall gene expression between the floral transcriptome (F12) and the other stages sampled (Figure 2C), it was not surprising to identify this stage as having the largest number of organ-specific genes. These genes are strong candidates for determining the specific functional components of the nearly mature flower.
Interestingly, the F1-9 flower-specific genes with 8-fold changes had 26 genes, including 9 transposons and 5 snoRNAs (Table S4), consistent with the previous finding that transposons and small RNAs were enriched among genes expressed in male meiocytes Yang et al., 2011). There are also 12 coding genes, one of which (AT5G09780) codes for a transcription factor of the B3 family and two (AT1G48700 and AT4G03050) are for iron binding proteins.
For meiocyte-specific genes, 424 genes were found with ∼8fold changes and showed enrichment for genes in an insertion of mitochondrial origin on chromosome II, as supported by similar preferential expression in meiocytes reported previously   . The enriched genes also included 45 mitochondrial and 28 chloroplast genes, respectively. Moreover, in addition to previously reported gene families (Yang et al., 2011), we also detected several other enriched gene families, such as Oxidored_q1, Oxidored_q2, Oxidored_q3, Oxidored_q4, and NADHdh. In particular, most members of the DUF577 family showed specific expression in meiocytes (Table S5). To further investigate this gene family, we performed phylogenetic analyses  Table S5. www.frontiersin.org January 2015 | Volume 5 | Article 802 | 9 of this gene family with members from several representative plant species, including Arabidopsis lyrata, Eutrema salsugineum, Brassica rapa. As shown in Figure 4A, this family can be divided into seven subfamilies, designated as A1-A7. The tree supported that this gene family have experienced expansion and origin in Brassicaceae ( Figure 4A). The A6 and A7 subfamilies only included Arabidopsis lyrata and Arabidopsis thaliana, suggesting an expansion that occurred since the divergence of Arabidopsis and other Brassicaceae species. Besides, functions of the DUF577 and other enriched family genes in meiosis need to be tested. Similarly for F12, the enriched families included DUF1216, Oleosin and DUF220. Most members of the three gene families had specific expression in F12; these gene families contain lineages that originated and expanded within Brassicaceae (Table S5 and Figures 4, 5). Phylogenetic analyses of the DUF1216 family suggested that this gene family is specific to Brassicaceae, without homologs in other plants, and experienced gene duplication during Brassicaceae history ( Figure 4B). Interestingly, the N-terminal region of DUF1216 proteins had putative signal peptides with similar sequences, according to the SignalP prediction (http://www.cbs.dtu.dk/services/SignalP/). The predicted signal peptide contains a large number of hydrophobic amino acids, a conserved basic amino acid and a conserved cysteine at the ninth position ( Figure 4C). At5g07750 of the Oleosin family was reported to have experienced positive selection (Schein et al., 2004). However, expression of each of three tandem duplicated genes in the Oleosin family (AT5G07510: 10081.76, AT5G07550: 29536.45, AT5G07560: 10942.38) had extraordinarily high levels of more than RPKM of 10, 000, suggesting that such high expression levels are important for F12 for later functions. Analysis of the Arabidopsis genome indicates that tandem duplication contributed to the expansion of DUF1216 in Brassicaceae, as well as the expansion of the Oleosin, DUF220 and DUF577 families ( Figure 5C). This pattern is different to those of SET, JmjC, and Rhomboid gene families, which are more likely to be retained after whole genome duplication events (Zhou and Ma, 2008;Zhang and Ma, 2012;.

CONCLUSIONS
The analysis of Arabidopsis floral transcriptome datasets presented here provides a valuable resource of candidate genes for further studies to understand the flower development program. We provided evidence for at least 23,961 genes that are expressed in the Arabidopsis flower. Compared with seedling, over 10,000 DEGs were identified, revealing novel and different molecular characteristics in the developing flower such as regulatory genes, genes for high-energy production, and transposable elements. These results showed that flower development at different stages requires common and distinct transcription factor families. The gene expression in F12 was dramatically different from that for early flower development (F1-9, F4, and IM).
In addition to identifying floral developmental gene candidates, we found many genes or gene families specifically expressed at one stage. Many transposable element genes, at least 45 mitochondrial and 28 chloroplast genes showed specific expression in meiocytes. The SCRL, SLR1-BP, DUF1216, Oleosin, and DUF220 gene families showed specific expression in F12 and DUF577 genes were detected to have specific expression meiosis. These specifically expressed genes have functions that are closely related to reproductive development, showed that mature flowers require many more specifically or differentially expressed genes than early flowers. These gene families expanded dramatically within the Brassicaceae lineage, suggesting novel functions that are possibly important for the origin and evolution of Brassicaceae. This dataset can be useful for discovering functional genes at different stages of the flower development and provide clues for the molecular and regulatory relationships between different stages.

ACKNOWLEDGMENTS
We would like to thank Qi Li for comments on the manuscript and helpful discussions. This work was supported by the National Natural Science Foundation of China (91131007) and Chinese Ministry of Science and Technology (2011CB944600). Liangsheng Zhang was supported by funds from Tongji University (2013KJ052).

Figure S2 | Comparison between transcriptomes from RNA-Seq and microarray.
Table S1 | The expression of all genes in six samples.