A comprehensive set of transcript sequences of the heavy metal hyperaccumulator Noccaea caerulescens

Noccaea caerulescens is an extremophile plant species belonging to the Brassicaceae family. It has adapted to grow on soils containing high, normally toxic, concentrations of metals such as nickel, zinc, and cadmium. Next to being extremely tolerant to these metals, it is one of the few species known to hyperaccumulate these metals to extremely high concentrations in their aboveground biomass. In order to provide additional molecular resources for this model metal hyperaccumulator species to study and understand the mechanism of adaptation to heavy metal exposure, we aimed to provide a comprehensive database of transcript sequences for N. caerulescens. In this study, 23,830 transcript sequences (isotigs) with an average length of 1025 bp were determined for roots, shoots and inflorescences of N. caerulescens accession “Ganges” by Roche GS-FLEX 454 pyrosequencing. These isotigs were grouped into 20,378 isogroups, representing potential genes. This is a large expansion of the existing N. caerulescens transcriptome set consisting of 3705 unigenes. When translated and compared to a Brassicaceae proteome set, 22,232 (93.2%) of the N. caerulescens isotigs (corresponding to 19,191 isogroups) had a significant match and could be annotated accordingly. Of the remaining sequences, 98 isotigs resembled non-plant sequences and 1386 had no significant similarity to any sequence in the GenBank database. Among the annotated set there were many isotigs with similarity to metal homeostasis genes or genes for glucosinolate biosynthesis. Only for transcripts similar to Metallothionein3 (MT3), clear evidence for an additional copy was found. This comprehensive set of transcripts is expected to further contribute to the discovery of mechanisms used by N. caerulescens to adapt to heavy metal exposure.


INTRODUCTION
Noccaea caerulescens (J. & C. Presl) F. K. Mey., formerly named Thlaspi caerulescens, is an outstanding model plant species to study heavy metal hyperaccumulation (Assunção et al., 2003;Peer et al., 2003;Milner and Kochian, 2008). It is one of the few plant species of which genotypes are known that are adapted to grow on soil containing high levels of zinc (Zn), cadmium (Cd), nickel (Ni), and/or lead (Pb) (Mohtadi et al., 2012). Not only are these genotypes extremely tolerant to the different heavy metals they are exposed to, but N. caerulescens can also hyperaccumulate these metals to high concentrations in shoots. Accumulations of Zn to 30,000 µg g −1 , Cd to 2700 µg g −1 , and Ni to 4000 µg g −1 levels are reported on a shoot dry weight (dw) base. These are two orders of magnitude higher than other, non-accumulating, species generally accumulate (Reeves and Brooks, 1983;McGrath et al., 1993;Brown et al., 1995;Lombi et al., 2000). Next to N. caerulescens, Arabidopsis halleri is developed as model metal hyperaccumulator (Meyer and Verbruggen, 2012). This species is also hypertolerant to Zn and Cd, and a strong Zn hyperaccumulator, but less of a Cd hyperaccumulator and not known to be adapted to Ni. Unlike N. caerulescens it is self-incompatible, which complicates genetic analysis, but it is much closer to the general plant model species Arabidopsis thaliana, with which is shares more sequence synteny. Adaptation to heavy metals seems to be more commonly occurring in the Brassicaceae family than in other families, with other Noccaea species known to hyperaccumulate Ni and Zn, and one, N. praecox, also hyperaccumulating Cd, as recently reviewed by Koch and German (2013). Other genera with metal hyperaccumulating species are Alyssum and Strepthanthus (both mostly Ni adapted). Thus, it is fortunate that the first plant species for which genomic data became available, A. thaliana, belongs to the Brassicaceae family, as this triggered further interest in generating genome sequence information from several members of this family (http://www.brassica.info/ resource/sequencing/bmap.php). For neither A. halleri nor N. caerulescens the genome sequence has been determined, which is why most of the gene expression research on these species so far has relied on heterologous micro-array analysis using the available A. thaliana micro-arrays (Becher et al., 2004;Weber et al., 2004;Hammond et al., 2006;Talke et al., 2006;Van De Mortel et al., 2006. This revealed that both species seem to have evolved similar strategies for dealing with the high metal exposure, typically by modifying the expression of several genes normally involved in Zn and Fe mineral homeostasis. A striking example is the copy number expansion of the HMA4 gene, observed in both species, which increased expression of the gene when compared to non-accumulating species (Hanikenne et al., 2008;Ó Lochlainn et al., 2011;Craciun et al., 2012).
One reason to investigate the remarkable metal adaptation properties of N. caerulescens is that these extremophile plants are interesting target species to develop for metal phytoextraction purposes, in which plants are used to remediate soils contaminated with toxic metals (Peer et al., 2006;Anjum, 2012). Another reason is that the rare extremophile nature of metal hyperaccumulators, to have adapted to otherwise hostile environments, makes them interesting models for plant evolutionary genomics studies (Hanikenne and Nouet, 2011). For both purposes, unraveling the evolutionary and physiological mechanisms that allowed their adaptation, at the molecular level, will be needed. The main approaches that have been followed so far in identifying genes involved in metal adaptation involve comparisons of hyperaccumulating and related non-hyperaccumulating plant species, either by using genetic crosses, or by transcriptomics or proteomics comparisons (reviewed by Krämer et al., 2007;Hassan and Aarts, 2011;Lin and Aarts, 2012). However, so far, a comprehensive set of genome sequences of a metal hyperaccumulator species is not available, which is seriously limiting the progress in molecular analysis of metal adaptation.
Molecular analysis of N. caerulescens has been largely performed based on its close relationship to the well-known model species A. thaliana, allowing the use of molecular tools and genomic databases developed for this species. Although the lineages of both species separated probably some 20 Mya (Clauss and Koch, 2006), there still is substantial genome sequence similarity between these species, estimated at 87-88% sequence identity in intergenic transcribed spacer regions and 88.5% sequence identity in transcribed regions (Peer et al., 2003;Rigola et al., 2006). This high level of conservation was sufficient to use heterologous, A. thaliana, micro-arrays for comparative transcriptome analyses (Hammond et al., 2006;Van De Mortel et al., 2006. The first attempt to obtain sequence information of N. caerulescens was performed by Rigola et al. (2006), who generated an Expressed Sequence Tag (EST) database of little over 3700 transcript sequences. This resource has been used to generate a cDNA-based micro-array, which has been used for transcript profiling of N. caerulescens, but due to the limited number of probes on the array, the information that could be gathered was limited (Plessl et al., 2010). Since then, the rapid development of high-throughput sequencing techniques has made whole genome and transcriptome sequencing a lot more efficient and affordable (Morozova et al., 2009). Recently, SOLiD technology has been applied for comparative root transcriptomics of three different N. caerulescens accessions (Halimaa et al., 2014). The 454/Roche pyrosequencing method takes advantages of long read lengths and a fast running time (Metzker, 2010;Niedringhaus et al., 2011) and is a suitable method for whole transcriptome sequencing, when the main purpose is to obtain a comprehensive set of transcript sequences of substantial length to be used as a reference database for future transcriptome profiling studies.
Here we present such a comprehensive set of N. caerulescens transcripts, which largely exceeds the previous dataset in the number of identified genes (Rigola et al., 2006). We identified many transcripts involved in mineral accumulation and homeostasis, which may be relevant for metal hyperaccumulation and hypertolerance. In addition we listed genes involved in the biosynthesis of glucosinolates. These are secondary metabolites conferring resistance to herbivores, especially prominent in Brassicaceae (Bones and Rossiter, 1996). This new transcript sequence information will facilitate the further analysis of the extremophile traits of N. caerulescens and is expected to contribute to similar studies in related metal hyperaccumulators, such as A. halleri, and also less studied species like the Zn/Cd hyperaccumulator Noccaea praecox or the Ni-hyperaccumulator Noccaea goesingense (Koch and German, 2013). It will also contribute to more efficient functional studies of heavy metal related genes, to establish their role in metal adaptation or for possible applications in metal phytoextraction, or genes related to synthesis of glucosinolates.

PLANT MATERIALS AND RNA PREPARATION
An inbred line of N. caerulescens accession "Ganges" (kindly obtained from Dr. Henk Schat, Free University, Amsterdam, The Netherlands) was used. Roots and shoots were collected separately from 5-week-old plants, grown in half-strength Hoagland solutions (Assunção et al., 2001) containing 10 µM ZnSO 4 , in a climate controlled growth chamber (set at 20/15 • C day/night temperature; 70% relative humidity; 12 h day time). Inflorescences were collected from 22-week-old plants grown in soil. RNA of these three plant parts was extracted separately by using the RNeasy® Plant Mini kit (Qiagen, cat. no. 74904) with on-column RNase-Free DNase set digestions (Qiagen,cat. no. 79254). The RNA concentrations were quantified using the highly selective Qubit™ RNA BR Assay kit (Invitrogen™, cat. no. Q10210) with a Qubit® 2.0 Fluorometer.

cDNA LIBRARY CONSTRUCTION
Similar amounts of RNA from roots, shoots, and flowers were pooled and used for preparation of a normalized and random-primed cDNA library for Roche/454 sequencing (Vertis Biotechnologie AG). From the total RNA sample, poly(A) + RNA was isolated, and used for cDNA synthesis. First strand cDNA synthesis was primed using random hexamer primers. To the 5 and 3 ends of the cDNA, Roche/454 adapters A and B, as provided by the manufacturer, were ligated and cDNA was finally amplified using 12 PCR cycles and proofreading DNA polymerase. Normalization was performed by denaturation and reassociation of cDNA. Re-associated double-stranded cDNA was separated from remaining (normalized) single-stranded cDNA (ss-cDNA) over a hydroxylapatite column. After separation, ss-DNA was PCR-amplified using six PCR cycles. Finally cDNA in the size range of 500-850 bp was eluted from a preparative agarose gel. The final normalized cDNA library contains double stranded fragments of between 500 and 850 bp, consisting of the following sequence structure: 5 -454-Adapter A (CCA-TCT-CAT-CCC-TGC-GTG-TCT-CCG-ACT-CAG), 5 -barcode  (CACACG), 5 adapter (GAC-CTT-GGC-TGT-CAC-TCA-GTT),  cDNA insert (400-750 bp), 3 adapter (TCG-CAG-TGA-GTG-ACA-GGC-CA), 3 -454-Adapter B (CTG-AGA-CTG-CCA-AGG-CAC-ACA-GGG-GAT-AGG).

DE NOVO SEQUENCING AND ASSEMBLY OF N. CAERULESCENS TRANSCRIPT SEQUENCES
Prior to Roche/454 sequencing, cDNA library molecules were clonally amplified using a "two copies per bead" ratio for one Large Volume Emulsion PCR, following the manufacturer's protocol (Roche, Genome Sequencer FLX Titanium Series). Two million DNA carrying beads (from a 23% enrichment) were loaded on half a picotiter plate equivalent, divided over two regions. Sequencing was done on a 454 GS FLX Titanium instrument using XLR70 chemistry and 200 flow cycles. The raw reads were assembled after adaptor trimming using Newbler (454 life Sciences Corporation) version v 2.6.

SEQUENCE ANNOTATION AND CHARACTERIZATION
The predicted proteomes and genome assemblies of A. thaliana, Arabidopsis lyrata, Brassica rapa, Capsella rubella, and Thellungiella halophila, recently suggested to be named Eutrema halophila (Koch and German, 2013), were downloaded from the phytozome repository version 9.1 (Goodstein et al., 2012). The protein function annotation of A. thaliana version TAIR10.0 was obtained from the TAIR website (www.arabidopsis.org). The gene ontology (GO) annotation for A. thaliana was downloaded from the GO webpage (www.geneontology.org). The protein annotations (including GO-terms) of the four other species were downloaded from the phytozome repository. Pathway and Enzyme Code annotations for A. thaliana and B. rapa were downloaded from the PlantCyc database (www.plantcyc.org) version 8.0. The annotated GO terms were summarized using the plant GOSlim Set of the GOSlim viewer at the AgBase website (http://www.agbase.msstate.edu/cgi-bin/tools/goslimviewer_select. pl) (McCarthy et al., 2006).
The predicted Brassicaceae proteins were clustered into groups of orthologous proteins by first performing all 25 possible pairwise similarity searches between the five predicted proteomes using BlastP version 2.26 (Altschul et al., 1990). Pairwise orthologs were determined for all 10 possible species-pairs using Inparanoid version 4.1 (Remm et al., 2001). Finally, multispecies-ortholog clusters were built by feeding the Inparanoid output files to the multiparanoid Perl-script (Alexeyenko et al., 2006). Whenever possible, the annotation of orthologous protein clusters was inherited from A. thaliana or B. rapa members (in that order). Clusters without A. thaliana or B. rapa members remained un-annotated.
The transcriptome of N. caerulescens was first searched against the predicted proteomes using BlastX with an e-value cut-off of 10 −5 . Sequences for which the best hit was a member of an annotated cluster inherited the annotation from that cluster. The remaining sequences inherited their annotation from their best hit. Sequences without a significant match against the predicted Brassica proteomes were searched (BlastX with e-value cut-off 10 −5 ) against the non-redundant protein database (NR) downloaded from the NCBI ftp-site (ftp://ftp.ncbi.nlm.nih.gov/). Proteins without BlastX hits were searched against the genomes of the Brassica species using BLAT (Kent, 2002) version 35. BLAT hits were filtered by requiring 90% of the N. caerulescens sequence to be aligned with an average identity of at least 85%. Finally sequences were searched against the NR nucleotide database, downloaded from the NCBI ftp-site (BlastN with e-value cut-off 10 −5 ). Blast hits were only accepted if the corresponding HSPs had ≥85% sequence identity and covered ≥90% of the query isotigs.
Protein alignments were constructed using ClustalW2 (Larkin et al., 2007) and regions of low alignment quality were removed by hand using JalView (Waterhouse et al., 2009). Maximum likelihood trees were constructed using PhyML (the following command line parameters were used: -c 4 -m LG -o lr -v e -a e -f e -b 100) (Guindon et al., 2010). In brief, the trees were constructed using the LG amino acid substitution model (Le and Gascuel, 2008) with four relative substitution rate categories. The proportion of invariable sites and equilibrium amino-acid frequencies were estimated from the data. The alpha parameter was estimated by maximizing the likelihood of the phylogeny. Branching patterns were validated using bootstrap-analysis; 100 bootstrap-samples were generated for the ML-trees and 1000 for the NJ trees. Trees were displayed using the ETE2 python package (Huerta-Cepas et al., 2010) and dendroscope (Huson and Scornavacca, 2012). To investigate the presence of gene duplications of metal homeostasis related genes, sequence sets were created for each gene family implicated in metal homeostasis by performing blast searches against the proteomes of the five Brassicaceae reference species (with e-value cutoff 10 −5 ). The exonerate program was used to rapidly compare transcript sequences to genome sequences (Slater and Birney, 2005). Neighbor-joining trees were subsequently constructed for all alignments using ClustalW2. This Transcriptome Shotgun Assembly project has been deposited at DDBJ/EMBL/GenBank under the accession GASZ00000000. The version described in this paper is the first version, GASZ01000000. Raw sequence data have been deposited in the NCBI short read archive under the accession SRX456668.

DE NOVO SEQUENCING AND ASSEMBLY OF N. CAERULESCENS ISOTIGS
To maximize the transcript diversity, a pool of RNA from roots, shoots, and inflorescences of N. caerulescens was made. A cDNA library synthesized from this RNA pool was used for sequencing. In total 834,911 raw reads with an average length of 401 nucleotides (nt) were obtained and subsequently assembled using Newbler. During the assembly process, two initial contigs can be joined together into an isotig when raw reads are found that map to the ends of both contigs. The resulting transcriptome assembly (Table 1) consisted of 26,785 contigs the vast majority of which were further assembled into a set of 23,836 putative transcripts (isotigs) with an average sequence length of 1025 base pairs (bp). Isotigs sharing at least one contig are grouped into isogroups. Isotigs are likely to be transcript isoforms from the same gene (Isogroup). The isotigs in this study were grouped into 20,378 isogroups. In total 2326 isogroups (11.4%) consisted of more than one isotig. The distributions of the isotigs sizes and the number of isotigs per isogroup are provided in Supplemental Figure S1.

ANNOTATION AND FUNCTIONAL CHARACTERIZATION OF N. CAERULESCENS GENES
N. caerulescens is a member of the Brassicaceae family, which is why we used the annotation of five other Brassicaceae species of which full genome sequence is available (A. thaliana, A. lyrata, B. rapa, C. rubella, and T. halophila) as a reference for annotation. The predicted proteomes of these five reference species were first clustered in 24,172 groups of orthologous proteins, which we called the Brassicaceae proteome (see Materials and Methods). In total 23,691 of these clusters were annotated following the A. thaliana or B. rapa annotation. We then compared the N. caerulescens isotig DNA sequences against the Brassicaceae proteome and listed the best match (Summarized in Figure 1A; full list in Supplemental Table S1). A total of 22,232 (93.2%) N. caerulescens isotigs (19191 isogroups) had a significant match with the Brassicaceae proteome and we annotated these sequences according to their best match. We also determined the best match for each N. caerulescens sequence in each of the five reference species ( Figure 1B). Based on this analysis, we estimate that between 40 and 60% of all N. caerulescens protein encoding genes are represented in this transcriptome set.
Even though five closely related species were used as reference, still 1604 isotigs were found without any significant hit in the Brassicaceae proteome data set. These isotigs were therefore compared with the NCBI NR protein database. The search revealed that, although not detected in our initial search, 33 isotigs did

FIGURE 1 | Classification of Noccaea caerulescens isotigs according to best-hits. (A)
The number of best-hits of N. caerulescens isotigs to the Brassicaceae proteome (comprising those of Thellungiella (Eutrema) halophila, Arabidopsis thaliana, Arabidopsis lyrata, Brassica rapa, and Capsella rubella) are listed. The 1604 isotigs without any significant hit to Brassicaceae proteome were further compared to the NCBI NR protein database, identifying best-hit similarities to 33 proteins of plant origin and 98 of non-plant origin. The non-hit isotigs were further matched to the available genome sequences of the listed Brassicaceae reference species, identifying another 74 best-hit similarities. Finally, a set of 1399 isotigs remained for which no protein similarity could be found. (B) Venn diagram showing the number of genes in each of the five reference species that correspond to the best-hit in that species for each N. caerulescens isotig sequence.
have a significant match to a plant protein, including 16 isotigs that strongly resembled an A. thaliana protein and seven that resembled proteins from other Brassicaceae (Supplemental Table  S2). Thirteen of the 33 isotigs corresponded to (retro)transposons and nearly all of the remaining ones encoded for hypothetical or unknown proteins. Next to these, an additional 98 isotigs showed only significant similarity to non-plant sequences (Supplemental Table S3). Based on the high similarity to genes of organisms often found to be associated with plants (fungi, bacteria, viruses, etc.), these are most likely reflecting such associations, rather than actual N. caerulescens genes. The remaining sequences with no hit when compared to the NR protein database, were compared against the genome sequences of the five Brassicaceae reference species. This revealed an additional 74 sequences with significant DNA matches to one or more of the reference genomes. Still, there are 1399 isotigs that had no hit to Brassica proteins and proteins in the NR database (Supplemental Table S4). When these were compared to the NR DNA database, an additional 13 isotigs were found to be similar to DNA entries in the database. Only two of these were plant genes. Five isotigs showed similarity to A. thaliana genomic sequences, one to plant mitochondrial DNA and the remaining five appeared to correspond to microbial sequences.
The list of isotigs with best-hit matches to the Brassicaceae proteome (Supplemental Table S1) has been used to obtain an estimate of transcript length coverage of the N. caerulescens isotigs (Supplemental Figure S2). This shows that around half of the transcripts cover over 80% of the protein sequence of their Brassicaceae best-hit match. In addition, we compared the full list of N. caerulescens isotigs to the previously obtained set of N. caerulescens ESTs (Rigola et al., 2006) using BlastN (≥95% identity and ≥60% coverage of shorter sequences). This retrieved 3773 of the 4289 sequences (88%). Sequence similarities between both datasets were generally around 98%, suggesting SNP frequencies of around 2 per 100 bp.

GENE ONTOLOGY (GO) ANNOTATION OF N. CAERULESCENS ISOGROUPS
Subsequently, we used the N. caerulescens isogroup sequences to categorize putative genes according to GO, by assigning them to three categories of GO terms: Biological Process, Cellular Compartment, and Molecular Function (see Materials and Methods) (Supplemental Table S5). The annotated GO terms were further classified with the plant set of the GO Slim Viewer (McCarthy et al., 2006) (summarized in Supplemental Figure S3, full list in Supplemental Table S6).
Within Biological Process, two major categories are cellular process (25.1%) and metabolic process (19.3%), followed by developmental process (12.3%), and response to stress/stimulus (10.4%). Genes involved in cellular ion homeostasis (involving "cations" and "metal," but also specifically zinc, copper, calcium, iron, manganese, potassium, and phosphate) (0.1%), are likely to be crucial for the regulation of the heavy metal balance in N. caerulescens (Supplemental Figure S3), in addition to genes involved in transport (2.7%), especially ion transport (0.5%), and signal transduction (1.6%). Within Cellular Compartment, the largest category is cytoplasmic component (50.7%), followed by membrane apparatus (12%), and then intracellular components (10.7%), extracellular components (2.1%), and cell wall (0.9%). Genes encoding proteins localized to the plasma membrane (5%) and the vacuolar membrane (0.6%) will comprise metal transporter genes that are expected to be involved in metal uptake and sequestration (Supplemental Figure S3). Within Molecular Function, the largest category is binding activity (37.1%) comprising ion binding (4%) as an important sub-category for metal hyperaccumulation traits. Furthermore, the genes categorized for transporter activity (8.1%) are expected to play an important role in metal uptake and metal transportation. To see if this is different from other species, we also compared N. caerulescens and A. thaliana regarding the GO terms with the highest percentages of N. caerulescens isogroup counts (Supplemental Figure S3). This made clear that there are only minor differences between both species.
A total of 3051 isotigs (2610 isogroups) could be assigned to 352 PlantCyc pathways (Supplemental Table S7). The 26 most represented pathways (around 20 or more isogroups) are shown in Supplemental Figure S4. The top six pathways are triacylglycerol degradation, homogalacturonan degradation, glycolysis II (from fructose-6P), tRNA charging, aerobic respiration (alternative oxidase pathway), and betanidin degradation, which have more than 50 isogroups found in the pathway.

TRANSCRIPTS RELATED TO THE RESPONSE TO HEAVY METAL EXPOSURE
The metal hyperaccumulation and hypertolerance properties of N. caerulescens partly rely on a number of genes known in other species, mainly A. thaliana, to be involved in mineral homeostasis. A list of such genes, including 87 isogroups, is shown in Table 1. The list includes metal transporter genes belonging to the ZRT/IRT-like Protein (ZIP) gene family (9 isogroups), the Natural Resistance-Associated Macrophage Protein (NRAMP) family (8 isogroups), the Heavy Metal ATPase (HMA) family (9 isogroups), the Metal Tolerance Protein (MTP) family (7 isogroups), and the Calcium exchanger (CAX) family (9 isogroups). In addition, we listed genes belonging to the Plant Cadmium Resistance (PCR) gene family (2 isogroups), the Pleiotropic Drug Resistance protein (PDR) family (14 isogroups), and Plant Defensin (PDF) family (8 isogroups). Furthermore, transcripts related to metal chelator and metal chelator transporter functions were listed, such as genes of the Nicotianamine Synthase (NAS) family (4 isogroups), the Phytochelatin Synthase (PCS) family, (2 isogroups), the Metallothionein (MT) family (4 isogroups), the Yellow Stripe Like protein (YSL) family, (6 isogroups), the Zinc Induced Facilitator (ZIF) family (5 isogroups), and the Multidrug Resistance-associated Protein/ ATP-binding cassette transporter ABCC type (MRP/ABCC) family (26 isogroups). Nota bene we did not include the large families of genes encoding heavy metal-associated isoprenylated plant proteins (HIPP) or heavy metal-associated plant proteins (HPP), of which some members are recently suggested to be relevant for response to Cd (De Abreu-Neto et al., 2013).
For two larger gene families, the ZIP and MTP families, we performed a phylogenetic analysis (Figures 3A,B) to confirm that the Blast analysis (    we created for each class multiple sequence alignments with sequences from the five reference species that were similar to the N. caerulescens isogroup sequences (see Materials and Methods). Neighbor-joining trees were constructed from these alignments and inspected manually for putative N. caerulescens specific duplications. Only in the tree of the MT3 class we identified a putative N. caerulescens specific duplication. In order to obtain additional support for a recent gene duplication we identified the MT3 orthologs of the five Brassicaceae reference species and constructed a phylogenetic tree of them ( Figure 3C). This confirmed that all inspected genomes had only one copy of the MT3 gene.

GLUCOSINOLATE BIOSYNTHESIS TRANSCRIPTS IN N. CAERULESCENS
Next to the mineral homeostasis genes, we also examined the occurrence of transcripts representing genes in the glucosinolate biosynthesis pathway. This pathway is generally prominent in Brassicaceae as it generates glucosinolates that upon interaction with myrosinase enzymes release toxic isothiocyanates, thiocyanates, or nitrils, which display strong anti-feeding properties against herbivores (Bones and Rossiter, 1996). Most of the genes in this pathway and their roles have been identified (Sønderby et al., 2010b;Wang et al., 2011). Important genes are a series of MYB and MYC genes encoding the transcription factors controlling aliphatic and indolic glucosinolate biosynthesis (Gigolashvili et al., 2007;Sønderby et al., 2010a;Schweizer et al., 2013), next to genes involved in core structure formation, side-chain elongation, secondary modification and co-substrate pathways (as summarized by Sønderby et al., 2010b) and more recently identified genes involved in secondary modification of indolic glucosinolates and genes encoding glucosinolate transporters (Pfalz et al., 2011;Nour-Eldin et al., 2012). N. caerulescens plants originating from metalliferous soils were found to contain lower levels of glucosinolates than plants originating from non-metalliferous soils (Noret et al., 2007), which is why we were interested to see if the glucosinolate genes known from A. thaliana were also present in the N. caerulescens transcriptome. Supplemental Table  S8 shows the list of genes involved in the glucosinolate biosynthesis pathway, which we classified based on their presumed function as transcription factor genes, genes involved in side-chain elongation, core structure formation and secondary modification or genes of a co-substrate pathway (according to Wang et al., 2011). Of the 61 genes in the list, 40 were found to be expressed in the N. caerulescens transcriptome.

DISCUSSION
N. caerulescens is one of the two heavy metal hyperaccumulator plant model species, together with A. halleri, that are studied in detail to understand their adaptation to growing on soil containing extremely high concentrations of Zn and Cd (Verbruggen et al., 2013). As part of this adaptation, they not only tolerate exposure to normally lethal concentrations of both metals, but also hyperaccumulate them in their shoots, probably as protection against herbivory and microbial infection (Hoerger et al., 2013). N. caerulescens is the only species that also hyperaccumulates Ni and Pb. The development of new and cost-effective gene expression analysis methods such as RNA-Seq prompted us to generate a comprehensive reference transcriptome dataset for N. caerulescens that will facilitate and support such gene expression analyses, without having to rely on heterologous comparisons. This data set replaces a previous data set we made, comprising only 4289 transcript sequences, representing 3709 unigenes, of accession "La Calamine" (Rigola et al., 2006). We now assembled 23,836 isotigs, further condensed into 20,378 isogroups of accession "Ganges," which is a better Cd-hyperaccumulator than "La Calamine." When assuming that isogroups will best represent genes, we have expanded the available transcriptome sequence information, in terms of sequenced genes, by almost 5.5-fold.
The use of isotigs, rather than contigs for listing transcripts, will account for small sequence differences between contigs corresponding to the same transcript. Such can be caused by sequence errors, allelic differences, or differences between recently duplicated gene copies. The accession "Ganges" was used for transcriptome sequencing. This accession had been inbred for at least seven generations, and as N. caerulescens is a self-compatible, readily self-fertilizing species, we expect that only few sequence differences between contigs will be due to allelic variation and that most of the differences between contigs and isotigs will be due to sequencing errors. The difference between isotigs and isogroups is mainly accounted for by differential or alternative splicing or generation of alternative transcripts from the same gene. The 454 sequence technology that was used generates relatively large reads and facilitates easier assembly compared to that of the much shorter reads created by the Illumina sequencing technology. However, despite the long reads, for about 10-15% of the isogroups more than one isotig was assembled, reflecting the occurrence of alternative transcripts (Supplemental Table S1).
When comparing the N. caerulescens transcriptome to that of other diploid Brassicaceae species for which transcriptome sequence is available, such as Thlaspi arvense (Dorn et al., 2013) or Thellungiella salsuginea (Eutrema salsugineum) (Lee et al., 2013), the N. caerulescens transcriptome set is relatively small. The draft transcriptome of T. arvense consists of 33,873 contigs, representing 25,232 Brassicaceae genes, and that of T. salsuginea comprises 42,810 unigenes,corresponding to 24,457 A. thaliana peptides. The 20,378 N. caerulescens isogroups corresponded to 19,191 protein matches in the Brassicaceae reference proteome set. The main reason for these differences will be the representation of more tissues or organs and conditions in the other sequenced libraries. For N. caerulescens, only roots, leaves and flowers of plants grown under normal Zn supply conditions were sampled, while material from more organs or conditions were sampled for the other two species. The different sequencing approaches may also account for the differences. T. arvense was sequenced using Illumina technology, which generates many more, but shorter, reads, while for T. salsuginea both a normalized and non-normalized library were sequenced, using 454 technology, and in the final assembly, previously generated EST sequences present in the NCBI GenBank database were included.
T. salsuginea is predicted to contain 26,521 protein encoding genes (Yang et al., 2013), of which 18,970 were found to be represented in the 42,810 unigene set (Lee et al., 2013). Similar gene numbers are also found for Leavenworthia alabamica (30,343 genes) and Sisymbrium irio (28,917 genes) (Haudry et al., 2013). These three species have a genome size comparable to the genome size of N. caerulescens, which is expected to be 310-330 Mb as based on 2C DNA content (Peer et al., 2003). Assuming that N. caerulescens has around 26,000-30,000 genes, this means around 60% of its transcriptome is covered in the current dataset.
N. caerulescens belongs to the Coluteocarpeae tribe of the Brassicaceae that consists mainly of Noccaea species, which, when tested, are all found to be metal hyperaccumulators (Koch and German, 2013). Some of the other species in this tribe, e.g., Raparia bulbosa and Thlaspiceras oxyceras, are also known to accumulate metals. This tribe belongs to the expanded lineage II of the Brassicaceae, to which also the Brassica and Thellungiella (Eutrema) genera belong. This fits well with the observation that most N. caerulescens transcript sequences find their best BLASTX hit with T. halophila proteins (Supplemental Table S1). Only 98 sequences appeared to be originating from non-plant organisms, potentially corresponding to organisms living in association with the N. caerulescens plants we used to generate the cDNA libraries.
Considering that for the T. arvense transcriptome set, 424 out of 33,873 contigs showed similarity to fungal genes, the contribution of such genes to the N. caerulescens dataset is modest. Upon comparing the N. caerulescens isotigs to existing sequence information available in the NCBI GenBank, 1386 isotigs (5.8%) were found to have no significant similarity to any other sequence in database (Supplemental Table S4). Potentially these could be genes unique to N. caerulescens; however, they may also represent miRNA precursors that are much less conserved and hard to detect using Blast analysis. Of course it is possible that these sequences represent genomic DNA sequences (not very likely considering the use of DNAses in the cDNA library construction), or transcripts of plant-associated organisms that have not been sequenced yet. In the previous N. caerulescens transcriptome dataset we generated, we found 8% of the unigenes to show no similarity to any other entry in the NCBI database (Rigola et al., 2006). Comparison of the "no-hit" sequences to the N. caerulescens whole genome sequence, when available, will be needed to clarify if these are from N. caerulescens.
Isogroups rather than isotigs were used for GO annotation and analysis to avoid overrepresentation of certain GO terms due to alternative transcripts generating many isotigs for the same gene.
Sometimes the isogrouping appears to be too strict, which has forced transcripts from different paralogs into one isogroup, as was found for some of the mineral homeostasis related genes ( Table 2). The GO analysis (Figure 2, Supplemental Figure S3 and Supplemental Tables S5 and S6) showed that N. caerulescens isogroups were largely GO-annotated in comparable GO-class representations as for A. thaliana.
Many mineral homeostasis genes were identified among the N. caerulescens isotigs ( Table 2). Although we identified genes for all relevant gene families, not all genes for each family were found to have N. caerulescens counterparts, representing their potential orthologs. For instance, for the ZIP gene family of plasma membrane metal importers, no N. caerulescens transcript was found for IRT2 and ZIP3, 5, 7, 8, and 12. Transcripts for these genes have been found by Halimaa et al. (2014), but except for the ZIP5 ortholog, these are expressed at low levels in GA roots. Not finding a potential ortholog for ZIP5 is remarkable, since transcripts for these genes were also found previously (Rigola et al., 2006) and at least the ortholog of ZIP5 was expressed in roots and shoots (Wu et al., 2009). Also for the MTP family of vacuolar metal importers we did not detect transcripts similar to all A. thaliana MTP genes, but most of them have been found by Halimaa et al. (2014). Phylogenetic analysis of the ZIP and MTP isotigs showed that those were classified correctly (Figure 3). For the HMA family of plasma membrane metal exporter genes, we initially did not find a potential ortholog of the NcHMA3 gene (Ueno et al., 2011). However, upon re-examining the isogroups showing similarity to HMA proteins, isogroup02580, which was found to be very similar to a B. rapa protein annotated as HMA2, was actually highly similar (∼90% DNA identity) to A. halleri and A. thaliana HMA3 genes.
Only for the MT3 gene we could identify an additional copy in N. caerulescens compared to the single copies found in related species ( Figure 3C). This gene has been implicated in Cu homeostasis, and expression in a southern France accession ("St. Felix," which is close to the origin of "Ganges") is much higher than in the two other tested accessions and appears to be constitutive, rather than induced by Cu (Roosens et al., 2004). Such could well be the consequence of an additional copy, which would mean the additional copy is accession, rather than species specific, and is worth further investigations. Not finding copy number expansion for other genes is remarkable, as for several of the other genes listed in Table 2, such as HMA3 and HMA4, multiple copies have already been reported in other accessions, including "Ganges" (Ó Lochlainn et al., 2011;Ueno et al., 2011;Craciun et al., 2012;Iqbal et al., 2013). However, for HMA4, for which cDNA sequences of "Ganges" are available, the different cDNA copies are very similar in sequence (Iqbal et al., 2013), only differing in a few bp and three InDels, which will not be distinguished from sequence differences due to sequencing errors by the assembly software we used. This is likely to be the case for more recently duplicated gene copies. Detailed analysis of copy number variation will be much less ambiguous upon availability of a whole genome sequence, where non-encoding sequence can be taken into account to distinguish duplicated copies.
When examining genes involved in glucosinolate biosynthesis, we identified isotigs corresponding to all transcription factor genes involved in aliphatic glucosinolates, but not for two of the MYB genes involved in indolic glucosinolates (MYB34 and MYB122) (Supplemental Table S8). In contrast, hardly any of the aliphatic side-chain elongation genes as well as the two CYP79F genes involved in aliphatic core structure formation was represented in the N. caerulescens isogroup list. Of course, not finding them in the isotig list does not mean these genes are not expressed, but it is remarkable that expression of several of the structural genes involved in aliphatic glucosinolate biosynthesis is so low that apparently these genes are more likely to missed by 454 sequencing than the other genes. Low expression of glucosinolate genes would be in line with a previous report that especially metallicolous accessions from the south of France, where also "Ganges" is  Table 2, and compared at amino acid sequence level to A. thaliana genes to indicate the most likely orthologs. When all proteins listed in originating from, are low in total glucosinolate levels (Noret et al., 2007). The main reason for studying N. caerulescens is to learn more on its extraordinary capacities to tolerate exposure to high concentrations of heavy metals and accumulate these to extremely high levels in the leaves. The transcriptome sequence as such will not tell us that much on which genes will be relevant for these traits, but it can be used as an excellent reference for RNA-Seq studies to determine gene expression in organs of different accessions, exposed to a range of metal concentrations. There are many different N. caerulescens populations in Europe, which can differ substantially in their ability to tolerate and accumulate metals, as well as differ in their metal preferences. Many populations grow on non-metallicolous soils, which are not enriched in metals. At those sites, they are likely to accumulate mainly Zn. When exposed to Zn, Ni, and Cd they can hyperaccumulate them, but as they are generally not metal tolerant, plants will rapidly die upon exposure. Other natural populations grow on (ultramafic) serpentine outcrops, which are generally rich in metals, but not Zn. At those sites, they often hyperaccumulate Ni, and when exposed to Zn or Cd, will also hyperaccumulate these metals in their shoot. However, they are often only tolerant to Ni and will suffer or die from exposure to Cd. Finally, there are several local populations spread over different sites in Europe that have been heavily contaminated by Zn and Cd, and often also Pb (Mohtadi et al., 2012). These calamine populations are Zn and often Cd hyperaccumulating and are also tolerant to these metals. When exposed to Ni, they will also hyperaccumulate it, but they are less tolerant to it than populations from serpentine sites (Assunção et al., 2003). Especially populations found on metal contaminated sites in the Cevennes region, around the town of Ganges in the south of France, are particularly good at hyperaccumulating Cd. Most of the phenotypic differences related to metal tolerance and accumulation between populations are due to genetic differences, often reflected in differences in gene expression (Van De Mortel et al., 2008). With the N. caerulescens transcriptome dataset we generated, it will be much easier to study such differences. Also, when the whole genome sequence of N. caerulescens will be determined, the transcriptome data can be used to annotate predicted genes and delineate potential transcripts of such genes. This will be useful to determine the function of these genes. Finally, since the previous transcriptome dataset was obtained from accession "La Calamine," and the new one from "Ganges," the comparison of transcripts from the same genes revealed a SNP frequency of around 2 per 100 bp. Although the presence of sequence errors in both sets will have inflated this number, it will be straightforward to convert these differences into genetic markers that can be used in mapping quantitative trait loci in segregating populations of N. caerulescens (Assunção et al., 2006;Deniau et al., 2006).