Insights into the Sesquiterpenoid Pathway by Metabolic Profiling and De novo Transcriptome Assembly of Stem-Chicory (Cichorium intybus Cultigroup “Catalogna”)

Stem-chicory of the “Catalogna” group is a vegetable consumed for bitter-flavored stems. Type and levels of bitter sesquiterpene lactones (STLs) participate in conferring bitterness in vegetables. The content of lactucin—and lactucopocrin-like STLs was higher in “Molfettese” than “Galatina” landrace stalks, regardless of the cultivation sites, consistently with bitterness scores and gustative differences. The “Galatina” transcriptome assembly resulted in 58,872 unigenes, 77% of which were annotated, paving the way to molecular investigation of the STL pathway. Comparative transcriptome analysis allowed the identification of 69,352 SNPs and of 1640 differentially expressed genes that maintained the pattern independently of the site. Enrichment analyses revealed that 4 out of 29 unigenes were up-regulated in “Molfettese” vs “Galatina” within the sesquiterpenoid pathway. The expression of two germacrene A -synthase (GAS) and one -oxidase (GAO) genes of the costunolide branch correlated positively with the contents of lactucin-like molecules, supporting that STL biosynthesis regulation occurs at the transcriptional level. Finally, 46 genes encoding transcription factors (TFs) maintained a differential expression pattern between the two varieties regardless of the growth site; correlation analyses among TFs, GAS, GAO gene expressions and STLs contents suggest that one MYB and one bHLH may act in the pathway.


INTRODUCTION
Chicory (Cichorium intybus L.) is cultivated worldwide to produce food, coffee surrogates, forages pharmaceuticals, and healthcare compounds (Street et al., 2013). Genetic diversity analysis supported the three-cluster structure of C. intybus cultivated germplasm (Kiers et al., 2000;Raulier et al., 2016): witloof, root and leaf chicory groups. The latter embraces "Radicchio, " "Sugarloaf " and "Catalogna" sub-groups. Several "Catalogna" cultivars/landraces are cultivated in Italy for both leaves and stems. These latter are appreciated for the bitter and crispy taste. Botanically, they bear a receptacle made of outer whorls of leaves (runcinated-pinnatifid type, large mid-rib) and an inner bulk of inflorescence stems (syn. flower stalks, turions). These stalks (cut at various lengths) are mostly eaten raw (sliced into curly and crunchy strips, a.k.a. "puntarelle") or cooked. Stem vegetables are novel products moving from a niche to a global market, showing potential use in the minimally or fully processed food chain (Renna et al., 2014).
A major aim of this work was to characterize the factors that contribute to bitterness at the metabolic and genetic levels by comparing "Galatina" (Gal) and "Molfettese" (Mol) landraces. The STL quantification revealed a higher content of lactucins and lactupicrins in the latter compared to the former, independently of the growing sites. Due to the lack of comprehensive genomic information on C. intybus, a reference Gal transcriptome for the "Catalogna" group was created. Digital gene expression (DGE) targeted to stems at the commercial maturity revealed those differentially expressed genes (DEGs), which maintained the patterns in Mol and Gal landraces irrespective of the cultivation area. Focusing on transcriptomic differences of the STL pathway, the Mol genotype was enriched of upregulated genes-two germacrene A-synthase and one oxidase (GAS and GAO)-acting upstream the route. The GAS and GAO transcription levels correlated positively with the contents of 11(S), 13-dihydrolactucin and 11(s), 13-dihydro-8-deoxylactucin, supporting that STL biosynthesis regulation occurs at the transcriptional level. Consequently, the GAS and GAO higher expression levels of Mol vs. Gal may account for higher contents of STLs, supporting that these genes may be good expression markers for bitterness selection. Correlation analyses among the expression levels of GAS, GAO, and transcription factor (TF) genes and STLs contents pointed at MYB and bHLH as the best TF candidates in the STL biosynthesis regulation.

Plant Materials and Sampling
The "Galatina" (Gal) and "Molfettese" (Mol) landraces of stem chicory (Cichorium intybus L. "Catalogna" group) were previously described, including botanical classification, phenotypical traits, site coordinates, and cultivation parameters (D'Acunzo et al., 2016). In the current work, the landraces were grown both on local private farms in Apulia (Molfetta, southern Italy) and in the Enza Zaden fields (Tarquinia, Viterbo, Lazio, Italy). Plants were sown in August 2012 in both locations and transplanted after 30 days. Plant density was 8.3 plants/m 2 and similar growing techniques were applied in both growing areas. As for Lazio, harvesting was on the 08/01/2013 and 28/01/2013 for Mol and Gal, respectively; the average temperatures 1 week before harvesting were 9.0 ± 0.9 • C and 6.9 ± 1.4 • C and the average temperature during the 1-28/1/2013 period was 8.2 ± 2.2 • C (www.idrografico.roma.it/annali). As for Apulia, harvesting occurred on the 14/01/2013 and 25/01/2013 for Mol and Gal, respectively; the average temperatures 1 week before harvesting were 9.4 ± 0.1 • C and 9.5 ± 0.3 • C, and the average temperature of the 1-25/1/2013 lapse was 9.7 ± 0.2 • C (www.agrometeopuglia.it). Harvested plants were brought to laboratories, selected for comparable weights (Lazio: 905 ± 230gr and 850 ± 201gr for Gal and Mol; Apulia: 820 ± 162gr and 940 ± 128gr for Gal and Mol; non-significant differences were scored by ANOVA) and processed. In the experiments of STL lactones quantification, transcriptional and allelic variation analyses, marketable stems were removed from rosettes (n = 15) of each landrace. A replicate batch consisted of 10 homogeneous stems (mean length 11.5 ± 1.5 cm, mean diameter of the median section: 2.7 ± 0.3 cm). These were immediately frozen in liquid nitrogen and stored at −80 • C for RNA isolation or they were lyophilized at −50 • C for 72 h (laboratory freeze dryer with stoppering tray dryer, FreeZone R , Labconco Corp., Kansas City, MO, USA) and stored at −20 • C for HPLC analyses. Three biological replicates were used in all the experiments.

Sesquiterpene Lactones Quantification
In order to quantify the total amount of STL (free and bound fractions), the samples were prepared by adopting both cellulase hydrolytic treatment (Tamaki et al., 1995) and ultrasound assisted extraction (UAE). As regards the enzymatic procedure, the lyophilized sample (2 g) was added to 50 mL of methanol/water solution (80:20, v/v) plus 2% of formic acid and 3 mL of santonin solution (101.7 µg/mL) as internal standard, and shaken (1000 g/min, for 15 min, at 80 • C; F80 Digit, Falc Instruments s.r.l., Italy). The supernatant was collected and the pellet underwent two additional extractions as described above. The final extract (about 150 mL) was dried under vacuum, redissolved in methanol/dichloromethane (1:7, v/v) solution, and loaded onto a solid phase extraction (SPE) column. The elution was carried out with 6 mL of a dichloromethane/ethyl acetate (3:2 v/v) solution; the eluted fractions were pooled and then vacuum-dried. The eluted fraction was re-dissolved in 1 mL of a cellulase enzyme solution (10 mg of cellulase/mL of water) and then incubated at 37 • C for 2 h with stirring. The solvent was evaporated and then made up to 500 µL. As for the ultrasound assisted extraction, the same protocol as described above was adopted. Differently, the SPE eluted fraction was sonicated at 50 KHz for 30 min (37 • C) by using an ultrasound bath (Labsonic LBS1-3, Falc Instruments s.r.l., Italy). The purified samples were added with methanol (4 mL) and the STL identification was performed by an HPLC apparatus Finnigan (Thermo Electron Corporation, San Jose, California), equipped with quaternary pump, DAD detector, and a C18 Kinetex column (250 × 4.60 mm, 5 µm). The mobile phases A and B were, respectively, methanol/water 14:86 and 64:36 (v/v). The gradients were 0-20 min, 100-58% A; 20-30 min, 58% A; 30-45 min, 58-0% A; 45-50 min, 0% A; 50-52 min, 0-100% A; 52-62 min, 100% A. The flow was at 0.5 mL/min and the injection volume was 80 µL. STL peaks were monitored at 260 nm. Following an identical preliminary extraction protocol, UAE lead to equivalent amounts of STL as compared to the cellulase treatment (Table S1).

RNA Isolation and Sequencing
For transcriptome reference assembly, Gal plants at the transplant (n = 5, 3-4 true leaves) and commercial maturation (n = 5) stages were used; distinct tissues (apices, stems, leaves and roots) were sampled and ground in liquid nitrogen. 500 mg of each tissues was used to isolate total RNA by TRIzol reagent (Invitrogen) followed by an additional purification step using RNAeasy separation columns (RNAeasy kit, Qiagen); yields were estimated by electrophoretic and spectrophotometric analyses (NanoDrop ND-1000; Thermo Scientific Inc.), and RNA integrity number (RIN > 7) was verified using BioAnalyzer 2100 (Agilent Technologies Inc). cDNA libraries were prepared using TruSeq RNA-seq sample preparation kit (Illumina) and sequenced in 100 bp paired-end mode using an Illumina HiSeq2000 (IGA Technology Services, Udine, Italy).
As for NGS transcriptional analyses and SNP mining, sizecomparable stems (n = 10) at harvesting time were sampled from Gal (n = 5) and Mol (n = 5) grown in each location. Total RNA was isolated, quantified and controlled for yield and integrity as described above. Illumina Truseq cDNA libraries were prepared and sequenced in 50 bp single-end on Illumina HiSeq2000 platform. For a given genotype, three biological replicates for each growing area were generated. RNAseq data sets have been stored in the National Center for Biotechnology Information database (NCBI, www.ncbi.nlm.nih.gov) under the BioProject accession number PRJNA328202.

Transcriptome Assembly
Paired-end Illumina raw reads were filtered to remove adapter contaminations and low-quality reads using Trimmomatic v0.32 (Bolger et al., 2014). The high-quality reads were assembled by using both a one-step (de novo) and two-steps (ESTbased backbone construction followed by a de novo assembly) approaches. In the one-step approach the cleaned reads were assembled using Trinity (Grabherr et al., 2011) with default parameters. In the two-steps approach, a collection of 53,973 Cichorium intybus EST was retrieved from the NCBI database. To create a unigene dataset, the EST were cleaned (trimming of vector tag, low quality stretches and repetitive element) and assembled using the EGassembler pipeline (Masoudi-Nejad et al., 2006) with the default parameters. The pipeline produced 26,085 unigenes (7199 contigs and 18,886 singletons). The filtered reads were mapped on the resulting unigenes by using Bowtie2 (Langmead and Salzberg, 2012). The unmapped reads were recovered and used in iterative contig extension processes using SeqMan Pro (DNASTAR. Madison, WI). The reads that did not extended contig lengths were de novo assembled Velvet/Oasis programs (Zerbino and Birney, 2008;Schulz et al., 2012) with a k-mer size of 25. Redundancy between one-and two-steps output was removed by TGICL-CAP3 (Pertea et al., 2003) using overlaping stretches 200 bp-long and minimal identity of 97%. The genes/isoforms clustering was performed using cd-hit-est from the CD-HIT package (Li and Godzik, 2006) with sequence identity threshold of 97%. The longest transcripts were selected as representative for each cluster.

Polymorphisms Calling
BWA (Li and Durbin, 2009), Picard tools (http://picard.sourceforge.net), SAMtools  and the BCFtools utilities were used to align the reads of Gal and Mol to the reference transcriptome, mark duplicated reads, compute the genotype likelihoods and call the variable positions, respectively. In order to provide a set of reliable SNPs useful in robust genotyping assays, the following filtering criteria were imposed: (a) quality score ("QUAL") ≥ 30 (99.9% base call accuracy); (b) at least 10 high-quality reads ("DP4") supporting the nucleotide differences; (c) SNP within homopolymer stretches of length ≥ 5 bp were excluded; (d) genotype quality score ("GQ") ≥ 50. The MISA perl script (http://pgrc.ipk-gatersleben.de/misa) was used for identification of potential simple sequence repeats (SSRs). Units with one to six nucleotides and a minimum repetition of twelve units for mono-nucleotides, six for di-nucleotides, five for tri-, tetra-, penta-and hexa-nucelotides were considered in the analysis.

Gene Expression Analyses
The raw single-end reads were trimmed as described above. For a given genotype, the cleaned reads were mapped on the reference assembly using BWA (Li and Durbin, 2009) and SAMtool pipeline, and read count for each transcript was scored in each replicate. The digital gene expression (DGE) levels were calculated and expressed as RPKM (reads per kilobases of transcript sequence per million of mapped reads) values.

Real Time Quantitative PCR (qPCR)
Total RNA derived from a pool (n = 5) of stems was isolated by the RNeasy Plant Mini Kit (Qiagen), DNase treated (RQ1, Promega), and 1 µg was reverse-transcribed at 55 • C by SuperscriptIII (Life Technologies). The cDNA (100 ng) was amplified by Eco Real-Time PCR System (Illumina) using 1x Quantimix easy master mix (Biotools) and 0.3 µM of each primer in a 10 µl final volume. The triplicate reaction conditions were as follows: 95 • C for 10 min, 45 cycles at 95 • C for 15 s, 60 • C for 15 s, and 72 • C for 40 s. Primer specificity was checked by melting curve analysis and by agarose gel electrophoresis. Three technical replicates and three independent biological experiments were performed for each sample. The expression levels of the target unigenes were normalized with the reference genes ACT (Maroufi et al., 2010) by the Q-Gene program (Muller et al., 2002). Primers were designed using Primer3 software (Untergasser et al., 2012) and they are listed in Table S2.

Statistical Analyses
The analysis of variance (ANOVA) was applied to STL content variation in landraces grown in Apulia and Lazio cultivation sites, followed by Duncan Multiple Range Test. All procedures, General Linear Model and means separation, were carried out by Statistical Analysis System program (SAS software, Version 9.1, Cary, NC, USA). The principal component analysis (PCA) allowed a visual overview about spatial distribution and grouping among genotypes and growing sites; it was based on mean centered and standardized data (unit variance scaled) and results were shown as bi-plots of scores (treatments) and loadings (variables) plots (XLStat Pro, Addinsoft, Paris, France). Differential expression analysis was performed using the Bioconductor edgeR package (Robinson et al., 2010). All samples were normalized by trimmed mean of M values (TMM). Unigenes with at least 1 read per million in at least 3 samples were retained and a false discovery rate (FDR) value ≤ 0.05 and an absolute value of log 2 fold change ratio (log 2 FC) ≥ 1 were set as the thresholds for the significance of the gene expression difference. The hypergeometric test (Pang et al., 2013) was used for GO and KEGG enrichment analyses of differentially expressed genes with the whole transcriptome set as background. The significant GO and KEGG terms were identified after multiple testing adjustments with an FDR ≤ 0.05. Correlations analyses were performed in R 3.2.2 (R Core Team, 2015) with the rcorr function of the Hmisc package (Harrell, 2016), whilst corrplot function (Wei, 2013) was used to produce the correlation matrix.

RESULTS
Stems of "Molfettese" Contain More STL than "Galatina" Independently of Cultivation Area The major STLs lactucin (Lc), 8-deoxylactucin (dLc), lactucopicrin (Lp) and the respective dihydro-derivatives, 1,3-dihydrolactucin (DHLc), 11(s), 13-dihydro-8-deoxylactucin (DHdLc), 11(s), 13-dihydrolactucopicrin (DHdLp) were quantified in stems of Mol and Gal landraces. Overall, the total STL content was significantly higher in Mol than Gal stems (84.9 ± 5.0 vs. 55.4 ± 3.0 mg kg −1 dry matter) as well as that of total lactucin-like forms (LcTOT, 66.2 ± 6.7 vs. 37.7 ± 4.5) independently of the growth site ( Table 1, values in the "Both" line). The total lactucopicrin-like compounds showed comparable levels in both landraces (LpTOT, 18.6 ± 2.0 vs. 17.7 ± 2.5). The mean contents of DHLc and DHdLc were 2-fold higher in Mol than Gal (26.6 ± 2.8 vs. 11.1 ± 1.4 and 29.7 ± 2.8 vs. 12.2 ± 1.5, respectively). Vice versa, the Lc and dLc contents were slightly but significantly higher in Gal than Mol (6.4 ± 0.8 vs. 5.1 ± 0.8 and 8.0 ± 0.9 vs. 4.9 ± 2.5). Finally, the Lp amount was higher in Mol than Gal stems, whilst that of DHLp was comparable in the two genotypes. The environment change per se did not cause any significant variation of all compounds in both landraces ( Table 1, values in the "A" and "L" lines). However, genotype-environment interactions were observed for the content of DHdLc, which decreased in Mol and was unvaried in Gal, and for Lc and Lp, which differed between Mol and Gal, only in one of the two sites (Lazio for Lc and Apulia for Lp). The principal component analysis (PCA, Figure 1) of STL data produced the two principal components PC1 and PC2 explaining, respectively, 80.67 and 19.30% of the total variance. The PCA fully separated Gal from Mol genotypes, respectively, located on the left and right side of PC1. Negative values of PC1 correlated with DHLp, Lc and dLc whilst the positive ones showed high correlation with the other compounds. Referring to PC2, Gal from Lazio (L) and Mol from Apulia (A) were in the upper quadrants, mainly positively correlated to the highest content of DHLp in Gal/L and of Lp and total STL in Mol/A ( Table 1).

Transcriptome Assembly
Cichorium intybus RNA-seq libraries were prepared from the apical tips, stems, leaves and roots sampled at transplant and harvest stages of the Gal landrace ( Table 2). The Illumina HiSeq2000 sequencing yielded approximately 164 million raw (100 bp) paired-end reads. The reads were processed to remove low-quality and adapter sequences, and ca. 97.7% of them ( Table 2) were used as common dataset to follow two workflows, named "one-step" and "two-step" assembly strategies (Table 3 and Figure S1). In the former, the high-quality reads were assembled de novo by Trinity (Grabherr et al., 2011) into 80,303 contigs with a N50 and mean lengths of 1472 and 1,130.9 bp, respectively. The two-step strategy first included a templatebased assembly and then a de novo one. Briefly, a non-redundant set of 26,085 unique sequences was generated by EGassembler pipeline (Masoudi-Nejad et al., 2006) using all the C. intybus public ESTs available at current date. Subsequently, the filtered reads were mapped on the unigenes EST set of 26,085 resulting into 11,153 read-supported EST fragments. These were used as input for iterative contig extension process using SeqMan Pro (DNAStar) which could raise the unigene mean length from ca. 760 to 1248 bp. Finally, the unmapped reads were retrieved by Bowtie2 (Langmead and Salzberg, 2012) and de novo assembled by Velvet/Oasis (Zerbino and Birney, 2008;Schulz et al., 2012) into 35,091 contigs. The output from both one-and two-step approaches were merged to obtain the final reference assembly consisting of 79,716 transcripts (N50 = 1545 bp and average contig length = 1230 bp) clustered into 58,872 isoform groups (unigenes). The unigenes set had a N50 and mean lengths of 1574 and 1220 bp, respectively (Table 3).

Annotation and Function Classification
In order to widen the information on the newly assembled transcriptome, sequence similarity searches were performed against eight databases (see Material and Methods, and  Figure 2A, and Figure S2). The "metabolic" and "cellular processes" were the most abundant categories (13,025 and 12,146 unigenes) within the "biological process;" "cell" and "cell part" top ranked (>10,500 unigenes) in the "cellular component, " while "binding" and "catalitic activity" included the highest unigene numbers (14,198 and 11,628) in the "molecular function" ontology. As for KEGG (Table 4 and Figure 2B), 7393 unigenes (12.6%) were mapped into 5 main categories and 130 metabolic maps. Most of the unigenes fell into the "metabolism" cluster (10,634; 82.7% of the mapped unigenes) followed by  the genetic information processing, "environmental information processing, " "cellular processes" and "organismal systems" (11.9, 2.9, 1.5, and 0.9%, respectively). Within the "metabolisms, " those of "carbohydrate" (2416; 18.8%), "nucleotide" (1718; 13.4%) and "cofactors and vitamins" (1648; 12.8%) contained the highest number of unigenes. Based on the KOG database ( Table 4 and Figure 2C), 11,650 unigenes (19.8%) belonged to 25 functional categories and the "general functional prediction only" (2309 unigenes; 17.6%), "post-translational modification, protein turnover, chaperones" (1341; 10.2%), and "signal transduction mechanisms" (1226; 9.4%), were the largest ones. InterProScan was used for structural annotation of the deduced products. The Frontiers in Plant Science | www.frontiersin.org FIGURE 2 | Continued classification into EuKaryotic Orthologous Groups (KOG). A, RNA processing and modification; B, Chromatin structure and dynamics; C, Energy production and conversion; D, Cell cycle control, cell division, chromosome partitioning; E, Amino acid transport and metabolism; F, Nucleotide transport and metabolism; G, Carbohydrate transport and metabolism; H, Coenzyme transport and metabolism; I, Lipid transport and metabolism; J, Translation, ribosomal structure and biogenesis; K, Transcription; L, Replication, recombination and repair; M, Cell wall/membrane/envelope biogenesis; N, Cell motility; O, Post-translational modification, protein turnover, chaperones; P, Inorganic ion transport and metabolism; Q, Secondary metabolites biosynthesis, transport and catabolism; R, General function prediction only; S, Function unknown; T, Signal transduction mechanisms; U, Intracellular trafficking, secretion, and vesicular transport; V, Defense mechanisms; Y, Nuclear structure; Z, Cytoskeleton. (D) Comparison of unigene length between annotated and non-annotated unigenes. The percentage of non-annotated transcripts (light gray bars) was high in the small-sized unigene category and progressively dropped along with the transcript length increase. analysis used 15 databases and assigned 237,796 annotations to 31,516 unigenes (53.5%, in Table 4). The Interpro accessions (IPR , Table S4) were 20,642; 8935 putative proteins could be grouped into 2596 families, while 16,354 showed known domains, 1850 harbored repeats and 3308 hosted functional sites. Finally, the total unigenes of the C. intybus Gal transcriptome with at least one annotation signature were 45,572 (77.4%, in Table 4) and showed average length of 1372 bp; non-annotated unigenes were 13,302 (22.6%) and mostly of short size (ca. 700 bp; Figure 2D).

Digital Gene Expression and Functional Classification of Differentially Expressed Genes
Taken that Mol contained more STL than Gal, we first performed DGE profiling on the two genotypes within the same environment (inter-landrace comparison); subsequently, we selected the genes that maintained the relative expression pattern (independently of the area) for further characterization. 12,274 DEGs were identified (sum of up-and down-regulated unigenes bracketed in Figure 3); 6346 and 2294 DEGs were specific for Apulia and Lazio shires, respectively. Moreover, 1640 DEGs (961 down-plus 679 up-regulated unigenes, bolded in Figure 3) maintained the relative expression pattern independently from the cultivation zone, while 177 unigenes showed opposite trends from one site to the other. The transcriptome variation within a given genotype following cultivation site change is reported in Figure S3. KEGG (Table 5) and GO (Tables S5,  S6) enrichment analyses were performed to functionally classify the 1640 DEGs. The former revealed that Mol up-and downregulated genes were significantly over-represented in 9 and 4 pathways, respectively. Four up-regulated DEGs fell in the sesquiterpenoid and triterpenoid biosynthesis ( Table 6).

Dissections of Putative Genes Related to Sesquiterpenoid and Triterpenoid Pathway
Sesquiterpene lactones (STL) precursors belong to the sesquiterpenoid and triterpenoid pathway (STP); therefore, the latter was further characterized by identifying 29 unigenes putatively encoding 16 distinct enzymes ( Table 6 and Figure  S4). Four unigenes were below the transcription threshold (RPKM 0-0.1), 13 showed a low expression (RPKM 0.1-3), 1 was moderately (RPKM 3-8) and 11 were highly (RPKM >8) expressed in the stems of both landraces ( Table 6). DGE patterns were checked by qPCR assays ( Figure S5) performed on 15 unigenes of the STP, and a significant positive correlation was found (Figure 4). Consequently, the DEG analysis (Table S7)  identified 4 unigenes (gray-shaded in Table 6) which maintained a significantly higher expression in Mol than Gal regardless of growth sites. These genes were the germacrene A synthase (GAS; Ci_contig62597 and Ci_contig62598), germacrene A oxidase (GAO; Ci_contig7113) and β-amyrin synthase (LUP4; Ci_contig3360). GAS and GAO are two key enzymes in the synthesis of germacrene-type STLs and act consecutively in the upstream steps that generate the costunolide (Figure 5). Within the STL branch, 8 sequences were found in the Gal transcriptome, consisting of 6 GAS transcripts (Table S7), 1 GAO and 1 costunolide synthase (COS) mRNAs. The GAS, GAO, and COS proteins shared significant identities with several Asteraceae orthologues (Table S7); the sequence variability among the 6 GAS ranged from 53.1 to 88.6% ( Figure S6). The Asteraceae GAS phylogenetic tree ( Figure 6A) placed Ci_contig62598 and Ci_contig7229 products in the clades of C. intybus short and long variants, respectively. The Ci_contig62597 derived protein belonged to the L. sativa LTC2 group, while the Ci_contig29105 and Ci_contig73080 formed a clade per se, sharing 88.6% sequence identity ( Figure S6). The Asteraceae GAO phylogenetic tree ( Figure 6B) assigned the Ci_contig7113 product in the branch of C. intybus GAO with which shared 100% identity (not shown).

Correlation Analyses among GAS, GAO, Putative Transcription Factors, and STLs
We approached the search for transcription factor genes (TFs) involved in the STL pathway. Globally, 2680 gene deduced products could be ascribed to 57 TF families according to PlantTFDB rules (Jin et al., 2014) and ca. 55% belonged to bHLH (258 proteins), WRKY (188) Fourteen genes belonging to the sesquiterpenoid and triterpenoid pathway (Table S2) were arbitrarily chosen and their expression was monitored in Mol and Gal landraces in Apulia and Lazio areas. For a given unigene, RNAseq fold change refers to the ratios of RPKM values of Mol to Gal, whilst qPCR fold change is the relative expression of Mol normalized to those of Gal. R 2 , coefficient of determination; ***, significant correlation at P < 0.001.

Genotype Differentiation by Gene Polymorphisms Mining
Polymorphism identification from transcriptome sequences is useful to score gene functional variations in species without a sequenced genome. Consequently, SSR and SNP were mined and a specific focus was on the STL pathway genes. Overall, 11,672 putative SSRs were found in 9826 unigenes and 1525 of them contained more than one microsatellite (Table S8). Excluded mononucleotides, the di-and tri-nucleotide repeats were the most abundant (respectively 52.8 and 42.4% out of 6946 SSR); the AG/CT and ATC/ATG were the most frequent motifs ( Table 7). The mapping of Mol reads vs. the Gal reference transcriptome scored 67,265 SNPs in 15,248 unigenes and ca. 68% were heterozygous (Table S9); the nucleotide substitutions ( Figure S7) included 61.8% transition (C/T vs. A/G, 32.8 vs. 29%) and 38.2% transvertion events (A/T was the most frequent). The average SNP frequency occurred at 1/1068 bp, with a mean of 4.4 per unigene. Eleven out of the 29 unigenes of the sesquiterpenoid pathway (Table S9) bore SNPs between Gal and Mol landraces and 5 genes included homozygous SNPs. Referring to DEGs, β-amyrin synthase (LUP4; Ci_contig3360) showed 7 SNPs, while GAS (Ci_contig62598) contained one heterozygous SNP, which produced a synonym substitution (Tyrosin).

DISCUSSION
In this work, the combination of a de novo transcriptome assembly, transcript and metabolite profiling was used to achieve insights in the genetic pathway of sesquiterpenes and, more generally, valuable genetic tools using two stem-chicory "Catalogna" landraces. As for total STL extraction procedure, cellulase treatment, and ultrasound assisted extraction were compared. Sonication can disrupt cell walls causing the release of cell contents (Toma et al., 2001) and it was effective as much as cellulose hydrolysis for yielding both free and bound fractions in chicory stems (Table S1). The metabolic characterization pointed that total STL content of stems was ca. 50-fold lower than that reported for "Catalogna" leaves (GEVES) 1 . Typically, STL have been either undetected or found in traces in stalks (Chou and Mullin, 1993;Douglas et al., 2004;Eljounaidi et al., 2015). Mol stems contained more total STL than those of Gal and differences were ascribed mostly to the genotype diversity and poorly to the environment changes; genotype-environment (GxE) interactions affected the contents of Lc, DHLc, DHdLc, and Lp variation ( Table 1 and Table S1). Genotype effects were also observed in chicory forage, because cultivars with higher STL than others maintained the characteristics regardless of the growth sites (Foster et al., 2006). Chicory studies also report STL content variation with organ developmental stage, cultivation and geolocalisation that account for GxE effects (Peters et al., 1997;Foster et al., 2006;Ramirez et al., 2013;Chen et al., 2014). The higher content of DHLc, DHdLc and Lp in Mol vs. Gal may account for bitterness differences. The conversion of STL amounts into bitterness degrees (van Beek et al., 1990) indicates that Mol has higher scores than Gal (Table S10) and that the difference was mostly due to total Lc-like compounds. Simple gustative tests further assessed that Mol stems were more bitter than Gal ones (χ 2 = 56.89; p < 0.001; Table S11). Consistently, the contents of Lc-like compounds showed positive correlation with bitterness in leaves of C. intybus sugarloaf, witloof and radicchio (Price et al., 1990;Peters and van Amerongen, 1998;Poli et al., 2002) and DHLc contents strongly correlated with root bitterness (Hance et al., 2007). Lp is one of the most bitter among chicory STLs (van Beek et al., 1990) and Lp-like molecules are strongly related to bitterness in other Asteraceae species, such as lettuce (L. sativa) and endive (C. endive) leaves (Seo et al., 2009;D'Antuono et al., 2016). Finally, a recent survey on endives proposes that bitterness depends on the balance between STL and phenolic contents and that the different compounds within these two categories can affect the trait both individually and in a complex manner (D'Antuono et al., 2016). Known that total phenolic contents are comparable in Gal and Mol stems (Renna et al., 2014), analytic studies are envisaged in addressing bitterness in these novel products.
Comparative analysis of Gal and Mol transcriptomes scored that ca. 2.8% unigenes (1640 out of 58,872 total unigenes) maintained differential expression pattern between the two genotypes in both Apulia and Lazio sites. This suggested that environment minimally or equally affected this gene pool, which may represent a source of transcriptional markers. Within this pool, four up-regulated DEGs ( Table 5) fell in the sesquiterpenoid (2 GAS and 1 GAO) and triterpenoid (β-amyrin synthase) biosynthesis (Table S7). The diversity in the functional amino acid stretches among the 6 GAS deduced proteins ( Figure  S6) suggests that they may have diversified catalytic functions with substrate specificity. The two GAS enzymes encoded by the up-regulated DEGs are similar but not identical, and ascribed to the C. intybus short form (Figure 6 and    Correlogram representing Pearson's correlation coefficient (r) between content of STLs and gene expression abundances of both biosynthesis genes and putative TFs. Heat map is used to indicate the strength of correlation between the variables with ordering determined by hierarchical clustering. Red and blue indicate negative and positive correlations, respectively. Pearson's correlation coefficient values were reported into the colored squares. *, **, *** = significant at P ≤ 0. 05,0.01,and 0.001,respectively. Lc,lactucin;DHLc,11(S),dLc,DHdLc,11(S),Lp,lactucopicrin;DHLp,11(s), 13-dihydrolactucopicrin; STLtot, total STL content; LcTOT, total lactucin-like STL; LpTOT, total lactucopicrin-like STLs.

S6
). Both GAS genes (Ci_contig62597 and Ci_contig62598) are 2-fold more expressed in Mol than Gal. Similarly, the GAO gene transcription is twice higher in Mol than Gal and the deduced protein corresponds to that of C. intybus available in public databases. Significant positive correlation occurred between GAS and GAO transcriptions and the contents of Lclike (DHLc, DHdLc). These results corroborate the finding that terpene accumulation goes in parallel with the expressions of respective synthase genes (Nagegowda, 2010) and are consistent with the correlation of artichoke GAS and GAO transcription levels with the cyanopicrin-STL abundance (Eljounaidi et al., 2015). Regarding the COS gene (costunolide is a down-stream precursor of STL), it is worth noting that the expression was more abundant in Mol than Gal in Apulia; the trend was maintained in Lazio, but not at significant level ( Table 6 and Figure S5). The STL genetic pathway per se needs investigation in plants, namely, the biosynthesis genes downstream the COS (leading to Lc-and Lp-like compounds) have been unknown as well as the routes of catabolism and transport. Moreover, the knowledge on the TFs involved in sesqui-and triterpenoids biosynthesis is still fragmentary (Yamada and Sato, 2013). In this work, 46 TF genes conserved the differential expression between the two genotypes. Among these, some showed strong positive (MYB, Ci_contig49541) and negative (bHLH, Ci_contig48487) correlations with both the biosynthesis genes (GAS and GAO) and total STLs (three-way relationship). Given that MYB factors and bHLH members control sesquiterpenes synthesis in plants acting on terpene synthase genes (Hong et al., 2012;Reeves et al., 2012;Lu et al., 2013), it is proposed that the abovementioned genes may represent TF that regulate the stemchicory STL pathway. However, some TF showed significant strong correlations in two-way relationships. For instance, the Ci_contig18404 (ERF) showed a very strong negative correlation (r < −0.8) with GAO and total STLs, but not with GAS. Similarly, the Ci_contig45269 (WRKY) had a strong positive correlation with GAO but not with GAS or total STL. Consequently, these TFs might take part in STL metabolism of C. intybus through specific routes or indirectly, consistently with their role in regulating STL biosynthesis in other species (Yamada and Sato, 2013). Moreover, the future availability of GAS and GAO genomic sequences will allow prediction of motifs targeted by these candidate genes and pave the way for functional study experiments.
Cichorium intybus genome is esteemed ca. 1400 MB (De Simone et al., 1997), hence transcriptome sequencing was convenient to widen resources aimed to gene discovery, expression profiling, and diversity analysis and to marker production for breeding. There have been two Illuminabased transcriptomes of C. intybus, which derived from seedlings (Hodgins et al., 2014). The 'Galatina' reference transcriptome enriches the number of those within the leafy group-"Catalogna, " Witloof (Hodgins et al., 2014) and Radicchio-and widens the investigation spectrum because it is based on several mature and young vegetative tissues. In the absence of a genome sequence, it is recommended that the transcriptome construction of non-model species is achieved through the joining of reference-guided and de novo transcriptome assemblies (Ockendon et al., 2016) and/or the combinatorial use of different assemblers (Nakasugi et al., 2014). The assembly used in this work employed the one-and twostep strategies ( Figure S1), which produced ca. 58,000 unigenes comparably to other C. intybus transcriptomes (Hodgins et al., 2014) but with nearly doubled length (1230 vs. 635-684 bp). Moreover, the Gal transcriptome contained 77.4% of annotated unigenes of average length of 1372 bp, consistently with features of other Asteraceae transcriptomes (Wang et al., 2013;Jung et al., 2014;Peng et al., 2014). The Gal transcriptome provided over 11,000 putative SSR markers; more than 15,000 unigenes differed between Gal and Mol for over 20,000 homozygous SNPs. The SSR and SNP validation by wide screening on different populations was beyond the scope of this work. However, the filtering criteria for polymorphism mining (Kumar et al., 2012) provide info to create SNP or SSR markers targeting alleles with variants in coding sequences. These marker types are useful to detect a causative mutation (Field and Wills, 1996) and are highly transferable across species (Varshney et al., 2005). The SNP frequency was 1/1068 bp suggesting that the two landraces are related. The polymorphisms events are expected to increase by screening a higher number of "Catalogna" landraces. As for SNPs in the STL pathway genes and the respective putative biological function, only a synonymous substitution was found in the Ci_contig62598 (GAS) differentiating Gal and Mol unigenes. This implies that no protein mutation occurs that might explain the STL content differences. These latter are more likely due to a diversified gene expression regulation (e.g., residing in gene promoter regions). Nonetheless, the heterozygous SNP provides a tool for genetic mapping of this key gene in the stem-chicory populations.
Finally, the landrace comparative approach and data mining of transcriptome and metabolic variations were efficient to discover genes involved in STL pathway as a precious source to comprehend regulation of bitter taste in this vegetables and support plant breeding for product quality.

AUTHOR CONTRIBUTIONS
DG was responsible for research costs and guided the work design and manuscript writing. GT carried out transcriptome assembly, differential gene expression analysis, polymorphism mining, qPCR validation, statistics and writing. GM contributed to transcriptome assembly and software usage. MG and MR performed statistical analysis on sesquiterpenes contents and gustative test. MG, GA and AS produced plant materials. GCT carried out the sesquiterpenes quantification. CN, GF, ED, MI performed sampling, phenotyping and nucleic acid extractions. All the authors reviewed, edited and approved the manuscript.