Identification of Abietane-Type Diterpenoids and Phenolic Acids Biosynthesis Genes in Salvia apiana Jepson Through Full-Length Transcriptomic and Metabolomic Profiling

Salvia apiana (S. apiana) Jepson is a medicinal plant that is frequently used by the Chumash Indians in southern California as a diaphoretic, calmative, diuretic, or antimicrobial agent. Abietane-type diterpenoids (ATDs) and phenolic acids (PAs) are the main bioactive ingredients in S. apiana. However, few studies have looked into the biosynthesis of ATDs and PAs in S. apiana. In this study, using metabolic profiling focused on the ATDs and PAs in the roots and leaves of S. apiana, we found a distinctive metabolic feature with all-around accumulation of ATDs, but absence of salvianolic acid B. To identify the candidate genes involved in these biosynthesis pathways, full-length transcriptome was performed by PacBio single-molecule real-time (SMRT) sequencing. A total of 50 and 40 unigenes were predicted to be involved in ATDs and PAs biosynthesis, respectively. Further transcriptional profile using Illumina HiSeq sequencing showed that the transcriptional variations of these pathways were consistent with the accumulation patterns of corresponding metabolites. A plant kingdom-wide phylogenetic analysis of cytochromes (CYPs) identified two CYP76AK and two CYP76AH subfamily genes that might contribute for the specific ATDs biosynthesis in S. apiana. We also noticed that the clade VII laccase gene family was significantly expanded in Salvia miltiorrhiza compared with that of S. apiana, indicating their involvements in the formation of salvianolic acid B. In conclusion, our results will enable the further understanding of ATDs and PAs biosynthesis in S. apiana and Salvia genus.


INTRODUCTION
Salvia apiana (S. apiana) Jepson, commonly known as white sage or bee sage, belongs to the genus Salvia L. (Lamiaceae: Nepetoideae: Mentheae: Salviinae). S. apiana is specific for California's and Baja California's flora (Will and Claßen-Bockhoff, 2017). It is widely distributed in the California Floristic Province and forms the chaparral and desert sage community together with other eighteen members of Salvia sections such as Echinosphace and Audibertia (Walker et al., 2015). Therefore, its taxonomical position is quite essential for understanding the phylogenetic relationship and origin of genus Salvia in the New World. However, the available nuclear genes in S. apiana are very limited, which hamper its phylogenetic studies based on molecular data.
The application history of S. apiana is unique due to its distribution area. It is widely used by the Chumash Indians as diaphoretic, calmative, diuretic, and antimicrobial agent, as well as a burning sage used in religious practices (called khapshikh or xapcix) (Walker et al., 2015;Krol et al., 2021). In modern times, S. apiana is demonstrated to have pharmacological activities, including antimicrobial, anti-inflammatory, gamma-aminobutyric acid (GABA)ergic, analgesic, antioxidant, cytotoxic, and antitumor activities, owing to its special chemical constitution (Khan et al., 2016;Saeed et al., 2016;Srivedavyasasri et al., 2017;Afonso et al., 2019;Krol et al., 2021). Previous phytochemical studies on S. apiana have identified substantial amounts of essential oil, accompanied by a variety of triterpenes, C23 terpenoids, diterpenes, flavonoids, and phenolic acids (PAs) (Krol et al., 2021). A distinctive metabolic feature of S. apiana is that it can produce abietanetype diterpenoids (ATDs) in the aerial part of the plant, while the other American Salvia species can only produce clerodane derivatives (Dentali, 1991;Bisio et al., 2019;Krol et al., 2021). Carnosic acid and its derivatives, such as 16-hydroxycarnosic acid, rosmanol, and 16-hydroxycarnosol, etc., constituted the majority of this chemical group (Luis et al., 1996). Tanshinones, such as tanshinones IIA, tanshinones B, and cryptotanshinone, represent another group of ATDs, are usually regarded as the feature compounds in the East Asia taxa of Salvia genus (Guo et al., 2016;Ma et al., 2021). Interestingly, cryptotanshinone was also found in the root of S. apiana (González et al., 1992), indicating the presence of other tanshinones chemicals and a much more complex metabolic composition of diterpenes in S. apiana as well.
Owing to the restricted distribution range and uncontrolled harvesting, S. apiana is now at high risk of rapid decline or extinction (Adlof, 2015). Therefore, studies concerning botany, cultivation, cell culture, and metabolic engineering of S. apiana are necessary for efficient propagation of natural resources and production of bioactive metabolites, which largely rely on the understanding of metabolic pathways of these bioactive compounds and the genetic resource at both the transcriptomic and genomic levels (D'Amelia et al., 2017). To date, studies of S. apiana have mainly focused on its chemical compositions and biological activities of metabolites. However, little about the genetics of the plant was concerned, which has resulted in the lack of knowledge about the biosynthesis of several pharmacologically active chemical groups in S. apiana.
Here, we used single-molecule real-time (SMRT) sequencing on the PacBio Sequel platform to obtain full-length transcriptome information of S. apiana. A plant metabolomics and chemometrics analysis based on ultra-high-performance liquid chromatography coupled with quadrupole time-of-flight mass spectrometry (UHPLC-Q-TOF-MS/MS) was further applied to characterize the metabolic profiling and screen the chemical markers corresponding to underground and aerial parts of S. apiana. Based on the metabolic profiling of ATDs and PAs, the transcriptional patterns of genes in diterpenoids and phenolic acid biosynthesis pathways were elucidated in S. apiana using Illumina sequencing.

Full-Length Sequencing and Annotation of S. apiana Transcriptome
The full-length transcriptome of S. apiana was obtained using PacBio SMRT sequencing. A total of 36,953,062 subreads (39.30 Gb) were produced, with the average length of 1,040 bp ( Table 1). Circular consensus long-read sequencing was further performed to generate highly accurate long high-fidelity (HiFi) reads to improve the quality of the subreads. As a result, a total of 310,713 circular consensus sequences (CCSs) were generated, with an average length of 1,317 bp and the N50 of 1,565 bp ( Table 1). The full-length non-chimeric (FLNC) reads were identified by screening the coexistence of 5′-primers, 3′-primers, and poly-A tails, which resulted in 189,999 FLNC reads (61.15% of CCS) with an average length of 894 bp and the N50 of 1,255 bp ( Table 1). After clustering the redundant data, 110,985 final consensus reads were obtained, among which 44,169 (39.80%) consensus reads were over 1,000 bp, 8,522 (19.29%) consensus reads were over 2,000 bp, and 652 (7.65%) consensus reads were over 3,000 bp in length (Figure 2).
Gene annotation was carried out by searching against six protein databases [nonredundant (NR), SwissProt, Pfam, Kyoto Encyclopedia of Genes and Genomes (KEGG), Gene Ontology (GO), and Clusters of Orthologous Groups of proteins (COG)] based on sequence similarities. A total of 61,659 (79.04% of 78,009) transcripts were mapped to at least one database, with NR having the highest number of hits (61,469 transcripts, 78.80%, Table 1 and Figure 2A). Among these transcripts, 54,710 (70.13%) transcripts were linked to the GO categories, including molecular functions (MFs), cellular components (CCs), and biological processes (BPs) ( Figure 2B), with CCs as the majority of the GO terms. By the KEGG, 31,950 transcripts were further assigned into 135 biological pathways ( Figure 2C and Supplementary Table 4). Notably, a total of 581 transcripts were sitting in the "metabolism of terpenoids and polyketides" pathway, which would help to reveal the ATDs biosynthesis pathways in S. apiana in the future.

Phylogenetic Analysis of S. apiana
To determine the phylogenetic position of S. apiana within the genus Salvia, orthologous genes from other available Salvia genomes [Salvia splendens (S. splendens), S. miltiorrhiza, and Rosmarinus officinalis], with Nepeta cataria (Nepeta, Lamiaceae) as an outgroup (Dong et al., 2018;Bornowski et al., 2020;Lichman et al., 2020b;Song et al., 2020), were annotated. In total, 130 single-copy genes were selected and used to reconstruct a maximum likelihood phylogenetic tree (Figure 3). As expected, S. apiana was phylogenetically categorized into the same taxa as S. splendens in America, which might diverge from S. miltiorrhiza, a species of East Asia taxa, at about 28.24 Mya.

Transcriptome Profiling of S. apiana Leaves and Roots
To identify differentially expressed genes (DEGs) in leaves and roots of S. apiana, RNA sequencing (RNA-seq) was performed on the Illumina NovaSeq 6000 platform. Three samples of each organ were sequenced, which produced 42.97 Gb clean data in total (Supplementary Table 5). After assembly, unigenes derived from Illumina sequencing were mapped to full-length transcriptome datasets with a match percentage ranging from 71.81 to 79.01%. Using DEGs analyses, we showed that the transcriptional levels of 15,152 and 14,954 transcripts were significantly higher in roots and leaves, respectively ( Figure 4A).
To figure out the most significant functional difference between roots and leaves, the GO enrichment of DEGs was further performed ( Figure 4B). As a result, metabolic process and cellular process were the most enriched terms in the category of molecular function. In the category of cellular components, most DEGs were enriched in membrane part and cell part. In the category of cellular components, catalytic activity and binding were the most different terms. We noticed that in terms of metabolic process and catalytic activity, these were massively enriched, suggesting that the secondary metabolic pathways in roots and leaves were significantly different. Consistent with the physiological function differences between the aerial and underground parts of plants, photosynthesis, primary metabolic, and plant hormones were found as the most different pathways by the KEGG pathway analysis. In the secondary metabolic pathways, phenylpropanoids biosynthesis was the most significant pathway, showing different transcriptional levels of lignins, flavonoids, and phenolic acids. The variation in diterpenoids pathways was not substantial, which could be attributed to the fact that different metabolites share similar upstream pathways ( Figure 4C).

Transcriptional Variation in Abietane-Type Diterpenoids and Phenolic Acids Biosynthetic Pathways
As the KEGG pathway analysis showed above, metabolism of terpenoids and polyketides (KEGG Ortholog 09109) represented as one of the most enriched pathways with 581 unigenes (Supplementary Table 6). According to the elucidated ATDs   pathway in other Salvia plants, including S. miltiorrhiza, S. fruticose, and S. officinalis (Guo et al., 2013;Ignea et al., 2016;Scheler et al., 2016), we predicted the ATD biosynthesis pathway in S. apiana ( Figure 5A). A total of 50 unigenes were supposed to be involved in the ATDs biosynthesis and could be considered as four modules. The geranylgeranyl diphosphate (GGPP) synthesis via the methylerythritol 4-phosphate (MEP) and mevalonate (MVA) pathways, along with the general diterpenoids catalytic steps, provided the common precursor (ferruginol) for ATDs biosynthesis. Then, the pathway was split into the biosynthesis of carnosic acids and tanshinones catalyzed by different CYP76AHs and CYP76AKs, respectively. The transcriptional level of these genes was further profiled by DEG analysis (Figure 5B). Most genes in MVA pathway showed high transcriptional levels in root, indicating that they play a prominent role in the production of the terpenoids precursor, isopentenyl diphosphate (IPP). However, the genes in MEP pathway and ferruginol synthesis did not show significant transcription trend. In addition, most CYP76AH, CYP76AK, and 2-ODD genes, which were specific for ATDs biosynthesis, showed higher transcriptional level in root, indicating the high accumulation of tanshinones. There was also one CYP76AK gene (transcript_30977) expressed highly in leaf, which might contribute to the carnosic acids biosynthesis. Taken together, the transcriptional profile of ATDs pathway was basically consistent with its metabolic properties in roots and leaves of S. apiana. Additionally, forty unignes might be involved in the biosynthesis of PAs were annotated (Figure 6A), most of which were highly expressed in root, consisting with the high accumulation of phenolic acids in roots ( Figure 6B). Given that different ATDs and PAs are distributed in roots and leaves, common genes involved in both the metabolic pathways, especially their upstream genes, were found to be highly expressed in both the organs ( Figure 4A). Genes related to the metabolic branches, on the other hand, were often organ specific. For example, genes in tanshinones biosynthesis (CPS, CYP76AHs, CYP76AKs, CYP71D375s, and 2-ODDs) were mostly highly expressed in roots.

Phylogeny of Cytochrome and Laccase Families
The functional specificity of Lamiaceae cytochromes (CYPs) determines the particular products of ATDs and PAs in Salvia plants (Hansen et al., 2021). Therefore, a total of 132 CYP genes were discovered in S. apiana by Pfam annotation and used for phylogenetic tree reconstruction, together with 578 CYPs from other plants (Figure 7A). Among these, CYP98A and CYP76 families were grabbed for detailed analysis. CYP98As are usually related to phenolic acids biosynthesis, while CYP76AKs and CYP76AHs are unique in Lamiaceae, which regulate the specific ATDs biosynthesis. In S. apiana, we found six homologous of CYP98A, two homologous each of CYP76AK and CYP76AH, respectively. We further studied the phylogeny of all the CYPs from S. apiana, which were classified into 26 families among six clans (Figure 7B), with 71 clan representing the richest families and CYP71 as the largest family. Interestingly, although S. apiana had more abundant ATD metabolites than that in other Salvia plants, CYP76AK and CYP76AH subfamilies did not seem to have experienced gene expansion, suggesting a functional diverse of these CYPs. On the other hand, we found that CYP98A subfamily might expanded, consisting with more complex composition of PAs in S. apiana. Apart from CYP98A-mediated hydroxylation, polymerization of phenolic polymers promotes the metabolic variety of phenolic acids, which were previously supposed to be catalyzed by the laccase family in Salvia plants. However, the specific laccase remains to be identified (Di et al., 2013;Li et al., 2019;Zheng et al., 2021). Metabolic profiling revealed the absence of these phenolic polymers, implying the lack or mutation of corresponding laccases in S. apiana. Thus, the phylogeny of laccase family was examined in both the S. apiana (27 members) and S. miltiorrhiza (29 members) ( Figure 7C). All the laccases were classified into seven clades, but clade VI was absent in S. apiana. We also found that the clade VII in S. miltiorrhiza showed significant gene expansion. Taken together, these extra laccase genes might contribute to the formation of phenolic polymers such as salvianolic acid B in S. miltiorrhiza.

DISCUSSION
S. apiana, known as white sage, is of ritual meaning and great medicinal value in Southern California historically. However, previous studies on S. apiana were mainly focused on the pharmaceutical perspective as phytochemistry and biological activities. Here, based on the establishment of full-length transcriptome of S. apiana, we were able to investigate its taxonomical position, chemical specificities, as well as genetic characteristics.
The origins of Salvia taxa in the New World, as well as their dispersal, are still debated (Will and Claßen-Bockhoff, 2017). According to the phylogenetic relationship of S. apiana, S. splendens, and S. miltiorrhiza, we estimated that the divergence between America Salvia taxa and North Asia taxa was at about 28.24 Mya. This was a little earlier than the previous estimation (17.6 Mya) according to the phylogeny between Salvia bowleyana (S. bowleyana), S. splendens, and S. miltiorrhiza (Zheng et al., 2021). However, both results suggested that the two lineages were diverged during late Oligocene or early Miocene, which is consisted with previous studies of the Salvia lineages (Kriebel et al., 2019;Krol et al., 2021). However, S. apiana had not been considered yet, which added nuclear gene information for North America taxa.
In contrast to ATDs, phenolic polymers as salvianolic acid B were not produced by S. apiana, indicating the lack of corresponding catalytic enzymes (Di et al., 2013;Hou et al., 2013). The biosynthesis of phenolic acids is composed of the general phenylpropane pathway, tyrosine-derived pathway, and additional enzymes (Petersen and Simmonds, 2003). A BAHD acyltransferase member [rosmarinic acid synthase (RAS)] and CYP98A subfamily were reported to contribute to the specific PAs biosynthesis in Lamiaceae by producing rosmarinic acid. Inferring from the indirect evidences, laccase family might catalyze further polymerization (Di et al., 2013;Li et al., 2019;Zheng et al., 2021;Zhou et al., 2021); however, it still needs to be proved. The absence of salvianolic acid B in S. apiana implies the lose function of laccase family, which can be considered as a natural mutant. When comparing the laccase family between S. apiana and S. miltiorrhiza, we discovered that clade VII has expanded in S. miltiorrhiza, which included the previously predictions according to mutant experiment of SmLAC7 and SmLAC20. It shows that the range of candidate genes can be effectively narrowed through the comparison of laccase families between Salvia species.

CONCLUSION
In summary, we reported the first full-length transcriptome of medicinal plant S. apiana. The distribution and contents of ATDs and PAs exhibited tissue-specific patterns (leaf and root) in S. apiana by UPLC-MS analysis. Such tissue-specific accumulation of ATDs and PAs was further verified with genes in ATDs and PAs biosynthetic pathways through systematic metabolomic and transcriptomic analyses. Moreover, phylogenetic analysis identified five candidate genes for ATDs synthesis; further functional studies of these genes are needed to verify their roles in ATDs biosynthesis. Together, this study provides novel insights into the molecular basis for the biosynthesis of ATDs and PAs in S. apiana and serves as an approach with which to analyze fulllength transcriptome data and the biosynthesis process in other Salvia plants.

Plant Materials
Salvia apiana was planted in a greenhouse with 25 • C and humidity of 40-70% at Shanghai University of Traditional Chinese Medicine for 6 months (May 2020 to November 2020). The fresh leaves and roots of S. apiana were harvested at vegetative stage for both the metabolic and transcriptomic analyses.

Ultra-High-Performance Liquid Chromatography Coupled With Quadrupole Time-of-Flight Mass Spectrometry Analysis for the Non-targeted Metabolomics Study
Each of 8 plants was divided into roots and leaves separately, represented as eight biological replicates. All the samples were crushed into homogeneous powder. Ten milligram prepared powder was extracted with 1 ml of 70% methanol (v/v) containing warfarin (5 µg/ml) as the internal reference for 1 h in an ultrasonic bath (53 kHz, 350 W) at 4 • C. After centrifuged at 12,000 g under 4 • C for 30 min, the supernatant was used for UHPLC-Q-TOF-MS analysis.
To acquire mass spectrometry data, a Xevo G2-XS with an electrospray ionization source equipped with an electrospray ionization source was used. Mass spectrometry was carried out in both the positive ion and negative ion modes under 30 V cone voltage, with the capillary voltage of 3 kV (positive ion mode) or 2.5 kV (negative ion mode), respectively. The desolvation temperature was set at 450 • C with the desolvation gas flow of 600 l/h and the source temperature was set at 150 • C with a cone gas flow of 50 l/h. All the data were collected in MS E mode, with the parameters as follows: MS E range 50-1,200 m/z, MS E low energy 6 eV, and MS E high energy 15-30 eV. To calibrate the instrument, a sodium formate solution (0.5 mM) was used. Leucine enkephalin was continuously acquired and used as an external standard for mass correction. All the data were viewed in MassLynx version 4.2.

Processing of Metabolomics Data
The metabolomics data was first converted to the Analysis Base File (ABF) converter format and imported into the MS-DIAL version 4.60 software (Tsugawa et al., 2015). The parameter settings were as follows: MS 1 and MS 2 tolerance were 0.01 and 0.02 Da, respectively. MS 1 mass range and MS/MS mass range were set between 100 and 1,000 with the MS/MS abundance cutoff at 800 amplitudes. Retention time range was set between 1 and 28 min with the retention time tolerance at 0.15 min.  (Fraisier-Vannier et al., 2020) with the following parameters: minimum blank ratio at 0.8; relative mass defect between 50 and 7,000 with the maximum mass difference at 0.005 Da; and maximum relative SD at 30 with the maximum retention time difference at 0.03 min. To generate the final peak table for metabolite identification, the highest intensity in each cluster was retained.

Metabolite Identification and Chemometric Analysis
MS-FINDER was applied to identify the structures of unknown metabolites with reference databases such as UNPD, Northern African Natural Products Database (NANPDB), Human Metabolome Database (HMDB), PlantCyc, KNApSAcK, and LipidMaps (Tsugawa et al., 2016). The structures of compounds were confirmed by matching with theoretical data or public databases.
To obtain normalized metabolomics data, the raw data was processed in MS-DIAL with warfarin (5 µg/ml) as internal standard in each sample. Normalized metabolomics data was directly imported into SIMCA-P 14.1 (Umetrics AB, Umea, Sweden). After Pareto scaling, the orthogonal partial leastsquares discriminant analysis (OPLS-DA) was applied to determine the supervised pattern recognition among the defined groups. To determine the unsupervised pattern recognition of leaves and roots, the principal component analysis (PCA) was performed. Only metabolites with variable importance in projection (VIP) value larger than 1.5 obtained from the OPLS-DA model were considered as the potential chemical markers with high contribution.

Ribonucleic Acid Extraction, Quality Assessment, and Quantification
Total RNA was extracted using the TransZol Plus RNA Kit (TransGen, China) according to the manufacturer's protocol. RNA integrity was determined by the 2100 Bioanalyzer (Agilent Technologies Incorporation, Santa Clara, California, USA) and the OD260/280 values were determined with the ND-2000 (NanoDrop Technologies, Wilmington, Delaware, USA). Only high-quality RNA samples were preserved for later Iso-Seq library construction.

PacBio Library Construction and Sequencing
The PacBio libraries were prepared separately using mixed messenger RNA (mRNA) from three leaf or root samples. The Iso-Seq library was prepared using the Clontech SMARTer PCR cDNA Synthesis Kit (Takara Bio, Mountain View, California, USA) and the BluePippin Size Selection System protocol as described by Pacific Biosciences (PN 100-092-800-03). Then, sequencing was performed on a PacBio Sequel platform (Pacific Biosciences).

Illumina Complementary DNA Library Construction and Next-Generation Sequencing
The complementary DNA (cDNA) library for next-generation sequencing (NGS) was prepared by using the Illumina TruSeq TM RNA Sample Preparation Kit (Illumina, San Diego, California, USA) according to the protocol provided by the manufacturer. Sequencing was performed on the Illumina NovaSeq 6000 Sequencer (Illumina, San Diego, California, USA) by Shanghai Majorbio Biopharm Biotechnology Corporation Ltd. (Shanghai, China).

Sequence Assembly and Gene Annotation
The raw paired-end reads were trimmed and quality controlled by fastp (https://github.com/OpenGene/fastp) with default parameters. Then, clean data from the samples of S. apiana was de-novo assembly into contigs using Trinity assembler (Grabherr et al., 2011).
To annotate the gene function, all the assembled transcripts were searched against the following databases: National Center for Biotechnology Information (NCBI) non-redundant (NR) protein sequences database; protein family (Pfam); Clusters of Orthologous Groups of proteins (KOG/COG); a manually annotated and reviewed protein sequence database (Swiss-Prot); KEGG Ortholog database (KO); and the Gene Ontology (GO). BLASTX was used for NR, KOG, Swiss-Prot, and the KEGG database analysis with the E-value set at 1e-5 (Camacho et al., 2009). The Pfam search results were generated using HMMER version 3.1b2 (with E < 1e-5).

Phylogenetic Tree Construction and Divergence Time Estimation
A single-copy orthologous nuclear gene set for Lamiaceae was used to construct the phylogenetic tree (Mint Evolutionary Genomics Consortium, 2018). Phylogenetic relationship among these 5 plant species was resolved using the RAxML package (version 8.1.13) (Stamatakis, 2014). Their divergence times were estimated by the program MCMCtree in PAML (version 3.15) (http://abacus.gene.ucl.ac.uk/software/paml.html).
Multiple sequence alignments of CYP targets were performed using the ClustalW program and phylogenetic trees were constructed by MEGA version 7.0. The maximum likelihood statistical method was used to calculate the phylogenetic tree, with 1,000 bootstrap replications.

Differential Expression Analysis and Functional Enrichment
The expression level of each transcript was analyzed by using the transcripts per million reads (TPM) method in RSEM software (Li and Dewey, 2011). The analysis of differential expressed genes (DEGs) was performed by using DESeq2 (Love et al., 2014) with p-adjust <0.05 and |log2FC| ≧ 1. In order to identify DEGs, the GO and the KEGG enrichment analyses were performed by Goatools (https://github.com/tanghaibao/Goatools) and KOBAS (Xie et al., 2011), respectively, with significantly enriched in the GO terms and metabolic pathways at Bonferroni-corrected P ≤ 0.05.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: https://ngdc.cncb.ac. cn/, CRA006765 and CRA006779.

AUTHOR CONTRIBUTIONS
WC and QL were the leading investigators of this research program. QL designed the experiments. JH performed most of experiments and analyzed the data. Other authors assisted in experiments and discussed the results. JH and FL wrote the manuscript. All authors contributed to the article and approved the submitted version.