The Conservation of Chloroplast Genome Structure and Improved Resolution of Infrafamilial Relationships of Crassulaceae

Crassulaceae are the largest family in the angiosperm order Saxifragales. Species of this family are characterized by succulent leaves and a unique photosynthetic pathway known as Crassulacean acid metabolism (CAM). Although the inter- and intrageneric relationships have been extensively studied over the last few decades, the infrafamilial relationships of Crassulaceae remain partially obscured. Here, we report nine newly sequenced chloroplast genomes, which comprise several key lineages of Crassulaceae. Our comparative analyses and positive selection analyses of Crassulaceae species indicate that the overall gene organization and function of the chloroplast genome are highly conserved across the family. No positively selected gene was statistically supported in Crassulaceae lineage using likelihood ratio test (LRT) based on branch-site models. Among the three subfamilies of Crassulaceae, our phylogenetic analyses of chloroplast protein-coding genes support Crassuloideae as sister to Kalanchoideae plus Sempervivoideae. Furthermore, within Sempervivoideae, our analyses unambiguously resolved five clades that are successively sister lineages, i.e., Telephium clade, Sempervivum clade, Aeonium clade, Leucosedum clade, and Acre clade. Overall, this study enhances our understanding of the infrafamilial relationships and the conservation of chloroplast genomes within Crassulaceae.


INTRODUCTION
The angiosperm family Crassulaceae, also known as the stonecrop family, belongs to the order Saxifragales and include 34 genera and approximately 1,400 species (APG IV, 2016;Messerschmid et al., 2020), which are predominantly perennial herbs, subshrubs, or shrubs (Thiede and Eggli, 2007). The species of Crassulaceae primarily occur in (semi-)arid and mountainous habitats of the temperate and subtropical areas (van Ham and Hart, 1998), and are distributed worldwide with centers of diversity in Mexico, southern Africa, Macaronesia, and the Himalayas (Mort et al., 2001). Physiologically, Crassulaceae are characterized by the Crassulacean acid metabolism (CAM) photosynthetic pathway (Gontcharova and Gontcharov, 2009), which achieves a higher level of water-use efficiency than either the C 3 or C 4 pathway in water-limited environment (Nobel, 1991). A recent study has identified the crown group of Crassulaceae as one of the 30 core diversification shifts across the angiosperm phylogeny (Magallón et al., 2019). This increased diversification rate may be well associated with the CAM pathway that has been recognized as an evolutionary key innovation (Pilon-Smits et al., 1996;Quezada and Gianoli, 2011;Silvestro et al., 2014).
Chloroplasts are semi-autonomous replication organelles, originating in endosymbiosis between cyanobacteria and nonphotosynthetic host, and play crucial roles in photosynthesis and physiology of plants (Gao et al., 2019a;Huo et al., 2019;Li et al., 2021). The chloroplast genome is mostly a typical quadripartite circular DNA genome comprising a small single-copy (SSC), a large single-copy (LSC), and two inverted repeats (IRs) (Hu et al., 2016;Chang et al., 2020). With the development of next−generation sequencing technology, chloroplast genomes have been proven to be powerful tools for resolving vague relationships of many complicated lineages due to many advantages, such as uniparental inheritance, the relatively conserved structure, low recombination and substitution rate (Goremykin et al., 2003(Goremykin et al., , 2004Twyford and Ness, 2017;Li et al., 2020a). Although the structure and sequence of plastomes are relatively conservative, variations in structure, size, and evolutionary rates of genes have been found in many studies, which in some cases signify phylogenetical information and adaptation to environment (Chumley et al., 2006;Hu et al., 2015;Piot et al., 2018;Gao et al., 2019a;Gruzdev et al., 2019;Shrestha et al., 2019;Zang et al., 2019).
Here, we sequenced fourteen fully annotated chloroplast genomes including nine species of Crassulaceae and five species of Haloragaceae. By taking advantage of the data already available, we analyzed a total of 33 chloroplast genomes and aimed to (i) assess the structural characteristics of the chloroplast genome in a comparative framework, (ii) improve the resolution of the infrafamilial relationships within Crassulaceae, and (iii) investigate the adaptive evolution by selective pressures analysis of protein-coding genes in Crassulaceae.

Taxon Sampling, DNA Extraction, and Sequencing
The aim of our taxon sampling was to try to obtain chloroplast genome sequences for at least one representative of each well-supported major clades. The final taxon sampling contained a total of 26 Crassulaceae species (Table 1), representing all three subfamilies sensu Thiede and Eggli (2007), and all five tribes sensu Thiede and Eggli (2007) or five out of the six major clades sensu Messerschmid et al. (2020) within Sempervivoideae. Of these complete chloroplast genomes, nine were newly generated in this study, i.e., Aeonium arboreum (L.) Webb and Berthel., Cotyledon tomentosa Harv., Crassula perforata Thunb., Graptopetalum amethystinum (Rose) E. Walther, Kalanchoe fedtschenkoi Raym.-Hamet and H. Perrier, Orostachys fimbriata (Turcz.) A. Berger, Species with newly sequenced chloroplast genomes are marked with the asterisks. LSC, large single-copy region; SSC, small single-copy region; IR, inverted repeat region; PCG, protein-coding gene.
The DNA materials of four species (Haloragis aspera, Haloragis erecta, Glischrocaryon aureum and Glischrocaryon glandulosum) were provided by DNA Bank of Royal Botanic Gardens, Kew 1 . Fresh leaves of the other ten species were collected from the field and preserved with silica gel. The total genomic DNA was extracted using a modified CTAB method (Allen et al., 2006). For each species, one paired-end library with an insertion size of ∼350 base pairs (bp) was prepared from the total genomic DNA using the NEBNext Ultra II DNA Library Prep Kit for Illumina (New England Biolabs, MA, United States), which was then sequenced using the HiSeq 2500 System (Illumina, Inc., CA, United States) to obtain paired 150bp reads. Briefly, (i) the genomic DNA was sonicated using the S220 Focused-ultrasonicator (Covaris, MA, United States), (ii) the fragmented DNA was end-repaired, dA-tailed, adapter ligated, and subjected to 10-12 cycles of PCR amplification, and (iii) the quality of each library was assessed using the 2100 Bioanalyzer system (Agilent, CA, United States).

Chloroplast Genome Assembly and Annotation
The raw Illumina reads were first filtered to remove pairedend reads if either of the reads contained (i) adapter sequences, (ii) more than 10% of N bases, and (iii) more than 50% of bases with a Phred quality score less than five. The filtered reads were then assembled using NOVOPlasty version 2.7.2 (Dierckxsens et al., 2017), and the complete chloroplast genome of Sedum sarmentosum Bunge (Dong et al., 2013) was used as the reference genome. These assemblies were manually inspected using Geneious version 11.0.3 (Kearse et al., 2012).
The assembled chloroplast genomes were annotated using Plann version 1.1 (Huang and Cronk, 2015), and the positions of exons and introns were inspected and adjusted using Sequin version 15.50. In addition, the circular maps of the chloroplast genomes were drawn using OGDRAW version 1.2 (Lohse et al., 2013), and all annotated chloroplast genomes were deposited in GenBank (Sayers et al., 2020).

Comparative Analysis of Chloroplast Genomes
For the 26 chloroplast genomes of Crassulaceae, the complete nucleotide sequences were compared using the glocal alignment algorithm Shuffle-LAGAN (Brudno et al., 2003) as implemented in the program Mvista 2 (Frazer et al., 2004). Here, Rhodiola rosea L. was chosen as the reference to evaluate gene content variation following Zhao et al. (2020).
To better determine whether any specific pattern of structural variation exists at the family level, the chloroplast genome of Rhodiola rosea was used as the representative owing to the highly conserved gene content and gene order within Crassulaceae (see section "Results"), and compared with that of Penthorum chinense using progressiveMauve (Darling et al., 2010) as implemented in the software package Mauve version 2.3.1 (Darling et al., 2004). Following Firetti et al. (2017), one of the inverted repeat (IR) regions was manually removed prior to the alignment.
To further identify the hypervariable regions, coding and noncoding regions of the 26 chloroplast genomes were first extracted using PhyloSuite version 1.2.1  and aligned separately using MAFFT version 7.427 (Katoh and Standley, 2013) with default parameters. The nucleotide diversity (π) was then estimated separately for coding and non-coding regions using DnaSP version 6.0 (Rozas et al., 2017).
Moreover, since the size variation of the chloroplast genome may be attributed to the expansion or contraction of the IR region, the boundaries between the IR and single-copy regions were identified using IRscope 3 (Amiryousefi et al., 2018) and manually inspected using Geneious.

Phylogenetic Analysis
The nucleotide sequences of the chloroplast protein-coding genes were aligned separately using MAFFT and then concatenated into a supermatrix using PhyloSuite. The optimal partitioning scheme and models of DNA sequence evolution were determined using the relaxed hierarchical clustering algorithm (Lanfear et al., 2014) as implemented in PartitionFinder v2.1.1 (Lanfear et al., 2017).
Phylogenetic relationships were inferred for 26 species of Crassulaceae (i.e., the 33-taxon supermatrix) using both Bayesian inference (BI) and maximum likelihood (ML) methods. For the BI method, four parallel Markov chain Monte Carlo (MCMC) runs were performed using MrBayes version 3.2.7 (Ronquist et al., 2012). The supermatrix was partitioned based on the optimal scheme determined by PartitionFinder, and the bestfitting substitution model was specified as prior for each partition with model parameters unlinked across partitions. A total of 1,000,000 generations were run with sampling every 500 generations, and the first 25% of samples were discarded as burn-in. Convergence of runs was assumed when the average standard deviation of split frequencies dropped below 0.01. The best-scoring ML tree was inferred using RAxML version 8.2.11 (Stamatakis, 2014) with the GTRGAMMAX model for each partition, and branch support was assessed using the rapid bootstrap algorithm (Stamatakis et al., 2008) with 1,000 replicates.
In order to test the potential effect of uneven taxon sampling from each of the genera, we further subsampled the 26 species of Crassulaceae down to a single species as the representative of each genus. Phylogenetic relationships were then estimated from the 20-taxon supermatrix as described above.
FIGURE 1 | Circular gene map of the Crassulaceae chloroplast genome. The genes labeled inside and outside of the circle are transcribed in clockwise and counterclockwise directions, respectively. The inner circle shows the quadripartite structure, with two IR regions (IRa and IRb) separating the large single-copy (LSC) and small single-copy (SSC) regions. The gray ring marks the GC-content with the inner circle indicating a 50% threshold.

Positive Selection Analysis
The likelihood ratio test (LRT) and Bayes empirical Bayes (BEB) based on modified branch-site model Zhang et al., 2005;Yang and Dos, 2011) were used to identify positively selected genes. Since the topological structures of phylogenetic trees constructed by ML and BI methods were congruent, ML tree was used to positive selection analysis. The amino acid sequences of the sixty-seven common protein-coding genes were aligned using MAFFT and converted into nucleotide alignments using PAL2NAL version 14 (Suyama et al., 2006). The nucleotide alignments were trimmed to obtain the final alignments for positive selection analysis by trimAL version 1.4 (Capellagutiérrez et al., 2009). The branch-site model was performed by codeml program in PAML version 4.9 (Yang, 2007). The branch-site test of positive selection was run with the ω of foreground lineage fixed to 1 (fix_omega = 1) for the null hypothesis and estimated (fix_omega = 0) for the alternative hypothesis. The LRT values at df = 1 were calculated by Chi Square test in PAML, and genes with p < 0.05 were treated as candidate positives. Finally, BEB was used to identify those positively selected codon sites.

Characteristics of Chloroplast Genomes in Crassulaceae
After quality control and pre-processing, at least four gigabases (Gb) of whole-genome sequencing data were obtained for each of the nine species (Table 1). These clean reads were assembled into high-quality chloroplast genomes using a reference-guided approach, and the resulting coverage ranged from 1,115 × (i.e., Cotyledon tomentosa) to 12,687 × (i.e., Kalanchoe fedtschenkoi). All these newly assembled chloroplast genomes exhibited a typical quadripartite structure, with two IR regions (i.e., IRa and IRb) separating the LSC and SSC regions (Figure 1).
The structure of the chloroplast genome appeared to be largely conserved across the family ( Table 1). For each of the 26 Crassulaceae species, the size of the chloroplast genome varied from 145,737 bp (i.e., Crassula perforata) to 151,773 bp (i.e., Sinocrassula densirosulata), and the overall GC-content ranged from 37.6% (i.e., Sempervivum tectorum) to 38.2% (i.e., Cotyledon tomentosa). In addition, the total number of annotated genes in each of these chloroplast genomes ranged from 131 (i.e., Sedum emarginatum Migo) to 134 (i.e., Sedum lineare Thunb.), and all these chloroplast genomes possessed 37 tRNA and four rRNA genes.
Using Rhodiola rosea as the reference, the analysis of mVISTA showed high similarity in gene content and gene order among the 26 chloroplast genomes, and further indicated a fairly high sequence similarity, especially in the coding regions (Supplementary Figure 2). This observation was corroborated by our analysis of nucleotide diversity (Figure 2). The nucleotide diversity in the coding regions ranged from 0 to 0.0794, with an average of 0.0230, which was significantly lower than that in the non-coding regions (0-0.1614, 0.0647; p-value < 2.2 × 10 −16 , Welch's t-test). Here, the five coding regions with the highest nucleotide diversity were matK, ycf1, ndhF, rpl22, and rpl32, and the corresponding non-coding regions were trnH-psbA, trnG-trnR, rpl32-trnL, rps16-trnQ, and ccsA-ndhD. Furthermore, no evidence of genomic rearrangement was found in the chloroplast genome of Crassulaceae, when compared with that of Penthorum chinense using progressiveMauve (Supplementary Figure 3).
The boundaries of the LSC, SSC, and IR regions were highly consistent within the family, and no obvious expansion or contraction of the IR region was detected in the 26 chloroplast genomes (Supplementary Figure 4). Here, trnH was shown to be the first gene in the LSC region at the junction between IRa and LSC (i.e., IRa/LSC). At the other end of the LSC region, the junction LSC/IRb was identified as located within the rps19 gene, which gave rise to a truncated copy of the rps19 gene in the IRa region. For both ends of the SSC region, the junctions IRb/SSC and SSC/IRa were found to be located within ndhF and ycf1 gene, respectively. As a consequence, a truncated copy of the ycf1 gene was retained in the IRb region.

Phylogenetic Relationships of Crassulaceae
The 33-taxon supermatrix contained a total of 79 genes and 70,905 sites, and the amount of missing data (including gaps and undetermined characters) was 4.5%. Phylogenetic analyses of the 33-taxon supermatrix using ML and BI methods yielded an identical topology (Figure 3), and all relationships were strongly supported by both methods, i.e., ≥85 ML bootstrap percentage (BP) and ≥0.99 Bayesian posterior probability (PP).
All 26 Crassulaceae species formed a monophyletic group, and were divided into three subclades corresponding to the three subfamilies sensu Thiede and Eggli (2007), i.e., Crassuloideae, Kalanchoideae, and Sempervivoideae (Figure 3). When rooted with Haloragaceae s.l., Crassuloideae (represented by Crassula perforata) were resolved as sister to Kalanchoideae plus Sempervivoideae. Within Kalanchoideae, the two sampled species of Kalanchoe formed a monophyletic group that was sister to Cotyledon tomentosa. Within Sempervivoideae, our sampled species fell into five clades, i.e., Acre clade, Aeonium clade, Leucosedum clade, Sempervivum clade, and Telephium clade (Figure 3). Here, the Telephium clade was established as sister to the rest of Sempervivoideae, and further split into two lineages. One lineage comprised two sampled species of Phedimus and four sampled species of Rhodiola, which formed two reciprocally monophyletic groups, and the other contained the sampled species of Hylotelephium, Orostachys, Sinocrassula, and Umbilicus. Importantly, Umbilicus rupestris (Salisb.) Dandy was recovered as sister to a clade consisting of Hylotelephium, Orostachys, and Sinocrassula. The Sempervivum clade (represented by Sempervivum tectorum) and the Aeonium clade (represented by Aeonium arboreum) were placed as successive sister lineages to the Leucosedum clade [represented by Rosularia alpestris (Kar. and Kir.) Boriss.] plus the Acre clade. Furthermore, the five sampled species of Sedum in the Acre clade were recovered as a paraphyletic group, with Graptopetalum amethystinum and Pachyphytum compactum nested within them. Importantly, except the non-monophyly of Sedum, the same set of intergeneric relationships was recovered from the 20taxon supermatrix (Supplementary Figure 5), suggesting that our results should be robust to taxon sampling.

Positive Selection Analysis
A total number of sixty-seven common genes were subjected to positive selection analyses ( Table 2). The LRTs with p-value > 0.05 suggested that there was no statistical support for positive selection in any genes ( Table 2), although the BEB approach identified fourteen genes (atpB, ndhE, ndhJ, petA, psaC, psaJ, psbB, psbD, psbN, rpoC2, rps15, rps3, rps7, and ycf4) with relatively high posterior probabilities of codon sites.

DISCUSSION
In this study, we report nine newly sequenced complete chloroplast genomes of Crassulaceae. Our comparative analyses indicate that the overall gene organization of the chloroplast genome is highly conserved across all 26 Crassulaceae species  Bold types are genes with positively selected sites identified by BEB with relatively high posterior probabilities.
investigated here. In addition, the results of IRscope analysis reveal no obvious expansion or contraction of the chloroplast IR region. Furthermore, although a recent study has identified a unique 4-kb inversion in the chloroplast genome of the outgroup species Myriophyllum spicatum (Liao et al., 2020), our analyses show a high degree of similarity in chloroplast gene order between Crassulaceae and the outgroup species Penthorum chinense.
Previous studies have demonstrated that the size variation of the angiosperm chloroplast genome is primarily due to variation in the IR region, intergenic region, and gene copy number (Zheng et al., 2017;Amiryousefi et al., 2018;Bedoya et al., 2019). The results of mVISTA analysis suggest that hypervariable intergenic regions in the LSC region, such as petA-psbJ, psbM-trnD, psbZ-trnG, rps16-trnQ, and trnE-trnT ( Supplementary  Figure 2), contribute most to the chloroplast genome size variation within Crassulaceae. Moreover, phylogenetic studies of Crassulaceae have relied primarily on a limited set of chloroplast markers (e.g., matK, rps16, and trnL-trnF). Our comparative analyses, however, have shown that these chloroplast markers appear to be relatively low in nucleotide diversity (Figure 2), which may be partially responsible for the lack of phylogenetic resolution within Sempervivoideae (Mort et al., 2001;Messerschmid et al., 2020). Thus, to achieve better phylogenetic resolution, future studies of Crassulaceae should focus on molecular markers from more variable regions of the chloroplast genome, such as ccsA-ndhD, rps16-trnQ, rpl32-trnL, and trnH-psbA (Prince, 2015).
With increased taxon sampling of key lineages of Crassulaceae, our phylogenetic analyses of chloroplast genome sequences have substantially improved the phylogenetic resolution, and provided robust inference of the infrafamilial relationships. Among the three subfamilies sensu Thiede and Eggli (2007), our phylogenetic analyses have confirmed the monophyly of Kalanchoideae and of Sempervivoideae, and suggested that Crassuloideae are sister to Kalanchoideae plus Sempervivoideae, corroborating previous studies based on DNA sequences and restriction site variation (e.g., van Ham and Hart, 1998;Mort et al., 2001Mort et al., , 2009Bruyns et al., 2019;Folk et al., 2019). In addition, the sister relationship between Kalanchoideae and Sempervivoideae is supported by two putative morphological synapomorphies, i.e., leaves with a single apical or subapical hydathode and seeds with costate testa (Thiede and Eggli, 2007).
Numerous studies have attempted to identify major lineages within Sempervivoideae (e.g., van Ham and Hart, 1998;Mort et al., 2001;Mayuzumi and Ohba, 2004;Nikulin et al., 2016;Folk et al., 2019;Messerschmid et al., 2020), but relationships among these major lineages remain uncertain (Supplementary Figure 1). Our analyses have revealed five clades that are successively sister lineages, i.e., Telephium clade, Sempervivum clade, Aeonium clade, Leucosedum clade, and Acre clade. These clades correspond to five out of the six major clades sensu Messerschmid et al. (2020). Of these six major clades, the lone exception here is the Petrosedum clade sensu Messerschmid et al. (2020), for which no chloroplast genome sequence is currently available. The Telephium clade identified here is equivalent to tribes Telephieae plus Umbiliceae sensu Thiede and Eggli (2007). Although it was first proposed by van Ham and Hart (1998), the monophyly of the Telephium clade has only recently been confirmed (i.e., ≥85 BP) by phylogenetic analyses of chloroplast genome sequences (Kim and Kim, 2020) and 301 low-copy nuclear genes (Folk et al., 2019). Our results add further evidence to support the recognition of the Telephium clade. In addition, Umbilicus rupestris is strongly supported by our phylogenetic analyses as sister to the tribe Telephieae (i.e., our sampled species of Hylotelephium, Orostachys, and Sinocrassula), thus corroborating recent findings (Folk et al., 2019;Kim and Kim, 2020;Zhao et al., 2020) and highlighting the paraphyly of the tribe Umbiliceae (i.e., our sampled species of Phedimus, Rhodiola, and Umbilicus). Furthermore, the Sempervivum and Aeonium clades identified here correspond to tribes Semperviveae and Aeonieae sensu Thiede and Eggli (2007), respectively, and the Leucosedum and Acre clades identified here together constitute tribe Sedeae sensu Thiede and Eggli (2007).
It has been indicated that the genes with positive selection played key parts in the adaptation to diverse environments (Moseley et al., 2018;Gao et al., 2019b;Li et al., 2020b). However, no positive selection was statistically supported among sixtyseven chloroplast protein-coding genes in sampled Crassulaceae. This result indicated these protein-coding genes are under strong structural and functional constraints. The absence of positive selection in most protein-coding genes of chloroplast genome was also found in Euonymus (Li et al., 2021) and Quercus (Yin et al., 2018).
In conclusion, by expanding the number of informative molecular characters, we have further improved the resolution of the phylogenetic relationships among major lineages within Crassulaceae, which will facilitate the identification of nonmolecular synapomorphies. However, additional sampling of key lineages (e.g., Petrosedum clade) is required to fully resolve the infrafamilial relationships. Furthermore, the boundaries of some of the traditional genera in Crassulaceae remain poorly defined. For example, Sedum, the largest genus with approximately 470 species, has been shown to be highly polyphyletic, and its intrageneric relationships remain largely unresolved (Messerschmid et al., 2020). Thus, chloroplast phylogenomics will continue to enhance our understanding of the evolutionary history of Crassulaceae in the future.

DATA AVAILABILITY STATEMENT
All annotated chloroplast genomes have been deposited in GenBank (https://www.ncbi.nlm.nih.gov/genbank/), and accession numbers are provided in Table 1.