The Complete Chloroplast Genome of Chinese Bayberry (Morella rubra, Myricaceae): Implications for Understanding the Evolution of Fagales

Liu, Lu-Xian; Li, Rui; Worth, James R. P.; Li, Xian; Li, Pan; Cameron, Kenneth M.; Fu, Cheng-Xin

doi:10.3389/fpls.2017.00968

ORIGINAL RESEARCH article

Front. Plant Sci., 30 June 2017

Sec. Plant Genetics and Genomics

Volume 8 - 2017 | https://doi.org/10.3389/fpls.2017.00968

The Complete Chloroplast Genome of Chinese Bayberry (Morella rubra, Myricaceae): Implications for Understanding the Evolution of Fagales

Lu-Xian Liu^1,2

Rui Li³

James R. P. Worth⁴

Xian Li⁵

Pan Li^2*

Kenneth M. Cameron⁶

Cheng-Xin Fu²

¹Laboratory of Plant Germplasm and Genetic Engineering, College of Life Sciences, Henan University, Kaifeng, China
²Key Laboratory of Conservation Biology for Endangered Wildlife of the Ministry of Education, College of Life Sciences, Zhejiang University, Hangzhou, China
³Food Inspection and Testing Institute of Henan Province, Zhengzhou, China
⁴Department of Forest Molecular Genetics and Biotechnology, Forestry and Forest Products Research Institute, Ibaraki, Japan
⁵Zhejiang Provincial Key Laboratory of Horticultural Plant Integrative Biology, Zhejiang University, Hangzhou, China
⁶Department of Botany, University of Wisconsin, Madison, WI, United States

Morella rubra (Myricaceae), also known as Chinese bayberry, is an economically important, subtropical, evergreen fruit tree. The phylogenetic placement of Myricaceae within Fagales and the origin of Chinese bayberry’s domestication are still unresolved. In this study, we report the chloroplast (cp) genome of M. rubra and take advantage of several previously reported chloroplast genomes from related taxa to examine patterns of evolution in Fagales. The cp genomes of three M. rubra individuals were 159,478, 159,568, and 159.586 bp in length, respectively, comprising a pair of inverted repeat (IR) regions (26,014–26,069 bp) separated by a large single-copy (LSC) region (88,683–88,809 bp) and a small single-copy (SSC) region (18,676–18,767 bp). Each cp genome encodes the same 111 unique genes, consisting of 77 different protein-coding genes, 30 transfer RNA genes and four ribosomal RNA genes, with 18 duplicated in the IRs. Comparative analysis of chloroplast genomes from four representative Fagales families revealed the loss of infA and the pseudogenization of ycf15 in all analyzed species, and rpl22 has been pseudogenized in M. rubra and Castanea mollissima, but not in Juglans regia or Ostrya rehderiana. The genome size variations are detected mainly due to the length of intergenic spacers rather than gene loss, gene pseudogenization, IR expansion or contraction. The phylogenetic relationships yielded by the complete genome sequences strongly support the placement of Myricaceae as sister to Juglandaceae. Furthermore, seven cpDNA markers (trnH-psbA, psbA-trnK, rps2-rpoC2, ycf4-cemA, petD-rpoA, ndhE-ndhG, and ndhA intron) with relatively high levels of variation and variable cpSSR loci were identified within M. rubra, which will be useful in future research characterizing the population genetics of M. rubra and investigating the origin of domesticated Chinese bayberry.

Introduction

Chloroplasts (cp) are essential organelles in plant cells for photosynthesis and perform other functions comprising synthesizing starch, fatty acids, pigments and amino acids (Neuhaus and Emes, 2000). Typically, the sizes of chloroplast genomes and their gene arrangement in angiosperms are highly conserved and usually have a circular structure ranging from 120 to 160 kb, with two copies of inverted repeats (IR) region separated by a large single-copy (LSC) region and a small single-copy (SSC) region (Palmer, 1991; Raubeson and Jansen, 2005). Chloroplast genomes generally contain 110–130 distinct genes and these genes exhibit a highly conserved gene order with a majority of which (∼79) encoding proteins that are mostly involved in photosynthesis, whereas the rest of the genes encode approximately 30 transfer RNA (tRNA)s and four ribosomal RNA (rRNA)s (Jansen et al., 2005).

Compared with nuclear and mitochondrial genomes, chloroplast genomes are largely conserved in term of gene content, organization and structure (Raubeson and Jansen, 2005), and the nucleotide substitution rate of chloroplast genes is higher than that of mitochondrial genes, but lower than that of nuclear genes (Wolfe et al., 1987; Drouin et al., 2008). However, evolutionary events such as mutations, duplications, losses and rearrangements of genes have been reported in a number of studies (Lee et al., 2007; Dong et al., 2013; Choi et al., 2016). Due to its relatively small size, simple structure and conserved gene content, the chloroplast genome has been used as ideal research model for evolutionary and comparative genomic studies (Dong et al., 2013). In recent years, comparative studies of chloroplast genomes have been applied to a number of focal species (Young et al., 2011), genera (Greiner et al., 2008a,b), or plant families (Daniell et al., 2006). At higher taxonomic levels, comparative analyses of chloroplast genomes are useful for phylogenetic studies (Moore et al., 2007; Moore et al., 2010), as well as for understanding the genome evolution relating genome size variations, gene and intron losses and nucleotide substitutions. Moreover, chloroplasts have their own independent genome encoding an array of specific proteins, and the nature of non-recombinant and uniparental inheritance makes it a primarily useful tool in genomics and evolutionary research (Cho et al., 2015). Single nucleotide polymorphsims (SNPs) and indels, resulting from translocations, inversions, copy number variation of tandem repeats and rearrangements, are suitable for applying to phylogeny reconstruction (De Las Rivas et al., 2002), DNA barcoding (Hollingsworth et al., 2011), as well as investigating the geographic origin of some important domesticated crops (Arroyo-Garcia et al., 2006; Londo et al., 2006; Delplancke et al., 2013).

In this study, we analyzed the chloroplast genome of Morella rubra Lour. (Myricaceae), also known as Chinese bayberry, which is one of the most popular and valuable fruits in eastern China because of its appealing color, texture, delicious taste and nutritional value (Cheng et al., 2015). From the whole family Myricaceae, M. rubra is the only species to be domesticated as a fruit crop (Lu and Bornstein, 1999). Due to its long cultivation history (>2000 years) in China, as many as 305 accessions have been recorded, of which 268 have been named as cultivars (Zhang and Miao, 1999; Zhang et al., 2009). Wild populations of M. rubra, which are important germplasm resources for Chinese bayberry breeding, are distributed in the subtropical evergreen forests in China, Japan, South Korea and Philippines. Despite the economic importance of Chinese bayberry, its population genetics and domestication origin are still unclear. In fact, even the phylogenetic placement of Morella within Myricaceae, and the family within the order Fagales, remains ambiguous. This is one of the most economically and ecologically important flowering plant orders since it contains a number of domesticated nut and timber species, as well as dominant forest tree species (e.g., chestnut, walnut, hickory, oak, southern beech, birch).

Before 1990, Fagales was generally considered to contain only two families: Betulaceae and Fagaceae (Takhtajan, 1980; Cronquist, 1988). However, several large-scale phylogenetic analyses using DNA sequences (Chase et al., 1993; Soltis et al., 2000; Chen et al., 2016) and cpDNA restriction sites (Manos et al., 1993) have provided evidence for the monophyly of an expanded Fagales, which now comprises seven families: Nothofagaceae, Fagaceae, Myricaceae, Juglandaceae (including Rhoipteleaceae), Casuarinaceae, Ticodendraceae, and Betulaceae (APG III, 2009; APG IV, 2016). Most of the relationships within Fagales are well resolved, but the position of Myricaceae still remains uncertain. For example, some studies placed Myricaceae as sister to (Casuarinaceae + (Ticodendraceae + Betulaceae)) (Manos and Steele, 1997, matK/matK + rbcL; Cook and Crisp, 2005; Sauquet et al., 2012; Xiang et al., 2014; Sun et al., 2016), whereas others supported a sister relationship between Myricaceae and Juglandaceae (Li et al., 2004; Soltis et al., 2007; Larson-Johnson, 2016). Still others found that Myricaceae is sister to all Fagales except Nothofagaceae and Fagaceae (Manos and Steele, 1997, rbcL; Li et al., 2002). Thus, previous studies appear to have been based on insufficient information and thus could not fully resolve the phylogenetic position of Myricaceae.

Here, three individuals of M. rubra (Myricaceae) were selected for complete chloroplast genome sequencing. By comparing these three chloroplast genomes to each other and to previously published chloroplast genomes from other taxa in Fagales, we aim to: (1) characterize and compare the cp genomes among select representatives of Fagales in order to gain insights into evolutionary patterns within the order; (2) resolve the phylogenetic position of Myricaceae; (3) screen and identify appropriate markers of the M. rubra genome for future studies on population genetics and domestication origin.

Materials and Methods

DNA Sequencing and Genome Assembly

Total genomic DNA was isolated from silica-dried leaves of three wild M. rubra plants collected in Guangdong (GZMZ), Fujian (FJZS), and Yunnan (YNML) using a modified CTAB method (Li et al., 2013). The high molecular weight DNA was sheared using a Covaris S220-DNA Sonicator (Covaris, INC., Woburn, MA, United States), yielding fragments of ≤800 bp in length. The quality of fragmentation was checked on an Agilent Bioanalyzer 2100 (Agilent Technologies). Short-insert (500 bp) paired-end libraries were generated by using Genomic DNA Sample Prep Kit (Illumina) according to the manufacturer’s protocol and then sequenced using an Illumina HiSeq 2500 (Beijing Genomics Institute, Shenzhen, China). Resulting sequence fragments were screened by quality in order to remove low-quality sequences (Phred score <30, 0.001 probability error), and all remaining high quality sequences were assembled into contigs using the CLC de novo assembler beta 4.06 (CLC Inc., Rarhus, Denmark) with parameters as follows: minimum contig length of 200, deletion and insertion costs of 3, mismatch cost of 2, bubble size of 98, length fraction, and similarity fraction of 0.9. We obtained the principal contigs representing the chloroplast genome from the total assembled contigs using a BLAST (NCBI BLAST v2.2.31) search with the cp genome sequence of J. regia (GenBank accession number: KT870116) as a reference sequence (Peng et al., 2015). The representative chloroplast sequence contigs were ordered and oriented according to the reference chloroplast genome, and the complete chloroplast sequence of M. rubra was constructed by connecting overlapping terminal sequences.

Genome Annotation and Molecular Marker Identification

The cp genomes of M. rubra were annotated through the online program Dual Organellar Genome Annotator (DOGMA; Wyman et al., 2004). Initial annotation, putative starts, stops, and intron positions were determined according to comparisons with homologous genes of J. regia and Castanea mollissima (GenBank accession number: HQ336406) cp genomes using Geneious v9.0.5 software (Biomatters, Auckland, New Zealand). In addition, all of the identified tRNA genes were further verified by using the corresponding structures predicted by tRNAscan-SE version 1.21 (Schattner et al., 2005) with default settings. The cp genome map of M. rubra was constructed utilizing the OGDRAW program (Lohse et al., 2013).

The three completed chloroplast genome sequences of M. rubra were aligned using MAFFT (Katoh et al., 2002). In order to screen various polymorphic regions among individuals of M. rubra (i.e., below the species level), the average number of nucleotide differences (K) and total number of mutations (Eta) were determined to analyze nucleotide diversity (Pi) using DnaSP v5.0 (Librado and Rozas, 2009).

Repeat Structure and Sequence Analysis

We used the online REPuter software to visualize and locate forward, palindrome, reverse and complement sequences with a minimum repeat size of 30 bp and a sequence identity greater than 90% (Kurtz and Schleiermacher, 1999).

Microsatellite (mono-, di-, tri-, tetra-, penta-, and hexanucleotide repeats) detection was performed using Msatcommander v0.8.2 (Faircloth, 2008). We applied a threshold nine, five, five, three, three, and three repeat units for mono-, di-, tri-, tetra-, penta-, and hexanucleotide SSRs, respectively.

Comparative Chloroplast Genomic Analysis

We downloaded Castanea mollissima, Juglans regia, and Ostrya rehderiana (GenBank accession number: KT454094) chloroplast genome sequences from GenBank, in order to compare the overall similarities among different chloroplast genomes in Fagales. Pairwise alignments among four Fagales cp genomes were implemented in the mVISTA program with LAGAN mode (Frazer et al., 2004) using the annotation of Cucumis sativus (Cucurbitaceae, Cucurbitales; GenBank accession number: DQ865976) as the reference.

Synonymous (K_S) and Non-synonymous (K_A) Substitution Rates Analysis

The DnaSP v5.0 (Librado and Rozas, 2009) software was employed to analyze the relative rates of sequence divergence in the four Fagales species and the reference sequence. In order to analyze synonymous (K_S) and non-synonymous (K_A) substitution rates, we extracted the same individual functional protein-coding exons and aligned separately using Geneious v9.0.5. Genes with the same functions were grouped and analyses were carried out on (1) datasets corresponding to those with the same functions, i.e., for atp, pet, ndh, psa, psb, rpl, rpo, and rps; (2) datasets corresponding to singular genes, i.e., for cemA, matK, ccsA, clpP, rbcL, and ycf1; and (3) concatenated common protein-coding genes, except for pseudogenes or lost genes from any species.

Phylogeny Inference

The complete chloroplast genome sequences of eight species from Fagales (10 accessions) were used for phylogenetic analysis, including representatives of five genera of Fagaceae, one genus of Betulaceae, one genus of Juglandaceae, and the three newly sequenced individuals of M. rubra used to represent Myricaceae (Supplementary Table S1). Two species from Cucurbitales (Corynocarpus laevigata and Cucumis sativus) were chosen as outgroup taxa to orient the Fagales tree. In order to investigate the utility of different regions, the phylogeny was inferred using two datasets: (1) the complete chloroplast genome sequences; and (2) a set of 69 protein-coding genes shared by the chloroplast genomes of the 12 accessions. All the gaps were excluded after alignment in both analyses.

All phylogenetic analyses were conducted using maximum-likelihood (ML) and Bayesian inference (BI) methods. ML analyses were implemented in RAxML-HPC v8.1.11 on the CIPRES cluster¹ (Miller et al., 2010) using the best-fit nucleotide substitution model (GTR+I+G) determined from jModelTest v2.1.4 (Posada, 2008) for the cp genome dataset and a partitioned model for protein-coding regions. BI analyses were performed in MrBayes v3.2.3 (Ronquist and Huelsenbeck, 2003) using the same model selection criteria for both data sets. Two independent parallel runs of four Metropolis-coupled Monte Carlo Markov Chains (MCMCs) were run with trees sampling every 1000 generations for five million total generations.

Results and Discussion

Genome Content and Organization in M. rubra

We generated a total of 8.5 million paired-end (PE) reads (200 million nucleotides) for M. rubra-GZMZ, and then trimmed and assembled them using the CLC genome assembler pipeline (CLC Bio, Aarhus, Denmark). A total of 290,501 PE reads were concordantly mapped to the final assembly and the mapped cp contigs were selected to merge for constructing a complete M. rubra-GZMZ cp genome map using BLAST (NCBI BLAST v2.2.31). Four initial contigs (contigs 16, 39, 79, and 883 respectively) were selected to generate the M. rubra-GZMZ cp genome sequence with no gaps and no Ns. The cp genome sequence was registered into GenBank with the accession number KY476637.

The complete chloroplast genome of M. rubra-GZMZ is 159,478 bp in length and shares the common feature of comprising two copies of IR (26,014 bp each) that divide the genome into two single-copy regions (LSC 88,683 bp; SSC 18,767 bp; Figure 1). The overall GC content of the total length, LSC, SSC, and IR regions is 36.1, 33.8, 29.2, and 42.6%, respectively. Coding regions (91,795 bp), comprising protein-coding genes (79,949 bp), tRNA genes (2,798 bp) and rRNA genes (9,048 bp) account for 57.56% of the genome, whereas non-coding regions (67,683 bp), including intergenic spaces (49,558 bp) and introns (18,125 bp) account for the remaining 42.44% of the genome.

FIGURE 1

FIGURE 1. Chloroplast genome map of Morella rubra (Myricaceae). The genes inside and outside of the circle are transcribed in the counterclockwise and clockwise directions, respectively. Genes belonging to different functional groups are shown in different colors. The thick lines indicate the extent of the inverted repeats (IR_A and IR_B) that separate the genomes into small single-copy (SSC) and large single-copy (LSC) regions.

Within the chloroplast genome of M. rubra there are in total 111 genes, including 77 protein-coding genes, 30 tRNA genes, four rRNA genes and 18 duplicated genes (Figure 1 and Tables 1, 2). Among the 111 unique genes, 15 contain one intron (six tRNA genes and nine protein-coding genes) and three (rps12, clpP, and ycf3) contain two introns. The 5′-end exon of the rps12 gene is located in the LSC region, and the intron and 3′-end exon of the gene are situated in the IR region. In addition to the GZMZ accession, we also sequenced the complete cp genomes of M. rubra-FJZS (GenBank accession number: KY476636) and M. rubra-YNML (GenBank accession number: KY476635). These are 159,568 and 159,586 bp in size, respectively, and the genome content and organization of them is nearly the same as the cp genome of M. rubra-GZMZ (Figure 1 and Table 1).

TABLE 1

TABLE 1. Comparative analysis of the chloroplast genomes among four families of Fagales, including three different accessions of Morella rubra (Myricaceae) sequenced for this study.

TABLE 2

TABLE 2. List of genes present in the M. rubra chloroplast genome.

Genome Organization of Fagales

The chloroplast genome organization is rather conserved within Fagales (Figure 2). We did not detect either translocations or inversions among any of the compared genomes. The IR region in these species is more conserved than the LSC and SSC regions, consistent with other angiosperms (Dong et al., 2013;Lu R. et al., 2016). Variations were detected with the following factors: genome size, gene losses, the pseudogenization of protein-coding genes, and IR expansion and contraction.

FIGURE 2

FIGURE 2. Identity plot comparing the chloroplast genomes of four Fagales families using Cucumis sativus as the reference sequence. The vertical scale indicates the percentage of identity, ranging from 50 to 100%. The horizontal axis indicates the coordinates within the chloroplast genome. Genome regions are color codes as protein-coding, rRNA, tRNA, intron, and conserved non-coding sequences (CNS).

Genome Size

Among the representative Fagales species, O.rehderiana exhibits the smallest genome size after comparing with the other three chloroplast genomes. The genome of Castanea mollissima (160,799 bp) is approximately 1.45 kb larger than that of O. rehderiana, 1.32 kb larger than that of M. rubra, and 0.26 kb larger than that of J. regia, as well as it is 5.28 kb larger than that of Cucumis sativus, an outgroup species. The detected sequence length difference is predominantly attributable to the variation in the length of the non-coding regions, especially in terms of intergenic spacer size (Table 1). The M. rubra-GZMZ genome exhibits the smallest non-coding region among the six analyzed chloroplast genomes.

Gene Loss

A single gene, infA, has been lost from all the four analyzed chloroplast genomes. After comparisons with the chloroplast genomes of other Fagales species, this gene also appears to be missing in Castanea pumila (GenBank accession number: KM360048) and Trigonobalanus doichangensis (GenBank accession number: NC023959), although it is present in Quercus edithiae (GenBank accession number: KU382355), Q. rubra (GenBank accession number: NC020152), Castanopsis echinocarpa (GenBank accession number: NC023801), Lithocarpus balansae (GenBank accession number: NC026577), Q. aliena (Lu S. et al., 2016), Q. spinosa (GenBank accession number: NC026907), Q. aquifolioides (GenBank accession number: NC026913), and Q. baronii (GenBank accession number: NC029490). InfA gene was thought to have functions as a translation initiation factor, which assists in the assembly of the translation initiation complex (Wicke et al., 2011). This gene is also possibly transferred to the nucleus and loss of which appears to have independently occurred multiple times during the evolution of land plants (Millen et al., 2001). Dong et al. (2013) reported the two genes including infA and rpl32 had been lost from the chloroplast genome of Paeonia obovata. Therefore, the loss of infA does not represent a unique phenomenon in some species of Fagales.

Gene Pseudogenization

ycf15 has been pseudogenized in all four representatives of Fagales, and rpl22 has been pseudogenized in M. rubra and Castanea mollissima but not in J. regia and O. rehderiana. The ycf15 gene, which has been paid great attention to its function by previous workers (Raubeson et al., 2007; Shi et al., 2013), is located immediately downstream of the ycf2 gene (Dong et al., 2013). Some studies have shown that the ycf15 gene is potentially functional (Shinozaki et al., 1986), but the validity of ycf15 as a protein-coding gene in angiosperms has long been questioned (Tangphatsornruang et al., 2011). The ycf15 presents a pseudogene in all the sequenced chloroplast genome of Fagales except Q. rubra. In Fagales, rpl22 appears as a pseudogene in Myricaceae and Fagaceae because there remain some internal stop codons within the coding region, and not to be pseudogenized in Juglandaceae and Betulaceae. Jansen et al. (2011) reported that rpl22 has been transferred to the nucleus in Fagaceae, whether the rpl22 gene has been transferred to the nucleus in Myricaceae remains to be investigated.

IR Expansion and Contraction

The expansions and contractions of the IR regions and the single-copy (SC) boundary regions often results in genome size variations among various plant lineages (Wang et al., 2008), and may reflect phylogenetic history. For this reason, we paid careful attention to the exact IR/SC border positions and their adjacent genes among the four Fagales species chloroplast genomes that we studied in detail (Figure 3). The ycf1 gene spanned the SSC/IR_A region and the pseudogene fragment of ψycf1 varies from 1058 to 1158 bp. The ndhF gene is separated from ψycf1 by spacers except in Castanea mollissima which does not contain a spacer (53 bp in M. rubra, 104 bp in J. regia and 165 bp in O. rehderiana) but shares some nucleotides (6 bp) with the ycf1 pseudogene in our outgroup taxon, Cucumis sativus. The trnH-GUG gene is generally located downstream of the IR_A/LSC border, and this gene is separated from the IR_B/LSC border by a spacers varies from 8 to 47 bp. However, the rps19 gene does not extend to the IR region among the sampled representatives of Fagales. Thus, the rps19 pseudogene is not observed in Fagales. Although there are expansions and/or contractions of the IR regions detected among the sampled representatives of Fagales, they contribute little to the overall size variations in the chloroplast genomes of these plants.

FIGURE 3

FIGURE 3. Comparison of junction positions between the single copy and IR regions among four Fagales genomes and Cucumis sativus.

Repeat Sequence Analysis and Molecular Marker Identification

Repeat motifs are thought to play an important role in phylogenetic studies and are very useful in the analysis of genome rearrangement (Cavalier-Smith, 2002; Nie et al., 2012). In the chloroplast genome of M. rubra-GZMZ, 39 pairs of repeats (30 bp or longer) containing 22 palindromic repeats, 15 forward repeats, one complement repeat and one reverse repeat were detected using the program REPuter (Kurtz and Schleiermacher, 1999) (Figure 4A). Among these repeats, 33 are 30–40 bp long, four repeats are 41 bp long, one repeat is 44 bp long and one repeat is 57 bp long (Figure 4B). Most of these repeats (53.8%) are distributed in non-coding regions (Table 3), whereas some are found in genes such as ycf1, ycf2, ycf3, psaB, and pasA. Further information about the repeat motifs of M. rubra-FJZS and M. rubra-YNML can be found in Supplementary Tables S2, S3.

FIGURE 4

FIGURE 4. Analysis of repeated sequences in the three M. rubra chloroplast genomes. (A) Frequency of repeat types. (B) Frequency of repeats by length.

TABLE 3

TABLE 3. Repeated sequences in the M. rubra-GZMZ chloroplast genome.

Simple sequence repeats (SSR), also known as microsatellites, are widely distributed over the genome (Chen et al., 2015) and have a high degree of polymorphism (Weber, 1990). As a result, SSRs are widely used as a molecular marker for breeding (Rafalski and Tingey, 1993), population genetics (Perdereau et al., 2014), genetic linkage map construction, and gene mapping (Pugh et al., 2004). In the current study, the distribution, type and presence of microsatellites were studied among the cp genomes of three M. rubra accessions. We did this, in part, because we are interested in developing markers that may be useful in future studies that will address intraspecific variation among natural populations and cultivars of M. rubra across East Asia. A total of 155 perfect microsatellites were identified in the M. rubra-GZMZ cp genome. Among them, 118 were located in the LSC regions, whereas 16 and 21 were found in the IR and SSC regions, respectively (Figure 5A). In addition, 22 SSRs were found in the protein-coding regions, 16 were in the introns and 117 were in intergenic spacers of the M. rubra-GZMZ cp genome (Figure 5B). The distribution and type of microsatellites of M. rubra-FJZS and M. rubra-YNML is shown in Supplementary Figure S1. Among these SSRs, 131 are mononucleotides, 18 are dinucleotides, five are tetranucleotides, and one is a pentanucleotide (Figure 5C). Trinucleotide SSRs are not found in M. rubra-GZMZ or M. rubra-YNML but were detected in M. rubra-FJZS. A majority of the mononucleotides (98.47%) are composed of A/T and most of the dinucleotides (88.89%) are composed of AT/TA (Figure 5C). These results are consistent with the contention that cp SSRs are generally composed of short polyA or polyT repeats (Kuang et al., 2011; Chen et al., 2015). The higher A/T content in cp SSRs also contributes to a bias in base composition, resulting in A/T enrichment (63.9%) in the M. rubra-GZMZ cp genome.

FIGURE 5

FIGURE 5. The distribution, type, and presence of simple sequence repeats (SSRs) in the cp genome of M. rubra. (A) Presence of SSRs in the LSC, SSC, and IR regions (M. rubra-GZMZ). (B) Presence of SSRs in the protein-coding regions, intergenic spacers and introns of LSC, SSC, and IR regions (M. rubra-GZMZ). (C) Presence of polymers in the cp genome of M. rubra.

The coding genes, non-coding regions and intron regions were compared among the three individuals of M. rubra divergence hotspots. We generated 90 loci (28 coding genes, 52 intergenic spacers, and 10 intron regions) with more than 200 bp in length from three M. rubra individuals and the nucleotide variability (Pi) values calculated with the DnaSP v5.0 software.

Among the values received from the three individuals of M. rubra (M. rubra-GZMZ, M. rubra-FJZS, and M. rubra-YNML) ranged from 0.00029 (ycf2 gene) to 0.01867 (psbA-trnK region) (Figure 6). The IR region is much more conserved than the LSC and SSC regions, and the lower sequence divergence observed in the IRs compared to the SSC or LSC regions for Morella species and other angiosperms is likely due to copy correction between IR sequences by gene conversion (Khakhlova and Bock, 2006; Lu R. et al., 2016). Seven of these variable loci, including trnH-psbA, psbA-trnK, rps2-rpoC2, ycf4-cemA, petD-rpoA, ndhE-ndhG, and ndhA intron, showed high levels of variation. Five of them (trnH-psbA, psbA-trnK, rps2-rpoC2, ycf4-cemA, and petD-rpoA) are located in the LSC, whereas two (ndhE-ndhG and ndhA intron) are in the SSC region (Figure 6).

FIGURE 6

FIGURE 6. Comparative analysis of the nucleotide variability (Pi) values among three M. rubra individuals.

All seven of these variable loci (trnH-psbA, psbA-trnK, rps2-rpoC2, ycf4-cemA, petD-rpoA, ndhE-ndhG, and ndhA intron) show great potential as highly informative phylogenetic markers in M. rubra. The results presented here will be helpful to the study on the domestication origin of Chinese bayberry in the future.

Synonymous (K_S) and Non-synonymous (K_A) Substitution Rate Analysis

Nonsynonymous (K_A) and synonymous (K_S) substitutions and their ratio (K_A/K_S) are important to indicate the rates of evolution and natural selection (Yang and Nielsen, 2000). Synonymous nucleotide substitutions have occurred more frequently than nonsynonymous substitutions, and the K_A/K_S value is usually less than one in most protein-coding regions (Makalowski and Boguski, 1998). In this study, these parameters were compared among the protein-coding chloroplast genes of the four-representative species of Fagales to investigate genome evolution, with the cp genome of Cucumis sativus as a reference (Table 4). The K_A values of the four-representative species ranged from 0.0879 to 0.0962, as well as the K_S values ranged from 0.01489 to 0.01605. Both the K_A and K_S values consistently indicated that Castanea mollissima has evolved a little rapidly than the other three species in Fagales. The K_A/K_S values of these Fagales species are less than 1, providing the evidence of purifying selection on the chloroplast protein-coding genes of Fagales species.

TABLE 4

TABLE 4. Substitution rates of 75 protein-coding genes in four Fagales chloroplast genomes.

Variations in evolutionary rates can be related to the function of genes and genome structure (Chang et al., 2006; Jansen et al., 2007; Dong et al., 2013). In Fagales species, the four-sampled genome structure are quite conserved, without any remarkable restructuring being detected. Comparing with the outgroup Cucumis sativus, the K_A (F = 293.17, P < 0.001) and K_S (F = 245.86, P < 0.001) values shown differ significantly among gene groups classified according to gene functions (Figure 7). The psb, pet, and rbcL genes show the lowest K_A values, while the ycf1 gene exhibits the highest K_A values. Moreover, the psa gene shows the lowest K_S values, whereas ccsA gene exhibits the highest K_S values. According to the K_A/K_S values, we found that the psa, rpo, atp, clpP, and ycf1 genes are under positive selection in Fagales.

FIGURE 7

FIGURE 7. Non-synonymous substitution (K_A), synonymous substitution (K_S), and K_A/K_S values for individual Fagales genes and groups of genes. The rpl22 is not included in the rpl group due to its pseudogenization in some species.

Phylogeny Inference

Relationships within Fagales are fairly well resolved in previously published studies, but the position of Myricaceae still remains somewhat uncertain (Manos and Steele, 1997; Cook and Crisp, 2005; Li et al., 2016). Most of these earlier studies have used sequences from only one or more chloroplast loci. In the present study, we explored two datasets: the complete chloroplast genome and a restricted matrix of 69 commonly shared protein-coding genes to perform phylogenetic analysis. For the analysis with the complete chloroplast genome data, the tree topologies from both the ML and the Bayesian analysis were found to be consistent with each other (Figure 8). All the analyzed families within Fagales have MLBS = 100%. Fagaceae are sister to the remaining Fagales (MLBS = 100%), followed by Betulaceae, which are subsequently sister to the remainder of the Fagales, with full support (MLBS = 100%). The remaining two families, Juglandaceae and Myricaceae, form one clade with BS = 100%, as well as the three Myricaceae individuals forming one clade with MLBS = 100%. The relationships among them are identical with the system of classification proposed by APG III (APG III, 2009).

FIGURE 8

FIGURE 8. Phylogenetic tree reconstruction of Fagales using maximum likelihood (ML) based on whole chloroplast genome sequences. Relative branch lengths are indicated. Numbers above the lines represent ML bootstrap values / BI posterior probability. The hyphen indicates that a ML bootstrap <50%. A phylogenetic tree resulting from analysis of 69 protein-coding genes was fully congruent with this topology.

Most phylogenomic studies have not used entire plastome sequences, but rather have used a subset of common protein coding genes (Jansen et al., 2007; Moore et al., 2010; Xi et al., 2012). In this study, the tree topologies inferred from ML and BI using a restricted cp gene matrix were consistent with the trees inferred from the whole cp genome data (Supplementary Figure S2), but the support values for some nodes in the phylogenetic trees were lower. In this study, we proved that complete chloroplast DNA sequences were more effective than common protein coding genes for the phylogenetic reconstruction of Fagales, as evaluated by higher bootstrap values and posterior probabilities. Therefore, we suggest that complete chloroplast genomes should be used more regularly for inferring the backbone relationships among other ordinal clades of angiosperms, as well as for resolving the phylogenetic position of various questionable lineages.

Conclusion

The complete chloroplast genome sequence of M. rubra, was determined using Illumina next-generation DNA sequencing technology. This is the first chloroplast genome sequenced in the Myricaceae family. The chloroplast genome of M. rubra shows a very similar size and organization comparing with the other sequenced angiosperms. The chloroplast genomes of Fagales species have experienced evolution at the gene level, rather than the genome level, because no significant structural changes are detected among their genomes. In addition, the examined genomes differ in size, and the detected genome size variations are mainly due to the length of intergenic spacers, instead of gene losses, gene pseudogenization, IR expansion or contraction. Inferred phylogenetic relationships based on the compete genome sequences from representatives of Fagales strongly support the placement of Myricaceae as sister to Juglandaceae. Furthermore, seven variable regions (trnH-psbA, psbA-trnK, rps2-rpoC2, ycf4-cemA, petD-rpoA, ndhE-ndhG, and ndhA intron) and variable cpSSR loci identified among multiple individuals of M. rubra will be useful in future studies characterizing the population genetics of this species and investigating the domestication origin of Chinese bayberry.

Author Contributions

LL, PL, CF, and XL conceived the ideas; LL and JW contributed to the sampling; LL performed the experiment; LL and RL analyzed the data. The manuscript was written by LL, PL, and KC.

Funding

This work was supported by the Special Fund for Agro-scientific Research in the Public Interest (Grant No. 201203089), and by the NSFC-NSF Dimensions of Biodiversity program (Grant No. 31461123001), the International Cooperation and Exchange of the National Natural Science Foundation of China (Grant Nos. 31511140095, 31561143015), the National Natural Science Foundation of China (Grant No. 31500184).

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgment

We would like to thank Yong-Hua Zhang and Nan Chen for their great help in collecting plant materials.

Supplementary Material

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fpls.2017.00968/full#supplementary-material

FIGURE S1 | The distribution and presence of simple sequence repeats (SSRs) in the cp genome of M.rubra-FJZS (A) and M. rubra-YNML (B).

FIGURE S2 | Phylogenetic tree reconstruction of 12 taxa of Fagales using maximum likelihood (ML) based on the 69 shared protein-coding genes sequences.

Footnotes

^ http://www.phylo.org/

References

APG III (2009). An update of the Angiosperm Phylogeny Group classification for the orders and families of flowering plants: APG III. Bot. J. Linn. Soc. 161, 105–121. doi: 10.1016/j.jep.2015.05.035

PubMed Abstract | CrossRef Full Text | Google Scholar

APG IV (2016). An update of the Angiosperm Phylogeny Group classification for the orders and families of flowering plants: APG IV. Bot. J. Linn. Soc. 181, 1–20. doi: 10.1016/j.jep.2015.05.035

PubMed Abstract | CrossRef Full Text | Google Scholar

Arroyo-Garcia, R., Ruiz-Garcia, L., Bolling, L., Ocete, R., Lopez, M., Arnold, C., et al. (2006). Multiple origins of cultivated grapevine (Vitis vinifera L. ssp. sativa) based on chloroplast DNA polymorphisms. Mol. Ecol. 15, 3707–3714. doi: 10.1111/j.1365-294X.2006.03049.x

PubMed Abstract | CrossRef Full Text | Google Scholar

Cavalier-Smith, T. (2002). Chloroplast evolution: secondary symbiogenesis and multiple losses. Curr. Biol. 12, R62–R64. doi: 10.1016/S0960-9822(01)00675-3

PubMed Abstract | CrossRef Full Text | Google Scholar

Chang, C. C., Lin, H. C., Lin, I. P., Chow, T. Y., Chen, H. H., Chen, W. H., et al. (2006). The chloroplast genome of Phalaenopsis aphrodite (Orchidaceae): comparative analysis of evolutionary rate with that of grasses and its phylogenetic implications. Mol. Biol. Evol. 23, 279–291. doi: 10.1093/molbev/msj029

PubMed Abstract | CrossRef Full Text | Google Scholar

Chase, M. W., Soltis, D. E., Olmstead, R. G., Morgan, D., Les, D. H., Mishler, B. D., et al. (1993). Phylogenetics of seed plants: an analysis of nucleotide sequences from the plastid gene rbcL. Ann. Mo. Bot. Gard. 80, 528–580. doi: 10.2307/2399846

PubMed Abstract | CrossRef Full Text | Google Scholar

Chen, J. H., Hao, Z. D., Xu, H. B., Yang, L. M., Liu, G. X., Sheng, Y., et al. (2015). The complete chloroplast genome sequence of the relict woody plant Metasequoia glyptostroboides Hu et Cheng. Front. Plant Sci. 6:447. doi: 10.3389/fpls.2015.00447

PubMed Abstract | CrossRef Full Text | Google Scholar

Chen, Z. D., Yang, T., Lin, L., Lu, L. M., Li, H. L., Sun, M., et al. (2016). Tree of life for the genera of Chinese vascular plants. J. Syst. Evol. 54, 277–306. doi: 10.1186/s12862-015-0405-2

PubMed Abstract | CrossRef Full Text | Google Scholar

Cheng, B. B., Zheng, Y. Q., and Sun, Q. W. (2015). Genetic diversity and population structure of Taxus cuspidata in the Changbai Mountains assessed by chloroplast DNA sequences and microsatellite markers. Biochem. Syst. Ecol. 63, 157–164. doi: 10.1016/j.bse.2015.10.009

The Complete Chloroplast Genome of Chinese Bayberry (Morella rubra, Myricaceae): Implications for Understanding the Evolution of Fagales

Introduction

Materials and Methods

DNA Sequencing and Genome Assembly

Genome Annotation and Molecular Marker Identification

Repeat Structure and Sequence Analysis

Comparative Chloroplast Genomic Analysis

Synonymous (KS) and Non-synonymous (KA) Substitution Rates Analysis

Phylogeny Inference

Results and Discussion

Genome Content and Organization in M. rubra

Genome Organization of Fagales

Genome Size

Gene Loss

Gene Pseudogenization

IR Expansion and Contraction

Repeat Sequence Analysis and Molecular Marker Identification

Synonymous (KS) and Non-synonymous (KA) Substitution Rate Analysis

Phylogeny Inference

Conclusion

Author Contributions

Funding

Conflict of Interest Statement

Acknowledgment

Supplementary Material

Footnotes

References

Synonymous (K_S) and Non-synonymous (K_A) Substitution Rates Analysis

Synonymous (K_S) and Non-synonymous (K_A) Substitution Rate Analysis