Comparative analysis of complete chloroplast genome sequences of two tropical trees Machilus yunnanensis and Machilus balansae in the family Lauraceae

Machilus is a large (c. 100 sp.) genus of trees in the family Lauraceae, distributed in tropical and subtropical East Asia. Both molecular species identification and phylogenetic studies of this morphologically uniform genus have been constrained by insufficient variable sites among frequently used biomarkers. To better understand the mutation patterns in the chloroplast genome of Machilus, the complete plastomes of two species were sequenced. The plastomes of Machilus yunnanensis and M. balansae were 152, 622 and 152, 721 bp, respectively. Seven highly variable regions between the two Machilus species were identified and 297 mutation events, including one micro-inversion in the ccsA-ndhD region, 65 indels, and 231 substitutions, were accurately located. Thirty-six microsatellite sites were found for use in species identification and 95 single-nucleotide changes were identified in gene coding regions.


Introduction
The genus Machilus in the family Lauraceae includes nearly 100 tree species distributed in tropical and subtropical East and South Asia, with most species in China (Tang et al., 2010). Machilus species are known for their high-quality wood. Although the relationships among the genera traditionally recognized within the monophyletic 'Persea group' are still unclear, Machilus (with two misplaced species of Phoebe) forms a distinct, monophyletic, Asian clade (Chen et al., 2009;Rohwer et al., 2009;Li et al., 2011). At the species level, however, the reported nuclear ITS and LEAFY intron II sequences failed to resolve the phylogenetic and species identification problems in this genus. Rohwer et al. (2009) suggest that the extremely low genetic variation among species of Machilus could be explained by recent species differentiation and/or a greatly decreased substitution rate within the genus. Besides nuclear sequences, 14 chloroplast genomic markers (matK, trnK, accD, ndhJ, psbC-trnS, rpoB, rpoC1, trnD-trnT2, trnH-trnK, psbA-trnH, psbB-psbH, trnS-trnG, rpoB-trnC, and trnS-trnfM) also failed to resolve either the phylogenetic problems within the Persea group or species delineation within Machilus (Rohwer, 2000;Rohwer and Rudolph, 2005;Rohwer et al., 2009;Li et al., 2011). All of these results showed very little variation in those chloroplast genomic markers. This raises the question, are there any useful sequences for the phylogenetic classification of Machilus species in the chloroplast genome?
The chloroplast genome is more conserved than the nuclear genome in plants, but many mutation events in the chloroplast DNA sequence have been identified, including indels, substitutions, and inversions (Ingvarsson et al., 2003). At a high taxonomic level, a 22 kb DNA inversion event was used to confirm that the Barnadesioideae is the most basal lineage in the Asteraceae (Jansen and Palmer, 1987), and three DNA inversion events composed a nested set of phylogenetic characters to clarify the close relationship between the Poaceae and Joinvilleaceae (Doyle et al., 1992). At a low taxonomic level in ginseng, the DNA polymorphism rates of indels and SNPs between Panax ginseng and P. notoginseng were 0.40% (Dong et al., 2014), and 0.20% among four chloroplast genomes of different Chinese ginseng strains (Zhao et al., 2015). In rice, the DNA polymorphism rate of indels and SNPs between Oryza sativa and O. nivara were 0.02% (Masood et al., 2004), and 0.07% between O. sativa indica and O. sativa japonica (Tang et al., 2004). All of these results show that variable characters exist among the chloroplast genomes at the species level.
Here, two species of Machilus (Lauraceae) were selected to determine the entire chloroplast genome sequences. Machilus yunnanensis Lecomte is distributed at high altitudes in Yunnan, Sichuan, and Tibet of SW China (Wei and Werff, 2008), while M. balansae (Airy Shaw) F. N. Wei and S. C. Tang occurs mainly at low elevations in North Vietnam (Tang et al., 2010). By comparing these two complete chloroplast genomes we will try to answer the following questions: (1) What is the size range of chloroplast genomes in Machilus? (2) Which types of mutation events occurred in chloroplast genomes of Machilus?
(3) Is there any highly variable region in the chloroplast genomes of Machilus? Comparisons were also made with the recently published chloroplast genome of Cinnamomum kanehirae (Wu et al., 2015).

DNA Extraction and Sequencing
We collected young leaves of M. yunnanensis and M. balansae from single seedlings growing in the nursery of the Xishuangbanna Tropical Botanical Garden (XTBG) on May 20, 2014. We also collected fruiting branches of both mother trees (Supplementary Figure S1) and compared them with the types to confirm their identifications (Supplementary Figure S2). Genomic DNA was extracted from 1 g fresh leaves using the mCTAB method (Li et al., 2013). Both genomes were sequenced following Dong et al. (2013), and their 138 pair specific primers were used to bridge gaps in the plastomes.

Chloroplast Genome Assembling and Annotation
Sanger sequence reads were proofread and assembled with Sequencher 4.10 (http://www.genecodes.com). All of the genes encoding proteins, transfer RNAs (tRNAs), and ribosomal RNAs (rRNAs) were annotated on Machilus plastomes using the Dual Organellar Genome Annotator (DOGMA) software (Wyman et al., 2004). To further verify the identified tRNA genes, the tRNAscan-SE 1.21 program was used to predict their corresponding structures (Schattner et al., 2005). The genome map of M. yunnanensis and M. balansae was drawn by GenomeVx (Conant and Wolfe, 2008).

Sliding Window Analysis of the Plastomes
After alignment using Clustal X 1.83 (Aiyar, 2000), the sequences were manually adjusted with Bioedit software (http://www.mbio. ncsu.edu/bioedit/bioedit.html). Further, we conducted a sliding window analysis to evaluate the variability (Pi) all over the plastomes in DnaSP version 5 software (Librado and Rozas, 2009). The window length was set to 600 base pairs and the step size was set as 200 base pairs.

Mutation Events Analysis
To identify the microstructural mutations between M. yunnanensis and M. balansae, the two aligned sequences were further analyzed using DnaSP version 5 (Librado and Rozas, 2009) and MEGA version 5 (Tamura et al., 2011). Indel and SNP events were counted and positioned in the plastome using an R program. For the SSRs search, the minimum repeat unit was limited to eight for mononucleotides and four for dinucleotides. For non-SSR indel and SNP detection, the plastome of M. yunnanensis was used as a reference to determine the insertion or deletion events and transition (Ts) or transversion (Tv) events. In addition, the SNPs in the exon of the plastome were further classified into synonymous (S) and non-synonymous (N) substitutions. The gene classification was according to Chang et al. (2006).

Size, Gene Content, and Organization of M. yunnanensis and M. balansae Plastomes
The chloroplast genome of M. yunnanensis (deposited in GenBank: KT348516), with a length of 152, 622 bp, was 99 bp smaller than that of M. balansae (deposited in GenBank: KT348517) (Figure 1), and 78 bp smaller than that of Cinnamomum kanehirae (152,700 bp, GenBank accession No. KR014245) (Wu et al., 2015). All three are smaller than the genome of Calycanthus fertilis (153, 337 bp, GenBank accession No. NC_004993) in the Calycanthaceae, which is in the same order as the Lauraceae (Goremykin et al., 2003).  Table 1). Both contain 113 different functional genes, including 79 protein-coding genes, 30 tRNA genes, and 4 rRNA genes (Supplementary Table S1). The gene map is shown in Figure 1. Among the functional genes, twelve protein-coding genes and  Table S1).

Divergence Hotspots of M. yunnanensis, M. balansae, and Cinnamomum kanehirae Plastomes
To elucidate the level of sequence divergence, the nucleotide variability (Pi) values within 600 bp in both chloroplast genomes of M. yunnanensis, M. balansae, and Cinnamomum kanehirae were calculated with DnaSP 5.0 software. Between two Machilus species these values varied from 0 to 0.01333 with a mean of 0.00154, indicating that the differences between the genomes were small. However, seven highly variable loci including the second intron of clpP, ndhF-rpl32, trnQ-psbI, rps8-rpl14, ycf2, rpl32-trnL, and ycf1 were precisely located (Figure 2A). All of these regions had much higher values than other regions (Pi > 0.008). Three of these loci lie in the LSC region, three in the SSC region, and one in the IR region. Among them, the introns of clpP, rpl32-trnL, and ycf1 loci have been reported before as highly variable regions in seed plants (Dong et al., 2012), while ndhF-rpl32, trnQ-psbI, rps8-rpl14, and ycf2 loci seem to be especially variable in Machilus. Among Cinnamomum kanehirae and the two Machilus species the Pi values varied from 0 to 0.02444, indicating that the differences between the two genera were larger than those between congeneric species. Three of the seven loci, including ndhF-rpl32, rpl32-trnL, and ycf1, were particularly highly variable between Cinnamomum and Machilus species (Pi > 0.015; Figure 2B).

Number and Forms of Microstructural Mutations
Indel markers display no ambiguity in complex mutation patterns with the advantages of low cost and high precision. To detect more variable sites in Machilus, indel mutations between the chloroplast genomes of M. yunnanensis and M. balansae were compared. There are 52 indels in gene spacer regions, 12 indels in introns, and one in the exon of ndhF. Further, these indels were classified into 36 simple sequences repeat (SSR) indels (Table 2) and 29 non-SSR indels ( Table 3). For the SSR indels, there are 34 single nucleotide repeats A/T ranged from 8 to 19 bp, one 9 bp single nucleotide repeat C in the petA-psbJ gene spacer region, and one double nucleotide repeat AT with 5 and 7 bp in the petN-psbM region ( Table 2). The sizes of most non-SSR indels ranged from 1 to 7 bp, while that within the ndhF-rpl32, trnH-psbA, and psbM-trnD gene spacer sequences were 94 bp, 22 bp, and 21 bp, respectively ( Table 3). Of all of these indel events, 83.08% of sites were in the LSC region while 3.07% of sites were in the IR regions. In addition, one micro-inversion event with five bases was detected in the ccsA-ndhD gene spacer region (Supplementary Figure S3).

Numbers and Pattern of SNP Mutations
SNP markers are the most abundant type of mutations, but have never been screened in Machilus species. We detected 95 SNPs, including 48 Ts and 47 Tv, in gene coding regions (Figure 3) and 136 SNPs, including 58 Ts and 78 Tv, in non-coding regions (Supplementary Table S2). The Tv to Ts ratio was 1: 0.74. Among the Tv, 14 were Tv between T and A, 9 were Tv between C and G, and the other 101 were related to GC content changes. Among substitution events in the gene coding regions, nonsynonymous and synonymous substitutions shared 48 and 47 of 95 sites in the entire plastomes, and 23 of 79 genes had nonsynonymous substitution sites. However, two genes psaA and ycf1 had more non-synonymous than synonymous substitutions sites, suggesting these two genes had a relatively high evolution rate ( Table 4). In all of these substitution events, 74.03% of SNP sites  were in the LSC region while 1.73% of SNP sites were in the IR regions.

Discussion
This study produced two complete chloroplast genomes for species in the Lauraceae, which comprises nearly 3500 species in over 50 genera worldwide. For species identification and population structure analysis in this family, the rapidly developed molecular markers such as indels and SNPs have been proved to have significant potential. Global alignment of 13 Gossypium plastomes (Malvaceae) indicated that the total number of SNPs varied from 6 to 1000 and the number of indels ranged from 3 to 178, which supported that the plastome divergence was approximate 0.00159 to 0.00454 within allotetraploids of Gossypium (Xu et al., 2012). Plastome comparative analysis of five Camellia species identified 15 molecular markers with over 1.5% sequence divergences, which were used to promote the further phylogenetic analysis and species identification of Camellia species (Huang et al., 2014). The indel and SNP variable sites of plastomes of 12 Triticeae species were used to estimate that barley diverged from rye and wheat around 8.5 million years ago and rye diverged from Triticum aestivum around 3.5 million years ago (Middleton et al., 2014). These results show that molecular markers including indels and SNPs are useful tools in research.
Out of the 65 indel markers between M. yunnanensis and M. balansae plastomes, the largest indel is located within the intergenic sequence ndhF-rpl32 (94 bp) in the SSC region. Another two large indels were found within trnH-psbA (22 bp) and psbM-trnD (21 bp) in the LSC region. Previous work in other plants has identified large indels in intergenic spacers, such as FIGURE 3 | The patterns of nucleotide substitutions among M. yunnanensis and M. balansae plastomes. The patterns were divided into six types as indicated by the six non-strand-specific base-substitution types (i.e., numbers of considered G to A and C to T sites for each respective set of associated mutation types). The plastome of M. yunnanensis was used as a reference.   ndhF-rpl32, rpoB-trnC, trnE-trnT, rpl32-trnL, trnQ-rps16, and protein coding genes, such as accD, rpl20, ycf1, and ycf15 (Shaw et al., 2007;Nashima et al., 2015). Most of these large indel events occurred in single copy regions but not IR regions. The 36 SSR loci identified in this study may be useful in population and evolutionary studies as well, as they were in Panax ginseng (Kim and Lee, 2004a), Cucumis sativus , Vigna radiata (Tangphatsornruang et al., 2010), and Pyrus pyrifolia (Terakami et al., 2012). In addition, one micro-inversion of five nucleotides was detected between M. yunnanensis and M. balansae plastomes, which indicate that differences in micro-inversion events could exist between Machilus species as previously report in Solanum species (Kim and Lee, 2004b;Gargano et al., 2012). Besides the indel markers, 231 SNP markers were detected between M. yunnanensis and M. balansae plastomes, which indicated that the nucleotide substitution events in the chloroplast genome of Machilus species are more than that between species of rice and less than species of ginseng, potato, and orange. Comparative analysis of genomes found 159 SNP sites between two chloroplast genomes of O. sativa and O. nivara (Masood et al., 2004), 464 between plastomes of P. ginseng and P. notoginseng (Dong et al., 2014), 591 between plastomes of Solanum tuberosum and S. bulbocastanum (Chung et al., 2006), and 330 between plastomes of Citrus sinensis and C. aurantiifolia (Su et al., 2014). For the 95 SNP markers in gene coding regions, non-synonymous and synonymous substitutions shared similar numbers of 48 and 47 sites in the entire Machilus plastomes, implying that constraint mechanisms of substitution existed. Under the constraint background, photosynthetic metabolism genes atpE and ndhF and gene expression genes rpoC2 and rpl2 shared the extra synonymous substitution sites which are equal with the non-synonymous substitution sites of ycf1, ycf2, and photosynthetic apparatus gene psaA.
The Indel and SNP mutation events in the genome were not random but clustered as "hotspots" (Shaw et al., 2007;Worberg et al., 2007). Such mutational dynamics created the highly variable regions in the genome. In M. yunnanensis and M. balansae plastomes, we identified seven highly variable loci including the second intron of clpP, ndhF-rpl32, trnQ-psbI, rps8-rpl14, ycf2, rpl32-trnL, and ycf1. Three of the seven, including ndhF-rpl32, rpl32-trnL, and ycf1, were particularly highly variable between Machilus and Cinnamomum plastomes. The second intron of clpP, rpl32-trnL, and ycf1 were the focus of previous analyses investigating sequence variation in seed plants (Dong et al., 2012(Dong et al., , 2015Sarkinen and George, 2013). The ycf2 and ndhF-rpl32 loci have also been widely used for phylogenetic studies (Shaw et al., 2007;Huang et al., 2010). Here, two rarely reported highly variable loci rps8-rpl14 and trnQ-psbI were present in Machilus plastomes. In contrast, none of the 14 regions in the chloroplast genome used previously for phylogenetic analysis were found to be variable (Rohwer et al., 2009;Li et al., 2011). All of these seven highly variable regions and the indel or SNP markers are better to use for phylogenetic studies at the species level in Machilus. We encourage researchers working on the Lauraceae family to use the seven highly variable regions identified in this study for phylogenetic analysis.