Comparative Analysis of Six Lagerstroemia Complete Chloroplast Genomes

Crape myrtles are economically important ornamental trees of the genus Lagerstroemia L. (Lythraceae), with a distribution from tropical to northern temperate zones. They are positioned phylogenetically to a large subclade of rosids (in the eudicots) which contain more than 25% of all the angiosperms. They commonly bloom from summer till fall and are of significant value in city landscape and environmental protection. Morphological traits are shared inter-specifically among plants of Lagerstroemia to certain extent and are also influenced by environmental conditions and different developmental stages. Thus, classification of plants in Lagerstroemia at species and cultivar levels is still a challenging task. Chloroplast (cp) genome sequences have been proven to be an informative and valuable source of cp DNA markers for genetic diversity evaluation. In this study, the complete cp genomes of three Lagerstroemia species were newly sequenced, and three other published cp genome sequences of Lagerstroemia were retrieved for comparative analyses in order to obtain an upgraded understanding of the application value of genetic information from the cp genomes. The six cp genomes ranged from 152,049 bp (L. subcostata) to 152,526 bp (L. speciosa) in length. We analyzed nucleotide substitutions, insertions/deletions, and simple sequence repeats in the cp genomes, and discovered 12 relatively highly variable regions that will potentially provide plastid markers for further taxonomic, phylogenetic, and population genetics studies in Lagerstroemia. The phylogenetic relationships of the Lagerstroemia taxa inferred from the datasets from the cp genomes obtained high support, indicating that cp genome data may be useful in resolving relationships in this genus.


INTRODUCTION
On the earth, some major subclades (i.e., Rosids, Asterids, Saxifragales, Santalales, and Caryophyllales) are recognized phylogenetically under the eudicot clade of angiosperms, consisting of ∼75% of all flowering plant species. Among the subclades, the rosids are grouped together as a large evolutionary monophyletic group, containing more than 25% of all angiosperms.
Lagerstroemia plants are positioned phylogenetically in the Lythraceae (within the Myrtales Rchb.) of the rosids among core eudicots. Lagerstroemia, one of the 25 genera in the family Lythraceae, has about 56 species in the world, with a distribution from the tropical to northern temperate zones (Qin and Shirley, 2007;APG III, 2009;Su et al., 2014).
Crape myrtles produce abundant large and beautiful panicles with charming flowers commonly lasting for about 3 months or more across summer and autumn seasons (Qin and Shirley, 2007). Their leaves can clean the air by absorbing smoke and dust. They are well-known excellent ornamental trees for city gardening and environmental protection. Their cultivation has a history of at least 1500 years in China. At present, more than 500 cultivars have been bred in the world. They have significant value in horticultural and landscaping application (Huang et al., 2013a,b,c).
Phylogenetic relationships within Lythraceae have been approached using morphology and DNA evidences from the rbcL gene, the trnL-F region, and the psaA-ycf 3 intergenic spacer of the cp genome, and ITS (the internal transcribed spacer) of the nuclear genome (Huang and Shi, 2002;Graham et al., 2005). The four DNA markers (rbcL, matK, trnH-psbA, and ITS) can only meet the need for plant identification at/above species level with limited or no resolution among closely related species and/or cultivars (Xiang et al., 2011;Suo et al., 2012Suo et al., , 2015Suo et al., , 2016. Due to shared morphological traits to some extent among species and cultivars, the lack of morphological and DNA markers heavily inhibited the genetic diversity evaluation of Lagerstroemia germplasm resources. Genetic information from comparative genomics for researches on genetic diversity and phylogeny in the Lagerstroemia is limited (Pounders et al., 2007;Wang et al., 2011;Suo et al., 2012Suo et al., , 2015Suo et al., , 2016He et al., 2014;Gu et al., 2016a,b).
Chloroplasts are key organelles in plants for photosynthesis and other biochemical pathways such as the biosynthesis of starch, fatty acids, pigments, and amino acids (Dong et al., 2013(Dong et al., , 2016Raman and Park, 2016). Chloroplast (cp) genome, as one of the three DNA genomes (the other two are nuclear and mitochondrial genomes) in plant body, with uniparental inheritance, has a highly conserved circular DNA arrangement ranging from 115 to 165 kb. Complete cp genome sequences have been widely accepted as an informative and valuable data source for understanding evolutionary biology because of their relatively stable genome structure, gene content, and gene order (Dong et al., 2012(Dong et al., , 2013(Dong et al., , 2014(Dong et al., , 2016Suo et al., 2012Suo et al., , 2015Suo et al., , 2016Curci et al., 2015;Downie and Jansen, 2015;Song et al., 2015). Along with the accumulation of complete cp genome sequences, comparative study of chloroplast genomes from Lagerstroemia plants is helpful for upgrading our evaluation on the application value of the cp genomes.
In this study, we report three newly sequenced complete cp genomes from the Lagerstroemia (two species and one cultivar) and genomic comparative analyses with other three published cp genome sequences of the genus downloaded from the National Center for Biotechnology Information (NCBI) organelle genome database (https://www.ncbi.nlm.nih.gov), focusing on organization, gene content, patterns of nucleotide substitutions, and simple sequence repeats (SSRs). The aims of our study are: (i) to deepen our understanding on the genetic and evolutionary significance from the structural diversity in the cp genomes, (ii) to upgrade our understanding on the application value of the complete cp genomes of Lagerstroemia, and (iii) to provide genetic resources for future research in this genus.

Plant Materials and DNA Extraction
Fresh leaves were collected from the trees of Lagerstroemia subcostata and L. indica "Lüzhao Hongdie" growing in the Beijing Botanical Garden (N 39 • 48 ′ , E 116 • 28 ′ , Altitude 76 m) of the Chinese Academy of Sciences, and from the trees of L. speciosa growing in the Xishuangbanna Tropical Botanical Garden (N 21 • 41 ′ , E 101 • 25 ′ , Altitude 570 m), the Chinese Academy of Sciences. The fresh leaves from each accession were immediately dried with silica gel for further DNA extraction. Total genomic DNAs were extracted from each sample using the Plant Genomic DNA Kit (DP305) from Tiangen Biotech (Beijing) Co., Ltd., China.

Chloroplast Genome Sequencing, Assembling, and Annotation
The Lagerstroemia cp genomes were sequenced using the shortrange PCR (Polymerase Chain Reaction) method reported by Dong et al. (2012Dong et al. ( , 2013. The PCR protocol was as follows: preheating at 94 • C for 4.5 min, 34 cycles at 94 • C for 50 s, annealing at 55 • C for 40 s, and elongation at 72 • C for 1.5 min, followed by a final extension at 72 • C for 8 min. PCR amplification was performed in an Applied Biosystems VeritiTM 96-Well Thermal Cycler (Model#: 9902, made in Singapore). The amplified DNA fragments were sent to Shanghai Majorbio Bio-Pharm Technology Co., Ltd (Beijing) for Sanger sequencing in both the forward and reverse directions using a 3730xl DNA analyzer (Applied Biosystems, Foster City, CA, USA). DNA regions containing poly structures or difficult to amplify were further sequenced using newly designed primers for confirming reliable and high quality sequencing results.
The cp DNA sequences were manually confirmed and assembled using Sequencher (v4.6) software, and cp genome annotation was performed using the Dual Organellar Genome Annotator (DOGMA; Wyman et al., 2004). BLASTX and BLASTN searches were employed to accurately annotate the protein-encoding genes and to identify the locations of the ribosomal RNA (rRNA) and transfer RNA (tRNA) genes. Gene annotation information from other closely related plant species was also utilized for confirmation when the boundaries of the exons or introns could not be precisely determined because of the limited power of BLAST in cp genome annotation. The cp genome map was drawn using Genome Vx software (Conant and Wolfe, 2008; Figure 1). The cp genome sequences have been deposited to GenBank with the following accession numbers: KF572028 for L. indica "Lüzhao Hongdie, " KF572029 for L. subcostata and KX572149 for L. speciosa. The cp genome sequences of L. fauriei (KT358807), L. indica (KX263727), and L. guilinensis (KU885923) were downloaded from GenBank (https://www.ncbi.nlm.nih.gov).

Simple Sequence Repeat Analysis
Perl script MISA (Thiel et al., 2003) was used to search for simple sequence repeat (SSRs or microsatellites) loci in the cp genomes. The minimum numbers (thresholds) of the SSRs were 10, 5, 4, 3, 3, and 3 for mono-, di-, tri-, tetra-, penta-, and hexa-nucleotides, respectively. All of the repeats found were manually verified and redundant results were removed.

Chloroplast Genome Analysis by Sliding Window
These cp genome sequences were aligned using MAFFT (Katoh and Standley, 2013) and were manually adjusted using Se-Al 2.0 (Rambaut, 1996). We used two data sets (the sequence alignment of all the six complete Lagerstroemia cp genomes and the sequence alignment of five Lagerstroemia cp genomes excluding L. speciosa) for sliding window analysis, because of the high divergence of L. speciosa from the other five cp genomes (Figure 2). Sliding window analysis was conducted to generate nucleotide diversity (Pi) of the cp genome using DnaSP (DNA Sequences Polymorphism version 5.10.01) software (Librado and Rozas, 2009). The step size was set to 200 bp, with a 600 bp window length.

Sequence Divergence Analysis
The alignment of the six Lagerstroemia complete cp genome sequences was visualized using mVISTA program in Shuffle-LAGAN mode (Frazer et al., 2004) in order to show inter-and intra-specific variations (Figure 3). Variable and parsimony-informative base sites across the complete cp genomes, and the large single copy (LSC), small single copy (SSC), and inverted repeats (IR) regions of the six cp genomes were calculated using Mega 6.0 software (Tamura et al., 2013). Insertions/deletions (indels) were manually detected using DnaSP software. To estimate selection pressures, nonsynonymous (dN), and synonymous (dS) substitution rates of the combined sequences of 79 protein coding genes were calculated using PAML with the yn00 program (Yang, 2007).

Phylogenetic Analysis
Phylogenetic analysis was conducted using the complete chloroplast genome sequences of the six Lagerstroemia taxa mentioned above, with one Onagraceae species (Oenothera argillicola, 165,061 bp, GenBank accession No. EU262887) that was used as an outgroup (Figure 4).
Maximum parsimony (MP) analyses were conducted using PAUP v4b10 (Swofford, 2003). All characters were equally weighted, gaps were treated as missing, and character states were treated as unordered. Heuristic search was performed with MULPARS option, tree bisection-reconnection (TBR) branch swapping, and random stepwise addition with 1,000 replications. The Maximum likelihood (ML) analyses were performed using RAxML 8.0 (Stamatakis, 2006). For ML analyses, the best-fit model, general time reversible (GTR)+G was used in all analysis as suggested with 1,000 bootstrap replicates.
Bayesian inference (BI) was performed with Mrbayes v3.2 (Ronquist et al., 2012). The Markov chain Monte Carlo (MCMC) analysis was run for 2 × 5,000,000 generations. Trees were sampled at every 1,000 generations with the first 25% discarded as burn-in. The remaining trees were used to build a 50% majority-rule consensus tree. The stationarity was considered to be reached when the average standard deviation of split frequencies remained below 0.001.

Chloroplast Genome Organization of the Lagerstroemia Taxa
The nucleotide sequences of the six Lagerstroemia cp genomes ranged from 152,049 bp (L. subcostata) to 152,526 bp (L. speciosa) in length (Figure 1 and Table 1). The six Lagerstroemia cp genome sequences have minor differences in length (no more than 477 bp; Table 1). The average GC content was 37.59%, which is almost identical with each other among the six complete Lagerstroemia cp genomes. When duplicated genes in IR regions were counted only once, the six Lagerstroemia cp genomes each identically harbored 112 different genes with the same arrangement order, including 78 protein-coding, 4 rRNA, and 30 tRNA genes (Figure 1, Table 1, and Table S1). The gene organization, gene order and GC content were highly identical and similar to those of other higher plants (Figure 1). The overall genomic structure including gene number and gene order were well-conserved. FIGURE 3 | Identity plot comparing the chloroplast genomes of six Lagerstroemia taxa using L. indica "Lüzhao Hongdie" as a reference sequence. The vertical scale indicates the percentage of identity, ranging from 50 to 100%. The horizontal axis indicates the coordinates within the chloroplast genome. Genome regions are color coded as protein-coding, rRNA, tRNA, intron, and conserved non-coding sequences (CNS).
Although cp genomes are highly conserved in terms of genomic structure and size, the IR/SC junction position change caused by expansion and contraction of the IR/SC boundary regions was usually considered as a primary mechanism in creating the length variation of the higher plant cp genomes (Kim and Lee, 2005;Asaf et al., 2016;Dong et al., 2016;Yang et al., 2016;Zhang et al., 2016). In this study, however, the IR/SC junction position change was not observed among the six cp genomes. This indicated that the IR/SC junction is relatively conserved in Lagerstroemia in comparison with other plant groups, such as Quercus (Yang et al., 2016) and Epimedium . Further, study is necessary by sampling more species of the genus across the world for confirmation.
The rpl2 intron loss was observed in the three newly sequenced Lagerstroemia cp genomes in this study. The occurrence of rpl2 intron loss in Lagerstroemia was considered to be one of the important evolutionary events in the Lythraceae of the rosids. It was inferred to occur after the divergence of the Lythraceae from the Onagraceae, but prior to the divergence of the Lythraceae genera (Gu et al., 2016a).

SSR Analysis of the Lagerstroemia cp Genomes
Simple sequence repeats (SSRs) in the cp genome can be highly variable at the intra-specific level, and are therefore often used as genetic markers in population genetics and evolutionary studies (Dong et al., 2013(Dong et al., , 2016Kaur et al., 2015;Suo et al., 2016;Yang et al., 2016). We analyzed the simple sequence repeats (SSRs) in the cp genomes (Tables 2, 3,  Tables S2, S3). The lengths of SSRs ranged from 10 to 15 bp. Comparative analysis of the six Lagerstroemia cp genome sequences indicated that totally five categories of SSRs (mono-nucleotide, di-nucleotide, tri-nucleotide, tetranucleotide, and penta-nucleotide repeats) were detected,   Tables 2, 3, Tables S2, S3). Fifty-four SSRs (19.64%) were located in intron regions. The distribution of SSRs is variable significantly among the four regions in each of the six Lagerstroemia cp genomes, which is identical with previous reports (Dong et al., 2016;Yang et al., 2016). Among the 148 homopolymer SSRs of the six Lagerstroemia cp genomes, 141 (95.27%) are the A/T type, distributed mostly in intergenic (90 A/T loci, 63.83%) and LSC (102 A/T loci, 72.34%) regions (Tables S2, S3). In Nicotiana otophora, all mono-nucleotides (100%) are composed of A/T (Asaf et al., 2016). In the five Epimedium cp genomes, mono-nucleotide SSRs were found to be the richest, up to 72.76%, and the mono-nucleotide A/T repeat units occupied 80.17% in the homopolymer SSRs. Our results are identical with the observation that the occurrence of transversion substitutions is correlated to some extent with high A/T content regions of the cp genome (Morton and Clegg, 1995;Morton et al., 1997).
In this study, no variation was detected in the repeat number of penta-nucleotide repeat category and only minor variation was observed in the repeat number of tri-nucleotide repeat category among species and/or cultivars. The repeat numbers of mono-nucleotide, di-nucleotide and tetra-nucleotide repeat categories were found variable significantly among the six cp genomes. Mono-nucleotide repeat category is the dominant variation source, especially between cultivars rather than between species, e.g., with 29 in L. indica "Lüzhao Hongdie, " and 18 in L. indica (Tables 2, 3, Tables S2, S3).
In the five Epimedium cp genomes, the detected 116 SSR loci mainly located in intergenic spacers (IGS, 62.07%), followed by introns (23.28%) and CDS (13.79%) regions. These are similar with our results. It was observed that 16 SSRs were located in 10 protein-coding genes (rpoC2, rpoB, psbC, psaA, psbF, ycf1, ycf2, rpl32, ndhE, and ndhH) of the five Epimeidium cp genomes . Therefore, evidences strongly suggest that the occurrence and genetic variations of SSRs in genes (such as, ycf 1) may have phylogenetic significance. This is worth further study in the future.
A preference for occurrence of SSRs in intergenic or gene regions was observed between plant families and among the samples/taxa within family. The cp SSRs of the six Lagerstroemia taxa represented abundant variation, and are useful for detecting genetic polymorphisms at population, intraspecific, and cultivar levels as well as comparing more distant phylogenetic relationships among Lagerstroemia species.

Genome Sequence Divergence among the Lagerstroemia Species/Cultivars
We used mVISTA to perform a sequence identity analysis, with L. indica "Lüzhao Hongdie" as a reference (Figure 3).
The alignment revealed high sequence similarity across the cp genomes, which suggests that they are highly conserved. Non-coding and SC regions exhibit higher divergence levels than coding and IR regions, respectively.
The LSC and SSC regions contributed 150 and 55 informative base sites, respectively, while the IR regions contributed only 15 informative sites ( Table 4). The SSC region showed the highest nucleotide diversity (0.00639), followed by the LSC region (0.00345) and the IR region (0.00175; Table 4). Lagerstroemia speciosa presented the highest numbers of nucleotide substitutions and insertions/deletions (indels) among the six Lagerstroemia taxa, while the nucleotide diversity, and the numbers of nucleotide substitutions and insertions/deletions (indels) at cultivar level were found to be the smallest (Tables 4, 5).
Pairwise substitution rates (dN/dS) between the Lagerstroemia cp genomes were calculated based on the 78 protein-coding gene sequences ( Table 6). The numbers of nucleotide substitutions and indels varied from 29 to 315, and 24 to 1089, respectively ( Table 5). There were always fewer dN than dS. The dN/dS ratio ranged from 0.1688 to 0.6081. The highest dN/dS ratio occurred between L. indica and L. guilinensis. The lowest dN/dS ratio occurred between Lagerstroemia indica and L. indica "Lüzhao Hongdie" ( Table 6). In our study, the dN/dS ratio is below 1, indicating that the related gene regions might be under negative selection.  We chose the 12 relatively highly variable regions including 2 gene regions and 10 intergenic regions from the cp genomes that might be undergoing a more rapid nucleotide substitution at species and cultivar levels, as potential molecular markers for application in phylogenetic analyses and plant identification in Lagerstroemia (Figure 2, Table 7). They are trnK-rps16, trnS-trnG, trnG-trnR-atpA, trnE-trnT, rbcL-accD, psbL-psbF-psbE, trnP-psaJ-rpl33, rrn16-trnI, ccsA, ndhG-ndhI, rps15-ycf1, and ycf1. Primers for these regions are shown in Table 7. Yang et al. (2016) determined five most variable coding regions and 14 most variable non-coding regions as potential molecular markers for Quercus germplasm resources, which are identical with the variable regions found in Lagerstroemia, except for trnE-trnT, psbL-psbF-psbE, trnP-psaJ-rpl33, ndhG-ndhI, and rps15-ycf1. Further, study is expected to utilize these cp DNA markers in global detection of the Lagerstroemia germplasm resources.

Phylogenetic Analysis
Phylogenetic analysis using cp genome sequences have resolved numerous lineages within the flowering plants (Jansen et al., 2007;Moore et al., 2007). The cp DNA regions of atpF-atpH, matK, psbK-psbI, rbcL, and trnH-psbA have been recommended and used as species-level barcodes with a great success (Suo et al., 2012(Suo et al., , 2015Dong et al., 2015Dong et al., , 2016. However, these five cp DNA markers are not powerful enough when closely related species or cultivars are under considerations. Therefore, genomic comparative researches of more complete cp genome sequences have become necessary. In this study, all of the six Lagerstroemia taxa were discriminated completely with high bootstrap support based on each of the four DNA sequence alignment data sets including whole cp genome sequences, coding regions, non-coding regions, and the 12 highly variable regions concatenation using maximum parsimony (MP), maximum likelihood (ML), and Bayesian inference (BI) methods (Figure 4). L. guilinensis, L. indica "Lüzhao Hongdie, " and L. indica showed a very close genetic relationship. The six taxa were separated into three evolutionary branches. The branch including L. subcostata and L. fauriei was a sister to the branch containing L. guilinensis, L. indica "Lüzhao Hongdie, " and L. indica. L. speciosa was placed at the basal position, and showed a large divergence from the rest five Lagerstroemia taxa. A better resolution was obtained by the sequence data set from the non-coding regions as compared to each of the other three datasets. Similar resolution can be obtained using a sequence data set from 12 highly variable cp regions with lower cost.

CONCLUSIONS
This study reports the comparative analysis results of six Lagerstroemia cp genome sequences with detailed gene annotation. The six cp genomes are similar in structure and have a high degree of the synteny of gene order. The IR/SC junction position change was not observed among the six cp genomes, indicating that the IR/SC junction is relatively conservative in Lagerstroemia in comparison with other plant groups, such as Quercus and Epimedium. Further study is necessary for confirmation within the whole genus by sampling more species. Twelve cp DNA markers were developed from the relatively highly variable regions. All of the six Lagerstroemia taxa were discriminated completely with high bootstrap support based on each of the four DNA sequence alignment data sets including whole cp genome sequences, coding regions, noncoding regions, and 12 highly variable regions using maximum parsimony (MP), maximum likelihood (ML), and Bayesian inference (BI) methods. A better resolution was obtained by the sequence data set from the non-coding regions rather than by each of the other three data sets, with no significant difference among the analytic methods. Similar resolution result can be obtained by the sequence data set from 12 highly variable regions with lower cost. The six taxa were separated into three evolutionary branches. The branch including L. subcostata and L. fauriei is a sister to branch formed by L. guilinensis, L. indica "Lüzhao Hongdie, " and L. indica. L. speciosa alone was placed at the basal position, and showed a large divergence from the rest five Lagerstroemia taxa. The data presented here will facilitate the understanding of the evolutionary history of crape myrtles. These findings provide an informative and valuable genetic source of the Lagerstroemia germplasm resources for identifying species, elucidating taxonomy, and reconstructing the phylogeny of the Lagerstroemia genus.

AUTHOR CONTRIBUTIONS
CX performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, reviewed drafts of the paper. WD conceived and designed the experiments, performed the experiments, analyzed the data, wrote the paper, prepared figures and/or tables, reviewed drafts of the paper. WL, YL, XX conceived and designed the experiments, contributed reagents/materials/analysis tools, wrote the paper, reviewed drafts of the paper. JS, KH contributed reagents/materials/analysis tools, reviewed drafts of the paper. XJ wrote the paper, reviewed drafts of the paper. ZS conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, reviewed drafts of the paper.