The Complete Chloroplast Genome of Wild Rice (Oryza minuta) and Its Comparison to Related Species

Oryza minuta, a tetraploid wild relative of cultivated rice (family Poaceae), possesses a BBCC genome and contains genes that confer resistance to bacterial blight (BB) and white-backed (WBPH) and brown (BPH) plant hoppers. Based on the importance of this wild species, this study aimed to understand the phylogenetic relationships of O. minuta with other Oryza species through an in-depth analysis of the composition and diversity of the chloroplast (cp) genome. The analysis revealed a cp genome size of 135,094 bp with a typical quadripartite structure and consisting of a pair of inverted repeats separated by small and large single copies, 139 representative genes, and 419 randomly distributed microsatellites. The genomic organization, gene order, GC content and codon usage are similar to those of typical angiosperm cp genomes. Approximately 30 forward, 28 tandem and 20 palindromic repeats were detected in the O. minuta cp genome. Comparison of the complete O. minuta cp genome with another eleven Oryza species showed a high degree of sequence similarity and relatively high divergence of intergenic spacers. Phylogenetic analyses were conducted based on the complete genome sequence, 65 shared genes and matK gene showed same topologies and O. minuta forms a single clade with parental O. punctata. Thus, the complete O. minuta cp genome provides interesting insights and valuable information that can be used to identify related species and reconstruct its phylogeny.


INTRODUCTION
The angiosperm chloroplast (cp) is a uniparentally inherited and stable structure. Accordingly, it is considered to be an informative and valuable resource for phylogenetic analysis in plants at multiple taxonomic levels (Nadachowska-Brzyska et al., 2015) compared to mitochondrial genomes (Timmis et al., 2004). Most cp genomes range from 120 to 210 kb and have a quadripartite structure that is typically composed of a small single-copy region (SSC), a large single-copy region (LSC) and a pair of inverted repeats (IRs) (Yurina and Odintsova, 1998;Wang et al., 2015). In most cases, differences in the length of the IRs determine length differences of the cp genome (Chang et al., 2006;Guisinger et al., 2011).
Previously, phylogenetic analyses have been based on sequencing one or a few loci from the chloroplast. Due to the availability of complete chloroplast sequences in public databases and advances in next-generation sequencing techniques, analyses based on the entire chloroplast genome are achievable and yield higher quality and more valuable information, which could reveal detailed insight into genomic organization (Martin et al., 2005). Indeed, examining the entire cp genome can resolve previously ambiguous phylogenetic relationships among species Moore et al., 2010). Due to availability of highthroughput sequencing technology as well as the comparatively small size and structural similarity of cp genomes, hundreds of sequencing projects in terrestrial plants have recently been reported (Wu, 2016b).
Rice is an important cereal crop that provides essential food and energy for more than half of the world's population. In addition, rice is considered a model crop for studies on cereal genomics. Two species of the genus Oryza (O. sativa, and O. glaberrima) are cultivated, though there are more than 20 wild species (Evenson and Gollin, 1997;Sang and Ge, 2007). Different species are categorized into 10 genome types, six are diploid (AA, BB, CC, EE, FF, and GG) (2n = 2x = 24) and the other four are allelotetraploid (BBCC, CCDD, HHJJ, and HHKK) (2n = 4x = 28) (Ge et al., 1999). About one half of the species in Oryza genus are allotetraploids that originated through interspecific hyberdization and genome doubling (Vaughan, 1989;Bao and Ge, 2008;Jacquemin et al., 2013). Rice (O. sativa) with an AA genome type, is one of the most important species, and it is further divided into the subspecies japonica and indica, which are distributed globally (Chang, 1976;Wambugu et al., 2015).
Because of the importance of Oryza as a major food crop, great attention has been given to understanding the genetic makeup and phylogeny of this genus, both within the genus and species (Guo and Ge, 2005). In plants, sequencing functional genes in cpDNA (chloroplast DNA) is helpful for resolving issues related to molecular taxonomy and phylogenetic reconstruction Moore et al., 2010;Wu and Ge, 2012), and such approaches can yield vast benefits in plant breeding and conservation strategies. Currently, 10 cp genomes belonging to Oryzeae have been published (Waters et al., 2012;Brozynska et al., 2014). Some wild Oryza species are better able than cultivated Oryza species to resist biotic and abiotic stresses and attack from insect pests. Thus, cultivated species can be improved through introgression of resistance genes from wild species (Heinrichs et al., 1985). For example, resistance traits from wild O. minuta, a tetraploid wild relative of cultivated rice, have been reported. O. minuta has a BBCC genome type and exhibits significant potential to resist blast blight, bacterial blight (BB), and white-backed plant hopper (WBPH) and brown plant hopper (BPH) diseases (Vaughan, 1994). Such diseases are damaging to the growth and yield of cultivated rice. In addition, stress tolerance genes from O. minuta have been successfully transferred to cultivated rice through introgression (Amante-Bordeos et al., 1992;Rahman et al., 2009). Overall, wild species such as O. minuta possess valuable genetic diversity that can contribute greatly to improving the growth and yield of various crops (Amante-Bordeos et al., 1992). To identify desirable genes and ensure effective conservation, it is essential to analyze phylogenetic and evolutionary relationships among species (Guo et al., 2013). Previously, it was reported that O. minuta was originated from allopolyploidization of O. officinalis (paternal) and O. punctate (meternal) (Ammiraju et al., 2010;Zou et al., 2015) In this study, we assembled for the first time the complete chloroplast genome sequence of O. minuta, and performed detailed phylogenetic analyses on the basis of complete cp genome and 65 shared genes. The complete cp genome of O. minuta, in conjunction with previously reported cp genome sequences, will improve our understanding of O. minuta and the evolutionary history of genus Oryza.

MATERIALS AND METHODS
In this study, a standard protocol for DNA extraction was used as described in detailed by Sierro et al. (2014). The extracted DNA was sequenced using an Illumina HiSeq-2000 (Illumina, San Diego, CA, USA) platform at Macrogen (Macrogen, Seoul, Korea), and the O. minuta cp genome was obtained by de novo assembly of the entire genome sequence via a bioinformatics pipeline (http://phyzen.com). A 400-bp paired-end library was produced according to the Illumina PE standard protocol, generating 28,110,596 bp of total reads with a 120-bp average read length. Raw reads with Phred scores of 20 or less were removed from the total PE reads using the CLC-quality trim tool, and de novo assembly was conducted on trimmed reads using CLC Genomics Workbench v7.0 (CLC Bio, Aarhus, Denmark) with parameters of minimum (200 to 600 bp) autonomously controlled overlap size. All contigs were then mapped and assembled against the reference cp genomes of O. officinalis and O. punctata by following a previously described method (Wu, 2016a,b). Primers were designed (Table S1) to test for correct sequence assembly. PCR amplification was performed in a total volume of 20 µl containing 1× reaction buffer, 0.4 µl dNTPs (10 mM), 0.1 µl Taq (Solg TM h-Taq DNA Polymerase), 1 µl (10 pm/µl) primers, and 1 µl (10 ng/µl) DNA. The PCR program consisted of initial denaturation at 95 • C for 5 min followed by 35 cycles of 95 • C for 30 s, 65 • C for 20 s and 72 • C for 30 s, with a final extension step at 72 • C for 5 min. After incorporation of the sequencing results, the finished cp genome was applied as a reference to map previously obtained short reads to refine the assembly based on maximum sequence coverage.

Genome Annotation and Sequence Architecture
The program DOGMA was used to annotate the O. minuta cp genome (Wyman et al., 2004). The annotation results were checked manually, and codon positions were adjusted by  (Schattner et al., 2005) with the default settings. OGDRAW (Lohse et al., 2007) was applied to illustrate the structural features of the O. minuta cp genome. To examine deviations in synonymous codon usage by avoiding the influence of amino acid composition, the relative synonymous codon usage (RSCU) was determined using MEGA 6 software (Kumar et al., 2008). mVISTA software was used in the Shuffle-LAGAN mode to compare the complete variation in the O. minuta cp genome with eleven other cp genomes using the O. minuta annotation as a reference (Frazer et al., 2004).

Characterization of Repeat Sequences and SSRs
We employed REPuter to identify repeat sequences, including palindromic, reverse, and direct repeats, within the cp genome (Kurtz et al., 2001). The following settings for repeat identification were used: (1) Hamming distance of 3; (2) 90% or greater sequence identity; (3) a minimum repeat size of 30 bp. Phobos version 3.3.12 (Leese et al., 2008) was used to detect (SSRs) within the cp genome, with the search parameters set at ten repeat units ≥10 for mononucleotides, eight repeat units ≥8 for dinucleotides, four repeat units ≥4 for trinucleotides and tetranucleotides, and three repeat units ≥3 for pentanucleotide and hexanucleotide SSRs. Tandem repeats in the O. minuta cp genome were identified using Tandem Repeats Finder version 4.07 b (Benson, 1999) with the default settings.

Sequence Divergence and Phylogenetic Analysis
Complete cp genomes as well as a separate partition using only 65 shared genes were employed to analyze the average pairwise sequence divergence for 11  (Katoh and Standley, 2013) with the default parameters. Kimura's two-parameter (K2P) model was selected to calculate pairwise sequence divergences (Kimura, 1980). To resolve the O. minuta phylogenetic position within the rice tribe (Oryzeae), 13 published cp genomes were downloaded from the NCBI database for analyses. First, multiple alignments were performed using the complete cp genomes based on the conserved structure and gene order of the chloroplast genomes (Wicke et al., 2011). Four methods were employed to construct phylogenetic trees, including Bayesian inference (BI) implemented with MrBayes 3.12 (Ronquist and Huelsenbeck, 2003), maximum parsimony (MP) with PAUP 4.0 (Swofford, 1993), and maximum likelihood (ML) and neighbor-joining (NJ) with MEGA 6 (Kumar et al., 2008) using described settings (Wu et al., 2015;Asaf et al., 2016a). In the second phylogenetic analysis, 65 shared genes from the cp genomes of 12 Oryza species and two Zizania outgroup species were aligned in ClustalX using the default settings, followed by manual adjustment to preserve reading frames. The above four phylogenetic-inference methods were used to infer trees from the 65 concatenated genes using the same settings (Wu et al., 2015;Asaf et al., 2016a).

Chloroplast Genome Organization of O. minuta
The O. minuta cp genome was assembled by mapping all Illumina reads to the draft cp genome sequence using CLC Genomics Workbench v7.0. A total of 1,577,251 reads were obtained, with an average length of 120 bp, for 504.211X coverage of the cp genome. The consensus sequence for a specific position was generated by assembling reads mapped with at least 875 reads per position and was used to construct the complete sequence of the O. minuta cp genome. The complete O. minuta cp genome is 135,094 bp in size (GenBank: KU179220), which is similar to the already reported cp genome sizes of related Oryza species and is within the range of other angiosperms (Yang et al., 2010). The cp genome possesses a typical quadripartite structure, which includes a pair of inverted repeats (IRa and IRb 20,836 bp) and separate SSC (12,446 bp) and LSC (80,974 bp) regions (Table 1, Figure 1). The GC content (39%) of the O. minuta cp genome is very similar to that of other Oryza species cp genomes ( Table 1) (Wu et al., 2015). However, the GC content is unequally distributed in the O. minuta cp genome: it is highest in the IR regions (44.3%), moderate in the LSC regions (37.1%) and lowest in the SSC regions (33.3%). This high IR GC percentage is due to the presence of eight ribosomal RNA (rRNA) sequences in these regions. These results are similar to a previously reported high GC percentage in IR regions (Qian et al., 2013). A total of 139 genes were found in the O. minuta cp genome, of which 110 are unique, including 91 protein-coding genes, 40 tRNA genes, and 8 rRNA genes (Figure 1, Table 2). Of these, 11 protein-coding, four rRNA, and eight tRNA genes are duplicated in the IR regions. The LSC region comprises 62 protein-coding
Protein, rRNAs, and tRNAs are encoded by 45.1, 6.83, and 2.2% of the entire cp genome, respectively, and the remaining 45.8% is composed of non-coding regions ( Table 3). The total protein-coding sequences (CDSs) are 60,948 bp in length and consist of 91 genes encoding 20,354 codons (Tables 1, 4). The O. minuta cp genome codon usage frequency was determined based on tRNA and protein-coding gene sequences (Table 5). Leucine (10.7%) and cysteine (1.2%) are the maximum and minimum commonly encoded amino acids, and isoleucine, serine, glycine, arginine and alanine are encoded by 7.9, 7.5, 7.4, 6.5, and 6.1% of CDSs, respectively ( Figure S1). Similar ratios for amino acids are present in previously reported cp genomes (Qian et al., 2013;Chen et al., 2015). Among these, the maximum and minimum codons used are ATT (820), encoding isoleucine, and TTG and ATT (1, 1), encoding methionine. The AT content is 52.5, 60.0, and 68.7% at the 1st, 2nd, and 3rd codon positions, respectively, within CDS regions ( Table 4). The preference for a high AT content at the 3rd codon position is similar to the A and T concentrations reported in various terrestrial plant cp genomes (Morton, 1998;Nie et al., 2012;Qian et al., 2013). In total, 42.65 and 57% of all types of preferred synonymous codons (RSCU>1) ending with A and U and C and G, respectively, were found. Non-preferred synonymous codons (RSCU<1) are 42.40 and 57.50% for C and G and A and U. Usage of the start codon AUG and UGG, the latter encoding tryptophan, has no bias (RSCU = 1) ( Table 5).

Repeat Analysis
Repeat sequences, which play a role in genome rearrangements, are very helpful in phylogenetic studies (Cavalier-Smith, 2002;Nie et al., 2012). Furthermore, analyses of various cp genomes revealed that repeat sequences are essential to induce indels and substitutions (Yi et al., 2013). Repeat analysis of the O. minuta cp genome showed 20 palindromic repeats, 30 forward repeats, and 28 tandem repeats (Figure 2A). Among these, 17 forward repeats are 30-44 bp in length, with only three tandem repeats of the same length and 18 15-29 bp in length (Figures 2A-D).
Similarly, 11 palindromic repeats are 30-44 bp, and 6 repeats are 45-59 bp in length ( Figure 2B) (Figure 2A). This suggests that O. minuta is more similar to O. barthii and O. officinalis in terms of repeats. Approximately 29.4% of these repeats are distributed in protein-coding regions. Previous reports suggest that sequence variation and genome rearrangement occur due to the slipped-strand mispairing and improper recombination of these repeat sequences (Cavalier-Smith, 2002;Asano et al., 2004;Timme et al., 2007). Furthermore, the presence of these repeats indicates that the locus is a crucial hotspot for genome reconfiguration (Gao et al., 2009;Nie et al., 2012). Additionally, these repeats are an informative source for developing genetic markers for phylogenetic and population studies (Nie et al., 2012).

SSR Analysis
Simple sequence repeats (SSRs), or microsatellites, are repeating sequences of typically 1-6 bp that are distributed throughout the genome. In this study, we detected perfect SSRs in O. minuta together with 11 other Oryza species cp genomes ( Figure 3A). Certain parameters were set because SSRs of 10 bp or longer are prone to slipped-strand mispairing, which is believed to be the main mechanism for SSR polymorphisms (Rose and Falush, 1998;Raubeson et al., 2007;Huotari and Korpelainen, 2012). A  total of 419 perfect microsatellites were found in the O. minuta cp genome (Figure 3A). Similarly,418,413,416,416,419,420,419,419,421,429,and 422 Figure 3A). In O. minuta, most mononucleotide SSRs are A (97%) and T (2.12.30%) motifs, with the majority of dinucleotide SSRs being A/G (47.05%) and A/T (38.60%) motifs ( Figure 3B). Approximately 62% of SSRs are located in non-coding regions; approximately 4.3% are present in rRNA sequences and 2.3% in tRNA genes ( Figure 3C).
Further analysis revealed that approximately 66.82% of SSRs occur in the LSC region, whereas 24.34 and 8.83% were found in IR and SSC regions, respectively ( Figure 3D). These results are similar to previous reports that SSRs are unevenly distributed in cp genomes, and the findings might provide more information for selecting effective molecular markers for detecting intra-and interspecific polymorphisms (Powell et al., 1995a,b;Provan et al., 1997;Pauwels et al., 2012). Furthermore, most mononucleotides and dinucleotides are composed of A and T, which may contribute to bias in base composition, consistent with other cp genomes . Our findings are comparable to previous reports that SSRs found in cp genome are generally composed of polythymine (polyT) or polyadenine (polyA) repeats and infrequently contain tandem cytosine (C) and guanine (G) repeats (Kuang et al., 2011). Therefore, these SSRs identified contribute to the AT richness of the O. minuta cp genome, as previously reported for various species (Kuang et al., 2011;Chen et al., 2015).    (Table S3). Furthermore, the O. minuta cp genome has a gene content and organization that are similar to other Oryza species and members of Poaceae (Wicke et al., 2011); however, as for other grasses, it lacks a ycf1 gene, and the accD gene is a truncated pseudogene. Because these genes are essential for the survival of photosynthetic plants (Drescher et al., 2000;Kode et al., 2005), they were most likely functionally transferred to the nucleus or functionally replaced by a eukaryotic gene, as observed for the accD plastid gene in other plant families (Babiychuk et al., 2011;Rousseau-Gueutin et al., 2011). Pairwise cp genomic alignment between O. minuta and the 11 other genomes showed a high degree of synteny. The O. minuta cp genome annotation was used as a reference for plotting the overall sequence identity of the cp genomes of the 11 Oryza species in mVISTA (Figure 4), and the results revealed high sequence identity with all 11 Oryza species. However, except for O. australiensis, relatively lower identity was also observed with these species in various comparable genomic regions, particularly the rps3, rpl22, rpl23, rpl2, and rps19 regions (Figure 4). In addition, the LSC and SSC regions show less similarity than the two IR regions in all Oryza species. In addition, non-coding regions exhibit greater divergence than coding regions. These highly divergent regions include rbcL, rps16-trnQ, trnfM-trnM, psbM-petN, rpoC2, atpI-atpH, ndhA rpl33, petA-psbJ, ccsA, ndhF-rpl32, and ycf3. Similar results related to these genes were also reported by Qian et al. (2013). Our results also confirm similar differences among various coding regions in the analyzed species, as suggested by Kumar et al. (2009).

Structural and Sequence Comparisons of cp Genomes in Oryza
We compared the cp genomes and calculated the average pairwise sequence divergence among the 12 species (Table S4). Of these, the O. minuta genome has 0.005 average sequence divergence, and high divergence was found for O. australiensis (0.00725); O. officinalis has the lowest average sequence divergence (0.0044). Furthermore, the twelve most divergent genes among these genomes are petG, matK, infA, ccsA, rpoC2, clcP, psbE, rbcL, psbN, rps18, rpl36, and ndhF. The highest average sequence distance was found for rpoC2 (0.01983), followed by petG (0.0154) (Figure 5). Both these genes are located in LSC regions and display a trend toward more rapid evolution.

IR Contraction and Expansion
Expansion and contraction at the borders of IR regions are the main reason for size variations in the cp genome and play a vital role in its evolution Wang et al., 2008;Yang et al., 2010Yang et al., , 2014 (Figure 6). Despite the similar length of the O. minuta IR region with the other eleven Oryza species, from 20,836 bp to 20,840 bp, some IR expansion and contraction was observed. J LA is located between rps19 and psbA, and variation in distances between rps19 and J LA range from 40 to 49 bp across all species; the distance in O. minuta is 46 bp. The distance between psbA and J LA is 81 bp in O. minuta, which is similar to the other genomes (81 bp). The distance between rpl22 and J LB varies from 23 bp to 29 bp. In O. minuta, 1-bp variations exist in the J SA border region compared to the other cp genomes. The ndhH gene traverses the SSC and IRa regions, with approximately 164 bp located in the IR region for O. minuta. Furthermore, there are 16-bp variations observed compared with O. officinalis for ndhF, ndhH and rps15 in the SSC and IRb regions, located 41 bp, 164 bp and 302 bp from the J SB and J SA border regions, respectively.

Phylogenetic Analysis
The Oryza genus is composed of 23 species distributed in different regions of America, Africa, Asia, and Australia (Ge et al., 1999). Continued efforts have expanded our ability to differentiate among and to understand the genomic structure and phylogenetic relationships of rice species (Khush, 1997).
Taxonomy and phylogeny of the rice genus have been extensively investigated at genus level (Ge et al., 1999;Zhu and Ge, 2005;Jacquemin et al., 2013). Previous evolutionary relationships among different rice genomes and species were estimated by nuclear and chloroplast DNA restriction fragment-length polymorphisms (Ge et al., 1999;Zou et al., 2015), but complete genome sequencing provides more detailed insight (Wambugu et al., 2015;Wu et al., 2015;Asaf et al., 2016b). In this regard, O. minuta has been poorly investigated. In this study, the phylogenetic position of O. minuta within Oryza was established by utilizing complete cp genomes and 65 shared genes among 12 Oryza members (Figures 7A,B). Two species, Zizania aquatic and Zizania latifolia were set as outgroups. Phylogenetic analysis using Bayesian inference (BI), maximum parsimony (MP), maximum likelihood (ML) and neighborjoining (NJ) methods were performed. The results showed same phylogenetic signals for the complete cp genomes and 65 shared genes of O. minuta. The complete genome sequences (Table S5) and 65 shared genes (Tables S3, S6) from all species generated phylogenetic trees with same topologies (Figures 7A,B). In these phylogenetic trees based on the entire genome data set and 65 shared genes, O. minuta formed a single clade with O. punctata, with high BI and bootstrap support using four different methods (Figures 7A,B). Furthermore, the tree topology confirmed the FIGURE 7 | Phylogenetic trees were constructed for 14 species from the rice tribe using different methods, and two Bayesian trees are shown for data sets of the entire genome sequence and 65 shared genes. (A) The entire genome sequence data set (B). The data set of 65 shared genes. Each data set was used with four different methods, Bayesian inference (BI), maximum parsimony (MP), maximum likelihood (ML) and neighbor-joining (NJ). Numbers above the branches are the posterior probabilities of BI and bootstrap values of MP, ML, and NJ, respectively. Stars represent positions for O. minuta (KU179220) in the two trees. relationship inferred from the phylogenetic work conducted by Ge et al. (1999) and Zou et al. (2015). This position of O. minuta confirms the previously published phylogeny described by Ge et al. (1999). Ge et al. (1999) reported that O. minuta BBCC shares a clade with O. punctata BB with regard to Adh1, whereas it forms a clade with O. officinalis CC in the Adh2 phylogenetic analysis. Similar resuls was suggested by Zou et al. (2015), whereby phylogenetic analysis of the four nuclear loci and three meternally interited chloroplast fragments from different Oryza species grouped O. minuta in a clade with maternal parent O. punctata BB (Zou et al., 2015). As the phylogenetic tree based on the matK gene represents the maternal genealogy of rice species, which can offer an opportunity to identify maternal parents of allotetraploid species, we performed an additional phylogenetic analysis of O. minuta using the matK gene from related species (Figure S2). The results revealed a single clade for O. minuta with parental O. punctata. Similar results was also suggested by Ge et al. (1999), whereby phylogenetic analysis of the matK gene from different Oryza species grouped O. minuta in a clade with the maternal parent O. punctata BB instead of O. officinalis CC. Furthermore, the result suggests that there is no conflict between the entire genome data set and 65 shared genes of these cp genomes.

CONCLUSION
This study reports the first complete chloroplast genome sequence of O. minuta (135,094 bp). The structure and organization of this genome is very similar to previously reported cp genomes from the tribe Oryzeae. The location and distribution of repeat sequences was detected, and sequence divergences among cp genomes and 65 shared genes were identified with related species. No major structural rearrangement of Oryza species cp genomes was observed. Phylogenetic analyses showed that data sets based on the entire genome and 65 shared genes generate trees with same topologies regarding the placement of O. minuta. These findings provide a valuable analysis of the complete cp genome of O. minuta, which can be used to identify species and clarify taxonomic questions.

AUTHOR CONTRIBUTIONS
All authors listed, have made substantial, direct and intellectual contribution to the work, and approved it for publication.

SUPPLEMENTARY MATERIAL
The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fpls.2017. 00304/full#supplementary-material     Figure S1 | Amino acid frequencies in O. minuta cp protein-coding sequences. The frequencies of amino acids were calculated for all 97 protein-coding genes from start to stop codon. Figure S2 | A phylogenetic tree was constructed based on 14 species from the rice tribe using different methods. matK gene sequence data were used with maximum parsimony (MP), maximum likelihood (ML) and neighbor-joining (NJ) approaches. Numbers above the branches are the bootstrap values of MP, ML, and NJ, respectively. Stars represent position for O. minuta (KU179220).