Variation and Evolution of the Whole Chloroplast Genomes of Fragaria spp. (Rosaceae)

Species identification is vital for protecting species diversity and selecting high-quality germplasm resources. Wild Fragaria spp. comprise rich and excellent germplasm resources; however, the variation and evolution of the whole chloroplast (cp) genomes in the genus Fragaria have been ignored. In the present study, 27 complete chloroplast genomes of 11 wild Fragaria species were sequenced using the Illumina platform. Then, the variation among complete cp genomes of Fragaria was analyzed, and phylogenetic relationships were reconstructed from those genome sequences. There was an overall high similarity of sequences, with some divergence. According to analysis with mVISTA, non-coding regions were more variable than coding regions. Inverted repeats (IRs) were observed to contract or expand to different degrees, which resulted in different sizes of cp genomes. Additionally, five variable loci, trnS-trnG, trnR-atpA, trnC-petN, rbcL-accD, and psbE-petL, were identified that could be used to develop DNA barcoding for identification of Fragaria species. Phylogenetic analyses based on the whole cp genomes supported clustering all species into two groups (A and B). Group A species were mainly distributed in western China, while group B contained several species from Europe and Americas. These results support allopolyploid origins of the octoploid species F. chiloensis and F. virginiana and the tetraploid species F. moupinensis and F. tibetica. The complete cp genomes of these Fragaria spp. provide valuable information for selecting high-quality Fragaria germplasm resources in the future.

As well as being an important plastid functioning in photosynthesis and carbon fixation (Neuhaus and Emes, 2000), the chloroplast (cp) contains a genome that can provide valuable information for taxonomic and phylogenic purposes, with the key advantages of its relatively small size, more conservative structure, and low nucleotide substitution rate (Palmer, 1985;Wolfe et al., 1987;Jansen et al., 2005;Wei et al., 2006;Wicke et al., 2011;Terakami et al., 2012;Liu et al., 2019). Furthermore, the cp genome is haploid and uniparentally inherited, so it is helpful for tracing source populations and conducting phylogenetic studies to resolve complex evolutionary relationships (Jansen et al., 2007;Parks et al., 2009;Ruhsam et al., 2015). To date, phylogenetic relationships based on complete cp genomes of many angiosperms have been fully studied in many genera Zhang et al., 2016;Zhao et al., 2018). Njuguna et al. (2013) systematically studied the phylogenetic relationships of Fragaria based on cp genomes, but there were only 10 sequences assembled from total genomic data, and the coverage of PCR amplified sequences ranged from 60 to 90%, which may mean the assembled chloroplast sequences were incomplete. To the best of our knowledge, there is only one study that has focused on the molecular phylogenetic analysis of Fragaria genus based on whole complete chloroplast genome sequences (Sun et al., 2021), which revealed that 20 Fragaria species were clustered into northern group (eight species), southern group (11 species), and an oldest extant species (one species) based on whole cp genomes. However, this previous study focused on molecular clock analysis and ignored the variation and evolution of whole cp genomes of Fragaria, including, for example, the contraction and expansion of inverted repeats (IRs) regions (Plunkett and Downie, 2000;Kode et al., 2005;Ma et al., 2014;Lei et al., 2016).
In addition, although some Fragaria species can be identified by their morphological characteristics, some species that have similar morphological structures, such as F. mandshurica and F. orientalis (Staudt, 2003;Lei et al., 2017), are likely to be inaccurately identified if morphological indexes are not collected . Furthermore, Yang et al. (2020) found that most wild Fragaria in Yunnan were difficult to accurately identify without flowers and fruits of collected species. In total, owing to the stage of species development and subjective factors (such as differences in personal knowledge and experience), traditional morphology classification is often inconsistent and unreliable, which can also affect the results of species identification . Over the years, many molecular analyses have provided insights into the species taxonomy and identification. Currently, DNA barcoding such as rbcL, matK, psbA-trnH and ITS sequences (Potter et al., 2007;Fazekas et al., 2008;Chen et al., 2013;Xin et al., 2013;Tegally et al., 2019;Phi et al., 2020;Islam et al., 2021) has been used for species identification. In Fragaria, the study of DNA barcoding dates back to the phylogenetic analysis conducted by Potter et al. (2000) using ITS and trnL-trnF sequences, but the authors were unable to completely distinguish among species in this genus. Njuguna and Bassil (2011) analyzed psbA-trnH and ITS sequences and reported that the two DNA barcoding sequences are not suitable for species identification in Fragaria. Moreover, more and more studies demonstrated that the four universal barcoding sequences were problematic with low bootstrap support and inability to distinguish between species in land plants (Xin et al., 2013;Tegally et al., 2019;Phi et al., 2020;Islam et al., 2021). Therefore, it is urgent to excavate superior DNA barcoding for special land plants, including Fragaria species, utilizing complete cp genomes, which contain more important variation information for taxonomic and phylogenic purposes (Huang et al., 2014).
In the present study, we sequenced 27 complete cp genomes of 11 wild Fragaria species from different collection sites in China and downloaded the whole cp genome sequences of seven more Fragaria species. This research had the following objectives: (1) to describe the characteristics of Fragaria cp genomes; (2) to infer the phylogenetic relationships among Fragaria spp.; (3) to detect the variations among Fragaria cp genomes and infer the evolution of the whole cp genomes of Fragaria species; and (4) to provide candidate DNA barcodes for Fragaria species identification. These results provide new insights into the interspecies relationships and evolution of Fragaria spp. as well as basic reference material for the application of Fragaria germplasm resources.

Plant Material Collection and Genome Data Sources
Twenty-seven individuals belonging to 11 Fragaria species were included in the present study, as summarized in and de novo assembly was performed according to the following processes. The other seven species listed by Sun et al. (2021) were excluded for duplication of our own sequencing species.

Genome Sequencing and Assembly
Fresh and clean leaves of each sampled species were collected and frozen in liquid nitrogen immediately. The samples were used to extract the total DNA by the modified CTAB method (Doyle and Doyle, 1987). Then, paired-end sequencing (Insert size: 350 bp) was performed using the Illumina Novaseq 6000 platform (Illumina, San Diego, CA, United States). Raw reads were filtered to obtain high-quality clean data. Then, de novo assembly was performed with the GetOrganelle software package (Jin et al., 2020).

Chloroplast Genome Annotation
Thirty cp genomes were annotated using the online GeSeq tool (Tyagi et al., 2020)

Hypervariable Site Identification
All 18 sequences were aligned using the Geneious prime 2021.1.1 plugin MAFFT v.7.450 (Katoh and Standley, 2013), and the alignment was manually adjusted. Then, we performed a sliding window analysis using DnaSP v6 software (Rozas et al., 2017;Jeon and Kim, 2019) to analyze nucleotide diversity (π) in order to detect hypervariable sites among Fragaria cp genomes. The window length was set to 600 bp, and the step size was set to 200 bp (Sun et al., 2021).

Comparative Analyses
The detailed comparisons of LSC, SSC, and IR boundaries in 19 Fragaria complete cp genomes are shown in Figure 2, revealing differences at boundary regions. There were five genes around borders of these regions, including rps19, rpl2, ycf1, ndhF, and trnH, in which rpl2 and ycf1 had two copies.  (Figure 2). Annotation of the F. gracilis cp genome was used for plotting the total sequence identity of the 18 Fragaria cp genomes in mVISTA (Figure 3). Overall, there was a high similarity among Fragaria species, except F. chinensis_1, F. viridis, and F. orientalis. Furthermore, these results showed non-coding regions are more divergent than coding regions. In our research, the most divergent coding regions in the Fragaria cp genomes analyzed were rpoC2, psbJ, ycf1, and ndhA. In addition, the representative most divergent non-coding regions were rps16-trnQ, trnR-atpA, petN-psbM, trnT-trnL, ndhC-trnV, petA-psbJ, trnP-psaJ, rpl32-trnL, and rps15-ycf1. Nucleotide diversity (π) was calculated to evaluate the sequence variability level in Fragaria cp genomes using DnaSP 6.0 software (Figure 4). The values ranged from 0 to 0.01016, and the average value was 0.00167. Four more polymorphic regions (π > 0.007) were identified, including trnS-trnG, trnR-atpA, trnC-petN, rbcL-accD, and psbE-petL, in which rbcL-accD had the highest π value, and all of these regions were located in the LSC region.

Phylogenetic Analysis
The GTR + I + G model was selected as the best-fit evolutionary model by using Modeltest 3.7. The phylogenetic tree was constructed using 34 complete cp sequences of Fragaria species, with P. fruticosa and D. saviczii as outgroups. As shown in Figure 5, Fragaria species can be clustered into two groups, A and B, with 100% bootstrap support. Group A included two subgroups, A1 (F. chinensis and F. daltoniana) and A2. Subgroup A2 contained six Fragaria species (F. nubicola, F. pentaphylla, F. corymbosa, F. moupinensis, F. gracilis and F. tibetica) that are mainly distributed in western China. Group B was composed of the remaining species, which included F. nilgerrensis, F. mandshurica, F. viridis, F. orientalis, F. moschata, F. × ananassa, F. chiloensis, F. virginiana, F. vesca ssp. vesca and F. vesca ssp. bracteata. These latter species are from Europe and America, apart from F. nilgerrensis, which is mainly distributed in southeast Asia, F. mandshurica, which is distributed in northeast China, and F. orientalis, which is mainly found in northern Asia.

Variations and Evolution of Whole Cp
Genomes of Fragaria spp.
In this study, 27 cp genomes of Fragaria species were sequenced and found to range in size from 155,479 to 155,832 bp, which falls within the cp genome size range for angiosperms but tends to be smaller than the cp genomes of other Rosaceae species (Palmer, 1985;Cheng et al., 2017). LSC regions showed the most difference in size, ranging from 85,471 to 85,726 bp. Additionally, the inferred structures and gene contents were in accordance with previous research (Sun et al., 2021).
Overall, Fragaria cp genomes were highly conservative, both in sequence and structure. Analysis with mVISTA showed that there is high similarity among Fragaria species apart from F. chinensis_1, F. viridis, and F. orientalis. We also observed that most variable regions were located in LSC, and non-coding regions were more variable than coding regions. This is a common phenomenon in the cp genomes of most angiosperms (Nazareno et al., 2015;Cheng et al., 2017;Asaf et al., 2018;Tyagi et al., 2020). Additionally, some of the most divergent regions of ycf1, rps16-trnQ, petN-psbM, and rpl32-trnL, as shown in Figure 5, were consistent with previous research (Cheng et al., 2017), indicating that these regions indeed evolve rapidly in Fragaria. FIGURE 1 | Gene map of the Fragaria chloroplast genome. Genes drawn inside the circle are transcribed clockwise, and those outside are transcribed counter clockwise. Color coding is used to show genes belonging to different functional groups. The inner circle indicates the range of the long single-copy region (LSC), short single-copy region (SSC), and inverted repeat regions (IRs) and also shows a GC content graph of the genome. In the GC content graph, the dark gray lines indicate GC content, while light gray lines indicate the AT content of each locus. The size differences among cp genomes in angiosperms may be caused by both the contraction and expansion of IR regions Zhao et al., 2015Zhao et al., , 2018. To elucidate this mechanism in Fragaria, we compared IR/SC boundaries of Fragaria cp genomes. In general, the distribution of border genes is conserved, but the distances between genes and the borders do differ somewhat. The distances between genes and IR/SC borders of F. virginiana, F. orientalis, and F. × ananassa are in accordance with the prior report by Cheng et al. (2017). Fragaria mandshurica, F. viridis, F. moschata, F. × ananassa, F. vesca ssp. vesca and F. vesca ssp. bracteata showed the same gap between rps19, rpl2, ycf1, trnH, and IR/SC junctions, which may explain why these cp genomes are more conserved. Additionally, some species also exhibited the same distances between the rps19 and LSC/IRa border, including F. pentaphylla, F. moupinensis, and F. tibetica. These results also indicated a low level of molecular divergence in the genus Fragaria.
Additionally, rpl2 and rps19 differed in their distances from the LSC/IRa border, which may be owing to IR contraction and expansion. Compared with F. × ananassa, the IR regions of F. pentaphylla, F. daltoniana, F. nubicola, F. chinensis, F. corymbosa, F. moupinensis, F. orientalis, F. gracilis, F. tibetica, F. virginiana, and F. chiloensis expanded to different degrees. Therefore, IR regions of these species are longer than that of F. × ananassa. For the other species in the family Rosaceae, such as Malus (Terakami et al., 2012), Prinsepia (Wang et al., 2013), Pyrus (Li et al., 2018), and Prunus , rps19 crossed the LSC region and IR region. Additionally, rps19 extended to the IRa region, resulting in the presence of ѱrps19 having the same length within IRb region. Notably, we found that rps19 was located inside the LSC region. The contraction of rps19 inside the LSC region would result in the size of the Fragaria cp genomes being smaller than that of other species in Rosaceae.

Phylogenetic Analysis
The similar morphology of most Fragaria spp. and the geographical overlap in ranges may lead to taxonomic confusion among collected specimens (Johnson et al., 2014) and finally result in misjudgment of phylogenetic relationship. To explore the evolutionary relationships among Fragaria species, we constructed a phylogenetic tree of Fragaria species based on whole cp genomes with more than 99% bootstrap support across all nodes. Our results showed high consistency with previous results, especially for species clustering in group B (Njuguna et al., 2013;Sun et al., 2021). Additionally, our results include multiple individuals of each species from different collection sites, which increases the reliability of the phylogenetic analysis.
The phylogenetic analysis showed that F. × ananassa and two octoploids, F. chiloensis and F. virginiana, formed a group, which was in accordance with the inferred origin of F. × ananassa from a hybridization between F. virginiana and F. chiloensis (Staudt, 1962(Staudt, , 1989. In addition, the two octoploids were considered to share a common maternal ancestor that may be F. vesca or F. mandshurica (Harrison et al., 1997;Rousseau-Gueutina et al., 2009;Dimeglio et al., 2014). Fortunately, we observed that F. vesca ssp. bracteata is evolutionarily closely related to the two octoploids. Similar results were also presented by Sun et al. (2021). Additionally, Njuguna et al. (2013) and Edger et al. (2019) suggested that F. vesca ssp. bracteata was probably the last diploid progenitor to the octoploid species. Thus, the conclusion that F. vesca ssp. bracteata is an ancestor of octoploid species was strengthened. However, the other ancestors of the octoploid species remain uncertain.
The evolutionary relationships of F. nubicola, F. pentaphylla, F. chinensis, F. daltoniana, F. corymbosa, F. moupinensis, F. gracilis,  psbA, psbB, psbC, psbD, psbE, psbF, psbH, psbI, psbJ, psbK, pabL, psbM, psbN, psbT -Gueutina et al., 2009;Sun et al., 2021). These species are mainly distributed in Western China (Lei et al., 2017) and, in our results, they were clustered into group A. In group A2, diploid F. pentaphylla was sister to the tetraploids F. moupinensis and F. tibetica with 99.9% bootstrap support (Figure 5), which was consistent with previous research (Njuguna et al., 2013;Sun et al., 2021). In addition, F. pentaphylla had been suggested to be the diploid ancestor to F. moupinensis (Rousseau-Gueutina et al., 2009;Kamneva et al., 2017). Lei et al. (2017) hypothesized that F. tibetica is a descendant of F. pentaphylla based on their runner branching and number of leaflets. Thus, it can be hypothesized that the tetraploid species F. moupinensis and F. tibetica may share the same female parent of F. pentaphylla, which is supported by their more similar morphological characteristics (Lei et al., 2017) and overlapping distribution in Southwestern China (Staudt, 1989;Johnson et al., 2014;Lei et al., 2017). In our study, we revealed a sister relationship between F. corymbosa and F. gracilis, which was consistent with Rousseau-Gueutina et al. (2009). And F. corymbosa and F. gracilis have some similar morphological characteristics, such as runners are filiform and monopodial, petioles and peduncles have spreading hairs, fruits are red and tasteless, calyx is reflexed, etc. (Lei et al., 2017). So our results strengthened the point that F. corymbosa and F. gracilis may have the same ancestor (Rousseau-Gueutina et al., 2009). In addition, F. corymbosa and F. gracilis may be the descendent of F. chinensis (Staudt, 2009;Yang and Davis, 2017). Notably, in this study, all accessions of F. chinensis and F. daltoniana were clustered into group A1, in contrast with the findings of previous studies (Yang and Davis, 2017;Sun et al., 2021). Therefore, future research should explore the relationship among F. chinensis, F. corymbosa, and F. daltoniana. In the future, multiple molecular markers, including cp and nuclear sequence data from more samples from different geographical populations, should be combined with geographical distribution data to analyse ancestral state reconstruction to clarify their phylogenetic ancestor relationships among F. moupinensis and F. tibetica; F. chinensis, F. corymbosa, and F. daltoniana; F. vesca ssp. bracteata and the other octoploid species.
Chloroplast capture is an important process of plant evolution (Okuyama et al., 2005). Hybridization and repeated backcross, the cytoplasm of one species is replaced by the cytoplasm of another species through gene flow infiltration, so that the genetic components of the species not only have nuclear genome components inherited from parents, but also capture new chloroplast gene components (Fehrer et al., 2007). More and more studies have proved the phenomenon of organelle DNA introgression (Du et al., 2011), and the phenomenon of chloroplast introgression between plant species has also been observed in previous studies on hazelnut (Hu et al., 2020). In this study, the phylogeographical relationships among Fragaria species were not declared for the lack of geographical population collections.  3 | Visualized alignment of the Fragaria chloroplast genome sequences with annotations, using mVISTA. Each horizontal row shows the graph for the pairwise sequence identity with the Fragaria gracilis chloroplast genome sequence. The x-axis represents the base sequence of the alignment, and the y-axis represents the pairwise percent identity ranging from 50 to 100%. Gray arrows indicate the position and direction of each gene. Red indicates non-coding sequences (CNS); blue indicates the exons of protein-coding genes (exons); lime green indicates ribosomal RNA (rRNA) or transfer RNA (tRNA) sequences.
However, clear geographical patterns (Western China, Southwestern China, and Northeastern China) have been clearly inferred in phylogenic tree figure. Chloroplast capture could be another explanation why the chloroplast genome analysis does not appear to reflect the species phylogeographical relationship (Tsitrone et al., 2003). Further research should be conducted by combining multiple molecular tools (e.g., nuclear DNA sequences) together with more comprehensive sampling to clarify if chloroplast capture does occur in Fragaria genus.

Candidate Barcoding Sequences for Fragaria
Species identification based on morphology is affected by season, environment, and human factors, which may cause results to be unreliable . In recent years, DNA barcoding has been widely used to promote accurate species identification owing to its clear advantages (Tegally et al., 2019;Phi et al., 2020;Islam et al., 2021). The ideal DNA barcode would be a single locus that could be universally amplified and sequenced across a broad range of taxa and provide sufficient variation to reliably distinguish among closely related species (Song et al., 2017). Many introns, coding regions, and intergenic regions, such as trnL-trnF (Potter et al., 2000), accD-psaI , ycf1-ndhF (Amar, 2020), matK, and trnK (Hilu et al., 2008), have been used as barcodes for constructing phylogenetic relationships. In this study, with the threshold of nucleotide diversity as 0.007, the five intergenic regions, trnS-trnG, trnR-atpA, trnC-petN, rbcL-accD, and psbE-petL were found to be the most divergent and provide potential information for species identification and phylogenetic analyses of Fragaria. Among them, three intergenic regions, including trnS-atpA, rbcL-accD, and psbE-petL were also suggested in Sun et al. (2021). However, FIGURE 5 | A maximum likelihood phylogenetic tree was reconstructed based on 34 Fragaria cp genomes. Potentilla fruticosa and Drymocallis saviczii were used as outgroups, and −2x, −4x, and −8x represent the different ploidies of Fragaria spp. Nodes marked with capital letters are discussed in the text. the threshold of nucleotide diversity used in Sun et al. (2021) was only 0.006, resulting in a low nucleotide diversity of the selected candidate barcoding sequences. In addition, the amplified fragments of the selected intergenic regions trnS-atpA more than 2,000 bp in length would reduce the success of the sequencing. In our study, trnS-trnG and trnR-atpA are located in trnS-atpA regions and the length of these two regions are about 700 bp, which will result in the high success of sequencing. Therefore, trnS-trnG and trnR-atpA are more suitable than trnS-atpA to be the potential candidate barcoding sequence.

CONCLUSION
This study provides 27 complete cp genome sequences of 11 wild Fragaria species. Comparative analysis of cp genomes of Fragaria species revealed that their genome structure is highly conserved. However, IR expansion or contraction was observed among different Fragaria cp genomes, resulting in cp genomes of different sizes. Five identified highly variable gene regions (trnS-trnG, trnR-atpA, trnC-petN, rbcL-accD, and psbE-petL) showed strong potential for species identification and phylogenetic relationship construction in the genus Fragaria. Phylogenetic analysis indicated that F. vesca ssp. bracteata may be one of the progenitor species of octoploids. Similarly, we hypothesize that F. pentaphylla is one of the progenitors of F. corymbosa and F. tibetica. The analysis of multiple molecular markers combined with morphological characters would be helpful for future research to test this hypothesis.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository and accession number(s) can be found below: NCBI repository, accession numbers MZ851747 and MZ851773.

AUTHOR CONTRIBUTIONS
JL designed the research. CL, CC, YT, ZS, and MJ performed the research and analyzed the data. CL wrote the first draft of the manuscript. All authors commented on previous versions of the manuscript. All authors contributed to the article and approved the submitted version.