The History and Diversity of Rice Domestication as Resolved From 1464 Complete Plastid Genomes

The plastid is an essential organelle in autotrophic plant cells, descending from free-living cyanobacteria and acquired by early eukaryotic cells through endosymbiosis roughly one billion years ago. It contained a streamlined genome (plastome) that is uniparentally inherited and non-recombinant, which makes it an ideal tool for resolving the origin and diversity of plant species and populations. In the present study, a large dataset was amassed by de novo assembling plastomes from 295 common wild rice (Oryza rufipogon Griff.) and 1135 Asian cultivated rice (Oryza sativa L.) accessions, supplemented with 34 plastomes from other Oryza species. From this dataset, the phylogenetic relationships and biogeographic history of O. rufipogon and O. sativa were reconstructed. Our results revealed two major maternal lineages across the two species, which further diverged into nine well supported genetic clusters. Among them, the Or-wj-I/II/III and Or-wi-I/II genetic clusters were shared with cultivated (percentage for each cluster ranging 54.9%∼99.3%) and wild rice accessions. Molecular dating, phylogeographic analyses and reconstruction of population historical dynamics indicated an earlier origin of the Or-wj-I/II genetic clusters from East Asian with at least two population expansions, and later origins of other genetic clusters from multiple regions with one or more population expansions. These results supported a single origin of japonica rice (mainly in Or-wj-I/II) and multiple origins of indica rice (in all five clusters) for the history of rice domestication. The massive plastomic data set presented here provides an important resource for understanding the history and evolution of rice domestication as well as a genomic resources for use in future breeding and conservation efforts.


INTRODUCTION
The domestication and improvement of wild plant species forms the basis of our current global food system (Hancock, 2004). Among plant lineages that have been domesticated to feed humans directly, Asian cultivated rice (Oryza sativa L.) is one of the most important given that it provides roughly 20% of the calories consumed by humans worldwide (Smith, 1998). Like other core staple food crops, documenting and analyzing wild and cultivated genetic diversity is the basis for increasing yield and the development of insect/drought/disease resistant varieties (Brown, 1987;Valliyodan et al., 2017;Wang H. et al., 2018;Vikas et al., 2020). Currently, the genetic diversity (especially in the wild) of the crop species upon which humans depend for survival are profoundly threatened from forces such as habitat loss, climate change, regional conflict, human population growth, and anthropogenic vectoring of pathogens (Wilkes and Williams, 1983;Smith et al., 2006;Ford-Lloyd et al., 2011;Hoban et al., 2020). Thus, it is increasingly necessary to improve our understanding of rice domestication and genetic diversity.
With the advent of high-throughput DNA sequencing technology many genetic studies of crop origins have focused on allelic variation from whole nuclear genomes (Rojas-Barrera et al., 2019;Razifard et al., 2020) or domestication genes (Scott et al., 2019;Liu et al., 2020;Lu et al., 2020) with limited use of whole plastid genomes for inferring crop origins. However organellar genomes (in particular plastids) possess several advantageous qualities in the inference of crop origin and diversity including high copy number, which is useful in degraded archeological finds (Schlumbaum et al., 2008;Wagner et al., 2018), uniparental inheritance resulting in more recent inferences for coalescence times (Palumbi et al., 2001), lack of recombination reducing problems associated with introgression and incomplete lineage sorting (Hu et al., 2016), and a variable mutation rate which is useful for inferences at multiple taxonomic levels (Pollmann et al., 2005). There are numerous well-known examples using organellar DNA in studies of origin, dispersal, and diversity, like the resolution of our own human history using mitochondrial DNA (Ballinger et al., 1992;Forster et al., 2001;Tanaka et al., 2004) and the use of plastid DNA as a standard phylogenetic marker in the Angiosperm Phylogeny Group (APG) classification of flowering plant orders and families (Byng et al., 2016). Therefore, pangenomic approaches to plastomic DNA analyses are expected to provide useful insights into the origin, and diversity of numerous domesticated crop species.
As a staple food crop which feeds over 3.5 billion people on the planet, domestication of cultivated rice ranks as one of the most important developments in the history of humankind. Accordingly, the origins and demographic history of Asian cultivated rice (O. sativa) have been frequently studied using a wide range of analytical approaches mainly from the fields of genetics and archeology. Yet debates regarding the exact origin and timing of domestication persist (Londo et al., 2006;Caicedo et al., 2007;Fuller et al., 2009;Molina et al., 2011;Huang et al., 2012;Civan et al., 2015;Choi et al., 2017), especially regarding the number and source of origins with some researchers suggesting a single origin from one population while others proposed multiple distinct origins from different populations (Sang and Ge, 2007;Gross and Zhao, 2014;Civan et al., 2019). The singleorigin hypotheses primarily focus on domestication genes (such as the non-shattering gene sh4) proposing that a desirable set of alleles occurred once, was selected for and that subsequent introgression with wild gene pools (while maintaining strong selection for the domestication genes) accounts for the current high diversity found in domesticated rice (Li C. B. et al., 2006;Sweeney et al., 2006;Vaughan et al., 2008). In contrast, the multiorigin hypotheses posit that distinct lineages of wild rice (Oryza rufipogon) in multiple locations were the source of different domesticated lineages (e.g., indica and japonica) and that the multi-origin sourcing accounts for most of the current diversity in domesticated rice rather than amalgamative introgression (Gross and Zhao, 2014).
In the present study, we constructed a large plastome data set containing 1445 complete de novo assembled genomes from wild and domesticated accessions in order to address the main outstanding questions in rice domestication: (1) Is the current diversity of domesticated rice derived from a single or multiple independent lineages? and (2) whether a common geographic origin or multiple geographic origins existed for the lineages? We employed numerous analytical approaches from population genetics and phylogeography to test these hypotheses. The results from these analyses were interpreted in light of previous work on rice history such that an updated model based on matrilineal diversity could be presented for this essential food crop. Our results from this large dataset of complete plastid genomes for wild and domesticated rice resolves distinct patterns in the evolution and domestication of different rice lineages. As with similar studies on the history and diversity of crop species the results herein will be directly applicable to crop improvement as well as in situ and ex situ conservation efforts.

Raw Dataset
Whole genome sequencing (WGS) data for O. rufipogon and O. sativa accessions were obtained from the EMBL database uploaded from previous studies (Huang et al., 2012;Wang W. S. et al., 2018) Figure 1). The samples were first selected randomly from the original datasets and then adjusted to ensure that they were distributed evenly according to geographic distribution. The accessions of O. sativa were classified into five categories including indica, japonica (including tropical japonica, typical japonica, and temperate japonica), aus, aromatic, and intermediate based on the identification included in the uploaded information (Wang W. S. et al., 2018). Nineteen published plastomes of Oryza species were downloaded from the NCBI database (Supplementary Table 2).

The de novo Assembly and Annotation of Complete Plastid Genomes
The raw sequencing data were processed using FastQC v0.11.5 and NGSQCToolkit v2.3 software to control sequence quality, and were then filtered using BWA and SAMtools (Li and Durbin, 2010) software to extract plastid-origin reads that were properly paired to the reference plastid genomes from Oryza (Supplementary Table 3). The de novo assembly of complete plastid genomes were conducted using SPAdes (Bankevich et al., 2012). Several strategies were applied in the de novo assembly pipeline: (1) The problem of excessive variation introduced by using highly variable reference sets was reduced by using a narrower set of reference plastomes only from Oryza and (2) the F 12 option in SAMtools was used to bait target reads of plastomes from the WGS dataset. The obtained plastid contigs were further visualized and scaffolded to complete circular genomes in Bandage (Wick et al., 2015). The read coverage depth of each contig was used to identify and remove the mis-assembled contigs derived from nuclear or mitochondrial genomes, which were usually lower than one-fifth the depth of plastid contigs. Other optimized options were also applied as reported in our previous study (He et al., 2020).
Summary statistics were calculated to evaluate the quality of the assembled plastid genomes by using QUAST (Gurevich et al., 2013). Gene annotation was performed using Geseq (Tillich et al., 2017) and manually modified if necessary. Gene function annotation of nucleotide variations were performed with ANNOVAR software (Wang et al., 2010).

Haplotype and Genetic Diversity Analyses
Population structure of all rice accessions were inferred based on plastomic single nucleotide variants (SNVs) by using Admixture software (Alexander et al., 2009), with 5 replications for different number of potential sub-populations (K) ranging from 1 to 25. The runs at each K showing the minimum cross-validation (CV) errors were used as optimal results to estimate the number of sub-populations. For the optimal K, the run with the lowest CV error was used to assign accessions with membership probability ≥0.65 to a sub-population (genetic cluster). Accessions with membership probability <0.65 were assigned to an unclassified group. Genealogical relationships of the identified haplotypes were inferred using a median joining method in Popart v1.7 (Leigh and Bryant, 2015) with annotations to the output figure completed in Adobe Illustrator software (Adobe Systems Incorporated, United States). Haplotype diversity, evolutionary distances based on the Tajima-Nei model, and population differentiation (Fst) were calculated for each group of haplotypes based on plastomic SNVs by using DnaSP v6.0 (Rozas et al., 2017) and MEGA7 (Kumar et al., 2016). The variational autoencoder plotting was calculated with popvae software (Battey et al., 2021). The principal coordinates analysis (PCoA) and multidimensional scaling (MDS) were conducted in TASSEL 5 (Bradbury et al., 2007). Nucleotide diversity (π) within and between different genetic clusters was calculated and tested using the pairwise differences method in Arlequin 3.5 (Excoffier and Lischer, 2010).

Phylogenetic Analysis and Molecular Dating
To perform the phylogenetic analysis, complete sequences of plastid genomes were aligned in MAFFT (Katoh and Standley, 2013) and manually adjusted in MEGA7 (Kumar et al., 2016). Phylogenetic analyses were conducted using maximum likelihood methods in IQ-Tree (Nguyen et al., 2014), and the Bayes information criterion was used to determine the best-fit model for nucleotide substitution. Branch support was determined with ultrafast bootstrap of 2000 replicates. The final trees were then plotted using the online tool in the Interactive Tree of Life. 1 For molecular dating, a set of representative haplotypes from major clades in the phylogenetic tree (consistent with the genetic clusters) was used for analyses. The smaller data set was used in order to simplify analyses and avoid zero-length branches from including repeat or highly similar haplotypes (Hou et al., 2014). A relaxed lognormal molecular clock with Yule priors was chosen to estimate the divergence time of lineages as implemented in BEAST v1.8.4 (Drummond et al., 2012). Three calibration points, from previous studies 2 were employed: the divergence between O. rufipogon and Oryza punctata [5.7 mya 95% highest posterior density (HPD): 2.6-8.9], the crown age of O. rufipogon (2.32 mya HPD: 2-2.93), and the divergence time between O. rufipogon and Oryza glaberrima and its wild relative Oryza barthii (estimated at c. 0.83 mya HPD: 0.51-1.11). Following the suggestion of Ho (2007), we assigned a normally distributed prior for the three calibrations, in which standard deviations of 2, 0.3, and 0.2 were used, respectively.
The Markov Chain Monte Carlo procedure was run for 100 million generations, with sampling every 10,000 generations. Tracer v.1.7.1 3 was used to assess appropriate burn-in and the adequate effective sample size values (>200). A burn-in of 25% was applied, and the maximum clade credibility (MCC) tree with the mean ages and 95% HPD intervals on nodes was produced in TreeAnnotator v1.8.4 (part of the BEAST package) and visualized in FigTree v.1.4.2. 4

Biogeographic Analyses and Historical Population Dynamics
For the ancestral range reconstruction, seven geographical areas based on the regions of endemism of rice were defined (Takhtajan, 1986;Wu and Wu, 1996). R package BioGeoBEARS (Matzke, 2013) was used to compare biogeographical models and estimate ancestral ranges of the Oryza populations using the MCMC tree from BEAST. Three biogeographical models were compared using a ML framework: Dispersal-Extinction-Cladogenesis (DEC) model (Ree and Smith, 2008), dispersalvicariance analysis (DIVA) (Ronquist, 1997), and BayArea (Landis et al., 2013), with each model run with and without the + J parameter. The fit for the different models was assessed using AIC and AICc scores. The maximum range size was set to four, as no extant species occurs in more than four biogeographical regions. To test the robustness of the ancestral range reconstruction, we also used statistical dispersal-vicariance analysis (S-DIVA) (Yu et al., 2010) and DEC (Ree and Smith, 2008) in RASP v3.2 (Yu et al., 2015). MaxEnt species distribution modeling was used to predict past and current climatically suitable areas for different genetic clusters of wild rice (Phillips et al., 2017). Nineteen bioclimatic variables with the 2.5 arc-minute spatial resolution were downloaded from WorldClim 1.4 5 (Hijmans et al., 2005). The neutrality test and mismatch distribution for plastomic variations of different groups were conducted using DnaSP v6.0 (Rozas et al., 2017). The reconstruction of population demography was estimated with stairway plot based on folded site frequency spectrum (SFS) as described in Liu and Fu (2015). The assumed mutation rate (µ) of plastomes in Oryza species was defined as 1.5 × 10 −10 mutations per site per generation which was calculated in BEAST v1.8.4 (Drummond et al., 2012).

De novo Assembly of Rice Plastomes Based on Whole Genome Sequencing Data
Based on the WGS data derived from the EMBL database (Huang et al., 2012;Wang W. S. et al., 2018), de novo assembly of 1445 plastid genomes (also referred to as plastomes) was completed from 1135 O. sativa and 295 O. rufipogon accessions and 15 in other rice species with AA genomes in Oryza (Table 1 and  Supplementary Table 1). A total length of 7.0-224.8 Mb of reads was yielded per plastome when the original WGS data was mapped to the reference genome, with read coverage averaging 222.6×. The de novo assemblies of each plastome yielded from 3 to 67 contigs, with an average of 7.1, for each accession, which were further extended and connected according to Kmer depth and De Bruijn paths among them. Accordingly, a total of 1445 high-quality complete circular genomes were assembled for each of the 1445 rice accessions. The obtained plastomes had an average length of 134,527.3 bp, and varied from 134,518 to 134,695 bp, with an overall average GC content of 39.0%, varying from 39.0 to 39.5%. A typical quadripartite structure was resolved among all of the genomes and included two inverted repeats (IRs) with an average length of 20,802.7 bp and an average GC content of 44.3%, separated by a large single copy region (LSC) with an average length of 80,575.7 bp and an average GC content of 37.1%, and a small single copy region (LSC) with an average length of 12,346.2 bp and an average GC content of 33.3%. The highest GC content was observed in the IRs, followed by the LSC and SSC, suggesting that different functions (e.g., gene regulation; Smarda et al., 2014) and/or evolutionary biases exist between these regions.

Maternal Phylogeny and Population Structure Reveal Distinct Genetic Clusters in Wild and Cultivated Rice
To investigate the plastomic relationships among all rice accessions, phylogenetic and population structure analyses were conducted, that consistently resulted in the resolution of multiple distinct genetic clusters with high support among both wild and cultivated accessions (Figure 1 and Supplementary Table 5).
A total of 12 genetic clusters and 1 unclassified genetic cluster for all accessions were obtained from their phylogenetic relationship and the best partition of K = 12 as calculated in ADMIXTURE, although a secondary optimal partition of K = 17 was also observed (Figures 1A,B and upplementary Figure 3). Two major maternal lineages were observed across all the O. rufipogon and O. sativa accessions, which could be further resolved into 4 O. rufipogon-wild-type genetic clusters (Or-w-I, Or-w-II, Orw-III, and Or-w-IV), and 2 O. rufipogon-wild-and-cultivatedtype clusters (Or-wj and Or-wi). The cultivated accessions were separated into Or-wj and Or-wi with multiple well-supported genetic clusters within each. Three O. rufipogon-wild-to-japonica (Or-wj) genetic clusters were resolved. The genetic clusters Orwj-I and Or-wj-II mainly consisted of japonica cultivated varieties with some other cultivars, while Or-wj-III contained most of the aromatic rice cultivars that have often been considered as a related ecotype of japonica rice (Figure 1). Two O. rufipogon-wild-toindica (Or-wi) genetic clusters were resolved and referred to as Or-wi-I and Or-wi-II. The Or-wi-I genetic cluster consisted of mainly indica rice cultivars and the aus rice ecotype, while the Or-wi-II genetic cluster contained indica cultivated rice and a high percentage (45.1%) of wild rice accessions (Figures 1, 2 and Supplementary Table 5). Almost all the wild rice accessions fell into one of four genetic clusters Or-w-I, Or-w-II, Or-w-III, or Or-w-IV, among which the Or-w-I was most closely related to the Or-wj genetic clusters and the Or-w-II was closest to the Or-wi genetic clusters ( Figure 1A).  (Figure 1). The vast majority of cultivated rice accessions (1114/1137) were clustered in either the Or-wj or Or-wi groups. Notably, 23 domesticated accessions clustered into various wild genetic clusters indicating a more distant maternal relationship between them (Figures 1, 2).

Nucleotide Divergence of Plastomes Within and Between Different Genetic Clusters
To further assess the nucleotide divergence of different clusters, a haplotype network of plastomes was constructed for all the accessions of wild and cultivated rice species from this study (Figure 2). In total 266 haplotypes were obtained from 1464 accessions which were resolved into 12 well supported genetic clusters and 1 mixed genetic cluster, which were classified as described above. Haplotype diversity in this study exhibited a similar pattern to other studies in grain crops where a broader diversity is found in wild haplotypes than in cultivated haplotypes. That is, clusters containing wild accessions exhibited a greater number of differences between haplotypes within the cluster, whereas haplotypes in the cultivated genetic clusters were separated only by a single or few steps (Figure 2). The wild rice (O. rufipogon) genetic clusters (Or-w groups) showed higher haplotype diversity (as a function of accessions per haplotype and multi-step branching within each genetic cluster) with 81 haplotypes in total and 1-11 accessions per haplotype. While cultivated rice genetic clusters (Or-wi and Orwj groups) had 149 haplotypes in total with 1-301 accessions per haplotype. The nine most common haplotypes (>21 accessions) contained 77.3% of all the accessions in the cultivated rice genetic clusters. The cultivated clusters showed very low nucleotide diversity (π) with a range from 0.00047 to 0.0015 which is lower than wild genetic clusters ranging from 0.0035 to 0.014 (Figure 3 and Supplementary Figure 4). Similarly, the pairwise genetic distances within Or-wj (4.0-7.3) and Or-wi (4.2) were much lower than those of Or-w genetic clusters (31.4-41.9). This narrowing of genetic diversity is common among domesticated lineages wherein artificial selection and demographic bottlenecking can induce such patterns. Group and individual based calculations provided further evidence for a multiple origin with long-term artificial selection scenario for the domestication of rice. On average, each Or-wj accession differed from an Or-wi accession by 31.6 SNVs with a range of 29.4-34.0, which is roughly 8-fold more than detected between accessions within Or-wj (4.0, 6.0, and 7.3) or Or-wi (4.2) genetic clusters (Figure 3). When using F st (calculated from SNVs) to assess divergence between different clusters, the Or-wj genetic clusters all showed relatively large population differentiation to Or-wi genetic clusters with a range of F st from 0.946 to 0.979 at a significance level of p < 0.01. These values are larger than the range of F st between either the Orwj genetic clusters and Or-w-I (F st = 0.859-0.914; p < 0.01) or between Or-wi genetic clusters and Or-w-II (F st = 0.675, 0.863; p < 0.01). Additionally, the Or-wj genetic clusters showed the least genetic distance from Or-w-I (K = 24.3-26.3) while Or-wi genetic clusters were the least distant from Or-w-II (K = 18.2, 20.7). The variational autoencoder estimation (VAE), principal component analysis (PCA), and MDS plot were conducted to further investigate the genetic relationships of all the accessions based on plastomic SNVs. From all multidimensional individual based analyses, the accessions in Or-wj and Or-wi genetic clusters were grouped together in a similar manner to the other analyses and resolved closest to Or-w-I and Or-w-II, respectively (Figure 4 and Supplementary Figure 5). All of these results consistently resolve a pattern wherein the two genetic clusters containing mainly cultivated accessions were selected from two distinct wild lineages, thus supporting a multiple maternal genetic origin scenario for the history of rice domestication.

Molecular Dating, Phylogeography, and Historical Demography Reconstructions for the Origin and Timing of Rice Domestication
To infer the historical dynamics of origin and evolution for wild and cultivated rice, a total of 61 representative plastid haplotypes from the cultivated and wild rice genetic clusters were employed in molecular dating and historical biogeographic reconstruction (Figure 5 and Supplementary Figures 6, 7).  genetic clusters Or-wj-I (99.3% were cultivated accessions) and Or-wj-II (95.6%) and a wild-cultivated cluster Or-wj-III (61.7%) 0.43 mya (Supplementary Table 5 and Figures 5B,C). The Orwi genetic clusters have a stem age of 0.86 mya and further diverged into a mainly cultivated genetic cluster Or-w-I (97.3% were cultivated accessions) and a wild-cultivated cluster Or-w-II (54.9%) 0.26 mya (Supplementary Table 5 and Figures 5B,C). From the phylogeographic analyses the best supported scenario for Or-wj groups is an independent single origin of Or-wj-I/II from East Asia and multiple origins of Or-wj-III (Supplementary Table 6 and Figure 5). For the Or-wi genetic clusters the best supported origin was from East Asia with dispersal and secondary diversification in Southeast and South Asia (Figure 5 and Supplementary Figures 7, 8). These results revealed a predomestication landscape of the divergency of different maternal lineages in wild and cultivated rice.
From neutrality (Supplementary Table 7) and mismatch distribution tests (Supplementary Figure 9) all Or-wj and Or-wi genetic clusters were inferred to have undergone relatively recent population expansion which could be related to recent artificial selection associated with domestication compared to natural radiations among the wild progenitor lineages. Reconstruction of past effective population sizes and population historical distribution among the different groups suggests that population FIGURE 2 | Haplotype network of plastomes in cultivated and wild rice species. Different cultivars/species were denoted in the haplotype network, and the haplotypes without a subgroup label belonged to the O-unc genetic cluster. The cultivars/species were designated for each accession as in Figure 1.
FIGURE 3 | Genetic distances and population differentiation within and between different genetic clusters. Below diagonal: Pairwise genetic distances (K); diagonal: the nucleotide diversity (π) for each group; above diagonal: The population differentiation coefficient (Fst) between groups. expansion occurred in different regions at different times between the genetic clusters (Figure 6). While analysis of all the Orwj genetic clusters indicated a two-step population expansion pattern, the increases were not synchronous across genetic clusters with the first rapid increase in effective population occurring around 50 kya in Or-wj-II and Or-wj-III whereas in Or-wj-I this step occurs at about 15 kya. Similarly, the second rapid increases in effective population size between the Or-wj genetic clusters were asynchronous in all instances. Between the two Or-wi genetic clusters, inferred rapid expansion events were asynchronous in number, timing, and extent (Figure 6 and  Supplementary Figure 8). Nearly all of the recently inferred expansion events appear to have occurred between 5.0 and 10.3 kya, except for Or-wj-I which was estimated to be more recent (2.0-3.0 kya). Modeling for climatically suitable areas of different genetic clusters suggested that all Or-wj and Or-wi clusters underwent geographic expansions both during periods of LGM to mid-Holocene and mid-Holocene to the present except for Or-wi-I ( Figure 6C). The Or-wj-I/II were inferred to have a first dramatic expansion in East Asia and secondary expansion mainly in South Asia, while the other clusters appeared to expand in multiple regions in most periods. In total these results from different analytical approaches are consistent with the expectation of a multiple-origin scenario.

Large Scale de novo Assemblies of Rice Plastid Genomes
Plastid has been a part of plant cells for roughly one billion years and remains the essential compartment for harvesting light energy in the cells of eukaryotic photosynthetic organisms. The plastid genome (plastome) possesses a number of special characteristics such as lack of recombination, conserved gene order and content, and differential rates of mutation (Supplementary Figure 2). These characteristics make plastomes ideally suited in the study of evolution and historical biogeography in plants (Gitzendanner et al., 2018), intergenomic transfer between organelles and nuclear genomes (Ma et al., 2020), plastid genetic engineering (South et al., 2019), and breeding and conservation of important crop germplasm (Tonti-Filippini et al., 2017;Gao et al., 2019). However, among the ∼35,000 cultivated species (Khoshbakht and Hammer, 2008), fewer than 80 complete plastid genomes are available in the NCBI database. Although the One Thousand Plants Transcriptome Project (1KP) 6 , as well as other recent efforts have contributed over 1,000 complete or nearly complete plastid genomes to global databases, most of these are from plants that are not of economic importance (Gitzendanner et al., 2018;Leebens-Mack et al., 2019;Li H. T. et al., 2019). Additionally, several large-scale datasets analyzing plastomic variations of wild and cultivated rice accessions were recently published, however these studies relied on a mapping-reads-to-a-reference method instead of de novo assembly and complete genome scaffolding (Civan et al., 2019;Moner et al., 2020). As such these analyses may have omitted regions of greater dissimilarity as these methods do not perform for longer genomic regions of this type. To date, there remains a lack of de novo assemblies for complete plastid genomes available in public databases that are sampled at the population level for important crop species.
In the present study, to conduct large scale de novo assemblies of well over 1000 plastid genomes, we developed an accurate and effective pipeline for auto-assembly of rice plastid genomes from high throughout WGS data. Several strategies 6 https://db.cngb.org/onekp/ were employed to obtain the optimal result of complete highquality plastid genomes. The strategies included: (1) using only 102 plastomes in the Oryza genus as reference, instead of using a collection of dozens of distantly related species for reference which is the common practice; (2) applying the F 12 option when using SAMtools software to bait target reads of plastomes from the WGS dataset which resulted in improved baiting efficiency of plastomic reads from the WGS dataset and reduced non-target reads. Assembly parameters were also optimized according to the methods previously reported in He et al. (2020). Using this pipeline, we were able to assembly a total of 1445 completed plastomes of wild and cultivated rice accessions, with a high average depth of read coverage (222.6×). In order to confirm the accuracy of our assembly pipeline, two previously published plastomes for rice, cultivars Nipponbare (GU592207.1, AY522330.1) and TN1 (NC_031333.1) were selected to compare with the plastomes of the two accessions generated in this study (CX140 and CX162), respectively. Both genome sequences showed identical alignments with no nucleotide changes between them, further confirming the high reliability of the assembled plastome sequences generated by our pipeline. The results of this effort established a large set of 1445 complete plastomes including 1135 plastomes of Asian cultivated rice varieties and 310 plastomes of wild rice accessions. This dataset will be a very useful resource for future rice research on molecular diversity and phylogeny, plastid genetic engineering in rice, marker assisted breeding, and germplasm preservation efforts.

Multiple Primary Origins With Ongoing Selection and Admixture Generate Secondary Origins Resulting in a High Diversity of Cultivated Rice Haplotypes
While much work has been dedicated to understanding the origin and evolution of domesticated rice the issue of multiple versus single origins is a topic of ongoing debate. Most recent studies have concluded that a multiple origin scenario is most likely however the number of domestication events wherein key grain domestication loci were acquired remains unresolved (Huang et al., 2012;Civan et al., 2015Civan et al., , 2019Choi et al., 2017;Choi and Purugganan, 2018;Civan and Brown, 2018;Wang W. S. et al., 2018;Moner et al., 2020). The two major type of cultivated rice, indica and japonica, originated from different ancestral wild rice populations, with gene flows observed from japonica to indica as well as two relatively small ecotypes, aus and aromatic rice. The plastome data in this study provides more strong evidence for multiple origins of Asian cultivated rice and further resolution of the maternal relationships between its different subspecies/groups (Supplementary Figure 9). The cultivated rice and its wild relatives both derived from two major maternal lineages which could be traced back to a most recent common ancestor before 1.31 mya (Figure 5B). Extensive and directional introgressions were indicated between different cultivated groups according to polyphyletic cultivar designations to different plastid lineages, e.g., a mainly japonica (61.9%) type of genetic cluster Or-wj-II contained a large proportion of indica  accessions (26.3%); more than half (58.2%) of aromatic rice accessions shared a common maternal lineage in the genetic cluster Or-wj-III, in which 31.6% accessions were indica or aus rice (Figures 1, 2 and Supplementary Table 5). Those results suggested abundant introgressions from japonica and aromatic rice to indica and aus rice cultivars. Furthermore, when comparing the different cultivated genetic clusters by using demographic and phylogeographic reconstruction the patterns appear to vary among the groups suggesting differences in geographic origin, intensity targets of selection, as well as timing of the initiation of these steps. It should be noted that the process of domestication by relaxing natural selection and increasing effective population sizes might have biased divergence time estimates. From the stairway plots ( Figure 6) the inferred early diverging domesticated genetic clusters Orwj-III and Or-wi-II had secondary rapid increases in effective population size starting around 15 and 7 kya, respectively. These times correspond roughly with dates inferred from other studies for when rice domestication could have reasonably started in China for japonica rice and in India for indica rice (Gross and Zhao, 2014). The older increases inferred in effective population size may have been the result of population responses to climatic changes related to past glacial events. Because the increases in effective population size are inferred to have occurred earlier in japonica genetic clusters, a scenario in which Sh4 and other genes evolved once in East Asian japonica and were introgressed into indica lines is not unreasonable but also not the only explanation for this pattern. Evidence for introgression is present in this plastome dataset in the form of polyphyletic cultivar designations to different plastid lineages.
While plastid genomes are not generally considered to possess genes involved with the domestication syndrome the patterns found in plastomic data nonetheless provide invaluable data for reconstructing the domestication history of a crop species. For instance, the loss of nucleotide diversity is a common feature of grain domestication (Haudry et al., 2007;Gross and Olsen, 2010;Avni et al., 2017) and was evident in our plastid data as well. While nucleotide diversity was low in the domesticated groups compared to the wild groups, haplotype diversity was higher (Hd: Or-wj-I = 0.744, Or-wj-II = 0.887, Or-wj-III = 0.801; Supplementary Table 8). However, the high haplotype diversity in cultivated groups is mainly the result of many single step changes from one or two highly abundant haplotypes possibly reflecting the relaxation of natural selection and correlated increase in effective population sizes among domesticated lineages (Petit and Barbadilla, 2009;Liu et al., 2019). This contrasts with the wild genetic clusters wherein many multiple step branching events typify the diversity of haplotypes (Figure 4 and Supplementary Figure 10). Similar patterns between progenitor and derived domesticated lineages have been described for numerous cultivated and domesticated species (Bjornerfeldt et al., 2006;Miller and Gross, 2011;Tembrock et al., 2017;Alves-Pereira et al., 2018). Because plastids are not thought to be a target of selection during domestication and are mostly uniparentally inherited, they are considered superb markers to avoid the interruption of artificial intervention to the genomic mutation rate in resolving the geographic origins and timing of domestication in crop species. While these and similar characteristics do make plastid data well suited for resolving questions of crop origin such data are also not expected to be entirely neutral during domestication as plastids are important in numerous organismic processes such as environmental adaptation, growth rate, and cytonuclear incompatibility during reproduction (Allen, 2005;Greiner et al., 2011;Roux et al., 2016;Bogdanova, 2019;Flood et al., 2020). By combining plastome data with data from nuclear sequences, phenotype, and crop characteristics the degree to which plastid genes are involved in the process of domestication and improvement can be more effectively assessed.

Cultivar Determination and Plastome Based Population Structure in Rice
Thousands (more than 40,000) of rice cultivars have been described since the beginning of documentation of rice cultivation (Li R. et al., 2019). Distinctive morphological traits and growth characteristics have been the primary means to recognize rice cultivars (Fujita et al., 1984;Li R. et al., 2019). More recently large genetic and genomic data sets have been employed to characterize rice diversity and understand the genetic underpinnings of desired traits (Kovi et al., 2011;Huang et al., 2013;Choi and Purugganan, 2018).
In our analysis of population structure using plastomic data most of the japonica (90.8%) and aromatic (86.6%) cultivars (determined a prior by original authors using cultivar traits) were present in the Or-wj genetic clusters. While the indica (67.0%) and aus (63.5%) cultivars (also determined a priori by original authors using cultivar traits) were mainly present in the Or-wi genetic clusters. From this it appears that japonica and aromatic cultivars are more closely associated with the Or-wj genetic clusters than the indica and aus cultivars are associated with the Or-wi genetic clusters. Such a pattern might be the result of biased maternal introgression from the Or-wj genetic clusters into indicia and aus cultivars. However, the inferred introgression of plastomes from the Or-wj genetic cluster into indica and aus varieties do not appear to have resulted in morphological changes that made them unrecognizable as indicia or aus cultivars. That said the influence of the Or-wj plastomes on the indica and aus traits might be minimal or they could be expressed in traits typically unrelated to cultivar recognition such as changes to photosynthetic rate.
The higher rate of inferred plastomic introgression from the Or-wj into the indica and aus cultivars raise several questions. For instance, is this pattern of asymmetrical introgression evidence of the transfer of the sh4 and other domestication genes from gene pools containing the desired traits to gene pools without the desired traits as proposed in Choi and Purugganan (2018)? Has unidirectional cytoplasmic incompatibility resulted in higher rates of Or-wj plastomes in indica and aus cultivars? Do Or-wj plastomes provide desired traits not found in Or-wi plastomes to indica and aus cultivars? Obviously further work is needed to better understand how plastids are involved in the domestication process generally and in rice specifically. With datasets such as those presented here these questions are now possible to investigate more thoroughly. The outcome of such studies could be directly applicable in the development of new rice cultivars through plastomic marker assisted breeding and editing of key plastid genes. In addition, geographic locations with high plastomic haplotype diversity can be identified and prioritized for conservation of valuable wild rice germplasm such that they can be preserved for use in future rice breeding efforts.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: https://www.ebi.ac.uk/ ena, PRJEB43684 and https://ngdc.cncb.ac.cn/, PRJCA006880.

AUTHOR CONTRIBUTIONS
WH, ZW, DJ, and LT conceived and designed the study. WH, CC, KX, JW, and PZ collected the public data and performed the data analysis. WH, CC, and KX wrote the manuscript. ZW, LT, and DJ revised the manuscript. All authors have read and agreed to the published version of the manuscript and reviewed the manuscript.