The complete chloroplast genome provides insight into the evolution and polymorphism of Panax ginseng

Panax ginseng C.A. Meyer (P. ginseng) is an important medicinal plant and is often used in traditional Chinese medicine. With next generation sequencing (NGS) technology, we determined the complete chloroplast genome sequences for four Chinese P. ginseng strains, which are Damaya (DMY), Ermaya (EMY), Gaolishen (GLS), and Yeshanshen (YSS). The total chloroplast genome sequence length for DMY, EMY, and GLS was 156,354 bp, while that for YSS was 156,355 bp. Comparative genomic analysis of the chloroplast genome sequences indicate that gene content, GC content, and gene order in DMY are quite similar to its relative species, and nucleotide sequence diversity of inverted repeat region (IR) is lower than that of its counterparts, large single copy region (LSC) and small single copy region (SSC). A comparison among these four P. ginseng strains revealed that the chloroplast genome sequences of DMY, EMY, and GLS were identical and YSS had a 1-bp insertion at base 5472. To further study the heterogeneity in chloroplast genome during domestication, high-resolution reads were mapped to the genome sequences to investigate the differences at the minor allele level; 208 minor allele sites with minor allele frequencies (MAF) of ≥0.05 were identified. The polymorphism site numbers per kb of chloroplast genome sequence for DMY, EMY, GLS, and YSS were 0.74, 0.59, 0.97, and 1.23, respectively. All the minor allele sites located in LSC and IR regions, and the four strains showed the same variation types (substitution base or indel) at all identified polymorphism sites. Comparison results of heterogeneity in the chloroplast genome sequences showed that the minor allele sites on the chloroplast genome were undergoing purifying selection to adapt to changing environment during domestication process. A study of P. ginseng chloroplast genome with particular focus on minor allele sites would aid in investigating the dynamics on the chloroplast genomes and different P. ginseng strains typing.


INTRODUCTION
The chloroplast, an important plastid, plays an essential role in plant cell functions, including photosynthesis and carbon fixation (Neuhaus and Emes, 2000). In angiosperms, the chloroplast has a conserved quadripartite structure composed of two copies of inverted repeat (IR), one large single copy (LSC), and one small single copy (SSC) (Palmer, 1985). Comparative analysis indicates that most chloroplast genomes are highly conserved at the gene level (Jansen et al., 2005). Compared with the nuclear genome, the chloroplast genome has many other advantages, such as haploid and maternal inheritance (Birky, 2001). Since decoding the first chloroplast genome sequence (Nicotiana tabacum, Shinozaki et al., 1986) in 1986, over 200 complete chloroplast genomes have been sequenced and deposited in NCBI Organelle Genome Resources database. With the development of DNA sequencing technology, the chloroplast genomes have been widely used for plant identification and resolving phylogenetic relationships (Jansen et al., 2007;Moore et al., 2007Moore et al., , 2010Parks et al., 2009).
Panax ginseng, a member of Araliaceae family, is native to China, Korea, and Russia. For thousands of years, traditional Chinese medicine has relied on P. ginseng to restore and enhance human health, and P. ginseng has been used as a tonic, stimulant, and agent to foster fatigue and stress-resistance for more than 2000 years. More recently, the pharmaceutical effects of P. ginseng have been shown by a host of studies, and P. ginseng has become one of the most well-known medicinal plants worldwide (Pazyar et al., 2012;Zheng et al., 2012;Gao et al., 2013).
Traditional methods to identify wild ginseng from cultivars are based on phenotypic observations, while morphological characteristics are often affected by environmental and developmental factors. Particularly during the early developmental stages, the seeds and seedlings of different ginseng cultivars are extremely morphologically similar; rendering their differentiation is quite difficult and sometimes impossible. Furthermore, morphological characteristics cannot be used for screening large numbers of ginseng samples. Therefore, a simple method of DNA analysis, rather than the traditional authentication methods, is clearly desirable. Lee et al. studied the phylogeny of Panax with chloroplast trnC-trnD intergenic region sequences (Lee and Wen, 2004). In 2004, the first P. ginseng chloroplast genome sequence, P. schinseng Nees (156,318 bp, Genbank accession number: NC_006290), was reported by Kim et al. (Kim and Lee, 2004). The general features of P. schinseng Nees chloroplast genome and genome structure dynamics compared to other chloroplast genome sequences have been well described. However, the data were insufficient for studying the genome dynamics and evolution of P. ginseng chloroplast. Herein, we present four complete Chinese P. ginseng chloroplast genome sequences based on next generation sequencing (NGS) technology. Comparative analyses of the chloroplast genomes between P. ginseng strains and other plants were conducted. At the same time, high resolution of genome sequences has made it easy to investigate genetic variations in the chloroplast genome sequence that have occurred over the course of domestication.

ETHICS STATEMENT
All plant samples (four kinds of Chinese ginseng cultivars) in current study were collected from E-Mu ginseng experimental base (N 43 • 47 45.72 , E128 • 6 7.88 , 46 m above sea level), Changchun University of Chinese Medicine with permission. Ginseng in current research does not involve endangered or protected species.

PLANT DNA EXTRACTION AND SEQUENCING
Four P. ginseng samples, including three domestic ginseng strains (DMY, EMY, and GLS) and one wild ginseng strain (YSS), were all collected from E-Mu ginseng experimental base. Morphological differences among these four strains mainly focus on rhizome, lateral root and so on, and the detailed differences were introduced in Supplementary Table S1. After cleaning, the fresh roots were frozen in liquid nitrogen and stored at −80 • C until further processing. Changchun University of Chinese Medicine confirmed the identification of the four ginseng samples through morphology. The total mixed genomic DNA was extracted from fresh roots by using a Plant Genomic DNA Kit (Tiangen Biotech Co., China) following the manufacturer's instructions.
For each sample, pair-end reads with different insertion lengths were sequenced by Illumina HiSeq2000 system. Additionally, DMY was also sequenced with Roche/454 GS-FLX (Titanium) pyrosequencing machine for single-end fragment reads. The detailed sequencing information for four strains was listed in Supplementary Table S2.

CHLOROPLAST GENOME ASSEMBLY AND MINOR ALLELE ANALYSIS
All raw HiSeq reads of four ginseng strains were filtered by quality control protocol with the following five rules: (i) there is no "N" in the entire read, (ii) the percentage of bases with quality <10 is 0, (iii) the percentage of bases with quality <15 is ≤3%, (iv) the percentage of bases with quality <20 is ≤5%, and (v) the average quality of all bases is ≥30. Only high quality reads satisfying all five rules were used in genome assembly and SNP calling (high quality reads information were listed Supplementary Table S2).
For DMY, all reads from 454 GS FLX sequences were directly assembled using Newbler (overlapMinMatchLength = 40 and overlapMinMatchIdentity = 90). According to Zhang's assembly strategy , we used BLAT (-minIdentity = 90) to search large contigs (length ≥ 1 kb) with the conserved chloroplast genes (both identity and coverage on gene nucleotide sequence are no less than 90%). Those large contigs, which contain two or more conserved chloroplast genes, were selected as seed contigs in bb.454contignet (Iorizzo et al., 2012) script (setlowlimit to 10). The chloroplast contig graph was also drawn with bb.454contignet, and low coverage contig branches (possibly including the DNA transferred from chloroplast to mitochondrial or nuclear genomes) were excluded manually . Subsequently, SSC, LSC, and IR were connected into a circle. High-quality pair-end reads from three different insertion length libraries were employed to verify the contig connections and whether each base on the circle genome was supported by the primary allele base by using BWA (Li and Durbin, 2010) with default parameters and SAMTools (-Q = 20, -m = 5 on mpileup) (Li et al., 2009).
For EMY, GLS, and YSS, high-quality pair-end reads with different insertion lengths were BLAT-searched (set -minIdentity to 70 in BLAT) against all 207 chloroplast genomes (206 were downloaded from NCBI FTP, the other one is DMY chloroplast genome, and the accession numbers were listed in Supplementary  Table S3). The matched reads (coverage and identity on each reads are no less than 70%) were considered as chloroplast-related reads, and 150,000 pairs of high-quality and chloroplast-related pair-end reads were selected randomly for assembly from different insertion length libraries respectively. Moreover, Velvet (Zerbino and Birney, 2008) was employed to assemble the reads into scaffolds (hash_length values are set to 63, 63, and 69 in velveth for EMY, GLS, and YSS respectively). For each of the other three ginseng strains, the order of scaffolds was determined by mapping scaffolds to DMY chloroplast genome with BLAT program (set -minIdentity to 90 in BLAT), and gaps were filled by fishing reads from high-quality pair-end reads with Gapcloser and then manually checked. The verification method for the chloroplast genome sequences of EMY, GLS, and YSS were the same as that for DMY.
In minor allele identification section, all high quality pair-end reads with an inserting length of 500 bp from all four strains were mapped onto DMY chloroplast genome using BWA with default value. The program rmdup in SAMTools was employed to remove duplicated reads and SNP calling. Finally, only alleles with MAF ≥0.05 were retained as minor allele candidates (De et al., 2008;Skoglund and Jakobsson, 2011;Iorizzo et al., 2013).

CHLOROPLAST GENE ANNOTATION
Protein-coding genes in ginseng chloroplast genome were annotated by DOGMA (set protein identity to 50, and hits number to 10) (Wyman et al., 2004); start and stop sites of these annotated protein-coding genes were corrected manually. rRNA was detected by aligning rRNA sequences from other chloroplast genomes to ginseng chloroplast genome sequence using BLAT with global coverage and identify ≥90%, and tRNA was identified by tRNAscan-SE (Lowe and Eddy, 1997) with default parameters. The gene map of ginseng DMY chloroplast was drawn by OGDRAW (Lohse et al., 2007).

IDENTIFICATION OF REPEAT SEQUENCES
All repeat sequences were identified using microsatellite identification tool (MISA, http://pgrc.ipk-gatersleben.de/misa/misa. html), and each repeat sequence was ≥10 bp. Repeat sequences whose repeating sequence units were arranged from 2-6 bp and repeated not less than three times were considered as SSRs. Repeat sequences with lengths ≥20 bp were considered as large repeat sequences.

PHYLOGENETIC ANALYSIS
A phylogeny constructed for ginseng and other plants from Asterids was based on the whole genome sequences of LSC, SSC, and IR regions, and 52 protein-coding gene sequences. As to protein-coding sequences, each gene was aligned using Clustal W with default parameters (Goujon et al., 2010), and the aligned sequences from all genomes were connected in order of their sequence. MAFFT (-maxiterate = 10) (Katoh and Toh, 2010) was used to align the sequences from LSC, SSC, and IR, and poor alignment regions were manually checked and adjusted. The phylogenetic trees were constructed using maximum likelihood (ML) algorithm in MEGA5 (Tamura et al., 2011) with default parameters, and Spinacia oleracea (Order: Caryophyllales) and Vitis vinfera (Clade: Rosid) were used as outgroups.
Phylogeny among the four newly sequenced ginseng chloroplasts and P. schinseng Nees chloroplast were constructed on the basis of the genome sequences and MAFs in the polymorphism sites. Because clear differences were observed between P. schinseng Nees chloroplast and the newly sequenced chloroplasts, P. schinseng Nees chloroplast was used as an outgroup directly. According to UPGMA algorithm, the evolutionary relationship among the four newly sequenced chloroplasts was calculated based on MAF of 208 polymorphism sites.

GENOME STRUCTURE ANALYSIS AND NUCLEOTIDE DIVERSITY
When comparing CDS, introns, and intergenic regions in LSC, SSC, and IR regions between ginseng and the other three closely related chloroplast genomes, the entire genome sequences were divided into series of small fragments by those exons of proteincoding genes. The fragment flanked by the same exons or genes would be taken as orthologous fragment. Orthologous fragments between ginseng and the other three closely related chloroplast genomes were aligned respectively, and then aligned sequences were orderly connected together according their regions. Nucleotide diversity and Ka/Ks in the different regions were calculated using DnaSP (version 5.10) (Librado and Rozas, 2009) with default parameters. The Ka/Ks value for each single protein-coding gene was calculated using the KaKs_caculator (-m MA -c 11) (Zhang et al., 2006).

Genome sequencing and assembly
The length for the whole chloroplast genome sequences for DMY, EMY, and GLS was 156,354 bp, while that for YSS was 156,355 bp. In DMY chloroplast genome sequence, the total length for LSC, SSC, and IR regions were 86,129 bp, 18,077 bp, and 26,074 bp, respectively. The accession numbers in GenBank for the four chloroplast genomes are KC686331 (DMY), KC686332 (EMY), KC686333 (GLS), and KF431956 (YSS). The clean reads for these newly sequenced chloroplast genomes were deposited in NCBI, and the accession number for SRA datasets are SRR1251992 (DMY), SRR1252006 (EMY), SRR1252007 (GLS), and SRR1252008 (YSS).

Genome annotation
In the determined nucleotide sequences of the chloroplast genome, DMY, EMY, and GLS chloroplast genome sequences were identical. Compared with these three chloroplast genome sequences, YSS chloroplast genome had only one base-pair insertion at 5472th base, which is located in the intron of rps16 in LSC region. Therefore, DMY chloroplast genome was used as the representative of these newly sequenced P. ginseng chloroplast genomes for further comparative analyses.
In DMY chloroplast genome, 128 functional genes were identified, including 86 protein-coding genes, 34 tRNA genes, and 8 rRNA genes (Table 1), and the gene map for DMY chloroplast genome was shown as Figure 1. Among 86 protein-coding genes, 74 were single copy genes, 12 were duplicates, and 13 genes were composed of two or more exons. For instance, the rps12 gene had three exons, which located in LSC region (one exon) and IR region (the other two). Among 42 RNA genes, 20 were unique and 22 were duplicates. 7 of 34 tRNA genes had one intron. A total of 26,162 codons were present in protein-coding genes of DMY chloroplast genome (Supplementary Table S4). Generally the nucleotide sequence of protein-coding gene begins with a start codon ATG, while there were some exceptions consist of the following: rps19 began with GTG and ndhD began with ACG, which was also found in the chloroplast genomes of other plants, such as Phoenix dactylifera L (Yang et al., 2010); the ycf15 gene in Eleutherococcus senticosus began with ATG , while that in P. ginseng began with GTG, with the first nucleotide changing from A to G.

Simple sequence repeats in DMY chloroplast genome
Simple sequence repeats (SSRs), which are also known as microsatellites and short tandem repeats (STRs), present high diversity in copy numbers. SSRs are important genetic molecular markers for population genetics (Doorduin et al., 2011;He et al., 2012) and are widely used for plant typing (Xue et al., 2012;Yang et al., 2011). With the microsatellite identification tool (MISA), 30 SSRs were detected in P. ginseng DMY chloroplast genome ( Table 2), including 18 homopolymers, 1 dipolymer, 8 tetrapolymers, 2 pentapolymers, and 1 hexapolymer. In 19 homopolymers and dipolymers, 15 SSRs were only composed of A or T bases. In the other 11 SSRs, more than half of the bases were composed of A or T bases. Therefore, SSRs in DMY chloroplasts are ATrich. Among these SSRs, 10 were located in intergenic regions, 2 in rRNA gene (rrn23), and 8 in protein-coding genes (rpoA, rpoB, rpoC2, atpB, psbM, and 3 in ycf1). SSRs with the length of repeat unit not less than two bases were not identified in the previously reported P. schinseng Nees chloroplast genome (Kim and Lee, 2004). Comparing with P. ginseng DMY chloroplast genome, copy number variations in P. schinseng Nees chloroplast genome have introduced some difference in homopolymer In addition to SSR sequences, repeats with lengths ≥20 bp were designated as large repeat sequences in DMY chloroplast, and 5 large repeat sequences were detected ( Table 3). Two of them were located in intergenic regions. The other three were located in protein-coding regions, one in the ycf1 gene and two in the ycf2 gene. Because ycf2 is located in IR region, each ycf2 is comprised of one large repeat sequence only, which has also been reported in Sesamum indicum chloroplast genome .

Phylogenetic analysis of P. ginseng chloroplast genome
To study the phylogenetic relationships of ginseng, 31 plants from Asterids clade were employed, and Spinacia oleracea and Vitis vinifera were incorporated as the outgroups (Supplementary  Table S5). LSC, SSC, IR, and protein-coding regions were employed to construct the phylogenetic tree respectively. As to protein-coding region, 55 conserved protein-coding genes were employed (Supplementary Table S6), and the results were shown in Figure 2. The results for LSC, SSC, and IR regions were presented in Supplementary Figure S1A-1C. The phylogenetic relationship from LSC, SSC, and protein-coding regions were almost identical, and the topological relation also agreed with the taxonomy of core eudicots (Figure 2 and Supplementary Figure S1). The phylogenetic relationships based on IR region showed high similarities with the LSC, SSC and protein-coding regions' results, while Ipomoea purpurea from Solanales and Trachelium caeruleum from Asterales did not located in their original branches as their taxonomy but clustered into a new branch, which may be caused by frequent recombination and expansion/contraction in IR region (Wicke et al., 2011). Among these four phylogenetic trees, P. ginseng DMY and other plants in Apiales, including P. schinseng Nees, Eleutherococcus senticosus (E. senticosus), Anthriscus cerefolium (A. cerefolium), and Daucus carota (D. carota), had stable, topological relationships. It is apparent that the conserved protein-coding genes of the plant chloroplast genome can adequately trace the phylogenetic relationships of core eudicot plants, which have been proven by a series of other works (Wu et al., 2007;Lin et al., 2010;Liu et al., 2013;Qian et al., 2013). Meanwhile, the LSC and SSC sequences of chloroplasts could also be used for phylogenetic analyses of core eudicot plants.

Dynamics in IR region of P. ginseng chloroplast genome
The contraction and expansion at the borders of IR regions were common evolutionary events, which primarily contributed to the observed variation in the size of chloroplast genomes ( Goulding et al., 1996;Wang et al., 2008). To elucidate this phenomenon and mechanism in Apiales, P. ginseng DMY chloroplast genome was used as the representative for comparison to the other three closely related chloroplast genomes, namely, those of E. senticosus, A. cerefolium, and D. carota. Detailed comparisons of LSC, SSC, and IR boundaries in these four representatives from Apiales were shown in Figure 3. The rps19 and ycf1 genes crossed the LSC/IRa and SSC/IRb boundaries, respectively. At the Gene name in the bracket indicates that this SSR located on the gene.

Selection pressure on P. ginseng chloroplast genome
To explore the selection pressures of P. ginseng chloroplast genome, DMY chloroplast genome was compared to the other three chloroplast genomes in Apiales (E. senticosus, A. cerefolium, and D. carota; Table 4). Furthermore, to clarify the variations in each part of chloroplast genomes, the sequences of LSC, SSC, and IR regions were divided into three functional regions, including a protein-coding region (CDS), intron region, and intergenic spacers (IGS) region. The polymorphic site numbers between P. ginseng and E. senticosus (1721) were less than those between P. ginseng and D. carota (9916) or A. cerefolium (9504). Nucleotide differences between P. ginseng and D. carota or A. cerefolium were significantly higher than those between P. ginseng and E. senticosus for all three functional regions. Thus, we can infer that P. ginseng is more closely related to E. senticosus than the other two, which is also reflected in the phylogenetic trees (Figure 2 and Supplementary Figure S1). From Table 4, it is apparent that SSC region has the highest variation ratio for all three functional regions, and IR region has the lowest variation ratio, which was the same as the chloroplast genome of Sesamum indicum L Zhang et al., 2013). The Ka/Ks values of IR region were higher than those of LSC and SSC regions. Moreover, this trend was consistent with all comparison results of between P. ginseng and D. carota  ycf15 was 50 (Ka = 0.0039, Ks = 7.7e-5, p-value = 0). It's obvious that those exceptionally high Ka/Ks value were mainly caused by extremely low Ks value, which could not well reflect the real selection pressure on these genes. Even though, we could infer that ycf2 and ycf15 suffered positive selection. ycf2 and ycf15 have been predicted in many chloroplast genomes by several previous studies, and experimental evidences have proved that they are indeed functional (Drescher et al., 2000;Dong et al., 2013;Shi et al., 2013). Anja Drescher and et al have proved that ycf2 played an essential role in cell survival in tobacco chloroplast (Drescher et al., 2000), and ycf15 was also found transcribed but not spliced in spinach (Schmitz-Linneweber et al., 2001).
To make more comprehensive investigation on selection pressure of all chloroplast genes, we have calculated the Ka/Ks value for all chloroplast genes between DMY and the three closely related species (E. senticosus, A. cerefolium, and D. carota). From Supplementary Table S7, we found that, when comparing the genes between DMY and E. senticosus, many genes (such as atpE, atpF and so on) have high Ka/Ks values, which were also introduced by extremely low Ks value. When comparing DMY with the three closely related-species, besides these genes with low Ks introduced high Ka/Ks values, ycf1, ycf2, and ycf15 indeed suffered positive selection with Ka/Ks value no less than 0.5, which were also found in tobacco and Sedum sarmentosum chloroplast genomes (Drescher et al., 2000;Dong et al., 2013). Furthermore, Kapralov MV and Filatov DA found that rbcL evolved under strong positive selection in Schiedea and plants from Amaranthaceae family (Kapralov and Filatov, 2006;Kapralov et al., 2012), while rbcL did not suffered obviously positive selection in P. ginseng, which indicates that gene may suffers different selection pressure in different plants.

Comparisons of the five P. ginseng chloroplast genome sequences
Genetic variations in chloroplast genome sequences usually have significant value in population genetic analyses and plant domestications (Tang et al., 2004;Kawakami et al., 2007;Yang et al., 2013). To explore the evolution of P. ginseng, comparative analyses among the four newly sequenced chloroplast genomes and P. schinseng Nees chloroplast genome were performed. DMY, EMY, and GLS had identical chloroplast genome sequences, and the YSS chloroplast genome sequence had a 1 bp insertion located at base 5472 reference to DMY chloroplast genome sequence. We also compared the previously reported P. schinseng Nees chloroplast genome with P. ginseng DMY chloroplast genome. P. schinseng Nees chloroplast genome was sequenced by ABI 377 with an average coverage of 4.7×, and P. ginseng DMY chloroplast genome was sequenced by HiSeq2000 with an average coverage over 1000× and Roche/454 GS-FLX (Titanium) with an average coverage of 217×. Five sites on P. schinseng Nees chloroplast genome sequence (accession number in GenBank: NC_006290) were undefined, four were marked as R (purine, A or G) and one was Y (pyrimidine, T or C). All five sites were excluded from the differentiated sites, because those bases in DMY at the same site were also a purine or pyrimidine. Excluding the undefined sites, a total of 172 variant sites, including 88 indels and 84 substitutions, were found (Supplementary Table S8). According to the annotation of DMY chloroplast genome, 30 of 84 substitution sites were located in intergenic region, 16 of them were located in intron region, 2 were located in rRNA gene rrn23, 1 was located in tRNA gene trnV-UAC, and the remaining 35 were located in proteincoding genes (7 for ycf1; 5 for rpoC2; 4 for accD and psbD; 2 for atpB and rpoB; and 1 for rps2, rps12, lhbA, ndhD, psaA, psaB, psaC, atpF, atpI, clpP, and atpA). Regarding 88 indels, 51 were located in intergenic region, 35 were located in intron region, and 2 were located in tRNA genes (trnC-GCA and trnS-UGA). Further analyzing the relationship between repeat regions of these indel sites, the result showed that 21 sites were located in the short single nucleotide acid repeat regions, with a repeat unit occurrence of five or more times. Thus, the different origins of P. schinseng Nees and P. ginseng DMY could account for the differences in their chloroplast genome sequences, and other reasons require further investigation.

Minor allele site polymorphisms and evolution of P. ginseng chloroplast genome
Minor allele frequency (MAF) refers to the frequency of the least common allele occurs in a given population and is usually used for conducting evolutionary analyses of genetic markers (Zhang et al., 2012a;McPherson et al., 2013). Diversity in the MAF has been widely employed for evolutionary and genetic analyses between domestic and wild species in both plants and animals (Vasemagi et al., 2012;Alhaddad et al., 2013;Iorizzo et al., 2013;Petersen et al., 2013). There are a large number of chloroplast genomes in per plant cell (Pyke, 1999). In current study, the DNA sample for each strain was extracted from numerous cells. This means that we took all these chloroplast genomes from the same individual as a chloroplast population for each strain. In the assembly section, the assembled genome sequences substantially represent major alleles in the population. To study the heterogeneity of chloroplast genomes and the dynamics of minor alleles over the course of P. ginseng domestication, all minor allele sites for DMY, EMY, GLS, and YSS were identified with high-resolution reads. After mapping the high-quality reads of all four strains in DMY chloroplast genome respectively and filtering minor alleles with an MAF value of <0.01 (Kolz et al., 2009;Liu et al., 2012), a total of 312 polymorphism sites were detected (Figure 4). To avoid interference from sequencing errors and DNA migration from chloroplast genome to mitochondria and nuclear genomes (Thorsness and Weber, 1996;Birky, 2001;Zhang et al., 2012b), and to restrict analyses to common alleles (De et al., 2008;Skoglund and Jakobsson, 2011;Iorizzo et al., 2013), minor alleles sites with an MAF value of ≥0.05 were used as credible minor alleles, and the cutoff line was shown as a dashed horizontal line in Figure 4. With this MAF value, 208 minor allele sites were identified from all four strains (Supplementary Table  S9). And the major types in variation of all these common SNP sites were immobile in every strain. Excluding the variant between YSS and the other three strains at base 5472, there were 5 deletions (3 for A, 1 for C, and 1 for CG) and 202 substitutions, including 41 transitions (transitions of A-G occurred 11 times, all other types transitions occurred 10 times). Among 161 transversion events, the occurrences A-C and T-G accounted for the majority part, 68 and 52 times, respectively. 136 of 208 minor alleles were located in LSC region, and the other 72 were located in IR region. According to the annotation of DMY chloroplast genome, 52 were located in intergenic region, 35 were in intron region, 1 was in tRNA gene trnD-GUC, and the other 120 were in protein-coding region (73 non-synonymous and 47 synonymous substitutions). For the protein-coding genes in LSC region, the details for minor allele number in each gene were shown as follows: 13 in rbcL; 6 in psaB; 6 in psbC; 4 in psbD, psaA, atpB, and rpl22; 3 in rps8; 2 in psbA, ycf3, petD, rpoA, rpl14, and rps3; and 1 in matK, infA, and rpl16 for each. For the protein-coding genes in IR region, the number of minor alleles would be doubled because there are two copies of IR (for each copy, 20 in ycf2, 4 in rp12, 3 in ycf15 and ndhB, and 1 in rps19 and rpl23).
For all 208 chloroplast minor allele sites with MAF ≥0.05, DMY, EMY, GLS,and YSS covered 116,92,152,and 193, respectively. The polymorphism site numbers per kb of the chloroplast genome sequences for these four strains were 0.74, 0.59, 0.97, and 1.23, respectively. The minor allele overlaps among DMY, EMY, GLS, and YSS was 76 ( Figure 5). All these 76 common minor allele sites were shared by all four strains; 29 of them were located in protein-coding region (14 non-synonymous and 15 synonymous substitutions). Additionally, there were 30 specific minor allele sites in YSS, with 23 in protein-coding region (15 nonsynonymous and 8 synonymous substitutions). In DMY, there were also 11 specific minor allele sits, of which 6 were in rbcL. Those specific minor allele sites in YSS and DMY could be used as genetic markers to discriminate them from others. However, the specific minor allele sites in GLS and EMY were very few, so it was very difficult to choose applicable minor allele sites. When the threshold for MAF value was increased to 0.1, the total number of minor alleles decreased to 88, among which DMY, EMY, GLS, and YSS covered 1, 11, 35, and 86, respectively. Seen from the minor allele analysis result, the wild-type P. ginseng YSS is very different from the other three cultivated strains.
At the same time, from Figure 4, we found that those minor alleles on chloroplast genomes were exposed to purifying selection due to domestication. Based on the information of minor allele sites and the differences of genome sequences, an evolutionary relationship among the four newly sequenced strains and the