Identification of Sesame Genomic Variations from Genome Comparison of Landrace and Variety

Sesame (Sesamum indicum L.) is one of the main oilseed crops, providing vegetable oil and protein to human. Landrace is the gene source of variety, carrying many desire alleles for genetic improvement. Despite the importance of sesame landrace, genome of sesame landrace remains unexplored and genomic variations between landrace and variety still is not clear. To identify the genomic variations between sesame landrace and variety, two representative sesame landrace accessions, “Baizhima” and “Mishuozhima,” were selected and re-sequenced. The genome sequencing and de novo assembling of the two sesame landraces resulted in draft genomes of 267 Mb and 254 Mb, respectively, with the contig N50 more than 47 kb. Totally, 1,332,025 SNPs and 506,245 InDels were identified from the genome of “Baizhima” and “Mishuozhima” by comparison of the genome of a variety “Zhongzhi13.” Among the genomic variations, 70,018 SNPs and 8311 InDels were located in the coding regions of genes. Genomic variations may contribute to variation of sesame agronomic traits such as flowering time, plant height, and oil content. The identified genomic variations were successfully used in the QTL mapping and the black pigment synthesis gene, PPO, was found to be the candidate gene of sesame seed coat color. The comprehensively compared genomes of sesame landrace and modern variety produced massive useful genomic information, constituting a powerful tool to support genetic research, and molecular breeding of sesame.


INTRODUCTION
Sesame (Sesamum indicum L.) is one of the main oilseed crops, providing vegetable oil and protein to human (Weiss, 2000;Wei et al., 2013). It is also one of the most ancient oilseed crops, starting its domestication from the wild progenitor S. malabaricum in the Indian subcontinent about 5000 years ago (Fuller, 2003;Pathak et al., 2015). Its high oil quality and benefit to human health has earned it the poetic label "queen of oilseeds" (Johnson et al., 1979). Because of the popular of sesame among consumers, the market demand for sesame seeds is growing rapidly and the planting area of sesame has almost doubled in the last fifty years (Faostat, 2015).
A widely grown sesame variety in China, "Zhongzhi13, " has been de novo sequenced recently (Wang et al., 2014b), which had opened the door of genome research and genomic selection breeding of sesame. The genome sequencing revealed that sesame has a small diploid genome, which was estimated to be 350 Mb. In total, 274 Mb draft genome of "Zhongzhi13" was successfully assembled and 27,148 protein-coding genes were predicted. Although the genome sequencing of "Zhongzhi13" had been completed, the lack of enough genomic information still limited the genetic improvement of sesame. Comprehensive and exhaustive identification of the genomic variations in sesame genome are urgent for this important oilseed crops.
Because of the bottleneck occurs during the artificial, landraces are considered to carry many agriculturally desirable alleles that are not contained in varieties (Doebley et al., 2006;Cavanagh et al., 2013). Landraces could provide useful genes for genetic improvement of crops. The agriculturally valuable alleles identified in landraces genome can be easily transferred to commercial varieties by introgression breeding (Zeven, 1998;Zamir, 2001). A lot of superior alleles from landrace have been discovered and used in the crop breeding. For example, the superior allele of the rice "green revolution" gene sd1 was identified from a rice landrace "Deo-geo-woo-gen" (Spielmeyer et al., 2002). One allele of GS2 in the rice landrace "Baodali" could increase 40% thousand seed weight and 14% yield of plant . And an allele of qHSR1 in the maize landrace "Ji1037" showed high resistance of the systemic disease heat smut (Zuo et al., 2015).
As a worldwide grown crop, the landraces of sesame have diverged to various genotypes to adapt different environment and climate. Previous studies revealed that sesame landraces had a large scale of diversity in both phenotype and genotype Yue et al., 2013;Wang et al., 2014a;Wei et al., 2015). And multiple loci that related to several agronomic traits had been identified in sesame landraces, including oil and protein content , plant height (Ding et al., 2013), disease resistance , drought tolerance , waterlogging tolerance , seed coat color . These research implied that large number of desirable alleles were carried in sesame landraces and largely needed to be discovered. However, genome of sesame landraces still have not been explored and genomic variation have not been detected.
To investigate the genome of sesame landraces, two landraces ("Baizhima" and "Mishuozhima") from South China and East China, respectively, were selected and de novo sequenced in the present research. Previous genetic research showed these two landraces had far relationship to "Zhongzhi 13" (Wang et al., 2014a;Wei et al., 2014). Each landrace was sequenced approximately 70-fold coverage of the predicted sesame genome. We compared the genomes of landraces and variety and discovered a large number of SNPs and InDels in the genomes. Large-effect SNPs and InDels were detected in some genes related to important agronomic traits of sesame. Identification of genomic variations in sesame landraces will provide massive valuable information for genetic research and molecular breeding of sesame in future.

Sampling and Genome Sequencing
Two representative sesame landraces (Table 1 and Figure 1), "Baizhima" and "Mishuozhima, " preserved at the China National Gene Bank, were selected and self-pollinated for six generation. Thirty-two SSR markers (two primers in each linkage group of sesame genome) were selected to detect the heterozygosity of the two accessions (Table S1). The samples were planted in Wuhan in the summer of 2014 and fresh leaves were collected from 4 weeks seedlings. Total genomic DNA was extracted by using the DNeasy Plant Mini Kit (Qiagen, USA). Library construction was done as described (Huang et al., 2009). A total of 49 Gb high-quality reads were generated using Illumina Hiseq2500 platform, representing a raw coverage of approximately 70-fold coverage of the genome for each landrace, using an estimated genome size of 350 Mb for sesame genome (Wang et al., 2014b). The DNA sequencing data were deposited in the European Nucleotide Archive (http://www. ebi.ac.uk/ena/data/view/PRJEB8078).

Sequence Assembling and Comparison
The genomes of two landrace were assembled by the SOAPdenovo2 package (version 2.04) (Luo et al., 2012). The reads were split into k-mers (k was 61 and 55 for "Baizhima" and "Mishuozhima, " respectively) from the short-insert size libraries. The gaps in the contigs were excluded by Gapcloser (version 1.12). The N50 length of the final assembly was calculated without all small contigs of < 200 bp. Contig numbers of different length were counted by in house Perl script. MUMmer was used in the anchoring of the assemblies to the sesame reference genome (Kurtz et al., 2004), and the diffseq program in the EMBOSS package was used in the calling of sequence variants (Rice et al., 2000). The reference genome "Zhongzhi13, " mRNA, protein sequences, repeat sequences, transposons types, and GFF file were downloaded from the Sinbase database (Wang et al., 2014b). Transposons in "Baizhima" and "Mishuozhima" were identified by the software RepeatMasker (Version 3.3.0, http://repeatmasker.org) with the default parameters. The repeat sequence library used in the transposons identification was downloaded from http://www.girinst.org. Homologous sequence used in the transposons alignment including LG1 (11920217 to 12241908) in "Zhongzhi13, " contig tarseq_824_000001_1 of "Baizhima" genome and contig tarseq_2333_000001, tarseq_2333_000002, tarseq_6067_000001, tarseq_749_000001, tarseq_749_000002 in "Mishuozhima" genome. Transposons types in "Baizhima" and "Mishuozhima" genome were annotated by aligning with the reference genome. Circos (Krzywinski et al., 2009) was used to construct the diagram of genome variations of all 16 linage groups in sesame. Phylogenetic tree of the three sesame accessions was constructed by the software PHYLIP (Felsenstein, 1989). Genome of Utricularia gibba (Ibarra-Laclette et al., 2013) was used as out-group.
Gene structure of the two genomes was predicted by the software FGeneSH (Salamov and Solovyev, 2000). Arabidopsis genome and gene annotation were download from the TAIR database (www.arabidopsis.org). All genes in "Zhongzhi13"  genomes were aligned with Arabidopsis genes by BLAST (Altschul et al., 1997) and gene function was annotated by the Arabidopsis genes (e < 10 −5 and protein sequence identity rate more than 70%). Then the best matched Arabidopsis genes were used for the annotation of sesame genes. Genes in "Baizhima" and "Mishuozhima" were aligned with that from "Zhongzhi13" genome and annotated by Arabidopsis genes. SNPs and InDels in coding regions were identified by self-customized Perl script from the genome comparison files generated by the diffseq program. Synonymous and nonsynonymous SNPs were also districted by self-customized Perl script. And dN/dS rate of each gene was calculated from the number of synonymous and non-synonymous SNPs. KEGG annotation of the genes with dN/dS ratio > 1 was performed with the annotation of the reference genome. The KEGG pathway of the genes were clustered on the KEGG website (http://www.kegg.jp/). The potential effects of the SNPs and InDels were predicted based on the annotation of "Zhongzhi13" genome with the GFF files. The de novo assemblies and genome-wide analysis of all the coding variants are available at the Sesame Haplotype Map Project database (http://www.ncgr.ac.cn/SesameHapMap).

QTL Mapping
Recombinant inbred lines (RILs) were developed from a cross between "Zhongzhi13" and "Mishuozhima" followed by selffertilization to F6, containing 550 individuals. The population was developed in the experimental field at Oil Crops Research Institute in Wuhan, Hubei Province. For genotyping, total genomic DNA was isolated from leaf tissues using the DNeasy Plant Mini Kit (Qiagen, USA). About four thousand SSR markers were designed from the sesame genome . All of them were used in the screening of the parent DNA and 400 polymorphism SSR primers were selected. Then the polymorphism SSRs were used in the QTL mapping for all F6 RILs. Six QTLs of sesame coat color were identified. To fine mapping the major QTL, 80 SSRs in the QTL region were developed and screened the RILs (Table S2). Seed coat color of the mature seeds was recorded for RILs. The color was recognized by CR-400 Chroma Meter (Konica Minolta, Japan) and recorded as standard color evaluation formula (L * a * b * ).

qRT-PCR
Developing seeds of "Zhongzhi13" and "Mishuozhima" were collected from the capsules at 5, 8, 11, 14, 17, and 20 days after flowering. All seeds were carefully collected on the ice and put into liquid nitrogen immediately. Total RNAs of the seeds were extracted by the EASYspin Plus kit (Aidlab, China) following the manufacturer's instructions. Any remaining DNA in the RNA was treated with DNaseI. The RNA was reverse transcribed by HiScript II 1st Strand cDNA Synthesis kit (Vazyme, China) with oligo (dT23) primer. Gene-specific primers and probes were designed for all candidate genes. The primers and probes used in the gene amplification were listed in Table S3. The qRT-PCR experiments were performed with Premix Ex Taq TM (Takara, Japan) on the CFX384 Real-Time System (Bio-Rad, USA). The actin7 gene (SIN_1006268) was used as an internal control.

De novo Assembly of Sesame Landrace Genomes
Two landraces (Table 1), "Baizhima" and "Mishuozhima, " which from South China and East China were selected and de novo sequenced. These two landraces were quite different from the variety "Zhongzhi 13" in several important agronomic traits, such as plant height, branchiness, seed size, seed number per capsule, disease resistance, and flowering date (Figure 1). "Zhongzhi13" is a high yielding variety, the major characters including unbranched, three capsules per axil, high plant height, and white seed. "Mishuozhima" is a typical semi-dwarf germplasm with black seed. "Baizhima" is a branched and late flowering germplasm with small white seed. Both SSR and SNP analysis showed significant genetic differentiation of the three sesame accessions (Wang et al., 2014a;Wei et al., 2014). Although sesame is mainly self-pollinated, out-crossing rate in natural environment was between 4 and 23% (Pathirana, 1994;Sun et al., 2015). These two landraces were strictly maintained by self-pollination for six generation to obtain highly homologous genomes. Heterozygosity of the samples were detected by 32 SSR markers (Table S1). No heterozygous locus was identified in both two landraces, indicating highly homozygous of the samples. Genomic DNA was extracted from seeding of 4 weeks for each landrace. The DNA libraries were constructed with an insert size of 300 bp and sequenced by the Illumina Hiseq2500 platform. Approximately 2.5 Gb data were generated for each sample, covering 70-fold genome size of sesame. SOAPdenovo was used to assemble the genomes (Luo et al., 2012), resulting in a draft genome of 267 and 254 Mb for "Baizhima" and "Mishuozhima, " respectively. The draft genome size was similar to that of "Zhongzhi13" (274 Mb; Wang et al., 2014b). In total, 73,395 and 49,167 contigs were assembled for "Baizhima" and "Mishuozhima" (Table S4 and Supplementary Figure S1), with the contig N50 size of 47.2 and 47.8 kb, respectively. These contigs were aligned with the reference genome "Zhongzhi13" using MUMmer software (Kurtz et al., 2004). Most sequence of "Baizhima" and "Mishuozhima" genome (82 and 86%, respectively) could be identified in the reference genome except the repetitive sequences.
Gene structure in "Baizhima" and "Mishuozhima" was predicted and the genes were compared with that from the reference genome (Supplementary Figure S2). In total, 77% of the coding genes (19,263 and 19,221 of "Baizhima" and "Mishuozhima, " respectively) were annotated for each genome of "Baizhima" and "Mishuozhima" (Table S5). Among them, 17,029 genes were Arabidopsis-homologous genes, which might be conservative genes.
Phylogenetic tree of the three genomes was constructed by PHYLIP (Felsenstein, 1989) and Utricularia gibba genome was used as out-group (Supplementary Figure S3). U. gibba, the tiny genome size plant, was reported to be the closest species of sesame in all sequenced species (Ibarra-Laclette et al., 2013). The phylogenetic analysis clearly showed that "Baizhima" and "Mishuozhima" has a closer relationship than "Zhongzhi 13, " indicating that plenty of variations might be discovered from genome comparison of the variety and landraces.

Landscape of Genome Variations between Sesame Landrace and Variety
All contig of "Baizhima" and "Mishuozhima" genomes were aligned against the reference genome, "Zhongzhi13." A large number of genome variations were identified, mainly including SNPs, InDels and transposons (Figure 2). Totally, we identified 1,332,025 SNPs, varying from 1.15 SNPs to 8.65 SNPs in 1 kb among the 16 linkage groups of the two landrace genome ( Table 2). In average, it is about 4.9 SNPs in 1 kb of the two genomes. Approximately 46% (609,178) of the SNPs were shared by the two landrace genomes ( Figure 3A). The 12 nucleotide substitution types ranged from 34,549 to 162,568 (Table 3 and Supplementary Figure S4A). Among the them, G:A, A:G, C:T, and T:C substitutions were the most common SNPs, in line with the result in the previous research of sesame (Wang et al., 2014a).
In total, 506,245 InDels (including structural variations) were detected in the two genomes, containing 203,271 shared InDels ( Figure 3B). Distribution of InDels was also not parallel in the LGs, the highest InDel rate (LG5, 3.20 InDels in 1 kb) was six times of the lowest one (LG12, 0.50 InDels in 1 kb) ( Table 2). As expect, small InDels were dominant of all InDels and single-base and two-base InDels were the most common InDels (Table 3 and Supplementary Figure S4B).
Among all SNPs and InDels, more than 90% were found to locate in intergenic and intronic regions of sesame genome ( Table 4). Large percent of genome variations in intergenic and intronic regions of genome was also reported in other species, such as rice, maize, soybean, grape, and tomato (Chia et al., 2012;Huang et al., 2012;Di Genova et al., 2014;Lin et al., 2014;Zhou et al., 2015). Since variations of coding regions usually generate functional change of proteins and might affect plant growth and development.
We noticed that almost all kind of SNPs and InDels in "Baizhima" genome were more than that in "Mishuozhima" genome (Supplementary Figure S4), implying that "Mishuozhima" might have a closer relationship of "Zhongzhi13." This result was in accordance with the geographic origin and phenotypes of the three accessions. "Mishuozhima" and "Zhongzhi13" are cultivated in the regions of similar latitude in China. Both "Mishuozhima" and "Zhongzhi13" are unbranched, early flowering and have three capsules per axil, while "Baizhima" is much branched, late flowering and have only one capsule per axil.
Transposable elements are one of the most common genomic variations in plants (Lisch, 2013). To compare transposable elements in the genomes of landrace and variety, the longest contig (tarseq_824_000001) in LG1 of "Baizhima" genome was selected and transposons were analyzed in the homologs region of the three genomes (Supplementary Figure S5). In this region, we found that long terminal repeats, including Gypsy-like and Copia retrotransposon elements were the most common transposons (Table S6).

Alteration of Coding DNA Sequence in Sesame Landrace Genomes
Genes were announced in the two genomes and compared with the reference genome. 70,018 SNPs (51,326 in "Baizhima" and 51,550 in "Mishuozhima") and 8311 InDels (5594 in "Baizhima" and 5897 in "Mishuozhima") were identified in coding regions of all genes (Supplementary Figure S6). Effects of the SNPs in coding regions were investigated and 33,385 synonymous SNPs LGs LG length and 36,633 non-synonymous SNPs in 9263 genes (34% of all sesame genes) were identified. Non-synonymous SNPs existed in 7404 genes (5577 in "Baizhima" and 5553 in "Mishuozhima"), indicating potential functional divergence of these genes. We calculated the non-synonymous-to-synonymous ratio of the SNP data set. Compared with the ratio for total sesame genes (1.14 in "Baizhima" and 1.12 in "Mishuozhima"), the non-synonymous-to-synonymous ratio (dN/dS) for the Arabidopsis-homologous genes (0.90 in "Baizhima" and 0.91 in "Mishuozhima") was significantly lower, indicating strong functional conservation of the gene set in evolution. Among them, 106 well-characterized genes related to flowering and 67 genes involved in lipid metabolism pathway were focused on specially (Table S7), which showed lower non-synonymousto-synonymous ratio (0.57 and 0.39, respectively). The result suggested that lipid metabolism genes were more conservative than flowering genes in sesame.
Based on dN/dS > 1, we identified 2274 genes might be positive selected genes (Table S8). Among them, 886 genes had more than 10 SNPs (Table S9), indicating rapid divergence in protein coding regions. These genes included plenty of lipid metabolism genes, disease resistant genes, and flowering related genes. Among them, disease resistance genes were overrepresented, suggesting the genes were undergoing strong natural and artificial selections. Similar phenomenon was also observed in the genome comparison of varieties or landraces in rice, maize, and grape (Springer et al., 2009;Li S. C. et al., 2012;Di Genova et al., 2014). KEGG pathway enrichment analysis of these genes were performed and these genes were founded to be involved in most KEGG pathways (Supplementary Figure S7), indicating that most pathways in sesame might have been selected during the domestication and artificial selection. The result showed that these genes were enriched in the carbohydrate metabolism and signal transduction pathways. Since both carbohydrate metabolism and signal transduction pathways are associated with plant abiotic stress such as salt, drought, and cold (Zhu, 2002;Baena-Gonzalez et al., 2007), the enrichment of the genes in these pathways might reveal the positive selection of abiotic stress tolerance for sesame.
Large-effect genomic variations which leading the loss function of genes and might significantly affect the agronomic traits were detected in "Baizhima" and "Mishuozhima" genomes. The large-effect SNPs causing premature to stop codon, start codon and stop codon changes were detected in 1276 genes in the genome of landraces (Table S10). While the large-effect InDels including frameshift InDels and InDels in start codon and stop codon were identified in 1932 genes in "Baizhima" and "Mishuozhima" genomes (Table S11). In total, 2627 large-effect SNPs were found in "Baizhima" and "Mishuozhima" genomes. Large-effect SNPs detected in "Mishuozhima" genome (1652) was much more than that in "Baizhima" genome (975). And the type of large-effect SNPs were also quite different between "Baizhima" and "Mishuozhima" genomes (Supplementary Figure S8). SNPs causing premature sites to be stop codons were dominant in "Baizhima" genome while SNPs which leaded start codon to be non-start codon was the most common ones in "Mishuozhima" genome. Large-effect genomic variations were found in the functional genes that related to disease resistant, flowering, plant development and lipid metabolism (Table S12). The identified large-effect SNP and InDels in crops can facilitate discovery  and cloning of candidate genes related to the domestication and improvement (Yu, 2013).

Genomic Variations of the Gene Related to Agronomic Traits
As a short day plant, sesame usually flowers early in short daylight condition. However, sesame has been brought worldwide and flowers in extremely different day-light conditions. The flowering mechanism of the photoperiod way of sesame is largely remaining explored. Flowering date of "Zhongzhi13, " "Baizhima, " and "Mishuozhima" showed greatly diverse (about 45, 60, and 38 days in the summer of Wuhan in China). Here, we showed comparison of several photoperiod genes in the three sesame cultivars (Figure 4). It is clearly revealed that multiple polymorphisms existed in the photoperiod genes, including phyC, ZTL, FLF, GI, and COL (Putterill et al., 1995;Fowler et al., 1999;Somers et al., 2000;Wang and Deng, 2004;Song et al., 2009). GI is the key gene which receives signals from the circadian clock and transfers the information to COL (Putterill et al., 1995). One loss-of-function insertion in GI was found in "Zhongzhi13" and "Mishuozhima, " resulting unfunctional GI in the two accessions. The loss-of-function protein caused invalid of circadian clock system in sesame and flowering promoting of "Zhongzhi13" and "Mishuozhima" in long-day condition. The highly diversity of the photoperiod genes in sesame provides convincing explanation to the flowering date divergence of sesame. And the natural and artificial selection of the photoperiod genes spurred sesame expanding globally after its domestication in India (Bedigian, 2003(Bedigian, , 2010. Plant height decides the plant architecture and contributes to the sesame yield (Wang et al., 2016). Because high-stem varieties tends to lodging at the ripening stage, decreasing plant height of modern varieties is one of the major targets in sesame breeding. "Mishuozhima" is a typical dwarf landrace. Its plant height is about half of that of "Zhongzhi13" in different environment (Figure 1). Interestingly, a frameshift deletion was found in GA20ox1 in "Mishuozhima" genome, but was not detected in "Baizhima" and "Zhongzhi13" genomes. GA20ox1 is the famous "Green revolution" gene which leading the decrease of plant height and substantial increase of yield in rice (Spielmeyer et al., 2002). Mutations in GA20ox1 also cause semi-dwarf phenotypes in Arabidopsis and barley Barboza et al., 2013). This loss-of-function mutation in GA20OX1 might be the key reason of the dwarf phenotype of "Mishuozhima." Oil content of sesame has been significantly selected such that varieties produce much more fatty acid than landraces. Oil content of "Zhongzhi13" (58%) is significantly higher than that in "Baizhima" (48%) and "Mishuozhima" (50%). Lipid metabolism genes in sesame were annotated based on homology analysis of the genes in acyl lipid pathways in Arabidopsis. Of these genes, large effect genes were found in LTP, WRI1, ACNA, and LOX2. Copy number of LTP and LOX2 had been reported intensely affect oil content of sesame (Wang et al., 2014b). The genomewide association study revealed that ACNA was associated with the content of stearic acid and peanut acid in sesame (Wei et al., 2015). And mutation of WRI1 was found to cause an 80% reduction in seed oil content of Arabidopsis . These genes might have been artificial selected in the sesame breeding and contributed to the improvement of seed oil concentration.

Validation and Utilization of the Genomic Variations
To validate if the genomic variation could be used in the genetic research, we used it in the QTL mapping of sesame seed coat color. Seed coat color is one of the most significant characters FIGURE 4 | Variations in photoperiod genes of sesame. In phyC, ZTL, ELF, and COL, amino acid changes resulting from SNPs and InDels are shown. In the GI gene, a 1 bp insertion was found in "Zhongzhi13" and "Mishuozhima." of sesame. The color of mature sesame seeds is diverse, varying from black, intermediate colors to white. Compared with white seeds, black sesame seeds usually have less oil but more protein and lignin . In East Asia, black sesame seeds and its products are more popular among the consumers. Seed coat color was reported to be related to biochemical properties, antioxidant content, and disease resistance (Shahidi et al., 2006;El-Hamid Sayid El-Bramawy et al., 2008;Kanu, 2011). Previous studies indicated that sesame seed coat was controlled by one or two genes (Gutierrez et al., 1994;Zhang et al., 2013).
As mentioned above, "Zhongzhi13" is a white seed variety which "Mishuozhima" is a black seed landrace. To identify the major QTLs that control this trait, we previously constructed a F6 recombinant inbred lines (RILs) population derived from the cross between "Zhongzhi13" (white seed) as the recurrent parent and "Mishuozhima" (black seed) as the donor parent. One major QTL, which was accounted for 66.5% of the phenotypic variations, was detected between the markers ZMM2905 and ZMM2239 (527 kb) on LG4 (Figure 5). To fine mapping the locus and clone the gene for seed coat color, 80 SSR markers were designed based on the genome sequence of the two parents and were used to screen the population in the QTL region (Table S2). The candidate block of the QTL was sharped into 42 kb, between ZMM4056 to ZMM4067. Six candidate genes were identified in this region.
Sequence of candidate genes in "Zhongzhi13" and "Mishuozhima" was aligned. Frameshift InDels were found in one gene (SIN_1016759) of "Zhongzhi13, " suggesting SIN_1016759 might be the key gene related sesame seed coat color. Then qRT-PCR was performed to investigate the gene expression of the candidate genes. Gene specific primers and probes were designed and used in the qRT-PCR (Table  S3). Developing seeds from 5 to 20 days of "Zhongzhi13" and "Mishuozhima" were collected and expression of the genes was detected. The qRT-PCR results showed that all candidate genes did not expressed in all seeds of "Zhongzhi13" and only SIN_1016759 expressed in the seeds from 11 to 20 days of "Mishuozhima." Homology analysis revealed that SIN_1016759 was homologous gene of PPO, which encodes polyphenol oxidase and produce black pigments through the browning reaction in plant (Mayer, 2006). Therefore, we concluded that SIN_1016759 (PPO) might be the candidate gene that controlling seed coat color in sesame.

DISCUSSION
In this work, we obtained the draft genome of two representative landraces, "Baizhima" and "Mishuozhima, " based on next generation sequencing technologies and de novo assembly. Genome of the landraces offers an opportunity to comprehensively investigate the genome variations in the landraces of sesame at the nucleotide resolution. After a whole genome comparison of the landrace genomes and the reference genome, we succeeded to identify the first comprehensive catalog of SNPs and InDels among the samples, containing more than 1,800,000 variants. These genome variations are a source of genetic variability, generating new allelic variants during natural or artificial selection. Although the large-effect InDels in PPO genes have been validated, it should be noticed that the SNPs and InDels detected still needed further experimental validation.
This study confirms that a single genome of variety does not adequately represent the diversity contained within sesame. The availability of the "Baizhima" and "Mishuozhima" genome would help to identify variations in gene and nucleotide level. Because of the genetic bottleneck, genes might be lost during the domestication (Sang and Ge, 2013). Identification of new genes in multiple genomes is possible. New genes have been identified in different genome of rice and grape (Di Genova et al., 2014;Wang et al., 2015). Some lost genes might be identified in the sesame landrace genomes. And novel molecular markers could be designed for sesame genetic research and breeding. Rare SNPs and InDels are useful for DNA marker-assisted improvement and providing new information about how variation is distributed across the genomes (Ohmido et al., 2000). Benefiting from the high genome coverage sequencing, large sample of variation in low-copy have been identified.
The variable distribution of SNPs and InDels rates in sesame landraces genome has been reported in several other plant genomes (Lai et al., 2010;Arai-Kichise et al., 2011;Evans et al., 2013;Yadav et al., 2015). Some of the variations might come from the natural selection during domestication. It is thought that sesame were domesticated 5000 years ago, then moved along trade routes to quite different environment and underlined natural selection to adapted the conditions (Bedigian, 2003;Fuller, 2003). In addition, the regions of high SNPs and InDels density could correspond to regions containing genes under divergent selection, or recombination hot spots that result in elevated localized mutation rates (Lercher and Hurst, 2002). Genes for disease resistance or adaptation to the environment are often associated with recombination hot spots and highly various of the genes provide a selective advantage for crops (Lai et al., 2010;Evans et al., 2013). Alternatively, low density of SNPs and InDels in the linkage group regions indicates a large reduction of genetic variation and strong selective sweeps might occur in these regions. Selective sweep usually results from artificial selection of significant superior alleles for crops and rapid fixing of the superior alleles in modern varieties . For sesame breeders, knowledge of the nature of diversity in genomes, and the presence of low genetic diversity regions, could contribute to the design of breeding approaches that accelerate crop improvement.
To meet the quickly increasing labor cost of sesame cultivation and continuously increasing consumption of sesame, there is an enhancing effort to create mechanized cultivation varieties with super high yield. "Zhongzhi13" is the most widely grown varieties in Yangtze River Basin in the last 10 years (Wang et al., 2014b). But the high plant height and long mature period have prevented it to be a suitable variety for mechanized harvesting. Thus, plant height reducing, flowering early and yield increasing are important targets of the sesame genetic improvement. "Mishuozhima" is a typical semi-dwarf landrace with early flowering date. "Baizhima" is a much branched landrace with multi seeds in the capsules. Both of them could provide valuable alleles for genetic improvement of sesame varieties. Other superior traits that might confer advantages to "Zhongzhi13" of the two landraces include strong disease resistance, high sesamin content, and black coat of seed. Linkage mapping of several important traits have been carried out in the superior QTL seeking of sesame landraces Wu et al., 2014). However, plenty candidate genes were located in the QTL region and it is difficult to determine which one is the key gene. The identified SNPs and InDels in "Baizhima" and "Mishuozhima" genome could be used as molecular markers and help to improve the saturation of the genomic region where a QTL has been identified, which should reduce substantially the list of candidate genes in the QTLs. In addition, the availability of SNPs and InDels could provide valuable information to gene functional analysis and help in the identification of candidates genes related to traits of interest (Pan et al., 2008).
Seed coat color has been domesticated and selected significantly in sesame. Seed coat color of sesame in wild sesame is black and was domesticated to be light colors in landraces. Then in the modern varieties, more than 98% was white seed (Wei et al., 2015). It is reported that sesame seed coat color was controlled by several major QTLs Wang et al., 2016). In fact, besides the major QTL which PPO located, three other QTLs of sesame coat color were identified. Previous research indicated that seed coat color was affected by various pigments including flavonols, proanthocyanidin, and other phenolic relatives, like lignin and melanin (Yu, 2013). And the genes involved in flavonoid metabolism were cloned from Arabidopsis (Saito et al., 2013), rapeseed , grape (Malacarne et al., 2016), and soybean (Yang et al., 2010). Thus, genes involved in the flavonoid synthesis might be the candidate genes of sesame coat color in the other identified QTLs. And the genome variations detected in these QTLs will provide valuable clues for future fine mapping and cloning of the genes of sesame seed coat color. Seed coat color significantly associated with oil content in sesame (Wei et al., 2015). The sesame seeds with light seed coat color tend to have higher oil content. Similar phenomenon is also found in the rapeseed seeds (Tang et al., 1997). BrTT8 had been identified as the functional gene of yellow seed coat of rapeseed and the homolog gene of BrTT8 in Arabidopsis, TT8, was reported to inhibit seed fatty acid accumulation by targeting several seed development regulators Chen et al., 2014). Therefore, we concluded that seed coat genes might affect the lipid synthesis of sesame and the cloning of seed coat gene might provide new strategy of oil content improvement for oilseed crops.
A large number of large-effect SNPs and InDels have been identified in the genes that related to plant development, disease resistant, flowering, and lipid metabolism. The identified largeeffect genomic variations in sesame can be used in discovery and cloning of candidate genes related to the significant agronomic traits. And the variations in these genes could be easily converted into effective selection markers and used in the molecular marker assisted selection breeding of sesame.

CONCLUSION
De novo genome sequencing of two representative sesame landraces, "Baizhima" and "Mishuozhima" had been completed and draft genome with contig N50 47 kb were obtained. Genome comparison of the landrace and modern variety genomes identified more than 1,800,000 SNPs and InDels, which were the source of agronomic traits variations in sesame. A large number of genomic variations related to important traits were also detected in the landrace genomes, which will provide valuable gene resources to sesame breeding. And the availability of the SNPs and InDels should accelerate the QTL mapping and gene cloning of sesame. Molecular markers developed from the SNPs and InDels could become a powerful tool in molecular assisted breeding. All of these will largely improve the efficiency of genetic improvement in sesame.