Genome-Wide Identification of Sequence Variations and SSR Marker Development in the Munake Grape Cultivar

The Munake grape cultivar produces uniquely flavored and high-quality fruits. Despite the numerous beneficial agronomic traits of Munake, there are few genetic resources available for this cultivar. To address this knowledge gap, the entire genome was sequenced using whole-genome sequencing approaches and compared with a Vitis vinifera L. reference genome. This study describes more than 3 million single nucleotide polymorphism (SNP), 300,000 insertion and deletion (InDel), 14,000 structural variation (SV), and 80,000 simple sequence repeat (SSR) markers (one SSR per 4.23 kb), as well as their primers. Among the SSRs, 44 SSR primer pairs were randomly selected and validated by polymerase chain reaction (PCR), allowing discrimination between the different Munake cultivar genotypes. The genetic data provided allow a deeper understanding of Munake cultivar genomic sequence and contribute to better knowledge of the genetic basis behind its key agronomic traits.


INTRODUCTION
In recent years, genomic studies on grapevine cultivars have provided a significant quantity of genomic information, including evolution and domestication processes (Zhou et al., 2019), molecular markers and transferability (Li et al., 2017;Zou et al., 2020), sex determination (Massonnet et al., 2020), and population genetics (Zhou et al., 2019). Furthermore, V. vinifera cultivar genomic information has been utilized as evidence to support ancestral hexaploidization in major angiosperm phyla. This information has been applied both to scientific research and biotechnological development of new grapevine cultivars. China is one of the largest grapeproducing countries in the world, with a cultivation area of 809,600 hm 2 that mainly produces table grapes. There are many table grape, high quality grapevine cultivars with a long history of cultivation in China, in particular the Munake cultivar, which is native to Xinjiang Province (Zhang et al., 2010). This cultivar germinates in early April, blossoms in mid-May, and ripens in early October. It produces uniquely flavored, thin-skinned, sweet fruits with a crispy texture (Pang et al., 2015;Keriman et al., 2016;Xu et al., 2018;Supplementary Figure 1), with a berry weight of 5 to 10 g and grape cluster weight from 600 to 1,000 g.
Several unsuccessful attempts have been made to introduce the Munake cultivar to other Chinese provinces. With the exception of a few areas in the Hexi Corridor of Gansu province, it is mainly cultivated and narrowly distributed around the Atushi area of Xinjiang province. This is because the Munake cultivar requires particular weather conditions to grow, such as an average temperature of 12 to 13 • C, a frost-free period of more than 220 days, an effective accumulated temperature greater than 4,200 • C, and less than 120 mm rainfall per year (Keriman et al., 2016). Despite its numerous, desirable agronomic traits, this cultivar is not widely grown (Biagini et al., 2016;Zhang et al., 2016), creating challenges associated with its storage and transport (Yang et al., 2009;Zhang et al., 2016;Xu et al., 2018). Breeding programs thus aim to transfer the beneficial agronomic traits of the Munake cultivar into other hardier cultivars, or breed a more climatically versatile Munake cultivar to overcome the storage and transport problems. However, there are currently few genetic resources available for this cultivar, which hinders such breeding programs.
The completion of genome sequencing of the Munake cultivar is necessary for comparative and functional genomics studies of table grapes, and the resulting genetic and genomic data will be a valuable resource for Munake cultivar breeding programs Karastan et al., 2018;Minio et al., 2019). This study describes whole-genome sequencing of the Munake cultivar and compares it with a V. vinifera L. reference genome (Jaillon et al., 2007). A significant number of effective DNA markers were developed for exploring agronomic trait-related genes by comparing the sequences obtained with the reference genome of V. vinifera, as well as some specific simple sequence repeat (SSR) markers. The SSR markers have the advantages of accurate and rapid cultivar detection even with the existence of co-dominance, multiple alleles, high polymorphism, and variability, and hence can be used for species identification, establishment of wellknown varieties of germplasm resources, and determination of differences between varieties (Patzak et al., 2019;Prysiazhniuk et al., 2019). The knowledge gained will create a large tag library and the foundation for future genetic studies, enabling molecular breeding and species resource protection of the Munake cultivar and its relatives.

Sample Preparation, DNA Isolation, and Genome Sequencing
A Munake cultivar sample was selected from its primary production area of Artux City, Xinjiang, China. Genomic DNA was extracted from young leaves using the modified cetyl trimethylammonium bromide method (Doyle and Doyle, 1990). The extracted DNA was randomly iron fragmented at 95 • C, and 350-bp fragments were gel-purified. These fragments were used to construct DNA libraries by cutting the genomic DNA using a Covaris M220 Focused-ultrasonicator. The HiSeq X-Ten platform (Illumina, Inc., San Diego, CA, United States) was used for adaptor ligation and DNA cluster preparation. Only high-quality data were used for mapping by filtering the lowquality reads (single reads <20 bp) 1 , in order to reduce the error probability in the mapping process. The clean reads were used for subsequent analyses and were stored in the National Center for Biotechnology Information database under the BioProject accession number PRJNA632683.

Mapping Reads to the Referenced Genome
The Burrows-Wheeler alignment (BWA) software was used to align the paired-end (PE) sequences of the Munake cultivar to V. vinifera reference genomes 2 based on default parameters (Li and Durbin, 2010; Table 1). Reads that could be aligned to more than one position in the referenced genome were filtered (Subbaiyan et al., 2012). The alignment results were used to calculate the average sequencing depth and coverage of mapped reads (Li et al., 2012). Single nucleotide polymorphisms (SNPs), insertions and deletions (InDels), and structural variation (SV) polymorphisms were detected in the mapped reads.

Detection of SNPs, InDels, and SVs
Variants were identified using the Unified Genotyper application of the Genome Analysis Toolkit (DePristo et al., 2011), programmed to detect SNPs and InDels. Based on the nucleotide substitutions, the SNPs detected were classified as transitions (Ti) (C/T and G/A) or transversions (Tv) (C/G, T/A, A/C, and G/T) ( Table 2). SAM files generated by BWA were converted to BAM format using SAMtools (Li et al., 2009), duplicates were marked first, and local realignment and base recalibration were completed using the Picard tool to ensure correction of the clean reads. The Haplotype Caller method was then used to detect SNPs and InDels. All results were shown as gVCF files, which were prepared for joint-genotype analysis and reads filtering. The filtered parameters were: varFilter -w 5 -W 10 (vcfutils. pl), QUAL < 30, QD < 2.0, MQ < 40, and FS > 60.0 (Fisher test); the other parameters were executed using the default values.
SV detection was performed using the Breakdancer Max. pl software with default parameters (Chen et al., 2009). To obtain reliable SVs, the detected SVs must be returned to the PE alignments between the Munake cultivar and the reference genome. Six SV types were detected: insertion (INS), deletion (DEL), interchromosomal translocation (CTX), deletion including insertion, intrachromosomal translocation (ITX), and inversion (INV).
The positions of small InDels in the Munake cultivar were also located using the SnpEff software, and described using the same main polymorphism types as annotated for SNPs, except that the InDels in the CDS were separated into Start_Lost, Frame_Shift, Codon_Deletion, Exon_Deleted, Codon_Insertion, Codon_Change_Plus_Codon_Deletion, Codon_Change_Plus_Codon_Insertion, Stop_Gained, and Stop_Lost. The genic SVs were classified as either CDS, untranslated regions (UTR), or introns, according to their localization (Birney et al., 2004). The gene ontology (GO)/Pfam annotation data were further used to functionally annotate each gene, including non-synonymous SNPs −10 to 10-bp in length.

Designation of SSR Markers
In the Microsatellite identification tool 3 , clean reads were used to identify and localize microsatellite motifs. Only the SSRs, including mono-, di-, tri-, tetra-, penta-, and hexanucleotide motifs with numbers of uninterrupted repeat units exceeding 10, 7, 6, 5, and 4 were selected. The selected SSR loci for developing genetic markers should include a perfect repeat motif and two unique flanking sequences with 150 bp on each side of the repeat (Varshney et al., 2005). Primer 3 software 4 was used to design unique flanking sequences with three parameters: (1) primer length ranging from 20 to 26 bases with an optimal size of 23 bp; (2) melting temperature between 58 and 63 • C, with an optimum annealing temperature of 6 0 • C; (3) a guanine-cytosine content of 40-60%, with an optimum content of 50%. To validate the new SSRs, a few primer pairs were chosen for polymerase chain reaction (PCR) amplification.
A DNA Engine thermal cycler (Bio-Rad, Hercules, CA, United States) was used for amplification with the following program: 94 • C for 4 min; 35 cycles of 94 • C for 45 s, 52 • C for 1 min, and 72 • C for 45 s; and a final extension at 72 • C for 10 min. The PCR products were checked by electrophoresis in 1.5% agarose gels containing 1:20 GoldView in our lab and analyzed with fluorescent primers on ABI instrumentation (Shanghai SBS, Biotech Ltd., China), and photographed with a Photoprint 215 SD (Vilber Lourmat Co., Marne la Vallée, France).
PCR products were sequenced to obtain the fluorescence absorption peak pattern and peak value. GenALEx 6.5 was used to calculate the number of alleles, effective allele number, observed heterozygosity (Het), expected heterozygosity, Shannon's index, primer polymorphism information at the locus level, and polymorphic information content (PIC) (Mahmood et al., 2019).

Mapping of Re-sequencing Reads to the Reference Genome
A total of 111,846,066 short reads of 111 nucleotides (1.12 × 10 7 reads) were generated in this study. 1.06 × 10 8 short reads were successfully matched to V. vinifera (accession GCF_000003745.3 at the International Grape Genome Program and GenBank 2 ). The average sequencing depth was approximately 22.71× coverage of the reference genome, and the resulting consensus sequences covered approximately 94.79% of the reference genome ( Table 1). Among them, 5.59 × 10 7 PE reads and 5.01 × 10 7 single-end (SE) reads were mapped to chromosomes corresponding to 111.8 Mb of the reference genome (Table 1 and Figure 1). Although the percentage of sequenced reads from the Munake cultivar aligned with the reference genome was above 90%, the PE reads aligned with V. vinifera exceeded the SE reads. Furthermore, a total of 9.5 × 10 6 reads were uniquely (85.4%) mapped to the reference genome, and the remaining 1.63 × 10 7 reads were mapped to multiple locations (Table 1).

Detection and Distribution of Variation
A total of 3,628,662 SNPs were detected, including 2,446,163 Ti, 1,182,499 Tv, and a 64.1% Het-ratio (Heterozygosity/homozygosity (Homo) × 100%). A total of 26,802 InDels of <10-bp in length were detected. This included 9,551 InDels in the CDS, and the main concentration of InDels were −1 and 1-bp in length both genome-wide and in the CDS. The proportion of SNPs is shown in Figure 1.
Among the variations detected, the highest numbers of SNPs (n = 397,613), InDels (n = 81,801), and SVs (n = 1,473) were observed in chromosome 14. Contrarily, the lowest numbers of   Table 2). The density of SNPs in the different chromosomes is shown in Figure 2. Chromosome 14 had the highest density of SNPs (3,251.7 SNPs per Mb), and chromosome 10 had the lowest SNP density (1,014.9 SNPs per Mb) ( Table 2). The distribution of polymorphisms was uneven within chromosomes. All chromosomes comprised a mixture of dense and sparse SNP regions (Figure 2). The distribution patterns of InDels and SVs densities were similar to those of SNPs ( Table 2). The proportion of Ti was larger than that of Tv, with a Ti/Tv ratio of 2.06. A total of 772,885 InDels were detected, with the Genome-Deletion type (52.3%) more abundant than the Genome-Insertion type (47.7%), and the genome Het number (66.5%) was significantly higher than the genome Homo number (33.5%) (Figure 3).

Annotation and Effect of SNPs
The ratio of non-synonymous-to-synonymous SNPs (1.01) was calculated based on all gene models. There were 6,828 InDels and 21,153 SNPs annotated according to the two gene functional databases (GO and Cluster of Orthologous Groups of proteins).
Using the GO database (Figure 4), the SNPs were classified as those involved in biological processes, cellular components, or molecular functions. The majority of SNPs (n = 3,067) appeared to be involved in a biological process, including reproduction, metabolic processes, cellular processes, and response to a stimulus. Many SNPs (n = 2,353) annotated as involved in a molecular function were detected, mainly annotated as having catalytic activity and involved in molecular binding. There were only 558 SNPs classified as being associated with a cellular component, in which the gene participates as a component of the cell, cell membrane, or organelle.
According to the Cluster of Orthologous Groups of proteins database (Figure 5), the gene annotations were mainly concentrated on general function prediction only (n = 2,510), gene transcription (n = 1,287), and replication, recombination, and repair (n = 1,202). By comparing the SNPs in the Kyoto Encyclopedia of Genes and Genomes database (Figure 6), 2,733 were annotated, mainly referencing the plant hormone signal transduction (n = 146), RNA transport (n = 136), and oxidative phosphorylation (n = 125).

Development and Potential Applications of the New SSR Markers
A total of 87,872 SSRs with a genome frequency of one SSR per 4.32 kb were identified. Di-nucleotide repeats were the most abundant type followed by tri-nucleotide repeats. AG/CT was the most frequent di-nucleotide motif, while AAG/CTT was the most abundant (5.40%) tri-nucleotide motif.
Forty-four pairs of primers were designed based on the sequences flanking the SSRs to validate the markers by the PCR experiments (Supplementary Table S1), and clear bands were produced (Figure 7). Using 12 of the 44 pairs of SSR primers to analyze the genetic diversity of 57 Munake cultivar samples, a total of 168 alleles were detected with an average of 44 for each pair of primers. In this study, only MNG15, MNG18, and MNG1314 showed low variation (PIC < 0.25), MNG6666 and MNG9999 had medium variation (0.25 < PIC < 0.5), while in MNG03, MNG10, MNG14, MNG23, MNG26, MNG29, and MNG35 the degree of variation was high (PIC > 0.5). The PIC    of MNG29 was the highest, reaching 0.842 ( Table 3). The SSR markers have the advantages of codominance, multiple alleles, high polymorphism and variability, accurate and rapid detection, which can be used for species identification, establishment of well-known varieties germplasm resources and determination of differences between varieties.

DISCUSSION
There are numerous genomic resources available for grapevines. Publication of the V. vinifera genome accelerated the genetic research of this valuable crop, including analyses of its genetic structure, construction of genetic maps, assessment of genetic diversity, detection of genotype/phenotype associations, and marker-assisted breeding (Chen et al., 2015;Mercati et al., 2016;Money et al., 2017). Due to its clonal mode of propagation and genomic complexity, there was limited genomic information available for grapevines before the publication of the V. vinifera reference genome (Lijavetzky et al., 2007;Guo et al., 2015;Xu et al., 2016. Massonnet et al. (2020) studied the structure and evolution of the sex-determining region in Vitis species and reported an improved, chromosome-scale Cabernet Sauvignon genome sequence. Zou et al. (2020) added shotgun genome sequences from 40 grapevine accessions to enable the identification of conserved core PCR primer binding sites flanking polymorphic haplotypes with high information content. From these target regions, Zou et al. developed 2,000 rhAmpSeq markers as a PCR multiplex and validated the panel in four biparental populations, spanning the diversity of the Vitis genus and showing transferability increases to 91.9%. Munake grape, as a specific cultivars with high superior for table (Wu et al., 2011), were still kept in nothing referring its genomic studies (Guo et al., 2014;Ma et al., 2018;Xu et al., 2018). Its germplasm reservation and other applications were depending on the traditional agriculture as far (Keriman et al., 2016), the availability of its whole genome sequence can allow a positional selection of DNA fragments to be re-sequenced, enhancing the usefulness of the discovered and efficient implementation on its SNPs and other information (Magris et al., 2019). Multiple whole-genome sequencing projects of grapevines have contributed to the efficient implementation of SNP discovery in key cultivars (Tanksley et al., 1981;Berry et al., 1992;Wang et al., 2015), such as the Franco-Italian sequencing project and the IASMA sequencing project, both of which focused on the Pinot Noir (PN40024) cultivar (Roach et al., 2018). However, there have been few such projects conducted in China (Bai et al., 2013;Mu et al., 2018). SSR markers have been applied to all Chinese grapevine landraces, which confirmed that the Munake cultivar was clustered in the Oriental cultivars group (Li et al., 2017), which includes the most ancient Chinese accessions. However, the origin of the Munake cultivar still requires further evidence. Here, the whole-genome sequence of the Munake cultivar was described with more than 3 million SNPs, more than 300,000 InDels, and more than 14,000 SVs reported. Li et al. (2017) analyzed 61 Chinese grapevine cultivars and 33 foreign grapevine cultivars using nine SSR markers and found that Munake and Lvmunage cultivars distributed in the Xinjiang Province (China) are the same cultivar. The present study also described more than 80,000 SSR markers. Of these, 44 SSR pairs were randomly selected and validated by PCR; all tested SSR pairs were able to discriminate between Munake cultivar genotypes.
The SNP and InDel molecular markers are useful alternatives to SSR markers in high density marker studies, such as quantitative trait locus identification, genetic map construction, and fine genetic mapping (Song et al., 2015;Nicolas et al., 2016). The statistical findings of the detected SNPs and InDels showed that there are more than 3 million SNPs, 300,000 InDels, and 10,000 SVs ( Table 2). These variations constitute useful genomic resources for future studies of genetic differentiation. SSR markers were widely and randomly distributed throughout the genome, presenting several advantages (i.e., co-dominance, hypervariability, polymorphism, and ease and reliability of scoring) (Figure 7). SNP markers serve as effective markers for high-throughput mapping and for studying complex genetic traits. Our study contributes a considerable amount of genomic resources, including SNPs, InDels, SVs and genomic SSRs for the Munake grape, which all identification in our study can offer some genetic methods for the geneticists and breeders. The improvement application of the Munake grape can be facilitated on the utilization of this genomic information in linkage mapping, comparative genomics and molecular breeding based on the considerable efforts in future.

CONCLUSION
The Munake grape cultivar is a prime example of a species that is critically endemic because of invasive cultivar, diseases and habitat degradation. As a typical late-matured and unique tasted variety, its genetic analyses are an essential prerequisite for implementing effective management strategies. The provided genomic resources (SNPs, InDels, SVs, and genomic SSRs) for the Munake cultivar in here will be useful for geneticists and grape breeders to construct linkage mapping, apply genomic comparisons, and assist in molecular breeding for grape, and will expedite plant breeding programs of the Munake cultivar. In its exist populations, despite the threats the species faces, with adequate management, it is still possible to prevent its genetic resources in future.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: https://www.ncbi.nlm. nih.gov/, PRJNA632683.