A Partially Phase-Separated Genome Sequence Assembly of the Vitis Rootstock ‘Börner’ (Vitis riparia × Vitis cinerea) and Its Exploitation for Marker Development and Targeted Mapping

Grapevine breeding has become highly relevant due to upcoming challenges like climate change, a decrease in the number of available fungicides, increasing public concern about plant protection, and the demand for a sustainable production. Downy mildew caused by Plasmopara viticola is one of the most devastating diseases worldwide of cultivated Vitis vinifera. In modern breeding programs, therefore, genetic marker technologies and genomic data are used to develop new cultivars with defined and stacked resistance loci. Potential sources of resistance are wild species of American or Asian origin. The interspecific hybrid of Vitis riparia Gm 183 x Vitis cinerea Arnold, available as the rootstock cultivar ‘Börner,’ carries several relevant resistance loci. We applied next-generation sequencing to enable the reliable identification of simple sequence repeats (SSR), and we also generated a draft genome sequence assembly of ‘Börner’ to access genome-wide sequence variations in a comprehensive and highly reliable way. These data were used to cover the ‘Börner’ genome with genetic marker positions. A subset of these marker positions was used for targeted mapping of the P. viticola resistance locus, Rpv14, to validate the marker position list. Based on the reference genome sequence PN40024, the position of this resistance locus can be narrowed down to less than 0.5 Mbp on chromosome 5.


INTRODUCTION
Downy mildew of grapevine is caused by the oomycete Plasmopara viticola. It is a serious threat for viticulture especially in humid and warm environments. Intensive chemical plant protection, frequently through the use of copper, is necessary to protect grapevines under disease-promoting conditions. Resistance breeding allows for the selection of new cultivars that require a reduced number of chemical treatments. Genetic analyses of resistance loci have promoted such breeding initiatives that are today oriented towards stacking (combining) of several resistance loci (Töpfer and Eibach, 2016). In the recent past, therefore, a number of resistance loci for downy mildew have been identified, and these have been localized within grapevine genomes. Around two dozen loci designated Rpv (Resistance Plasmopara viticola) are listed in the Vitis International Variety Catalogue [VIVC, www.vivc.de;and (Hausmann et al., 2019)]. Some of them are of elite genetic background, such as Rpv1 (Merdinoglu et al., 2003), Rpv3 (Welter et al., 2007;Bellin et al., 2009), Rpv10 (Schwander et al., 2012), and Rpv12 (Venuti et al., 2013), and these can be used in breeding programs for selection of new cultivars. Other loci are less well characterized and are essentially known only from wild species (Marguerit et al., 2009;Blasi et al., 2011;Ochssner et al., 2016). To allow their application, these loci need to be introgressed into a favorable Vitis vinifera genetic background. Such an introgression can be accelerated considerably by using marker-assisted selection or marker-assisted back crossing (Herzog et al., 2013). To start such an approach, a locus should be well characterized, and sufficient closely linked markers should be available. Within plant breeding programs, simple sequence repeats (SSRs) and single nucleotide variant (SNVs) markers are widely applied. The discovery and development of this marker types by using NGS data has been successfully demonstrated for various plant species (Taheri et al., 2018).
One of the genetically useful genotypes as a source for resistance is the rootstock cultivar 'Börner'. It is an interspecific hybrid of V. riparia Gm 183 x V. cinerea Arnold; 'Börner' is thus derived from individuals of two Vitis species native to North America that carry several resistances. Among these are resistances to phylloxera [Rdv1 (Zhang et al., 2009)], black rot [Rgb1 and Rgb2 (Rex et al., 2014)], and to downy mildew [Rpv14 (Ochssner et al., 2016)]. To support the molecular analysis of these and other loci, we generated a first draft genome sequence assembly of the highly heterozygous 'Börner' genome. We produced comprehensive lists of SSR and SNV marker positions based on NGS data from 'Börner,' and we validated these marker positions randomly as well as by using Rpv14 on chromosome 5 (chr.5) as an example target.
The first grapevine reference genome sequence was generated based on Sanger sequencing technology using the 'Pinot noir' inbreeding line PN40024, which is estimated to be homozygous for 93% of the genome, thus enabling a successful assembly [(Jaillon et al., 2007), current version 12x.v2]. However, PN40024 suffers from inbreeding depression, and selfing grapevine for homozygosity is a difficult and a time-consuming process. Up until 2018, other available grapevine genome sequences were derived from heterozygous genotypes (Velasco et al., 2007;Adam-Blondon et al., 2011;Di Genova et al., 2014), and this posed a challenging sequencing and assembly problem that is not yet fully solved. In 2007, it was impossible to achieve a high-quality genome assembly for 'Pinot noir' due to its heterozygosity since merged allelic sequences made it impossible to separate both haplophases. As NGS techniques developed progressively, the number of sequenced plant genomes has rapidly increased, breaching the 100 mark some time ago (Michael and Jackson, 2013;VanBuren et al., 2015). Only a few genome sequences are based purely on the Sanger sequencing technique (e.g., Arabidopsis and grapevine). A larger number was generated based on mixed data types, including Sanger,454,and Illumina (e.g.,potato,sugar beet,and Brassica rapa) or only Illumina-derived sequences (e.g., eggplant and Norway spruce). The quality of the latter, generated by early NGS technologies, does not reach the quality of the older BAC-based Sanger assemblies despite higher read data coverage (Alkan et al., 2011;Burton et al., 2013). Although long reads became the best option for generating high-quality genome sequences (Li et al., 2018), the high throughput and low costs of Illumina sequencing has made it economical to sequence plenty of genomes at least to a draft state or to re-sequence individual genotypes. A major problem and current challenge in plant genome sequence assembly is resolving the repeat structures and proper continuous separation of haplophases within heterozygous or polyploid plants (VanBuren et al., 2015). Repeat sequences are very common in plant genomes, including repeats that are more than 10 kbp in length. Therefore, resolving plant genome complexity with short NGS reads, mainly 100-300 nt in length, is a big challenge. Such assemblies result in collapsed sequences and breakdown at the borders of repetitive regions, and this causes shortening of contigs. Some improvements were reached by using a variation of paired-end libraries (VanBuren et al., 2015) combined with mate-pair libraries. Long-insert libraries were of particular value to bridge such gaps up to about 12 kbp.
Here, we utilized sequencing data mainly generated by 454 and Illumina technologies to generate a draft assembly of the 'Börner' genome sequence with the main goal of allowing easy access to markers. The approach was complemented by sequencing BAC ends from a library created from 'Börner' DNA. We hypothesized that the interspecific nature of 'Börner' and the genetic distance of its two wild ancestor species, V. riparia and V. cinerea, could be helpful to dissect and assign the haplophases of 'Börner' in its parental genomes even with Illumina reads. Having at least to a certain extent phased contigs at hand, several downstream applications become feasible. The alignment of 'Börner' contigs to the reference PN40024 sequence revealed many polymorphic sites, including those at SSR positions, as being sources of potential high value for, to give an example, the purpose of grapevine resistance breeding. The detection of single SNVs or groups of nearby SNVs within a multiple alignment of two 'Börner' contigs and the PN40024 reference sequence are both considered to be useful in a bulked segregant analysis (BSA). The identification of markers linked to loci or genes causing disease resistance by applying BSA has been shown in plants (Michelmore et al., 1991). Pooling equal amounts of DNA is a practical way to reduce the cost of large-scale association studies to identify susceptibility loci for diseases (Sham et al., 2002;Zou et al., 2016). The pooling allows us to measure allele frequencies in groups of individuals by using far fewer genotyping assays than required when genotyping all individuals. Here, we demonstrate the power of heterozygous positions within one species to physical narrow down a candidate locus for downy mildew resistance (Rpv14).

BAC Library
To construct a BAC library of the grapevine cv. 'Börner' young frozen leaf material was sent to Bio S&T Inc. (Montreal, Canada) as a service provider for the generation of BAC libraries. The DNA was inserted into the vector pIndigoBAC-5 (Epicentre, Madison, USA; ENA/GenBank accession number EU140754) cut with HindIII. Competent cells of Escherichia coli strain DH10B were transformed with the ligation mix and plated on LB medium supplemented with 12.5 µg/mL chloramphenicol (CM), 40 µg/mL X-GAL and 0.2 mM IPTG. White colonies were picked and transferred in 384-well microtiter plates filled with LB freezing medium (Zimmer and Verrinder Gibbins, 1997) containing CM. Altogether, 159 microtiter plates with 384 wells were generated, and these were duplicated and stored at -80°C. The insert size was estimated by pulsed-field gel electrophoresis to be on average 93 kbp. The library coverage is almost 12 haploid genome equivalents based on a genome size of about 500 Mbp (Lodhi and Reisch, 1995).

Processing of BAC End Sequences (BES)
Trace files were processed by the base calling program phred (www.phrap.org, version 0.071220.c). Vector sequences were trimmed using the program blastall (Altschul et al., 1990) in version 2.2.24 with blast parameters set to the ones suggested by NCBI (www.ncbi.nlm.nih.gov/tools/vecscreen/about/). Lowquality sequences were masked by applying a sliding window approach (size: 10 bp, minimal average score > 13), and the longest unmasked subsequence was taken as a high-quality sequence if the length was ≥ 80 bp. Slippage reads were removed as described before (Dohm et al., 2012). The dataset was filtered for contamination with E. coli sequences (blastall matches with an e-value ≤ 1e-50). Afterwards, the sequences were singularized per BAC end. A total of 69,444 high-quality 'Börner' BES were submitted to ENA/GenBank and can be accessed under accession numbers KG622866 -KG692309.

NGS Sequencing Library Preparation
'Börner' DNA, extracted from young leaf material with a commercial kit (Qiagen) according to the instructions of the manufacturer, was sequenced using 454 GS-FLX Titanium (Roche), IonTorrent PGM (life technologies), Illumina GAIIx, MiSeq, and HiSeq1500 instrumentation. The preparation of Illumina paired-end (PE) and single-end (SE) sequencing libraries was performed according to the Illumina TruSeq DNA Sample Preparation v2 Guide. DNA was fragmented by nebulization. After end repair and A-tailing, individual indexed adaptors were ligated to the DNA fragments, thus allowing for multiplexed sequencing. The adaptor-ligated fragments were size selected on a 2% low melt agarose gel to a size of 300-600 bp. After enrichment PCR of fragments that carry adaptors on both ends, the final libraries were quantified with PicoGreen. The average fragment size of each library was determined on a BioAnalyzer High-Sensitivity DNA chip. Libraries were sequenced either 100 nt on an SE run using one lane of a GAIIx run (ENA/SRA accession numbers ERR2729619 and ERR2729620) or sequenced 2 x 100 nt PE on GAIIx (ERR2729739, ERR2729740, ERR2729741, ERR2729743, and ERR2729744) or HiSeq1500 (ERR2729742) sequencers. A mate-pair library was constructed according to the Illumina Mate Pair Sample Preparation v2 Guide. DNA was fragmented to a size of 3-4 kbp by using Hydroshear. Fragment ends were repaired and biotinylated. Afterwards, fragments of 3.5 kbp size were selected by gel electrophoresis. The resulting molecules were circularized. Removal of linear fragments was achieved by DNA exonuclease treatment. The circular molecules were randomly fragmented by nebulization. The biotinylated fragments were purified using streptavidin-coated magnetic beads. Several steps were carried out as described for the PE library construction. Finally, the library was sequenced in pairedend modus using an Illumina MiSeq generating 2 x 36 nt PE reads (ERR2729745).
GS-FLX sequencing libraries (454) were prepared from 10 micrograms of DNA according to the GS-FLX-Titanium Rapid Library Preparation Method Manual. DNA was fragmented by nebulization. Purified DNA fragments were subjected to end repair and subsequently to adaptor ligation. Large DNA fragments were selected via AMPure Beads. After emPCR and enrichment of DNA carrying beads, sequencing was performed on several runs on a Roche/FLX sequencing instrument using Titanium XLR70 sequencing kits (ERR2728011-ERR2718029). Mate-pair libraries (ERR2728288-ERR2728292) with varying insert length (multi-span PE reads, 3 kbp, and 8 kbp) were prepared from 15-20 micrograms of genomic DNA according to GS-FLX-Titanium manual method and using the GS FLX Titanium Paired End Adaptors kit and sequenced on a Roche/ FLX sequencer.
The library construction for IonTorrent PGM (life technologies) sequencing was performed according to the Ion Xpress™ Plus gDNA and Amplicon Library Preparation guide for a 200 bp library. DNA was fragmented by nebulization. The ends were repaired and ligated to adapters. Size selection of DNA molecules was performed via E-Gel ® SizeSelect™ Agarose Gel to an average insert size of 350 bp. After enrichment of the template carrying Ion Sphere particles on an Ion OneTouch, sequencing was performed using a 318 chip on an Ion Personal Genome Machine™ system (ERR2729812).
In addition to 'Börner' we used the V. vinifera cultivar 'Pinot noir' for DNA extraction, library construction, and Illumina HiSeq1500 sequencing (2 x 100 nt PE) according to the workflow described above. The read data were submitted to the ENA study PRJEB36326 with the read accessions ERR3834961 and ERR3834962).

NGS Read Pre-Processing
Illumina reads were quality trimmed and adapter clipped using Trimmomatic (Bolger et al., 2014) with all available adapter sequences from Illumina. Reads from the 454-and PGM/Ion Torrent-sequencing platform were trimmed with the CLC Genomics Workbench using the trim points of the SFF-files. Reads shorter than 36 nt were discarded. After pre-processing, 75.62 Gbp raw data (=90.9% of the raw bases) were available for de novo assembly.

Assembly, Post-Processing, and Validation
A de novo hybrid assembly of the 'Börner' genome was generated with the CLC Genomics Workbench (v. 8.0) using the certain parameters: length fraction 0.5, similarity fraction 0.8, word size 20, and bubble size 50. For the re-mapping of the reads to the contigs that the CLC assembler carries out, the filtering thresholds were optimized to a length fraction of 0.8 (default: 0.5) and a similarity fraction of 0.95 (default: 0.9). From 75.62 Gbp of read data, representing nearly 80-fold coverage for each of the two 'Börner' haplophases, 338,484 contigs with a total length of 652.17 Mbp were calculated. A total of 91.3% of the reads could be mapped back to the assembly, indicating a high probability that the sequence data was integrated correctly into the contigs. This raw assembly was designated BoeWGS0.8.1 (Supplementary Table 2).
The assembly metadata include a value for the number of reads that cover a given position and averages of these values for each contig. The value, which is essentially the total length of aligned reads divided by the assembly size, is referred to below as "alignment depth" (AD) and was used to deduce if a given contig represents a single haplophase or contains merged sequence information from both haplophases.
To test the completeness of the assembly we mapped the 69,444 'Börner' BES on BoeWGS0.8.1. Mapping was done with the program BLAT (Kent, 2002). The majority (43,008 BES = 61.9%) mapped with 99% identity or better and for a length fraction of 0.5. About 4.5% BES matched the end of contigs with between 20 to 50% of their length, but their mapping was not unique to one contig within the assembly. A small proportion of these (733 BES) produced multiple matches, suggesting that these contain mainly repetitive sequences. To avoid connecting contigs from different phases, we refrained from scaffolding since we had no information about the phase from which the BACs/ BES were derived.
The raw assembly BoeWGS0.8.1 was checked for contaminations by searching against a custom database with sequences of human, yeast, mold, several fungal and viral pathogens, and insects, such as aphids and mites, with blast+ (e-value <=1e-50) (Camacho et al., 2009). A BLAST (Altschul et al., 1990) run querying GenBank's non-redundant protein database (nr) completed the filtering process by using the same cut off criteria. Contigs matching to contaminants were also characterized by very low AD (1-20x). The removal of PhiX as well as sequences resembling cloning vectors caused a reduction of 213,834 bp and a total removal of three contigs. Plastid sequences from V. vinifera were detected in 67 contigs and removed as well. In addition, contigs smaller than 500 bp were discarded. Although nearly 30% of all initial contigs were rejected by this filter, the total assembly size was only reduced by about 5%. A total of 238,193 contigs remained after the different filters and were included in the final assembly, which was designated BoeWGS1.0 (Biosample PRJEB28084/http:// www.ebi.ac.uk/ena/data/view/PRJEB28084). The results were evaluated for standard assembly statistics parameters (Supplementary Table 2).
The BoeWGS1.0 contigs were aligned to the reference genome sequence PN40024 12x.v2 (https://urgi.versailles.inra. fr/download/vitis/12Xv2_grapevine_genome_assembly.fa.gz) with blast+ with the dust filter turned off and using an e-value cutoff of 1e-50 and a culling limit of 1. As a result of this analysis, each mapping contig from BoeWGS1.0 receives location information derived from PN40024. If exactly two BoeWGS1.0 contigs were assigned to one locus of the reference sequence, we took this as indication that the assembly separated the two 'Börner' haplophases. We have referred to these contigs, which may contain allelic sequence information, as "contig pairs".
BWA-MEM (Li, 2013) was applied to map all PE Illumina reads to the final assembly sequence using the -M option to mark short splits as secondary. The resulting mapping was subjected to inspection by REAPR (Hunt et al., 2013) to assess the assembly quality based on coverage, distance of PE reads, and orientation of PE reads.
BUSCO v3 (Simão et al., 2015) was deployed to analyze the recovery of the 'embryophyta odb9' benchmarking sequences in the assembly. This analysis was performed in the 'genome' mode using an e-value cutoff of 1e-3 and considering up to three candidate regions per BUSCO hit.

Calculation of Heterozygosity
Jellyfish v2.2.4 (Marcais and Kingsford, 2011) was applied on the reads of ERR2729742 to calculate histograms for k-mer lengths 19, 21, 23, and 25. The results were individually subjected to GenomeScope (Vurture et al., 2017) using default parameters for calculation of the heterozygosity. The average of all four resulting predictions was calculated to avoid k-mer length dependency of the final result. For comparison, we included Illumina read data from 'Pinot noir' (V. vinifera). As for the analysis of 'Börner,' about 40 Gbp Illumina PE data were used and the analyses were performed with the same parameters.

Genome-Wide Variant Detection
A strict mapping of 245,325,650 'Börner' PE reads (2x100 nt) on the PN40024 12x.v2 reference sequence with an average coverage of 50x formed the basis for the variant detection. The mapping was done with a CLC Genomics Workbench toolkit, the similarity fraction set to 0.95 and the length fraction to 0.9. Non-specific matches were discarded. Variants [single nucleotide variants (SNVs), insertions and deletions (InDels), as well as multi nucleotide variations (MNVs)] were detected with CLC's Probabilistic Variant Caller using default parameters. Only biallelic SNVs with a read coverage between 10 and 90 were kept for downstream analyses, whereas tri-allelic and poly-allelic variants, InDels, and MNVs were ignored.

Simple Sequence Repeat Detection
SSR detection was done with MISA [MIcroSAtellite identification tool; http://pgrc.ipk-gatersleben.de/misa/misa. html, (Thiel et al., 2003)] using default parameters. Possible SSR marker positions were evaluated by comparison of the candidate SSRs plus adjacent +/-200 bases from BoeWGS1.0 with the SSRs of the PN40024 12x.v2 reference sequence. The comparison was performed with blastn with dust filter turned off and 1e-50 as e-value cutoff. SSR candidates where a BoeWGS1.0 contig pair matched one coordinate in the reference sequence and differences did not exceed 40 bp (identity >=90%) were classified as useful for SSR marker development. The SSR candidates were named "SSR [chr#]_[contig-ID]" with a trailing number if more than one SSR candidate was detected for one contig pair.

SSR Marker Validation
For the validation of heterozygous SSRs within the BoeWGS1.0 assembly and variants in relation to the reference sequence, 53 SSR candidate positions were randomly selected for SSR marker validation. The SSRs from different genomic regions were selected with a minimum of two nucleotides and of six contiguous repeat units in the reference sequence. Primer3 (Untergasser et al., 2012) was used to design the amplimers (two primers directing towards each other based on the source sequence) fitting to the BoeWGS1.0 contig sequences as well as to the reference sequence PN40024 (Supplementary Table 1). The expected amplicon size of the markers was set from 150 to 400 bp and the primer size from 18 to 27 nt. DNA from 'Börner' and V3125 (see below) was used to check the polymorphism of the SSR marker sequences by PCR and gel electrophoresis on a 3% agarose gel. A few PCR products with expected smaller sequence variations were applied to Sanger sequencing and analyzed at the bp level by multiple sequence alignments. Verified heterozygous markers either in 'Börner' and/or in relation to V3125 were analyzed by fragment analysis using DNA of the crossing parents (V3125 and 'Börner') as well as the parents of 'Börner' as described before (Ochssner et al., 2016)

Locus-Guided Variant Calling and Marker Generation
The BoeWGS1.0 contig mapping on the PN40024 12x.v2 reference sequence was evaluated for a genomic region on chr.5. The BLAST results from the post-processing analysis described above were used. Regions where a multiple alignment composed of exactly two BoeWGS1.0 contigs (i.e., contig pairs) and parts of chr.5 from the PN40024 12x.v2 reference sequence was built were considered as genome regions representing both haplophases if the contig pairs selected showed the expected AD of about 75x (75 ± 25).
Each contig pair was aligned with edialign (from the EMBOSS suite, version 6.2) using default parameters. Alignments without N-stretches and with lengths of at least 500 bp were subjected to SNV detection. The list of contig pairs used for SNV detection on chr.5 is available as Supplementary Table 3. The downstream analysis focused on grouped SNVs between BoeWGS1.0 contig pairs that are bi-allelic after comparison with the PN40024 reference sequence.

Amplimer Design for the Rpv14 Locus
BoeWGS1.0 contig pairs on chr.5 were used for SNV-based marker development addressing the Rpv14 locus. Amplimers were designed using Primer3 (Untergasser et al., 2012) preferably for contig pair alignments with a rate between 0.4 and 1 SNVs per 100 bp (Supplementary Table 1 and Supplementary Table  3). We targeted an expected amplicon size between 400 and 530 bp. Deduced amplimers were tested to be unique in BoeWGS1.0 and PN40024 12x.v2.

Plant Material, Bulk Set Up, DNA Extraction, and PCR
A biparental F1 mapping population comprising 202 individuals of the grapevine breeding line V3125 ('Schiava Grossa' x 'Riesling') and 'Börner' (V. riparia Geisenheim 183 x V. cinerea Arnold), cultivated in the field at the JKI Institute for Grapevine Breeding Geilweilerhof, Siebeldingen, Germany, has been phenotyped three times for downy mildew leaf resistance as described previously (Ochssner et al., 2016). For SNV genotyping by amplicon sequencing, genomic DNA was extracted from the parents V3125, and 'Börner' and selected F1 genotypes of the mapping population. Building bulks of DNA from all the individual samples and typing one SNV marker at a time can save valuable template DNA and has been successfully utilized in microsatellite markers (Barcellos et al., 1997) and SNPs (Shaw et al., 1998). Two bulks with 10 individuals each were generated from the mapping population. The first (R) and second (S) bulks encompass resistant and susceptible genotypes, respectively.
Genomic DNA was extracted from young leaf material with a commercial kits (Qiagen) according to the instructions of the manufacturer. DNA samples were quantified using the Qubit™ dsDNA BR Assay Kit with the Qubit ® 2.0 Fluorometer (Invitrogen, Life Technologies, Darmstadt, Germany).
Genomic DNA from the parents of the mapping population V3125 x 'Börner', as well as from the two bulks, was used as template in 25 µl PCR reactions with the two marker-specific primers and 1 ng DNA per individual under standard conditions. The amplicons obtained were purified and sequenced from both directions using the ABI Prism BigDye Terminator chemistry on an ABI Prism 3730 sequencer (Applied Biosystems).

Determination of SNV/Allele Ratios
For the estimation of allele ratios, only bi-allelic BoeWGS1.0 SNVs that differentiate the two 'Börner' phases were considered, i.e., SNVs between BoeWGS1.0 contig pairs. Since only bi-allelic SNVs were considered, these positions are often homozygous in the V3125 genotype for one of the two 'Börner' alleles. Exceptions to these criteria (V3125 displays identical SNV heterozygosity to 'Börner') were excluded from the analysis. As a result, the allele frequency contributed from the 'Börner' parent has to be 50% and is 100% for the V3125 parent when counting the variant that is present in V3125. Following these assumptions, the expected frequency for each single allele in a bulk of the unselected F1 individuals of this mapping population is 75%. We have estimated the SNV frequencies within the parental genotypes in order to confirm the determination and to gain hints for problematic amplimers that amplify the haplophases with a bias for one of the two.
The estimation of the SNV frequencies among the pooled DNA samples (Pool R = resistant; Pool S = susceptible) and comparison with the parental lines was carried out by using the tool QVSanalyzer (Carr et al., 2009). The Sanger sequence trace files for determination of the relative proportions of two sequence variants were analyzed per batch. The generated output files contain details of the examined sequence variant ratios for individual samples as well as summary statistics. The area below the peak at the position of the targeted SNV was calculated and set into context with the surrounding peaks for each sample and the corresponding trace file. For graphical presentation of the results, the ratios were converted into percent values for SNV or allele frequency. SNVs considered in the analysis are listed in Supplementary Table 3).

A Draft Assembly of the 'Börner' Genome Sequence
The data for the 'Börner' genome sequence were obtained by whole genome shotgun (WGS) sequencing. The assembly, designated BoeWGS1.0, has a total size of 618.3 Mbp, and the N50 sequence size was 4,255 bp (Supplementary Table 2). We evaluated the average AD for the contigs in the assembly and classified the contigs with respect to the expected depth of aligned reads for paired contigs (separated haplophases) and contigs with merged sequences (both haplophases combined into one contig, see Supplementary Figure 1). A significant proportion of the contigs showed an AD between 50 and 100 (75 ± 25), which can be considered as the expected value for a haplophase-specific contig. The range has been deduced from the total amount of reads (75 Gbp), 2 x 500 Mbp for the expected fully diploid sequence length, and an interval of ±33%.
REAPR flagged a total of 14,007 regions (5.9%) as erroneous. Fragment coverage distribution errors within a contig accounted for 4,250 cases (1.8%), while the same error type over a gap added 5,674 cases (2.4%). The remaining cases were contributed by low coverage within a contig or across a gap with 3,723 (1.6%) and 360 (0.2%) cases, respectively. In summary, the assessment by REAPR indicated a short-read assembly from heterozygous material of acceptable to good quality. The heterozygosity of 'Börner' was calculated to be about 3.1% independent of the kmer length used in the analysis. This is a quite high value in comparison to the V. vinifera variety 'Pinot noir' for which we calculated a heterozygosity of 1.5% regardless of k-mer length (Supplementary Table 4).
BUSCO, i.e., the Benchmarking of Universal Single-Copy Orthologs, revealed the presence of 55.9% of all 1,440 benchmarking genes from the reference set for embryophyta, 42% of the total 1,440 as single copy and 13.3% as duplicated. In addition, fragments of 19.1% benchmarking genes were detected. Only 25% of the benchmarking genes were not detected in the assembly. It should be noted that the duplicated BUSCO genes can be explained by detection of two allelic versions in BoeWGS1.0 contig pairs. Taken together, the BoeWGS1.0 assembly represents a typical short-read assembly of a heterozygous genotype. Since we wanted to address marker development, we optimized the assembly for phase separation and not for continuity.

BoeWGS1.0 Contigs Representing Different Haplophases
The BoeWGS1.0 contigs were mapped against the V. vinifera reference genome sequence PN40024 12x.v2. About 210,444 (88.3%) could be mapped successfully with at least 30% of their length (https://doi.org/10.4119/unibi/2938185). The ratio of the total length of the mapped BoeWGS1.0 contigs to the length of the reference chromosomes is 1.14 on average (1.02 to 1.31; Supplementary Table 5), indicating uniform assembly quality and an overall homogeneous synteny between 'Börner' and PN40024. The median of the average contig ADs for all 19 chromosome mappings was 86.3 (Supplementary Table 5). Conspicuously, it is only the genetically unassigned chromosome "Ukn," or "unknown," where the mapped BoeWGS1.0 contigs have an AD of 248 on average. The remaining, unmapped contigs contain sequences that are too diverse from the PN40024 reference sequence to be mapped.
More than one quarter (29.2%; 142 Mbp) of the PN40024 reference sequence was covered by BoeWGS1.0 contig pairs, again suggesting that the two 'Börner' haplophases were partially separated in the assembly. Most of these contigs show on average an AD in the expected range. Another quarter of the reference (28.7%; 139 Mbp) was covered by one BoeWGS1.0 contig, and these contigs often (but not always) display higher AD values. In these cases, either the two 'Börner' haplophases were merged into one contig during assembly, or only one of the two haplophases displayed sufficient similarity to become mapped.
Due to repetitive sequences, a small fraction of the PN40024 reference sequence (3.7%; 18 Mbp) was matched by more than two contigs. Nearly 38.5% (187 Mbp) of the reference sequence (including 3% N-stretches) was not covered by any stringently matching BoeWGS1.0 contig (Figure 1).

Homozygous and Heterozygous Variant Frequencies
Genomic variants including SNVs contribute not only to intraspecific diversity but they are also candidates for valuable molecular markers. By mapping 'Börner' NGS reads against the reference sequence and subsequent variant calling, almost 5 million highly reliable bi-allelic SNVs were detected (https:// doi.org/10.4119/unibi/2938178 and https://doi.org/10.4119/ unibi/2938180).
Half of the 4,996,490 SNVs are heterozygous SNVs (2,536,406), which means one of the BoeWGS1.0 contigs shows the same nucleotide as the reference at a given position. The other half (2,460,084) are homozygous SNVs, describing variants only existing between PN40024 and BoeWGS1.0. The frequency is on average 1 per 77 bp (Supplementary Table 6), ranging from 1/70 to 1/82 bp for chr.3, chr.8, and chr.16. In addition to SNVs, MNVs and small InDels were called, dropping the overall variant frequency slightly to 1 variant per 68 bp.

SSR Detection for Marker Development
Simple sequence repeat (SSR) markers are often used to study molecular diversity or heterozygosity as well as for genetic mapping in grapevine. By comparing SSR positions in the grapevine reference genome sequence with BoeWGS1.0, a total of 10,820 putative SSR marker positions ("candidates") with different unit sizes (two to six) and repeat numbers (up to 27) were deduced ( Table 1). Regions of the reference matched by a BoeWGS1.0 contig pair were exploited. Out of the 10,820 positions, 12% (1,313) were monomorphic and 38% (4,110) were tri-allelic in an alignment of the three sequences. The remaining 50% were either bi-allelic or showed more than one motif in the SSR region. The more than 4,000 tri-allelic SSR candidates are the most valuable ones with regard to genetic mapping in a broad range of accessions (Supplementary Table  7). For various reasons, not all of them are new; for example, candidate SSR chr1_1203 corresponds to the established marker GF01-03, SSR chr1_1083 to marker GF01-21, and SSR chr1_1248 to marker GF01-22 . However, the identification of already existing markers convincingly indicates that our candidate list is valid.
A set of 45 tri-allelic and eight bi-allelic SSR candidates with a unit size of two or three randomly selected from all  chromosomes was used to validate the power and reliability of SSR candidate prediction. The bi-allelic candidates were included to estimate the chance of false negative predictions. In this subset of eight candidates, we tested six candidates; the reference matched only one contig of BoeWGS1.0, and two candidates were not found on a contig pair but shared a unit size and number. Out of the total set of 53 SSR sequences, 38 could be verified by either PCR and subsequent agarose gelelectrophoresis, Sanger sequencing, or fragment analysis (Supplementary Table 8). The 15 negative cases were caused by unspecific primers that amplified various fragments instead of allelic amplicons. We did not optimize the amplimers empirically in a second round of amplimer design. As expected, 'Börner' amplicons with predicted small sequence variations between the alleles were analyzed more successful by Sanger sequencing than those with larger sequence variations. In six out of eight cases, the predicted bi-allelic SSR sequences were verified as monomorphic within 'Börner'. Analysis of the sequenced nucleotides before and after the SSR region confirmed the assumption that these genome regions were less heterozygous than others (or monomorphic), and they were therefore only represented by a single contig in the BoeWGS1.0 assembly. For the SSR chr1_43023 candidate, the new marker GF01-59 was established, although for this candidate only one BoeWGS1.0 contig matched to the reference. Both Sanger sequencing and fragment analysis confirmed the existence of two 'Börner' alleles for GF01-59.
Sanger sequencing of amplicons from SSR candidate loci with larger differences (e.g., in unit size, unit number, etc.) was less successful. In addition, we observed in these cases higher heterozygosity between the alleles even outside the SSR itself. However, validation by agarose gel-electrophoresis worked out smoothly for these SSR candidates.
Since gel-electrophoresis did not allow us to verify exact allele sizes, these were determined by fragment analysis in 14 cases (Supplementary Table 8). Fragment analysis was performed with DNA of the exact parental lines of 'Börner' and V3125 in addition to 'Börner' itself. For 12 out of 14 SSR candidates different fragment sizes for the two 'Börner' alleles were detected that fit well to the predictions for both 'Börner' alleles. The validated SSR candidates-for which we generated complete documentation-received marker designations. Seven of these SSR markers turned out to be highly or fully informative because they discriminate all four alleles of the biparental cross V3125 x 'Börner'. One marker (GF01-60, derived from candidate SSR chr1_100137_2) was scored as monomorphic in this analysis, likely due to the small sequence difference of 2 bp in fragment size. In fact, exactly the 2 bp difference in fragment length were properly detected by Sanger sequencing, these but are surely below the reliable resolution for a marker assay based on fragment analysis. The last of the 14 SSR markers, GF06-19 (SSR chr6_800_1), did not show any scoreable fragment for 'Börner,' but amplified a single scoreable amplicon in V3125. Taken together, every method seems to have its limitations, but the different validation approaches demonstrated the high reliability and usability of our prediction approach for new SSR markers. Further investigation of the highly informative markers in the grapevine varieties 'Schiava Grossa', 'Riesling', 'Villard Blanc', 'Calardis Musque' and 'Pinot noir', as well as in the parents of 'Börner' showed a transferability of about 70 to 80%. Most of the assayed SSR positions were heterozygous within and between the different cultivars.

Targeted Mapping of the Downy Mildew Resistance Rpv14 by Bulk Segregant Analysis
To assist the introgression of the downy mildew locus Rpv14, which has been mapped in 'Börner' to the lower arm of chr.5 (Ochssner et al., 2016), into V. vinifera, closely linked markers are required. For SNV marker-based association studies for Rpv14, 25 amplimers were designed targeting the lower arm of chr.5 of 'Börner', and 17 of those turned out to be functional. As controls, two additional amplimers were designed, one at the top of chr.5 opposed to Rpv14 and one on chr.14. All primers were designed based on BoeWGS1.0 contigs and SNV data (see Methods). The 76 amplicons obtained for 19 markers and 4 template DNAs (susceptible V3125, resistant donor 'Börner', and the F1 bulks S and R) were sequenced, and the trace files were subjected to peak area determination and evaluation.
All predicted SNVs within or between the BoeWGS1.0 contigs could be verified. In one case, we observed a tri-allelic SNV and removed the respective data points from further analyses. For SNVs from physically distinct contigs, the peak ratios or SNV frequencies varied as initially expected for a bulked segregant analysis. By ordering BoeWGS1.0 contig pairs along the chr.5 of the reference genome sequence and allocating the corresponding SNVs or allele frequencies to these contigs, the changes in the frequencies of the pools are obviously correlating with the candidate locus ( Figure 2). For the control regions on chr.14 and at the top of chr.5, we hypothesized no selective pressure and therefore expected an allele frequency for the bulks of 75%. The observed frequencies on chr.14 fulfilled the expectation. The frequencies in the north of chr.5 in the R bulk (55-60%) differ to some extent from the expectation where, as in the S bulk, the SNV frequency is as expected. This may be due to the low number of genotypes within the bulks.
Even if only one out of the ten genotypes has no recombination between the target locus and the terminal part of the chromosome, the estimated frequency value changes theoretically by almost 10% of the expected value. Within the region of the candidate locus between approximately 19 and 21 Mbp, the resistant bulk has a nearly uniform estimated SNV frequency of 50%. This indicates that all resistant individuals in the bulk carry the same allele of the resistant 'Börner' parent (inherited from V. cinerea) (Ochssner et al., 2016). Since there are only a limited number of recombination events possible, the large interval is not surprising. At the same time, the SNV frequency in the susceptible bulk converges (continuously) to about 95%. Our results indicate that the variants of 'Börner', located between 19.67 Mbp and 20.6 Mbp in the corresponding reference sequence PN40024 are highly and specifically linked to the resistant trait. The center of the candidate locus with the highest linkage, showing SNV frequencies over 85% for the S bulk is an interval of 330 kbp between 20.31 and 20.6 Mbp. This is the region flanked by the BoeWGS1.0 contigs c35489 and c244127 on one side, and c12059 and c12060 on the other (NCBI accessions CCJE01069526.1, CCJE01173868.1, CCJE01026433.1 and CCJE01045859.1).

Genome Sequencing
For the development of molecular markers and to gain knowledge about genome regions as well as loci relevant for grapevine breeding, we analyzed the genome of the resistant interspecific hybrid 'Börner' (V. riparia x V. cinerea) by WGS sequencing. The draft assembly BoeWGS1.0 covers a total of 618 Mbp, which corresponds to 62% of the expected 1 Gbp long diploid 'Börner' genome. Assessment by REAPR revealed some critical regions that could be caused by the incompletely resolved haplophases. Mapping of reads to at least in parts very similar sequences, such as separated alleles, still poses a challenge. However, most of the contigs pass the read mapping-based assembly evaluation.
A major challenge during the de novo assembly of this dataset was to optimize the separation of sequences into contigs derived from both haplophases for subsequent generation of highly informative SSR and SNV marker assays. Different evaluations for the success of this separation result in somehow related-but in detail different-numbers. BUSCO finds about 14% duplicated benchmarking genes. Some genes in the reference set are,  however, not detected at all (BUSCO completeness is about 75%). This is probably an underestimation for the contig level since the relatively small contigs complicate gene calling. The total length of the assembly is about 120 Mbp longer than the haploid Vitis genome sequence, indicating that 240 Mbp of the 618 Mbp represent sequences from separated haplophases. The analyses of contig mappings against the reference genome sequence PN40024 indicates that 142 Mbp of the reference are covered by two BoeWGS1.0 contigs. Taken together, we assumed that the BoeWGS1.0 assembly contains phase-separated sequence information for one quarter of the 'Börner' genome. Note that this information is not "phased throughout" since, even if a contig pair is detected, we lack information about which of the two belongs to the V. cinerea and which belongs to the V. riparia haplophase. Anyway, since the regions for which separation worked are distributed throughout the genome, the data are a very good source for marker development.
Although the separation of sequences from both haplophases is great for molecular breeding applications and several genomewide investigations, the short-read-derived BoeWGS1.0 assembly is still a draft genome sequence. The short contig length significantly limits downstream analyses like gene prediction or approaches to detect genomic rearrangements. The main reason for the fragmentation of the BoeWGS1.0 assembly are regions where the ancestor genomes of V. cinerea and V. riparia are quite similar. In fact, the frequency of SNVs and MNVs detected after read mapping between the two haplophases of 'Börner' is in the same range as between 'Börner' and PN40024. However, it should be considered that read mapping requires quite similar sequences to be specific.
Currently, substantially longer read lengths (mean >20 kbp) than those that were used here are provided by third-generation single-molecule sequencing technologies like those offered by Pacific Biosciences (PacBio) or Oxford Nanopore Technologies (ONT). The long reads can be used to solve the problem of bridging repeat sequences or transposable elements. The first de novo assembly of domesticated grapevine using PacBio sequencing was from V. vinifera cv. 'Cabernet Sauvignon' (Chin et al., 2016) followed recently by the diploid assembly of the cultivar 'Chardonnay' (Roach et al., 2018;Zhou et al., 2019). The first and only wild Vitis genome sequence so far comes from the rootstock Vitis riparia 'Gloire de Montpellier' and reached considerable continuity by incorporating PacBio and 10X Chromium (Girollet et al., 2019). However, assembling a highly heterozygous genotype, accurate phasing, and aligning of Vitis haplotypes is still challenging. With regard to the data on hand and the complexity of this assembly, the application of dedicated scaffolding tools like scaffoldScaffolder (Bodily et al., 2015) or SSPACE (Boetzer et al., 2011) comes at a high risk of generating erroneous connections between contigs. Therefore, we calculated the BoeWGS1.0 assembly with very stringent parameters using the CLC Genomics Workbench toolkit, which contains an implementation of a de Bruijn Graph assembler, and refrained from scaffolding.
In addition to the room for improvement for the assembly parameters, there is a significant fraction of the BoeWGS1.0 genome sequence that is currently not able to align, and it is thus not assignable to the reference genome sequence. This genome fraction likely holds sequence information underlying important traits. Chromosomal anchoring of these contigs is a task that could be realized by the use of the provided SSR or SNV marker sequences of this study in an existing F1 mapping population (V3125 x 'Börner') . Finally, the generation and incorporation of long read data for improvement of the 'Börner' genome sequence is an option for future studies. We expect that a long-read assembly with good continuity would also allow to study 'Börner' sequences that display high divergence from the reference.

Deduction of SSR Sequences for Marker Development
The detected polymorphic SSR candidate loci are a very valuable resource for genome-wide marker development and genotyping. SSR markers are still the most abundant molecular marker type used for mapping in Vitis species and marker-assisted selection in grapevine breeding programs. Because of their very polymorphic nature, SSRs often allow us to clearly distinguish between more than two alleles. This is highly necessary when using F1 mapping populations derived from a cross of highly heterozygous parents, when following an allele in a phylogenetic tree, or when identifying accessions (Cipriani et al., 2011). All these conditions fit to grapevine.
In this study, we were able to reliably predict thousands of candidates for SSR markers. Furthermore, we validated high prediction reliability, which is essentially limited only by the design of specific primers, and demonstrated their applicability in Vitis genotyping approaches. The prediction schema for selection of promising SSR candidates that relied on aligning contigs representing separated haplophases instead of large amounts of sequence reads turned out to be very successful. We were able to reliably detect, predict, and select for longer sequence variations caused by the variable number of repeated units. This is favorable for marker detection by fragment analysis as the current predominant method used for SSRbased genotyping.

Heterozygosity of 'Börner'
For V. riparia (cv. 'Gloire de Montpellier'), the mean distance between heterozygous SNVs is 217 bp (Girollet et al., 2019). For the heterozygous 'Pinot noir,' a frequency of 1 nucleotide variant per 100 bp and 1 InDel per 0.45 kbp was described (Velasco et al., 2007). The overall variant frequency that we obtained between the 'Börner' genome sequence and the reference PN40024 was 1 variant per 68 bp. This number is likely to be a significant underestimation because of the stringent read mapping parameters and the narrow confines for the coverage filter. A calculation of heterozygosity using GenomeScope indicated a heterozygosity of 3.1% for 'Börner' and 1.5% for 'Pinot noir' as an example for a prominent V. vinifera cultivar. This result gives additional indication for the highly heterozygous nature of the 'Börner' genome compared to V. vinifera varieties.

Application of Phased Contig Data for Targeted Mapping of the Downy Mildew Resistance Rpv14 of Vitis
Single SNVs or groups of nearby SNVs within the multiple alignments of two 'Börner' contig pairs and the PN40024 reference sequence were shown to be useful for the molecular identification and localization of a quantitative trait locus. The highly reliable SNVs detected in our study can function as molecular markers in a bulked segregant analysis as shown for Rpv14 on chr.5 in our analysis. Rpv14 was selected as an example for proof of concept because the alternative Rdv1, which causes phylloxera resistance, has already been restricted to an area of about 350 kbp-a value deduced from the physical distance of the published flanking genetic markers (Zhang et al., 2009;Hausmann et al., 2014) relative to the PN40024 reference sequence. The other option, namely black rot resistance mediated by Rgb1/Rgb2, was not selected because there are at least two segregating loci (Rex et al., 2014). We were able to physically reduce the size of the Rpv14 target region to less than 500 kbp. In relation to the results from Ochssner et al. (Ochssner et al., 2016), the center of the target region is narrowed down from the north and the south with the center being still at 20.06 Mbp. One limitation for further reduction of the target region was the number of F1 individuals in our R and S pools or BSA bulks. A recent review on BSA (Zou et al., 2016) mentioned an optimal bulk size for monogenic traits of 10 to 20% of the segregating population for each pool or "tail" of the trait value distribution. However, a BSA study in rice (Wambugu et al., 2018) was successful with 10 or fewer individuals per pool. In addition, equal size of the contrasting pools is considered to be important. In our study, the pool size of 10 resulted from the number of F1 genotypes with reliable and consistent phenotypic scoring results within the population. Obviously, this limits the resolution since only a relatively small number of recombination events were evaluated.
Further research should focus on the identification of new F1 individuals with additional recombinations in the target region. This will probably require enlargement of the number of F1 individuals in the mapping population and additional phenotyping for downy mildew resistance. Subsequently, the work presented here allows for quick access to additional markers, which can be used to determine the recombination points that then can be correlated with resistance genes from candidate gene predictions.

AUTHOR CONTRIBUTIONS
DH and LH conceived and planned the experiments. DH, PV, and LH designed and performed the experiments. TR, DH, and BP calculated the data. RT, BW, and DH conceived the original idea and supervised the project. DH, RT, and BW wrote the manuscript with input from all authors. All authors interpreted and discussed results.

FUNDING
The project was supported by the German Ministry of Education and Science (BMBF) grant FKZ 0315460A and 0315460B (Acronym: GrapeReSeq) to BW and RT, respectively. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.