Identification and Validation of a Core Single-Nucleotide Polymorphism Marker Set for Genetic Diversity Assessment, Fingerprinting Identification, and Core Collection Development in Bottle Gourd

Germplasm collections are indispensable resources for the mining of important genes and variety improvement. To preserve and utilize germplasm collections in bottle gourd, we identified and validated a highly informative core single-nucleotide polymorphism (SNP) marker set from 1,100 SNPs. This marker set consisted of 22 uniformly distributed core SNPs with abundant polymorphisms, which were established to have strong representativeness and discriminatory power based on analyses of 206 bottle gourd germplasm collections and a multiparent advanced generation inter-cross (MAGIC) population. The core SNP markers were used to assess genetic diversity and population structure, and to fingerprint important accessions, which could provide an optimized procedure for seed authentication. Furthermore, using the core SNP marker set, we developed an accessible core population of 150 accessions that represents 100% of the genetic variation in bottle gourds. This core population will make an important contribution to the preservation and utilization of bottle gourd germplasm collections, cultivar identification, and marker-assisted breeding.

Currently, an increasing number of homogeneous new bottle gourd varieties are being released onto the market, and intellectual property disputes could emerge in cases where different species have been assigned the same name or where different names have been assigned a single species. Traditional identification and registration of germplasm collections or varieties are primarily dependent on field planting, which is both time-consuming and inefficient. Moreover, plant traits are often influenced by environmental variation (Jamali et al., 2019). Therefore, a precise, rapid, convenient, and cost-effective procedure, e.g., using molecular markers to predict traits (Jones and Mackay, 2015), is urgently needed to resolve intellectual property disputes and for cultivar improvement.
Molecular DNA markers are being used in an increasing number of crops for seed authentication, genetic diversity analysis, DNA fingerprint construction, and core collection development (Lv et al., 2012;Zhang et al., 2012;Hao et al., 2016;Yang et al., 2019). Several molecular markers have been developed for bottle gourd, including random amplified polymorphic DNA (RAPD), simple sequence repeat (SSR), and insertiondeletion (InDel) markers (Morimoto et al., 2005;Xu et al., 2011;Sarao et al., 2014;Wu et al., 2017). In addition, 3,226 singlenucleotide polymorphisms (SNPs) were identified by restrictionsite associated DNA sequencing (RAD-Seq) genotyping of a natural population, and it was suggested that two sub-gene pools (Sub R and Sub L) were associated with fruit shape (Xu et al., 2014). SNP markers are suitable for high-throughput genotyping, due to their unique characteristics, including wide distribution, high density, and good stability (Su et al., 2018;Wang et al., 2018;Li et al., 2019;Wu et al., 2021). In this regard, Kompetitive allelespecific PCR (KASP) is a flexible, user-friendly SNP genotyping system that has been used for SNP genotyping in wheat, common bean, Brassica rapa, and cowpea (Allen et al., 2010;Cortés et al., 2011;Su et al., 2018;Li et al., 2019;Wu et al., 2021). The selection and development of a core SNP marker set with a highthroughput SNP genotyping platform, which can be used for rapid assessment and fingerprinting of germplasm collections, is essential for marker-assisted breeding and cultivar identification in bottle gourd.
Although originating from Africa, the bottle gourd was in use by humans in east Asia, the Americas, Europe, and the South Pacific (Erickson et al., 2005;Schlumbaum and Vandorpe, 2012;Kistler et al., 2014). Bottle gourd populations exhibit a tremendous diversity in fruit shape (Heiser, 1979;Morimoto and Mvere, 2004;Xu et al., 2014Xu et al., , 2021, based on which the bottle gourd populations are consistently grouped rather than the geographical origin (Xu et al., 2011;Mladenovic et al., 2012;Yildiz et al., 2015). Bottle gourd germplasm is preserved in several seed banks and is used in various research by different institutions across the world (Morimoto et al., 2005;Achigan-Dako et al., 2008;Gurcan et al., 2015;Xu et al., 2021).
With regard to the preservation and utilization of germplasm collections in plant breeding, Frankel (1984) were the first to propose the concept of core collections. A core collection is a subset of the total germplasm collection that is designed to represent a substantial proportion of the overall genetic diversity of the collection as a whole (Brown, 1989). Subsequently, core collections have been established in a range of plants, including wheat (Balfourier et al., 2007), rice (Zhang et al., 2011), andbarley (Muñoz-Amatriaín et al., 2014). In addition, publicly accessible core collections have been developed for cucurbit crop cucumbers with advanced genetic variation (Lv et al., 2012;Wang et al., 2018). However, to date, there have been no efforts to preserve and utilize core collections in bottle gourd germplasm collections.
In this study, based on the re-sequencing of 20 representative bottle gourds, we selected and identified a core SNP marker set consisting of 22 core SNPs with abundant polymorphisms evenly distributed throughout the bottle gourd genome. To demonstrate the representativeness of the 22 core SNPs, we evaluated the polymorphism and discrimination of the core SNP set using 206 bottle gourd collections and a multiparent advanced generation inter-cross (MAGIC) population. Using this core SNP marker set, we assessed the genetic diversity and population structure of bottle gourd collections, fingerprinted bottle gourd germplasm collections and commercial cultivars, performed an optimized procedure for seed authentication, and developed an accessible core population. The results thus obtained will play a vital role in the preservation and utilization of bottle gourd germplasm collections, as well as in cultivar identification.

Plant Materials and DNA Extraction
In this study, we utilized a total of 206 bottle gourd germplasm collections for core SNP marker set screening, all of which are inbred lines (Supplementary Table 1). Twenty bottle gourd germplasm collections with diverse agronomic traits and genetic backgrounds were selected as representatives and were employed to re-sequence the whole bottle gourd genome for SNP discovery and selection. A total of 377 elite lines of a bottle gourd MAGIC population (Supplementary Figure 1) were utilized to assess the potential polymorphism of the core SNP markers developed in our study.
All accessions were grown in a growth incubator at 28 • C/22 • C with a 16 h light/8 h dark regime, and for each accession, young leaves from three independent individuals were collected for genomic DNA extraction using the cetyltrimethylammonium bromide (CTAB) method (Maguire et al., 1994). DNA quality was verified by electrophoresis on a 0.8% agarose gel, and the DNA concentrations were quantified using a NanoDrop 2000 UV spectrophotometer (Thermo Fisher Scientific, Waltham, MA, United States) and adjusted to a concentration of 20 ng/µL with sterile water.

Single-Nucleotide Polymorphism Discovery and Selection
As a reference for SNP discovery, we used the Chinese landrace bottle gourd "HZ gourd" (Xu et al., 2021). Based on the SNP polymorphism information content (PIC) values of the genotypes of the 20 bottle gourd representatives that were resequenced in a recently published study (Xu et al., 2021), we  selected SNPs for identification by KASP assays on another 22 distantly related accessions that were selected from the 206 bottle gourd germplasm collections. After eliminating lowquality and low-discriminatory SNPs, high-quality SNPs were selected for core SNP marker screening of the 206 bottle gourd germplasm collections. Core SNP markers were examined using previously published protocols (Yang et al., 2019), based on the even distribution of SNPs per chromosome and the principle of a minimum number of SNPs representing the maximum genetic diversity.

Data Analysis
PowerMarker software 1 was used to calculate genetic diversity and the PIC. GenAlEx 6.5 (Peakall, 2012) was performed to estimate minor allele frequency (MAF) and observed heterozygosity. Tassel 5.1 (Bradbury et al., 2007) was used to perform principal component analysis (PCA), and MEGA 5 (Tamura et al., 2011) was used to construct a neighbor-joining (NJ) tree based on Nei's standard genetic distance (Nei, 1978). STRUCTURE V2.3 was used to analyze population structure (Pritchard et al., 2000;Falush et al., 2003). STRUCTURE HARVESTER (Earl and Vonholdt, 2012) was used to determine the most likely K value based on the K method (Evanno et al., 2005). A barcode online generator 2 was used to convert each SNP fingerprint into a barcode.

Core Collection Development
The selection of core collections was carried out using Core Hunter (Thachuk et al., 2009), an algorithm for sampling genetic resources based on multiple genetic measures. Meanwhile, using this Core Hunter software, the Shannon-Weaver diversity index 1 http://statgen.ncsu.edu/powermarker/ 2 http://barcode.tec-it.com/barcode-generator.aspx (I), Nei's gene diversity index (H), and PIC between the core and whole collections were calculated and compared.

Selection of High-Quality Single-Nucleotide Polymorphisms
The re-sequencing of the 20 bottle gourd representatives (Xu et al., 2021) generated a total of 1,843,914 SNPs. After filtering based on the criteria of a minor allele frequency of >5% and missing data rate of <5%, we obtained 723,946 filtered SNPs. Based on the SNP PIC values and the distribution of the resequenced bottle gourds, 1,100 SNPs (100 SNPs per chromosome and as evenly distributed as possible) were finally selected for identification by KASP assays. Twenty-two distantly related bottle gourd germplasm collections from among the 206 assessed collections were used to screen high-quality SNPs from the 1,100 SNPs using KASP assays. Single-nucleotide polymorphisms could be called for AA, BB, and AB genotypes (Figure 1). Where data points could not be clearly called for AA, BB, and AB genotypes ( Figure 1A) or no polymorphism was identified (Figure 1B), these SNPs were deemed to be low-quality or low-discriminatory SNPs. For high-quality SNPs, discrimination between the two homozygous alleles (AA and BB) and heterozygous allele (AB) in the 22 bottle gourd germplasm collections was relatively straightforward (Figures 1C,D). Finally, 93 high-quality SNPs were selected and used to screen the 206 bottle gourd germplasm collections to identify potential core SNP marker sets.

Identification of Candidate Core Single-Nucleotide Polymorphisms
To identify the candidate core SNPs, the 206 bottle gourd germplasm collections were utilized for KASP assays with the 93 high-quality SNPs. A core SNP marker set was selected by considering the physical position, PIC, MAF, observed heterozygosity, and missing values among all 206 genotypes. The core marker set comprised 22 SNP markers, with two markers per chromosome (Figure 2 and Table 1). The saturation curve presented in Figure 3 shows that 22 core SNP markers could distinguish all 206 bottle gourds. The PIC for the 22 core SNP markers across all examined germplasm collections ranged from 0.137 to 0.499, with an average value of 0.390. For 17 of the 22  Table 1.

Polymorphism and Discriminatory Capacity of the Core Single-Nucleotide Polymorphism Set
We initially evaluated the polymorphism of the core SNP set using KASP assays in a MAGIC population, which had been constructed from eight genetically diverse elite parents and consisted of 377 recombinant inbred lines. Data obtained for individuals from these 377 lines were used to calculate the PIC, MAF, observed heterozygosity, and missing values for each core SNP marker. The PIC values of the 22 core SNP markers across the 377 individuals ranged from 0.12 to 0.50, with an average of 0.38. Sixteen SNP markers had PIC values >0.3, and the MAF values ranged from 0.07 to 0.45, with an average of 0.30. The observed heterozygosity for each core SNP marker was ≤0.10, with an average of 0.04. For all core SNP markers, the missing values comprised ≤0.06 of the data points (Table 2). Collectively, these results indicate that the core SNP markers were highly polymorphic.
We further evaluated the discriminatory power of the core SNP set by analyzing the genetic structure of the 206 bottle gourd germplasm collections using the core SNP set and the aforementioned 93 high-quality SNPs. Assessment of the relationships among the 206 bottle gourd germplasm collections using STRUCTURE showed that the best K value was 2, which divided the 206 bottle gourd germplasm collections into two groups when using either the 93 high-quality SNPs or the core SNP marker set (Figures 5A,B). The clustering results obtained for the core SNP marker set and 93 highquality SNPs differed only with respect to 14 of the collections (Supplementary Table 3). PCA and the UPGMA dendrogram exhibited similar results when using the two different SNP marker sets (Figures 5C,D). The aforementioned results thus indicate that the core SNP marker set had strong representativeness and discriminatory power equal to the 93 well-selected highquality SNPs.

Applications of the Core Single-Nucleotide Polymorphism Marker Set
Molecular fingerprinting of bottle gourd germplasm collections or commercial cultivars is a potential practical application of the newly developed core SNP marker set. In this study, 206 bottle gourd germplasm collections were fingerprinted using the 22 core SNP markers ( Table 3 and Supplementary Table 4), which highlighted the efficiency and accuracy of genotype discrimination using the 22 core SNP markers. Furthermore, representative cultivars (hybrids) currently on the market, and with unique barcodes and QR codes, were genotyped using the 22 core SNP markers (Table 4), which indicates the potential contribution of these markers in resolving intellectual property disputes caused by the use of the same name for different species or different names for the same species. Moreover, the SNP fingerprints provided a precise, rapid, convenient, and costeffective KASP genotyping procedure for determining bottle gourd seed purity.

Development of a Core Collection
To provide a subset of representative germplasm collections for the selection of parents in hybrid combinations in bottle gourd breeding or related basic studies, we further developed a core collection of bottle gourd germplasm collections using the 22 core SNP markers and Core Hunter software (Thachuk et al., 2009). The core collection included 102 representative bottle gourd germplasm collections, which captured approximately 50% of the total number of germplasm collections with 100% allele coverage (marked in red in Supplementary Table 1). Two indices were used to measure the average genetic distance of the core collection population: the modified Rogers distance (MR) and Cavalli-Sforza and Edwards distance (CE), with the values of 0.4424 and 0.44427, respectively. Furthermore, three genetic diversity indices were calculated: Shannon's diversity index (SH), expected heterozygous (HE), and PIC with the values of 2.9714, 0.425644, and 0.332121, respectively (Table 5). Additionally, the results obtained from PCA analysis of the 102 germplasm collections in the core collection were to a large extent consistent with those obtained for the original collection (Figure 6). Therefore, the core collection is representative of the genetic diversity of the original collection. Considering suitable size, geographical distribution, phenotype, and unique agronomic traits, an additional 48 germplasm collections were      added to the initial core collection, giving a final core collection containing 150 bottle gourd germplasm collections.

DISCUSSION
A range of DNA molecular markers, including RAPDs, SSRs, InDels, and SNPs, have been used for germplasm characterization (Chen and Sullivan, 2003;Lv et al., 2012;Hao et al., 2016;Yang et al., 2019). Among these, SNPs, with their unique characteristics of wide distribution, high density, and good stability, combined with a cost-effective, user-friendly SNP genotyping platform (KASP), have become a popular marker type for germplasm characterization and cultivar fingerprinting (Allen et al., 2010;Cortés et al., 2011;Su et al., 2018;Wang et al., 2018;Li et al., 2019;Wu et al., 2021). However, although RAPD, SSR, and InDel markers, which involve complex processes, would be superseded by the development of SNP molecular markers, which are suitable for large-scale high-throughput screening of multiple samples and sites, SNPs combined with KASP are yet to be utilized for large-scale germplasm characterization in bottle gourd. In this study, using the high-throughput SNP genotype platform, we selected and developed a core set of SNP markers from an initial 1,100 SNPs identified by the re-sequencing of 20 bottle gourd representatives. The representativeness and discriminatory power of the core SNP marker set were evaluated using 206 bottle gourd germplasm collections and a MAGIC population (Figure 4 and Table 2). We found that the core SNP marker set had strong representativeness and discriminatory power equal to that of 93 high-quality SNPs (Figure 5). The use of fewer markers is more convenient for identifying varieties or fingerprinting cultivars than using large numbers of markers. Different subsets of markers show different identification rates; for example, in cultivated pumpkin, subsets of 24 and 12 SNP markers identified only 24.2% and 4.9% accessions, respectively (Nguyen et al., 2020). The core SNPs were identified to represent the greatest possible genetic diversity using the minimum number of SNPs. For example, in non-heading Chinese cabbage, 50 core SNPs were found to provide adequate information for genetic identification (Li et al., 2019). A core set of 16 SSRs has been shown to be sufficient to identify 382 cucumber varieties and establish DNA fingerprints (Yang et al., 2019). Similarly, in cowpea, 50 informative core SNPs were shown to be strongly representative of the 51,128 SNPs available to analyze genetic dissimilarity in this species . In this study, a saturation curve revealed 22 abundant polymorphisms, and uniformly distributed core SNP markers distinguished 100% of 206 bottle gourds (Figure 3), thereby indicating that these 22 core SNP markers were sufficiently discriminatory for the identification of bottle gourd germplasm.
The verification of seed authenticity and purity is of particular importance for seed producers and farmers (Gao et al., 2012). Similar genetic backgrounds often make it difficult to morphologically identify species using low-efficiency and time-consuming field planting. Moreover, morphological characteristics are often influenced by the environment and are therefore not suitable for the current rapid inspection demands (Tian et al., 2015). A potential application of the core SNP marker set is the molecular fingerprinting of bottle gourd germplasm collections or commercial cultivars to preserve and utilize germplasm collections and determine the authenticity and purity of cultivars. We fingerprinted 206 bottle gourd germplasm collections and representative bottle gourd commercial hybrids with unique barcodes and QR codes (Tables 3, 4) and developed optional primers for determining bottle gourd seed purity. Owing to the biallelic nature of SNP markers, each marker can distinguish three individuals. The maximum number of individuals distinguishable using the set of 22 core SNP markers selected in this study is, in theory, 3 22 = 31,381,059,609. Therefore, it is feasible to construct a fingerprint database of bottle gourd germplasm collections or main commercial varieties using the 22 core SNP markers.
The construction of a core collection will substantially improve the efficiency of germplasm collection management and utilization. Core collections established using molecular markers are not readily affected by environmental or other external factors, and hence, several core collections have been constructed using various DNA molecular markers (Balfourier et al., 2007;Zhang et al., 2011;Muñoz-Amatriaín et al., 2014). In this study, we developed a core collection of 102 accessions that represent 100% of the bottle gourd collections in China (Table 5 and Figure 6). Previous studies have revealed weak population stratification and low diversity in bottle gourd germplasm collections, which are generally independent of the site of collection (Yetişir et al., 2008;Xu et al., 2011Xu et al., , 2014Xu et al., , 2021. Taking into consideration the factors of suitable size, phenotype, and unique agronomic traits, we augmented our original core collection with an additional 48 inbred lines, thereby establishing a final core collection containing 150 bottle gourd inbred lines. To the best of our knowledge, this study represents the first effort to preserve and utilize germplasm collections of bottle gourd, and the core collection thus developed will contribute substantially to future bottle gourd breeding and research. Accordingly, we believe that the genomes of the accessions selected for the core collection should be re-sequenced to provide a valuable resource for future breeding and scientific studies.
In summary, based on the re-sequenced genomic data from 20 bottle gourd germplasm collections, we identified and validated a core set of 22 representative SNPs, which exhibited abundant polymorphisms and were evenly distributed throughout the bottle gourd genome. Using this core SNP marker set, we assessed the genetic diversity and population structure of bottle gourd collections, fingerprinted bottle gourd germplasm collections and commercial cultivars, performed an optimized procedure for seed authentication, and developed an accessible core population. Our findings will provide a valuable basis for the future preservation and utilization of bottle gourd germplasm collections and also contribute to cultivar identification, which will enable the resolution of commercial disputes and protect the rights of breeders.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.

AUTHOR CONTRIBUTIONS
YW and GL conceived the research. YW, XaW, YL, ZF, ZM, JW, XnW, BW, and ZL performed the experiments. XaW provided the mutant material. YL, ZF, ZM, JW, and XnW provided technical assistance. YW analyzed the data and wrote the manuscript. All authors contributed to the article and approved the submitted version.