Genetic Diversity and Population Structure of Schima superba From Southern China

The tree Schima superba is important for afforestation and fire prevention in southern China. The wood of this tree can also be used for furniture and buildings. However, the lack of genetic background and genomic information for this species has lowered wood yield speed and quality improvement. Here, we aimed to discover genome-wide single nucleotide polymorphisms (SNPs) in 302 S. superba germplasms collected from southern China and to use these SNPs to investigate the population structure. Using genotyping by sequencing, a total of 785 high-quality SNP markers (minor allele frequency [MAF] ≥ 0.05) were identified from 302 accessions collected from seven geographical locations. Population structure analyses and principal coordinate analyses (PCoAs) indicated that these germplasm resources can be clearly separated into different populations. The S. superba accessions originating from Yunnan (YN) and Guangxi (GX) fell into the same population, separate from the accessions originating from Guangdong (GD), which indicated that these two regions should be regarded as major provenances of this species. In addition, two independent core germplasm sets with abundant genetic polymorphisms were constructed to support the breeding work. The identification of SNP markers, analyses of population genetics, and construction of core germplasm sets will greatly promote the molecular breeding work of S. superba.


INTRODUCTION
Schima superba is a precious and cultivated evergreen broad-leaved tree that is naturally distributed in southern China . This species is valued commercially for its timber because the hardness of its wood is higher than most tree species, which is suitable for furniture and buildings . In addition, this species is used as fire breaks and thus helps to protect forests from fires Zhang et al., 2013;Yang et al., 2017a,b). Finally, it is important for afforestation projects in southern China because it has superior growing speed in this region and has a long history of cultivation, which makes it suitable for carbon fixation on a large scale Xu et al., 2018). Therefore, S. superba plays important roles in forestry products and ecological protection systems. Germplasm resources are the most important basis for carrying out molecular breeding work and obtaining filial generations with superior phenotypes. Due to the long breeding cycle of woody trees using a conventional method, it is urgent to carry out the latest technical measures to accelerate the breeding work (Isik, 2014). Furthermore, the wide geographical distribution of S. superba in southern China laid a good foundation to carry out the relevant study.
To better characterize the current level of available S. superba diversity for future breeding efforts, large-scale population genetics analyses, using advanced molecular genotyping strategies, are needed. Marker-assisted selection (MAS) is a useful tool that accelerates the breeding cycle and has thus been used for many decades (Bernardo, 2008). Therefore, a large scale of molecular markers are needed to better realize molecular breeding. Across all types of molecular markers, single nucleotide polymorphisms (SNPs) are the most widely used, due to their ubiquity, uniform distribution, and high heritability (Verma et al., 2015;Luo et al., 2019). As next-generation sequencing (NGS) technologies have advanced, it has become easier to obtain large numbers of SNPs for modern breeding programs (Elshire et al., 2011;Manimekalai et al., 2020). Due to the advantages of low cost and high efficiency in mining SNPs, a method called genotyping by sequencing (GBS) has been successfully applied for population structure and population genetic diversity analyses in a wide range of crop plants and trees, such as Arachis hypogaea (Zheng et al., 2018), Rubus (Ryu et al., 2018), Fraxinus pennsylvanica (Wu et al., 2019), Picea (Elleouet and Aitken, 2019;Haselhorst et al., 2019), Camellia sinensis (Niu et al., 2019), Eleaeis guineensis (Babu et al., 2019), Pinus tabuliformis (Xia et al., 2018), and Quercus suber (Pina-Martins et al., 2019). Until now, large-scale genome-wide SNP identification and population genetics have not been reported in S. superba, except for a genetic map construction using a full-sib family . Therefore, a small number of related studies and a lack of the molecular genetic background of germplasm resources hindered the development of S. superba molecular breeding work. The use of SNPs in germplasm resource populations will aid in thorough genomic assessments of population structure and inform MAS strategies for S. superba.
Here, NGS technologies were employed to confirm the genotypes of 302 S. superba accessions originating from Guangdong province, Yunnan province, and Guangxi Zhuang Autonomous Region in southern China. A set of high-quality SNPs were identified from the S. superba genome and used to confirm the SNP genotypes of these germplasm resources. Then, investigation and analyses of population genetic diversity, population structure, genetic relationship, genetic distance, and construction of core germplasm sets were conducted to better understand these resources. Our results not only describe the genetic diversity and population structure of collected and conserved S. superba accessions, the construction of a core germplasm set that can effectively promote crossbreeding, and the formation of high-quality offspring but also provide a framework for future breeding programs.

Plant Material
The 302 superior trees of S. superba were collected from Guangdong (GD), Guangxi (GX), and Yunnan (YN) provinces

DNA Extraction and GBS
Genomic DNA was extracted from a 0.5 g leaf sample from each accession using the DP320 DNAsecure Plant Kits (Tiangen, China), following the manufacturer's protocols. The concentrations of 302 extracted DNA samples were measured using a NanoDrop-2000 spectrophotometer (Thermo Fisher Scientific, USA). Each quantified DNA sample was diluted to 20 µg·µl −1 and stored at −20 • C until use. Individual GBS libraries for each DNA sample were prepared using the restriction enzymes Hae III and EcoR I as described (Elshire et al., 2011). The purified samples were amplified using PCR, with the Phusion Master Mix (NEB) universal primer and the index primer. The PCR products were purified using Agencourt AMPure XP (Beckman) and pooled and then run on 2% agarose gels. DNA fragments 375-400 bp long (such as, indexes and adaptors) were isolated using a Gel Extraction Kit (Qiagen). The isolated DNA fragments were purified using Agencourt AMPure XP (Beckman) and then diluted for sequencing. Paired-end (PE) 150 bp sequencing was performed using the NovaSeq 6000 platform (Illumina, San Diego, CA, USA).

Genotyping and Quality Control (QC)
Sequences were associated with samples based on the samplespecific barcodes. To ensure that reads were reliable and without artificial bias, the raw data were processed using a series of quality control (QC) procedures implemented by in-house C scripts. The QC scripts removed all reads with ≥10% unidentified nucleotides (N); all reads with >50% bases having Phred quality <5; and all reads with >10 nt aligned to the adapter, allowing ≤10% mismatches. The samples with the most tags (SS222 and SS234) were used as reference genomes (Paris et al., 2017). GATK and SAMtools were used for SNP calling (Li et al., 2009;McKenna et al., 2010). SNPs with integrity <0.8 and minor allele frequency (MAF) ≤ 0.01 were discarded.

Population Structure
Population structure was estimated using a Bayesian Markov Chain Monte Carlo (MCMC) model implemented in Structure v2.3.4 (Pritchard et al., 2000). The admixture model without the LOCPRIOR option was used in this study. Furthermore, 10 replicate runs were performed for each value of K between 1 and 10, based on the number of sampled populations. Burn-in and MCMC replicates were both set to 100,000 for each run. The most probable value of K was determined with Structure Harvester (Earl and VonHoldt, 2012), using the log probability of the data [LnP(D)] and delta K ( K), based on the rate of change in [LnP(D)] between successive K-values. In addition, ADMIXTURE and fastSTRUCTURE were also used to conduct the K values of population structure. Different numbers of clusters (K = 1-10) for the dataset were tested using 10 replicate runs per K (Alexander et al., 2009;Raj et al., 2014). The optimal K value was assessed by cross-validation error with the lowest value and marginal likelihood with the highest value. GenAlEx v6.503 was used to calculate the Nei's genetic distances between pairs of accessions (Peakall and Smouse, 2012). Principal coordinate analysis (PCoA) was performed based on the genetic distances. An unrooted phylogenetic tree (i.e., we did not assume an evolutionary hierarchy) was constructed using the maximum likelihood method in MEGA 6.0 with 1,000 bootstrap replicates (Tamura et al., 2013).

Analysis of Molecular Variance (AMOVA) and Genetic Diversity
Analysis of molecular variance (AMOVA) was performed and calculated using seven geographical subpopulations in GenAlEx v6.503 (Peakall and Smouse, 2012). Based on the results of AMOVA, the fixation index (Fst) and haploid number of migrants (Nm) within the population were obtained. Fst measures the amount of genetic variance that can be explained by population structure based on Wright's F-statistics (Wright, 1965): an Fst value of 0 indicates no differentiation among subpopulations, while a value of 1 indicates complete differentiation. Several other genetic indices were calculated for each subpopulation using GenAlEx v6.503, such as the number of loci with private alleles, the number of different alleles (Na), the number of effective alleles (Ne), Shannon's information index (I), observed heterozygosity (Ho), and expected heterozygosity (He).

Construction of Core Germplasm Set
Core Hunter 3 was used to select core germplasm set based mainly on the genetic distance metrics calculated by SNP molecular markers (De Beukelaer et al., 2018). The most important standard for filtering the core germplasms is to keep the highest average genetic distance to maintain the genetic diversity of core germplasm set. A comparative PCoA was conducted for two core germplasm sets with/without consideration of geographical information.
Average numbers of 5,516,738 raw reads and 5,516,633 clean reads were obtained from 302 S. superba accessions sequenced and genotyped using GBS (Supplementary Table S1). The average coverage was 86.18% with an average number of 4,758,675 clean reads, the ratio of >Q30 nucleotide was 87.99%, the average GC content ratio was 37.08%, and the average tag sequencing depth was 12.95× (Supplementary Table S1). Then, 1,085,853 unfiltered SNPs were identified from these sequenced tags. After QC and SNP filtering (MAF ≥ 0.01), a total of 10,661 SNPs were obtained across the S. superba genomes. Characteristics of the identified SNPs were described in Table 1. Transition SNPs (7,191 SNPs; 67.45%) were two times as common as transversion SNPs (3,470 SNPs; 32.55%). A/G transitions were the most common SNP type (34.28%), while G/C transversions were the least common SNP type (5.78%). The two types of transitions occurred with similar frequency (A/G: 34.28% and C/T: 33.17%). The most common transversion was A/T (9.80%), followed by A/C (8.61%), G/T (8.36%), and G/C (5.78%). In the next step, the SNP markers were further filtered by MAF ≥ 0.05. A total of 785 high-quality SNP markers were obtained and used for the following bioinformatics analyses. The values of He, polymorphic information content (PIC), and MAF of the filtered SNPs (MAF ≥0.05) were calculated and displayed in  The capital letter on the left side of the SNP variation type represents the major allele and the capital letter on the right side of the SNP variation type represents the minor allele.
SNPs) with an average of 0.201, and the MAF values varied from 0.05 (386 SNPs) to 0.5 (44 SNPs) with an average of 0.156 (Figure 2).

Population Structure
The most likely number of populations (K) was estimated based on the genome-wide genotypic data for the 302 S. superba accessions using Structure Harvester, ADMIXTURE, and fastSTRUCTURE. The results of Structure Harvester showed that the Delta K value was highest at K = 2 and rapidly declined from 3 to 10 (Figures 3A,B), suggesting that these accessions should be clustered into two populations. However, the ADMIXTURE cross-validation error was lowest at K = 3 (Supplementary Figure S1A), indicating the best K of three. fastSTRUCTURE marginal likelihood values fluctuated somewhat, and the highest marginal likelihood was observed at K = 2 (Supplementary Figure S1B).
Considering that most of the samples were collected from GD (91.39%) and the major geographical distance between GD and other two regions, population of the total collections conforms to the characteristics of geographical distribution. According to the results of population structure in Figure 3C, one common population on the left side consisted of accessions collected from GX and YN in the K values between 2 and 4. In addition, the result of sample composition in this population was identical among different structure analyses. According to the seven subpopulations determined by geographical locations, PCoA plot was drawn based on the pairwise genetic distance matrix for 302 S. superba accessions ( Figure 4A). Population 1 in the plot was similar to the common population produced by population structure analysis, comprising 26 accessions originating from YN/GX. Most accessions in GD were clustering into a major group, with the exception of several samples, especially SS149 and SS150 from NG. The clustering result of population structure was consistent with the PCoA plot and also closely related to the geographical distributions. For example, the result that accessions in YN clustered with GX were consistent with the geographically adjacent relationship. However, samples from GD cannot be divided into different populations.

Genetic Differentiation and Relationships Among Populations
Analysis of molecular variance of the seven geographical subpopulations found that 7.73% of the total variation was among subpopulations, while 92.27% of the total variation was within subpopulations ( Table 2). In addition, Nei's genetic distance analysis indicated that Nm was high (2.985) among these subpopulations. We also found a substantial genetic divergence between YN/GX and other subpopulations (Supplementary   The phylogenetic tree of all 302 S. superba accessions supported the monophyly of Population 1 in the structure analysis and PCoA: all 26 accessions in this population were recovered in the same clade ( Figure 4A). However, the remaining accessions cannot be divided into different populations. To clarify the genetic relationship among all the collected accessions, a phylogenetic tree using the maximum likelihood method was constructed and indicated genetic information (Figure 4B). In addition, phylogenetic analyses based on the sample geographic distribution information suggested that the accessions in GX and YN were separated from GD (Supplementary Figure S2).

Construction of Core Germplasm Set
Due to the potential redundancy of genetic information in the total collections, the construction of a compressed core germplasm set is an important work in modern plant breeding work. To keep genetic diversity of the core germplasm set, average genetic distance is the most important standard in selecting members. By using Core Hunter, two core germplasm sets containing 60 accessions, with/without considering the geographical location, were constructed ( Figure 5). Comparative PCoA plots of the two core sets displayed major difference in member compositions of the seven subpopulations. The core set without considering the geographical location showed unequal distributions in seven subpopulations. For instance, only one individual originated from YN/GX and NEG, respectively, which may lead to a loss of genetic information in this region. The core set, considering the geographical location, avoided imbalance in geographical distributions. Detailed compositions of two core sets are listed in Supplementary Tables S6, S7.

DISCUSSION
Due to the importance of S. superba in terms of wood production and forest fire protection, it is urgent to carry out a molecular breeding strategy to obtain superior individuals. Therefore, GBS technology was employed to study the molecular genetics and evolution of 302 S. superba germplasm resources collected from southern China. The genotypic data of these accessions were used for analyses of genetic diversity, population structure, etc. These analyses are important for the characterization of evolutionary history of this species, as well as the genetic relationships among these populations or individuals (Luo et al., 2019). Across the SNPs identified in 302 S. superba accessions, transitions were more frequent than transversions, consistent with findings in several other plant species (Morton et al., 2006;Park et al., 2010;Huang et al., 2013). This suggests that transitions are better tolerated than transversions in the expansion and diversification process of this species (Guo et al., 2017).
Population structure reflects genetic diversity within a species and supports genome-wide studies of the associations between molecular markers and traits (Sonah et al., 2015;Pais et al., 2020). The 302 accessions used in this study originated from three major geographical regions: GX, YN, and GD. However, it was unclear whether population structure would reflect these geographical distinctions. Our various population structure analyses indicated that the 302 accessions are possibly clustered into two, three, or four populations. However, because 91.39% of the accessions originated from GD and because larger numbers of populations may also help to maintain population diversity during breeding work (Chen et al., 2017;Alemu et al., 2020), we chose to classify the accessions into seven subpopulations according to the geographical distribution. Notably, in all analyses, the accessions from GX and YN were recovered in the same cluster (although these populations were only represented by a few accessions each). This might suggest that genetic exchange occurred between these adjacent geographical regions. Thus, our results suggest that the S. superba accessions from GX and YN may represent superior genotypes for inclusion in future breeding programs (Zhang et al., 2006;Zhou et al., 2006). However, except for the above sample collection regions, there are also several geographical distribution regions in southern China, which are also important provenances of S. superba. Therefore, a more extensive collection of S. superba germplasm resources is especially necessary to increase the scale of number and genetic diversity of this species.
The population structure of S. superba in GD was similar to that of Pinus massoniana, another important forestry species in southern China (Bai et al., 2019). That is, in both species, genetic distances among accessions did not completely correlate with geographic distance in GD. Consistent with our genetic analysis results, the phenotypic variation of S. superba in the GD province was shown to be high, reflecting abundant genetic diversity (Yan et al., 2013;Wang et al., 2018). Similarly, previous studies have shown that the genetic diversity of S. superba was high in Jianou and Longquan (Yang et al., 2016). Due to the long history of S. superba cultivation and afforestation in these regions, such high levels of genetic diversity and admixture are not surprising.
The fixation index is an important measure of population differentiation (Luo et al., 2019). Here, Fst values indicated that YN/GX was genetically divergent from the other subpopulations; conversely, Fst values were low among the six subpopulations in GD, suggesting that these subpopulations were closely genetically related. Thus, YN-GX accessions and GD accessions may represent two independent sources of S. superba diversity. However, most of the total variation among accessions was within subpopulations (92%), while only 8% of the total variation was among subpopulations. In addition, the Nm of the seven subpopulations was high (2.925), indicating a high rate of genetic exchange and gene flow among subpopulations (Eltaher et al., 2018;Luo et al., 2019), particularly in six subpopulations in GD. This may explain why the different software programs we used (Structure, ADMIXTURE, and fastSTRUCTURE) returned different values of K for the 302 accessions.
Genetic indices, such as He, reflect genetic diversity within subpopulations (Emanuelli et al., 2013). He values for WG, NWG, CG, NG, NEG, and EG were similar (0.196-0.218), but the He value for YN/GX was much lower (0.104). This might suggest that YN/GX was less diverse than the other subpopulations. However, the low He value might also merely reflect the lower number of samples in YN/GX, as He depends on both the number of alleles and the abundance of alleles in a population (Luo et al., 2019). Thus, in analyses of population diversity, it is important to consider not only the geographical origin but also the number of samples in each region. Therefore, sample numbers of the collected regions also need to be increased to obtain more accurate genetic parameters. Nevertheless, the genetic diversity analyses in this study will support the generation of new strategies for S. superba breeding.
Open pollination and hybridization have been common phenomena throughout evolutionary history (Payseur and Rieseberg, 2016). It has also been used in hybrid progeny acquisition in forestry breeding. While obtaining superior phenotypic offspring of a specific economic trait, a small number of germplasm resources with high quality is usually necessary for the breeding work. Therefore, core germplasm resources are extremely important for hybridization programs and germplasm creation (Khaing et al., 2013;Xu et al., 2017;Pereira-Lorenzo et al., 2018). In this study, we constructed two sets containing 60 S. superba core germplasms, which potentially provide a simplified and compressed genetic resource for the current resources of S. superba. These selected germplasm resources cover a wide genetic and geographic range and therefore should be suitable for diversity generation through hybridization (Bai et al., 2019). Although few accessions were collected from GX and YN, the high Fst values associated with this population indicated importance of the S. superba accessions from these regions. However, more GX and YN S. superba accessions should be collected to enhance the overall genetic diversity of our germplasm collection. Concurrently, the importance of accessions from GD should be somewhat deprecated in breeding works and germplasm resource collection to avoid decreasing phenotypic and genetic diversity. Finally, further genetic information, for example, from plastid, transcriptome sequencing, and whole-genome scales, is required for further investigations of the genetic relationships among S. superba germplasms . An improved understanding of the population structure and genetic diversity of S. superba accessions will facilitate and guide molecular breeding works in the future. In addition, genome-wide association studies (GWASs) of specific economic traits are very important in MAS, whether they use GBS-GWAS or whole genome-GWAS. It is also required in S. superba molecular work in the future.

CONCLUSION
In this study, a total of 785 SNPs (MAF ≥ 0.05) were identified and used to explore the genetic diversity and population structure of 302 S. superba accessions from southern China. Based on the SNP markers, the S. superba accessions were divided into two major populations by structure and PCoA. Accessions collected from GD differed noticeably from those collected from YN and GX. AMOVA indicated that 92.27% of the variance appeared within, not among, seven subpopulations separated by geographical location. Our results provided important genetic data for S. superba germplasm resources. The construction of two germplasm sets considering the geographical distribution and genetic diversity information will facilitate and guide molecular breeding work. To improve the genetic diversity of current resources, more geographical distribution regions should be considered in the collection of germplasm resources in the future. In addition, GWAS and MAS are necessary for the selection of individuals with superior phenotypes.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: NCBI [accession: PRJNA674494].

AUTHOR CONTRIBUTIONS
QB designed the project and drafted the manuscript. YC and HL originally collected the germplasm resources of Schima superba. BH, YC, DL, and QZ assisted in the analytical work. QB and YW reviewed the manuscript. All authors read and approved the final manuscript.