Genome-Wide Divergence and Linkage Disequilibrium Analyses for Capsicum baccatum Revealed by Genome-Anchored Single Nucleotide Polymorphisms

Principal component analysis (PCA) with 36,621 polymorphic genome-anchored single nucleotide polymorphisms (SNPs) identified collectively for Capsicum annuum and Capsicum baccatum was used to characterize population structure and species domestication of these two important incompatible cultivated pepper species. Estimated mean nucleotide diversity (π) and Tajima's D across various chromosomes revealed biased distribution toward negative values on all chromosomes (except for chromosome 4) in cultivated C. baccatum, indicating a population bottleneck during domestication of C. baccatum. In contrast, C. annuum chromosomes showed positive π and Tajima's D on all chromosomes except chromosome 8, which may be because of domestication at multiple sites contributing to wider genetic diversity. For C. baccatum, 13,129 SNPs were available, with minor allele frequency (MAF) ≥0.05; PCA of the SNPs revealed 283 C. baccatum accessions grouped into 3 distinct clusters, for strong population structure. The fixation index (FST) between domesticated C. annuum and C. baccatum was 0.78, which indicates genome-wide divergence. We conducted extensive linkage disequilibrium (LD) analysis of C. baccatum var. pendulum cultivars on all adjacent SNP pairs within a chromosome to identify regions of high and low LD interspersed with a genome-wide average LD block size of 99.1 kb. We characterized 1742 haplotypes containing 4420 SNPs (range 9–2 SNPs per haplotype). Genome-wide association study (GWAS) of peduncle length, a trait that differentiates wild and domesticated C. baccatum types, revealed 36 significantly associated genome-wide SNPs. Population structure, identity by state (IBS) and LD patterns across the genome will be of potential use for future GWAS of economically important traits in C. baccatum peppers.

Pepper germplasm is a valuable resource for investigating the still-unresolved question of whether similar domestication related changes occurred independently to result in parallel or convergent evolution in the domestication syndrome (Pickersgill, 2007). Because C. annuum and C. baccatum are sexually incompatible, the question cannot be resolved by crossing these genetically isolated domesticated peppers. However, genomic tools offer a plethora of opportunities to compare domestication footprints and determine whether complementary or different loci are involved (Pickersgill, 2007). C. baccatum var. pendulum is known for great variability in fruit quality traits, yield, pathogen resistance, and bioactive compounds (Yoon et al., 2006;Rodríguez-Burruezo et al., 2009;Do Rêgo et al., 2009;Eggink et al., 2014). Conventional plant breeding programs require costly investments in time, labor and land to develop improved cultivars; the application of genomic tools combined with nextgeneration sequencing could accelerate the genetic improvement of peppers. The use of C. baccatum and C. annuum species in interspecific breeding programs has been limited because of post-fertilization barriers.
Genotyping by sequencing (GBS) is a reduced representation method, which utilizes next-generation sequencing to develop genome-wide single nucleotide polymorphisms (SNPs). SNPs generated by GBS have been successfully deployed for genetic diversity analysis and Genome-wide association studies (GWAS) in several crops (Poland and Rife, 2012;Narum et al., 2013;Liu et al., 2014;Nimmakayala et al., 2014Nimmakayala et al., , 2016Guajardo et al., 2015;Otto et al., 2016). Increased marker density across the chromosomes facilitates to estimate genome-wide nonrandom association of allelic states across the chromosomes, which is known as Linkage disequilibrium (LD; Mackay and Powell, 2007;Reddy et al., 2014;Baird, 2015;Wang et al., 2015;Zanke et al., 2015). GWAS models are to scan genomewide LD blocks to identify causal locus for trait of the interest, while involving population structure and identity by state (IBS) matrices as the cofactors to reduce spurious associations due to confounding effects of population stratification and polygenic background (Rafalski, 2010;Stich and Melchinger, 2010;Newell et al., 2011). The availability of genome-wide (SNPs) affords new opportunities in the current study to better resolve C. baccatum population structure, LD and diversity and dissect the population demographic history across the genome by comparison with another domesticated species, C. annuum. In addition, we utilized population structure analyses for a genomewide association study (GWAS) of peduncle length, an important domestication trait.

Germplasm
A representative sample of 377 pepper accessions (283 C. baccatum and 94 diverse C. annuum accessions) collected from 32 countries across the world were obtained from the USDA-ARS, Germplasm Resource Information Network, Plant Genetic Resources Conservation Unit, Griffin, GA and World Vegetable Center (AVRDC, Shanhua, Taiwan) ( Table S1). The C. annuum collection was comprised of 90 domesticated cultivars and 4 wild accessions. The C. baccatum collection had 218 lines of C. baccatum var. pendulum and 17 wild accessions (C. baccatum var. baccatum). Peduncle length (cm) was measured for 5 plants each of 217 accessions belonging to C. baccatum var. pendulum grown in a greenhouse in three replications.

Genotyping by Sequencing (GBS)
Genomic DNA was isolated from the seedlings using the DNeasy plant mini kit (QIAGEN, Germany), and GBS was as described (Elshire et al., 2011). DNA was treated with the restriction enzyme ApeKI, a type II restriction endonuclease, barcoded by accession, and sequenced on an Illumina HiSeq 2500 as described (Elshire et al., 2011). SNPs were identified using the TASSEL-GBS Discovery/Production pipeline (https://bitbucket. org/tasseladmin/tassel-5-source/wiki/Tassel5GBSv2Pipeline). Chromosomal assignment and position on the physical map of various SNPs were deduced from the C. annuum whole genome sequence (WGS) draft at http://peppergenome.snu.ac.kr. SNPs were designated by chromosome number and position (e.g., S10_172735351, which indicates an SNP located at position 172735351 on chromosome 10).

Genome-wide Divergence and Population Structure Analysis
Genetic diversity values were calculated by a neighbor-joining algorithm using TASSEL 5. In a second approach, we utilized IBS and principle component analysis (PCA) with the SNP & Variation Suite (SVS v8.1.5) (Golden Helix, Inc., Bozeman, MT, USA; www.goldenhelix.com). Observed nucleotide diversity (π) and Tajima's D were estimated by using TASSEL v5.0 with a sliding-window approach as described (Korneliussen et al., 2013). The fixation index (F ST ) was estimated on the basis of the Wright F statistic (Weir and Cockerham, 1984) with use of SVS v8.1.5.

Characterization of Linkage Disequilibrium (LD)
For GBS data, we considered only SNPs successfully mapped to the C. annuum WGS draft, because knowing the chromosome location of SNPs helps prevent spurious LD and thereby unreliable association mapping. Mapped SNPs were further filtered by call rate >90%. Before studying LD decay, haplotype blocks were calculated for all markers by using the default settings in SVS v8.1.5. Adjacent and pairwise measurements of LD for GBS data were calculated separately for SNPs in each chromosome. For computing LD, we used the expectationmaximization (EM) algorithm (Dempster et al., 1977) as an iterative technique for obtaining maximum likelihood estimates of sample haplotype frequencies.

GWAS Mapping
The PC matrix was constructed with the program "EIGENSTRAT" (http://genetics.med.harvard.edu/reich/ Reich_Lab/) and the PCA correction technique; the method of stratification was as described (Price et al., 2006). IBS was calculated as described (Purcell et al., 2007). GWAS involved a single-locus mixed linear model (SLMM), a method that uses a forward and backward stepwise approach to select markers as fixed-effects covariates in the model (Segura et al., 2012), and implemented in SVS v8.1.5. We used a PC matrix to correct for population stratification and an IBS matrix to correct for a polygenic background. Manhattan plots for associated SNPs were visualized by using GenomeBrowse v1.0 (Golden Helix, Inc.). The SNP P-values from GWAS underwent false discovery rate (FDR) analysis (Storey, 2002).

Population Stratification
We used PCA of the 36,621 SNPs identified from C. baccatum and C. annuum with MAF ≥0.05 to characterize domesticated and wild C. annuum and C. baccatum peppers. PCA with first and second eigen vectors that explained 80% of the total variation produced two clusters of C. baccatum and C. annuum accessions (Figure 1). Tepin and Tepin Guatemala, two wild peppers belonging to C. annuum var. glabriusculum that are native to southern North America and northern South America, were close to CB-77, a wild C. baccatum pepper. Similarly, three other wild C. baccatum peppers, CB-93, CB-92, and CB-40, were intermediate between the major C. annuum and C. baccatum clusters. A third cluster comprised the remaining wild, semi-domesticated and crown shaped fruit type C. baccatum accessions that grouped with the domesticated large-fruited C. baccatum peppers. A separate PCA with 13,129 SNPs that were polymorphic for C. baccatum accessions resolved the population structure comprised by this group of C. baccatum accessions. This PCA identified 283 C. baccatum accessions in 3 distinct clusters (Figure 2). The middle cluster (cluster II) was parallel to the C. annuum cluster, and the wild species Tepin, Tepin Guatemala, CB-77, CB-93, CB-92, and CB-40 were found in the middle, which indicates intercrossing between wild C. annuum and C. baccatum peppers while or before domestication. PCA placement of various accessions are noted in Tables S2, S3.

Fixation Index (F ST ) Distribution to Locate Positive Selection Footprints
F ST was estimated with 95% confidence intervals between wild and domesticated C. annuum and C. baccatum. The F ST between wild (C. annuum + C. baccatum) and domesticated (C. annuum + C. baccatum) accessions was 0.09 and 0.05, respectively. The F ST between domesticated C. annuum and C. baccatum was 0.78, which indicates genome-wide divergence. The F ST between wild C. baccatum and wild C. annuum was 0.66. Crownshaped fruited C. baccatum types are unique for this species  Table S4). Based on F ST values, peaks on chromosomes 1, 2, 3, 4, 5, 6, and 9 in the Manhattan plot might be the regions of positive selection and important for improvement. Because of the strong population structure, we assessed patterns of variation separately for each group of domesticated accessions from the respective species when making inferences about the evolutionary dynamics of domestication. Crop domestication is often associated with "population bottlenecks" because of the limited number of founding individuals experiencing domestication events. These bottlenecks may be evident in pepper when comparing diversity between cultivated forms of C. annuum and C. baccatum. We estimated nucleotide diversity (π) and Tajima's D across various chromosomes to understand genome-wide bottleneck effects. The frequency of segregating SNPs as reflected by various chromosomal measures of mean π and Tajima's D is presented in Figure 4. For cultivated C. baccatum, chromosome 4 was positive for π and Tajima's D which indicates accumulation of rapid mutations on this chromosome. The remaining chromosomes were negative or nearly negative for Tajima's D, which indicates bottlenecks in domestication. In contrast, C. annuum chromosomes were positive for Tajima's D on all chromosomes except chromosome 8, which indicates differential evolution after the domestication or the influence of diverse breeding.

LD Analysis for C. baccatum
We conducted an extensive LD analysis on the entire dataset of C. baccatum accessions on all adjacent marker pairs within a chromosome or within a haplotype block. Haplotype distribution is important to understand patterns of genetic variation of C. baccatum gene pools and has a wide range of applications. The 2 major processes that shape haplotype structure are the domestication process and breeding history. We used "minimize historical recombination, " a block-defining algorithm developed by Gabriel et al. (2002). The upper confidence boundary was set to 0.98 and the lower boundary to 0.70. SNPs with MAF <0.05 were omitted. Maximum block length was set to 160 kb. The expectation maximization (EM) algorithm was used for haplotype estimation, with convergence tolerance 0.0001, and frequency threshold 0.01. Maximum EM iterations were set to 50. We identified 1742 haplotypes containing 4420 SNPs, with a range of 9-2 SNPs per haplotype ( Table S5). The results provided values for both the EM algorithm (Dempster et al., 1977) and composite haplotype method (CHM; Weir and Cockerham, 1996). Squared-allele frequency correlations (r 2 ) and LD estimate (D ′ ) for the EM and CHM methods are in Table S6. We created LD plots by using marker-pair associations Frontiers in Plant Science | www.frontiersin.org   Table S4. of adjacent SNPs within a chromosome, within a haplotype block, and within genes (Figure 5). The length of individual LD blocks varied among chromosomes, with regions of high and low LD interspersed ( Table 2). The genome-wide average LD block was 99.1 kb. The largest LD block, of 13,021 kb, was on chromosome 11. Pairwise LD was estimated by r 2 and we compared the pattern of decay at different levels. With pairwise analysis considering adjacent SNPs across chromosomes, most SNP associations were within 50 kb (Figure 5). The second analysis based on adjacent SNPs within haplotypes revealed most associations within 20 kb ( Figure S1). The third analysis of SNPs located in genes revealed most associations within 5 kb ( Figure S2).

GWAS for Peduncle Length
Peduncle length is the prime differentiating trait between wild and domesticated forms of C. baccatum. Mean peduncle lengths for respective accessions are listed in Table S7. The cultivated form of C. baccatum, var. pendulum, is named based on the epithet related to pendant fruits. In our GWAS, 36 SNPs located on chromosomes 1, 2, 3, 4, 6, 7, 8, 9, 10, and 11 were identified as significantly associated with peduncle length and cumulatively explained 21% of the total variation (Figure 6). Four SNPs located in the intergenic space between the oxidoreductase family protein/arogenate dehydrogenase on chromosome 7 explained 10.6% of the total variation. Chromosome number, map position, P-value, regression beta,  FDR correction, variance explained, call rate, and minor/major allele frequencies for all significantly associated SNPs are in Table S8.

Candidate Gene Selection
The predicted gene set from the annotated C. annuum cv. CM334 reference genome (Kim et al., 2014) was used to characterize the genes containing SNPs or nearby SNPs. Eleven candidate genes containing SNPs in exons or promoters were significantly associated with peduncle length, and 12 more SNPs in introns or intergenic regions of candidate genes were proposed. GWAS details and strengths of association of SNPs are in Table S8. Details of annotation for various associated SNPs, their location in various genes and type of mutation (synonymous or non-synonymous) are in Table 3.

DISCUSSION
The cultivated pepper species, C. baccatum, known as aji or Peruvian hot pepper, is a valuable source of novel genes that has not yet been analyzed for genome-wide diversity and population structure (Albrecht et al., 2012). Our genomewide diversity analysis showed that many domesticated C. baccatum var. pendulum from western Bolivia/Peru and eastern Brazil/Paraguay cluster with most wild-type C. baccatum var. baccatum, suggesting that they may be the ancestral cluster. The flow of the river Rio Mizque from the south to join the Amazon is through lowland tropical Bolivia and the Amazon Basin and thus includes both the range of the C. baccatum group and a portion of the range of the C. annuum group (Eshbaugh, 1980). McLeod et al. (1982) suggested that the white-flowered ancestor migrated to dry areas of southern Bolivia, to produce the C. baccatum group, and the wild form in the wetter Amazon basin developed into the wild progenitor for C. annuum.
Our comparative divergence analysis across the chromosomes for C. annuum and C. baccatum revealed that chromosome 4 of C. baccatum had a unique divergence history, and for C. annuum, chromosome 8 showed a differential evolution when comparing mean π and Tajima's D for various chromosomes. In addition, biased distribution of Tajima's D toward negative values on all chromosomes (except chromosome 4) in cultivated C. baccatum indicates a population bottleneck during domestication or through the breeding histories, or the speciation of C. baccatum might have occurred with relatively narrow genetic diversity. In contrast, C. annuum chromosomes showed positive Tajima's D on all chromosomes except chromosome 8, which indicates that speciation or domestication of C. annuum might have occurred at multiple sites, contributing to wider genetic diversity as discussed by Kraft et al. (2014). Subsequent spread of C. annuum cultivars across the world and exposure to diverse breeding programs or selection in conjunction with diverse ecological adaptation might explain such rapid population size expansion and recovery from the bottleneck effects. The genome size of C. annuum types was estimated to be 3691 Mbp and C. baccatum was 4048 Mbp, which indicates wide divergence between these 2 cultivated pepper genomes (Belletti et al., 1998). Tang et al. (2006) concluded that unusually divergent genomic regions between closely related rice species are informative about species incompatibility or reproductive barriers resulting in partial fertility. Similar to the current findings, several reports implicated newly recruited polymorphisms as causing highly divergent genomic regions that may control traits associated with reproductive incompatibility or ecological adaptation (Wu, 2001;Wu and Ting, 2004).
Current advances in genome sequencing for identifying genome-wide SNPs and mapping them to WGS drafts allowed for scanning of LD decay across the genome. LD, the non-random association of alleles at different loci and germplasm panels that represent genome-wide cultivar diversity (power of association panel), plays an integral role in GWAS and determines the density of SNPs required for GWAS (Flint-Garcia et al., 2003;Nicolas et al., 2016). Low to moderate LD (decay within 100 kb) such as that observed for the C. baccatum panel in our study must utilize high SNP density (Kovi et al., 2015). In this study, we noted the highest LD for chromosome 11. One explanation for such variable LD is the "Bulmer effect, " whereby high LD regions are generally associated with selective sweeps harboring important genes underlying domestication (Bulmer, 1971;Kovi et al., 2015). The stochastic process that generates LD during selective sweeps is because of a spontaneous mutation leading to an advantageous effect or LD decays with recombination with a diverse haplotype and further segregation (Baird, 2015).

GWAS for Peduncle Length
Wild C. baccatum has a relatively restricted distribution confined to southern Peru, Bolivia, and southern Brazil (Eshbaugh, 1970). C. baccatum var. pendulum is a widely distributed cultivated plant found throughout western South America and now spreading worldwide (Eshbaugh, 1970). Wild C. baccatum has red, erect, and non-persistent fruits, and C. baccatum var. pendulum has red, orange, yellow, green, or brown fruits that are pendant and persistent. Because the peduncle is the most differentiating trait between domesticated and wild C. baccatum species, we performed GWAS for peduncle length. We associated 36 SNPs with the trait peduncle. Four of these SNPs clustered with candidate genes on chromosome 7. Annotation for some of these FIGURE 6 | Manhattan plot of the genome-wide association study for peduncle length in C. baccatum var. pendulum. (A) Range of observed peduncle length. (B) Chromosome coordinates are on the X-axis, with the negative log-10 of the association P-value for each SNP on the Y-axis. High negative log-10 indicates strong association with the trait. Histograms show effects of significantly associated SNPs for peduncle length. (C) Four SNPs located in the intergenic space between the oxidoreductase family protein/arogenate dehydrogenase on chromosome 7 that explained 10.6% of the total variation for peduncle length. associated SNP-containing sequences revealed their location in various genes, so these genes might play a role in peduncle length, peduncle architecture and C. baccatum domestication.
Length of peduncle is determined by the cell number or cell size, although it is indirectly regulated by hormones and multiple pathways. Kinases play important roles in plant growth and development. Peduncle associated SNPs in the current study were located in leucine-rich repeat receptor like kinases (LRR-RLKs), serine/threonine protein kinase, ABC transporter gene and RING finger protein, which may play important roles in growth and development as well as cell wall integrity and elongation as has been shown in other plants (Lally et al., 2001;Arunyawat et al., 2007;Guo et al., 2009;Gish and Clark, 2011;Ghosh et al., 2013). Plant cell walls contain a glycoprotein component rich in the otherwise rare amino acid hydroxyproline and accumulation of this amino acid was positively correlated with cell elongation in pea epicotyls (Flint-Garcia et al., 2003). In the current study, we also associated a marker S11_725918 on GABA (γ-aminobutyric acid), a ubiquitous non-protein amino acid. An Arabidopsis GABA gene mutant pop2 exhibited defects in hypocotyl cell elongation and pollen-tube elongation via influence on cell-wallrelated genes (Bulmer, 1971).
Our study describes the utility of SNPs generated by GBS for genome-wide divergence and LD patterns between C. annuum and C. baccatum. Mapping all the SNPs to the C. annuum reference genome helped to identify homologous SNPs between the two incompatible cultivated pepper genomes, which was further useful to reduce ascertainment bias, so this SNP set was useful in estimating genome-wide population differentiation and allele sharing between the two genomes. Furthermore, the SNPs anchored to the C. annuum genome may not be in the same order in the C. baccatum genome because some genomic regions may not be co-linear to the C. annuum genome because of genome rearrangements. In a comparison of C. baccatum and C. annuum linkage maps, Lee et al. (2016) identified two major reciprocal translocations between chromosomes 3 and 5 and between chromosomes 3 and 9, as well as translocations between chromosomes 1 and 8. Such uncertain positions of SNPs can be corrected only when the whole genomesequence is available for C. baccatum genome. This SNP panel and the results pertaining to population structure, IBS and LD decay analyses will facilitate routine use of GWAS for identification of genes associated with various economically important traits in Peruvian peppers. Our identification of SNPs associated with fruit peduncle length demonstrates opportunities for utilization of GWAS in crop improvement.

AUTHOR CONTRIBUTIONS
UR, PN, JS, GH, and AE designed the study and drafted the manuscript. PN, VA, JD, and BD conducted peduncle phenotyping. PN, VA, AA, JD, and BD extracted DNA and assisted to generate genome-wide SNPs. DC provided whole genome sequence draft and mapped SNPs to the genome. UR, PN, CR, TS, AA, and VA performed population structure and GWAS analysis.

ACKNOWLEDGMENTS
The study received funding from USDA-NIFA (2010-02419 and 2012-02617), NIH Grant P20RR016477 to the West Virginia IDeA Network for Biomedical Research Funding, Raman postdoctoral fellowship to CR by University Grants Commission, Government of India and the Gus R. Douglass Institute (graduate research assistantship to AV and BD). DC was supported by the Agricultural Genome Center of Next-Generation Biogreen 21 Program (PJ011275).

SUPPLEMENTARY MATERIAL
The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fpls.2016. 01646/full#supplementary-material Table S1 | List of Capsicum annuum and C. baccatum pepper accessions used in the current study.