Evaluation of the minimum sampling design for population genomic and microsatellite studies. An analysis based on wild maize

Massive parallel sequencing is revolutionizing the field of molecular ecology by allowing to understand better the evolutionary history of populations and species, and to detect genomic regions that could be under selection. However, the needed economic and computational resources generate a tradeoff between the amount of loci that can be obtained and the number of populations or individuals that can be sequenced. In this work, we analyzed and compared two extensive genomic and one large microsatellite datasets consisting of empirical data. We generated different subsampling designs by changing the number of loci, individuals, populations and individuals per population to test for deviations in classic population genetics parameters (HS, FIS, FST) and landscape genetic tests (isolation by distance and environment, central abundance hypothesis). We also tested the effect of sampling different number of populations in the detection of outlier SNPs. We found that the microsatellite dataset is very sensitive to the number of individuals sampled when obtaining summary statistics. FIS was particularly sensitive to a low sampling of individuals in the genomic and microsatellite datasets. For the genomic datasets, we found that as long as many populations are sampled, few individuals and loci are needed. For all datasets we found that increasing the number of population sampled is important to obtain precise landscape genetic estimates. Finally, we corroborated that outlier tests are sensitive to the number of populations sampled. We conclude by proposing different sampling designs depending on the objectives.

populations in the detection of outlier SNPs. We found that the microsatellite dataset is very 25 sensitive to the number of individuals sampled when obtaining summary statistics. FIS was 26 particularly sensitive to a low sampling of individuals in the genomic and microsatellite 27 datasets. For the genomic datasets, we found that as long as many populations are sampled, 28 few individuals and loci are needed. For all datasets we found that increasing the number of 29 population sampled is important to obtain precise landscape genetic estimates. Finally, we  To be able to compare deviations obtained from MPS and microsatellite markers, 157 which have been used in many studies (Table 1)   They are composed of many individuals per population (between 9 and 26 individuals per 171 population) and contain a large number of SNPs or microsatellite markers, distributed along 172 the 10 chromosomes of teosinte. Therefore, we defined these datasets as the samples 173 representing the most accurate data (i.e., the "real" data for the purpose of this work). We  We used adegenet (Jombart, 2008) and hierfstat (Goudet 2005)  statistic, we obtained the mean value across loci for each population. 195 We tested patterns of isolation by distance (IBD) and isolation by environment (IBE) 196 using multiple regressions of distances matrices (MRM function from the ecodist package) to 197 test the association between pairwise genetic distance (FST) as a response variable and the 198 environmental and geographic distances as predictive variables. We performed 1,000 199 permutations to test for significance between variables. The advantage of MRM tests is that it 200 allows testing simultaneously both the environmental and geographic distances, and 201 determines the relative contribution of each variable (Lichstein, 2007).   population. We used R custom scripts (available as supporting information-function 214 num_locs) to extract data from the entire sets: for the DTS dataset 100, 1000 and 5,000 random SNPs; for the 50K dataset 100, 1000 or 15,000 random SNPs; and for the 216 microsatellite dataset 5, 10 and 15 random markers. For the 50K and microsatellite datasets, 217 we also tested the effect of sampling different estimates of individuals per populations. We 218 extracted randomly for each population 3, 6 and 9 individuals (available as supporting 219 information-function num_inds()). This was not performed on the DTS dataset, because it is 220 based on pooled DNA.  To test the effect of the number of populations in the estimation of the parameters described 232 above (Hs, FIS, FST, IBE, IBD, CAH associations), we performed random sampling designs.

233
For the genomic datasets, we sampled 5, 10, 20, 30 and 40 random populations from the 49 234 (50K) and 47 (DTS) populations described above (Supporting Information Figure S1). For the calculated the estimates described above, and for summary statistics we estimated the mean

270
Summary statistics for the entire datasets 271 We considered the entire datasets (50K, DTS and microsatellite) as those revealing the "real" 272 or most accurate patterns of genetic diversity across teosinte populations. Table 2 Table S1). 276 We found striking differences between the datasets for the estimated mean across

282
We were not able to calculate FIS for the DTS dataset (pooled DNA), but we also found 283 differences between the estimated mean using the 50K (FIS=0.065) and microsatellite datasets 284 (FIS=0.19).

285
In contrast to the summary statistics, for the three datasets we found similar patterns of  Table 2). For the three datasets, we observed that patterns of IBD and IBE were positive and that the values were stronger for 288 IBD than IBE. Finally, for the three datasets, we observed negative associations between 289 genetic diversity and the distance to the geographic and niche centroids ( Figure 3, Table 2).  Table S1, S2). While the ranges 302 of the maximum and minimum values were close to the "real" estimates for the 50K dataset, 303 many of the comparisons between subsamples of 3, 6 and 9 individuals and the real data set 304 were nonetheless significantly different (Supporting Information Table S2-S4).  Fig. S2-S4). Importantly, we found that the variance across 316 replicates was higher for the microsatellite dataset, followed by the DTS dataset and finally 317 when sampling each subspecies using the 50K dataset (Supporting Information Fig. S2-S4).

318
Even though decreasing the number of loci increased the variance, it was interesting to 319 note that estimated distributions fell close to the estimates for "real" datasets (Supporting 320 Information Table S1), and that the distribution of different estimates fell within the range of

328
The only test that generated strong deviations with respect to the "real" datasets (even 329 if the ranges were not very large) was the association between HS and the geographic centroid, 330 where we found that sampling fewer DTS and 50K SNPs reduced the association (β got closer 331 to 0). Importantly, sampling 1,000 or 15,000 of the 50K SNPs; 1,000 or 5,000 of the DTS Varying the number of sampled populations 336 Varying the number of populations generated similar mean values to those found for the 337 "real" datasets. However, for the three datasets, sampling 5 and 10 populations generated  Table   341 S1), but remained high for the microsatellite dataset. For patterns of IBD and IBE, we also found that sampling fewer than 10 populations 348 generated in some replicates incorrect association estimates. The real value showed positive 349 associations (Table 2), but for 5 and 10 sampled populations we found that up to 18% and 7% 350 of sample replicates generated negative associations, respectively (Supporting Information 351 Table S5; see changes in signs in Supporting Information Table S1). For tests of association 352 between HS and ecological variables, this was even more sensitive for the genomic datasets, 353 since we found up to 4.6% of positive estimates (when the entire dataset was negative) for 30 354 populations when testing association between Hs and the distance to the niche centroid. 355 However, we found that the DTS was less sensitive to deviation for associations between 356 ecological and genetic variables (Figure 3).

357
Finally, we found that the microsatellite dataset was more sensitive for isolation  Table S5).    Figure 5a shows that for 5 and 10 sampled populations, the maximum FST for a locus found 373 by bayescenv across replicates was significantly higher than for the rest of the sampling 374 designs. We also tested the number of candidate SNPs across replicates. We found that more 375 candidate SNPs when sampling a higher number of populations ( Figure 5b). Interestingly, we 376 found that few outlier SNPs were shared between replicates and between sample designs 377 sampling designs (Supporting Information Table S6).  Comparing datasets 391 We found that the most evident differences between datasets were associated to HS, FST and  Even though we found differences between datasets, it was interesting to note that for 405 tests of isolation, and tests associated to the central abundance hypothesis, very similar 406 patterns were found with the three datasets. In all cases, we found the same associations for the summary statistics (Table 2). This is an interesting observation, and suggests that for these 408 types of analyses any genomic or microsatellite dataset might be useful, as long as enough 409 individuals, loci and populations are sampled (Table 3, See below).

410
Finally, when we compared the sampling designs with their respective "real" datasets, 411 we found that the DTS dataset generated fewer significant deviations for sampling designs 412 than the 50K datasets (Supporting Information    Table 1). We did not test for a combination of sampling at the same time   445 We were only able to compare the effect of sampling different number of individuals for the 446 50K and microsatellite datasets. In general we found that for the 50K and microsatellite 447 datasets many comparisons, especially for summary statistics, between the "real" dataset and 448 the sampling designs were significantly different (Supporting Information Table S4).

449
However, for the 50K dataset we found that the majority of summary statistics 450 distributions (except FIS) had a low variance that fell within the distribution of sampling 20-451 30 populations (Figures 1-3, Supporting Information Fig. S2-S4), indicating that sampling 452 few individuals would not deviate summary statistic estimates. For the microsatellite dataset, 453 we found that sampling 3-6 individuals generated important deviations for FIS and FST estimates (Figure 1), and larger variance than the 50K dataset across replicates in all tests. 455 Importantly, we found that sampling few individuals did not produce important deviations for 456 tests of patterns of isolation, especially when using MRM tests (Figure 2). 457 For the 50K and microsatellite datasets, sampling fewer individuals increased the 458 variation across sampling, but more importantly underestimated the FIS inbreeding estimation 459 (Figure 1b, Supporting Information Fig. S2). In contrast to other summary statistics, FIS was   (Table 3). In fact we found that it is more  Table 1). 479 We tested the effect of randomly sampling different number of populations for all datasets. 480 Sampling above 10 populations did not generate significant differences between sampling 481 designs and the "real" sample for the three datasets (Supporting Information Table S4).

482
However, we found that the number of populations was strongly associated with the accuracy 483 of the mean estimates across replicates (Figures 1-3 Table S1). Sampling different number of populations with the microsatellite dataset generated 485 a lower variance across replicates than the genomic datasets when estimating patterns of 486 isolation using the MRM test (Figure 2a,b).  Table 1). In fact we found that it is more convenient 500 to sample more populations with few individuals than few populations with many individuals 501 (Figure 4, Supporting Information Fig. S5).

504
An important advantage of MPS is that it allows detecting candidate loci under selection.

536
Depending on the objectives, and the amount of data that can be produced using genomic 537 platforms, we propose some suggestions for sampling design that could be considered 538 according to our results (Table 3). It is important to consider that these recommendations are datasets. If testing local adaptation is not a priority, then sampling fewer populations, but with 546 many individuals (>20) might be more important, and with special focus on populations that 547 could be at risk.

548
When testing summary statistics associated to demographic history, genetic structure, 549 environmental associations, and landscape genomics, we found that the number of SNPs, and number of individuals did not produce strong deviations in comparison to the real data when 551 testing the genomic datasets (even if some summary statistics were significantly different to 552 the real data). However, we found that sampling few populations could create important 553 deviations relative to the "real" summary statistics. In some replicates, we even found 554 incorrect estimates of parameters (Figure 2, 3, 4; Supporting Information Fig. S2-S4; Table   555 S5). These results corroborate the findings from other studies performed with microsatellites 556 and using bio-informatic simulations (See summary and references in Table 1). Considering 557 these results, we propose that if detecting local adaptation is not an objective and FIS is not 558 being measured, it is more important to sample many populations (~30) even if few 559 individuals per population are considered ( Figure 4) and few SNPs are obtained. Interestingly, 560 we also found that DTS was less sensitive than the 50K dataset for different sampling designs 561 (fewer significant deviations from real data, Supporting Information Table S4)

564
For studies whose aim is to identify candidate SNPs, we found that the number of     Very sensitive (>30-40 populations) As many SNPs as possible are needed to differentiate outlier loci, also to increase the probability of finding loci within selective regions. Increase as much as possible the number of populations, covering the largest geographic and environmental distribution. A possibility is to use pooled-sample DNA.  Table S2. FIS, was not possible to obtain for the DTS dataset because it is based on pooled data.  Table S2.  Table S2.

FIGURE 4
Trade off between the number of individuals and the number of populations sampled for all summary statistics using the 50K dataset. We tested the effect of sampling more individuals in few populations and few individuals in many populations.  Table S2. F IS, was not possible to obtain for the DTS dataset because it is based on pooled data.