Genetic Diversity and Connectivity in Maurolicus muelleri in the Bay of Biscay Inferred from Thousands of SNP Markers

Mesopelagic fish are largely abundant poorly studied fish that are still intact, but which, due to their potentially great added value, will be imminently exploited by humans. Therefore, studies that provide information to anticipate the anthropogenic impact on this important resource are urgently needed. In particular, knowledge about their connectivity, potential adaptation and resilience are needed. This information can be obtained through the analysis of genome-wide markers which are now relatively easily and cost-efficiently discovered thanks to high-throughput sequencing technologies. Here, we have generated thousands of SNP markers in Maurolicus muelleri, based on the restriction-site associated DNA sequencing method, and preformed population connectivity and genetic diversity analyses in a subset of samples collected from the Bay of Biscay. Our study proves the method valid for obtaining genome-wide markers in this species and provides the first insights into the population genomics of M. muelleri. Importantly, the genomic resources developed here are made available for future studies and set the basics for additional endeavors on this issue.


INTRODUCTION
The lack of physical barriers in the ocean and large populations sizes generally results in low genetic differentiation in marine fishes (Ward et al., 1994;Bradbury et al., 2008), which renders the task of inferring demographic patterns in this environment particularly difficult. Yet, over the last few years, an increasing number of studies based on high-throughput genetic data have provided evidences of fine scale population differentiation (e.g., Hess et al., 2013;Benestan et al., 2015;Rodríguez-Ezpeleta et al., 2016), challenging previous assumptions based on traditional methods (Hauser and Carvalho, 2008). In particular, the advent of the restriction site associated DNA sequencing (RAD-seq) (Baird et al., 2008) and related methods (Davey et al., 2011) has revolutionized the field of marine conservation genomics . This approach consists on subsampling putative homologous regions from the genome in several individuals with the aim of identifying and genotyping single nucleotide polymorphisms (SNPs). Interestingly, the method can be applied to organisms for which no prior genomic resources are available, and it is suitable to study both, neutral population structure and local adaptation (Allendorf et al., 2010).
In the context of marine management, RAD-seq is particularly relevant for poorly studied widely distributed species, as it can provide quick estimates of genetic diversity, population connectivity and adaptation more cost-effectively than relying on genome or transcriptome sequencing, or on non-sequencing based microsatellite or SNP typing.
Among the poorly studied marine organisms are mesopelagic fish, i.e., those inhabiting the deep scattering layer, that represent the largest fish biomass in the ocean (Kaartvedt et al., 2012;Irigoien et al., 2014) and play important roles in the marine ecosystem and global biogeochemical cycles (Klevjer et al., 2016). The large abundance of these fish make them particularly attractive for exploitation (St. John et al., 2016). Yet, the mesopelagic ecosystem is still largely unknown (Sutton, 2013), and the potential effects of newly introduced anthropogenic pressures in this realm should be anticipated so that sustainable management strategies for this valuable resource can be developed (St. John et al., 2016). Thus, there is an urgent need for establishing focused scientific surveys, developing appropriate sampling gear and generating additional biological data to booster knowledge in these species. In particular, developing genomic tools is foremost, as they can provide insights to understand population persistence, productivity and response to exploitation (Palumbi, 2003;Cowen et al., 2006).
Several studies have focused on deciphering species boundaries and population connectivity within mesopelagic fishes using genetic variants, but they are all based on allozymes (Gunawickrama and Naevdal, 2001;Kristoffersen and Salvanes, 2009), single mitochondrial markers (Kojima et al., 2009;Habib et al., 2012;Gordeeva, 2014) or a few microsatellites (Gordeeva, 2011;Van de Putte et al., 2012). To our knowledge, no studies based on genomic-wide SNP data have been published. Yet, simulated and empirical data based evidences support that high-throughput SNP data analyses provide more accurate population structure inferences than single or a few polymorphic marker based analyses (Haasl and Payseur, 2011;Rašić et al., 2014;Benestan et al., 2015;Szulkin et al., 2016). Here, in a first attempt to introduce genome-wide SNP discovery and genotyping to study mesopelagic fishes, we have focused on the Mueller's pearlside, Maurolicus muelleri.
Although initially thought to be distributed worldwide, previous reports of M. muelleri in the Northeast Atlantic (Gjøsaetr and Kawaguchi, 1980), the South Atlantic (Hulley and Prosch, 1987), the Southeast Pacific (Robertson, 1976), and the coasts of Japan (Okiyama, 1971) are now believed to belong to different species of the genus Maurolicus. The genus was split into fifteen species based on morphological characters and geographic distribution (Parin and Kobyliansky, 1993), but recent studies have suggested an overestimate of species diversity in Maurolicus (Kim et al., 2008;Rees et al., 2017), illustrating a need for more in-depth phylogeographical analyses. In response to these needs, we have discovered and genotyped thousands of SNP markers in M. muelleri, which constitute the first genomic resource of a mesopelagic fish.

Tissue Sampling and DNA Extraction
Specimens of M. muelleri were collected at several coastal locations of southern Bay of Biscay (Figure 1) through pelagic trawling on board the R/V Emma Bardán and Ramón Margalef in September 2016. Catches were collected between 106 and 212 m and during daytime. Fish where frozen at −20 • C until DNA extraction. Genomic DNA was extracted from about 20 mg of tissue using the Wizard R Genomic DNA Purification kit (Promega, Fitchburg, WI, United States) following manufacturer's instructions for "Isolating Genomic DNA from Tissue Culture Cells and Animal Tissue." Extracted DNA was suspended in Milli-Q water and concentration was determined with the Quant-iT dsDNA HS assay kit using a Qubit R 2.0 Fluorometer (Life Technologies). DNA integrity was assessed by electrophoresis, migrating about 100 ng of GelRed TM -stained DNA on an agarose 1.0% gel.

RAD-Seq Library Preparation and Data Analysis
Restriction-site-associated DNA libraries of 94 individuals (10 to 20 per catch) were prepared following the methods of Etter et al. (2011). Briefly, starting DNA (ranging from 300 to 500 ng) was digested with the SbfI restriction enzyme and ligated to modified Illumina P1 adapters containing 5 bp unique barcodes. Pools of 30 or 32 individuals were sheared using the Covaris R M220 Focused-ultrasonicator TM Instrument (Life Technologies) and size selected to 300-500 pb by cutting agarose migrated DNA. After Illumina P2 adaptor (including 5 pb index) ligation, each library was amplified using 14 PCR cycles. The three pools, each with one unique Illumina index, were combined and pairedend sequenced (100 pb) on an Illumina HiSeq2000. Generated RAD-tags were analyzed using Stacks version 1.44 (Catchen et al., 2013b). Quality filtering and demultiplexing was performed with process_radtags with default parameters and removing the last 10 bases of the end of the reads. Putative orthologous tags (stacks) per individual were assembled using ustacks with a minimum depth of coverage required to create a stack (m) of 3 and a maximum nucleotide mismatches (M) allowed between stacks of 2, 4, or 6. Catalogs of RAD loci were assembled using cstacks with number of mismatches allowed between sample tags when generating the catalog (n) of 3, 6, and 9 for M-values of 2, 4, and 6, respectively. Matches of individual RAD loci to the catalog were searched using sstacks. RAD loci found in at least 75% of the individuals were selected using populations and used to calculate (nucleotide diversity -π, minor allele frequency -MAF, expected heterozygosity -H e , expected homozygosity -H o , and inbreeding coefficient -F IS ). Using PLINK version 1.07 (Purcell et al., 2007), SNPs with MAF larger than 0.05 and a genotyping rate larger than 0.9 were selected for population structure analyses. The obtained genotype dataset was exported to Structure and Genepop formats using PGDSpider version 2.0.8.3 (Lischer and Excoffier, 2012).

Genetic Diversity and Population Structure
F ST values per group pairs were calculated following the Weir and Cockerham (1984) formulation as implemented in Genepop 4.3 (Rousset, 2008). Principal component analyses (PCA) were performed with the R package adegenet (Jombart and Ahmed, 2011) without any a priori grouping assumption. The percentage of appurtenance of each individual to each of the K (ranging from 1 to 4) hypothetical ancestral was calculated with the Bayesian clustering approach implemented in STRUCTURE (Pritchard et al., 2000) without any prior population assignment, based on the admixture model and a burn-in period of 100,000 iterations followed by 300,000 iterations from which estimates were obtained. Ten replicates for each K-value were performed and analyzed with CLUMPP (Jakobsson and Rosenberg, 2007) to identify common modes. Results were plotted using DISTRUCT (Rosenberg, 2004), and best K was identified according to the Evanno method (Evanno et al., 2005) as implemented in StructureHarvester (Earl and vonHoldt, 2012).

RAD-Seq Genotyping
The number of RAD-seq reads passing quality filters per individual included in the final analyses ranges from 326 824 to 2 116 073 with an average of 974 008. The number of RAD-tags   (Figure 2A). Average coverage per individual ranges from 6 to 20× (with an average of 11×). Number of RAD-tags per individual increases with number of reads used for stack formation and does not reach a maximum value even when more than 100K reads are used ( Figure 2B). This places the number of SbfI cut sites in M. muelleri above 32 000, which is larger than in other teleost species (Catchen Above twenty thousand RAD-tags are present in at least 75% of the individuals, and comprise more than two million sites of which more than a thousand are variable ( Table 1). The number of tags present in more than 75% of the individuals increases with increasing M because more common loci can be found when these are composed by more alleles. Consequently, larger M-values results in more alleles and thus more variable positions (Catchen et al., 2013b;Rodríguez-Ezpeleta et al., 2016). Selecting only those positions with MAF larger than 0.05 and applying a stringent filtering of a minimum of 90% of the individuals being genotyped per SNP, results in three genotype datasets of 1 625, 2 350, and 2 409 SNPs for M-values 2, 4, and 6, respectively.

Genetic Diversity
The number of polymorphic sites contained in the tags present in at least 75% off the individuals is 6.5, 8.3, and 9.4% for M-values of 2, 4, and 6, respectively. When calculated considering all individuals a single group, overall nucleotide diversity (π), inbreeding coefficient (F IS ), and expected (H e ) and observed heterozygosity (H o ) values are congruent across the three parameter sets used. Expectedly, using only variable sites or selecting those with at least 0.05 MAF increases nucleotide diversity and both, expected and observed heterozygosity ( Table 2). For all set of parameters and position subsets used, observed heterozygosity is lower than expected both overall and when calculated per catch. Per catch, values of π and H e are similar, whereas values of H o and F IS differ (Figure 3). In general, catches with low observed heterozygosity are also those with highest inbreeding coefficient (9018, 9204, 9201) and vice versa (hauls 9007, 9219, 9234), although this does not hold for catches 9012 and 9029 with present large H o, but medium F Is values with respect to the others. Similar to other studies (Rodríguez-Ezpeleta et al., 2016), we find that, for the four variables, absolute values depend on the set of parameters used for SNP discovery, but that relative values among groups are maintained.

Population Structure
In the Bayesian population structure analyses (Figure 4), all individuals display admixed representation of each of the hypothetical ancestral population, suggesting genetic connectivity within the area of study. Interestingly, catch 9018 shows an ancestry pattern that differs from the rest of the catches, yet, this difference is not visible in the PCA plots (Figure 5), where no differentiation among groups can be observed. In accordance with the hypothesis of high connectivity within the area of study, F ST values between pairs of populations are low (Table 3), although, consistent with the pattern observed in the structure analyses, pairs including catch 9018 are those with highest F ST values.

DISCUSSION
Restriction-site associated DNA sequencing constitutes an unprecedented opportunity for performing demographic inferences in species for which no prior genetic resources are available (Corander et al., 2013;Hess et al., 2013). Here, FIGURE 4 | Graphical representation of the Bayesian clustering approach for the best K-value obtained for M-values 2, 4, or 6; each bar represents an individual and each color, its inferred membership to each K potential ancestral populations. Pie charts represent per catch averaged proportion of assignment to each potential ancestral population. For each M-value, only the best value of K is shown.
we have discovered and genotyped thousands of RAD-seq derived SNP markers in M. muelleri in a first attempt to use genome wide data to study diversity and connectivity in this mesopelagic species, which is particularly relevant in view of the imminent exploitation of this till now pristine marine resource.
We have demonstrated that the Sbf I restriction enzyme is as a good candidate for RAD-sequencing based SNP discovery in M. muelleri. Interestingly, we found that, although the average number of RAD-tags per individual is similar to that found in other species (Amores et al., 2011;Catchen et al., 2013b;Puebla et al., 2014;Diaz-Arce et al., 2016;Rodríguez-Ezpeleta et al., 2016), the number of RAD-tags obtained per individual increases with the number of sequencing reads produced; this suggests that most likely the genome of M. muelleri could have as many as 50 000 SbfI cut sites, which is higher than most fish species studied so far. The presence of a large number of cut sites, and therefore of loci, could suggest lower coverage per locus. Yet, although lower coverage than in other species is obtained, all individuals had more than 6× coverage and average was of 11×. Multiplexing less individuals per lane or sequencing the same pool in more than one lane would increase coverage accordingly. Yet, the large number of loci shared among 75% of the individuals and the above one thousand SNP markers obtained after filtering suggest that this increased sequencing would not be necessary. The three M parameter values tested provided congruent outcomes; here, only those values corresponding to M = 2 are shown.
In our analyses, observed heterozygosity is lower than expected. This could be indicator of (i) population differentiation within the area of study, referred to as the Wahlund effect (Wahlund, 1928), (ii) preferential mating with close relatives or inbreeding (Wright, 1921) or (iii) non-random sampling of members from a limited number of families (Robertson, 1965). Due to the large biomass of mesopelagic fish (Irigoien et al., 2014), it would seem unlikely that our samples, composed by a few individuals per location, contain relatives. Similarly, preferential mating with close relatives or inbreeding would be difficult also to explain. Thus, the natural explanation for the low observed heterozygosity obtained would be population differentiation within the area of study. Yet, we do not find clear evidences of population stratification except for catch 9018, which appears genetically differentiated in the STRUCTURE plots, but not in the PCA.
The fact that when separating individuals into catches, observed heterozygosity is also lower than expected, suggests that one of the three possible explanations should be also acting at the catch level. In this sense, the possibilities of inbreeding and non-random sampling members of the same family should also be considered. Indeed, the fact that size distributions are quite homogeneous within catches (Figure 1), supports previous observations that individuals are grouped according to their age (Staby et al., 2011), and claims for more studies linking the reproductive behavior of the species and the genetic diversity results obtained here, which should be further confirmed with additional data. Similarly, no explanation to the different genetic diversity estimates obtained per catch could be found. These differences could be simply due to a low sample size per catch or have a more biologically meaningful information, and thus, more analyses are needed to confirm either one.
Overall, we have validated the RAD-seq based SNP discovery method using the SbfI restriction enzyme for M. muelleri, which has resulted in the first genomic resources for this species that are moreover made available for future studies (raw sequences are available at NCBI SRA Bioproject PRJNA417219). This data will contribute to shed light on phylogeographic patterns within the species (Rodríguez-Ezpeleta et al., 2016), detect genetic adaptation (Van Wyngaarden et al., 2017), and to resolve the phylogeny within the genus for species delimitation (Diaz-Arce et al., 2016).

AUTHOR CONTRIBUTIONS
NR-E designed the study, performed the analyses, interpreted the data, and wrote the manuscript. PÁ collected the samples, contributed to the interpretation of the data and revised the manuscript. XI contributed to the interpretation of the data and revised the manuscript.

FUNDING
This project was supported by the Department of Agriculture, Fisheries and Food of the Basque Country.