Selection Signatures of Pacific White Shrimp Litopenaeus vannamei Revealed by Whole-Genome Resequencing Analysis

Wang, Hao; Teng, Mingxuan; Liu, Pingping; Zhao, Mingyang; Wang, Shi; Hu, Jingjie; Bao, Zhenmin; Zeng, Qifan

doi:10.3389/fmars.2022.844597

ORIGINAL RESEARCH article

Front. Mar. Sci., 24 March 2022

Sec. Marine Fisheries, Aquaculture and Living Resources

Volume 9 - 2022 | https://doi.org/10.3389/fmars.2022.844597

This article is part of the Research TopicAdvances in the Biology, Aquaculture, and Conservation of Threatened Marine Species and their Application in Human Health and NutritionView all 27 articles

Selection Signatures of Pacific White Shrimp Litopenaeus vannamei Revealed by Whole-Genome Resequencing Analysis

Hao Wang^1†

Mingxuan Teng^1†

Pingping Liu^1,2

Mingyang Zhao²

Shi Wang^1,2,3

Jingjie Hu^1,2

Zhenmin Bao^1,2,4

Qifan Zeng^1,2,4*

¹MOE Key Laboratory of Marine Genetics and Breeding, College of Marine Life Sciences, Ocean University of China, Qingdao, China
²Key Laboratory of Tropical Aquatic Germplasm of Hainan Province, Sanya Ocean Institute, Ocean University of China, Sanya, China
³Laboratory for Marine Biology and Biotechnology, Qingdao National Laboratory for Marine Science and Technology, Qingdao, China
⁴Laboratory for Marine Fisheries Science and Food Production Processes, Qingdao National Laboratory for Marine Science and Technology, Qingdao, China

The Pacific white shrimp Litopenaeus vannamei is among the top aquatic species of commercial importance around the world. Over the last four decades, the breeding works of L. vannamei have been carried out intensively and have generated multiple strains with improved production and performance traits. However, signatures of domestication and artificial selection across the L. vannamei genome remain largely unexplored. In the present study, we conducted whole genomic resequencing of 180 Pacific white shrimps from two artificially selective breeds and four market-leading companies. A total of 37 million single nucleotide polymorphisms (SNPs) were identified with an average density of 22.5 SNPs/Kb across the genome. Ancestry estimation, principal component analysis, and phylogenetic inference have all revealed the obvious stratifications among the six breeds. We evaluated the linkage disequilibrium (LD) decay in each breed and identified the genetic variations driven by selection. Pairwise comparison of the fixation index (F_st) and nucleotide diversity (θ_π) has allowed for mining the genomic regions under selective sweep in each breed. The functional enrichment analysis revealed that genes within these regions are mainly involved in the cellular macromolecule metabolic process, proteolysis, structural molecule activity, structure of the constituent ribosome, and responses to stimulus. The genome-wide SNP datasets provide valuable information for germplasm resources assessment and genome-assisted breeding of Pacific white shrimps, and also shed light on the genetic effects and genomic signatures of selective breeding.

Introduction

Aquaculture produces almost half of the seafood consumed by humans and is becoming one of the fastest-growing food production sectors in the world (FAO, 2020). To match the ever-increasing food demands of the growing population, aquaculture production should increase fivefold in the next three decades (Costello et al., 2020). The recent annual global production of farmed shrimps reached more than 7.7 million tons, representing a value of over 33 billion US dollars (FAO, 2020). Benefitting from technological innovation and policy reforms, shrimp farming has developed steeply from traditional and small-scale activities into a global industry, holding a great promise for enhancing the contribution of aquaculture production to food supply (Lotz, 1992; Briggs et al., 2005).

Pacific white shrimp (Litopenaeus vannamei) is the top shrimp species of commercial importance. The annual global yield of L. vannamei reached 4.4 million tons with a production value of about 26.7 billion USD, accounting for 80% of the total cultured shrimp production (FAO, 2020). The success of Pacific white shrimp aquaculture is largely attributed to a series of breeding programs since the 1970s (Lotz, 1992; FAO, 2011). Genetic improvements in performance traits and disease resistance have achieved remarkable progress over the last decade (Argue et al., 2002; Campos-Montes et al., 2012; Montaldo et al., 2013; Lillehammer et al., 2020). Several specific pathogen-free (SPF) L. vannamei breeds, with superior health and efficiency adapted to distinct farming conditions, have been cultivated and shipped worldwide (Briggs et al., 2005; Fletcher, 2020; Ren, 2020). Benefitting from the rapid development of high-throughput sequencing and genotyping technologies, several causative genes responsible for phenotypic variations of L. vannamei have been reported (Wang et al., 2019; Zhang X. et al., 2019; Lyu et al., 2021). For instance, genes encoding class C scavenger receptor (SRC), deoxycytidylate deaminase (dCMPD), and non-receptor protein tyrosine kinase (NPTK) were identified as potential genes related to the growth rate (Wang et al., 2019; Lyu et al., 2021). Whereas, genomic signatures underlying artificial selection remain largely unexplored. Dissection of the high-resolution genomic variation map is essential for a better understanding of genetic diversity, population structure, and genomic features during generations of selection.

In the present study, to profile the genome signatures of selection in L. vannamei, we performed a whole genomic resequencing analysis of shrimps from two artificially selective breeds (Renhai No. 1 and Kehai No. 1) and broodstocks of four market-leading companies (Benchmark Genetics, Charoen Pokphand, Shrimp Improvement Systems, and Top Aquaculture Technology). The genome-wide single nucleotide polymorphisms (SNP) datasets revealed the population structure and genetic effects during the breeding process, providing valuable information for germplasm resource assessment and genome-assisted breeding of Pacific white shrimps.

Materials and Methods

Sampling and DNA Extraction

A total of 180 samples from the L. vannamei broodstock of Renhai No. 1 (RH), Kehai No. 1 (KH), Benchmark Genetics (BMK), Charoen Pokphand (CP), Shrimp Improvement Systems (SIS), and Top Aquaculture Technology (TA) were collected from Hairen Aquatic Seed Industry Technology Co., Ltd (Hebei, China) (Supplementary Table 1). The muscle samples of each individual were collected for DNA extraction and genome resequencing. Genomic DNA was extracted using the TIANamp Marine Animal DNA Kits (TIANGEN Biotech Co., Ltd. Beijing, China). Paired-end sequencing libraries, with an insert size of 250 to 350 bp, were constructed by the VAHTS Universal Plus DNA Library Prep Kit for MGI (Vazyme Biotech Co., Ltd., Nanjing, China) in their lab, and sequenced on an MGI DNBSEQ-T7 system (BGI Genomics Co., Ltd., Shenzhen, China). All raw sequencing data were subject to quality control¹. Low-quality bases and reads were trimmed by Trimmomatic with parameters “LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36” (Bolger et al., 2014).

Sequencing and Genotyping

The reference genome was downloaded from the National Center for Biotechnology Information (NCBI) with the accession of GCA_003789085.1 (Zhang X. et al., 2019). We anchored the contigs into scaffolds by the guidance of a genetic linkage maps (Yu et al., 2015; Jones et al., 2017), and all trimmed reads were aligned to this genomic assembly using the BWA (version 2.3.4.1), with the default parameters (Li and Durbin, 2009). The Binary Alignment Map (BAM) files were imported to the samtools (v0.1.19) for reference index building, format conversion, and reads sorting (Li et al., 2009). The Picard² (v1.92) and the sambamba (v0.8.0) were used to assign read group information (Tarasov et al., 2015). PCR and optical duplicates were marked and filtered using GATK (v4.1). Variants and haplotypes were identified using the HaplotypeCaller algorithm in Genomic Varient Call Format (GVCF) mode (Mckenna et al., 2010). The GenotypeGVCFs were used subsequently for joint genotyping.

To keep the most reliable SNPs for subsequent analysis, variant sites were marked and filtered by the vcffilter with the following criteria: low quality score (GQ < 20); low quality by depth score (QD < 10); high Fisher strand score (FS > 10); low mapping quality (MQ < 40); low read position rank sum score (ReadPosRankSum < 8); high strand odds ratio (SOR > 4); and low mapping quality rank-sum score (MQRankSum < 12.5). The SNP loci with multiple alleles, low minor allele frequency (MAF < 0.05), and missing genotypes (max-missing < 1) were also removed using the VCFtools (v0.1.16) (Danecek et al., 2011). The potential effect of each SNP was annotated using the SNPEff (Cingolani et al., 2012).

Population Structure Analysis

Principal component analysis (PCA) was performed using the PLINK (Purcell et al., 2007). Ancestry estimation and population structure were analyzed using the ADMIXTURE (Alexander et al., 2009). Pairwise fixation index (F_st) was calculated by VCFtools with parameters –fst-window-size 50,000 and –fst-window-step 10,000 (Danecek et al., 2011). The genomic observed heterozygosity, expected heterozygosity, and inbreeding coefficient were evaluated by the VCFtools with parameter –het (Danecek et al., 2011). For the phylogenetic analysis, representative SNP markers were extracted using the VCFtools with parameter –thin 10,000 to mitigate the effects of LD, homozygous sites were removed from the alignment file. Maximum likelihood phylogenetic inference was carried out by IQ-TREE (Nguyen et al., 2015).

The SNPs with specific alleles in one population were defined as population-specific SNPs (ps-SNPs). If the frequency of the population-specific allele was greater than 50% in a population, the SNPs were defined as common population-specific SNPs (cps-SNPs). The LD decay was estimated for each population using the PopLDdecay (Zhang C. et al., 2019). Genomic inbreeding coefficients F_ROH were calculated according to the following equation, with a minimum window of 100 kb (McQuillan et al., 2008):

F_{R O H} = Σ \frac{L_{R O H}}{L_{A u t o}}

Selective Sweep Analysis

The pairwise F_st and the ratio of genetic polymorphisms (θ_π) were calculated using a 50 kb window and a 10 kb step between each breed. Genomic regions with signals of differences in the spectrum of genetic polymorphisms were identified as regions that are potentially under selection. Empirical cut-offs for F_st and θ_π ratio were set as the top 2% largest and the top 5% largest or smallest, respectively (Li L. et al., 2018; Wang et al., 2021). Candidate genes under selection were defined as within or overlapping with the regions showing signals of selection. Candidate genes were characterized with Gene Ontology (GO) enrichment analysis implemented in the EnrichPipeline based on Fisher’s exact test (Huang et al., 2009). The GO terms with a P-value < 0.05 were considered significantly enriched.

Results

SNP Identification and Annotation

An average of 164 million 300 bp paired-end reads were generated for each library, representing ∼18.7 × coverage. After quality control, approximately 80% of reads could be aligned to the reference genome (∼12.34 × mean unique coverage) and used for variant calling. A total of 37,308,098 SNPs were identified in the six populations. The number of SNPs in the chromosomes ranged from 182,599 to 1,274,977, and the density was between 20.60 SNPs/Kb and 24.58 SNPs/kb. Of these, 4,509,526 reliable SNPs were kept after the removal of low-quality samples and those within repetitive regions of missing genotypes. The transitions/transversions ratio in this biallelic SNPs set was found to be 1.63, with 1,117,185,030 transitions and 72,199,828 transversions, respectively. The SNPs were unevenly distributed across the genome, with over 10.56% (1,460,366) located in the intergenic regions, about 55.43% (7,662,564) located in the introns, and only 2.283% (238,482) located in the exons (Table 1). More than 73.89% of the SNPs (221,021) were silent mutations, the remaining 25.82 (77,219) and.30% of SNPs (886) could result in missense and non-sense mutations, respectively (Table 1).

TABLE 1

Table 1. SNP annotation by genomic region, impact and function class.

Population Structure

The population structure was inferred based on the genome-wide biallelic SNPs. As shown by the PCA, the first three components cumulatively explained 24.56% of the variance (Figure 1A and Supplementary Figures 1, 2). Samples from the six populations were grouped into three major clusters, which largely restored the sampling sources. Obvious stratifications were observed among BMK, SIS, and the Asian breeds (RH, KH, TA, and CP). Although RH, KH, TA, and CP were distributed closely on the first two components, they possessed distinct population structures (Figure 1 and Supplementary Figure 2). The pairwise comparison revealed that the F_st values among the six populations ranged from 0.0005 to 0.13415, and that BMK had the largest genetic distances from the others. The KH and TA share a relatively higher similarity in genetic structure (Figure 1B). Populations defined by estimated genetic ancestries also confirmed the distribution of samples, where six ancestral populations were inferred to be the optimal scenario by cross-validation (Figure 1C and Supplementary Figure 3). The phylogenetic relationships were reconstructed via the maximum likelihood inference. Except for two samples of TA clustered together with CP, samples from each population formed monophyletic clades, which is consistent with the population structure inferred by PCA and ancestral estimation (Figure 1D).

FIGURE 1

Figure 1. Population structure analysis of the six populations by principal component analysis (PCA) (A), Pairwise F_st distance (B), ancestral estimation (C), and phylogenetic inference (D).

Genome Selection Signatures

As shown in Figure 2A, population-specific SNPs were identified in each breed. The BMK and SIS possessed 83,273 and 15,980 cps-SNPs, respectively, which are much larger than that of the other four breeds. The RH, KH, TA, and CP only had 3,153, 1,889, 485, and 836 cps-SNPs, respectively (Table 2). We evaluated the genetic diversity by the observed heterozygosity, expected heterozygosity, and inbreeding coefficient (Supplementary Data Sheet 1). These six populations have an average genomic observed heterozygosity of 0.122, 0.137, 0.144, 0.147, 0.149; and 0.156 in BMK, CP, TA, RH, KH, and SIS. The extent of genome-wide LD decay was measured against the physical distance in each population (Figure 2B). Asian breeds (RH, KH, TA, and CP) typically exhibited an overall lower level of LD than the others. Their LD levels (r²) fell below 0.1 at less than 15Kb, whereas, the physical distances reached over 70 kb in BMK and SIS. Analysis of runs of homozygosity (ROH) revealed that BMK and CP reserved higher levels of inbreeding than the others (Figure 2C and Supplementary Figure 4). Homozygous segments larger than 1Mb were mostly identified in these two populations. Despite that the genome-wide distribution of ROH is distinct in each population, peaks in specific regions of chromosomes 10, 15, and 42 were identified in each population (Figure 2D).

FIGURE 2

Figure 2. (A) cps-SNPs among the six populations. (B) linkage disequilibrium (LD) decay at a distance of 0–300Kb. (C) Box plot of F_ROH statistic. (D) Genome-wide distribution of ROHs in the six populations.

TABLE 2

Table 2. Genetic statistical summary of the six population.

Genes Under the Selective Sweep

The pairwise F_st and θ_π ratio were calculated to scan the genomic regions with genetic signals of divergence. The top 5% of the windows, with high values of F_st and differentiation of polymorphism frequency spectrum, were identified as regions that are potentially under selection (Figure 3A and Supplementary Figure 5). Genome-wide selective sweeps in BMK were illustrated in Figure 3A. We detected 271, 440, 257, 269, 431, and 429 candidate genes under a strong selection from 166,258 genomic windows in RH, BMK, KH, TA, SIS, and CP, respectively (Supplementary Data Sheet 2). Interestingly, 206 genes were under selection in more than one population (Supplementary Data Sheet 2). Function enrichment analysis of these candidate genes provided 631 significant enriched categories (Supplementary Data Sheet 3). For example, genes under selection in BMK were enriched in response to stimulus, carbohydrate transport, and several biological processes (Figure 3B). Selective genes in RH were mainly enriched in trialkyl sulfonium hydrolase activity, protein modification by a small protein conjugation, and phospholipid binding. Selective genes in KH were mainly enriched in the structural constituent of ribosome, structure molecule activity, and cellular macromolecule metabolic process. Selective genes in TA were enriched in the pyruvate kinase activity, peptidase inhibitor activity, and ATP metabolic process. Selective genes in SIS were enriched in the ribonuclease H2 complex, GTP cyclohydrolase I activity, and cellular macromolecule metabolic process. Selective genes in CP were mainly enriched in the structural molecule activity, regulation of ATPase activity, and 5′-deoxynucleotidase activity (please check Supplementary Data Sheet 3 for detailed in section “Results”). Notably, B3GT5, CDC42, PXDN, and several genes that play important roles in cellular macromolecule metabolic process, proteolysis, and responses to stimulus were identified under selection in more than one population (Figure 4).

FIGURE 3

Figure 3. Selective sweep and Gene Ontology (GO) enrichment analysis of the BMK samples. (A) Distribution of θ_π ratios (θ_π reference population/θ_π selective breed) and F_st values calculated in 50 kb window with a 10 kb step. Windows with the 5% largest or smallest of θ_π ratio, and the top 2% largest F_st were highlighted as regions under selection. (B) Enriched GO terms of genes under selection.

FIGURE 4

Figure 4. Shared enriched GO terms of selective genes among each breed.

Discussion

Genetic breeding of L. vannamei has achieved remarkable progress and cultivated multiple novel breeds with significant improvement in growth and adaptation to modern farming conditions after generations of artificial selection (Argue et al., 2002; Campos-Montes et al., 2012; Montaldo et al., 2013; Lillehammer et al., 2020). As exploring the full range of genetic diversity has practical implications for the management of germplasm resources and breeding programs, several studies have been carried out to decipher the spectrum of genetic polymorphism (Castillo-Juarez et al., 2015; Garcia et al., 2021). Whereas, these studies were mainly based on a limited number of markers by targeted or by reduced-representation genotyping (Li and Wu, 2003; Perez-Enriquez et al., 2009, 2018b; Garcia et al., 2021; Prithvisagar et al., 2021). The genome-wide landscape of variations shaped by artificial selection remains largely unexplored. In this study, we conducted whole genomic resequencing of L. vannamei and investigated the distribution of SNP sites among six populations. In total, 4.5 million high confidence SNPs were identified across the genome, reaching an average density of 22.5 SNPs/kb across the genome. Interestingly, population structures inferred by PCA, ancestry estimation, and phylogenetic analysis provided strong evidence for interpopulation stratifications. Considering that the founders of L. vannamei breeds were originally introduced from native populations in the Pacific coast from Mexico to Northern Peru in the late 1970s and 1980s (Wyban and Sweeney, 1991; Rosenberry, 2000), it was suggested that genetic drift and artificial selection could potentially contribute to the differentiation. The genomic heterozygosity of the six populations ranges from 0.122 to 0.156. The BMK and CP exhibit the lowest level of heterozygosity and high inbreeding coefficient, which is consistent with the ROH analysis. The observed heterozygosity levels are relatively lower than previous estimations (De Freitas and Galetti, 2005; Artiles et al., 2011; Rezaee et al., 2016; Perez-Enriquez et al., 2018a). The decay of LD provides important information about the historical recombination of population and is widely used to understand the evolutionary and demographic processes (Slatkin, 2008). Our results revealed that the overall LD levels of the four Asian populations are comparable to the breeding population of Ecuador (Garcia et al., 2021), while LD at long-range was stronger in BMK and SIS. Despite the biases in sampling size and the relatedness among individuals, LD extension could be attributed to demographic and biological factors, such as the admixture, mutation, and fluctuation of the effective population size (Qanbari, 2019). As the natural distribution of L. vannamei is restricted in America, we presumed that the observed long-range LD in American strains may be introduced by recurrent inclusion of wild strains, which was absent in the Asian populations (Moss et al., 2012). This phenomenon was also reported in salmonids, where admixture is the major factor contributing to long-range LD (Odegard et al., 2014; Barria et al., 2018; Vallejo et al., 2018).

The F_st values and θ_π ratios have been broadly used to identify genetic differentiation and genome-wide selective sweeps (Barreiro et al., 2008; Li L. et al., 2018). In the present study, we revealed numerous breeding signatures reflecting the complex genomic architectures from six breeds with distinct genetic backgrounds. A series of positive selective genes, such as B1GT3, C1GLT, NU133, PXDN, SOCS7, and UBP8, etc., have been identified in more than one breed. Previous studies in arthropods have revealed that many of these genes are involved in important physiological and biochemical functions. For example, PXDN, SCAP (SREPF Chaperone), SOCS7, and TOLL8 are related to antiviral or antibacterial immunity in Pacific white shrimps and other Crustaceans (Du et al., 2013; Wang et al., 2016; Li H. et al., 2018; Aweya et al., 2020). The CDC42 plays an important role in Reactive oxygen species (ROS) production and apoptosis by Mitogen-activated protein kinases (MAPK) pathway in L. vannamei (Peng et al., 2015). As selective breeding of these populations has been carried out for generations, the identified selective genes seem to coincide with the improved performance traits. Previous studies revealed that target genotyping of trait-linked markers can improve the accuracy of breeding value estimation (Li et al., 2017). Thus, we expect that these loci could serve as important markers for genetic improvement via targeted genotyping.

Conclusion

We conducted a population genetic study on six L. vannamei breeds by whole genomic resequencing and identified over 37 million SNPs. Population structure revealed by PCA, ancestral estimation, and phylogenetic analysis confirmed the stratification among these populations. Selective sweep analysis detected that 206 genes were under selection in more than one population. Among them, several candidate genes act as key regulators in cellular macromolecule metabolic process, proteolysis, and responses to stimulus, which could be responsible for the improved economic traits and adaptation to modern aquaculture. This study not only provides valuable information for germplasm resources assessment and genome-assisted breeding of L. vannamei, but also sheds light on the genetic effects and genomic signatures of selective breeding.

Data Availability Statement

The data presented in the study are deposited in the repository of Aquaculture Molecular Breeding Platform (AMBP, http://mgb.qnlm.ac).

Author Contributions

ZB, QZ, JH, and SW designed the experiments. MT, MZ, and PL collected the samples. HW, MT, and PL performed the experiments. HW, MT, and MZ analyzed the data. HW and QZ wrote the manuscript. All authors have read and approved the final manuscript.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Acknowledgments

We would like to thank graduate students Jing Liu, Baojun Zhao, Hongyu Lv, and Yantong Cai for their assistance in sample processing. We acknowledge the grant support from the Project of Sanya Yazhouwan Science and Technology City Management Foundation (SKJC-KJ-2019KY01), National Key Research and Development Program of China (2021YFD1200805), Major Science and Technology Program of Hainan Province (ZDKJ2021017), and Agricultural Variety Improvement Project of Shandong Province (2019LZGC014).

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fmars.2022.844597/full#supplementary-material

Footnotes

References

Alexander, D. H., Novembre, J., and Lange, K. (2009). Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19, 1655–1664. doi: 10.1101/gr.094052.109

PubMed Abstract | CrossRef Full Text | Google Scholar

Argue, B. J., Arce, S. M., Lotz, J. M., and Moss, S. M. (2002). Selective breeding of Pacific white shrimp (Litopenaeus vannamei) for growth and resistance to Taura Syndrome Virus. Aquaculture 204, 447–460.