Identification of Growth-Associated Genes by Genome-Wide Association Study and Their Potential Application in the Breeding of Pacific White Shrimp (Litopenaeus vannamei)

The Pacific white shrimp (Litopenaeus vannamei) is the most widely cultured shrimp in the world. A great attention has been paid to improve its body weight (BW) at harvest through genetic selection for decades. Genome-wide association study (GWAS) is a tool to dissect the genetic basis of the traits. In this study, a GWAS approach was conducted to find genes related to BW through genotyping 94,113 single nucleotide polymorphisms (SNPs) in 200 individuals from a breeding population. Four BW-related SNPs located in LG19 and LG39 were identified. Through further candidate gene association analysis, the SNPs in two candidate genes, deoxycytidylate deaminase and non-receptor protein tyrosine kinase, were found to be related with the body weight of the shrimp. Marker-assisted best linear unbiased prediction (MA-BLUP) based on the SNPs in these two genes was used to estimate the breeding values, and the result showed that the highest prediction accuracy of MA-BLUP was increased by 9.4% than traditional BLUP. These results will provide useful information for the marker-assisted breeding in L. vannamei.


INTRODUCTION
The Pacific white shrimp, Litopenaeus vannamei, is the most widely cultured shrimp all over the world (FAO, 2019). Due to its extremely high industry value, genetic improvement by selection breeding has been performed for decades (Argue et al., 2002;Sui et al., 2016). Growth trait is considered as the most important economic trait in L. vannamei and has been taken as the main target trait in most breeding programs (Argue et al., 2002;Sui et al., 2016).
As a typical quantitative trait with medium to high heritability (Castillo-Juárez et al., 2007;Sui et al., 2016;Nolasco-Alzaga et al., 2018), improvement of the growth trait was performed mainly by best linear unbiased prediction (BLUP) (Castillo-Juárez et al., 2007;Sui et al., 2016;Nolasco-Alzaga et al., 2018), which is essentially based on an "infinitesimal model" (Mather, 1941). This model assumes that the quantitative trait is controlled by an infinite number of gene loci that are evenly distributed across the genome and each locus has an infinitely small effect (Fisher, 1918;Bulmer, 1971). However, substantial advances in molecular genetics and genomic research provided us an opportunity to capture the quantitative trait loci (QTLs) or genes that have definite effects on growth.
An essential method for identifying QTLs is genetic linkage mapping. Several studies based on this method have identified QTLs for growth trait in L. vannamei, and no related genes were found due to the limitation of marker density (Andriantahina et al., 2013;Yu et al., 2015). Besides, QTLs detected in a mapping population with a biparental family are usually difficult to be transferred to large, more complex breeding groups (Takeda and Matsuoka, 2008). With the development of high-throughput genotyping platforms for single nucleotide polymorphisms (SNPs), genome-wide association study (GWAS) has become a more powerful tool to dissect the genetic basis underlying the traits. It utilizes high-density SNPs distributed along the genome to determine the QTLs by statistical associations between SNP genotypes and particular phenotypes. GWAS has been successfully carried out in fish, referring to many traits, such as filet texture, fat content (Sodeland et al., 2013), growth (Gonzalez-Pena et al., 2016;Li et al., 2017), sexual maturation (Gutierrez et al., 2015), disease resistance (Liu et al., 2015;Zhou et al., 2017;Rochus et al., 2018;Tan et al., 2018), low oxygen tolerance (Zhong et al., 2017), and fatty acid composition (Horn et al., 2020). As for crustaceans, reports to use GWAS to discover SNPs or genes related to interesting traits are relatively less. Based on the restriction-site-associated DNA (RAD) sequencing data in a biparental family, GWAS analysis for growth trait was performed and two growth-related candidate genes protein kinase C delta type and ras-related protein Rap-2a were identified in L. vannamei . In another GWAS analysis, 23,049 SNPs were obtained in a population of 200 L. vannamei using 2b-RAD sequencing, and class C scavenger receptor was identified as a growth-related candidate gene . Then, as the genome was assembled, another growth-related candidate gene, MMD2, was found based on the correlation analysis results in Wang et al. (2019) and further validated (Wang et al., 2020). Although these researches illustrated the power of identifying QTLs or genes using GWAS in L. vannamei, low marker density and simple population structure restrict their applied values.
With the accomplishment of the whole genome sequencing in L. vannamei , performing GWAS analysis and screening trait-related genes with reference genome and highdensity SNPs become possible. In the present study, the 2b-RAD sequencing data were mapped to the published genome sequence, and the SNPs were called based on the reference sequence, then GWAS for body weight (BW) was performed. Candidate growthassociated genes were further validated in other populations. Moreover, a trial to apply marker-assisted selection (MAS) was conducted. These results will provide useful information for marker-assisted selection in L. vannamei.

Identification of Growth-Associated Markers by GWAS
The experimental population used in GWAS was the same as described previously . Briefly, the GWAS population which was named as population 1, consisting of 13 full-sib families, was bred in Guangtai Marine Breeding Company in Hainan Province of China in 2015. Each family was raised separately for seedlings. After grown to 3 cm, 50 individuals per family were randomly selected, mixed, and cultured in a 10m 2 pond. When they reached the market size, 200 shrimp were selected randomly to be taken as GWAS population and the BW of each shrimp was measured. The 2b-RAD library sequencing data of these 200 individuals have been published previously , and 23,049 SNPs were obtained in these researches. As the reference genome has been finished and published recently , the sequencing data were mapped to the genome sequence, and the SNPs were called based on the reference sequence. After quality control filtering by minor allele frequency at 0.05 and SNP call rate at 0.9, 94,113 SNPs were used in subsequent GWAS.
Considering a possible stratification due to multiple origins of the shrimp, GWAS was performed using the egscore function in GenABEL package implemented in R environment (Aulchenko et al., 2007). In this method, classic principal component analysis (PCA) of the genomic kinship matrix was used to evaluate the population stratification (Price et al., 2006). Then, genome-wide significance was assessed by the Bonferroni method, in which the conventional p value was divided by the number of tests performed (SNPs tested). The model used for GWAS was as follows: where y is the vector of phenotypic records (BW), b is the vector of fixed effects including sex and the first four principal components, X is the incidence matrices for fixed effects in b, s is the effect of the SNP genotype, W is the design vector for s, and e is the vector of random residuals.

Validation of Growth-Associated SNPs by the General Linear Model
After GWAS, another independent population bred in 2017 which was named as population 2 in this study was used to validate the identified growth-associated SNPs. Population 2 contains 303 individuals from multiple families with similar day age. The top 50 candidate SNPs with the smallest p values obtained in GWAS were genotyped in population 2 by the target sequencing method (Shanghai Biowing Applied Biotechnology Co., Ltd., Shanghai, China). The general linear model (GLM) was used to verify associations between them and BW in population 2. The GLM was as follows: where y is the vector of phenotypic records (BW), b 1 is the vector of sex fixed effects, X 1 is incidence matrix for sex fixed effect in b, s is the effect of the SNP genotype, W is the design vector for s, and e is the vector of random residuals. The GLM analyses were executed using the package stats in R.

Annotation of Significant SNPs
After validation, the upstream 10 kb and downstream 10 kb around the significant SNPs were blasted to L. vannamei genome 1 and the nearby genes were identified. The 5 untranslated regions (UTRs) and open reading frames (ORFs) of growth-associated candidate genes were predicted by the Expert Protein Analysis System (EXPASY 2 ).

Identification of Significant SNPs in Candidate Genes by the Mixed Linear Model
To identify SNPs associated with BW located in two candidate genes, the population bred in 2018 which was named as population 3 was used to genotyped SNPs located in 5 UTR of the candidate genes. Population 3 contained 321 individuals from more than 50 full-sib families with the same day age. Mixed linear model (MLM) analysis was applied to eliminate the effects of population structure. Twelve microsatellite markers (Table 1) were used to genotype this population and their mothers to reconstruct pedigree in the COLONY (Jones and Wang, 2010). The restricted maximum-likelihood method was carried out in ASReml (Gilmour et al., 2012) to solve the following MLM: where y is the vector of phenotypic records (BW); b 1 is the vector of sex fixed effects; X 1 is the design matrices for b; u is the random polygenic additive genetic effect except for target SNP, assumed to be ∼ N(0, Zσ 2 u ), where σ 2 u is the polygenic additive variance; Z is the additive genetic relationship matrix for u; s is the fixed effect of the SNP genotype; W is the design vector for s; and e is the vector of random residuals.
After MLM analysis, linkage disequilibrium (LD) analyses were performed for BW-associated SNPs using the package LDheatmap in R, and additive (a), dominance (d), and substitution (α) effects of each BW-associated SNP were calculated as follows (Falconer and Mackay, 1996): where AA, BB, and AB were the least squares means of BW of three genotypes, and p and q are the allele frequencies of A and B.

Comparison Between BLUP and Marker-Assisted BLUP
After obtaining SNPs located in candidate growth genes, the methods of MAS were further explored. We calculated the estimated breeding values (EBVs) based on marker-assisted BLUP (MA-BLUP). The EBV of an individual based on MA-BLUP is the sum of polygenic effect (û) and SNP genotype effects (ŝ.) which were obtained by solving MLM equations: For comparison, EBVs from traditional BLUP model were also obtained using a similar MLM but without SNP effects. Tenfold cross-validation was used to compare the accuracies of BLUP and MA-BLUP. The GLMs were also included in the comparison, which represented selection accuracies based on significant markers. The phenotypes under GLMs were predicted by fixed effects of sex and SNP genotype, which were the same as those used in the validation of growth-associated SNPs. Crossvalidation was performed in population 3. For each tenfold cross-validation, the dataset was randomly divided into 10 equal zed subsets. Nine subsets were used as training data and the other one is used as prediction data. Cross-validation was repeated 500 times. Accuracy was represented by average Pearson correlation between observed phenotypes and predicted phenotypes in the prediction dataset.

Genome-Wide Association Analysis
According to the reference genome information , a total of 83,015 SNPs were distributed along the 44 linkage groups (LGs) with one SNP per 0.05 cM on average, while the other 11,098 SNPs were not assigned to any LGs. The Manhattan plot of the GWAS result is shown in Figure 1, in which the unassigned SNPs are shown in the rightmost column. We selected the 50 SNPs with the smallest p values for further verification. Their p value ranged from 1.42E-05 to 1.71E-04. A total of eight significant SNPs were located in LG19, and four and three were located in LG11 and LG15, respectively. The other 24 SNPs were located in 17 different LGs. Therefore, potential trait-associated SNPs showed a wide distribution in genome. More details about these 50 SNPs are shown in Table 2.

Validation of Growth-Associated SNPs by GLM
Eight candidate SNPs with small p values in GWAS were amplified and genotyped in population 2 by using the target sequencing approach. Their primer sequences are shown in Table 3. As there is a high ratio of repeat sequences and insertion-deletion mutations, the other 42 candidate SNPs were not genotyped. The GLM analysis results are listed in Table 4. Minor allele frequencies in all except for SNP555_246199 were larger than 0.10. The results of GLM analyses indicated that four of them were significantly correlated with BW after Bonferroni correction.

Annotation of Significant SNPs
The SNPs showing significant correlations with BW after Bonferroni correction were further analyzed and annotated. The  flanking genes around the significant SNPs are listed in Table 5.
Four growth-associated SNPs were found to be located near the genes. There were two unknown genes around the most significant SNP3085_190487, and the second most significant SNP1678_860109 was located in intron 2 of an unknown gene. The other two candidate growth-associated genes near the SNP3406_465799 and SNP2570_399791 were predicted as deoxycytidylate deaminase (dCMPD) and non-receptor protein tyrosine kinase (NPTK). Distances from growth-associated SNPs to candidate genes ranged from 2.6 to 10.0k. Particularly, there were seven transcript variants for L. vannamei NPTK and two for dCMPD in the NCBI GenBank. By comparing sequences of transcripts, we found that the sequence variation among different transcripts existed in 5 UTR for both genes.

Validation of Growth-Associated SNPs in Candidate Genes by MLM
In this study, the two annotated candidate genes were further analyzed in population 3 to identify the SNPs located in genes. Polymorphisms around 5 UTR of these two genes were detected by PCR amplification and Sanger sequencing. Two pairs of primers were designed to amplify 5 UTR for each gene, and the primer sequences are shown in Table 6.
The two numbers separated by underscore in SNP ID represent the ID of the scaffold and position of SNP in this scaffold (bp). For sequences of scaffolds, see http://www.shrimpbase.net/vannamei.html. The asterisk (*) indicates LGs of these SNPs were unknown. High polymorphisms were detected in amplified fragments of dCMPD and NPTK. Roughly, there were more than 45 SNPs and 5 indels in the amplified fragments of dCMPD and more than 32 SNPs and 2 indels in those of NPTK. However, most SNPs could be genotyped only in a few samples by Sanger sequencing due to the existence of the indels. Finally, 12 SNPs in NPTK and 8 SNPs in dCMPD with MAF > 0.01 were used to detect their associations with BW. The p values for each SNP are shown in Table 7. One SNP within intron 1 of NPTK (NPTK-278) and two SNPs located about 100 bp upstream of exon 1 of dCMPD (dCMPD-118 and dCMPD-129) were found significantly correlated with BW (p < 0.01).
Linkage disequilibrium analyses among these significant SNPs showed that dCMPD-118 and dCMPD-129 were in strong LD (Table 8), which indicates an overlap exists between their effects on BW. Therefore, the SNP with a larger effect, dCMPD-118, was included for further analysis. The least squares mean, additive, dominance, and substitution effects of NPTK-278 and dCMPD-118 are shown in Table 9. An overdominance effect was found in dCMPD.

Comparison Between BLUP and MA-BLUP
Estimated breeding valuess predicted from MA-BLUP and traditional BLUP were compared. For MA-BLUP analysis, we The asterisk (*) indicates a significant correlation with BW after Bonferroni correction (p < 6.25E-03). +10.0 k calculated EBVs using three models, two containing the fixed effect of NPTK-278 or dCMPD-118, respectively, and the other one containing both. In particular, the interaction effect between NPTK-278 and dCMPD-118 was not significant, so it was not considered. The correlations among different EBVs are shown in Table 10. The Pearson and Spearman coefficients between different models were high correlated.
The prediction accuracies of all the models including GLMs are listed in Table 11. MA-BLUP containing fixed effects of NPTK-278 and dCMPD-118 (model 4) has the highest accuracy, and those containing one SNP fixed effect (model 2 and model 3) The numbers in SNP ID represent locations in amplified fragments. The asterisk (*) indicates a significant correlation with BW (p < 0.01). are also larger than the traditional BLUP animal model (model 1). The predicted phenotypes by GLMs (models 5-7) showed weak or no correlation with the observed phenotypes.

DISCUSSION
Identification of trait-associated genetic markers or candidate genes is very important both for understanding the molecular mechanisms of phenotypic variation and for breeding applications. In aquaculture species, GWAS analysis was regarded as a useful tool for correlating genetic and phenotypic variations (Abdelrahman et al., 2017). In this study, we present a GWAS analysis to correlate genetic and growth phenotypic variations in L. vannamei. Although the results showed that no SNP reached genome-wide significance, a set of SNPs with small p values was still worthy of further validation and analysis. This is because the use of more samples and markers would probably increase the confidence levels, theoretically allowing the suggested QTL to be detected at a statistically significant level. Considering that Bonferroni correction was overly restrictive, we selected 50 SNPs with the smallest p values as the candidate SNPs for further validation. The association analyses between these SNPs and BW were performed in another population to avoid type I errors. These 50 SNPs were distributed widely in a genome which confirmed that growth trait has a polygenic architecture affected by multiple loci with small effects (Falconer and Mackay, 1996). Further association analysis results proved that some of the most significant SNPs were worthy to be reanalyzed although they did not reach genome-wide significance in GWAS.
After validation of the GWAS result, two candidate growthassociated genes were found, which were dCMPD and NPTK. dCMPD is an enzyme which catalyzes the deamination of dCMP to dUMP, thus providing nucleotide substrate for thymidylate synthase (Maley and Maley, 1959). Previous studies have  indicated its potential importance in DNA replication. Its activity is elevated in such rapidly dividing tissues as embryo (Maley and Maley, 1959), regenerating liver and rat hepatomas (Maley and Maley, 1961). A research on HeLa cell also found that dCMPD showed the highest expression in late S phase and its expression declines in the following G2 phase (Gelbard et al., 1969). There was also a report about the role of dCMPD in growth, and inhibition of dCMPD could result in inhibition of growth of chick embryo (Roth and Buccino, 1963). Protein tyrosine kinases (PTKs) are enzymes catalyzing the transfer of the gamma-phosphate group of ATP to the hydroxyl groups of specific tyrosine residues in peptides. PTKs are primarily involved in the regulation of cellular functions that are directly related to the multicellular status of the organism, such as growth, differentiation, and cell-cell and cell-extracellular matrix interactions (Robinson et al., 2000). PTKs are divided into two groups according to the presence of transmembrane and extracellular domains, and some lacking these domains are referred to as NPTK (Robinson et al., 2000). So far, there was no report about the function of dCMPD and NPTK in the growth of L. vannamei or other crustaceans. In L. vannamei, there were two transcript variants for dCMPD and seven for NPTK in NCBI, and sequence variation among different transcripts mainly existed in 5 UTR for both genes. Besides, considering complex traits are mainly driven by noncoding variants (Pickrell, 2014;Li et al., 2016), polymorphisms around 5 UTR of these two genes were identified, and their associations with phenotypic variations were detected using MLM analysis. Compared with GLM analysis, the advantage of MLM was accounting for genetic relationships which could avoid confounding effects due to population structure (Kennedy et al., 1992). One SNP in each gene was found significantly correlated with BW. These results showed that dCMPD and NPTK were growth-associated candidate genes in L. vannamei and this deserves further research. However, due to the possible presence of pleiotropy, how allelic frequency changes of these SNPs affect the other important economic traits also needs to be further studied.
Prior to this study, some BW-related genes have been identified in L. vannamei, such as myostatin (Lee et al., 2015), protein kinase C delta type, ras-related protein Rap-2a , class C scavenger receptor , and MMD2 (Wang et al., 2020). These findings suggested that growth trait has a polygenic architecture and provides the possibility of application of MAS for growth in L. vannamei. In this paper, we did not try to uncover the biological functions and mechanisms of the effects on phenotypic variations of these two genes, but to explore the methods on how to apply them in breeding.
For aquaculture species, the primary purpose of GWAS was to identify trait-related genes or markers to apply MAS. However, there has been very few MAS application cases in commercial aquaculture breeding. The biggest obstacle is that for economy traits, almost all of which are complex traits, even the most important loci in the genome have small effects, especially for growth traits. Therefore, the effects of associated genes or markers should be taken as complements to rather than a replacement of polygenic effects (additive genetic effects) in breeding. For the reasons above, MA-BLUP was developed (Fernando and Grossman, 1989) and previous studies showed that MA-BLUP could increase the accuracies of genetic evaluations compared with traditional BLUP (Van Arendonk et al., 1994;Zhang et al., 2003), especially on the condition that marker effect was relatively large (Lande and Thompson, 1990;Villanueva et al., 2005). In this study, there were certain differences between EBVs based on MA-BLUP and BLUP (Table 10). Cross-validation revealed that the performance of MA-BLUP was better than BLUP, and MA-BLUP with two SNP effects was better than with one SNP effect. By using the proposed MA-BLUP method, the prediction accuracy was increased by 9.4%. Besides, the prediction accuracies of GLMs consisting of only fixed effects of SNP were much lower than those of the MA-BLUP or BLUP. This suggested that selection only based on trait-related markers was inefficient.
Since the population for GWAS in this study was relatively small and its genetic background was narrow, the power of GWAS might be weakened. Therefore, we further analyzed the significant SNPs and the candidate genes in the other populations step by step and identified three body weight-related SNPs in two candidate genes. The roles of the two candidate growth genes should be further investigated to enrich our understanding of the genetic basis of growth trait in shrimp. Meanwhile, we give an example of applying the trait-associated SNPs in genetic selection by the MA-BLUP method. The prediction accuracy of MA-BLUP based on significant SNPs was higher than that of traditional BLUP. Foreseeably, as more molecular markers associated with different economic traits were identified, MA-BLUP will have a wide application prospect in the selective breeding of shrimp.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/supplementary material.

AUTHOR CONTRIBUTIONS
DL and YY conducted the experiment and data processing. JX and FL conceived and supervised the project. QW and XZ contributed to statistical analysis. ZL and QZ participated in the extraction of genomic DNA. DL, YY, and FL prepared the manuscript. All authors have read and approved the manuscript.