Comparative genomic analysis of three salmonid species identifies functional candidate genes involved in resistance to the intracellular bacteria Piscirickettsia salmonis

Piscirickettsia salmonis is the etiological agent of Salmon Rickettsial Syndrome (SRS), and is responsible for considerable economic losses in salmon aquaculture. The bacteria affect coho salmon (CS) (Oncorhynchus kisutch), Atlantic salmon (AS) (Salmo salar) and rainbow trout (RT) (Oncorhynchus mykiss) in several countries, including: Norway, Canada, Scotland, Ireland and Chile. We used Bayesian genome-wide association (GWAS) analyses to investigate the genetic architecture of resistance to P. salmonis in farmed populations of these species. Resistance to SRS was defined as the number of days to death (DD) and as binary survival (BS). A total of 828 CS, 2,130 RT and 2,601 AS individuals were phenotyped and then genotyped using ddRAD sequencing, 57K SNP Affymetrix® Axiom® and 50K Affymetrix® Axiom® SNP panels, respectively. Both trait of SRS resistance in CS and RT, appeared to be under oligogenic control. In AS there was evidence of polygenic control of SRS resistance. To identify candidate genes associated with resistance, we applied a comparative genomics approach in which we systematically explored the complete set of genes adjacent to SNPs which explained more than 1% of the genetic variance of resistance in each salmonid species (533 genes in total). Thus, genes were classified based on the following criteria: i) shared function of their protein domains among species, ii) shared orthology among species, iii) proximity to the SNP explaining the highest proportion of the genetic variance and, iv) presence in more than one genomic region explaining more than 1% of the genetic variance within species. Our results allowed us to identify 120 candidate genes belonging to at least one of the four criteria described above. Of these, 21 of them were part of at least two of the criteria defined above and are suggested to be strong functional candidates influencing P. salmonis resistance. These genes are related to diverse biological processes, such as: kinase activity, GTP hydrolysis, helicase activity, lipid metabolism, cytoskeletal dynamics, inflammation and innate immune response, which seem essential in the host response against P. salmonis infection. These results provide fundamental knowledge on the potential functional genes underpinning resistance against P. salmonis in three salmonid species.

Currently, the advancement of molecular technologies has allowed the generation of 82 dense marker panels for salmonid species (Houston et (Hjeltnes, 2018). Interestingly, Moen et al., (2015) mapped the QTL to a region containing an 93 epithelial cadherin (cdh1) gene encoding a protein that binds to IPNV, indicating that the 94 protein is part of the machinery used by the virus for host internalization. 95 P. salmonis resistance has been suggested to be polygenic, with many loci explaining 96 a small amount of the total genetic variance (Barría et al., 2018a;Correa et al., 2015), 97 suggesting that the implementation of genomic selection (GS) is the most appropriate strategy 98 to accelerate the genetic progress for this trait. Methods which can model all available SNPs 99 simultaneously, including Bayesian regression methods (Fernando & Garrick 2013), appear to 100 be better for estimating marker effects than conventional methods of modeling each SNP 101 RT individuals were genotyped using the commercial 57K Affymetrix® Axiom® SNP 156 array, developed by the National Center of Cool and Cold Water Aquaculture at the USDA 157 (Palti et al., 2015). SNPs were filtered with the following QC parameters: HWE p < 1x10-6, 158 MAF < 0.05, and SNP call rate < 0.95. Individuals with call rates lower than 0.95 were also 159 removed. 160 The 50K SNP Affymetrix® Axiom® array used to genotype AS, was developed by 161 Universidad de Chile and Aquainnovo (Correa et al., 2015;Yañez et al., 2016). These markers 162 were selected from a 200K array, as described in detail in Correa et al. (2015). Genotypes were 163 quality-controlled using the following criteria: HWE p < 1x10-6, MAF < 0.05, SNP and 164 samples were discarded when the genotype rate was < 0.95. where, y is the vector of phenotypic records (DD or BS); X and Z are the incidence matrix of 174 fixed effects and polygenic effect, respectively; b is the vector of fixed effects (tank and body 175 weight); u is the random vector of polygenic effects of all individuals in the pedigree; * is the 176 vector of the genotypes for the i th SNP for each animal; * is the random allele substitution 177 effect of the i th SNP; * is an indicator variable (0, 1) sampled from a binomial distribution with 178 parameters determined such that π value of 0.99; and e is a vector of residual effects.
The prior assumption is that SNP effects have independent and identical mixture 180 distributions, where each SNP has a point mass at zero (with probability π) and a univariate 181 Gaussian distribution (with probability 1-π) with a mean equal to zero and variance equal to 182 2 3 , having in turn a scaled inverse 3 prior, with 2 = 4 and 7 = 10 degrees of freedom (df) 183 and scale parameter, respectively (Fernando and Garrick, 2013) where * and * are the allele frequencies for the i-th SNP, * is the estimated additive effect of 194 the i-th SNP on the phenotype and E 3 is the estimate of the polygenic variance (Lee et al., 195 2013). 196 The association between the SNPs and the phenotypes was assessed using the 197 proportion of the genetic variance explained by each marker. To be inclusive regarding the 198 genomic regions to be compared across the three species, we selected each of the regions 199 explaining at least 1% of the genetic variance for the trait in each species. 200 The heritability values were calculated as: 201 where, L M is the total additive genetic variance, estimated as the sum of additive marker 203 (2 2 3 ∑ * * ) and the polygenic pedigree based ( P 3 ) additive genetic variance. 204 205

Comparative genomic analysis 206
Initially, sequence homology between chromosomes containing regions with SNPs 207 explaining more than 1% of the genetic variance were compared. Synteny among these 208 chromosomes was identified by using Symap (Soderlund et al. 2011). The relationship between 209 the chromosomes from CS, AS and RT and the association between SNPs and resistance to P. 210 salmonis (Manhattan plot) was plotted using Circos (Krzywinski 2009 Subsequently, we retrieved the sequences of all the genes (and their protein products) adjacent 217 to each SNP within a window of 1 Mb (500 Kb downstream and 500 Kb upstream to the 218 associated SNP). We then used this information to apply the following criteria in order to 219 classify and prioritize functional candidate genes by comparing the genomic regions involved 220 in P. salmonis defined as DD and BS within and among the three species: Orthologous and paralogous genes among species were identified using the 226 ProteinOrtho tool (Lechner et al., 2011). Multi-directional alignments were performed using the full-length sequences among complete sets of proteins 228 encoded in each of the three species to obtain orthologous groups, with a 35% 229 threshold for identity and similarity (Group B); 230 iii) The complete set of genes within 1 Mb windows adjacent to SNPs explaining 231 the highest proportion of the genetic variation for each trait (leader SNP) were 232 recovered and classified as high priority genes (Group C); and 233 iv) The complete set of genes located at the intersection of more than one 1 Mb 234 windows within a species were also identified and considered as high priority 235 genes (Group D). 236

Challenge test and genetic parameters 239
There was considerable phenotypic variation for P. salmonis resistance across fish species 240 (Figure 1). The average cumulative mortality for different families ranged from 5% to 81%, 241 8.3% to 73.7% and 8% to 100% for CS, AS and RT, respectively. This result suggests that the 242 phenotypic variation for this trait could be related with the genetic background on each species. 243 Estimated heritabilities for P. salmonis resistance were significant for the three species, 244 indicating the feasibility to improve the trait by means of artificial selection ( to high percentage of genetic variance for P. salmonis resistance. In contrast, for AS, a large 254 number of SNPs with small effect were found, and the percentage of genetic variance explained 255 by a single marker was not higher than 5% (Figure 2  Interestingly, the P-loop containing nucleoside triphosphate hydrolase superfamily (also 286 known as P-loop_NTPase) contained the largest group of proteins for both traits, and at least 287 one representative protein from each salmonid species belonged to this superfamily. Thirty-288 one of the proteins identified in this study are part of this superfamily, including some GTPases, 289 kinesin and myosin proteins and ATP-dependent RNA helicases (Supplementary Table S1, 290 sheet: P-loop NTPases (Group_A)). 291 To complement these analyses, we looked for orthologous proteins through multi-292 directional alignments using full-length sequences of the complete set of proteins for each 293 species (Group B). Only five groups of orthologous genes were identified in at least two 294 species, highlighting three non-receptor tyrosine-protein kinases (nr-TPK) with representative 295 genes in the three species for DD and in two species for BS. In addition, for DD, two ATP-296 dependent RNA helicases (DDX) and two Ras-related proteins (RAB) were identified in CS 297 and RT, while two FYVE, RhoGEF/PH domain-containing proteins (FGD) were identified in 298 RT and AS. For BS, two fatty acid-binding proteins (L-FABP) and two ankyrin repeat domain-299 containing proteins were identified in CS and RT (Supplementary Table S1, sheet: 300 Orthologous genes (Group_B)). The proteins nr-TPK, DDX and L-FABP are also encoded by genes adjacent to SNPs that explained the highest proportion for the genetic variance (leader 302

SNP) for both trait definitions (Group C). 303
Group C contained other genes (n=42) that encoded proteins such as myosin-IIIb 304 (MYO3B), ATP-dependent RNA helicase (TDRD9), kinesin protein (KIF15) and kinesin 305 protein (KIF2C) that are also included into the P-loop_NTPases superfamily, as well as 306 members of the orthologous groups such as fatty acid-binding proteins (FABP). Other genes 307 encoding proteins classically associated to immune response such as tripartite motif-containing 308 protein 35 (TRIM35) and lysozyme C II (LYZ) are also part of this group. A complete list of 309 these genes and proteins is in Supplementary Table S1, sheet: Adjacent to leader SNP 310

(Group_C). 311
Group D was composed of genes (n=58) located adjacent to more than one SNP 312 simultaneously (within overlapped windows). Among them, we identified GTPase IMAP 313 family member 4 (GIMAP4), GTPase IMAP family member 8 (GIMAP8), NLR family CARD 314 domain-containing protein 3 (NLRC3), ADP-ribosylation factor protein 5B (ARL5B), voltage-315 dependent L-type calcium channel subunit beta-2 (CACNB2) and heparan sulfate glucosamine 316 3-O-sulfotransferase 3A1 (HS3ST3A1), all of which are also P-loop NTPases. In addition, we 317 identified histidine triad nucleotide-binding protein 1 (HINT1), that is also adjacent to the 318 leader SNP for DD in AS, and other genes associated with immune response such as collectin-319 12 (COL12), macrophage mannose receptor 1 (MRC1) and tapasin-related protein (TAPBPL). 320 A complete list of these genes and proteins can be found in Supplementary Table S1, sheet: 321 Genes overlapped windows (Group_D). Additionally, the gene that codes for NACHT, LRR 322 and PYD domains-containing protein 12 (NLRP12) was found in groups A, C and D. 323 We identified several candidate genes associated with P. salmonis resistance (n=120) 324 which were present in at least one of the groups described previously. These genes are 325 associated with the following biological processes: dependent on kinase activity, GTP hydrolysis, helicase activity, lipid metabolism, cytoskeletal dynamics and inflammation. In 327 order to rank the genes, we scored them based on the counting of each of them across following 328 categories: i) species (CS, RT and AS); ii) trait definitions (DD and BS); and iii) groups (A, B, 329 C and D), thus the maximum score for one particular gene was equal to 9. The prioritized 330 functional candidate genes based on the score described above are shown in Table 2  The comparative genomic strategy used in this study allowed us to identify groups of 336 homologous superfamilies and orthologous genes common to more than one species of 337 salmonids among genes adjacent to SNPs that explain more than 1% of the genetic variance 338 for P. salmonis resistance. To our knowledge, this is the first study which aims at identifying 339 and prioritizing functional candidate genes involved in the differential response against families of homologous proteins in the complete set of proteins adjacent to all the SNPs that 395 explained more than 1% of the genetic variance, without searching for a specific function. The 396 identification of genes directly associated with the innate immune response, after applying all 397 the classification criteria, such as lysozyme C II, macrophage mannose receptor 1, collectin-12 398 and tapasin-related protein, suggests that our strategy was successful in finding strong 399 functional candidate genes involved in resistance to P. salmonis. Interestingly, around one hundred genes not classically associated with the immune system were also identified; from 401 which seventeen of them were part of at least two of the groups described previously and hence 402 are considered strong candidates for being responsible on trait variation (Table 2). 403 Previously, lysozymes have primarily been described as having a bacteriolytic activity 404 against gram-positive bacteria; however, the expression of lysozyme C II has been shown to 405 be induced in a resistant rainbow trout line in response to Another set of candidate genes for SRS resistance in the three salmonid species studied 423 are a cluster of cytosolic non-receptor tyrosine-protein kinases (nRTKs). These proteins are a 424 subgroup of the tyrosine kinase family, enzymes that phosphorylate tyrosine residues of 425 proteins, and regulate many cellular functions, such as: cell growth and survival, apoptosis, cell 426 adhesion, cytoskeleton remodeling and differentiation (Neet and Hunter, 1996). Although these 427 genes are not classically related to the response to pathogens, it has been described that the 428 interaction of T-and B-cell antigen receptors with some non-receptor tyrosine protein kinases 429 is critical to the activation of lymphocytes by an antigen (Sefton and Taddie, 1994) (Ching et al., 2007). Since it has been described that the infective process of P. salmonis 460 depends on the exploitation of the actin monomers (Ramírez et al., 2015), the identification in 461 this study of candidate genes that encode for cytoskeletal motor proteins (two kinesins and a 462 myosin), highlights their relevance not only for the reorganization of the cytoskeleton, but also 463 for its motility and involvement in the development of the infection (Hoyt et al., 1997). 464 Remarkably, two other candidate proteins associated with SRS resistance are also members of 465 the GTPase superfamily, the GTPases of the immunity-associated proteins (GIMAPs) 4 and 8. 466 This is a family of proteins abundantly expressed in lymphocytes and whose function is to 467 contribute in the regulation of apoptosis and the maintenance of T-cell numbers in the organism 468 (Yano et al., 2014). 469 Another group of orthologous genes code for ATP-dependent RNA helicases DDX24 470 in CS and DDX47 in RT for DD. The ATP-dependent RNA helicase DDX family, also known 471 as DEAD-box helicases, is required for different cellular processes such as transcription, pre-472 mRNA processing, ribosome biogenesis, nuclear mRNA export, translation initiation, RNA 473 turnover and organelle function. The protein structure is very similar to viral RNA helicases 474 and to DNA helicases, which suggests that the fundamental activities of these enzymes are similar (Rocak and Linder, 2004). Viruses also utilize RNA helicases at various stages of their 476 life cycle. Many viruses carry their own helicases to assist with the synthesis of their genome, 477 but others synthesize their genome within the cell nucleus which tends to exploit cellular 478 helicases and thus do not encode their own. We also identified the ATP-dependent RNA 479 helicase TDRD9, which has not been directly implicated in infection, but was differentially 480 Genes coding NACHT, LRR and PYD domains-containing protein 12 (NLRP12); 494 NACHT, LRR and CARD domains-containing protein 3 (NLRC3); voltage-dependent L-type 495 calcium channel subunit beta-2 (CACNB2); heparan sulfate glucosamine 3-O-sulfotransferase 496 3A1 (HS3ST3A1) and histidine triad nucleotide-binding protein 1 (HINT1) were also selected 497 as candidate genes for SRS resistance. NLRP12 and NLRC3 are two cytosolic proteins that 498 share two functional domains (NACHT and LRR). NLRP12 was one of the best ranked genes, 499 adjacent to the leader SNP and adjacent to more than one SNP simultaneously for DD in AS.
This protein functions as an attenuating factor of inflammation in monocytes, by negative 501 regulation of the NF-κB activation (Fata et al., 2013). In murine macrophages, a significant 502 expression increase has been shown in cells infected with the intracellular parasite Leishmania 503 major compared to non-infected macrophages (Fata et al., 2013). NLRC3 is also a negative 504 regulator of the innate immune response mediated by the inhibition of Toll-like receptor (TLR)-505 dependent activation of the transcription factor NF-κB (Schneider et al., 2012). The presence 506 of these genes suggests that the control of the inflammatory reaction in response to P. salmonis 507 infection could be essential to combat SRS. 508 To the best of our knowledge, this is the first time that functional candidate genes 516 underpinning resistance to P. salmonis are proposed based on a comparative genomics 517 approach comparing GWAS results for the same trait in different fish genus/species. We 518 hypothesize that variations in the sequences of these genes could play important roles in the 519 host response to P. salmonis infection, which could be tested through new genetic approaches 520 such as gene editing using CRISPR-Cas9, and utilized through genomic selection or more 521 traditional selection practices. All this information together can be used to generate better 522 control and treatment measures for one of the most important bacterial disease affecting salmon 523 aquaculture. 524

Conclusions 526
Although P. salmonis resistance has previously been described as polygenic trait, our 527 comparative genomics approach based on GWAS results for the same trait in different 528 salmonid species allowed us to identify around one hundred candidate genes that may explain 529 resistance to P. salmonis. Of these, 21 are suggested to be strong functional candidates 530 influencing the trait. These genes are associated with multiple biological processes, including: 531 dependent on kinase activity, GTP hydrolysis, helicase activity, lipid metabolism, cytoskeletal 532 dynamics, inflammation and the innate immune response. We hypothesize that variations in 533 the sequences of these genes could play an important role in the expression and/or activity of 534