GENETIC DISSECTION OF IMPORTANT TRAITS IN AQUACULTURE: GENOME-SCALE TOOLS DEVELOPMENT, TRAIT LOCALIZATION AND REGULATORY MECHANISM EXPLORATION

EDITED BY : Peng Xu, Lior David, Paulino Martínez and Gen Hua Yue PUBLISHED IN : Frontiers in Genetics

### Frontiers eBook Copyright Statement

The copyright in the text of individual articles in this eBook is the property of their respective authors or their respective institutions or funders. The copyright in graphics and images within each article may be subject to copyright of other parties. In both cases this is subject to a license granted to Frontiers. The compilation of articles constituting this eBook is the property of Frontiers.

Each article within this eBook, and the eBook itself, are published under the most recent version of the Creative Commons CC-BY licence. The version current at the date of publication of this eBook is CC-BY 4.0. If the CC-BY licence is updated, the licence granted by Frontiers is automatically updated to the new version.

When exercising any right under the CC-BY licence, Frontiers must be attributed as the original publisher of the article or eBook, as applicable.

Authors have the responsibility of ensuring that any graphics or other materials which are the property of others may be included in the CC-BY licence, but this should be checked before relying on the CC-BY licence to reproduce those materials. Any copyright notices relating to those materials must be complied with.

Copyright and source acknowledgement notices may not be removed and must be displayed in any copy, derivative work or partial copy which includes the elements in question.

All copyright, and all rights therein, are protected by national and international copyright laws. The above represents a summary only. For further information please read Frontiers' Conditions for Website Use and Copyright Statement, and the applicable CC-BY licence.

ISSN 1664-8714 ISBN 978-2-88963-914-4 DOI 10.3389/978-2-88963-914-4

### About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

### Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

### Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

### What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

# GENETIC DISSECTION OF IMPORTANT TRAITS IN AQUACULTURE: GENOME-SCALE TOOLS DEVELOPMENT, TRAIT LOCALIZATION AND REGULATORY MECHANISM EXPLORATION

Topic Editors: Peng Xu, Xiamen University, China Lior David, Hebrew University of Jerusalem, Israel Paulino Martínez, University of Santiago de Compostela, Spain Gen Hua Yue, Temasek Life Sciences Laboratory, Singapore

Citation: Xu, P., David, L., Martínez, P., Yue, G. H., eds. (2020). Genetic Dissection of Important Traits in Aquaculture: Genome-scale Tools Development, Trait Localization and Regulatory Mechanism Exploration. Lausanne: Frontiers Media SA. doi: 10.3389/978-2-88963-914-4

# Table of Contents

*07 Editorial: Genetic Dissection of Important Traits in Aquaculture: Genome-Scale Tools Development, Trait Localization and Regulatory Mechanism Exploration*

Peng Xu, Lior David, Paulino Martínez and Gen Hua Yue


Paolo Ronza, Diego Robledo, Roberto Bermúdez, Ana Paula Losada, Belén G. Pardo, Paulino Martínez and María Isabel Quiroga


Sheng Liu, Li Li, Jie Meng, Kai Song, Baoyu Huang, Wei Wang and Guofan Zhang

*92 Transcriptome Analysis Reveals Common and Differential Response to Low Temperature Exposure Between Tolerant and Sensitive Blue Tilapia (*Oreochromis aureus*)*

Tali Nitzan, Fotini Kokou, Adi Doron-Faigenboim, Tatiana Slosman, Jakob Biran, Itzhak Mizrahi, Tatyana Zak, Ayana Benet and Avner Cnaani

*103 Genomic, Transcriptomic, and Epigenomic Features Differentiate Genes That Are Relevant for Muscular Polyunsaturated Fatty Acids in the Common Carp*

Hanyuan Zhang, Peng Xu, Yanliang Jiang, Zixia Zhao, Jianxin Feng, Ruyu Tai, Chuanju Dong and Jian Xu

*116 Transcriptome Profile Analysis on Ovarian Tissues of Autotetraploid Fish and Diploid Red Crucian Carp*

Yude Wang, Minghe Zhang, Qinbo Qin, Yajun Peng, Xu Huang, Chongqing Wang, Liu Cao, Wuhui Li, Min Tao, Chun Zhang and Shaojun Liu

*126 Transcriptome Analysis Identified Genes for Growth and Omega-3/-6 Ratio in Saline Tilapia*

Grace Lin, Natascha M. Thevasagayam, Z. Y. Wan, B. Q. Ye and Gen Hua Yue


Mathieu Besson, François Allal, Béatrice Chatain, Alain Vergnet, Frédéric Clota and Marc Vandeputte

*165 Chromosome-Level Assembly of the Chinese Seabass (*Lateolabrax maculatus*) Genome*

Baohua Chen, Yun Li, Wenzhu Peng, Zhixiong Zhou, Yue Shi, Fei Pu, Xuan Luo, Lin Chen and Peng Xu

*171 The Impact of Chronic Heat Stress on the Growth, Survival, Feeding, and Differential Gene Expression in the Sea Urchin* Strongylocentrotus intermedius

Yaoyao Zhan, Jiaxiang Li, Jingxian Sun, Weijie Zhang, Yingying Li, Donyao Cui, Wanbin Hu and Yaqing Chang

*184 High-Density Genetic Linkage Maps Provide Novel Insights Into ZW/ZZ Sex Determination System and Growth Performance in Mud Crab (Scylla paramamosain)*

Khor Waiho, Xi Shi, Hanafiah Fazhan, Shengkang Li, Yueling Zhang, Huaiping Zheng, Wenhua Liu, Shaobin Fang, Mhd Ikhwanuddin and Hongyu Ma

*200 A High-Density Genetic Linkage Map and QTL Mapping for Sex in Black Tiger Shrimp (*Penaeus monodon*)*

Liang Guo, Yu-Hui Xu, Nan Zhang, Fa-Lin Zhou, Jian-Hua Huang, Bao-Suo Liu, Shi-Gui Jiang and Dian-Chang Zhang


Tao Liu, Xumin Wang, Guoliang Wang, Shangang Jia, Guiming Liu, Guangle Shan, Shan Chi, Jing Zhang, Yahui Yu, Ting Xue and Jun Yu

*229 Genome-Wide Association Study Identifies Genomic Loci Affecting Filet Firmness and Protein Content in Rainbow Trout*

Ali Ali, Rafet Al-Tobasei, Daniela Lourenco, Tim Leeds, Brett Kenney and Mohamed Salem


Christos Palaiokostas, Tomas Vesely, Martin Kocour, Martin Prchal, Dagmar Pokorova, Veronika Piackova, Lubomir Pojezdal and Ross D. Houston


Elena Sarropoulou, Elizabet Kaitetzidou, Nikos Papandroulakis, Aleka Tsalafouta and Michalis Pavlidis

*326 Identification of Single Nucleotide Polymorphisms Related to the Resistance Against Acute Hepatopancreatic Necrosis Disease in the Pacific White Shrimp* Litopenaeus vannamei *by Target Sequencing Approach*

Qian Zhang, Yang Yu, Quanchao Wang, Fei Liu, Zheng Luo, Chengsong Zhang, Xiaojun Zhang, Hao Huang, Jianhai Xiang and Fuhua Li


Yunji Xiu, Guangpeng Jiang, Shun Zhou, Jing Diao, Hongjun Liu, Baofeng Su and Chao Li


Peilin Cheng, Yu Huang, Hao Du, Chuangju Li, Yunyun Lv, Rui Ruan, Huan Ye, Chao Bian, Xinxin You, Junmin Xu, Xufang Liang, Qiong Shi and Qiwei Wei

*402 Differences in DNA Methylation Between Disease-Resistant and Disease-Susceptible Chinese Tongue Sole (*Cynoglossus semilaevis*) Families*

Yunji Xiu, Changwei Shao, Ying Zhu, Yangzhen Li, Tian Gan, Wenteng Xu, Francesc Piferrer and Songlin Chen


# Editorial: Genetic Dissection of Important Traits in Aquaculture: Genome-Scale Tools Development, Trait Localization and Regulatory Mechanism Exploration

### Peng Xu1,2 \*, Lior David<sup>3</sup> , Paulino Martínez <sup>4</sup> and Gen Hua Yue<sup>5</sup>

<sup>1</sup> Fujian Key Laboratory of Genetics and Breeding of Marine Organisms, College of Ocean and Earth Sciences, Xiamen University, Xiamen, China, <sup>2</sup> State Key Laboratory of Large Yellow Croaker Breeding, Ningde Fufa Fisheries Company Limited, Ningde, China, <sup>3</sup> Department of Animal Sciences, RH Smith Faculty of Agriculture, Food and Environment, The Hebrew University of Jerusalem, Rehovot, Israel, <sup>4</sup> Departamento de Xenética, Universidade de Santiago de Compostela, Lugo, Spain, <sup>5</sup> Temasek Life Sciences Laboratory, National University of Singapore, Singapore, Singapore

### Keywords: aquaculture, genomics, genetic breeding, GWAS, QTL, traits

**Editorial on the Research Topic**

### **Genetic Dissection of Important Traits in Aquaculture: Genome-Scale Tools Development, Trait Localization and Regulatory Mechanism Exploration**

### Edited and reviewed by:

Johann Sölkner, University of Natural Resources and Life Sciences Vienna, Austria

> \*Correspondence: Peng Xu xupeng77@xmu.edu.cn

### Specialty section:

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Received: 11 May 2020 Accepted: 27 May 2020 Published: 25 June 2020

### Citation:

Xu P, David L, Martínez P and Yue GH (2020) Editorial: Genetic Dissection of Important Traits in Aquaculture: Genome-Scale Tools Development, Trait Localization and Regulatory Mechanism Exploration. Front. Genet. 11:642. doi: 10.3389/fgene.2020.00642 After vigorous growth through the last decades, aquaculture industry reached a key milestone in 2014 when the aquaculture contribution to the supply of fish for human consumption overtook that of wild-caught fish for the first time. Fast developing aquaculture has achieved potential to feed 9.7 billion people by 2050 in a context of climate change, economic and financial uncertainty, and growing competition for natural resources (FAO, 2016). However, there are still many challenges for fast and sustainable development of aquaculture. These challenges include the generalization of genetically improved stocks, facing the climate change, environmental stressors, emerging pathogens and diseases, and improving of feed conversion rate, growth rate, and resilience. Genetic improvement and germplasm enhancement have been proved efficient and cost-effective approaches to take aquaculture production to the next level as needed.

Important progress has been made with genetic markers to support breeding schemes in the past decades, but now the flourishing applications of the next generation of genome sequencing technologies enables a further and bigger leap ahead. The genomes of more and more aquaculture species have been sequenced or are being sequenced, which facilitate the fast development of genome-scale technologies and tools. Genomic tools and resources combined with new more sophisticated bioinformatic tools are now available for many major aquaculture species, including reference genome sequences and their annotations, genome-wide polymorphic markers and genotyping platforms, high-density and high-resolution linkage maps, transcriptomic resources, and more recently, breakthrough techniques for understanding regulatory mechanisms underlying gene expression. Genomic scale fine mapping and genetic regulation of important performance traits, such as disease resistance, growth rate, sexual determination, and tolerance to various environmental stressors, have been studied for better understanding the regulatory mechanisms. Genome information supporting selective breeding programs have been initiated and are prepared to apply in many key aquaculture species.

It is our great pleasure that this Research Topic is presented to its readers. The present Research Topic focusing on genetic dissection of important traits in aquaculture collects 36 articles including two review articles and 34 original research articles from a total of 305 authors. The contributions cover diverse aquaculture species of finfish, shellfish, shrimps, crabs, and algae, representing the latest progress on genetic dissection of economically important traits with genome-scale tools and technologies. Herein we classified these articles into different topics to be highlighted.

Developing genome resources and genome-scale genetic tools constitute a fundamental step for genome-wide genetic analysis. In the past decade, reference genomes of aquaculture species have been quickly assembled linked to the advances in genome sequencing technologies and assembly algorithms as well as computation power enhancement. Four new reference genomes are reported in this issue, including a chromosome-level reference genome of Chinese Seabass (Lateolabrax maculatus) (Chen et al.) and three draft genomes of Sterlet sturgeon (Acipenser ruthenus) (Cheng et al.), Kanglang white minnow (Anabarilius grahami) (Jiang et al.), and a brown algae (Saccharina japonica) (Liu T. et al.). Linkage maps have been considered as the traditional genetic tool for trait dissection and construction of highly-dense chromosome-scale maps is addressed in several contributions. Three high-density linkage maps, for mud crab (Scylla paramamosain) (Waiho et al.), black tiger shrimp (Penaeus monodon) (Guo et al.), and channel catfish (Ictalurus punctatus) (Zhang S. et al.) respectively, were constructed based on Restriction Site associated DNA sequencing (RAD-seq or derivative methodologies). Sex- and growth-related traits are mapped in these three new linkage maps accordingly. The increasing genome resources and tools will facilitate genetic studies on important traits to tackle genomic selection in those aquaculture species.

Comparative transcriptomics is an effective approach to compare gene expression profiles between different samples that are challenged with different conditions or exhibit distinct phenotypes, which provides insights into molecular functions and biological processes underlying target traits. Comparative transcriptomics in tilapia, Chinese seabass, crucian carp (Carassius carassius), X-Ray tetra (Pristella maxillaris), tongue sole (Cynoglossus semilaevis), sea urchin (Strongylocentrotus intermedius), and Pacific oyster (Crassostrea gigas) treated at differential temperature and salinity or presenting significant differences in growth performance or ploidy are reported in this issue (Bian et al.; Hu et al.; Lin et al.; Liu J. et al.; Nitzan et al.; Tian et al.; Wang Q. et al.; Zhan et al.; Zhang F. et al.). Additionally, comparative genomics and phylogenetic analysis of immune-related genes aid at understanding of the structure and function of the immune system of Senegalese sole (Solea senegalensis) (García-Angulo et al.). Non-coding RNAs play important roles in transcriptome and translation regulation. Investigation on sncRNA, circRNA and miRNA was conducted in European sea bass (Dicentrarchus labrax) (Sarropoulou et al.) and turbot (Scophthalmus maximus) (Xiu, Jiang et al.), providing insights into the regulation mechanisms in these two species. DNA methylation, the most widely studied and most well-understood epigenetic modification, has been reported to play crucial roles in gene regulation processes, such as those related to sexual development. Piferrer et al. discuss the model of Conserved Epigenetic Regulation of Sex (CERS) and the use of CERS to make testable predictions on how sex is epigenetically controlled and to better understand sexual development primarily in fish. DNA methylation profiles in disease-resistant and disease-susceptible Chinese tongue sole against Vibrio harveyi infection are also compared and the results highlight that artificial selection for disease resistance may change methylation levels in important immune-related genes (Xiu, Shao et al.).

Many economically important traits in aquaculture are quantitative traits. Quantitative trait loci (QTL) mapping and genome-wide association studies (GWAS) are primary approaches to dissect the architecture of such traits, and to identify genomic regions, the underlying genes and eventually causative mutations that contribute to trait variation. Recent advances in high-throughput genotyping technologies facilitate more accurate trait dissection in many aquaculture species using GWAS and QTL mapping. Here we collect a number of articles under this topic, representing over one third of all contributions. Sex determination loci and candidate genes are identified in channel catfish (Zhang S. et al.) and black tiger shrimp (Guo et al.) via QTL mapping with high-density genetic linkage maps. Disease resistance is critical for aquaculture species, and resistance traits against sea lice, myxozoan parasite Enteromyxum scophthalmi and acute hepatopancreatic necrosis disease (AHPND) in Atlantic salmon (Salmo salar) (Robledo et al.), turbot (Ronza et al.), and Pacific white shrimp (Litopenaeus vannamei) (Wang, Zhang et al.) are respectively reported. GWAS on very diverse traits including growth, body shape, feed conversion ratio (FCR), filet quality, polyunsaturated fatty acids (PUFAs) content and glycogen content are also reported, indicating that this is a key and active research field in aquaculture (Ali et al.; Besson et al.; Kyriakis et al.; Liu S. et al.; Waiho et al.; Wang Q. et al.; Zhang H. et al.; Zhang Q. et al.; Vallejos-Vidal et al.).

Research communities in some aquaculture species have developed high density SNP genotyping arrays, which expedite population genomic studies and germplasm evaluation of those species. Barria et al. report their population genomic structure and genome-wide linkage disequilibrium analysis in three Chilean commercial populations of Atlantic salmon with different origins using a 159K SNP genotyping array. Xu et al. conduct a population genomic analysis to determine the genetic architecture of 2,198 individuals in 14 common carp populations worldwide using a 250K SNP genotyping array. Yoshida et al. genotype three farmed Nile tilapia (Oreochromis niloticus) populations in Latin America using a 50K SNP panel, and population genetic analysis revealed short-range LD decay for three populations.

Abundant genome resources, well-established genotyping technologies and well-characterized germplasms have boosted genome selection applications on the aquaculture species, as it is demonstrated in this issue by Palaiokostas et al. on KHV resistance breeding in common carp. Flourish applications of genome selection are expected on more and more farmed species in aquaculture industries in the foreseeable future. In addition to genome selection, Wang, Yang et al. highlight their findings on generating goldfish-like fish via interspecific hybridization of female koi carp × male blunt snout bream and indicate the potential to form new species.

Overall, advances of genomic technologies and their quick applications are accelerating genetic dissection of important traits in many aquaculture species. The better understanding of the genetic basis and gene regulations of economically important traits will further expedite genetic improvement using diverse approaches, and ultimately ensure the fast and sustainable growth of aquaculture industries globally.

We hope the aquaculture community will find this Research Topic to be an informative and useful collection of articles. As editors of this topic, we would like to thank the authors for their contribution to novel knowledge of this topic. We are grateful to all referees for their careful evaluation of the papers sent to them. Appreciation is also expressed to the numerous colleagues who responded to the call for papers, but whose interests could not be accommodated within the confines of this Research Topic. Finally, we glad to acknowledge Frontiers in Genetics for supporting this Research Topic.

# AUTHOR CONTRIBUTIONS

PX prepared the draft editorial. LD, PM, and GY revised the manuscript. All authors contributed to the article and approved the submitted version.

# REFERENCES

FAO (2016). The State of World Fisheries and Aquaculture 2016. FAO.

**Conflict of Interest:** PX was employed by the company Ningde Fufa Fisheries Company Limited.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Xu, David, Martínez and Yue. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# The Formation of the Goldfish-Like Fish Derived From Hybridization of Female Koi Carp × Male Blunt Snout Bream

Yude Wang† , Conghui Yang† , Kaikun Luo† , Minghe Zhang, Qinbo Qin, Yangyang Huo, Jia Song, Min Tao, Chun Zhang and Shaojun Liu\*

State Key Laboratory of Developmental Biology of Freshwater Fish, College of Life Sciences, Hunan Normal University, Changsha, China

Goldfish (Carassius auratus var., GF; 2n = 100) is the most popular ornamental fish in the world. It is assumed that GF evolved from red crucian carp (C. auratus red var., RCC; 2n = 100). However, this hypothesis lacks direct evidence. Furthermore, our knowledge of the role of hybridization in the formation of new species is sparse. In this study, goldfish-like fish with twin tails (GF-L; 2n = 100) was produced by selfmating red crucian carp-like fish (RCC-L; 2n = 100) derived from the distant crossing of koi carp (Cyprinus carpio haematopterus, KOC; 2<sup>n</sup> = 100; ♀) with blunt snout bream (Megalobrama amblycephala, BSB; 2<sup>n</sup> = 48; ♂). The phenotypes and genotypes of GF-L and RCC-L were very similar to those of GF and RCC, respectively. Microsatellite DNA and 5S rDNA analyses revealed that GF-L and RCC-L were closely related to GF and RCC, respectively. The presence of a twin tail of GF-L was related to a base mutation in chordinA from G in RCC-L to T in GF-L, indicating that the lineage of RCC-L and GF-L can be used to study gene variation and function. The sequences of 5S rDNA in GF-L and RCC-L were mapped to the genomes of CC and BSB, which revealed that the average similarities of both GF-L and RCC-L to CC were obviously higher than those to BSB, supporting that the genomes of both RCC-L and GF-L were mainly inherited from KOC. GF-L and RCC-L were homodiploids that were mainly derived from the genome of KOC with some DNA fragments from BSB. The reproductive traits of GF-L and RCC-L were quite different from those of their parents, but were the same as those of GF and RCC. RCC-L easily diversified into GF-L, suggesting that RCC and GF evolved within the same period in their evolutionary pathway. This study provided direct evidence of the KOC–RCC–GF evolutionary pathway that was triggered by distant hybridization, which had important significance in evolutionary biology and genetic breeding.

Keywords: distant hybridization, crucian carp, goldfish, microsatellite DNA, 5S rDNA

# INTRODUCTION

Goldfish (Carassius auratus var.,GF; 2n = 100) and red crucian carp (C. auratus red var., RCC; 2n = 100), are the most prevalent ornamental fish in the world, and these species belong to Cyprinidae (family), Cyprininae (subfamily), and Carassius (genus) (Luo et al., 1999; Wang et al., 2014). GF and RCC are considered varieties of crucian carp (Carassius carassius).

### Edited by:

Peng Xu, Xiamen University, China

# Reviewed by:

Chuanju Dong, Henan Normal University, China Pinghui Feng, University of Southern California, United States

### \*Correspondence:

Shaojun Liu lsj@hunnu.edu.cn †These authors have contributed equally to this work

### Specialty section:

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Received: 08 July 2018 Accepted: 14 September 2018 Published: 10 October 2018

### Citation:

Wang Y, Yang C, Luo K, Zhang M, Qin Q, Huo Y, Song J, Tao M, Zhang C and Liu S (2018) The Formation of the Goldfish-Like Fish Derived From Hybridization of Female Koi Carp × Male Blunt Snout Bream. Front. Genet. 9:437. doi: 10.3389/fgene.2018.00437

**10**

An obvious difference between GF and RCC is that GF has distinct split double tails (twin tail), whereas RCC does not. Although some studies have suggested that GF evolved from crucian carp (Podlesnykh et al., 2015), the direct evidence of its evolutionary pathway is lacking. Hybridization promotes species formation and the adaptive radiation of animals and plants (Mallet, 2007). In plants, some homodiploid hybrid species have been reported, e.g., in Helianthus (Rieseberg et al., 1995, 2003; Ungerer et al., 1998), Vigna (Takahashi et al., 2015), Iris (Arnold et al., 2012), and Pinus (Mao and Wang, 2011). There have been few reports on the formation of homoploid in animals; for example, the formation of a homodiploid crucian carp (Wang et al., 2017). Furthermore, our knowledge of the role of hybridization in the formation of new animal species is sparse.

In the catalog, in Cyprininae (subfamily), there are only two kinds of species: Cyprinus carpio and C. auratus, which belong to Cyprinus (genus) and Carassius (genus), respectively. What is the relationship between these two kinds of species? Both GF and RCC are varieties of C. auratus, and most individuals of these species are characterized by red or colorful bodies. Koi carp (Cyprinus carpio haematopterus, KOC; 2n = 100) is a variety of Cyprinus carpio, and most individuals of these species are also characterized by red or colorful bodies. Based on the close status in the catalog and similar body colors among RCC, GF, and KOC, it is possible that GF originate from RCC or KOC by distant hybridization. Blunt snout bream (Megalobrama amblycephala, BSB; 2n = 48) is a suitable species to cross with KOC. BSB belongs to Cyprinidae (family), Cultrinae (subfamily), and Megalobrama (genus). Compared with KOC, BSB possess different chromosome number (2n = 48), different body colors (gray) and the same age of sexual maturity (2 years). In this study, we cross female KOC with male BSB and obtain red crucian carp-like fish (RCC-L) and goldfish-like fish (GF-L), which are homodiploids mainly derived from the genome of KOC with some DNA fragments from BSB, showing the potential of interspecific hybridization to produce new homoploid species in fish.

# MATERIALS AND METHODS

# Ethics Statement

The procedures were conducted in accordance with the approved guidelines. Experimental fish individuals were housed in open pools (0.067 ha) with suitable pH (7.0–8.5), water temperature (22–24◦C), dissolved oxygen content (5.0–8.0 mg/L) and adequate forage at the State Key Laboratory of Developmental Biology of Freshwater Fish, Hunan Normal University, China. The fish used as the samples were anesthetized with 100 mg/L MS-222 (Sigma-Aldrich, St. Louis, MO, United States) before dissection.

# Animals and Crossing Procedure

All samples were cultured at the State Key Laboratory of Developmental Biology of Freshwater Fish, Hunan Normal University, China. The female and male of KOC and BSB reached sexual maturity at 2 years, while the female and male of RCC and GF reached sexual maturity at 1 year. During the reproductive season (April–July) in 2015–2017, 20 mature females and 20 mature males of KOC and BSB were selected as the maternal and paternal parents, respectively. The crosses were performed in two groups: in the first group, KOC and BSB were used as the maternal and paternal parents, respectively; and in the second group, the maternal and paternal parents were reversed. The mature eggs were fertilized with semen, and the embryos were developed in culture dishes at a water temperature of 18–23◦C. In the first group, the KOC (♀) <sup>×</sup> BSB (♂) resulted in two types of offspring: red crucian carp-like fish (RCC-L) and gynogenetic koi carp (GKOC). In the second group, the cross of BSB (♀) <sup>×</sup> KOC (♂) did not produce any living progeny.

In April, 2016, the male and female RCC-L that reached sexual maturity at 1 year were mated to produce the second generation. In the second generation, there were two types of offspring: red crucian carp-like (RCC-L-F2) and goldfish-like fish (GF-L) with split double tails.

In December, 2017, the eggs and the white semen were stripped from the female and male of GF-L, respectively, and they were fertilized to form GF-L-F2.

The entire crossing procedure was shown in **Figure 1**. For each cross, 5,000 embryos were selected at random to determine fertilization (number of embryos at the gastrula stage/number of eggs × 100%), hatching (number of hatched fry/number of eggs × 100%), and survival (number of adulthood/number of eggs × 100%) rates. Simultaneously, self-mating of KOC and BSB were performed as controls. The hatched fry were transferred to a pond for further culture.

# Measurement of Morphological Traits

We randomly selected 60 1-year-old fish from each group (KOC, BSB, RCC-L, GF-L, RCC, and GF) for morphological examination. We measured whole length (WL), body length (BL), body height (BH), head length (HL), head height (HH), caudal peduncle length (CPL), and caudal peduncle height (CPH) of each fish (accurate to 0.1 cm). These values were then used to calculate the following ratios: BL/WL, BH/BL, HL/BL, HH/HL, CPH/CPL, and HH/BH. In addition, we recorded the number of lateral line scales, the number of scale rows above and below the lateral line, and the number of dorsal, anal, and pelvic fin rays. We used analysis of variance (ANOVA) (Osterlind et al., 2001) and multiple comparison tests (LSD method) (Williams and Abdi, 2010) to test for differences in each trait among the six types of fishes using SPSS Statistics 19.0 (IBM Corp., NY, United States). The values of the independent variables are expressed as the mean ± SD (Nigam and Turner, 1995).

# Preparation of Chromosome Spreads

To determine ploidy, chromosome preparation was carried out on the kidney tissues of 10 KOC, 10 BSB, 10 RCC-L, 10 GF-L, 10 RCC, and 10 GF at 1 year of age according to the procedures reported by Liu et al. (2001). We photographed 200 metaphase spreads from each sample to determine the chromosome number. Good-quality metaphase spreads were photographed and used for analysis of karyotypes. The chromosomal metaphase spreads were examined under an oil lens at a magnification of 3330×.

Chromosomes were classified on the basis of their long-arm to short-arm ratios according to the reported standards (Levan et al., 1964).

# Microsatellite DNA Cloning and Sequencing

Total genomic DNA was isolated from whole blood collected from the caudal vein of 15 KOC, 15 BSB, 15 RCC-L, 15 RCC, 15 GF-L, and 15 GF using a standard phenol-chloroform procedure (Sambrook et al., 1989). DNA concentration and quality were assessed using agarose gel electrophoresis.

Three primer pairs (MFW1-F: 5<sup>0</sup> -AGCGGAACTCACTAAA C-3<sup>0</sup> , MFW1-R:5<sup>0</sup> -ACAGGCTTCCAGTAAAA-3<sup>0</sup> , MFW2-F: 5<sup>0</sup> - TTCATATCTCAGTGGCTT-3<sup>0</sup> , MFW2-R: 5<sup>0</sup> -ATCATTTATTCT TGTGGT-3<sup>0</sup> , MFW3-F: 5<sup>0</sup> -AGACAGCACTATCATTCC-3<sup>0</sup> , and MFW3-R: 5<sup>0</sup> -CCTAACATAAATAAACCCA-3<sup>0</sup> ) were designed for the flanking regions of repeated (CA)n dinucleotide microsatellites based on RCC genome (Liu et al., 2016). The microsatellite loci were amplified and sequencing was performed as described by Liu et al. (2010). The genetic similarity was calculated as described by Nei and Li (1979) .

# 5S rDNA, chordinA Cloning and Sequencing

fgene-09-00437 October 8, 2018 Time: 15:41 # 4

One pair of primers (5SF: 5<sup>0</sup> -GCTATGCCCGATCTCGT CTGA-3 0 and 5SR: 5<sup>0</sup> -CAGGTTGGTATGGCCGTAAGC-3<sup>0</sup> ) (Sajdak et al., 1998) was designed and synthesized to amplify the 5S rDNA repeats directly from 10 KOC, 10 BSB, 10 RCC-L, 10 GF-L, 10 RCC, and 10 GF by PCR. One pair of primer (chordinA-F: 5<sup>0</sup> -TAACGCACAGATGCAGACGTGTG-3<sup>0</sup> and chordinA-R, 5 0 -TGCTGTTCTCCTCAGAGCTGATGTAGG-3<sup>0</sup> ) was designed and synthesized to amplify the chordin sequence directly from 10 RCC-L, 10 GF-L, 10 RCC, and 10 GF by PCR.

The PCR reactions and sequencing were performed as described by Qin et al. (2010) and Abe et al. (2014), respectively. Sequences were analyzed using BioEdit software (BioEdit version 7.0) (Hall, 1999).

# Mapping 5S rDNA to the Reference Genome

The genomes of CC, BSB, and RCC and their annotations were used as references for analyses of 5S rDNA obtained in this study. The above genomes were downloaded from the following websites:


We used BLASTN (E-value < = 10−5) to compare the sequences of 5S rDNA in RCC-L (203, 340, and 479 bp) and GF-L (168, 203, 340, and 495 bp) to the corresponding sequences of the genomes of CC, BSB, and RCC, respectively. Then we obtained the nucleotide similarities between the sequences of the above 5S rDNA and those from each of the genomes of CC, BSB, and RCC.

# Phylogenetic Analysis

Using Mega 5.1 (Tamura et al., 2011), the derived 5S rDNA coding gene sequences (120 bp) of these fragments were aligned from KOC, BSB, RCC-L, nature crucian carp (NCC), GF-L, RCC, and GF. Regions of sequences which were difficult to align were removed from the alignment. Gaps were also removed from the alignment. The maximum likelihood method implemented in the online software RAxML (Stamatakis, 2015) was used to construct a phylogenetic tree.

# Observation of Gonadal Structure

To observe the gonadal structure, we selected 10 10-month-old individuals of both RCC-L and GF-L. The gonads were fixed in Bouin's solution for 24 h (Bancroft and Gamble, 2008; Ganjali and Ganjali, 2013), dehydrated using an ethanol gradient, and cleared in xylene. The gonadal sections were embedded in paraffin, cut at 7 µm, and stained with hematoxylin and eosin. The microstructure was observed and photographed using a Pixera Pro 600ES (Pixera Corporation, Santa Clara, CA, United States). We identified the gonadal development stages based on the standards for cyprinid fish (Liu, 1993).

# RESULTS

# The Formation of RCC-L and GF-L

The crossing procedure to produce RCC-L and GF-L was outlined in **Figure 1**. In the first generation of KOC (♀) <sup>×</sup> BSB (♂), 99% RCC-L and 1% GKOC existed. The self-mating of RCC-L produced 98% RCC-L-F<sup>2</sup> and 2% GF-L with twin tails. The self-mating of GF-L produced next generation of GF-L-F<sup>2</sup> with twin tails.

# Fertilization, Hatching, and Survival Rates

The fertilized eggs of KOC (♀) <sup>×</sup> BSB (♂) showed high fertilization (90.5%) and hatching (80.3%) rates, but a low survival rate (35.6%). The self-mating of KOC resulted in a 95.6% fertilization rate, 85.3% hatching rate, and 80.7% survival rate, and the self- mating of BSB resulted in a 92.9% fertilization rate, 88.2% hatching rate, and 73.4% survival rate. In addition, the fertilization, hatching, and survival rates of RCC-L self-mating were 92.3, 85.8, and 76.3%, respectively.

# Morphological Traits and Feeding Habits

The morphological traits of KOC (**Figure 1A**), BSB (**Figure 1B**), RCC-L (**Figure 1C**), GKOC (**Figure 1D**), RCC-L-F<sup>2</sup> (**Figure 1E**), GF-L (**Figure 1F**), GF-L-F<sup>2</sup> (**Figure 1G**), and GF (**Figure 1H**) were shown in **Figure 1**. RCC-L and GF-L both exhibit broad phenotypic diversity. The individuals were generally distinguished from KOC by their body colors and shapes. One of the most recognizable features of the GF-L was the bifurcated tail.

**Table 1** presented the trait values for KOC, BSB, RCC-L, GF-L, RCC, and GF. Regarding the measured traits, RCC-L and their progeny had HH/BH values between and significantly different from those of KOC and BSB. In addition, RCC-L and their progeny had HL/BL values significantly greater than those of KOC and BSB and BL/WL values significantly lower (P < 0.05) than those of KOC and BSB. The HH/HL value in RCC-L was lower (P < 0.05) than that in either KOC or BSB and was markedly higher (P < 0.05) than that in GF-L or KOC or BSB. RCC-L exhibited BH/BL value similar to that of BSB but different from that of KOC. The BH/BL in GF-L was higher (P < 0.05) than that in KOC or BSB. The CPH/CPL value of RCC-L was between that of KOC and that of BSB and markedly different from both, whereas CPH/CPL in GF-L was lower than that in KOC or BSB. The RCC-L and RCC had similar CPH/CPL values. The HH/HL value of GF-L was significantly higher (P < 0.05) than that of

TABLE 1 | The phenotypes including the measurable traits (the average ratios of body length to whole length (BL/WL), body height to body length (BH/BL), head length to body length (HL/BL), head height to head length (HH/HL), caudal peduncle height to caudal peduncle length (CPH/CPL), and head height to body height (HH/BH), and the countable traits (number of lateral scales, number of dorsal fins, number of abdominal fins, number of anal fins in RCC-L, and their progeny and their parents).


TABLE 2 | Chromosome numbers in KOC, BSB, RCC-L, GF-L, RCC, and GF.


GF. In other measurable traits (BL/WL, BH/BL, CPH/CPL, and HH/BH), there was no significant difference (P > 0.05) between GF and GF-L.

Regarding the countable traits, all values (i.e., number of lateral scales, number of upper lateral scales, number of lower lateral scales, number of abdominal fins, and number of anal fins) except the number of dorsal fins in RCC-L and GF-L were significantly lower than those in KOC and BSB (P < 0.05). For number of dorsal fins, the RCC-L and GF-L had values intermediate between KOC and BSB. RCC and RCC-L presented no significant differences (P > 0.05). All countable traits had no significant difference (P > 0.05) in GF-L and GF.

Regarding feeding habits, RCC-L, RCC, GF-L, and GF similar to BSB were herbivorous.

## Chromosome Numbers and Karyotypes

**Table 2** presented the distribution of chromosome number in KOC, BSB, RCC-L, GF-L, RCC, and GF. Among KOC, 91.0% of the chromosomal metaphase spreads exhibited 100 chromosomes (**Table 2**), indicating that KOC was diploid with 100 chromosomes (**Figure 2A**) with a karyotype of 22m + 34sm + 22st + 22t (**Figure 3A**) (m, the chromosome with the cross in the median region; sm, submedian region, st, subterminal region; t, terminal region). Among BSB, 88.0% of the spreads exhibited 48 chromosomes (**Table 2**), indicating that BSB was diploid with 48 chromosomes and a karyotype of 18m + 22sm + 8st (**Figure 3B**). A large pair of submetacentric chromosomes was observed in BSB, which was used as a chromosomal marker to identify this species (**Figure 2B**). Among KOC chromosomes, there was no large submetacentric chromosome. Among RCC-L, 87.5% of the chromosomal metaphase spreads had 100 chromosomes (**Figure 2C**) with a karyotype 22m + 34sm + 22st + 22t (**Figure 3C**), indicating that RCC-L was diploid. Among GF-L, 90.0% of the chromosomal metaphase spreads had 100 chromosomes (**Figure 2D**) with a karyotype of 22m + 34sm + 22st + 22t (**Figure 3D**), indicating that GF-L was diploid. Among GF, 90.0% of the metaphases had 100 chromosomes (**Figure 2E**). Among RCC, 92.5% of the metaphases had 100 chromosomes (**Figure 2F**). Unlike BSB, RCC-L, and GF-L exhibited no large submetacentric chromosome. The above results indicated that the typical number of chromosomes in RCC-L, RCC, GF-L, and GF was 100.

# Microsatellite DNA

Three pairs of microsatellite primers (MFW1, MFW2, and MFW3) were used to analyze the genomic traits in RCC-L, GF-L, KOC, BSB, RCC, and GF. With the MFW1 primers, only one band with 150 bp was amplified in RCC-L, whereas two bands with 150 and 130 bp were amplified in RCC (**Supplementary Figure S1**), suggesting that RCC-L and RCC can be identified by these primers.

With the MFW2 primer, KOC and BSB were detected by yielding different microsatellite DNA patterns (**Figure 4**). RCC-L exhibited some DNA fragments similar to those of KOC (**Figure 4**, black arrow), suggesting that RCC-L inherited those DNA fragments from KOC. Furthermore, RCC-L had some DNA fragments (**Figure 4**, red arrow) similar to those presented by BSB, showing that RCC-L also inherited some DNA fragments from BSB. Interestingly, a new DNA fragment (**Figure 4**, blue arrow) that was not observed in either KOC or BSB was observed in both RCC-L and GF-L, suggesting DNA variation in RCC-L that was inherited from RCC-L to GF-L.

With the MFW3 primer, the genotypic similarity of RCC-L and RCC was 95.00%, whereas the genotypic similarities of GF to GF-L was 98.30%, showing the RCC-L and RCC as well as GF and GF-L had high similarity.

# 5S rDNA and chordinA

Several DNA fragments were amplified from KOC, BSB, RCC-L, GF-L, RCC, and GF using 5S rDNA primer pair. These PCR fragments generated distinct agarose gel electrophoresis band patterns. There were two fragments (approximately 200 and 400 bp) in KOC (MH909573 and MH909574) and two fragments (approximately 180 and 360 bp) in BSB (GQ485554 and KT824058.1), three fragments (approximately 200, 340, and 500 bp) in RCC-L, four fragments (approximately 160, 200, 340 and 500 bp) in GF-L, three fragments (approximately 200, 340 and 500 bp) in RCC (GQ485555, GQ485556, and GQ485557), and four fragments (approximately 160, 200, 340, and 500 bp) in GF (**Figure 5**) (GU188688, GU188687, GU188689, and GU188690). Based on the BLASTN analyses, all fragments from KOC, BSB, RCC-L, GF-L, GF, and RCC were confirmed as 5S rDNA repeat units (**Table 3**).

The sequences of 5S rDNA units cloned in this study contained a coding region (5<sup>0</sup> -99 bp and 3<sup>0</sup> -21 bp) and a mid-region consisting of distinct NTS sequences. In BSB, only monomeric 5S rDNA (designated class I: 188 bp) was characterized by one NTS type (designated NTS-I: 68 bp). In KOC, only monomeric 5S rDNA (designated class II: 203 bp) was characterized by one NTS type (designated NTS-II: 83 bp). In RCC-L, there were three monomeric 5S rDNA classes (designated class II: 203 bp; class III: 340 bp; and class IV: 495 bp) that were characterized by three NTS types (designated NTS-II: 83 bp, NTS-III: 220 bp, and NTS-IV: 375 bp). In GF-L and GF, there were four monomeric 5S rDNA classes (designated class V: 168 bp; class II: 203 bp; class III: 340 bp; and class IV: 495 bp) (**Supplementary Figure S2**) which were characterized by four NTS types (designated NTS-V:48 bp; NTS-II:83 bp; NTS-III: 220 bp; and NTS-IV: 375 bp) (**Supplementary Figure S3**). In RCC, there were also three monomeric 5S rDNA classes (class I, class II, and class IV), which had three NTS sequences (NTS-II, NTS-III, and NTS-IV), respectively.

The KOC, RCC-L, GF-L, RCC, and GF all had 203 bp DNA fragments in 5S rDNA. This fragment exhibited high similarities among the different kinds of fishes. The similarities between KOC and RCC-L, KOC and GF-L, KOC and RCC, and KOC and GF were 83.70, 84.20, 84.25, and 85.20%, respectively. The similarities between RCC-L and GF-L, RCC-L and RCC, RCC-L and GF were 92.10, 93.50, and 95.50%, respectively. The similarities between GF-L and RCC, and GF-L and GF were 92.60

largest submetacentric chromosomes.

and 93.50%, respectively. The similarities between RCC and GF was 96.00%. Among them, the highest similarity was between RCC and GF, which reached 96.00% (**Supplementary Figure S4** and **Table 4**).

Comparative analyses of the NTS sequences indicated several base substitutions or insertions-deletions between RCC-L and RCC. The NTS-I sequences of RCC-L and RCC were highly similar (with 97.5% average similarity). The NTS-II sequence of RCC-L showed an average 90.4% similarity to that of RCC. The sequence comparison of NTS-III between RCC-L and RCC indicated 93.05% identity. The sequence comparisons of RCC-L and RCC among classes II, III, and IV revealed 99.5% identity for class II, 91.4% identity for class III, and 91.9% identity for class IV, revealing that the sequences of those DNA fragment in the RCC-L were highly homologous to those of RCC (**Supplementary Figure S5**).

The 5S rDNA coding regions (CDS) of KOC, BSB, RCC-L, GF-L, GF, and RCC exhibited similarities of 97.5, 97.5, 97.5, 96.6, and 95.0%, respectively. The sequence comparison of 5S rDNA CDS between RCC-L and RCC resulted in 98.3% identity, suggesting that RCC-L and RCC were derived from similar parents. The sequence comparison of 5S rDNA CDS between GF-L and GF resulted in 97.5% identity, showing that GF-L and GF were also derived from the similar parents. The sequence comparison of 5S rDNA CDS among GF-L, KOC, BSB, RCC-L, and RCC presented a 91.7% identity between GF-L and KOC, a 90.9% identity between GF-L and BSB, a 92.5% identity between GF-L and RCC-L, and a 92.5% identity between GF-L and RCC (**Table 5** and **Supplementary Figure S6**).

The sequences of chordinA in GF-L, GF, RCC-L, and RCC were compared (MH898971, MH898974, MH898972, and MH898970), which indicated that the 320th location base was T in GF-L and GF, whereas the 320th location base in RCC-L and RCC was G, respectively (**Figure 6**). This mutation (G-T) showed that RCC-L and GF-L formed excellent fish lineage for studying gene variation and function. The present results were in accordance with a previous study in which the position base mutation (G-T) was found to possibly contribute to the

occurrence of a twin tail in GF (Abe et al., 2014, 2016). In addition, compared with RCC, we found that there were some base site mutations (137th position:C-A; 140th position:A-G; 294th position:C-T) in the RCC-L sequence, indicating that there was variation in the RCC-L genome (**Figure 6**).

# The Sequences of 5S rDNA in RCC-L and GF-L Aligned With the Genomes of Related Species

The sequences of 5S rDNA in RCC-L (203, 340, and 479 bp) and GF-L (168, 203, 340, and 495 bp) (MH898963, MH898964, MH898965, MH898966, MH898967, MH898968, and MH898969) were mapped to the corresponding sequences in the CC, BSB, and RCC genomes as references, respectively. The results were shown in **Table 6**.

As for RCC-L, CC, and BSB, the nucleotide similarities of the sequences of 5S rDNA (203, 340, and 479 bp) of RCC-L to CC (genome) were 98.03, 99.41, and 19.42%, respectively, whereas those similarities of RCC-L to BSB (genome) were 48.28, 27.94, and 19.42%, respectively, showing that the average similarity (72.29%) of RCC-L to CC was obviously higher than that (31.88%) of RCC-L to BSB. Because KOC is a variety of CC, we conclude that the similarity of RCC-L to KOC is higher than that of RCC-L to BSB.

For GF-L, CC, and BSB, the nucleotide similarities of the sequences of 5S rDNA (168, 203, 340, and 495 bp) of GF-L to CC (genome) were 56.55, 78.82, 99.41, and 19.19%, respectively,

whereas those similarities of GF-L to BSB (genome) were 57.74, 37.44, 29.12, and 20.00%, respectively, indicating that the average similarity (63.67%) of GF-L to CC was obviously higher than that (36.08%) of GF-L to BSB. Because KOC is a variety of CC, we conclude that the similarity of GF-L to KOC is higher than that of GF-L to BSB.

Regarding RCC-L and RCC, the nucleotide similarities of the sequences of 5S rDNA (203, 340, and 479 bp) of RCC-L to RCC (genome) were 100.00, 91.12, and 85.80%, respectively, whereas those similarities of RCC (5S rDNA) to RCC (genome) were 100.00, 100.00, and 100.00%, respectively, showing genomic DNA variation in RCC-L.

FIGURE 5 | DNA bands (5S rDNA) amplified by the primer pair 5SF-5SR in BSB, KOC, RCC-L, RCC, GF, and GF-L. Lane 1, two DNA fragments (approximately 200 and 360 bp) found in BSB. Lane 2, two DNA fragments (approximately 200 and 400 bp) found in KOC. Lane 3, three DNA fragments (approximately 200, 340, and 500 bp) found in RCC-L. Lane 4, three DNA fragments (approximately 200, 340, and 500 bp) found in RCC. Lane 5, four DNA fragments (approximately 160, 200, 340, and 500 bp) found in GF-L. Lane 6, four DNA fragments (approximately 160, 200, 340, and 500 bp) found in GF. M represents DNA ladder markers (100 bp increments).

TABLE 3 | Results of 5S rDNA DNA fragments by PCR and sequenced clone number.



fgene-09-00437 October 8, 2018 Time: 15:41 # 9

TABLE 4 | The percentages of nucleotide identity of 5S rDNA (class II)sequences in KOC, RCC-L, GF-L, RCC, and GF.


TABLE 5 | The percentages of nucleotide identity of 5S rDNA (coding region) sequences in KOC, BSB, RCC-L, GF-L, RCC, and GF.


Regarding GF-L, GF, and RCC, the nucleotide similarities of the sequences of 5S rDNA (168, 203, 340, and 495 bp) of GF-L to RCC (genome) were 100.00, 100.00, 100.00, and 85.45%, whereas those similarities of GF to RCC (genome) were 100.00, 100.00, 94.12, and 86.00%, respectively, showing the average similarity (96.36%) of GF-L to RCC was almost equal to that (95.03%) of GF to RCC.

The map of relationships between the 5S rDNA sequences and the corresponding sequences in the genomes of CC, BSB, and RCC as references were shown in **Supplementary Figure S7**.

## Phylogenetic Relationships

Using the NJ method in Mega software, the phylogenetic tree of GF-L, GF, RCC-L, NCC, RCC, KOC, and BSB was constructed. The largest tree span appeared between GF-L and BSB, and the smallest tree span between in GF-L and GF. GF-L and GF formed a sister group. The tree distance between GF and KOC was smaller than that of GF and BSB. (**Figure 7**).

# Gonadal Microstructure of KOC, BSB, RCC-L, and GF-L

Two-year-old BSB and 2-year-old KOC were able to produce normal mature gamete (**Figures 8A,B**; Liu et al., 2013; Wen et al., 2013). Moreover, 1-year-old RCC-L and 1-year-old GF-L were able to produce normal mature gametes. We stripped white semen from 10-month-old males RCC-L and GF-L and mature ova from 10-month-old females RCC-L and GF-L. In the testes of 1-year-old RCC-L and GF-L, we observed numerous mature spermatozoa, spermatids, and spermatogonia in the seminiferous tubules (**Figures 8C,E**). Observation of the gonadal tissue sections revealed that the ovaries of 8-month-old RCC-L and GF-L females were at stages III and IV, indicating that RCC-L and GF-L were fertile (**Figures 8D,F**).

# DISCUSSION

# Origin of Goldfish

Extensive comparative studies of GF and crucian carp found that they not only exhibited similar phenotypes and fertility in the hybrids of GF and crucian carp (Fu, 2016), but also shared the same embryonic developmental processes and chromosome number (2n = 100) (Changcheng, 1988; Tsai et al., 2013). GF and crucian carp were generally believed to be closely related, and were classified within the same species, but belonged to different varieties. Based on many biochemical and molecular phylogenetic analyses, including isozyme amplification, muscle protein electrophoresis, serotype identification, RAPD, and mitochondrial DNA analyses (Komiyama et al., 2009), it was concluded that GF evolved from crucian carp. However, the direct evidence is lacking.

In this study, the distant hybridization of KOC (2n = 100, ♀) and BSB (2<sup>n</sup> = 48, ♂) produced RCC-L (2<sup>n</sup> = 100) in F1; subsequent self-mating of RCC-L produced 2% GF-L (2n = 100) with double tailfins; self-mating of GF-L generated GF-L-F2, which provided clear evidence for the pathway of the formation of the GF as shown as KOC– (KOC as a variety of CC)-color crucian carp–GF (**Figure 1**).

GF-L and RCC-L were showed to be homodiploids mainly derived from the genome of KOC with some DNA fragments from BSB (**Figures 1**–**7**; **Tables 1**–**5**). GF-L and RCC-L presented obviously different traits from KOC and BSB (**Table 1**). For example, in terms of phenotypes, GF-L and RCC-L had obvious different HH/BH, HL/BL, BL/WL, and HH/HL values, and different number of lateral scales, number of abdominal fins, and number of anal fins from their parents. In terms of reproductive traits, the GF-L and RCC-L had different sexual mature age (1 year) from that (2-year) of KOC and BSB (**Figure 8**), further indicating that GF-L and RCC-L were potentially new species with the same chromosomal number (2n = 100) as their maternal parent (KOC), but with different phenotypes and genotypes from their parents.

In terms of genotypes, GF-L and RCC-L showed different microsatellite DNA patterns and different 5S rDNA sequences from those of KOC and BSB (**Figure 4**, **Table 4** and

TABLE 6 | The percentages of nucleotide identity of 5S rDNA sequences in RCC-Land GF-L compared with the genomes of related species.


**Supplementary Figure S4**), suggesting that DNA variation occurred in GF-L and RCC-L. The presence of multicopy of 5S rDNA, which was probably due to gene conversion resulting from the parental genome (Holliday, 1964; Sun et al., 1989; Martins and Galetti, 1999), showed further evidence for the DNA variation occurring in GF-L and RCC-L.

By comparing the chordinA sequences in GF-L, GF, RCC-L, and RCC, we found that the 320th location base in GF-L and GF was T, whereas the 320th location base in RCC-L and RCC was G (**Figure 6**). This mutation (G-T) showed that RCC-L and GF-L formed an excellent fish lineage for studying gene variation and function. The present results were in accordance with a previous

KOC, BSB, RCC-L, NCC, GF-L, RCC, and GF. The numbers at the branch nodes indicate the bootstrap percentage.

study in which the position base mutation (G-T) was found to possibly contribute to the occurrence of twin tails in GF (Abe et al., 2014, 2016). The mitochondrial genome of RCC-L also presented a large number of variations (unpublished data).

The results of mapping the sequences of 5S rDNA in GF-L and RCC-L to each of the genomes of CC and BSB as references provided further evidence that RCC-L and GF-L were derived from both KOC and BSB. KOC is a variety of CC. The genome of KOC is a always the same as that of CC. The average similarity of each of GF-L and RCC-L to CC was obviously higher than that to BSB, supporting that the genome of both RCC-L and GF-L is mainly inherited from KOC, but with some DNA fragments from BSB.

spermatogonia. (D) Mature ovary of RCC-L including II-phase, III-phase, and IV-phase oocytes. (E) Mature testis of GF-L containing mature spermatozoa, spermatids and spermatogonia. (F) Mature ovary of GF-L including II-phase, III-phase, and IV-phase oocytes. Bar = 20 µm.

The comparative analyses of the phenotypes and genotypes, as well as the reproductive traits between GF-L and GF, and between RCC-L and RCC, indicated that GF-L was very similar to GF, and RCC-L was very similar to RCC. For example, the morphological characteristics of GF-L and GF showed no significant difference (P > 0.05) in BL/WL, BH/BL, and CPH/CPL. The morphological characteristics of RCC-L and RCC showed no significant difference (P > 0.05) in BL/WL, BH/BL, HL/BL, HH/BH, and the number of lower lateral scales (**Table 2**). Regarding the genotypes, the similarities regarding the sequences of microsatellite DNA between GF-L and GF, and between RCC-L and RCC, were 95.00 and 98.30%, respectively, indicating that their similarities in genotypes were very high. On the other hand, the chromosomal numbers in GF-L, GF, RCC-L, and RCC were all 100 (**Figure 2**). For the reproductive traits, the age of sexual maturity was 1 year in GF-L, GF, RCC-L, and RCC (**Figure 8**).

The analyses of the phylogenetic tree based on the 5S rDNA sequences, showed that GF-L and GF were located in the same group and were close to RCC-L and RCC (**Figure 7**), providing further evidence that the pathway of RCC-GF existed. On the other hand, GF-L, GF, RCC-L, and RCC were closer to KOC than BSB (**Figure 7**), supporting the existence of a KOC-RCC-GF pathway.

Although most of the characteristics of RCC-L were similar to those of RCC, some differences were found between them. For instance, RCC-L presented unique microsatellite bands which were not found in its parents and RCC (**Supplementary Figure S1**). The results of mapping the sequences of the 5S rDNA of RCC-L to the RCC genome showed genomic variation in RCC-L (**Table 6** and **Supplementary Table S1**). These results indicated that genomic incompatibilities and genomic shock arose from distant hybridization and resulted in genomic DNA changes in RCC-L. These genomic variations might explain why RCC-L could easily reproduce GF-L with many phenotypic changes including the presence of two-tails, whereas it was difficult for RCC to reproduce GF. The RCC-L had been subjected to genomic incompatibilities and genomic shock due to distant hybridization and was in the "plastic" stage that was prone to produce genomic variations and novel traits.

Based on the presence of the GF-L derived from RCC-L selfmating, we concluded that GF was probably derived from RCC self-mating. Despite the low frequency (2%) of the formation of GF-L, we established the persistent RCC-L and GF-L and GF-L-F<sup>2</sup> lineages as the neodiploid population, providing new evidence regarding the origins of GF via the KOC–RCC–GF pathway, indicating that interspecific hybridization has the potential to form new species, which is importance to species evolution research.

# REFERENCES


# Significance of GF-L

As a new type of goldfish-like fish, GF-L and GF-L-F<sup>2</sup> presented very beautiful phenotypes, especially (**Figure 1**) those with twin tails and white bodies accompanied by red spots. These phenotypes were quite different from any other GF, indicating that the GF-L lineage had great potential in the ornamental market. On the other hand, GF-L possessed greater genomic DNA variations, which could easily result in phenotypic changes. GF-L has been used as a new fish resource to cross with other GFs to produce a series of new types of GFs with beautiful phenotypes. The formation of GF-L was very important to both evolutionary biology and fish genetic breeding.

# AUTHOR CONTRIBUTIONS

SL conceived and designed the study. YW and CY contributed to the experimental work, performed most of the statistical analyses, and wrote the manuscript. QQ, JS, and MZ designed the primers and performed the bioinformatics analyses. KL and YH collected the experimental materials. MT and CZ collected the photographs. All authors read and approved the final manuscript.

# FUNDING

This work was supported by the National Natural Science Foundation of China (Grant Nos. 31430088 and 31730098), the Earmarked Fund for China Agriculture Research System (Grant No. CARS-45), Hunan Provincial Natural Science and Technology Major Project (Grant No. 2017NK1031), the Cooperative Innovation Center of Engineering, the Key Research and Development Program of Hunan Province (Grant No. 2018NK2072), and New Products for Developmental Biology of Hunan Province (Grant No. 20134486).

# ACKNOWLEDGMENTS

We would like to sincerely thank the researchers who helped this study. They are Shi Wang, Xu Huang, Chongqing Wang, Jun Xiao, Wuhui Li, Li Ren, Rurong Zhao, Lu Zhao, Juan Liu, and Dengke Li.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2018.00437/full#supplementary-material

serial gene duplication, sub-functionalization, and selection. Sci. Rep. 6:26838. doi: 10.1038/srep26838



subfamilies. J. Exp. Zool. B Mol. Dev. Evol. 314, 403–411. doi: 10.1002/jez.b. 21346


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Wang, Yang, Luo, Zhang, Qin, Huo, Song, Tao, Zhang and Liu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Genome Assembly for a Yunnan-Guizhou Plateau "3E" Fish, Anabarilius grahami (Regan), and Its Evolutionary and Genetic Applications

Wansheng Jiang<sup>1</sup>† , Ying Qiu2,3† , Xiaofu Pan<sup>1</sup>† , Yuanwei Zhang<sup>1</sup> , Xiaoai Wang<sup>1</sup> , Yunyun Lv2,3, Chao Bian2,3, Jia Li<sup>3</sup> , Xinxin You2,3, Jieming Chen2,3, Kunfeng Yang<sup>1</sup> , Jinlong Yang<sup>4</sup> , Chao Sun<sup>1</sup> , Qian Liu<sup>1</sup> , Le Cheng<sup>4</sup> \*, Junxing Yang<sup>1</sup> \* and Qiong Shi2,3 \*

<sup>1</sup> State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, China, <sup>2</sup> BGI Education Center, University of Chinese Academy of Sciences, Shenzhen, China, <sup>3</sup> Shenzhen Key Lab of Marine Genomics, Guangdong Provincial Key Lab of Molecular Breeding in Marine Economic Animals, BGI Academy of Marine Sciences, BGI Marine, BGI, Shenzhen, China, <sup>4</sup> BGI-Yunnan, BGI-Shenzhen, Kunming, China

# Edited by:

Lior David, Hebrew University of Jerusalem, Israel

### Reviewed by:

Chuanju Dong, Henan Normal University, China Jie Mei, Huazhong Agricultural University, China

### \*Correspondence:

Le Cheng chengle@genomics.cn Junxing Yang yangjx@mail.kiz.ac.cn Qiong Shi shiqiong@genomics.cn †These authors have contributed equally to this work

### Specialty section:

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Received: 26 July 2018 Accepted: 21 November 2018 Published: 04 December 2018

### Citation:

Jiang W, Qiu Y, Pan X, Zhang Y, Wang X, Lv Y, Bian C, Li J, You X, Chen J, Yang K, Yang J, Sun C, Liu Q, Cheng L, Yang J and Shi Q (2018) Genome Assembly for a Yunnan-Guizhou Plateau "3E" Fish, Anabarilius grahami (Regan), and Its Evolutionary and Genetic Applications. Front. Genet. 9:614. doi: 10.3389/fgene.2018.00614 A Yunnan-Guizhou Plateau fish, the Kanglang white minnow (Anabarilius grahami), is a typical "3E" (Endangered, Endemic, and Economic) species in China. Its distribution is limited to Fuxian Lake, the nation's second deepest lake, with a significant local economic value but a drastically declining wild population. This species has been evaluated as VU (Vulnerable) in the China Species Red List. As one of the "Four Famous Fish" in Yunnan province, the artificial breeding has been achieved since 2003. It has not only re-established its wild natural populations by reintroduction of the artificial breeding stocks, but also brought a wide and popular utilization of this species to the local fish farms. A. grahami has become one of the main native aquaculture species in Yunnan province, and the artificial production has been emerging in steady growth each year. To promote the conservation and sustainable utilization of this fish, we initiated its whole genome sequencing project using an Illumina Hiseq2500 platform. The assembled genome size of A. grahami is 1.006 Gb, accounting for 98.63% of the estimated genome size (1.020 Gb), with contig N50 and scaffold N50 values of 26.4 kb and 4.41 Mb, respectively. Approximately about 50.38% of the genome was repetitive. A total of 25,520 protein-coding genes were subsequently predicted. A phylogenetic tree based on 4,580 single-copy genes from A. grahami and 18 other cyprinids revealed three wellsupported subclades within the Cyprinidae. This is the first inter-subfamily relationship of cyprinids at genome level, providing a simple yet useful framework for understanding the traditional but popular subfamily classification systems. Interestingly, a further population demography of A. grahami uncovered a historical relationship between this fish and Fuxian Lake, suggesting that range expansion or shrinkage of the habitat has had a remarkable impact on the population size of endemic plateau fishes. Additionally, a total of 33,836 simple sequence repeats (SSR) markers were identified, and 11 loci were evaluated for a preliminary genetic diversity analysis in this study, thus providing another useful genetic resource for studying this "3E" species.

Keywords: genome sequencing, population history, SSR, plateau fish, Cyprinidae

# INTRODUCTION

fgene-09-00614 December 1, 2018 Time: 14:1 # 2

The Yunnan-Guizhou Plateau (or Yungui Plateau) is a highland region primarily located in the Yunnan and Guizhou provinces in the southwest part of China. This mountain area harbors large numbers of plants and terrestrial vertebrates, and contains 4 of the 10 hotspot ecoregions in China (the Xishuangbanna area, and the Hengduan, Wumeng, and Wuling mountains, Tang et al., 2006). It also holds an abundance of aquatic species, as it encompasses the headwaters of many of the great rivers in Asia that originate on the Qinghai-Tibet Plateau (e.g., the Salween, Mekong and Yangtze rivers). As a consequence, Yunnan province possesses the greatest diversity of fishes in China, accounting for 40% of the nation's freshwater fish species (Chen, 2013).

Most of the native fishes in Yunnan province are locally endemic. The Kanglang white minnow (Anabarilius grahami) is but one example. It is a cyprinoid fish with restricted distribution in Fuxian Lake, a typical Yunnan-Guizhou plateau lake and also the second deepest lake in China. The species is one of the "Four Famous Fish" in Yunnan that has a special value and popularity. Although it is a small-sized fish, it has long been the major economic fish species in Fuxian Lake, accounting for 70–80% of the natural fishery production before 1990s (Li et al., 2003a). This fish is historically famous because of its good taste and flavor – attribute to its special muscle nutrition compositions (Deng et al., 2013) – as well as some folkloric medicinal functions and in appealing to fishing cultures. Along with the long-term formation of Fuxian Lake, A. grahami has many special biological characters that were thought to be a result of adaptation of the fish to the lake (Yang, 1992). For instance, because of the limited food resources in the oligotrophic Fuxian Lake, it has a very low absolute fecundity (number of mature eggs: 2,175–3,840) relative to its sister species, the Anabarilius andersoni (13,971–15,770) in the adjacent Xinyun Lake (Yang, 1992). As a way of compensating, it has a long annual breeding period from March to October, and shows unusual spawning behaviors, such as a temporally regular interval (ca. 7 days) between two sequentially spawnings (Yang, 1992; Ma et al., 2008). It would also be an adaptation to the limited spawning sites that are only available at some cave or hill springs around the Fuxian Lake. In addition, the larvae and adults of A. grahami occupy distinct habitats, with the larvae and juveniles occurring in the shallow coastal regions and the adults in the middle and upper layers of open water (Yang, 1992; Ma et al., 2008). This seems to be a response to the limited food resources in the whole lake (Yang, 1994). As the second deepest lake of China, Fuxian Lake has a relatively broad niche in terms of water depth. Correspondingly, the adults of A. grahami can frequently be found in water depths down to 20 m, and may occasionally be seen as deep as 50 m (Yang, 1992). Thus A. grahami is unusual among other species of the Cultrinae, because most of them are thought to live in the upper to middle levels (probably less them 5 m) of shallow lakes or rivers (Chen, 1998). The spatial dichotomy strategy of A. grahami might be also a crucial reason enabling it to maintain the largest natural fish stocks in Fuxian Lake. However, all of the interesting biological questions on A. grahami are hypothetical, and have not been empirically explored.

In recent decades, however, the wild population of A. grahami has decreased sharply. It has been triggered by the introduction in 1982 of the exotic icefish, Neosalanx taihuensis. The annual production of A. grahami declined from about 400 tons before the 1990s, to 10.4 tons in the 1998, and finally to less than 1 ton in the early 2000s; while the annual production of N. taihuensis has increased since 1990s, from about 200 tons during the early colonized years (1986–1990) to an average of 1,554 tons during 1990–2004 (Xiong et al., 2006). Competitive disadvantage has been ascribed for the population decline of A. grahami, because the exotic N. taihuensis and the endemic A. grahami have significant food and space overlaps (Qin et al., 2007). However, other anthropogenic causes, such as overfishing of A. grahami, destruction of the spawning sites, and the collateral damage by catching N. taihuensis, should also be considered (Li et al., 2003a). At the same time, the low fecundity of A. grahami itself (Yang, 1992) might also make it vulnerable in the changing environment. The drastic population decline of A. grahami shifted it from an abundant economic species to an endangered fish. This valuable fish was evaluated as VU (vulnerable) in the China Species Red List in 2004 (Wang and Xie, 2004) and 2015 (Jiang et al., 2016), and among the threatened fishes of the world (Liu et al., 2009). Fortunately, artificial breeding was achieved in 2003 (Li et al., 2003b), and reintroduction of the breeding stocks has become almost the only way to re-establish its wild populations. However, the adaptability and sustainability of the re-established wild population, as well as the current genetic diversity (after serious population fluctuation) are unexplored areas that await evaluation; lack of effective genetic markers might be one reason for this situation. On the other hand, artificial breeding has also created the chance of aquaculture utilization of this valuable species. Although the artificial cultivation is still a smallscale operation, the annual production has gradually increased since 2005, and reached about 15 tons in 2014 from the fish farms around Fuxian Lake (Li, 2015). At present, A. grahami has been one of the main native aquaculture species in Yunnan province, and the utilization of this species in local aquaculture has been exhibiting in steady growth each year.

Anabarilius grahami is a typical species with "3E" (Endangered, Endemic, and Economic) status and priorities. We therefore initiated the whole genome sequencing (WGS) project of this valuable species. The WGS would promote the aspect of many biological and conservational enquiries, and also provide extensive opportunities for its utilization in aquaculture. Based on the WGS information, we also aimed to carry out three evolutionary and genetic applications in this study: (1) reconstruction of the inter-subfamily phylogenetic relationship within the Cyprinidae from a genomic view, (2) reconstruction of the demographic history of A. grahami along with the formation of Fuxian Lake, and (3) development of massive simple sequence repeats (SSR) markers for the future genetic evaluation of this "3E" plateau fish species.

# MATERIALS AND METHODS

fgene-09-00614 December 1, 2018 Time: 14:1 # 3

# Sample Preparation and Genome Sequencing

Samples of A. grahami were collected from artificial cultivated stocks in the Endangered Fish Conservation Center (EFCC) of the Kunming Institute of Zoology, Chinese Academy of Sciences (KIZ), Kunming, China. The research protocol and treatment of experimental fishes was reviewed and approved by the internal review board of KIZ (approval ID: 2015-SMKX026).

Genomic DNA was extracted from a pool of muscle tissue from two individuals. Three short paired-end (200, 500, and 800 bp) and four long paired-end (2, 5, 10, and 20 kb, respectively) sequencing libraries were constructed with the standard protocol provided by Illumina (San Diego, United States), and then sequenced on an Illumina Hiseq2500 platform. Low-quality and duplicated reads were filtered out through SOAPfilter (v2.2) software (Li R. et al., 2009).

For transcriptome-based prediction, RNA was extracted from four tissues (brain, liver, gonad and muscle) of the same two individuals. All the libraries were prepared using the Illumina TruSeq RNA sample preparation kit (San Diego, United States) and then sequenced by Illumina Hiseq4000.

# Genome Assembly

The genome size was estimated using the 17-mer depth frequency distribution formula (Liu et al., 2013) as follows: G (Genome size) = k-mer\_number/k-mer\_depth, where k-mer\_number is the total number of k-mer, and k-mer\_depth indicates the peak frequency that is higher than others. The clean reads were used to construct contigs and original scaffolds by assembler, Platanus (v1.2.4, Kajitani et al., 2014) with default parameters. Subsequently, intra-scaffold gaps were filled using the reads of short-insert libraries by GapCloser 1.12 (Li R. et al., 2009). BUSCO (Benchmarking Universal Single-Copy Orthologs; v3.0.2, Simao et al., 2015) was employed to evaluate the completeness of achieved genome assembly.

# Genome Annotation

We identified repetitive sequences using the following pipeline. At first, Tandem Repeats Finder (v4.07, Benson, 1999) was used to search tandem repeats in the genome assembly. Subsequently, we combined both homology-based and de novo predictions to identify transposable elements (TEs). We utilized RepeatMask (v1.323, Tarailo-Graovac and Chen, 2009) to detect known TEs against the Repbase TE library (release 21.01, Jurka et al., 2005) and RepeatProteinMask (v2.1) to identify the TE correlated proteins. Subsequently, we used LTR\_FINDER (Xu and Wang, 2007) and RepeatModeler (v1.73, Abrusan et al., 2009) to construct a de novo repeat library with default parameters. Finally, we employed RepeatMask (Tarailo-Graovac and Chen, 2009) to identify known and novel TEs against Repbase TE and the de novo repeat library.

We combined de novo, homology-based and transcriptomebased prediction methods to predict protein-coding genes. For the de novo prediction, AUGUSTUS 3.0.1 (Stanke et al., 2006) and GenScan 1.0 (Burge and Karlin, 1997) were employed to predict gene structures of the repeat-masked genome assembly. For the homology-based prediction, the reference protein sequences were from five fishes, including zebrafish (Danio rerio, Howe et al., 2013), medaka (Oryzias latipes, Kasahara et al., 2007), a Chinese cavefish (Sinocyclocheilus grahami, Yang et al., 2016), grass carp (Ctenopharyngodon idella, Wang et al., 2015) and common carp (Cyprinus carpio, Xu et al., 2014). These downloaded protein sequences were mapped onto the assembled genome using tBlastn (v22.19, Mount, 2007) with E-value threshold of 1e−5. Genewise (v2.2.0) was employed to predict gene structures. The RNA-Seq data were aligned to the genome assembly using TopHat (v2.0, Trapnell et al., 2009) and searched transcript structures with Cufflinks (Trapnell et al., 2010). Finally, all gene models from the above three methods were integrated to form a comprehensive and non-redundant gene set using GLEAN (Elsik et al., 2007).

# Functional Assignment

All protein sequences from the GLEAN results were aligned to TrEMBL and SwissProt databases (Boeckmann et al., 2003) using BlastP at E-value ≤ 1e−5. The gene pathways were mapped to the KEGG database (Kanehisa and Goto, 2000). We also used the InterProScan software (Hunter et al., 2009) to annotate the protein sequences by searching publically available databases including Pfam (Finn et al., 2014), PRINTS (Attwood, 2002), PANTHER (Thomas et al., 2003), ProDom (Bru et al., 2005) and SMART (Letunic et al., 2004). In summary, approximately 87.08% of the genes were supported by at least one related function assignments from the public databases (TrEMBL, SwissProt, KEGG and InterPro).

# Phylogenetic Analysis

To understand the phylogenetic status of A. grahami within the Cyprinidae, we reconstructed a phylogenetic tree with A. grahami and other 18 cyprinids, using channel catfish (Ictalurus punctatus) as the out group. These selected species covered 11 of the 12 broadly recognized subfamilies in Cyprinidae (Chen, 1998); however, no data of the remainder Gobiobotinae are available yet. Among these cyprinid species (**Table 1**), if the whole-genome gene sets were available, we directly adopted them; if only transcriptome data were available, we downloaded the submitted reads from NCBI and employed a de-novo assembled method to assembly them into gene sets. Generally, each single-copy gene in diploid species would have two corresponding copies in tetraploid genomes. We therefore randomly separated the two copies into two gene sets and then combined each of the gene sets in tetraploid species with the single gene set in diploid species to produce two final single-copy datasets (dataset I and II).

In dataset I and II, 229 single-copy families including 4,580 single-copy genes were collected; however, the alignment yielded 247,500 and 256,839 sites, respectively. These two datasets were subsequently employed to construct phylogenetic trees using both maximum likelihood (ML) method in PhyML (v3.0, Guindon and Gascuel, 2003) and Bayes Inference (BI) method in Mrbayes (v3.1, Ronquist and Huelsenbeck, 2003).


TABLE 1 | Fish species selected for the phylogenetic analysis of Cyprinidae in the present study.

<sup>∗</sup>The classification of subfamily was adopted from a previous report (Chen, 1998). #The accession numbers included the NCBI BioProject ID for the genome data and Sequence Read Archive (SRA) Run ID for the transcriptome data. Please note that the ID highlighted in bold referred to the genome data of A. grahami that we assembled in this study.

# Heterozygous SNP Calling and Demographic History

Firstly, we identified heterozygous single-nucleotide polymorphisms (SNPs) in the A. grahami genome. We mapped 500-bp insert-sized reads against our assembled genome with BWA (v0.7.12-r1039, Li and Durbin, 2009). The SNPs were called by SAMtools (v0.1.19, Li H. et al., 2009) and filtered by read depth across the genome. In total, approximately 1,733,343 heterozygous sites were identified and the diploid consensus genome sequences were generated by these SNPs. Secondly, the distribution of the time since the most recent common ancestor (TMRCA) between two alleles in an individual was used to predict the history of change in population size. We employed the pairwise sequentially Markovian coalescent (PSMC) model (Li and Durbin, 2011) on heterozygous sites of A. grahami genome with the putative generation time (g = 2 years) and the mutation rate (µ = 3.51 × 10−<sup>9</sup> per year per nucleotide, Graur and Li, 2000) to estimate historical effective population sizes over a range from 10<sup>4</sup> to 10<sup>7</sup> years ago. Finally, we used gnuplot4.4 (Janert, 2010) to draw a curve for the reconstructed population history.

# SSR Searching and Identification

We searched for SSR loci with motifs ranging from di- to hexanucleotides in the assembled genome scaffolds of A. grahami. Our mining criteria included (i) scaffolds extracted with the length ≥1 kb and the average sequence coverage >20×; (ii) SSR identified from the selected scaffolds using MISA script<sup>1</sup> with default settings at (2/6) (3/5) (4/5) (5/5) (6/5), and >100 bp between two SSRs; (iii) repeat motifs and the 200-bp flanking sequences used for Blastn search against the genome sequence with E-value ≤ 1e−5; (iv) SSR developed through filtering with >90% identity and >85% alignment length of the flanking sequences; (v) final SSR loci identified as candidates for marker development with single hit when mapped back to the genome.

# SSR Evaluation and Genetic Diversity Analysis

Three steps were conducted in evaluating the efficiencies of SSR development in this study, and a preliminary genetic diversity of A. grahami was also assessed based on the final optimized SSR markers. We named the three steps as polymerase chain reactions (PCRs), polymorphism and parameters evaluation, respectively. Firstly, we selected a random set of 50 SSR loci for primer design using PRIMER3 (Koressaar and Remm, 2007), with expected PCR products ranging from 100 to 200 bp. Amplification effectiveness was tested on two geographically separated individuals of A. grahami. Secondly, we chose those good loci with correct and bright electrophoretic bands for a polymorphism evaluation, which was realized from 7 populations with three individuals in each population. Those SSR loci without any polymorphism among all the 21 samples were discarded. Thirdly, we filtered the SSR loci by means of an evaluation of the parameters that could affect the reliability of SSR analysis. This evaluation included null allele detection and linkage disequilibrium tests, based on the genotyping data matrix from four populations (30 samples in each population). These four populations, named EFCC1, EFCC2, Huoyanshan

<sup>1</sup>http://pgrc.ipk-gatersleben.de/misa/misa.html

and Luchong, respectively, were artificially preserved populations from three different fish breeding farms. They were the main sources for artificial reintroduction each year to the current wild population in Fuxian Lake.

Polymerase chain reaction in the first step were carried out in 12.5-µL reaction volumes using the amplification profile as follows: 4 min at 94◦C, 35 cycles of 30 s at 94◦C, 35 s at 57◦C, 40 s at 72◦C, followed by a final extension step of 10 min at 72◦C. The PCR procedures in the second and third steps were performed with the same conditions as those in the first step, except using fluorescent labeled reverse primers (6-FAM, HEX) instead of the regular primers, and using 1:(40–100) dilution of the first PCR product as the DNA template according to the brightness relative to the standard DNA marker-referred electrophoretic stripes. All PCR products were then genotyped on an ABI 3730xl genetic analyzer with Gene-Scan LIZ-500 (Applied Biosystems, United States) as the internal size standard, and scored with GeneMarker (SoftGenetics, United States). Genotyping errors associated with SSR analysis such as stutter bands, large allele dropout and null alleles were detected using MICRO-CHECKER (v2.2.3, Van Oosterhout et al., 2004). CERVUS (v3.0.7, Kalinowski et al., 2007) was employed to find matching pairs of genotypes and calculate the basic genetic parameters, including number of alleles (Na), polymorphism information content (PIC), the observed and expected heterozygosities (Ho and He), and null allele frequencies. The inbreeding coefficient (Fis), deviations from Hardy-Weinberg equilibrium (HWE), and linkage disequilibrium tests were performed with GENEPOP (v4.7.0, Rousset, 2008).

# RESULTS

# Summary of Genome Assembly and Annotation for A. grahami

A total of 279.6-Gb raw data were generated by sequencing seven libraries on the Illumina HiSeq 2500 platform (**Supplementary** TABLE 2 | Summary of the genome assembly and annotation for A. grahami.


**Table S1**). The k-mer depth distributes with a main peak at 40× (**Figure 1A**), and therefore the genome size of A. grahami was estimated to be 1.020 Gb (**Table 2**). In addition, a minor curve at the right tail showed a low level of possible repetitive sequences (**Figure 1A**). After filtering low-quality reads, 188.9 Gb of clean reads were assembled using Platanus (**Supplementary Table S2**). The final assembled genome size of A. grahami is 1.006 Gb, accounting for 98.63% of the estimated genome size (1.020 Gb). The assembled contig number is 250,527 and the scaffold number is 178,229, with contig N50 and scaffold N50 values of 26.4 kb and 4.41 Mb, respectively (**Table 2**). The length of scaffold N50 of A. grahami is greater than these in all fishes but grass carp within Cyprinidae with published genomes, and also greater than these in most other non-cyprinid teleosts (see more details in **Supplementary Table S3**).

Using BUSCO software, we chose the single-copy orthologs (N = 4,584) obtained from the phylogenetic analysis to assess the completeness of our genome assembly. The result showed

that 93.2% of BUSCO genes were complete, in which 89.6% were single-copy BUSCOs while 3.6% were duplicated BUSCOs; 4.0% were fragmental BUSCOs, and 2.8% were missing. These data confirmed that our assembled genome was comparatively high quality and complete.

The genome comprised approximately 50.38% repetitive sequences (**Table 2**), which was comparable to the repeat content (52.2%) of the zebrafish genome (Howe et al., 2013, **Supplementary Table S4**). Additionally, the most abundant type of TE was class II DNA transposon (31.37%; **Supplementary Table S5**).

The number of predicted genes in A. grahami, reciprocally homologous to five representative fish genomes (D. rerio, O. latipes, S. grahami, C. idella and C. carpio), was more than 25,000 (**Figure 1B**). With a combination of de novo, homologybased and transcriptome-based annotation methods, we finally predicted a total of 25,520 protein-coding genes from the present A. grahami genome assembly, and 22,406 (87.80%) genes matched entries in the public databases (TrEMBL, SwissProt, KEGG and InterPro, **Table 2** and **Figure 1C**). The total number of protein-coding genes identified in A. grahami (25,520) was similar to the sequenced diploid cyprinids, such as zebrafish (26,000, Howe et al., 2013) and grass carp (27,263, Wang et al., 2015), and approximately half of the tetraploid cyprinids, such as common carp (52,610, Xu et al., 2014) and the golden-line barbel fish (42,109, Yang et al., 2016). These data provided evidence to support the diploid nature of A. grahami from a genomic view.

# Phylogenetic Position and Population History of A. grahami

Based on two datasets (dataset I and II) and two methods (ML and BI), four phylogenetic trees (ML-I, ML-II, BI-I and BI-II) were obtained. All the four trees revealed an identical topology of the 19 species in the Cyrpinidae involved in the study, representing 11 of the 12 recognized subfamilies (Chen, 1998; **Figure 2**). Within this group, the closest relative of A. grahami is Culter ilishaedormis; both of them belong to the subfamily of Cultrinae. The 11 subfamilies were all recovered as monophyletic groups except Leuciscinae, in which the Tinca tinca was not nested with C. idella, but had a closer relationship with the species representing the Gobioninae and the Acheilognathinae.

Three major subclades (Clade I, II and III) were recovered in the Cyprinidae with strong supporting values (**Figure 2**). In summary, the Clade I represented the subfamily Danioninae, which was resolved as the basal-most subfamily within the Cyprinidae. The Clade II was recovered in a relationship of [Labeoninae, (Schizothoracinae, (Cyprininae, Barbinae))], and the Clade III was recovered in a relationship of [(Tinca, (Acheilognathinae, Gobioninae)), ((Leuciscinae, Hypophthalmichthyinae), (Xenocyprinae, Cultrinae))].

Using the heterozygous SNPs from the genome data of A. grahami, we reconstructed the population demography based on the PSMC model. As shown in **Figure 3**, the population of A. grahami had been maintaining in a relative stable size for a long time (0.6–3 Ma), then increasing since 0.6 Ma, reaching to a peak at about 0.03–0.04 Ma, and then declining in the subsequent phase.

# SSR Identification, Evaluation and Application

A flowchart depicting the process used for SSR markers identification, evaluation and application is presented in **Figure 4**. In brief, a total of 144,693 SSR were developed using the criteria from (i) to (iv), and 33,836 were identified as final SSR loci after (v) (**Supplementary Table S6**). The numbers of both the developed SSR loci (144,693) and those finally identified loci (33,836) gradually decreased from di- to hexa-nucleotides motifs, while the di- plus tri-nucleotide SSRs accounted for over 98% of all the final identified SSR loci (**Supplementary Table S7**). For the 50 randomly selected SSR loci, 47 loci (94%) were successfully amplified PCR products with a single band and expected size (Step I: PCR evaluation). Using 27 SSR loci (22 di-, 4 tri-, and 1 tetra-nucleotide SSRs) in seven different populations (n = 3 in each population), only 17 of the 22 di-nucleotide loci were detected with SSR polymorphism (Step II: polymorphism evaluation), and thus retained to the next step. After excluding three loci with detection of null alleles (using MICRO-CHECK), two loci with null allele frequency greater than 0.2 (using CERVUS), two loci involved in linkage disequilibrium (using GENPOP, **Supplementary Table S8**), only 11 loci were finally retained (Step III: parameters evaluation). These 11 optimized SSR markers (**Supplementary Table S9**) were then used for a subsequent genetic diversity analysis.

Basic genetic parameters of four different populations in A. grahami based on the 11 SSR markers are summarized in **Table 3** (more details in **Supplementary Table S10**). In all the four populations, the mean Ho (0.391∼0.467) was higher than the mean He (0.354∼0.411). The PIC values decreased in the order EFCC2 > Huoyanshan > EFCC2 > Luchong, and the average value was about 0.3 among all four populations, which indicated a moderate polymorphism in A. grahami. The majority of the Fis values were negative, indicating that the inbreeding level was relatively low. Based on the 11 SSR markers, significant deviation from HWE was observed only in the EFCC2 population (P < 0.05), and the heterozygosity excess may contribute somehow as Ho was relatively higher than He in this population.

# DISCUSSION

# The WGS of A. grahami Provides a Useful Genetic Resource

If the initial discovery of a species can be treated as the first milestone for enabling people to know it, the WGS of a species would be another landmark to promote further applications. The advent of next-generation sequencing (NGS) has revolutionized genomics research by bringing the sequencing of entire genomes in a way of ever-increasing throughput and ever-decreasing cost (Van Dijk et al., 2014). This revolution has not only radically changed the paradigm of biological research, shifting it to a

genome-wide scale, but also broadly opened up a new age in the whole biological sciences (Koboldt et al., 2013). Since the first completion of the human genome sequence in 2004, many WGS projects have been launched, such as the Genomes 10K Project (David et al., 2009), involving the sequencing of thousands or even millions of genomes (Van Dijk et al., 2014). The WGS is the basic genetic heritage for a species; WGS has ushered in a new era of investigation in biological sciences to the new sequenced species, allowing it to touch nearly every aspect of the biological enquiry (David et al., 2009).

Fishes account for over one-half of the world's living species of vertebrates, exhibiting an incomparable diversity in their morphology, physiology, behavior, and ecological adaptations (Nelson et al., 2016). Fishes are also important food sources for humans, comprising 49.8 million tonnes of products, with an estimated first-sale value of US \$99.2 billion in 2014 (FAO, 2016). The NGS-based WGS brings new opportunities to fish research and utilization; however, the current WGS projects on fishes, do not approach their diversity and application needs. The published genome data up to June 2018 was only available to 60 fish species (Hughes et al., 2018). These sequenced species are predominantly from the economically important fishes, such as Atlantic salmon (Davidson et al., 2010), common carp (Xu et al., 2014), and channel catfish (Liu et al., 2016); other sequenced species are either model organisms, including zebrafish (Howe et al., 2013) and medaka (Kasahara et al., 2007), or evolutionary interesting nodes, such as the coelacanth (Amemiya et al., 2013) and cavefishes (McGaugh et al., 2014).

In this study, we reported the WGS of a Yunnan-Guizhou plateau "3E" fish, A. grahami, which is a typical species with endangered, endemic, and economic status and priorities. Corresponding genome assembly of this teleost has been evaluated with good quality (**Figure 1** and **Table 2**), and it was expected to provide a useful genetic resource for the further studies of this valuable fish.

# Inter-Subfamily Phylogenetic Relationships Within the Cyprinidae

In this study, we reconstructed for the first time the phylogenetic relationships within the Cyprinidae from a genomic viewpoint, combining the genomic data of A. grahami we obtained here with 18 other genomic and transcriptomic data of cyprinids that were downloaded from NCBI (**Table 1**). As we know, Cypriniformes is the largest monophyletic group of freshwater fishes in the world, with 4,000+ species recognized as well as 2,000+ species still awaiting for description (Mayden et al., 2009; Stout et al., 2016). Cyprinidae contains the vast majority of taxa in the Cypriniformes, and it is also the largest family of freshwater fishes on the earth (Nelson et al., 2016). Classification of subfamilies can facilitate the taxonomic, evolutionary and many other studies of this big group; however, the recognition of the subfamilies remains controversial in spite of some systematic studies. With 4,000+ recognized species, the ambition to reconstruct a tree of life at the species-level is largely impractical; however, using phylogeny-based subfamily classification could provide a simple but useful taxonomic system for broader studies.


TABLE 3 | The average genetic parameters at 11 SSR loci of A. grahami in four different populations (n = 30 per population).

Na, number of alleles per locus; Ho, observed heterozygosity; He, expected heterozygosity; PIC, polymorphism information content; Fis, inbreeding coefficient. <sup>∗</sup> indicated the probability of significant deviation from HWE based on all 11 SSR loci.

A putative subfamily classification system, including 2 series and 10 subfamilies using skeletal characters, has been proposed (Chen et al., 1984). It was a fundamental framework for most of the ensuing taxonomic literature about Cyprinidae, such as in the books of "Fauna Sinica, Osteichthyes, Cypriniformes II & III" (Chen, 1998); and "Fishes of the World" (5th ed.), as well as some previous versions (Nelson et al., 2016). The previous classification (Chen et al., 1984) has been updated to a 12-subfamily system for the Cyprinidae, namely, Danioninae, Leuciscinae, Cultrinae, Xenocyprinae, Hypophthalmichthyinae, Cobioninae, Gobiobotinae, Acheilognathinae, Barbinae, Labeoninae, Schizothoracinae, and Cyprininae (Chen, 1998). This 12-subfamily classification has become one of the most useful and popular systems for subsequent studies (Chen, 2013), and due to its popularity, the inter-subfamily relationships under this classification system has also been testified by some of the molecular phylogenetic studies, mainly based on PCR-targeted DNA sequences (Wang et al., 2007, 2012).

The phylogenetic relationship in this study revealed three well-supported subclades of Cyprinidae (**Figure 2**). The subfamily Danioninae (herein as Clade I) was resolved as the basal-most subfamily within the Cyprinidae, which is consistent with some previous molecular phylogenetic studies (Gilles et al., 2001; Wang et al., 2007) but disagrees with some others (Chen and Mayden, 2009; Wang et al., 2012). Morphologically, Danioninae is a large assemblage containing mostly taxa unaccommodated by the other subfamilies (Wang et al., 2007). The sister group relationship of Clade II and III, in line with most of the previous studies based on PCR-targeted DNA sequences, supported two well-accepted major lineages within Cyprinidae, namely, barbeled cyprinines (herein as Clade II) and (usually) non-barbeled leuciscines (herein as Clade III, Wang et al., 2012). It was also largely consistent with the two series classification – the fundamental framework of (Chen et al., 1984) based on skeletal characters – except for the position of Tinca. Clade II was recovered with a relationship of [Labeoninae, (Schizothoracinae, (Cyprininae, Barbinae))], which was largely consistent with most previous studies based on more species but less sequence lengths (Wang et al., 2007, 2012), and this subclade has now been suggested to be a named Cyprininae (see review in Yang et al., 2015). Clade III was comprised of the species usually called as "the Endemic Clade of East Asian Cyprinidae" (Tao et al., 2010), even though the inter-group relationships were controversial based on previous PCR-targeted DNA sequences (Wang et al., 2007, 2012). Based on genome-level sequences used in this study, Clade III was recovered in a relationship of [(Tinca, (Acheilognathinae, Gobioninae)), ((Leuciscinae, Hypophthalmichthyinae), (Xenocyprinae, Cultrinae))]. Two sister group relationships among Clade III, the Acheilognathinae + Gobioninae, and the Xenocyprinae + Cultrinae, were broadly consistent with most of the previous studies; however, recovering Tinca as the sister group of other Leuciscinae from some other studies (Wang et al., 2007, 2012; Stout et al., 2016) was not supported in this study. Tinca has long been treated as Incertae sedis from both morphological and molecular studies (Wang et al., 2007). Due to its controversial phylogenetic position, the monotypic genus Tinca has been frequently suggested to be an independent subfamily as Tincinae (Wang et al., 2012; Stout et al., 2016).

The classification of subfamilies in the Cyprinidae and the subgroups embodied in each subfamily have varied among different studies, which has been inevitable in the progress toward the ultimate tree of life among 4,000+ cyprinids. During this process, many taxonomic levels, such as series, lineages, subfamilies, and tribes, were proposed to designate newly recognized groups (Yang et al., 2015; Stout et al., 2016); however, these complicated terms make the phylogenetic relationships of Cyprinidae inaccessible for most people without in-depth knowledge of this group. The phylogenetic relationship revealed in this study (**Figure 2**), in spite of the limited number of species included, is expected to provide a simple but useful framework of the inter-subfamily phylogeny of Cyprinidae.

# Historical Relationship Between A. grahami and Its Habitat Fuxian Lake

As one of the Yunnan-Guizhou plateau lakes, Fuxian Lake is the sole habitat of A. grahami. Interestingly, the species exhibits many special biological characters, which were believed to be a result of adaptation along with the long-term formation of Fuxian Lake (Yang, 1992). Fuxian Lake, similar to most of the other Yunnan-Guizhou plateau lakes, is a kind of rift lake that formed and evolved under long, periodic and complex tectonic events during the rising of Qinghai-Tibet plateau (Zhu et al., 1989).

From the evidence of lake sediments, we know that the Fuxian Lake was formed by fault-subsidence tectonics in the late Tertiary, and then sustained from pond to basin since late Pliocene (ca. 3.0–3.4 Ma). It experienced a large paleo-Fuxian Lake period in the late Pleistocene to Holocene (ca. 0.126– 0.012 Ma), where the superficial area was approximately 1.6-fold greater and the surface elevation was 30–40 m higher than the lake at present (Zhu et al., 1989). Afterward, the lake body rapidly sunk and the mountains around gradually lifted, which finally shaped Fuxian Lake to be the second deepest lake in China, with an extreme depth at over 150 m and average depth at about 87 m. Along with the process of deepening, Fuxian lake has also been undergoing a copiotrophic to oligotrophic transformation (Yang, 1994). In summary, there are three periods during the development of Fuxian Lake (**Figure 3**): (I) lacus formation period since late Pliocene (ca. 3 Ma), (II) large lake period since late Pleistocene (ca. 0.1 Ma), and (III) a deepening period of the lake accompanied by oligotrophic development since early Holocene (ca. 0.012 Ma).

Interestingly, the population demography of A. grahami matched well with the three periods during the development of Fuxian Lake (**Figure 3**). The population of A. grahami maintained a relatively stable level at the early period (0.6–3 Ma), which would reflect the long time of the lacus formation since late Pliocene (ca. 3 Ma, in Period I). During this time, the ancestors of A. grahami colonized the lake and shifted gradually from lotic to lentic habitats. The population increase of A. grahami since 0.6 Ma would possibly be a response of the expansion of Fuxian Lake. When the fish reached the maximal population size (0.03–0.04 Ma), the lake was also had its largest ponding area (ca. 0.1 Ma, in Period II). A similar pattern was also detected by us in another adjacent Yunnan-Guizhou plateau lake, Dianchi Lake, when the endemic fish, S. grahami, exhibited a noteworthy population expansion congruent with a period when the paleo-Dianchi Lake had a three times larger area (Yang et al., 2016). In considering of the similar patterns between A. grahami in Fuxian Lake and S. grahami in Dianchi Lake, range expansion served as a crucial factor in increasing the population sizes of plateau endemic fishes, and vice versa. In A. grahami, the later shrinking and deepening of Fuxian Lake (ca. 0.012, Period III) might be the key reason for its population declining after the maximal population size (0.03–0.04 Ma). The oligotrophizing along with the deepening of Fuxian Lake would, synchronously and substantially, accelerate the speed of its population decline afterward.

# SSR Development and Utilization for Genetic Diversity Analysis

Molecular markers have been widely used to study the genetic diversity of a species. Because of the abundantly polymorphic, selectively neutral, highly repeatable, and unambiguously genotyping, SSR is one of the most useful molecular markers that can easily explore and apply in this post-genomic area. Compared to the traditionally expensive, time-consuming and labor-intensive in construction of the enriched libraries, identifying SSR markers based on high-throughput sequencing is much faster and more cost-effective (Liu et al., 2017). Identification of SSR markers provided valuable resources for further studies of each newly sequenced taxon (Stoll et al., 2017).

In this study, we identified 33,836 SSR loci of A. grahami after genomic searching under five criteria, which can serve as a SSR resource pool for studies on this species (**Supplementary Table S6**). We designed a three-step approach, namely, PCR, polymorphism and parameters evaluation (**Figure 4**), to assess this identified SSR resource pool by randomly selected 50 SSR loci for primer design and marker screening. After three steps evaluating and filtering, we retained 11 optimized SSR markers that can be used for a preliminary genetic diversity analysis (**Supplementary Tables S8**, **S9**). The PIC of each marker usually reveals the general diversity in the genetic analysis of a species. According to the PIC values of the 11 SSR markers in four populations (30 samples in each population), the average PIC value was calculated to be 0.322 among these four artificially cultivated populations (**Table 3**), which indicated the general genetic diversity of A. grahami was reasonably informative (Botstein et al., 1980).

Maintenance of genetic diversity is the major objective of most projects for conservation and utilization, so that population can face environmental challenges in the future and can respond to long-term selection, either natural or artificial for traits of economic and cultural interest (Sharma et al., 2016). From the perspective of conservation, reintroduction is the most popular technique for endangered species to re-establish populations within their historic range (IUCN, 1998). However, success of such projects largely depends on the correspondingly longterm management for the genetic diversity, population structure, levels of inbreeding and other relevant parameters (Tollington et al., 2013). As an endangered fish that has undergone drastic population decline in these decades, reintroduction of A. grahami to the Fuxian Lake has become the major way to re-establish its wild populations. Therefore, the artificial cultivated populations from fish breeding farms have been the main sources for the present and future wild populations. Based on the four artificial cultivated populations, we revealed that the general genetic diversity of A. grahami was moderate, and the inbreeding level within each of the four populations was relatively low (**Table 3**). It would suggest that the genetic diversity of A. grahami at present is not necessarily a cause for pessimism; however, a whole picture of its genetic diversity and population structure based on a broader sample coverage has yet to be uncovered.

In summary, besides the new assembled genome resource, the identified 33,836 SSR loci provided another useful genetic resource for long-term explorations of this "3E" species. Especially, the 11 optimized SSR loci screened from this study will provide practical genetic tools for further near-term genetic and conservation studies.

# DATA AVAILABILITY

This Whole Genome Shotgun project has been deposited at DDBJ/ENA/GenBank under the accession RJVU00000000 with a BioProject ID of PRJNA477399. The version described in this paper is version RJVU01000000.

# AUTHOR CONTRIBUTIONS

fgene-09-00614 December 1, 2018 Time: 14:1 # 11

JXY, WJ, QS, and LC conceived the project and designed the scientific objectives. WJ, XP, YZ, XW, KY, CS, and QL collected and prepared the fish samples. YQ, YL, CB, JL, XY, JC, and JLY conducted bioinformatics analysis. WJ and YZ performed the SSR development and experiments. WJ, YQ, and XP prepared the manuscript. QS, JXY, and LC revised the manuscript. All authors have read and approved the final manuscript.

# FUNDING

This work was supported by the Innovation and Enhancement Program (2016AB024), Basic Research Program (2018FB047 and 2016FA044), and construction and people program (2015DA008

# REFERENCES


and 2014HB053) of Yunnan Provincial Science and Technology Department; and National Natural Science Foundation of China (31672282 and U1702233).

# ACKNOWLEDGMENTS

We would like to thank Prof. Richard Winterbottom (Royal Ontario Museum, Toronto, ON, Canada) for reviewing and revising the writing of this paper; and we also acknowledge Mr. Zaiyun Li and Mr. Yapeng Zhao for their assistances in sample collections.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2018.00614/full#supplementary-material


in paternity assignment. Mol. Ecol. 16, 1099–1106. doi: 10.1111/j.1365-294X. 2007.03089.x


recombination activating gene 2 sequences. Mol. Phylogenet. Evol. 42, 157–170. doi: 10.1016/j.ympev.2006.06.014


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Jiang, Qiu, Pan, Zhang, Wang, Lv, Bian, Li, You, Chen, Yang, Yang, Sun, Liu, Cheng, Yang and Shi. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Population Genomic Structure and Genome-Wide Linkage Disequilibrium in Farmed Atlantic Salmon (Salmo salar L.) Using Dense SNP Genotypes

Agustin Barria<sup>1</sup>† , Maria E. López<sup>1</sup>† , Grazyella Yoshida<sup>2</sup> , Roberto Carvalheiro<sup>2</sup> , Jean P. Lhorente<sup>3</sup> and José M. Yáñez1,3,4 \*

<sup>1</sup> Facultad de Ciencias Veterinarias y Pecuarias, Universidad de Chile, La Pintana, Chile, <sup>2</sup> Faculdade de Ciências Agrárias e Veterinárias, Universidade Estadual Paulista Júlio de Mesquita Filho, Jaboticabal, Brazil, <sup>3</sup> Benchmark Genetic S.A., Puerto Montt, Chile, <sup>4</sup> Nucleo Milenio INVASAL, Concepción, Chile

### Edited by:

Paulino Martínez, University of Santiago de Compostela, Spain

### Reviewed by:

Silvia Teresa Rodriguez Ramilo, Institut National de la Recherche Agronomique (INRA), France Roger Luis Vallejo, United States Department of Agriculture, United States

### \*Correspondence:

José M. Yáñez jmayanez@uchile.cl †These authors have contributed equally to this work

### Specialty section:

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Received: 01 September 2018 Accepted: 30 November 2018 Published: 14 December 2018

### Citation:

Barria A, López ME, Yoshida G, Carvalheiro R, Lhorente JP and Yáñez JM (2018) Population Genomic Structure and Genome-Wide Linkage Disequilibrium in Farmed Atlantic Salmon (Salmo salar L.) Using Dense SNP Genotypes. Front. Genet. 9:649. doi: 10.3389/fgene.2018.00649 Chilean Farmed Atlantic salmon (Salmo salar) populations were established with individuals of both European and North American origins. These populations are expected to be highly genetically differentiated due to evolutionary history and poor gene flow between ancestral populations from different continents. The extent and decay of linkage disequilibrium (LD) among single nucleotide polymorphism (SNP) impacts the implementation of genome-wide association studies and genomic selection and provides relevant information about demographic processes of fish populations. We assessed the population structure and characterized the extent and decay of LD in three Chilean commercial populations of Atlantic salmon with North American (NAM), Scottish (SCO), and Norwegian (NOR) origin. A total of 123 animals were genotyped using a 159 K SNP Axiom <sup>R</sup> myDesignTM Genotyping Array. A total of 32 K SNP markers, representing the common SNPs along the three populations after quality control were used. The principal component analysis explained 78.9% of the genetic diversity between populations, clearly discriminating between populations of North American and European origin, and also between European populations. NAM had the lowest effective population size, followed by SCO and NOR. Large differences in the LD decay were observed between populations of North American and European origin. An r 2 threshold of 0.2 was estimated for marker pairs separated by 7,800, 64, and 50 kb in the NAM, SCO, and NOR populations, respectively. In this study we show that this SNP panel can be used to detect association between markers and traits of interests and also to capture high-resolution information for genome-enabled predictions. Also, we suggest the feasibility to achieve similar prediction accuracies using a smaller SNP data set for the NAM population, compared with samples with European origin which would need a higher density SNP array.

Keywords: linkage disequilibrium, Salmo salar, selective breeding, GWAS, population structure

# BACKGROUND

fgene-09-00649 December 13, 2018 Time: 16:38 # 2

Atlantic salmon (Salmo salar) is one of the species of farmed fish with the highest commercial value in aquaculture (FAO, 2016a). Chile is the second largest producer, generating nearly 532,000 tons of this specie in 2016 (FAO, 2016b). All of the Atlantic salmon populations farmed in Chile were introduced from three main geographical origins (i) North America, (ii) Scotland, and (iii) Norway. These populations also represent the main origins of cultured Atlantic salmon worldwide. Breeding programs for Atlantic salmon were first established in Norway during the early 1970s (Gjedrem et al., 2012). Since then, there has been an increased interest in implementing genetic improvement programs for salmon in the most important producer countries, including Australia, Chile, Iceland, Ireland, Scotland and Norway. The main traits included in the breeding objectives of Atlantic salmon are growth, disease resistance, carcass quality and age at sexual maturation (Rye et al., 2010).

Recent advances in next-generation sequencing and high-throughput genotyping technologies have allowed the development of valuable genomic resources in aquaculture species (Yáñez et al., 2015). For instance, dense single nucleotide polymorphism (SNP) panels have been developed for Atlantic salmon (Houston et al., 2014; Yáñez et al., 2016). Genetic evaluations for traits that are difficult to measure in selection candidates, such as disease resistance and carcass quality traits, can be more accurate when integrating genome-wide SNP information, in what has been called genomic selection (Meuwissen et al., 2001; Sonesson and Meuwissen, 2009). Genomic selection exploits the linkage disequilibrium (LD) that exists between SNP and quantitative trait loci (QTL) or causative mutations that are involved in the variation of the trait (Goddard and Hayes, 2009), increasing the accuracy of genome-enabled estimated breeding values (GEBVs) in farmed salmon species (Ødegård et al., 2014; Tsai et al., 2016; Bangera et al., 2017; Vallejo et al., 2017; Yoshida et al., 2018; Barria et al., 2018b). Furthermore, association mapping through genome wide association studies (GWAs) is a useful approach to detect genomic regions and genes involved in economically important traits for salmon aquaculture and they also rely on LD between the QTL and SNP markers. Thus an adequate SNP density is required to assure that all QTL are in LD with a marker (Flint-Garcia et al., 2003).

In addition, knowing the extent and pattern of LD can be used to help explore different evolutionary forces that may affect certain regions of the genome (Ardlie et al., 2002). Because it is affected by population growth, genetic drift, admixture or migration, population structure, variable recombination rates and artificial/natural selection, LD can be variable among populations and loci (Ardlie et al., 2002). Different measures of LD between two loci have been proposed, among them the absolute value of D' (also called Lewontin's D') and r 2 are the most widely used. D' = 1, indicates no recombination between loci and complete LD, while values less than 1 indicate that loci have been separated by recombination. D' estimations are overestimated in small sample sizes and low frequencies of minor allele, therefore, high values of D' can be obtained even when markers are in linkage equilibrium (Ardlie et al., 2002). Therefore, r 2 , the squared correlation between alleles at two loci, is the most accepted measure for comparing and quantifying LD (Pritchard and Przeworski, 2001).

To date, several studies have been performed to determine the levels and extent of LD in livestock species such as dairy (Khatkar et al., 2008; Bohmanova et al., 2010), beef cattle (McKay et al., 2007; Lu et al., 2012; Espigolan et al., 2013; Porto-Neto et al., 2014), pigs (Badke et al., 2012; Ai et al., 2013), goats (Mdladla et al., 2016; Visser et al., 2016), and sheep (Prieur et al., 2017). Moreover, some studies have related patterns of LD with genomic regions subjected to selection in domestic species (Prasad et al., 2008). Recent studies have also aimed at characterizing the levels of LD in farmed aquaculture species, such as rainbow trout (Rexroad and Vallejo, 2009; Vallejo et al., 2018), coho salmon (Barria et al., 2018a), and Atlantic salmon (Kijas et al., 2017). However, until now there have been no comprehensive studies aiming at characterizing and comparing levels and extent of LD in commercial Atlantic salmon populations that include the three main geographical origins. The goal of this study was to (a) assess the levels of LD in farmed Atlantic salmon populations with three different geographical origins (i.e., Canada, Scotland, and Norway); (b) calculate the effective population size for each breeding population; and (c) estimate the population structure and genetic admixture of each population.

# MATERIALS AND METHODS

# Populations and Samples

The current study is comprised of 123 Atlantic salmon individuals from three different commercial populations cultivated in the South of Chile, which have different geographical origins. These fish were obtained from Chilean farmed populations, which were originated from imported stocks. The Norwegian population was comprised of 43 fish belonging to a breeding population derived from the Mowi strain, which is the oldest farmed population constituted in Norway. This strain was established in the late 1960s using fish from west coast rivers in Norway (NOR), River Bolstad in the Vosso watercourse, River Årøy and Maurangerfjord area (Verspoor et al., 2007). Ova of this strain were introduced into Ireland from 1982 to 1986 (Norris et al., 1999) and from there, they were imported to Chile for farming purposes in the 1990s (Solar, 2009). Since 1997 this population has been selected for rapid growth in Chile (Yáñez et al., 2013, 2014; Correa et al., 2015, 2017). A second population of 43 fish of Scottish origin (SCO) was comprised of samples from a strain derived from fish from Loch Lochy, located on the West Coast of Scotland. Fish of this strain are described as a stock with rapid growth potential and a high early maturation grilsing rate (Johnston et al., 2000). During the 1980s, eggs from the Scottish population were introduced to Chile to establish an aquaculture broodstock. The third population used in this study was comprised of 37 fish of North American (NAM) origin; belonging to a domestic strain established in the 1950s, using

ova from the Gaspé Bay (QC, Canada). It is presumed that fish of this strain were transferred and kept at an aquaculture hatchery located in the state of Washington, United States for two generations. Fertilized eggs of this strain were introduced from Washington to Chile between 1996 and 1998 (López et al., 2018). Since their introduction in Chile, these lines have been maintained separately and no crosses between them have been performed. The mean relatedness among individuals was estimated within each population using Plink v1.90 (Purcell et al., 2007). The estimated values were 0.18 (0.19), 0.05 (0.08), and 0.05 (0.06) for the NAM, SCO, and NOR populations, respectively.

# Genotyping

Fin clip samples from individuals from the three populations were obtained for genomic DNA extraction and further genotyping. Genotyping was carried out using a 200 K Affymetrix Axiom <sup>R</sup> myDesign Custom Array as described by Yáñez et al. (2016). This dense SNP array contains 151,509 polymorphic SNPs with unique position and evenly distributed markers across the genome. A total of 2,302 (1.6%) SNPs were discarded prior to analysis due to unknown chromosomal location on the S. salar reference genome (Yáñez et al., 2016). Quality control of genotypes was performed using Axiom Genotyping Console (AGT, Affymetrix) and SNPolisher for R, according to the Best Practices procedures indicated by the array manufacturer<sup>1</sup> . Quality control (QC) was performed using PLINK software v1.09, and assessed separately for each population. SNPs with minor allele frequency (MAF) lower than 5%, significantly deviating from Hardy–Weinberg Equilibrium (HWE) (p < 1e-6), and a SNP call rate of lower than 95% were excluded. Samples with more than 5% of missing genotypes were also excluded. All subsequent analyses were done using the common SNPs along the three populations after QC.

# Population Structure and Genetic Admixture Analysis

To investigate genetic structure among populations, we performed a principal component analysis (PCA) implemented in PLINK v1.09. Visualization of the first two PCA were plotted along two axes in R. Additionally, we used a hierarchical Bayesian modeling implemented in STRUCTURE software, using a burnin of 20,000 iterations, and running 50,000 iterations with three replicates. Subsequently, we computed the posterior probability of each K value according Pritchard et al. (2000), to choose the best K assuming a uniform prior on K between 1 and 10.

# Estimation of LD

Linkage disequilibrium as Pearson's squared correlation coefficient (r 2 ) was chosen over |D'| to predict the LD between each pair of molecular markers. This statistic is less sensitive to bias caused by differences in allelic frequencies (Ardlie et al., 2002), more appropriate for biallelic markers (Zhao et al., 2005) and can be used to compare the results with previous studies in salmonid species and other domestic animals. Genotypes were coded as 2, 1, and 0 in function of the number of non-reference alleles. The pair-wise LD as r <sup>2</sup> was calculated for each population and within chromosomes using Plink v1.09 using the formula proposed by Hill and Robertson (Hill and Robertson, 1968). For each SNP pair, bins of 100 kb were created based on pairwise physical distance. The extent and decay of the LD, was visualized by plotting the average r <sup>2</sup> within each bin from 0 up to 10 Mb, using R software (R Core Team, 2016).

# Effective Population Size

Historical effective population size (Ne) was estimated using SNeP v1.1 (Barbato et al., 2015). SNeP software estimates N<sup>e</sup> using LD data calculated through the following formula proposed by Corbin et al. (2012):

$$N\_{\rm t} = \frac{1}{(4f(\alpha))} (\frac{1}{E[r\_{\rm adj}^2 | c\_{\rm t}]} - \alpha)^2$$

Where N<sup>t</sup> is the effective population size t generations ago, c<sup>t</sup> is the recombination rate, t generations ago, being proportional to the physical distance between SNP markers, r 2 adj is the estimated LD adjusted for sample size and α is the adjustment for mutation rate. As proposed by Tenesa et al. (2007), we used an α = 2, considering that mutation does occurs. The minimum and maximum distance used between SNPs for N<sup>e</sup> estimation was 0 and 5 Mb, respectively. Data was grouped in 30 distance bins of 50 kb each. Finally, N<sup>e</sup> was estimated from the r 2 values calculated for the mean distance of each distance bin. Considering the relative small number of SNPs per chromosome, the estimated N<sup>e</sup> per chromosome was calculated using harmonic mean (Alvarenga et al., 2018). Contemporary effective population size for each population, was estimated using NeEstimator v2.01 (Do et al., 2014). Briefly, estimation was based on LD method, with a critical value (Pcrit) of 0.05 and a non-random mating model.

# RESULTS

# SNP Quality Control

No animals from the three populations were removed after quality control, giving genotype data from 123 individuals (37, 43, and 43 from NAM, SCO, and NOR, respectively). A total of 40,316 (27.02%), 113,282 (75.92%), and 136,446 (91.46%) SNP markers passed the QC criteria for the NAM, SCO and NOR populations, respectively. Filtered SNPs differed significantly between populations of North American or European origin. 106 K SNPs were excluded from the NAM population by a low MAF, representing 70% of the total available markers in the array. The markers excluded by MAF in SCO and NOR populations reached 23 and 7.8%, respectively. A summary of the number of fish genotyped from each population, number of SNPs excluded by HWE, MAF and the final number of SNPs per population are shown in **Table 1**. After QC, a total of 31,978 common SNPs among the three populations were identified. These 32 K SNPs were used for all the subsequent analyzes.

<sup>1</sup>http://media.affymetrix.com/support/downloads/manuals/axiom\_best\_practice\_ supplement\_user\_guide.pdf

TABLE 1 | Summary of results from quality control of SNPs for each farmed population genotyped with the 200K SNP array.


<sup>1</sup>Geographical origin of each farmed population: NAM, North America; SCO, Scotland; NOR, Norway.

<sup>2</sup>Number of genotyped samples.

<sup>3</sup>Number of excluded SNPs by the genotype call rate <0.95.

<sup>4</sup>Number of excluded SNPs by Hardy–Weinberg Equilibrium.

<sup>5</sup>Number of excluded SNPs with minor allele frequency <0.05.

<sup>6</sup>Number of SNPs retained for each population.

fgene-09-00649 December 13, 2018 Time: 16:38 # 4

TABLE 2 | Estimated chromosome length and average linkage disequilibrium values for three Chilean farmed populations of Atlantic salmon with Norwegian (NOR), Scottish (SCO), and North American (NAM) origin.


SD in parenthesis. <sup>a</sup>Chromosome; <sup>b</sup>effective population size.

# Summary Statistics for Each Population

Summary statistics of each chromosome's length, average r 2 and N<sup>e</sup> estimates among SNPs for each chromosome and population are shown in **Table 2**. The markers spanned 2,218.6 Mb, of the Atlantic salmon genome, encompassing 70% of the total sequence length (assuming a S. salar genome size of 2.96 Gb based on the last assembly GCA\_000233375.4). Average r 2 between adjacent SNPs reached up to 0.26 ± 0.28 in the NAM population. These values were higher than for SCO and NOR populations (0.11 ± 0.14 and 0.07 ± 0.10, respectively).

The average LD, measured as r 2 , between adjacent markers across the 29 chromosomes, ranged from 0.16 to 0.35, 0.08 to 0.14, and 0.05 to 0.09 in NAM, SCO, and NOR populations, respectively (**Table 2**). These results indicate that average levels of LD among syntenic SNPs are considerably lower in both populations with European origin compared with the population with North American origin. Also, LD for the SCO population is slightly higher when compared with the NOR population. For each population, effective population size by chromosome was calculated up to 180 generations ago, excepting for Ssa08, Ssa26, and Ssas28 in which N<sup>e</sup> was estimated up to 55 generations ago. Estimations were lower for all chromosomes in NAM population, while the higher values were estimated in the population with Norwegian origin (**Table 2**).

The 32K common markers are uniformly distributed along the 29 chromosomes, with an average SNP density per chromosome per Mb ranging from 8.39 to 20.55 with a mean of 14.10 ± 3.14 (**Supplementary Table S1**). All three populations showed a similar mean MAF of 0.26 ± 0.13, 0.29 ± 0.13, and 0.32 ± 0.12 for the NAM, SCO, and NOR, respectively. The mean MAF per chromosome ranged from 0.22 to 0.29 in the NAM population. For the populations with European origin, the MAF ranged from 0.24 to 0.31 and from 0.31 to 0.34 for SCO and NOR population, respectively. The proportion of loci with MAF higher than 0.20 ranged from 0.20 to 0.34 along the three populations. For those loci with MAF between 0.05 and 0.09, the proportion reached up to 0.13, 0.09, and 0.04 for the NAM, SCO, and NOR population, respectively (**Figure 1**).

# Population Structure

Principal components 1 and 2 together accounted for 78.9% of the total genetic variation (**Figure 2**). These components clearly revealed three different clusters, corresponding to the Atlantic salmon with North American (NAM), Scottish (SCO), and Norwegian (NOR) origin. The first principal component discriminates populations with North American and European origin and accounted for 55.2% of the total variation. The second principal component accounted for 23.7% of the total variance and divided the two European populations into two clusters, corresponding to Scottish and Norwegian populations, respectively. According to STRUCTURE analysis, using 31,978 common SNPs across three populations, we obtained the best K = 8 by computing the posterior probabilities of each K. NAM population presented the highest level of admixture, while SCO presented the lowest (**Figure 3**). STRUCTURE results assessing K values from 2 to 10 are presented in **Supplementary Figure S1**, while posterior probabilities are showed in **Supplementary Table S2**.

# Linkage Disequilibrium Decay

The LD decay was estimated for each population as a function of physical distance. SNP pairs were sorted in 100 kb-bins based on the distance between pairs. Average r 2 values were

FIGURE 2 | Genetic differentiation of Atlantic salmon populations revealed by principal component analysis. Principal component analysis for three Chilean Atlantic salmon breeding populations with different geographical origin. North America (NAM), Scottish (SCO), and Norwegian (NOR).

estimated for each bin. As estimated in other domestic animals (Badke et al., 2012; Makina et al., 2015; Kijas et al., 2017; Barria et al., 2018a), genome-wide average LD declines with increasing physical distance between markers. **Figure 4** shows an overview of the decay of r 2 as a function of distance for each population. A slow decay was observed for the NAM population, while the decay was faster in both populations with European origin. The average distance at which the LD value reached 0.2, varied between populations. For the NAM population the distance reached ∼ 7,800 kb. For the SCO and NOR populations, the distance decreased drastically, corresponding to ∼ 64 and 50 kb, respectively. Average r 2 for the first bin at distances of 0.5, 1.0, 5.0, and 10.0 Mb is shown in **Table 3**. Mean r <sup>2</sup> within the first bin was larger for the NAM population (r <sup>2</sup> = 0.62), followed by the SCO and NOR (r <sup>2</sup> = 0.36 and 0.35, respectively). Average r 2 for SNP pairs with a mean distance of 1.0 Mb was 0.31, 0.13, and 0.08 for the NAM, SCO, and NOR populations, respectively. These values decreased to 0.19, 0.08, and 0.06, when average distance between SNPs reached up to 10.0 Mb for NAM, SCO, and NOR, respectively.

# Effective Population Size

Estimated effective population size differed among populations. **Figure 5** shows the historical trends in N<sup>e</sup> up to 85 (**Figure 5A**) and 1516 (**Figure 5B**) generations ago, respectively. Within this range of generations, Atlantic salmon with North American origin had the smallest Ne, followed by SCO and NOR populations. These N<sup>e</sup> values ranged from 15 to 574; 44 to 1,346; and from 72 to 1,325 for NAM, SCO, and NOR populations, respectively. Contemporary N<sup>e</sup> estimations based on LD reached up to 7.5, 107, and 160 for NAM, SCO, and NOR populations, respectively.

# DISCUSSION

The study of extent and decay of whole-genome LD can aid in the understanding of demographic processes experienced by populations. Processes such as founder effect, admixture and genetic drift in conjunction with recombination and mutations are key elements determining LD. Similarly, other factors that affect LD include inbreeding, admixture and selection (Gaut and Long, 2003); which has resulted in studies aimed at estimating LD variation between populations (Ai et al., 2013; Yang et al., 2014; Al-mamun et al., 2015; Mdladla et al., 2016).

This is the first study aimed at characterizing the decay and extent of LD in three different Atlantic salmon breeding populations established in Chile, representing the three main geographical origins of this cultured species now in Chile and worldwide. All three populations have been subjected to artificial selection for growth related traits. Individuals used in this study were selected having non-common ancestors for three generations back to avoid inflated LD estimations that are likely to occur due to high kinship relationships (Gutierrez et al., 2015).

The results presented here indicate the existence of differential average levels of LD across the genome between these Atlantic salmon populations. Large differences between North American and European populations were expected considering that they belong to two different lineages probably separated by more than 1,000,000 years (Rougemont and Bernatchez, 2018). The longrange LD found in the NAM population is likely a consequence of the admixture, as it has been suggested in Atlantic salmon (Ødegård et al., 2014; Rougemont and Bernatchez, 2018) and rainbow trout (Vallejo et al., 2018) populations. However, it could also reflect demographic process in the strain formation, as well as demographic events in its wild progenitors. It is well known that North American populations of Atlantic salmon have lower genetic diversity than European populations (Bourret et al., 2013; Makinen et al., 2014), as migrations probably favored only a few individuals colonizing North America, reducing effective population sizes and causing a major effect of genetic drift (Rougemont and Bernatchez, 2018) on patterns and degree of LD, especially in closely linked loci (Kruglyak, 1999). Additionally, artificial selection in the NAM population has been probably stronger than in SCO, presenting the former higher level of LD probably by the breeding process. On the other hand, the lowest level of LD found in NOR is consistent with its more diverse origin (Norris et al., 1999). The NOR population was established in early 1970s using fish from several rivers on the west coast in Norway, which probably favored greater genetic diversity than NAM and SCO.

Recent events of admixture decrease the short-range LD present in original populations (Ødegård et al., 2014) and haplotypes with high LD levels are shorter in highly admixed populations (Toosi et al., 2010). Admixture can also generate long-range LD, which could be captured by lower density SNP panels (Ødegård et al., 2014; Vallejo et al., 2018). Conversely, the highest overall levels of LD present in the SCO and NAM populations may be reflecting the unique origin of these populations without the recent introgression of different genetic material and a small effective population size, as it has been shown in Tasmanian Atlantic salmon (Kijas et al., 2017) and Pacific salmon (Barria et al., 2018a). On the other hand, we suggest that admixed origin of NOR population could cause an elevated extent of LD, but more than four decades of domestication and artificial selection have been enough to broke down the initial pattern and levels of LD in this population.

Our results suggest that the effect of these demographic features is more extreme in the NAM population, which has the highest level of LD of the three populations analyzed. In general terms, LD varied moderately between chromosomes in the NAM population, suggesting a variation in autosomal recombination rate which could be associated to genetic drift or artificial selection (Arias et al., 2009). Population analysis using STRUCTURE yielded the best K = 8, showing the highest level of admixture in NAM, which agrees with the hypothesis that the American population of Atlantic salmon was founded by multiple European sources (Rougemont and Bernatchez, 2018). Besides that, PCA shows clear genetic differentiation among the three populations, being PC1 that split NAM of European populations. This confirms the great divergence between European and North American Atlantic salmon, which agrees with STRUCTURE results, where it shows a clear differentiation between the NAM and both European populations.

The lower levels of SNP variability observed in the NAM population was expected, considering that North American salmon populations have lower genetic diversity than European populations (Bourret et al., 2013; Makinen et al., 2014). This also may be attributed to the ascertainment bias caused by prioritization of SNP markers segregating in NOR and SCO populations in the design of the SNP array used in the present study (Yáñez et al., 2016). A similar situation has been observed when evaluating performance of SNP panels, which have been designed to account for the variability in European populations of Atlantic salmon and in Tasmanian farmed Atlantic salmon populations with a North American origin (Dominik et al., 2010; Kijas et al., 2017). To reduce this bias in the genetic differentiation analysis, we used a common subset of SNPs of approximately 32 K. However, this does not ensure better estimates of genetic diversity, consequently

these results should be interpreted with caution as suggested by Rougemont and Bernatchez (2018). Inaccurate estimates of genetic parameters due to ascertainment bias could be avoided by using a SNPs array developed with North American Atlantic samples specifically, that will provide more information about local variation.

It has been suggested that the minimum number of individuals needed for accurate LD estimations using r 2 ranges between 55 and 75 individuals, increasing to more than 400 for |D'| (Khatkar et al., 2008; Bohmanova et al., 2010). However, an accurate estimation of LD measured as r 2 , has been obtained in Pacific salmon using 62 individuals (Barria et al., 2018a). Because of the relatively small sample size of each population (37, 43, and 43 for the NAM, SCO, and NOR populations, respectively), we measured LD decay as r 2 , instead of |D'|. Furthermore, estimates of r 2 are less susceptible to overestimation and are more useful to predict the power of an association mapping (Ardlie et al., 2002; Bohmanova et al., 2010). Significant linear association has been assessed previously between chromosome length and LD (as r 2 ) in Nellore cattle (Espigolan et al., 2013). We only found significance between these variables in the NOR population (p < 0.05). Like Bohmanova et al. (2010), we found no association in SCO and NAM populations, which could be due to lower marker density (data not shown).

TABLE 3 | Mean linkage disequilibrium (r 2 ) at different distances in three Chilean farmed populations of Atlantic salmon with North American (NAM), Scottish (SCO), and Norwegian (NOR) origin.


The current results compared the LD decay between Chilean Atlantic salmon breeding populations originating from different geographic regions. The SNP panel used in the current study has one SNP every 14 kb (Yáñez et al., 2016). Based on a r 2 threshold value of 0.2, as suggested by Meuwissen et al. (2001), reached at a minimum marker distance of 50 kb, this panel can be used to detect associations between markers and traits of interest and also capture high-resolution information for genome predictions.

# CONCLUSION

The current study reveals different LD decay between three Atlantic salmon farmed populations. The highest extent of LD was estimated for the NAM population, followed by the SCO and

NOR populations. A lower level of LD in NOR was consistent with its population history. Specifically, this population comes from a farmed strain established with samples from several rivers in Norway. Therefore, subsequent genetic bottlenecks associated with strain formation have been less severe in comparison with the other two populations used in this study, that were established using fish from only one location. Also, the highest level of LD and lowest N<sup>e</sup> that wa1s observed in NAM is consistent with the hypothesis that American salmon colonization from European fish favored only a few individuals. The high long range LD in NAM indicates the feasibility of achieving better prediction accuracies in this population with a smaller SNP data set than European populations.

# DATA AVAILABILITY

fgene-09-00649 December 13, 2018 Time: 16:38 # 9

Raw genotype data for each population is available from the online digital repository Figshare, accession number doi: 10.6084/m9.figshare.7144631.

# ETHICS STATEMENT

The sampling protocol was previously approved by The Comité de Bioética Animal, Facultad de Ciencias Veterinarias y Pecuarias, Universidad de Chile (certificate N◦ 29–2014).

# AUTHOR CONTRIBUTIONS

AB performed LD and N<sup>e</sup> analyses, and wrote the initial version of the manuscript. ML performed populations structure analysis, first quality control of genomic data and contributed with discussion and writting. GY contributed with LD analysis and discussion. RC and JL contributed with analysis and

# REFERENCES


discussion. JY and JL conceived and designed the study. JY supervised work of AB and contributed to the analysis, discussion, and writing. All authors have reviewed and approved the manuscript.

# FUNDING

This work has been conceived on the frame of the grant FONDEF NEWTON-PICARTE (IT14I10100), funded by CONICYT (Government of Chile). This work has been partially supported by Núcleo Milenio INVASAL from Iniciativa Científica Milenio (Ministerio de Economía, Fomento y Turismo, Gobierno de Chile).

# ACKNOWLEDGMENTS

AB and ML acknowledge the National Commission for Scientific and Technological Research (CONICYT) for the funding through the National Ph.D. funding program. We thank to Cristian Araneda for providing computational capacity support. We also acknowledge the Associate Editor and the two reviewers for their constructive comments and suggestions on the manuscript. JY is supported by Núcleo Milenio INVASAL funded by Chile's government program, Iniciativa Científica Milenio from Ministerio de Economía, Fomento y Turismo.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2018.00649/full#supplementary-material




salmon (Salmo salar): validation in wild and farmed American and European populations. Mol. Ecol. Resour. 16, 1002–1011. doi: 10.1111/1755-0998. 12503


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Barria, López, Yoshida, Carvalheiro, Lhorente and Yáñez. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Integrating Genomic and Morphological Approaches in Fish Pathology Research: The Case of Turbot (Scophthalmus maximus) Enteromyxosis

Paolo Ronza<sup>1</sup> \*, Diego Robledo<sup>2</sup> , Roberto Bermúdez<sup>1</sup> , Ana Paula Losada<sup>1</sup> , Belén G. Pardo<sup>3</sup> , Paulino Martínez<sup>3</sup> and María Isabel Quiroga<sup>1</sup>

<sup>1</sup> Departamento de Anatomía, Producción Animal y Ciencias Clínicas Veterinarias, Universidade de Santiago de Compostela, Lugo, Spain, <sup>2</sup> Royal (Dick) School of Veterinary Studies, The Roslin Institute, The University of Edinburgh, Midlothian, United Kingdom, <sup>3</sup> Departamento de Zoología, Genética y Antropología Física, Universidade de Santiago de Compostela, Lugo, Spain

### Edited by:

Nguyen Hong Nguyen, University of the Sunshine Coast, Australia

### Reviewed by:

Timothy D. Leeds, United States Department of Agriculture, United States Lill-Heidi Johansen, Fisheries and Aquaculture Research (Nofima), Norway

> \*Correspondence: Paolo Ronza paolo.ronza@usc.es

### Specialty section:

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Received: 01 August 2018 Accepted: 16 January 2019 Published: 31 January 2019

### Citation:

Ronza P, Robledo D, Bermúdez R, Losada AP, Pardo BG, Martínez P and Quiroga MI (2019) Integrating Genomic and Morphological Approaches in Fish Pathology Research: The Case of Turbot (Scophthalmus maximus) Enteromyxosis. Front. Genet. 10:26. doi: 10.3389/fgene.2019.00026 Enteromyxosis, caused by Enteromyxum scophthalmi, is one of the most devastating diseases stemming from myxozoan parasites in turbot (Scophthalmus maximus L.), being a limiting factor for its production. The disease develops as a cachectic syndrome, associated to catarrhal enteritis and leukocytic depletion, with morbidity and mortality rates usually reaching 100%. To date, no effective treatment exists and there are different unknown issues concerning its pathogenesis. The gross and microscopic lesions associated to enteromyxosis have been thoroughly described, and several morphopathological studies have been carried out to elucidate the mechanisms of this host-parasite interaction. More recently, efforts have been focused on a multidisciplinary approach, combining histopathology and transcriptome analysis, which has provided significant advances in the understanding of the pathogenesis of this parasitosis. RNA-Seq technology was applied at early and advanced stages of the disease on fishes histologically evaluated and classified based on their lesional degree. In the same way, the transcriptomic data were analyzed in relation to the morphopathological picture and the course of the disease. In this paper, a comprehensive review of turbot enteromyxosis is presented, starting from the disease description up to the most novel information extracted by an integrated approach on the infection mechanisms and host response. Further, we discuss ongoing strategies toward a full understanding of host-pathogen interaction and the identification of suitable biomarkers for early diagnosis and disease management strategies.

Keywords: Scophthalmus maximus, Myxozoa, pathogenesis, histopathology, transcriptomics

# INTRODUCTION

Turbot (Scophthalmus maximus L.) is a flatfish species naturally distributed throughout the European coast, from the Baltic and the Atlantic Ocean up to the Black Sea, being scarce in the Mediterranean Sea (Prado et al., 2018). Fish are important for human diet, being a good source of high-quality proteins, vitamins, and other essential nutrients, including n-3 polyunsaturated fatty

acids (PUFAs) and trace minerals. Flatfish are a group of great commercial value, considered as low-fat fish (2–4% fat) with a firm, white, mild tasting flesh, highly accepted by the consumers (Cerdá and Manchado, 2013; Dong et al., 2018). The reduction of captures caused by fisheries' exhaustion has promoted flatfish aquaculture mainly in Europe and Asia, with turbot and Japanese flounder Paralichthys olivaceus as the dominant species (Food and Agriculture Organization [FAO], 2016). It is a fast-growing industry, where the high appreciation by the market allows higher prices, which compensate the greater production costs of flatfish due to their land-based aquaculture systems (Cerdá and Manchado, 2013; Robledo et al., 2017a). Turbot, in particular, is a great value species that is much favored in many market segments such as white tablecloth restaurants (Bjørndal and Øiestad, 2010). The aquaculture production of this species started in the late 1970s and has experienced an important increase in the last decade. In the European Union (EU) more than 10,000 tons of turbot were produced in 2016, mostly in Spain (>70% of EU production), and in particular in Galicia (NW Spain, 99% of Spanish production; Apromar Asociación Empresarial de Acuicultura de España, 2017). Worldwide aquaculture production of turbot rose above 65,000 tons in 2015, mostly due to its quick expansion in PR China (Martínez et al., 2016) where the species was introduced in the 1990s (Lei and Liu, 2010). As for most aquaculture species, and despite being mainly produced in land-based facilities, pathogens represent the most important threat to the sustainability of turbot aquaculture. Although there has been significant progress with the development of some effective treatments and vaccines or the identification of major genomic regions associated with pathogen resistance (Martínez et al., 2016), diseases represent the main challenge that turbot farming will face in the near future.

Bacterial diseases, such as tenacibaculosis by Tenacibaculum maritimum, vibriosis by Vibrio anguillarum, edwarsiellosis by Edwardsiella tarda, and aeromoniasis (furunculosis) by Aeromonas salmonicida subsp. salmonicida, are among the most common causes of economic losses in aquaculture industry. Vaccination is routinely used for tenacibaculosis and vibriosis in turbot, although sometimes the complementary use of antibiotics is necessary (Avendaño-Herrera et al., 2006). In contrast, in the case of aeromoniasis and edwarsiellosis the development of successful vaccines is still under investigation (Castro et al., 2008; Coscelli et al., 2015) and several outbreaks have recently been reported in turbot farms (Lillehaug et al., 2003; Padrós et al., 2006; Qin et al., 2014). On the other hand, currently there is not a straightforward solution to tackle parasitic diseases, especially those produced by endoparasites, and they represent one of the most important threats for turbot industry. Philasterides dicentrarchi, the causative agent of scuticociliatosis, has been involved in severe mortality episodes in farmed turbot (Iglesias et al., 2001) and although some encouraging results have been achieved with experimental vaccines (Sanmartín et al., 2008; Palenzuela et al., 2009b), the high variability among parasite strains and the changes in the antigen surface along the infection have precluded a general protection. Current efforts are focused on obtaining more resistant or tolerant broodstock, and for that the genomic architecture of resistance to this parasite is being evaluated in the framework of the Fishboost EU project (FP7/2007–2013, ref. 613611).

Similarly, enteromyxosis, caused by the myxozoan parasite Enteromyxum scophthalmi, represents a major challenge to turbot production, with morbidity and mortality rates usually approaching 100% (Branson et al., 1999; Redondo et al., 2004; Quiroga et al., 2006). Despite efforts to find an effective treatment against this disease, mainly testing coccidiostatic drugs alone or in combination with antibiotics (Bermúdez et al., 2006a; Palenzuela et al., 2009a), there are still no available therapeutic measures. Current control measures are basically preventive, focused on improving husbandry strategies. A good treatment (ozone, UV and filtration) is essential for incoming and effluent water, and periodic epidemiological surveys have been suggested for early detection of the infection. Once the presence of E. scophthalmi is detected at farm facilities, the only available option is culling of the units where the infection was detected followed by their disinfection to minimize losses (Quiroga et al., 2006; Sitjà-Bobadilla and Palenzuela, 2012). Enteromyxosis affects turbot weighing over 50 g, with the highest prevalence observed in the range between 201 and 300 g. Mortality can be low at the beginning of the outbreak if older fish are first infected, but increases exponentially and younger fish are progressively affected. A 100% mortality is often observed in a few weeks, particularly at summer temperatures (water temperature >14◦C), which have been related to a faster progress of the disease (Quiroga et al., 2006; Sitjà-Bobadilla and Palenzuela, 2012). On the other hand, turbot enteromyxosis is generally characterized by a long pre-patent period. The first clinical signs appear and the parasite is detected several weeks after the exposure (Redondo et al., 2004; Quiroga et al., 2006). Some evidences of resistance to this parasite have been reported; the origin of turbot was identified as a risk factor (Quiroga et al., 2006), and cases of fish showing protective acquired immunity after surviving the infection have been described (Sitjà-Bobadilla et al., 2004, 2007). Nonetheless, heritability for resistance to this parasite and genetic correlations with other traits have not been reported yet, and therefore, estimating these parameters for turbot enteromyxosis should be a priority to decide the best strategy for genetic breeding programs.

Transcriptome analysis is widely used as a powerful tool to gain a better understanding of the underlying pathways controlling disease progression in hosts (Sudhagar et al., 2018). A proper understanding of host-pathogen interaction is critical to devise successful disease prevention strategies, and the study of gene expression profiles is key to achieve this goal. The field of transcriptomics has constantly evolved from the first studies performed using the hybridizationbased microarray technology, from full cDNA-probes to short and more precise oligo-probes. Particularly, oligo-microarrays have been employed in turbot to study the genetic response to aeromoniasis (Millán et al., 2011) and scuticociliatosis

(Pardo et al., 2012). Nevertheless, the microarray technology presents some limitations, such as the requirement of a prior knowledge of gene sequences from the organism of interest, the restricted dynamical range, the low accuracy at gene families caused by cross-hybridization (particularly important in teleost due to their specific whole genome duplication), and finally the difficulties to identify alternative splice variants, essential for understanding the expression profiles of the different isoforms obtained from single genes (Wang et al., 2009).

RNA sequencing (RNA-Seq) is an evolving technology that uses next-generation sequencing (NGS) to obtain transcriptome profiles. It emerged as a rapid and effective approach for genome survey, and massive functional gene and molecular marker identification (Hrdlickova et al., 2017). In recent years, the application of RNA-Seq to biological investigations is revolutionizing the outlook and accelerating the knowledge of the eukaryotic transcriptome. In the field of pathology, RNA-Seq analysis during host–pathogen interaction allows us to deeply explore the mechanisms of infection and the defense strategies of the host, providing valuable information for developing effective targeted control and therapeutic measures. This technology has recently been employed in several investigations on fish diseases (Sudhagar et al., 2018), including vibriosis by V. anguillarum in turbot (Gao et al., 2016).

On the other hand, there is a paucity of works, especially in fish pathology, where transcriptomic analysis is combined with a morphopathological approach. The great amount of data generated by RNA-Seq is best exploited in an interdisciplinary approach, combining the essential expertise in bioinformatics and genetics, with a proper immunological, microbiological and pathological point of view. Particularly, tissue-based works aimed to investigate disease pathogenesis require specialists, from the experimental design, the quality assessment and characterization of the specimens to the contextual interpretation of the results in relation to the tissue and the disease under study (Berman et al., 2012).

In this sense, the assembly and annotation of the turbot genome (Figueras et al., 2016) has represented a landmark for the application of RNA-Seq technologies, facilitating the mapping and annotation of sequencing data, which translates into more accurate and comprehensive results. A new refined version of the turbot genome has recently been released and made available at the NCBI genome database (Maroso et al., 2018; GCA\_003186165.1). In the case of turbot enteromyxosis, RNA-Seq analysis was applied to get insights into the pathogenesis of this threatening disease, selecting the specimens based on a histological evaluation and grading of the lesions observed after an experimental infection, and analyzing the data in an integrated framework which considered the evolution of the parasitosis as evidenced by tissue lesion (Robledo et al., 2014; Ronza et al., 2016).

In this paper a thorough review of turbot enteromyxosis is presented, discussing the recent advances in host-parasite interaction obtained by integrating the

application of genomics and morphopathological techniques.

# TURBOT ENTEROMYXOSIS

# Disease Description and Gross Pathology

The occurrence of an emaciative condition in farmed turbot was increasingly reported in NW Spain in the 1990s. The first studies promptly associated the disease to the presence of a myxozoan parasite in the gastrointestinal tract of the affected fish (Branson et al., 1999). The genus Enteromyxum was then proposed by Palenzuela et al. (2002) as a result of the study of the causal agent involved in the disease, and the parasite named as E. scophthalmi. The new genus was demonstrated by phylogenetic analysis using ribosomal RNA, and the two species previously known as Myxidium leei and M. fugu were included in this genus (Palenzuela et al., 2002; Yanagida et al., 2004). E. scophthalmi, E. leei, and E. fugu are still the only three known species of this genus of myxosporean parasites (Sitjà-Bobadilla and Palenzuela, 2012).

Although the presence of an intermediate invertebrate host is hypothesized for all Myxozoa, which usually present a diphasic life cycle alternating between invertebrate (actinospore phase) and vertebrate (myxospore phase) hosts (Kent et al., 2001; Lom and Dykova, 2006), this intermediate host is still to be discovered for Enteromyxum spp. On the other side, it has been shown that the three species present direct fish-to-fish transmission of the vegetative stages or trophozoites (Diamant, 1997; Redondo et al., 2002; Yasuda et al., 2002).

Enteromyxum fugu does not represent a relevant threat for its host Tiger puffer (Takifugu rubripes) (Tun et al., 2002; Yanagida et al., 2006), while E. leei and E. scophthalmi have a great impact in marine aquaculture. The infection is associated to a cachectic syndrome producing high mortality and deterioration of performance indicators, causing important economic losses (Sitjà-Bobadilla and Palenzuela, 2012). E. leei presents a wide geographical distribution and host range (Padrós et al., 2001; Diamant et al., 2006; Rigos and Katharios, 2010; Katharios et al., 2011, 2014; Sitjà-Bobadilla and Palenzuela, 2012), although its virulence varies depending on the infected species. The disease shows a severe clinical picture and high mortality in some cases, such as for Diplodus puntazzo or Takifugu rubripes (Yanagida et al., 2006; Álvarez-Pellitero et al., 2008), but it can also develop as a chronic condition, with progressive emaciation and low mortality of diseased fish, as observed in gilthead seabream Sparus aurata (Fleurance et al., 2008; Sitjà-Bobadilla et al., 2008).

Similarly, turbot enteromyxosis caused by E. scophthalmi is clinically characterized by a cachectic syndrome, being anorexia, weight loss and lethargy the main symptoms (Branson et al., 1999; Sitjà-Bobadilla et al., 2006; Bermúdez et al., 2010; Sitjà-Bobadilla and Palenzuela, 2012). A decrease in hematocrit values, consistent with anemia, has also been reported (Bermúdez, 2003; Sitjà-Bobadilla and Palenzuela, 2012). The fecal–oral route, through the ingestion of the infective stages present in the stools of diseased fish, is thought to be the main route of entry of the

myxozoan (Redondo et al., 2004). This way of transmission favors the rapid spread of enteromyxosis among the productive units: infected turbot are considered the primary source of transmission and the high-density culture represents a significant risk factor (Redondo et al., 2002; Quiroga et al., 2006).

The experimental transmission of enteromyxosis by waterborne contamination from the effluent of a tank containing infected fish or by cohabitation of infected and test fish are likely the ways that best reproduce the situation of spontaneous infections among cultured turbot (Redondo et al., 2002, 2004; Bermúdez et al., 2006b; Sitjà-Bobadilla et al., 2006; Losada et al., 2012). On the other hand, these are slower and more heterogeneous infection models than experimental per os transmission; the parasite is firstly detected by histology from 20 days post-exposure onwards in infections by effluent or cohabitation, while it can be observed as early as 7 days after experimental per os inoculation (Redondo et al., 2004; Bermúdez et al., 2006b; Sitjà-Bobadilla et al., 2006; Losada et al., 2014a). The experimental infection by oral route is considered the most effective way for infecting turbot, with more homogeneous prevalence rates and lesions (Redondo et al., 2002, 2004), allowing the selection of a proper number of specimens with analogous lesions for case-control studies involving gene expression and/or immunohistochemical marker analyses (Ronza et al., 2015a). In all the challenges described, no matter the experimental infection method employed, prevalence and mortality rates often reached 100%, reflecting the high susceptibility of turbot to the infection also observed in spontaneous outbreaks (Redondo et al., 2002, 2004; Bermúdez et al., 2006b; Sitjà-Bobadilla et al., 2006; Losada et al., 2014a).

The disease presents a chronic course, and the progressive emaciation is externally reflected by enophthalmos and conspicuous head bony ridges, due to muscle atrophy (Branson et al., 1999; García, 1999; Bermúdez et al., 2010; Losada et al., 2014a). For this reason, it was initially named as "sunken head" syndrome (**Figures 1A,B**). Ascites and dilated, congestive or even hemorrhagic alimentary canal, containing a seromucous liquid, are often reported at necropsy (Branson et al., 1999; García, 1999; Bermúdez et al., 2010; Losada et al., 2014a). Pale appearance of other organs and/or splenomegaly was sporadically described, but often there are no significant macroscopic lesions outside the gastrointestinal tract (Branson et al., 1999; García, 1999; Bermúdez et al., 2010).

# Histopathology

The gastrointestinal tract, where the trophozoites of E. scophthalmi develop in the lining epithelium, shows the most characteristic microscopic lesions. The disease is typically defined by a picture of catarrhal gastroenteritis, its severity increasing throughout the infection and most of the times leading to the death of the fish. Typically, infection begins in pyloric caeca and anterior intestine, extending up- and backward through the alimentary canal, leading to the colonization of the entire gut, from the esophagus to the anus (Redondo et al., 2002, 2004; Bermúdez et al., 2010; Losada et al., 2012). Bermúdez et al. (2010) performed a comprehensive histopathological study of enteromyxosis, analyzing naturally and experimentally infected turbot at different stages of the disease. They proposed a histological grading of enteromyxosis based on the lesional pattern and parasitic load observed (Bermúdez et al., 2010).

In slight infection (**Figure 1C**), most of the intestinal folds do not show significant alterations or histologically visible parasites. Early developmental stages of E. scophthalmi are sporadically observed at the base of the lining epithelium, sometimes associated with slight infiltration of mononuclear immune cells in this site and/or in the lamina propria-submucosa (Bermúdez et al., 2010). The histological detection of early parasitic stages is difficult, as they are small, rounded, basophilic structures, easily confused with apoptotic cells (Redondo et al., 2004; Bermúdez et al., 2010; Losada et al., 2014a). An increased density of mucous and rodlet cells has also been described at this stage (Bermúdez et al., 2010).

Moderate infection is characterized by a notable increase in parasitic load; different development stages of E. scophthalmi can be observed through all the alimentary canal, although they are still more frequent in pyloric caeca and anterior intestine. The gland epithelium of the stomach can also be affected. The inflammatory infiltration is evident, although not always related to a high number of parasites. Infiltrates are mainly composed by intraepithelial lymphocytes and mixed inflammatory cells in the lamina propria-submucosa, where the presence of melanomacrophage aggregates has also been occasionally reported (García, 1999; Sitjà-Bobadilla et al., 2006; Bermúdez et al., 2010; Losada et al., 2014b). At this stage the normal architecture of the gut starts to show pathological evidences, represented by a scalloped shape of the lining epithelium (Bermúdez et al., 2010; Losada et al., 2014a).

The lesions extend to most of the digestive tract in severe infection (**Figure 1D**), when the presence of the Myxozoa is widespread in all the gut regions. The epithelium is often detached from the basal lamina, showing a variable degree of desquamation, and even a total absence of epithelium can be observed in the most serious cases (Branson et al., 1999; García, 1999; Redondo et al., 2004; Bermúdez et al., 2010; Losada et al., 2014a). The enterocytes show severe alterations, such as necrotic or apoptotic features, vacuolated cytoplasm and fragmented nucleus; the apical brush border and cell-cell junctions are often lost. Groups of apoptotic desquamated cells still associated to parasitic forms are often described (Redondo et al., 2004; Bermúdez et al., 2010; Losada et al., 2014a). Different degrees of inflammatory infiltration were also reported in most gut regions, often severe, and leukocytes with apoptotic features are often detected among those constituting the infiltrates (Sitjà-Bobadilla et al., 2006; Bermúdez et al., 2010; Losada et al., 2014a). The activation of tissue repair processes has also been documented, observing areas of re-epithelialization constituted by squamous or low cubic cells (García, 1999; Bermúdez et al., 2010).

In other organs, the most characteristic lesion is the lymphohematopoietic depletion, observed in spleen and kidney (**Figures 1E,F**). This lesion is always reported in turbot at advanced enteromyxosis stages (Bermúdez et al., 2006b, 2010; Sitjà-Bobadilla et al., 2006; Losada et al., 2014a). Further, increased presence of apoptotic cells and changes in density and morphology of the melanomacrophage centers have also

been documented in the same organs (Bermúdez et al., 2006b, 2010; Sitjà-Bobadilla et al., 2006; Ronza et al., 2013a). E. scophthalmi has occasionally been detected in locations other than the gastrointestinal tract, such as skin and gills (other possible routes of entry), blood (possible dissemination route) and lymphohematopoietic organs, sometimes engulfed by macrophages (Redondo et al., 2002, 2004; Sitjà-Bobadilla et al., 2006; Bermúdez et al., 2010; Estensoro et al., 2014). There are also anecdotic descriptions of the parasite presence in bile ducts, pancreas and muscle, in cases of severe infection with an extremely high parasite load (García, 1999; Redondo et al., 2004; Bermúdez et al., 2010). The extraintestinal localization of E. scophthalmi is usually not associated with histological alterations.

# Host-Parasite Interaction: Morphopathological Studies

Until recently, host-parasite interaction in turbot enteromyxosis was mainly analyzed through morphological techniques. Light and electron microscopy were used to study the location of the parasite and the lesions associated. E. scophthalmi colonizes

the digestive tract invading the lining epithelium, where it localizes between the mucosal cells, establishing connections with them through cytoplasmic projections and cell-cell junctions. These structures are thought to be related with mechanisms for attachment, communication and nutrition of the trophozoites (Redondo et al., 2003b; Bermúdez et al., 2010). Intracellular early developmental stages of E. scophthalmi have occasionally been described (Palenzuela et al., 2002; Redondo et al., 2003b, 2004). In vitro assays using intestinal explants showed the capability of the myxosporean to invade the epithelium both by its apical or basal surface (Redondo et al., 2004).

Lectin- and immune-histochemistry have also been widely employed to deepen into the knowledge of turbot-E. scophthalmi interaction. The role of carbohydrate-lectin interactions in the adhesion and penetration of the parasite in turbot epithelium was demonstrated by combining the use of intestinal explants and lectin histochemistry. N-acetylgalactosamine, galactose and mannose/glucose residues were identified as the main carbohydrate terminals in the parasite membrane involved in the recognition mechanisms, and the corresponding binding lectins showed an inhibitory effect on its adhesion and penetration (Redondo et al., 2008; Redondo and Álvarez-Pellitero, 2010a,b).

Immunohistochemistry has often been employed as an important complement for the histopathological evaluation of parasitized fish. The presence of the parasite in the intestinal mucosa was associated to the progressive alteration of the lining epithelium, which compromises the proper intestinal barrier function and has been related to disorders in osmoregulation and nutrient absorption. These mechanisms, along with anorexia, would predispose to the wasting syndrome typical of enteromyxosis (Sitjà-Bobadilla and Palenzuela, 2012). In turbot enteromyxosis, the loss of cell-cell junctions in the intestinal mucosa was observed by transmission electron microscopy (Bermúdez et al., 2010) and supported by the immunohistochemical alteration of the expression pattern of several cell junction proteins (Ronza et al., 2013b). Moreover, the observation of increased apoptosis in the gut of diseased fish was confirmed by immunostaining for active caspase-3 (**Figure 2A**), a crucial mediator of programmed cell death. Apoptotic cells were observed in the lining epithelium and intestinal lumen, often engulfing parasitic structures (Losada et al., 2014a). It has been suggested that this could be beneficial for the host to reduce the parasite load, but E. scophthalmi might also take advantage of being eliminated with apoptotic cell remnants to better survive in the water and find a new host (Redondo et al., 2003a; Bermúdez et al., 2010). Bermúdez et al. (2010) also suggested that the apoptosis could be a consequence of the loss of anchorage of the epithelial cells to the extracellular matrix, a mechanism known as anoikis.

Turbot immune response was also primarily investigated by using immunohistochemistry. Immunoreactivity to inducible nitric oxide synthase (iNOS), an important mediator of innate immune response, was notably increased in the gastrointestinal tract of parasitized turbot (**Figure 2B**), where immune cells, mucous cells and the epithelium itself were labeled. As well, an enhanced number of iNOS-positive cells was found in kidney and spleen of infected fish (Losada et al., 2012). The hypothesis about the relationship of the intestinal lesions with an exacerbated local inflammatory reaction was supported by this study and the investigations on turbot neuroendocrine system (NES). NES plays a key role in the digestive function and alimentary behavior, but it is also involved in the coordination of the immune response through its interactions with the immune system (Palmer and Greenwood-Van Meerveld, 2001). A large set of NES hormones/transmitters was studied in infected and non-infected fish by immunohistochemistry, finding an increased presence of molecules that boost the immune response in the intestine of diseased fish (Bermúdez et al., 2007; Losada et al., 2014b). The implications of the inflammatory reaction in the pathogenesis of diseases associated to catarrhal enteritis are well documented in mammals (Peterson and Artis, 2014; Kamekura et al., 2015; Williams et al., 2015).

The immune adaptive response was also investigated by an immunohistochemical technique targeting turbot IgM (**Figures 2C,E**). Immunoreactive cells were numerous in spleen and kidney at 20–40 days post-infection (dpi) but decreased in advanced stages of the disease. On the other hand, the number of IgM-positive cells in the gastrointestinal tract increased during the infection until 76 dpi, possibly migrating from the lymphohematopoietic organs (Bermúdez et al., 2006b). Sitjà-Bobadilla et al. (2004) also demonstrated using ELISA that turbot produces specific anti-E. scophthalmi antibodies, which in some cases showed a protective effect (Sitjà-Bobadilla et al., 2007); still, most evidences indicated that the humoral immunity is delayed and ineffective in turbot against enteromyxosis (Bermúdez et al., 2006b; Sitjà-Bobadilla et al., 2006).

The leucocytic depletion often reported in advanced enteromyxosis could be an important factor underlying immunodepression and/or failure in the connection between innate and adaptive response (Bermúdez et al., 2006b; Sitjà-Bobadilla et al., 2006). This lesion, observed in spleen and kidney, has been related to the increased apoptosis of immune cells in those organs and in the gastrointestinal tract, where the exacerbated need for cell migration from the lymphohematopoietic organs would contribute to cause the depletion (Bermúdez et al., 2006b, 2010; Sitjà-Bobadilla et al., 2006; Losada et al., 2014a).

Immunohistochemistry was also employed in combination with quantitative PCR (qPCR) to investigate the role of tumor necrosis factor-alpha (TNFα) in the disease (**Figures 2D,F**), an approach that allowed the simultaneous study of gene expression and protein in situ visualization on the same specimens (Ronza et al., 2015a). TNFα is a cytokine involved in a broad spectrum of cellular and organismal responses (Goetz et al., 2004; Hehlgans and Pfeffer, 2005; Parameswaran and Patial, 2010). Its main function as a potent pro-inflammatory mediator was demonstrated in teleost species, and there are many reports on the modulation of TNFα under pathological conditions (Montes et al., 2010; Schwenteit et al., 2013; Ma et al., 2014; Pennacchi et al., 2014). An immunohistochemical technique was set up in turbot tissues (Ronza et al., 2015b), which was employed along with TNFα expression analysis by qPCR on healthy and E. scophthalmi-infected fish (Ronza et al., 2015a). An increased

propria-submucosa, as well as an elevated parasitic load in the epithelium. (D) Immunoreactivity to tumor necrosis factor-alpha (TNFα) of several cells constituting the inflammatory infiltrate in the gut of an E. scophthalmi-infected turbot. Two parasites (arrowheads) are recognizable in the lining epithelium. (E) Turbot kidney showing scattered IgM<sup>+</sup> in the lymphohematopoietic interstitial tissue of the organ. (F) Kidney of turbot with advanced enteromyxosis showing immunostaining to TNFα of some cells of the intertubular parenchyma, which suffered a remarkable cell depletion associated to dilatation of renal tubules (asterisks).

number of immunoreactive cells and up-regulation of TNFα was reported in the spleen and kidney of turbot with moderate infection, demonstrating the involvement of the cytokine in triggering the immune response against E. scophthalmi. At the intestinal level, a progressive increase of immunoreactive cells was noticed with the progress of the disease, many of which constituted the inflammatory infiltrates in the lamina propria-submucosa. Nevertheless, this increment in labeled cells did not correspond to a significant up-regulation of TNFα in the intestine, suggesting the recruitment of leukocytes

with a preformed intracellular pool of the cytokine from the lymphohematopoietic organs (Ronza et al., 2015a). TNFα was demonstrated to induce the production of nitric oxide in turbot (Ordás et al., 2007), and, concomitantly, iNOS immunohistochemical expression was increased in the gut of E. scophthalmi-infected fish (Losada et al., 2012), indicating a possible relation between the two inflammatory mediators during the disease. The prolonged exposure to inflammation of the gastrointestinal tract could explain the development of the typical intestinal lesions of enteromyxosis, in accordance with what has been reported in different mammalian diseases (Panaro et al., 2007; Chokshi et al., 2008; Bienvenu et al., 2010; Watson and Hughes, 2012; Leppkes et al., 2014).

# Pathogenesis Studies Integrating Morphological and Genomics Approaches

Transcriptional profiling is a powerful tool for the identification of genes and pathways involved in host-pathogen interaction and it is acquiring a pivotal role for understanding the pathogenesis of diseases of fish and shellfish (Qian et al., 2014; Sitjà-Bobadilla et al., 2015; Valenzuela-Miranda et al., 2015; Sudhagar et al., 2018). Particularly, RNA-Seq has emerged as the technology of choice for transcriptomic studies (Wang et al., 2009; Qian et al., 2014) due to its high sensitivity and specificity, and its ability to identify new genes, rare transcripts, alternative splice isoforms, and novel SNPs to be used for association studies (Marioni et al., 2008; Morozova et al., 2009; Nielsen et al., 2011). In turbot enteromyxosis, RNA-Seq analysis was applied to get insights into the early (Ronza et al., 2016) and late (Robledo et al., 2014) stages of the disease. The two studies employed a similar approach: after experimental infection, tissue samples at the same time point (24 dpi for early infection and 42 dpi for late infection) were taken for the application of histological and transcriptomic techniques; after a histopathological evaluation and classification of the specimens, infected fish showing similar lesions and their respective controls were chosen for RNA-Seq analysis. In both cases the transcriptomic study was carried out on pyloric caeca, the intestinal region where the infection usually starts (Redondo et al., 2004), and spleen and kidney, the two major lymphohematopoietic organs (**Table 1**). A meticulous histopathological evaluation of the fish was a valuable tool for obtaining accurate and useful transcriptomic data, while reducing the biological noise and accordingly the number of samples to be tested, and all in all, it was essential for a better understanding of host-parasite interaction and pathogenesis studies.

### Advanced Infection

RNA sequencing analysis of severely infected turbot indicated that an exacerbated local inflammatory response is implied in the development of the intestinal lesions. Several proinflammatory genes were found up-regulated, while various genes related to antioxidant defense were down-regulated (Robledo et al., 2014). Oxidative stress linked to prolonged inflammation plays a major role in the pathogenesis of gastrointestinal diseases (Bhattacharyya et al., 2014). The transcriptomic profiling of pyloric caeca also showed the up-regulation of different proapoptotic genes, including caspase-3 (Robledo et al., 2014), in accordance with the immunohistochemical results reported for this protein by Losada et al. (2014a). These authors also described immunoreactive apoptotic cells among those constituting the inflammatory infiltrates at intestinal level, suggesting local immune cell death as a possible reason for an increased cell demand from the lymphohematopoietic organs responsible for the observed leucocytic depletion (Losada et al., 2014a). The involvement of a systemic response during enteromyxosis has been widely demonstrated, by hematological and serological studies (Sitjà-Bobadilla et al., 2006), as well as by immunohistochemistry for IgM (Bermúdez et al., 2006b), iNOS (Losada et al., 2012), and TNFα (Ronza et al., 2015a). Nevertheless, it was highlighted by several authors that at advanced stages of the disease the immune response appears depressed, showing lymphocytopenia (Sitjà-Bobadilla et al., 2006) and decreased numbers of IgM and TNFα immunoreactive cells (Bermúdez et al., 2006b; Ronza et al., 2015a), which has been related to the cell depletion suffered by kidney and spleen. RNA-Seq analysis revealed that not only numerous genes related to adaptive immunity [e.g., immunoglobulin light chain, immunoglobulin mu heavy; V (D) J recombination-activating 1; T-cell surface glycoprotein CD4, T-cell receptor beta chain] were down-regulated in spleen and kidney (**Figure 3**), but also many genes involved in the coordination between innate and adaptive immunity, such as those related with antigen presenting cells, Th17 lymphocytes and interferons (Robledo et al., 2014). Those results support the hypothesis of a failure in the development of a coordinated immune response of turbot against the disease, where the leucocytic depletion of the lymphohematopoietic organs possibly plays an important role.

Regarding the causes underlying the pathogenesis of the cell depletion, in addition to the exacerbated leukocyte recruitment to the intestine (Bermúdez et al., 2006b; Losada et al., 2014a), the cell death affecting spleen and kidney during the infection has been proposed (Bermúdez et al., 2006b; Sitjà-Bobadilla et al., 2006), but

TABLE 1 | RNA-Seq statistics for the three studied organs in Enteromyxum scophthalmi-infected and their respective controls at 24 and 42 days post-infection (dpi).


the transcriptomic profiling did not confirm these hypotheses, and a balance between cell death/survival signals was essentially detected in both organs (Robledo et al., 2014).

On the other hand, the transcriptomic analysis highlighted that the spleen and kidney showed down-regulation of genes related with erythropoiesis (Robledo et al., 2014; **Figure 3**), a finding in accordance with previous observation of anemic status of fish suffering severe enteromyxosis (Bermúdez, 2003; Sitjà-Bobadilla and Palenzuela, 2012). The concurrent downregulation of ferritin, an important iron-storage protein, and up-regulation of hepcidin, a major regulator of iron metabolism involved in iron sequestration during infections, pointed toward a reduced availability of this element during infection (Robledo et al., 2014). Hepcidin acts as an acute-phase protein during infection, reducing iron absorption in the intestine and iron sequestration in macrophages, thus limiting its availability for hemoglobin synthesis in maturing erythrocytes. This mechanism is considered responsible for the so called "anemia of chronic disease" (Ganz, 2002, 2011). Additionally, it is well known that turbot shows anorexia and severe intestinal lesions at this stage of the disease (Bermúdez et al., 2010; Sitjà-Bobadilla and Palenzuela, 2012), which might affect intestinal iron absorption. Furthermore, in mammals it has been shown that the signaling pathways activated in chronic inflammation affect hematopoiesis (Schuettpelz and Link, 2013), and TNFα, overrepresented in diseased turbot (Ronza et al., 2015a), is thought to have a main role as positive or negative regulator of lymphohematopoiesis (Schuettpelz and Link, 2013; Waters et al., 2013). These mechanisms should deserve further attention as possibly implicated in the cell depletion of lymphohematopoietic organs during enteromyxosis. Interestingly, TNFα was not up-regulated in spleen nor in kidney of gilthead sea bream parasitized by E. leei (Sitjà-Bobadilla et al., 2008; Pérez-Cordón et al., 2014), and cell depletion of these organs during the disease is not described in this species. Nonetheless, E. leei-infected gilthead sea bream also showed an intense local response in the gastrointestinal tract associated to up-regulation of TNFα (Davey et al., 2011; Pérez-Cordón et al., 2014). The different entity of the intestinal lesions between these two species, and consequently of the disease course, might be explained by an efficient activation of anti-inflammatory mechanisms in sea bream (Sitjà-Bobadilla et al., 2008; Davey et al., 2011; Pérez-Cordón et al., 2014), which appeared either to fail or to be absent in turbot based on RNA-Seq data (Robledo et al., 2014).

The observation of clinical signs characteristic of a cachectic syndrome (anorexia, weight loss and muscle atrophy) is a common feature of Enteromyxum-infected fish (Sitjà-Bobadilla and Palenzuela, 2012). The interaction between the immune response and the NES, through the action of intestinal peptides, has been investigated as an underlying pathogenic mechanism (Bermúdez et al., 2007; Estensoro et al., 2009, 2011; Losada et al., 2014b). Proinflammatory molecules have demonstrated effects as mediators of cachexia in mammals, modulating the production of hormones and neuromodulators, which finally alter the metabolism and feeding behavior causing anorexia, weight loss and tissue wasting (Morley et al., 2006; Tizard, 2008; Grossberg et al., 2010; Freeman, 2012). In severely infected turbot, the transcriptomic profile of the intestine showed, in addition to inflammation, a modulated expression of genes encoding orexigenic and anorexigenic neuropeptides, indicative of the development of anorexia (Robledo et al., 2014). Further, the tissue wasting associated to cachectic syndromes was reflected by a wide down-regulation of genes related to structural proteins in kidney, spleen and pyloric caeca. The anorexic status of diseased fish along with an impaired nutrient absorption caused by the intestinal lesions have been related to reduced synthesis of structural proteins (Robledo et al., 2014), as observed in other species (Wykes et al., 1996; Lenaerts et al., 2006).

## Early Infection

fgene-10-00026 March 18, 2019 Time: 20:1 # 10

When RNA-Seq was applied on the same three organs to study the early phase of the disease (Ronza et al., 2016), as expected, a remarkable difference in the number of differentially expressed genes (DEGs) was found in comparison with the advanced stage. In the latter case 1,316 (kidney), 1,377 (spleen), and 3,022 (pyloric caeca) DEGs were found (Robledo et al., 2014), while the numbers were 287, 211, and 187, respectively, in slightly infected turbot (Ronza et al., 2016). These fish were selected for showing incipient signs of the disease, with very subtle histological lesions, such as a slight mononuclear infiltration, and the presence of early developmental stages of E. scophthalmi assessed by immunohistochemistry (Ronza et al., 2016).

Differentially expressed genes in common between the two analyzed time points were detected, such as the up-regulation of CD209 in pyloric caeca (Robledo et al., 2014; Ronza et al., 2016). The protein encoded is a C-type lectin receptor, a surface antigen characteristic of dendritic cells acting in pathogen recognition (Osorio and Reis e Sousa, 2011; Shao et al., 2015). RNA-Seq analysis of severely diseased fish also revealed DEGs related to lectin complement pathway and other C-type lectins (Robledo et al., 2014), supporting the previously hypothesized role of lectins in recognizing E. scophthalmi (Redondo et al., 2008; Redondo and Álvarez-Pellitero, 2010b).

Mechanisms underlying the adhesion and penetration in the intestinal epithelium by E. scophthalmi, a key factor for the course of the infection, were suggested by the modulation of a set of genes at the early stages. These included the cytoskeletal remodeling of host enterocytes through genes acting

in the c-Jun N-terminal protein kinases (JNK) pathway, and the alteration of the physiological renewal of the intestinal lining epithelium by inhibiting apoptosis and cell proliferation (Ronza et al., 2016; **Figure 4**). Other pathogens show similar strategies, such as the intestinal parasite Cryptosporidium parvum, which induces a biphasic modulation of apoptosis consisting of early inhibition and late promotion of cell death (Liu et al., 2009). In enteromyxosis, the inhibition of apoptosis and cell proliferation at early stages would favor the epithelial invasion and proliferation of the parasite (Ronza et al., 2016), while the apoptotic cell death in advanced infection (Losada et al., 2014a; Robledo et al., 2014) could facilitate the dispersion of E. scophthalmi and even its survival in the aquatic milieu surrounded by cell remnants (Redondo et al., 2003a; Bermúdez et al., 2010).

Host-pathogen interactions are also characterized by strategies of the parasite to evade the host immune response. Different parasites, including other myxozoans affecting fish (Lom and Dykova, 2006; Sitjà-Bobadilla et al., 2015), benefit from an intracellular localization. This is still a controversial aspect in the case of E. scophthalmi, although there are some reports about early development stages (unicellular) inside host cells (Palenzuela et al., 2002; Redondo et al., 2002, 2004). The transcriptomic analysis of the incipient phase of the infection pointed toward a possible intracellular localization of the parasite, given the evidence of activation of the RIG-I-like receptors (RLRs) pathway found in pyloric caeca (Ronza et al., 2016). This pathway triggers the innate immune response against intracellular pathogens mediated by interferons, inducing the expression of interferon-stimulated genes (ISGs) (Dixit and Kagan, 2013; Nie et al., 2015). A transcriptomic profile consistent with RLRs pathway activation was also observed in spleen and kidney, along with evidences of T cell activation, also related to an interferon-mediated response against intracellular antigens (Ronza et al., 2016). These results suggest that at least some parasite stages enter the blood stream and reach distant organs as kidney and spleen, in accordance with previous observations (Redondo et al., 2003b, 2004).

Interestingly, the opposite expression pattern was found at the advanced stage of enteromyxosis, with down-regulation of interferon-related pathways in the three studied organs (Robledo et al., 2014; **Figure 5**). This result suggests that the immune response to E. scophthalmi is elicited differently during the two stages of infection, perhaps depending on a change in the localization of the parasite during the infection. On the other hand, it might be indicative of the inability to develop an effective immune response, possibly parasite-induced as an immune evasion mechanism, often described in mammalian viral diseases (Gale and Sen, 2009; Song et al., 2013; Taylor and Mossman, 2013). The role of interferon-mediated immune response in parasitic disease has also been highlighted in several investigations in both fish and mammals (Álvarez-Pellitero, 2008; McCall and Sauerwein, 2010; Beiting, 2014; Sitjà-Bobadilla et al., 2015). The inhibition of this response has been related to susceptibility to amoebic gill disease in salmon (Young et al., 2008), while the up-regulation of ISGs has been related with resistance to E. leei in gilthead sea bream (Davey et al., 2011).

In the case of turbot, a correlation might exist between the wide down-regulation of interferon-mediated response during the advanced stage of enteromyxosis and the high susceptibility of this species to the disease.

Other changes in gene expression indicative of inhibition of the immune response were found at early stages of the infection (Ronza et al., 2016). Although the IFN-mediated response was up-regulated in the three organs, SOCS1, a recognized inhibitor of interferon-mediated pathways (Skjesol et al., 2014), was upregulated in kidney. Similarly, CD2, encoding a surface antigen of T and NK cells, was down-regulated in kidney and pyloric caeca (Ronza et al., 2016). The down-regulation of CD2 was found associated to parasitization by Leishmania donovani in humans, being related to disorders in T cell function (Bimal et al., 2008). Hence, although there are strong evidences of activation of certain mechanisms for parasite elimination in early enteromyxosis (such as IFN-mediated response), these might be counteracted thus failing to stop disease progression (Ronza et al., 2016). In the same way, RNA-Seq analysis also suggested that the acute-phase response is targeted by immune evasion strategies of E. scophthalmi (Ronza et al., 2016). Acute-phase response is an evolutionary conserved immune mechanism consisting in the adaptive regulation of the synthesis and blood circulation of different proteins in response to most forms of inflammation, infection and tissue injury (Bayne and Gerwick, 2001). This response was found activated during infection in several teleost species, also in case of parasitization (Khoo et al., 2012; Kovacevic et al., 2015). At the early stage of enteromyxosis, several genes encoding for acute-phase proteins (APPs) showed a decreased expression, such as complement components, antiproteases and proteins acting in the iron metabolism. The complement and coagulation cascades pathway was significantly down-regulated in the spleen (Ronza et al., 2016). Complement system and iron metabolism are known to be possible targets of pathogen evasion strategies (Zipfel et al., 2007; Ben-Othman et al., 2014; Leon-Sicairos et al., 2015), and the inhibition of antiproteases may contribute to the virulence of the parasite (Armstrong, 2006; Gomez et al., 2014; Sitjà-Bobadilla et al., 2015). These findings point toward the interference of E. scophthalmi with the innate immune response of turbot, which would favor the proliferation and dissemination of the parasitic forms in the gastrointestinal tract. The existence of mechanisms of immune silencing is in accordance with the long pre-patent period showed by enteromyxosis. In the study by Ronza et al. (2016), the authors

reported that the expression profile of one of the infected fish clustered with the control group for the three studied organs. This fish, although histologically evaluated as early infected considering the absence of lesions and low parasitic load, actually showed more mature and spreading stages of the parasite than the other two infected fish at that stage, suggesting a slightly more advanced infection. This might indicate that the host response is quenched after the early host-parasite interaction (Ronza et al., 2016).

# CONCLUSION

The integrated evaluation of histopathological and transcriptomic information to investigate the pathogenesis of enteromyxosis in turbot provided a comprehensive interpretation of parasite-host interaction (**Figure 6**) that will aid for developing control measures against this threatening disease. E. scophthalmi might benefit from immune evasion strategies to circumvent the turbot immune response at early stages of the infection and reach its target, the gastrointestinal tract. At this level, the remodeling of the host cell cytoskeleton and inhibition of epithelial renewal appear to facilitate the invasion and colonization of the myxozoan parasite. Parasite proliferation might be favored by the silencing of the host immune response, until the parasitic load is increased and the intestinal lesions become serious. The consequent triggering of the inflammatory response would be exacerbated and dysfunctional, failing to coordinate innate and adaptive responses, and lacking an efficient activation of protective anti-inflammatory mechanisms. In this way, turbot immune response would not be able to stop the infection, and instead would contribute to the development of the characteristic clinical signs and lesions associated to the disease.

# FUTURE PERSPECTIVES

Myxozoan parasitization in fish is a threat for aquaculture production and there is a need for a more comprehensive understanding of host-parasite interaction, including the entry routes, recognition mechanisms, host immune response and immune evasion strategies (Gomez et al., 2014; Sitjà-Bobadilla et al., 2015). A better characterization and knowledge of the life cycle of these parasites is essential for understanding its epidemiology. In this sense, the inability to culture myxozoans in vitro is a major hindrance to get access to a pure source of parasites for experimentation and vaccination, and its proper genetic characterization. Although several methods have been explored to culture E. scophthalmi in vitro, including the use of a turbot cell line, none has been successful (Redondo et al., 2003a). The E. leei transcriptome has been recently characterized using the intestine of heavily infected fish as source of the parasite (trophozoites) (Shpirer et al., 2014; Chang et al., 2015), and a similar strategy has been recently employed in E. scophthalmi with promising results (Palenzuela et al., unpublished data).

Regarding host-parasite interactions in enteromyxosis, the early phase of the infection, which results in epithelial invasion and colonization, is incompletely understood (Sitjà-Bobadilla and Palenzuela, 2012). The application of transcriptomic analysis in turbot enteromyxosis has provided novel and intriguing information on this issue. The identification of the molecular factors and the characterization of the pathways behind disease pathogenesis, clarifying their roles in resistance/susceptibility to the disease, will contribute to the development of more precise predictive tools, control and therapeutic strategies, and in turn, new relevant information for breeding programs.

Given the long pre-patent phase of this disease, early detection tools are a major goal for enteromyxosis control. Therefore, delivering suitable biomarkers for early diagnosis would have a direct impact on disease management strategies. Further, early detection of the parasite would also enable early treatment of the parasitosis, and, in this regard, understanding the early phase of the infection could provide molecular targets for drug development to disrupt parasite invasion. In this sense, the determination of IFNs blood levels or different APPs has been widely used as a diagnostic and prognostic marker for several diseases (Bauer et al., 2006; Schrödl et al., 2016). This is an area that needs to be explored in fish pathology, in combination with high-throughput blood gene expression profiling, which is being successfully applied in human and veterinary medicine to identify markers of health status and infectious and non-infectious diseases (Chaussabel, 2015). Additionally, microRNAs (miRNAs) are emerging as biomarkers for different diseases with a promising diagnostic potential (Wang et al., 2016). A first description of the turbot miRNAs repertoire has been recently published (Robledo et al., 2017b), and new advances on this issue are expected, including the possible applications to disease management.

The integrated analysis of gene expression profiles with the histological lesions associated to disease progression, and the characterization of relevant gene products in the tissue context by other methods (immunological techniques, proteomic analyses) is a promising approach for a comprehensive evaluation of the consistency and significance of the transcriptomic results. Key genes related to response to enteromyxosis identified by RNA-Seq can represent useful intermediate phenotypic markers for direct and more accurate estimations of resistance, which could speed up selective breeding programs. In that sense, despite the hints of potential genetic variation for resistance to E. scophthalmi, future quantitative genetics and animal breeding studies are critical to assess this aspect and its correlation with other productive traits in turbot, and, consequently, its potential for selection. Understanding the mechanisms of host-parasite interaction will contribute to selection endeavors through the identification of genes underlying possible QTL and the prioritization of functional markers for genomic selection, potentially leading to an increase of the efficiency and accuracy of selection.

Markers related to the immune system should clearly receive special attention, as they reflect the immunological status of the animal and might indicate the specific traits related to disease resistance/susceptibility. There are some examples of immune indicators that have been used in selective breeding programs for different fish diseases (Das and Sahoo, 2014). Signaling pathways involving IFN-related genes seem to be relevant in turbot early response to enteromyxosis, and the complement system and APP were suggested targets of parasite immune system evasion strategies. Since the liver is the main producer of complement components and APP, the investigation of the hepatic gene expression profiles would help to clarify its role. This strategy can be combined with the search of single nucleotide polymorphisms (SNPs) associated with disease resistance, either on identified candidate genes or detected through genome-wide association analysis (GWAS) using high throughput genotyping approaches (genotyping-by-sequencing or SNP arrays). The investigation of SNP variants has been recently applied in turbot in combination with RNA-Seq to another main target trait for aquaculture production, growth rate (Robledo et al., 2017c).

Finally, all functional and genetic information gained on the progress of enteromyxosis in turbot can feed genome editing approaches in the future. Key genes or genome regions can be modified to (a) confirm their involvement on the parasitosis and (b) develop more resistant animals. While genome editing is obviously a powerful tool and will be key to validate and exploit functional results in the future, this technique has not been tested in turbot and will require an important research effort (and societal changes) before its application in aquaculture.

Nowadays, research in fish pathology benefits from novel and powerful genomic technologies that can be combined with morphopathological approaches to advance in the understanding of diseases' pathogenic mechanisms. This multidisciplinary approach will be essential for deciphering the physiopathological relevance of the observations and, together with the development of complementary genetic studies, will allow us to acquire new powerful tools for controlling diseases that compromise aquaculture production.

# AUTHOR CONTRIBUTIONS

PM and MQ conceived the review and participated in the reviewing and critical analysis of the manuscript. PR wrote the original draft and generated part of the figures. DR, BP, RB, and AL participated in writing, preparation of artwork, reviewing and critical analysis of the manuscript.

# FUNDING

This work was supported by the European Regional Development Fund (ERDF) together with the Spanish Ministry of Economy, Industry and Competitiveness through the projects AGL2015– 67039–C3–1–R and AGL2015–67039–C3–3–R. DR was supported by a Newton International Fellowship from The Royal Society (NF160037).

# REFERENCES

fgene-10-00026 March 18, 2019 Time: 20:1 # 14



response to inflammation. Int. Immunol. 26, 509–515. doi: 10.1093/intimm/ dxu051


and biological activity of the recombinant protein. Mol. Immunol. 44, 389–400. doi: 10.1016/j.molimm.2006.02.028


by Enteromyxum leei and E. scophthalmi (Myxozoa). Parasitol. Int. 59, 445–453. doi: 10.1016/j.parint.2010.06.005



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Ronza, Robledo, Bermúdez, Losada, Pardo, Martínez and Quiroga. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Corrigendum: Integrating Genomic and Morphological Approaches in Fish Pathology Research: The Case of Turbot (Scophthalmus Maximus) Enteromyxosis

Paolo Ronza<sup>1</sup> \*, Diego Robledo<sup>2</sup> , Roberto Bermúdez <sup>1</sup> , Ana Paula Losada<sup>1</sup> , Belén G. Pardo<sup>3</sup> , Paulino Martínez <sup>3</sup> and María Isabel Quiroga<sup>1</sup>

Keywords: Scophthalmus maximus, Myxozoa, pathogenesis, histopathology, transcriptomics

### Approved by:

Lugo, Spain

**A Corrigendum on**

Frontiers in Genetics Editorial Office, Frontiers Media SA, Switzerland

> \*Correspondence: Paolo Ronza paolo.ronza@usc.es

### Specialty section:

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Received: 27 February 2019 Accepted: 28 February 2019 Published: 20 March 2019

### Citation:

Ronza P, Robledo D, Bermúdez R, Losada AP, Pardo BG, Martínez P and Quiroga MI (2019) Corrigendum: Integrating Genomic and Morphological Approaches in Fish Pathology Research: The Case of Turbot (Scophthalmus Maximus) Enteromyxosis. Front. Genet. 10:225. doi: 10.3389/fgene.2019.00225

### **Integrating Genomic and Morphological Approaches in Fish Pathology Research: The Case of Turbot (Scophthalmus maximus) Enteromyxosis**

<sup>1</sup> Departamento de Anatomía, Producción Animal y Ciencias Clínicas Veterinarias, Universidade de Santiago de Compostela, Lugo, Spain, <sup>2</sup> Royal (Dick) School of Veterinary Studies, The Roslin Institute, The University of Edinburgh, Midlothian, United Kingdom, <sup>3</sup> Departamento de Zoología, Genética y Antropología Física, Universidade de Santiago de Compostela,

by Ronza, P., Robledo, D., Bermúdez, R., Losada, A. P., Pardo, B. G., Martínez, P., et al. (2019). Front. Genet. 10:26. doi: 10.3389/fgene.2019.00026

In the original article, we neglected to include the funder "European Regional Development Fund (ERDF)," which, together with the Spanish Ministry of Economy, Industry and Competitiveness co-funded the projects that supported our work (AGL2015–67039–C3–1–R and AGL2015–67039– C3–3–R).

The authors apologize for this error and state that this does not change the scientific conclusions of the article in any way. The original article has been updated.

Copyright © 2019 Ronza, Robledo, Bermúdez, Losada, Pardo, Martínez and Quiroga. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Discovery and Functional Annotation of Quantitative Trait Loci Affecting Resistance to Sea Lice in Atlantic Salmon

Diego Robledo<sup>1</sup> \*, Alejandro P. Gutiérrez<sup>1</sup> , Agustín Barría<sup>2</sup> , Jean P. Lhorente<sup>3</sup> , Ross D. Houston<sup>1</sup> \* † and José M. Yáñez2,4 \* †

<sup>1</sup> The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Edinburgh, United Kingdom, <sup>2</sup> Facultad de Ciencias Veterinarias y Pecuarias, Universidad de Chile, Santiago, Chile, <sup>3</sup> Benchmark Genetics Chile, Puerto Montt, Chile, <sup>4</sup> Núcleo Milenio INVASAL, Concepción, Chile

### Edited by:

Peng Xu, Xiamen University, China

### Reviewed by:

Maria Saura, Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA), Spain Elizabeth Grace Boulding, University of Guelph, Canada

### \*Correspondence:

Diego Robledo diego.robledo@roslin.ed.ac.uk Ross D. Houston ross.houston@roslin.ed.ac.uk José M. Yáñez jmayanez@uchile.cl †These authors have contributed equally to this work

### Specialty section:

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Received: 26 October 2018 Accepted: 23 January 2019 Published: 08 February 2019

### Citation:

Robledo D, Gutiérrez AP, Barría A, Lhorente JP, Houston RD and Yáñez JM (2019) Discovery and Functional Annotation of Quantitative Trait Loci Affecting Resistance to Sea Lice in Atlantic Salmon. Front. Genet. 10:56. doi: 10.3389/fgene.2019.00056 Sea lice (Caligus rogercresseyi) are ectoparasitic copepods which have a large negative economic and welfare impact in Atlantic salmon (Salmo salar) aquaculture, particularly in Chile. A multi-faceted prevention and control strategy is required to tackle lice, and selective breeding contributes via cumulative improvement of host resistance to the parasite. While host resistance has been shown to be heritable, little is yet known about the individual loci that contribute to this resistance, the potential underlying genes, and their mechanisms of action. In this study we took a multifaceted approach to identify and characterize quantitative trait loci (QTL) affecting host resistance in a population of 2,688 Caligus-challenged Atlantic salmon post-smolts from a commercial breeding program. We used low and medium density genotyping with imputation to collect genome-wide SNP marker data for all animals. Moderate heritability estimates of 0.28 and 0.24 were obtained for lice density (as a measure of host resistance) and growth during infestation, respectively. Three QTL explaining between 7 and 13% of the genetic variation in resistance to sea lice (as represented by the traits of lice density) were detected on chromosomes 3, 18, and 21. Characterisation of these QTL regions was undertaken using RNA sequencing and pooled whole genome sequencing data. This resulted in the identification of a shortlist of potential underlying causative genes, and candidate functional mutations for further study. For example, candidates within the chromosome 3 QTL include a putative premature stop mutation in TOB1 (an antiproliferative transcription factor involved in T cell regulation) and an uncharacterized protein which showed significant differential allelic expression (implying the existence of a cis-acting regulatory mutation). While host resistance to sea lice is polygenic in nature, the results of this study highlight significant QTL regions together explaining between 7 and 13 % of the heritability of the trait. Future investigation of these QTL may enable improved knowledge of the functional mechanisms of host resistance to sea lice, and incorporation of functional variants to improve genomic selection accuracy.

Keywords: Caligus rogercresseyi, Salmo salar, aquaculture, disease, parasite, GWAS, heritability, imputation

# INTRODUCTION

fgene-10-00056 February 6, 2019 Time: 23:39 # 2

Sea lice are a major concern for salmon aquaculture worldwide, in particular Lepeophtheirus salmonis in the Northern Hemisphere and Caligus rogercresseyi in the Southern Hemisphere. These copepods attach to the skin and feed on the mucus and blood of several species of salmonid fish. Parasitized fish display reduced growth rate and increased occurrence of secondary infections (Jónsdóttir et al., 1992). In addition to a significant negative impact on salmonid health and welfare, lice prevention and treatment costs are a large economic burden for salmonid aquaculture, with global losses of over \$430M per year (Costello, 2009). Current control strategies include, for example, feed supplements, cleaner fish, tailored cage design, or "lice-zapping" lasers (Aaen et al., 2015), but these multifaceted strategies are only partially effective. Expensive and potentially environmentally damaging chemicals and treatments are still frequently required to control sea lice populations, which are becoming resistant to common delousing drugs (Bravo et al., 2008; Aaen et al., 2015). Therefore, despite these extensive control efforts, sea lice remain a significant threat to salmon welfare and aquaculture sustainability, and incur in further indirect costs via negative impact on public opinion of aquaculture (Jackson et al., 2017).

Selective breeding can contribute to sea lice prevention via harnessing naturally occurring genetic variation within commercial salmon stocks to identify the most resistant individuals. The identification of selection candidates can be enabled either by pedigree or genomic based approaches, the latter via genomic selection (Meuwissen et al., 2001). Moderate genetic variation in resistance to sea lice exists in Atlantic salmon, with heritabilities typically ranging between 0.1 and 0.3 for both the North Atlantic sea louse (Lepeophtheirus salmonis; Kolstad et al., 2005; Gjerde et al., 2011; Ødegård et al., 2014; Gharbi et al., 2015; Tsai et al., 2016), and the Pacific sea louse (Caligus rogercresseyi; Lhorente et al., 2012; Yáñez et al., 2014; Correa et al., 2017b); and tests of genomic selection approaches for resistance to both lice species have shown substantially higher prediction accuracies that "traditional" pedigree-based approaches (Ødegård et al., 2014; Gharbi et al., 2015; Tsai et al., 2016; Correa et al., 2017a).

Genomic selection is now routinely applied in Atlantic salmon breeding programs for the genetic improvement of several traits (Houston, 2017). While it offers notable benefits in terms of selection accuracy, it has limitations such as the significant cost (via the need for high volume genotyping using SNP arrays), and the limited accuracy when the reference and selection candidate populations are not closely related (e.g., Daetwyler et al., 2012; Tsai et al., 2016). Discovering the causative genetic polymorphisms underlying phenotypic variation in complex traits is a fundamental goal of genetic research. Further, identifying these causative variants would also facilitate more effective genomic selection, potentially via cheaper genoyping strategies, increased genetic gain each generation, and improved persistency of prediction accuracy across generations and populations. Approaches that incorporate biological knowledge into selection models have shown an increase in genomic prediction accuracy, specially across distantly related populations (MacLeod et al., 2016). Further, knowledge of causative variants offer future possibilities of harnessing genomic editing approaches, for example simulations have shown that even a small number of edits per generation can rapidly increase the frequency of favorable alleles and expedite genetic gain (Jenko et al., 2015). However, finding causative mutations within QTL regions is very challenging, with few success stories in farm animals to date. QTL regions tend to cover large segments of chromosomes, and typically contain many variants in linkage disequilibrium that show approximately equal association with the trait. Functional genome annotation data can be applied to prioritize variants within these regions, markedly reducing the number of mutations to investigate by narrowing them to known transcribed and regulatory regions, and – although largely lacking for aquaculture species to date – are currently being developed for Atlantic salmon and other salmonid species (Macqueen et al., 2017). Further understanding of the function of genetic variants, for example through expression QTL studies, can also identify causal associations between genotype and phenotype (Lappalainen, 2015), as previously shown for example for bovine milk composition (Littlejohn et al., 2016) or adiposity in chicken and mice (Roux et al., 2015). While challenging, shortlisting and identification of causative genes and variants impacting on disease resistance in salmon would have positive implications for both selective breeding and fundamental understanding of the host-pathogen interaction.

In the current study, a large population of farmed Atlantic salmon of Chilean origin was challenged with sea lice and high density SNP genotype data was collected. The overall aim of the study was to detect and annotate QTL affecting host resistance to sea lice in farmed Atlantic salmon, with a view to identifying putative causative genes and variants. The specific objectives were a) estimate genetic variance of sea lice resistance in our population, b) dissect the genetic architecture of the trait, and c) explore the genomic basis of the detected QTL using transcriptomics and whole-genome sequencing data.

# MATERIALS AND METHODS

# Disease Challenge

2,668 Atlantic salmon (Salmo salar) Passive Integrated Transponder (PIT)-tagged post-smolts (average weight 122 ± 40 grams, measured for weight and length prior to the start of the challenge) from 104 families from the breeding population of AquaInnovo (Salmones Chaicas, Xth Region, Chile), were experimentally challenged with Caligus rogercresseyi (chalimus II-III). Animals were distributed in three tanks, with 24 to 27 fish of each family in every tank. Infestation with the parasite was carried out by depositing 50 copepods per fish in the tank and stopping the water flow for 6 h during infestation, thereafter water flow was gradually restored reaching its normal flow two days after. During this process oxygen saturation was maintained at 90–110 %, and oxygen and temperature where constantly monitored. Eight days after the infestation fish were sedated before being carefully and individually removed from the tanks. The number of sea lice attached to the fish were counted

from head to tail. At this stage, each fish was also measured for weight and length, PIT-tags were read, and fin-clips collected for DNA extraction. Log-transformed lice density was estimated as log<sup>e</sup> LC BWini 2/3 , where LC is the number of lice counted on the fish, BWini is the initial body weight prior to the challenge, and BWini 2/3 is an approximation of the surface of the skin of each fish (Ødegård et al., 2014). Both raw sea lice counts and sea lice density (lice counts normalized for fish size) were used as proxy of fish resistance to sea lice. Growth during infestation was calculated as [(BWend – BWini)/BWini) ∗ 100], where BWini and BWend are the weight of the fish at the start and at the end of the trial, respectively, the same formulae was used for length.

# Genotyping

DNA was extracted from fin-clips from challenged fish using a commercial kit (Wizard <sup>R</sup> Genomic DNA Purification Kit, Promega), following the manufacturer's instructions. All samples where genotyped with a panel of 968 SNPs (**Supplementary Table S1**) chosen as a subset of the SNPs from a medium density SNP array (Yáñez et al., 2016) using Kompetitive Allele Specific PCR (KASP) assays (LGC Ltd., Teddington United Kingdom). Different low density panels were tested in fish of the same population previously genotyped with the medium density SNP array (Yáñez et al., 2016), and the position of the low density panel markers was selected to maximize imputation accuracy, enriching the extremes of the chromosome were linkage disequilibrium decreases (Yoshida et al., 2018). A population containing full-siblings of the challenged animals had previously been genotyped with a SNP panel of 45,819 SNPs (n = 1,056, Correa et al., 2015; Yáñez et al., 2016), and the experimental lice-challenged population was imputed to ∼46 K SNPs using FImpute v.2.2 (Sargolzaei et al., 2014). Imputation accuracy was estimated by 10-fold cross validation, masking 10% of the 1,056 genotyped full-sibs to the 968 SNP panel, performing imputation, and then assessing the correlation between the true genotypes and the imputed genotypes. All imputed SNPs showing imputation accuracy below 80% were discarded, and the remaining imputed SNPs had a mean imputation accuracy of 95%. Genotypes were further filtered and removed according to the following criteria: SNP call-rate < 0.9, individual call-rate < 0.9, FDR corrected p-value for high individual heterozygosity < 0.05 (removing fish with an excess of heterozygote genotypes), identity-bystate > 0.95 (both individuals removed), Hardy-Weinberg equilibrium p-value < 10−<sup>6</sup> , minor allele frequency < 0.01. After filtering 38,028 markers and 2,345 fish remained for association analysis.

# Estimation of Genetic Parameters

Variance components and heritabilities were estimated by ASReml 4.1 (Gilmour et al., 2014) fitting the following linear mixed model:

$$y = \mu + Xb + Za + e\tag{1}$$

where y is a vector of observed phenotypes (lice number, lice density, initial weight, initial length and weight and length gain during infestation), µ is the overall mean of phenotype records, **b** is the vector of fixed effects of tank (as factor with 3 levels) and initial body weight (as a covariate; except when initial weight or initial length were the observed phenotypes), **a** is a vector of random additive genetic effects of the animal, distributed as ∼N (0,**G**σ 2 a) where σ 2 a is the additive (genetic) variance, **G** is the genomic relationship matrix. **X** and **Z** are the corresponding incidence matrices for fixed and additive effects, respectively, and **e** is a vector of residuals. The best model was determined comparing the log-likelihood of models with different fixed effects and covariates. Phenotypic sex was not significant for any of the traits. **G** was calculated using the GenABEL R package (Aulchenko et al., 2007) to obtain the kinship matrix using the method of Amin et al. (2007), which was multiplied by a factor of 2 and inverted using a standard R function. Genetic correlations were estimated using bivariate analysis implemented in ASReml 4.1 (Gilmour et al., 2014) fitting the same fixed and random effects described in the univariate linear mixed model described above.

# Single-SNP Genome-Wide Association Study

The single-SNP GWAS was performed using the GenABEL R package (Aulchenko et al., 2007) by applying the mmscore function (Chen and Abecasis, 2007), which accounts for the relatedness between individuals applied through the genomic kinship matrix. Significance thresholds were calculated using a Bonferroni correction where genome-wide significance was defined as 0.05 divided by number of SNPs (Duggal et al., 2008) and suggestive as one false positive per genome scan (1/number SNPs). The proportion of variance explained by significant SNPs was calculated as Chi-square / (N-2 + Chi-square), where N is the sample size and Chi-square is the result of the Chi-square test of the mmscore function (Chen and Abecasis, 2007).

# Regional Heritability Mapping

A regional heritability mapping (RHM) analysis (Nagamine et al., 2012; Uemoto et al., 2013) was performed where the genome was divided into overlapping regions of 50 consecutive SNPs (according to the most recent Atlantic salmon reference genome assembly GCA\_000233375.4) and a step-size of 25 SNPs (two consecutive SNP windows share 25 SNPs) using Dissect v.1.12.0 (Canela-Xandri et al., 2015). The significance of the regional heritability for each window was evaluated using a log likelihood ratio test statistic (LRT) comparing the global model fitting all markers with the model only fitting SNPs in a specific genomic region.

# Whole-Genome Sequencing

Genomic DNA from a pool of 50 fish with high sea lice counts (Mean = 66.8 ± 21.5) and a pool of 50 fish with low sea lice counts (Mean = 20.4 ± 5.3) were sequenced in five lanes of HiSeq 4000 as 150 bp PE reads. Family structure was similar in both pools, with 34 different families and a maximum of two fish per family. The quality of the sequencing output was assessed using FastQC v.0.11.5<sup>1</sup> . Quality filtering and removal of residual adaptor

<sup>1</sup>https://www.bioinformatics.babraham.ac.uk/projects/fastqc/

sequences was conducted on read pairs using Trimmomatic v.0.32 (Bolger et al., 2004). Specifically, Illumina adaptors were clipped from the reads, leading and trailing bases with a Phred score less than 20 were removed, and the read trimmed if the sliding window average Phred score over four bases was less than 20. Only reads where both pairs were longer than 36 bp postfiltering were retained. Filtered reads were mapped to the most recent Atlantic salmon genome assembly (ICSASG\_v2; Genbank accession GCF\_000233375.1; Lien et al., 2016) using Burrows-Wheeler aligner v.0.7.8 BWA-MEM algorithm (Li, 2013). Pileup files describing the base-pair information at each genomic position were generated from the alignment files using the mpileup function of samtools v1.4 (Li et al., 2009), discarding those aligned reads with a mapping quality < 30 and those bases with a Phred score < 30. Synchronized files containing read counts for every allele variant in every position of the genome were obtained using Popoolation2 v1.201 (Kofler et al., 2011). A read depth ≥ 10 and a minimum of three reads of the minor allele were required for SNP calling.

# Differential Allelic Expression

Since complex traits can be influenced by causative mutations in regulatory regions that impact gene expression (Keane et al., 2011; Albert and Kruglyak, 2015), the sequence data from a previous RNA-Seq study on the skin of a subset of animals taken from this sea lice infected population (Robledo et al., 2018a) were used to investigate allelic specific expression (differential abundance of the allelic copies of a transcript) to reveal putative nearby cis-acting regulatory polymorphisms. Alignment files were produced using STAR v.2.5.2b (Dobin et al., 2013; detailed protocol can be found in Robledo et al., 2018a) and used for SNP identification and genotype calling with samtools v1.4 (Li et al., 2009). Reads with mapping quality < 30 and bases with phred quality scores < 30 were excluded. A read depth ≥ 10 and ≥ 3 reads for the alternative allele were required to call a SNP. Read counts were obtained for each allele in heterozygous loci and a binomial test was performed to assess the significance of the allelic differences using the R package AllelicImbalance (Gådin et al., 2015).

# RESULTS

# Disease Challenge, Genotyping and Imputation

Eight days after the start of the challenge, the average lice burden per fish across the challenged population was 38 ± 16. The average weight prior to the start of the trial was 122 ± 40 grams and 143 ± 49 grams after the challenge. All samples were genotyped using a low-density SNP panel (968 SNPs). The average r 2 (squared correlation coefficient between alleles at different loci, Hill and Robertson, 1968) between consecutive markers in the low density panel was 0.06, and the average physical distance 438 Kb. 50 samples were genotyped for less than 90% of the SNPs and therefore removed from subsequent analyses. The remaining samples were imputed to high-density from a population of 1,056 salmon that included full siblings of the challenge population, which had previously been genotyped for 45K SNPs (subset of Yáñez et al., 2016 selected as described in Correa et al., 2015). After removing SNPs showing low imputation accuracy (< 80 %), a total of 39,416 SNPs remained with an average imputation accuracy (as assessed by cross-validation) of ∼95 %. After further call rate, minimum allele frequency, heterozygosity, identity-by-descent and Hardy-Weinberg filters, 38,028 markers and 2,345 fish remained for downstream analyses.

# Genetic Parameters

Heritabilities and genetic correlations of different traits related to sea lice load, growth and growth during infestation are shown in **Table 1**. The estimated heritability for sea lice load was 0.29 ± 0.04, and the number of sea lice attached to each fish showed positive genetic correlation with both initial weight (0.47 ± 0.07) and initial length (0.42 ± 0.08). However, sea lice density (h <sup>2</sup> = 0.28 ± 0.04) was independent of the size of the fish which implies that these positive genetic correlations are due to the fact that larger fish tend to have more lice. Initial weight and length showed significant heritabilities as expected, and growth during infestation also presented a moderate heritability, especially weight (h <sup>2</sup> = 0.24 ± 0.04). Surprisingly, weight gain during infestation showed a positive genetic correlation with sea lice density and sea lice counts, albeit with a high standard error (0.25 ± 0.12 and 0.27 ± 0.12, respectively).

# Genome-Wide Association

The genetic architectures for the traits of sea lice density (**Figure 1**) and growth gain during infestation (**Supplementary Figure S1**) were studied using two different methods. The single SNP GWAS for sea lice density revealed three (imputed) SNPs reaching genome-wide significance in the distal part of chromosome 3 (**Figure 1A**), each estimated to explain 3.61–4.14 % of the genetic variation. Additional SNPs showed suggestive association with sea lice density in chromosome 5, chromosome 9, and chromosome 18. Regional heritability analyses using 50 SNP windows confirmed the QTL in chromosomes 3 and 18, both estimated to explain ∼7.5 % of the genetic variation in sea lice density (**Figure 1B**). The RHM approach detected an additional QTL not found in the single SNP GWAS, on chromosome 21, explaining close to 10 % of the genetic variation in sea lice density. The QTL regions in chromosomes 3, 18, and 21 were further refined, adding and removing SNPs until the window explaining the most genetic variation for each QTL was found. These QTL were narrowed to regions of 3–5 Mb and were each estimated to explain between 7.8 and 13.4 % of the genetic variation in sea lice density, accounting in total for almost 30% of the genetic variance (**Table 2**) assuming additive effects of the QTL.

# QTL Characterization

The three sea lice density QTL regions were then further interrogated to identify and characterize potential causative genes and variants. The Atlantic salmon genome annotation (Lien et al., 2016), the results of a previous RNA-Seq study comparing liceattachment sites and healthy skin (Robledo et al., 2018a), and SNP variants obtained from the WGS of pools of fish with high



Heritabilities are in bold and genetic correlations are in normal font.

and nominal significance (black). (B) RHM results showing the percentage of additive genetic variance explained by each genomic region, represented by 50 consecutive SNPs.

and low number of lice were combined to obtain a holistic view of these regions (**Figure 2** and **Supplementary Figures S1**, **S2**). The QTL regions were all relatively large, and contained a large number of SNPs and genes. The whole-genome re-sequencing data led to the identification of 16K–22K putative SNPs in each of the QTL regions, but less than a thousand were located in genic regions in each of them. Surprisingly, the number of mutations that were predicted to have a moderate or large functional effect was relatively high, especially in chromosome 3 where a total of 213 non-synonymous SNPs were detected, along with 5 premature stop, 1 stop lost and 12 start gain mutations. The equivalent numbers were more modest for chromosomes 18 and 21, with 37 and 13 non-synonymous mutations, respectively, but still relatively high to single out high priority candidate causative variants using variant effect prediction data alone. The three QTL regions had also a relatively high number of genes

according to the RNA-Seq. Bar color represents the expression level of the gene (lighter = less expressed), and the annotation of the gene is presented in a label on the top of the graph. Genic SNPs detected by WGS are shown in between, those with putatively more severe effects are located toward the top of the figure.

TABLE 2 | Details of Sea lice resistance QTL.


Positions and details of the QTL detected by RHM. Chromosome, start and end boundaries of the QTL region, number of SNPs in the QTL region, and percentage of genetic variance of sea lice density explained by the QTL are shown.

(83, 36, and 11 genes in the QTL regions of chromosomes 3, 18 and 21, respectively). Therefore, to shortlist candidate genes and variants, a combination of differential expression between resistant and susceptible fish, variant effect prediction, and a literature search relating to the function of the genes and their potential role in host response to parasitic infection were used.

The clearest candidate gene in chromosome 3 was arguably TOB1, where a premature stop mutation was detected. This transcription factor negatively regulates cell proliferation (Matsuda et al., 1996), including that of T-cells (Baranzini, 2014). In our study, it was highly expressed in the skin according to RNA-Seq data, and its expression was significantly lower in lice attachment regions of the skin (Robledo et al., 2018a). For chromosome 21, serine / threnonine-protein kinase 17B (STK17B) showed the highest fold change between liceattachment and healthy skin and a missense mutation; this gene has been connected to apoptosis and T-cell regulation, and T-cells of STK17B deficient mice are hypersensitive to stimulation (Honey, 2005). Previous studies comparing the immune response of resistant and susceptible salmonid species have linked Th2 type responses to sea lice resistance (Braden et al., 2015), which is consistent with these two genes potentially having a functional role relating to the resistance QTL. Chromosome 18 does not contain any clear candidate genes, but from a literature search alone, the most plausible gene is probably heme binding protein 2 (HEBP2). Reducing iron availability has been suggested as a possible mechanism of resistance to sea lice (Fast et al., 2007; Sutherland et al., 2014) and Piscirickettsia salmonis (Pulgar et al., 2015) in Atlantic salmon.

A differential allelic expression analysis was performed to screen for potential cis-acting regulatory mutations affecting genes in the QTL regions. An uncharacterized gene (XP\_014049605.1) showed clear signs of differential allelic expression (P = 0.00081, **Figure 3**) in chromosome 3 at 8.1 Mb, less than 200 Kb away from the significant GWAS SNPs. This gene is also highly expressed in the skin of lice-infected salmon (Robledo et al., 2018a), and similar proteins are found in other salmonid and teleost species (NCBI's non-redudant protein database).

# DISCUSSION

In the current study we aimed to use a combination of GWAS, RNA-Seq, whole genome resequencing, and functional annotation approaches to identify and characterize QTL influencing host resistance to sea lice. Heritability estimates for sea lice density were similar or higher than previous studies

on C. rogercresseyi-challenged Atlantic salmon. Lhorente et al. (2012, 2014) obtained pedigree-estimated heritabilities of 0.17– 0.34, while estimates on a previously related sea lice challenged population were of 0.10–0.11 with both pedigree and molecular information (Yáñez et al., 2014; Correa et al., 2017a,b). This heritability is also in consistent with heritability estimates for salmon challenged with L. salmonis (0.2–0.3; Kolstad et al., 2005; Gjerde et al., 2011; Gharbi et al., 2015; Tsai et al., 2016), and similar to heritabilities for resistance to other ectoparasites affecting Atlantic salmon such as Gyrodactylus salaris (0.32; Salte et al., 2010) and Neoparamoeba perurans (Amoebic gill disease; 0.23–0.48; Taylor et al., 2009; Robledo et al., 2018a).

Previous studies on the architecture of resistance to C. rogercresseyi had revealed just one significant SNP in chromosome 21 ∼6.5 Mb (Correa et al., 2017b). While no significant SNPs were found in chromosome 21 in our study using single SNP GWAS, the regional heritability analysis did highlight the nearby region between 8 and 10.5 Mb as explaining over 13% of the genetic variation of the trait. Our RHM analysis also identified two additional QTL explaining a significant amount of the genetic variance, but only one of them detected in our single SNP GWAS. RHM analyses and other similar approaches use the information of several consecutive SNPs, increasing the statistical power and reliability of association mapping (Riggio and Pong-Wong, 2014; Shirali et al., 2016), which consequently should result in higher repeatability and concordance between genetic association studies. Accordingly, in our study the RHM analysis arguably located the previously detected QTL (Correa et al., 2017b), while the single SNP GWAS seemingly failed to do so.

Discovering the causal variants underlying QTL is a very challenging task, and as result very few causative variants have been identified to date. The first problem lies with the large regions that have to be investigated, since narrowing the QTL is extremely difficult due to reduced recombination and high linkage disequilibrium along large regions of the genome (e.g., Tsai et al., 2016). Further, despite the simplicity of identifying most or all variants within a region using WGS, prioritizing those variants is challenging with the current status of annotation of the Atlantic salmon genome, which has 48,775 protein coding genes and 97,546 mRNAs (Lien et al., 2016). Putative non-synonymous and even premature stop mutations appear relatively frequently, probably indicating a significant proportion of pseudogenes, and therefore hindering our ability to prioritize functional mutations. Further, in complex traits, a high proportion of causative mutations are located in regulatory elements (Keane et al., 2011; Albert and Kruglyak, 2015), which are difficult to evaluate without comprehensive genome annotation using assays that can identify such regions. In this sense, the outputs of the Functional Annotation of All Salmonid Genomes (FAASG; Macqueen et al., 2017) initiative should contribute to prioritization of intergenic SNPs through the characterization of functional regulatory elements in salmonid genomes. Complementary, differential allelic expression (DAE) and expression QTL (eQTL) can be an effective route to identify functional candidates (Gamazon et al., 2018). The caveat of DAE and eQTL is that gene expression is quite commonly restricted to specific tissues. A GTex-like project (GTEx Consortium, 2017) in salmonids could also facilitate the discovery of functional variants underlying QTL.

Despite these limitations, two genes were identified that are strong candidates for the QTLs in chromosome 3 and 21, TOB1 and STK17B, respectively. Coho salmon, a salmonid species considered resistant to sea lice, shows pronounced epithelial hyperplasia and cellular infiltration two days after sea lice attachment, and wound healing combined with a strong Th2 immune response has been suggested as the mechanism of resistance (Braden et al., 2015). TOB1 and STK17B have been previously associated with cell proliferation and T cell regulation. TOB1 is an antiproliferative protein which is ubiquitously expressed in several species (Baranzini, 2014), and inhibits T cell proliferation in humans (Tzachanis et al., 2001). TOB1 downregulation in response to sea lice attachment suggests that this gene plays a relevant role in the Atlantic salmon response to the parasite, and the detected putative premature stop codon mutation may be concordant with faster wound healing and T cell proliferation. STK17B, also known as DRAK2, has also been connected to T cell function (Honey, 2005; Gatzka et al., 2009) and to proliferation in cancer (Yang et al., 2012; Lan et al., 2018). STK17B contains a non-synonymous mutation and marked up-regulation in response to sea lice in salmon skin. In addition to these two strong functional candidates, the allelic differential expression analysis also revealed an uncharacterized protein regulated by cis-polymorphisms in the QTL region in chromosome 3. These three genes and their mutations deserve further attention in follow-up studies aimed to increase resistance to sea lice in Atlantic salmon. Such follow up studies could include further functional annotation of QTL regions using chromatin accessibility assays to identify genomic regions potentially impacting on the binding of transcription factors or enhancers. Genome editing approaches (such as CRISPR-Cas9) could be applied to test hypotheses relating to modification of gene function or expression caused by coding or putative cis-acting regulatory variants in cell culture, or ultimately to perform targeted perturbation of the QTL regions and assess the consequences on host resistance to sea lice in vivo.

# CONCLUSION

fgene-10-00056 February 6, 2019 Time: 23:39 # 8

Host resistance to sea lice in this Chilean commercial population is moderately heritable (h <sup>2</sup> = 0.28) and shows a polygenic architecture, albeit with at least three QTL of moderate effect on chromosomes 3, 18, and 21 (7.8 to 13.4% of the genetic variation). Growth during infestation also has a significant genetic component (h <sup>2</sup> = 0.24), and its genetic architecture is clearly polygenic, with QTL of small effect distributed along many genomic regions. The three QTL affecting lice density were further investigated by integrating RNA-Seq and WGS data, together with a literature search. A putative premature stop codon within TOB1, an anti-proliferative protein, seems a plausible candidate to explain the QTL in chromosome 3. Alternatively, an uncharacterized protein on the same QTL region displayed differential allelic expression, and which may form a suitable target for further functional studies. STK17B, functionally connected to proliferation and T cell function, is a plausible candidate for the QTL in chromosome 21. It is evident that even when all variants in a QTL region are discovered, that shortlisting and prioritizing the potential causative variants underlying QTL is challenging. However, the impending availability of more complete functional genome annotation and eQTL data is likely to assist this process, thereby helping to elucidate the functional genetic basis of complex traits in aquaculture species.

# DATA AVAILABILITY STATEMENT

The RNA-Seq raw reads have been deposited in NCBI's Sequence Read Archive (SRA) under Accession No. SRP100978, and the results have been published in Robledo et al. (2018b). The WGS raw reads have been deposited in NCBI's Sequence Read Archive (SRA) under Accession No. SRP106943. The imputed genotypes and corresponding SNP positions are available in **Supplementary Data Sheet 1** (compressed file, GenABEL.ped and .map files), and phenotypes of the challenged animals are available in **Supplementary Table 2**.

# ETHICS STATEMENT

The lice challenge experiments were performed under local and national regulatory systems and were approved by the Animal Bioethics Committee of the Faculty of Veterinary and Animal Sciences of the University of Chile (Santiago, Chile), Certificate No 01-2016, which based its decision on the Council for International Organizations of Medical Sciences (CIOMS) standards, in accordance with the Chilean standard NCh-324-2011.

# AUTHOR CONTRIBUTIONS

RH, JY, and DR were responsible for the concept and design of this work and drafted the manuscript. JP was responsible for the disease challenge experiment. AB managed the collection of the samples. AG performed the molecular biology experiments. DR performed bioinformatic and statistical analyses. All authors read and approved the final manuscript.

# FUNDING

This work was supported by an RCUK-CONICYT grant (BB/N024044/1) and Institute Strategic Funding Grants to The Roslin Institute (BBS/E/D/20002172, BBS/E/D/30002275, and BBS/E/D/10002070). Edinburgh Genomics is partly supported through core grants from NERC (R8/H10/56), MRC (MR/K001744/1), and BBSRC (BB/J004243/1). DR was supported by a Newton International Fellowship of the Royal Society (NF160037).

# ACKNOWLEDGMENTS

We would like to thank the contribution of Benchmark Genetics Chile and Salmones Chaicas for providing the biological material and phenotypic records of the experimental challenges. We would also like to thank Benchmark Genetics Chile for providing high-density genotypes of full-sibs of our experimental population for imputation. JY would like to acknowledge the support from Núcleo Milenio INVASAL from Iniciativa Científica Milenio (Ministerio de Economía, Fomento y Turismo, Gobierno de Chile).

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019. 00056/full#supplementary-material

FIGURE S1 | QTL region in chromosome 3. Bars represent the log 2 fold change between healthy and sea lice attachment skin for every gene in the QTL region according to the RNA-seq. Bar colour represents the expression level of the gene (lighter = less expressed), and the annotation of the gene is presented in a label on the top of the graph. Genic SNPs detected by WGS are shown in between, those with putatively more severe effects are located towards the top of the figure.

FIGURE S2 | QTL region in chromosome 18. Bars represent the log2 fold change between healthy and sea lice attachment skin for every gene in the QTL region according to the RNA-seq. Bar colour represents the expression level of the gene (lighter = less expressed), and the annotation of the gene is presented in a label on the top of the graph. Genic SNPs detected by WGS are shown in between, those with putatively more severe effects are located towards the top of the figure.

TABLE S1 | Markers in the low density SNP panel.

TABLE S2 | Phenotypes of the challenged animals.

# REFERENCES

fgene-10-00056 February 6, 2019 Time: 23:39 # 9


with pleiotropic effects on bovine milk composition. Sci. Rep. 6:25376. doi: 10.1038/srep25376


chum Oncorhynchus keta and pink salmon O. gorbuscha during infections with salmon lice Lepeophtheirus salmonis. BMC Genomics 15:200. doi: 10.1186/1471- 2164-15-200


**Conflict of Interest Statement:** JY and JL were supported by Benchmark Genetics Chile.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Robledo, Gutiérrez, Barría, Lhorente, Houston and Yáñez. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Association and Functional Analyses Revealed That PPP1R3B Plays an Important Role in the Regulation of Glycogen Content in the Pacific Oyster Crassostrea gigas

Sheng Liu1,2,3, Li Li1,4,5,6 \*, Jie Meng1,4,5,6, Kai Song1,4,5,6, Baoyu Huang1,2,5,6, Wei Wang1,2,5,6 and Guofan Zhang1,2,5,6 \*

<sup>1</sup> Key Laboratory of Experimental Marine Biology, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, China, <sup>2</sup> Laboratory for Marine Biology and Biotechnology, Qingdao National Laboratory for Marine Science and Technology, Qingdao, China, <sup>3</sup> University of Chinese Academy of Sciences, Beijing, China, <sup>4</sup> Laboratory for Marine Fisheries Science and Food Production Processes, Qingdao National Laboratory for Marine Science and Technology, Qingdao, China, <sup>5</sup> National and Local Joint Engineering Laboratory of Ecological Mariculture, Qingdao, China, <sup>6</sup> Center for Ocean Mega-Science, Chinese Academy of Sciences, Qingdao, China

Edited by: Peng Xu, Xiamen University, China

# Reviewed by:

Weiwei You, Xiamen University, China Jun Hong Xia, Sun Yat-sen University, China

### \*Correspondence:

Li Li lili@qdio.ac.cn Guofan Zhang gzhang@qdio.ac.cn

### Specialty section:

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Received: 05 December 2018 Accepted: 30 January 2019 Published: 14 February 2019

### Citation:

Liu S, Li L, Meng J, Song K, Huang B, Wang W and Zhang G (2019) Association and Functional Analyses Revealed That PPP1R3B Plays an Important Role in the Regulation of Glycogen Content in the Pacific Oyster Crassostrea gigas. Front. Genet. 10:106. doi: 10.3389/fgene.2019.00106 The Pacific oyster (Crassostrea gigas) is one of the most important aquaculture species worldwide. Glycogen contributes greatly to the special taste and creamy white color of oysters. Previous genome-wide association studies (GWAS) identified several single nucleotide polymorphism (SNP) sites that were strongly related to glycogen content. Genes within 100 kb upstream and downstream of the associated SNPs were screened. One gene annotated as protein phosphatase 1 regulatory subunit 3B (PPP1R3B), which can promote glycogen synthesis together with protein phosphatase 1 catalytic subunit (PPP1C) in mammals, was selected as a candidate gene in this study. First, full-length CgPPP1R3B was cloned and its function was characterized. The gene expression profiles of CgPPP1R3B in different tissues and seasons showed a close relationship to glycogen content. RNA interference (RNAi) experiments of this gene in vivo showed that decreased CgPPP1R3B levels resulted in lower glycogen contents in the experimental group than in the control group. Co-immunoprecipitation (Co-IP) and yeast two-hybrid (Y2H) assays indicated that CgPPP1R3B can interact with CgPPP1C, glycogen synthase (CgGS) and glycogen phosphorylase (CgGP), thus participating in glycogen metabolism. Co-sedimentation analysis in vitro demonstrated that the CgPPP1R3B protein can bind to glycogen molecules directly, and these results indicated the conserved function of the CgPPP1R3B protein compared to that of mammals. In addition, thirteen SNPs were precisely mapped in this gene. Ten of the thirteen SNPs were confirmed to be significantly (p < 0.05) related to glycogen content in an independent wild population (n = 288). The CgPPP1R3B levels in oysters with high glycogen content were significantly higher than those of oysters with low glycogen content, and gene expression levels were significantly associated with various genotypes of four associated SNPs (p < 0.05). The data indicated that the associated SNPs may control glycogen content by regulating CgPPP1R3B expression. These results suggest that CgPPP1R3B is an important gene for glycogen metabolic regulation and that the associated SNPs of this gene are potential markers for oyster molecular breeding for increased glycogen content.

Keywords: oyster, glycogen content, protein phosphatase 1 regulatory subunit 3B (PPP1R3B), gene function analyses, associated SNPs

# INTRODUCTION

fgene-10-00106 February 13, 2019 Time: 19:20 # 2

The Pacific oyster (Crassostrea gigas) is one of the most vital aquaculture species worldwide. Glycogen, a stored form of glucose, is a very important quality trait for oysters because it is responsible for the creamy white color and special taste of the oyster. In addition, glycogen is related to reproductive success and stress tolerance (Whyte and Englar, 1982; Li L. et al., 2018). The heritability of glycogen content in oyster was 0.29 ± 0.02, indicating that it is genetically controlled (Liu et al., 2019), meanwhile, glycogen content is a quantitative trait with high variation among individuals and may affected by many genes and pathways. Hence, genetic improvement of glycogen content is necessary and feasible. Glycogen content is a carcass trait that cannot be measured without killing the animal. Indirect selective breeding for glycogen content by morphological traits seems impossible because of the weak correlation between these characteristics, as shown in several studies (Li et al., 2017; Li C. et al., 2018). Thus, molecular breeding is indispensable for the genetic improvement of glycogen content, which requires elucidation of the underlying genetic basis and the identification of genetic variations.

Glycogen metabolism pathway and physiology in oysters were studied a lot. Berthelin et al. (2000a,b) studied glycogen storage cell isolation, distribution and physiological properties. Several key genes in the glycogen metabolic pathway, such as glycogen synthase (GS), glycogen phosphorylase (GP) (Bacca et al., 2005), phosphoglucomutase (PGM) (Tanguy et al., 2006), glycogenin (Li et al., 2017) and the regulatory gene GSK3β (Zeng et al., 2013), were cloned. Glycogen content showed seasonal fluctuations along with gonad development, the gonads and labial palp had relatively high glycogen levels (Berthelin et al., 2001). Glycogen-related genes show expression variations in different seasons as well as tissues. Furthermore, elucidation of genetic variations of glycogen metabolic genes and their relationships with glycogen content is needed for molecular breeding. Association studies, including genome-wide association studies (GWAS) and candidate gene association analysis, are widely utilized for complex trait dissection and marker screening in crops and livestocks (Li et al., 2013). Although numerous and high frequencies of single nucleotide polymorphisms (SNPs) in the oyster genome were identified by transcriptome and whole genome re-sequencing (Wang J.F. et al., 2015; Li L. et al., 2018), few have been reported on glycogen metabolic genes associated SNPs in oysters. Candidate gene association studies of oyster identified several SNPs in or near GS, GP and glycogen debranching enzyme (GDE) that were significantly associated with oyster glycogen content (She et al., 2015; Liu et al., 2017).

Genome-wide association studies could systematically uncover markers and genes related to target traits, which is powerful and informative and will be beneficial for molecular breeding (McCarthy et al., 2008; Manning et al., 2012; Li et al., 2013; Gutierrez et al., 2018a). GWAS for growth, sex determination, shell color and disease resistance had been conducted in several aquatic animals, such as carp (Zhou et al., 2018), Atlantic Salmon (Correa et al., 2017; Covelo-Soto et al., 2017), Rainbow Trout (Gonzalez-Pena et al., 2016),Yesso Scallop (Zhao et al., 2017) as well as Pacific oyster (Gutierrez et al., 2018a,b). Previous GWAS for nutritional quality traits using 427 oysters were conducted by our research group (Meng et al., unpublished), which revealed three genomic regions, including more than 100 SNPs related to glycogen content. Genes within 100 kb upstream and downstream of the associated SNPs were screened (**Supplementary Table S1**). One gene annotated as protein phosphatase 1 regulatory subunit 3B (PPP1R3B), with six associated SNPs in or near the gene, was selected as a candidate gene related to glycogen content for further study based on the known function in mammals. PPP1R3B, a glycogen-targeting subunit, forms a holoenzyme together with PPP1C and is believed to be related to the dephosphorylation regulating GS and GP in mammals (Newgard et al., 2000; Stender et al., 2018). Thus, GS is converted from an inactive form to an active form, while GP is converted from an active form to an inactive form. Consequently, increased glycogen accumulation and inhibited glycogen breakdown are observed. To date, the function of invertebrate PPP1R3B remains unclear. Thus, CgPPP1R3B function and genetic variations need to be elucidated.

In this study, we integrated forward and reverse genetic methods to help elucidate the function of this gene and its value in molecular breeding for glycogen content. We precisely mapped the glycogen-associated SNPs in CgPPP1R3B by cloning and validated these SNPs in an independent population. RNA interference (RNAi), overexpression of target genes, and protein interactions with known glycogen metabolic genes were assessed to observe the corresponding changes in glycogen content (reverse genetic methods). We detected gene expression among different genotypes of SNPs and among individuals with different glycogen contents (forward genetic method). We focused on functional studies of the CgPPP1R3B gene in regulating glycogen metabolism as well as genetic variations in or near CgPPP1R3B and their potential for improving oyster glycogen content. This study not only uncovered the genetic basis underlying glycogen content but also developed markers that could be used in molecular breeding for glycogen content in the Pacific oysters.

## MATERIALS AND METHODS

fgene-10-00106 February 13, 2019 Time: 19:20 # 3

## Ethics Statement

The Pacific oyster is broadly distributed in the intertidal or sub-tidal areas of the marine environment. The oysters used in this study were collected from wild populations or cultured population derived from artificial breeding by the authors. This study was approved by the Animal Care and Use Committee of Institute of Oceanology, Chinese Academy of Sciences.

# Animals, Sampling and Glycogen Measurement

Oysters used in this study were cultured in the lantern net which was suspended on a maritime longline rope and oysters were taken from the sea to the lab before sampling. There are four parts of oysters used in his study. The first part oysters were sampled from an 16 months old cultured population (shell height = 80.4 ± 1.7 mm) in October that was used to determine the gene expression profiles of different tissues, including gonad, labial palp, gill, mantle, visceral mass and adductor muscle. The second part oysters were used for analysis of the gene expression profiles in different seasons, gonads of 15 adult cultured oysters (18–27 months old, shell height = 84.4–100 mm) were sampled in January, April, July and October, respectively. The third part oysters were used for the RNAi experiments, 6-month old cultured oysters (shell height = 71.0 ± 1.4 mm) were used. The fourth part oysters were used for association analysis, a wild population (n = 288) of spat oysters was caught in July, separated into single individuals and cultured in lantern nets with the same density to eliminate possible environmental effects. Oysters which reached a commercial size (18 months old, shell height = 87.4 ± 0.8 mm) were sampled in next February (adductor muscle for subsequent DNA extraction, the left flesh for RNA extraction and glycogen measurement), when glycogen content is relatively high and stable. Gonad development stage were determined based on experience and seawater temperature record according to (Lango-Reynoso et al., 2006), oysters used in each experiment were in the same gonad developmental stage.

For glycogen content measurement of abovementioned oysters, corresponding tissues or the flesh of the oysters was homogenized with liquid nitrogen to powder by a mortar and then freeze dried for 48 h. Approximately 0.1 g of dried flesh powder was used and glycogen measurement was determined by near-infrared reflectance spectroscopy, which is high throughput and more accurate than the traditional method (Wang W.J. et al., 2015).

# Gene Cloning and Bioinformatics Analysis

Full-length CgPPP1R3B and CgPP1C were cloned by rapid amplification of cDNA ends (RACE).All the primers used in this study were listed in **Supplementary Table S2**. Open Reading Frame Finder<sup>1</sup> was used to analyze coding sequences and the corresponding deduced polypeptides they encoded. The UniProt database was used to predict protein domains<sup>2</sup> . Protein sequences from different species were downloaded from NCBI<sup>3</sup> . A phylogenetic tree was constructed with the neighborjoining algorithm using the program MEGA (Version 6.0). The reliability of the branching was tested using bootstrap resampling (1000 pseudo-replicates). Multiple alignments were completed by DNAMAN (Version 9).

# Gene Expression Profile Detection of Different Tissues and Seasons

CgPPP1R3B expression levels in six tissues (gonad, labial palp, gill, mantle, visceral mass and adductor muscle) in October (n = 15 for each tissue) and different seasons (n = 15) for gonads were determined by real-time PCR (RT-PCR). Total RNA was isolated using an RNAprep Kit (Tiangen, Beijing) according to the manufacturer's instructions. The RNA integrity and concentration were tested by 1% agarose gel electrophoresis and NanoDrop 2000 spectrophotometry, respectively. cDNA was synthesized using a Prime Script RT Kit (TaKaRa, Dalian). RT-PCR was performed on a 7500 Fast Real-Time PCR System (ABI, United States) using a SYBR Green Master Mix kit (TaKaRa). The primers used for the RT-PCR analysis are listed in **Supplementary Table S2**. The elongation factor (EF) gene was chosen as an internal control, and each result represents the mean of three replicates.

# Plasmid Construction, Cell Culture, and Transfection

For the generation of tagged protein for further functional studies, the open reading frame (ORF) regions of CgPPP1R3B, CgPPP1C, CgGS, and CgGP were amplified using Phusion High-Fidelity DNA polymerase (Thermo) with specific primers (**Supplementary Table S2**). pCMV-Myc (Clontech, United States), pEGFP-N1 (Clontech), and pCMS-EGFP-FLAG plasmids (constructed by our lab) and pET-32a (Biomed, Beijing) were digested with EcoRI, XhoI, XhoI, and EcoRI (New England Biolabs, United States), respectively. The purified PCR products were fused with the purified digested plasmids using the Ligation-Free Cloning System (Applied Biological Materials, Inc., Canada). Escherichia coli Trans T1 cells (TransGen, China) were transformed with the fusion mixture and cultured in LB agar plates overnight, and the grown colonies were tested by colony PCR. One clone was confirmed by Sanger sequencing, and the corresponding plasmids were extracted from overnight cultured bacteria using an endo-free plasmid extraction kit (Tiangen).

For co-immunoprecipitation (Co-IP) and the CgPPP1R3B protein overexpression assays, HEK293T cells (ATCC) were cultured in Dulbecco's modified Eagle's medium (high glucose) (HyClone). HeLa cells (ATCC, United States) were cultured in modified Roswell Park Memorial Institute (RPMI)-1640 medium (HyClone, United States) for subcellular assays. Both types

<sup>1</sup>http://www.ncbi.nlm.nih.gov/gorf/orfig.cgi

<sup>2</sup>https://www.uniprot.org

<sup>3</sup>http://www.ncbi.nlm.nih.gov/guide/proteins/

of media were supplemented with 10% fetal bovine serum (HyClone) and 1× penicillin-streptomycin solution (Solarbio, China). Cells were grown in an atmosphere of 5% CO<sup>2</sup> at 37◦C. Plasmids were transfected into HeLa or HEK293T cells using Lipofectamine 3000 reagent (Life Technologies, United States) according to the manufacturer's instructions.

# CgPPP1R3B RNAi in vivo and Overexpression in HEK293 Cells

fgene-10-00106 February 13, 2019 Time: 19:20 # 4

For further elucidation of the role of CgPPP1R3B in glycogen metabolism, we performed CgPPP1R3B RNAi experiments. RNAi was conducted by small interfering RNA (siRNA) synthesized by Shanghai GenePharma company. SiRNA sequences are shown in **Supplementary Table S2**. This experiment was carried out in January, which is thought to be the glycogen accumulation stage (Berthelin et al., 2000b). Oysters were anesthetized by mixed seawater (500 g MgCl<sup>2</sup> + 5 L seawater + 5 L freshwater) according to Suquet et al. (2009). One hundred and eighty oysters were then randomly divided into three groups of 60 individuals. 100 µl SiRNA (100 ng/µl), neutral control (NC) RNA (100 ng/µl) and H2O (blank control, BC) were injected into the adductor muscle; the animals were injected twice on days 0 and 10. After injection, a small amount of gonad and labial palp was sampled for RNA extraction for further gene expression analysis of CgPPP1R3B, CgGS and CgGP by RT-PCR, and the left flesh was sampled for glycogen content measurement. The tissue was sampled for 18 days, and details of sampling are shown in **Supplementary Table S3**. Overexpression of CgPPP1R3B was conducted in HEK293 cells, which were cultured in 5-cm diameter dish Petri dishes (Corning, United States). CgPPP1R3B-Myc, CgPPP1R3B-FLAG, and CgPPP1R3B-EGFP were transfected, and cells were harvested 24 h later for glycogen measurement by a kit according to the manufacturer's instructions (Jiancheng, Nanjing). The glycogen content represents the mean of three independent replicates.

# Protein–Protein Interaction Methods

For analysis of the protein interactions of CgPPP1R3B with its catalytic domain CgPPP1C and CgPPP1R3B with CgGS and CgGP, Co-IP and yeast two-hybrid (Y2H) assays were carried out.

# Co-IP Assays

HEK293T cells were divided between two or more Petri dishes (10 cm diameter, Corning, United States) and cultured for 24 h. Fusion pCMV-Myc/pEGFP-N1 plasmids were cotransfected with vectors expressing FLAG-tagged fusion proteins or empty FLAG vector (control). After 24 h, cells were harvested, and proteins were extracted by cell lysis buffer (Beyotime). Control (input) samples were prepared from the cell lysate, and the remaining lysates were mixed with the anti-FLAG M2 magnetic beads (Sigma, United States) by gently mixing with a roller at 4◦C for 1–2 h. The beads were washed twice with PBS and then washed twice with cell lysis buffer. Input and Co-IP samples were mixed with 2× protein SDS-PAGE loading buffer (TaKaRa) and then denatured at 100◦C for 5 min. Proteins were analyzed by SDS-PAGE and then by Western blotting using anti-Myc/GFP and anti-FLAG antibodies (Sigma).

# Y2H Assays

Y2H assays were performed to detect interactions between proteins using the Clontech Matchmaker Gold Yeast Two-Hybrid System (TaKaRa). The fusion plasmids pGADT7 (AD vector) and pGBKT7 (BD vector) were transformed into Y187 and Gold yeast competent cells, respectively. Y187 cells were cultured on selective plates with synthetically defined (SD) medium lacking leucine (SD/-Leu), whereas Gold cells were cultured on plates lacking tryptophan (SD/-Trp). After 48–96 h, yeast were grown on SD/-Leu and SD/-Trp medium and then hybridized in 2× yeast extract peptone dextrose-adenine (YPDA) medium and selected by a double dropout (SD/-Leu/-Trp) plate. Interactions between proteins were detected based on the ability of the hybridized yeast to grow on quadruple dropout (SD/- Ade/-His/-Leu/-Trp) medium supplemented with X-α-Gal and aureobasidin A (TaKaRa).

# Glycogen-Protein Co-sedimentation Assay

For analysis of whether the CgPPP1R3B protein could bind to the glycogen molecule, glycogen-protein co-sedimentation assays were performed. Prior to the co-sedimentation assay, recombinant pET-32a/CgPPP1R3B proteins (6x-His Tag) were produced in E. coli. Plasmid-ligated pET-32a was transformed into the E. coli BL21 (DE3) strain (Vazyme, Nanjing) and cultured on the corresponding LB agar plate. A single bacterial colony was selected and cultured overnight at 37◦C in 10 ml LB medium (supplemented with 100 µg/ml ampicillin). In another conical flask with 200 ml LB medium, 4 ml of the abovementioned bacteria was inoculated and cultured to OD600 = 0.6. Protein production was induced by the addition of 1 mM isopropyl β-Dthiogalactoside for 16 h in a 16 ◦C incubator. Bacterial cells were collected by centrifugation, and the pellets were resuspended in PBS. Then, resuspended bacteria were ultrasonicated and centrifuged for 5 min at 12,000 g. The supernatant was used for purification of recombinant protein using a His-tag Protein Purification Kit (Beyotime, Shanghai).

Glycogen-protein co-sedimentation assays were performed as described previously (Kerekes et al., 2014). A total of 400 µl of 0.5 mg/ml recombinant pET-32a/PPP1R3B was incubated with 670 µl of 17.6% (mass/volume) glycogen solution (from oyster, Yuanye) and 1600 µl of buffer for 2 h at 4◦C; the buffer contained Tris-HCl (pH 7.5), mercaptoethanol, PMSF, EDTA, and protease inhibitor cocktail (Roche), with concentrations of 10, 1, 1, 1, and 1 mM, respectively. After incubation, 4 ml of extra buffer was added, and the solution was layered over 6.7 ml of 0.25 M sucrose in a 12.5 ml centrifuge tube. After centrifugation, the supernatant and pellet fractions were separated for 90 min at room temperature at 25,700 rpm (100,000 g) with a Beckman XL-80K centrifuge. The original pET-32a/vector was used as a negative control experiment and was treated in the same way. After the addition of 2 × SDS buffer, the samples were denatured for 5 min at 100◦C and then examined by SDS-PAGE and Western blotting with an anti-His antibody (TransGen, China).

# Subcellular Localization Assay

fgene-10-00106 February 13, 2019 Time: 19:20 # 5

For detection of the subcellular localization of CgPPP1R3B, CgPPP1C, CgGS, and CgGP, HeLa cells were transfected with CgPPP1R3B-EGFP, CgPPP1C-EGFP, CgGS-EGFP, CgGP-EGFP or pEGFP-N1, and nuclei were counterstained with Hoechst 33342 (blue) (Invitrogen, United States). The membrane was counterstained with Alexa Fluor 594 (red) (Invitrogen, United States). After the nuclei and membrane were stained for 10 min at 37◦C, cells were washed twice with PBS and then observed using laser scanning confocal microscopy (Carl Zeiss, Germany).

# Genotyping of Candidate SNPs

Prior to the association analysis, genotyping of the candidate SNPs was performed. Genomic DNA of oysters was extracted from the adductor muscle by sodium laurate as described previously (Cai et al., 2014) with a simple modification. The integrity and concentration of the DNA were tested by 1% agarose gel electrophoresis and NanoDrop 2000 spectrophotometry, respectively. All SNPs used in this study were derived from whole-genome resequencing of 427 individuals (Li L. et al., 2018).

Thirteen SNPs, including six associated SNPs (63644, 63965, 64380, 64742, 67094, and 67096) from the previous GWAS results (Meng, unpublished) and other seven newly identified SNPs (63878, 68341, 64496, 71564, 72487, 77716, and 78604) were genotyped by SNaPshot method with 3730xL DNA Analyzer (Applied Biosystems) in the wild population (n = 288).

# Association Studies Among Genotypes, CgPPP1R3B Expression and Glycogen Content

For determination of the association between genotypes of SNPs and glycogen content, a linear regression model was used in SPSS 22.0 software (dominant allele homozygotes assigned 0, heterozygotes assigned 1, and recessive allele homozygotes assigned 2). Analyses of linkage disequilibrium among SNPs were performed according to http://analysis. bio-x.cn/myAnalysis.php (Shi and He, 2005). The association between SNP genotypes and gene expression levels was determined as described above. RNA for association studies were extracted from the same homogenized flesh used for glycogen measurement, and the expression levels of individuals (n = 210) with various genotypes were determined by RT-PCR. CgPPP1R3B expression in high-glycogen (n = 20) and low-glycogen (n = 20) individuals was evaluated to assess their relationships.

# Data Analysis

Glycogen content is shown as the mean ± standard error of the mean (SEM), and gene expression levels were calculated by the 2[-Delta Delta C(T)] method (Livak and Schmittgen, 2001), the data are shown as the mean ± SEM. For RNAi study, one-way ANOVA followed by a post hoc multiple comparison (Duncan) was conducted to test the significance of the expression levels or glycogen content among three groups, student's t-test were used in Microsoft Excel for comparing the experiment group with control in CgPPP1R3B overexpression study. In association study, a linear regression model was used in SPSS 22.0 as mentioned above, the p-value indicates the significance of the association, R 2 in the model represents the phenotypic variation explained (PVE). Student's t-test were used in Microsoft Excel for comparing the expression level of different genotypes' individuals or low and high glycogen groups.

# RESULTS

# Gene Functional Studies

## Gene Cloning, Phylogenetic Tree Construction and Expression Profile

A full-length cDNA of 1717 bp was isolated from oyster cDNA and designated CgPPP1R3B. CgPPP1R3B encodes a 351 amino acid polypeptide (**Supplementary Figure S1**), with a predicted molecular weight of 39.66 kDa and a predicted carbohydrate binding type-21 (CBM21) domain. Another full-length cDNA was 1171 bp and designated CgPPP1C. CgPPP1C encodes a 328 amino acid polypeptide (**Supplementary Figure S2**), with a predicted molecular weight of 37.29 kDa, and UniProt blast analysis predicted a PP2A domain (protein phosphatase 2A homologs, catalytic domain).

The phylogenetic tree was constructed based on the protein sequences of PPP1R3B and PPP1C from human, mouse, amphibian, reptile, fishes, oyster, and other shellfish (**Figure 1**). The phylogenetic tree and multiple alignment analysis suggest that CgPPP1R3B and CgPPP1C, should be an ortholog of invertebrate and vertebrate PPP1R3B and PPP1C proteins, respectively. The PPP1R3B sequence varied greatly among species; the similarity of the oyster to mouse protein is 40%, which was relatively low, as shown in **Supplementary Figure S3**. The CgPPP1C sequence was strongly conserved, with a similarity of more than 90%, as indicated by the multiple alignment analysis shown in **Supplementary Figure S4**.

CgPPP1R3B expression in different tissues in October showed high levels in the gonad, labial palp and mantle, which have relatively high glycogen contents (**Supplementary Figures S5A,B**). In different seasons, the expression levels showed similar trends to the glycogen contents. In winter and autumn, the CgPPP1R3B transcript levels and glycogen contents were higher than those in spring and summer (**Supplementary Figure S5C**).

## Knockdown of CgPPP1R3B by RNAi Decreased the Glycogen Content

The gene expression level of the RNAi group was lower than that of the NC and BC groups on most days. Knockdown of CgPPP1R3B expression resulted in a strong interference effect 3– 4 days post-injection (**Figure 2A**). These results demonstrated that suppression of CgPPP1R3B slightly decreased CgGS and increased CgGP expression on days 4 and 5 (**Figures 2B,C**) and decreased glycogen content on the 18th day post-first injection (**Figure 2D**).

# Overexpression of PPP1R3B Increased Glycogen Content in HEK293 Cells

CgPPP1R3B was overexpressed in HEK293 cells, and the results showed that glycogen content was significantly higher in the overexpression group than in the control group (vector), and no differences among different vectors were found in improving glycogen content (**Figure 2E**).

# CgPPP1R3B Interacted With CgPPP1C, CgGS, CgGP and Glycogen Molecules

We conducted Y2H and Co-IP assays to detect the interaction of CgPPP1R3B with its catalytic subunit CgPPP1C (**Figures 3A,B**). The CgPPP1C binding site was the KRVSF (133–137 aa) motif, as shown by the truncated protein analyses and site-directed (**Supplementary Figure S6A**) mutagenesis of V135E in CgPPP1R3B, which abrogated this interaction (**Supplementary Figure S6B**).

Based on a previous study in mammals, we carried out assays to determine whether CgPPP1R3B interacts with key enzymes involved in glycogen metabolism, such as CgGS and CgGP. The results showed that CgPPP1R3B can interact with CgGS in Y2H and Co-IP assays (**Figures 3C,D**); similar results were obtained for CgGP in Y2H assays (**Figure 3E**). The interaction domain of CgPPP1R3B with CgGS was not determined because none of the truncated proteins could interact with CgGS (**Supplementary Figure S6C**). The interaction domain of CgPPP1R3B with CgGP may be the C-terminal of the protein, as indicated by the truncated protein experiments shown in **Supplementary Figure S6D**. The co-sedimentation of the CgPPP1R3B protein with glycogen molecules showed that

FIGURE 2 | CgPPP1R3B RNAi in vivo and overexpression in HEK293T cell. (A) CgPPP1R3B expression after RNAi in days post first injection (n = 5), oysters were injected twice in day 0 (first) and day 10 (second). The asterisk indicates significant difference (p < 0.05) between RNAi and neutral control group. (B) CgGS and (C) CgGP expression in days 3–5 post first injection (n = 5). (D) Glycogen content in day 18 post first injection (n = 5), different letters of "a" and "b" mean significant difference (p < 0.05), glycogen content was percentage of flesh dry weight. (E) Fold change of glycogen content in HEK393T cell overexpressed CgPPP1R3B in three kinds of vector (n = 3), double asterisk mean glycogen levels of overexpressed group is higher than vector control (p < 0.01). Bars in (A–E) represent mean ± SEM (standard error of means).

they can bind each other in vitro, while the pET-32a vectorexpressed tag protein did not bind to the glycogen molecules, as shown in **Figure 4**. Thus, a large complex of CgPPP1R3B, CgPPP1C, CgGS, and CgGP that targets glycogen molecules can form (**Figure 5**).

## Subcellular Localization

Finally, to directly assess the subcellular localization of CgPPP1R3B, CgPPP1C, CgGS and CgGP, we performed fluorescence localization analyses. The fluorescent signals representing the CgPPP1R3B fusion proteins were mainly located in the cytoplasm, and CgGS appeared to aggregate together in the cytoplasm. In contrast, CgPPP1C and CgGP were distributed in a dispersed manner in both the cytoplasm and the nucleus (**Supplementary Figure S7**).

# Association Studies of SNPs in CgPPP1R3B

# Glycogen Measurement and Genotyping of SNPs in the Wild Population

Glycogen content in the wild population ranged from 14.9 to 49.4% dry weight, with a mean value of 37.5 ± 0.3%, which followed the normal distribution (p > 0.05). There was no subpopulation structure of this wild population based on genomic kinship matrix and principal component analysis (unpublished data).

FIGURE 4 | CgPPP1R3B protein bind with oyster glycogen molecule directly in vitro. Input refers to sample before centrifugation as a control. After centrifugation, the sediment in the bottom of the tube were named pellet. Input, supernatant, and pellet were tested by western blotting with anti-His. St represents protein molecular mass standards; molecular masses are given in kDa.

We mapped the CgPPP1R3B cDNA sequence to the genome, and the relative location of CgPPP1R3B spanned from scaffold 1243–64080 to scaffold 1243–79594. The full-length DNA of this gene was 15515 bp, which is much longer than predicted, with three exons and two introns, as shown in **Supplementary Table S4**. The precise locations of the 13 SNPs in the genome are presented in **Figure 6**. Three SNPs were located in the 5<sup>0</sup> flanking region (SNP 63644, 63878, and 63965), one in the 5'UTR (SNP 68341), seven in the introns (SNP 64380, 64742, 67094, 67096, 71564, 72487, and 77716) and two in the CDS region (SNP 68496, 78604, synonymous mutation). Six of the SNPs were T/C transitions, four were A/G transitions, two were A/C transversions, and one was a T/G transversion. The minor allele frequency (MAF) was more than 0.05 for most of the SNPs, ranging from 0.09 to 0.47, for SNP 68496 it is 0.02. Genotype and allele frequencies of the 13 SNPs in the wild population are shown in **Supplementary Table S5**.

## Association Between SNPs and Glycogen Content

The associations between the genotypes of 13 SNPs and glycogen content are shown in **Table 1**. SNP 63644 (TT > TC > CC), SNP 63965 (CC > TC > TT), SNP 63878 (TT > TC > CC), SNP 64742 (TT > TC > CC), SNP 67094 (AG > GG > AA), SNP 67096 (AG > GG > AA), SNP 68341 (TT > TC > CC), SNP 71564 (CC > TC > TT), SNP 72487 (AA > AC > CC), and SNP 78604 (TT > TG > GG) were significantly related to glycogen content (p < 0.05). These SNPs accounted for 1.4 to 6.3% of the PVE. The minor allele was shown to be the favorable allele of 63644 (T), 63878 (T), 63965 (C), 64742 (T), 68341 (T), 71564 (C), 72487 (A), and 78604 (T). For 67094 and 67096, individuals with favorable alleles had increased glycogen levels, while the heterozygotes had higher glycogen contents than each of the homozygotes. For other significant SNPs, the results showed that favorable allele homozygotes > heterozygotes > unfavorable allele homozygotes.

### Relationship Between CgPPP1R3B Expression Levels and Glycogen Content

High-glycogen individuals (n = 20) and low-glycogen individuals (n = 20), whose contents were 45% ± 0.29% and 26% ± 0.96%, respectively, were significantly different (p < 0.001). The gene expression of CgPPP1R3B in high-glycogen individuals was significantly higher than that in low-glycogen individuals (p < 0.001) (**Figure 7**).

## Association Between SNPs and CgPPP1R3B Expression Levels

Analyses of the relationships between gene expression levels and 13 SNP loci showed that different genotypes of four SNPs (63644, 63965, 63878, and 68341) were significantly related to CgPPP1R3B expression levels (**Figure 8**). For SNP 63644, the CgPPP1R3B expression level of favorable allele homozygotes (TT, n = 58) was significantly higher than that of heterozygotes (TC, n = 52) and unfavorable allele homozygotes (CC, n = 100). There was no difference between heterozygotes and unfavorable allele homozygotes. For SNP 63965, different expression levels were observed between favorable allele homozygotes (CC, n = 37) and heterozygotes (TC, n = 55) or unfavorable allele homozygotes (CC, n = 115). There were no significant differences in the transcript levels of heterozygotes and unfavorable allele homozygotes. For SNP 63878, strong evidence of distinct expression levels was found among favorable allele homozygotes (TT, n = 7), heterozygotes (TC, n = 33) or unfavorable allele homozygotes (CC, n = 170) (i.e., TT > CC, TC > CC). For SNP 68341, the difference between the dominant allele homozygotes (TT, n = 9) and heterozygotes (TC, n = 80) was significant; it was similar between dominant allele homozygotes and recessive allele homozygotes (CC, n = 121) (i.e., TT > CC, TC > CC).

## Effect of Genotypic Combination

The 10 associated SNPs were used to characterize the linkage disequilibrium (LD) pattern. The results showed that the LD block was quite short, and only SNP 67094 and SNP 67096 were located in the same block; other glycogen-associated SNPs were not in the same LD block (**Supplementary Figure S8**) and showed independent assortment. The genotypic combination of

FIGURE 6 | Gene structure of CgPPP1R3B and glycogen content associated SNP location in this gene in the Pacific oyster. From left to right (50—3<sup>0</sup> ), red dot represent SNPs from GWAS result (from left to right 63644, 63965, 64380, 64742, 67094, and 67096), black ones represent newly identified SNPs (from left to right 63878, 68341, 68496, 71564, 72487, 77716, and 78604). Blue box represent 5<sup>0</sup> UTR, yellow box is CDS, red box is 3<sup>0</sup> UTR, and green lines are introns.

most significantly associated SNPs (63644, 63965, and 63878) showed that the glycogen content was 42.77 ± 0.91, which was 14.1% higher than the average content. Individuals with the most favorable combinations accounted for 4.9% of the total wild population. For individuals with unfavorable combinations, glycogen content was 35.93 ± 0.56, which was 4.2% lower than the average. Not all of the combinations are shown in **Table 2** because there are many combinations. Individuals with favorable allele combinations had higher glycogen content than those with a single favorable allele.



In allele column, T > C means T was the favorable allele for glycogen content, genotyping were completed with SNaPshot method. Glycogen content were shown as percentage of flesh dry weight. PVE is the phenotype variation explanation (%). # means that the SNP was also significantly associated with glycogen content in the population for GWAS results (Meng et al., unpublished). Three significance levels were used: <sup>∗</sup>p < 0.05, ∗∗p < 0.01, ∗∗∗p < 0.001.

# DISCUSSION

Based on previous GWAS results, candidate genes and SNPs were acquired. We explored gene functions and validated associated SNPs in an independent wild population. We adapted both forward genetics and reverse genetics methods to further elucidate CgPPP1R3B function in regulating glycogen content and associated variations that may be potentially useful in molecular breeding.

# Identification of PPP1R3B Function

CgPPP1R3B encodes a longer protein than most of its orthologs in invertebrates and vertebrates. Multiple alignment showed that the CgPPP1R3B protein had relatively low similarity (40%) to that of humans (**Supplementary Figure S3**), whereas for rabbit and mouse, the homologies were 70–85% (Newgard et al., 2000). CgPPP1C encoded a strongly conserved protein with a similar length to that of other animals (**Supplementary Figure S4**), with more than 90% similarity. The high similarity suggested the conserved function in different taxa of animals. Several glycogen metabolic genes were cloned in the Pacific oyster, and most of the identified glycogen metabolism and corresponding genes show more than 60% similarity to protein sequences of other species, thus indicating a relatively conserved pathway (Bacca et al., 2005; Hamano et al., 2005; Tanguy et al., 2006; Hanquet et al., 2011; Zeng et al., 2013; Li et al., 2017).

For the gene expression profiles, CgPPP1R3B expression increased in tissues with high glycogen contents in October, consistent with the high expression of CgGS observed by Bacca et al. (2005). These results are consistent with the finding that storage cells are mainly distributed in the labial palp, mantle and gonadal area, as shown by periodic acid–Schiff staining and ultrastructural characteristics of isolated storage cells (Berthelin et al., 2000a,b). Glycogen is mainly stored to meet the energy demands of reproduction, and the results also demonstrated that CgPPP1R3B expression and glycogen content were relatively high in the degenerating gametes stage (autumn)and early gametogenesis (winter), decreased accordingly in the growth of gamete stage (spring) and mature and spawning stages (summer). These findings were supported by the corresponding experiments (Berthelin et al., 2001).

The protein interaction results showed that CgPPP1R3B could interact with CgPPP1C, and the interaction motif was proven to be KRVSF (133–137 aa), which is consistent with the binding site of the glycogen targeting subunit (GTS) in mammals (Armstrong et al., 1998). The site-directed mutagenesis of the consensus sequence (V135E, KR**V**SF to KR**E**SF) blocked CgPPP1C binding, which was consistent with the point mutation results for another GTS (Fong et al., 2000). Four types of GTS have been identified in mammals, which share the PP1C and glycogen binding function, with G<sup>M</sup> mainly distributed in muscle, PPP1R3B (GL) mainly expressed in liver, and PPP1R6 and protein targeting

to glycogen (PTG) distributed widely, which reflect tissue typedependent glycogen metabolism (Printen et al., 1997; Newgard et al., 2000). We screened the oyster genome protein dataset (Zhang et al., 2012) by the KRVSF/KRVVF/RRVSF/LRVRF motif (Cohen, 2002), and only one GTS of CgPPP1C was found (data not shown). These results may indicate that GTS is not specific in oysters, suggesting the important role of CgPPP1R3B in glycogen metabolic regulation.

CgPPP1R3B interacts with the key glycogen metabolic enzymes CgGS and CgGP. CgPPP1R3B was shown to interact with both of these enzymes in this study. PTG could bind to the GS protein, and the binding site was shown to be in the C-terminus in mammalian cells (Printen et al., 1997; Fong et al., 2000). For PPP1R3B, researchers observed neither an interaction nor a binding site with GS (Armstrong et al., 1998; Newgard et al., 2000). In this study, we confirmed the interaction but did not find the binding site, which may indicate that the whole protein is needed for GS binding, and perhaps the misfolding of the truncated protein impaired the binding action (Fong et al., 2000). Moreover, GP interacts with CgPPP1R3B, but the interaction is not very strong, as shown by the Y2H assay. Initial studies reported that the GP binding site lies in the C-terminal 16 amino acids of PPP1R3B (GL) (Armstrong et al., 1998). This finding was supported by the observation that the C-terminal 16 amino acids of CgPPP1R3B showed very low similarity to those of the human, mouse and zebrafish orthologs (**Supplementary Figure S3**). Furthermore, CgPPP1R3B could interact with the oyster glycogen molecule by co-sedimentation assaysin vitro. The CBM21 conserved domain in the middle of the protein is thought to be responsible for binding. This domain shared high similarity with those from different species (Printen et al., 1997; Armstrong et al., 1998; Fong et al., 2000).

Overexpression of GM, PTG, and PPP1R3B in cell lines and intact cells all increased the corresponding enzyme activity and glycogen levels. Overexpression of PTG in the liver of normal rats demonstrated that glycogen levels in fasted or fed PTGoverexpressing animals were 70% higher than those in fed controls (O'Doherty et al., 2000). Stender et al. (2018) found that overexpression of human PPP1R3B in HEK293 cells increased the glycogen level almost 20-fold. Because of the lack of mature mollusk cell lines, we overexpressed the CgPPP1R3B protein in HEK293 cells, and nearly twofold higher glycogen levels were

TABLE 2 | Favorable genotypes combination of three SNPs-63644, 63965, and 63878 in the Pacific oysters.


Improvement means extent of glycogen content of genotype combination higher/lower than the total average. Count means number of individuals with the combination, and the percentage means the proportion of the individuals in the wild population.

observed in overexpression cells than in control cells. Due to the high conservation of PP1C protein, PP1C binding and glycogen binding sites, we concluded that CgPPP1R3B functioned well in HEK293 cells and thus accelerated glycogen accumulation. Transgenic mice with liver-specific deletion of PPP1R3B showed significantly reduced glycogen synthase protein levels and substantially decreased total hepatic glycogen content (Mehta et al., 2017). We found that RNAi of CgPPP1R3B significantly reduced the glycogen content compared with that of the control (p < 0.05). Given that glycogen metabolism in oyster showed seasonal variation (Bacca et al., 2005), we hypothesized that the regulation of oyster glycogen metabolism has a long-term regulation cycle compared to the quick fluctuation period for mammals (Stender et al., 2018), which may also explain why the decrease in glycogen level by RNAi was not as strong as that by transgenic deletion of the gene in mouse or human cells.

The subcellular locations of CgPPP1R3B, CgGS, CgPPP1C, and CgGP may imply that they participate in glycogen metabolism in the cytoplasm. Binding of GTS to glycogen metabolic enzymes may occur in the cytosol, followed by translocation of the formed complex to the glycogen particle (Newgard et al., 2000). Likely because of a redistribution of PPP1C and GS to glycogen particles, overexpression of PPP1R3B in cells strongly improves glycogen levels (Mehta et al., 2017).

Thus, CgPPP1R3B could interact with CgPPP1C, CgGS, and CgGP and target the glycogen molecule, forming a complex (Newgard et al., 2000), and CgPPP1R3B served as a molecular scaffold to connect the corresponding proteins. Moreover, PPP1C cannot interact with CgGS and CgGP directly (data not shown), which highlights the importance of PPP1R3B in this metabolic regulatory mechanism (**Figure 5**). CgPPP1C may activate CgGS and inactivate CgGP, thereby improving glycogen levels.

# Association of SNPs With Glycogen Content

Compared to growth traits, which have been relatively well studied and for which molecular breeding is practical (Gutierrez et al., 2018b), nutritional quality traits have rarely been used for genetic modification (Hollenbeck and Johnston, 2018). To date, less than ten glycogen associated SNPs have been identified in oysters. Candidate gene association studies showed that six SNPs in the coding region of CgGS were significantly associated with glycogen content (Liu et al., 2017) as well as two SNPs in Cg\_GD1 (glycogen debranching enzyme) and one SNP in Cg-GP (glycogen phosphorylase) (She et al., 2015). A highdensity genetic map found a QTL explaining 8.6% of glycogen phenotypic variation, and a new gene annotated as a zinc finger protein may participate in glycogen metabolism (Li C. et al., 2018). Many phenotypic variations remain to be explained.

In this study, 10 of 13 SNPs were confirmed to be associated with glycogen content in a wild population, eight of which were also associated with glycogen content in the GWAS population (**Table 1**). SNPs from this study could account for 1.4–6.3% of the PVE, which is comparable to the GWAS results for host resistance to Ostreid herpesvirus of oyster, with the top 10 markers ranging between 1.9 and 4.7% (Gutierrez et al., 2018a) as well as GWAS result of growth traits in juvenile farmed Atlantic salmon, with the top SNP explain ∼7% of the additive genetic variation. However, the use of selected fast growing lines may contribute to the high phenotypic variation of 7–52% PVE of growth traits in common carp (Su et al., 2018). Moreover, minor alleles were the favorable allele for 8 of the 10 SNPs, indicating that we should pay more attention to the minor alleles. Zhou et al. (2018) found that the MAF of the most significant SNP (carp159317) was significantly different between individuals with scattered scale patterns and individuals with normal scale patterns in the Yellow River carp (0.49 and 0, respectively). Holborn et al. (2018) reported one favorable allele was minor allele (MAF = 0.078) for bacterial kidney disease resistance in North American Atlantic salmon population.

Associated SNPs were not in the same LD block, and the combination of favorable genotypes showed potential for selective breeding for high glycogen content using these markers. On the one hand, we can genotype the oysters with these markers before spawning and select individuals with favorable combinations as parents, which may increase the glycogen content by approximately 14.2% on average. On the other hand, individuals with favorable genotypic combinations comprised 4.9% of the wild population, indicating it is feasible to conduct molecular breeding to improve glycogen content using this favorable genotypic combination.

# Regulatory Mechanism of SNP-Mediated Glycogen Content

These associated SNPs were distributed in different regions of the CgPPP1R3B gene, which suggested different mechanisms of glycogen content regulation. Among the 10 significant SNPs, three SNPs were located in the 5<sup>0</sup> flanking region spanning 100–500 bp upstream of transcription start sites (TSS). These SNPs are located in the promoter region and may influence the transcription process. This hypothesis was confirmed by the fact that different genotypes of the SNPs 63644, 63965, and 63878 showed different expression levels. To date, few reports have focused on oyster gene promoter regions and functions. In other invertebrates, such as fruit flies, most of the promoters lie in the 1-kb region, and some of them may affect the TSS (Schor et al., 2017). This may have a distinct regulatory effect of different genotypes of the genetic variants, confirming the in vivo expression differences (Cannavò et al., 2017). Moreover, one SNP among the 10 associated SNPs located in the 50UTR

and the expression level of CgPPP1R3B presented differences among different genotypes. Genetic variations in the 50UTR may impact RNA stability and translation efficiency, as confirmed in humans and plants (Zou et al., 2003; Dahlqvist et al., 2010). Furthermore, one associated synonymous SNP (78604) was located in the CDS. Two synonymous SNP in Serum amyloid A gene were associated with Vibrio-resistance in the clam Meretrix meretrix (Zou and Liu, 2015). Komar (2007) reported that naturally occurring synonymous SNPs can affect in vivo protein folding and combinations of three SNPs for a gene that altered glycoprotein activity. Finally, five SNPs were located in introns, a intronic SNP was found to be associated with resistance to sea lice in Atlantic salmon (Correa et al., 2017) and two intronic SNP in Serum amyloid A gene in the clam M. meretrix were significantly associated with Vibrio-resistance and one intronic SNP was associated with growth related traits by candidate gene association studies (Zou and Liu, 2015). A 311-bp deletion in intron 10 and exon 11 of fgfr1A was proved to be the causal gene responsible for abnormal scattered scale in the Yellow River carp (Zhou et al., 2018). Several intronic SNPs were confirmed to be associated with glucose metabolism in humans (Bouatia-Naji et al., 2008). Intronic SNPs may impact mRNA processing, the precise mechanism requires further elucidation in the future.

# CONCLUSION

The CgPPP1R3B protein can bind CgPPP1C, CgGS, CgGP and glycogen molecules, thus participating in glycogen metabolic regulation. Associated SNPs may increase the transcription level of CgPPP1R3B, thereby increasing the glycogen content of oyster individuals. These associated SNPs or favorable genotypic combinations provide potential markers for marker-assisted selection programs for high-glycogen oyster breeding.

# REFERENCES


# AUTHOR CONTRIBUTIONS

LL and GZ conceived and designed the study. SL and WW collected the experimental materials. SL and BH contributed to the experimental work. SL, JM, and KS performed the most of the statistical analyses. SL and LL wrote and polished the manuscript. All authors read and approved the final manuscript.

# FUNDING

This work was supported by the National Natural Science Foundation of China (Grant No. 31530079) and the Earmarked Fund for China Agriculture Research System (Grant No. CARS-49).

# ACKNOWLEDGMENTS

We would like to sincerely thank Dr. Kokai from University of Debrecen of Hungary for kindly providing the details of the co-sedimentation experiment. We thank Dr. Weijun Wang and Dr. Jianmin Yang for help of glycogen content measurement with NIRS model. Other researchers who contributed to this study are Youli Liu, Ruihui Shi, and other members of the lab.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00106/full#supplementary-material


for enzymes of glycogen metabolism. J. Biol. Chem. 275, 35034–35039. doi: 10.1074/jbc.M005541200


consensus, uncertainty and challenges. Nat. Rev. Genet. 9, 356–369. doi: 10. 1038/nrg2344



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Liu, Li, Meng, Song, Huang, Wang and Zhang. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Transcriptome Analysis Reveals Common and Differential Response to Low Temperature Exposure Between Tolerant and Sensitive Blue Tilapia (Oreochromis aureus)

Tali Nitzan<sup>1</sup> , Fotini Kokou1,2, Adi Doron-Faigenboim<sup>1</sup> , Tatiana Slosman<sup>1</sup> , Jakob Biran<sup>1</sup> , Itzhak Mizrahi <sup>2</sup> , Tatyana Zak <sup>3</sup> , Ayana Benet <sup>3</sup> and Avner Cnaani <sup>1</sup> \*

1 Institute of Animal Science, Agricultural Research Organization, Rishon LeZion, Israel, <sup>2</sup> Department of Life Sciences, Ben-Gurion University of the Negev, Beer-Sheva, Israel, <sup>3</sup> The Aquaculture Research Station, Ministry of Agriculture and Rural Development, Dor, Israel

### Edited by:

Paulino Martínez, University of Santiago de Compostela, Spain

### Reviewed by:

Manuel Manchado, Andalusian Institute for Research and Training in Agriculture, Fisheries, Food and Ecological Production (IFAPA), Spain Smaragda Tsairidou, University of Edinburgh, United Kingdom

> \*Correspondence: Avner Cnaani avnerc@agri.gov.il

### Specialty section:

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Received: 08 November 2018 Accepted: 29 January 2019 Published: 26 February 2019

### Citation:

Nitzan T, Kokou F, Doron-Faigenboim A, Slosman T, Biran J, Mizrahi I, Zak T, Benet A and Cnaani A (2019) Transcriptome Analysis Reveals Common and Differential Response to Low Temperature Exposure Between Tolerant and Sensitive Blue Tilapia (Oreochromis aureus). Front. Genet. 10:100. doi: 10.3389/fgene.2019.00100 Tilapias are very important to the world's aquaculture. As befitting fish of their tropical origin, their distribution, and culture practices are highly affected by low temperatures. In this study, we used genetic and genomic methodologies to reveal pathways involved in the response and tolerance of blue tilapia (Oreochromis aureus) to low temperature stress. Cold tolerance was characterized in 66 families of blue tilapia. Fish from cold-tolerant and cold-sensitive families were sampled at 24 and 12◦C, and the transcriptional responses to low-temperature exposure were measured in the gills and liver by high-throughput mRNA sequencing. Four hundred and ninety four genes displayed a similar temperature-dependent expression in both tolerant and sensitive fish and in the two tissues, representing the core molecular response to low temperature exposure. KEGG pathway analysis of these genes revealed down-regulation of focal-adhesion and other cell-extracellular matrix (ECM) interactions, and up-regulation of proteasome and various intra-cellular proteolytic activities. Differential responses between cold-tolerant and cold-sensitive fish were found with genes and pathways that were up-regulated in one group and down-regulated in the other. This reverse response was characterized by genes involved in metabolic pathways such as glycolysis/gluconeogenesis in the gills and biosynthesis of amino-acids in the liver, with low temperature down-regulation in tolerant fish and up-regulation in sensitive fish.

Keywords: carbon metabolism, cold tolerance, Oreochromis, selective breeding, tilapia, transcriptome

# INTRODUCTION

Environmental stressors disrupt homeostasis and are harmful to the physiological function of organisms. Environmental temperature is one of the main factors that drove a wide array of evolutionary adaptations, including the division of animals into homeotherms (endotherms), like mammals and birds, and poikilotherms (ectotherms), like reptiles, amphibians, and fish. Fish species inhabit waters of a wide range of temperatures, from below 0◦C in the Antarctic ocean to above 40◦C in East African lakes (DeVries and Wohlschlag, 1969; Reite et al., 1974). However, each fish species can survive within a limited range of temperatures, and the response to fluctuations in environmental temperature is a crucial factor of its fitness and survival (Schulte et al., 2011). There are various biological components and pathways that respond to changes in the environmental temperature. These include alteration of enzymatic activity and efficiency, membrane permeability, gas solubility, particle diffusion rates, and notably, metabolic rate (Battersby and Moyes, 1998; Itoi et al., 2003).

Information about the genes and biological pathways that affect the ability of fish to acclimate and function in a wide range of environmental and body temperatures is scarce. Several studies, on different fish species, used transcriptomic approaches in order to gain broad view of genes involved in the response and acclimation to low temperatures (Gracey et al., 2004; Long et al., 2012, 2013; Mininni et al., 2014; Hu et al., 2016). These studies pointed to different tissues involved in the physiological response and acclimation to low temperature stress, sometimes with tissue-specific responses. In addition, certain pathways, such as mitochondrial function, lipid and carbohydrate metabolism, anti-oxidant response, apoptosis, RNA processing, and protein catabolism, were found to have temperature-dependent regulation.

Tilapiine fishes of the family Cichlidae originate from the tropical and subtropical parts of Africa, with colonization into the Middle East through the Great Rift Valley (Trewavas, 1983). During the twentieth century, tilapias were introduced to Asia, South and North America and are now highly important in global aquaculture production (FAO data at: www.fao.org/ fishery/culturedspecies/Oreochromis\_niloticus/en). Reflecting their tropical origin, the optimal temperature for growth of most tilapiine species lies within the range of 20–30◦C and reproduction and feeding are usually suppressed at temperatures below 20◦C. Variation in the lower lethal temperature has been observed among different tilapiine species (Wohlfarth and Hulata, 1983), with blue tilapia (Oreochromis aureus) being one of the most cold-tolerant species (Cnaani et al., 2000). Within species variation in cold tolerance has been attributed to acclimation, physiological stage, environmental factors, and to genetic effects.

Thermal tolerance is a quantitative trait of considerable economic importance in several fish species, including tilapia. Linkage analyses using microsatellites markers resulted in QTL of minor effect (Perry et al., 2001; Cnaani et al., 2003). Several studies attempted to track the inheritance pattern and genetic basis of tilapia's cold tolerance (Cnaani et al., 2000, 2003; Charo-Karisa et al., 2005; Thodesen et al., 2013; Nitzan et al., 2016), however, mechanisms that underlie the within-species variation of cold tolerance remain unknown. Identification of such pathways can open new directions to improve brood-stocks and provide insights into the nature of environmental tolerance and adaptation.

In this work, we used a population with distinct tolerant and sensitive families, obtained through selective breeding, to characterize within-species variation. We then performed transcriptome analysis to compare the gills and liver transcriptome responses to low temperature challenges between cold tolerant and sensitive blue tilapia. We characterized the gene expression pattern which is the general temperature-dependent response common among different fish and tissues, as well as the tolerance-based differences between the cold tolerant and sensitive fish.

# MATERIALS AND METHODS

# Animals and Experimental Conditions

The fish used in this study were from an Israeli strain of O. aureus that is under an ongoing selective breeding program at the Dor Aquaculture Research Station (Zak et al., 2014). Spawns were conducted in pairs-mating to obtain 66 families. Offspring of each family were grown in individual tanks, until being marked with a specific family code using sub-dermally injected dyes when the fish were about 4 months old and weigh ∼40 g. In order to determine a cold-tolerance value for each family, 10– 15 fish from each family faced a cold-challenge experiment, similar to our previously described study (Nitzan et al., 2016). The 20 families with the highest mean of survival days were considered to have high cold tolerance and the 20 families with the lowest mean of survival days were considered to have low cold tolerance (**Figure 1**). Sibling of the challenged fish, from three families with high cold tolerance and three families with low cold tolerance, were kept in a common 1,000 L tank at 24◦C. After 1 week, half of the fish were transferred to a 600 L tank within a temperature-controlled room (with thermostat within the tank), where the water temperature was reduced from 24◦C at a rate of 1 ◦C/day and maintained at 12◦C for 2 days. The fish were not fed throughout the challenge, dissolved oxygen levels were kept at above 90% saturation, while ammonia and nitrite levels were not detectable.

This study was approved by the Agricultural Research Organization Committee for Ethics in Using Experimental Animals and was carried out in compliance with the current laws governing biological research in Israel (Approval number: 146/09IL).

FIGURE 1 | Phenotypic distribution of cold tolerance, as measured in median survival day, of 66 blue tilapia families that were challenged under declined temperature regime (n = 10–15 fish/family). Resistant families are highlighted in purple and tolerant families in green. Families used for transcriptome sequencing are marked with stripes.

# Tissue Collection and RNA Extraction

Seven fish from each family were sampled at 24◦C, and seven at 12◦C. Gills and liver samples were taken and kept in RNAlater© buffer (Qiagen, Hilden, Germany) at −20◦C until use. mRNA was extracted from the tissue samples using TRIzol <sup>R</sup> reagent (Thermo Fisher Scientific, Waltham, MA, United States), and purified to remove DNA contamination using the TURBO DNA-freeTM kit (Invitrogen, Carlsbad, CA, United States). RNA concentration and quality were determined using an Epoch Microplate Spectrophotometer (BioTek, Winooski, VT).

# Transcriptome Sequencing

For each temperature, 2 µg RNA from gills and liver samples from three cold tolerant (Family 480) and three cold sensitive (Family 740) fish were sent on dry-ice to the Technion Genome Center (Haifa, Israel). Twenty-four libraries were prepared and sequenced on three lanes on an Illumina Hi-Seq 2500 device.

# Quantitative Real-Time PCR

Specific primers for quantitative real-time PCR (qPCR) were designed for eight genes that were found to have temperaturedependent expression in the Next-Generation Sequencing (NGS) analysis, within a pathway that was differentially expressed between cold-tolerant and cold-sensitive fish: aldoaa, gpib, pfkma, pgam2, gpia, ldha, tbiB, pgam1a (the sequences of the primers are listed in **Table S1**). mRNA from the gills of nine cold tolerant and nine cold sensitive fish (three fish from each family) was reverse transcribed using the Verso cDNA kit (Thermo Fisher Scientific). Amplification reactions were performed using ABsolute Blue SYBR Green ROX mix (Thermo Fisher Scientific), in a 10 µl reaction volume, with primers at 700 nM, on a Rotor-Gene Q real-time PCR (Qiagen). All reactions were performed as follows: 95◦C for 15 min, followed by 35 cycles of 95◦C for 15 s, 60–58◦C for 30 s, and 72◦C for 15 s. Relative expression was calculated using 11Ct, with the geometric mean of Elongation Factor 1 (EF-1) and β-actin as reference genes. Significance of differential expression between temperatures was analyzed using the Wilcoxon 2-sample non-parametric test.

# Bioinformatic Analyses

The tilapia Illumina sequences were analyzed as previously described (Ronkin et al., 2015), using the TopHat program to map the reads to the Nile tilapia reference transcriptome (ftp:// ftp.ncbi.nlm.nih.gov/genomes/Oreochromis\_niloticus/), and the Cufflinks and Cuffdiff software tools (Trapnell et al., 2012) for transcript quantification and differential expression analysis between the two temperatures. Transcripts with at least a 2 fold change in response to temperature (at q < 0.05, with FDR adjustment of p-values) were regarded as significantly upor down-regulated. Functional annotation of tilapia transcripts

with significant temperature-dependent expression was extended using PANTHER (http://www.pantherdb.org/), based on gene ontology (GO) categories assigned to the human or zebrafish orthologs. The database for annotation, visualization and integrated discovery (DAVID) web software (http://david.abcc. ncifcrf.gov/home.jsp) was used for the functional analysis of KEGG biological pathway enrichment (regarded as significantly up- or down-regulated at adjusted p-value of 0.05). Adonis implementation of Permanova (vegan R package Anderson, 2001 was used for comparison between groups for clustering analysis using the Jaccard distance matrix (presence/absence of genes).

# RESULTS

# Cold Tolerance Variation

Low temperature exposure of the 66 challenged families that were phenotyped for their mean survival day resulted in individual fish mortality on a range from days 4 to 38, and a normally distributed family mean survival day on a range from 22 to 36 days (**Figure 1**). Based on this distribution, we further chose three sensitive and three tolerant families for the transcriptome analysis: Families 720, 730, and 740 had an average survival of 25.6, 22.1, and 24.2 days under the temperature reduction regime, respectively, thus considered as cold-sensitive. Families 430, 460, and 480 had an average survival of 33.9, 34.2, and 34.1 days, respectively, thus considered to be cold-tolerant.

# Transcriptome Sequencing

Raw RNA-Seq sequences were deposited in the SRA database (accession numbers SRR7976381 to SRR7976404 under project PRJNA419688). Overall, 496,843,855 reads obtained from 24 libraries. After quality control, 441,882,583 clean reads (89%) remained for further analysis. From these clean reads, 70.6% were mapped to the tilapia reference genome, similar to mapping rate in previous tilapia's transcriptome analyses (Ronkin et al., 2015; Tao et al., 2018). Information concerning the libraries are listed in **Table S2**.

# The General Transcriptome Response Blue Tilapia to Temperature Decline

Concerning the number of genes expressed (FDR<0.05 and 2-fold expression difference), Permanova analysis (**Table S3A**) and clustering using Non-metric multidimensional scaling (NMDS) (**Figure 2A**) showed that tissue and temperature were the major factors affecting the variance in our dataset. In order to separate the tissue effect, which seems to have the largest affect, we performed Permanova and NMDS analysis for each tissue. Our results showed that temperature was the main factor affecting the gene expression and not the genetic line, with significant clustering according to temperature for both the gills (FTemperature = 3.53, P = 0.002) and the liver (FTemperature = 4.39, P = 0.005) (**Figures 2B,C** and **Tables S3B, S3C**).

Exposure to cold temperature led to a total of 2,696 and 3,721 temperature-dependent differentially expressed genes (DEGs) in the gills of families 480 (cold-tolerant) and 740 (cold sensitive), respectively. Additionally, 3,714 and 4,114 temperature-dependent DEGs were found in the liver of families 480 and 740, respectively. In both organs, more genes were up- and down-regulated in the sensitive fish than in the tolerant, 38% more in the gills and 11% more in the liver (**Figure 3A**).

More specifically, in the gills, 1,590 temperature-dependent DEGs (657 up-regulated and 933 down regulated) were common between the sensitive and tolerant families. In the liver, there were 2,186 temperature-dependent DEGs (1,104 up-regulated and 1,082 down regulated) that were common between the sensitive and tolerant families. Of these, 494 genes (314 upregulated and 180 down-regulated) were shared between the two tissues (**Figure S1**). These genes represent the core response of blue tilapia to low temperature exposure (list of these genes in **Table S4**). KEGG pathway analysis of these genes revealed significant (P < 0.05) down-regulation of focal-adhesion and other cell-ECM interactions, as well as up-regulation of proteasome, various intra-cellular protein processing activities, RNA transport and degradation (**Table 1**).

# Differential Response Between Cold Tolerant and Sensitive Fish

There were more genes with temperature-dependent expression in the sensitive fish than in the tolerant (**Figure 3A**). This

down-regulation (blue).

TABLE 1 | Temperature-dependent KEGG pathways observed in more than one family or tissue (↑ for up-regulation and ↓ for down-regulation with P-value for each one).


Gills and liver of tolerant fish are highlighted in green and of sensitive fish in purple. Common and differential pathways are in bold letters. Shared pathways between all samples (the core response) are highlighted in red for up-regulation and blue for down-regulation. Pathways with reverse response between resistant and tolerant fish are highlighted in orange.

effect was similar for temperature-dependent enriched KEGG pathways (**Figure 3B**). Other than the difference in the number of genes and pathways with temperature-dependent differential expression, KEGG analysis revealed an opposite regulation of biological pathways between the cold-sensitive and cold-tolerant fish, with low temperature down-regulation in tolerant fish and up-regulation in sensitive fish (**Figure 4**, marked in black box). This opposite response was noted for genes involved in metabolic pathways, glycolysis/gluconeogenesis in the gills and biosynthesis of amino-acids in the liver (**Figure 5**). It is worth noting that

Frontiers in Genetics | www.frontiersin.org

no specific gene showed an opposite response between the cold sensitive and cold tolerant fish, rather, different genes within the pathway (sometimes paralogous genes) that were either up- or

# Validation of Transcriptome Sequencing Results Using qPCR

Target genes for qPCR analysis were selected from the genes related to the glycolysis/gluconeogenesis pathway, that had opposite responses between cold tolerant and sensitive fish. Gills samples of nine fish from three tolerant families and nine fish from three sensitive families were analyzed. For all tested genes, the direction of change in expression was concordant between the transcriptome sequencing data and the qPCR analysis (**Figure 6**). Transcripts levels at 12◦C significantly differ from 24◦C for gpia, ldha, tbiB, and pgam1a (P < 0.001), for aldoaa and pgam2 (P = 0.02), but not for gpib, pfkma (P = 0.1).

# DISCUSSION

Ectotherm animals, like fish, go through major physiological responses when they acclimate to different temperatures. While the emergence of genomic technologies led to several studies that characterized genes and pathways involved in fish thermal acclimation (Gracey et al., 2004; Long et al., 2012, 2013; Mininni

FIGURE 4 | Up- and down-regulated fold changes belonging to different KEGG pathways in the gills and liver of cold tolerant and sensitive fish.

FIGURE 5 | KEGG pathways with differential temperature-dependent regulation between cold sensitive and cold tolerant blue tilapia, glycolysis/gluconeogenesis in the gills (A) and biosynthesis of amino-acids in the liver (B). Up-regulated genes in these pathways were found only in the cold sensitive fish and are marked in red, whereas down-regulated genes in these pathways were found only in the cold tolerant fish and are marked in blue.

et al., 2014; Hu et al., 2016), there is still considerable lack of knowledge in this area. Moreover, although it is well-known that there is within-species phenotypic variation in thermal tolerance (Tave et al., 1990; Thoa et al., 2014), the physiological basis behind this variation was not investigated so far. In the current study, we used a set-up in which phenotypes were characterized at the family level, enabling comparative transcriptome characterization of fish with the same phenotype, either the tolerant or sensitive, at different temperatures. This system allowed for retrieval of the response to temperature decline at the gene-expression level.

As only one tolerant and one sensitive family were used for the transcriptome sequencing, we examined fish from additional four families, two tolerant and two sensitive, verifying that the observed patterns are universal and the detected differences are not family-specific. By comparing the transcriptional responses to a low-temperature challenge in the gills and liver of cold tolerant and sensitive blue tilapia, we observed the common response between all the fish that were analyzed. Thus, we considered it as the primary response of blue tilapia to low temperature exposure. KEGG analysis of DEG showed down-regulation of genes involved in cellular interactions with the surrounding environment, adjacent cells and the ECM. In contrast, up-regulated genes are involved in intracellular processes, such as protein processing and RNA transport and degradation. These findings are in line with previous transcriptomic analyses that used microarray to study response to low temperatures in fish. Analysis of seven common carp (Cyprinus carpio) tissues demonstrated common activation of genes involved in ubiquitin-dependent protein catabolism and proteasome function, a response that was not expected (Gracey et al., 2004). Analysis of gilthead sea bream (Sparus aurata) liver transcriptome also revealed broad activation of genes involved in RNA processing, protein catabolism and folding (Mininni et al., 2014). Overall, it can be concluded that at the cellular level, the primary response of fish to low temperature stress is an increase in intracellular processes and decrease in extracellular processes. This might reflect an adjustment process of the cell with a concomitant reduction in the organismal activity.

Most of the variation found in our transcriptome sequences can be attributed to differences between tissues and temperatures. Despite that, our following analyses showed relatively small number of genes that belong to specific pathways are the basis of the difference between tolerant and sensitive fish. The analysis of the transcriptomic response to a decreasing temperature regime highlighted some differences between the cold tolerant and sensitive fish. Sensitive fish showed a wider range of upregulated genes and pathways than tolerant fish in response to cold exposure, suggesting higher investment of physiological resources by cold-sensitive fish during acclimation to low temperatures. The increase in carbon metabolism observed in the sensitive fish might reflect this higher investment. Furthermore, in their study on common carp, Gracey et al. (2004) found an opposite response of carbohydrate metabolism pathways between tissues of the same fish and assumed a probable shift of energy to organs with higher demands.

In this study, we focused on the gene expression level and identified divergence between the cold-sensitive and coldtolerant blue tilapia in specific pathways. Two carbon metabolism pathways, glycolysis/gluconeogenesis in the gills and biosynthesis of amino-acids in the liver, were found to be differentially regulated in response to low temperature exposure. These pathways play a key role in supplying the organism's energetic demands. While such reverse expression of these pathways between tolerant and sensitive fish was not known before, we have previously demonstrated that increased expression of the atp6 gene (mitochondrial ATP synthase) in response to temperature reduction was inversely correlated to the level of cold tolerance (Nitzan et al., 2016). This observation further supports the current results showing that energy related pathways serve as an important component of the cold tolerance trait.

Negative correlation between metabolic rate and tolerance to environmental stress is well-known in farmed animals, with cattle and poultry strains with high temperature tolerance having lower growth rate, as well as other productivity traits (West, 2003; Druyan et al., 2012). However, these are endothermal animals that need to maintain constant body temperature, while in fish, such negative correlation is less intuitive. Several studies on different fish species, described the relationships between metabolism and the response to cold temperatures (Guderley and St-Pierre, 2002; Schulte, 2015), but as far as we know the relationships between metabolism and tolerance of low temperatures has not yet been clearly explained in fish. Our results, showing lower expression of energy-related metabolic pathways in cold tolerant fish, suggest that energy balance has a key role in the fish ability to make the physiological adjustments to the cold environment.

In this work, we used the power of family-phenotype, obtained through selective breeding, in order to characterize the transcriptome response to low temperature exposure, in tolerant and sensitive fish. We revealed pathways that are the core cellular response as well as pathways that differ between cold-tolerant and cold-sensitive fish and might be the basis of within-species variance. Understanding the regulation of these pathways should be a key for the improvement of tilapia's cold tolerance, as ability to control or affect pathways which are the basis of phenotypic variation bear better potential for aquaculture than detection of polymorphism in specific genes, each one with minor effect. This study demonstrates the opportunities in using defined genetic structure for experiments aiming at characterization of the physiology of traits. The presented data can be important for our understanding of this economically important trait in cultured

# REFERENCES


fish, as well as for designing research aiming to improve tilapia's cold tolerance through genetic or physiological manipulations on key pathways.

# AUTHOR CONTRIBUTIONS

AC conceived and designed the experiments. TZ and AB bred and selected the fish. TS, TN, and AC challenged the fish and sampled tissues. TN performed the RNA analyses. TN, AD-F, JB, FK, and AC analyzed the results. AC and IM secured funding and supervised the project. TN, FK, and AC wrote the manuscript.

# FUNDING

This research was supported by research grants 863-0045 and 356-0698 from the Chief Scientist of the Ministry of Agriculture and Rural Development.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00100/full#supplementary-material


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Nitzan, Kokou, Doron-Faigenboim, Slosman, Biran, Mizrahi, Zak, Benet and Cnaani. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Genomic, Transcriptomic, and Epigenomic Features Differentiate Genes That Are Relevant for Muscular Polyunsaturated Fatty Acids in the Common Carp

Hanyuan Zhang<sup>1</sup> , Peng Xu<sup>2</sup> , Yanliang Jiang<sup>1</sup> , Zixia Zhao<sup>1</sup> , Jianxin Feng<sup>3</sup> , Ruyu Tai<sup>1</sup> , Chuanju Dong<sup>4</sup> and Jian Xu<sup>1</sup> \*

<sup>1</sup> Key Laboratory of Aquatic Genomics, Ministry of Agriculture, CAFS Key Laboratory of Aquatic Genomics and Beijing Key Laboratory of Fishery Biotechnology, Chinese Academy of Fishery Sciences, Beijing, China, <sup>2</sup> Fujian Collaborative Innovation Center for Exploitation and Utilization of Marine Biological Resources, Xiamen University, Xiamen, China, <sup>3</sup> Henan Academy of Fishery Science, Zhengzhou, China, <sup>4</sup> College of Fishery, Henan Normal University, Xinxiang, China

### Edited by:

Hooman Moghadam, SalmoBreed AS, Norway

### Reviewed by:

Jie Mei, Huazhong Agricultural University, China Kieran G. Meade, Teagasc, The Irish Agriculture and Food Development Authority, Ireland Tereza Manousaki, Hellenic Centre for Marine Research (HCMR), Greece

> \*Correspondence: Jian Xu xuj@cafs.ac.cn

### Specialty section:

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Received: 29 September 2018 Accepted: 27 February 2019 Published: 15 March 2019

### Citation:

Zhang H, Xu P, Jiang Y, Zhao Z, Feng J, Tai R, Dong C and Xu J (2019) Genomic, Transcriptomic, and Epigenomic Features Differentiate Genes That Are Relevant for Muscular Polyunsaturated Fatty Acids in the Common Carp. Front. Genet. 10:217. doi: 10.3389/fgene.2019.00217 Polyunsaturated fatty acids (PUFAs) are a set of important nutrients that mainly include arachidonic acid (ARA4), docosahexaenoic acid (DHA), eicosapentaenoic acid (EPA), and α-linolenic acid (ALA). Recently, fish-derived PUFAs have been associated with cardiovascular health, fetal development, and improvement of brain functions. Studies have shown that fish muscular tissues are rich in PUFAs, which are influenced by various factors, including genetic variations, regulatory profiles, and methylation status of desaturase genes during fatty acid desaturation and elongation processes. However, the genetic mechanism and the pathways involved in fatty acid metabolism in fishes remain unclear. The overall aim of this study was to assess differences in gene expression responses among fishes with different fatty acid levels. To achieve this goal, we conducted genome-wide association analysis (GWAS) using a 250K SNP array in a population of 203 samples of common carp (Cyprinus carpio) and identified nine SNPs and 15 genes associated with muscular PUFA content. Then, RNA-Seq and whole genome bisulfite sequencing (WGBS) of different groups with high and low EPA, DHA, ARA4, and ALA contents in muscle, liver and brain tissues were conducted, resulting in 6,750 differentially expressed genes and 5,631 genes with differentially methylated promoters. Gene ontology and KEGG pathway enrichment analyses of RNA-Seq and WGBS results identified enriched pathways for fatty acid metabolism, which included the adipocytokine signaling pathway, ARA4 and linoleic acid metabolism pathway, and insulin signaling pathway. Integrated analysis indicated significant correlations between gene expression and methylation status among groups with high and low PUFA contents in muscular tissues. Taken together, these multi-level results uncovered candidate genes and pathways that are associated with fatty acid metabolism and paved the way for further genomic selection and carp breeding for PUFA traits.

Keywords: common carp, polyunsaturated fatty acids, GWAS, transcriptome, methylation

# INTRODUCTION

fgene-10-00217 March 13, 2019 Time: 18:14 # 2

Aquatic products contribute to a large part of our daily diet, which not only offer high-quality protein but also contain abundant long-chain polyunsaturated fatty acids (PUFA), vitamins, and mineral substances. The fat contents (FAs) of most fishes range from 1 to 4% (Iverson et al., 2002). In general, fish flesh has lower FA but higher PUFA content compared to livestock and poultry meat (Rymer and Givens, 2005; Wood et al., 2008; Henriques et al., 2014; Geay et al., 2015). Omega-6 and omega-3 PUFAs are essential fatty acids (EFAs) that play critical roles in maintaining cell membrane structure. However, as bioactive lipid mediators, these are not synthesized by the human body (Calder, 2013). Long-chain PUFAs, particularly eicosapentaenoic acid (EPA) and docosahexaenoic acid (DHA), positively influence retinal development, neurological development, and cardiovascular system maintenance (Innis, 2008; Guesnet and Alessandri, 2011; Mozaffarian and Wu, 2011). PUFAs can effectively reduce the risk of heart diseases (e.g., coronary heart disease and heart rate variability) (Vedtofte et al., 2011; Enns et al., 2014), prevent the increase of blood viscosity (Zanetti et al., 2015), and act as an anti-glioma agent that promotes the regression and apoptosis of tumors by acting as an intracellular signaling molecule (Witte and Hardman, 2015). Another two PUFAs, namely, α-linolenic acid (ALA) and arachidonic acid (ARA4), have also been reported to be beneficial for human development and health by preventing cardiovascular disease and promoting embryonic and brain development (Davis-Bruno and Tassinari, 2011; Pan et al., 2012).

Fatty acids in fishes originate from any of two sources, namely, synthesis in vivo from non-lipid carbon sources or uptake from dietary lipids. Therefore, muscular PUFA content can be influenced by dietary intake and endogenous metabolism (Davidson, 2013). Various studies have investigated the mechanism underlying PUFA synthesis and degradation, including associated genetic variants, gene expression, and epigenetics (Berton et al., 2016; Chen et al., 2018). The pathways for the synthesis and modification of PUFAs include a variety of enzyme systems. Genetic variants in fatty acidsynthesizing enzymes substantially influence muscular fatty acid levels (Foster, 2012; Beld et al., 2015). For example, delta-5 desaturase (FADS1) and delta-6 desaturase (FADS2) are two enzymes involved in fatty acid metabolism (Ralston et al., 2015). Two haplotypes including the fads1 and fads2 genes have been associated with differences in the synthesis of longchain PUFAs (Ameur et al., 2012). During fatty acid synthesis, acetyl-CoA serves as the initial specific substrate for the acetyl-CoA carboxylase. Modifications in the microsomal glycerol-3 phosphate pathway in fishes as well as in mammals involve the incorporation of PUFAs into phospholipids and triacylglycerols (Murphy, 2013). During desaturation and elongation processes in freshwater fishes, omega-9 fatty acids could be synthesized by 19 desaturase, whereas omega-3 (n-3) and omega-6 (n-6) fatty acids could not be produced due to the lack of the 112 and 115 desaturase enzymes.

Although PUFAs have a series of important functions, current understanding of the molecular mechanisms related to fish fatty acid metabolism remains limited. QTL mapping and genomewide association analysis (GWAS) on muscular FA have been reported in the Atlantic salmon, common carp (Cyprinus carpio), and large yellow croaker (Derayat et al., 2007; Sodeland et al., 2013; Kuang et al., 2015; Xiao et al., 2016; Zheng et al., 2016). A previous genome-wide scan identified multiple QTLs located on LG6 and LG23 that influence n-3 PUFA content in the muscle tissues of Asian seabass (Xia et al., 2014). Fatty acid desaturases and elongases (e.g., 15 and 16 desaturases) of several freshwater and marine fishes have been cloned (Agaba et al., 2004; Zheng et al., 2016). These enzymes play crucial roles in the biosynthesis of the long-chain C20/22 PUFAs from shorter chain C18 PUFAs. It involves a series of complex biosynthesis reactions, including desaturation, elongation and peroxisomal β-oxidation, to convert ALA to n-3 fatty acids EPA and DHA (Kjaer et al., 2016). A previous study demonstrated that there are two alternative pathways for DHA biosynthesis in teleost, (Oboh et al., 2017). One is Sprecher pathway which desaturates 24:5(n-3) to 24:6(n-3) via 16 desaturase Fads2, and the other is 14 pathway functioned by 14 desaturase Fads2. However, the molecular genetic mechanisms of fatty acid-related traits, particularly those involved in important PUFA (e.g., DHA, EPA) pathways, remain unclear.

The common carp has a long breeding history as an important worldwide cultured fish (Bostock et al., 2010). In 2016, the global production of C. carpio reached 4.56 million tons, while 3.50 million tons were cultured in China<sup>1</sup> . Dozens of C. carpio strains and populations have been cultured, including Yellow River carp, Songpu mirror carp, and Hebao carp. The diverse populations of C. carpio exhibited phenotypic variations in body color, growth, scales, and muscular fatty acid contents. The muscular PUFA content of common carp have been a research topic of interest based on the need to increase aquaculture quality instead of quantity. Based on its economic and scientific importance, genomic resources of C. carpio have been developed and extensively utilized in recent years. The transcriptome and genome assembly of C. carpio have been reported, paving the way for more in-depth investigations (Ji et al., 2012; Xu J. et al., 2012, Xu P. et al., 2014). A high-throughput carp SNP array with 250K SNPs was developed, thereby offering a powerful tool for genetic association studies (Xu J. et al., 2014). Several QTL and GWAS of C. carpio were conducted using the Carp 250K SNP array, including traits of growth and FA (Peng et al., 2016; Zheng et al., 2016). High-throughput sequencing techniques facilitated the multi-omics studies on genome resequencing, genome methylation, and transcriptome analysis on important traits of C. carpio (Jiang et al., 2014; Wang et al., 2014a,b). Gene expression profiles are largely regulated by DNA methylation through transcriptional regulation and chromatin remodeling. The genomic regions with differential methylation levels, known as differentially methylated regions (DMRs), represent the most active regions that may be related to transcriptional regulation. A handful of integrated studies on the expression and epigenetics of various species have been reported. He et al. (2013) investigated transcriptomic and epigenomic variations

<sup>1</sup>http://www.fao.org

in maize hybrids and concluded that similar mechanisms may account for the genome-wide epigenetic regulation of gene activity and transposon stability in different organs. Zhang et al. (2015) conducted integrative analysis of transcriptomic and epigenomic data to reveal regulatory patterns for bone mineral density variations, showing consistent association evidence from both mRNA/miRNA expression and methylation data. A recent study conducted comparative transcriptomic and DNA methylation analyses of color trait in Crucian carp and identified several pigmentation-related pathways (Zhang et al., 2017). The above studies indicated the power of multi-omics data analysis. Here, we report our multi-level research on muscle PUFA content traits in common carp using GWAS, RNA-Seq, and methylation analyses. Dozens of fatty acid metabolismassociated candidate genes and pathways were identified by the integration of associated genes, differentially expressed genes (DEGs), and genes in differential methylated regions (DMRs). This study takes a further step to better understand the mechanism of muscular PUFA content and presents potential applications in breeding.

# MATERIALS AND METHODS

# Ethics Statement

This study was conducted in accordance with the recommendations of the Care and Use of Animals for Scientific Purposes established by the Animal Care and Use Committee of the Chinese Academy of Fishery Sciences (ACUC-CAFS). The protocol was approved by the ACUC-CAFS. Before the blood and tissue samples were collected, all fishes were euthanized in MS222 solution.

# Sample Collection and Phenotypic Measurements

A total of 203 Yellow River 2-year-old carp were randomly selected from a large population that was cultured at the Henan Fishery Research Institute, Henan, China. From each sample, 1 mL of blood was collected in lysis buffer for DNA extraction and genotyping. More than 100 g of dorsal muscle tissue were collected and cryopreserved in dry ice, and then stored at −80◦C until fatty acid content determination. More than 5 g of muscle, liver, and brain tissues were acquired and preserved in RNALater (Qiagen, Hilden, Germany) at −80◦C for RNA extraction and sequencing. Approximately 20 g of muscle sample per fish were used in measuring total fat and fatty acid contents (N = 12), which were conducted by the Agricultural Products Safety and Quality Supervision Inspection Center in Zhengzhou, Henan according to national standards for food safety. Correlation analysis among traits was conducted using IBM SPSS Statistics 19.0 software, and the correlation matrix plot was drawn using the R package ggplot2<sup>2</sup> .

# DNA Extraction, Genotyping, and Quality Control

Genomic DNA was extracted from whole blood using a DNeasy 96 Blood & Tissue Kit (Qiagen, Shanghai, China) following the manufacturer's protocol. The extracted DNA was quantified using a NanoDrop-1000 spectrophotometer (Thermo Fisher Scientific, Wilmington, DE, United States). The integrity of DNA was examined on a 1.5% agarose gel by electrophoresis. The final DNA concentration was diluted to 50 ng/µL for genotyping. The total amount of qualified genomic DNA for whole genome sequencing was 2 µg per sample. The common carp 250K SNP array was developed in a previous study using the Affymetrix Axiom genotyping technology (Xu J. et al., 2014). Genotyping was performed by GeneSeek (Lincoln, Nebraska, United States). After genotyping, PLINK v1.9<sup>3</sup> was used for quality control (Chang et al., 2015). SNPs with low call rate (<95%) or low minor allele frequency (MAF < 5%) were excluded, and samples with <90% genotyping rates were filtered out.

# Genome-Wide Association Analysis

Genome-wide association analysis for screening of fat and fatty acid content trait (total fat, 12 individual fatty acids, nine classified fatty acids)-related genes were performed based on genotyping data using the TASSEL version 5.0 software<sup>4</sup> (Bradbury et al., 2007). The model "PCA" was used to create a q-matrix, and then the option "wMLM" was used to perform association analysis. The genome-wide significant P-value threshold was adjusted based on Bonferroni correction, and the suggestive threshold was set to 3.445 × 10−<sup>5</sup> . Associated SNP loci for four PUFA (ALA, ARA4, DHA, and EPA) content values were selected with a corrected P value <0.05. Gene annotation was performed on 10 kb of the flanking regions of the associated SNP loci. The Manhattan plots and Q-Q plots were generated using qqman package of the Comprehensive R Archive Network<sup>5</sup> .

# RNA Sequencing and Differential Gene Expression Analysis

Three tissues (brain, liver, and muscle) of 20 fishes were included in RNA sequencing based on four traits, namely, ALA, ARA4, DHA, and EPA content. From these fishes, four were selected with high content and four were selected with low content in all four traits. Total RNA was extracted from the brain, liver, and muscle tissues using an RNeasy kit (Qiagen, Shanghai, China) following manufacturer's instructions. The integrity and size distribution of all samples were assessed using a Bioanalyzer 2100 system (Agilent Technologies, Santa Clara, CA, United States). A complementary DNA (cDNA) library was constructed, and high-throughput sequencing was performed using an Illumina HiSeq2500 Sequencing System with paired-end 2 × 150 nucleotide reads (Illumina, San Diego, CA, United States).

<sup>2</sup>https://cran.r-project.org/web/packages/ggplot2/

<sup>3</sup>https://www.cog-genomics.org/plink2

<sup>4</sup>http://www.maizegenetics.net/tassel

<sup>5</sup>http://cran.r-project.org/package=qqman

Low-quality reads and residual adapter sequences from FASTQ files were filtered and trimmed using Trimmomatic v.0.32 (Bolger et al., 2014). Reads were trimmed when the average Phred score was <20 across four bases in the sliding window. After filtering, reads shorter than 36 bp or single ends were removed. Bowtie2-build indexer (Bowtie2 v.2.3.4.2) (Langmead and Salzberg, 2012) was used to build a Bowtie index from the common carp genome assembly. The filtered reads were aligned to the genome sequence using Tophat2 (Kim et al., 2013). Samtools (Li et al., 2009) was used to index the Tophat2 output bam files, and Cufflinks (Trapnell et al., 2010, 2014) was used to assemble the reconstructed transcripts from the aligned reads using genome and annotation files. These assembled transcript structures were merged into one single dataset with Cuffmerge. Differential expression and regulation at the gene and transcript levels were identified using Cuffdiff. The volcano plots showing gene expression differences were constructed using the ggplot2 package. The Benjamini-Hochberg FDR corrected P value (q value) <0.05 and the absolute log<sup>2</sup> fold-change (FC) value >1 were considered differentially expressed genes. The heatmaps of the four PUFAs in three tissues were generated by the FPKM of differentially expressed genes using the pheatmap package. Gene Ontology (GO) enrichment and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analyses were performed using DAVID Bioinformatics Resources 6.8 Tools<sup>6</sup> (Huang et al., 2009). Pathways with at least five differentially expressed genes assigned and q value <0.05 were considered enriched. The GO and KEGG plots were constructed using R package clusterProfiler (Yu et al., 2012).

# Whole Genome Bisulfite Sequencing and Differential Methylation Analysis

Genomic DNA was extracted from the muscle tissues of the same fishes used in the RNA-Seq study. Genomic DNA was treated using sodium bisulfite, which converts unmethylated cytosine to uracil, then thymine (Wang et al., 2006). Whole genome bisulfite sequencing (WGBS) was performed using an Illumina HiSeq2500 sequencer with 150-bp paired-end sequencing (Illumina, San Diego, CA, United States). The software swDMR<sup>7</sup> was used to comprehensively analyze the DMRs from methylation sequencing profiles by a sliding window approach (Wang et al., 2015). The input files were prepared using a WGBS data aligner Bismark (Krueger and Andrews, 2011). The DMR detection and annotation procedures were performed as described below. First, a sliding window with 1,000-bp window size and 100-bp step size was chosen for scanning methylation rates. Second, The Benjamini−Hochberg FDR corrected P value (q value) <0.1 and absolute log<sup>2</sup> FC value >1 were considered as potential DMRs. Third, two potential DMRs were merged when their distance was less than the threshold. The merged DMRs were tested by previous steps to guarantee the significance level. This extension step was repeated until P value >0.1. The new

<sup>6</sup>https://david.ncifcrf.gov/

extension of potential DMRs were considered as candidate DMRs. Finally, candidate DMRs in the promoter regions were selected for enrichment and correlation analyses. GO and KEGG enrichment analyses were performed using DAVID Bioinformatics Resources 6.8 Tools<sup>6</sup> (Huang et al., 2009). The GO and KEGG plots for candidate genes in the DMPs were constructed using R package clusterProfiler (Yu et al., 2012). The DGE and DMP results were correlated using R package ggplot2 (Ginestet, 2011), and transcriptional factor binding site (TFBS) prediction was conducted using AnimalTFDB 3.0 Tools<sup>8</sup> (Hu et al., 2018).

# RESULTS

# Genotyping and Phenotyping of 203 Accessions of C. carpio

On the basis of our previous work (Xu J. et al., 2014; Xu P. et al., 2014), we randomly collected 203 samples from a cultivated population of C. carpio. After DNA extraction and

<sup>8</sup>http://bioinfo.life.hust.edu.cn/AnimalTFDB/#!/tfbs\_predict

TABLE 1 | Summary of fat, nine classified fatty acids, and 12 single fatty acids.


<sup>7</sup>http://sourceforge.net/projects/swdmr/

SNP genotyping using Carp 250K SNP array, a raw genotype data of 250,000 SNPs for 203 samples were generated. A total of 193 samples with 108,684 polymorphic SNPs passed the quality control threshold, and 29,026 tag SNPs were chosen by the pruning method (SNPs with LD R<sup>2</sup> > 0.9 were filtered out) for further association analysis.

We analyzed total FA and 12 fatty acids, then nine classified groups of fatty acids were also calculated (**Table 1**). **Figure 1A** shows that oleic acid and linoleic acid comprise 60% of all fatty acids, and the content of the PUFAs (ALA, ARA4, DHA, and EPA) are relatively low. The 22 phenotypes were also analyzed to uncover potential relationships among phenotypes, and the correlation matrix is shown in **Figure 1B** and **Supplementary Table S1**. We found that ALA, ARA4, DHA and EPA contents were significantly correlated with each other, possibly suggesting shared or related genes regulating these traits. The correlations between ALA and other three fatty acids (ARA4, DHA, EPA) were moderate (r = 0.398, 0.434, 0.306, respectively; p < 10−<sup>8</sup> , p < 10−<sup>9</sup> , p < 10−<sup>4</sup> , respectively), whereas the measurements of ARA4, DHA, and EPA were highly correlated (ARA4 vs. DHA: r = 0.916, p < 10−76; ARA4 vs. EPA: r = 0.705, p < 10−29; DHA V.S. EPA: r = 0.684, p < 10−27).

# Genome-Wide Association Analysis and Gene Annotation of Identified SNPs

We identified nine SNPs (7, 8, 2, and 1 SNP for ARA4, DHA, EPA, and ALA, respectively) that achieved the suggestive significance threshold (P < 3.445 × 10−<sup>5</sup> ), in which 4 SNPs (4, 3, 0, and 0 SNPs for ARA4, DHA, EPA, and ALA, respectively) surpassed the significance line (P < 1.72 × 10−<sup>6</sup> ) for the content of four fatty acids. To further identify more potentially associated SNPs, we further investigated the results of the DHA\_EPA group, which contained two associated SNPs and six suggestive SNPs (**Figure 2** and **Supplementary Table S2**). The Manhattan plots and Q-Q plots are shown in **Figure 2** for traits with associated SNPs (ARA4, DHA, and DHA\_EPA), in which the Q-Q plots (genomic inflation factor ≈1) indicated reliability of experimental design and data analysis. Genes within the 10-kb regions of the associated and suggestive SNPs were annotated using the common carp genome (Xu P. et al., 2014), and a total of 15 genes were identified (six genes for associated SNPs, 12 genes for suggestive SNPs).

# Transcriptomic Profiling of Samples With Divergent PUFA Contents

We assessed the fatty acid content of 20 new individuals and selected eight showing extreme PUFA content (H: high content group; L: low content group). The ARA4, EPA, DHA, and ALA content of the 20 individuals are shown in **Supplementary Table S3**. Three types of tissues (brain, liver, and muscle) were collected from eight individuals, and we used Illumina highthroughput sequencing to generate mRNA transcriptomes. The overall quality of the extracted RNA from 24 tissue samples is summarized in **Supplementary Table S4**. After sequencing, approximately 176 Gb of raw data were generated from 24 tissue samples, and 173.1 Gb clean data were used for further differential gene expression (DGE) analysis (**Supplementary Table S5**). Using the Bowtie software, 76.1% of the clean reads could be mapped to the genome, and the DGE of each trait between the high-content and low-content groups was calculated using the Cufflinks software.

As shown in the volcano plots in **Figure 3**, a large number of genes were identified to be differentially expressed in the brain, with counts of 6,191 genes, 916 genes, 1,010 genes, and 36 genes for ARA4, DHA, EPA, and ALA, respectively (**Figure 3** and **Supplementary Table S6**). We further conducted GO and KEGG enrichment analyses to focus on the most important pathways (**Figure 3** and **Supplementary Tables S6–S8**). Several important lipid and fatty acid metabolism pathways were enriched, including "adipocytokine signaling pathway," "ARA4 metabolism," and "glycerolipid metabolism." Key genes within these enriched pathways included cd36, npy, and lepr. Interestingly, several development and metabolismrelated pathways were also enriched (**Figures 3C,D**) such as "Wnt signaling pathway," "Notch signaling pathway," and the "insulin signaling pathway," indicating possible relationship between growth and muscular fatty acid. Pathways of amino acid metabolism (especially branched chain amino acids) were also enriched such as "valine, leucine, and isoleucine degradation," indicating related pathways between amino acid and fatty acid metabolism processes.

In liver tissues, 113, 299, 228, and 60 genes were identified to be differentially regulated in the four groups, respectively (**Supplementary Figure S1** and **Supplementary Tables S9–S11**). Similar to the results of the brain tissues, pathways related to growth and amino acid metabolism were also enriched in the liver tissues such as the "insulin signaling pathway," "glycine, serine, and threonine metabolism," and the "Wnt signaling pathway," including key genes such as prkaa2 and gsk-3β. In the muscles, there were 23, 15, 38, and 4 genes that were differentially expressed in the four groups, respectively (**Supplementary Figure S2** and **Supplementary Table S12**). Volcano plots and Venn diagrams showed a limited number of genes that were differentially expressed in the muscle tissues, thus no GO terms or KEGG pathways were enriched. **Supplementary Figure S3** shows the results of cluster analysis of the four traits in three types of tissues.

# Patterns of Methylation Variations Among Groups With Distinct PUFA Contents

To assess the global trends of epigenetic variations in different groups, we performed genome-wide pairwise comparisons of each epigenetic modification between the H and L groups. Muscle tissues were collected from the eight samples used in transcriptome analysis, and DNA was extracted for further WGBS. A total of 541.3 Gb of raw data (3,608,912,206 raw reads) were acquired (around 45× coverage for each sample), and 535.1 Gb of clean data were used for alignment to the carp genome (**Figure 4A** and **Supplementary Table S13**).

blue means negative correlation.

Following standard pipelines (see Methods), DMRs were identified (**Figure 4B**). The most abundant DMRs were located in intergenic repeat regions, followed by introns, exons, and promoter regions (**Supplementary Table S14**). Because DMRs in the promoter regions (differentially methylated promoters, DMPs) play vital roles in the regulation of transcription (Haque et al., 2016), we selected DMPs for further functional enrichment analysis (**Supplementary Tables** **S15, S16**). GO enrichment analysis identified various terms, including "embryonic organ development," "embryonic organ morphogenesis," which were related to growth and development (**Figure 4C**). More direct evidences have been identified in KEGG enrichment such as shared pathways (**Figure 4D**), namely, "insulin signaling pathway," "glycerophospholipid metabolism," "ether lipid metabolism" for the four traits, and fatty acid metabolism-related pathways, including "ARA4 metabolism,"

"linoleic acid metabolism," "biosynthesis of unsaturated fatty acids," and "adipocytokine signaling pathway." Furthermore, "steroid hormone biosynthesis" was enriched, indicating the role of fatty acids in hormone metabolism.

# Integrated Analysis of DGE and DMP Results

Although DGE and DMP analyses identified several genes and pathways, the correlation between the two regulatory levels remains unclear. Methylation levels in the promoter regions significantly affect the transcription of downstream genes; therefore, it is necessary to conduct a correlation analysis of genes shared between the DGE and DMP results. Due to the limited number of genes shared between the liver/muscle DGE and DMP results, we focused on the correlation between brain DGE log2(FC) values and DMP gene methylation differences (**Supplementary Table S17**). **Figure 5** shows the results of Pearson correlation analysis, which indicated a significant linear correlation for the ARA4, DHA, and EPA traits. The genes shown in **Figure 5** were classified into two categories, namely, positively correlated (blue dots) and negatively correlated (red dots). For the ARA4, DHA, and EPA traits, 457, 35, and 59 genes were included in the correlation analysis. The R<sup>2</sup> values of the 221 positively correlated genes and 236 negatively correlated genes in ARA4 trait analysis were 0.68 and 0.55, respectively. For the DHA trait, the R<sup>2</sup> values of 21 positively correlated genes and 14 negatively correlated genes were 0.44 and 0.60, respectively. For the EPA trait, R<sup>2</sup> values of 46 positively correlated genes and 13 negatively correlated genes were 0.51 and 0.77, respectively. The DGE and DMP results of the liver and muscle tissues were also integrated (**Supplementary Table S17**), which identified fewer genes compared to those in the brain. To find more evidence supporting the correlation between gene expression and genome methylation, TFBS prediction of the promoters of selected genes from integrated analyses was performed (**Supplementary Table S18**).

# DISCUSSION

# GWAS of 203 Individuals and Gene Annotation

Among the identified SNPs in GWAS, the most significant SNP carp222748 has been associated with ARA4 content in C. carpio, which is located within the coding region of the duox2 gene. The Duox2 (dual oxidase 2) protein, which is encoded by duox2, generates hydrogen peroxide, which is required for the activity of thyroid peroxidase and plays a role in thyroid hormone synthesis. Interestingly, another gene, trh (pro-thyrotropinreleasing hormone), was identified downstream of a suggestive

FIGURE 3 | Volcano plot, Venn diagram, Gene Ontology, and KEGG enrichment for differentially expressed genes in brain tissues. (A) Volcano plots for DEGs in the brain tissues for four traits. Red dots indicate upregulated genes; blue dots indicate downregulated genes. (B) Venn diagram of DEGs in brain tissues showing the number of shared and unique genes for each trait. (C) Gene Ontology enrichment of DEGs in brain tissues. The size of the circles represents gene numbers in each term; colors represent minus logarithms of adjusted P values. (D) KEGG enrichment of DEGs in brain tissue. Size of circles represents gene numbers in each pathway; colors represent minus logarithms of adjusted P values.

FIGURE 4 | Sequencing depth distribution, DMR classification, and enrichment of genes within DMP regions. (A) Accumulative fraction against sequencing depth. Curve lines with different colors represent eight samples used for WGBS, and the Y-axis represents the cumulative ratio of genomic regions mapped by reads. (B) Counts for each DMR classification in different genomic regions for each trait. (C) Gene Ontology enrichment of genes downstream of DMPs. The size of the circles represents the number of genes in each term; colors represent minus logarithms of adjusted P values. (D) KEGG enrichment of genes downstream of DMPs. The size of the circles represents the number of genes in each pathway; colors represent minus logarithms of adjusted P values.

SNP carp215797, and the Trh protein stimulates the release of thyrotropin. Thyroid hormone status affects metabolic pathways of ARA4 in mice and human (Fonseca et al., 2014; Yao et al., 2015), and thus it is possible that a similar genetic mechanism may be utilized in ARA4 metabolism in C. carpio. Another associated SNP carp152877 was found to be associated with the ARA4, DHA, and DHA\_EPA traits and is located downstream of the ormdl3 gene. The Ormdl3 (ORM1-like protein 3) protein negatively regulates sphingolipid synthesis and may be indirectly involved in endoplasmic reticulum-mediated calcium ion signaling (Breslow et al., 2010). As the genes discovered from GWAS analysis may not be enough to illustrate the underlying mechanism of muscular PUFA metabolism, we sought to uncover more evidences by investigating gene expression and epigenetic statuses of the collected samples.

# GO and KEGG Analysis Based on PUFA RNA-Seq Data

GO and KEGG enrichment analyses of the DGE results identified a handful of genes that are involved in enriched pathways that are related to fatty acid metabolism. CD36 has been reported as a high affinity receptor for long-chain fatty acid (FA) uptake, in addition to the contribution of lipid accumulation and FAinitiated signaling (Pepino et al., 2014). Neuropeptide Y (NPY), an orexigenic hypothalamic neuropeptide that is released by the neurons of arcuate nuclei, influences foraging behavior and food intake in mammals and indirectly affects fat accumulation (Levin et al., 2013). Leptin is an adipocytokine that regulates energy intake and expenditure through interactions with the leptin receptor (LEPR). LEPR has been associated with intramuscular fat and FA accumulation in Duroc pigs (Ros-Freixedes et al., 2016). We presume these genes may have similar functions in C. carpio. Insulin has been reported to promote development and fatty acid biosynthesis through the conversion of glucose into triglyceride in the liver, fat, and muscle cells (Laron, 2008; Xu X. et al., 2012; Pramfalk et al., 2016).

Among the potentially associated genes identified in liver tissue pathways, protein kinase α2 (PRKAA2) participates in lipid metabolism and energy homeostasis. A recent study on mice showed that lipopolysaccharide could significantly inhibit PRKAA2 expression (Tzanavari et al., 2016). In the

Wnt and insulin signaling pathways, gsk-3β has been enriched in the liver tissues for DHA and EPA. GSK-3β acts as an upstream regulator of the ACSL family and lipid accumulation in hepatocytes (Chang et al., 2011). Previous studies have shown that high levels of amino acids upregulate hepatic fatty acid biosynthetic gene expression in trout hepatocytes (Dai et al., 2015). Despite intragroup variations in each group due to limited samples, distinct expression patterns were observed between the H (high) and L (low) groups. Interestingly, several genes identified in GWAS were also differentially expressed in three tissues, including ormdl3, trh, and nptn. This provided potential correlations between SNP frequency and relative gene expression. Despite the sample size of GWAS and RNA-Seq were both too small to make a comprehensive exploration of all trait-related genes, the results were still very informative for further larger sample analysis.

# Predicting Fatty Acid Metabolism-Related Genes With DNA Methylation Data

The core genes within DMP-enriched pathways included gpd1, acsl1, fads2, elovl6, and the pla2 family (pla2g15, pla2g12b, and pla2g6), some of which have already been reported to

TABLE 2 | Summary of genes identified in three types of analyses.


be involved in PUFA metabolism in various species. For example, acyl-coenzyme A (CoA) syntheses 1 (ACSL1) is a well-studied obesogenic gene that is involved in fatty acid metabolism. The expression of ACSL1 has been reported to be associated with high caloric food intake in mice and human (Joseph et al., 2015). In the biosynthesis of PUFA, 16 desaturase (FADS2) has been demonstrated as an important indicator that catalyzes the first denaturation step and influences PUFA synthesis capability in fish. The expression of fads2 is regulated by dietary fatty acid profiles in Japanese seabass and is significantly negatively correlated with CpG methylation rates in the fads2 gene promoter (Xu H. et al., 2014). The elongation of very long chain fatty acid 6 (ELOVL6) mainly catalyzes the elongation of long chain SFAs and MUFAs (C14, C16, and C18), while it is engaged in balancing the overall fatty acid composition in mammals (Corominas et al., 2015).

# Integrated Analysis of DGE and DMP Results

Taken together, the significant linear correlation between expression and methylation indicates the epigenetic effects on transcription in addition to genetic factors such as genome variations. GO and KEGG enrichment further confirmed the importance of these correlated genes, and vital pathways were enriched, including "adipocytokine signaling pathway," "glycerophospholipid metabolism," "Notch signaling pathway," "Wnt signaling pathway," "ion binding," and "regulation of growth". Genes enriching these pathways included agrp, ctbp2, and previously discussed acsl1, gpd1, and pla2g15. AGRP has been reported to play similar roles in food intake and high-fat diet preference as NPY in mice and bats, and the upregulation of the agrp gene enhances diet preference and fat gain (Levin et al., 2013). The ctbp2 gene was downregulated and highly methylated in the high-PUFA content group, which agrees with the findings of previous studies that overexpression of C-terminal-binding protein 2 (CTBP2) suppresses lipid accumulation and hepatic glucose uptake (Liu et al., 2017). Fewer correlated genes were identified from the liver and muscle data, partly because of the limited number of samples. The acsl5 gene showed higher expression levels and lower methylation levels in the promoter region in the H group compared to the L group in relation to muscular ALA content. ACSL5 has been reported to act like other ACSL family members and engages in lipid biosynthesis and fatty acid degradation (Bowman et al., 2016). Besides, Fad and Elovl enzymes encoded by fad and elovl genes have been demonstrated to have all the activities and jointly play an important role for the biosynthesis of DHA from C<sup>18</sup> PUFA in rabbitfish, as well as in other teleost (Monroig et al., 2012). In mammal and teleost fish, Elovl2, Elovl4, and Elovl5 have been cloned and functionally characterized separately as crucial elongation enzymes in PUFA biosynthesis (Agaba et al., 2005; Jakobsson et al., 2006; Morais et al., 2009; Monroig et al., 2010). It has been reported that the capability of PUFA biosynthesis varies among teleost fish with alternative pathways during evolution

(Castro et al., 2016). Therefore, we assume elovl6 which has been enriched in DMP pathway has similar functions in its encoding Eolvl6 enzyme. Fads2 with 14, 16, and 18 desaturase activities, respectively, composed a crucial part of PUFA biosynthesis as well (Li et al., 2010; Oboh et al., 2017). After integration of three types of analyses, the important genes identified in this study were summarized in **Table 2**, which provided a full view of genes supported by multiple evidences. We used muscles for WGBS and methylation analysis and considered that it could represent genome methylation of an individual. However, we could not ignore the slight differences in methylation among tissues. Investigations on tissue-specific methylation sites using multi-omics analyses and the correlation between DGE and DMP in the same tissue are thus warranted. Our results indicate that fatty acid metabolism-related genes are associated with the growth, hormone, and even amino acid relevant genes, which regulate muscular PUFA content.

# CONCLUSION

This study investigated the associated genes and the divergence of transcriptomic and epigenomic variations in C. carpio samples with distinct muscular PUFA contents. The phenotypic correlation and GWAS results indicated dozens of potential genes that are associated with PUFA metabolism. Further investigation of DGE patterns of three tissues and gene profiles with DMPs identified important genes that enriched PUFA synthesisand degradation-related pathways. Integrated analysis showed significant correlations between DGE and DMP and enrichment of correlated genes involved in vital pathways related to fatty acid desaturation, elongation, and transportation. Confirmation of these results using integrated genomic, transcriptomic, and epigenomic profiling with more extensive sequencing of larger samples is warranted. This study may also facilitate

# REFERENCES


the applications of combined genome selection and molecularassociated breeding for multiple traits.

# DATA AVAILABILITY

The sequencing datasets of all samples have been deposited at NCBI (PRJNA493161).

# AUTHOR CONTRIBUTIONS

JX initiated and coordinated the research project. HZ and JX conceived and conducted the analysis and drafted the manuscript. PX engaged in sample collection and genotyping analysis. YJ and ZZ assisted in transcriptome and methylation data analysis. JF, RT, and CD took part in trait measurement, tissue manipulation, and enrichment analysis. All the authors read and approved the final manuscript.

# FUNDING

The Central Public-interest Scientific Institution Basal Research Fund, CAFS (Nos. 2016HY-ZD0302, 2016GH02, and 2016HY-JC0301), the National Natural Science Foundation of China (Nos. 31502151 and 31422057), the National High-Technology Research and Development Program of China (2011AA100401), and the National Infrastructure of Fishery Germplasm Resources of China (No. 2018DKA30470) supported this study.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00217/full#supplementary-material




on whole genome bisulfite sequencing. PLoS One 10:e0132866. doi: 10.1371/ journal.pone.0132866


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Zhang, Xu, Jiang, Zhao, Feng, Tai, Dong and Xu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Transcriptome Profile Analysis on Ovarian Tissues of Autotetraploid Fish and Diploid Red Crucian Carp

Yude Wang† , Minghe Zhang† , Qinbo Qin† , Yajun Peng† , Xu Huang, Chongqing Wang, Liu Cao, Wuhui Li, Min Tao, Chun Zhang and Shaojun Liu\*

State Key Laboratory of Developmental Biology of Freshwater Fish, College of Life Sciences, Hunan Normal University, Changsha, China

Polyploidization can significantly alter the size of animal gametes. Autotetraploid fish (RRRR, 4nRR = 200) (4nRR) possessing four sets of chromosomes were derived from whole-genome duplication in red crucian carp (RR, 2n = 100) (RCC). The diploid eggs of the 4nRR fish were significantly larger than the eggs of RCC. To explore the differences between the ovaries of these two ploidies of fishes at the molecular level, we compared the ovary transcriptome profiles of 4nRR fish and RCC and identified differentially expressed genes (DEGs). A total of 19,015 unigenes were differentially expressed between 4nRR fish and RCC, including 12,591 upregulated and 6,424 downregulated unigenes in 4nRR fish. Functional analyses revealed that eight genes (CDKL1, AHCY, ARHGEF3, TGFβ, WNT11, CYP27A, GDF7, and CKB) were involved in the regulation of cell proliferation, cell division, gene transcription, ovary development and energy metabolism, suggesting that these eight genes were related to egg size in 4nRR fish and RCC. We validated the expression levels of these eight DEGs in 4nRR fish and RCC using quantitative PCR. The study results provided insights into the regulatory mechanisms underlying the differences in crucian carp egg sizes.

Keywords: red crucian carp, autotetraploid fish, ovarian tissues, egg size, transcriptome

# INTRODUCTION

Polyploidy is a very common phenomenon. In vertebrate evolution, polyploidy is considered to have led to the evolution of more complex forms of life by providing the opportunity for new functions to evolve (Ohno, 1970; Epstein, 1971). Polyploidy, including allopolyploidy and autopolyploidy, is both widespread and evolutionarily important (Van de Peer et al., 2017). Allopolyploids contain genomes from distinct taxa, while autopolyploids are formed by genomes from the same species (Van Drunen and Husband, 2018).

Phenotypic changes induced by chromosome duplications have been reported since the early 20th century (Stebbins, 1947). A well-known effect of polyploidy in plants and animals is cell enlargement (Knight and Beaulieu, 2008), but less evident effects can also occur (Maherali et al., 2009). In plants, for example, polyploidy often modifies physiological traits such as transpiration, and rates of photosynthesis or growth (Levin, 2002). Following such changes in physiology, shifts in ecological tolerance have been demonstrated for some taxa (Levin, 2002). Polyploidy can also induce phenotypic modifications in reproductive traits, but surprisingly, these effects have received less attention. Sometimes, polyploids have reproductive organs that are larger than those of their diploid counterparts (Robertson et al., 2011).

### Edited by:

Gen Hua Yue, Temasek Life Sciences Laboratory, Singapore

### Reviewed by:

Jian Xu, Chinese Academy of Fishery Sciences (CAFS), China Jun Hong Xia, Sun Yat-sen University, China

### \*Correspondence:

Shaojun Liu lsj@hunnu.edu.cn †These authors have contributed equally to this work

### Specialty section:

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Received: 30 October 2018 Accepted: 26 February 2019 Published: 19 March 2019

### Citation:

Wang Y, Zhang M, Qin Q, Peng Y, Huang X, Wang C, Cao L, Li W, Tao M, Zhang C and Liu S (2019) Transcriptome Profile Analysis on Ovarian Tissues of Autotetraploid Fish and Diploid Red Crucian Carp. Front. Genet. 10:208. doi: 10.3389/fgene.2019.00208

**116**

Following their instantaneous multiplication in DNA content, polyploids can experience processes that either expand or shrink their genomes (Leitch et al., 2008). This increase in DNA has great potential to induce phenotypic variation (Chen, 2007). The relationships between genome size and phenotypic traits have been discussed in comparative studies at a broad phylogenetic levels (Muenzbergova, 2009), but few studies have analyzed how and whether genome size or polyploidy can modify phenotypic traits at the microevolutionary scale (Lavergne et al., 2009). In fish, polyploidization can obviously alter egg size. For example, allotetraploid hybrids of red crucian carp × common carp can produce diploid eggs that are obviously larger than those of their parents (Liu et al., 2004, 2007). Forés et al. (1990) studied egg activity in Scophthalmus maximus and found that these eggs reached a higher fertilization ratio when the diameter of the egg was 0.9–1.1 mm, but when the diameter was 1.1– 1.2 mm, the fertilization ratio was lower. Thus, egg diameter is an important parameter that can reflect the positives or negatives of egg mass. Autotetraploid fish (4nRR) (Qin et al., 2014) derived from genome duplication in RCC. Autopolyploids often differ ecologically and phenotypically from their low ploidy parents (Husband et al., 2016), but because studies are commonly performed on long-established cytotypes (Comai, 2005). It is unclear whether differences are due to instantaneous changes associated with the whole-genome duplication (WGD) event or divergences through selection after the fact (Weiss-Schneeweiss et al., 2013). In this process, we found that egg diameters of 4nRR fish are larger than those of RCC. Meanwhile, through selfcrossing RCC and 4nRR fish, we found that the fertilization ratio of RCC (96.70%) was higher than that of 4nRR fish (65.36%). In livestock and wildlife, egg quality is affected by a number of factors and is highly variable, including egg size (Chapman et al., 2014). Egg size plays an important role in the heredity and reproduction of fish.

In this study, we examined the transcriptomes of mature ovarian tissues from 4nRR fish and RCC using RNA-seq. The purposes of this research were to expand the genetic resources available for crucian carp, analyze differentially expressed genes (DEGs) between 4nRR fish and RCC and identify genes related to egg diameter. Overall, our results were valuable for understanding valuable genomic information and the molecular mechanism of ovarian development in 4nRR fish and RCC. In addition, this study helped establish a foundation for polyploid evolution and molecular breeding in crucian carp and other closely related species.

# MATERIALS AND METHODS

# Ethics Statement

Fish researchers were certified under a professional training course for laboratory animal practitioners held by the Institute of Experimental Animals, Hunan Province, China (Certificate No. 4263). All fish were euthanized using 2-phenoxyethanol (Sigma, United States) before dissection. This study was carried out in accordance with the recommendations of the Administration of Affairs Concerning Experimental Animals for the Science and Technology Bureau of China. The protocol was approved by the Administration of Affairs Concerning Experimental Animals for the Science and Technology Bureau of China.

# Sample Collection and Preparation

One-year old female 4nRR fish and RCC were obtained from the State Key Laboratory of Developmental Biology of Freshwater Fish, Hunan Normal University, China. The ploidy status of the 4nRR fish and RCC was tested by flow cytometry as described by Qin et al. (2014). Three one-year-old mature female 4nRR fish and RCC were chosen. Ovarian tissues were removed from the 4nRR fish and RCC after euthanasia using 2-phenoxyethanol (Sigma, United States). In this experiment, 4nRR fish were used as treatment group, while RCC was used as a control group. The ovarian tissues of the 4nRR fish and RCC were then divided into three parts; the first part was used to measure egg diameters to test differences in egg size between 4nRR fish and RCC using multiple-contingency-table analyses (Sokal and Rohlf, 1981); the second part was fixed in 4% paraformaldehyde solution for histological observation as described in Cao and Wang (2009); the third part was promptly frozen in liquid nitrogen, stored at – 80◦C, and then used for RNA-Seq and Real-time Quantitative PCR Detecting System (qPCR) analysis. Total RNA was extracted from 4nRR fish and RCC ovarian tissues using a Total RNA Kit II (TaKaRa, China) according to the instructions of the manufacturer. For each ploidy. Each amounts of RNA from three 4nRR fish and three RCC were pooled to offer templates for construction of the RNA-Seq library (**Supplementary Figure 1**).

# Measurement of the Size of the Eggs and the Histology Observation of Ovary

Ten female 4nRR fish and ten female RCC were sorted into two groups producing "high-quality" or "low-quality" eggs as described by Chapman et al. (2014). The diameters of 167 4nRR fish eggs and 167 RCC eggs were measured by a Vernier caliper. We used analyses of variance (ANOVA) (Osterlind et al., 2001) and multiple comparison tests (LSD method) (Williams and Abdi, 2010) to test for differences in egg size between 4nRR fish and RCC using SPSS Statistics 21.0. The values of the independent variables were expressed as the mean ± SD. The gonads of 4nRR fish and RCC were fixed in Bouin's solution for the preparation of tissue sections. The paraffin-embedded sections were cut and stained with hematoxylin and eosin. Gonadal structure was observed with a light microscope and photographed with a Pixera Pro 600ES.

# RNA Sequencing Library Construction and Illumina Sequencing

The cDNA library was constructed using high quality RNA. Poly (A) was separated using oligo-dT beads (Qiagen, Dusseldorf, Germany). The fragmentation buffer was added to break all the mRNA into short fragments. Random hexamer-primed reverse transcription was used for first-strand cDNA synthesis. The second cDNA strand synthesis was subsequently performed using DNA polymerase I and endonuclease. The quick PCR extraction kit was used to purify the cDNA fragments. These

purified cDNA fragments were rinsed with EB buffer for end reparation Poly (A) addition and then ligated to sequencing barcodes. The fragments with a size suitable for sequencing criteria were isolated from the gels and enriched by PCR amplification to construct the final cDNA library. The cDNA library was sequenced on the Illumina sequencing platform (Illumina HiseqTM2500) using paired-end technology in a single run, by Novogene Technologies (Beijing, China). The Illumina GA processing pipeline was used to analyze the images and for base calling.

# De novo Assembly and Functional Annotation

Raw reads were filtered using Fastqc software (Babraham Bioinformatics) (Davis et al., 2013) to obtain paired-end clean reads. All clean reads were used for assembly using Trinity software (Grabherr et al., 2011) with the following parameters: (1) minimum assembled contig length to report = 100bp; (2) maximum length expected between fragment pairs = 250 bp; and (3) count for K-mers to be assembled by Inchworm = 25. After assembly, contigs longer than 200 bp were used for analysis. The contigs were connected to obtain sequences that could not be extended further at either end, and the sequences of the unigenes were generated. The unigenes were further spliced and assembled to acquire maximum length non-redundant unigenes using TGICL clustering software (J. Craig Venter Institute, Rockville, MD, United States). Finally, Blastx was used to compare the unigenes base on E-value < 10−<sup>5</sup> (Altschul et al., 1997) with the non-redundant protein (Nr), SwissProt, Kyoto Encyclopedia of Genes and Genomes (KEGG) and Clusters of Orthologous Group (COG) databases (E-value < 10−<sup>3</sup> ). Gene ontology (GO) annotation of the unigenes was completed using Blast2GO based on the results from the NCBI Nr database annotation. Blastn was used to align the unigenes to the Nr database and search for proteins with the highest sequence similarity to the given unigenes, accompanied by their protein functional annotations. A heat map which grouped genes according to FPKM values was generated in Cluster3.0 (De Hoon et al., 2004).

# Identification of Differentially Expressed Genes (DEGs)

The mapped reads were normalized according to the FPKM for each unigene between the 4nRR and RCC fish, which was beneficial for comparing unigene expressions (McCarthy and Smyth, 2009) of 4nRR and RCC fish. The DEGs were identified by the DEGseq package (Wang et al., 2009) by applying the MA-plot-based method with a random sampling model. DEGs between 4nRR and RCC fish were selected based on the following filter criteria: (1) false discovery (FDR) < 0.05;

FIGURE 1 | Gonadal structure of ovarian tissues in (A) red crucian carp and (B) autotetraploid crucian carp. The eggs produced by (C) red crucian carp and (D) autotetraploid fish.

and (2) |log2(foldchange)| > 1 (Storey and Tibshirani, 2003; Lv et al., 2013).

## Validation of RNA-Seq Results by qPCR

To verify the reliability of the RNA-seq results, eight DEGs (CDKL1, CKB, AHCY, ARHGEF3, TGFβ, SCP1, WNT11 and CYP27A) involved in the development of ovarian tissues were selected for validation using quantitative real-time PCR (qPCR) on a Prism 7500 Sequence Detection System (Applied Biosystems, United States) with a miScript SYBR Green PCR Kit (Qiagen, Germany). The reaction mixture (10 µL) comprised 2.5 µL cDNA (1:3 dilution), 5 µL SYBR Premix Ex TaqTMII (TaKaRa), 0.5 µL specific forward primer, 0.5 µL reversal primer, and 1.5 µL water. Real-time PCR was performed on biological replicates in triplicate. The amplification conditions were as follows: (1) 50◦C for 5 min, (2) 95◦C for 10 min and (3) 40 cycles at 95◦C for 15 s, followed by 60◦C for 45 s. The average threshold cycle (Ct) was calculated for 4nRR fish and RCC using the 2 <sup>−</sup>11Ct method (Pfaffl, 2001) and normalized to that of β-actin. Finally, a melting curve analysis was completed to validate the specific generation of the expected products.

# RESULTS

# Comparison of Egg Size

One-year-old 4nRR and RCC fish were used in this research. The ovaries of one-year 4nRR and RCC fish developed well and contained II, III, and IV oocytes. Furthermore, large numbers of eggs were stripped from one-year-old 4nRR fish and RCC, respectively. The results showed that 4nRR and RCC fish had reached sexual maturity by one year of age (**Figures 1A,B**). The average egg diameters of the RCC and 4nRR fish were 13.67 and 17.71 mm, respectively (**Table 1**). Eggs from 4nRR fish were significantly larger than those from RCC fish (**Figures 1C,D**) (t = −33.370, p < 0.05).

# Sequencing, de novo Assembly and Functional Annotation

RNA-seq (Feng et al., 2012) was conducted on 4nRR and RCC fish ovarian tissue. A total of 118.1 million 150 bp paired-end reads were generated. After removing low-quality reads and short read sequences, a total of 108.1 million clean reads (91.54%) were obtained (**Supplementary Table 1**), and these reads were used for the following analyses. Ovarian tissues from RCC and 4nRR fish were used to generate 212,573 transcripts and 149,851 unigenes.

TABLE 1 | Comparison of mature eggs diameters between RCC and 4nRR fish (Each fish is randomly selected 10).


Means in same column with different superscripts were very significantly different (t = –33.370, p < 0.01).

The N50 values of the transcripts and unigenes were 1,525 and 996 bp, respectively. A summary of the assembly data was shown in **Supplementary Table 2**. The length distributions of the transcripts and unigenes were shown in **Supplementary Figure 2**. Approximately 91.9% of the unigenes (149,861) were annotated by Blastx and Blastn against seven databases (GO, KO, KOG, NR, NT, PFAM, and SwissProt) with a threshold of 10−<sup>5</sup> (Altschul et al., 1997; Lv et al., 2013). Among these unigenes, 38,140, 26,510, 21,296, 51,507, 135,474, 36,236, and 40,008 were identified in the GO, KO, KOG, NR, NT, PFAM and SwissProt databases, respectively (**Supplementary Figure 3**). Clean RNA sequencing reads were deposited in the NCBI Sequence Read Archive (SRA) under accession numbers SAMN07418623 and SAMN07418624<sup>1</sup> .

# The Differentially Expressed Genes Between the Two Kinds of Crucian Carp

A total of 19,015 unigenes were differentially expressed between the RCC and 4nRR fish. In total, 12,591 unigenes were upregulated in 4nRR fish, while 6,424 unigenes were downregulated in 4nRR fish compared with RCC. Some upregulated genes in 4nRR fish, such as vitellogenin (Vtg), Meiotic nuclear division 5 homolog B (Md5b), Mediator of RNA polymerase II transcription subunit 25 (Mpts), Transcription factor TFIIIB component (Tfc), Cell division cycle-associated protein 3(Cdc3), S-phase kinase-associated protein (Skp1), Bcl-2-related ovarian killer protein homolog A (Bokp), Ovarian

bioproject&Cmd=Link&LinkName=bioproject\_biosample&LinkReadableName= BioSample&ordinalpos=1&IdsFromResult=395975

TABLE 2 | Summary of the 10 DEGs related to egg diameter.


<sup>∗</sup>FDR: False discovery rate, which was used to determine the p-value threshold in multiple tests.

<sup>1</sup>https://www.ncbi.nlm.nih.gov/biosample?Db=biosample&DbFrom=

cystatin (Oct), Dynein regulatory complex protein 1 (Drc1) and Cyclin-dependent kinase-like 1 (CDKL1) (**Table 2**), were mainly involved in the regulation of cell proliferation and cell division, gene transcription, ovary development and energy metabolism, showing that these genes might be related to egg diameter in crucian carp.

# Analysis of Functional Enrichment

Mapping all the DEGs to terms in the GO database enabled the annotation of 149,862 unigenes, of which 65,580, 43,934, and 111,357 unigenes could be grouped into the cellular component, molecular function and biological process categories, respectively. In the cellular component category, cell (12,494, 32.76%), cell part (12,494, 32.76%), organelle (8,160, 21.39%), membrane (7,697, 20.18%), and macromolecular complex (7,636, 20.02%) represented the majority. Binding (21,572, 56.56%), catalytic activity (13,639, 35.76%), transporter activity (2,760, 7.24%), molecular transducer activity (1,776, 4.65%) and nucleic acid binding transcription factor activity (1,464, 3.84%) showed a higher proportion in the classification of molecular functions. Additionally, cellular process (21,702, 56.90%), singleorganism process (17,647, 46.27%), metabolic process (17,595, 46.13%), biological regulation (10,164, 26.64%) regulation of biological process (9,741, 25.54%) and localization (6,333, 16.61%) represented the majority of the biological process categories (**Figure 2**).

A total of 21,304 unigenes were assigned to COG classifications (**Figure 3**). Among the 24 KOG categories, the top 10 categories were as follows: (1) signal transduction mechanisms (4,678, 21.96%), (2) general function prediction (4,427, 20.18%), (3) post-translational modification, protein turn-over, and chaperones (2,064, 9.69%), (4) transcription (1,355, 6.36%), (5) intracellular trafficking, secretion, and vesicular transport (1,210, 5.68%), (6) cytoskeleton (1,129, 5.30%), (7) function unknown (1,054, 4.95%), (8) inorganic ion transport and metabolism (955, 4.48%), (9) translation, ribosomal structure and biogenesis (882, 4.14%) and (10) lipid transport and metabolism (719, 3.37%). KEGG pathway annotation enabled us to assign the 10,023 DEGs to 232 pathways. In the enrichment analysis, the first ten enriched pathways included Ribosome (Ko1400), Fatty acid metabolism (Ko8746), Cell division (Ko04110), Oocyte meiosis (Ko04114), p53 signaling pathway (Ko04115), Focal adhesion (Ko04510), Adherens junction (Ko04520), Signaling pathways regulating pluripotency of stem cells (Ko04550), Regulation of autophagy (Ko04140) and Lysosome (Ko04142). These enriched pathways had functions in cell proliferation, steroidogenesis activity, receptor binding, and energy metabolism, which might indicate the differences in the developmental process of ovarian tissues between RCC and 4nRR fish.

Using log ratio values, we performed hierarchical clustering of 16,581 DEGs based on their expression. Expression levels during the stages of ovarian development were divided into 24 categories based on K-means clustering. Detailed expression profile clusters between 4nRR and RCC fish are shown in **Supplementary Figure 4**. The expression patterns not only indicate the diverse and complex interactions among genes, but also suggest that unigenes with similar expression patterns may have similar functions in the development of ovary.

# Validation of Differentially Expressed Genes by qPCR

Quantitative real-time PCR was performed on 8 selected genes (cyclin-dependent kinase-like [CDKL1], Creatine kinase B-type [CKB], adenosylhomocysteinase [AHCY], Rho guanine nucleotide exchange factor (GEF)3 [ARHGEF3], transforming growth factor beta [TGFβ], growth/differentiation factor 7 [GDF7], protein Wnt-11 [WNT11], and vitamin D3-25 hydroxylase [CYP27A]). The qPCR results were compared with the RNA-seq expression profiles (**Supplementary Tables 3, 4** and **Figure 4**). The

expression patterns of the eight genes by qPCR ranged from significantly different to similar to those indicated by the RNAseq analysis (**Figure 5**).

# DISCUSSION

# Significance of Polyploidization

Polyploidization of chromosomes was thought to be one of the most important mechanisms in species evolution (Masterson, 1994). Polyploidization is a major factor that drives plant genome evolution (Stupar et al., 2007) and fish evolution (Finn and Kristoffersen, 2007). Polyploidization not only significantly shaped the genomes but also affected other genetic aspects including gene expression (Cheung et al., 2009). Polyploids may contain genomes from different parental species (allopolyploidy) (Wang et al., 2017) or multiple sets of the same genome (autopolyploidy). Many studies have revealed that polyploid genomes undergo major chromosomal, genomic, and genetic changes (Doyle et al., 2008; Buggs et al., 2011, 2012; Ainouche et al., 2012). Despite the great progress in clarifying the genomic and transcriptomic changes that accompany polyploidization, few studies have explicitly correlated these changes with phenotype alterations (Gaeta et al., 2007). The changes in the characteristics of polyploids were mainly caused by differences in gene expression (Stupar et al., 2007; Chen et al., 2010), and thus, RNA-seq technologies can now be used in a highthroughput manner to investigate such phenotypic changes

(Cui et al., 2013; Qiao et al., 2013; Zhang et al., 2017). Here, we showed that autotetraploidization causes increased egg size in 4nRR fish compared to RCC fish. We established a 4nRR fish lineage to better understand the genetic impact imposed by autopolyploidization. The 4nRR fish were derived from a whole genome duplication of RCC and possessed four sets of chromosomes derived from RCC (Qin et al., 2014). However, phenotypic changes were present in the 4nRR fish, including increased blood cell and germ cell sizes compared with RCC fish. Notably, the phenotypic and molecular data reported here were due to autopolyploidy rather than cultivar influence, as similar effects on the RCC and 4nRR fish cultivars were found.

# Significance of Egg Size Study

Autopolyploidy is traditionally considered to cause reduce fertility or sterility compared with diploid progenitors

Wang et al. Transcriptome Analysis of Tetraploid Ovary

(Cifuentes et al., 2013). However, recent research showed that 4nRR fish can produce unreduced diploid eggs and showed dual reproductive modes of sexual reproduction and gynogenesis (Qin et al., 2015). In this research, the histological features of the gonads revealed that the 4nRR and RCC fish both possessed normal gonadal structure and could reach maturation. In the breeding seasons, large numbers of eggs were harvested from one-year-old 4nRR and RCC fish. These results showed that autotetraploidization did not cause fertility or sterility. Previous studies suggested that polyploid formation could induce various types of genomic changes (Wang et al., 2017). Comparative analysis based on egg size measurements revealed that the average diameter of diploid eggs from 4nRR fish was 17.71 mm, which was significantly larger than average haploid eggs with a diameter of 13.67 mm in RCC, suggesting that genetic factors were likely to be the cause of this difference in ovary development and egg diameter. In mature ovaries, the increased oocyte volume was mainly due to the incorporation of vitellogenin (Santos et al., 2007; Schilling et al., 2015). This process requires a range of enzymes to provide hormonal and energy support for the synthesis and breakdown of vitellogenin (Williams et al., 2014). We found that the egg diameters of 4nRR fish were obviously larger than those of RCC fish. Developing oocytes were thought to be largely non-transcribed and serve as a repository for specific maternal RNA, proteins and other molecules important for fertilization, initiation of zygotic development, and transition to embryonic gene expression (Santos et al., 2007; Reading et al., 2013; Chapman et al., 2014). Through self-mating experiments between 4nRR and RCC fish, we found that the fertilization of 4nRR fish to be lower than that of RCC. The result showed that variation in sizes of fish eggs has been associated with polyploidization. Among the 19,051 DEGs identified in this study, most of the key genes were involved in protein processing, fat and energy metabolism, cytoskeleton, steroidogenesis activities and cell division.

Cluster analysis of the genes differentially expressed between 4nRR and RCC fish identified a list of genes, of which 12,591 were more highly expressed in 4nRR fish and 6,424 were more highly expressed in RCC fish. With reference to the relevant literature (Heringstad et al., 2000; Dubrac et al., 2005; Menges et al., 2005; Knoll-Gellida et al., 2006; Santos et al., 2007), we screened 8 key genes (CDKL1, AHCY, ARHGEF3, TGFβ, WNT11, CYP27A, GDF7 and CKB) related to egg development. Compared with RCC, there existed some genes in 4nRR fish that showed a marked up-regulation, (specifically CDKL1, AHCY, ARHGEF3, TGFβ, WNT11, CYP27A, GDF7 and CKB) which might account for the differences in the egg diameters between RCC and 4nRR fish. CDKL1 was a member of the cyclin-dependent kinase-like (CDK) protein family, which was a group of serine/threonine kinases (Santos et al., 2007). The cyclin dependent kinase CDKL1 controls the cell cycle, which was best understood in the model organism Saccharomyces cerevisiae. AHCY (S-adenosylhomocysteine hydrolase) was the cellular enzyme that cells rely on for replication (Heringstad et al., 2000). ARHGEF3 was a regulatory small GTPase that mediates signal transduction (Mullin et al., 2008) and was related to energy metabolism. Transforming growth factor β (TGFβ) and its signaling effectors act as key determinants of carcinoma cell behaviors, which play a key role in steroid hormone and vitellogenin synthesis during ovary development (Knoll-Gellida et al., 2006). WNT11 regulates cell fate and patterns during embryogenesis. In many different tissues, CYP27A played an important role in cholesterol and bile acid metabolism and fatty acid metabolism (Dubrac et al., 2005). In our previous study, obvious expression difference of the gnrh2, gthb and gthr were found in the 4nRR fish (Qin et al., 2018). Altogether, our results provide a foundation for the further characterization of gene expression in 4nRR and RCC fish with respect to egg size.

# DATA AVAILABILITY

The datasets generated for this study can be found in National Center for Biotechnology Information, SAMN07418623 and SAMN07418624.

# AUTHOR CONTRIBUTIONS

SL and QQ conceived and designed the study. YW and YP contributed to the experimental work and wrote the manuscript. MZ, XH, and LC performed most of the statistical analyses. YW and WL designed the primers and performed the bioinformatics analyses. MT and CZ collected the photographs. All authors read and approved the final manuscript.

# FUNDING

This work was supported by the National Natural Science Foundation of China (Grant Nos. 31430088, 31730098), the Earmarked Fund for China Agriculture Research System (Grant No. CARS-45), Hunan Provincial Natural Science and Technology Major Project (Grant No. 2017NK1031), the Cooperative Innovation Center of Engineering, the Key Research and Development Program of Hunan Province (Grant No. 2018NK2072) and New Products for Developmental Biology of Hunan Province (Grant No. 20134486).

# ACKNOWLEDGMENTS

We would like to sincerely thank Yuwei Zhou, Jun Wang, Dengke Li, and Kejie Chen.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019. 00208/full#supplementary-material

# REFERENCES

fgene-10-00208 March 18, 2019 Time: 16:26 # 9


Epstein, C. J. (1971). Evolution by gene duplication. Am. J. Hum. Genet. 23:541.



with potato autopolyploidization. Genetics 176, 2055–2067. doi: 10.1534/ genetics.107.074286


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Wang, Zhang, Qin, Peng, Huang, Wang, Cao, Li, Tao, Zhang and Liu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Transcriptome Analysis Identified Genes for Growth and Omega-3/-6 Ratio in Saline Tilapia

Grace Lin1,2, Natascha M. Thevasagayam<sup>1</sup> , Z. Y. Wan1,2, B. Q. Ye<sup>1</sup> and Gen Hua Yue1,2,3 \*

<sup>1</sup> Temasek Life Sciences Laboratory, National University of Singapore, Singapore, Singapore, <sup>2</sup> School of Biological Sciences, Nanyang Technological University, Singapore, Singapore, <sup>3</sup> Department of Biological Sciences, National University of Singapore, Singapore, Singapore

Growth and omega-3/-6 ratio are important traits in aquaculture. The mechanisms underlying quick growth and high omega-3/-6 ratio in fish are not fully understood. The consumption of the meat of tilapia suffers a bad reputation due to its low omega-3/-6 ratio. To facilitate the improvement of these traits and to understand more about the mechanisms underlying quick growth and high omega-3/-6 ratio, we conducted transcriptome analysis in the muscle and liver of fast- and slow-growing hybrid saline tilapia generated by crossing Mozambique tilapia and red tilapia. A transcriptome with an average length of 963 bp was generated by using 486.65 million clean 100 bp pairedend reads. A total of 42,699 annotated unique sequences with an average length of 3.4 kb were obtained. Differentially expressed genes (DEGs) in the muscle and liver were identified between fast- and slow-growing tilapia. Pathway analysis classified these genes into many pathways. Ten genes, including foxK1, sparc, smad3, usp38, crot, fadps, sqlea, cyp7b1, impa1, and gss, from the DEGs were located within QTL for growth and omega-3, which were previously detected content in tilapia, suggesting that these ten genes could be important candidate genes for growth and omega-3 fatty acid content. Analysis of SNPs in introns 1 and 2 of foxK1 revealed that the SNPs were significantly associated with growth and omega-3/-6 ratio. This study lays the groundwork for further investigation of the molecular mechanisms underlying the phenotypic variation of these two traits and provides SNPs for selecting these traits at fingerling stage.

Keywords: tilapia, growth, meat quality, RNA, gene

# INTRODUCTION

Candidate genes are important for understanding the phenotypic variations (Yue, 2014). Candidate gene association studies examine whether the genetic variations in a candidate gene are significantly associated with important traits. Some candidate genes were found to be associated with important traits in aquaculture species (Han et al., 2017; Wang Q. et al., 2017; Wang Y. et al., 2017; Wei et al., 2018). However, the success of candidate gene association studies is heavily dependent on assumptions underlying the selection of genes to be studied, and the number of candidate genes was limited due to the lack of genomic resources in most aquaculture species before 2008 (Yue, 2014). Recently, due to the rapid development of next-generation sequencing technologies

### Edited by:

Farai Catherine Muchadeyi, Agricultural Research Council of South Africa (ARC-SA), South Africa

### Reviewed by:

Jian Xu, Chinese Academy of Fishery Sciences (CAFS), China Jun Hong Xia, Sun Yat-sen University, China

\*Correspondence:

Gen Hua Yue genhua@tll.org.sg orcid.org/0000-0002-3537-2248

### Specialty section:

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Received: 30 August 2018 Accepted: 05 March 2019 Published: 20 March 2019

### Citation:

Lin G, Thevasagayam NM, Wan ZY, Ye BQ and Yue GH (2019) Transcriptome Analysis Identified Genes for Growth and Omega-3/-6 Ratio in Saline Tilapia. Front. Genet. 10:244. doi: 10.3389/fgene.2019.00244

**126**

(Schuster, 2007), transcriptome analysis (Wang et al., 2009) becomes a powerful tool for identifying candidate genes related to important traits (Qian et al., 2014). RNA-seq is one of the important tools for analysis of transcriptomes (Wang et al., 2009). In fishes, through RNA-seq, many candidate genes have been identified for some important traits (Qian et al., 2014). QTL mapping is another way to identifying genes related to important traits (Collins, 1995; Yue, 2014). By combining transcriptome analysis and QTL mapping, it is possible to identify causative genes underlying important traits (Yue, 2014; Yue and Wang, 2017). The knowledge on the relationship between genotypes and phenotypes will enhance our understanding about phenotypic variations and identify DNA markers for markerassisted selection (MAS) to accelerate the genetic improvement (Liu, 2008; Yue, 2014).

Tilapia is a common name for over 70 species of fishes that belong to the tilapiine tribe, in the family Cichlidae (Webster and Lim, 2006). Tilapia has become the second most important group of farmed fish in the world with a production of more than 5.3 million tons in 2014 (Food and Agriculture Organization, 2014). The main cultured species globally is the Nile tilapia (Oreochromis niloticus). For Nile tilapia, many genetic resources, including genome sequences (Brawand et al., 2014; Xia et al., 2015; Wan et al., 2016) and linkage maps (Kocher et al., 1998; Liu et al., 2013) have been generated in the past 30 years. These resources have been used to map QTL for important traits (Cnaani et al., 2003; Liu et al., 2014; Palaiokostas et al., 2015; Gu et al., 2018), which paved the way for MAS to accelerate genetic improvement. However, the limited sources of freshwater in the world prohibited the expansion of aquaculture of freshwater Nile tilapia. In addition, the low omega-3 content in Nile tilapia is a major health concern of consumption of tilapia (Young, 2009). Mozambique tilapia can adapt to full seawater and produces high omega-3 (Young, 2009), but grows much slower than Nile tilapia (Webster and Lim, 2006). Hybrid tilapia (i.e., saline tilapia) produced by crossing Nile and Mozambique tilapia is able to adapt to full seawater and grows quicker than M. tilapia, but still slower than Nile tilapia (Liu et al., 2013). Therefore, it is essential to increase the growth of saline tilapia to make it economically viable. However, for saline tilapia, genome resources are limited. Not much is known about genes controlling growth and fatty acid contents in saline tilapia.

The aim of this study was to identify candidate genes for growth and fatty acid contents in saline tilapia, by RNA-seq in combination with previous QTL mapping (Liu et al., 2013, 2014; Lin et al., 2016, 2018), to accelerate the genetic improvement of these traits, and to understand more about the molecular mechanism underlying the phenotypic variations in growth and fatty acid contents. We generated transcriptomes in the muscle and liver of fast and slow growing hybrid tilapia, and identified differentially expressed genes located in QTL for growth and omega-3 content (Liu et al., 2014; Lin et al., 2016, 2018). Our study paves the way for detailed analysis the functions of these genes to understand more about the mechanism underlying growth and fatty acids synthesis. We cloned the full-length cDNA of the foxK1 gene, identified two SNPs in introns 1 and 2 of the gene and found significant associations between the SNPs and growth and omega-3/-6 ratio. Therefore, the SNPs could be useful in selecting superior tilapia at fingerling stage.

# MATERIALS AND METHODS

# Tilapia, Tissue Samples, and RNA Extraction

At 110 dph, four fast-growing and four slow-growing individuals were randomly selected from the F<sup>2</sup> family of tilapia, including 522 individuals, used previously for QTL mapping for growth and omega-3 contents (Lin et al., 2018). Briefly, an F<sup>1</sup> family was generated by crossing one fast-growing red tilapia female and one slow-growing Mozambique tilapia male. An F<sup>2</sup> family was produced by crossing one F<sup>1</sup> male and one F<sup>1</sup> female, which were randomly selected from the F<sup>1</sup> family. The F<sup>2</sup> family consisted of 522 offspring and was cultured under normal culture conditions as detailed in our paper (Lin et al., 2018). All F<sup>2</sup> fish were raised under the same culture conditions and feeding scheme. Fishes were dissected and tissue samples from muscle (M), liver (L), brain (B), gills (G), and intestine (I) were collected and frozen in liquid nitrogen and stored in −80◦C for further downstream analyses. Total RNA was extracted using RNeasy <sup>R</sup> Mini kit (Qiagen, Hilden, German) according to the manufacturer's instructions. The concentration of the RNA was then determined with Agilent 2100 Bioanalyzer (Agilent, Singapore). A total of 40 total RNA samples (i.e., 4 fast growing fish + 4 slow growing fish × 5 tissues per fish) were generated. For each tissue in each group (fast and slow growing), equal amount of RNA from each of four fish were pooled for cDNA library construction. Hence, there are 10 pooled samples named with MFast, LFast, BFast, GFast, IFast, MSlow, LSlow, BSlow, GSlow, and IFast.

# Construction of 10 cDNA Libraries for RNA-Seq

One microgram of total RNA from each pooled sample was used for library construction using TruSeqTM RNA sample prep kit (Illumina, San Diego, CA, United States). In brief, the total RNA was treated with DNase to remove the genomic DNA contamination, and the thus isolated mRNA with poly(A) was enriched using magnetic beads coated with oligo-dT subsequently. The enriched mRNA was fragmented and reversetranscribed with random hexamer-primers to produce second strand cDNA. The second strand cDNA was purified and ligated with Illumina sequencing index and adapters. Size selection for the cDNA was carried out and enriched to generate the final sequencing libraries. Sequencing of these libraries was then performed on the Illumina HiSeqTM 2000 (Illumina, San Diego, CA, United States) to produce paired end (2 bp × 101 bp) sequence reads. Ten sequencing libraries were constructed.

# Processing of Sequencing Reads and de novo Assembly of Reference Transcriptome

The Illumina short raw reads were processed by NGS QC toolkit (Patel and Jain, 2012) to remove adapters and low quality reads

(Q < 20) and unpaired reads. The filtered high-quality reads were de novo assembled using Trinity (version 20140717) (Grabherr et al., 2011; Haas et al., 2013). As Nile tilapia is closely related to the hybrid tilapia in this study, the de novo assembly was merged with 45,440 annotated Nile tilapia mRNA sequences from NCBI RefSeq database using CAP3 software (Huang and Madan, 1999) to maximize the chance of presenting as many transcripts as possible. The merged transcriptome was BLASTN-searched against the Nile tilapia mRNA to assign gene descriptions, by using an e-value cut-off of 1 × 10−<sup>6</sup> and selecting the best hit. The reference transcriptome used for subsequent analyses comprised of the set of unique BLASTN-annotated sequences, where a single sequence was selected as a representative for each corresponding Nile tilapia BLASTN match, as well as the remaining unannotated sequences. A BLASTX-search against the NCBI RefSeq protein database was also further carried out for the unannotated sequences to supplement the annotated dataset.

# Identification of Differentially Expressed Transcripts (DETs) in the Muscle and Liver

The filtered reads from the muscle and liver samples (MFast, LFast, MSlow, and LSlow) were aligned to the reference transcriptome using the CLC Genomics Workbench (version 7.0) "Map reads to reference" tool, with thresholds set at 95% length fraction and 95% similarity fraction, and with default costs for mismatches, insertions and deletions. The BAM files of the mapping were then imported into Partek Genomics Suite (version 6.6) for differential gene expression analysis. Following mRNA quantification to obtain read counts and reads per exon kilobase per million mapped reads (RPKM), differential expression analysis was performed for pair-wise tissues using ANOVA. As there were no replicates in this study, the p-value was adapted from the mRNA quantification step where the algorithm estimates a p-value by assuming that all the samples are replicated and that the transcripts are evenly distributed across all samples. The p-value was subsequently corrected using the conservative Bonferroni method (Noble, 2009; Haynes, 2013) available in the software, to decrease the probability of false positives being detected, correcting for family-wise error rates since pooled samples from the same family are used in this study. The differential expression analysis was first narrowed down to transcripts with BLASTN or BLASTX annotations. Secondly, the global RPKM values were examined to filter off very lowly expressed transcripts, with RPKM >1 set as the minimum threshold for expression (**Figure 2**). Annotated transcripts were considered differentially expressed if the corrected p-value <0.001 and fold change ≥2.

# Functional Classification and Pathway Analysis of DETs

To understand the possible roles of the differentially expressed transcripts (DETs) in conferring the body size of the fish and the lipid content of the meat, the annotation of gene ontology (GO) terms was carried out using Blast2GO (Götz et al., 2008) based on the GO annotations of the corresponding homologs in the NCBI RefSeq database, allowing for an overview of the functional classes. These DETs were then mapped onto KEGG pathways using KAAS webserver (Moriya et al., 2007) to allow us to have an overview of the network of pathways that were involved in conferring the differing body size and lipid content in meat between the fast and slow growing fishes. From the assigned KEGG pathways, 18 and 21 pathways of interest were selected for growth and lipid biosynthesis, respectively.

# qRT-PCR for the Validation of the Expressions of DETs Identified With RNA-Seq

Reverse transcription was carried out using iScriptTM reverse transcription supermix (Bio-Rad Laboratories, Hercules, CA, United States). The single strand cDNA was diluted five times and kept in −20◦C for further analyses. A total of 21 primer pairs (17 up-regulated and 4 down-regulated) (**Supplementary Table S1**) were designed using Primer- blast software (Ye et al., 2012) for both muscle and liver transcripts that were differentially expressed between the fast and slow growing fishes. Quantitative PCR (qPCR) was carried out on BioRad iQ5 (Bio-Rad Laboratories, Hercules, United States) using SYBR <sup>R</sup> Green as fluorescent dye. qPCR reactions were carried out in triplicates for each tissue of four individuals, each 20 µL reaction volume containing 2 µL (10 ng) of 10 times diluted sscDNA, 0.2 µL of each primer of concentration 10 µM, 5 µL of 2× master mix from KAPA SYBR <sup>R</sup> FAST qPCR Kits (Life Technologies, Carlsbad, CA, United States) and 2.6 µL of sterile water. Raw data was converted to cycle threshold (Ct) values using the software provided by BioRad iQ5 (Bio-Rad Laboratories, Hercules, United States). Quantification of relative gene expression was calculated using the 211ct method (Livak and Schmittgen, 2001) using beta actin as the housekeeping gene to normalize the relative expression. Paired-t test was used for calculation of p-values for comparing the differential expression of genes between fast and slow growing fish. The fold change of each gene was the ratio of expression in fast-growing fish to that in slow-growing fish and all results were compared with RNA sequencing fold change data. The correlation between the data of RNA-seq and qRT-PCR was calculated using Microsoft Excel.

# Mapping DETs in Previously Mapped QTL for Growth and Omega-3 Content

From the QTL identified for growth and fatty acid contents, focus was placed on the three growth traits, body weight (BW), total length (TL), body thickness (BT) and omega-3 fatty acid contents (Lin et al., 2018). In order to locate the genomic regions of the flanking markers where these QTL were mapped, the microsatellite or SNP marker sequences were downloaded from NCBI database and the NGS data from the SNP marker discovery. The genomic locations of the markers and DEGs were identified using the BLASTN program against the Nile tilapia genome (Brawand et al., 2014). DEGs that lie or are within close proximity with the identified QTL regions were selected for further study.

# Identification of SNPs in Introns 1 and 2 of the foxK1 Gene and Analysis of Their Association With Growth and Omega-3/-6 Ratio

The full-length cDNA of the gene foxK1 was derived from GenBank (XM\_019360121). The genomic DNA sequence was obtained by Blasting cDNA against the whole genome sequence of tilapia in Ensembl Release 91<sup>1</sup> . The exons and introns were identified by aligning the full-length cDNA and genomic DNA sequence using Sequencher v5.0 (GeneCodes, CA, United States). Two pairs of primers (HySNPFoxK1-1FR and HySNPFoxK1- 2FR) (**Supplementary Table S1**) were designed for PCR to identify SNPs in the 5<sup>0</sup> UTR-intron 1 and exon 2-intron 2 of the foxK1 gene. The PCR reactions consisted of the following components: 10 ng of genomic DNA, 1× PCR buffer with 1.5 mM MgCl2, 0.2 µM of each primer, 50 µM of each dNTP and 1 unit of Taq DNA Polymerase (Fermentas, PA, United States). PCR was conducted for each individual with the following program: an initial denaturation of 95◦C, followed by 34 cycles of 94◦C for 30 s, annealing temperature 55–58◦C for 30 s, 72◦C for 30 s, and a final extension at 72◦C for 5 min. After PCR, the PCR products were checked for on 2% agarose gels. The single band PCR product was used for the next step of Sanger sequencing using either a forward or reverse primer with BigDye <sup>R</sup> Terminator Sequencing Kit (Applied Biosystems, Foster City, CA, United States). The sequencing PCR products were then sequenced using an ABI 3730xl DNA Analyzer (Applied Biosystems, Foster City, CA, United States) following the standard protocol of BigDye sequencing. After sequencing, SNPs were identified using the software Sequencher (GeneCodes, MA, United States). The genotypes at each SNP locus were exported to an excel file for further analysis of their associations with the traits.

The mapping family, containing 522 F<sup>2</sup> individuals and used for QTL mapping (Lin et al., 2018), was applied for association studies here. The associations between SNPs and traits were determined using ANOVA available in JMP 8.0 software (SAS, NC, United States).

# Analysis the Expression of the foxK1 Gene Using qRT-PCR

Total RNA from muscle and liver of eight individuals at the age of 110 days post hatch were extracted and reverse transcribed to single strand cDNA, following the instruction of the cDNA synthesis kit (Sigma, CA, United States). One primer pair (FoxK1-F and FoxK1-R) (**Supplementary Table S1**) was designed using Primer- blast software for qPCR. The qPCR was carried out in triplicates. The expressions of the foxK1 gene were examined by Quantitative real-time PCR (qRT-PCR) on an iQ5 RT-PCR machine (Bio-Rad, CA, United States). The elongation factor 1-alpha (EF1α) gene (see primers in **Supplementary Table S1**) was used as an internal control. The cDNA was amplified with the primers. PCR amplification was carried out in a total volume of 20 µl containing 1× MaximaTM SYBR Green qPCR Master Mix (Fermentas, PA, United States), 0.25 µM of each primer and 10 ng template cDNA. The PCR program included a single cycle of 10 min at 95◦C followed by 40 cycles of 15 s at 95◦C, 30 s at 55◦C, and 20 s at 72◦C. To confirm the specificity of the amplification, after the completion of the qRT-PCR, a melting-curve analysis was conducted. The expression level of the foxK1 gene was analyzed using 11CT method (Livak and Schmittgen, 2001).

# RESULTS

# Summary of RNA-Seq

The sequencing of ten libraries from the five tissues of fastand slow-growing tilapias produced 492,458,402 paired-end (PE) reads of length 100 bp (**Table 1**). After quality-trimming and filtering of low-quality reads, 486,651,674 high-quality PE trimmed reads (98.88%) were used for de novo assembly. In order to obtain a reference transcriptome of saline tilapia, the filtered reads were assembled. Contigs were then merged with 45,440 mRNA sequences of the Nile tilapia available in GenBank. A total of 328,078 contigs with average length of 963 bp and 42,699 annotated unique sequences (average length of 3.4 kb) were obtained. As the aim of the study is to identify candidate genes related with growth and fatty acid contents that are associated with muscle growth and lipid biosynthesis in the liver, detailed analyses were conducted only using muscle and liver transcriptome data.

# Differentially Expressed Transcripts (DETs) Between the Muscle and Liver of Fast- and Slow-Growing Fish

Annotated transcripts were used in the differential expression analysis. Following a RPKM cutoff of >1, and thresholds for differential expression set as Bonferroni-corrected p-value <0.001 and fold change ≥2, 2,236 transcripts in muscle and 3,020 in liver were differentially expressed between the fast and slow-growing fish. 427 and 1,809 transcripts were, respectively, up and downregulated in the muscle, whereas the corresponding numbers for the liver were 1,808 and 1,212 transcripts (**Supplementary Table S2**). Our data analyses identified 14 DETs that overlapped between muscle and liver datasets with either expressions that agree in both tissues (i.e., both up/down regulated) or contrasting expression (i.e., up regulated in muscle and down regulated in liver) (**Supplementary Table S2**).

# Functional Classification of DETs

A total of 957 (42.7%) and 1726 (57.2%) DETs were assigned 2088 and 3374 GO terms in the muscle and liver, respectively, covering all three domains: biological process (BP), molecular function (MF) and cellular component (CC). In most of the cases, one transcript was assigned to many terms and these sequences were further characterized into primary subcategories.

In the muscle, the GO terms with the most assigned genes, in the domain of CC, were cell parts (GO:0044464) with 60 transcripts, intracellular (GO:0005622) with 40 transcripts

<sup>1</sup>https://asia.ensembl.org/index.html



<sup>∗</sup>Tissues include dorsal muscle (M) and Liver (L); Subscript Fast – fast growing fish and Slow – slow growing fish.

and intracellular part (GO:0044424) with 34 transcripts. For BP, 40 transcripts were assigned to primary metabolic process (GO:00044238), 39 to cellular metabolic process (GO:0044237), and 34 to macromolecule metabolic process (GO:0043170) and lastly, for MF, 18 transcripts were involved in nucleotide binding (GO:0000166), 16 in ion binding (GO:0043167) and 15 in hydrolase activity (GO:0016787) (**Figure 1**). In the liver, the most assigned GO terms in the CC domain were cell parts (GO:0044464) with 56 transcripts, intracellular (GO:0005622) with 36 transcripts and membrane (GO:0016020) with 33 transcripts. For BP, 40 transcripts were assigned to primary metabolic process (GO:00044238), 39 to cellular metabolic process (GO:0044237), and 29 to macromolecule metabolic process (GO:0043170) and lastly, for MF, 19 transcripts were involved in nucleotide binding (GO:0000166), 16 in hydrolase activity (GO:0016787) and 15 in ion binding (GO:0043167) (**Figure 1**).

A total of 844 and 1234 DETs were mapped onto KEGG in the muscle and liver, respectively. In each of the two tissues, the annotation to KEGG via KAAS revealed diverse, but related top five subclass pathways: "metabolic pathways," "endocytosis," "ubiquitin mediated proteolysis," "PI3K-Akt signaling pathway" and "protein processing in the endoplasmic reticulum" in the muscle and, "metabolic pathways," "biosynthesis of secondary metabolites," "biosynthesis of antibiotics," "microbial metabolism in diverse environments," and "endocytosis" in the liver (**Figure 2**).

# Validation of DEGs Using qRT-PCR

For the validation of the expression profile of differentially expressed genes (DEGs) in the RNA-sequencing data, primers from 21 DEGs were used for qRT-PCR. The comparison of the fold changes (ratio of RPKM of fast-growing fish to RPKM of slow growing fish) in the RNA-seq and qPCR data showed a significant correlation (r = 0.885 P < 0.0001) for all the 21 genes tested, indicating that the expression profiling of the DEGs determined by the RNA-seq is reliable and accurate.

# Locating DEGs in Muscle and Liver to Previously Mapped QTL for Growth and Fatty Acids

We focused on the DEGs that are located in QTL for growth and omega-3 contents, which were identified in a previous study (Lin et al., 2018), to facilitate the identification of candidate genes and their polymorphisms that underlie the phenotypic variation of growth and fatty acid content, especially omega-3 fatty acids, in the population. Ten growth and/or lipid biosynthesis related DEGs that were located at reasonably close proximity to the identified QTL for growth (77 kb–1.8 Mb) and omega-3 fatty acid content (<400 kb) (**Table 2**). For growth related QTL, four genes were found to be located within or close to the QTL region on LG 2, 6 and 7. One of the four genes, forkhead box K1 (foxK1), showed down regulation in fast growing fish and was found to be located approximately 1 Mb downstream of the QTL for both BW and TL on LG 6, while ubiquitin specific peptidase 38 (usp38) was located inside of the detected QTL for BT on the same LG, and was down regulated in fast growing fish. Two other genes, secreted protein acidic and cysteine rich (sparc) and smad family member 3 (smad3), which were up and down regulated in fast growing fish muscle, respectively, were located approximately 77 and 300 kb downstream of QTL identified for BW and TL, respectively. For omega-3 fatty acid content QTL, three, two and one genes were located within or in close proximity to the QTL regions identified on LG 11, 18, and 20, respectively. On LG 11, for QTL associated with EPA, two genes, peroxisomal carnitine O-octanoyltransferase-like (crot) located within the QTL region, and farnesyl pyrophosphate synthase-like (fdps) located 600 kb upstream of the QTL region, were up and down regulated in fast growing fish liver, respectively. As for QTL associated with DHA on LG 11, one gene, squalene epoxidase (sqlea), located approximately 14 kb upstream of the QTL region, was down regulated in fast growing fish. On LG 18, for QTL associated to ALA and EPA, two genes, cytochrome P450-family 7-subfamily B-polypeptide 1-like (cyp7b1) and inositol monophosphatase 1-like (impa1), located within and 140 kb upstream of the QTL region, respectively, and were down and up regulated in fast growing fish liver, respectively. On LG 20, for QTL associated with ALA and EPA as well, one gene, glutathione synthetase-like (gss), was located approximately 23 kb upstream of the QTL region and was downregulated in fast growing fish liver.

# Expressions of 10 Candidate Genes in the Muscle and Liver of Fast- and Slow-Growing Saline Tilapia

The expressions (**Table 3**) of 10 candidate genes (sparc, foxK1, smad3, usp38, crot, fdps, sqlea, cyp7b1, impa1, and gss) located in previously mapped QTL for growth or omega-3 content, in muscle and liver of fast and slow growing saline tilapia were examined using qRT-PCR. All the 10 genes showed differential expressions in at least in one tissue between fast- and slowgrowing tilapia (**Table 3**). For example, the expression level of the foxK1 gene in the liver was much higher in faster-growing tilapia than slower-growing tilapia (16.60 ± 1.16 vs. 1.65 ± 0.48, P < 0.001), while the expression level in muscle was higher in slow-growing tilapia than in faster-growing tilapia (5.07 ± 0.69 vs. 1.51 ± 0.15, P < 0.05).

# SNPs in the foxK1 Gene and Their Associations With Growth and Omega-3 Content

The full-length cDNA sequence of the foxK1 gene was 4444 bp, including an ORF of 2019 bp encoding 672 amino acids, a 3<sup>0</sup> UTR of 1907 bp and a 5<sup>0</sup> UTR of 518 bp. The genomic sequence of the gene was 21699 bp, containing nine exons and eight introns. The two SNPs (SNP1 and SNP2, see **Supplementary Figure 1**) were located in introns 1 and 2 of the gene and inherited together. Therefore, the analysis of the association of one SNP will suffice. ANOVA test showed that there were significant associations between the SNP and body weight, as well as omega-3/-6 ratio (**Table 4**). The fish with genotype GG grew much quicker than the ones with GC genotype (Body weight at 6 months post hatch: 304.0 ± 69.9 g vs. 250.6 ± 66.9 g, P < 0.0001). Similarly, the

fish with genotypes CC and GG showed higher omega-3/-6 ratio than those with the genotype GC (omega-3/-6 ratio: 1.58 ± 0.16, 1.68 ± 0.27 vs. 1.48 ± 0.14, P < 0.0001).

# DISCUSSION

In this study, we used RNA-seq in combination with a previous study on QTL mapping (Lin et al., 2018) to identify candidate genes to improve our understanding of the molecular mechanisms underlying growth and lipid biosynthesis in saline tilapia. A reference transcriptome of the hybrid saline tilapia was generated. The number of transcripts (262,282) and average contig size (846 bp) were fairly close to that of the Asian sea bass genome (267,616 contigs, 979 bp) assembled using multiple platforms (Thevasagayam et al., 2015) and that of blunt snout bream (253,439 contigs, 998 bp) (Tran et al., 2015). On the other hand, number of contigs was much higher than those in the Mozambique tilapia (196,178) (Böhne et al., 2014) and the red tilapia (160,762) (Zhu et al., 2016) while the average contig length of 846 bp was higher than that of the Mozambique tilapia(645 bp) (Böhne et al., 2014) and lower than that of the red tilapia (1120.61 bp) (Zhu et al., 2016). This difference of assembly may be due to the difference of species/strains used, sequencing platforms and assembly algorithms and the reference transcriptome.

Global gene expression and variation in transcription activity occurs across the genome and the number and level of transcript isoforms is not always known (Costa et al., 2010; Ozsolak and Milos, 2011; Pelechano et al., 2013). Therefore, in our study, annotation was carried out using nucleotide and protein databases (BLASTN and BLASTX), bringing the number of annotated transcripts to 42,699, which is more than the total number of mRNA transcripts in the NCBI database for both Nile and Mozambique tilapia. This suggests that our reference transcriptome is more comprehensive. Certainly, it is also possible that the hybridization between two species generated some new genes (Liu et al., 2016).

For identifying differentially expressed genes (DEGs) by RNAseq, different statistical thresholds can be used. By setting a higher statistical threshold, some DEGs may be missed, while by setting a lower threshold, too many genes may be selected. In this study, we set the significance threshold at twofold difference, which is at lower end of the threshold. Therefore, we identified many DETs: 2,236 in the muscle, and 3,020 in the liver. Surprisingly, the majority of the transcripts were down regulated in both tissues in the fast-growing fish. More than 42.7 and 57.2% of the differentially expressed transcripts in the muscle and liver, respectively, were assigned GO. Additionally, the DETs identified for both muscle and liver had diverse cellular functions and were mainly in pathways of metabolic processes, suggesting the complex crosstalk between the muscle and liver in the regulation of metabolic processes for the growth and development of the fish. In this study, the Hippo pathway, which has been reported to promote cell death and differentiation and inhibit cell proliferation (Zhao et al., 2011; Yu and Guan, 2013), showed decreased expression in fast growing fish in our study. As fishes have indeterminate growth, this finding may be suggestive of a positive transcriptional regulation of cell growth in fast growing fishes in our study. However, the crosstalk of the Hippo pathway with other signaling pathways warrants further investigation on how its down-regulation may confer


TABLE 2 | Ten candidate gene for growth and fatty acids which were identified by combining RNA-seq and QTL mapping, in fast- and slow-growing saline tilapia.

<sup>∗</sup>Bonferroni methodology of correction was applied and all differentially expressed genes statistically significant (P < 0.001). <sup>∗</sup>Regarding to the genomic range, QTL effects, the data can be found in our previous papers (Lin et al., 2016, 2018).

superior growth in our fast-growing fishes. Surprisingly, the mTOR pathway, a positive regulator of growth known to promote anabolic processes, including biosynthesis of proteins, lipids, and organelles and limiting catabolic processes such as autophagy (Laplante and Sabatini, 2009), was generally down-regulated in our fast growing fish. This is in contrast with findings from studies conducted in rainbow trout where slow growing fish had suppressed mTOR signaling (Danzmann et al., 2016). The reason for this discrepancy remains to be further studied.

For the differentially expressed genes (DEGs) in the liver, the general trend for fatty acid (FA) content associated pathways, such as the biosynthesis of unsaturated FA, FA elongation, alpha linolenic acid and linolenic acid metabolism, is upregulation in the fast-growing fish. This is in contrast to salinity challenged tilapia, where, it was reported that osmotic stress decreases these FA pathways in order to cope with stress (Xu et al., 2015). In addition, the main catalyst of FA degradation, carnitine O-palmitoyltransferase 2 (cpt2), which was previously found to be affected by dietary changes in Atlantic salmon and predominates in red muscle(Frøyland et al., 1998), was downregulated slightly. To date, there have not been many studies on function of cpt2 in lipid biosynthesis in the liver in fishes, however, accelerated growth has been reported in European sea bass (Santulli and D'Amelio, 1986) and African catfish

TABLE 3 | Average expression levels and standard deviations of 10 candidate genes detected by qRT-PCR.


These 10 genes were identified by combining RNA-seq and QTL mapping, in fast- and slow-growing saline tilapia. One-way ANOVA was conducted on differential gene expression between fast and slow growing fishes. Significant differential gene expression between the fast and slow growing fishes are in bold with p-value <0.05.

TABLE 4 | Associations between SNPs in the FoxK1 gene and traits (mean ± SD) in saline tilapia.


<sup>a</sup>P < 0.0001; <sup>b</sup>P = 0.376 > 0.05; <sup>c</sup>P = 0.017 < 0.05; and <sup>d</sup>P < 0.0001.

(Torreele et al., 2007) that have been fed with a diet supplemented with carnitine. Genes in the (peroxisome proliferators-activated receptors) PPAR signaling pathway such as fatty acid desaturase 2 (fads2), retinoic acid receptor (rxr) and fatty acid binding proteins (fabp), known to play a major role in lipogenesis and the biosynthesis of fatty acids (Nakamura et al., 2004), were generally up regulated. We think that this could most probably explain the higher fatty acid content in fast-growing saline tilapia in our study since lipogenesis and biosynthesis of fatty acids are up-regulated in these fish. In order to improve our understanding of the pathways and genes associated with growth and biosynthesis of fatty acids in saline tilapia, further enrichment analyses using GSEA (Subramanian et al., 2005) and DAVID (Huang et al., 2008) should be conducted using the well annotated and assembled tilapia genome.

In this study, ten DEGs from the muscle and liver were located inside or within close proximity of QTL regions for growth and fatty acid content, identified in our previous studies (Liu et al., 2014; Lin et al., 2016, 2018). The 10 candidate genes in QTL are sparc, foxK1, smad3, usp38, crot, fdps, sqlea, cyp7b1, impa1, and gss. Although there is some information about their functions in model organisms and humans, not much information about their role in growth and fatty acid synthesis and storage in food fish is available. Therefore, it is essential to study their functions in food fish. In this study, we selected the foxK1 gene from the 10 candidate genes to study some of its functions and found that the foxK1 gene was significantly differentially expressed in both muscle and liver. Previous studies in mice reported that foxK1 promotes cell proliferation through (i) translational repression of forkhead box protein O4 (foxo4); (ii) inhibiting myocyte differentiation through myocyte enhancer factor 2 (mef2) (Shi et al., 2012); (iii) repressed autophagy of muscle cells through the restriction of acetylation of histone H4; and (iv) expression of critical autophagy genes (Bowman et al., 2014). It is interesting to note that the expression of foxK1-alpha agreed with the differential gene expression in the muscle between fast and slow growing fish in this study. Our association study showed that the SNPs in the gene were significantly associated with growth and omega-3/-6 ratio, suggesting the SNPs may be useful in selecting fish, which grow fast and produce a higher omega-3/-6 ratio in fingerlings. However, it is not known whether the association is due to the function of the gene or/and a linked QTL for these traits. Further studies on the role, in the liver of fishes, pertaining to lipid biosynthesis would be interesting since it was observed to be up regulated in the liver of fast-growing fish in our study. CRISPR-Cas 9 technology, a novel technology for genome editing (Sander and Joung, 2014), could be used to knockout these genes to investigate their roles in conferring phenotypic variation in the saline tilapia.

# CONCLUSION

By combining the analysis of transcriptomes in the liver and muscle of fast- and slow-growing saline tilapia conducted here with previous studies on QTL mapping for growth and omega-3 contents (Lin et al., 2018; Liu et al., 2014; Lin et al., 2016), we identified 10 candidate genes for these traits. An association study confirmed that two SNPs in intron 2 of one of these candidate genes foxK1 was significantly associated with growth and omega-3/-6 ratio. The SNPs associated with these traits may be useful in MAS to accelerate the genetic improvement of these traits. The transcriptomes and the 10 candidate genes supply an important resource for further understanding the molecular mechanisms underlying phenotypic variations.

# DATA AVAILABILITY

fgene-10-00244 March 19, 2019 Time: 18:8 # 10

The raw reads generated for this study have been deposited in BioProject Accession: PRJDB7318. The accession numbers for the RNA-seq data are SAMD00137288 (Big Brains), SAMD00137287 (Big Gills), SAMD00137286 (Big Intestine), SAMD00137285 (Big Liver), SAMD00137284 (Big Muscle), SAMD00137283 (Small Brains), SAMD00137282 (Small Gills), SAMD00137281 (Small Intestine), SAMD00137280 (Small Liver), and SAMD00137279 (Small Muscle).

# ETHICS STATEMENT

All handling of tilapia was conducted in accordance with the guidelines on the care and use of animals for scientific purposes set up by the Institutional Animal Care and Use Committee (IACUC) of the Temasek Life Sciences Laboratory, Singapore. The IACUC has approved this study within the project "Breeding of Tilapia" [approval number TLL (F)-12-004].

# AUTHOR CONTRIBUTIONS

GY initiated and coordinated the research project for GL. GL conceived and conducted the analysis. GL and GY drafted

# REFERENCES


and finalized the manuscript. NT, ZW, and BY assisted with experiments, data analysis, and manuscript preparation.

# FUNDING

This research was supported by the National Research Foundation, Prime Minister's Office, Singapore, under its Competitive Research Program (CRP Award No. NRF-CRP7- 2010-01) and the internal fund of Temasek Life Sciences Laboratory, Singapore.

# ACKNOWLEDGMENTS

We thank Professors L. Orban, VCL Lin, and M. Featherstone for co-supervising the Ph.D. study of GL.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00244/full#supplementary-material

FIGURE S1 | Two SNPs (SNP1 and SNP2) identified in FoxK1 gene of saline tilapia.

TABLE S1 | Primers used in the current study.

TABLE S2 | Differentially expressed transcripts in the muscle and liver, respectively, between fast- and slow-growing tilapia.

activities in atlantic salmon (Salmo salar). Lipids 33, 923–930. doi: 10.1007/ s11745-998-0289-4


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The reviewer JHX declared a past co-authorship with one of the authors GY to the handling Editor.

Copyright © 2019 Lin, Thevasagayam, Wan, Ye and Yue. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

fgene-10-00244 March 19, 2019 Time: 18:8 # 11

# Construction of a High-Density Linkage Map and QTL Fine Mapping for Growth- and Sex-Related Traits in Channel Catfish (Ictalurus punctatus)

Shiyong Zhang1,2,3† , Xinhui Zhang<sup>4</sup>† , Xiaohui Chen2,3 \*, Tengfei Xu<sup>4</sup> , Minghua Wang2,3 , Qin Qin2,3, Liqiang Zhong2,3, Hucheng Jiang2,3, Xiaohua Zhu<sup>2</sup> , Hongyan Liu<sup>2</sup> , Junjie Shao<sup>2</sup> , Zhifei Zhu<sup>5</sup> , Qiong Shi1,4 \*, Wenji Bian2,3 \* and Xinxin You1,4 \*

### Edited by:

Peng Xu, Xiamen University, China

### Reviewed by:

Yun Li, Ocean University of China, China Costas S. Tsigenopoulos, Hellenic Centre for Marine Research (HCMR), Greece

### \*Correspondence:

Xiaohui Chen cxiaohui416@hotmail.com Qiong Shi shiqiong@genomics.cn Wenji Bian js6060@sina.com Xinxin You youxinxin@genomics.cn †These authors have contributed equally to this work

### Specialty section:

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Received: 20 November 2018 Accepted: 06 March 2019 Published: 26 March 2019

### Citation:

Zhang S, Zhang X, Chen X, Xu T, Wang M, Qin Q, Zhong L, Jiang H, Zhu X, Liu H, Shao J, Zhu Z, Shi Q, Bian W and You X (2019) Construction of a High-Density Linkage Map and QTL Fine Mapping for Growth- and Sex-Related Traits in Channel Catfish (Ictalurus punctatus). Front. Genet. 10:251. doi: 10.3389/fgene.2019.00251 <sup>1</sup> BGI Education Center, University of Chinese Academy of Sciences, Shenzhen, China, <sup>2</sup> National Genetic Breeding Center of Channel Catfish, Freshwater Fisheries Research Institute of Jiangsu Province, Nanjing, China, <sup>3</sup> The Jiangsu Provincial Platform for Conservation and Utilization of Agricultural Germplasm, Nanjing, China, <sup>4</sup> Shenzhen Key Lab of Marine Genomics, Guangdong Provincial Key Lab of Molecular Breeding in Marine Economic Animals, BGI Academy of Marine Sciences, BGI Marine, Beijing Genomics Institute, Shenzhen, China, <sup>5</sup> BGI-Zhenjiang Institute of Hydrobiology, Zhenjiang, China

A high-density genetic linkage map is of particular importance in the fine mapping for important economic traits and whole genome assembly in aquaculture species. The channel catfish (Ictalurus punctatus), a species native to North America, is one of the most important commercial freshwater fish in the world. Outside of the United States, China has become the major producer and consumer of channel catfish after experiencing rapid development in the past three decades. In this study, based on restriction site associated DNA sequencing (RAD-seq), a high-density genetic linkage map of channel catfish was constructed by using single nucleotide polymorphisms (SNPs) in a F<sup>1</sup> family composed of 156 offspring and their two parental individuals. A total of 4,768 SNPs were assigned to 29 linkage groups (LGs), and the length of the linkage map reached 2,480.25 centiMorgans (cM) with an average distance of 0.55 cM between loci. Based on this genetic linkage map, 223 genomic scaffolds were anchored to the 29 LGs of channel catfish, and a total length of 704.66 Mb was assembled. Quantitative trait locus (QTL) mapping and genome-wide association analysis identified 10 QTLs of sex-related and six QTLs of growth-related traits at LG17 and LG28, respectively. Candidate genes associated with sex dimorphism, including spata2, spata5, sf3, zbtb38, and fox, were identified within QTL intervals on the LG17. A sex-linked marker with simple sequence repeats (SSR) in zbtb38 gene of the LG17 was validated for practical verification of sex in the channel catfish. Thus, the LG17 was considered as a sex-related LG. Potential growth-related genes were also identified, including important regulators such as megf9, npffr1, and gas1. In a word, we constructed the high-density genetic linkage map and developed the sex-linked marker in channel catfish, which are important genetic resources for future marker-assisted selection (MAS) of this economically important teleost.

Keywords: channel catfish, linkage map, quantitative trait locus (QTL), growth-related genes, sex-related marker

# INTRODUCTION

fgene-10-00251 March 23, 2019 Time: 17:45 # 2

Genetic-map construction is a critically important tool for further genomic studies, as well as for genetic breeding of economically important aquatic species. It has been employed for genome assembly (Jiao et al., 2014), comparative genome analysis (Xiao et al., 2015; Zhu et al., 2015; Peng et al., 2016), and quantitative trait locus (QTL) identification for important economic traits (Liu F. et al., 2013; Zhang et al., 2018). In order to construct a genetic linkage map, it is necessary to develop a large number of molecular markers on examined families. Most of the early genetic linkage map constructions used amplified fragment length polymorphism (AFLP) and simple sequence repeats (SSR), but these maps had few molecular markers with low density (Zhang et al., 2011), which limited identification of QTL and related researches.

With the rapid development of next-generation sequencing (NGS), an increasing number of methodologies have been applied for cost-effective development and genotyping of thousands of single nucleotide polymorphisms (SNPs) in nonmodel animals, such as genome resequencing (Lijavetzky et al., 2007), transcriptome sequencing (Triwitayakorn et al., 2011; Xiao et al., 2015), genotyping-by-sequencing (GBS) (Poland et al., 2012), restriction site associated DNA sequencing (RAD-seq), and specific-locus amplified fragment (SLAF) sequencing (Sun X. et al., 2013). At present, the RAD-seq technology is a popular tool for establishment of high-density genetic linkage maps in many aquaculture species, such as Zhikong scallop (Chlamys farreri) (Jiao et al., 2014), mandarin fish (Siniperca chuatsi) (Lijavetzky et al., 2007), tilapia (Oreochromis niloticus L.) (Palaiokostas et al., 2013b), Asian seabass (Lates calcarifer) (Wang L. et al., 2015), and Chinese mitten crab (Eriocheir sinensis) (Cui et al., 2015).

High-quality genetic linkage maps can locate QTLs on corresponding genomes and facilitate marker-assisted selection (MAS) and breeding in many economically important aquaculture species. Sex is one of the most basic characteristics of organisms. Many species of teleost fish have sexually dimorphic growth patterns (Tong and Sun, 2015) with significant growth differences between male and female individuals, such as yellow catfish (Pelteobagrus fulvidraco) (Liu H. et al., 2013), Japanese flounder (Paralichthys olivaceus) (Van Ooijen, 2011), halfsmooth tongue sole (Cynoglossus semilaevis) (Song et al., 2012), and Atlantic halibut (Hippoglossus hippoglossus) (Palaiokostas et al., 2013a). Growth is also one of the most important economic traits for aquaculture fish species, was reported to be controlled by multi-gene and environmental effects (Feng et al., 2018) with extensive studies in many aquaculture fish species, such as rainbow trout (Sundin et al., 2005), salmon (Salmo salar) (Norman et al., 2012), and common carp (Cyprinus carpio) (Peng et al., 2016). In addition to QTL for growth and sex, QTL for stress responses, disease resistance, and cold tolerance have been mapped in other fish species (Ozaki et al., 2001; Cnaani et al., 2003; Fuji et al., 2006; Tripathi et al., 2009).

Traditional strategies for genetic improvement of growthrelated traits have mainly relied on phenotypic data, which increased time and cost for breeding. However, MAS using marker-linked QTLs has accelerated genetic improvement with high accuracy of selection. Since genetic linkage map and QTLs allow to identify molecular markers or candidate genes associated with traits (Mackay et al., 2009; Feng et al., 2018), they have become important MAS breeding tools in recent years.

As an important aquaculture species, channel catfish (Ictalurus punctatus) has been popular in the worldwide. Especially in the native United States, it accounts for more than 60% of the US annual aquaculture production (Liu, 2011). Since its introduction to China in 1984, it has been promoted to many provinces in China, with an annual production of more than 200,000 tons. In the past decades, several linkage map and QTL studies (Waldbieser et al., 2001; Liu et al., 2003; Kucuktas et al., 2009; Li et al., 2015; Zeng et al., 2017) have been carried out to facilitate channel catfish genetic improvement and breeding programs. Previous reports confirmed that males of channel catfish grow generally faster than females under same culturing conditions (Beaver et al., 1966). Therefore, all-male monosex channel catfish has important economic values for development in aquaculture. A sex-linked marker for American strains of channel catfish was identified and the electrophoretic bands of PCR products were characterized (Ninwichian et al., 2012); however, the DNA variations of this marker should be further illustrated for validation in China strains. Meanwhile, it is necessary to establish a MAS breeding program for the channel catfish beforehand to improve the targeted economical traits. In this study, RAD-seq was employed to construct a high-density genetic linkage map, which was useful for subsequent construction of chromosome maps and identification of candidate sex-related and growth-related genes. Our present work confirms that a high-density genetic linkage map can provide a powerful tool for QTL fine mapping and genome-wide association study of economical traits.

# MATERIALS AND METHODS

# Sample Collection and DNA Isolation

A F<sup>1</sup> full-sib family of channel catfish was generated at National Genetics Breeding Center of Channel Catfish in Nanjing, Jiangsu Province, China, during June of 2015. Fertilized eggs were hatched with slow-flowing water (23–27◦C) in separate tanks. After approximately 1 week, a random collection of approximately 5,000 larvae was stocked in separate larvae-culture tanks (3.0 m × 1.0 m × 0.5 m). Zooplankton was fed at the first 10 days, and then with formula feed. After 20 days, a random collection of approximately 1,000 larvae was transferred to an outdoor pond (about 667 square meters) for further culturing. Until December 2016, these offspring individuals at the age of 18 months were measured for growth-related traits including body height (BH), body length (BL), body weight (BW), and body width (WD), and the genders of these F1 individuals were also identified at the same time.

After investigating the relationships among the growthrelated traits with R3.3.1 software (Baayen, 2008) to calculate Pearson correlation coefficients, we randomly picked up 156 healthy individuals for sample collection. Fin clips of their parents and muscle tissues of these offspring were collected in

absolute ethanol, and then stored in a −20◦C freezer before use. Genomic DNA was extracted using the established phenolchloroform protocol (Taggart et al., 1992). DNA quality was evaluated via the Qubit Fluorometer (Invitrogen, United States) and electrophoresis on a 0.6% agarose gel. All experiments were performed in accordance with the Regulations of the Animal Ethics Committee and were approved by the Institutional Review Board on Bioethics and Biosafety of Freshwater Fisheries Research Institute of Jiangsu Province (No. FT 18134).

# Construction and Sequencing of RAD Libraries

DNAs of the total 158 individuals (156 offspring and two parents) were used to construct RAD libraries, which was prepared via a previously published protocol (Baird et al., 2008). In brief, each enzyme reaction system (30 µL), containing 1 µg of genomic DNA and 15 U of EcoRI (15 U/µL, with the restriction site of 50GˆAATTC 3<sup>0</sup> ) (Thermo Scientific, Waltham, MA, United States), was incubated at 65◦C for 10 min. Barcode adapters with a sample-specific nucleotide code were designed, following the standard Illumina adapters design flow. Unique barcode adapter (10 µmol) for each DNA sample was added to individual reaction system. Twelve DNA samples were pooled per tube. In order to obtain more sequencing data for two parents, they were pooled in triplicates. Therefore, there were 162 DNA samples used for pooling. Fourteen pools were collected and the fragments at the size range of 300–500 bp were chosen. Fourteen libraries were independently sequenced using the 150-bp pairend sequencing method on an Illumina HiSeqX-ten platform (Illumina, San Diego, CA, United States).

# Filtering of Raw Data and Splitting of Barcode Reads

After removal of sequencing adapters, low-quality reads (more than five positions with quality value less than 20, or more than 3% of unknown nucleotides) were filtered using the SOAPnuke software (Chen et al., 2018). The remained highquality reads were used for subsequent analysis. Clean reads from the same library were separated via a Perl script on the basis of their individual barcodes. Meanwhile, those reads with wrong barcodes (not matching to the expected) were also discarded.

# SNP Discovery and Genotyping

Genome-wide SNPs were identified using a stringent SNP discovery filtering method within the software SOAPsnp (Li et al., 2009a). In order to obtain specific SNPs of the parents and the offspring, we employed SOAP2.22 (Li et al., 2009b) to align these high-quality paired-end RAD reads to the channel catfish reference genome, which was published previously (Chen et al., 2016). Based on our SOAP alignment results, we used SOAPsnp v1.05 to call SNPs. For quality control, we applied many criteria to filter SNPs, including (1) nucleotide quality more than 20; (2) depth between 3 and 300; (3) removal of reads that mapped to multiple sites; and (4) at least one heterozygote from parents. These identified SNP loci were finally separated into three segregation patterns, type lm × ll or nn × np (1:1) or hk × hk (1:2:1).

# Construction of the High-Density Genetic Map

Linkage groups (LGs) were assigned using JoinMap4.1 software (Ooijen, 2006) under the CP algorithm, and Lep-Map (Rastas et al., 2013) was used to realize the genetic map construction. First, the Lep-Map filtering module was used to filter out markers via comparison of the offspring genotype distribution and the expected Mendelian proportions (segregation distortion test). The default value of the data tolerance (P-value = 0.01) was used to discard highly segregated markers (χ 2 test, P < 0.01). Subsequently, the separate LGomosomes module was used to assign markers into LGs and then execute with logarithm of odds (LOD) scores for recombination fraction. A range of LOD scores from 5 to 15 incrementing by 1 were tested for linkage grouping. The final LOD score (15) was selected based on whether the number of LGs matched the number of chromosomes of channel catfish and the assembled chromosomes showed the 1:1 synteny relationship with the Liu et al. (2016) map. Finally, the Order Markers module was used to orientate markers within each LG, and the Kosambi mapping function was used to convert the recombination frequencies into map distances in centiMorgans (cM) (Kosambi, 2016).

# QTL Mapping for Growth- and Sex-Related Traits

We employed MapQTL6.0 (Van Ooijen, 2011) to perform QTL analyses following the method of multiple QTL model (MQM). Regression algorithm was used for mapping quantitative trait loci in line crosses. The threshold for QTL significance was determined using a genome-wide permutation test with 200 iterations, and cofactors for MQM analyses were automatically selected with a p-value of 0.02. Significant LOD thresholds were calculated by permutation test of α < 0.05 and n = 1,000 for significant linkages. The software also calculated phenotypic variation that resulted from growth- and sex-related QTLs. These markers were mapped on the channel catfish genome (Chen et al., 2016) and upstream and downstream coding genes were identified and subsequently searched against the database of nonredundant protein sequences (Nr, cut-off value of 1e-10) at the National Center for Biotechnology Information (NCBI) using BLASTx to predict functions of these genes.

# Genome Scaffold Assembly, Synteny Analysis, and Identification of Potential Sex-Related and Growth-Related Genes

Single nucleotide polymorphisms in the genetic linkage map were used for assembling of pseudo-chromosomes. To increase the accuracy of pseudo-chromosomes assembly, we chose at least two SNPs in each scaffold using custom Perl scripts. Based on genetic distances between SNP markers, we determined the position and orientation of each scaffold and anchored these scaffolds to construct pseudo-chromosomes. To perform the genome synteny analysis, genome sequences of zebrafish

(Danio rerio) and channel catfish (Liu et al., 2016) were downloaded from the NCBI. Genome-wide alignments were performed using lastz software (Kurtz et al., 2004), and the best homology segments were selected using perl scripts. The final genomic synteny was visualized using the Circos software (Krzywinski et al., 2009).

# Identification, Verification, and Localization of a Sex-Specific Marker

According to a previous report (Ninwichian et al., 2012), we obtained a 192-bp sex-related sequence of channel catfish by Sanger sequence. The specific primers, SexF (50 -TGAATGTGAGACTAACAGGAG-3<sup>0</sup> ) and SexR (5<sup>0</sup> - ACATCGCTTTGAGAAGCTGCT-3<sup>0</sup> ), were designed based on flanking sequences of this sex-linked marker using Primer3 (Koressaar and Remm, 2007) software. The forward primer was labeled with a fluorescent dye 5<sup>0</sup> 6-FAM by Sangon Biotech Co. Ltd. (Shanghai, China). Subsequently, the designed specific primers were used for PCR amplification in 43 male and 53 female channel catfish individuals from two breeding populations. Each PCR reaction was done in a 20-µL volume containing 1 µL of 30–50 ng of genomic DNA, 1 µL of forward primer and reverse primer (1.0 pmol/L), 10 µL 2 × Taq PCR MasterMix [0.1 U Taq polymerase µL–1, 5.0 × 10−<sup>4</sup> mol/L dNTP each, 2.0 × 10−<sup>2</sup> mol/L Tris-HCl (pH 8.3), 0.1 mol/L KCl, 3.0 × 10−<sup>3</sup> mol/L MgCl2; Vazyme, Nanjing, China], and sufficient ddH2O. Touchdown PCR was initiated at 94◦C for 30 s. The annealing temperature of these reactions decreased from 60◦C to a touchdown 50◦C at the cooling rate of 1◦C in every cycle; followed 94◦C for 30 s, 55◦C for 30 s, 72◦C for 30 s, 20 cycles; and final extension steps at 72◦C for 10 min. For fragment length analysis, PCR products were genotyped on a ABI PRISM 3730XL DNA Sequencer (Applied Biosystems, Foster City, CA, United States) with GS500 marker as an internal size standard. The allelic sizes were determined using GeneMarker version 1.5 (SoftGenetics LLC, State College, PA, United States).

# RESULTS

# Characteristics of the Growth- and Sex-Related Traits

Individuals in the mapping family had an average BH of 6.69 ± 0.78 cm, an average BL of 30.05 ± 2.93 cm, an average BW of 601.76 ± 187.53 g, and an average WD of 5.18 ± 0.59 cm. The growth-related traits showed strong correlation with each other (r = 0.8292-0.9312, P-value < 0.001 for all). The highest correlation value 0.9312 was observed between BW and BL (**Table 1**). In the mapping population, 74 and 82 individuals were identified as male and females, respectively, with a sex ratio of 1:1.11. Statistics on male and female growth data demonstrated that male individuals have larger BL, BH, body wide, and BW than female individuals (**Table 2**), indicating that males grow faster than females under the same culturing condition.

# Summary of the RAD Libraries

A total of 14 RAD-seq libraries from two parents and their 156 offspring individuals were sequenced on an Illumina HiSeqX-ten platform to generated about 1.59 billion of 150-bp pair-end reads. The DNA sequencing raw data have been deposited for public availability in CNSA (CNGB Nucleotide Sequence Archive)<sup>1</sup> with the project no. CNP0000229. After subsequent filtering of lowquality reads, 1.42 billion of clean reads were remained. The number of clean reads per offspring individual was averaged to 8.66 million. Meanwhile, both the female and male parental data contained 28.76 and 25.64 million of clean reads, respectively.

# SNP Calling and Construction of the High-Resolution Genetic Linkage Map

A total of 1,367,192 SNPs in all individuals were identified using SOAPSnp, in which 10,661 SNPs passed through the filtering criteria (see more details in section "Materials and Methods"). These SNPs were classified into three categories: paternal heterozygous (lm × ll, 5,617 SNPs), maternal heterozygous (nn × np, 4,915 SNPs), and heterozygous in both (hk × hk, 129 SNPs). Among them, 4,768 SNPs were consistent with Mendelian segregation pattern, and then they were used for subsequent linkage map construction using a pseudo-testcross strategy (Shao et al., 2015).

These SNPs were finally grouped into 29 LGs (**Table 3** and **Figure 1**), which is consistent with the reported haploid chromosome number of the channel catfish (Liu et al., 2016). The genetic linkage map spanned 2,480.25 cM with an average SNP interval of 0.55 cM. The genetic length of each LG ranged from 28.71 (LG27) to 141.1 cM (LG22), with an average SNP distance of 0.21–1.01 cM (**Table 3** and **Figure 1**).

<sup>1</sup>https://db.cngb.org/cnsa/

TABLE 1 | Pearson correlation coefficients for all pairwise combinations of the four examined traits.


TABLE 2 | Significant growth difference between female and male channel catfish.


<sup>∗</sup>Significant difference; ∗∗very significantly different.

TABLE 3 | Characteristics of genetic linkage map and anchoring scaffolds of channel catfish.


The anchored genes and scaffolds are from the genome assembly (Chen et al., 2016).

# Fine QTL Mapping for Growth- and Sex-Related Traits

In this study, four growth-related traits including BW, BL, BH, and WD were measured. Six QTLs associated with growth-related traits were identified on the LG28, which were detected between the narrow span of 32.53–45.29 cM (**Table 4**). Among these QTLs, the highest LOD value of 4.29 (**Figure 2A**) was located at 37.46 cM near the marker Scaffold53-3609071, which accounts for 11.8% of the phenotypic variation. No major locus explaining > 20% of the total variation was detected among these growthrelated QTLs.

Meanwhile, 10 significant QTLs for sex determination were detected on the LG17 of the channel catfish using permutation tests (P < 0.02, LOD > 3.3). Among these QTLs, the highest LOD value of 31.01 (**Figure 2B**) was located at 53.39 cM near the marker Scaffold55-2157602, which contributed to 59.5% of the phenotypic variation.

# Chromosomal Assembly and Comparative Genome Analysis

Twenty-nine pseudo-chromosomes (Chr) of channel catfish with a total length of 704.66 Mb were assembled, which comprised 83.39% of the assembled scaffold sequences (Chen et al., 2016) and 18,161 genes (a total number of 21,556 genes). The average pseudo-chromosome length was 24.29 Mb with seven scaffolds. The smallest pseudochromosome was chr16 with 7.23 Mb containing one scaffold, and the largest pseudo-chromosome was chr13 with 40.9 Mb containing 10 scaffolds.

There were 16,197 synteny blocks between the assembled genomes of channel catfish and zebrafish, and 15 of the 29 LGs of channel catfish had relatively conserved collinear blocks on zebrafish chromosomes (**Figure 3A**). The total number of synteny blocks between our channel catfish assembly (Chen et al., 2016) and the assembly published by Liu et al. (2016) was 55,525. All chromosomes showing the 1:1 synteny relationship



GW, growth; Expl, percentage of explained phenotypic variation.

(**Figure 3B**). **Figure 4** summarizes the distribution of SNPs, genes, GC content on 100-kb genomic intervals, and interchromosomal relationships of our assembled channel catfish pseudo-chromosomes.

# Potential Candidate Genes for Sex Dimorphism and Growth-Related Traits

To further identify potential genes underlying sex dimorphisms, we used BLASTX to search gene sequences on the LG17 from

the QTL regions against the NCBI Nr database. Finally, 23 sexrelated genes were predicted, and they were previously reported to be involved in spermatogenesis, gonad sex determination, and testicular determination (**Table 5**). These genes included spermatogenesis-associated protein 2 (spata2), spata5, splicing factor 3 (sf3), and forkhead box (fox). Five sex-related genes including Wilms tumor protein 1-interacting protein (wt1), spata2, probable ATP-dependent DNA helicase DDX11 (ddx11), zinc finger homeobox protein 3-like isoform X1 (zfhox3), and forkhead box protein F1 (foxf1) were located near four sex-related QTLs (qSEX\_6, qSEX\_7, qSEX\_8, and qSEX\_10) (**Figure 5A**).

We screened our reference genome and collected proteincoding genes from the growth-related QTL regions. Three growth-related genes, including multiple epidermal growth factor-like domains protein 9 (megf9), neuropeptide FF receptor 1 (npffr1), and growth arrest-specific protein 1 (gas1), were found in two (qGW\_1 and qGW\_3) of the six growth-related QTL regionso (**Table 5**). Among these genes, npffr1 is receptor of neuropeptide FF and RFamide-related peptide (rfrp), which are involved in control of feeding behavior both in certain invertebrates and in vertebrates (Dockray, 2004; Bechtold and Luckman, 2007).

These candidate genes localized on the LG17 and LG28 may involve in the genetic control of sex- and growth-related traits. Detailed functions are worthy of further investigation for genetic breeding of channel catfish.

# A Sex-Linked Marker Was Verified in the Channel Catfish

According to previously reported molecular marker (Ninwichian et al., 2012), we obtained a 192-bp fragment by Sanger sequencing. Subsequently, we used the clean reads of the samples in this study to perform multiple sequence alignments with this special sex-linked sequence. We observed that the sex-linked locus consists of three

TABLE 5 | Annotation of growth- and sex-related candidate genes in the genome of channel catfish.


major interchromosomal relationships in the channel catfish genome.

types of allele, including no deletion (allele G1), 3 bp deletion [one (TAA) repeat; allele G2], and 6-bp deletion [two (TAA) repeats; allele G3] in the SSR marker, respectively (**Figures 5C,D**).

The 6-bp deletion allele presented in all male individuals but not in any of the females (**Figures 5C,D**), thus it could definitely distinguish male and female individuals. Chromosome location of this sex-linked SSR marker clearly marked it on the LG17 of channel catfish, within the non-coding region of zinc finger and BTB domain-containing protein 38 (zbtb38), which is approximately 25 kb in length and consists of six exons and five introns (**Figure 5B**). Furthermore, to verify the sexlinked SSR marker, we designed fluorescent primers to perform PCR amplification in 43 male and 53 female catfish individuals from two other breeding populations. The PCR results matched phenotypes with a 100% overall accuracy.

# DISCUSSION

A genetic linkage map can provide important genomic information and allow for exploration of QTL, which can be used to maximize the selection of target traits in breeding animals. Availability of a large number of genetic markers is essential for constructing a good genetic linkage map and for QTL mapping of available genetic traits. High-density

linkage maps and growth/sex-related QTLs were analyzed using RAD sequencing in several aquatic animals, such as genetic linkage map construction in orange-spotted grouper (Epinephelus coioides) (4,608 SNPs) (You et al., 2013), genetic map and sex-/growth-related QTL in turbot (Scophthalmus maximus) (6,647 SNPs) (Wang W. et al., 2015), blunt snout bream (Megalobrama amblycephala) (14,648 SNPs) (Wan et al., 2017), mandarin fish (S. chuatsi) (3,283 SNPs) (Sun et al., 2017), common carp (C.c. haematopterus) (7,820 SNPs and 295 SSRs)

(Feng et al., 2018), and pompano (Trachinotus blochii) (12,358 SNPs) (Zhang et al., 2018).

In the present study for channel catfish, we employed the RAD sequencing technology to construct a high-density genetic linkage map with 4,768 SNPs, reaching the total length to 2,480.25 cM with an average SNP distance of 0.55 cM. Using the markers to target specific scaffolds from previous study (Chen et al., 2016), we anchored a total number of 223 scaffolds in the channel catfish

genome assembly to 29 LGs. Approximately 704.66 Mb (83.39%) of the assembled genome sequences were assigned to the 29 LGs with identification of 18,161 genes (84.25%). Interestingly, in previous work (Chen et al., 2016), a female individual collecting from a local breeding stock in China was used for genome sequencing, and the assembly genome size was 845 Mb. However, in another published study of channel catfish genome using a doubled haploid female individual from the United States (Liu et al., 2016), a 783-Mb genome was assembled. The remarkable different genome size may be generated from the different source of sequencing samples.

Many species of teleost fish have sex dimorphic growth patterns (Mei and Gui, 2015), and there are significant growth differences between males and females (Gui and Zhu, 2012; Liu F. et al., 2013).Therefore, in some species, production of monosex populations is desirable for economic values. In this study, we observed that males of the Chinese channel catfish have significant differences in BW and BH compared with females (P-value < 0.05, **Table 2**), and males exhibited much faster growth than females under the same culturing condition. This sexual dimorphism was also determined in other fish species, such as yellow catfish (P. fulvidraco) (Liu H. et al., 2013) and Nile tilapia (O. niloticus) (Lee et al., 2011). The high-density genetic linkage map generated in present study provided useful data for QTL fine mapping of important economically traits (especially growth- and sex-related) in channel catfish.

Growth and sex are the most important traits for cultured fish species. Based on established genetic linkage maps, researchers have determined many practical QTLs for sex/growth traits. For example, in mandarin fish (S. chuatsi) one significant QTL for sex determination was identified on LG23; genotypes of all the female fish on r1\_33008 marker were heterozygous, and all males were homozygous, thus this sex-specific marker can be used to identify male and female individuals of mandarin fish; meanwhile, 11 significant QTLs for growth traits were also detected on four LGs (Sun et al., 2017). Similarly, in our present study, 11 significant QTLs associated with sex-related trait at LOD ≥ 3.3 were detected on the LG17, contributing to 12.3– 59.5% of the phenotypic variation in channel catfish. This finding was confirmed in other fish species, such as Atlantic halibut (H. hippoglossus) (Palaiokostas et al., 2013a) and gilthead sea bream (Sparus aurata) (Loukovitis et al., 2011).

In contrast, sex-related QTLs of some other fish species were detected to be distributed on different LGs. In the blunt snout bream (M. amblycephala), three QTLs related to gonad development were detected on LG13, LG12, and LG1 (Wan et al., 2017). In the tilapia, sex-linked QTLs were detected on at least three LGs (LG1, LG3, and LG23; Lee et al., 2003; Cnaani et al., 2004; Eshel et al., 2012). These results suggest involvement of multiple chromosomes or LGs in sex relation and provide support to the polygenic sex determination in fishes. Sex determination of channel catfish probably is maledominant (XX/XY), due to sex ratios of offspring both in interspecific hybridization (I. punctatus ♀ <sup>×</sup> Ictalurus furcatus ♂) and intraspecific hybridization of channel catfish close to 1:1, and the offspring in gynogenetic families were all females (Goudie et al., 1995). Similar to other fish species, the channel catfish sex chromosomes (X and Y) is difficult to be distinguished based on current karyotype analysis technologies. Therefore, identification of sex-specific markers and QTLs is a necessary prerequisition for uncovering sex determination mechanisms and associated genes, as well as proceeding sex-transformation and sex control tests.

Sex-specific markers have been identified in more than 20 economic aquaculture species to date by the strategies of AFLP and NGS. In the present study, 23 genes related to sex development were detected on the LG17, such as spata (related to sperm production; Sun F. et al., 2013; Chen et al., 2015), wt1 (early establishment of gonad; Lin et al., 2017), and foxl1 (sex hormone regulation; Hu et al., 2017). The result is similar to tilapia, in which 51 sex-determination genes were annotated in the sex region on scaffold 101 (Eshel et al., 2012). In our current study, all sex-related QTLs are located on the same LG17 suggesting that a single chromosome may be involved in sex determination (**Figure 3**). To validate a previously reported sex-specific marker in channel catfish, we extracted those malespecific tags presented in all male individuals but not in any of females. Finally, we obtained one SSR marker with 6-bp deletion presented in all male zbtb38 gene. Results from fluorescent PCRcapillary gel electrophoresis also confirmed the male-specific SSR marker. Likewise, sex-specific SSR markers have been reported in half-smooth tongue sole (C. semilaevis) (Chen et al., 2012) and kiwifruit (Actinidia chinensis Planchon) (Zhang et al., 2015), and they were used to distinguish male and female in practical breeding programs.

The good resolution and high density of our genetic linkage map provide an effective support for QTL mapping of economically traits, as well as for genome assembly. QTL fine mapping and positional cloning of candidate genes have been an efficient approach for breeding programs in aquaculture animals, especially for investigation of quantitative traits (Xiao et al., 2015; Yu et al., 2015; Peng et al., 2016). Interestingly, in our present study, six QTLs associated with growth traits (BL, BH, BW, and WD) were identified to cluster at a narrow linkage span (32.53–45.29 cM) of the LG28. We screened the reference genome (Chen et al., 2016) and identified three proteincoding genes from this growth-related QTL region, providing potential tools for molecular breeding of new variants with growth superiority.

# CONCLUSION

In the present study, we employed RAD sequencing to construct a high-density genetic linkage map with 4,768 SNPs for the channel catfish. Ten sex-related QTLs were detected on the LG17, on which 23 genes related to sex development were annotated, such as spata, wt1, and foxl1. Six QTLs for growth were detected on the LG28, on which three growth-related genes were identified within the QTL intervals. A previously reported sex-linked marker was confirmed on the LG17, which can effectively identify male and female individuals of channel catfish from difference genetic resources. In summary, we provide a valuable genetic resource for future molecular breeding of this economically important fish species.

# DATA AVAILABILITY

fgene-10-00251 March 23, 2019 Time: 17:45 # 12

The datasets generated for this study can be found in CNGB Nucleotide Sequence Archive (https://db.cngb.org/cnsa/), CNP0000229.

# AUTHOR CONTRIBUTIONS

XY, SZ, QS, WB, and XC conceived the ideas and designed the investigations. XZ, TX, and SZ analyzed the data. SZ, MW, QQ, LZ, and HJ collected and processed the samples. HL, JS,

# REFERENCES


and ZZ performed the experiments. XZ, SZ, and XY wrote the manuscript. XY and QS revised the manuscript. All authors read and approved the final manuscript for publication.

# FUNDING

This project was supported by Major Project for New Cultivar Breeding of Jiangsu Province (No. PZCZ201741), China Agriculture Research System (No. CARS-46), Jiangsu Fisheries Research System (No. JFRS-05), Research Fund for the 333 Project of Jiangsu Province (No. BRA2018377), Shenzhen Special Program for Development of Strategic Emerging and Future Industries (No. 20170428173357698), and Shenzhen Special Program for Development of Emerging Strategic Industries (No. JSGG20170412153411369).

proteins and acts as a homeotic and morphogenetic regulator of imaginal development in Drosophila. Development 129, 2419–2433.


catfish using gene-associated markers. Genetics 181, 1649–1660. doi: 10.1534/ genetics.108.098855


to promote angiogenesis. Oncotarget 9, 1210–1228. doi: 10.18632/oncotarget. 23456


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The handling Editor declared a past co-authorship with one of the authors XY.

Copyright © 2019 Zhang, Zhang, Chen, Xu, Wang, Qin, Zhong, Jiang, Zhu, Liu, Shao, Zhu, Shi, Bian and You. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Combining Individual Phenotypes of Feed Intake With Genomic Data to Improve Feed Efficiency in Sea Bass

Mathieu Besson1,2, François Allal <sup>2</sup> , Béatrice Chatain<sup>2</sup> , Alain Vergnet <sup>2</sup> , Frédéric Clota1,2 and Marc Vandeputte1,2 \*

<sup>1</sup> GABI, INRA, AgroParisTech, Université Paris-Saclay, Jouy-en-Josas, France, <sup>2</sup> MARBEC, Univ Montpellier, CNRS, Ifremer, IRD, Palavas-les-Flots, France

### Edited by:

Lior David, Hebrew University of Jerusalem, Israel

### Reviewed by:

Allan Schinckel, Purdue University, United States Jesús Fernández, Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA), Spain

> \*Correspondence: Marc Vandeputte marc.vandeputte@inra.fr

### Specialty section:

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Received: 14 November 2018 Accepted: 27 February 2019 Published: 29 March 2019

### Citation:

Besson M, Allal F, Chatain B, Vergnet A, Clota F and Vandeputte M (2019) Combining Individual Phenotypes of Feed Intake With Genomic Data to Improve Feed Efficiency in Sea Bass. Front. Genet. 10:219. doi: 10.3389/fgene.2019.00219 Measuring individual feed intake of fish in farms is complex and precludes selective breeding for feed conversion ratio (FCR). Here, we estimated the individual FCR of 588 sea bass using individual rearing under restricted feeding. These fish were also phenotyped for their weight loss at fasting and muscle fat content that were possibly linked to FCR. The 588 fish were derived from a full factorial mating between parental lines divergently selected for high (F+) or low (F–) weight loss at fasting. The pedigree was known back to the great grand-parents. A subset of 400 offspring and their ancestors were genotyped for 1,110 SNPs which allowed to calculate the genomic heritability of traits. Individual FCR and growth rate in aquarium were both heritable (genomic h² = 0.47 and 0.76, respectively) and strongly genetically correlated (−0.98) meaning that, under restricted feeding, faster growing fish were more efficient. FCR and growth rate in aquariums were also significantly better for fish with both parents from F– (1.38), worse for fish with two parents F+ (1.51) and intermediate for cross breed fish (F+/F– or F–/F+ at 1.46). Muscle fat content was positively genetically correlated to growth rate in aquarium and during fasting. Thus, selecting for higher growth rate in aquarium, lower weight loss during fasting and fatter fish could improve FCR in aquarium. Improving these traits would also improve FCR of fish in normal group rearing conditions, as we showed experimentally that groups composed of fish with good individual FCR were significantly more efficient. The FCR of groups was also better when the fish composing the groups had, on average, lower estimated breeding values for growth rate during fasting (losing less weight). Thus, improving FCR in aquarium and weight loss during fasting is promising to improve FCR of fish in groups but a selection response experiment needs to be done. Finally, we showed that the reliability of estimated breeding values was higher (from +10% up to +125%) with a genomic-based BLUP model than with a traditional pedigree-based BLUP, showing that genomic data would enhance the accuracy of the prediction of EBV of selection candidates.

Keywords: aquaculture, feed conversion ratio, fine phenotyping, genomic selection, individual feed intake, restricted feeding, selective breeding

# INTRODUCTION

Improving feed conversion ratio (FCR) is crucial to enhance the sustainability of fish production, as feed is a major economic and environmental cost of fish production (Besson et al., 2017). Improving FCR by selective breeding has already been achieved in terrestrial livestock species by selecting directly on improved growth rate (Knap and Kause, 2018). Faster growing animals are expected to be more efficient as their maintenance cost is proportionally lower than animals growing slowly. In terrestrial livestock, many studies found that the genetic correlation between growth rate and feed intake was negative meaning that improving growth rate generates a correlated decrease of FCR (Knap and Kause, 2018). In fish, however, the genetic correlation between growth rate and FCR is still uncertain. Some studies show no correlation while other show negative correlation (De Verdal et al., 2017; Knap and Kause, 2018). Hence, 64–100% of the genetic variance of FCR is expected not to be explained by growth rate (Knap and Kause, 2018). This large uncertainty on the correlation between growth rate and FCR in fish, and hence on the feasibility of improving FCR through selection for faster growth, is mainly due to the difficulties of measuring individual feed intake.

In livestock, recording individual feed intake was achieved by using electronic feeders (e.g., Gilbert et al., 2007). This kind of device gives access to the feed, located in a closed containment, to a single animal at one time. Then, the device associates the animal to its feed intake using its ear-tag. Using such devices to estimate individual feed intake of fish reared in groups is not possible because of several issues such as the reluctance of fish to enter a closed containment or the difficulty to ensure that the fish eat the entire ration distributed. A first solution to overcome these issues was to estimate feed intake at the scale of a tank composed of fish from the same family. With this method, feed intake is measured at the group level by measuring the amount of feed distributed and by collecting uneaten pellets (e.g., Kolstad et al., 2004). Using separately reared families allows to estimate genetic variability of FCR and then proceed to between-family selection. Family measurements, however, do not enable the estimation of within-family variation, resulting in overestimation of genetic parameters (Doupé and Lymbery, 2003). Estimating within-family variation can be done by measuring FI at the individual fish level by using feed pellets containing radioopaque glass beads. After a meal, fish are anesthetized and Xrayed, and the pellets in the gastro enteric tract are counted on the radiography (Kause et al., 2006, 2016). This technique is highly accurate to estimate the feed intake of a single meal but it is laborious, as many records on each fish are needed to take into account the variation of feed intake across meals. Consequently, we need to find easier ways to access individual feed intake.

Measuring individual feed intake directly on individual fish over a long period of time would be the best solution to estimate accurate genetic parameters of feed intake or FCR and then develop efficient breeding programs. This was the aim of the present research. To reach this objective, we build an experimental rearing facility of 200 aquariums where fish were reared in isolation and where individual feed intakes were measured accurately over a long period of time. In addition, we chose to feed the fish with a restricted ration because, following the results of studies in rabbit and pigs (Nguyen et al., 2005; Drouilhet et al., 2016), selecting faster growing animal under restricted feeding improved FCR as a correlated response. This is because in a restricted feeding condition, the animals that grow faster are de facto the most efficient. The estimation of feed intake in these conditions also enables the calculation of an accurate estimate of individual FCR. Nevertheless, the individual measurement of FCR in aquariums remains laborious and cannot be made on all selection candidates of a fish breeding program, but rather on a (relatively) limited amount of sibs. Thus, we genotyped, with a custom SNP chip, the fish phenotyped for their individual FCR in aquariums to test if genomic information would enhance the estimation of the breeding values of selection candidates. In genomic selection (GS), genotypes and phenotypes of sibs are used in the prediction equations of the GEBVs of the selection candidates that are only genotyped. In aquaculture, several studies have shown the higher performances of GS in terms of genetic gain for traits such as growth (Tsai et al., 2015) or disease resistance (Bangera et al., 2017; Vallejo et al., 2017).

Furthermore, to ensure enough phenotypic variability in the traits measured in aquarium conditions, we used fish divergently selected for their weight loss during fasting to generate our experimental fish population. Weight loss during fasting was shown to be correlated to FCR in rainbow trout (Grima et al., 2008) and to be correlated to residual feed intake (another estimate of feed efficiency) in sea bass (Grima et al., 2010a,b). Weight loss during fasting is, indeed, supposed to be linked to FCR because during fasting, fish use their stored energy to cover maintenance costs. Hence, weight loss during fasting is an indicator of maintenance metabolic rate and selecting for fish with lower weight loss during fasting should, theoretically, reduce FCR due to lower maintenance needs. Furthermore, in the pig industry Knap and Wang (2012) reported positive correlations between back fat depth and FCR, meaning that selection for leaner pigs led to an improvement of FCR. This is because the deposition of fat is less efficient in terms of energy used per unit of wet weight gain than the deposition of protein. Thus, fat content and weight loss at fasting seem promising for the genetic improvement of individual FCR. Consequently, in addition of individual measurement of FCR in aquarium, we also phenotyped the fish for their fat content and their weight loss during fasting in order to identify more traits explaining part of the genetic variation of FCR.

Finally, knowing the individual phenotypes of these fish for FCR, we set up a validation experiment where we tested if the FCR of groups of fish could be explained by their individual performances in aquarium and/or by their weight loss during fasting and their fat content. Here, our objective was (1) to investigate if selection for three indirect traits, weight loss during fasting, fat content, and FCR in aquarium under restricted feeding, could explain the performance of FCR of groups and (2) to test if genomic information could improve the estimation of genetic parameters and breeding values for FCR-related traits.

# MATERIALS AND METHODS

# Origin of the Fish

Generation 0: The animals of G0 were caught from the wild in the West Mediterranean (Gulf of Lions).

Generation 1: Forty-one sires and 8 dams randomly chosen from G0 were mated in a full factorial mating design to create the G1 generation (Grima et al., 2010b). The G1 individuals (1,912 fish) were phenotyped for their growth performance during two fasting periods of 3 weeks following normal feeding periods of 3 weeks. The trait measured was the average thermal growth coefficient (TGC) from the two periods, corrected for the effects of initial weight and initial TGC (FDcorr) (Grima et al., 2010b).

Generation 2: Broodstock fish were selected from the 1,912 candidates of G1 based on their phenotypes for FDcorr (mass selection) to create generation G2. Twenty sires and 5 dams with the lowest FDcorr (losing much weight at fasting) were mated in a full factorial mating design to create the F+ line. In parallel, 20 sires, and 5 dams with the highest FDcorr (loosing less weight during fasting) were mated in a full factorial mating design to create the F– line (Daulé et al., 2014). The average selection differential was +1.49 phenotypic standard deviations (σP) in FDcorr for the F– dams, +2.25 σ<sup>P</sup> for F– sires, −1.81 σ<sup>P</sup> for the F+ dams and −1.74 σ<sup>P</sup> for the F+ sires. A total of 1,037 individuals of G2 generated from these matings were phenotyped for FDcorr during three feed deprivation periods of 3 weeks.

Generation 3: Two G1 dams (one from the F+ line and one from the F– line) were mated with 30 G2 sires (15 from F+ 15 sires from F–) in a full factorial mating design. We had to pick females from G1 because there were no females from G2 ready to spawn at the time of the mating. Both sire and dams were chosen based on their FDcorr phenotypes. The selection differential was −1.70 σ<sup>p</sup> for the F+ dam and +1.20 σ<sup>p</sup> for the F– dam while the selection differential was −1.82 σ<sup>p</sup> for the F+ sires and +1.49 σ<sup>p</sup> for the F– sires.

# Initial Growing Period

After artificial fertilization, all G3 families were mixed and kept in 2 replicate tanks at 48 h post-fertilization. At 100 days postfertilization (dpf), fish were transferred to two 1.5 m<sup>3</sup> fiberglass tanks. At 185 dpf (mean weight = 13.1 g), 660 fish from one tank were individually tagged with the passive integrated transponder. At 276 dpf (mean weight = 33.22 g), 350 fish from the second tank were individually tagged and mixed with the previously tagged fish. At tagging (in total 1,010 G3 fish tagged), a piece of fin from each fish was collected for DNA extraction for parentage assignment and genotyping.

# Phenotyping

The G3 fish went through several phenotyping experiments that are described below and summarized in **Figure 1**.

### Individual Feed Efficiency in Aquarium Under Restricted Feeding

Two hundred 10 l aquariums were used, in a recirculation system where natural salinity sea water was kept at 21◦C. Individuals from G3 were first reared in groups of five fish in each aquarium, to enable adaptation of the fish to their new environment. After 14 days, they were weighted and randomly split into individual aquariums. After 14 days of acclimation in isolation, the fish were weighted again in a "go, no go" step. The fish that lost weight during this period were removed. The remaining ones were kept in aquariums for 2 more periods of 14 days. In total, a "successful" fish stayed 56 days in aquarium and was weighted 4 times (**Figure 2**) before being replaced by another fish in the aquarium. For the first batch, the age at starting of the fish was 199 dpf while, for the last batch, fish started at 324 dpf. Individual BW at each measurement was used to estimate the individual feeding ration for the following period. This ration (1.3% BW/day) was half the standard ration (2.6% BW/day) given by the feed manufacturer (see feed composition in **Supplementary Table 1**). We chose a high level of restriction because, in aquarium conditions, fish do not express their full feed intake potential. Some pre-tests showed that their ad libitum in aquariums was lower than in normal rearing conditions. Fish were fed automatically once a day in the morning (9.00 a.m.) with the whole of this daily ration. Every afternoon, uneaten pellets were counted in each aquarium and then removed. The number of pellets was then converted to grams (1 pellet ≈ 0.00925 g). Among the 831 G3 fish tested in aquariums, 185 fish did not pass the "go, no go" step. Thus, 646 fish were evaluated for individual feed intake over 2 periods of 14 days. For those 646 fish, we calculated their cumulated FCR (noted as FCR\_aquarium) using their cumulated weight gain (BWG = BW3-BW1) and cumulated feed intake (cumFI = FI1+FI2). We excluded fish with aberrant performances: 6 fish with negative FCR\_aquarium and 52 fish with FCR\_aquarium higher than 2.60. Applying these thresholds, we could keep 588 fish with data available for FCR\_aquarium, DGC\_aquarium and DFC\_aquarium, calculated as follows.

$$\text{FCR\\_aquarium} = \frac{\text{BWG}}{\text{cumFI}}\_{\text{..}} \tag{1}$$

$$\text{DGC\\_aquarium} = \frac{\text{BW}\_3^{\frac{1}{3}} - \text{BW}\_1^{\frac{1}{3}}}{28} \times 100 \tag{2}$$

$$\text{DFC\\_aquarium} = \frac{\text{cumFI}^{\frac{1}{3}} - \text{BW}\_1^{\frac{1}{3}}}{28} \times 100 \tag{3}$$

Where DGC is the daily growth coefficient, DFC, is the daily feed intake coefficient calculated following Janssen et al. (2017), BWG is the weight gain during the 2 isolation periods in aquarium and 28 is the duration in days between the measurements of BW<sup>1</sup> and BW3. All traits were log transformed to enhance homogeneity of variance and to linearize the relationships between the traits.

### Weight Loss at Fasting

At 570 dpf (177 days after the last fish ended phenotyping in aquarium), 764 fish previously tested in aquariums, were phenotyped for their tolerance to fasting in a 5 m<sup>3</sup> fiberglass tank. The tolerance to fasting was calculated as the average (negative) daily growth coefficient over two fasting periods of 3 weeks (DGC\_fasting). These two fasting periods (fasting\_P1 and fasting\_P2) were separated by a period of 3 weeks of refeeding, similar to Grima et al. (2010b), where fish were fed ad libitum using a self-feeder with a standard commercial diet.

FIGURE 1 | Summary of the experiments realized on G3 fish at different ages (days post-fertilization). For each experimental period we listed the phenotypes available and the number of individuals phenotyped (between brackets). FCR refers to feed conversion ratio, DGC to daily growth coefficient, DFC to daily feed intake coefficient and muscle\_fat refers to muscle fat content.

### Feed Efficiency of Groups

To test the link between individual feed efficiency and group feed efficiency, the 588 fish phenotyped in aquariums were split in groups according to their individual performances in aquariums as follow:


Thus, 12 groups of individually tested fish were constituted, with one "high FCR" and one "low FCR" sub-group for each of the 6 categories of relative feed intake (**Figure 3**). In addition, we formed four more groups of 44 fish with the fish that lost weight during the first period of rearing in isolation in aquariums. These four groups were made to test if the non-acclimation to individual rearing could be linked to group FCR. In total, 764 fish were stocked in 16 tanks of 2 m² covered by opaque plastic curtains to avoid disturbance. Fish were fed once a day ad libitum using an automatic feeder delivering the daily ration in 20 portions over 6 h and 20 min. The frequency of distribution was every 5 min for the 5 first portions, every 10 min for the next 5 portions, then every 20 min for the next 5 portions and finally, every 30 min for the final 5 portions. The feeders were filled with a known amount of pellets. Uneaten pellets were collected in the fecal trap of each tank. Every day, at the end of the feeding period, each fecal traps were checked. If pellets were found, it meant that fish of the tank reached ad libitum. If a fecal trap was empty or only few pellets were present, an additional portion was given to the tank by activating the feeder manually. Additional portions were then given every 30 min until pellets were present in the fecal trap meaning that ad libitum was reached. Uneaten pellets of all tanks were then collected, photographed and counted using ImageJ (Abràmoff et al., 2004). The group FCR experiment lasted for 4 periods of 3 weeks from 441 to 525 dpf (48 days after the last fish ended phenotyping in aquarium); the first 3 weeks were considered an acclimation period followed by three testing periods (group\_P1, group\_P2, group\_P3) where daily group feed intake was recorded. Feed intake was therefore also available for a period a 9 weeks (group\_full). At the beginning of the first period and at the end of each of the three testing periods, all fish were weighted in order to estimate the weight gain of the groups. Finally, using the daily feed intake of the groups and the body weight gain of the groups over the periods of 3 weeks, the average FCR of each group (FCR\_group) was estimated for each period.

### Fat Content

We measured the dorsal muscle fat content of the fish using indirect ultrasonic measurement (Distell Fish Fatmeter, FM 692) according to the method described by Douirin et al. (1998). Briefly, after a fish had been anesthetized and weighted, the Fatmeter was applied on each side of the fish. We only measured once on each side of the fish because they were not big

enough to permit several measurements on each side. Thus, fat measurements were the average of two measurements. Four fat measurements took place during the fasting experiment, after each period of 3 weeks at 591, 612, 633, and 654 dpf. Thus, from the fasting experiment, we calculated the average fat content after feeding periods (muscle\_fat). Muscle\_fat measurements were log transformed to reduce heteroscedasticity.

# Genetic Analysis

### Genotyping and Parentage Assignment

The 50 grand-parents from G1 and the 49 great grandparents from G0 were genotyped with an iSelect Custom Infinium Illumina <sup>R</sup> European sea bass array of 2.722 SNP (Faggion et al., 2018). Then, with a similar array of 3,987 SNP, we genotyped:


This second array uses the same markers as the original 2,722 SNP array plus 1,265 duplicated markers that were ineffective on the original array due to bioinformatics problems that occurred during probe design. Once the animals were genotyped, the first step to create the SNP dataset used in our genomic analysis was to apply classic quality control ignoring all SNPs with a MAF inferior to 5% and a call rate inferior to 90% in the G3 animals. This quality control resulted in keeping 2,100 SNP for G3 individuals and their parents. From this 2,100, we discarded the original version of all duplicated markers, which resulted in keeping 1,923 SNP markers for G3 individuals and their parents. Then, we kept only the markers that were in common between both chips, representing 1,110 SNP. Finally, we discarded all animals for which the call rate (number of SNP genotyped over the number of SNP on the array) was lower than 90% indicating potential quality issue of the DNA sample. This resulted in keeping 5 individuals of G0 (out of 49), 52 individuals of G1 (out of 52), 30 fish of G2 (out of 30), and 399 fish of G3 (out of 400).

The pedigree of G1 and G2 fish was previously retrieved by Grima et al. (2010a) and Daulé et al. (2014) using microsatellites markers and VITASSIGN, an exclusion-based parentage assignment software (Vandeputte et al., 2006). We also used VITASSIGN to retrieve the pedigree of all the 764 fish of G3 that were tested in the group experiment. Among those fish, 463 fish (out of 466 genotyped on the SNP array) were correctly assigned considering a percentage of mismatches lower than 2% (99.3% success). Then, among the remaining 298 fish typed for 12 microsatellites markers (188 fish with valid phenotypes in aquariums and 110 not phenotyped in aquarium), 286 fish could be assigned (96% success) to a single parental pairs.

## Breeding Value Estimation

Variance components and estimated breeding values for all traits were computed based on multivariate linear mixed animal models. In these multivariate models, we always included DGC\_fasting as a "reference" trait because DGC\_fasting was measured on all G1, G2, and G3 fish, even those not selected to create the next generation(s). This allowed to integrate the selection process realized on weight loss at fasting in the estimation of variance components. Thus, 1278 fish of G1, 1029 fish of G2, and 701 fish of G3, all with DGC\_fasting phenotypes and pedigree, were included in all models. For DGC\_fasting, the linear model included the fixed effect of the generation as DGC\_fasting was measured at different ages in different conditions across generations. Then, in the multivariate models, the other traits included were the traits only measured on G3 fish (e.g., DGC\_aquarium, or fat\_fasting). The models were fitted by restricted maximum likelihood in AIREMLF90 (Misztal et al., 2002) to compute the classical heritability's using pedigree and the genomic heritability's using genomic information. The breeding values were also computed with classical pedigree-based BLUP and single-step GBLUP (ssGBLUP) using the genomic relationship matrix. The conventional pedigree-based EBVs were estimated using the following model:

$$\mathcal{Y} = Xb + Z\mu + e$$

Where **y** is the vector of phenotypes, **b** is the vector of fixed effects (batch, rack, line, and column for the phenotypes measured in aquariums) and **X** an appropriate incidence matrix, **u** is the vector of random additive genetic animal effects, **Z** the appropriate incidence matrix and **e** is vector of random error variance. The additive (animal) genetic effects were assumed to follow N(**0**, V ⊗ A**)**, with **V** the genetic (co) variance matrix between traits and **A** the numerator relationship matrix relating all animals in the pedigree, while the residual effects were assumed to follow N(**0**, R ⊗ I**)**, **R** the residual (co) variance matrix between traits and **I** an appropriate identity matrix. The SNP based EBV (GEBV) was estimated using a singlestep GBLUP (ssGBLUP) combining pedigree, genomic and phenotypic information (Legarra et al., 2014). In ssGBLUP, the relations between non-genotyped fish are based on the numerator relationship matrix (**A** matrix) derived from the pedigree, while the relations between fish with genotypes are based on the genomic relationship matrix described by VanRaden (2008) (**G** matrix). Apart from that, the general model (y = Xb + Zu + e) remains the same as in PBLUP.

### Cross Validation Scheme to Test Predictive Abilities

The predictive abilities of the different models described above (PBLUP and ssGBLUP), depending on the number of fish phenotyped were assessed using a cross validation scheme. The model tested was a bivariate model including the phenotypes of 3,008 fish over 3 generations for DGC\_fasting and the phenotypes of 588 fish of G3 for log(DGC\_aquarium), We tested log(DGC\_aquarium) as this trait was considered as an adequate variable describing feed efficiency (see Results). We included the generation as fixed effect for DGC\_fasting while we used the batch, the rack and the column as fixed effects for log(DGC\_aquarium). The cross validation procedure followed three steps:



In addition, to reduce to stochastic effects we replicated the cross validation scheme 300 times for each number of individuals in training and validation group. Thus, we could calculated the average r 2 EBV,y and average Spearman rank correlation over the 300 repetitions for each of the five sizes of training population (40, 120, 120, 280, and 360 fish). The average r 2 EBV,<sup>y</sup> was used to estimate the reliability (REBV,BV) of PBLUP and ssGBLUP models using the following formula as in Bangera et al. (2017):

$$R\_{EBV,BV} = \frac{\overline{r\_{EBV,y}^2}}{h^2} \tag{4}$$

Where, r 2 EBV,y is the average over 300 repetition of the squared correlation between the predicted EBV and GEBV for log(DGC\_aquarium) for all the fish in the validation group and the recorded DGC\_aquarium, corrected for fixed effects, and h² is the heritability of log(DGC\_aquarium) estimated using pedigree including all fish with phenotypes (h² = 0.39).

## Statistical Analysis

In this section, all complete models are described. All linear models were analyzed using the lm procedure of R (R Development Core Team, 2008).

### Individual Performances in Aquarium

The individual performances in aquarium were studied using the following linear model:

yijklmnop = µ + batch<sup>j</sup> + rack<sup>k</sup> + line<sup>l</sup> + column<sup>m</sup> + sire\_origin<sup>n</sup> + dam\_origin<sup>o</sup> + sire\_dam\_originno + eijklmnop

Where, yjklmnop is the (log transformed) trait of interest (FCR\_aquarium, DGC\_aquarium or DFC\_aquarium), µ is the overall mean, batch<sup>j</sup> is the fixed effect of the batch in which the fish has been phenotyped in aquariums (1–10). Rackk, line<sup>l</sup> , and Column<sup>m</sup> are the fixed effect of the physical position of the aquarium in which the fish have been phenotyped. There were 4 racks of 50 tanks in 5 lines and 10 columns. sire\_origin<sup>n</sup> and dam\_origin<sup>o</sup> are the fixed effect of the line of origin of sires and dams with 2 levels each (F+ and F–). sire\_dam\_originno is the interaction between sire and dam origins. Finally, eijklmnop is the random residual effect. From this complete model, we used the boot.stepAIC function in R to find out which fixed effects had to be included in the model for the genomic analysis for the different phenotypes observed. The boot.stepAIC function looks for the model with the lowest Akaike information criterion (AIC) (Austin and Tu, 2004).

### Weight Loss at Fasting

First, the effect of parental origin and fat content on weight loss at fasting of G3 fish was analyzed using the following linear model:

$$\begin{aligned} \text{DGC\\_fasting}\_{\text{ijk}} &= \mu + \beta \,\text{muscle\\_fat}\_{\text{k}} + \text{sire\\_origin}\_{\text{i}} \\ &+ \text{ dam\\_origin}\_{\text{j}} + \text{sire\\_dam\\_origin}\_{\text{i}} + \text{ e}\_{\text{ijk}} \end{aligned}$$

Where µ is the overall mean, muscle\_fat<sup>k</sup> is the covariate describing the effect of muscle fat content measured before fasting on individual k, sire\_origin<sup>i</sup> is the fixed effect of sire origin (F+ or F–), dam\_origin<sup>j</sup> is the fixed effect of dam origin (F+ or F–), sire\_dam\_originij is the interaction between sire and dam origins and eijk is the random residual effect.

Second, we tested if the selection process over 2.5 generations for lower or higher weight loss during fasting was efficient by comparing the genomic estimated breeding value (GEBV) of fish from the different parental origins, F+/F+ (both parents of F+ line), F+/F– and F–/F+ (one parent of F+ and one parent from F–) and F–/F– (both parents of F– line) within G2 or G3 fish using the following models:

$$\text{GEBV\\_DGC\\_fasting}\_{\text{ij}} = \mu + \text{line\\_origin}\_{\text{i}} + \text{\\_e}\_{\text{ij}}$$

Where µ is the overall mean, line\_origin<sup>j</sup> is the effect of the line of origin of the fish j with two levels (F+/F+ and F–/F–) for G2 fish and with four levels in G3 (F+/F+, F–/F+, F+/F–, and F–/F–). eij is the random residual effect. GEBVs of DGC\_fasting of G2 and G3 fish were calculated using single-step GBLUP procedure.

### Feed Efficiency of Groups

The aim of the experiment in groups was to test whether individual performances measured in aquarium and during fasting could predict the performance of fish in groups. As a first analysis, therefore, we applied a paired sample t-test to compare FCR\_group between tanks of fish with higher or lower relative weight gain within each groups of relative feed intake (6 groups). Then, we tested if the average phenotypic performances in aquarium of the fish composing a tank could predict the FCR of the tank (FCR\_group). To do so, we computed the group mean performance to obtain avg\_DGC\_aquarium, avg\_DFC\_aquarium and, for each periods of 3 weeks (group\_P1, group\_P2, and group\_P3) and for the combined period of 9 weeks (group\_full), we tested the following model:

> FCR\_group<sup>i</sup> = µ + β1 avg\_DGC\_aquarium<sup>i</sup> + β2 avg\_DFC\_aquarium<sup>i</sup> <sup>+</sup>e<sup>i</sup>

Where µ is the overall mean, avg\_DGC\_aquarium<sup>i</sup> and avg\_DFC\_aquarium<sup>i</sup> , are the average over tank i of the DGC\_aquarium and DFC\_aquarium, of every fish in tank i and ei is the random residual effect. For this analysis we could test the FCR\_group of 12 tanks out of 16 because 4 tanks were composed of fish that were not phenotyped in aquariums.

Then, we tested if the average (GEBV which are described in the Genetic Analysis section) of DGC\_fasting of the fish composing a tank could predict the FCR of the tank (FCR\_group). To do so, we computed the group mean GEBV to obtain avg\_GEBV\_DGC\_fasting for each periods of 3 weeks (group\_P1, group\_P2, and group\_P3) and for the combined period of 9 weeks (group\_full), we tested the following model:

$$\text{FCR\\_group}\_{\text{i}} = \mu + \beta 1 \,\text{avg\\_GEBV\\_DGC\\_fasting}\_{\text{i}} + \,\text{e}\_{\text{i}}$$

For this analysis we could test the FCR\_group of the 16 tanks for group\_P1 and group\_P2 and 14 tanks for group\_P3 and group\_full. We could get data only for 14 tanks during group\_P3 because feed wasted could not be collected for two tanks for a day due to operating mistake.

# RESULTS

# Feed Efficiency in Aquariums

From the phenotyping experiment in aquarium, we could measure significant phenotypic variances for FCR\_aquarium, DGC\_aquarium and DFC\_aquarium (**Table 1**). There was a significant effect of sire origin on the three traits, log(FCR\_aquarium) [F(1, 553) = 11.64, P <0.0001], log(DGC\_aquarium) [F(1, 553) = 13.97, p <0.0001] and log(DFC\_aquarium) [F(1, 553) = 8.31, p =.0042]. The effect of the origin of the dam was also significant for log(FCR\_aquarium) [F(1, 553) = 4.86, p =. 028]. The interaction effect between sire and dam origin was not significant for any trait (**Table 2**). This means that the fish with two F- parents were more efficient, were growing faster and were eating more than fish with only one F– parent (dam or sire), or with two F+ parents. In addition, the three traits measured in aquariums displayed large phenotypic correlation [r = −0.78 between log(FCR\_aquarium) and log(DGC\_aquarium); r = 0.83 between log(DGC\_aquarium) and log(DFC\_aquarium) and moderate phenotypic correlation between log(FCR\_aquarium) and log(DFC\_aquarium)] (r = −0.38, **Figure 4**). These results show that, in aquariums under restricted feeding conditions, the fish that grow faster have a better (lower) FCR. The three main traits measured in aquariums were heritable, and the heritability estimate was greater when using genomic data than using pedigree only (**Tables 3**, **4**). Furthermore, the three traits were all strongly genetically correlated, both with pedigree or genomic data but again the

### TABLE 1 | Overview of phenotypic results in aquariums.


TABLE 2 | Least square means (±s.e.) of individual performance of G3 fish in aquariums [log(FCR\_aquarium), log(DGC\_aquarium), and log(DFC\_aquarium)] as function of the line of origin of sires and dams.


of regression.

TABLE 3 | Genetic parameters of traits measured in aquariums calculated using pedigree information.


Heritability (±s.e.) on the diagonal and genetic correlations (±s.e.) above the diagonal.

TABLE 4 | Genetic parameters of traits measured in aquariums calculated using genomic data.


Heritability (±s.e.) are in diagonal and the genetic correlations are above the diagonal.

ssGBLUP was always more predictive than PBLUP and that including more individuals in the training group increased the predictive ability of ssGBLUP more than the predictive ability of PBLUP (**Figure 5**). Furthermore, in PBLUP models, the Spearman rank correlation was constant from 0.19 (0.005) when 40 fish were in the training group to 0.20 (0.008) when 360 were in the training group whereas the Spearman rank correlation increased in ssGBLUP from 0.20 (0.002) to 0.36 (0.007).

# Weight Loss at Fasting

In total, 701 of G3 fish were phenotyped for their tolerance of fasting calculated as the average (negative) daily growth

TABLE 5 | Genetic parameters of traits measured during feed deprivation periods [DGC\_fasting and log(muscle\_fat)] calculated using genomic data.


The heritabilities (±s.e.) are in diagonal and the genetic correlations (±s.e.) are above the diagonal.

coefficient over two consecutive feed deprivation periods (DGC\_fasting). **Figure 6** is a boxplot of DGC\_fasting for each generation as function of parental origin. In G2 fish, we observed a significant divergence in phenotypes. This significant divergence in DGC\_fasting was also observed in the next generation G3 between all parental origins. The differences in the average DGC\_fasting between generations is explained by the fact that the experiments were done separately and at different ages for the different generations. Furthermore, within G3 fish we showed that DGC\_fasting was significantly affected by the fat content before fasting periods, muscle\_fat [F(1, 693) = 6.10, p = 0.013], the dam origin, F+ or F– [F(1, 693) = 13.38, p = 0.0002], and the interaction between dam origin and muscle\_fat [F(1, 693) = 6.02, p = 0.014] (more information in **Supplementary Table 2**). The genomic heritability of DGC\_fasting was moderate and the heritability of log(muscle\_fat) was high (**Table 5**). Also, log(muscle\_fat) was positively genetically correlated with DGC\_fasting (0.34 ± 0.12). In addition, the GEBVs were significantly different between parental origins in G2 [F(1, 1,027) = 497.9, p < 0.0001 for G2] and G3 [F(3, 747) = 238.1, p < 0.0001 for G3 fish] (**Figure 7**). In G2, fish from the F–/F– parents had higher GEBV for DGC\_fasting than fish with F+/F+ parents. This trend was confirmed in G3 fish, where there were differences in GEBV of DGC\_fasting between the divergent lines F–/F+ and F+/F+, between F–/F– and hybrid lines (F–/F+ andF+/F–), between F+/F+ and hybrid lines (F–/F+ and F+/F–) but there were no differences between hybrid lines F+/F– and F–/F+ (Tukey LSD test, **Supplementary Table 3**).

# Feed Efficiency of Groups

The paired sample t-test showed that there were significant differences in FCR\_group between tanks composed of fish with higher relative weight gain (thus lower individual FCR)

and tanks composed of fish with lower relative weight gain (thus higher individual FCR) during periods group\_P1 [t(1, 5) = −3.94, p < 0.005], group\_P2 [t(1, 5) = −2.35, p < 0.033] and for the combined group\_full [t(1, 5) = −3.1, p < 0.014]. However, in group\_P3, differences in the FCR\_group were not significant (**Figure 8**). Aadditionally, there was a significant effect of log(avg\_DGC\_aquarium) and log(avg\_DFC\_aquarium) on FCR\_groups during group\_P1, group\_P2, and group\_full (**Supplementary Table 4**). More particularly, the tanks composed of fish with higher log(DGC\_aquarium) and lower log(DFC\_aquarium) were more efficient in these periods. Finally, there was a significant effect of avg\_GEBV\_DGC\_fasting on FCR\_groups during group\_P1, group\_P2, and group\_full, (**Supplementary Table 5**). The tanks composed of fish with lower GEBV of weight loss during fasting were more efficient (lower FCR) in these periods.

# Genomic Correlations Between Growth Rates and Fat Content Across Experiments

The genetic correlations between DGC\_aquarium and the growths rates measured in the fasting and in the group experiments were not significantly different than zero (**Table 6**). At the contrary, there were moderate negative genetic correlations between DGC\_fasting and DGC\_group (**Table 6**). Additionally, there were moderate to high positive genetic correlations between the different DGC and fat measured before fasting periods, which had a high heritability. More particularly, the genetic correlation was 0.39 between log(DGC\_aquarium) and log(muscle\_fat) (**Table 6**).

TABLE 6 | Genetic parameters of log(muscle\_fat) and daily growth coefficient (DGC) measured during different experiment, in aquarium (DGC\_aquarium), during fasting experiment (DGC\_fasting), and during the experiment in groups (DGC\_group).


The heritabilities (±s.e.) are in diagonal and the genetic correlations (±s.e.) are above the diagonal. These genetic parameters were calculated using genomic data. The value in bold are considered as significant as their standard error is lower than half of the value.

# DISCUSSION

Isolating fish to estimate individual feed efficiency was primly done by Silverstein (2006) on 55 rainbow trout. The method we present is, however, the first to estimate individual FCR of a large number of fish (588) which allowed estimating genetic parameters of individual FCR. With this method based on individual estimation of feed intake and weight gain under restricted feeding in a 200 aquariums facility, we found phenotypic variability in FCR (FCR\_aquarium), in daily growth coefficient (DGC\_aquarium) and in daily feed intake coefficient (DFC\_aquarium). The phenotypic coefficient of variation (CV) for FCR\_aquarium (21%) was close to that observed by De Verdal et al. (2017) who estimated a CV of 23.4% for individual FCR in Nile tilapia with video observation over a period of 10 days. Silverstein (2006) also showed phenotypic variability of residual feed intake (RFI, a trait related to feed efficiency) in isolated rainbow trout. This is an encouraging result toward possible genetic improvement of FCR in sea bass, as our results also showed that FCR\_aquarium, DGC\_aquarium and DFC\_aquarium were all heritable (genomic heritability estimates of 0.47, 0.76, and 0.57, respectively). Additionally, in these conditions of restricted feeding, FCR\_aquarium and DGC\_aquarium were strongly phenotypically and genetically correlated (r<sup>p</sup> = −0.78 and r<sup>g</sup> = −0.98). Such very high negative phenotypic correlation under restricted feeding was already observed by Silverstein (2006) in trout (r<sup>p</sup> = −0.57 between RFI and growth). Moreover, such high negative genetic correlation was similar to the estimates obtained in pigs with −0.94 (Nguyen and McPhee, 2005) and in rabbits with −1.00 (Drouilhet et al., 2013) also evaluated in restricted feeding conditions. The heritability estimated for FCR using pedigree (0.25) was also similar to the heritability of FCR measured in the pig (0.16) and rabbit (0.23) studies. However, the heritability of growth rate was higher (0.39 with pedigree) compared to the pig study (0.16) and the rabbit study (0.22) but such high heritability for growth rate is common in fish and especially sea bass (e.g., 0.43 in Vandeputte et al., 2014).

In pigs and rabbits, the results were used to set up a selection procedure based on selection of faster growing animals under restricted feeding. They both showed that such selection resulted in an improvement of FCR in the next generations. In rabbits, FCR was reduced by 0.2 (from 2.82 to 2.63) after 9 generations (Drouilhet et al., 2016). In pigs, the EBV of FCR was reduced by 0.2 after 4 generations (Nguyen et al., 2005). Thus, selecting for higher growth rate under restricted feeding is efficient to reduce FCR in terrestrial livestock. Drouilhet et al. (2016) even showed that the correlated response obtained by this method was similar to the correlated response obtained from selection on residual feed intake under ad libitum feeding. This fact is also supported by similar estimates of heritabilities for FCR between the different feeding regimes, restricted and ad libitum, in pigs and rabbits (Hermesch, 2004; Drouilhet et al., 2013). Thus, the prospects to improve FCR by selection is also promising for sea bass using individual measurement of growth rate in isolation under restricted feeding. Nevertheless, rearing sea bass in isolation does not reflect commercial rearing conditions. To address this issue, we realized a validation experiment where we recorded the FCR of groups of fish in tanks. The tanks were composed of several fish previously phenotyped in individual aquariums. The effects of DGC\_aquarium and DFC\_aquarium on the group FCR were significant for two of the 3 weeks periods investigated and for the overall period of 9 weeks. These results support the hypothesis that DGC\_aquarium under restricted feeding is a usable proxy of the FCR of fish in groups fed ad libitum—which is the standard rearing procedure. Silverstein (2006) also showed that individual and group performances were correlated in trout. It suggests therefore that selecting for better DGC\_aquarium and lower DFC\_aquarium would improve the FCR performance of fish in groups. However, a selection response experiment needs to be done to validate this point. The correlated response to selection obtained in FCR by Nguyen et al. (2005); Drouilhet et al. (2016) after selection on growth rate under restricted feeding was due to an increase of the body weight of animals in pigs. This increase in body weight was paired with a decrease of feed intake (Nguyen et al., 2005). They suggested that selection under restricted feeding would, therefore, increase the partitioning of energy for growth and decrease the partitioning of energy for maintenance. Cameron et al. (1994) also suggested that restricted feeding may select animals with higher partitioning of energy toward protein deposition rather than toward fat deposition. However, Kanis (1990) suggested that energy partition toward protein deposition was negatively associated with feed intake capacity. Hence, the selection procedure involving restricted feeding may not select for the animals with the maximal growth and then the maximal protein deposition rate. This could explain the moderate but not significant genetic correlation between DGC\_aquarium and DGC\_group of 0.13. The best overall selection objective may therefore require a little more emphasis on growth rate in group condition.

Our results on fat content, however, does not follow this hypothesis. Muscle fat content was indeed positively genetically correlated to DGC\_aquarium. Additionally, muscle fat content was also positively correlated to DGC\_fasting, which is a proxy of metabolic rate and feed efficiency as we showed that the most efficient fish in aquariums were coming from parents with lower weight loss during fasting (F– line) and that the tanks with the best FCR had lower averaged GEBV for DGC\_fasting. This means that the fish that lost less weight during fasting and the fish that were more efficient in aquarium were genetically fatter. A potential explanation for these results is that the fish tolerant to fasting would express a more reactive behavior with lower swimming activity. Yet, we know that physical activity might be linked to energy required for maintenance as it has been observed in mice by Mousel et al. (2001) who found that selected mice for low heat loss (a measure of metabolic activity) had lower locomotor activity. This potential link between lower activity and tolerance to fasting could cause the higher fat content observed in our fish. This is supported by Simpkins et al. (2003) who showed that, during fasting, the fat content of rainbow trout was significantly higher for sedentary fish than for active fish. Hence, the fish losing less weight during fasting would, in fact, have lower metabolic activity causing their higher fat content and their better feed efficiency in aquarium and in groups.

These positive genetic correlations between fat content and weight loss at fasting follow earlier results of Grima et al. (2010b). These results are in contradiction with the commonly accepted theory that more efficient animals are leaner because protein deposition requires less energy than fat deposition per unit of wet weight gain (Knap and Kause, 2018). This theory is supported in fish by several studies on trout (Quillet et al., 2007; Kause et al., 2016), a species selected for several generations for higher growth rate. Yet, we know that selection for growth rate tends to generate more proactive animals (Sundström et al., 2004; Huntingford and Adams, 2005) that display a more aggressive behavior and higher exploratory capacities compared to wild or unselected animals for which bigger animals tend to be shy and reactive (Adriaenssens and Johnsson, 2010; Ferrari et al., 2016). In our study the sea bass were potentially more reactive than commercial populations selected for growth, as they were selected for 3 generations only (they had wild great grandparents) but only based on their weight loss during fasting and not on growth rate. Thus, from our results, DGC\_aquarium, DGC\_fasting, and muscle\_fat could potentially be used in an index to select genetically superior animals for better feed efficiency. However, the direction to which muscle\_fat should be improved remains uncertain and the relationship between muscle\_fat, DGC\_aquarium, and DGC\_fasting needs to be verified on current commercially selected population.

Even though this new phenotyping method gave us essential information for the genetic improvement of FCR in sea bass, it is also a tedious and time consuming method. Over 6 months, we could only phenotype 588 fish. This small number of fish phenotyped caused a relatively low reliability of our genetic models. For instance, using 340 fish genotyped and phenotyped for DGC\_aquarium to predict the performance of the 80 remaining fish (80–20% ratio), we reached a reliability of only 0.33 using ssGBLUP. This reliability was, however, much larger than the reliability realized with a PBLUP model using the same data (0.15). This indicates that the use of genomic data would be essential to enhance the prediction of EBV in selection candidates using a relatively small number of fish phenotyped for DGC\_aquarium. Our reliability estimate with ssGBLUP was slightly lower than that of Bangera et al. (2017) for disease resistance in Atlantic salmon. In this study, they showed that the reliability of GEBV calculated with ssGBLUP for resistance to salmon rickettsial syndrome was about 0.41 when using 80% of the fish phenotyped and genotyped to predict the remaining 20%. Our results showed also that the reliability could be increased with more fish phenotyped as we did not reach a plateau when increasing the number of fish in the training group. However, this reliability results must be taken with care as the formula used to calculate the reliability is an approximation of the accuracy (Gunia et al., 2014). In order to estimate the true reliability of GEBV, the G3 fish phenotyped for DGC\_aquarium and genotyped could be crossed to generate a G4 in a future experiment. Then, by phenotyping several fish of G4 for DGC\_aquarium we could estimate a proxy of the true breeding value of G3 fish. Finally, the GEBV calculated previously could be correlated to these true breeding values to obtain a better estimate of the accuracy of ssGLUP model. Such procedure has been implemented in rainbow trout for resistance to bacterial cold water disease (Vallejo et al., 2017), and showed that the predictive ability of genomic predictions was twice higher than that of traditional pedigree BLUP. This confirms the importance of genomic data for genetic improvement of traits which are difficult to record, such as disease resistance and FCR.

An important aspect for the practical applicability of this method is therefore its cost-benefit ratio. Based on the present experiment, the cost of this selection method applied for 588 fish was about 50 k€: phenotyping costs 26 k€ and genotyping cost 24 k€ (60 € per fish on 400 G3 fish). While the cost of genotyping tends to decrease, the phenotyping cost remains important considering manpower (≈14 k€, 930 h at 15€/h over 6 months) and infrastructure costs (12 k€ per year, 200 aquariums and their recirculating system). The genetic gain for FCR obtained from this method has yet to be demonstrated with a response to selection experiment, but we can roughly estimate the response that could be achieved. The difference of group FCR between the best and the worst fish was about 2% (**Figure 8**), with an estimated heritability of 0.75 for DGC\_aquarium, the potential gain per generation could be 1.5%. In an integrated fish farm (producing its own juveniles) that produces 3,000 tons of fish per year, feed consumption is about 4,500 tons of feed per year (FCR = 1.5) for a total cost of 6,750 k€ per year. It means that a gain of 1.5% would save the company about 100 k€/y, which is more than the cost of selection.

Despite the potential economic gain that could be achieved, we can point out that we could only phenotype juveniles of about 25 g to fit with the size of the aquariums (10 L). However, the targeted trait we wish to improve is FCR at commercial size because the animals consume higher amounts of feed in the later stages of production, hence further increasing the interest of improving FCR at such late stages. Based on our results, we cannot tell whether the most efficient fish when weighting 25 g will also be the most efficient at commercial size (450 g). Nevertheless, the measure of feed efficiency of groups took place when fish were 105 g till 200 g (on average) and the link between DGC\_aquarium and FCR\_group suggest that the most efficient fish early in life (in aquarium) tend to stay the most efficient later in life (in groups). Ideally, individual FCR should be measured a second time later in the fish life. However, previous experiments already revealed that bigger sea bass do not feed as easily as juveniles, or even do not feed at all when isolated (Ferrari et al., 2015). Therefore, such measure of growth rate in aquarium at commercial size is a priori not feasible in sea bass. An opportunity to overcome this issue would be to find traits that integrates the efficiency of the animals through its entire life. This could be done using mechanistic animal growth models. Such models aim at describing the growth of an individual based on underlying biological parameters to estimate energy uptake, storage, and utilization. These biological parameters of growth models can be, for instance, routine metabolic rate, or allocation to soma. Each of these biological parameters could, then, be generated for each individual by optimization of model's parameter. In our case, the optimization would be done by fitting the predicted weight and the predicted feed intake of an individual to its weight measured along its entire life and its feed intake measured as juveniles in aquarium. With this approach, we may be able to highlight potential genetic variation within a population and to find heritable model's parameters potentially related to the feed efficiency over an entire life. Such parameters could then be improved with a breeding program. A similar approach have been presented by Doeschl-Wilson et al. (2007) who used model inversion to obtain estimates of phenotypic and genetic components of the biological traits in a mechanistic growth animal model for pig. The results of this study suggest that such mechanistic growth models can be useful to animal breeding through the introduction of new biological traits that are less influenced by environmental factors than phenotypic traits currently used and that are valid all along the life of the individuals.

# DATA AVAILABILITY

The raw data supporting the conclusions of this manuscript can be found on SEANOE (https://doi.org/10. 17882/58267).

# ETHICS STATEMENT

This study was carried out in accordance with the recommendations of Directive 2010-63-EU on the protection of animals used for scientific purposes. The protocoles were approved by C2EA−36 (Comité d'éthique en expérimentation animale Languedoc-Roussillon) under authorizations APAFIS#1362-2015071718471856\_v4 and APAFIS#9877-2017042614262200\_v2.

# AUTHOR CONTRIBUTIONS

MB, FA, AV, BC, and MV designed the animal experiments. MB, AV, and FC performed the animal experiment. MB, FA,

# REFERENCES


and MV performed the analysis. MB wrote the manuscript. MV and FA revised the manuscript. All authors read and approved the manuscript.

# FUNDING

This project has received funding from the European Union's Horizon 2020 research and innovation program under grant agreement No 654008 (EMBRIC).

# ACKNOWLEDGMENTS

The authors are grateful to Sébastien Ferrari who performed the pre-tests of individual rearing in the aquarium facility.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00219/full#supplementary-material

Production traits and gut microbiota characteristics. J. Anim Sci. 94, 38–48. doi: 10.2527/jas.2015-9402


and carcass traits when fed two different diets. Aquaculture 269, 220–231. doi: 10.1016/j.aquaculture.2007.02.047


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Besson, Allal, Chatain, Vergnet, Clota and Vandeputte. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Chromosome-Level Assembly of the Chinese Seabass (Lateolabrax maculatus) Genome

Baohua Chen1,2, Yun Li <sup>3</sup> , Wenzhu Peng1,4, Zhixiong Zhou1,4, Yue Shi 1,4, Fei Pu1,2 , Xuan Luo1,4, Lin Chen1,4 and Peng Xu2,4,5 \*

*<sup>1</sup> State Key Laboratory of Marine Environmental Science, College of Ocean and Earth Sciences, Xiamen University, Xiamen, China, <sup>2</sup> Shenzhen Research Institute of Xiamen University, Shenzhen, China, <sup>3</sup> The Key Laboratory of Mariculture, Ministry of Education, Ocean University of China, Qingdao, China, <sup>4</sup> State-Province Joint Engineer Laboratory of Marine Bioproducts and Technology, College of Ocean and Earth Sciences, Xiamen University, Xiamen, China, <sup>5</sup> Laboratory for Marine Biology and Biotechnology, Qingdao National Laboratory for Marine Science and Technology, Qingdao, China*

Keywords: Lateolabrax maculatus, seabass, genome assembly, Hi-C, teleost genome

# INTRODUCTION

### Edited by:

*Hans Cheng, U.S. National Poultry Research Center (ARS-USDA), United States*

### Reviewed by:

*Geoff Waldbieser, Warmwater Aquaculture Research Unit (ARS-USDA), United States Yniv Palti, Cool and Cold Water Aquaculture Research (ARS-USDA), United States*

> \*Correspondence: *Peng Xu xupeng77@xmu.edu.cn*

### Specialty section:

*This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics*

Received: *03 December 2018* Accepted: *12 March 2019* Published: *04 April 2019*

### Citation:

*Chen B, Li Y, Peng W, Zhou Z, Shi Y, Pu F, Luo X, Chen L and Xu P (2019) Chromosome-Level Assembly of the Chinese Seabass (Lateolabrax maculatus) Genome. Front. Genet. 10:275. doi: 10.3389/fgene.2019.00275* The Chinese seabass (Lateolabrax maculatus), inhabiting in inshore rocky reefs and estuaries with a broad adaptability of salinity, is an euryhaline teleost fish native to the margin seas of the Northwest Pacific Ocean (**Figure 1A**). The Chinese seabass belongs to genus Lateolabrax that was first described as a geographic population of Japanese seabass Lateolabrax japonicas. It was recently re-described as independent species from L. japonicas, based on the differentiated characters of morphological traits and molecular phylogenies (Liu et al., 2006; An et al., 2014). In contrast with geographically restricted L. japonicas, L. maculatus is broadly and continuously distributed along the coasts of China and Indo-China Peninsula (Yokogawa and Seki, 1995). The north most of wildlife habitats of L. maculatus is latitude 41◦ north in temperate Bohai Gulf and the south most reaches at least 20◦ north in tropical Beibu Gulf, between which the sea surface temperature difference frequently reaches 18◦C in winter. The very different environments of L. maculatus that live in Bohai Gulf and Beibu Gulf make them divergent in genetic structures and phenotypes, such as life history, behaviors, and breeding season etc., providing us with a feasible fish model for population genetic studies in continual marginal sea (Zhao et al., 2018). In addition, L. maculatus is recognized as one of the most important mariculture fish in China, which contributes over 120,000 tons of annual production. Recently, a reference genome of L. maculatus derived from in the northern population in Bohai Gulf had been reported (Shao et al., 2018). Herein, we report a chromosome-level genome assembly of L. maculatus from the southern population in the subtropical region, which provides an important resource not only for basic ecological and population genetic studies but also for the upcoming breeding program of Chinese seabass.

# DATA

A whole genome shotgun (WGS) strategy was employed in this project. After removal of redundant and low-quality reads, there are a total of 112.76 Gb (188.87X) clean WGS reads, including 49.63, 31.89, 19.69, and 11.66 Gb reads from 250 bp, 2 Kbp, 5 Kbp, and 10 Kbp libraries, respectively, obtained for genome size estimation, de novo contig assembly, primary scaffolding. High-through chromosome conformation capture (Hi-C) sequencing were performed for chromosome-level scaffolds construction. A total of 159.54 Gb pair-end Hi-C reads were generated with an average sequencing coverage of 267.06X (**Table 1**).

negative strand (blue), and scaffolds which comprised the chromosome (adjacent contigs on a chromosome are painted in different colors). (C) Divergence distribution of TEs in *L. maculatus* Genome. (D) A venn diagram indicating the number of genes predicted by three different approaches. (E) A venn diagram showing orthologous gene families across five fish genomes. (F) Evolutionary relationships among eight species.

Based on all reads mentioned above, we de novo assembled the draft L. maculatus genome with a size of 597.39 Mb containing 1,639 scaffolds. And the contig N50 size was 182.31 kb and the scaffold N50 size was 2.79 Mb. After integrating the scaffolds with Hi-C map, we finally obtained 24 chromosomes constructed from 419 scaffolds (25.56% of all scaffolds) with a total length of 586.03 Mb (98.10% of the total length of all scaffolds) (**Table 1** and **Figure 1B**). Our new reference genome has been significantly improved compared with the previous reference genome of the northern population, which presents contig N50 length of 31 Kb, scaffold N50 length of 1,040 Kb, and chromosome integration rate of 77.68% (Shao et al., 2018).

A total of 105.5 Mb (∼17.66% of L. maculatus genome) were identified as repetitive elements in the L. maculatus genome, including 6.09% of DNA transposons, 4.99% of long interspersed nuclear elements (LINEs), and 2.31% of long terminal repeats (LTRs) (**Figure 1C**).



Gene structure prediction identified 23,657 protein-coding genes, of which 22,509 genes can be annotated against at least one database (**Figure 1D**), and 1,734 candidate non-coding RNAs, including 676, 644, 99, and 315 miRNA, tRNA, rRNA, and snRNA genes, respectively (**Table 1**).

To evaluate the accuracy of the genome assembly, we mapped Illumina short reads that were used for genome assembly and identified 904,102 heterozygous SNPs and 12,050 false homozygous SNPs, respectively, accounting to 0.1557 and 0.0004% of the reference genome. The homozygous SNPs were false because they refer to the SNPs that only retained one alternative allele in the Illumina short reads (homozygotes for short reads data), which was different from the reference genome. The low rate of false SNPs suggests the high accuracy of the genome assembly.

The completeness and connectivity of this assembly were accessed using both Core Eukaryotic Genes Mapping Approach (CEGMA) and Benchmarking Universal Single-Copy Orthologs (BUSCO) approaches. Two hundred and thirty-five Core Eukaryotic Genes (CEGs) out of the complete set of 248 CGEs (94.76%) were covered by the assembly and 818 out of 843 searched BUSCOs (97.03%) had been completely assembled in the draft genome, suggesting a high level of completeness and connectivity of the de novo assembly (**Table 1**).

For better use of this dataset, the evolutionary position of L. maculatus was accessed based on single-copy genes of L. maculatus and seven related species (T. rubripes, G. aculeatus, O. latipes, D. rerio, O. niloticus, L. calcarifer, and D. labrax).

The protein sequences were downloaded from the Ensembl Core database (release 90). After removing the protein sequences shorter than 50 amino acids, the set of 245,644 consensus protein sequences of the seven teleost and Chinese seabass L. maculatus was used to construct gene families. As a result, a total of 20,788 OrthoMCL families were built (**Figure 1E**) and 667 single-copy ortholog protein families in a 1:1:1 manner from all eight teleost species were used for phylogenetic analysis (**Figure 1F**).

# MATERIALS AND METHODS

# Sample Collection, Library Construction, and Sequencing

A wild adult female Chinese seabass was collected in the Xiamen Bay, Fujian, China and used to collect blood sample. The total length and body weight of this fish were 524.6 g and 34.5 cm, respectively. Total RNA and DNA extraction were performed for whole genome sequencing and whole transcriptome sequencing following our previous studies (Jiang et al., 2014; Peng et al., 2016). Four whole-genome shotgun sequencing libraries were prepared with various insert sizes ranging from 250 bp to 10 Kbp (250 bp, 2 Kbp, 5 Kbp, 10 Kbp). The 250 bp pair-end library was constructed for de novo contig assembly and the other three mate-pair libraries were constructed for scaffolding contigs. Before sequencing, a quality control step was performed on Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA) by evaluating the distribution of fragment length. Then libraries were sequenced using the Illumina HiSeq2500 platform with a read length of 2 × 150 bp.

High-through chromosome conformation capture (Hi-C) were performed parallelly to the Illumina sequencing. DNA samples, collected from muscle tissue, were snap frozen using liquid nitrogen for 30 min and then stored at −80◦C until DNA extraction. Firstly, the DNA was fixed by formaldehyde to maintain the conformation. Then it was digested by MboI restriction enzyme and repaired by biotinylated residues to form blunt-end fragments. After in-situ ligation of these fragments, DNA was reverse-crosslinked and purified. Before sequencing, end repair, adaptor ligation, and polymerase chain reaction were successively performed. At last, the well-prepared Hi-C libraries were sequenced using Illumina Hiseq 2500 platform with a read length of 2 × 150 bp.

# Genome Assembly

All low-quality Illumina read pairs were filtered out if any read of the pair complies with following criteria: containing adaptor sequences; the proportion of uncertain bases (represented by "N") exceed 10%; the proportion of low-quality base (Q < 5) exceed 50%. After strict filtration clean, all Illumina reads were used to generate 17-mers with a window-sliding-like method. Obviously, there were 4<sup>17</sup> kinds of different 17-mers. After calculation of depth distribution of these 17-mers using Jellyfish (Marcais and Kingsford, 2011) (version 2.2.5), we can estimate the genome size using Lander/Waterman's equations:

$$\mathbf{C}\_{\text{base}} = \mathbf{C}\_{17-\text{mer}} \times \mathbf{L}/(\mathbf{L} - 17 + 1) \tag{1}$$

$$\mathbf{G\_{est}} = \mathbf{N\_{17-mer}/C\_{17-mer}} = \mathbf{N\_{base}/C\_{base}} \tag{2}$$

In these equations, L was the length of reads (150 for Illumina reads), Nbase and N17−mer were counts of bases and 17-mers; Cbase and Ck−mer were expectations of coverage depth of bases and 17-mers; estimated genome size was represented by Gest. The genome size of Latiolabrax maculatus was then estimated to contain 641.02 Mb, which is similar to Asian seabass (Lates calcarifer, 668.5 Mb) (Vij et al., 2016) and European seabass (Dicentrarchus labrax, 675 Mb) (Tine et al., 2014).

For de novo genome assembly, high quality reads from the short-insert library (250 bp) were collected and assembled using SOAPdenovo2 (Luo et al., 2012) with optimized parameters to build initial contigs. Long-insert reads were then mapped onto the de novo assembled contigs for scaffolding (2, 5, and 10 Kbp, in turn). The GapCloser (Luo et al., 2012) was then used to close the gaps in scaffolds using the pair-end reads, of which one end uniquely mapped to a contig and another was located within a gap.

In order to obtain chromosome-level genome assembly, Hi-C reads were filtered in the same way as short-insert library reads and subsequently mapped to de novo assembled scaffolds to construct contacts among scaffolds using bwa (Li and Durbin, 2009) (version 0.7.17) with default parameters. Obtained BAM files containing Hi-C read-pairs linking messages were processed by another round of filtering, in which reads located further than 500 bp from the nearest restriction enzyme site were removed. Then LACHESIS (Korbel and Lee, 2013) (version 2e27abb) was used to chromosome-level scaffolding by clustering, ordering and orientating the de novo genome assemblies based on genomics proximity messages between Hi-C reads pairs. In these steps, all parameters were set as default except that CLUSTER\_N, CLUSTER\_MIN\_RE\_SITES and ORDER\_MIN\_N\_RES\_IN\_SHREDS were set as 24, 80, and 10 separately. Note that the parameter CLUSTER\_N was used to specify the number of chromosomes.

Both karyotype analysis and recently published genome assembly for Chinese seabass (spotted seabass) indicated that the number of chromosomes of this species is 24 (Sola et al., 1993; Shao et al., 2018). Besides, genetics maps of two species in Perciformes, European seabass (Dicentrarchus labrax) and Asian Seabass (Lates calcarifer), both contain 24 linkage groups (Wang et al., 2011; Tine et al., 2014).

# Repetitive Elements Characterization

We employed two approaches to detect repeat sequences in L. maculatus genome. Firstly, we used Tandem Repeats Finder (Benson, 1999) (version 4.04), Piler (Edgar and Myers, 2005) (version 1.0), LTR\_FINDER (Xu and Wang, 2007) (version 1.0.2), RepeatModeler (Tarailo-Graovac and Chen, 2009) (version 1.04), and RepeatScout (Price et al., 2005) (version 1.0.2) synchronously to detect various kinds of repeat sequences in L. maculatus genome. The results were then combined as a single de novo repeat sequence library by Uclust (Edgar, 2010) (version 1.2.22q). Subsequently, the whole library was annotated using RepeatMasker (Tarailo-Graovac and Chen, 2009) (version 3.2.9) based on Repbase TE (Jurka et al., 2005) (version 14.04) to discriminate between known and novel transposable elements (TEs). In another approach, generated genome sequences were mapped on Repbase TE (Jurka et al., 2005) (version 14.04) using RepeatProteinMask (Tarailo-Graovac and Chen, 2009) (version 3.2.2), a perl script included in RepeatMasker, to detect TE proteins in L. maculatues genome. The results of two approaches were combined and then the redundancy was removed to obtain a final Repetitive elements set.

# Gene Structure Prediction

To access a fully annotated L. maculatus genome, three different approaches were employed to predict protein-coding genes. Ab intio gene prediction was performed on repeat-masked L. maculatus genome assembly using Augustus (Stanke and Morgenstern, 2005) (version 2.5.5), GlimmerHMM (Majoros et al., 2004) (version 3.0.1), SNAP (Korf, 2004) (version 1.0), Geneid (Parra et al., 2000) (version 1.4.4), and GenScan (Burge and Karlin, 1997) (version 1.0). Furthermore, homology-based prediction was performed using downloaded protein sequences of closely related teleost including Takifugu rubripes (Aparicio et al., 2002), Gasterosteus aculeatus (Jones et al., 2012), Oryzias latipes (Kasahara et al., 2007), Danio rerio (Howe et al., 2013), Oreochromis niloticus (Brawand et al., 2014), Lates calcarifer (Vij et al., 2016), Larimichthys crocea (Ao et al., 2015), and Cynoglossus semilaevis (Chen et al., 2014). Subsequently, these protein sequences were mapped onto the generated assembly using blat (Kent, 2002) (version 35) with e-value ≤1e-5. GeneWise (Birney et al., 2004) (version 2.2.0) was employed to align the homologs in L. maculatus genome against the other species for gene structure prediction. In addition, we also applied transcriptome-based prediction by using RNAseq datasets of a pooled cDNA library of 12 tissues from the fish which was used for whole genome sequencing. The RNA-seq reads were mapped onto the genome assembly using TopHat (Trapnell et al., 2009) (version 1.2) software. The structures of all transcribed genes were predicted by Cufflinks (Trapnell et al., 2010) (version 2.2.1) with default parameters. The predicted gene sets generated from three approaches were then integrated to a non-redundant gene set using EvidenceModeler (Haas et al., 2008) (version 1.1.0). PASA (Haas et al., 2003) (version 2.0.2) was then used to annotate the gene structures. Aiming at identifying candidate non-coding RNA (ncRNA) genes, we aligned repeat-masked genome sequences against Rfam database (Burge et al., 2013) (version 11.0) using BLASTN to search homologs.

# Functional Annotation of Genes

Genes identified by structure prediction were subsequently functionally annotated by BLAST searches against the NCBI nr and SwissProt protein databases. Unidirectional best-hit of each L. maculatus gene was assigned as its homolog after discarding those with E-value <1 × 10−<sup>5</sup> by alignments. Gene ontology (GO) annotations of genes were assigned using the InterProScan program (version 5.26) (Quevillon et al., 2005). KEGG annotation was performed against KEGG database, the KEGG Automatic Annotation Server (KAAS) (Moriya et al., 2007).

# The Completeness and Accuracy of the Assembly

The completeness and accuracy of the assembly were further assessed. We mapped Illumina short reads that were used for genome assembly using bwa (Li and Durbin, 2009) (version 0.7.17-r1188). Subsequently, BAM files containing mapping message were then piled up using samtools (Li et al., 2009) (version 1.8) to identify SNPs using thresholds of read depth >10 and quality score >20. Then, the assembly completeness was evaluated by Core Eukaryotic Genes Mapping Approach (CEGMA) (Parra et al., 2007) (version 2.3) and Benchmarking Universal Single-Copy Orthologs (BUSCO) (Simao et al., 2015) software (version 1.22) using vertebratespecific database (vertebrata\_odb9).

# Ortholog Analysis

Single-copy genes in L. maculatus and related species were identified based on gene families constructed from protein sequences of all species employing OrthoMCL (Li et al., 2003) and BLASTP software with default parameters. As there are no corresponding CDS sequences of Asian seabass proteins used in gene family analysis provided, the Asian seabass transcripts were translated into proteins using ORFinder (version 0.4.1). Single-copy ortholog proteins were aligned by MUSCLE (Edgar, 2004) (version 3.8.31). Subsequently, all obtained alignments were converted to their corresponding coding DNA sequences using an internal python script.

# REFERENCES


A combined "supergene" was constructed from all the translated coding DNA alignments for minimum evolution (ME) phylogenetic tree construction using MEGA (Kumar et al., 2016) (Version 7.0.26).

# ETHICS STATEMENT

This study was approved by the Animal Care and Use Committee, College of Ocean and Earth Sciences, Xiamen University. The methods were carried out in accordance with approved guidelines.

# AUTHOR CONTRIBUTIONS

PX conceived the study. BC, WP, and YL performed bioinformatics analysis. YL, FP, YS, and XL collected samples. ZZ and LC extracted DNA and RNA. ZZ and YS performed the quality control. BC and PX wrote the manuscript. All authors read and approved the final manuscript.

# FUNDING

This work was supported by the Knowledge Innovation Program of Shenzhen City (Fundamental Research, Free Exploration, No. JCYJ20170818142601870) and Fundamental Research Funds for the Central Universities, Xiamen University (No. 20720160110).


completeness with single-copy orthologs. Bioinformatics. 31, 3210–3212. doi: 10.1093/bioinformatics/btv351


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The reviewer GW declared a past co-authorship with one of the authors YL to the handling editor.

Copyright © 2019 Chen, Li, Peng, Zhou, Shi, Pu, Luo, Chen and Xu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# The Impact of Chronic Heat Stress on the Growth, Survival, Feeding, and Differential Gene Expression in the Sea Urchin Strongylocentrotus intermedius

Yaoyao Zhan, Jiaxiang Li, Jingxian Sun, Weijie Zhang, Yingying Li, Donyao Cui, Wanbin Hu and Yaqing Chang\*

Key Laboratory of Mariculture & Stock Enhancement in North China's Sea, Ministry of Agriculture and Rural Affairs, Dalian Ocean University, Dalian, China

Edited by:

Peng Xu, Xiamen University, China

### Reviewed by:

Yang Yu, Institute of Oceanology (CAS), China Jie Mei, Huazhong Agricultural University, China

> \*Correspondence: Yaqing Chang yqkeylab@hotmail.com

### Specialty section:

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Received: 12 December 2018 Accepted: 19 March 2019 Published: 04 April 2019

### Citation:

Zhan Y, Li J, Sun J, Zhang W, Li Y, Cui D, Hu W and Chang Y (2019) The Impact of Chronic Heat Stress on the Growth, Survival, Feeding, and Differential Gene Expression in the Sea Urchin Strongylocentrotus intermedius. Front. Genet. 10:301. doi: 10.3389/fgene.2019.00301 To explore the impact of chronic heat stress on commercial echinoderms, the present study assessed the effects of chronic high temperature on the growth, survival, feeding, and differential gene expression in the sea urchin Strongylocentrotus intermedius cultured in northern Yellow Sea in China. One suitable seawater condition (20◦C) and one laboratory-controlled high temperature condition (25◦C) were set up. After 28 days incubation, our results showed that: (1) The specific growth, survival, and ingestion rates of S. intermedius reared under high temperature (25◦C) decreased compared to those reared under optimal temperature (20◦C) conditions; (2) comparative transcriptome analysis identified 2,125 differentially expressed genes (DEGs) in S. intermedius reared under high temperature (25◦C) compared to those subjected to optimal temperature condition (20◦C), which included 1,015 upregulated and 1,100 downregulated genes. The accuracy of the transcriptome profiles was verified by quantitative real-time PCR (qRT-PCR). Further Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways analyses revealed that these DEGs mainly enriched the functional categories of ribosome, protein processing in endoplasmic reticulum, and prion diseases. A total of 732 temperature-induced expressed genes, such as ATP5, heat shock protein 70, and heat shock protein 90, were identified as candidates that were closely correlated with heat resistance in S. intermedius. Differentially expressed transcription factors (TFs), such as AP-1, Fos, CREB, and ZNF, were also identified as potential regulators that regulate the molecular network that was associated with responses to heat stress in sea urchins. Observations in the present study provide additional information that improves our understanding of the molecular mechanism of temperate echinoid species in response to heat stress, as well as theoretical basis for the molecular-assisted breeding of heat-resistant sea urchins.

Keywords: heat stress, Strongylocentrotus intermedius, growth, survival, feeding, transcriptome

# INTRODUCTION

fgene-10-00301 April 2, 2019 Time: 17:29 # 2

Seawater temperature has been proven to be a major environmental factor affecting echinoderms from the biomacromolecular to ecological levels. With climate-induced ocean warming, extensive efforts have been made in studying the impact of elevated seawater temperatures on echinoderms. Laboratorybased studies have demonstrated that increased seawater temperature affects early development, survival, growth, metabolism, immunity, behavior, and gene expression profiles in echinoderms. For example, with the elevation of near-future seawater temperature, fertilization and early development in the sea urchin Heliocidaris erythrogramma would be compromised (Byrne et al., 2009). Increased seawater temperatures can affect and reduce both the specific growth rate (SGR) and contents of highly unsaturated fatty acids (HUFAs) in juvenile sea cucumber (Yu et al., 2016). It has been demonstrated that the parental effect of long acclimatization could increase thermal tolerance in juvenile sea cucumber Apostichopus japonicus (Wang et al., 2015). The existence of species-specific innate immune response variations was investigated in the tropical subtidal sea urchin Lytechinus variegatus and the intertidal sea urchin Echinometra lucunter while coping with rising sea temperatures (Branco et al., 2013). In addition, negative effects of elevated seawater temperature on covering and righting behaviors were observed in the sea urchins L. variegatus and Strongylocentrotus intermedius (Brothers and Mcclintock, 2015; Zhang et al., 2017). Comparative transcriptome study indicated alterations in gene expression profiles under mild, chronic increases in temperature stress in embryos of the sea urchin Strongylocentrotus purpuratus (Runcie et al., 2012). A recent study also showed that the response of juvenile sea urchin Loxechinus albus to acute increases in sea temperature is an integrated differential gene regulatory network that includes heat-shock, membrane potential, and detoxification (Vergara-Amado et al., 2017).

The temperate edible sea urchin S. intermedius is naturally distributed along the intertidal and subtidal rocky bottom of Hokkaido, Japan, the Korean Peninsula, and Russian Far East (Chang et al., 2004; Lawrence, 2013). This species has an average lifespan of 8–10 years, and the sexual maturity age is 1.5–2 years. The thermal tolerance of this species is from −1 to 23◦C (Chang et al., 2004), and the suitable sea temperature range for the growth of this species is 15–20◦C. In 1989, this species was introduced from Japan to north China by the Dalian Ocean University, and artificial breeding was subsequently performed. To date, S. intermedius has been the predominant commercial valuable sea urchin species that has been widely cultivated along the coastal areas of the north Yellow Sea in China (Chang et al., 2004). Due to global ocean warming, sea water temperatures in the north Yellow Sea in China have often been higher than 25◦C (the lethal limit of S. intermedius) in the summer in the past few years (Zeng et al., 2006), resulting in the massive death of cultured S. intermedius. The sustainable development of S. intermedius farming and industry, therefore, is under serious threat. Our previous study demonstrated the existence of genotype by temperature interactions (GEI) in the survival rate (SR) in the selection of S. intermedius (Chang et al., 2016); however, the response to heat-stress, especially the corresponding gene expression mechanism in S. intermedius remain unclear.

In the present study, we investigated the impact of high water temperatures on the growth, survival, and feeding of S. intermedius. Then, we identified candidate genes that were closely correlated to heat tolerance in S. intermedius by comparative transcriptome analysis between suitable (20◦C, as control) and high temperature (25◦C) seawater conditions in S. intermedius. A gene regulatory network related to heattolerance was also predicted by setting up relationships between candidate genes and differential expressed transcription factors (TFs). The findings of this study enrich our knowledge of the molecular responses of sea urchins to heat stress, as well as provide candidate genes that can serve as molecular markers that could be potentially used in the selection of heat toleranceresistant breeding of S. intermedius.

# MATERIALS AND METHODS

# Sea Urchins and Treatments

A total of 360 S. intermedius (average test diameter: 10 ± 0.1 mm) were transported from Dalian Haibao Fisheries Company to the Key Laboratory of Mariculture & Stock Enhancement in the North China's Sea, Ministry of Agriculture and Rural Affairs at the Dalian Ocean University in August 2015. All of the sea urchins were kept in ∼60-L recirculating sea water tanks; each tank was fitted with an automatic temperature control and monitoring system (Dalian Huixin Titanium Equipment Development Co., Ltd., Liaoning, China). Seawater was sand filtered and continuously aerated. The animals were kept under natural light. All of the specimens were fed kelp (Laminaria japonica). Sea urchins were acclimated to default laboratory conditions [18 ± 0.5◦C and 31.22 ± 0.14 (practical salinity units) PSU] for 1 week prior to experimentation. The experiments were conducted between November 2015 and January 2016.

All of the sea urchins were dried with a paper towel and weighed on a digital balance (0.01 g sensitivity; AL204; Mettler Toledo, Shanghai, China) to obtain initial mass (W1). We then randomly divided the sea urchins into three groups of 60 specimens each (three replicates for each temperature). Each group was housed in a separate tank. To reach the desired temperature (20 and 25◦C), we removed half of the seawater from each tank every day, and replaced it with seawater at a different temperature. We changed the temperature of the new seawater such that the temperature of the entire tank did not increase by more than 1◦C per day; this was based on a previous study on S. intermedius (Lawrence et al., 2009) and on field survey data of the coastal waters of the Yellow Sea (Zhang et al., 2016).

We monitored the temperature in each tank using an automatic temperature control and with a water quality monitor (A329 Portable Meter; Thermo Scientific Orion Star, Beijing, China).

# Growth, Survival, and Ingestion in Each Treatment

Before the experiment, S. intermedius individuals in each treatment were dried with a paper towel and weighed on a digital balance (0.01 g sensitivity; AL204; Mettler Toledo, Shanghai, China) to obtain initial average mass (W0). The specific growth rate (SGR), survival rate (S), average food consumption (FC) of individual, and daily feeding rate (FR) were calculated using the following formulae:

> SGR (% · day−<sup>1</sup> ) = 100 × (lnW<sup>t</sup> − lnW0)/t; S (%) = 100 × (Nt/N0); FC (g · individual−<sup>1</sup> ) = (TB<sup>t</sup> − RBt)/N<sup>t</sup> ;

FR (% · day−<sup>1</sup> ) = 100 × 6(TBt/N<sup>t</sup> − RBt/Nt)/[(W<sup>t</sup> + W0)/2 × t];

where W<sup>t</sup> is the average body weight (g) of live S. intermedius on day t; t is the duration of experiment; N<sup>0</sup> is the initial number of live S. intermedius; N<sup>t</sup> is the number of live S. intermedius on day t; TB<sup>t</sup> is the bait supplied on days t; and RB<sup>t</sup> is the total amount of remaining bait on day t (Qin et al., 2011; Chang et al., 2016).

# Sample Collection for RNA-Seq

As for individual test diameter of around 10 mm, most of the sea urchins sampled in this study were too small to develop their gonads. In addition, it is generally difficult to dissect or obtain enough tissues such as tube feet, coelomocytes, and perioral membranes for RNA-seq library construction and subsequent validation.

The intestines are important organs for nutrient intake and stress defense in sea urchins, and these are the only sources for tissues that can be sampled under a dissection microscope in this study. Therefore, we opted to utilize the intestines for perform transcriptome analysis in this study. At the end of the experimental period, the number of living S. intermedius individuals in three high temperature group (25◦C) replicate tanks was 40 (replicate 1#), 31 (replicate 2#), and 34 (replicate 3#). To ensure that there are three independent samples for the transcriptome validation experiment, we randomly selected 20 S. intermedius specimens from each replicate tank for intestinal RNA extraction. For RNA sequencing library construction, the intestines of 20 living S. intermedius specimens from replicate 1# of high temperature group (25◦C) were carefully removed and pooled (designated as Si\_TT2\_1), and 20 intestines of living S. intermedius specimens from replicate 2# and replicate 3# (10 of each) of high temperature group (25◦C) were carefully removed and pooled (designated as Si\_TT2\_2). Two optimal temperature (20◦C) sample pools (as control) were constructed using the procedure employed for high temperature (25◦C) sample pool construction. We named two optimal temperature (20◦C) sample pools as Si\_TT0\_1 and Si\_TT0\_2. All of the pooled samples were stored at −80◦C until RNA extraction.

# RNA Extraction and Sequencing

Total RNA was extracted from each pooled sample by using TRIzol (Ambion, United States) following the manufacturer's instructions. Total RNA quantity and integrity were assessed by 1% agarose gel electrophoresis and the RNA Nano 6000 assay kit of the Agilent Bioanalyzer 2100 system (Agilent Technologies, CA, United States).

A total amount of 2 µg high quality RNA per pooled sample was used for RNA sample preparation. All high-quality RNA samples were sent to BGI Co., Ltd. (Beijing, China). Sequencing libraries were constructed using the NEBNext <sup>R</sup> UltraTM RNA Library Prep Kit for Illumina <sup>R</sup> (NEB, United States) according to the manufacturer's recommendations, and index codes were added to sequences to distinguish one sample from another. The Agilent Bioanalyzer 2100 system was employed to assess the quality of each RNA library. The index-coded samples were clustered on a cBot Cluster Generation System using TruSeq PE Cluster Kit v3-cBot-HS (Illumina) according to the manufacturer's instructions. After cluster generation, the library preparations were sequenced on an Illumina Hiseq platform and paired-end reads (Hiseq 4000, 101PE).

# Transcriptome Assembly and Annotation

Clean data (clean reads) were obtained from the raw data (raw reads) by removing adaptors, reads with poly-N, and lowquality reads. All of the clean reads were then submitted to the National Center for Biotechnology Information (NCBI) Short Read Archive (SRA) Sequence Database (Accession Number PRJNA508827). The percentage of bases with a Phred value > 20 (Q20), the percentage of bases with a Phred value > 30 (Q30), and the content of base G and C (GC-content) were calculated. The high-quality clean data were used for subsequent analyses. Clean reads were assembled into transcriptome as reference using Trinity (Grabherr et al., 2011) with min\_kmer\_cov set to 2 by default, and all of the other parameters set to default. The clean data were mapped back onto the assembled transcriptome, and read count for each gene was obtained from the mapping results by RSEM (v1.2.12). BUSCO v.3.0.2 was used to assess the completeness of the gene assembly (Simão et al., 2015).

Transcriptome annotation was performed using Basic Local Alignment Search Tool (BLAST) searches against the NCBI non-redundant (Nr) databases, NCBI nucleotide sequences (Nt), Swiss-Prot, InterPro, Kyoto Encyclopedia of Genes and Genomes (KEGG), Clusters of Orthologous Groups (COG), and Swiss-Prot. We employed Blast2GO with NR annotation for Gene Ontology (GO) annotation, and InterProScan5 for InterPro annotation.

# Single Nucleotide Polymorphism (SNP) and Simple Sequence Repeat (SSR) Identification

Single nucleotide polymorphism (SNPs) and SSRs in the transcriptome level were identified using GATK3 software (v3.4) (Mckenna et al., 2010) and MISA (microsatellite)<sup>1</sup> , respectively. The parameters for SNP identification were an MQ < 20.0 and QD < 2.0. SSR identification criteria in the MISA script were 1–12, 2–6, 3–5, 4–5, 5–4, and 6–4.

<sup>1</sup>http://pgrc.ipk-gatersleben.de/misa/misa.html

# Differentially Expressed Gene (DEG) Analysis

fgene-10-00301 April 2, 2019 Time: 17:29 # 4

Gene expression levels were calculated as previously described (Li and Dewey, 2011) with RSEM (v1.2.12). Differential expression analysis between optimal temperature (20◦C; Si\_TT0) and high temperature (25◦C; Si\_TT2) was performed using the NOISeq R package (v3.1). NOISeq provides statistical routines for determining differential expression in digital gene expression data using a model based on a noise distribution model (Tarazona et al., 2011). The software information: fold-change ≥ 2.0 and probability ≥ 0.8. Since genes with an adjusted | fold-change| ≥ 2.0 and probability ≥ 0.8 found by NOISeq were assigned as differentially expressed.

# GO and KEGG Pathway Enrichment Analyses

Differentially expressed genes were classified based on GO and KEGG functional annotation, GO, and pathway functional enrichment was performed using phyper, as implemented in the R package (v3.1). P-values were calculated using the hypergeometric test:

$$P = 1 - \sum\_{i=0}^{m-1} \frac{\binom{M}{i} \binom{N-M}{n-i}}{\binom{N}{n}}.$$

To ensure relatively precise results, we calculated the false discovery rate (FDR) for each p-value. In general, the terms in which FDR was not larger than 0.001 were defined as significant. Enriched cluster analysis of candidate DEGs was performed using the R package (v3.1).

# qRT-PCR Validation

Annotated DEGs were validated using quantitative real-time reverse transcription polymerase chain reaction (qRT-PCR). Twelve DEG candidates were randomly selected (including eight upregulated and four downregulated DEGs) from 2,125 DEGs. The DEG input RNA was used as template for cDNA synthesis. cDNA was synthesized using PrimeScriptTM RT reagent Kit (TaKaRa, Japan). The cytochrome b (Cytb) gene was used as internal control (Yang et al., 2010). Primers used in qRT-PCR analyses were designed by Primer Premier 5.0 (**Table 1**). qRT-PCR was performed in a total volume of 16 µL, which consisted of 2 µL of the cDNA template, 8 µL of 2× SYBR Green Master mix (TaKaRa, Japan), 0.3 µL of ROX reference dye II, 4.5 µL of PCR-grade water, and 0.6 µL (10 mM) of each primer. The running program was set as follows: 95◦C for 30 s; followed by 40 cycles of 95◦C for 5 s and annealing temperature 56◦C for 32 s. At the end of reaction, PCR melting curve analysis was conducted to confirm single PCR products. The relative expression level of each candidate DEG was determined using the comparative 2−11Ct method (Livak and Schmittgen, 2001). The concrete formula was as follows:

$$
\Delta\Delta\text{Ct} = \text{[Ct (sample)} - \text{Ct (internal reference)]} - \text{[Ct (control)]}
$$

− Ct (internal reference)].


TABLE 1 | Primers used in verification of RNA-Seq results by qRT-PCR.

vs. control.

# Protein–Protein Interaction (PPI) Analysis

We used Blastx v2.2.28 (Zhang et al., 2000) with an e-value cutoff of 1e−10 to align the S. intermedius DEG sequences with the protein sequences from the sea urchin S. purpuratus. A PPI network was built using STRING<sup>2</sup> (Szklarczyk et al., 2011) based on the PPI network of S. purpuratus. We used Cytoscape v3.5.1 (Shannon et al., 2003) to visualize the PPI network.

# Data Analysis

All of the data were expressed as the mean ± standard deviation (SD). All of the statistical analyses were performed with SPSS 16.0 (IBM, Shanghai, China). We first confirmed that our data were normally distributed and homogeneous with the Shapiro– Wilk test and with Levene's test. We then compared differences in survival rate, SGR among treatments with one-way ANOVA (factor: temperature). We considered p < 0.05 as statistically significant and p < 0.01 as statistically extremely significant. Significant differences between pairs of treatments were identified with Duncan's multiple range tests.

# RESULTS

# The Impact of High Seawater Temperature on Growth, Survival, and Feeding of S. intermedius

During the 30-day incubation, differences in S. intermedius growth, survival, and feeding among all of the treatments were compared and analyzed statistically. S. intermedius reared at both 20 and 25◦C exhibited an increase in SGR (**Figure 1A**), but

<sup>2</sup>http://string-db.org/

the SGRs of S. intermedius reared at 25◦C were significantly lower compared to those reared at 20◦C. During the first 14 days of incubation, no significant difference in survival was observed between 20 and 25◦C treatments, whereas survival rates decreased after 14 days of incubation at 25◦C (**Figure 1B**). Among all of the experimental groups, food consumption increased in a time-dependent manner, whereas that in S. intermedius reared at 25◦C was relatively lower compared to those reared at 20◦C (**Figure 1C**). In addition, significantly reduced feeding rates were observed at 25◦C compared to 20◦C (**Figure 1D**).

# RNA Sequencing, Transcriptome Assembly, and Annotation

Four RNA sequencing (RNA-Seq) libraries (Si\_TT0\_1, Si\_TT0\_2, Si\_TT2\_1, and Si\_TT2\_2) from the intestines of S. intermedius cultured at 20 and 25◦C were constructed and subsequently sequenced on an Illumina Hiseq 4000 platform. Approximately 45.24–45.25 million raw reads were obtained from each pooled sample (**Figure 2A**). After trimming, 44.61–44.73 million clean reads were obtained from each sample (**Figure 2A**). The Q20 range of all of the samples was 96.85–97.02%, and the Q30 range of all samples was 92.56–92.88%. The average GC content was 37.16 ± 0.14% (**Figure 2A**).

The transcriptome of S. intermedius was de novo assembled using the Trinity software with min\_kmer\_cov set to 2 by default, and all of the other parameters set to default. Completeness of the de novo assembly was assessed with BUSCO using a eukaryotic database<sup>3</sup> of 303 genes. The assembled genes were deemed 82.5% complete by BUSCO (61.4% as single genes and

<sup>3</sup>http://busco.ezlab.org/datasets/eukaryota\_odb9.tar.gz

21.1% as duplicated genes). In the intestines of S. intermedius cultured at 20◦C (Si\_TT0), a total of 188,430 transcripts were obtained, with an average length of 636 bp and N50 length of 1,089 bp, and approximately 102,528 unigenes with a mean length of 721 bp and N50 length of 1,179 bp were also generated (**Figure 2A**). As for S. intermedius cultured at 25◦C (Si\_TT2), the intestine transcriptome analysis indicated that a total of 218,307 transcripts were obtained, with an average length of 678.5 bp and N50 length of 1201.5 bp, and approximately 113,170 unigenes with a mean length of 777 bp and N50 length of 1313.5 bp were generated as well (**Figure 2A**). In summary, approximately 46.33% of the transcripts in Si\_TT0 and approximately 53.67% of the transcripts in Si\_TT2 were successfully mapped back to the de novo transcriptome assembly, respectively (**Figure 2A**). Pearson's correlation coefficients of FPKM distribution among the two biological replicates indicated the reproducibility of RNA-seq data (see **Figure 2B**). Annotation of assembled transcriptome was performed for comprehensive functional annotation of each unigene. After alignment, 65,349 unigenes were annotated in at least one of seven databases (**Figure 2C**).

# Identification of SNPs and SSRs

A total of 890,154 SNPs and 20,239 SSRs were identified from the assembled transcriptome. In SNPs, percent transition (Ts) and percent transversion (Tv) were 58.5 and 41.5%, respectively (**Figure 3A**). Ts was higher than Tv among the four libraries, and the Ts/Tv ratio was 117:83. The most common transitions were A–G (262,240, 29.46%) and C–T (258,526, 29.04%), and the predominant transversion type was A–T (128,089, 14.39%). The AG/CT repeat was the most abundant type of motif, the proportion gradient of repeat motifs in decreasing order was as follows: dimers (8,754, 43.25%) > monomers (5,573, 27.54%) > trimers (4,925, 24.33%) > pentamers (451, 2.22%) > quadmers (380, 1.88%) > hexamers (156, 0.77%) (**Figure 3B**).

# Analysis of Differentially Expressed Genes (DEGs)

After removing the duplicated unigenes, approximately 59,846 genes expressed in the intestines of S. intermedius cultured at 20◦C (Si\_TT0) and 82,676 genes expressed in the intestines of S. intermedius cultured at 25◦C (Si\_TT2) were identified by Trinity with default parameters. DEGs between Si\_TT0 (as control) and Si\_TT2 were identified with a fold change ≥ 2.00 and probability ≥ 0.8. A total of 2,125 DEGs were identified, which included 1,015 upregulated and 1,110 downregulated unigenes as compared to those expressed in Si\_TT0. Of the 2,125 DEGs, 732 genes were expressed specifically in Si\_TT2, and 919 genes were expressed specifically in Si\_TT0. qRT-PCR analysis indicated that the expression trends of 12 randomly selected DEGs (8 upregulated and 4 downregulated) were correlated well with those obtained in RNA-seq analysis, indicating the reliability and accuracy of the RNA-seq data obtained in this study (see **Figure 4**).

All of the identified DEGs were then annotated with GO terms. Approximately 79, 78, and 78 DEGs were clustered as categories related to metabolic process, cell, and cell parts, respectively (see **Figure 5A**). A total of 2,125 identified DEGs were enriched in 247 pathways by KEGG analysis. The ribosome pathways were the most enriched pathways, and followed by protein processing in endoplasmic reticulum pathways, prion disease pathways, and protein export pathways (see **Figure 5B**). Several DEGs related to growth (**Supplementary Table S1**), energy metabolism (**Supplementary Table S2**), heat shock responses (**Supplementary Table S3**), and immune response (**Supplementary Table S4**) were identified.

For further elucidate interaction relationships, prediction analysis was conducted. After examining PPI networks, 173 proteins were found to be extremely well connected. In the 25◦C treatments, 127 proteins were upregulated, 45 proteins were downregulated compared to the 20◦C treatments (**Figure 6A**). These well-connected proteins included E3 ubiquitin-protein ligase MIB2 (GenBank Accession Number gi| 390364157), with 58 connections (**Figure 6B**); and heat shock protein 90 (GenBank Acc. No. gi| 390340697), with 29 connections (**Figure 6C**).

Transcription factors are also key regulators involved in heat-stress response in organisms. We identified 458 potential TFs from 25 TF families in S. intermedius cultured at 20 and 25◦C. The identified TFs most commonly belonged to the following families: Cys2His2 protein (C2H2; 231 TFs; 50.44%), LIM domain-containing protein (LIM; 44 TFs; 9.61%), basichelix-loop-helix protein (bHLH; 32 TFs; 6.99%), CCCH-type

zinc finger protein (C3H; 22 TFs; 4.80%), and the basic leucine zipper protein (bZIP; 8 TFs; 1.75%) (**Figure 7A**). In the 25◦C treatments, 40 TFs were upregulated and 11 TFs were downregulated compared to the 20◦C treatments (**Figure 7B** and **Supplementary Table S5**).

# DISCUSSION

In the present study, we first investigated the impact of high water temperature on the growth, survival, and feeding behavior of S. intermedius. As we expected, high seawater temperature imparted significantly negative effects on the survival, growth, and feeding of S. intermedius. These results support the opinion that the growth, survival and feeding of sea urchins are sensitive to temperature (Siikavuopio et al., 2008; Pearce et al., 2010; Onitsuka et al., 2013). Moreover, we found that the lower SGRs of S. intermedius suffered from high seawater temperature stress were due to a decrease in feeding rates in this study, which agrees with the findings of Yu et al. (2016), who reported that high seawater temperature reduces the feeding and SGRs of sea cucumber A. japonicus. The observed decrease

FIGURE 5 | Functional annotation of DEGs of Si\_TT2 vs. Si\_TT0 in S. intermedius. (A) The most enriched GO terms of DEGs of S. intermedius Si\_TT0and S. intermedius Si\_TT2. The Y-axis represents the categories of annotated DEGs, and the X-axis represents the number of DEGs. (B) The top 20 enriched KEGG terms of DEGs of Si\_TT2 vs. Si\_TT0 in S. intermedius. The Y-axis represents the KEGG pathway, and the X-axis represents the enrichment factor. Dot size indicates the number of DEGs in the pathway. Dot colors corresponds to different Q-values.

FIGURE 6 | Differentially expressed genes interactive network prediction. (A) Interactive network prediction of all DEGs. (B) Interactive network prediction of MIB2. (C) Interactive network prediction of heat shock protein 90.

in food consumption and feeding rate not only explains the slow growth of S. intermedius, but also reflects to some extent the decrease in energy budget of S. intermedius under high temperature stress. This observation is also consistent with the results of Watts et al. (2011).

Organisms undergo metabolic adjustments in the presence of environmental stimuli. Moreover, cellar stress responses (CSRs) can also reflect the responses of an individual to environmental fluctuations (such as chronic high seawater temperature stress in this study) at the cellular and tissue levels. We therefore subsequently investigated the molecular mechanisms underlying high temperature-driven decrease in growth, survival, and feeding in S. intermedius by constructing four high-quality RNA-seq libraries and performing comparative transcriptome analysis to identify gene candidates associated with hightemperature responses.

In terms of the decreased energy budget, our data indicated that the expression of ATP5 [δ or the oligomycin sensitivityconferring protein (OSCP)] mRNA significantly decreased in S. intermedius cultured at 25◦C compared to those incubated at 20◦C. ATP5 is one of subunits of ATP synthase, and silencing the expression of ATP5 can affect the activity of ATP synthase and further block the electron transfer chain in organisms. Our data suggest that high temperature stress can block ATP synthesis and affect the energy charge by decreasing atp5 expression in sea urchins. A decrease in ATP5 expression can alter total cellular ATP levels and impair growth in plants (Robison et al., 2009), which bears some resemblance to our observations. However, we could not find any studies involving echinoderm ATP5 to compare our results with. In addition to a reduction in energy, another other common strategy for organisms to adapt environmental stress (e.g., warming in this study) involves decreasing energetically expensive metabolic processes to extend their duration of tolerance (Pörtner and Farrell, 2008). Histones are dynamic proteins that can undergo multiple types of post-translational modifications and regulate gene expression depending on the metabolic state of the cell. Padilla-Gamino et al., 2013 reported that the downregulation histone-encoding genes is a principal transcriptional response accompanying metabolic depression in S. purpuratus larvae cultured at higher temperature (18◦C), and that histones possibly act as metabolic sensors. We also observed a reduction in the expression of histone-encoding genes, such as Unigene17018\_All (fold change: −2.42), Unigene9687\_All (fold change: −7.70), and Unigene13915\_All (fold change: −8.64). These findings support the hypothesis that sea urchins exposed to higher temperature consume energy supplies more rapidly than those cultured at optimal temperatures by reducing key metabolites for histone-modifying enzymes. In addition, another strategy for organisms to adapt environmental stress is minimizing body size or delaying growth. In this study, we found a significant reduction in expression of transforming growth factor beta (TGF-β) transcripts. TGF-β acts as a cytokine by imparting immunoregulatory effects, including lymphocyte proliferation, cytokine responsiveness, or cytokine expression (Ruscetti and Palladino, 1991). In echinoderms, TGF-β plays an important role in the symmetrical growth of sea urchin embryos and the biomineralization of larval skeletogenesis (Zito et al., 2003). In this study, we postulate that the reduced growth under high temperature stress in S. intermedius might be due to alterations in TGF signals. Further studies should be conducted to clarify the mechanisms underlying how high-temperature influences the TGF signal pathway.

Heat-shock proteins (HSPs) are molecular chaperones with multiple functions, including stress resistance and adaption to environmental changes in various species (Currie, 2011). The upregulation of Hsps is one of ubiquitous mechanisms of marine organism in coping with thermal stress (Osovitz and Hofmann, 2005). Our data also identified altered mRNA expression levels of some Hsps. Runcie et al. (2012) demonstrated that higher temperatures (18◦C vs. 12◦C) can cause mild embryonic developmental stress and increase both hsp70 and hsp90 expression in embryos of the sea urchin S. purpuratus. Vergara-Amado et al. (2017) reported that transient warmer temperature treatments (18◦C vs. 14◦C) induces the up-regulation of hsp70 and hsp90 in juveniles of the sea urchin L. albus. As expected, we found the upregulation of hsp70 and hsp90 in adult S. intermedius after application of chronic high-temperature stress. Hsp70 and

Hsp90 have both been implicated in the proteasomal degradation of chaperoned client proteins (Kiang and Tsokos, 1998; Pratt, 1998). Hsp90, in particular, has been characterized as the driver for Hsp-mediated proteasomal degradation. Hsp90 is a protective chaperone when in complex with p50/immunophilin, p23, and ATP, but drives client proteins to poly-ubiquitination and proteasomal degradation when ATP-depleted and bound in complex with Hsp70 (Doong et al., 2003). Interestingly, the present study observed that mind bomb-2 (mib2) was downregulated under high temperature (25◦C) conditions as compared to that observed using the optimal temperature (20◦C). MIB2 has E3 ubiquitin-protein ligase activity and can promote the ubiquitination and endocytosis of its client protein (Koo et al., 2005). PPI prediction indicated the upregulation of Hsp90 and the downregulation of E3 ubiquitin-protein ligase MIB2 under high temperature (25◦C) conditions as compared to that observed using the optimal temperature (20◦C) (**Figure 6B**). This observation suggests that high temperature-induced proteasomal degradation or apoptosis might mainly rely on the Hsp-mediated proteasomal degradation pathway rather than that of being MIB-2-mediated. In addition, the cytokine-like function of Hsp70 what been well documented in several studies (Zhang et al., 2011; Yang et al., 2016; Ying et al., 2016). Combined with the findings of the present study, these observations support the hypothesis that Hsp70 and Hsp90 regulate not only proteasomal degradation but also immune responses when sea urchins are subjected to high-temperature stress (Ying et al., 2016). Since enhanced hsp70 and hsp90 transcripts can be detected from embryos to adults in sea urchins during high temperature stress regardless of whether such exposure is chronic or transient, we therefore hypothesize that the hsp70 and hsp90 genes be included in selective breeding or assisted breeding of high temperature-resistant sea urchins.

The significant upregulation of the glutathione S-transferases (GST) gene was also observed in the present study. GST is a phase II detoxification isozyme that catalyzes the conjugation of glutathione with both xenobiotics and endogenous substrates. GST activity has long been utilized as a bioindicator of environmental contamination in coastal regions (Cunha et al., 2005). Field studies have shown that the maximal GST activity in sea urchins and mussels can be observed in the summer (Moreira and Guilhermino, 2005). However, no study has measured echinoderm GST levels under thermal-stress, and thus we were unable to clarify the relationship between GST activities and high temperature stress in sea urchins.

Transcription factors and cis-acting elements are conserved mechanisms that regulate gene transcription (Murray et al., 1988). Among differentially expressed TFs, we observed the significantly upregulation of mRNA expression of some multiple function TFs in sea urchins in response to high temperature stress, which include activator protein-1 (AP-1), FOS, cAMPresponse element-binding protein (CREB), and the zinc finger (ZNF) proteins. AP-1 and FOS are members of the basic leucine zipper protein (bZIP) family. Activated AP-1 has been demonstrated to be a stress-responsive TF and plays a key role in responding to environmental stimuli by regulating various immune signal transduction pathways, such as the Toll-like receptor (TLR), tumor necrosis factor alpha (TNF-α), and mitogen-activated protein kinase (MAPK) pathways, in marine organisms (Qu et al., 2015; Zhan et al., 2018). Fos proteins are a key part of the AP-1 complex and can regulate a wide range of biological process (Hirayama et al., 2005). Production of many immune-related molecules (antioxidant enzymes, chemokines, and interleukin) require Fos expression (Rimoldi et al., 2009). Additionally, it has been shown that Fos cooperates with Notch to regulate cell fate specification of intermediate precursors during Caenorhabditis elegans development (Oommen and Newman, 2007). CREB is one of multi-function TF regulating various signal transduction pathways. In Drosophila, CREB and Hsp70 can additively suppress polyglutamine-mediated toxicity (Iijimaando et al., 2005). In cultured rat primary hippocampal neuron cells, CREB has been demonstrated to be activated by the Hsp90/Akt signal pathway (Cen et al., 2006). Combined with our comparative expression data on hsp70 and hsp90, we postulate that the CREB-mediated gene regulation network of higher thermal tolerance in S. intermedius might be closely correlated with HSPs. Further studies should be conducted to confirm this hypothesis. ZNFs can be found in all eukaryotes and act as TFs that plays critical role in responding to environmental stimuli such as biometals (Villalpando et al., 2017). In plants, ZNFs have been identified as heat response-related gene candidates (Wang et al., 2010; Yan et al., 2016; Fang et al., 2017). The present study has shown that ZNFs are involved in echinoderm heat responses and more studies should be conducted to elucidate the ZNF regulated genes and signal transduction pathways during chronic heat stress in echinoderms.

Moreover, the adjustment of gene structure is the ultimate mechanism for organisms to adapt to long-term stress and maintain population size. Our comparative transcriptomic data also indicated altered SNPs and SSRs. These results will facilitate studies on the genetic structure, population geography, and ecology of sea urchins. Further screening of these SNPs and SSRs may assist in the identification of more valuable heat-resistant genetic markers that may be utilized in the selective breeding of heat-resistant sea urchins.

However, significant expression alterations of some transient thermal-tolerance candidate genes such as cytochrome P450 (CYP450), Na+/K<sup>+</sup> ATPase (Vergara-Amado et al., 2017) were not observed in the present study. This suggests that high temperature-induced molecular responses in the same species depend on individual developmental stages and the duration of stress.

# CONCLUSION

The present study has demonstrated that chronically high seawater temperatures negatively influence the growth, survival, and feeding of the sea urchin S. intermedius. Gene candidates (e.g., HSPs, cytokines, and TFs) that were closely correlated with thermal resistance and adaptation were identified by comparative transcriptomic analysis. In summary, our results provide insight into the genes and regulatory networks involved in chronic thermal stress in S. intermedius, as well as enriching transcriptomic and genetic resources for sea urchins and other invertebrates. More studies on the molecular events involved in the thermal resistance and adaption mechanisms for S. intermedius should be conducted to better understand the impact of chronically high temperature stress on sea urchin physiology and ecology.

# DATA AVAILABILITY

fgene-10-00301 April 2, 2019 Time: 17:29 # 12

The datasets generated for this study can be found in NCBI, PRJNA508827.

# AUTHOR CONTRIBUTIONS

YC and YZ conceived and designed the experiments. JL, JS, WZ, YL, DC, and WH performed the experiments. JL, YZ, and WZ

# REFERENCES


analyzed the data. YZ and JL wrote the manuscript. All authors read and approved the manuscript.

# FUNDING

This study was supported by the National Natural Science Foundation of China (31672652) and the grant for Chinese Outstanding Talents in Agricultural Scientific Research (for YC) supported this study.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00301/full#supplementary-material

FOS/AP-1 in zebrafish controls CRY-1a and WEE-1. Proc. Natl. Acad. Sci. U.S.A. 102, 10194–10199. doi: 10.1073/pnas.0502610102



and accelerates recovery in grapevine leaves. BMC Plant Biol. 10:34. doi: 10. 1186/1471-2229-10-34


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Zhan, Li, Sun, Zhang, Li, Cui, Hu and Chang. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# High-Density Genetic Linkage Maps Provide Novel Insights Into ZW/ZZ Sex Determination System and Growth Performance in Mud Crab (Scylla paramamosain)

Khor Waiho1,2,3† , Xi Shi1,2† , Hanafiah Fazhan1,2, Shengkang Li1,2, Yueling Zhang1,2 , Huaiping Zheng1,2, Wenhua Liu1,2, Shaobin Fang1,2, Mhd Ikhwanuddin2,4 and Hongyu Ma1,2,3 \*

<sup>1</sup> Guangdong Provincial Key Laboratory of Marine Biotechnology, Shantou University, Shantou, China, <sup>2</sup> STU-UMT Joint Shellfish Research Laboratory, Shantou University, Shantou, China, <sup>3</sup> Laboratory for Marine Fisheries Science and Food Production Processes, Pilot National Laboratory for Marine Science and Technology (Qingdao), Qingdao, China, <sup>4</sup> Institute of Tropical Aquaculture, Universiti Malaysia Terengganu, Kuala Terengganu, Malaysia

## Edited by:

Gen Hua Yue, Temasek Life Sciences Laboratory, Singapore

### Reviewed by:

Shikai Liu, Ocean University of China, China Le Wang, Temasek Life Sciences Laboratory, Singapore

\*Correspondence:

Hongyu Ma mahy@stu.edu.cn †These authors have contributed equally to this work

### Specialty section:

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Received: 27 October 2018 Accepted: 19 March 2019 Published: 05 April 2019

### Citation:

Waiho K, Shi X, Fazhan H, Li S, Zhang Y, Zheng H, Liu W, Fang S, Ikhwanuddin M and Ma H (2019) High-Density Genetic Linkage Maps Provide Novel Insights Into ZW/ZZ Sex Determination System and Growth Performance in Mud Crab (Scylla paramamosain). Front. Genet. 10:298. doi: 10.3389/fgene.2019.00298 Mud crab, Scylla paramamosain is one of the most important crustacean species in global aquaculture. To determine the genetic basis of sex and growth-related traits in S. paramamosain, a high-density genetic linkage map with 16,701 single nucleotide polymorphisms (SNPs) was constructed using SLAF-seq and a full-sib family. The consensus map has 49 linkage groups, spanning 5,996.66 cM with an average marker-interval of 0.81 cM. A total of 516 SNP markers, including 8 female-specific SNPs segregated in two quantitative trait loci (QTLs) for phenotypic sex were located on LG32. The presence of female-specific SNP markers only on female linkage map, their segregation patterns and lower female: male recombination rate strongly suggest the conformation of a ZW/ZZ sex determination system in S. paramamosain. The QTLs of most (90%) growth-related traits were found within a small interval (25.18–33.74 cM) on LG46, highlighting the potential involvement of LG46 in growth. Four markers on LG46 were significantly associated with 10–16 growth-related traits. BW was only associated with marker 3846. Based on the annotation of transcriptome data, 11 and 2 candidate genes were identified within the QTL regions of sex and growth-related traits, respectively. The newly constructed high-density genetic linkage map with sex-specific SNPs, and the identified QTLs of sex- and growth-related traits serve as a valuable genetic resource and solid foundation for marker-assisted selection and genetic improvement of crustaceans.

Keywords: genetic linkage map, sex-specific SNP, QTL, association analysis, sex determination system, Scylla paramamosain

# INTRODUCTION

Mud crab, Scylla paramamosain (Crustacea: Decapoda: Brachyura) is naturally distributed in the coasts of Asia regions along with other Scylla species (Waiho et al., 2016a, 2018; Fazhan et al., 2017a) and is the dominating mud crab species in the Mekong Delta, Vietnam (Le Vay et al., 2001), the southern part of Japan (Ogawa et al., 2012) and China (Ma et al., 2012). It is one of the main

aquaculture invertebrate species in China, with a total production of more than 145,000 tons in 2016 (Department of Agriculture of China, 2017). Their high market values are attributed to their delicate meat, high nutritional value, ease of capture and hardy nature (Waiho et al., 2017). Thus, the development of genetic breeding programs, including marker- and gene-assisted selection are urgently needed to ensure sustainability and genetic variability of cultured mud crab. Attempts have been made to identify the correlation between some economically important growth traits of S. paramamosain such as body size and weight (Jiang et al., 2014), and transcriptome-derived microsatellite markers (Ma et al., 2014). These serve as foundations for further selection of genomic loci or genes related to the traits of interests by the construction of genetic maps and are of utmost importance and relevance to the aquaculture of this species.

Mud crabs exhibit significant sexual dimorphism, with females displaying higher growth rate and greater body weight compared to their male counterparts of the same size (Jiang et al., 2014; Waiho et al., 2016b). In addition, coupled with the gravid status of the mature females, the commercial value is substantially higher in females than in males. Therefore, it is of great interest to consider the possibility of mud crab monosex culture in near future. The first step of monosex culture is to understand its sex determination system. Unlike vertebrates, crustaceans' sex determination systems are more diverse and plastic, influenced by both genetic and environmental factors (Ford, 2008). In crabs, both XX–XY and ZZ–ZW sex determination systems have been reported and is species-specific (Niiyama, 1938; Lécher et al., 1995; Cui et al., 2015). Based on our previous study using single-nucleotide polymorphisms (SNPs), that of S. paramamosain is postulated to be ZZ-ZW (Shi et al., 2018). Based on their transcriptomic profiles, some sex-related genes such as vasa, Dmrt, FEM1, and Wnt6 were found to be differentially expressed in the gonad of S. paramamosain (Gao et al., 2014). Recently, we have also uncovered 147 gonadal differentially expressed long non-coding RNAs (lnc RNAs), nine of which showed regulation toward eight sex-related genes in S. paramamosain (Yang et al., 2017). The genome organization of sex chromosomes in S. paramamosain and the exact gene localization, however, remains unclear. The screening of sex-associated SNP markers will hasten the development of all-female S. paramamosain culture and contribute significantly to the understanding of the mud crab sex determination mechanism.

An accurate and comprehensive genetic linkage map is the cornerstone for genomic and genetic studies, as well as genetic breeding of a species. It aids in the elucidation of genomic characteristics, provides excellent framework for quantitative trait locus (QTL) localization, facilitates both marker- and geneassisted selection, and enables comparative genome analysis between species (Yu et al., 2015; Peng et al., 2016). For example, sex-determination and/or growth-related traits were successfully mapped and studied in several aquaculture fish species, including the common carp (Peng et al., 2016), turbot (Taboada et al., 2014; Wang W. et al., 2015), blunt snout bream (Wan et al., 2017), bighead carp (Fu et al., 2016), Asian seabass (Wang L. et al., 2015), mandarin fish (Sun et al., 2017) and tilapia (Liu et al., 2013; Palaiokostas et al., 2015a). In decapod crustaceans, however, progress on the construction of high-density linkage maps is slow and difficult due to their high number of chromosomes. To date, SNP-based high-density linkage maps with thousands of markers and average marker distances of less than 1 cM are reported for only four decapod species (Baranski et al., 2014; Yu et al., 2015), of which two are portunid crabs – the Chinese mitten crab, Eriocheir sinensis (Cui et al., 2015; Qiu et al., 2017) and the swimming crab, Portunus trituberculatus (Lv et al., 2017).

Recently, with the advent of next-generation sequencing (NGS) technologies, SNPs may be mined using specific-locus amplified fragment sequencing (SLAF-seq) method, a modification of the commonly used restriction site-associated DNA sequencing (RAD-seq) method (Sun et al., 2013). This method involves size selection of restriction fragments to ensure even distribution and exclude repeats. SLAF-seq is gaining attention and has been used in plants (Xu et al., 2015; Luo et al., 2016; Zhou et al., 2017) and animals (Wang W.H. et al., 2015; Lv et al., 2017; Qiu et al., 2017) alike, especially those without reference genome, to construct high-density genetic maps as it is more efficient and cost-effective compared to the previous RAD-seq method (Qiu et al., 2017). The first genetic linkage map for S. paramamosain was constructed using microsatellite and amplified fragment length polymorphism (AFLP) markers (Ma et al., 2016). The resolution of the resulting map, however, is low (only 50% coverage of the estimated genome), with a mean marker interval of 18.68 cM, and only 212 markers were mapped, thus limiting its application in further QTL localization and genome assembly (Lv et al., 2017).

Herein, we randomly selected 129 G<sup>1</sup> offspring from one full-sib family of S. paramamosain for SNP mining and genotyping using SLAF-seq method. A high-density genetic linkage map (0.81 cM average marker interval) of S. paramamosain with 16,701 SNP markers was successfully constructed, spanning a total of 5,996.66 cM in 49 linkage groups (LGs). The inclusion of 8 female-specific SNP markers enabled the identification of two QTL regions on LG32 that were linked with sex determination. Based on the growth-related traits measurements, 27 quantitative trait loci (QTLs) of growth-related traits were also identified. In addition, growth-related traits associated SNP markers were detected as well. Moreover, 11 and 2 candidate genes were identified within the QTL regions of sex and growth-related traits, respectively. This study provides novel insights into the sex determination system and growth performance of mud crab and in other related crustacean species.

# MATERIALS AND METHODS

# Ethics Statement

The animal experimental procedures used in this study were approved and conducted in strict accordance with the recommendations in the Guide for the Care and use of Laboratory Animals outlined by the Institutional Animal Care and Use Ethics Committee of Shantou University and the National Institutes of

Health guide for the care and use of Laboratory animals (NIH Publications No. 8023, revised 1978).

# Mapping Population Collection and DNA Extraction

The mud crab S. paramamosain is a common aquaculture species in southeastern coastal areas of China. The parents were cultured and mated in a pond, and the full-sib G1 family were produced in a hatchery located at Raoping, China. The offspring were artificially reared with commercial feed and low-value fishes to maturity in the same pond. A total of 129 progenies (63 males; 66 females) was randomly collected 4 months post-hatch for linkage mapping analysis and growth traits measurement. The measured growth traits include: carapace length (CL), carapace width (CW), internal carapace width (ICW), carapace frontal width (CFW), abdomen width (AW), body height (BH), carapace width at spine 8 (CWS8), distance between frontal median spine (DFMS), distance between frontal lateral spine (DFLS), distance between lateral spine 1 (DLS1), distance between lateral spine 2 (DLS2), fixed finger length of the cheliped (FFLC), fixed finger width of the cheliped (FFWC), fixed finger height of the cheliped (FFHC), meropodite length of pereopod 1 (MLP1), meropodite length of pereopod 2 (MLP2), meropodite length of pereopod 3 (MLP3), dactyl length of pereiopod 4 (DLP4), dactyl width of pereiopod 4 (DWP4), and body weight (BW) (Ma et al., 2013; Fazhan et al., 2017b). The 19 morphological traits were measured to the nearest 0.01 mm using standard Vernier caliper. BW was measured to an accuracy of 0.01 g with a digital electronic balance. Sex of each crab was determined based on its gonad morphology after dissection (Quinitio et al., 2007; Waiho et al., 2017). Genomic DNA from muscle tissues of the right cheliped of the maternal parent and 129 progenies were extracted using conventional CTAB DNA extraction method. The extracted DNA's quantity and quality were checked using Nanodrop 1000 spectrophotometer (Thermo Scientific, Wilmington, DE, United States) and agarose gel electrophoresis (1% concentration), respectively, before storing at −80◦C until further analysis.

# SLAF-Seq Library Construction and High-Throughput Sequencing

Specific-locus amplified fragment sequencing (SLAF-seq) libraries were constructed and sequenced based on the method of Sun et al. (2013) with slight modification. Preliminary marker discovery was stimulated in silico based on the reference genome of E. sinensis. Restriction endonucleases HaeIII and Hpy166II (New England Biolabs, NEB) were selected and used to digest the genomic DNA of S. paramamosain G<sup>1</sup> population. Subsequently, a single nucleotide (A) overhang was added to the digested fragments using Klenov (3<sup>0</sup> → 5 0 exo−) (NEB) and dATP. Both steps were incubated at 37◦C. Duplex tag-labeled sequencing adapters (PAGE-purified, Life Technologies, United States) were then ligated to the A-tailed fragments using T4 DNA ligase. The diluted restriction-ligation DNA products were then subjected to Polymerase Chain Reaction (PCR) using Q5 <sup>R</sup> High-Fidelity DNA Polymerase and specific primers (forward: 5<sup>0</sup> -AATGATACGGCGACCACCGA-3<sup>0</sup> ; reverse: 5 0 -CAAGCAGAAGACGGCATACG-3<sup>0</sup> ) (Life Technologies) and subsequently purified using Agencourt AMPure XP beads (Beckman Coulter, High Wycombe, United Kingdom). Purified products were then pooled and separated by 2% agarose gel electrophoresis. Fragments in the range of 314–414 bp (with indexes and adaptors) were excised and purified using QIAquick gel extraction kit (Qiagen, Germany). After purification, paired-end 125 bp sequencing was performed on the Illumina HiSeq 2500 platform (Illumina Inc., CA, United States). Asian rice Oryza sativa japonica (genome size 382 M) was used as control and subjected to the same sequencing procedure to assess the accuracy of library construction.

# Sequencing Data Grouping and Genotyping

The discovery and genotyping of SLAF markers were conducted based on Sun et al. (2013). Raw reads were sorted to each progeny according to the duplex barcode sequences after the removal of low-quality reads (reads with quality score of less than 20e, with e represents base sequencing error rate). Next, barcodes and terminal 5-bp positions were trimmed from each raw read. Sequences that were mapped to the same position with high similarity (>95% identity) were defined as a SLAF locus (Zhang et al., 2015). SNP loci of each SLAF locus were detected between parents using Genome Analysis Toolkit (GATK). To ensure the accuracy, SNP calling was also carried out using SAMtools. Only variants called out by both algorithms (GATK and SAMtools) were considered as SNPs. Further, GATK with default parameters was used for the removal of duplicated reads, realignment of reads around insertions/deletions, and recalibration of base quality. Reads with quality by depth (QD) score < 2.0, mapping quality (MQ) of <40 and Fisher Strand (FS) score of >60 were filtered out. All polymorphic SNPs were genotyped for consistency with the parental and offspring SNP loci. Genotype scoring based on Bayesian approach was performed to ensure genotyping quality (Sun et al., 2013). To identify polymorphic SNPs, firstly, SNPs with average depth of less than 10× in parent and 4× in offspring were filtered out. Next, only SNPs with at least 90% frequency among all offspring were selected. Lastly, markers with significant segregation distortion (based on Chi-square test, P < 0.05) were excluded from map construction but added later as accessory markers. Based on their SNP genotypes, five segregation patterns (ab × cd, ef × eg, hk × hk, lm × ll, nn × np) were used to construct the full-sib family linkage map. The paternal genotype was deduced based on the maternal and offspring genotypes. Further, 13 SNP markers were selected for validation of the accuracy of genotyping (**Supplementary Table S1**).

# Linkage Map Construction

Eight female-specific SNP markers were added (**Supplementary Table S2**) in addition to the high-quality SNP markers generated. Marker loci were first partitioned into LGs. The markers' robustness for each LGs were validated by filtering out markers with modified logarithm of odds (MLOD) scores of less than 6.

Waiho et al. Genetic Maps in Mud Crab

Additionally, 8 sex-specific SNP markers that were heterozygous in female S. paramamosain were added to the selected markers for genetic linkage map construction. HighMap strategy was employed for the construction of a high-density and high-quality map (Liu et al., 2014). Firstly, linkage phases were inferred based on recombinant frequencies and LOD scores estimated by two-point analysis. Then, the process of marker ordering was conducted by combining the enhanced gibbs sampling, spatial sampling and simulated annealing algorithms (GSS) (Jansen et al., 2001; Van Ooijen, 2011). After several cycles, a stable map order was obtained. A subset of currently unmapped markers was then added to the previous sample with decreased sample radius for subsequent mapping. The process was repeated until all markers were mapped accordingly. SMOOTH strategy and k-nearest neighbor algorithm were used to correct errors based on parental contribution of genotypes and to impute missing genotypes, respectively (van Os et al., 2005; Huang et al., 2012). Skewed markers were subsequently inserted into the linkage map via multipoint method of maximum likelihood. The sex-specific maps were constructed based on heterozygous markers in either parents whereas the consensus map was built by integrating the maps of both parents via anchor markers. Kosambi mapping function was then used to estimate map distances (Kosambi, 1943). The expected genome size (Ge) was calculated using the formula:

$$\mathcal{G}\_{\mathfrak{e}} = (\mathcal{G}\_{\mathfrak{e}1} + \mathcal{G}\_{\mathfrak{e}2})/2$$

$$\text{With } \mathcal{G}\_{\mathfrak{e}1} = \Sigma \left( L\mathcal{G}\_{OL} + 2s \right), \mathcal{G}\_{\mathfrak{e}2}$$

$$= \Sigma \left[ L\mathcal{G}\_{OL} \times \left( (m+1)/(m-1) \right) \right]$$

where LGOL is the observed length of linkage group, s represents the average marker interval and m is the number of markers in each linkage group (Chakravarti et al., 1991; Ma et al., 2016). The estimated genome coverage was then calculated as the percentage of observed genome size divided by G<sup>e</sup> (Liao et al., 2007; Jones et al., 2013; Ma et al., 2016). The ratio of female: male recombination rate was calculated using two methods, (1) based on the full length of each LG, and (2) based on the length of shared markers between female and male linkage maps (specifically for LG32).

# QTL Analysis and Candidate Genes Identification

The QTL analysis was conducted using the R/QTL package (Broman and Sen, 2009) to link phenotypic trait measurements with genotypic data in an attempt to uncover the genetic basis of variation in the measured traits (Kearsey, 1998). The phenotypic sex was treated as a binary trait (0 for females and 1 for males). Composite Interval Mapping (CIM) was performed for each trait. The LOD threshold for each data set was acquired based on permutation test (1,000 permutations, P < 0.05). We then identified their candidate genes based on the obtained QTL of sex and growth-related traits. In brief, SNPs on QTLs were grouped as SLAF marker based on the initial SLAF sequencing. They were then compared to the assembled and annotated gonadal transcriptome sequences of mature S. paramamosain (GenBank ID: SRR5387739 and SRR5387741) by NCBI<sup>1</sup> Blast+. The parameters were set as e-value < 1e−<sup>05</sup> , identities > 90% and sequence length of alignment > 80 bp. The candidate genes were obtained according to the annotation of the transcriptome sequence.

# Statistical Analyses

All statistical analyses were conducted using IBM SPSS Statistic ver. 20 and Microsoft Excel 2016. Pearson's correlation test was conducted between every two traits to determine their strength of association. Generalized linear model (GLM) was used to evaluate the association between markers of different genotypes and the expected growth-related traits derived from QTL analysis, with the genotype as explanatory variable and growth-related traits as dependent variables. Subsequent Student–Newman–Keuls (SNK) method was conducted if significant differences among genotypes occurred. All results were statistically significant at P < 0.05.

# RESULTS

# SLAF Sequencing Summary

Before the construction of the sequencing library, restriction enzymes were chosen based on the predicted number of SNPs and length of produced fragments. With an insert size range of 314–414 bp, a combination of restriction enzymes HaeIII and Hpy166II was expected to produce the highest number of SNP markers (227,798) and to achieve the 93.49% digestion efficiency in the control sample. Libraries construction and sequencing of parent and 129 progenies using HaeIII and Hpy166II generated 731.37 M high quality pair-end reads, with Q30 percentage of 93.92% and GC percentage of 41.78% (**Table 1**). Of the five major patterns that could be used in linkage map construction, nn × np was the major pattern (43.44%), in contrast, ab × cd accounted for only 0.01% of the total SNP number (**Supplementary Figure S1**). After filtering markers with MLOD values of less than 6, 16,693 out of 17,246 markers were selected for subsequent genetic linkage map construction (**Table 1**).

# Construction of Genetic Linkage Maps

Based on a pseudo-testcross strategy, a genetic linkage map containing 49 LGs with 16,701 markers were constructed with a high 99.95% individual integrity value (**Table 2**). The total length of male and female linkage maps were 5,877.71 and 5,790.08 cM, respectively. These two sex-specific linkage maps were integrated into a sex-averaged map (**Figure 1**) that spanned 5,996.66 cM with an average marker interval of 0.81 cM (**Table 2**). The detail information of the genetic linkage map for female, male and sex-averaged were shown in **Supplementary Table S3**. The estimated genome size was 6,004.58 cM for male, 5,907.38 cM for female, and 6,076.28 cM for the sex-averaged. Based on the ratio of the observed and estimated sizes, the genome coverage of the

<sup>1</sup> ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/

TABLE 1 | SLAF-seq data statistics in Scylla paramamosain.


male, female and sex-average maps were 97.89, 98.01, and 98.69%, respectively.

# Segregation Distortion

Of the 16,701 markers mapped on the genetic linkage map, only 187 were skewed markers (**Supplementary Figure S2**), representing a mere rate of 1.12%. These skewed markers were distributed in 26 out of 49 LGs, with LG34 and LG5 exhibited highest number of skewed markers, 28 and 25, respectively. Including LG34 and LG5, only 14.29% of the LGs had more than 10 skewed markers. The number of skewed markers in these 26 linkage groups ranged from 1 to 28.

# QTL Mapping for Sex and Growth-Related Traits

Quantitative trait locis for phenotypic sex trait were exclusively found on LG32, with 516 markers distributed in two QTL regions and covered approximately 86.72% of the 168.88 cM LG length (**Figure 2** and **Table 3**). Specifically, the 8 added female-specific SNP markers showed complete linkage with phenotypic sex and were in the region of 109.70–123.43 cM of the sex-averaged linkage map, 79.69–87.55 cM in the female linkage map and absent in the male linkage map TABLE 2 | Summary of Scylla paramamosain linkage maps.


(**Supplementary Tables S5**, **S6**), with an average proportion of phenotypic variation explained by these 8 female-specific SNP markers to be more than 99% (**Supplementary Table S4**). Hence, the presence of 8 female-specific SNP markers in female but not male linkage map strongly favor female over male heterogamety in S. paramamosain, i.e., a ZW/ZZ sex determination system. Further, the observed segregation patterns to those expected for female-specific SNP markers under the assumption of a ZW/ZZ or XY/XX system were compared (**Table 4**). All 8 female-specific SNP markers exhibited segregation pattern 1 expected under female heterogamety and none segregated according to patterns 6, 7, or 8, expected under the assumption of male heterogamety. Additionally, 7 estimated markers sharing the same location with the 8 female-specific SNP markers from the linkage map were identified, and when tabulated according to their segregation patterns, 6 showed the pattern 1 (**Table 4**). Thus, this further highlights the involvement of this region in LG32 as possible sex determination region. The male-skewed female: male recombination rate ratios of all markers (recombination ratio = 0.61) (**Table 5**) and shared markers (recombination ratio = 0.64) (not shown) on LG32 reflect the lower recombination rate of female. Thus, coupled with the linkage data and the linkage patterns observed in the female-specific SNP markers, the skewed recombination rate ratios firmly establish female heterogamety in S. paramamosain.

All measured growth traits and BW of the 129 progenies (**Supplementary Table S7**) showed significant correlation between one another (P < 0.001) (**Supplementary Table S8**). Using the Composite Interval Mapping Method, a total of 27 significant QTLs for growth-related traits were detected (**Table 3**). Of the 20 growth traits, the QTLs for 18 traits, including some economically important traits such as CL, CW, and BW were found on LG46, with most SNPs being found in the narrow region of 25.18–33.74 cM (**Table 3** and **Supplementary Table S4**). Thus, this region was postulated to be the candidate genomic region involved in the growth regulation of S. paramamosain. The PVE values of all growth-related traits were in the range of 5.8% to 11.95%, with maximum PVE values lie between 8.9 and 15.8% (**Table 3**). Interestingly, the QTLs for CL, CW, ICW, CFW, AW, BH, and CWS8 shared the same 43 SNPs on LG46 from 26.39 to 33.74 cM, highlighting the relatedness among these growth-related traits. The full list of markers corresponding to each trait is available in **Supplementary Table S4**. Specifically looking at the economically important


traits, the QTLs of CL and CW both occupy the same region, 26.39 to 33.74 cM, whereas that of BW was on 30.06 CM of LG46 (**Supplementary Table S4** and **Figure 3**). The QTLs of these three economically important traits recorded high maximum PVE values, with that of CL, CW and BW being 14.1, 14.4, and 15.8%, respectively (**Table 3**), and 14 markers were shared among them (**Supplementary Table S4**).

Of the 166 SNP markers within the QTLs of sex and growth-related traits, 23 (16 were the same marker on different QTLs) were successfully annotated to the transcriptome assembly of S. paramamosain using Blast+ (**Table 6** and **Supplementary Table S9**). Eleven candidate genes were identified from a single sex QTL on LG32 (qSEX\_32-b), including 26S proteasome non-ATPase regulatory subunit 3

(PSMD3), RNA polymerase II subunit A C-terminal domain phosphatase (CTDP1), low density lipoprotein receptor-related protein 2 (LRP2) and protein FAM126B (FAM126B). Due to the overlapping QTL regions on LG46 for most growth-related traits, the same SNP marker was on the 16 growth-related traits' QTLs (**Supplementary Table S9**). This marker (marker 2388) showed significant similarity with an unigene that is annotated to multidrug resistance-associated protein 4 (MRP4) and cystic fibrosis transmembrane conductance regulator (CFTR).

# Association Between SNP Markers and Growth-Related Traits

Of the 95 SNP markers distributed in 27 QTLs of growth-related traits, more than half (67) showed significant association with at least one growth-related traits, and all 20 growth-related traits had at least one associated marker (**Supplementary Table S10**). Further, four markers (23029, 3846, 7391, and 8848) were significantly associated with 10–16 growth-related traits (P < 0.05) (**Supplementary Table S10**). Interestingly, at three markers (i.e., 23029, 3846, and 7391), individuals with genotype hk were of larger size in terms of CL and CW compared to those of genotype hh and kk. Similar pattern was also observed in the association between marker 3846 and BW, where hk-genotype individuals were significantly heavier compared to the other two genotypes. The higher average values of CL and CW in individuals with genotype hk at markers 3846 and 7391 indicate that genotype hk at these two markers are better in selecting CL and CW compared to genotype hk at markers 23029 and 8848. Unlike individuals with genotype hk × hk, those with genotype ll × lm and nn × np did not show any significant association with economically important growth traits, i.e., CL, CW, and BW (**Supplementary Table S10**).

# DISCUSSION

# SLAF-Seq in Scylla paramamosain

SLAF-seq is an improved version of the common RAD sequencing, a reduced representation sequencing technology which enables large-scale genotyping and calling of SNPs based on the sampling of genome-wide enzyme loci developed via next-generation sequencing (Sun et al., 2013). The use of restriction enzymes to generate myriads of DNA fragments is the basis for RAD sequencing and its derivatives, including SLAF-seq. In the present study, after sequencing, the number of obtained SNPs was in accordance with the estimated in silico digestion simulation (275,876 SNPs were obtained, and 227,798 SNPs were estimated), indicating the high consistency between simulated and actual digestion using the predicted restriction enzymes, and the suitability of HaeIII and Hpy166II for SLAF-seq of S. paramamosain.

TABLE 3 | Quantitative trait loci (QTLs) for sex and growth-related traits of Scylla paramamosain.


TABLE 4 | Expected and observed segregation patterns for female-specific SNP markers (n = 8) and estimated markers sharing the same loci with female-specific SNP markers (n = 7) under the assumption of either a ZW/ZZ or a XY/XX sex determination system (according to Staelens et al., 2008).


ZW/ZZ, patterns 1–5; XY/XX, patterns 6–10; NI, not informative; Segregation patterns unique for the two sex determination systems are in italics; 'A' and 'a' are symbols and do not represent allelic dominance. One estimated marker (marker 19218) exhibited pattern 4.

TABLE 5 | The SNP marker numbers and recombination rates (all markers) of each LG of Scylla paramamosain.




Unigene ID, the ID of unigene from the gonadal transcriptome data of mature S. paramamosain (GenBank ID: SRR5387739 and SRR5387741).

Of the 275,876 SNPs, a total of 17,246 were polymorphic in the mapping population. The high sequencing depth in parent (102×) and offspring (average 22.65×) guaranteed the accuracy of SLAF-seq in the current study. Due to the ability to obtain large numbers of high-quality SNPs, SLAF-seq has been utilized in organisms with and without a reference genome sequence (Sun et al., 2013), including other brachyuran species such as E. sinensis (Qiu et al., 2017) and P. trituberculatus (Lv et al., 2017).

A high success rate of 96.79% (16,693 out of 17,246 SNPs) polymorphic markers was used in the S. paramamosain linkage map construction. The high number of polymorphic markers suggests high heterozygosity and complexity of the S. paramamosain genome. After the addition of 8 female-specific SNP markers, all 16,701 SNPs (100%) were successfully assigned into the linkage map.

The genome of crustacean is very repetitive (Song et al., 2016; Yuan et al., 2017; Zhang et al., 2019). Thus, removing reads with multiple targets in genome is beneficial for increasing the quality of genetic linkage map. In our preliminary genome survey of S. paramamosain, the genome repetitive rate was 60.8% (unpublished data). It is believed that with the availability of S. paramamosain draft genome in future, we would be able to enhance the quality of the current genetic linkage map through the prediction of reads' targets and removal of multiple-target reads.

# Linkage Mapping

The constructed high-density genetic linkage map of S. paramamosain based on 16,701 SNP markers has 49 LGs and spanned a total distance of 5,996.66 cM. The current linkage map is the second genetic linkage map, the first high-quality linkage map of S. paramamosain. Compared to the first genetic linkage map of 2,746.4 cM in length and constructed based on a combination of 60 microsatellites and 152 AFLP markers (Ma et al., 2016), the current genetic linkage map is of higher density and comprises almost 78-fold more markers. The larger genetic linkage map length found in our study is expected as the increase in marker density (Ball et al., 2010) will increase the power of detecting recombination (more chance to map markers into marker absence regions), enlarging the length of the genetic linkage map accordingly as more recombination events could be recognized. Low marker densities may underestimate map length (Slate, 2008). Unlike the high-density linkage maps now, previous generations of linkage maps were conducted with low number of markers, resulting in a smaller genome coverage. Similar increase in map length (almost doubled) due to the increase in marker densities were also reported in the Chinese mitten crab (Cui et al., 2015; Qiu et al., 2016, 2017) and turbot (Ruan et al., 2010; Bouza et al., 2012; Wang W. et al., 2015). Another reason might be related to the number of chromosomes. The number of chromosomes (n = 49) in S. paramamosain is high (Chen et al., 2004), and with the low marker numbers used in previous linkage map construction (Ma et al., 2016), some chromosome might not be mapped, resulting in the missing information of the whole chromosomal genome and consequently smaller genome coverage. The high genome coverage (>97%) and small average marker interval (0.81 cM) reflect the high density and quality of the constructed genetic linkage map (Lv et al., 2017; Qiu et al., 2017). The much higher resolution of the SNP-based linkage map facilitates more accurate and detailed QTL mapping, provides more anchor points for whole genome sequences assembly as well as to serve as new chromosome framework for comparative genomic studies with other closely related organisms.

The number of LGs of our current constructed genetic linkage map is consistent with the reported number of haploid chromosomes of S. paramamosain (n = 49) by Chen et al. (2004) (noted that although Chen et al., 2004 described the investigated species as Scylla serrata, it should be S. paramamosain because of the wrong nomenclature). The higher number of LGs (n = 65) from previously constructed genetic linkage map (Ma et al., 2016) might be due to the weak linkage between markers. Additionally, lower marker density (only 212 markers) used in the previous study may be another reason leading to more LGs. When marker numbers are low, huge gaps exist between markers, thus more linkage groups will be predicted (Da Costa E Silva et al., 2007, Da Costa et al., 2012; Tao et al., 2017). Similar reduction of the estimated number of LGs when comparing first generation with second generation linkage maps was also observed in other organisms, such as in the common carp Cyprinus carpio [from 64 LGs (Cheng et al., 2010) to 50 LGs (Peng et al., 2016)] and pear Pyrus spp. [from 18 LGs (Iketani et al., 2001) to 17 LGs (Wu et al., 2014)]. Thus, this improved linkage map, with 49 LGs and 16,701 markers, is presently the densest mud crab linkage map.

# Segregation Distortion

Segregation distortion is the deviation of the segregation ratio of a locus from the expected Mendelian ratio. Such distortion is common in linkage map studies and genome analysis, and the proportion of skewed markers varied among species (Wang et al., 2016). The percentage of skewed markers found in this study (1.12%), however, is comparatively lower than that of previous study (32.10%). The low average frequency of skewed markers in our linkage map could be due to several internal molecular factors, including zygotic viability selection, genes duplication, transposable elements, and unusual meiotic segregation distortion. Additionally, segregation distortion has also been reported to be caused by factors such as small population size and types of genotyping markers used, as in the case of our previous study (Ma et al., 2016). The genetic linkage maps of E. sinensis had a skewed marker percentage of 15.72% (Qiu et al., 2017) when constructed based on a combination of SNP and SSR markers, and 16.76% when genotyped using only SSR markers (Qiu et al., 2016). The high degree of linkage (average = 98.88%) between adjacent markers of each LGs suggests that segregation distortion does not substantially impact QTL mapping, instead, incorporation of these skewed markers during the construction of genetic linkage maps could enhance genome coverage and improve the detection of linked QTLs (Xu, 2008; Qiu et al., 2017).

# QTL for Sex

Sex determination is an essential part of reproduction and holds significant importance in genome evolution (Cui et al., 2015). The sex determination system in crustaceans is controversial, with early karyotyping studies suggest a XY–XX sex determination system (Niiyama, 1938; Lécher et al., 1995), but recent analysis of genetic linkage map revealed an ZW/ZZ sex determination system in Chinese mitten crab E. sinensis (Cui et al., 2015), brine shrimp Artemia franciscana (De Vos et al., 2013) and black tiger shrimp Penaeus monodon (Staelens et al., 2008). Coupled with the recent discovery of several female-specific SNP markers in S. paramamosain (Shi et al., 2018), we believe that female

heterogametic sex determination system (ZW/ZZ) is one of sex determination systems in crustaceans. In the current study, the phenotypic sex trait mapped 516 markers to a single LG–LG32 (168.88 cM). The high coverage of these markers (approximately 86.72%) on LG32 suggests that the sex determination system of S. paramamosain is polygenic but sex determining QTLs are located on the same LG, unlike in some fish species, with sex-determining QTLs being spread in several LGs (Palaiokostas et al., 2015b). The presence of female-specific SNP markers exclusively on female linkage map, their near 100% PVE values, their segregation patterns that comply to that of ZW/ZZ system (Staelens et al., 2008), and their lower recombination rate on LG32 strongly imply that the sex determination system of S. paramamosain follows a ZW/ZZ system, with LG32 as the putative sex chromosome. These findings serve as solid foundation for future sex-manipulation of S. paramamosain. Future studies involving triploidy induction by retention of the second polar body is recommended to investigate the tendency of feminization in S. paramamosain larvae and to further validate the suggested ZW/ZZ system (Sellars et al., 2010), with sex ratio of induced triploids skew toward female is expected. Similar triploidy induction method was also successfully applied as a validation strategy of the suggested ZW/ZZ sex determination system in E. sinensis (Cui et al., 2015). The 6 estimated markers that shared the same loci with female-specific SNP markers and exhibited similar female heterogamety pattern should be further investigated as well. These female-specific markers could be useful in future genetic analysis of the sex chromosome of S. paramamosain and other Scylla species (Jairin et al., 2013). With the completion of the S. paramamosain whole genome sequencing in the future, this region containing female-specific SNP markers should provide useful insights into the genes involved in sex determination mechanism of this economically important crustacean species.

To further identify potential genes within QTL for sex, we compared the markers within the two sex QTLs with the transcriptome data of S. paramamosain. Markers from only one QTL (qSEX\_32-b) were successfully annotated. Among the sex-related candidate genes, the 26S proteasome non-ATPase regulatory subunit 3 (PSMD3) was identified via marker1101. 26S proteasome is an essential egg coat lysin found in sperm that enables its penetration through the egg's vitelline coat (Sutovsky, 2011). Low density lipoprotein receptor-related protein 2 (LRP2) gene was found via marker131703. LRP2 is postulated to be involved in the development of reproductive organ by regulating the uptake of androgen and estrogen bound to the sex-steroid binding globulin in reproductive tissues (Willnow et al., 2007).

# QTL for Growth-Related Traits

Quantitative trait loci mapping of growth-related traits has been conducted on various aquaculture species, including fish (Fu et al., 2016; Peng et al., 2016; Pang et al., 2017; Sun et al., 2017), shrimps (Andriantahina et al., 2013; Baranski et al., 2014), crabs (Cui et al., 2015; Hui et al., 2017; Lv et al., 2017) and bivalve mollusks (Wang et al., 2016; Nie et al., 2017). The high-density linkage maps constructed in the current study serve as a powerful tool for accurate QTL mapping, allowing a complete identification of the QTL locations and markers' sequences, of which both are essential in the genetic improvement of selected traits in aquaculture (Andriantahina et al., 2013). This study is the first reported attempt of QTL mapping of growth-related traits in S. paramamosain. The measured growth traits of S. paramamosain were significantly correlated between one another and the QTL regions of almost all growth traits (90%), including some with high economic values such as CL, CW, and BW, were located on LG46. The distribution of almost all growth traits on one LG reflects the tight linkage of these traits whereas QTLs of these traits located on a small interval of 25.18 to 33.74 cM indicates that these traits may be regulated by the same genes occupying the same/nearby genetic positions (Andriantahina et al., 2013; Lv et al., 2017). This strongly indicates that LG46 serves as a major chromosome involved in growth regulation of S. paramamosain. The QTL clustering of almost all growth traits also explains for the positive correlation among the measurements of various growth traits. Future decoding of the full genome of S. paramamosain will allow the discovery of potential candidate genes influencing the QTLs of growth-related traits, especially on LG46, found in this study. Among the markers found in growth-related traits' QTLs, only marker2388 on LG46 was successfully annotated to two genes – multidrug resistance-associated protein 4 (MRP4); cystic fibrosis transmembrane conductance regulator (CFTR). Interestingly, MRP4 is known to be involved in the regulation of prostaglandins across cell membranes (Reid et al., 2003). Prostaglandins is proven to affect molting of crustaceans, where higher prostaglandins resulted in shorter molt duration cycle in Penaeus esculentus (Koskela et al., 1992). This might suggest that MRP4 may have an indirect regulation on growth of mud crab by modulating the expression of prostaglandins, and thus molting.

# Association Analysis Between SNP Markers and Growth-Related Traits

The high number (67 out of 95) of SNP markers significantly associated with growth-related traits as indicated by GLM analysis reflects the results obtained via QTL analysis. The non-association of some SNP markers with their respective traits is expected, as the calculated PVE values for all markers only ranged between 5.8 and 11.95%. Two markers (i.e., 3846 and 7391) were significantly associated with the highest number of growth-related traits (16 out of 20) while seven markers were associated with more than two growth-related traits (**Supplementary Table S10**). Meanwhile, DLP4 trait was significantly linked to 38 markers. Such phenomenon where one marker was associated with several traits and several markers simultaneously associated with one trait indicates that one SNP marker might be involved in the regulation of several growth-related traits and several SNP markers are also potentially responsible in controlling the same trait in S. paramamosain. Similar observation was also reported in previous association studies using transcriptome-derived microsatellite markers with the growth performance in S. paramamosain (Ma et al., 2014) and in other aquaculture species, such as Asian seabass Lates calcarifer (Xu et al., 2006), large yellow croaker Larimichthys crocea

(Xiao et al., 2016) and Pacific oyster Crassostrea gigas (Wang and Li, 2017). The polymorphisms and their potential regulatory effects on growth-related traits observed in this study highlight the involvement of potential genes in growth regulation of S. paramamosain. Thus, further study on the identification of these candidate genes based on the current QTL analysis data is recommended. Additionally, the replicability and correlation of the four markers associated with economically important growth traits across families, populations and generations should be investigated as genes are known to segregate and/or recombine over generations (Tizaoui and Kchouk, 2012). Based on our results, individuals with genotype hk serve as better candidates for future breeding programs based on their higher values of economically important growth traits and markers corresponded to each targeted trait could be used for the selection of S. paramamosain with higher growth performance.

# Applications of High-Density Linkage Maps and Growth-Related Traits QTLs in Genomics, Genetics, and Breeding

The constructed high-density linkage map of S. paramamosain, with a total length of 5,996.66 cM and an average marker interval of 0.81 cM, serves as a solid foundation for future genome sequencing, sequence assembly and marker-assisted selection of economically important traits. The putative ZW/ZZ sex determination system of S. paramamosain uncovered in this study contributes significantly toward the understanding of sex determination mechanism in decapod crustaceans and facilitates future establishment of mono-female culture population. Future research on QTLs of growth-related traits, especially on the QTL regions of CL, CW, and BW found in this study is expected to improve the breeding and aquaculture of S. paramamosain. In addition to being useful in promoting genetic breeding and stock enhancement, and to prevent inbreeding in the aquaculture sector, the linkage map constructed in the present study and the available SNP markers are also beneficial to population studies of wild mud crabs, including parentage assignment and population structure analysis (Smith et al., 2007; Panetto et al., 2017). Further, due to the limited number of genetic markers available for other Scylla species, the large number of SNP markers described in this study could potentially be amplified in other closely related species.

# REFERENCES


# AUTHOR CONTRIBUTIONS

HM conceived and designed the research. HM, XS, and KW performed the research. HM, KW, XS, and SF analyzed the data. HM contributed reagents and materials. KW wrote the manuscript. SL, YZ, HZ, WL, and MI provided substantial comments and revised the manuscript. All authors read and approved the final version of the manuscript.

# FUNDING

This study was funded by the National Key Research & Development Program of China (No. 2018YFD0900201), the National Natural Science Foundation of China (No. 31772837), the National Program for Support of Top-Notch Young Professionals, the Science and Technology Project of Shantou City (2016-44), the Program of Ocean and Fishery Department of Guangdong Province (SDYY-2018-11), the STU Scientific Research Foundation for Talents (No. NTF17006), the "Sail Plan" Program for the Introduction of Outstanding Talents of Guangdong Province, China, the Niche Research Grant Scheme (NRGS) (Vot. No. 53131) by the Malaysia's Ministry of Higher Education, and the Program for Innovation and Enhancement of School of Department of Education of Guangdong Province (No. 2017KCXTD014).

# ACKNOWLEDGMENTS

We are grateful to Qingyang Wu, Huaqiang Tan, Yin Zhang, Zhuofang Xie, and Mengyun Guan for their assistance in sample collection and laboratory analyses.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00298/full#supplementary-material




male mud crab Scylla spp. J. Shellfish Res. 35, 1027–1035. doi: 10.2983/035. 035.0425



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Waiho, Shi, Fazhan, Li, Zhang, Zheng, Liu, Fang, Ikhwanuddin and Ma. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# A High-Density Genetic Linkage Map and QTL Mapping for Sex in Black Tiger Shrimp (Penaeus monodon)

Liang Guo1,2† , Yu-Hui Xu<sup>3</sup>† , Nan Zhang1,2, Fa-Lin Zhou1,2, Jian-Hua Huang1,2 , Bao-Suo Liu1,2, Shi-Gui Jiang1,2 and Dian-Chang Zhang1,2 \*

<sup>1</sup> Key Laboratory of South China Sea Fishery Resources Exploitation and Utilization, Ministry of Agriculture and Rural Affairs, South China Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Guangzhou, China, <sup>2</sup> Guangdong Provincial Engineer Technology Research Center of Marine Biological Seed Industry, Guangzhou, China, <sup>3</sup> Biomarker Technologies Corporation, Beijing, China

### Edited by:

Paulino Martínez, University of Santiago de Compostela, Spain

### Reviewed by:

Jun Hong Xia, Sun Yat-sen University, China Shikai Liu, Ocean University of China, China Xinxin You, Beijing Genomics Institute (BGI), China

\*Correspondence:

Dian-Chang Zhang zhangdch@scsfri.ac.cn †These authors have contributed

### Specialty section:

equally to this work

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Received: 25 October 2018 Accepted: 26 March 2019 Published: 09 April 2019

### Citation:

Guo L, Xu Y-H, Zhang N, Zhou F-L, Huang J-H, Liu B-S, Jiang S-G and Zhang D-C (2019) A High-Density Genetic Linkage Map and QTL Mapping for Sex in Black Tiger Shrimp (Penaeus monodon). Front. Genet. 10:326. doi: 10.3389/fgene.2019.00326 The black tiger shrimp, Penaeus monodon, is important in both fishery and aquaculture and is the second-most widely cultured shrimp species in the world. However, the current strains cannot meet the market needs in various cultural environments, and the genome resources for P. monodon are still lacking. Restriction-site associated DNA sequencing (RADseq) has been widely used in genetic linkage map construction and in quantitative trait loci (QTL) mapping. We constructed a high-density genetic linkage map with RADseq in a full-sib family. This map contained 6524 single nucleotide polymorphisms (SNPs) and 2208 unique loci. The total length was 3275.4 cM, and the genetic distance was estimated to be 1.1 Mb/cM. The sex trait is a dichotomous phenotype, and the same interval was detected as a QTL using QTL mapping and genome-wide association analysis. The most significant locus explained 77.4% of the phenotype variance. The sex locus was speculated to be the same in this species based on the sequence alignments in Mozambique, India, and Hawaii populations. The constructed genetic linkage map provided a valuable resource for QTL mapping, genome assembly, and genome comparison for shrimp. The demonstrated common sex locus is a step closer to locating the underlying gene.

### Keywords: Penaeus monodon, genetic linkage map, sex, QTL mapping, RADseq

# INTRODUCTION

The black tiger shrimp, Penaeus monodon, is naturally distributed in the Indo-West Pacific region and is cultured in much of this region (Motoh, 1985). It is commercially important, both in capture fisheries and in aquaculture (Brackishwater Aquaculture Information System, 1988; FAO, 2018), and is the second-most widely cultured shrimp species only after Pacific white shrimp (FAO, 2018). Substantial efforts have been made to improve the quality of the breeding strains. In China, the strains "Nanhai No. 1" and "Nanhai No. 2" are aimed for growth, and the survival rates are improving through selective breeding and cross-breeding. In India, breeding for disease resistance has been performed (Robinson et al., 2014). Even so, the cultured strains still cannot meet the market needs, especially during severe disease outbreak and negative influences to aquaculture production expansion (Robinson et al., 2014; FAO, 2018). Thus, it is necessary to develop genome resources for breeding to achieve sustainable aquaculture.

This species contains 44 chromosomes that is based on the karyotype (Kong, 1993; You et al., 2010). The different kind of markers, including amplified fragment length polymorphism (AFLP), simple sequence repeat (SSR), and single nucleotide polymorphism (SNP), were used to construct the genetic linkage maps for this species (Tassanakajon et al., 2002; Wilson et al., 2002; Wuthisuthimethavee et al., 2005; You et al., 2010; Baranski et al., 2014; Robinson et al., 2014). The highest density map was constructed with 3959 coding SNPs (cSNPs) that were genotyped by an Illumina iSelect genotyping array, which contains 2170 unique loci, and the flanking sequences have also been released (Baranski et al., 2014). This map was constructed using samples that were collected from coastal of India, so it is referred to as the India map in this study. Recently, restriction-site associated DNA sequencing (RADseq) has been widely used to construct highdensity genetic linkage maps (Robledo et al., 2018), including that of the Pacific white shrimp (Yu et al., 2015) and Kuruma prawn (Lu et al., 2016). With the advantage of RADseq, a highdensity genetic linkage map could be easily achieved, which is important for locating the functional genes underlying the traits, assembling the genome sequences, and comparing chromosomal evolution (Zhao et al., 2013). For traits that are related to growth (Sraphet, 2004) and disease resistance (Robinson et al., 2014), quantitative trait loci (QTL) mapping has been performed in this species. These studies could provide clues in genome dissecting and marker-assisted breeding.

The mechanism of sex determination is diverse, especially for the master determining gene. The master determining genes have been confirmed in several fish (Kikuchi and Hamaguchi, 2013) and insects (Geuverink and Beukeboom, 2014). The progress in Decapoda lags behind, despite its high economic importance (Chandler et al., 2017). The black tiger shrimp is gonochoristic; the female reaches a relatively large size, and size dimorphism appears in the late development stage (Primavera et al., 1998). Thus, focusing on sex determination could deepen the understanding of the mechanism of sex determination in invertebrate and facilitate the potential usage in production (Martínez et al., 2014). The heteromorphic sex chromosomes have not been observed (You et al., 2010), which may hint that the sex chromosomes are at the initial stage (Sember et al., 2018). In such case, the sex QTL are detected mainly through QTL mapping, such as in common carps (Peng et al., 2016; Feng et al., 2018), yellow drum (Qiu et al., 2018), Nile tilapia (Eshel et al., 2012; Palaiokostas et al., 2013), and the Tiger Pufferfish (Kamiya et al., 2012). Two independent studies, one using AFLP (Staelens et al., 2008) and the other using cSNPs (Robinson et al., 2014), located only one sex QTL in the black tiger shrimp, which demonstrates that the sex is mainly determined by only one genetic factor (Martínez et al., 2014). Even though these two studies published a closely linked sex segment and genetic map-associated sequences, respectively, the reported two sex loci could not corroborate each other because they lacked the common sequences to anchor the markers. As known, the former sex-linked segment is from Moana Technologies in Hawaii, and the latter is from the Indian population. The study on the genetic analysis of the black tiger shrimp showed significant genetic distinctions in individuals that reside at the peripheries of the Indian and Pacific Ocean distribution range, which is supported by the result of microsatellites and mtDNA. The population from the Pacific Ocean is differentiated from the population from the Indian Ocean. The India and Mozambique populations were statistically significantly differentiated (Fst = 0.065, p-value < 0.05) (Waqairatu et al., 2012). It was reported that the master sex-determining gene may vary in different strains or populations in insects (Biedler and Tu, 2016) and fish (Wilson et al., 2014), and there is no report on the variation in master sex-determining genes in shrimp from differentiated populations.

To deepen the knowledge in sex determination of black tiger shrimp, we conducted this research. First, we constructed a high-density genetic linkage map using RADseq and located the sex QTL using QTL mapping and a family-based genomewide association study (GWAS) in the Mozambique population. We only detected one sex QTL, which is consistent with the conclusion that the sex is mainly determined by only one genetic factor, and the segregation pattern supports the WZ–ZZ chromosomal system. Moreover, we compared the published sexlinked segment, the India sex QTL and the Mozambique sex QTL. All three loci were located in the same region, which hints that the sex determination region of black tiger shrimp in differential populations may be the same.

# MATERIALS AND METHODS

# Ethics Statement

All experiments in this study were approved by the Animal Care and Use Committee of South China Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, and were performed according to the regulations and guidelines that were established by this committee.

# Sample Collection

The full-sib black tiger shrimp family that was used for QTL mapping was an F2 population. The F0 population was collected from the Mozambique Channel with the permission obtained in accordance with the national guidelines and cultured in the Shenzhen Experiment Base of the South China Sea Fisheries Research Institute. The shrimps from the first filial generations were artificially inseminated and tagged. One full-sib family, including the F1 parents and the F2 offspring, was randomly selected for genetic linkage map construction and QTL mapping. Broodstock mating, culturing, and larval rearing have been described previously (Sun et al., 2015). At approximately 60 days, the growth traits, including carapace length (CL), body length (BL), body weight (BW), and sex were recorded according to previous descriptions (Motoh, 1985; Sun et al., 2015), and abdominal muscles from the two parents and offspring were preserved in 95% alcohol for genotyping. Assessing the influence of sex on the growth traits (Kruskal–Wallis test) was performed using Minitab 17<sup>1</sup> .

<sup>1</sup>http://www.minitab.com

# High-Throughput Sequencing and Genetic Linkage Map Construction

The method SLAF-seq (Sun et al., 2013) was used to survey the genome. The enzyme EcoRI (G|AATTC), NlaIII (CATG|), and MseI (T|TAA) and the effective library length of 380– 430 bp were manipulated to enrich the sequences, and the read pairs with read length of 100 bp were sequenced on the platform Illumina HiSeq 2500 system (Illumina, Inc., San Diego, CA, United States). The procedures of library construction and sequencing were conducted in Biomarker Technologies Corporation (Beijing, China).

The raw reads were checked using FastQC v0.11.5 (Andrews, 2015) and were filtered using Trimmomatic v0.36 (Bolger et al., 2014). The reads were first trimmed to be 50 bp in length, and then the bases with a quality lower than 20 were cut off at the start and at the end. The reads were also scanned with four-base wide sliding window and average base quality threshold of 15. Finally, reads shorter than 40 bp were dropped off. The clean reads were mapped to the reference assembly (unpublished) with BWAbacktrack (Li and Durbin, 2009). The reference assembly was assembled with short reads from Illumina platform, the contig N50 and scaffold N50 were 10 and 383 kbp, respectively. The properly mapped primary read pairs with insert size range of 100–700 bp were selected for next step. The alignments were piled up with a minimum base quality of 20 and a minimum map quality of 20 with SAMtools v1.5 (Li et al., 2009). The genetic map was constructed with Lep-MAP3 (Rastas, 2017), in which the genotype likelihoods were calculated with the script pileup2posterior.awk according to the description for RADseq (Yang et al., 2016). The markers were filtered as below: (1) coverage in parents should be between 10 and 200 based on the expected normal distribution (Davey et al., 2013) (**Figure 1**); (2) the number of missing individuals should be no more than 10; and (3) the inheritance should be in agreement with Mendel's law of segregation (p-value = 0.01). The parentage relationship was checked by calculating identity by decent. The linkage groups were assigned with an LOD score limit of 15 and a minimum marker number of 9 using informative markers in both parents. Other singular markers were assigned with an LOD score limit of 5. The marker order in each linkage group was obtained with best score from 10 independent runs. The genetic distance was converted by using the Kosambi mapping function.

The average interval was calculated as the quotient that divided the length of the accumulated linkage groups by the difference of the number of loci minus the number of linkage groups (Ren et al., 2016). The expected genome length was calculated as the sum of length of each linkage group, which was the sum of two times the average interval, and the length of itself (Fishman et al., 2001). The genome coverage was estimated as the quotient that divided the expected genome length by the accumulated map length.

# Sex QTL Detection

The phased genotypic data were exported from Lep-MAP3 (Rastas, 2017). The QTL were detected with MapQTL 6 (Ooijen et al., 2009). The potential QTL were first detected using internal mapping, and the threshold of the significant level of the LOD score was determined using a permutation test with a P-value of 0.05 and with 10,000 permutations. Then, the SNPs closest to the significant QTL were taken as cofactors to narrow the interval in subsequent MQM mapping. As an alternative method of QTL detection, an association analysis was performed using the GWAF package (Chen and Yang, 2010), which was designed for family data. The genotypes of the filtered SNPs were directly transformed to fit this program. The kinship was calculated according to the pedigree. The association between sex and genotypes was performed using logistic regression (Chen and Yang, 2010). Bonferroni correction was performed to control the false-positive rate. The candidate genes and the annotation from the sex QTL interval were obtained from our genome program (unpublished).

# Sex Loci Comparison

The relation between the sex-linked segment (Staelens et al., 2008), the India sex QTL, and Mozambique sex QTL was explored. The consensus between the sex-linked segment, India sex QTL, and Mozambique sex QTL was assessed. The mRNA sequences among the India sex QTL intervals were downloaded and concentrated according to the order on the India map (Robinson et al., 2014). The scaffolds among the Mozambique sex QTL intervals in this study were also concentrated according to their position on the Mozambique map. The synteny was constructed using the LAST program (Kielbasa et al., 2011), which can find homologous sequences that take the feature of the reverse complement and large gap into consideration. The sex-linked segment (Staelens et al., 2008) was treated as a query to search for the genome assembly using BLAST (Altschul et al., 1997).

# QTL Validation

The significant sites in the Mozambique sex QTL interval were validated in another population. One hundred individuals were randomly collected from our breeding population, which is an admixture population. The DNA was extracted using a HiPure Tissue and Blood DNA Kit (Magen, Guangzhou, China), and the quality was tested with 1% agarose gel electrophoresis. Primers were designed using Primer-BLAST to cover the SNPs. PCRs were performed using a PCR amplification Kit (PrimeSTAR <sup>R</sup> HS, Takara, Dalian, China) with a program of 5 min at 94◦C, 35 cycles of 45 s at 94◦C, 45 s at 60◦C, 45 s at 72◦C, and 10 min at 72◦C. Finally, PCR products were genotyped and sequenced on a 3130xl capillary DNA analyzer (Applied Biosystems, Foster City, CA, United States), and the allele sizes were analyzed using GeneMapper version 4.0 (Applied Biosystems, Foster City, CA, United States) and sequences were viewed using the Seqman software package (Lasergene Version 7.1; DNA Star Inc., Madison, WI, United States). The sites that contained the target SNPs in the mapping population were genotyped in the validation population as SNP or Indel (insertion or deletion). A genotypic (2 df) test was performed to test the relationship between genotype and sex (Purcell et al., 2007).

# RESULTS

# Phenotyping

A full-sib family, two parents, and 98 offspring **Supplementary Table S1**) were sampled. The traits of CL, BL, and BW were 20.1 ± 2.9 mm, 71.1 ± 9.3 mm, and 5.1 ± 1.7 g (mean ± SD), respectively, and only BW fit to the normal distribution (p value > 0.05). These three traits were significantly related with each other (p-value = 0.00), and Pearson's correlation coefficients were 0.83, 0.83, and 0.95 for the trait pairs CL–BL, CL–BW, and BL–BW, respectively. In the offspring, there were 52 males and 46 females. There was no significant difference in these three growth traits between the different sexes (p-value > 0.05).

# Genetic Linkage Map Construction

After filtering, 93.61% of the base had a quality above 30. Each offspring obtained 680 ± 395 (mean ± SD) thousand read pairs, one parent obtained 20 million read pairs and the other parent obtained 18 million read pairs. Only 64.0% were primarily and properly mapped, among which only 47.8% were used in later steps, with a mapping quality above 20. After strict filtering, 6821 SNPs were selected to construct the genetic map.

According to the relatedness based on markers, all the offspring were assigned to the targeted family. The sex-averaged consensus genetic map was constructed, and 6524 SNPs that were located on 2354 scaffolds (**Supplementary Tables S2**, **S3**) were assigned into 44 linkage groups (**Figure 2** and **Table 1**), which contained 2208 unique loci. The consensus map was 3275.4 cM in length. The average interval between loci was 1.51 cM. This map was estimated covering 96.1% of the genome. Based on that, the genome size was 2.47 Gb (C-value: 2.53) (Gregory, 2018), the genetic distance was estimated as 1.1 Mb/cM. The length of each linkage group ranged from 9.7 to 175.0 cM, and the number of unique loci varied from 8 to 161. The scaffolds of the genome assembly were anchored to the genetic linkage map. After filtering the scaffolds those were anchored to more than one linkage group and supported with only one SNP, 3202 SNPs left on the genetic linkage map.

# Sex QTL Detection and Validation

The QTL were detected using the methods of QTL mapping and GWAS. For the trait sex, only one locus on group 23 was genome-wide significant in both methods, and these two intervals overlapped (**Figure 3**). The LOD score of 5.1 and the p-value 7.66 × 10−<sup>6</sup> were calculated as the genome-wide significant threshold in QTL mapping and GWAS, respectively. The feature of this QTL in QTL mapping is described below. The most related loci explained 77.4% phenotype variance. This interval (the LOD score larger than the threshold) ranged from 53.49 to 92.45 cM on linkage group 23, and the peak was

### TABLE 1 | Summary of the genetic linkage map.

fgene-10-00326 April 8, 2019 Time: 7:56 # 5


located at 74.45 cM. MQM confirmed that this interval contained only one QTL. To confirm the result, primers were designed to validate these sites in another randomly collected breeding population. Finally, five sites were successfully genotyped and were significant associated with sex, including X262, X881, X5302, X5303, and X748, with p-values of 2.49 × 10−<sup>19</sup> , 1.30 × 10−10, 1.11 × 10−12, 2.83 × 10−12, and 7.99 × 10−<sup>8</sup> , respectively (**Table 2**).

Seventeen mRNA sequences (19.6 kbp) and 42 scaffolds (20.6 Mbp) located in the India and Mozambique sex QTL interval were concentrated, respectively. They were confirmed to be of the same locus, with four common gene segments detected based on the synteny (**Figure 4**). The sex-linked segment from the Hawaii population hit the scaffold 000006388 with an e-value of 1 × 10−130. This scaffold was anchored at 75.47 cM in the Mozambique sex QTL. In the sex QTL interval, 29 genes (**Supplementary Table S4**) were located on the scaffolds that contain the SNPs with the highest LOD in QTL mapping.

## DISCUSSION

The black tiger shrimp is an important species in aquaculture and in fishery. In this study, we constructed a genetic linkage map using RADseq and preliminarily located the sex QTL.

The genome assembly is the foundation of structural and functional genomics. With the widely application of next generation sequencing technology, the genome assemblies and genetic linkage maps have rapidly accumulated (Lehmann et al., 2018; Robledo et al., 2018). In fish, 27 chromosome-scale genome assemblies have been published and N50 of the contigs in Nile tilapia, orange clownfish, and Asian seabass is over 1 Mbp (Lehmann et al., 2018). However, the disadvantage of short read blocks is the usage in a complex genome with a large size and a high amount of repetitive sequences. For example, the estimated repetitive sequences account for 79.37% (Yu et al., 2015) and the C-value is 2.50 (Chow, 1990) in Pacific white

FIGURE 3 | Illustration of the QTL for trait sex. GWAS (A) and QTL mapping (B) were performed to locate the sex QTL. The overlapped interval in these two methods demonstrates that this genome-wide significant locus is the only one interval in which the sex QTL was located. The family structure always confuses the result in GWAS. QQplot (C) for the GWAS shows that the result is statistically significant with an efficient family structure correction. The length of each linkage groups (A,B) is plotted as the genetic distance in x-axis. The genome-wide significant points in C are highlighted in light green.



a Indels exist beside the sites X262 and X881, the genotypes are showed as amplicon size. These two sites were also confirmed by consensus sequence by direct sequencing.

shrimp. The published genome assemblies in shrimp, including cherry shrimp (Kenny et al., 2014), Pacific white shrimp (Yu et al., 2015), Kuruma prawn, and black tiger shrimp (Yuan et al., 2018), assembled using short reads are far from completion compared with those of fish (Lehmann et al., 2018). The N50 of the scaffold/contig for these four genome assemblies are all less than 1 kbp (Kenny et al., 2014; Yu et al., 2015; Yuan et al., 2018). In our study, the quality of the used genome assembly is at a comparable level with the published shrimp genome assemblies. Only approximately 60% of the reads from RADseq were properly mapped to the genome assembly, which was presumed to be mainly caused by the poor quality of genome assembly and the large amount of repetitive sequences. The incompleteness makes fine mapping and genome comparison difficult. Fortunately, longer read sequence technology has been used, such as PacBio SMRT technology. The combination of these technologies is expected to improve the quality of the complex genome assembly (Lehmann et al., 2018).

A genetic linkage map is useful in genome assembly, genome comparison, and QTL mapping. The marker number on the map is determined primarily by the technology and secondarily by the experiment operation. With the advantage of RADseq, the recently published maps contain much more SNPs (Peng et al., 2016); for example, the map for blunt snout bream contains 14,648 SNPs (Wan et al., 2017), and the map for Pacific White Shrimp contains 6359 markers. In our study, the number of unique loci was 2208, which is smaller than the 4693 markers on the Pacific White Shrimp map (Yu et al., 2015). The fewer number of unique loci on our map compared with that of the Pacific White Shrimp map is supposed to be caused by less data input for each individual, fewer offspring, and shorter reads. Compared with the Pacific White Shrimp, the data volume and number of offspring are both approximate by half. Compared with the large number of SNPs, the number of individuals on the recently published map was relatively small. In the blunt snout bream map, the number of unique loci was 5676, which means that every three SNPs were located on the same unique locus. More individuals would provide extra information on the crossover and improve the resolution of the map and provide a benefit for the reference assembly improvement.

By comparing the genetic distance and unique markers in each linkage group, the density of the Mozambique map in this study is on the same level as that of the India map, with unique loci numbers that are approximately 2200 (Baranski et al., 2014). In general, the integration of different genetic maps is performed with common markers (Holtz et al., 2017). With the reference assembly as an intermediary, genetic maps that are constructed by different kinds of markers could be compared and integrated (Tang et al., 2015; Sutherland et al., 2016). However, the quality of our reference assembly is poor, and only half of the sequences from the India map can be downloaded from the database with the corresponding relation of linkage groups between the India map and Mozambique map not being established due to inadequate common scaffolds between the maps. For an India map, only 2114 mRNA sequences that were assigned to 1422 unique loci can be downloaded from the database of GenBank. These mRNA sequences were

alignments are in the reverse direction and in the same direction, respectively. The synteny between the sequences that are located in the same interval from the two independent studies hints that the sex determination region of black tiger shrimp in differential populations may be the same.

mapped to the reference assembly (unpublished) with the LAST program [30]. The corresponding relation of the linkage groups between these two maps was rebuilt with ALLMAPS (Tang et al., 2015). However, only 164 scaffolds were supported by at least two mRNA sequences, and 50 scaffolds were supported by the two maps, which only account for 0.5% of the scaffolds in number.

Various experiments have confirmed that the sex of the black tiger shrimp is determined by a WZ–ZZ chromosomal system (Benzie et al., 2001; Li et al., 2003; Staelens et al., 2008; Robinson et al., 2014). Our result also supports this conclusion. Even an approximately equal number of SNPs in the sex QTL interval shows the heterozygote favor, and the phased genotypes favor a female over male heterogamete. At the site X262, 93 out of 98 individuals support the WZ–ZZ system for which the segregation pattern in the female parent is associated with the sex determination. Sex determination in shrimp is confusing, as there has been no confirmed master sex-determining gene (Chandler et al., 2017). It was reported that the master sex-determining gene appears to be variable among different strains or populations in insects (Biedler and Tu, 2016) and in fish (Wilson et al., 2014). Two independent studies identified only one interval that contains the sex QTL in the black tiger shrimp (Staelens et al., 2008; Robinson et al., 2014). In this study, we compared the previous sex loci and our sex QTL, and the sequence alignment supports that these three loci are the same loci. The current black tiger shrimp is supposed to have ancestral origins of the Gondwana supercontinent (Waqairatu et al., 2012) and to have evolved through continual drift, ice age events, and environment adaption. Our evidence hints that the sex QTL may be the same in this species and provides the foundation for mapping

the master sex-determining gene in the future. We found the gene SOX2 located in the sex QTL interval, which has been reported to be necessary for the normal development and function of the hypothalamo-pituitary and reproductive axes in humans and in mice (Kelberman et al., 2006), and also supposed to be essential in spermatogenesis and testis development in Chlamys farreri (Liang et al., 2019). However, the actual sex-determining gene is far from being discovered, with the fact that the sex determining varies in gene and form, such as piRNA in silkworms and alternative splicing in Drosophila melanogaster (Biedler and Tu, 2016) and SOX2 is located far from the supposed master sex-determining gene.

One of the applications of genetic markers is in sex identification, especially in fish (Mei and Gui, 2015). We also confirmed that the previously published sexlinked segment could also be used in our population. The segregation pattern only can be tested using polypropylene gel electrophoresis or capillary electrophoresis, which is time and cost consuming. A specific marker that could be tested in agarose gel electrophoresis needs to be developed in the future.

# CONCLUSION

Restriction-site associated DNA sequencing was applied to construct a high-density genetic linkage map for black tiger shrimp in this study, and our result supports the WZ–ZZ system. The sex QTL was located, and this locus was demonstrated to be the same as the loci in Mozambique, India, and Hawaiian populations.

# ETHICS STATEMENT

This study was carried out in accordance with the recommendations of the Animal Care and Use Committee of South China Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences. The protocol was approved by the Animal Care and Use Committee of South China Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences.

# REFERENCES


# AUTHOR CONTRIBUTIONS

D-CZ conceived and designed this work. Y-HX and NZ executed the experiments. LG analyzed the data and wrote the manuscript. F-LZ, J-HH, B-SL, and S-GJ helped in the execution of some experiments. All authors discussed the results of the manuscript, reviewed the manuscript, and read and approved the final manuscript.

# FUNDING

The study was supported by China Agriculture Research System-48 (CARS-48), China-ASEAN Maritime Cooperation Fund (00- 201620821), National Science and Technology Infrastructure Platform (2018DKA30470), and Guangdong Oceanic and Fisheries Project of China (A201701A01).

# ACKNOWLEDGMENTS

We thank the breeding group in Shenzhen Experiment Base of South China Sea Fisheries Research Institute for their assistance in the shrimp cultivation and in the preparation of tissue samples. We also thank Li-Shi Yang, Yun-Dong Li, Zhi-Kang Huang, Ming-Ge Zhuang, Meng-Ke Shi, and Hong-Di Fan for their assistance in result improvement.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00326/full#supplementary-material

TABLE S1 | The pedigree information for the mapping family.

TABLE S2 | The genotypes in Hapmap format for the 6524 SNPs.

TABLE S3 | The SNPs in the constructed genetic map. The flanking sequences are provided.

TABLE S4 | The genes located in the scaffolds containing the SNPs with the highest LOD in QTL mapping.




(Penaeus monodon) using microsatellite and AFLP markers. Anim. Genet. 41, 365–376. doi: 10.1111/j.1365-2052.2009.02014.x


**Conflict of Interest Statement:** Y-HX was employed by the company Biomarker Technologies Corporation.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Guo, Xu, Zhang, Zhou, Huang, Liu, Jiang and Zhang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# The Sterility of Allotriploid Fish and Fertility of Female Autotriploid Fish

Fangzhou Hu† , Jingjing Fan† , Qinbo Qin† , Yangyang Huo, Yude Wang, Chang Wu, Qingfeng Liu, Wuhui Li, Xuan Chen, Liu Cao, Min Tao, Shi Wang, Rurong Zhao, Kaikun Luo and Shaojun Liu\*

State Key Laboratory of Developmental Biology of Freshwater Fish, College of Life Science, Hunan Normal University, Changsha, China

Based on the formation of an autotetraploid fish line (4nAUT, 4n = 200; F2–F11) derived from the distant hybridization of female Carassius auratus red var. (RCC, 2n = 100) × male Megalobrama amblycephala (BSB, 2n = 48), we produced autotriploid hybrids (3nAUT) by crossing females of RCC with males of 4nAUT and allotriploid hybrids (3nALT) by crossing females of Cyprinus carpio (CC, 2n = 100) with males of 4nAUT. The aim of this study was to comparatively investigate the reproductive characteristics of 3nALT and 3nAUT. We investigated morphological traits, chromosomal numbers, DNA content and gonadal development in 3nAUT and 3nALT. The results indicated both 3nAUT and 3nALT possessed 150 chromosomes and were triploid hybrids. The females and males of 3nALT and males of 3nAUT had abnormal gonadal development and could not generate mature eggs or sperm, but the females of 3nAUT had normal gonadal development and generated mature eggs at 2 years old. The females of 3nAUT generated different sizes of eggs, which fertilized with haploid sperm from RCC and formed viable diploid, triploid, and tetraploid offspring. The formation of these two kinds of triploid hybrids provides an ideal model for studying the reproductive traits of triploid hybrids, which is of great value in animal genetics and reproductive biology.

Keywords: distant hybridization, autotriploid, allotriploid, fertility, sterility

# INTRODUCTION

Polyploids are organisms that normally have three or more chromosome sets. Polyploidy is common in plants, and studies have shown that all angiosperms are ancient polyploids (Otto, 2007). As research continues, increasing evidence has shown that polyploids are also widespread in animals and are mainly concentrated in amphibians, reptiles, and fishes (Mable, 2004; Gregory and Mable, 2005; Wertheim et al., 2013). Polyploids can be divided into autopolyploids and allopolyploids according to their origin of chromosome doubling. Allopolyploids possess a combination of chromosomes from two or more different species, while autopolyploids possess multiple chromosome sets mainly derived from a single taxon.

As lower vertebrates, fish chromosomes display plasticity and thus produce polyploids more easily (Liu, 2010). Triploid fish are found spontaneously in both wild and cultured populations and can be induced via physical or chemical methods. The artificial induction of triploid fish is mainly used to improve quality associated with sexual maturation such as higher growth rates, stronger disease resistance, and better organoleptic properties (Cuñado et al., 2002; Cal et al., 2005;

### Edited by:

Gen Hua Yue, Temasek Life Sciences Laboratory, Singapore

### Reviewed by:

Olga V. Anatskaya, Institute of Cytology (RAS), Russia Zhiyi Bai, Shanghai Ocean University, China

\*Correspondence:

Shaojun Liu lsj@hunnu.edu.cn †These authors have contributed equally to this work

### Specialty section:

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Received: 22 October 2018 Accepted: 09 April 2019 Published: 26 April 2019

### Citation:

Hu F, Fan J, Qin Q, Huo Y, Wang Y, Wu C, Liu Q, Li W, Chen X, Cao L, Tao M, Wang S, Zhao R, Luo K and Liu S (2019) The Sterility of Allotriploid Fish and Fertility of Female Autotriploid Fish. Front. Genet. 10:377. doi: 10.3389/fgene.2019.00377

**209**

Poontawee et al., 2007; Maxime, 2008; Werner et al., 2008; Chen et al., 2009; Kavumpurath and Pandian, 2010).

Fish eggs are released at the metaphase stage of meiosis II. Further resumption of meiosis II of the eggs is induced by the entry of the spermatozoon (Colas and Dubé, 1998). Thus, physical or chemical treatments applied during meiosis II can prevent the extrusion of the second polar body while allowing chromosomal division, thus producing triploids (Piferrer et al., 2009). Generally, physical and chemical treatments are successfully used to induce triploidy in many fishes (Chourrout, 1984, 1988; Haffray et al., 2007; Xu et al., 2010). However, the survival of triploids is usually very low due to the physical and chemical treatments damaging the fertilized eggs. Triploid fish can also be mass produced using indirect methods based on distant hybridization (Biradar and Rayburn, 1993; Bullini, 1994; Mallet, 2007; Liu, 2010; Hu et al., 2018). In our previous study, both females and males of fertile allotetraploid fish (4n = 200) were produced by crossing female red crucian carp and male common carp (Liu et al., 2001). Sterile triploids have been produced at a large scale by crossing allotetraploid and diploid fish (Liu et al., 2001; Chen et al., 2009).

Additionally, in our previous study, we successfully produced both females and males of fertile allotetraploid hybrids (F1, 4n = 148) in the first generation of Carassius auratus red var. (2n = 100) × Megalobrama amblycephala (2n = 48) (Liu et al., 2007). Due to the abnormal chromosome behavior during meiosis of F<sup>1</sup> hybrids, autodiploid sperm and autodiploid ova were produced and used to fertilize each other, finally resulting in the formation of autotetraploid F2. Surprisingly, the females and males of autotetraploids could produce diploid eggs and diploid spermatozoa, respectively. These diploid gametes could be fertilized to form the next generation of autotetraploid fish. The F2–F<sup>11</sup> of the autotetraploid stocks have been established in succession (Qin et al., 2014). In the present study, based on the formation of 4nAUT, we successfully obtained autotriploid hybrids (3nAUT) and allotriploid hybrids (3nALT) by crossing female RCC × male 4nAUT (F10) and female Cyprinus carpio (CC, 2n = 100) × male 4nAUT (F10), respectively. Furthermore, we investigated important biological traits of 3nAUT and 3nALT, including morphological traits, chromosomal numbers, DNA content and gonadal development. This study is of importance for fish genetic breeding and fish reproductive biology.

# MATERIALS AND METHODS

# Animals and Crosses

RCC, CC, 4nAUT (F10), 3nAUT, and 3nALT were obtained from the Protection Station of Polyploid Fish at Hunan Normal University. During the reproductive seasons (from April to June each year), 20 mature female RCC and CC and male 4nAUT (F10) were chosen as the parents. The crossings were performed in two groups. In the first group, RCC was used as the maternal line and 4nAUT was used as the paternal line. In the second group, CC was used as the maternal line and 4nAUT was used as the paternal line. The mature eggs and sperm of RCC (CC) and 4nAUT were fertilized and the embryos

developed in culture dishes at a water temperature of 20–22◦C. In each group, 5000 embryos were taken at random for the examination of fertilization rate (number of embryos at the gastrula stage/number of eggs) and the hatching rate (number of hatched fry/number of eggs). The hatched fry were transferred to a pond for further culture.



# Morphological Traits

fgene-10-00377 April 26, 2019 Time: 7:24 # 3

At 1 year of age, 20 RCC, 20 CC, 20 4nAUT, 20 3nAUT, and 20 3nALT were randomly selected for morphological examination following the methods described in a previous study (Hu et al., 2012). For both measurable and countable data, we used the software SPSS 22.0 to analyze the covariance of the data between hybrid offspring and their parents.

# Measurement of DNA Content

The DNA content of erythrocytes of RCC, CC, 4nAUT and their hybrid offspring was measured using a flow cytometer (cell counter analyzer, Partec). Approximately 0.5–1 ml of red blood cells was collected from the caudal vein of the above fish into syringes containing 100–200 units of sodium heparin. The blood samples were treated following the method described in a previous paper (Liu et al., 2001). The DNA content of each sample was measured under the same conditions. To calculate the probabilities of the ratios of the DNA content of the polyploid hybrids to the sum of that of RCC (CC) and 4nAUT, the x<sup>2</sup> test with Yate's correction was used for testing deviation from expected ratio values.

# Preparation of Chromosome Spreads

To determine ploidy, chromosomal preparations were performed from peripheral blood cell cultures of 20 3nAUT and 20 3nALT at 1 year of age. The chromosomes were prepared in accordance with a previous study (Liu et al., 2001). First, about 0.1 ml blood was collected from each sample using a syringe soaked with 0.1% sodium heparin, cultured in nutrient solution at 25.5◦C and 5% CO<sup>2</sup> for 72 h,


then colchicine was added 3.5 h before harvest. Cells were harvested by centrifugation, followed by hypotonic treatment with 0.075 M KCl at 26◦C for 30 min, then fixed in methanol–acetic acid (3:1, v/v) with three changes. Cells were dropped onto cold slides, air-dried and stained for 30 min in 4% Giemsa solution. The shape and number of chromosomes were analyzed under a microscope. In total, 100 metaphase spreads (50 metaphase spreads for each sample) of chromosomes were analyzed.

# Gonadal Structures and Gamete Phenotypes

At ages of 1 and 2 years, 50 3nAUT and 50 3nALT individuals were randomly sampled for examination of gonad development via histological sectioning. The gonads were fixed in Bouin's solution, embedded in paraffin, sectioned, and stained with hematoxylin and eosin. Gonadal structures were observed and photographed with a Pixera Pro 600ES digital camera (Nikon, Japan). The gonadal stages were classified in accordance with a prior standard series for cyprinid fish (Sun et al., 2003). In addition, at 2 years old, the mature eggs or water-like semen were squeezed out from the females and males of 3nAUT, respectively. The mature eggs and semen were collected for morphological examination.

# Egg Ploidy Detection

The 2-year-old female 3nAUT could produce different-sized eggs. To determine egg ploidy, mature eggs were used to fertilize RCC haploid sperm and viable offspring were produced. The ploidy of these hybrid offspring was determined by flow cytometric analysis of DNA content in erythrocytes.

# RESULTS

# The Formation of Two Triploid Hybrids

During the reproductive season (from April to June), 3nAUT (**Figure 1D**) were produced by crossing female RCC (**Figure 1A**) and male 4nAUT (**Figure 1C**). 3nALT (**Figure 1E**) were produced by crossing female CC (**Figure 1B**) and male 4nAUT (**Figure 1C**). A high fertilization rate (>96.4%) and hatch rate (>86.7%) were observed in both groups (**Table 1**).

# Phenotypes of Hybrids and Their Parents

The phenotypes of RCC, 4nAUT, CC, 3nALT, and 3nAUT are illustrated in **Figure 1**. The counts traits and measurable traits of RCC, 4nAUT, CC, 3nALT, and 3nAUT are shown in **Tables 2**, **3**. Several morphological differences were detected both between 3nALT and their parental, 3nAUT and their parental (**Tables 2**, **3**). In addition, the main morphological differences between 3nALT and 3nAUT are that 3nALT has two pairs of barbels and 33–34 lateral scales, whereas the 3nAUT have no barbels and 30–31 lateral scales (**Figure 1** and **Table 2**).

TABLE 2 |

Comparison

of the countable traits between the hybrid offspring and their parents.


TABLE 3 | Comparison of the measurable traits between the hybrid offspring and their parents.

Values in the same column with letter a, b, c, d, e for each species show significant differences with RCC, CC, 4nAUT, 3nAUT, and 3nALT (P < 0.05).

# DNA Content of Two Triploid Hybrids and Their Parents

The DNA content of the parents RCC, CC and 4nAUT were used as the controls (**Figure 2** and **Table 4**). The results of the comparisons of DNA content between hybrids and their parents are shown in **Table 4**. The mean DNA content of 3nALT and 3nAUT was equal (P > 0.05) to the sum of one parent and half of the other parent, indicating that they were triploids (**Figure 2**).

# Chromosome Number of Two Triploid Hybrids and Their Parents

Chromosomes were counted in 10 metaphase spreads for each sample of RCC, CC, 4nAUT, 3nALT, and 3nAUT (**Figure 3** and **Table 5**). For RCC, 92.5% of chromosomal metaphases possessed 100 chromosomes, indicating that they were diploids with 100 chromosomes (2n = 100) (**Figure 3A** and **Table 5**). For CC, 95.5% of chromosomal metaphases had 100 chromosomes, indicating they were diploids with 100 chromosomes (2n = 100) (**Figure 3B** and **Table 5**). For 4nAUT, 81.0% of chromosomal metaphases had 200 chromosomes, indicating they were tetraploid with 200 chromosomes (4n = 200) (**Figure 3C** and **Table 5**). For 3nAUT, 88.0% of chromosomal metaphases had 150 chromosomes, indicating they were triploid with 150 chromosomes (4n = 200) (**Figure 3D** and **Table 5**). For 3nALT, 89.5% of chromosomal metaphases had 150 chromosomes, indicating they were triploid with 150 chromosomes (4n = 200) (**Figure 3E** and **Table 5**).

# Fertility of the Two Types of Triploid Hybrids

The ovaries of 1-year-old RCC developed well and contained stages II, III, and IV oocytes (**Figure 4A**).


<sup>a</sup>The observed ratio was not significantly different (P > 0.05) from the expected ratio.

The testes of 1-year-old RCC contained many lobules in which there were many mature spermatozoa and spermatids (**Figure 4F**).

The ovaries of 1-year-old 3nALT contained many oogonium-like cells but very few ova at stage II (**Figure 4B**). In the testes of 1-year-old 3nALT, some spermatogonia developed into primary spermatocytes (**Figure 4G**). In the ovaries of 2-year-old 3nALT, the oogonium-like cells were disintegrating (**Figure 4E**). In 2-year-old male 3nALT, a number of empty seminiferous tubules lacking secondary spermatocytes or sperm were observed in the testes (**Figure 4J**). In the reproductive season, no milt or eggs were stripped out from the 2-year old males and females of 3nALT. These results suggest that 3nALT were sterile.

The ovaries of 1-year-old 3nAUT were partially developed. Many oogonia proliferated massively with a few having developed into oocytes of phase II (**Figure 4C**). In the testes of 1-year-old 3nAUT, some spermatogonia developed into primary spermatocytes (**Figure 4F**), but no semen could be squeezed out of the testes. The ovaries of 2-year-old 3nAUT developed well and contained stages II, III, and IV oocytes (**Figure 4D**). The testes of 2-year-old 3nAUT contained blunt spermatid-like cells, many spermatogonia with heteromorphous and cavitate nuclei or with a few sperms that lacked tails or nuclei (**Figure 4I**). In the reproductive season, water-like semen and different sizes of eggs were collected from 2-year-old males and females of 3nAUT, respectively (**Figure 5**).

# Egg Ploidy

The ploidy levels of the crossing offspring of female 3nAUT and male RCC were confirmed by measuring DNA content. The results show that diploid, triploid, and tetraploid hybrid were successfully obtained by crossing female 3nAUT and male RCC (**Figure 6**). These results indicate that female 3nAUT produce eggs of at least three different ploidy levels, including haploid, diploid, and triploid eggs.

# DISCUSSION

Distant hybridization is an important means of fish genetic breeding and is also an effective way to produce polyploid offspring. In our previous study, autotetraploid hybrid lines were established from the distant hybridization of red crucian carp × blunt snout bream (Megalobrama amblycephala)

(Liu et al., 2007; Qin et al., 2014). In the present study, 3nAUT and 3nALT were produced by crossing female RCC × male 4nAUT and female Cyprinus carpio (CC, 2n = 100) × male 4nAUT, respectively (**Figure 1**).

Distant hybridization is a useful strategy to produce hybrid offspring with altered genotypes (Bullini, 1994; Liu et al., 2007; Hu et al., 2018). Compared with their parents, obvious differences were found in 3nALT and 3nAUT in the measurable and countable data, indicating the distant hybridizing effect (**Figure 1**). Additionally, most of the countable and measurable traits were significantly different between 3nAUT and 3nALT (P < 0.05) (**Tables 2**, **3**). It was easy to distinguish 3nAUT and

TABLE 5 | Examination of chromosome number in 3nAUT, 3nALT and their parents.


3nALT, as 3nALT have two short barbels and 3nAUT have no barbels (**Figure 1**).

Examining the DNA content is a rapid and simple method of determining the ploidy of samples. Counting the chromosomal number is a direct and accurate method. In this study, the ploidy levels of 3nAUT and 3nALT were confirmed by measuring DNA content (**Figure 2** and **Table 4**) and counting chromosomal number (**Figure 3** and **Table 4**). All of the above results were in agreement that both 3nAUT and 3nALT were triploid hybrids.

In aquaculture, induced triploidy is mainly used for the production of sterile fish. According to traditional concepts, triploid fish usually have disordered meiosis, which can lead to low fertility or complete infertility (Vrijenhoek, 1994, 2006; Liu et al., 2000; Yin et al., 2000; Peter et al., 2010). In allotriploid fish, functional sterility may reflect genomic imbalances due to the presence of an extra set of chromosomes (Krisfalusi et al., 2000). Infertile allotriploid fish have been reported in some studies (Liu et al., 2007; He et al., 2013; Xiao et al., 2014; Hu et al., 2018). In autotriploid fish, meiosis is seriously impacted because three sets homologous chromosomes cannot correctly pair during the zygotene stage of prophase I (Carrasco et al., 1998; Cuñado et al., 2002).

FIGURE 4 | Gonadal development. (A) Histological section of ovary of 1-year-old RCC. (B) Histological section of ovary of 1-year-old 3nALT. (C) Histological section of ovary of 1-year-old 3nAUT. (D) Histological section of ovary of 2-year-old 3nAUT. (E) Histological section of ovary of 2-year-old 3nALT. (F) Histological section of testis of 1-year-old RCC. (G) Histological section of testis of 1-year-old 3nAUT. (H) Histological section of testis of 1-year-old 3nALT. (I) Histological section of testis of 2-year-old 3nAUT. (J) Histological section of testis of 2-year-old 3nALT.

males of 3nAUT.

fgene-10-00377 April 26, 2019 Time: 7:24 # 8

In the present study, the gonadal development of both 3nAUT and 3nALT was examined by means of microscopic tissue sections. The results show that 3nALT were sterile and that their gonadal development was abnormal (**Figure 4**). Male 3nAUT also could not produce normal sperm and was sterile. But, female 3nAUT were fertile and could produce different-sized eggs during the reproductive seasons. Han et al. (2010) reported infertility of female triploid rainbow trout caused by developmental abortion of oocytes and that oogonia formed cytocysts before the prophase oocytes. A similar result was found in female triploid yellowtail tetra (Ferreira et al., 2017). In contrast, Gomelsky et al. (2016) described a fertile triploid female koi, which could produce aneuploid eggs. Similarly, in our study, female 3nAUT could produce eggs with at least three different ploidy levels. Further study will be needed to determine whether there are aneuploid eggs.

In general, half-reduced gametes are produced by meiosis in animals. For example, diploid fish usually produce haploid gametes. However, there are some reports of the production of unreduced gametes generated by hybrids. For example, female triploid loach produced haploid and triploid eggs (Zhang et al., 1998). The female and male diploid hybrids of koi Cyprinus carpio × goldfish Carassius auratus produced unreduced diploid eggs and diploid sperm, respectively (Delomas et al., 2017). In our previous study, we found that the female allotetraploid hybrids produced diploid and tetraploid eggs (Liu et al., 2007; Hu et al., 2018). The formation of these unreduced gametes may be related to premeiotic endoreduplication, endomitosis or fusion of germ cells (Allen and Stanley, 1978; Yoshikawa et al., 2008; Liu, 2010). In this study, the 3N eggs produced by 3nAUT may be due to premeiotic endoreduplication of oogonia, thereby forming 6N oogonia and ultimately forming 3N eggs. Additionally, the formation of 1N and 2N eggs may relate to

abnormal behavior of chromosomes during meiosis in female 3nAUT. This phenomenon is most common in autotriploid plants (Lange and Wagenvoort, 1973; Del Bosco et al., 2007). Interestingly, though they had the same parents, female 3nAUT were fertile, while males were sterile. This difference may exist because some genes that control meiosis have sex-specific expression (Kaul and Murthy, 1985; Maceira et al., 1992).

In summary, the formation of both 3nALT and 3nAUT is of great value in aquaculture and fisheries. The sterility of 3nALT ensures that it is unable to mate with other wild fish, and this would play an important role in protecting wild fish resources. Besides, 3nAUT can be used as a model to research productive rules of distant hybrid polyploid progeny, and fertile female 3nAUT also provide a special resource for fish breeding.

# ETHICS STATEMENT

All the fish were cultured in ponds at the Protection Station of Polyploid Fish, Hunan Normal University. Fish treatments were performed according to the Care and Use of Agricultural Animals in Agricultural Research and Teaching. This study was approved by the Science and Technology Bureau of China. Approval from the Department of Wildlife Administration was not required for the experiments conducted in this study.

# REFERENCES


Before dissection, fish were deeply anesthetized with 100 mg/L MS-222 (Sigma-Aldrich).

# AUTHOR CONTRIBUTIONS

SL contributed to the conception and design of the study. FH, WL, QL, XC, YW, LC, and YH performed the experimental work. SL and FH participated in drafting the manuscript. JF, CW, QQ, and MT analyzed the data. SW, RZ, and KL participated in interpretation and discussion of the results. All authors read and approved the final manuscript.

# FUNDING

This work was supported by the National Natural Science Foundation of China (Grant Nos. 31802286, 31430088, and 31730098), the China Postdoctoral Science Foundation (Grant No. 2018M642986), the earmarked fund for China Agriculture Research System (Grant No. CARS-45), the Hunan Provincial Natural Science and Technology Major Project (Grant No. 2017NK1031), the Cooperative Innovation Center of Engineering and New Products for Developmental Biology of Hunan Province (Grant No. 20134486), and the Key Research and Development Program of Hunan Province (Grant No. 2018NK2072).

(F2) koi Cyprinus carpio × goldfish Carassius auratus hybrids. J. Fish Biol. 90, 80–92. doi: 10.1111/jfb.13157



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Hu, Fan, Qin, Huo, Wang, Wu, Liu, Li, Chen, Cao, Tao, Wang, Zhao, Luo and Liu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Evolution of Complex Thallus Alga: Genome Sequencing of Saccharina japonica

Tao Liu1,2 \* † , Xumin Wang<sup>2</sup>† , Guoliang Wang3,4† , Shangang Jia<sup>5</sup> \*, Guiming Liu<sup>6</sup> , Guangle Shan<sup>4</sup> , Shan Chi1,7, Jing Zhang<sup>8</sup> , Yahui Yu<sup>1</sup> , Ting Xue<sup>9</sup> \* and Jun Yu3,4 \*

<sup>1</sup> College of Marine Life Science, Ocean University of China, Qingdao, China, <sup>2</sup> College of Life Sciences, Yantai University, Yantai, China, <sup>3</sup> CAS Key Laboratory of Genome Sciences and Information, Beijing Key Laboratory of Genome and Precision Medicine Technologies, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, China, <sup>4</sup> University of Chinese Academy of Sciences, Beijing, China, <sup>5</sup> College of Grassland Science and Technology, China Agricultural University, Beijing, China, <sup>6</sup> Beijing Agro-Biotechnology Research Center, Beijing Academy of Agriculture and Forestry Sciences, Beijing, China, <sup>7</sup> Qingdao Haida Blue Tek Biotechnology Co., Ltd, Qingdao, China, <sup>8</sup> College of Biological Engineering, Qilu University of Technology, Shandong Academy of Sciences, Jinan, China, <sup>9</sup> The Public Service Platform for Industrialization Development Technology of Marine Biological Medicine and Product of State Oceanic Administration, College of Life Sciences, Fujian Normal University, Fuzhou, China

### Edited by:

Gen Hua Yue, Temasek Life Sciences Laboratory, Singapore

### Reviewed by:

Chaotian Xie, Jimei University, China Jian Xu, Chinese Academy of Fishery Sciences (CAFS), China

### \*Correspondence:

Tao Liu liutao@ouc.edu.cn Shangang Jia jsg200830@163.com Ting Xue xueting@fjnu.edu.cn Jun Yu junyu@big.ac.cn

†These authors have contributed equally to this work

### Specialty section:

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Received: 12 December 2018 Accepted: 09 April 2019 Published: 02 May 2019

### Citation:

Liu T, Wang X, Wang G, Jia S, Liu G, Shan G, Chi S, Zhang J, Yu Y, Xue T and Yu J (2019) Evolution of Complex Thallus Alga: Genome Sequencing of Saccharina japonica. Front. Genet. 10:378. doi: 10.3389/fgene.2019.00378 Saccharina, as one of the most important brown algae (Phaeophyceae) with multicellular thallus, has a very remarkable evolutionary history, and globally accounts for most of the economic marine aquaculture production worldwide. Here, we present the 580.5 million base pairs of genome sequence of Saccharina japonica, whose current assembly contains 35,725 protein-coding genes. In a comparative analysis with Ectocarpus siliculosus, the integrated virus sequence suggested the genome evolutionary footprints, which derived from their co-ancestry and experienced genomic arrangements. Furthermore, the gene expansion was found to be an important strategy for functional evolution, especially with regard to extracelluar components, stress-related genes, and vanadium-dependent haloperoxidases, and we proposed a hypothesis that gene duplication events were the main driving force for the evolution history from multicellular filamentous algae to thallus algae. The sequenced Saccharina genome paves the way for further molecular studies and is useful for genome-assisted breeding of S. japonica and other related algae species.

Keywords: Saccharina japonica, genome sequencing, virus genome, phylogenetic analysis, extracellular components, halogen biosynthesis

# INTRODUCTION

Brown algae are a large group of multicellular algae, which displays a huge biomass dominating cool temperate intertidal and subtidal zone water, due to its macro soma and large biomes. Brown algae are fundamentally different from green and red algae, as green and red algae acquired plastids from cyanobacteria during primary endosymbiosis, while brown algae descend from secondary endosymbiosis (Valentin and Zetsche, 1990). Therefore, brown algae may acquire both cyanobacterial genes via EGT (Endosymbiotic Gene Transfer) and eukaryotic sequences from the nucleus of the red algal endosymbiont (Lane and Archibald, 2008), which makes the genome more complicated to interpret. Therefore, knowledge about the brown algal genome is crucial for understanding its evolution path. There are currently approximately 1500–2000 species of

brown algae worldwide. The morphology of brown algae ranges from slender filaments (e.g., Ectocarpus siliculosus) to giant thallus (e.g., Saccharina japonica). Meanwhile, brown algae exhibit a diverse range of life cycles, for example the haploid–diploid cycle of the genus Saccharina is quite different from its close relative the genus Ectocarpus which lacks the parenchyma stage. This may have indicated key adaptive events in their evolution. One important question is which genes or evolutionary events underlie the structural evolution from filamentous brown algae (Ectocarpus) to heteromorphic haploiddiploid algae (Saccharina).

In addition, genus Saccharina, one of the most important genera of brown algae, has been recently separated from genus Laminaria based on molecular evidence from nuclear, plastid, and mitochondrial genome sequences (Draisma et al., 2001; Yoon et al., 2001; Erting et al., 2004; Lane et al., 2006). Many species from Saccharina are known to constitute the marine forests in the Asian coastal areas, which are the primary producers in marine ecosystem and traditionally indispensable in a diet with high industrial value (Zemke-White and Ohno, 1999). They are a potential source of renewable energy (Demirbas, 2010), as well as of polysaccharides, e.g., laminarans, alginic acids, and fucoidans (Vishchuk et al., 2011). S. japonica is not only a common seafood in China and many other countries, but has also been documented as a drug in traditional Chinese medicine for over a thousand years, being rich in polysaccharides, e.g., alginate, fucoidin, fucoidan galactosan sulfate and alginic acids, mannitol and trace elements. With the development of modern science and technology, the medical applications of S. japonica have been gradually revealed, such as its capacity to regulate blood lipids, blood sugar, and blood pressure, and its activities such as anticoagulation, antioxidant, anti-tumor and anti-radiation, etc. (Xue et al., 2001; Zhao et al., 2004, 2012; Kim et al., 2006; Wang et al., 2008, 2012; Huang et al., 2010; Mizuta and Yasui, 2010; Vishchuk et al., 2012).

A previous study on the transcriptome of S. japonica has facilitated the understanding of the genome background of S. japonica along the detailing of vanadium-dependent haloperoxidase family (Liang et al., 2014). Previous studies have shown the biosynthesis pathways of important cellular components (alginate and fucoidan) (Chi et al., 2018a) and complex halogen metabolism of Saccharina (Ye et al., 2015). The previous genome sequencing in S. japonica assembled 13,327 scaffolds (537-Mb) which covered 98.5% of the estimated genome and predicted 18,733 protein-coding genes, consisting of 13,327 scaffolds (Ye et al., 2015). It focused on the evolutionary adaptation and the functional diversification of the polysaccharide biosynthesis and iodine concentration mechanisms of S. japonica. However, a better genome assembly is necessary to provide a complete understanding of S. japonica, and further reveal the evolution significance of multicellular parenchymatous thallus. In particular, the Hi-C (High-through chromosome conformation capture) data can be used for clustering the scaffolds into chromosomes, by exploiting chromatin interaction data (Korbel and Lee, 2013). A complete genome assembly of S. japonica will further facilitate our research with the evolution pattern of tissue differentiation and heteromorphic generational alternation. Comparative genomics analysis will give us a deeper insight into the gene family duplication, differential expansion/contraction and evolution of conserved non-coding sequences. Meanwhile, a detailed investigation of S. japonica genome will accelerate the metabolic expatiation of mannitol, laminarin, alginates and trehalose pathways.

Here, we surveyed the genome of S. japonica with the help of high throughput sequencing. We intend to include a detailed analysis of the updated genome data of S. japonica, which will pave the way of its genetic research and breeding. We reported a preliminary analysis of its genome organization and gene content, including gene annotations, GC content, repeat elements and SSRs. Subsequently, phylogenetic analysis, motif analysis and exon-intron organization of some important genes were also investigated. We believe the S. japonica genome generated in this study will enhance both fundamental and applied research in related areas.

# MATERIALS AND METHODS

# Sample Collection and DNA Preparation

A thermotolerant and high-yielding Saccharina cultivar "Rongfu" was selected as the genomic DNA source for whole genome sequencing. Samples were collected from Rongcheng, Shandong Province, P. R. China, and were provided by the Culture Collection of Seaweed at the Ocean University of China. Genomic DNA was extracted from fresh sporophyte with improved CTAB method (Guillemaut and Maréchal-Drouard, 1992).

# Genome Sequencing and Assembly

We constructed three paired-end libraries and three mate-paired libraries according to Illumina standard operating procedure. Sequencing of each library was performed on an Illumina HiSeq 2000 instrument to produce the raw data. We then filtered out low-quality and short reads to obtain a set of usable reads.

We then assembled the reads into contigs using SOAPdenovo with varying parameters, and mate-paired relationships between the reads were used to construct scaffolds. The genome assembly was improved by exploiting chromatin interaction data from Hi-C data, and grouping contigs into pseudomolecules/chromosomes. Chromatins were cut by the restriction enzyme Mbol, and ligated together in situ after biotinylation. DNA was extracted, and sheared before end repairing. DNA fragments were enriched by using interaction of biotin and streptavidin, and subject to Hiseq sequencing with a paired end length of 150 bp. After trimming and quality control, paired-end reads were aligned to the genome assembly, and the reads with more than one hit were discarded. The reads without the restriction site of Mbol were filtered out. Then, paired-end reads were analyzed for the valid ones, which were aligned to two different enzyme fragments, for the estimated insert size to meet the expectations, by using HiC-Pro v2.7.8 (Servant et al., 2015). Based on the relationships among valid reads, the order and directions of scaffolds/contigs were clustered

into the 31 pseudomolecules/chromosomes by LACHESIS (Burton et al., 2013).

# Repeat Analysis and Genome Annotation

Both homology-based and de novo prediction analyses were used to identify the repeat content in the Saccharina genome. For the homology-based analysis, we used Repbase (version 20140131) to perform a TE search with RepeatMasker (open-4.0) and the ncbi RMblast search engine. For the de novo prediction analysis, we used RepeatModeler to construct a TE library, and classified elements within the library using a homologous search with Repbase and TEClass.

Approaches including homology detection, expressionevidence-based predictions and ab initio gene predictions were used for gene model construction. To identify homology patterns in S. japonica, the BLASTX search was performed against the NCBI non-redundant protein database with E-value < 10−<sup>5</sup> . For expression evidences, published ESTs, transcripts and RNAseq datasets from the OneKP database<sup>1</sup> were aligned to the genome. After measuring and comparing a variety of programs, AUGUSTUS was used for ab initio gene prediction. Gene model parameters for the programs were trained from long transcripts and known Saccharina genes processed by PASA. And then all these predictions were combined into consensus gene structures using EVM and optimized by manual corrections.

Functional classification of Gene Ontology of the genes was performed with InterProScan (Zdobnov and Apweiler, 2001). The EuKaryotic Orthologous Groups (KOG) classification was performed against KOG database (Tatusov et al., 2003). Pathway analyses were performed using the Kyoto Encyclopedia of Genes and Genomes (KEGG) annotation service KAAS (Kanehisa et al., 2004).

# Gene Family Analysis

Related protein sequences were downloaded from NCBI and local BLAST was used to get full-length sequences in S. japonica, with tBLASTn module of WU-BLAST 2.0 and an E-value cut-off of 1 × 10−<sup>5</sup> . Nucleotide sequences were transferred to amino acid sequences using MEGA with option of standard genetic code and then aligned (Tamura et al., 2011). NCBI BLAST was run on proteins sequences sets of S. japonica, E. siliculosus, Aureococcus anophagefferens, Nannochloropsis gaditana, Phaeodactylum tricornutum, and Thalassiosira pseudonana, which were downloaded from the NCBI GenBank, and then were clustered to orthogroups using Orthomcl (Li et al., 2003). The S. japonica assembly was compared to the available E. siliculosus genome<sup>2</sup> using lastz (Frith et al., 2010) with – chain and – gapped parameter, and then the alignments were plotted for synteny analysis.

# Phylogeny Analysis

In phylogenetic analysis, full amino acid sequences of housekeeping genes from archaea, proteobacteria, cyanobacteria, tracheophytes, and algae, were aligned using MEGA6

<sup>1</sup>https://db.cngb.org/onekp/

<sup>2</sup>http://bioinformatics.psb.ugent.be/genomes/view/Ectocarpus-siliculosus

(Tamura et al., 2013) software and edited manually. MrBayes v3.1.2 (Ronquist and Huelsenbeck, 2003) software was used to investigate evolutionary relationships based on amino acid sequences. Bayesian analysis was performed by two separate sequence analyses for four Markov chains (using default heating values), which were run for 500,000 generations until the average standard deviation of split frequencies was below 0.01 (Posada and Crandall, 1998). In addition, trees were sampled every 100 generations with the first 25% of trees discarded as the burn-in. Remaining trees were used to build a 50% majority rule consensus tree, accompanied with posterior probability values. FigTree v1.3.1<sup>3</sup> was used for displaying phylogenetic trees.

# RESULTS AND DISCUSSION

# Genome Assembly

The genome sequence was assembled using 14.86 and 15.06 Gb mate-paired sequences from libraries with 3 and 5 kb inserts, respectively, plus 46.54 Gb paired-end reads from small-insert libraries (**Supplementary Table S1**). The assembled nuclear genome of S. japonica contains 418,683 contigs and 236,802 scaffolds, and the total length was 580.5 Mb (**Table 1** and **Supplementary Table S2**), with an N50 of 13,636,083 bp and 48.72% GC content. There are 1,602 scaffolds longer than 100 kb, a total of 5,257 longer than 20 kb, and a total of 11,156 that are longer than 10 kb. We further improved the assembly by using the Hi-C data. After trimming, a total of 237,795,214 pair-end reads, and 105,783,030 pair-end reads were

<sup>3</sup>http://tree.bio.ed.ac.uk/software/figtree/

TABLE 1 | Genome statistics of S. japonica and E. siliculosus.


determined as valid, which were aligned to two different enzyme fragments, and the estimated insert size met the expectation. Based on the relationships among valid reads, the order and directions of 46,865 scaffolds/contigs were clustered into the 31 pseudomolecules/chromosomes (**Supplementary Table S3**), which account for 517,689,860 bp, about 89.19% of the whole genome. This was an updated genome assembly and provided a better resource than the previous assembly (Ye et al., 2015).

A combination of expert and automatic annotation predicts 35,725 gene models, with an overlap of 23,930 ones with E. siliculosus (**Supplementary Figure S1**). Genes are rich in introns (4.63 per gene on average). However, the analysis of GO annotation, KOG and KEGG discovered a similar distribution between S. japonica and E. siliculosus (**Supplementary Figures S2**–**S4**). A gene family clustering analysis across the whole genome was conducted by using protein sequences of S. japonica and E. siliculosus. Totally, there were 16,954 genes in 9658 orthogroups in S. japonica, 13,063 genes in 9615 orthogroups in E. siliculosus, with 18,771 and 3471 genes unassigned, respectively (**Table 2**). We extended our ortholog search to more species, i.e., A. anophagefferens, N. gaditana, P. tricornutum, and T. pseudonana, and found that the gene number in orthogroups of S. japonica is the highest, suggesting that the gene duplication strategy was preferred during the genome evolution (**Table 2**).

# Integration of Virus Genome in Co-ancestry Period of Laminariales and Ectocarpales

Laminariales and Ectocarpales algae show a close phylogenetic relationship with each other. Analysis of the E. siliculosus genome sequence identified the presence of E. siliculosus virus-1 (EsV-1) (Cock et al., 2010). EsV-1 only infects free-swimming gametes or spores. The viral DNA integrates into the cellular genome after infection and then, through mitosis, it spreads to all cells of a developing host (Bräutigam et al., 1995; Lee et al., 1998). Viral DNA has been detected in some species' genomes in Ectocarpus and Feldmannia (Meints et al., 2008; Cock et al., 2010). In our analysis, we found some genomic fragments of phaeovirus EsV-1 integrated in S. japonica genome, which were also found in E. siliculosus genome (**Figure 1**). This result shows that this virus integration may happen before the divergency of Laminariales and Ectocarpales. After BLAST against EsV-1 sequences, we also found that the integrated virus sequences are quite different in these two genomes, as E. siliculosus genome contains largefragment clusters (310,438 bp of 313,838 bp EsV-1 genome, 98.92%, and contains 173 orthologs genes of 231 EsV-1 ones), and in S. japonica the orthologous regions consist of more fragments and fewer large segments (8,256 bp). So the integration would have occurred before the differentiation of Laminariales and Ectocarpales, and experienced the isolated evolution events after the differentiation. This indicates the diverse algal genome evolutionary mechanisms between Laminariales and Ectocarpales, which may represent the potential driving force for advanced brown algae evolutionary. There may be large-scale rearrangement in Laminariales genomes during evolution from filaments to the thallus, accompanied by loss of some genes.

# Repeats, Introns and Non-coding Regions

We downloaded the genome sequences of brown algae species E. siliculosus (Michel et al., 2010), and conducted a synteny analysis. In total, the conserved regions covered 23,193,396 bases from S. japonica and 15,108,600 bases from E. siliculosus, and no significant replication of large fragments were shown (**Supplementary Figure S5**). We found that 43.12% of the S. japonica genome assembly could be attributed to repeat sequences, while 22.7% only was found for the reported 214 Mb genome of E. siliculosus, a multicellular filamentous brown alga. LTR, LINE, SINE and simple repeat are the major components of repeat sequences (**Supplementary Figure S6**).

In total, 98,386 SSRs conforming to the definitions (i.e., unit/minimum number of repeats 1/20, 2/10, 3/7, 4/5, 5/4, and 6/4) were recovered from 15,173 sequences (6.28% of the total sequences) in the S. japonica genome sequences. Provided that the size of the S. japonica genome is 580.5 Mb, the frequency of


S. japonica and E. siliculosus genomes.

occurrence of the above SSRs was estimated to be one SSR per 59 kb. Compound SSRs accounted for 17.75% and the average SSR density was 169 SSR/Mb. Di-, tri-, and tetra-nucleotide SSRs accounted for 4.51, 46.67, and 17.21% of the identified SSRs, respectively.

When it comes to the di-nucleotide SSRs, AG/CT pattern was the most frequent type representing 41.06% of dinucleotide repeat units, followed by AT/AT (35.74%), and AC/GT (23.09%) (**Supplementary Figure S7**). AGC/CTG was the most abundant trimer motif (34.11%), and ACT/AGT was second at 23.77%. Among the tetra-nucleotide SSRs, ACAG/ATGT (18.20%) was the most common and ACCCG/CGGGT (13.57%) was the most common hexamer (5.00%).

Genome analysis of S. japonica predicted 35,725 genes with average intron length of 1,203 bp, 4.63 introns per gene. Introns constitute 29% of the whole genome, which is lower than that of E. siliculosus (37.3%). A total of 15,820 simple sequence repeats (SSRs) were identified. Among the SSRs, the trinucleotide and dinucleotide repeat types are the most abundant, which is similar to other published algae genome data.

# Housekeeping Genes

Just like the diverse archaea, bacteria, tracheophytes and algal species, a wealth of candidate housekeeping gene sequences, such as actin, α-Tubulin (TUA), ß-Tubulin (TUB), Elongation factor 1-alpha (EF1-α), and glyceraldehyde 3-phosphate dehydrogenase (GAPDH) were also detected in the genome of S. japonica.


Therefore, we built phylogenetic trees that display relationships of full amino acid sequences of housekeeping genes from archaea, proteobacteria, cyanobacteria, tracheophytes, and algae (only representative candidates are included to save space) based on Bayesian method. In the consensus tree of four housekeeping genes (EF1α, GAPDH, TUA, and TUB), almost all the Phaeophyceae algae (including S. japonica) formed a wellsupported clade with oomycetes or protists, which indicates their origin from endosymbiosis host (**Figure 2**). The remaining actin genes from brown algae have a complex evolutionary history. Our phylogenetic analysis shows they may have arisen from multiple origins. Three copies of S. japonica actin cluster into separate clades: one is related to oomycetes (e.g., S. japonica 1), and the other groups with cyanobacteria and bacteria (e.g., S. japonica 2, 3). Therefore, S. japonica may acquire actin from different ancestors, while S. japonica 1 is with an endosymbiotic host origin, and S. japonica 2 and 3 have a cyanobacterial origin through endosymbiosis gene transfer



(EGT) or acquired from non-cyanobacterial proteobacteria via horizontal gene transfer (HGT).

# Gene Expansion

During the evolution of single cells to multicellular, housekeeping genes, signal transduction and cell junction pathway related genes play an important role (**Supplementary Figures S2**, **S3**), while genes and pathways of extracellular component may be more important from multicellular filamentous to multicellular thallus, due to gene expansion (**Table 3**).

In phylum Heterokontophyta, there is a very close relationship between orders Laminariales and Ectocarpales, which is consistent with phylogenetic analysis of chloroplast genome. In the present study, in S. japonica the genome size is almost 3 times that of E. siliculosus, and the total gene number is nearly 2.1 times that of E. siliculosus (**Tables 2**, **3**). S. japonica and E. siliculosus genes were subjected to the KEGG database, and the annotated gene comparison revealed a similar distribution of gene types among most categories, which indicated the increases in organismal complexity are not associated with pathway changes. The similarity of gene orthologous were also compared, which suggests that the gigantic increasing of S. japonica genes may depend on gene replication leading to multiple copies. Interestingly, analysis of gene duplication indicated a significant gain of the ones associated with cell wall component metabolisms such as alginate and cellulose synthesis pathways. For instance, Saccharina genome contains a number of 14 candidate cellulose synthase and cellulose synthase-like genes, which is nearly 1.6 times that of Ecutocarpus (**Table 3**). The modifier gene MC5E in alginate synthesis pathway reached 84 copies in Saccharina genome, about 3.2 times that of Ecutocarpus. The gene number of glycosyltransferases (GTs) and glycosyl hydrolases (GHs), which are likely involved in cell wall polysaccharide metabolisms, also endures an expansion of more than two times in Saccharina genome. The same situation occurred in protein kinases. These genes include membrane-spanning receptor kinases, which may play key roles in developmental processes such as differentiation (De Smet et al., 2009). This indicated that the strategy of gene expansion may contribute a lot to the evolution from the multicellular filamentous brown alga to complex thallus alga.

Saccharina japonica is the most efficient iodine accumulator among all living organisms, owing to the activities of vanadiumdependent haloperoxidases (vHPOs) (Leblanc et al., 2006). The halogen accumulation level of foliaceous S. japonica is much higher than that of filamentous E. siliculosus and the halogen metabolism ability is significantly enhanced. In the genome of S. japonica, we identified 89 vHPOs, including 21 vanadium-dependent bromoperoxidases (vBPOs) and 68 vanadium-dependent iodoperoxidases (vIPOs). In contrast, the previous genomic study on E. siliculosus and S. japonica has, respectively, predicted one (**Table 4**) and 76 vHPO genes (Ye et al., 2015) involved in halogen metabolism. In addition, 32 dehalogenation related genes were predicted. Brown algae usually have active halogen metabolism, and many halogen related genes were found in the E. siliculosus and Chondrus crispus genome (Saenko et al., 1978; Leblanc et al., 2006). Hypohalous acids and organo-halogenated compounds produced by halogenations are considered substances which can participate in algae defense reactions. The large size of these halogen-related gene families is a result of specific evolutionary adaptation to the marine environment, and it may be evolved in various defense mechanisms and produce more rich secondary metabolites in evolution from small algae to large algae.


TABLE 6 | Gene number of the calcium-based signaling system in sequenced Saccharina variety "Rongfu," S. japonica and E. siliculosus.


Plants respond to various biotic or abiotic stimuli with different mechanisms. Heat shock proteins (Hsps) are essential components in plant tolerance mechanism under various abiotic stresses. The function of calcium-based intracellular signaling system is to combine extracellular stimuli with their specific intracellular responses (Edel et al., 2017). Three major elements play a role in the generation and translation of a stimulus-induced Ca2<sup>+</sup> signal: influx, efflux and decoding (Edel et al., 2017). In a comparison with E. siliculosus, S. japonica experienced a gene expansion in Hsp family. The number of hsp genes including Hsp20, Hsp33, Hsp40, Hsp70, and Hsp90 in S. japonica is nearly 1.5 times of that of E. siliculosus (**Table 5**). In the genome of S. japonica, we identified 3 protein families of Ca2<sup>+</sup> influx, 2 protein families of Ca2<sup>+</sup> efflux and 2 protein families of calcium decoding (**Table 6**). The number of TPC, MCA, and CaM genes were higher than in E. siliculosus (**Table 6**). These gene expansion events may contribute to the differentiation and evolution from filamentous Ectocarpus to complex organized Saccharina. In addition, we also found more gene copies of Hsp40, Hsp70, TPC, MCA, and CPK in our S. japonica "Rongfu" genome than in the previous S. japonica genome (Ye et al., 2015). It indicated that the Saccharina high-temperature-resistant variety "Rongfu" possessed more genes of Hsp and calcium-based signaling system which might be caused by artificial selection.

The above results indicated that the expansion of Saccharina genome was mainly due to gene family expansion, especially when it contributed to cell wall and halogen biosynthesis (**Tables 3–6**). The differentiation and evolution from filamentous Ectocarpus to complex organized Saccharina may be related to these gene expansion events.

However, we did not discover a huge replication in housekeeping, signal-transduction-related and cellcommunication-related genes. For example, mannitol represents up to 15–26% of dry weight, as one of the primary photosynthetic products and storage compounds in Laminariales, and its biosynthesis involves two major enzymes, mannitol-1- P dehydrogenase (M1PDH) and mannitol-1-phosphatase (M1Pase) (Chi et al., 2018b). Two unigenes of M1PDH1 and M1PDH2 and two M1Pase homologs of M1Pase1 and M1Pase2 were found in the S. japonica genome, while there are three M1PDH unigenes and two M1Pase copies in E. siliculosus (**Table 4**). Meanwhile, in a comparison of E. siliculosus, S. japonica did not experience a significant family expansion in alginate and fucoidan biosynthesis, which involves 6–8 genes/families (Chi et al., 2018a). Mannose-6 phosphate isomerase (MPI), phosphomannomutase (PMM), and mannose-1-phosphate guanylyltransferase (MPG) are involved in converting fructose-6-phosphate into GDP-mannose. GDPmannose may be used in the two biosynthesis ways: the alginate biosynthesis which involves GDP-mannose/UDP-glucose 6 dehydrogenase (GMD/UGD), mannuronan synthase (MS), and MC5E; the fucoidan biosynthesis which involves GDP-mannose 4,6-dehydratase (GM46D), GDP-fucose synthetase (GFS), and et al. It showed that S. japonica contained less gene copies of MPI, PMM/PGM, and GMD, compared to E. siliculosus (**Table 4**).

In addition, we also found more gene copies in our S. japonica than in the previous S. japonica genomes (Ye et al., 2015), for example, the vHPO gene family. It indicated that the Saccharina varieties possess the different gene clusters, which might be caused by artificial selection, and provide important resources for algal breeding on high alginate and iodine.

# Data Availability

The raw data is deposited in NCBI, with Bioproject accession of PRJNA280905, and the accession in Short Read Archive of SRP057092, including four running datasets, SRR1972526, SRR1972528, SRR1972529, and SRR1972530.

# CONCLUSION

Large alga S. japonica has a large complex thallus tissue, and possesses a huge genome and gene expansion. In spite of high similarity in gene composition and classification, the structures of the two genomes of E. siliculosus and S. japonica are different in non-coding regions, repeat sequences, introns length and gene number. In particular, the number of genes related to extracellular components and halogen biosynthesis in S. japonica is significantly higher than that of E. siliculosus, which may be the main motive force for evolution of filament to thallus. In addition, the integration of viral genome in the S. japonica and E. siliculosus genomes during their co-ancestry period further demonstrated their close genetic relationship, genome rearrangements and gene duplication events after their differentiation.

# AUTHOR CONTRIBUTIONS

TL, TX, XW, and JY designed the study. SC, JZ, and YY maintained and prepared the plant materials. GL and GS prepared the sequencing libraries and conducted the sequencing. TX conducted the Hi-C sequencing. GW, SJ, and XW assembled the draft sequence, and conducted the analysis. GW and SC drafted the manuscript. SJ modified the manuscript. All authors reviewed and approved the final manuscript.

# FUNDING

This study was funded by the National Natural Science Foundation of China (41376143); Science and Technology Major Project of Fujian Province (2019NZ08003); Leading Talents Program in Taishan Industry of Shandong Province; the seed industry innovation and industrialization project of Fujian Province (2017FJSCZY01); the 13th Five-Year Plan for the Marine Innovation and Economic Development Demonstration Projects (FZHJ1); and China Agriculture Research System-50.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00378/full#supplementary-material

# REFERENCES

fgene-10-00378 May 2, 2019 Time: 11:47 # 9



**Conflict of Interest Statement:** SC was employed by Qingdao Haida Blue Tek Biotechnology Co., Ltd.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Liu, Wang, Wang, Jia, Liu, Shan, Chi, Zhang, Yu, Xue and Yu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Genome-Wide Association Study Identifies Genomic Loci Affecting Filet Firmness and Protein Content in Rainbow Trout

Ali Ali<sup>1</sup> , Rafet Al-Tobasei2,3, Daniela Lourenco<sup>4</sup> , Tim Leeds<sup>5</sup> , Brett Kenney<sup>6</sup> and Mohamed Salem1,2 \*

<sup>1</sup> Department of Biology and Molecular Biosciences Program, Middle Tennessee State University, Murfreesboro, TN, United States, <sup>2</sup> Computational Science Program, Middle Tennessee State University, Murfreesboro, TN, United States, <sup>3</sup> Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL, United States, <sup>4</sup> Department of Animal and Dairy Science, University of Georgia, Athens, GA, United States, <sup>5</sup> National Center for Cool and Cold Water Aquaculture, Agricultural Research Service, United States Department of Agriculture, Kearneysville, WV, United States, <sup>6</sup> Division of Animal and Nutritional Sciences, West Virginia University, Morgantown, WV, United States

### Edited by:

Peng Xu, Xiamen University, China

### Reviewed by:

Jun Hong Xia, Sun Yat-sen University, China Zhe Zhang, South China Agricultural University, China

> \*Correspondence: Mohamed Salem mohamed.salem@mtsu.edu

### Specialty section:

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Received: 11 February 2019 Accepted: 10 April 2019 Published: 03 May 2019

### Citation:

Ali A, Al-Tobasei R, Lourenco D, Leeds T, Kenney B and Salem M (2019) Genome-Wide Association Study Identifies Genomic Loci Affecting Filet Firmness and Protein Content in Rainbow Trout. Front. Genet. 10:386. doi: 10.3389/fgene.2019.00386 Filet quality traits determine consumer satisfaction and affect profitability of the aquaculture industry. Soft flesh is a criterion for fish filet downgrades, resulting in loss of value. Filet firmness is influenced by many factors, including rate of protein turnover. A 50K transcribed gene SNP chip was used to genotype 789 rainbow trout, from two consecutive generations, produced in the USDA/NCCCWA selective breeding program. Weighted single-step GBLUP (WssGBLUP) was used to perform genome-wide association (GWA) analyses to identify quantitative trait loci affecting filet firmness and protein content. Applying genomic sliding windows of 50 adjacent SNPs, 212 and 225 SNPs were associated with genetic variation in filet shear force and protein content, respectively. Four common SNPs in the ryanodine receptor 3 gene (RYR3) affected the aforementioned filet traits; this association suggests common mechanisms underlying filet shear force and protein content. Genes harboring SNPs were mostly involved in calcium homeostasis, proteolytic activities, transcriptional regulation, chromatin remodeling, and apoptotic processes. RYR3 harbored the highest number of SNPs (n = 32) affecting genetic variation in shear force (2.29%) and protein content (4.97%). Additionally, based on single-marker analysis, a SNP in RYR3 ranked at the top of all SNPs associated with variation in shear force. Our data suggest a role for RYR3 in muscle firmness that may be considered for genomic- and marker-assisted selection in breeding programs of rainbow trout.

### Keywords: trout, muscle, firmness, softness, protein, GWAS, WssGBLUP, QTL

# INTRODUCTION

Aquaculture continues to experience rapid growth worldwide. However, for a sustainable industry, there is a need to produce fish filets with consistent quality and high value. Consumer attitude toward fish is influenced by nutritional and sensory attributes, including filet firmness (Bonneau and Lebret, 2010). Firmness is one of the most important quality attributes that determines consumer satisfaction toward the product; and, it is affected by many intrinsic and extrinsic factors

(Destefanis et al., 2008). These factors include prerigor muscle processing, production and storage temperature, chilling protocols, genotype, handling stress, collagen content, extent of proteolysis, and the proximate composition of muscle (Castañeda et al., 2005; Bahuaud et al., 2010; Grze´s et al., 2017). Filet softness shares common causes but should not be confused with gaping that results from tearing the connective tissue between muscle layers and weakening of the interface between the myotome and the myosepta causing slits in the filet (Jacobsen et al., 2017). Previous studies in farmed European whitefish showed that filet firmness is a heritable trait (0.30 ± 0.09); whereas, gaping seems to be not heritable (Kause et al., 2011). Gaping is affected by a range of perimortem harvest and handling factors and postmortem handling practices. In other words, there is a great opportunity for uncontrolled, random variation that makes elucidation of the genetic control of gaping a challenge. Loss of filet firmness and gaping contribute to downgrading during the secondary processing of filet causing economic loss for the industry (Torgersen et al., 2014; Jacobsen et al., 2017). The increased level of stress has been reported as a major cause of gaping and filet softness (Jacobsen et al., 2017). In pigs, heat stress leads to development of pale, soft, exudative (PSE) meat (Strasburg and Chiang, 2009) that is associated with defective Ca2<sup>+</sup> regulation. Despite a well-developed understanding of meat tenderization that has been studied for decades in mammals, the need exists for genetic markers of the fish "gaping" and filet softness phenotypes (Ouali et al., 2013).

Connective tissue, muscle fiber density, muscle fiber type, postmortem metabolism, and postmortem autolysis are inherent factors affecting muscle texture. Proteolytic degradation of connective tissue, myofibrils, extracellular matrix, and cell membrane constituents contribute to post-mortem softening (Torgersen et al., 2014). Protein content is relatively constant in fish; however, it may vary due to seasonal changes and physiological factors (Delbarre-Ladrat et al., 2006; Belitz et al., 2009). For instance, carbohydrate content and metabolism affect postmortem changes in protein content. Glycolysis determines the rate and extent of pH decline, which affects proteolysis and water-binding ability of the tissue. In turn, proteolysis and water-binding ability influence firmness of porcine muscle (Grze´s et al., 2017). However, the pH decline in fish is small due to low glycogen content in the muscle (Belitz et al., 2009). There is general agreement that tenderization is enzymatic in nature and may begin with the onset of apoptosis, followed by proteolysis (Ouali et al., 2013). Enzymatic degradation of key structural proteins that maintain myofibril integrity leads to postmortem tenderization. Calpains, cathepsins, proteasome, and matrix metalloproteases may act in synergy, affected by pH, sarcoplasmic calcium, osmotic pressure, and oxidative processes, to degrade the proteins (Delbarre-Ladrat et al., 2006). The increased level of stress, glycogenolysis, glycolysis, and pH decline (Thomas et al., 2005) in the perimortem period, is associated with increased activity of cathepsin L, which degrades collagen and leads to filet softening. However, protein isoforms of fish may react differently than mammalian species because filet storage temperature are much closer to temperature optimal for proteases, glycolytic enzymes, and pyruvate dehydrogenase to name a few possibilities. Firmness of salmon muscle has been previously attributed to efficient aerobic metabolism and degradation of damaged/misfolded proteins (Torgersen et al., 2014). In addition, atrophying muscle from sexually mature rainbow trout fish showed softer muscle that that of sterile fish (Paneru et al., 2018). Transcriptomic profiling of the atrophying muscle revealed differential expression of genes related to protein ubiquitination, autophagy, extracellular matrix, myofibrillar proteins, and collagen; collectively called "the rainbow trout muscle degradome" (Paneru et al., 2018). Further, profiling muscle transcriptome from fish families exhibiting divergent filet firmness, revealed a network of protein-coding and non-coding genes related to lysosomal and proteolytic activities (Paneru et al., 2017; Ali et al., 2018). Understanding the underlying mechanism of filet firmness will help evaluate the postmortem changes affecting filet quality, and facilitate selective breeding decisions.

Traditional genetic improvement programs to determine animals with elite genetic merit have used statistical analyses of phenotypes and pedigree information (Dang et al., 2014). Genetic selection has been introduced in rainbow trout to improve filet quality (Kause et al., 2007; Hu et al., 2013). Selection programs for fish, including rainbow trout, focused on growth rate and filet quality traits; however, little attention has been paid to filet texture (Bahuaud et al., 2010). Selection on fat content improved color and filet texture (Florence et al., 2015), feed conversion ratio (FCR), and protein-retention efficiency (Kause et al., 2016). Five generations of family based selection was established at the USDA National Center of Cool and Cold Water Aquaculture (NCCCWA) yielding a genetic gain of ∼10% in body weight/ generation (Leeds et al., 2016). Firmness is measured postmortem, thus the trait cannot be measured directly on breeding candidates. Only family specific estimated breeding values (EBVs) are used for breeding candidates in traditional breeding programs. Genomic selection will allow further within-family selection for the filet firmness traits, and thus is anticipated to increase accuracy of genetic predictions and selection response. Understanding the genetic architecture of the filet phenotypic traits and development of genetically improved strains will improve aquaculture industry profitability and consumer satisfaction (Ali et al., 2018).

Genome-wide association (GWA) analysis compares allele frequencies at candidate loci with respect to the studied trait, and takes advantage of linkage disequilibrium (LD) between SNP marker and trait loci (Schielzeth and Husby, 2014). GWA analyses have been extensively used, in mammals including human, to facilitate the investigation of variants association with complex phenotypic traits and diseases (Hindorff et al., 2009). A limited number of GWA analyses have been conducted in fish including Atlantic salmon (Tsai et al., 2015), catfish (Geng et al., 2016), orange-spotted grouper (Yu et al., 2018), and rainbow trout (Gonzalez-Pena et al., 2016; Salem et al., 2018). The studied traits in fish included growth (Tsai et al., 2015; Yu et al., 2018), disease resistance (Palti et al., 2015), head size (Geng et al., 2016), heat stress (Jin et al., 2017), low oxygen tolerance (Zhong et al., 2017), and muscle yield (Gonzalez-Pena et al., 2016; Salem et al., 2018). In rainbow trout, GWA analysis revealed quantitative trait loci (QTL) associated with filet yield and disease resistance

(Liu et al., 2015; Palti et al., 2015; Gonzalez-Pena et al., 2016). No GWA studies have been conducted in fish to identify the genetic architecture of filet firmness. However, several GWA studies in cattle and pig revealed some genetic factors affecting meat tenderness. Calpain 1 and calpastatin are among genes that harbored genetic variants associated with meat tenderness in cattle (Ramayo-Caldas et al., 2016).

A 50K transcribed gene SNP chip of average 1 SNP per 42.7 Kb, was recently developed for rainbow trout. About 21K SNPs showing potential association with important traits, including fish growth, muscle yield/quality and filet softness, were used to build the chip. In addition, 29K SNPs were added to the chip following a strategy of 2 SNPs/ gene to randomize the SNP distribution. The recent release of rainbow trout genome (GenBank assembly Accession GCA\_002163495, RefSeq assembly accession GCF\_002163495) helped in assigning SNPs to chromosomes. Recently, the chip was successfully used to identify several QTL markers associated with muscle yield (Salem et al., 2018). The objective of the current study was to explore the genetic architecture in one of the most important muscle quality attributes, filet firmness in relation to protein content, and identify QTL associated with these traits in a rainbow trout population developed by the USDA/NCCCWA selective breeding program.

# MATERIALS AND METHODS

# Ethics Statement

Institutional Animal Care and Use Committee of the United States Department of Agriculture, National Center for Cool and Cold Water Aquaculture (Leetown, WV, United States) specifically reviewed and approved all husbandry practices used in this study (IACUC approval #056).

# Fish Population, Tissue Sampling, and Phenotypic Traits

Fish population and tissue sampling were previously described in detail (Al-Tobasei et al., 2017). Briefly, diploid females from a growth-selected line at NCCCWA were used to carry out GWA analysis. This selective breeding program was initiated in 2004 and has gone through 5 generations of selection (Leeds et al., 2016). Third- and fourth-generation fish (Year-class, YC, 2010 and YC 2012) were used for GWA analysis. Phenotypic data were collected from 789 fish representing 98 families from YC 2010 and 99 families from YC 2012. Over a 6-week period, full-sib families were produced from single-sire × single-dam matings. Eggs were reared in spring water and incubated at 7–13◦C to hatch all families within a 3-week period. Each family was reared in a separate 200-L tank at ∼12.5◦C to retain pedigree information and were fed a commercial fishmeal-based diet (Zeigler Bros Inc., Gardners, PA, United States). At ∼5-months post-hatch, fish were tagged with a passive integrated transponder (Avid Identification Systems Inc., Norco, CA, United States) and reared together in 800-L communal tanks supplied with partially recirculated spring water, at ∼13◦C, until ∼13 months post-hatch. Fish were fed a commercial fishmeal-based diet. The feeding schedule was previously described (Hinshaw, 1999). Fish did not receive feed for 5 days prior to harvest to facilitate processing.

Whole body weight (WBW) was measured in fish belonging to each family and families were sorted according to their WBW. The 2nd or 3rd fish from each family was selected for muscle sampling to keep the distribution of WBW consistently adjusted around the median of each family. For each harvest year, selected fish were randomly assigned to one of five harvest groups (∼100 fish each) allowing one fish per family per harvest group. The five groups were sampled in five consecutive weeks (one group/week) each YC. Fish from the YC 2010 were harvested between 410- and 437-days post-hatch (mean body weight = 985 g; SD = 239 g), whereas those from YC 2012 were harvested between 446- and 481-days post-hatch (mean body weight = 1,803 g; SD = 305 g). Muscle shear force and protein content showed low regression coefficient (R 2 ) values of 0.05 and 0.04 with body weight, respectively. Fish were euthanized in a lethal dose of tricaine methane sulfonate (Tricaine-S, Western Chemical, Ferndale, WA, United States), harvested, and eviscerated. Head-on gutted carcasses were packed in ice, transported to the West Virginia University Muscle Foods Processing Laboratory (Morgantown, WV, United States), and stored overnight. Carcasses were manually processed into trimmed, skinless filets (Salem et al., 2013). Shear force of 4 × 8 cm section of cooked filet was assessed using a five-blade, Allo-Kramer shear cell attached to a Texture Analyzer (Model TA-HDi <sup>R</sup> ; Texture Technologies Corp., Scarsdale, NY, United States), equipped with a 50 kg load cell; tests were performed at a crosshead speed of 127 mm/min (Aussanasuwannakul Kenney et al., 2010). Texture Expert Exceed software (version 2.60; Stable Micro Systems Ltd., Surrey, United Kingdom) was used to record and analyze forcedeformation graphs. Peak shear force (g/g sample) was recorded. All cooked texture evaluations were performed approximately 48 h post-harvest. Details of the proximate analyses, including crude protein were previously described (Manor et al., 2015). Crude protein analysis was achieved using AOAC-approved methods (AOAC 2000). Percent Kjeldahl nitrogen (KjeltecTM 2300; Foss North America; Eden Prairie, MN, United States) was converted into crude protein using 6.25 as the conversion factor. The pedigree-based heritability h 2 (h <sup>2</sup>ped) for protein content and shear force were estimated according to Zaitlen et al. (2013).

# SNP Genotyping and Quality Control

Genotyping was done using a 50K transcribed gene SNP-chip that we recently described and utilized in identifying QTL affecting filet yield (Salem et al., 2018). Source of all SNPs used to build the SNP chip was described in our previous publication (Al-Tobasei et al., 2017). In brief, the array has about 21K SNPs showing potential allelic imbalances with fish body weight, muscle yield, fat content, shear force, whiteness index, and susceptibility to Bacterial Cold Water Disease (BCWD) as we previously described (Al-Tobasei et al., 2017; Salem et al., 2018). In addition, ∼5K non-synonymous SNPs and more SNPs were added to the chip to include at least 2 SNPs per each SNP-harboring gene. The SNP chip includes a total of 50,006 SNPs.

As describe before, a total of 1,728 fish were used to assess quality of this Affymetrix SNP chip. Genotyped fish were obtained from the NCCCWA growth- and BCWD- selection lines (Salem et al., 2018). The SNP chip and sample metrics were calculated. Assessment of quality control (QC) and filtration of samples/genotypes have been performed using the Affymetrix SNPolisher software at the default parameters (Liu et al., 2015). A call rate of 0.97 and Dish QC (DQC) threshold of 0.82 have been applied to filter out genotyped samples. For this study, 789 fish genotyped by the SNP chip had available phenotypic data for filet shear force and protein content. All genotypic data passed the QC. Those fish were used for the current GWA analyses.

# Fifty-SNP Window GWA Analysis

Genome-wide association analysis was performed using the Weighted single-step GBLUP (WssGBLUP) as we previously described (Salem et al., 2018). In brief, WssGBLUP allows use of genotyped and ungenotyped animals. WssGBLUP integrates phenotypic data, genotype and pedigree information in a combined analysis using the following mixed model for single trait analysis:

$$\mathbf{y} = \mathbf{X}\mathbf{b} + \mathbf{Z}\_1\mathbf{a} + \mathbf{Z}\_2\mathbf{w} + \mathbf{e}$$

Where **y** is the vector of the phenotypes, **b** is the vector of fixed effects including harvest group and hatch year, **a** is the vector of additive direct genetic effects (i.e., animal effect), **w** is the vector of random family effect, and **e** is the residual error. The matrices **X**, **Z**1, and **Z**<sup>2</sup> are incidence matrices for the effects contained in **b**, **a**, and **w**, respectively. The model combines all the relationship information (based on pedigree and genotypes) into a single matrix (**H**−<sup>1</sup> ).

$$H^{-1} = A^{-1} + \begin{bmatrix} 0 & 0 \\ 0 & G^{-1} - A\_{22}^{-1} \end{bmatrix}$$

where **H**−<sup>1</sup> is the inverse of the realized relationship matrix (**H**), A −1 is the inverse of the relationship matrix based on pedigree information, A −1 <sup>22</sup> is the inverse of the pedigree relationship matrix for genotyped animals only, and G −1 is the inverse of the genomic relationship matrix. The random family effect is uncorrelated and just accounts for the fact the animals within the same family were raised in a common environment, and the covariance structure is given by Iσ 2 w , where **I** is an identity matrix and σ 2 w is the family variance.

AIREMLF90 (Misztal et al., 2018) was used to estimate the variance components for the additive direct genetic effect, random family effect, and residuals. The inbreeding value, was previously calculated using a pedigree data of 63,808 fish from five consecutive generations in the NCCCWA breeding program using INBUPGF90 (Misztal et al., 2002; Salem et al., 2018). QC of genomic data was performed using PREGSF90 (Misztal et al., 2014) according to the following settings; MAF > 0.05, call rate > 0.90, and HWE < 0.15. In total, 35,322 SNPs (70.6%) passed QC.

In WssGBLUP analysis, two iterations were used. All SNPs were assigned the same weight during the first iteration (i.e., weight = 1.0). For the second iteration, weights were calculated according to the SNP effects (<sup>∧</sup> u) assessed in the first iteration as<sup>∧</sup> u 2p(1 − p), where p represents the current allele frequency. Three steps were performed in each iteration: (1) weight was assigned to the SNPs. (2) genomic estimated breeding values (GEBV) were computed using BLUPF90 based on **H**−<sup>1</sup> (Misztal et al., 2002). (3) SNP effects and weights were calculated using POSTGSF90 (Misztal et al., 2002) based on sliding variance windows of 50 adjacent SNPs. Since the SNPs in the chip were not evenly distributed over the whole genome, the window size used for the current analysis was based on a specific number of adjacent SNPs (n = 50 SNPs) instead of physical size (e.g., specific number of nucleotides). A Manhattan plot showing the proportion of additive genetic variance explained by the 50 SNP windows was generated in R using the qqman package (Turner, 2014).

# Single Marker GWA Analysis

Single marker association analysis was conducted using PLINK (Purcell et al., 2007). The phenotypic data were checked for normality using Kolmogorov–Smirnov and Shapiro–Wilk test in order to make sure that the studied phenotypes are normally distributed and meet the assumption of linear model analysis in PLINK (Purcell et al., 2007). For single marker association analysis, the linear model included multiple covariates and accounted for population structure. To control the global inflation of the test statistic, the first five Principal components (PCs) were used as covariates in the model. The Wald test, using the –assoc command, was applied to the quantitative traits in order to retrieve the R-squared values of association.

# Gene Annotation and Enrichment Analysis

To retrieve SNP annotations, SNPs bed file was intersected with the rainbow trout genome gff/gtf file using Bedtools as described before (Quinlan and Hall, 2010; Salem et al., 2018). SNPs located within each gene were classified as genic whereas SNPs located outside the body of the gene were classified as intergenic. Genic SNPs were subsequently classified as CDS, intronic, 50UTR or 3 <sup>0</sup>UTR SNPs. SNPs within long non-coding RNAs (lncRNAs) were determined using a gtf file of our previously published lncRNA reference assembly (Al-Tobasei et al., 2016). SNPharboring genes were uploaded to the Database for Annotation, Visualization and Integrated Discovery (DAVID) v6.8 (Huang da et al., 2009a,b) to perform gene enrichment analysis (Fisher Exact < 0.05).

# RESULTS AND DISCUSSION

Soft flesh is a major criterion for downgrading fish fillets, resulting in loss of value (Michie, 2001). Post-mortem muscle softness is correlated with proteolytic degradation of extracellular matrix and cell membrane components (Bahuaud et al., 2010; Martinez et al., 2011). The fish population used for the current GWA analysis had average shear force of 475.7 ± 83.47 (g/g) and crude protein% = 20.64 ± 0.62. For the current GWA

analysis, phenotypic variations in shear force and protein are shown in **Figure 1**. The estimated heritabilities were 0.33 ± 0.07 and 0.27 ± 0.06 for shear force and protein content, respectively (Zaitlen et al., 2013). Previous studies showed a significant correlation between changes in protein content and meat tenderness (Grze´s et al., 2017). Consistently, our data showed a significant correlation between protein content and shear force (R <sup>2</sup> = 0.18; p-value < 0.0001). Therefore, we used a 50K SNP chip to perform GWA analyses to identify QTL associated with both traits based on 50 SNP sliding windows using WssGBLUP and single-marker association analyses using PLINK. The chip contains SNPs potentially associated with muscle quality traits including filet softness as we previously described (Al Tobasei et al., 2017; Salem et al., 2018). However, we did not include any fish used in building the SNP-chip for GWA analysis in this study.

# QTL Affecting Filet Shear Force Using WssGBLUP

The WssGBLUP-based GWA analysis identified a total of 212 SNPs affecting the additive genetic variance for shear force. These SNPs were located within 95 genes coding for proteins and 4 lncRNAs with 20 SNPs in intergenic regions. SNPs were included in windows explaining at least 2% (arbitrary value) of the additive genetic variance for shear force (**Supplementary Table S1**). Genomic loci that harbor SNPs were clustered on 6 chromosomes (4, 7, 8, 10, 13, and 28) (**Figure 2**). Chromosome 13 had the most significant peaks affecting shear force (6.91%) (**Supplementary Table S1** and **Figure 2**) and the highest number of SNPs (n = 83) in windows explaining additive genetic variance for shear force (**Supplementary Table S1**). Many of the SNPs (n = 80) were located within the 30UTR of their genes suggesting a role for these SNPs in microRNA, posttranscriptional regulation of gene expression. Among those 80 SNPs, 32 SNPs created or deleted binding sites for 56 microRNAs (**Supplementary Table S2**). All QTL associated with genetic variance of shear force are listed in **Supplementary Table S1**. To gain insights into the biological significance of the identified QTL, we annotated the SNP-harboring genes followed by gene enrichment analysis. Functional annotation showed that SNPharboring genes were involved in calcium binding/ metabolism, proteolytic activities, apoptotic process, and cellular adhesion and junction (**Tables 1**, **2**). Enriched terms included calcium channel complex, smooth endoplasmic reticulum, ryanodinesensitive calcium-release channel activity, calcium ion binding, and Z disk (**Supplementary Table S3**).

## SNPs in Genes Affecting Ca2<sup>+</sup> Homeostasis

Ten genes necessary for calcium metabolism harbored 47 SNPs affecting the genetic variation in shear force (**Table 1**). Ryanodine receptor3 (RYR3; member of the sarcoplasmic reticulum calcium release channel) had 17 SNPs located on chromosome 4 and 8 suggesting an important role for calcium in regulating shear force.

Two SNPs were non-synonymous, and one of these SNPs exists in the third structural repeat that is conserved in all RYR isoforms; it is located in the N-terminal part of the cytoplasmic region of the RYRs. Several studies reported a correlation between development of pale, soft and exudative (PSE) meat and abnormality in calcium release mechanism of porcine skeletal muscle as a result of a point mutation in the porcine RYR1 that led to a substitution of cysteine for arginine (Arg615Cys) (Fujii et al., 1991). Poor regulation of the mutant channel led to accumulation of sarcoplasmic calcium and development of PSE meat accordingly (MacLennan and Phillips, 1992). Breeding strategies were initiated to avoid this mutation from the pig populations.

Unlike in mammals where RYR1 is the main isoform expressed in skeletal muscle, fish co-express RYR3 (Murayama and Kurebayashi, 2011). Absence of RYR1 in fish, causes slow swimming, weak contractions and reduced Ca2<sup>+</sup> transients (Hirata et al., 2007). On the other hand, RYR3 knock-down led to reduction in formation of anatomical structures called the parajunctional feet (PJF), which are located on the sides of the SR junctional cisternae (Perni et al., 2015). Reduction of the PJF was coupled with reduced SR Ca2<sup>+</sup> flux that causes Ca2<sup>+</sup> sparks that was reported in fish muscle. However, the muscle fibers looked structurally normal and the swimming behavior was not affected (Perni et al., 2015). Association of RYR1&3 mRNA expression level with filet water holding capacity was reported in the Nile tilapia under pre-slaughter stress (Goes et al., 2015). Impaired Ca2<sup>+</sup> handling was reported in the muscles of the hatchery-reared salmon compared to that of wild fish (Anttila et al., 2008). Levels of RYR were greatly reduced in the muscles of the hatchery-reared salmon. Similar differences were seen in

the oxidative capacity of muscles. This impairment was suggested to contribute to the lower swimming capacity of the reared fish (Anttila et al., 2008).

Chromosome 13 had 15 SNPs in Ca2<sup>+</sup> homeostasisrelevant genes located in top windows affecting the genetic variability in shear force (**Table 1**). Nucleobindin 1, is a multidomain calcium-binding protein of unclear physiological and biochemical functions (Kapoor et al., 2010) and harbored 2 SNPs within the 30UTR representing the highest peak in this QTL. The second gene on this chromosome, myosinbinding protein C, fast (MyBP-C), encompassed 4 SNPs. MyBP-C sensitizes the actin thin filaments to Ca2<sup>+</sup> (Lin et al., 2018). MyBPC gene knockout in mice leads to muscle hypertrophy and impaired contractile function. The third gene, protein kinase C and casein kinase substrate in neurons protein 3 (PACSIN 3) had 4 SNPs. PACSIN 3 has been primarily identified in muscle and lung (Bai et al., 2012). PACSIN 3 is known to modulate the subcellular localization of TRPV4 (Cuajungco et al., 2006) which regulates Ca2<sup>+</sup> homeostasis and cytoskeletal remodeling (Ryskamp et al., 2016). Coronin-1 represents the fourth gene and had two SNPs. It mediates Ca2<sup>+</sup> mobilization from the intracellular stores (Mueller et al., 2008). The fifth gene, myosin regulatory light chain 2 (MYL2), had three SNPs within 30UTR. MYL2 is a calciumbinding chain known to be associated with meat tenderness (Rosa et al., 2018).

Eight SNPs were also identified in 3 genes necessary for calcium metabolism on chromosome 10 (**Table 1**). Plastin-3 (PLS3) had five SNPs in windows explaining up to 3.83% of the additive genetic variance in shear force. PLS3 functions as a protective modifier of spinal muscular atrophy in Ca+<sup>2</sup> dependent manner (Lyon et al., 2014). A single SNP was identified in a gene that codes for TBC1 domain family member 8B (TBC1D8) and has GO terms belong to calcium ion binding. Galectin-9 (Gal-9) harbored two SNPs within windows explaining ∼3.37% of the additive genetic variance in shear force. Gal-9 induces apoptosis via the Ca2+-calpain-caspase-1 pathway (Kashio et al., 2003).

Chromosome 28 had a single gene, matrix metalloproteinase-14 (MMP14), that had 7 SNPs explaining at least 2.0% of the additive genetic variance (**Table 1**). MMP14 has a Ca2+ dependent catalytic MP domain that degrades the extracellular matrix proteins such as collagen (Tallant et al., 2010). Our recent studies showed that MMP9 was downregulated in trout families of high shear force suggesting a role for matrix metalloproteinase family in regulating filet firmness in fish (Ali et al., 2018). In addition, transcripts of stanniocalcin (STC), the main regulatory hormone of Ca2<sup>+</sup> homeostasis in fish (Verma and Alim, 2014), were overexpressed in trout families with high shear force (Ali et al., 2018). The relationship between calcium and protein content in dystrophic muscle has been attributed to decreased functionality of the sarcoplasmic reticulum to sequester calcium ions (Kameyama and Etlinger, 1979). Together, our results indicate a major role of Ca2<sup>+</sup> homeostasis in determining fish filet firmness.

### SNPs in Genes Affecting Proteolysis

Six SNP-harboring genes involved in proteolytic/ catabolic and apoptotic processes were identified on chromosomes 10, 13, and 28 (**Table 2**). Chromosome 10 had a gene

TABLE 1 | SNP markers in genomic sliding windows of 50 SNPs explaining at least 2% of additive genetic variance in shear force and involved in calcium homeostasis.


A color gradient on the left indicates differences in additive genetic variance explained by windows containing the representative SNP marker (green is the highest and red is the lowest). SNPs are sorted according to their chromosome positions.

TABLE 2 | SNP markers in genomic sliding windows of 50 SNPs explaining at least 2% of additive genetic variance in shear force and involved in proteolytic, apoptotic process, tight junction, and focal adhesion.


A color gradient on the left indicates differences in additive genetic variance explained by windows containing the representative SNP marker (green is the highest and red is the lowest). SNPs are sorted according to their chromosome positions.

that codes for Gal-9 which is known to induce apoptotic process (Kashio et al., 2003). Chromosome 13 had four genes harboring SNPs within top windows affecting the additive genetic variance in shear force. First, tripartite motif-containing protein 16, affecting 5.47% of the additive genetic variance, promotes apoptosis by modulating the caspase-2 activity. Second, branched-chain-amino-acid aminotransferase (cytosolic), had a single 30UTR SNP. This enzyme catalyzes the first reaction in the catabolism of the most hydrophobic branched chain amino acids (leucine, isoleucine, and valine) that play important roles in determining the structure of globular proteins, in addition to the interaction of transmembrane domains with the phospholipid layer (Blomstrand et al., 2006). Third, potassium voltage-gated channel, subfamily A member 1 harbored a single synonymous SNP. Voltage-dependent potassium channels mediates transmembrane potassium transport and are involved in the proteolytic system that causes postmortem tenderization (Mateescu et al., 2017). The fourth gene in the list codes for granulins that had 2 SNPs. Granulins have possible critical lysosomal functions, and their loss is an initiating factor in lysosomal dysfunction (Holler et al., 2017). In addition, chromosome 28 had four SNPs in a gene coding for apoptotic chromatin condensation inducer in the nucleus (ACIN1) (**Table 2**). ACIN1 belongs to the prominent canonical apoptosis signaling pathway (Schrötter et al., 2012).

Two SNP-harboring genes were mapped to the autophagy pathway; immunoglobulin-binding protein 1 (IGBP1) and zinc finger FYVE domain-containing protein 1 (ZFYVE1). ZFYVE1, has been used as a marker of omegasomes (exists only during autophagosome formation) (Zientara-Rytter and Subramani, 2016). Three SNPs spanning two genes coding for coronin-1A and charged multivesicular body protein 1b, were mapped to the endosomal/phagosomal pathway. Previous studies support presence of phagocytic activities in postmortem muscle to eliminate extracellular material (Ouali et al., 2013).

### SNPs in Genes Affecting Cell Adhesion

Genes involved in focal adhesion and cell junction were previously reported to be associated with meat tenderness (Fonseca et al., 2017). Five SNPs spanning two genes on chromosome 13 were mapped to the focal adhesion pathway (**Table 2**). These genes code for myosin regulatory light chain 2 and serine/threonine-protein phosphatase alpha-2. In addition, 2 SNPs were identified spanning two genes involved in tight junction pathway (**Table 2**). The two genes are located on chromosomes 7 and 13, and code for Na (+)/H (+) exchange

regulatory cofactor NHE-RF1 and actin-related protein 3. Cerebellin-1 on chromosome 28, had a single SNP in a window explaining 2.25% of the additive genetic variance (**Table 2**). Functional annotation analysis showed that cerebellin-1 has GO terms belonging to heterophilic, cell-cell adhesion via plasma membrane, cell adhesion molecules. The list also includes a SNP in a gene on chromosome 13, that codes for claudin-4 (**Table 2**). This SNP creates a binding site for the mir-10c-5p. mir-10c-5p showed differential expression association with shear force in trout fish families of YC 2010 (Paneru et al., 2017). Members of the claudins family are major integral membrane proteins existing at tight junctions, and they have Ca2+-independent cell-adhesion activity (Kubota et al., 1999).

# QTL Affecting Protein Content Using WssGBLUP

In total, 225 SNPs affecting the genetic variation in muscle protein content were identified; 202 genic and 23 intergenic SNPs (**Supplementary Table S4**). Each SNP was in a window explaining at least 2% of the additive genetic variance for the protein content. The genomic loci that harbor SNPs were clustered on five chromosomes (1, 3, 4, 7, and 11) (**Figure 3**). Chromosomes 4 and 1 harbored 50 SNPs located within top windows affecting the genetic variability (variance > 4.0%) in protein content of the muscle (**Supplementary Table S4**). Similar to shear force, 40% of the SNPs were located within 30UTR. Thirteen SNPs created/ deleted target sites for 16 microRNAs (**Supplementary Table S5**). SNPs associated with genetic variation in crude protein content are listed in **Supplementary Table S4**. Functional annotation followed by gene enrichment analysis were performed to functionally characterize the SNPharboring genes. Functional annotation showed that SNPharboring genes were mainly involved in apoptotic process, proteolysis, lysosomal activities, cell proliferation, transcription, and methylation (**Tables 3**, **4**). Enriched terms included muscle contraction, transcription, regulation of transcription, and chromatin remodeling (**Supplementary Table S6**).

### SNPs in Genes Affecting Apoptosis

Thirteen SNPs were identified spanning seven genes on chromosomes 4 and 7, and engaged in apoptotic process (**Table 3**). Actin, alpha harbored two SNPs in windows that explained the highest genetic variability (4.62%) in this category. Alpha actin was previously suggested as a genetic marker for apoptosis (Ouali et al., 2013). SNW domain-containing protein 1 (SNW1) harbored 4 SNPs in windows explaining up to 3.53% of the additive genetic variance. Depletion of SNW1 or its associating proteins induced apoptotic processes in cancer cells (Sato et al., 2015). Three SNPs were identified in RNAbinding protein 25 (RBM25) and Bcl-2-like protein 1 (BCL2L1). RBM25 is involved in apoptotic cell death by regulating BCL2L1 expression (Zhou et al., 2008). Two SNPs were identified in RHOB that is known to positively regulate apoptotic process (Srougi and Burridge, 2011). A single 30UTR SNP was identified

TABLE 3 | SNP markers in genomic sliding windows of 50 SNPs explaining at least 2% of additive genetic variance in protein content and involved in proteolytic and apoptotic processes.


A color gradient on the left indicates differences in additive genetic variance explained by windows containing the representative SNP marker (green is the highest and red is the lowest). SNPs are sorted according to their chromosome positions.

in a gene coding for protein snail homolog Snai. Snai1-expressing cells resists apoptosis triggered by proapoptotic stimuli (Olmeda et al., 2007). Another 30UTR SNP was also identified in a gene coding for cell death activator CIDE-3. This gene has a role in the execution phase of apoptosis (Liang et al., 2003).

## SNPs in Genes Affecting Proteolysis

Ten genes with proteolytic activities were identified that affected genetic variability in protein content (**Table 3**). A single SNP located in the gene coding for short transient receptor potential channel 4-associated protein (TRPC4AP) followed by a SNP

TABLE 4 | SNP markers in genomic sliding windows of 50 SNPs explaining at least 2% of genetic variance in protein content and involved in calcium metabolism, cell proliferation, and transcriptional/ chromatin regulations.


A color gradient on the left indicates differences in additive genetic variance explained by windows containing the representative SNP marker (green is the highest and red is the lowest). SNPs are sorted according to their chromosome positions.

in 26S protease regulatory subunit 4 (PSMC1) came at the top of this group. TRPC4AP is involved in ubiquitination and destruction of Myc protein (Choi et al., 2010) that control cell proliferation and growth (Bernard and Eilers, 2006). Whereas, PSMC1 is a component of the 26S proteasome that maintains protein homeostasis through ubiquitin-mediated degradation of damaged and misfolded proteins (Kanayama et al., 1992). NEDD8 ultimate buster 1 (NUB1) and inactive serine protease 35 (PRSS35) had a single SNP. NUB1 positively regulates proteasomal ubiquitin-dependent protein catabolic process (Schmidtke et al., 2006) whereas, the proteolytic activities of the serine protease, PRSS35, have not been characterized yet (Diao et al., 2013). Plectin had nine SNPs. In human, mutations of the plectin gene cause muscular dystrophy (Natsuga et al., 2010). The list also includes two mitochondrial genes, encoding for 2-oxoisovalerate dehydrogenase subunit beta and aspartate aminotransferase, involved in amino acid catabolism (Schiele et al., 1989; Nobukuni et al., 1991). Of note, three genes on chromosome 4 were involved in lysosomal activities; V-type proton ATPase subunit D, Rho-related GTP-binding protein RhoB (RHOB), and lysosomal-associated transmembrane protein 4A (LAPTM4A). V-type proton ATPase subunit D had 4 SNPs in windows explaining up to 3.30% of the genetic variation in crude protein content. The vacuolar (H+)-ATPases acidify the intracellular compartments and play an important role in protein degradation (Toei et al., 2010). RHOB is involved in trafficking epidermal growth factor (EGF) receptor from late endosomes to lysosomes (Gampel et al., 1999). Three SNPs were identified in LAPTM4A. The function of this gene is unclear.

## SNPs in Genes Affecting Ca2<sup>+</sup> Homeostasis

We identified 28 SNPs, within 5 genes on chromosomes 1, 3, and 4, that are involved in calcium homeostasis (**Table 4**). Interestingly, RYR3 harbored ∼ 68% of those SNPs; whereas four of these SNPs affected genetic variability in shear force. This result suggests a major role for RYR3 in regulating protein content and shear force in rainbow trout. SNPs of RYR3 were ranked first in this category and were located within windows explaining up to 4.97% of the additive genetic variance in protein content. A single SNP was identified within a gene that codes for reticulocalbin 2 (RCN2). Previous studies showed that RCN2 binds to calcium and was identified to be localized in endoplasmic reticulum. RCN2 was upregulated in hepatocellular carcinoma patients and its homozygous deletion in mice was lethal (Wang et al., 2017). In addition, there were three 30UTR SNPs within the calmodulin (CaM) gene. CaM codes for a calcium binding protein known to regulate RYR activity through direct binding to a CaM-binding domain of RYR (Oo et al., 2015). In addition, two genes coding for inhibitor of Bruton tyrosine kinase (Btk) and protein FAM26E (CALHM1) harbored 5 SNPs on chromosome 3. Btk plays a role in releasing sequestered Ca2<sup>+</sup> to the cytosol (Liu et al., 2001). Whereas, CALHM1 detects the extracellular Ca2<sup>+</sup> level and plays a role in Ca2<sup>+</sup> homeostasis (Ma et al., 2012). These results suggest a significant role of the genes involved in Ca2<sup>+</sup> handling

(release and re-sequestration). In mammals, proteolysis by calcium-dependent proteases (calpains) in the early postmortem period greatly affects muscle texture and meat tenderization (Koohmaraie, 1992; Duckett et al., 2000). We previously showed that calpains are elevated and calpastatin is reduced during starvation-induced muscle degradation (Salem et al., 2005a, 2007) and calpastatin expression is associated with rainbow trout muscle growth (Salem et al., 2005b). Further studies are warranted to investigate postmortem autolysis caused by calpain system in regulating protein content and shear force in rainbow trout.

# SNPs in Genes Affecting Transcriptional Process and Cell Proliferation

Genes involved in transcription and cell proliferation were identified (**Table 4**). The majority of SNP-harboring genes were involved in transcription. Sixty-six SNPs were identified in 26 genes located, mainly, on chromosomes 4 and 7. Four SNPs in a gene that code for poly(A) polymerase beta were identified in windows explaining the highest genetic variability (4.63%) in this category.

Additionally, twelve SNPs located on six genes involved in cell proliferation were identified. Three SNPs on two genes that code for myocyte-specific enhancer factor 2A (MEF2) and RCN2 were ranked at the top of this group. MEF2 plays diverse roles in muscle to control myogenesis (Black and Olson, 1998).

# SNPs in Genes Affecting Histone Modifications

Twelve SNPs in six genes involved in epigenetic transcriptional regulation were also identified on chromosomes 1, 3, 4, and 7 (**Table 4**). Histone-lysine N-methyltransferase KMT5B (KMT5B) had a single SNP located in a window explaining the maximum variance in protein content in this group (3.97%). KMT5B is a histone methyltransferase that trimethylates 'Lys-20' of histone H4 (a tag for epigenetic transcriptional repression) and plays a role in myogenesis (Neguembor et al., 2013). Four SNPs in a gene that codes for SNW domain-containing protein 1 were identified. This protein positively regulates histone H3-K4 methylation (Brès et al., 2009). A single SNP was identified on ribosomal oxygenase 1 that functions as histone lysine demethylase, a ribosomal histidine hydroxylase, and contributes to MYC-induced transcriptional activation (Eilbracht et al., 2004; Suzuki et al., 2007; Ge et al., 2012). Two SNPs were identified in a gene coding for host cell factor 1 (HCF-1). In human, the cell-proliferation factor HCF-1 tethers Sin3 histone deacetylase and Set1/Ash2 histone H3-K4 methyltransferase (H3K4me) complexes that are involved in repression and activation of transcription, respectively (Wysocka et al., 2003). The list includes two other genes that harbored four SNPs on chromosome 3; transcription and mRNA export factor ENY2-2 and ubiquinone biosynthesis O-methyltransferase, mitochondrial.

Taken together, our results suggest that calcium homeostasis, more likely through RYR3, and transcriptional/chromatin regulators have major roles in regulating genetic variability in muscle protein content.

FIGURE 4 | Manhattan plot showing single SNP markers associated with variations in shear force. Blue and red horizontal lines represent suggestive and significance threshold p-values of 1e-05 and 2.01e-06, respectively.


SNPs are sorted according to their R<sup>2</sup> values.

# Single Marker Association Analyses

In addition to WssGBLUP and to identify single SNP marker association with phenotypic variation in shear force and protein content, we analyzed SNPs included in the SNP chip using general linear regression model available in PLINK which allows for multiple covariates (Purcell et al., 2007). In this study, PLINK identified 11 significant SNPs with potential impact on the shear force (Bonferroni-corrected p < 2.01E-06; **Figure 4** and **Table 5**).

Most of the significant SNPs were located on chromosome 5 (n = 5) and chromosome 7 (n = 4). However, the most significant SNP explaining 3.4% of the phenotypic variability in shear force, was located on chromosome 8 in a gene coding for RYR3. This result was in agreement with the WssGBLUP 50 SNP-window analysis and suggests an essential role for RYR3 in regulating filet firmness in trout. Cytochrome c oxidase subunit 6C-1 (COX6C1), 14-3-3B1 protein, and rho GTPaseactivating protein 15 (ARHGAP15) were ranked next to RYR3 in impacting phenotypic variability in shear force. COX6 was rapidly degraded under endoplasmic reticulum stress conditions induced by Ca2<sup>+</sup> depletion (Hong et al., 2016) and upregulated in rainbow trout families of high shear force (Ali et al., 2018). 14-3-3B1 protein has been reported to be involved in apoptotic process (Rodrigues et al., 2017). Previous studies elucidated the involvement of 14-3-3 proteins in meat tenderness (Rodrigues et al., 2017). Overexpression of ARHGAP15 increases actin stress fibers and cell contraction (Seoh et al., 2003). ARHGAP15 SNP was in strong LD (D' = 1), with two SNPs located in COX6C and 14-3-3B1 protein. In addition to 14- 3-3B1 protein, a gene coding for disabled homolog 2 (DAB2) was also involved in apoptotic process (Prunier and Howe, 2005). Two SNP-harboring genes, phosphatidylinositol glycan anchor biosynthesis class U (PIGU) and annexin A13, were involved in lipid metabolism. PIGU has functions in lipid metabolism including membrane lipid biosynthesis. This gene

exhibited differential expression in porcine muscles divergent for intramuscular fat, which correlates positively with meat tenderness (Hamill et al., 2013). Annexins are Ca2+-dependent phospholipid-binding proteins that have an important role in the cell cycle and apoptosis (Mirsaeidi et al., 2016). The list also includes a cell adhesion receptor, nicotinamide riboside kinase 2, that modulates myogenic differentiation (Li et al., 1999). Singlemarker analysis did not identify SNPs in significant association with variation in protein content.

Altogether, results obtained from the single SNP analyses provided additional evidence of RYR3 role in regulating phenotypic variability in filet firmness. Also, single-marker analysis highlighted a role for a few more genes in filet firmness. However, estimating the effect of each SNP individually does not allow the detection of small effects of multiple joint SNPs. This may explain the inconsistency in the significant peaks between the single-marker analysis and the WssGBLUP approach. Several studies indicated that the SNP-joint analysis is more successful than the single-SNP analysis in GWA studies of complex traits (Fridley and Biernacka, 2011; Lu et al., 2015). Therefore, WssGBLUP approach is assumed to be more effective in dissecting the genetic architecture of the studied traits and providing putative markers that can be used for selection purposes.

# CONCLUSION

The current GWA analyses identified novel genomic loci with a role in regulating muscle firmness and protein content. These genomic loci code for proteins involved in calcium homeostasis, transcriptional and chromatin regulators, cell adhesion, protein synthesis/degradation, and apoptotic processes. The top windows affecting the additive genetic variance in protein content and shear force appeared on chromosome 4 and 13, respectively. RYR3 was the major gene harboring the largest number of SNPs located within windows affecting the additive genetic variance in shear force and protein content. Abnormal calcium homeostasis in muscle cells accelerates postmortem protein degradation, and meat softness (Barbut et al., 2008). The current study revealed that WssGBLUP, using 50 adjacent SNP windows, provided putative markers that could be used to estimate breeding values for firmness and protein content.

# DATA AVAILABILITY

All datasets generated for this study are included in the manuscript and/or the **Supplementary Files**. The genotypes (ped

# REFERENCES

Al Tobasei, R., Palti, Y., and Wiens, G. D. (2017). "Identification of SNPs with allelic imbalances in rainbow trout genetic lines showing different susceptibility to infection with Flavobacterium psychrophilum," in Proceedings of the PAG-XXV Plant & Animal Genomes Conference, (San Diego, CA).

and .map files) and phenotypes are available in **Supplementary Data Sheet S1**. A list of all SNPs affecting the additive genetic variances are provided in **Supplementary Tables S7**, **S8**.

# AUTHOR CONTRIBUTIONS

MS, TL, and BK conceived and designed the experiments. RA-T, MS, TL, and BK performed the experiments. RA-T, AA, DL, BK, and MS analyzed the data. AA and MS wrote the manuscript. All authors reviewed and approved the publication.

# FUNDING

This study was supported by a competitive grant No. 2014- 67015-21602 from the United States Department of Agriculture, National Institute of Food and Agriculture (MS). The content is solely the responsibility of the authors and does not necessarily represent the official views of any of the funding agents. RA-T trainee's projects are supported by Grant Number T32HL072757 from the National Heart, Lung, and Blood Institute.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00386/full#supplementary-material

TABLE S1 | List of all SNPs affecting >2% of the additive genetic variance in shear force.

TABLE S2 | Created/deleted miRNA targets as a consequence of SNPs affecting genetic variance in shear force.

TABLE S3 | David functional annotation for genes harboring SNPs affecting the additive genetic variance in shear force.

TABLE S4 | List of all SNPs affecting >2% of the additive genetic variance in protein content.

TABLE S5 | Created/deleted miRNA targets as a consequence of SNPs affecting genetic variance in protein content.

TABLE S6 | David functional annotation for genes harboring SNPs affecting the additive genetic variance in shear force.

TABLE S7 | List of all SNPs affecting the additive genetic variance in shear force.

TABLE S8 | List of all SNPs affecting the additive genetic variance in protein content.

DATA SHEET S1 | The genotypes (ped and map files) and phenotypes files used in GWA analysis.


in a special type of synchronously replicating chromatin. Mol. Biol. Cell 15, 1816–1832.


fgene-10-00386 May 3, 2019 Time: 10:1 # 15


subunit of an ion channel that mediates extracellular Ca2+ regulation of neuronal excitability. Proc. Natl. Acad. Sci. U.S.A. 109, E1963–E1971. doi: 10. 1073/pnas.1204023109


fgene-10-00386 May 3, 2019 Time: 10:1 # 16

associated with growth and muscle quality traits in rainbow trout. Sci. Rep. 7:9078. doi: 10.1038/s41598-017-09515-4


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Ali, Al-Tobasei, Lourenco, Leeds, Kenney and Salem. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

fgene-10-00386 May 3, 2019 Time: 10:1 # 17

# Transcriptome Profiling Insights the Feature of Sex Reversal Induced by High Temperature in Tongue Sole Cynoglossus semilaevis

Jinxiang Liu1,2, Xiaobing Liu<sup>1</sup> , Chaofan Jin<sup>1</sup> , Xinxin Du<sup>1</sup> , Yan He1,2 and Quanqi Zhang1,2 \*

<sup>1</sup> Key Laboratory of Marine Genetics and Breeding, Ministry of Education, Ocean University of China, Qingdao, China, <sup>2</sup> Laboratory for Marine Fisheries Science and Food Production Processes, Qingdao National Laboratory for Marine Science and Technology, Qingdao, China

Sex reversal induced by temperature change is a common feature in fish. Usually,

### Edited by:

Paulino Martínez, University of Santiago de Compostela, Spain

### Reviewed by:

Gustavo M. Somoza, CONICET Institute of Biotechnological Research (IIB-INTECH), Argentina Laia Ribas, Institute of Marine Sciences (ICM), Spain

> \*Correspondence: Quanqi Zhang qzhang@ouc.edu.cn

### Specialty section:

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Received: 16 November 2018 Accepted: 13 May 2019 Published: 29 May 2019

### Citation:

Liu J, Liu X, Jin C, Du X, He Y and Zhang Q (2019) Transcriptome Profiling Insights the Feature of Sex Reversal Induced by High Temperature in Tongue Sole Cynoglossus semilaevis. Front. Genet. 10:522. doi: 10.3389/fgene.2019.00522 the sex ratio shift occurs when temperature deviates too much from normal during embryogenesis or sex differentiation stages. Despite decades of work, the mechanism of how temperature functions during early development and sex reversal remains mysterious. In this study, we used Chinese tongue sole as a model to identify features from gonad transcriptomic and epigenetic mechanisms involved in temperature induced masculinization. Some of genetic females reversed to pseudomales after high temperature treatment which caused the sex ratio imbalance. RNA-seq data showed that the expression profiles of females and males were significantly different, and set of genes showed sexually dimorphic expression. The general transcriptomic feature of pesudomales was similar with males, but the genes involved in spermatogenesis and energy metabolism were differentially expressed. In gonads, the methylation level of cyp19a1a promoter was higher in females than in males and pseudomales. Furthermore, high-temperature treatment increased the cyp19a1a promoter methylation levels of females. We observed a significant negative correlation between methylation levels and expression of cyp19ala. In vitro study showed that CpG within the cAMP response element (CRE) of the cyp19a1a promoter was hypermethylated, and DNA methylation decreased the basal and forskolin-induced activities of cyp19a1a promoter. These results suggested that epigenetic change, i.e., DNA methylation, which regulate the expression of cyp19a1a might be the mechanism for the temperature induced masculinization in tongue sole. It may be a common mechanism in teleost that can be induced sex reversal by temperature.

Keywords: high-temperature treatment, RNA-seq, cyp19a1a, DNA methylation, Cynoglossus semilaevis

# INTRODUCTION

The types of sex determination are diversified in teleost. Three main types of primary sex determination have been described in gonochoristic species: genotypic sex determination (GSD), temperature-dependent sex determination (TSD) and a combination of both (GSD ++ TSD) (Ospina-Álvarez and Piferrer, 2008; Yamamoto et al., 2014). The sex of fish had strong uncertainty

in the development process. In addition to genetic information, environmental factors could influence the sex determination, such as temperature. Apart from fish, the temperature irreversibly determining gonadal sex has been well established in reptiles and amphibians (Sarre et al., 2011; Flament, 2016). Since firstly described in Menidia menidia (Conover and Kynard, 1981), this phenomenon had been widely observed in fish, which showed that sex ratio would become unbalanced if the fish experienced high temperature during thermosensitive period (TSP). The imbalance of sex ratio was caused by sex reversal. Usually, it can be divided into three types: (1) high temperature has positive correlation with the proportion of males. (2) High temperature induces females, and has a negative correlation with males. (3) Both low temperature and high temperature increase the proportion of males (Baroiller and D'Cotta, 2001; Devlin and Nagahama, 2002; Ospina-Álvarez and Piferrer, 2008).

To verify the molecular mechanism of temperature effects (TE), a series of exploration was carried out. Steroid hormone, glucocorticoid, and epigenetic modification have been reported to be related to sex reversal and played critical roles during sex differentiation in TSD (Hattori et al., 2009; Lance, 2009; Nakamura, 2010; Yoshinaga et al., 2010; Navarro-Martín et al., 2011; Fernandino et al., 2012, 2013; Kitano et al., 2012; Piferrer, 2013; Zhang et al., 2013). Besides, it was discovered that intron retention of JARID2 and JMJD3 genes in Pogona vitticeps could mediate sex-reversed females (Deveson et al., 2017). Androgen-to-estrogen ratio determined whether an undifferentiated gonad differentiated into a testis or ovary in nonmammalian vertebrates (Simpson et al., 1994). The regulation of steroid ratio depended on the activity of gonadal aromatase, the product of cyp19a1a, which converts androgens into estrogens irreversibly (Simpson et al., 1994). In reptiles, up-regulating or down-regulating cyp19a1a could alter gonad phenotype. The expression level of gonadal cyp19a1a was associated with TSD in Trachemys scripta and Alligator mississippiensis (Pieau and Dorizzi, 2004; Matsumoto et al., 2016). In teleost, it has been confirmed that high temperature induced masculinization is related to the methylation level of cyp19a1a promoter in Dicentrarchus labrax. Methylation modification in the promoter region could suppress the binding of transcription factors to the corresponding sites (SF-1, FOXL2, and CREB) resulting in the change of expression (Navarro-Martín et al., 2011; Zhang et al., 2013). Similar conclusions were observed in Oreochromis niloticus and Oncorhynchus mykiss (Valdivia et al., 2014; Wang et al., 2017). Meanwhile, FOXL2 and SOX9, which showed dimorphic DNA methylation patterning were also considered as the candidate genes in A. mississippiensis and Paralichthys olivaceus (Parrott et al., 2014; Si et al., 2016). Other factors have also been suggested to play a role in GSD + TE, such as heat shock proteins (HSPs), transient receptor potential channels (TRPs), cold inducible RNA binding proteins (CIRBPs), and microRNAs (Kohno et al., 2010; Rhen and Schroeder, 2010; Bizuayehu et al., 2015; Czerwinski et al., 2016; Schroeder et al., 2016).

The effect of temperature on the sex differentiation can be profound and far-reaching, and needs comprehensive studies to fully understand the molecular mechanisms. Chinese tongue sole, Cynoglossus semilaevis, is a GSD + TSD sex determination teleost with ZZ/ZW sex chromosomes (Zhou et al., 2005). Femalespecific DNA sequences had been identified in C. semilaevis, which could be used for distinguishing genetic female and male (Wang et al., 2009, 2013). Therefore, C. semilaevis is a unique powerful model to explore molecular events associated with GSD + TSD. In previous study, it was reported that epigenetic modification was involved in sex reversal of C. semilaevis by BS-seq and RNA-seq, and transgenerational epigenetic inheritance was observed in offspring generated by sex reversal individuals (Chen et al., 2014; Shao et al., 2014). We aimed to filter genes related to sex differentiation, explore the relationship of expression level and methylation modification, and analyze whether methylation could regulate the binding of transcription factor. In this study, the genetic female individuals that inversed to phenotypic male individuals are defined as pseudomales. These pseudomales are distinguished from high temperature treatment groups using female-specific markers. RNA-seq was performed on the gonads of females, males, and pseudomales. The whole expression profiles were investigated, and candidate genes involved in sexual gonad development were identified. The methylation patterns of the putative genes were also analyzed. The interaction of upstream regulatory sequence and the corresponding transcription factors was verified by dual-luciferase reporter system. These findings helped us to understand the genetic epigenetic programing driving vertebrate GSD + TE and provide insight for future investigations aimed at clarifying the mechanisms controlling sex differentiation and sex reversal.

# MATERIALS AND METHODS

# Fish Rearing and Temperature Treatment

Fish and embryos were collected from Yellow Sea Aquatic Product Co. Ltd., Shandong, China. Embryos were incubated at 20◦C, the natural temperature for C. semilaevis spawning, fertilization and hatching. For this study, a batch of embryos collected from three pairs of parents was used. After hatching the fry were reared at ambient temperature (20–22◦C). The juveniles at 25 days post fertilization (dpf) with total length (TL) of 13 ± 2 mm were separated into two groups. One group (n = 3000) was reared at ambient temperature throughout the TSP as control group (low temperature group, LT). The other group (n = 3000) was exposed at 28◦C during the entire TSP and as the high-temperature group (high temperature group, HT). The temperature was increased to 28◦C at a rate of 0.5◦C/day, and then maintained for 100 days, until 125 dpf (**Figure 1**). Then the water was recovered to ambient temperature to follow the natural fluctuations until the end of the study, when the fish were 300 days old. The proportion of phenotypic males and females was counted by gonad biopsy and section confirmation in LT group and HT group, respectively. From these phenotypic males genetic males and pseudomales were identified using female-specific markers (Wang et al., 2009). The survival rate was also calculated for both of the groups.

# Sample Collection and Gonadal Histology

formation and sex differentiation are also indicated.

At 300 dpf, fish were sacrificed and gonadal samples were collected. For each fish, one gonad was processed for histological identification of phenotypic gender and DNA/RNA extraction. Gonads were fixed in 4% PFA in PBS, embedded in paraffin, cut at 7 µm thickness and stained with haematoxylin-eosin. Meanwhile, the other gonad was snap-frozen in liquid nitrogen and stored at −80◦C for RNA-seq analysis. Muscle tissues were collected to extract DNA for individual sexing and methylation analysis. The methylation level of muscle was selected as control.

# RNA Isolation, cDNA Library Construction, and Illumina Sequencing

Gonads from nine individuals including three biological replicates of females (FO), males (MT), and pseudomales (PMT) were selected for RNA-seq analysis. Total RNA was extracted using Trizol Reagent (Invitrogen, Carlsbad, CA, United States) according to the manufacturer's protocol, treated with RNasefree DNase I (TaKaRa, Dalian, China) to degrade genomic DNA, and then frozen at −80◦C. RNAclean Kit was applied to remove proteins. The quality and quantity were evaluated via 1.5% agarose gel electrophoresis and spectrophotometry using NanoPhotometer Pearl (Implen GmbH, Munich, Germany) and Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, United States).

The nine RNA-seq libraries were constructed with Illumina TruSeq RNA Sample Prep Kit (Illumina, San Diego, CA, United States) in accordance with the manufacturer's instruction. Then the libraries were subjected to paired-end sequencing of 125 bp on the Illumina HiSeq 2000.

# Data Processing and Bioinformatics Analysis

Raw reads generated from the Illumina sequencing platform were cleaned by removing adaptors and low quality sequences using FastQC. The cleaned reads of each sample were mapped to the reference genome (Chen et al., 2014) by TopHat with default parameters (Kim and Salzberg, 2011). Then the mapping files were analyzed using Cufflinks to assemble the reads into transcripts for each dataset (Roberts et al., 2011). Complete transcripts were obtained by merging the assemblies of all datasets using Cuffmerge.

# Identification of Differentially Expressed Genes and Functional Enrichment Analysis

All the expressed genes were aligned to databases for homology annotation, including non-redundant protein databases (NR), Swiss-Prot, Gene Ontology (GO), eukaryotic Orthologs Groups (KOG), and Kyoto Encyclopedia of Genes and Genomes (KEGG) by BlastX with e-value of 1e-5 (Kanehisa et al., 2008).

FPKM were used to select the DEGs. The FPKM was calculated by Cuffdiff (Trapnell et al., 2012). To identify the differentially expressed genes (DEGs) among female, male and pseudomale gonads, we set the following standards: genes with an adjusted log2FoldChange ≥ 2 or log2FoldChange ≤−2, and P < 0.01 were considered as DEGs. The DEGs were then enriched by GO terms and KEGG categories using DAVID (Huang et al., 2008). The visualization of global similarities and differences of expression profiles of all individuals was accomplished by principle component analysis (PCA), MA plot and heatmap. These analysis were completed with R package.

# qRT-PCR Validation

fgene-10-00522 May 27, 2019 Time: 14:37 # 4

A total of ten DEGs (Sox9, GATA4, Dmrt1, AMH, HSD11b2, cyp19a1a, esr1, topaz1, GATA6, Sox3) were selected for qRT-PCR validation. Specific primer pairs were designed by IDT. qRT-PCR was performed in a 20 µg solution containing 10 ng of template cDNA and SYBR qPCR SuperMix (Novoprotein, Shanghai, China) by using LightCycler 480 (Roche, Forrentrasse, Switzerland) at 95◦C for 5 min pre-incubation, followed by 45 cycles of 95◦C for 15 s and 60◦C for 45 s. The relative quantities of the target genes expressed as fold variation over GAPDH were calculated using the 2−11Ct comparative Ct method. qRT-PCR data were statistically analyzed using one-way ANOVA followed by LSD test using SPSS 20.0. P < 0.05 indicated statistical significance.

# Methylation Levels Measured by Bisulfite-Mediated Genomic Sequencing

Methylation sites were prediction and BSP primers design in promoter were performed by Methprimer. Gonad and muscle tissues of females, males and pesudomales (six individuals each) were used to extract genomic DNA. The DNA samples from the same tissue of the same gender were mixed. The mixed DNA was modified using the EZ DNA Methylation-Gold Kit (ZYMO Research). The primers M-cyp19a1a-Fw1/Rv1 and M-cyp19a1a-Fw2/Rv2 were used for methylation-specific PCR. Eight positive clones were sequenced for each group. Site-specific methylation measurements were analyzed using BiQ-Analyzer.

# Cyp19a1a-Luc Reporter Vector Construct and in vitro Methylation

A pGL3-Cyp19a1a-Luc reporter vector was constructed by inserting the cyp19ala promoter fragment into the pGL3-basic vector (Promega, Madison, WI, United States) between SacI and XhoI sites. The promoter was a 1969 bp fragment amplified from genomic DNA with primers pGL3-cyp19a1a-Fw/Rv (**Supplementary Table S3**). The pGL3-cyp19a1a promoter vector was cytosine-methylated using M. SssI methylase (Thermo Fisher Scientific, MA, United States) (M-cyp19a1a-Luc) according to the manufacturer's instructions. It could methylate all cytosine residues within the double-stranded dinucleotide recognition sequence. The methylation status of the vector was checked by HhaI, which only digested methylated DNA.

# Transfection and Luciferase Reporter Gene Assay

The HEK 293T cell line was used for transfection with unmethylated and methylated plasmids. Before the experiment, a total of 5 × 10<sup>5</sup> cells were seeded into 24-well plates and cultured for 24 h. Then the plasmids were transfected into HEK 293T cells by LipofectamineTM 3000 Transfection Reagent (Thermo Fisher Scientific, MA, United States) according to the manufacturer's instructions. At 48 h after transfection, cells were washed with PBS and analyzed for Luc activity using the luciferase assay system (Promega, Madison, WI, United States). Forskolin (5 µM), the activation of cAMP, which binds to CREB site, was added 10 h before the end of cell culture.

# RESULTS

# Sexual Ratio Changes After High Temperature Treatment

The survival rates and proportion of females and males were counted both in LT group and HT group after treatment. No difference was found between the two groups in survival rate (χ <sup>2</sup> = 0.190, P = 0.663). The survival rates were 62.97 and 59.80% in LT group and HT group (**Figure 2E**). The proportion of females and males as detected by biopsy and gonad tissue sectioning was 56.58 and 43.53% in LT group (**Figures 2A–C**), and 36.51 and 63.49% in HT group (**Figure 2D**), respectively. The proportion of males was significantly increased more than 20% after treatment with high temperature during TSP (χ <sup>2</sup> = 7.624, P = 0.006). These data indicated that masculinization was induced in genetic females following high temperature treatment.

# RNA Sequencing

The genotypic and phenotypic sex of these individuals were distinguished by molecular marker and tissue section (**Figure 2F**). A total nine cDNA libraries were sequenced on the Illumina platform, generating 655,677,682 raw reads, encompassing about 30 Gb of sequence. Valid ratio and GC content of each cDNA library were shown in **Table 1**. Approximately 80.1% of reads exhibited significant hits to the genome. The transcriptome data obtained from the samples has been uploaded to NCBI SRA site, with accession numbers of PRJNA480118 (SAMN09628942, SAMN09628943, SAMN09628989, SAMN09628990, SAMN096 28991, SAMN09628992, SAMN09628993, SAMN09628994, and SAMN09628995).

# Differential Expression and Functional Enrichment Analysis

Principle component analysis analysis were conducted to detect the global similarities and differences expression profiles among FO, MT, and PMT. It displayed that ovary (FO) replicates clustered closely in a region, and testis (MT and PMT) replicates clustered into another region. The MT and PMT replicates clustered together (**Figure 3B**). These results demonstrated that the expression patterns of phenotypic females and males was significantly different. However, the expression profiles showed more similarity between males and pseudomales. Although females and pseudomales retained the same genotype, the expression profiles were quite different. Males and pseudomales possessed different sex chromosomes, but the expression patterns were similar.

Among the DEGs, 5851 genes were significantly differentially expressed in FO vs. MT. 5611 genes were found differentially expressed in FO vs. PMT. Between MT vs. PMT, only 426 genes were identified as DEGs (**Figure 3C** and **Supplementary Table S1**). Regarding the functions of the DEGs, a large number of genes related to gonad development and sex differentiation were identified, which include Dmrt1, Dmrt3, HSD3b1, AMH, HSD3b7, esr1, SOX9, GATA4, GATA6, cyp19a, AMHR2 (**Table 1**). The heatmap of hierarchical clustering of DEGs was generated to visualize the expression patterns. The profile of phenotypic

female was obviously different with all phenotypic male. The expression pattern of pseudomale was prone to that of male (**Figure 3A**).

After filtration, the DEGs were applied to perform GO analysis and KEGG enrichment. All the DEGs were mapped to GO terms and compared with the background of the whole

TABLE 1 | Summary statistics of gonad transcriptome sequencing data.


transcriptome. They were significantly enriched in several GO terms in biological process, cellular component and molecular function (**Supplementary Table S2**). The results of enrichment were as follow: (1) In DEGs of FO vs. MT, the terms related to sexual differentiation and the regulation of reproduction were enriched, including sperm motility, 3-beta-hydroxy-delta5 steroid dehydrogenase activity and steroid hormone receptor activity. Besides, the terms about immune response were enriched (**Figure 4A**). (2) In FO vs. PMT, the terms of steroid hormone and helicase activity were detected, such as steroid hormone receptor activity and helicase activity (**Figure 4B**). They were also involved in reproduction and sexual differentiation and development. (3) In MT vs. PMT, it was found that some terms related to reproduction and the generation and development of testis was detected, comprising of male gamete generation, spermatogenesis, spermatid development, spermatid differentiation, and sterol transport (**Figure 4C**). Interestingly, the terms about sperm generation and differentiation were detected, including male gamete generation, spermatogenesis and spermatid differentiation and development. Surprisingly, the GO terms about energy metabolism were enriched, including UTP metabolic process, CTP metabolic process, CTP biosynthetic process, GTP metabolic process (**Figure 4C** and **Supplementary Table S2**). These terms are involved in meiosis and gamete generation, and may influence sperm activity. Meanwhile, KEGG pathway enrichment analysis was performed. A total of 44 KEGG terms were significantly enriched. The enriched signal pathways were similar in FO vs. MT and FO vs. PMT, including ribosome biogenesis, cell adhesion and metabolism and biosynthesis.

Only one signal pathway involved in metabolism was enriched, phosphatidylinositol signaling system (**Figure 5**).

# Identification of Genes Involved in Sexual Differentiation and Gonad Development

To identify genes involved in reproduction, including gonad development, gametogenesis and steroid biosynthesis in C. semilaevis, three strategies were used. (1) Sex-related genes were retrieved from the enriched GO terms, related to reproduction and steroid. (2) The DEGs were filtered by a set of key words that had been reported in other teleost, including gonad, sex, oocyte, meiosis, steroid, reproduction, and morphogenesis (Fan et al., 2014; Shao et al., 2014; Robledo et al., 2015). (3) Part genes were chosen from sex-related KEGG pathways. In accordance with these strategies, a set of potential candidate genes were obtained, and qRT-PCR validation were conducted (**Table 2**). Additionally, the DEGs of MT vs. PMT were analyzed independently, and the genes involved in spermatogenesis, gamete generation and development and energy metabolism were selected (**Table 3**).

## qRT-PCR Validation

The expression patterns of ten DEGs (Sox9, GATA4, Dmrt1, AMH, HSD11b2,cyp19a1a, esr1, topaz1, GATA6, Sox3) associated with gonad development or steroid biosynthesis were selected for qRT-PCR validation. All the genes displayed consistent expression patterns both in qRT-PCR and RNA-seq (**Figure 6**). The Pearson correlation coefficient analysis exhibited correlation between qRT-PCR assay and RNA-seq data (R = 0.394, P = 0.031), indicating the accuracy and reliability of RNA-seq.

# Sex-Specific Methylation Levels of Gonadal cyp19a1a Promoter

The DEGs analysis, qRT-PCR validation as well as the results of previous study (Shao et al., 2014), indicated that cyp19a1a played an essential role in sex differentiation, and sex reversal induced by temperature in C. semilaevis. It was regarded that proper expression of cyp19a1a is essential for maintaining the ratio of androgen and estrogen. The balance might be destroyed by expression changes of cyp19a1a mediated by abnormal environmental temperature. Epigenetic modification is considered as one of the factors that might affect cyp19a1a expression level.

To test our hypothesis, DNA methylation of cyp19a1a promoter in gonad and muscle was examined. The CpG dinucleotides ∼2000 bp upstream of the transcription start site were selected, which had two approximate clusters: 10 CpGs in the distal promoter region (−1857 to −1718, designated as region I) and 6 CpGs in the proximal promoter region (−357 to −220, designated as region II). No difference in the methylation level was detected in the muscle tissue among females, males and pseudomales (**Figure 7B**). In the

gonads, however, significant higher methylation levels were observed in male and pseudomale testis than in the female ovaries (**Figure 7A**). It was important to notice that high temperature-induced sex reversal from females to pseudomales is accompanied by the significant elevation of methylation level of gonadal cyp19a1a promoter.

To investigate if the promoter methylation would regulate the expression of cyp19a1a, qRT-PCR was performed in

gonads of three groups. The expression level in females from LT and HT groups was similar, which were significantly higher (P < 0.05) than that in males (both LT and HT groups) as well as pseudomales, No expressional difference was observed between males and pseudomales (P > 0.05) (**Figure 7C**). Based on the methylation and expression data, we conclude that the expression level of cyp19ala showed highly negative correlation with the promoter methylation levels in gonads.

However, it was not the same case in the muscle tissue, where the cyp19a1a was only basally expressed. The average methylation levels of cyp19a1a promoter were similar and high in all, regardless of temperature treatment (**Figure 7B**). Two-way ANOVA analysis showed absolutely no differences among three groups in the cyp19a1a promoter methylation level in terms of temperature treatment (P > 0.05) and sex (P > 0.05). Either, no significant interaction between the two factors was found (P > 0.05).

# DNA Methylation Inhibits cAMP-Stimulated cyp19a1a Promoter Activity in vitro

Transcription factor binding sites in cyp19a1a promoter were predicted using MatInspector. Two binding sites for CREB were

### TABLE 2 | DEGs associated with sex differentiation and gonad development in FO vs. MT and FO vs. PMT.


found in the CpGs in position −1818 and −226, respectively (**Figure 8B**). In vitro study demonstrated that the methylation could decrease the activity of cyp19a1a promoter. The activity of unmethylated promoter could be significantly induced by forskolin stimulation. In contrast, no significantly change was observed in methylated promoter (**Figure 8A**).

# DISCUSSION

Since the initial discovery of vertebrate GSD + TE, the mechanism by which temperature exerts its influence on sex determination has been extensively investigated (Ferguson and Joanen, 1982; Rhen and Schroeder, 2010; Czerwinski et al., 2016;

TABLE 3 | DEGs associated with spermatogenesis and energy metabolism in MT vs. PMT.


Schroeder et al., 2016; Yatsu et al., 2016). Sex reversal can be induced when the temperature achieved a threshold, and cause sex ratio change. The sex of embryos, larva or juveniles can be reversed completely or partly under a threshold temperature in reptiles and teleost (Ferguson and Joanen, 1982; Strüssmann et al., 1996; Hattori et al., 2013; Czerwinski et al., 2016). The pseudomales have the same chromosome complement with females, but the phenotype is completely different (Hu et al., 2014). Usually, temperature exerts its influence at TSP of embryo, larva or juvenile development, when the individuals remain sexually flexible (Navarro-Martín et al., 2011; Holleley et al., 2015).

In this study, a teleost, C. semilaevis, sensitive to temperature was used. We demonstrated the significant imbalance of sex ratio and survival rate after high temperature treatment in TSP. The proportion of males was about 20% higher in HT group, indicating masculinization was induced by high temperature. This phenomenon have also been reported in M. menidia, D. labrax, O. niloticus, and O. mykiss (Conover and Kynard, 1981; Navarro-Martín et al., 2011; Valdivia et al., 2014; Wang et al., 2017). Interestingly, only part of the females were easy to be induced sex reversal. In a recent study, a SNP (A/T) of FBXL17 had large controlling effect on sex reversal in C. semilaevis, and ZAW genotype would never reverse into phenotypic males, while those with ZTW genotypes would sometimes undergo sex reversal (Jiang and Li, 2017). Based on these results, we speculated that some mutation might cause the females to be sensitive to

temperature, and sex reversal emerged when the temperature exceeded threshold.

To explore the different expression profiling of female, male and pseudomale, RNA-seq was performed. A lot of GO terms involved in reproduction and steroid biosynthesis were identified by DEGs and enrichment analysis. Interestingly, some GO terms related to immune responses were also enriched. Similar results were displayed in Pogona vitticeps, in which the expression levels of prominent immune genes were significantly lower in pseudomales than in females and males. Further, canonical stress-related GO terms were enriched, including defense response, response to biotic stimuli (Deveson et al., 2017). It has been known that immune system was intertwined with stress. Meanwhile, evidences showed that stress and sex were connected in vertebrates. In Amphiprion akallopisos and Odontesthes bonariensis, cortisol was considered the regulator of sex change in response to environmental or social stress (Hattori et al., 2009; Yoshinaga et al., 2010; Fernandino et al., 2012, 2013; Kitano et al., 2012; Todd et al., 2016). In reptiles, POMC and corticosterone-mediated stress was observed in sexreversed individuals (Deveson et al., 2017). In birds and rats, elevated maternal corticosterone and ACTH skewed the sex ratio of offspring (Barbazanges et al., 1996; Pike and Petrie, 2006). In human, evidence indicated that maternal stress could enhance the circulation of corticosterone and affect neuroendocrine system. These stresses had long-lasting effects on offspring morphology, behavior, physiology, and phenotype, which could cause the imbalance of sex ratio (Obel et al., 2007; Navara, 2010; Schnettler and Klüsener, 2014; van den Heuvel et al., 2018). According to a series of studies, we speculated that the C. semilaevis larva was stressed by high temperature, and immune response was activated. Then, these responses influenced endocrine system, which caused the up-regulation or downregulation of cortisol. The biosynthesis and secretion of steroid were interfered, which leaded to sex reversal under the stress of high temperature treatment. Till now, the evidence has not been adequate, so the interaction of stress and endocrine and specific mechanism need further study.

A series of evidences of environmental influences on phenotype plasticity in vertebrate mediated by epigenetic mechanisms, such as DNA methylation and histone deacetylation has been obtained (Reik et al., 2001; Jaenisch and Bird, 2003). Epigenetic regulation can inhibit or stimulate gene transcription, which alters gene expression from the same genetic blueprint and thus affects development and differentiation (Rottach et al., 2009). In previous studies, whole-genome methylation has been found to be involved in sex-induced by high temperature in C. semilaevis, and methylation modification in sex-reversed males was inherited. Besides, dosage of Z chromosomal region was related to sex reversal in C. semilaevis (Chen et al., 2014; Shao et al., 2014). However, it was found higher levels of methylation of cyp19a1a and also higher levels in gene expression of cyp19a1a (Shao et al., 2014). In the present study, we found that the methylation level of C. semilaevis cyp19a1a promoter was significantly higher in males than in females. Importantly, the methylation profiles of pseudomales were similar with males,

to HEK 293T cells. Luciferase activities were measured 48 h after transfection. Fold change was calculated, and cyp19a1a-Luc group was used as control. (A) The CREB mediated stimulation of cyp19a1a promoter activities by forkolin in HEK 293T cells. (B) The location of CG sites and CREB sites in two approximate clusters of cyp19a1a promoter. The data was shown as mean ± SD (n = 3). ∗∗P < 0.01 represented significantly different.

but absolutely different from females, although pseudomales had the same genotype (ZW) with females. Based on the methylation and expression data, we concluded that the expression level of cyp19a1a showed highly negative correlation with the promoter methylation levels in ovaries and testes. In Oryzias latipes, DNA methylation of cyp19a1a promoter was reported to be related to sex differentiation (Contractor et al., 2004). The methylation levels were twice in males compared with females in D. labrax gonads (Navarro-Martín et al., 2011). Besides, the allied discoveries was also observed in O. niloticus and P. olivaceus (Fan et al., 2017; Wang et al., 2017). Cyp19a1a played important roles in sex differentiation by regulating estrogen synthesis. In C. semilaevis, females and pseudomales had the same genetic background (ZW), but different DNA methylation and expression levels of cyp19ala. Epigenetic modification caused by high temperature might transform the topology of DNA and block the binding of transcription factor, which could change the expression of cyp19a1a. In vitro study demonstrated that the methylation of −1818 and −226 sites in cyp19a1a promoter inhibited the binding of transcription factor CREB and suppressed the promoter activity, which could regulate the expression level ofcyp19a1a. Thus, our results clearly showed that epigenetic modification, most likely DNA methylation, regulated the expression of gonadal cyp19a1a, which then mediated sex differentiation.

Interestingly, a lot of DEGs between males and pseudomales were enriched to GO terms involved in spermatogenesis, including spermatogenesis, male genitalia development, male gamete generation, spermatid development, and spermatid differentiation. Both males and pseudomales generate sperms, but the process seemed to be significantly different. In males, only Z type sperms were generated, but both Z type and W type sperms were generated theoretically in pseudomales. The generation of different types of sperm might influence spermatogenesis and spermatid differentiation and development. Surprisingly, GO terms related to energy metabolism such as UTP, GTP, and CTP biosynthetic process and metabolic process were also enriched in DEGs between males and pseudomales. Energy metabolism could affect the sperm vitality. The results implied the quality of sperm generated from males and pseudomales might be significantly different. In theory, super female (WW) individuals could be generated by W type sperm fertilized with W type eggs. However, super females were never observed in the larval stage in our lab produced by pseudomales (unpublished data). These lines of evidence suggest that W type sperm generated from pseudomale might have weak vitality. Pseudomales might unable to generate function W type sperms or the WW embryos could not develop normally to larva (**Figure 9**).

# REFERENCES


# CONCLUSION

In conclusion, we demonstrated that high temperature could induce masculinization in C. semilaevis. The expression patterns of pseudomales was similar to males, but the genes involved in spermatogenesis and energy metabolism were differentially expressed. Besides, high-temperature treatment could change the epigenetic modification of cyp19a1a promoter, leading to DNA methylation level increase in pseudomales, which results in the decrease ofcyp19a1a expression. There was a negative correlation between methylation levels and expression of cyp19ala. Thus the epigenetic regulation of cyp19a1a might play an essential role in the sex reversal induced by high temperature in C. semilaevis.

# ETHICS STATEMENT

This study was carried out in accordance with the recommendations of the Administration of Affairs Concerning Experimental Animals. The protocol was approved by the College of Marine Life, Ocean University of China.

# AUTHOR CONTRIBUTIONS

JL, XL, CJ, and XD performed the experiments and analyzed the data. JL and YH prepared the figures and wrote the manuscript. QZ designed the experiments.

# FUNDING

This research was supported by the National Natural Science Foundation of China (31802327), the National Key Research and Development Program of China (2018YFD0900601), and the China Postdoctoral Science Foundation (2017M622282).

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00522/full#supplementary-material

TABLE S1 | DEGs in FO vs. MT, FO vs. PMT, and MT vs. PMT.

TABLE S2 | GO terms in biological process, cellular component, and molecular function levels.

TABLE S3 | List of primer sequences used in the study.


Conover, D. O., and Kynard, B. E. (1981). Environmental sex determination: interaction of temperature and genotype in a fish. Science 213, 577–579.


induced Nile tilapia masculinization. J. Therm. Biol. 69, 76–84. doi: 10.1016/j. jtherbio.2017.06.006


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Liu, Liu, Jin, Du, He and Zhang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

fgene-10-00522 May 27, 2019 Time: 14:37 # 15

# A Novel Candidate Gene Associated With Body Weight in the Pacific White Shrimp Litopenaeus vannamei

Quanchao Wang<sup>1</sup>† , Yang Yu1,2† , Qian Zhang1,3, Xiaojun Zhang1,2, Jianbo Yuan1,2 , Hao Huang<sup>4</sup> , Jianhai Xiang1,2,5 and Fuhua Li1,2,5 \*

<sup>1</sup> Key Laboratory of Experimental Marine Biology, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, China, <sup>2</sup> Laboratory for Marine Biology and Biotechnology, Qingdao National Laboratory for Marine Science and Technology, Qingdao, China, <sup>3</sup> University of Chinese Academy of Sciences, Beijing, China, <sup>4</sup> Hainan Grand Suntop Ocean Breeding Co., Ltd., Wenchang, China, <sup>5</sup> Center for Ocean Mega-Science, Chinese Academy of Sciences, Qingdao, China

### Edited by:

Gen Hua Yue, Temasek Life Sciences Laboratory, Singapore

### Reviewed by:

Zhe Zhang, South China Agricultural University, China Qi Cun Zhou, Ningbo University, China

### \*Correspondence:

Fuhua Li fhli@qdio.ac.cn †These authors have contributed equally to this work

### Specialty section:

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Received: 04 November 2018 Accepted: 13 May 2019 Published: 31 May 2019

### Citation:

Wang Q, Yu Y, Zhang Q, Zhang X, Yuan J, Huang H, Xiang J and Li F (2019) A Novel Candidate Gene Associated With Body Weight in the Pacific White Shrimp Litopenaeus vannamei. Front. Genet. 10:520. doi: 10.3389/fgene.2019.00520 Improvements of growth traits are always the focus in selective breeding programs for the Pacific white shrimp Litopenaeus vannamei (L. vannamei). Identification of growth-related genes or markers can contribute to the application of modern breeding technologies, and thus accelerate the genetic improvement of growth traits. The aim of this study was to identify the genes and molecular markers associated with the growth traits of L. vannamei. A population of 200 individuals was genotyped using 2b-RAD techniques for genome-wide linkage disequilibrium (LD) analysis and genome-wide association study (GWAS). The results showed that the LD decayed fast in the studied population, which suggest that it is feasible to fine map the growth-related genes with GWAS in L. vannamei. One gene designated as LvSRC, encoding the class C scavenger receptor (SRC), was identified as a growth-related candidate gene by GWAS. Further targeted sequencing of the candidate gene in another population of 322 shrimps revealed that several non-synonymous mutations within LvSRC were significantly associated with the body weight (P < 0.01), and the most significant marker (SRC\_24) located in the candidate gene could explain 13% of phenotypic variance. The current results provide not only molecular markers for genetic improvement in L. vannamei, but also new insights for understanding the growth regulation mechanism in penaeid shrimp.

Keywords: penaeid shrimp, growth traits, GWAS, candidate gene, class C scavenger receptor

# INTRODUCTION

Litopenaeus vannamei (L. vannamei), as one of the most economically important marine aquaculture species, is playing an important role in fulfilling the increased requirement for high quality animal proteins consumption. It is estimated that L. vannamei provided approximately 70% of the total shrimp production in the world (Li et al., 2018). The continuous development of shrimp industry drives the genetic improvement of important economic traits. During the past decade, large efforts have been put to improve the key economic traits, including growth traits and disease resistance (Argue et al., 2002; Huang et al., 2011; Andriantahina et al., 2013). Among these traits, growth is always the focus of breeders because it directly contributes to the shrimp production.

In general, the growth trait, such as body weight, presents a moderate to high heritability (Sui et al., 2016; Nolasco-Alzaga et al., 2017), and the genetic gain per generation reached 10.7% (Andriantahina et al., 2012), which is higher than that of farmed terrestrial species. At present, broodstocks with high and stable growth traits are urgently needed to meet the requirement of shrimp culture industry. Modern molecular breeding technologies, including marker assisted selection (MAS), gene assisted selection (GAS), and gene editing technology, etc., are promising methods for accelerating the genetic improvement of growth traits (Gjedrem and Baranski, 2010). Till present, several growth-related genes involved in molting and muscle development such as molt-inhibiting hormone (MIH), crustacean hyperglycaemic hormone (CHH), ecdysteroid receptor (EcR), actin and myostatin differential factor 11 (MSTN), etc., were identified (Li et al., 2011; Jung et al., 2013). However, the growth traits are likely to be highly polygenic, and the underlying physiological bases may involve complex regulatory networks of many interacting genes with different effects. Although QTL mapping analysis of growth traits has been conducted in L. vannamei (Yu et al., 2015), limited number of markers makes the fine mapping of QTLs difficult. Hence, new methods are urgently required to localized the major genes or markers related to growth traits in shrimp.

Genome-wide association study (GWAS) have been successfully performed to identify genes participating in the regulation of complex traits in human (Mccarthy et al., 2008), livestock (Zhang et al., 2013), and crop (Huang and Han, 2014). Recently, with the development of high-throughput sequencing technologies and the successive decoding of aquatic animal genomes, GWAS is becoming a powerful tool to analyze the genetic basis of complex traits, and some candidate genes associated with growth traits or disease resistance were reported in a number of aquatic animals, including Atlantic salmon (Sodeland et al., 2013; Gutierrez et al., 2015; Correa et al., 2017), rainbow trout (Vallejo et al., 2014;

TABLE 1 | Numbers of SNPs and the average distances between adjacent SNP pairs for each chromosome.


Chr, chromosome; No, number.

Gonzalez-Pena et al., 2016), and catfish (Geng et al., 2015; Jin et al., 2016). However, there is no relevant study in L. vannamei. In the present study, we aimed to identify growth-related loci or genes in diverse population by using GWAS integrated with candidate gene association study, and provide a convinced result for revealing the molecular mechanism of growth traits in L. vannamei.

# MATERIALS AND METHODS

# Animals and Genotyping

Two populations, designated as A16 and B2016\_13, had been used in this study. These two populations were created and cultured at Guangtai Marine Breeding Company in Hainan province, China. The population A16 was established in 2015 as previously described (Wang et al., 2017). Briefly, it was composed of 200 individuals from 13 full-sib families (offsprings of 13 dams and 13 sires). Each full-sib family was cultured separately in the 5 m<sup>2</sup> tank before their body length reached 3 cm, and then 50 individuals from each family were transferred to a 10 m<sup>2</sup> pond for culture. At the harvest, two hundred individuals were randomly collected for the phenotyping and genotyping. For population B2016\_13, it was constructed in 2016 and the individuals from multiple fullsib families were mixed after spawn, a total of 322 individuals were collected and phenotyped. The sex of all individuals from these two populations was determined by sex-associated marker (Yu et al., 2017). The average body weight for A16 population was 5.56 ± 2.16 g and that for B2016\_13 population was 9.51 ± 3.30 g.

Total DNA of each sample was extracted from the muscle of shrimp using Plant Genomic DNA Kit (TIANGEN, Beijing, China) according to the manual instruction. The purity and integrity of the extracted DNA was determined by using a NanoDrop 1000 Spectrophotometer (NanoDrop, Wilmington, DE, United States) and electrophoresis on 1% agarose gel. Qualified genomic DNA was stored at −20◦C.

All individuals from A16 population were genome-widely genotyped using 2b-RAD method (Wang et al., 2012), which was carried out by OE Biotech Company (OE Biotech, Shanghai, China). The reference genome were de

TABLE 2 | Summary of the first twenty significant SNPs associated with body weight by GWAS.


Chr, chromosome; MAF, minor allele frequency; and UN, no chromosome was assigned for the marker.

novo assembled using the reads from the 10 individuals with high sequencing depth, and the genotyping of each individuals were conducted using RADtyping program (Fu et al., 2013). The shrimp from B2016\_13 population were genotyped for the targeted locus of the candidate genes by using PCR-based sequencing.

TABLE 3 | The primers designed for the identification of SNP in the coding region of LvSRC.


Ta, the optimal annealing temperature.

# Genome-Wide Linkage Disequilibrium Analysis

The physical position of SNPs was identified by blasting the 2b-RAD marker to the assembled reference genome of L. vannamei (Zhang et al., 2019). LD was estimated by using SNPs genotyping and physical position information. The squared correlation of allele frequencies (r<sup>2</sup> ) was used as a measure of LD (Hill, 1974). The r<sup>2</sup> between each pair of SNPs on the same chromosome was calculated using "genetics" package in R (R Core Team, 2018). The decay of the r<sup>2</sup> with distance was fitted using the expected value of r<sup>2</sup> under drift-recombination equilibrium that had previously been implemented (Remington et al., 2001; Marroni et al., 2011).

# Genome-Wide Association Study

Genome-wide association study for body weight were performed using the egscore function in the R package GenABEL (Aulchenko et al., 2007). The potential bias in association caused by hidden population stratification was corrected by principal components (PCs) of genomic kinship matrix (Price et al., 2006). Via inspecting the eigenvalues of the kinship matrix, the first four PCs were selected to adjusting the genotypes and phenotypes. Sex was selected as fixed factor. Besides, adjusting with PCs did not remove all population stratification, hence a further genomic control correction of the obtained P values was performed using the inflation factor. Considering the small sample population size and the sparse marker density, the significance level for genome-wide significance was set as P = 0.01 (−log<sup>p</sup> <sup>10</sup> = 2).

# Candidate Genes Study

The sequences of SNPs associated with body weight were compared by BLAST against the genome sequence of the L. vannamei (Zhang et al., 2019). Given the rapid LD decay rate (**Figure 1**), the genes within the 18 kb upstream and downstream of the significant SNPs were considered as candidate genes. Then, the SNPs in the coding region of candidate genes were detected by PCR-based sequencing. The non-synonymous SNPs were genotyped in all the individuals from B2016\_13 population and tested for association with body weight. The association test of candidate genes was performed by using linear model in R software, and sex was selected as covariate. According to the principle of variance decomposition in linear model (Ho and Lin, 2003), the ratio of phenotypic variance (Var) explained by the SNP, significantly associated with the body weight of L. vannamei, was calculated as following:

$$Var = \frac{SS\_R}{SS\_R + SS\_S + SS\_E} \times 100\%$$

where SS<sup>R</sup> is the sum of squares produced by the SNP; SS<sup>S</sup> is the sum of squares produced by the sex; SS<sup>E</sup> is the residual sum of squares.

# RESULTS

# Genome-Wide Linkage Disequilibrium

A total of 23,049 single nucleotide polymorphism (SNP) markers were obtained after quality control that SNPs with missing rate at more than 5% across samples and minor allele frequency less than

0.05 were removed. By blasting these markers to the assembled reference genome of L. vannamei, 13,814 SNPs were successfully mapped onto chromosomes. These SNPs were located on 44 chromosomes (Chrs) with a median distance between adjacent markers of 226.12 kb and an average of 314 SNP markers per chromosome (**Table 1**). The number of SNPs varied among Chrs, from 43 on chr14 to 761 on chr1. The average distance between the adjacent SNPs pairs within Chr was also different, ranging from 91.45 kb on chr15 to 1431.73 kb on chr39. A total of 2,579,595 paired SNPs had been used to calculate the r<sup>2</sup> . The r<sup>2</sup> with distance was plotted in **Figure 1A**. The overall LD across the genome between all paired SNPs was 0.06 and only few values (0.4%) of r <sup>2</sup> > 0.6 were found. A rapid decay of LD was presented in **Figure 1B**, where r<sup>2</sup> decreased to 0.2 at SNP marker interval of 18 kb.

# Genome-Wide Association Study

The Manhattan plot of all SNPs is shown in **Figure 2**. A total of 226 SNPs significantly associated with body weight were identified at a threshold of P < 0.01 (−log<sup>p</sup> <sup>10</sup> > 2). Among the 226 significant SNPs, 84 SNPs are currently unassigned to chromosomes, and the remaining 142 SNPs were successfully mapped to 39 chromosomes. Given the large number of significant markers, the first twenty significant markers were used for subsequent analysis. Of these, 12 SNPs are currently unassigned to chromosomes, and the remaining 8 SNPs were successfully mapped to 6 chromosomes (**Table 2**). Gene annotation showed that only the marker ref-613798-25 was located in the coding region of one gene which can encode the class C scavenger receptor (SRC). Therefore, the gene, referred to hereafter as LvSRC, was considered as the most likely candidate gene for body weight in L. vannamei.

# Candidate Gene Association Study

Ten PCR primers (**Table 3**) were developed from the targeted genome sequences of LvSRC and then used to amplify for a specific locus. A total of 29 SNPs (including ref-613798-25) were identified in the coding region of LvSRC (**Figure 3**). Among



Var, the ratio of phenotypic variance explained by the SNP; NA, null value.

these, 20 SNPs are synonymous mutation (**Supplementary Table S1**), and 9 SNPs are non-synonymous mutations (**Table 4**). All these non-synonymous mutations were examined for association with body weight in B2016\_13 population. The statistical results showed that 7 SNPs presented significant association (P < 0.01) with body weight, and the SNP (SRC\_24) contributed most significantly to the trait and it could explain 13% of phenotypic variance, followed by SRC\_13 (6%), SRC\_15 (6%), SRC\_7 (6%), SRC\_14 (6%), ref-613798-25 (4%), and SRC\_27 (4%).

# DISCUSSION

To our knowledge, this is the first report about LD pattern and GWAS in L. vannamei. Overall, the decay of LD in this population is rapid, which suggested it is feasible to perform the fine mapping of growth-related genes with GWAS. However, it is worth noting that high-density markers will be required to increase the power of GWAS. An average r<sup>2</sup> greater than 0.2 has been proposed to be the desirable requirement for GWAS in previous studies (Meuwissen et al., 2001; Mckay et al., 2007). Considering a genome length of 2.64 Gb in L. vannamei

(Yu et al., 2015), ∼150 K fully informative markers would be needed to saturate the requirement of GWAS at an average resolution of 18 kb.

Although only ∼23 K markers were used for GWAS in this study, a large number of markers significantly associated with body weight were identified (P < 0.01). This result may confirm the previous speculation that the shrimp growth is highly polygenic, and regulated by complex regulatory networks of many interacting genes (Moss and Moss, 2009). The current identified LvSRC gene may be one of those interacting genes and play an important role in the regulation of shrimp growth.

Scavenger receptors (SRs) comprise a large family of structurally diverse transmembrane cell surface glycoproteins and nine heterogeneous subclasses (A-I) were classified in accordance with their multidomain structures (Canton et al., 2013). As one member of SRs, SRC has only been identified in a few invertebrates, including Drosophila melanogaster, Aedes aegypti, and Marsupenaeus japonicus. Especially, previous studies only reveal the function of SRC in immunological process (Rämet et al., 2001; Yang et al., 2016, 2017), and whether SRC participates in growth regulation remains largely unknown. Indeed, although SRs family encompasses a wide range of molecules with little structural homology (Canton et al., 2013), almost all of them have been characterized in vertebrates by the common feature to bind modified low density lipoproteins (LDLs), such as oxidized LDL (OxLDL) and acetylated LDL (AcLDL). Therefore, SRs can play a central role in lipid metabolism. The similar function of SRs was also revealed in invertebrates. For example, in Macrobrachium nipponense, the expression of gene encoding the class B SR can be regulated by dietary lipid sources including soybean and linseed oils (Ding et al., 2016). Therefore, it's interesting to note that SRC may be related to the body weight of shrimp by participating in lipid metabolism.

The significant SNPs in the coding region of LvSRC, especially the marker SRC\_24, could be promising candidates for marker assisted breeding of growth traits in L. vannamei. Nevertheless, it is still uncertain that which mutations within the LvSRC gene are the causative loci associated with growth of shrimp. Therefore, gene editing technology will be a powerful tool to determine the causative locus in the future. Besides, it is important to note that the phenotypic variation of complex traits can be affected by the mutations in the non-coding region of genes, including untranslated region (Si et al., 2016) or promoter region (Wang et al., 2016). Therefore, it should be further investigated that whether the causative loci located in the non-coding region of LvSRC.

In addition, it is important to note that a number of significant markers from GWAS failed to be annotated. There may be two reasons for this result. Firstly, parts of the reference

# REFERENCES

genome was not fully assembled which result in the difficulty of gene annotation. Secondly, the region of candidate genes was determined based on the average LD decay rate in this study; however, the LD decay of different genomic regions might be quite different (Lu et al., 2012; Kawakami et al., 2017). Therefore, in the future, more growth-related genes would be revealed with the increase of genome information and the detail survey of LD decay of different genome regions.

# CONCLUSION

In this study, the LD decay of the studied population is rapid with an average r 2 (0.2) values at 18 kb, which suggested that it is feasible to fine map the growth-related genes by using this population. By using GWAS integrated with candidate gene association study, the LvSRC was proved to be associated with growth traits. This result not only provides molecular markers that may contribute to accelerate the genetic improvement for penaeid shrimp, but also provides new insights to help understand regulatory mechanism of shrimp growth. Further studies are needed to fine mapping the causative mutation in the LvSRC and investigate its regulatory mechanism on shrimp growth.

# AUTHOR CONTRIBUTIONS

QW and YY conducted the experiment and data processing. JX and FL conceived and supervised the project. QZ, XZ, and JY contributed to prepare the genomic DNA for SNP genotyping. HH prepared and cultured the experimental animals. QW, YY, and FL wrote the manuscript. All authors have read and approved the manuscript.

# FUNDING

This work was supported by National Key R&D Program of China (2018YFD0901301), National Natural Science Foundation of China (31830100), and China Agriculture Research System-48 (CARS-48).

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00520/full#supplementary-material


Andriantahina, F., Liu, X., Huang, H., and Xiang, J. (2012). Response to selection, heritability and genetic correlations between body weight and body size in Pacific white shrimp. Litopenaeus vannamei. Chin. J. Oceanol. Limnol. 30, 200–205. doi: 10.1007/s00343-012-1066-2

taura syndrome virus. Aquaculture 204, 447–460. doi: 10.1016/s0044-8486(01) 00830-4



Zhang, X., Yuan, J., Sun, Y., Li, S., Gao, Y., Yu, Y., et al. (2019). Penaeid shrimp genome provides insights into benthic adaptation and frequent molting. Nat. Commun. 10:356. doi: 10.1038/s41467-018- 08197-4

**Conflict of Interest Statement:** HH was employed by company Hainan Grand Suntop Ocean Breeding Co., Ltd.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Wang, Yu, Zhang, Zhang, Yuan, Huang, Xiang and Li. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Genome and Phylogenetic Analysis of Genes Involved in the Immune System of Solea senegalensis – Potential Applications in Aquaculture

Aglaya García-Angulo<sup>1</sup> , Manuel A. Merlo<sup>1</sup> , María E. Rodríguez<sup>1</sup> , Silvia Portela-Bens<sup>1</sup> , Thomas Liehr<sup>2</sup> and Laureana Rebordinos<sup>1</sup> \*

<sup>1</sup> Área de Genética, Facultad de Ciencias del Mar y Ambientales, Universidad de Cádiz, Cádiz, Spain, <sup>2</sup> Institute of Human Genetics, Jena University Hospital, Friedrich Schiller University Jena, Jena, Germany

### Edited by:

Paulino Martínez, University of Santiago de Compostela, Spain

### Reviewed by:

Filippo Biscarini, Italian National Research Council (CNR), Italy Shaojun Liu, Hunan Normal University, China Ricardo Utsunomia, São Paulo State University, Brazil

> \*Correspondence: Laureana Rebordinos laureana.rebordinos@uca.es

### Specialty section:

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Received: 09 January 2019 Accepted: 14 May 2019 Published: 11 June 2019

### Citation:

García-Angulo A, Merlo MA, Rodríguez ME, Portela-Bens S, Liehr T and Rebordinos L (2019) Genome and Phylogenetic Analysis of Genes Involved in the Immune System of Solea senegalensis – Potential Applications in Aquaculture. Front. Genet. 10:529. doi: 10.3389/fgene.2019.00529 Global aquaculture production continues to increase rapidly. One of the most important species of marine fish currently cultivated in Southern Europe is Solea senegalensis, reaching more than 300 Tn in 2017. In the present work, 14 Bacterial Artificial Chromosome (BAC) clones containing candidate genes involved in the immune system (b2m, il10, tlr3, tap1, tnfα, tlr8, trim25, lysg, irf5, hmgb2, calr, trim16, and mx), were examined and compared with other species using multicolor Fluorescence in situ Hybridization (mFISH), massive sequencing and bioinformatic analysis to determine the genomic surroundings and syntenic chromosomal conservation of the genomic region contained in each BAC clone. The mFISH showed that the groups of genes hmgb2 trim25-irf5-b2m; tlr3-lysg; tnfα-tap1, and il10-mx-trim16 were co-localized on the same chromosomes. Synteny results suggested that the studied BACs are placed in a smaller number of chromosomes in S. senegalensis that in other species. Phylogenetic analyses suggested that the evolutionary rate of immune system genes studied is similar among the taxa studied, given that the clustering obtained was in accordance with the accepted phylogenetic relationships among these species. This study contributes to a better understanding of the structure and function of the immune system of the Senegalese sole, which is essential for the development of new technologies and products to improve fish health and productivity.

Keywords: Solea senegalensis, bacterial artificial chromosome, immune system, aquaculture, syntenic conservation

# INTRODUCTION

Global aquaculture production continues to increase rapidly, yet only a small proportion of the aquatic animals and plants being produced are obtained from managed breeding and improvement programs (MacKenzie and Jentoft, 2016). However, the accelerated growth of aquaculture has resulted in adverse effects to the natural environment and to human health. This concern is illustrated by the widespread and, in some cases, unrestricted use of prophylactic antibiotics in this industry, with the objective of preventing bacterial infections resulting from sanitary shortcomings in fish rearing. This practice has resulted in the emergence of antibiotic-resistant bacteria in

aquaculture environments, the increase of antibiotic resistance in fish pathogens, the transfer of these resistance determinants to bacteria affecting land animals and to human pathogens, and alterations of the bacterial microbiome in both sediments and water column (Han et al., 2017). All of these are serious and undesirable outcomes.

A viable alternative for avoiding chemicals and preventing economic impact is the administration of immunostimulants, prebiotics and probiotics, which act to reinforce the innate immune system of the farmed fish (Nayak, 2010). A selection program is also an important tool for optimizing the immune capacity of the stocks, developing new technologies and products to improve productivity and to overcome the misuse of antibiotics. Thus, it is essential to understand in depth the structure and function of the fish immune system. However, few studies deal with the immune system in commercially important fish species.

Solea senegalensis is a flatfish species belonging to the Pleuronectiformes order, which comprises about 570 species. It is distributed along the northwestern coast of Africa, as far north as the southwestern coast of the Iberian Peninsula, including the Mediterranean Sea (Díaz-Ferguson et al., 2007). Commercial production of S. senegalensis started in the early 1980's and this species is considered a promising candidate for the diversification of aquaculture (Chairi et al., 2010; Padrós et al., 2011). In the last 10 years, the production of S. senegalensis in Spain has increased from 32 to 747.15 Tn, which illustrates the rapid growth of interest in production of the species (FishStat, FAO, 2019).

Several studies have already been carried out to improve the production of the Senegalese sole. The high mortality rates at different phases of production and the high incidence of diseases, particularly pasteurellosis and flexibacteriosis, have been critical in recognizing the need advocating for better production methods for the sole (Morais et al., 2014). Comprehensive study of the genes involved in disease resistance should greatly facilitate the solution of these problems.

The elaboration of an integrated genetic map would provide complete information about the localization and structure of genes of interest. This information could be used for comparative genomics purposes, and would constitute the scientific basis for developing improvement programs. In the case of S. senegalensis, the mapping of its genome has been carried out in recent years, using markers such as the minor and major ribosomal genes and other repetitive sequences were first localized using FISH techniques (Cross et al., 2006; Manchado et al., 2006). The elaboration of a BAC library in S. senegalensis has allowed researchers to localize single copy genes (Ponce et al., 2011), and to integrate the cytogenetic map with the physical map obtained by BAC sequencing (García-Cegarra et al., 2013; Merlo et al., 2017; Portela-Bens et al., 2017). In addition, linkage maps were also created in S. senegalensis (Molina-Luzón et al., 2014) and in the closely related species Solea solea (Diopere et al., 2014). A preliminary draft genome for a S. senegalensis female has been published and consisted of 34176 scaffolds with a N50 of 85 kb. Furthermore, this draft genome contained 209 out of the 274 ultra-conserved core eukaryotic genes, with a completeness of 84.3% and an average number of orthologues of 1.31, considering the number of eukaryotic genes discovered into the scaffolds (Manchado et al., 2016).

In S. senegalensis, the gene expression of various genes related to the immune system has been examined, including hepcidin, lysozyme g-type and the TNF gene family (Salas-Leiton et al., 2010, 2012; Núñez-Díaz et al., 2016). An exhaustive expression analysis of genes relevant for the immune system was also undertaken in the closely related species S. solea (Ferraresso et al., 2016). However, knowledge of the gene structure, genomic characterization and localization of immune-related genes is limited. Studies of this kind have been carried out only with the g-type lysozyme (Ponce et al., 2011), myxovirus resistance protein 1, immunoglobulin superfamily member 9b, and semaphorin 7a (García-Cegarra et al., 2013).

In this work, the localization and the genomic organization of 14 BAC clones containing immune-related genes was assessed. Seven out of the 14 BAC clones contain well-known immunerelated genes, such as the g-type lysozyme (lysg), myxovirus resistance 1 (mx1), toll-like receptors 3 and 8 (tlr3 and tlr8), beta-2-microglobulin (b2m), interferon regulatory factor 5 (irf5) and tumor necrosis factor α (tnfα). Another four BAC clones were chosen for their relationship with the immune system found in the bibliography, such as antigen peptide transporter 1 (tap1) (Pinto et al., 2011) interleukin-10 (il10) (Zou et al., 2003), and two BAC clones with calreticulin (calr) (Wang et al., 2018). The remaining three were anonymous BAC clones that, in the sequencing and annotation process, were found to contain immune-related genes, such as tripartite motifcontaining proteins 16 and 25 (trim16 and trim25) and high mobility group protein B2 (hmgb2). The genes studied belong to both the innate and acquired immune system. The objective of this study was to carry out an analysis of micro-synteny, comparative mapping and phylogenetics between S. senegalensis and relevant aquaculture species. This will allow deepening the knowledge about the structure of the genome and evolutionary trends of the immune system within flatfish species. The results would facilitate future work related to quantitative trait loci (QTL) and gene expression.

# MATERIALS AND METHODS

# PCR Screening of the Solea senegalensis BAC Library

A 4D-PCR methodology (Asakawa et al., 1997) was carried out to find and isolate clones bearing targeted gene sequences from a BAC library previously constructed in S. senegalensis (García-Cegarra et al., 2013). Fourteen BAC clones containing immune-related genes were isolated (**Table 1**). The thirteen candidate genes used to isolate BACs were lysozyme g type (lysg), calreticulin (calr), myxovirus resistance 1 (mx1), toll-like receptors 3, and 8 (tlr3, and tlr8), interferon regulatory factor 5 (irf5), beta-2 microglobulin (b2m), antigen peptide transporter 1 (tap1), interleukin-10 (il10), tumor necrosis factor α (tnf α), tripartite motif-containing protein 25 (trim25), high mobility group protein B2 (hmgb2) and tripartite motif-containing protein 16 (trim16). The PCR conditions were the same as those


The immune-related genes are underlined.

described in García-Cegarra et al. (2013). BACs are named after the name of the harboring candidate gene.

# BAC Clone Sequencing and Annotation

Positive BAC clones were isolated using the Large Construct Kit (Qiagen, Hilden, Germany), and sent for sequencing using the Illumina sequencing platform (Illumina, San Diego, CA, United States). The sequences were generated on the Miseq equipment, with a configuration of 300 cycles of paired end reads (Lifesequencing S.L., Valencia). The reads were de novo assembled using SPAdes software version 3.11.1. The functional and structural annotation of the gene sequences identified in each BAC clone was carried out in a semiautomated process. Proteins and expressed sequence tags (ESTs) from S. senegalensis and related species were compared. The homologous sequences obtained were used to obtain the best predictions for gene annotation. Finally, all the information available was used to create plausible models and, when possible, functional information was added. Using the Apollo genome editor (Lewis et al., 2002), Signal map software (Roche Applied Science, Penzberg, Germany), and Geneious R11 (Kearse et al., 2012), the results were individually completed and adjusted in the final editing process of the annotation. All BAC clones have been deposited in the GenBank database under the accession numbers AC278047 to AC278120. The structure of some of the genes was compared with those of seven other representative fish species, i.e., Danio rerio (zebrafish), Oreochromis niloticus (tilapia), Gasterosteus aculeatus (stickleback), Seriola dumerili (greater amberjack), Seriola lalandi dorsalis (yellowtail amberjack), Scophthalmus maximus (turbot), and Cynoglossus semilaevis (tongue sole).

Cross-species genome comparisons were carried out at two levels. At the first level, a micro-synteny study was performed using the ENSEMBL database and the NCBI platform. The order of the contigs within each BAC clone of S. senegalensis was estimated using the information provided by these programs. The seven species used for this comparison were the same as those listed above. All BAC clones were analyzed with the exception of the mx BAC clone, where only one gene was found, and the trim16 BAC clone. In the schematic figures each gene is represented by a different color; white color indicates a gene that is different from that found in the Senegalese sole.

At the second level of comparison, a synteny analysis was performed using the CIRCOS software (Krzywinski et al., 2009). The five species used in this analysis are C. semilaevis, S. maximus, O. niloticus, G. aculeatus and D. rerio. The flatfish Paralichthys olivaceus has not been included because the genome assembly is not at chromosome level. These sequences are available in the ENSEMBL database and the NCBI platform. In the case of S. dumerilii and S. lalandi dorsalis, the synteny analysis could not be done because the complete genomes were not available in these databases. In the figures the locations of the genes that make up the BAC clones of S. senegalensis were compared with the location they presented in the other species studied, so that the relationship between the chromosomes of both species appears in the figures. Each BAC clone was represented by a different color so that co-localizations and reorganizations of genes could be better observed.

# Cytogenetic Mapping

Chromosome preparations were obtained according to García-Angulo et al. (2018). To prepare FISH probes, BAC clones were grown on LB containing chloramphenicol, at 37◦C, overnight. BAC-DNA was extracted using the BACMAXTM DNA purification kit (Epicenter Biotechnologies, Madison, United States), following the manufacturer's instructions. Insert presence was evaluated by digestion with EcoRI and agarose gel electrophoresis (0.8%). Probes were amplified by DOP-PCR and then labeled by a conventional PCR using four different fluorochromes, i.e., Texas Red (Life Technologies, Carlsbad, California, United States), Spectrum Orange, Fluorescein isothiocyanate (FITC) (Abbott Molecular/ENZO, Illinois, United States), and diethylaminocoumarin (DEAC) (Vysis, Downers Grove, United States), using the protocol described in Liehr (2009).

Chromosome preparations were pre-treated with pepsin solution at 37◦C and fixed with paraformaldehyde solution. Finally, preparations were dehydrated with ethanol in a concentration series of 70%, 90%, and 100%, and airdried. Hybridization and post-hybridization treatment was according to Portela-Bens et al. (2017).

Slides were visualized with a fluorescence microscope (Olympus BX51 and/or Zeiss Axioplan using software of MetaSystems, Altlussheim, Germany) equipped with a digital CCD camera (Olympus DP70) to capture the images.

# Phylogenetic Analysis

fgene-10-00529 June 7, 2019 Time: 18:29 # 4

Before concatenation, the sections with the highest homogeneity in each gene were taken and the substitution saturation degree was also examined in each gene using saturation plots with transitions (s) and transversions (v) implemented in the DAMBE6 software (Xia, 2017). The distance model used was GTR. Saturation is inferred when the index of substitution saturation (ISS) is either larger or not significantly smaller than the critical value (ISS.C). Finally, for phylogenetic analysis were chosen those genes representatives of different immune pathways and present in a wide number of species, in addition to do not present significative substitution saturation. Under these requirements, up to five immune-system genes (tlr3, tlr8, nlrc3, calr, ikbke) were concatenated to perform the phylogenetic analysis. Thirty-four species were included to generate the phylogenetic tree; twenty-two were fish species, including the target species S. senegalensis; ten mammal species, one reptile species and, additionally, Latimeria chalumnae was included as an outgroup to root the tree. The sequences were aligned using the MAFFT program (Katoh and Toh, 2008) following an iterative method of 100 iterations. The final alignment consisted in a total of 7806 positions, in which 2354 were for tlr3, 1725 for tlr8, 1772 for nlrc3, 1284 for calr and 671 for ikbke. The PhyML 3.0 program (Guindon et al., 2010) was used to determine the best-fit phylogenetic model and then to run the model. The resulting best-fit model was the Generalized Time-Reversible (GTR) model (Tavaré, 1986), considering the proportion of invariable sites (+I) and gamma distribution (+G). The statistic used for model selection was the Akaike information criterion (AIC), the value of which was 255435.93458, and the -LnL was -127384.510441. Branch support was tested by the fast likelihood-based method using aLRT SH-like (Anisimova et al., 2011). Finally, the tree was edited in the MEGA7 program (Kumar et al., 2016).

# RESULTS

# BAC Clone Sequencing and Annotation

Of the 14 BAC clones analyzed, nine BAC clones were sequenced with a total of 80 genes annotated, and the other five BAC clones had been sequenced previously (Ponce et al., 2011; García-Cegarra et al., 2013; Merlo et al., 2017; García-Angulo et al., 2018). In total, 109 genes were annotated and 24 of the 109 genes (22.01%) were found to be related to the immune system (**Table 1**).

# Cytogenetic Mapping

Using mFISH, the 14 BAC clones were localized on six different chromosome pairs. (**Figures 1**, **2**). BAC clones hmgb2, b2m, irf5, and trim25 co-localized in a metacentric chromosome pair. The BAC clones tlr3 and lysg co-localized in one acrocentric chromosome pair. The BAC clones tap 1 and tnfα co-localized in a second acrocentric chromosome pair; and, lastly, the BAC clones il10, mx and trim16 co-localized in a third acrocentric chromosome pair. Conversely, the BAC clone calr showed a signal on the largest metacentric chromosome, different from that in which the genes hmgb2, b2m, irf5, and trim25 were co-located. The BAC clone tlr8 showed a signal in two different chromosomes pairs, one stronger signal in a submetacentric pair and the other weaker signal in an acrocentric chromosome pair. Only the most intense signal was considered.

# Comparative Mapping

The genes annotated in Senegalese sole were found in eleven chromosomes in C. semilaevis, S. maximus and O. niloticus, in nineteen chromosomes in D. rerio, and in eleven and five scaffolds in G. aculeatus (**Figures 3**, **4** and **Supplementary Figures S1–S3**). The closest species was C. semilaevis: 96.36% of the genes of S. senegalensis were found in seven chromosomes of C. semilaevis (**Figure 3**). Moreover, D. rerio was the species with the largest number of rearrangements (**Supplementary Figure S1**).

Considering the comparison with C. semilaevis, large clusters of genes were conserved with S. senegalensis. Hence, chromosome 2 of Senegalese sole seems to correspond mainly with chromosomes 1 and 8 of C. semilaevis; chromosome 4 with chromosomes 8 and 16; and chromosome 19 with chromosomes 11 and 20 (**Figure 3**). S. maximus also presented large conserved regions with respect to S. senegalensis; chromosomes 1, 13 and 19 of Senegalese sole seem to correspond to chromosomes 7, 8 and 6 of S. maximus, respectively (**Figure 4**). The comparison with O. niloticus, G. aculeatus and D. rerio showed more gene rearrangements (**Supplementary Figures S1–S3**).

Several gene co-localizations observed in S. senegalensis also appeared in other species, as the cases of il10-mx-trim16 in S. maximus and O. niloticus, tlr7-tlr8 in all species, tlr3-lysg in C. semilaevis, S. maximus and G. aculeatus, tnfα-tap1 in C. semilaevis, O. niloticus and G. aculeatus, trim25-hmgb2 in C. semilaevis and G. aculeatus, b2m-irf5 in C. semilaevis and D. rerio, b2ml-irf5 in O. niloticus, or hmgb2-irf5 in S. maximus.

# Micro-Synteny

The micro-synteny analysis showed that many candidate genes have conserved genomic surroundings, and that, among the genomic regions analyzed, C. semilaevis is the species with greatest homology. The region surrounding genes il10, tlr3, tlr8, nlrc3, and calr were highly conserved in all species (**Figures 5a,b,k,e,i**). The micro-synteny of BAC clone b2m showed that gene b2m presented one paralog gene (b2ml) in all the species analyzed apart from D. rerio. Curiously, in O. niloticus the paralog gene was present but not the candidate gene (**Figure 5f**). The most gene rearrangements within the same genomic structure were observed in the micro-synteny of BAC clone hmgb2 (**Figure 5c**).

In some of the BAC clones analyzed, other genes whose function is involved in the immune system were also found. In the BAC clone il10, the genes il19 and ikbke appeared in the same region (**Figure 5a**). In the BAC clone tap1, the gene hla-drb1 was found (**Figure 5g**). In the BAC clone tnfα, the gene pycard was observed (**Figure 5j**). In the BAC clone tlr8, the genes tlr7 and nlrc3 appeared (**Figures 5e,k**). In the BAC clone lysg, the gene pgrp2 was found (**Figure 5d**). In the BAC clone b2m, the genes b2m, b2ml and irak3 appeared (**Figure 5f**). Lastly, in the BAC

FIGURE 1 | Results of mFISH of the BACs isolated with the following candidate genes: (a) tlr8 (blue), lysg (green), tap1 (pink), tnfα (orange); (b) irf5 (blue), calr (green), b2m (red); (c) trim 16 (green), mx (orange), calr (pink); (d) irf5 (blue), tlr3 (green); tnfα (orange), mx (pink); (e) hmgb2 (blue), lysg (green), b2m (pink); (f) trim25 (blue), tnfα (green), tap1 (orange), b2m (pink); (g) il10 (blue), tnfα (green), tap1 (orange), b2m (pink). In those cases where two or more probes co-localize in one chromosome, a diagrammatic representation is included.

clone trim25, genes c1q and btr12 were found (**Figure 5h**). These genomic architectures were not found in all species: only the group of genes il10-il19-ikbke (**Figure 5a**) and the tandem tlr8 tlr7 were preserved among all the species analyzed (**Figure 5k**).

# Phylogenetic Analysis

To carry out the phylogenetic analysis, up to five candidate genes were concatenated (**Supplementary Data Sheet S1**). The resulting alignment had 7806 positions, where 5808 and 4975 were variable and parsimony-informative, respectively. The nucleotide frequencies were similar: f(A) = 0.26617; f(C) = 0.27001; f(G) = 0.24585; f(T) = 0.21796; and GTR relative rate parameters were A < – > C 1.30911; A < – > G 3.60083; A < – > T 1.25710; C < – > G 1.05237; C < – > T 4.73877; G < – > T 1.00000, with the proportion of invariable sites at 0.191. Results of substitution saturation tests (Xia et al., 2003) for each gene and for the concatenated alignment did not show any significant saturation, since ISS indices were lower than ISS.<sup>C</sup> values in all cases (**Supplementary Data Sheet S2**). The phylogenetic tree showed a good resolution and robust branch support. The phylogeny clearly separated two main clusters: rayfinned fishes and tetrapods (**Figure 6**). The cluster of tetrapods was divided into mammals and reptiles, and the cluster of rayfinned fishes was divided into Holostei and Teleostei. Within the group of teleosts, S. senegalensis appeared included in a subgroup together with the species C. semilaevis, P. olivaceus, S. maximus, Lates calcarifer and S. dumerili. All these species belong to the Carangaria group.

# DISCUSSION

# Annotation of Immune-Related Genes

From the BAC library of S. senegalensis 16 genes related to the immune system have been obtained. These genes, together with those already annotated, make a total of 24 relevant genes (**Table 1**). The use of BAC libraries has proven to be helpful for characterizing the genome in different species, including the fish species Larimichthys crocea (Ao et al., 2015), S. maximus (Taboada et al., 2014), Ictalurus punctatus (Xu et al., 2007),

Oncorhynchus mykiss (Palti et al., 2009), and bivalve species (Zhang et al., 2011; Cross et al., 2018).

Seventeen of 24 of the annotated genes related to the immune system are part of the NF-κB signaling pathway or JAK-STAT signaling pathway, two routes for the immune response. Up to twelve genes have been annotated for the first pathway, including the genes tlr, nlrc3, lysg, pgrp2, il, tnfα, irak3, ikbke, and pycard.

Toll-like receptors (tlr) are a class of pattern recognition receptors (PRR) and their function is to recognize microbial pathogens. In fish species, the tlr genes exhibit distinctive features and large diversity; these differences are probably derived from the diverse evolutionary history of this group and the distinctive environments that they occupy (Palti, 2011). This feature makes this kind of genes good candidates for immunityresponse improvement of the aquaculture stocks. The tlr8 gene

BAC clone nlrc3; (f) BAC clone b2m; (g) BAC clone tap1; (h) BAC clone trim25; (i) BAC clone calr; (j) BAC clone tnf alfa; (k) BAC clone tlr8.

was found to be composed by only one exon in S. senegalensis, as also occurs in C. semilaevis and G. aculeatus. However, in S. maximus, S. dumerili and D. rerio, tlr8 is composed by two exons and one intron, which could indicate an intron losing process in different lineages during the teleost species evolution. Another class of PRR has been annotated within the same BAC clone as that of genes tlr7 and tlr8, the gene nlrc3. Whereas TLR proteins are extracellular PRRs that recognize extracellular PAMPs, NLR proteins are intracellular PRRs that recognize intracellular PAMPs (Sha et al., 2009). Two additional PRRs were annotated in other BAC clone, the genes lysg and pgrp2. Lysozyme is a conserved molecule in teleosts (Sun et al., 2006) and is an important enzyme of the innate immunity response to bacterial infection. Many factors, such as stress and infection, sexual maturity, nutrition, toxic substances, and others, have been studied in relation to the activity levels of

lysozyme in fish (Saurabh and Sahoo, 2008). The PGRP2 protein, as lysozyme, also recognizes and hydrolyzes the peptidoglycan layer (Choi et al., 2019).

The remaining genes of the NF-κB signaling pathway do not participate as PRRs. Instead, the function is downstream of the pathway. The genes il10 and il19 encode for two members of the IL-10 cytokines family, which includes IL-10, IL-19, IL-20, IL-22, IL-24, and IL-26 (Lutfalla et al., 2003). The IL-10 cytokine plays an important role as an anti-inflammatory agent in the innate and adaptive immune system, and IL-19 is a pro-inflammatory cytokine of the innate immune system (Hofmann et al., 2012). The il19 gene of S. senegalensis showed a conserved structure across different teleost species, since it is structured in 5 exons and 4 introns; the exception is G. aculeatus, which shows an additional exon. Tnfα, as well as il10 and il19, is another cytokine that is secreted by activated immune-related cells upon induction by various pathogens - parasitic, bacterial and viral (Salazar-Mather and Hokeness, 2006). The irak3 gene encodes for an inteleukin-1 associated-receptor kinase, which, in the orange-spotted grouper (Epinephelus coioides) has been proved to induce the NF-κB activation through TLR signaling (Li et al., 2018). It has been reported in humans that ikbke and pycard genes act as positive and negative regulators, respectively, of the NF-κB activation. In addition, ikbke also participates in the JAK-STAT signaling pathway (Sarkar et al., 2006; Liu et al., 2019).

Aside from ikbke, another five annotated genes participate in the JAK-STAT signaling pathway for the immunity response. Interferons (IFN) are another type of cytokine that are involved in key aspects of the host defense mechanisms (Reboul et al., 1999). Interferon regulatory factors (IRF) were originally identified as transcription factors participating in the regulation of interferon expression (Mamane et al., 1999). It is known that, in C. semilaevis, the irf5 gene may play a role in the immune defense, primarily against intracellular pathogens (Zhang et al., 2015). The irf5 gene structure was conserved across representative species of the Pleuronectiformes order and in O. niloticus and G. aculeatus, composed by 8 exons and 7 introns. Non-etheless, the two representative species of the Seriola genus showed a derived structure composed by 6 exons and 5 introns, but more analyses needs to be done in order to ascertain if this structure is plesiomorphic within the Carangidae family. D. rerio also showed a derived structure composed by 9 exons and 8 introns, which could be representative of such ancient lineage of teleosts. Trim16, trim25 and btr12 genes present orthologs in mammals, but btr genes come from the trim39 gene of mammals. Both trim16 and trim25 are considered the genes from which the socalled fintrims, specific to teleosts, diverged (van der Aa et al., 2009). These three genes belong to the class IV subgroup of TRIM proteins, which are involved in antiviral immunity of the IFN signaling cascade (Boudinot et al., 2011). Two different types of Mx were observed in the European seabass (Dicentrarchus labrax), and both showed antiviral activity, but with different intensity and spatial and temporal patterns (Novel et al., 2013).

Calreticulin is a calcium-binding protein with an important role in the assembly and expression of Major Histocompatibility Complex (MHC) class I molecules (Raghavan et al., 2013). Two

calreticulin types that may function as a PRR have been described in the flatfish C. semilaevis, thus demonstrating the antiviral and antibacterial activity of that molecule (Wang et al., 2018).

Regarding the MHC molecules, four additional genes have been annotated. MHC molecules are members of the immunoglobulin superfamily that present pathogen peptides of infected cells and thus initiate the generation of adaptive immunity to pathogens (Zhu et al., 2013). The b2m gene codifies for the beta subunit of the MHC class I and different paralog copies have been observed in several fish species (Kondo et al., 2010; Sun et al., 2015). Two types of b2m genes have been found in S. senegalensis and adjacent to each other. This can be observed in genome databases of fish species, in which the two adjacent copies are annotated as b2m and b2ml. Both genes presented different gene structures in S. senegalensis, since in b2m two exons and one intron were identified, and in b2ml 3 exons and 2 introns. The structure of b2ml was conserved across the other representative teleost species, but not the structure of the b2m gene, which is composed by 3 exons and 2 exons in those species. An expression analysis would be able to conclude definitively if this gene is undergoing a pseudogenization process. The tap1 gene plays an essential role in the antigen presentation MHC class I pathway, transporting peptides from cytosol to the lumen of the endoplasmic reticulum (ER), where the peptides are loaded to MHC class I (Pinto et al., 2011). In the same BAC clone where the tap1 gene is located, another MHC-related gene, hla-drb1 was found, which encodes for the beta subunit of the MHC class II. In primates five different families of MHC class II genes, including the DR family, have been described (Satta et al., 1996).

The c1q gene belongs to the C1 component of the complement system pathway, and binds immunoglobulins attached to pathogen surfaces; it subsequently activates a complement cascade that culminates in the elimination of the infectious agents (Chen et al., 2018).

The hmgb2 gene belongs to a gene family that encodes nonhistone chromosomal proteins, and the HMGB2 protein has been demonstrated to display an antibacterial activity in fish due to its ability to bind pathogen DNA (Wang et al., 2019). The hmgb2 gene has 4 exons and 3 introns in S. senegalensis, a structure that is also observed in the other representative teleost species.

# Structural Genomics of the Immune System

The 24 annotated genes related to the immune system are distributed in 28.571% of the chromosome complement of S. senegalensis. This was the lowest value in comparison with those of C. semilaevis (33.333%), O. niloticus (40.909%), S. maximus (45.455%), G. aculeatus (47.619%), and D. rerio (52%). These data clearly show a grouping tendency of immune system genes in the two Soleoidei species, and in S. senegalensis in particular. Such grouping could be a consequence of the reduction and compaction trend observed in the Pleuronectiformes evolution (Cerdá and Manchado, 2013). Moreover, more than 50% of the genes studied fell into two chromosome pairs, which could reflect a certain degree of chromosome specialization in order to facilitate the immune response of the organism. Altogether, non-random proximity patterns could be formed in a way that provides functional advantages in the genomic architecture (Cremer and Cremer, 2010). A study carried out in 2008 indicated that the selection of groups of genes of the immune system could have been an important factor that affected the reorganization of the vertebrate genome (Makino and McLysaght, 2008). However, further analysis including new immune-related genes will be required to prove this hypothesis.

The hybridizations agree with the results obtained for the BAC clones tlr8, calr, lysg and mx that had already been located (Ponce et al., 2011; García-Cegarra et al., 2013; Merlo et al., 2017; García-Angulo et al., 2018). In our study, candidate genes tlr3 and tlr8 were found in different chromosomes. This result has also been observed in all the species studied and in other Pleuronectiformes species such as P. olivaceus (Hwang et al., 2011). The tlr7 and tlr8 locus is highly conserved in vertebrates and these genes are located together in the chromosomes of mammals, birds and fishes (Rauta et al., 2014).

The gene nlrc3 was located within the BAC clone tlr8, but, in all the species analyzed, the gene nlrc3 was found in a chromosome different from the genes tlr7 and tlr8. However, in Tetraodontiformes species (Tetraodon nigroviridis and Takifugu rubripes) the genes tlr7 and nlrc3 were in the same chromosome, which may be due to the compact structure of the genome in the Tetraodontiformes species, since these species have genomes that are among the most compact known in vertebrates (Jaillon et al., 2004). As mentioned before, this same trend could have taken place in the Pleuronectiformes evolution.

As can be deduced from the fish genome databases, the gene tlr3 is linked with genes lysg and pgrp2, and this linkage is conserved throughout many bony fishes, thus representing a genomic cluster that evolves together. As discussed before, the three genes share similar functions, so the linkage could represent an advantage for a more effective immune response. It has been postulated that a conserved group of genes could indicate a functional cluster (Overbeek et al., 1999). The calreticulin BAC clone appears as a region strongly conserved in all the species studied at the level of macro-synteny and micro-synteny, so it could represent another functional cluster. This same situation has been observed with the genes belonging to BAC clone il10, which show a highly conserved synteny in all the species analyzed, including the il19 gene and, in some species, the mx gene. The linkage between il10 and il19 has been described in other fish species (Lutfalla et al., 2003). Another example of conserved linkage among fish species is the tap1 tnfα genes. In S. senegalensis the irf5 gene co-localized with the genes b2m, trim25, and hmgb2 in one metacentric chromosome. Interestingly, this result was not observed in any other of the species studied, although a tendency to co-localize two-totwo was observed.

The results also show that large parts of the genomic regions tend to be conserved in the species most closely related to S. senegalensis, such as the Pleuronectiformes C. semilaevis and S. maximus, with C. semilaevis being the species that presents the most homology between the genomic regions analyzed. Regardless of the genetic distance between the species,

a large conserved region of genes present in chromosome 2 of S. senegalensis was found in only one chromosome in the rest of the species analyzed. An exception was observed in stickleback; this is because part of the stickleback genome is assembled at the scaffold level.

The results of micro-synteny revealed a paralog of b2m in most of the species analyzed. It is clear that gene duplication and subsequent diversification have played a major role in the evolution of diversity in molecules of the MHC (Andersson, 1996). In studies carried out in bivalve species such as Crassostrea gigas, the expansion of several gene families related to defense pathways, including protein folding, oxidation and antioxidation, apoptosis and immune responses, has been observed (Wang et al., 2012).

# Immune System Phylogenetics

The phylogeny results clearly separated two main clusters: rayfinned fishes and tetrapods. The coelacanth was used to root the tree and, as expected, the genetic sequence analyzed is closer to the tetrapods than to the actinopterygian fish. The innate immune system is phylogenetically older than the adaptive system, and it is found, in some form, in all multicellular organisms, whereas the adaptive system is found in all vertebrates except jawless animals (Zarkadis et al., 2001). The concatenated gene sequences used in the phylogenetic analysis belong to both the innate and adaptive immune systems, so it was not possible to set an invertebrate as an outgroup species. However, several studies indicate that the coelacanth provides the ideal outgroup sequence against which tetrapod genomes can be measured (Noonan et al., 2004).

The cluster of tetrapods is divided into mammals and reptiles; the cluster of ray-finned fishes is divided into holostei and teleostei. Ray-finned fishes (Actinopterygii) diverged from the lineage of lobe-finned fishes (Sarcopterygii) about 450 million years ago (Christoffels et al., 2004). The actinopterygians, in turn, are divided into two main groups: holostei, the only representative of which in this tree is Lepisosteus oculatus; and teleostei in which group the other fish analyzed are included. L. oculatus appears Paleozoic era and the first fossil record dates from the Late Permian period. The genome of L. oculatus has a very low evolution rate and it was sequenced to help connect teleost biomedicine to human biology because its lineage represents the unduplicated sister group of teleosts (Anderson et al., 2012).

As these results show, within the clade of the teleosts, the phylogenetic tree is divided into three main groups. In the first group (superorder Ostariophysi), Astyanax mexicanus and D. rerio belong to the cohort Otomorpha and are separated from the cohort Euteleosteomorpha (the rest of those with radiated fins). In the second group (superorder Protacanthoptherygii), Esox lucius belongs to the order of Esociformes; Salmo salar belongs to the order of Salmoniformes. The third group (superorder Acanthoptherygii) is composed of several subgroups: Ovalentaria, Carangaria, and Eupercaria groups.

Similar clustering, with some exceptions, is observed in a phylogeny based on concatenated sequences related to sex determination and reproduction (Portela-Bens et al., 2017). However, other phylogenies made with one immune systemrelated sequence show unexpected fish-species groupings, like c1q (Zeng et al., 2015), and hmgb2 (Wang et al., 2019). It was established that, for phylogenetics, the concatenation approach yields more accurate trees, even when the different concatenated sequences evolve with different substitution patterns (Gadagkar et al., 2005). The result presented here represents a novel phylogeny based on the concatenation of several immunerelated genes of fishes. Moreover, the fish immune system has contributed significantly to a better understanding of the evolutionary history of the immune system (Rauta et al., 2012).

# CONCLUSION

Based on the results obtained, it can be concluded that the immune system genes studied tend to be grouped together in the genome of S. senegalensis, as the 24 immune-related genes annotated were located in only six chromosome pairs. The second conclusion is that S. senegalensis, and the Soleoidei suborder in general, show a higher degree of grouping in the immunerelated genes, which could represent an evolutionary advantage. In addition, it seems that large parts of these genomic regions tend to be conserved in the species most closely related to S. senegalensis, particularly the Pleuronectiformes C. semilaevis and S. maximus; and that the rate of variability of the immune system genes studied is not high.

# ETHICS STATEMENT

The experimental procedures are ac-cording to the recommendation of the University of Cádiz (Spain) for the use of laboratory animals and the Guidelines of the European Union Council (86/609/EU).

# AUTHOR CONTRIBUTIONS

AG-A and TL carried out the multiple FISH. AG-A drafted the manuscript. MR and SP-B isolated the BACs and carried out the bioinformatic analysis. MM constructed the phylogenetic tree and helped with the discussion and drafting. LR conceived and coordinated the study, participated in its design, discussed the results and corrected the manuscript. All authors read and approved the manuscript.

# FUNDING

This study has been supported by the Spanish Ministerio de Ciencia e Innovación MICINN-FEDER (AGL2014-51860- C2-1-P and RTI2018-096847-B-C21) and the Instituto de Investigaciones Marinas (INMAR-UCA). AG-A received a fellowship from the UCA.

# ACKNOWLEDGMENTS

fgene-10-00529 June 7, 2019 Time: 18:29 # 12

We acknowledge their gratitude to the Ensemble team and the Genomicus Browser curators for providing a browser and database that greatly facilitated the use and analyses of the BACs used in this work. We also thank Dr. Manuel Manchado of the IFAPA (El Toruño, Cádiz, Spain) for sharing the BAC library, Mr. Emilio García for technical support with the cultures and chromosomal preparations and Dr. Alberto Arias and Mr. Royston Snart for a critical reading of the manuscript.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00529/full#supplementary-material

FIGURE S1 | Circos analysis in the species D. rerio. On the left side the distribution of the BAC clones of Senegalese sole distributed in chromosomes can

# REFERENCES


be observed. Each BAC clone is represented in a different color. The genes found by annotation are indicated within each BAC clone and their corresponding localization in the D. rerio chromosomes are denoted by crossing lines. The BAC clones analyzed are given in Table 1.

FIGURE S2 | Circos analysis in the species O. niloticus. On the left side the distribution of the BAC clones of Senegalese sole distributed in chromosomes can be observed. Each BAC clone is represented in a different color. The genes found by annotation are indicated within each BAC clone and their corresponding localization in the O. niloticus chromosomes are denoted by crossing lines. The clones analyzed are given in Table 1.

FIGURE S3 | Circos analysis in the species G. aculeatus. On the left side the distribution of the BAC clones of Senegalese sole distributed in chromosomes can be observed. Each BAC clone is represented in a different color. The genes found by annotation are indicated within each BAC clone and their corresponding localization in the G. aculeatus chromosomes are denoted by crossing lines. The BAC clones analyzed are given in Table 1.

DATA SHEET S1 | Sequences used for the construction of the phylogenetic tree.

DATA SHEET S2 | Saturation indices and plots with transitions (s) and transversions (v) for each gene and in the concatenated sequence.

duplication early during the evolution of ray-finned fishes. Mol. Biol. Evol. 21, 1146–1151. doi: 10.1093/molbev/msh114


Ferraresso, S., Bonaldo, A., Parma, L., Buonocore, F., Scapigliati, G., Gatta, P. P., et al. (2016). Ontogenetic onset of immune-relevant genes in the common sole (Solea solea). Fish Shellfish Immunol. 57, 278–292. doi: 10.1016/j.fsi.2016.08.044




**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 García-Angulo, Merlo, Rodríguez, Portela-Bens, Liehr and Rebordinos. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Optimizing Genomic Prediction of Host Resistance to Koi Herpesvirus Disease in Carp

Christos Palaiokostas1,2 \*, Tomas Vesely<sup>3</sup> , Martin Kocour<sup>4</sup> , Martin Prchal<sup>4</sup> , Dagmar Pokorova<sup>3</sup> , Veronika Piackova<sup>4</sup> , Lubomir Pojezdal<sup>3</sup> and Ross D. Houston<sup>1</sup> \*

<sup>1</sup> Royal (Dick) School of Veterinary Studies, The Roslin Institute, The University of Edinburgh, Midlothian, United Kingdom, <sup>2</sup> Department of Animal Breeding and Genetics, Swedish University of Agricultural Sciences, Uppsala, Sweden, <sup>3</sup> Veterinary Research Institute, Brno, Czechia, <sup>4</sup> Faculty of Fisheries and Protection of Waters, South Bohemian Research Centre of Aquaculture and Biodiversity of Hydrocenoses, University of South Bohemia Ceské Bud ˇ ejovice, Vod ˇ nany, Czechia ˇ

### Edited by:

Johann Sölkner, University of Natural Resources and Life Sciences Vienna, Austria

### Reviewed by:

Wanchang Zhang, Nanchang University, China Paulino Martínez, University of Santiago de Compostela, Spain

### \*Correspondence:

Christos Palaiokostas christos.palaiokostas@slu.se Ross D. Houston ross.houston@roslin.ed.ac.uk

### Specialty section:

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Received: 03 December 2018 Accepted: 22 May 2019 Published: 12 June 2019

### Citation:

Palaiokostas C, Vesely T, Kocour M, Prchal M, Pokorova D, Piackova V, Pojezdal L and Houston RD (2019) Optimizing Genomic Prediction of Host Resistance to Koi Herpesvirus Disease in Carp. Front. Genet. 10:543. doi: 10.3389/fgene.2019.00543 Genomic selection (GS) is increasingly applied in breeding programs of major aquaculture species, enabling improved prediction accuracy and genetic gain compared to pedigree-based approaches. Koi Herpesvirus disease (KHVD) is notifiable by the World Organization for Animal Health and the European Union, causing major economic losses to carp production. GS has potential to breed carp with improved resistance to KHVD, thereby contributing to disease control. In the current study, Restriction-site Associated DNA sequencing (RAD-seq) was applied on a population of 1,425 common carp juveniles which had been challenged with Koi herpes virus, followed by sampling of survivors and mortalities. GS was tested on a wide range of scenarios by varying both SNP densities and the genetic relationships between training and validation sets. The accuracy of correctly identifying KHVD resistant animals using GS was between 8 and 18% higher than pedigree best linear unbiased predictor (pBLUP) depending on the tested scenario. Furthermore, minor decreases in prediction accuracy were observed with decreased SNP density. However, the genetic relationship between the training and validation sets was a key factor in the efficacy of genomic prediction of KHVD resistance in carp, with substantially lower prediction accuracy when the relationships between the training and validation sets did not contain close relatives.

### Keywords: KHVD, carp, RAD-seq, genomic selection, aquaculture breeding

# INTRODUCTION

Genomic selection (GS) has become a cornerstone of genetic improvement in both plant and livestock breeding, enabling improved prediction accuracy, control of inbreeding, and (in some cases) reduction in generation interval compared to traditional pedigree-based approaches (Meuwissen et al., 2016; Hickey et al., 2017). The landmark paper of Meuwissen et al. (2001) highlighted the concept of breeding value prediction based on the joint merit of all markers distributed throughout the genome, and the advent of high-throughput DNA sequencing and development of SNP arrays in the subsequent decade made this concept a practical reality. While the application of genomics in aquaculture breeding has traditionally lagged behind the plant and terrestrial livestock sector, it is gaining momentum with reference genome assemblies and

SNP arrays now available for most of the key aquaculture species (Robledo et al., 2017; Yue and Wang, 2017). Both simulation and empirical studies suggest that considerable improvement in breeding value prediction accuracy is plausible, even with relatively modest SNP marker densities (Sonesson and Meuwissen, 2009; Lillehammer et al., 2013; Ødegård et al., 2014; Tsai et al., 2015; Correa et al., 2017; Vallejo et al., 2017, 2018; Robledo et al., 2018).

Infectious diseases present a major and persistent threat to sustainable aquaculture production, and breeding for improved host resistance is an increasingly important component of mitigation (Houston et al., 2017). Common carp (Cyprinus carpio) is one of the world's most important freshwater aquaculture species, particularly in Asia and Europe. However, koi herpesvirus disease (KHVD), also known as Cyprinid herpesvirus-3 (CyHV-3) disease is a major threat to carp farming and is listed as a notifiable disease by the European Union (Taylor et al., 2010) and the World Organization for Animal Health (OIE, 2018). Encouragingly, resistance to KHVD has been shown to be a highly heritable trait with estimates ranging between 0.50 and 0.79 (Ødegård et al., 2010; Palaiokostas et al., 2018a). The potential of selective breeding for improved KHVD resistance in carp (utilizing information from challenge trials) has been illustrated by several studies which demonstrated large variation in survival both between-family (Dixon et al., 2009; Tadmor-Levi et al., 2017) and between strain (Shapira et al., 2005; Piacková et al., 2013 ˇ ). Further, a significant QTL associated with resistance to KHVD has been identified (Palaiokostas et al., 2018a). Nevertheless, the potential of GS for improving KHVD resistance in carp has not yet been studied.

While SNP arrays are available for several aquaculture species, and are commonly used in some of the most advanced commercial breeding programs (e.g., Atlantic salmon), they tend to be relatively expensive and can suffer from ascertainment bias (Robledo et al., 2017). Genotyping by sequencing technology, such as RAD-seq (Baird et al., 2008) and subsequent variants, have also been effective in studying complex traits such as disease resistance in aquaculture species, and testing GS (Vallejo et al., 2016; Barría et al., 2018; Palaiokostas et al., 2018b; Aslam et al., 2018). Disease resistance is particularly amenable to GS, because typically it is not possible to record on selection candidates themselves (Yáñez et al., 2014), and is typically measured on their close relatives (e.g., full siblings) in aquaculture breeding programs (Gjedrem and Rye, 2016). While effective, the limitations of current GS methods in aquaculture include (i) that the genotyping is typically expensive, partially due to the high-density marker genotyping, and (ii) the accuracy of prediction drops rapidly when the genetic relationship between the training and validation populations decreases (e.g., Tsai et al., 2016).

Family-based breeding programs are at a formative stage in common carp, including a program focused on the Amur mirror carp breed in Europe (Prchal et al., 2018a,b), where improvement of disease resistance is a major breeding goal. The main aim of the current study was to investigate the potential of GS to predict host resistance to KHVD in common carp using genome-wide SNP markers generated by RAD sequencing. An additional aim was to investigate the importance of SNP marker density in genomic prediction accuracy, with a view to future low-density SNP panels for cost-effective GS. Finally, the impact of genetic relationship between the training and validation sets was assessed by comparing prediction accuracy in groups of closely and distantly related fish.

# MATERIALS AND METHODS

# Population Origin and Disease Challenge

The origin of the samples and the details of the disease challenge experiment have been fully described previously (Palaiokostas et al., 2018a). In brief, the study was performed on a population of Amur mirror carp that was created at the University of South Bohemia in Ceské Bud ˇ ejovice, Czech Republic in May 2014 using ˇ an artificial insemination method (Vandeputte et al., 2004). The population was the result of four factorial crosses of five dams x ten sires (20 dams and 40 sires in total). A cohabitation KHV challenge was performed on randomly sampled progeny of these crosses. Mortality of individual fish was recorded for a period of 35 days post infection (dpi), by which stage the mortality level had returned to baseline. In total, phenotypic records regarding survival/mortality were documented for 1,425 animals. Presence of KHV in a sample of dead fish (n = 100) was confirmed by PCR according to guidelines by the Centre for Environment, Fisheries and Aquaculture Science, United Kingdom (Cefas) (Pokorova et al., 2010). The entire experiment was conducted in accordance with the law on the protection of animals against cruelty (Act no. 246/1992 Coll. of the Czech Republic) upon its approval by Institutional Animal Care and Use Committee (IACUC).

# RAD Sequencing and Parentage Assignment

The RAD library preparation protocol followed the methodology originally described in Baird et al. (2008), presented in detail in Palaiokostas et al. (2018c). In brief, RAD libraries were sequenced by BMR Genomics (Padova, Italy) in fourteen lanes of an Illumina NextSeq 500, using 75 base paired-end reads (v2 chemistry). Reads missing the restriction site, with ambiguous barcodes and PCR duplicates were identified and discarded using the Stacks v2.0 software (Catchen et al., 2011). Remained sequenced reads were aligned to the common carp reference genome assembly version GCA\_000951615.2 (Xu et al., 2014) using bowtie2 (Langmead and Salzberg, 2012). Uniquely aligned reads were retained for downstream analysis. The aligned reads were sorted into RAD loci and SNPs were identified from both P1 and P2 reads using the Stacks software v2.0 (Catchen et al., 2011). Opposed to our previous study (Palaiokostas et al., 2018b) variant calling in Stacks v2.0 and above utilizes information from both P1 and P2 ends, while prior versions were using only P1 ends. SNPs were detected using gstacks (–var-alpha 0.001 – gt-alpha 0.001 –min-mapq 40). Only single SNPs from each individual RAD locus where considered for downstream analysis to minimize the possibility of genotypic errors. SNPs with minor allele frequency (MAF) below 0.05, greater than 25% missing data were discarded. The R package hsphase (Ferdosi et al., 2014)

was used for parentage assignment allowing for a maximum genotyping error of 2%. The aligned reads in the format of bam files were deposited in the National Centre for Biotechnology Information (NCBI) repository under project ID PRJNA414021.

# Genomic Prediction Models

Overall binary survival (0 = dead, 1 = alive) was used as the phenotype to assess the potential of GS for improved resistance to KHVD in common carp. Several commonly used GS models were tested on the data using the R package BGLR for binary traits (Pérez and de los Campos, 2014): specifically rrBLUP, BayesA, BayesB (Meuwissen et al., 2001) and BayesC (Habier et al., 2011). In addition, pedigree-based BLUP (Henderson, 1975) was evaluated using the same software. The general form of the fitted models was:

$$\mathbf{l} = \mathbf{X}\mathbf{b} + \mathbf{Z}\alpha + \mathbf{e},\tag{1}$$

where **l** is the vector of latent variables, **b** is the vector of the fixed effects (intercept, standard length), **X** is the incidence matrix relating phenotypes with the fixed effects, **Z** the incidence matrix relating the underlying liability with the genotypes, α the vector of SNP effects using the corresponding prior distribution for each of the aforementioned Bayesian models and **e** the vector of residuals. The parameters of each model were estimated by Markov chain Monte Carlo (MCMC) using Gibbs sampling (110,000 iterations; burn-in: 10,000; thin: 10). Convergence of the resulting posterior distributions was assessed both visually (inspecting the resulting MCMC plots) and analytically using the R package coda v0.19-1 software (Plummer et al., 2006).

# Prediction Metrics for KHVD Resistance

The prediction performance of the utilized models was tested using the following metrics:


The prediction accuracy was approximated as:

$$r = (\mathbf{GEBV}, \mathbf{y})/h,\tag{2}$$

where **y** is the vector of recorded phenotypes, **(G) EBV** is the (genomic) estimated breeding values and h is the square root of the heritability (h <sup>2</sup> = 0.50 using the genomic relationship matrix as described in Palaiokostas et al., 2018a).

Receiver operator characteristic curves were used to assess the efficacy of classifying the animals as resistant or susceptible, using either the pedigree- or the genomic-based models. The area under the curve (AUC) metric (Hanley and McNeil, 1982; Wray et al., 2010) was used to interpret the performance of the genomic prediction models, with values of 1 representing the perfect classifier.

# Genomic Prediction With Varying SNP Densities

Genomic prediction models were applied using datasets of varying SNP density using either MAF or linkage disequilibrium (LD) values as thresholds for filtering. In particular, to obtain the reduced density SNP panels for genomic prediction, a strategy of retaining SNPs surpassing a sequentially increased MAF threshold was applied, as described in Robledo et al. (2018). These MAF thresholds were 0.1 (3,993 SNPs), 0.25 (1,619 SNPs) and 0.35 (802 SNPs).

In addition, reduced density SNP datasets were obtained by applying filtering based on LD values. LD amongst SNP pairs was calculated using SNPrune (Calus and Vandenplas, 2018). Thereafter, only SNP pairs below a sequentially increased LD value were retained. The LD thresholds were 0.15 (1,006 SNPs), 0.25 (2,895 SNPs), 0.35 (5,118 SNPs).

Five-fold cross-validation was performed for all the density varying SNP datasets in order to test the efficiency of correctly classifying animals in the validation set as resistant or susceptible. The dataset was randomly split into sequential training (n = 1008) and validation sets (n = 251). The number of resistant and susceptible animals in each validation set was proportional to the overall survival of the challenged population. In the validation sets, the phenotypes of the animals were masked, and their (genomic) estimated breeding values – (G)EBV – were estimated based on the prediction model derived from the training set. This cross-validation procedure was repeated five times to minimize potential bias.

# Testing the Impact of Genetic Relationship on Genomic Prediction

Four different scenarios were tested for evaluating the impact of genetic relationships between training and validation sets. In scenario 1 (S1), the formation of training and validation sets required the existence of full-siblings in both sets for each family. For scenario 2 (S2) the formation of validation and training sets allowed the existence of only half siblings between the two sets (and no full siblings). Both in S1 and S2 the cross validation procedure was repeated five times in order to reduce potential bias, while the size of the validation set was 290 animals on each replicate. In scenario (S3) the genomic prediction models were tested by sequentially assigning each of the four factorial crosses (mean = 315 animals; sd = 81 animals) as a validation set, using the remaining three as a training set. This approach resulted in relatively unrelated training and validation sets, since it avoided the inclusion of full/half sibs in both the training and the validation sets. The genomic prediction models were tested on the dataset comprised of the full SNP data. Since pedigree information was not available for prior generations, pBLUP could not be used for obtaining meaningful predictions across the factorial cross groups. Finally, a scenario 4 (S4) was performed as control where no restrictions were applied in the formation of training and validation sets (i.e., they were taken at random). Cross validation in S4 was performed five times with the size of the validation sets being set to 290 animals. The S4 scenario was in fact similar with the approaches tested in the previous section regarding varying SNP densities with the only difference being the size of the validation set. The full SNP dataset was used for all the tested scenarios.

# RESULTS

# Disease Challenge

fgene-10-00543 June 11, 2019 Time: 15:19 # 4

The mean weight of the genotyped carp juveniles was 16.3 g (SD 4.6) and the mean standard length (SL) was 77 mm (SD 7.1). Mortalities began at 12 dpi reaching a maximum daily rate between 21 and 24 dpi (98 – 130 mortalities per day) decreasing thereafter (**Supplementary File S2**). Observed mortalities displaying typical KHVD symptoms (weakness, lethargy, loss of equilibrium, erratic swimming, sunken eyes, excessive mucous production, discoloration, and hemorrhagic lesions on the skin and gills).

# RAD Sequencing and Parentage Assignment

2.8 billion paired-end reads were uniquely aligned to the common carp genome assembly (GenBank assembly accession GCA\_000951615.2) representing approximately 82% of reads passing initial quality filters (missing restriction site, ambiguous barcodes and PCR duplicates). Approximately 5% of those reads had a mapping quality below 40 and were discarded. In total 397,047 putative RAD loci were identified with a mean coverage of 21X (SD = 7.6, min = 1.3X, max = 58.5X). 15,615 SNPs found in more than 75% of the genotyped animals and with a MAF above 0.05 were retained for downstream analysis (**Supplementary File S1**).

The carp progeny were assigned to unique parental pairs allowing for a maximum genotypic error rate of 2%. In total 1,259 offspring were uniquely assigned (**Supplementary File S3**), comprising 195 full-sib families (40 sires, 20 dams) ranging from 1 to 21 animals per family with a mean size of 6 (SD 4). The individual dam contribution to the population ranged from 9 to 99 animals with a mean of 61 (SD 23), while the sire contribution ranged from 7 to 53 animals with a mean of 30 (SD 12). In addition, the mean weight and length per full-sib family were approximately 16 (SD 2.8) g and 76 (SD 4.5) mm respectively. Finally, mean survival per full sib family was 34% (**Figure 1**).

# Impact of SNP Density on Genomic Prediction

Datasets of varying genotyping density were comprised of 15,615 SNPs (D1; full dataset; **Supplementary File S4**) and in the case of MAF as the filtering criterion of 3,993 (D2; MAF 0.1), 1,619 (D3; MAF 0.25) and 802 (D4; MAF 0.35) SNPs. The accuracy of genomic prediction of breeding values was assessed and compared to prediction using a pedigree-based approach. Prediction accuracy with pBLUP was 0.49, compared to 0.53 – 0.54 for the genomic prediction models applied using D1 (**Table 1**). Prediction accuracies for D2 ranged between 0.52 and 0.53, while in the case of D3 and D4 prediction accuracy for all genomic models was 0.49 and 0.46 respectively (**Figure 2A**). Following estimation of ROC curves, the genomic models for D1 had a maximum AUC estimate of 0.74 as opposed to 0.71 using

TABLE 1 | Mean survival accuracy for D1<sup>1</sup> (5-fold cross validation; 5 replicates).


<sup>1</sup>15,615 SNPs.

pBLUP. AUC for D2 was 0.73 for all genomic models. In the case of D3 and D4 the AUC for all genomic models was 0.71 and 0.70 respectively.

Regarding the reduced density SNP datasets obtained using LD pruning, the number of SNPs in the sets with the LD thresholds of 0.15, 0.25, and 0.35 were 1,006 (LD1), 2,895 (LD2) and 5,118 (LD3) respectively. The genomic prediction accuracy obtained for LD1 was very slightly higher than pBLUP using the BayesB and BayesC models (<1% increase), while the AUC was the same. In the case of rrBLUP and BayesA for the same SNP dataset the estimates were 2 and 1% lower compared to pBLUP for accuracy and AUC respectively. Using datasets of higher SNP density resulted in the increase of both the accuracy and the AUC metrics as observed previously for the reduced density datasets filtered by MAF. In particular, accuracy for LD2 and LD3 ranged between 0.52 and 0.54 and AUC between 0.72 and 0.74 (**Figure 2B**), which were very similar to the accuracy and AUC values obtained for the full SNP dataset (15,615 SNPs).

# Impact of Genetic Relationship on Genomic Prediction

For the scenario S1, where all animals in the validation set had full sibs in the training set the genomic prediction accuracy was approximately 0.56, which was marginally higher (∼ 4% increase) than the random allocation of animals into training and validation sets described above. In S2 where the design of the validation set allowed the inclusion of only corresponding half sibs in the training and validation set, the genomic prediction accuracy fell to ∼ 0.53. In S3 where the training and validation sets were set up to correspond to separate factorial crosses, the mean accuracy for the genomic models was markedly lower, and ranged between 0.16 and 0.20. Finally, in the scenario where training and validation sets were set up without posing any restrictions estimated, such that close relatives are likely to be included in both sets, accuracy ranged between 0.52 and 0.54 for the genomic prediction models and 0.49 for pBLUP (**Table 2**).

The obtained AUC values from the ROC curves were 0.74 (BayesB; **Figure 3**) and 0.72 for S1 and S2 for the genomic prediction models, while the corresponding AUC values from pBLUP were 0.72 and 0.69 respectively. For S3 the estimated AUC values for the genomic models were again substantially lower and ranged between 0.57 and 0.58. In S4, where no restrictions were applied regarding the inclusion of full/half sibs on both training and validation sets, the AUC values were between 0.72 and 0.74, comparable to S1 and S2 (**Table 2**).

# DISCUSSION

In the current study, genotyping by sequencing was applied to study genomic prediction of resistance to KHVD in carp, including testing the impact of SNP marker density and genetic relationship between training and validation sets. While genomic data in the form of genetic markers can be a valuable addition to selective breeding for disease resistance, the methods of applying the data depend on the underlying genetic architecture of the trait. In the case of major QTL such as resistance to Infectious Pancreatic Necrosis in salmon (Houston et al., 2008; Moen et al., 2009), it may be most effective to use QTL-targeted marker-assisted selection, and in the case of polygenic traits GS is likely to be preferable. In our previous study we identified a QTL associated with KHVD resistance in common carp located on chromosome 33 (Palaiokostas et al., 2018a). However, this QTL accounted for approximately 7% of the genetic variation in the trait, highlighting that multiple additional loci are involved. Further, using genomic prediction models that incorporate variable selection – i.e., allow for the existence of QTL of large effect – did not result in significant improvement in prediction accuracy compared to ridge regression BLUP, which supports the involvement of many genomic regions in the trait (Meuwissen et al., 2001; Kizilkaya et al., 2010; Habier et al., 2013).

Since genotyping cost is generally related to SNP marker density, determining the lowest SNP density that retains maximum genomic prediction accuracy is a logical goal. In the current study, reducing SNP density from 15,615 to 2,895 resulted in minor decreases in prediction accuracy, with 1,000–1,600 SNPs giving approximately the same accuracy as pBLUP. Furthermore, the LD-pruned dataset of approximately 5,000 SNPs resulted in the same prediction accuracy performance as the full dataset (15,615 SNPs). A more drastic impact of genetic relationship between training and validation sets on prediction accuracy was observed. The highest prediction efficiency was observed in scenario S1 where animals in the validation set had full siblings in the training set. Prediction efficiency decreased 6– 8% in the scenario allowing for only the inclusion of half-siblings (and no full siblings) in the training and validation sets but was still comparable to the results when the sets were established at random. Interestingly, the impact of the lower genetic relationships on pBLUP accuracy was greater, and it dropped by approximately 16% between S1 and S2. This may indicate that genomic prediction models have the potential to utilize distant relationships compared to pBLUP, especially in the current set up where there was only a two generation pedigree. Furthermore, when the training set comprised three of the factorial cross groups and the validation set comprised the fourth, thus resulting in no shared full/half sibs between the two sets, the accuracy dropped massively to 0.16–0.17 (15,615 SNPs). The decrease in prediction accuracy with more distant relationships is to be expected, thus close relationships between training and validation sets is a necessary prerequisite for successfully implementing GS (Meuwissen et al., 2013), and it highlights the importance of obtaining genotype and phenotype records on close relatives of selection candidates in future carp breeding programs using genomic (and pedigree) selection.

FIGURE 2 | Relative accuracy of genomic prediction models compared to pedigree BLUP for varying SNP densities. (A) SNP filtering based on minor allele frequency and (B) SNP filtering based on linkage disequilibrium.



<sup>1</sup>Accuracy; <sup>2</sup>Area under curve. Genetic relationships for the 4 tested scenarios were for S1: Full sibs on the training set for all animals of the validation set (n = 290), for S2: Half sibs on the training set for all animals in the validation set (n = 290), for S3: cross validation performed for each of the breeding cross. No full/half sibs on the training set for any of the animals in the validation set (n = 315) and for S4: No restrictions applied (n = 290) for genetic relationship between training and validation set.

Testing genomic prediction on binary traits such as survival, presents a challenge to define a suitable test metric for selecting the best performing model, especially when survival deviates significantly from 50%. Solely relying on correlation derived accuracy for model assessment in this case could result in suboptimal selection decisions. Suitable metrics for evaluating prediction efficiency in binary traits and thus selecting the best performing models for estimating breeding values include the AUC from ROC curves.

The AUC values provide a commonly used metric for assessing the prediction efficacy of binary classifiers, taking into consideration both the rate of false positives and false negatives with values of one suggesting 100% successful classification. This approach has been routinely used to test the efficacy of prediction models in disease resistance studies both in humans (Wray et al., 2010), livestock (Tsairidou et al., 2014) and aquaculture (Palaiokostas et al., 2018b) amongst others. In the current study, genomic prediction using the marker density scenarios of ∼ 3,000 SNPs and above resulted in a slight improvement (∼ 4%) of AUC compared to pBLUP. Performing predictions using approximately 1,000 SNPs resulted in the same AUC value (0.71) as pBLUP, while when using approximately 800 SNP the

estimated AUC value was 0.70 which is slightly inferior. A gradual decrease was observed regarding the estimated AUC values for the scenarios of varying genetic relationship as was also the case for the prediction accuracy metric. As expected highest values were obtained in the scenario of highest relationships between training and validation sets (S1). Most striking effect of the impact of genetic relationships between the above sets, however, was observed in the scenario where the training and validation sets were set up to be most distantly related, where the estimated AUC values ranged between 0.56 and 0.57, which are substantially lower than all other tested scenarios, but still useful.

In summary the results from the current study demonstrate that GS was more efficient than pBLUP in predicting for KHVD resistant carp. The consistency of improvement in prediction accuracy versus pedigree-based accuracy across multiple scenarios highlights flexibility and robustness to different approaches, and it may allow circumvention of limitations posed by incomplete pedigree records. Of major importance is the fact that relatively low density marker panels could be of value for genomic prediction without loss of accuracy. However, close relationships between training and validation sets are key, with substantial loss of prediction accuracy in the scenario where the sets were relatively unrelated. Pedigree-based prediction was also efficient in scenarios with recorded relationships between training and validation sets, possibly partly because KHVD resistance is a high-heritable trait (h <sup>2</sup> = 0.5 – 0.79), but genetic markers were required to assign the pedigree in the factorial crosses. Future studies testing the efficiency of single-step BLUP approaches (Aguilar et al., 2011; Legarra et al., 2014) could potentially prove beneficial by allowing genomic predictions based on larger datasets (only a portion of the dataset would be genotyped, thus reducing costs). Overall our results help inform the use of genetic markers in carp breeding to enable improvement of disease resistance, with downstream benefits of helping prevent KHVD outbreaks in carp aquaculture.

# ETHICS STATEMENT

The entire experiment was conducted in accordance with the law on the protection of animals against cruelty (Act No. 246/1992 Coll. of the Czech Republic) upon its approval by Institutional Animal Care and Use Committee (IACUC).

# AUTHOR CONTRIBUTIONS

TV, MK, MP, VP, and RH conceived the study and contributed to designing the experimental structure. TV, DP, and LP carried out the challenge experiment. CP carried out DNA extractions,

RD library preparation, and sequence data processing. CP and RH carried out parentage assignment and the quantitative genetic analyses. All authors contributed to drafting the manuscript.

# FUNDING

The authors are supported by funding from the European Union's Seventh Framework Programme (FP7 2007–2013) under grant agreement no. 613611 (FISHBOOST). CP and RH gratefully acknowledge Institute Strategic Funding Grants to The Roslin Institute (Grant Nos. BBS/E/D/20002172, BBS/E/D/30002275, and BBS/E/D/10002070). MK and

# REFERENCES


MP were also supported by project, Biodiverzita (CZ.02.1.01/0.0/0.0/16\_025/0007370). VP was also supported by project PROFISH (CZ.02.1.01/0.0/0.0/16\_019/0000869) both under the Ministry of Education, Youth and Sports of the Czech Republic. TV, DP, and LP were also supported by Ministry of Agriculture of the Czech Republic (Project MZE-RO 0518).

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00543/full#supplementary-material



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The reviewer PM declared a past co-authorship with several of the authors CP and RH to the handling Editor.

Copyright © 2019 Palaiokostas, Vesely, Kocour, Prchal, Pokorova, Piackova, Pojezdal and Houston. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Comparative Transcriptome Analysis Reveals Molecular Basis Underlying Fast Growth of the Selectively Bred Pacific Oyster, *Crassostrea gigas*

### *Fuqiang Zhang1, Boyang Hu1, Huiru Fu1, Zexin Jiao1, Qi Li1,2\*, and Shikai Liu1,2\**

*1 Key Laboratory of Mariculture, Ministry of Education, and College of Fisheries, Ocean University of China, Qingdao, China, 2 Laboratory for Marine Fisheries Science and Food Production Processes, Qingdao National Laboratory for Marine Science and Technology, Qingdao, China*

### *Edited by:*

*Gen Hua Yue, Temasek Life Sciences Laboratory, Singapore*

### *Reviewed by:*

*Jian Xu, Key Laboratory of Aquatic Genomics, Chinese Academy of Fishery Sciences, China Chuanju Dong, Henan Normal University, China*

### *\*Correspondence:*

*Shikai Liu liushk@ouc.edu.cn Qi Li qili66@ouc.edu.cn*

### *Specialty section:*

*This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics*

*Received: 28 February 2019 Accepted: 11 June 2019 Published: 28 June 2019*

### *Citation:*

*Zhang F, Hu B, Fu H, Jiao Z, Li Q and Liu S (2019) Comparative Transcriptome Analysis Reveals Molecular Basis Underlying Fast Growth of the Selectively Bred Pacific Oyster, Crassostrea gigas. Front. Genet. 10:610. doi: 10.3389/fgene.2019.00610*

Fast growth is one of the most desired traits for all food animals, which affects the profitability of animal production. The Pacific oyster, *Crassostrea gigas*, is an important aquaculture shellfish around the world with the largest annual production. Growth of the Pacific oyster has been greatly improved by artificial selection breeding, but molecular mechanisms underlying growth remains poorly understood, which limited the molecular integrative breeding of fast growth with other superior traits. In this study, comparative transcriptome analyses between the fast-growing selectively bred Pacific oyster and unselected wild Pacific oysters were conducted by RNA-Seq. A total of 1,303 proteincoding genes differentially expressed between fast-growing oysters and wild controls were identified, of which 888 genes were expressed at higher levels in the fast-growing oysters. Functional analysis of the differentially expressed genes (DEGs) indicated that genes involved in microtubule motor activity and biosynthesis of nucleotides and proteins are potentially important for growth in the oyster. Positive selection analysis of genes at the transcriptome level showed that a significant number of ribosomal protein genes had undergone positive selection during the artificial selection breeding process. These results also indicated the importance of protein biosynthesis and metabolism for the growth of oysters. The alternative splicing (AS) of genes was also compared between the two groups of oysters. A total of 3,230 differential alternative splicing events (DAS) were identified, involved in 1,818 genes. These DAS genes were associated with specific functional pathways related to growth, such as "long-term potentiation," "salivary secretion," and "phosphatidylinositol signaling system." The findings of this study will be valuable resources for future investigation to unravel molecular mechanisms underlying growth regulation in the oyster and other marine invertebrates and to provide solid support for breeding application to integrate fast growth with other superior traits in the Pacific oyster.

Keywords: Pacific oyster, RNA-Seq, growth, DEGs, Ka/Ks, alternative splicing

# INTRODUCTION

Growth is one of the most important traits related to fitness and production for any organism. Traits associated with fast growth have been one of the major breeding goals to enhance the profitability of production for all food animals. For aquaculture species, growth rate is especially important because aquaculture takes place in highly variable water environments, and improvement of growth not only can reduce the input cost but also can decrease the risk of economic loss by shortening culture time. Therefore, genetic breeding of fast-growing varieties or strains have been extensively conducted in aquaculture species.

Growth is a complex trait in marine mollusks, which is genetically controlled but affected by environmental variables such as temperature and food availability (Tamayo et al., 2011; Gosling, 2015). Quantitative genetic analyses revealed that growth rate has a significant genetic component with a mediateto-high heritability in marine mollusks (e.g., Kong et al., 2015). Growth rate has been positively correlated with the degree of heterozygosity of enzyme-coding genes (Szulkin et al., 2010). This has been shown from a physiological perspective that more heterozygous individuals grow faster and are characterized by higher levels of metabolism and protein turnovers (Bayne and Hawkins, 1997). Identification of basic elements of endocrine and regulatory networks of vertebrates indicated that a similar system could exist in marine mollusks. Growth rate of abalones has been associated with three neuropeptides secreted in neural ganglia (York et al., 2012), supporting the neural control of growth. Many studies in mollusks have also reported the association of genetic mechanisms within insulin-related peptide genes with differences in growth (Kellner-Cousin et al., 1994; Gricourt et al., 2003; Cong et al., 2013; Feng et al., 2014; Alarcon-Matus et al., 2015). Polymorphisms in the genes coding for enzymes responsible for nutrient acquisition, such as amylases and glycogen synthase, have also been identified (Bacca et al., 2005; Prudence et al., 2006). Although many genetic factors associated with growth have been proposed in a number of marine mollusks, critical molecular mechanisms underlying growth remain largely unexplored.

The Pacific oyster (*Crassostrea gigas*) originated in the Pacific Northwest and has been introduced to many countries around the world for aquaculture purposes (Troost, 2010). It is now one of the most widely cultivated shellfish species worldwide, with global production reaching ~0.6 million tons in 2016 (FAO, 2018). Because of the great value in economics, a number of selective breeding programs based on family and mass selection have been conducted over the years (Langdon et al., 2003; Evans and Langdon, 2006; Dégremont et al., 2010; Li et al., 2011; de Melo et al., 2016). Starting from 2006, we conducted the selective breeding program of the Pacific oyster in China, using the oysters collected from three wild populations in Rushan (China), Miyagi (Japan), and Busan (South Korea). Significant improvement of the growth has been achieved after generations of artificial selection (Li et al., 2011), as reported in other breeding programs (e.g., Langdon et al., 2003; Evans and Langdon, 2006). In 2013, the selectively bred fast-growing oyster strain from our breeding program was certified as an oyster variety by the National Commission for the Examination and Approval of Aquatic Original and Improved Species, Ministry of Agriculture of China.

The fast-growing oyster variety provides us a good research model for studies of growth trait. Physiological energetics analysis of the fast-growing oysters suggested that the selectively bred oysters had a higher energy gain than unselected oysters, while the basal metabolic rate between them was not significantly different. Therefore, fast-growing oysters possess superior energy budget for growth (Zhang et al., 2018). Gene-associated single nucleotide polymorphism (SNP) markers were developed for association analysis of the markers with growth traits, allowing identification of a number of SNP markers with allele frequencies showing a significant difference between fast-growing oysters and unselected commercial control oysters (Wang and Li, 2017). With the fast-growing oysters as the research material, genomewide analysis of genetic markers and genes would warrant a finescale genetic dissection of growth trait in oysters.

Molecular genetic approaches have been rapidly developed in recent years, allowing for identification of genetic markers and genes that are associated with production traits in oysters. Genetic linkage maps of the Pacific oyster have been constructed based on a variety of molecular markers, including amplified fragment length polymorphism (AFLP) markers or combinations of AFLP with microsatellite markers (Li and Guo, 2004; Guo et al., 2012), microsatellites markers (Li et al., 2003; Hubert and Hedgecock, 2004; Hubert et al., 2009; Plough and Hedgecock, 2011), SNPs (Sauvage et al., 2007; Wang et al., 2015; Qi et al., 2017), and a combination of microsatellite markers with SNPs (Sauvage et al., 2010; Zhong et al., 2014; Hedgecock et al., 2015). Based on these linkage maps, quantitative trait locus (QTL) mapping studies have been performed to examine the genetic basis of growth-related traits in the Pacific oyster (Prudence et al., 2006; Hedgecock et al., 2007a; Guo et al., 2012; Wang et al., 2016; Li et al., 2018). In most of these studies, numerous QTLs associated with growth traits were reported, indicating that growth in oysters is a highly polygenic trait (Qin et al., 2012; Gutierrez et al., 2018). However, the identified growth-related QTLs could only explain a limited portion of the phenotypic variation in the Pacific oyster. With more and more studies being conducted, it is being well recognized that integrative analysis of the genetic findings with genomics is required to unravel the molecular mechanisms behind the complex traits such as growth.

The rapidly developed high-throughput sequencing technologies have dramatically boosted genomics research and enabled genetic analysis of traits at whole genome level. RNA-Seq (high-throughput sequencing of RNA) is now an effective tool for transcriptome level analysis of gene expression related to production and performance traits. A large number of RNA-Seq studies have been conducted in the Pacific oyster for analyses of various traits including salinity stress (Zhao et al., 2012; Meng et al., 2013), shell colors (Feng et al., 2015), virus infection (He et al., 2015), heat stress (Yang et al., 2017), and sex determination (Yue et al., 2018). Transcriptome sequencing approach has been applied to understand growth-related traits in the Pacific oyster (Hedgecock et al., 2007b) and other species (e.g., Guan et al., 2017). In the Pacific oyster, the massively parallel signature sequencing (MPSS) was used to generate expressed sequence tags for investigation of genetic causes of heterosis, and it indicated that ribosomal proteins involved in protein metabolism could play a critical role in growth (Hedgecock et al., 2007b). However, low throughput of MPSS technique was not able to provide deep genome coverage data for comprehensive analysis of molecular mechanism underlying growth. In this study, toward understanding of the molecular basis for fast growth in the selectively bred Pacific oyster, we compared the transcriptomes of the selectively bred oysters and unselected wild oysters by a deep RNA-Seq. The transcriptional differences were analyzed from several aspects, including gene expression, gene positive selection, and gene alternative splicing. The results will be valuable for future efforts toward understanding of molecular mechanisms underlying growth regulation and molecular breeding in the Pacific oyster.

# MATERIALS AND METHODS

# Ethics Statement

The experiments in this study were conducted according to institutional and national guidelines. No endangered or protected species was involved in the experiments of the study. No specific permission was required for the location of the culture experiment.

# Experimental Animals

Samples of the fast-growing oysters used in this study were from the oyster breeding program conducted by our research group (Li et al., 2011). Briefly, the selectively bred line of the Pacific oyster for fast growth was first developed in 2007, using breeding base population constructed with the wild oysters collected from Rushan Bay (36.8°N, 121.6°E), Shandong, China, in 2006. Thereafter, this line was successively selected for fast growth annually and had undergone 10 successive generations of mass selection up to 2017. In June 2017, a total of 110 one-year-old fastgrowing Pacific oysters from the selectively bred variety "Haida No. 1" and 120 unselected wild individuals were simultaneously stripped spawned and separately cultured in two 24-m3 concrete tanks in a hatchery in Laizhou (37.3°N, 119.9°E), Yantai, China. The larvae were reared according to the procedures as reported in a previous study (Li et al., 2011). The same rearing procedure and feeding were applied to the two tanks. When the spats attached to the collectors (scallop shell) reached 2–3 mm in shell height, they were transferred to Sanggouwan Bay in Rongcheng (37.1°N, 122.5°E, Shandong, China) for marine culture.

# Growth Measurement and Sampling

In December 2017, 100 6-month-old Pacific oysters were collected from each of the two populations (thereafter referred as "breed" and "wild" groups) for use in this study. Shell height, shell length, shell width, and total weight of each individual were measured and weighed. Nine individuals from each of the two groups were randomly chosen for tissue collection. Equal amount of the mantle tissues was dissected from each of three oysters and pooled into one sample, creating three biological replicates for "breed" and "wild" groups, respectively. Tissues were flash frozen

in liquid nitrogen and then transferred to −80°C until used for RNA extraction.

# RNA Extraction, Library Construction, and Sequencing

Total RNA was extracted using TRIzol reagent (Invitrogen) according to the manufacturer's instructions. The RNA quality was confirmed by running 1% agarose gel electrophoresis. RNA concentration and purity were measured using NanoDrop (Thermo Fisher Scientific), and the RNA integrity number (RIN) was assessed using the RNA Nano 6000 Assay Kit of the Bioanalyzer 2100 system (Agilent Technologies).

Six sequencing libraries were constructed using NEBNext® Ultra™ RNA Library Prep Kit for Illumina® (NEB, USA) following manufacturer's protocols. The index codes were added to attribute sequences to each sample. Briefly, mRNA was purified from total RNA using poly-T oligo-attached magnetic beads. Firststrand cDNA was synthesized using random hexamer primer and M-MuLV Reverse Transcriptase (RNase H-). Subsequently, second-strand cDNA was synthesized using DNA polymerase I and RNase H. After purification, end repair, adenylation of 3′ ends of DNA fragments, and adaptor ligation, cDNA fragments of 250–300 bp were selected using AMPure XP beads and enriched by PCR. Library quality was evaluated on the Agilent Bioanalyzer 2100 system. The index-coded samples were clustered on the cBot Cluster Generation System using TruSeq PE Cluster Kit v3-cBot-HS (Illumina) according to the manufacturer's instructions. After cluster generation, the libraries were sequenced using the Illumina HiSeq 2500 platform for 150-bp paired-end reads.

# Read Mapping and Differential Expression Analysis

Raw reads in fastq format generated from Illumina sequencing were assessed by FastQC. Clean reads were obtained by trimming reads containing adapter, reads containing poly-N, and reads with low sequencing quality. The downstream analyses were based on the high-quality clean reads. The reference oyster genome (Zhang et al., 2012) was first indexed (Li et al., 2009), and then the pairedend clean reads were aligned to the indexed reference genome using Hisat2 (v2.0.4) (Kim et al., 2015). Hisat2 was selected as the mapping tool because it can generate a database of splicing junctions based on the gene model annotation file and thus provide better mapping results than do other non-splicing mapping tools. The counts of reads mapped to each gene were obtained using HTSeq (v0.9.1) (Anders et al., 2014). The fragments per Kilobase of transcript per million mapped reads (FPKM) of each gene was then determined based on the length of the gene and counts of reads mapped to the gene (Trapnell et al., 2010).

Differential expression analysis of the two groups ("breed" vs "wild") was performed with the R package DESeq (1.18.0) (Anders and Huber, 2010), using a model based on the negative binomial distribution to calculate the *P*-value. The resulting *P*-values were adjusted using the Benjamini and Hochberg approach for controlling the false discovery rate. Genes with an adjusted *P*-value < 0.05 and fold-change > 1.5 were determined as differentially expressed genes (DEGs). Volcano plot was drawn using R scripts to exhibit the overall distribution of DEGs. Gene ontology (GO) enrichment analysis was conducted using the R package GOseq to study the distribution of DEGs in gene ontology in order to clarify the biological meaning as indicated in terms of gene function (Young et al., 2010). GO terms with corrected *P*-value of less than 0.05 were considered as significantly enriched with DEGs. KEGG pathway analysis was conducted to understand high-level functions and utilities of the biological system from molecular-level information (Kanehisa et al., 2007). The statistical enrichment of DEGs in KEGG pathways was tested using KOBAS (2.0) software (Mao et al., 2005), and multiple-testing-corrected *P*-value of less than 0.05 was regarded as significantly enriched in the pathway.

# Quantitative Real-Time PCR Validation

To validate the results of RNA-Seq, 12 differentially expressed genes were selected for quantitative real-time PCR (qRT-PCR) analysis. The RNA samples used for the qRT-PCR assay were same as those used for RNA-Seq. The cDNA was synthesized for qRT-PCR by Prime Script TM RT Reagent Kit with gDNA Eraser (TaKaRa, Dalian, China). Specific primers for qRT-PCR were designed according to the reference sequences using Primer Premier 5.0 (**Supplementary Table 1**). *Eukaryotic elongation factor 1* (*eEF-1*) gene was used as an endogenous control to normalize gene expression by real-time PCR (Renault et al., 2011). The amplification was performed on the LightCycler 480 real-time PCR instrument (Roche Diagnostics, Burgess Hill, UK) using SYBR® Premix Ex Taq™ (TaKaRa). Cycling parameters were 95°C for 5 min and then 40 cycles of 95°C for 5 s, 58°C for 30 s, and 72°C for 30 s. The melting curve of PCR products was performed to ensure specific amplification. Relative gene expression levels were calculated by the 2−ΔΔCt method (Schmittgen and Livak, 2008). Data were analyzed by *t*-test using software SPSS 18.0, and *P*-value < 0.05 was considered as statistical significance.

# Alternative Splicing Analysis

Alternative splicing (AS) of genes creates multiple mRNA transcripts from one gene, resulting in tremendous proteomic complexity in higher eukaryotes (Keren et al., 2010; Nilsen and Graveley, 2010). AS events were analyzed using the software rMATS (v3.2.5) (Shen et al., 2014a). The AS events were divided into five categories, including skipped exon (SE), alternative 5′ splice site (A5SS), alternative 3′ splice site (A3SS), mutually exclusive exons (MXE), and retained intron (RI). The expression of each type of AS events was then calculated. The differential alternative splicing (DAS) events were determined from twogroup RNA-Seq data with replicates. False discovery rate (FDR) < 0.05 was regarded as the screening criterion for DAS events. Similar to analysis of DEGs, GO and KEGG enrichment analyses were also conducted for the DAS genes.

# Transcriptome *De Novo* Assembly and Annotation

*De novo* assembly of transcriptome was carried out using Trinity (Grabherr et al., 2011) with parameters set as default, followed by mapping cleaned reads to the *de novo* assembled transcript sequences using RSEM software (Li and Dewey, 2011). The assembled transcripts were annotated based on seven public databases, including the NCBI non-redundant protein sequences (Nr) database, NCBI non-redundant nucleotide sequences (Nt) database, Protein family (Pfam) database, euKaryotic Ortholog Groups (KOG) database, Swiss-Prot database, KEGG Ortholog (KO) database, and Gene Ontology (GO) database. For the genes that had multiple assembled transcript sequences, the longest transcript was chosen to represent the gene that is referred to as unigene. Coding sequences (CDSs) were predicted by matching unigenes to the Nr database and Swiss-Prot database by BLASTX.

# Analysis of Positively Selected Genes during Artificial Selection

Putative orthologs between two groups ("breed" vs "wild") of the Pacific oysters were identified using BLAST-based approach*.* The CDSs were first extracted from unigenes, and then self-toself BLASTP was conducted for all amino acid sequences with a cut-off *E*-value of 1E−5, and finally, orthologous pairs were constructed from the BLASTP results with OrthoMCL (v2.0.3) (Li et al., 2003) with default settings. The ratio of the number of nonsynonymous substitutions per nonsynonymous site (Ka) to the number of synonymous substitutions per synonymous site (Ks) was used to test for positive selection. Ka/Ks calculation was performed with PAML (Yang, 2007) package with default settings. The orthologs with Ks > 0.1 were excluded from further analysis to avoid potential paralogs (Elmer et al., 2010). The Ka/Ks ratio greater than 1 usually indicates genes evolving under positive selection (divergent), while those orthologs with a Ka/Ks ratio less than 0.1 indicates that these genes are under heavy selection pressure (conserved).

# RESULTS

# Growth Comparison

Growth of the "breed" and "wild" oysters were compared at 6 months of age as shown in **Table 1**. Apparently, the "breed" oysters showed significant growth advantage to the unselected "wild" oysters in terms of all quantified growth-related traits including shell height, shell length, shell width, and body weight. In addition, the growth-related traits of "breed" oysters were relatively uniformed as indicated by smaller variations of phenotypic traits (**Table 1**).

TABLE 1 | Growth comparison between "breed" and "wild" populations of the Pacific oysters.


*\*\*means the difference is significant at the 0.01 level (P < 0.01). Values are means ± SD, n = 100.*

# Transcriptome Sequencing and Mapping

A total of 300.9 million clean reads were obtained after trimming over 307 million 150-bp paired-end raw reads, with Q20 varying from 97.2% to 97.7%. The total bases of clean reads generated from each sample ranged from 7.0 to 8.2 Gb, which is about 15× of the oyster genome size. For the six samples, 79.3–82.2% of the total clean reads were aligned to the genome, of which 71.0–73.1% had a unique alignment and 8.1–9.1% had multiple alignment positions on the genome (**Table 2**). The abundance of transcript sequences for all gene models (35,362) was normalized and calculated by FPKM method using uniquely mapped reads. Nearly half (39.2–47.7%) of the genes were considered not to be expressed or expressed at very low levels (0 < FPKM < 1), and less than 4% (3.4–3.7%) were highly expressed (FPKM > 60). The correlation of gene expression among biological replicates was reasonably high with Pearson's *R*<sup>2</sup> values greater than 0.8 for all samples (**Supplementary Figure 1**).

# Analysis of Differentially Expressed Genes

A total of 1,303 differentially expressed genes (DEGs) were identified between the "breed" and "wild" Pacific oysters, of which 888 genes were expressed at higher levels in "breed" oysters while 415 genes were expressed at higher levels in the unselected "wild" oysters (**Figure 1** and **Supplementary Table 2**). The number of genes expressed at higher levels in "breed" oysters is significantly larger than that in the "wild" oysters.

Twelve DEGs were selected for qRT-PCR validation, and the results were compared with those from RNA-Seq data analysis. The results showed that expression levels of most genes detected by qRT-PCR were consistent with the results as determined based on RNA-Seq analysis, with the exception of *MFAP4*, which showed a similar expression pattern but a significantly different degree of fold change between RNA-Seq and qRT-PCR (**Figure 2**).

To further understand the biological meanings of these DEGs, gene ontology (GO) term enrichment analysis (*P* ≤ 0.05) was performed. For the 888 genes expressed at higher levels in "breed" oysters, the most significantly enriched GO terms were "microtubule-based movement" in the biological process (BP), "microtubule motor activity" in the molecular function (MF), and "dynein complex" in the cellular component (CC) (**Figure 3A** and **Supplementary Table 3**). Therefore, microtubule-related genes were highly enriched in top three GO categories in the DEGs expressed at higher levels in the fast-growing "breed" oysters. Besides, significantly enriched GO terms associated with microtubule or cell movement also include "movement of cell or subcellular component," "microtubule-associated complex," "motor activity," and "microtubule-based process" (**Figure 3A**). A total of 42 microtubule-related genes expressed at higher levels in "breed" group were identified. For example, *C1ql4* (LOC105334943) showed 16.7-fold, *DNAH5* (LOC105330782) showed 3.5-fold, and *KIF12* (LOC105329973) displayed a 2.8 fold higher expression in "breed" oyster than "wild" oysters (**Supplementary Table 4**). In addition, genes involved in the process of biosynthesis and metabolism of nucleotide compounds (GTP, UTP, and CTP), ribonucleotide (pyrimidine ribonucleotide), nucleoside triphosphate (pyrimidine nucleoside triphosphate and pyrimidine ribonucleoside triphosphate), and nucleoside (pyrimidine nucleoside and pyrimidine ribonucleoside) were also highly enriched in the DEGs that were expressed at higher levels in the "breed" oysters (**Figure 3A**). A total of four DEGs were involved in these pathways including *NME5* (2.2-fold), *NME7* (2.3-fold), *CiIC3* (2.8-fold), and LOC105346007 (2.4-fold), suggesting that activation of cell movement, microtubule, dynein, and nucleoside compound-related genes may be associated with growth of the Pacific oyster. In addition, a total of 258 DEGs expressed at higher levels in the "breed" oysters were enriched in "protein binding" pathway, with the number of enriched genes far more than that of other pathways (**Figure 3A**).

For the DEGs that were expressed at higher levels in "wild" oysters, the significantly enriched GO terms include "chitin metabolic process," "glucosamine-containing compound metabolic process," "chitin binding," "amino sugar metabolic process," and "aminoglycan metabolic process" (**Figure 3B**), of which a total of 11 genes were involved, including *Col14a1*, *CHIA*, *EXT*, *ITIH3*, and other seven uncharacterized genes (**Supplementary Table 5**).

TABLE 2 | Summary of RNA sequencing data and statistics of read mapping to the Pacific oyster genome assembly.


KEGG enrichment analysis of these DEGs was performed to further determine the metabolic processes and signal transduction pathways. The results revealed that the DEGs are significantly enriched in 25 pathways, such as "phototransduction," "longterm potentiation," "vascular smooth muscle contraction," "calcium signaling pathway," "phosphatidylinositol signaling system," "gastric acid secretion," "salivary secretion," "adrenergic signaling in cardiomyocytes," and "ABC transporters" (**Figure 4** and **Supplementary Table 6**). A total of 23 genes that had known functions associated with growth regulation were found in these significantly enriched KEGG pathways, and these growthrelated DEGs were categorized into different gene families (**Supplementary Table 7**), of which 15 genes were associated with calcium signaling pathway (i.e., *CaM*, *CML*, *CAMK*, *CALCRL*, and *SLC8A*) and two genes were related to actin activity (i.e., *Actin1* and *Actin2*).

# Analysis of Alternative Splicing

A total of 22,573 AS events were identified from 8,176 genes in all six samples, indicating that nearly 24.3% of multi-exonic genes were alternatively spliced. The AS events were categorized into five types, of which ISE and RI events were the most and least frequent, accounting for 75.2% (16,974) and 1.1% (246), respectively (**Table 3** and **Supplementary Figure 2**). The results are consistent with previous studies in animals (Wang et al., 2008), but in contrast to those reported in plants (Marquez et al., 2012; Shen et al., 2014b). To investigate the potential effects of AS on cellular processes related to growth, we identified a total of 3,230 differential alternative splicing (DAS) events from 1,818 genes between "breed" and "wild" oysters. To determine the

association of the DAS events with the gene expression, the DAS genes were compared with the DEGs. Only a small subset of DAS genes (175 genes, 9.6%) were differentially expressed between the two groups (**Supplementary Figure 2**).

Based on GO and KEGG analysis, we found that the DAS genes were significantly enriched in five specific functional pathways, among which "long-term potentiation," "phosphatidylinositol signaling system," "salivary secretion," and "ABC transporters" were also identified as the significantly enriched pathways in DEG analysis (**Figure 5**). A total of 68 genes were involved in these enriched pathways, some of which were from same gene families, such as calmodulin (CaM, three genes), 1-phosphatidylinositol 4,5-bisphosphate phosphodiesterase (PLC, five genes), inositol 1,4,5-trisphosphate receptor (ITPR, four genes), diacylglycerol kinase (DGK, five genes), and ATP-binding cassette (ABC) transporters (13 genes) (**Supplementary Table 8**), among which calmodulins were also identified as DEGs in the differential expression analysis section as mentioned above. Notably, a total of 13 genes that belong to the ABC transporter gene family were enriched in this analysis, indicating the potentially critical roles of the ABC transporters involved in growth of the oysters.

# Identification of Positively Selected Genes During Artificial Selection

In order to identify the positively selected genes during artificial selection of the "breed" oysters, we performed *de novo* transcriptome assemblies with the RNA-Seq data generated for "breed" and "wild" oysters, respectively. This yielded a total of 273,500 transcript sequences for the "breed" oyster group and 241,347 transcript sequences for the "wild" oyster group.



*\*The total gene rate is higher than 100% because one gene can experience two or more AS events. Abbreviations: SE, skipped exon; A5SS, alternative 5*′ *splice site; A3SS, alternative 3*′ *splice site; MXE, mutually exclusive exons; RI, retained intron.*

By choosing the longest transcript to represent the gene when the genes had multiple transcript sequences assembled, the transcriptome assembly provided 194,978 unigenes for "breed" and 172,863 unigenes for "wild," respectively (**Supplementary Table 9**). Annotation of the unigenes against the public databases including Nr, Nt, KOG, KO, Swiss-Prot, GO, and Pfam provided a total of 70,419 (36.1%) unigenes from "breed" and 67,173 (38.9%) unigenes from "wild" (**Supplementary Table 10**).

A total of 5,453 pairs of putative orthologs were identified between "breed" and "wild" oysters, of which 3,328 ortholog pairs with all nonsynonymous substitutions and synonymous substitutions were used for calculating Ka/Ks ratios, and the results revealed that 1,198 pairs had Ks > 0.1 that were determined as potential paralogs. After removal of potential paralogs, 2,130 pairs of orthologs were finalized with mean Ka of 0.0097, mean Ks of 0.0304, and mean Ka/Ks ratio of 0.359. A total of 589 ortholog pairs with a Ka/Ks ratio > 1 were identified (**Figure 6**), which might have experienced or be experiencing positive selection during artificial selection.

KEGG pathway analysis of the 589 positively selected genes showed that genes related to ribosomal proteins were greatly divergent between the "breed" and "wild" Pacific oysters. These ribosomal protein-related genes include *RP-L24e*, *RPL24*, *RPS18*,

indicate orthologous gene pairs identified with Ka/Ks ratio > 1, while the dots between black and gray lines indicate orthologous gene pairs identified with Ka/Ks ratio 0.5–1.

*RP-S3Ae*, *RPS3A*, *RP-L21e*, *RPL21*, *RP-L30*, *MRPL30*, *rpmD*, *RP-L36E*, *RPL36*, *RP-S28e*, and *RPS28* (**Figure 7**).

# DISCUSSION

Growth trait is implicated in a variety of cellular processes and is subject to regulation by multiple complex biological processes. Growth rate is heavily affected by environment variables, especially in aquatic animals inhabiting in highly variable water environments. Generation of fast-growing varieties of animals by selection breeding approach would provide good materials with similar genetic background but contrast phenotype for genetic dissection of growth trait. We initiated a selective breeding program of the Pacific oyster back in 2006. Up to 2017, the selectively bred lines have undergone 10 successive generations of intensive artificial selection for fast growth. Great enhancement of growth has been achieved as indicated by growth trial experiments, while the effects of artificial selection on the Pacific oyster genome remain unexplored. In this study, we used the selectively bred fast-growing oysters as research material to investigate the molecular basis of growth in the Pacific oyster.

We performed transcriptome comparative analysis of the fastgrowing selectively bred oysters with the unselected wild oysters. We identified a total of 1,303 protein-coding genes that were differentially expressed (DEGs) between fast-growing oysters and wild controls. Functional analysis of the DEGs showed that microtubule, cell movement, and nucleotide compoundrelated genes were significantly enhanced for expression in the fast-growing oysters. Microtubules are reported to be essential for proper cell division and cell expansion (Maiato and Sunkel, 2004; Bichet et al., 2008; Jiang et al., 2015). The microtubuleassociated proteins (*dynein* and *kinesin*), as well as many other microtubule-related proteins (*C1ql4*, *Cas8*, and *Ift46*), were found to be expressed at higher levels in the selectively bred Pacific oysters. The higher expressions of kinesin genes such as *Kif9* and *Kif12* in the fast-growing oysters are consistent with observations in previous studies that the expressions of *Kif9* and *Kif12* were positively correlated with cell division and cell growth (Gong et al., 2009; Andrieu et al., 2012). The results indicated that microtubule- and cell-movement-related genes could probably play critical roles in growth regulation in the Pacific oysters.

Cell movement plays an important role in the growth and development of organisms, participating in embryonic development and wound healing. The cell movement process needs to be driven by the physical forces generated by cytoskeleton (composed of microfilaments, intermediate filaments, and microtubules) and the participation of many other proteins (Ananthakrishnan and Ehrlicher, 2007). The organization, dynamics, and transport processes of the cytoskeleton are involved in three types of molecular motors, including myosin (which transports cargo along actin filaments) and kinesin and dynein (which transport cargo along microtubules) (Reddy and Day, 2001). As revealed in this study, along with the enhanced expressions of *kinesin* and *dynein* in the fast-growing oyster, the expression of myosin genes, including *Myo3a* and *Myo3b*, was also found to be expressed at higher levels in the fast-growing selectively bred Pacific oysters. Together, the higher expression of cytoskeleton- and cellmovement-related genes in the selectively bred oysters indicated that enhanced division and movement of cells could be probably associated with the fast growth of the Pacific oyster.

The higher expression of genes associated with nucleotide compounds (GTP, UTP, and CTP), pyrimidine ribonucleotide, and nucleoside (pyrimidine nucleoside and pyrimidine ribonucleoside) in the fast-growing oysters suggested the involvement of biosynthesis and metabolism of nucleotides in the growth regulation. Nucleotides carry packets of chemical energy in the form of the nucleoside triphosphates (ATP, GTP, CTP, and UTP) and plays an important role in metabolism at the cellular level, such as synthesis of amino acids and proteins, movement of the cell and cell parts, and division of the cell (Pedley and Benkovic, 2017). The enhanced expression of nucleotide metabolism-related genes, therefore, may contribute to increase the efficiency of protein synthesis and cell division for enhanced growth performance.

Gene functional annotation analysis showed that both differentially expressed genes and alternatively spliced genes were significantly enriched in long-term potentiation, phosphatidylinositol signaling system, ABC transporters, and salivary secretion pathways. In longterm potentiation pathway, differentially expressed genes are mainly calmodulin kinase and its regulators. When long-term potentiation increased, the binding efficiency of Ca2+ to calmodulin is increased, causing the increased level of CaMK II and CaMK IV contents. Then, EPK is activated, which promotes increased synthesis of synapse growth protein (Silva et al., 1992; Strack et al., 1997). In this process, calcium ions and calmodulin play critical roles in regulation. Calcium ion carries out its functions by binding to specific calcium receptors or calcium-binding proteins (CaBPs). Genes associated with calcium ion regulation were found to be differentially expressed between "breed" and "wild" groups of the Pacific oysters (**Supplementary Table 7**). For example, the calmodulin-related genes such as *CaM* (LOC105328007), *CML12* (LOC105319978), *CAMK* (LOC105335050), and *SLC8A* (LOC105340116) were expressed 4.5-, 3.2-, 3.0-, and 3.1-fold higher in the "breed" than "wild" oysters, respectively. Exceptionally, the expression of *CALCRL* (LOC105320473) gene was expressed 3.3-fold higher in the unselected "wild" oysters (**Supplementary Table 7**). In the phosphatidylinositol signaling pathway, external signaling molecules bind to G proteincoupled receptors (GPCRs) to activate phospholipase C (PLC), decomposing PIP2 into IP3 and DG and finally activating protein kinase C (PKC) to generate cellular responses, including cell secretion, cell proliferation, and differentiation.

Besides the altered expression patterns of genes between "breed" and "wild" oysters, the effects of artificial selection on these protein coding genes are also of importance. Enrichment analysis of positively selected genes between the "breed" and "wild" groups of the Pacific oysters showed that genes related to the biosynthesis of ribosomal proteins were significantly divergent during the artificial selection process. Ribosomal proteins are crucial for the growth and development of the organisms (Xie et al., 2009; Baloglu et al., 2015). In the larval Pacific oysters, a previous study reported that ribosomal protein-related genes were essentially involved in growth heterosis (Hedgecock et al., 2007b). The divergence of ribosomal protein genes may be associated with differential efficiency of transcription and protein biosynthesis, eventually resulting in growth phenotypic difference between "breed" and "wild" oysters. However, this observation requires future investigation.

# CONCLUSION

To unravel the molecular basis for fast growth of the selectively bred Pacific oyster, we performed comparative transcriptome analysis of the fast-growing "breed" with the unselected "wild" Pacific oysters in terms of gene expression, alternative splicing, and molecular evolution. The most significant outcome is the identification of potential growth-related genes in the Pacific oysters. Further functional analysis revealed that genes involved in microtubule motor activity, and biosynthesis of nucleotides and proteins would be important for oyster growth. Transcriptomewide analysis of positively selected genes revealed the important roles of ribosomal protein genes, which further suggested that the process of protein biosynthesis may be a key biological process related to the growth difference between selectively bred oysters and unselected wild oysters. This study provides valuable resources for further investigations on the growth regulation mechanisms and will be useful to support the breeding application to integrate fast growth with other superior traits in the Pacific oysters.

# DATA AVAILABILITY STATEMENT

The Pacific oyster reference genome and gene model annotation files in this study were downloaded from the NCBI (ftp://ftp.ncbi. nlm.gov/genomes/Crassostrea\_gigas). All raw RNA-Seq data have been deposited in the NCBI Sequence Read Archive with BioProject accession no. PRJNA524442 (sequence accessions: SRR9089186-SRR9089191).

# AUTHOR CONTRIBUTIONS

SL conceived and designed the study. FZ, BH, and HF collected the samples and executed the experiments. FZ, BH, HF, ZJ, and SL analyzed the data. FZ drafted the manuscript, and SL revised the manuscript. QL provided reagents and materials and supervised the study. All authors have read and approved the final version of the manuscript.

# FUNDING

This study was supported by the grants from National Natural Science Foundation of China (31741122 and 31802293), Laboratory for Marine Fisheries Science and Food Production Processes, Qingdao National Laboratory for Marine Science and Technology (2017-2A04), China Postdoctoral Science Foundation (2017M622283), and the Fundamental Research Funds for the Central Universities (201812013).

# ACKNOWLEDGMENTS

We are grateful to Chengxun Xu, Ziqiang Han, and the oyster breeding team of our research group for their assistance in maintaining and collecting the samples used in this work.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.00610/ full#supplementary-material

# REFERENCES


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Zhang, Hu, Fu, Jiao, Li and Liu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Patterns of Geographical and Potential Adaptive Divergence in the Genome of the Common Carp (*Cyprinus carpio*)

*Jian Xu1†\*, Yanliang Jiang1†, Zixia Zhao1, Hanyuan Zhang1, Wenzhu Peng2, Jianxin Feng3, Chuanju Dong4, Baohua Chen2, Ruyu Tai1 and Peng Xu2,5\**

*1 Key Laboratory of Aquatic Genomics, Ministry of Agriculture, CAFS Key Laboratory of Aquatic Genomics and Beijing Key Laboratory of Fishery Biotechnology, Chinese Academy of Fishery Sciences, Beijing, China, 2 State Key Laboratory of Marine Environmental Science, College of Ocean and Earth Sciences, Xiamen University, Xiamen, China, 3 Henan Academy of Fishery Science, Zhengzhou, China, 4 College of Fishery, Henan Normal University, Xinxiang, China, 5 Laboratory for Marine Biology and Biotechnology, Pilot National Laboratory for Marine Science and Technology, Qingdao, China*

### *Edited by:*

*Paulino Martínez, University of Santiago de Compostela, Spain*

### *Reviewed by:*

*Shaojun Liu, Hunan Normal University, China Fabyano Fonseca Silva, Universidade Federal de Viçosa, Brazil*

### *\*Correspondence:*

*Jian Xu xuj@cafs.ac.cn Peng Xu xupeng77@xmu.edu.cn*

*†These authors have contributed equally to this work.*

### *Specialty section:*

*This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics*

*Received: 13 December 2018 Accepted: 24 June 2019 Published: 12 July 2019*

### *Citation:*

*Xu J, Jiang Y, Zhao Z, Zhang H, Peng W, Feng J, Dong C, Chen B, Tai R and Xu P (2019) Patterns of Geographical and Potential Adaptive Divergence in the Genome of the Common Carp (Cyprinus carpio). Front. Genet. 10:660. doi: 10.3389/fgene.2019.00660*

The common carp, *Cyprinus carpio*, is a cyprinid fish species cultured in Europe and Asia. It accounts for >70% of freshwater aquaculture production worldwide. We conducted a population genomics analysis on *C. carpio* using high-throughput SNP genotyping of 2,198 individuals from 14 populations worldwide to determine the genetic architecture of common carp populations and the genetic bases for environmental adaptation. Structure analyses including phylogeny and principal component analysis were also conducted, showing distinct geographical patterns in European and Asian populations. The linkage disequilibrium block average lengths of the 14 populations ranged from 3.94 kb to 36.67 kb. Genes within selective sweep regions were identified by genome scanning among the different populations, including *gdf6a*, *bmpr1b*, and *opsin5*. Gene Ontology and KEGG enrichment analyses revealed potential trait-related loci and genes associated with body shape, scaling patterns, and skin color. This population genomics analysis may provide valuable clues for future genome-assisted breeding of *C. carpio*.

Keywords: common carp, population genomics, linkage disequilibrium, haplotype, selective sweep

# INTRODUCTION

The common carp, *Cyprinus carpio*, is one of the most important cyprinid species due to its food value and complex paleotetraploidized genome. It is cultured in >100 countries, and the annual global production of *C carpio* is > 4.56 million metric tons. This is approximately 10% of the global freshwater aquaculture production (FAO Fisheries and Aquaculture Department; Bostock et al., 2010). Common carp provide high value protein as food and some strains, such as koi, are popular ornamental fish. Common carp have been cultured for several thousand years. Domesticated common carp differ from their wild ancestor in morphological, behavioral, physiological, and reproductive traits (Balon, 1995). For example, wild carp has an elongated body with full scale cover, while domesticated carp usually have a much deeper body with four scale patterns: 1) leather, with no scales; 2) line, with large scales along the lateral line; 3) mirror, with a small number of large scattered scales; and 4) fully scaled. Domesticated carp are generally more capable of coping with extreme environments than their wild ancestor (Balon, 1995). Genetic evidence indicates

**306**

that all common carp populations originate from two ancestral forms of wild carp, the European subspecies (*C. c. carpio*) and the East Asian subspecies (*C. c. haematopterus*) (Chistiakov and Voronova, 2009). The validity of a third subspecies, *C. c. rubrofuscus*, is questionable; it may have diverged from *C. c. haematopterus* (Zhou et al., 2004; Wang et al., 2010; Kohlmann and Kersten, 2013).

During its domestication, common carp has been introduced into many areas. Common carp ancestors have been subjected to genetic interventions, and natural and artificial selection. These factors combined with accumulation of mutations and longterm geographical isolation have produced many varieties of common carp with distinct skin color, body shape, scale pattern, body size, and stress tolerance. Human transport of carp to different geographical locations has generated high levels of gene flow (Wang et al., 2010). Hybrid breeding of carp in China has been common over the last 50 years and has resulted in many varieties or strains, such as Jian carp (JIAN). Multiple rounds of hybridization and genetic introgression were employed during hybrid breeding (Dong et al., 2015). Consequently, the genetic backgrounds of most common carp populations are unknown, especially when breeding history is inadequately recorded or missing. Various genetic tools and molecular markers have been developed and used for studying the phylogenetic relationship among populations and the genetic architecture of populations. These include random amplified polymorphic DNA, amplified fragment length polymorphism, restriction fragment length polymorphism, mitochondrial DNA, and microsatellites (Bartfai et al., 2003; Zhou et al., 2003; Mabuchi et al., 2008; Cheng et al., 2010). However, due to the limited resolution of these genetic markers, many phylogenetic relationships remain uncertain. For example, the origin and relationships among Hebao carp (HB), Xingguo Red carp (XG), Songpu carp (SP), Oujiang color carp (OUJ), and Koi carp (KOI) are controversial (Balon, 1995; Froufe et al., 2002; Wang and Li, 2004a; Wang and Li, 2004b). The previous studies indicated different phylogeny patterns using both nuclear genome (Xu et al., 2014) and mitochondrial genome sequence (Dong et al., 2015; Liu et al., 2019); however, due to limited number of samples, these results were not so solid for validation. Also, the genomic basis of local adaptation shaped by natural selection is still largely unknown. Selective sweep analysis is an effective approach to identify trait-related genes under natural selections or domestications. Xu et al. have reported the selective sweeps in the HB population compared with the SP population and identified *fgfr1a1* in the selective regions (Xu et al., 2014). Another research on Amur ide alkaline adaptation was also conducted using selective sweep method, and dozens of ion transportation-related genes were revealed involving in osmoregulation and pH regulation (Xu et al., 2017). Larger samples would be more persuasive in identification of genes in selective sweep regions due to the high genetic diversity of the populations.

With the fast growth of sequencing technologies, highthroughput genetic markers, such as single-nucleotide polymorphism (SNP), have been used in population genetics. Many studies have demonstrated that SNP arrays can improve the resolution of the differentiation of genetic stocks (Perez-Enriquez et al., 2018; Torati et al., 2019). SNP assays are a useful tool for studying population structure and the effects of natural and artificial selection at the genome scale. For example, the Atlantic salmon SNP array that contained 6,176 informative SNPs was used to genotype 38 anadromous and freshwater wild populations (Bourret et al., 2013). The data illustrated the genetic architecture in salmon and showed the adaptive divergence of SNP allele frequencies across populations and among regional groups. Bradbury et al. applied an SNP array to Atlantic cod and showed an association between SNP allele frequencies and water temperatures across the species range (Bradbury et al., 2011). Jones et al. developed an SNP array to study geographic patterns of genetic variation on stickleback. Substantial genetic variation was found in 34 populations with predominant patterns reflecting demographic history and geographic structure. Genome regions contributing to evolution of marine–freshwater or benthic– limnetic species pairs were identified (Jones et al., 2012).

Many SNP markers have been identified from common carp (Kongchum et al., 2010; Xu et al., 2014), and a high-throughput 250 K common carp SNP array has been developed (Xu et al., 2014). The entire genome sequences of common carp have been published in 2014 (Xu et al., 2014). In the present study, genomewide SNP genotyping was conducted to determine the genetic architecture of common carp populations and the genetic bases for environmental adaptation. A total of 2198 samples were successfully genotyped with high quality (see Materials and Methods). These samples belonged to 14 different populations, including Yellow River carp (YR), HB, Xingguo carp (XG), OUJ, KOI, Qingshuijiang carp (QSJ), JIAN, Songhe carp (SH), SP, Heilongjiang carp (HLJ), Danube carp (DANU), Szarvas 22 carp (SZ), Tisza carp (TZ), and a population from the USA (AME). YR are mainly cultured along the Yellow River basin of China; HB, XG, and OUJ are mainly cultured in the south of China; SH, SP, and HLJ were mainly cultured in the north of China, and QSJ is mainly cultured on the southwest of China. DANU, SZ, and TZ were collected from Europe, while AME was collected from Alabama in the USA. The phenotypic traits of these common carp populations differ from each other. Most common carp are black or gray, but HB and XG are red. KOI and OUJ have various skin color and patterns, including white, black, red, yellow, blue, and cream. The molecular mechanisms underlying trait differences between different common carp populations were unveiled in this study.

# MATERIALS AND METHODS

# Sample Collection

The 14 populations of *C. carpio* (2,198 individuals) were randomly collected across Europe, North America, and China. DANU, TZ, and Szarvas 22 (SZ) were collected from the carp live gene bank of the Research Institute for Fisheries, Aquaculture and Irrigation of Hungary (HAKI). North American carp (AME) were collected from the Chattahoochee River in Alabama in the USA. Ten other populations were sampled from China, namely, the YR from Zhengzhou of Henan Province, HB from Wuyuan of Jiangxi Province, XG from Xingguo of Jiangxi Province, OUJ from Oujiang of Zhejiang Province, KOI from Beijing, QSJ from Guiyang of Guizhou Province, JIAN from Wuxi of Jiangsu Province, SH and SP from Harbin of Heilongjiang Province, and HLJ from Mudanjiang of Heilongjiang Province. The numbers of samples from each population are shown in **Table S1**.

# DNA Extraction, Genotyping, and Quality Control

Genomic DNA was extracted from blood or fin samples using a DNeasy 96 Blood & Tissue Kit (Qiagen, Shanghai, China) following the manufacturer's protocol. Extracted DNA was quantified by a Nanodrop-1000 spectrophotometer (Thermo Scientific, Wilmington, DE, USA). DNA integrity was examined on a 1.0% agarose gel by electrophoresis. The final DNA concentration was diluted to 50 ng/μl for genotyping with an amount of 2 μg per sample. The common carp 250-K SNP array was developed using Affymetrix Axiom genotyping technology. Genotyping was performed by GeneSeek (Lincoln, Nebraska, USA). After genotyping, PLINK v1.9 software (https://www. cog-genomics.org/plink2) was used for quality control (Chang et al., 2015). SNPs with low call rate (< 95%) or low minor allele frequency (MAF < 5%) were excluded, and samples with <90% genotyping rate were filtered out. The filtered genotype file was uploaded into the European Nucleotide Archive database (https://www.ebi.ac.uk/ena/data/view/PRJEB33066).

# Phylogeny, Principal Component Analysis, and Genetic Structure

A maximum-likelihood tree was constructed by RAxML with 1,000 bootstraps (Stamatakis, 2014), and the tree was displayed with iTOL software (http://itol.embl.de/upload.cgi). Principal component analysis (PCA) was conducted using GATK software (Van der Auwera et al., 2013), and all of the SNPs were used to investigate the population structure using Structure 2.3.1 software with 2,000 iterations and the MCMC model (Falush et al., 2007). The optimal *K* value was selected by Delta *K* method (Evanno et al., 2005). The resulting structure matrix was plotted using StructurePlot 2.0 software (Ramasamy et al., 2014).

# LD Decay and Haplotype Construction

Linkage disequilibrium (LD) decays for the main populations and all of the samples were calculated within a range of 50 kb using PLINK (Chang et al., 2015). The average *R*<sup>2</sup> value of each 1 kb region was calculated (Zhou et al., 2016; Xu et al., 2017). All of the *R*<sup>2</sup> values were then plotted against the physical distances of SNPs in units of kb. Haplotype blocks in different populations were identified by PLINK software using the "--blocks" parameter.

# Calculation of **π** Ratio, Fst, Tajima's *D*, and Identification of Selective Signatures

We calculated the π distribution for each linkage group using a sliding window method in Vcftools. The window width was set to 100 kb, and the stepwise distance was 100 kb. The π values from the main populations were compared, and the ratios were sorted. Fst and Tajima's *D* values were also calculated using Vcftools with the parameters "–weir-fst-pop" and "–TajimaD," respectively. We identified the regions with the 5% highest π ratios and the regions with the 5% highest Fst values. Together with regions identified on the basis of the above two thresholds, genes within selective sweeps were annotated using GOEAST for Gene Ontology (GO) (Young et al., 2010) and the DAVID software (Dennis et al., 2003) for KEGG pathway analysis. Scatter plots of π and Fst values were generated using the ggplot2 package of the Comprehensive R Archive Network (http://cran.r-project.org/package=ggplot2).

# RESULTS AND DISCUSSION

# Sample Collection and Genotyping

A total of 14 populations of *C. carpio* were collected from 13 locations in China, Hungary, and the USA (**Figure 1**). In addition to the geographic divergences of these populations, several populations were included due to their special biological features, such as scale pattern (SP population), red body color (HB population, XG population, and QH group in the OUJ population), and purse-like body shape (HB population). After DNA extraction and SNP genotyping using Carp 250-K SNP array, a raw genotype database of 222,694 SNPs for 2,198 samples was generated. A total of 2,198 samples with 134,719 polymorphic SNPs passed the quality control threshold and were used for further analysis. Sample information for each population is shown in **Table S1**.

# Phylogeny, PCA, and Population Structure

To investigate the divergence of the representative *C. carpio* populations from different locations, we constructed the phylogenetic tree using the whole genotyping data (**Figure 2A**). The Asian and European populations formed two distinct clades, while the AME population grouped with the European clade. The SP population constituted the major part of the European clade as it was bred from mirror carp originally introduced from Europe in the 1950s. A similar result was indicated by PCA, showing subgroups in either Asian populations or European populations (**Figure 2B**). In the Asian populations, YR formed a tight cluster and other populations (JIAN, HB, XG, KOI, OUJ, and QSJ) formed another subgroup. Three subgroups could be identified in the European cluster. The first subgroup contained only SP samples which were closely grouped. The second subgroup included three Hungary populations (DANU, TZ, and SZ), SP population, and HLJ population, indicating close relationship among SP, HLJ, and Hungary carp. A small number of SP samples and all of the AME samples formed the third subgroup, showing that the common carp in the USA might have originated from European populations. We analyzed the population structure using the Bayesian clustering program STRUCTURE. Since the values of Delta *K* from ln likelihood were high for the models *K* = 5, we showed the clusters of *K* = 5 in **Figure 2C**. The Asian populations were separated into two subgroups, similar to the PCA result, and the European populations showed shared common ancestry. Within the Asian populations, obvious genetic admixture was observed in the YR population and other populations, except

FIGURE 2 | Phylogeny, principal component analysis (PCA), and genetic structure of 14 populations of *C. carpio*. (A) A maximum-likelihood phylogenetic tree of 14 populations of common carp generated on the basis of polymorphic single-nucleotide polymorphisms (SNPs). Population abbreviations: SP, Songpu; DANU, Danube; SZ, Szarvas; TZ, Tisza; AME, North American; YR, Yellow River; HLJ, Heilongjiang; OUJ, Oujiang color; HB, Hebao; XG, Xingguo; KOI, Koi; SH, Songhe; JIAN, Jian; QSJ, Qingshuijiang. (B) PCA of *C. carpio* populations. (C) The population structure of common carp populations. Each color represents one ancestral population; each individual is represented by a vertical bar, and the length of each colored segment in each vertical bar represents the proportion contributed by ancestral populations. *K* = 5 was used for analysis with the highest Delta *K* value.

for KOI and OUJ. The KOI population has had a highly inbred history to maintain the purity of the genetic component, and the OUJ population also showed relatively pure genetic structure due to their habitats in the isolated mountainous areas in Zhejiang, China. This result is in accordance with our previous study (Xu et al., 2014), but there were slight differences in the shared genetic components in all of the populations.

# LD Decay and Haplotype Construction

Based on the genome assembly and SNP array for *C. carpio*, the LD was investigated for seven main populations that contained an adequate number of samples. The *R*2 value among each pair of SNPs was calculated using PLINK software, and the raw data were classified by distance ranges. The LD decays with the extension of the distance between SNPs and different populations showed distinct LD decay patterns (**Figure 3A**). KOI and OUJ had significantly higher *R*<sup>2</sup> values than five other populations, which was consistent with the population structure results. Haplotypes were constructed for all samples and seven populations using PLINK. The distribution of different haplotype lengths was calculated by ggplot2 package, and OUJ, SP, XG, and YR showed longer blocks (**Figure 3B** and **Table S2**). Haplotypes are useful in GWAS analysis and provide more SNP genotyping information through imputation.

# Genome-wide Selective Sweep Analysis

*C. carpio* is a genetically diverse species that has adapted to a variety of environments in Eurasia and has been domesticated for more than 2,000 years (Xu et al., 2014). *C. carpio* has been bred into numerous strains, generating distinct phenotypes in body color, scale pattern, and body shape. These characters are partially attributable to genome diversity due to environmental adaptation.

The genetic diversity in certain genome regions might be reduced due to natural selection. To identify the genome regions

paired SNPs, and the *Y*-axis represents mean *R*2 of the SNP pairs within each distance region. (B) Haplotype distribution of all of the samples and seven populations. The *X*-axis represents the lengths (kb) of haplotype blocks, and the *Y*-axis represents the density (percentage in all of the blocks) of each block with certain lengths.

under selective pressure in populations with distinct biological features, we scanned the genome-wide variations and allele frequency spectra of the 134,719 SNPs. The areas of comparison included scales, body shape, and body color. The π ratios of three groups (πScaled carp/SP, πYR/HB, and πQH/FY) were calculated using a 200-kb sliding-window approach with Vcftools software. In comparison to the Scaled carp in Asia (including YR, HB, XG, OUJ, QSJ, KOI, and JIAN), the SP population had distinct genetic diversities across the whole genome (**Figure 4A**). We identified 321 significant windows corresponding to 64.2 Mb in size (top 5%, empirical π ratios ≥ 23.17), which included 289 candidate genes based on the π ratio analysis. To validate the genome regions under strong selective sweeps in the SP population, the genome regions with Fst greater than 0.3995 (top 5%) were also identified, corresponding to 64.2 Mb and 285 candidate genes. A total of 100 candidate genes shared by both the π ratio and Fst analysis were identified as potentially affected genes under selective sweeps (**Figure 4A**, **Table S3**).

The results suggest that the genomes of the SP population have been significantly altered, by the environment, into a no-scale pattern. Fish scale is an epidermal appendage and it is an important protective tissue. Among the samples, all were fully scaled except for the SP carp that lacks scales. Through selection sweeping analysis of comparing SP carp with other fully scaled carps, the significant genome regions were identified (**Table S3**). Also, several target genes were found that might be related to the scale pattern (**Figure 4B**). Growth differentiation factor 6a (*gdf6a*), also named cartilage derived morphogenetic protein 2, is a member of the BMP family. The expression of the *gdf6a* gene has been detected in both fetal and post-natal cartilaginous tissues involved in the development of long bones. The *gdf6a* gene may play a role in suppression of ossification (Wei et al., 2016). Insulin-like growth factor-binding protein 5 (*igfbp5*) is a member of the IGFBP family that can either inhibit or stimulate the growth-promoting effects of the IGFs on cell culture. It is involved in the Ras/p38 MAPK signaling pathway in regulating cell proliferation and apoptosis (Yang et al., 2018). To study these candidate genes and their potential functions, GO and KEGG analyses were performed on the candidate genes, offering insight into the genetic evolution and adaptive mechanisms of the SP population (**Tables S4** and **S9**).

We also investigated selective sweeps in comparisons (YR vs HB, QH vs FY), which showed distinct body shapes or body color, respectively. YR and HB were compared because they represent the typical populations in central China and southern China, respectively. QH and FY were compared due to their high similarity in genome background and the possibility that it might be easier to screen out potential genes associated with body color. We identified a total of 321 significant windows corresponding to 64.2 Mb in size for each comparison (top 5%, empirical π ratios ≥ 1.3121, 1.9341, respectively), which included 293 and 287 candidate genes, respectively. To further validate the genome regions under strong selective sweeps in the HB or FY population, the genome regions with Fst greater than 0.2588 or 0.1622 (top 5%) were also identified, including 278 and 292 candidate genes, respectively. A total of 38 and 65 candidate genes shared by both the π ratio and Fst analysis were recognized, respectively, as genes potentially affected under selective sweeps (**Figures S1A** and **S2A**, **Tables S5** and **S7**).

The purse-like shape of the HB population was probably due to extensive growth of muscle or bones compared to the YR population, and several genes (*trhr* and *bmpr1b*) were screened out for their potential functions in bone and muscle development (**Figure S1B**). Previous genome-wide association and replication studies identified *trhr* as a gene associated with lean body mass (Liu et al., 2009). The gene *bmpr1b* is engaged in the regulation

of skeletal development through interactions with FGFR families (Qi et al., 2014). GO enrichment analysis was also performed on the candidate genes, offering insight into the genetic evolution and adaptive mechanisms of the HB population (**Tables S6** and **S9**). GO terms including thyrotropin-releasing hormone receptor activity, Wnt-activated receptor activity, and transforming growth factor beta-activated receptor activity were enriched, providing clues for more detailed analysis. Coloration is an important trait for common carp, especially for ornamental strains, since it is often a criterion for visually determining quality and market value. OUJ carp is a famous ornamental farmed fish, which has four distinct color patterns, namely, whole white (FY), whole red (QH), white with scattered big black spots, and red with scattered big black spots. FY and QH were compared, and several potential target genes were identified (**Figure S2B**). The keratinocyte growth factor (fgf7/kgf) can promote melanosome transfer and act on recipient keratinocytes through stimulation of the phagocytic process. Fgf7 affects keratinocytes derived from different skin color (Cardinali et al., 2008). Another gene in selective sweep regions, *opsin5*, has been reported engaging in phototransduction and regulates seasonal changes in color perception (Shimmura et al., 2017). It was identified as an ultraviolet (UV)-sensitive pigment of the retina and other photosensitive organs in birds (Ohuchi et al., 2012). GO and KEGG analyses were also performed on the candidate genes and offered insight into the genetic evolution and adaptive mechanisms of the SP population (**Tables S8** and **S9**). Several significant pathways were enriched, including dopamine receptor signaling pathway and dopamine neurotransmitter receptor activity. This indicated the importance of dopaminerelated networks in body color determination. The representative GO terms and pathways enriched in these comparisons were selected, and the Fst values of genes in these pathways were significantly higher than the whole-genome level (**Figure S3**).

# CONCLUSIONS

We investigated the genomic divergence among various populations of *C. carpio*. Distinct genetic component differences were identified between Asian and European populations. The haplotypes of each population could benefit research on trait associations. Selective sweep analyses results showed that hundreds of genes within selective sweep regions were identified by genome scanning among different populations, including *gdf6a*, *bmpr1b*, and *opsin5.* This study comprehensively revealed genetic structure of global populations of *C. carpio*, and potential trait-related genes could be valuable for genome-assisted breeding of *C. carpio*.

# ETHICS STATEMENT

This study was carried out in accordance with the recommendations of the care and use of animals for scientific purposes set up by the Animal Care and Use Committee of Chinese Academy of Fishery Sciences (ACUC-CAFS). The protocol was approved by the ACUC-CAFS. Before the blood or fin samples were collected, all of the fishes were euthanized in MS222 solution.

# AUTHOR CONTRIBUTIONS

PX initiated and coordinated the research project. JX and YJ conducted the analysis and drafted the manuscript. ZZ, HZ, and JF engaged in sample collection and genotyping analysis. CD, WP, BC, and RT took part in enrichment analysis. All authors read and approved the final manuscript.

# FUNDING

This work was supported by Central Public-Interest Scientific Institution Basal Research Fund, CAFS (No. 2016GH02, No. 2016HY-JC0301, and No. 2016HY-ZD0302), the National Natural Science Foundation of China (No. 31502151 and No. 31422057), the National High-Technology Research and Development Program of China (2011AA100401), the National Key Research and Development Program (2018YFD0900102), and the National Infrastructure of Fishery Germplasm Resources of China (No. 2018DKA30470).

# ACKNOWLEDGMENTS

We thank Mr. Xianhu Zheng, Mr. Youyi Kuang, Mr. Xiaowen Sun, Mr. Chenghui Wang, and Mr. Zhanjiang Liu for their assistance with sample collection.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.00660/ full#supplementary-material

SUPPLEMENTARY TABLES 1–9 | Summary of genotyping data, haplotype blocks, and selective sweep results.

SUPPLEMENTARY FIGURE 1 | π, Fst distribution, and selected genes in selective sweep analysis between YR and HB populations. (A) Distribution of π and Fst values in the comparison between YR and HB populations. The *X*-axis represents the π ratio values of all sliding windows, and the *Y*-axis represents the Fst values of all sliding-windows. Red dots represent windows that passed thresholds of both π and Fst. (B) π ratio and Fst values within windows neighboring selected genes. The solid blue line represents the π ratio, and the dashed blue line represents the π ratio threshold. The solid red line represents the Fst, and the dashed red line represents the Fst threshold.

SUPPLEMENTARY FIGURE 2 | π, Fst distribution, and selected genes in selective sweep analysis between QH and FY samples. (A) Distribution of π and Fst values in the comparison between QH and FY samples. The *X*-axis represents the π ratio values of all sliding windows, and the *Y*-axis represents the Fst values of all sliding windows. Red dots represent windows that passed thresholds of both π and Fst. (B) π ratio and Fst values within windows neighboring selected genes. The solid blue line represents the π ratio, and the dashed blue line represents the π ratio threshold. The solid red line represents the Fst, and the dashed red line represents the Fst threshold.

SUPPLEMENTARY FIGURE 3 | Box plot of the Fst differences between representative GO terms and the whole genome. Boxes denote the values between the 25th and 75th percentiles and the black transverse line inside the box denotes the median. Black vertical lines denote the values within 1.5 times of quartile values. Outliers are shown as black dots. (A) Scaled vs SP. (B) YR vs HB. (C) QH vs FY.

# REFERENCES


common carp varieties (*Cyprinus carpino* L.). *Meta Gene* 19, 82–90. doi: 10.1016/j.mgene.2018.11.001


**Conflict of Interest Statement**: The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The handling editor is currently editing co-organizing a Research Topic with one of the authors PX, and confirms the absence of any other collaboration.

*Copyright © 2019 Xu, Jiang, Zhao, Zhang, Peng, Feng, Dong, Chen, Tai and Xu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Inventory of European Sea Bass (*Dicentrarchus labrax*) sncRNAs Vital During Early Teleost Development

*Elena Sarropoulou1\*, Elizabet Kaitetzidou1, Nikos Papandroulakis1, Aleka Tsalafouta2 and Michalis Pavlidis2*

*1 Institute of Marine Biology, Biotechnology and Aquaculture, Hellenic Center for Marine Research, Heraklion, Greece, 2 Department of Biology, University of Crete, Heraklion, Greece*

### *Edited by:*

*Paulino Martínez, University of Santiago de Compostela, Spain*

### *Reviewed by:*

*Diego Robledo, University of Edinburgh, United Kingdom Laia Ribas, Superior Council of Scientific Investigations, Spain*

> *\*Correspondence: Elena Sarropoulou sarris@hcmr.gr*

### *Specialty section:*

*This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics*

*Received: 28 February 2019 Accepted: 21 June 2019 Published: 25 July 2019*

### *Citation:*

*Sarropoulou E, Kaitetzidou E, Papandroulakis N, Tsalafouta A and Pavlidis M (2019) Inventory of European Sea Bass (Dicentrarchus labrax) sncRNAs Vital During Early Teleost Development. Front. Genet. 10:657. doi: 10.3389/fgene.2019.00657*

During early animal ontogenesis, a plethora of small non-coding RNAs (sncRNAs) are greatly expressed and have been shown to be involved in several regulatory pathways vital to proper development. The rapid advancements in sequencing and computing methodologies in the last decade have paved the way for the production of sequencing data in a broad range of organisms, including teleost species. Consequently, this has led to the discovery of sncRNAs as well as the potentially novel roles of sncRNA in gene regulation. Among the several classes of sncRNAs, microRNAs (miRNAs) have, in particular, been shown to play a key role in development. The present work aims to identify the miRNAs that play important roles during early European sea bass (*Dicentrarchus labrax*) development. The European sea bass is a species of high commercial impact in European and especially Mediterranean aquaculture. This study reports, for the first time, the identification and characterization of small RNAs that play a part in the 10 developmental stages (from morula to all fins) of the European sea bass. From 10 developmental stages, more than 135 million reads, generated by next-generation sequencing, were retrieved from publicly available databases as well as newly generated. The analysis resulted in about 2,000 sample grouped reads, and their subsequently annotation revealed that the majority of transcripts belonged to the class of miRNAs followed by small nuclear RNAs and small nucleolar RNAs. The analysis of small RNA expression among the developmental stages under study revealed that miRNAs are active throughout development, with the main activity occurring after the earlier stages (morula and 50% epiboly) and at the later stages (first feeding, flexion, and all fins). Furthermore, investigating miRNAs exclusively expressed in one of the stages unraveled five miRNAs with a higher abundance only in the morula stage (miR-155, miR-430a, d1, d2, and miR-458), indicating possible important key roles of those miRNAs in further embryonic development. An additional target search showed putative miRNA-mRNA interactions with possible direct and indirect regulatory functions of the identified miRNAs.

Keywords: development, RNAseq, sncRNA, microRNA teleosts, functional genomics

# INTRODUCTION

Non-coding RNAs (ncRNAs) are involved in several different regulatory pathways. In the last decade, their importance has been demonstrated in a broad range of organisms, including teleost species (Herkenhoff et al., 2018). Several types of non-coding regulatory RNAs have been identified, chief among them being long ncRNAs and small ncRNAs (sncRNAs). The major classes of sncRNAs are short interfering RNAs, piwi acting RNAs, small nuclear RNAs (snRNA), small nucleolar RNAs (snoRNA), and microRNAs (miRNAs; Labbé et al., 2017). The majority of studies investigated the role of miRNA, which is a highly conserved class of small regulatory RNAs known to act at the translational level mainly by repressing protein production (Gavery and Roberts, 2017). The rapid advancements in sequencing and computing methodologies have significantly enhanced the discovery of new miRNAs not only in humans and model species, such as the zebrafish (*Danio rerio*; Mishima, 2012), but also in non-model species, such as the Atlantic halibut (*Hippoglossus hippoglossus*) (Bizuayehu et al., 2012a) and the catfish (*Ictalurus punctatus*) (Barozai, 2012). Besides their functional importance in diverse biological processes, it has also been shown that miRNAs are essential for vertebrate development (Wienholds and Plasterk, 2005). In teleosts, the importance of miRNA in modulating gene expression during development has been reported in various studies evaluating both model and non-model fish species. The first miRNAs with a regulative function in fish development have been described in zebrafish (Kloosterman et al., 2004). The dynamics of miRNA expression during early ontogeny in nonmodel fish species, such as the Atlantic halibut (Bizuayehu et al., 2012b), turbot (*Scophthalmus maximus*); (Robledo et al., 2017), and Senegalese sole (*Solea senegalensis*; Campos et al., 2014), have also been reported.

The European sea bass (*Dicentrarchus labrax*) is a species of high commercial impact whose industrial production is steadily growing. Efforts have also been made to intensify fish cultivation and target faster growth rates and better feed conversion ratios. Consequently, several objectives such as the larval survival rate, alternative feeds, and disease resistance are of significant importance to the industry. The embryonic and larval stages are part of the most important periods to ensure high performance and superior quality in the following developmental phases of the life cycle, particularly for fish in captivity (Varsamos et al., 2006; Pittman et al., 2013; Tsalafouta et al., 2015). Over the last decade, these economic interests have directed increased research efforts in the rearing of European sea bass, including the significant enrichment of the molecular toolbox for the study of this species. Today, besides several transcriptome datasets (Sarropoulou et al., 2009; Sarropoulou et al., 2012; Pinto et al., 2017), the whole genome sequence of the European sea bass (Kuhl et al., 2010; Tine et al., 2014) as well as single nucleotide polymorphism markers, genetic linkage maps (Chistiakov et al., 2008; Guyon et al., 2010), and radiation hybrid maps (Guyon et al., 2010) are available. However, scarce information has been published concerning ncRNAs (Kaitetzidou et al., 2015).

The present work aims to identify sncRNAs and their targets that play an important role in the development of the European sea bass. Teleost development can be reflected as a sequence of ongoing morphological changes, whereby the embryonic and larval stages are considered to be the most significant time points in the life cycle of marine fish. In the natural environment and during its embryonic and larval phases, the European sea bass lives in the marine environment, whereas as a juvenile it migrates to coastal zones, estuaries, and lagoons. The European sea bass is therefore considered a euryhaline species with reportedly high adaptation processes during its early life phases. To detect most of the sncRNAs that are important to the European sea bass development and to obtain a list of unique miRNAs, this study analyzed the small RNA transcripts of early development (as described by Kaitetzidou et al., 2015) along with newly generated sequencing reads from three additional later stages. It reported for the first time the identification and characterization of the European sea bass small RNAs during development (from morula to all fins); using target search, it also revealed the putative miRNA–mRNA interactions resulting in the possible direct and indirect regulatory functions of the selected miRNAs, which show differential expression (DE) during development.

# MATERIALS AND METHODS

All experiments/methods in the present study were performed in accordance with the approved guidelines and regulations from the Hellenic Center for Marine Research (HCMR) Institutional Animal Care and Use Committee following the three Rs (Replacement, Reduction, and Refinement) guiding principles for more ethical use of animals in testing, which was first described by Russell and Burch in 1959 (EU Directive 2010/63). These principles are now followed in many testing establishments worldwide before the initiation of experiments.

# Sampling and RNA Extraction

For the generation of new data from three later developmental stages, samples were collected from i) the first feeding stage (FF), when the mouth opens, the pectoral fins have been developed, and the yolk sac has been completely absorbed; ii) the flexion (FLX) stage, when the notochord flexion has been completed; and iii) the all-fins stage (FINS), when all fins have been established (Pavlidis et al., 2011). All samplings were carried out at the installations of the Institute of Marine Biology, Biotechnology, and Aquaculture (IMBBC), HCMR (Heraklion, Crete, Greece). For each of the newly sampled stages, i.e., FF, FLX, and FINS, three biological replicates were sampled, flash frozen in liquid nitrogen, and transferred to a −80°C ultra-low freezer until miRNA library preparation. For miRNA library construction and sequencing, the same method as described by Kaitetzidou et al. (2015) was followed. In brief, total RNA was extracted from all developmental stages using Nucleospin miRNA Kit for the isolation of small RNA (sncRNA) and mRNA (Macherey-Nagel GmbH & Co. KG, Duren, Germany) according to the manufacturer's instructions. Larvae were disrupted with mortar and pestle in liquid nitrogen and homogenized in lysis buffer by passing lysate through a 23-gauge (0.64 mm) needle five times. The quantity of RNA was estimated with NanoDrop ND-1000 spectrophotometer (NanoDrop Technologies, Inc., Wilmington, DE, USA) and the quality was further evaluated by agarose (1%) gel electrophoresis and Agilent 2100 Bioanalyzer using RNA Nano Bioanalysis chip.

# miRNA Library Preparation and Sequencing

For all samples, miRNA libraries were prepared according to the manufacturer's instructions and single-end sequencing was carried out using Illumina sequencing technology platform (Illumina HiSeq 2000) at the Cornell University Core Laboratories Center. The use of multiplex identifier tags for each library allowed the pooling of the samples to be run in one HiSeq 2000 lane. The quality control of all reads was assessed by running Fastqc version 0.10.0 (http:// www.bioinformatics.babraham.ac.uk/projects/fastqc).

# Sequence Reads Analysis

Raw sequencing data from seven early developmental stages, i.e., morula (M), 50% epiboly (E), late gastrulation-organogenesis (GO), organogenesis (O), late organogenesis (LO), hatching (HA), and 24 h post-hatching (hph), were retrieved from Kaitetzidou et al. (2015). Sequence reads from a total of 10 developmental stages, i.e., reads obtained from Kaitetzidou et al. as well as the newly generated reads from the FF, FLX, and FINS stages, were quality and adapter trimmed using Trimmomatic software 0.30 (Bolger et al., 2014) and imported into CLC Genomics Workbench version 10.1 (CLC Bio, Aarhus, Denmark). The annotation of unique sncRNAs was performed against Gasterosteus\_aculeatus. BROADS1.ncrna as well as against available miRNA annotations of teleosts, humans, and mice in miRBase (Griffiths-Jones et al., 2008; *Cyprinus carpio*, *D. rerio*, *Fugu rubripes*, *I. punctatus*, *Oryzias latipes*, *Petromyzon marinus*, *Salmo salar*, *Tetraodon nigroviridis*, *Gallus gallus*, *H. hippoglossus*, *Homo sapiens*, *Mus musculus*, and *Paralichthys olivaceus*).

# Differential Expression

For DE analysis, transcript counting was carried out with CLC Genomics Workbench 10.1.1 (CLC Bio; https://www. qiagenbioinformatics.com/). For all not yet normalized transcripts, read counts greater than five were considered for DE analysis. For DE sequencing reads, which could not be annotated through the Gasterosteus\_aculeatus.BROADS1.ncrna database, additional effort was made by submitting them to publicly available databases of the National Center for Biotechnology Information (NCBI; non-redundant nucleotide and expressed sequence tags). Transcripts per million normalized count values were transformed into a decimal logarithmic scale. Obtained small RNA reads were mapped to the *D. labrax* genome for validation purposes. DE was assessed based on the empirical analysis method (EDGE test) provided within the CLC Genomics Workbench with default parameters. Transcripts were considered significant DE if false discovery rate (FDR) values were below 0.05 and log2fold change > 2. To find stage-specific miRNAs, transcripts found only in one of the stages but zero counts in the other were considered as stage-specific miRNA.

# miRNA Target Search

For target search, the coding sequence information (mRNA) for the European sea bass was extracted from the NCBI nucleotide databases and open reading frame prediction was carried out using Transdecoder version 5.5.0 software program (Haas et al., 2013). 3′ Untranslated regions were extracted by ExUTR version 0.1.0 and RNAhybrid version 2.1.2 (Krüger and Rehmsmeier, 2006) as well as miRanda web tool (Enright et al., 2003) and used for target prediction with the default parameters and an energy threshold of mfe < −30 kcal. The complete workflow is illustrated in **Figure 1**.

# Gene Ontology (GO) Classification and Enrichment Analysis of Target Genes

GO terms were retrieved using the PANTHER classification system (http://www.pantherdb.org/). The chosen analysis type was "Statistical overrepresentation test" and the annotation data set the PANTHER GO-Slim Biological Process data set (PANTHER version 14.1 Released 2019-03-12). The analyzed list consisted of the zebrafish target genes for miR-430 and miR-21 retrieved from TargetScanFish (TargetScanFish, release 6.2, June 2012; Ulitsky et al., 2012). The reference gene list for the test consisted of all genes in the PANTHER database for zebrafish. The test type was Fisher's exact test with FDR correction. The statistically overrepresented GO terms of miR-430 and miR-21 target genes were illustrated using PANTHER Overlaid Area Chart of Difference (observed vs. expected).

# RESULTS

# Small RNA Libraries from Developmental Stages of the European Sea Bass

For samples FF, FLX, and FINS, new miRNA libraries were generated. RNA quality control by DNAnalyzer Nano RNA chip before library construction showed high-quality total RNA with an RIN number > 8 (**Supplemental Figure S1**). Only samples passing this evaluation step were used for next-generation sequencing (NGS). For each sample after trimming, an average of about 7 million reads were obtained and about 40% were annotated. Annotated read counts for each stage were grouped ("sample grouped") and resulted in an average of 1,380 annotated reads per stage. An overview of sequencing read numbers is listed in **Table 1**. Sample grouped reads of all stages together amounted to 2.115 reads, whereas sequence lengths of 20 to 24 bp were more than 50% (**Figure 2A**). The total number of successfully mapped small RNA (sample grouped) onto the European sea bass genome amounted to 1,524 reads (~72%; **Supplemental Table S1**) with 1,169 (77%) reads corresponding to miRNA (**Figure 2B**). Raw sequencing reads were submitted to the SRA database of the NCBI under accession numbers PRJNA269278 and PRJNA369460.

# Differentially Expressed miRNAs During Sea Bass Development

Pair-wise DE analysis was carried out and transcripts with FDR values < 0.05 and log2fold change > 2 were considered as



DE, resulting in a total of 1,157 annotated reads in at least one comparison among all the 10 stages under study (**Supplemental Table S2**). The majority of transcripts were annotated as miRNA (59%) followed by snRNA (25%) and snoRNA (10%; **Figure 3**). Principal component analysis (PCA) showed that earlier stages

separated themselves clearly from the later stages: the FLX and FINS stages (**Figure 4A**). A similar pattern was obtained by hierarchical clustering (**Figure 4B**). According to the PCA plot and the heatmap, the M stage was selected as the reference stage to generate a Venn diagram (**Figure 5**). Furthermore, the

FIGURE 2 | (A) Read length distribution of all sequencing reads after trimming and sample grouping. *Y*-axes show the percentage of read counts at a specific read length, whereas *X*-axes show read lengths. (B) Successfully mapped small RNA on the available European sea bass (*D. labrax*) genome. The outer ring illustrates the chromosome number, whereas the inner rings present the relative amount of mapped miRNAs.

comparisons of stages GO, O, LO, HA, and 24 hph to stage M were combined in one list as well as the comparison of stage M to stages FLX and FINS. The Venn diagram showed that the majority of transcripts exclusively expressed belonged to the FLX and FINS stages (263, 26%), with 62% of them classified as snRNA and 24% as miRNA. The list comprising stages GO, O, LO, HA, and 24 hph had 40 (3.9%) unique transcripts (36 miRNAs and four snoRNAs), the FF had 13 (1.3%), and the E stage included only 4 (0.4%) miRNA.

# Stage-Specific miRNAs

Transcripts found only in one of the stages and zero counts in the other were considered as stage-specific miRNA. In the present work, no stage-specific miRNAs were detected for stages E, FLX, FINS, and HA. For the first stage, the M stage, five miRNAs were found, among which three belonged to the miR-430 family. The last studied stage, the FINS stage, also did not show the presence of any uniquely expressed miRNA. Nevertheless, one of the highly abundant small RNAs at this stage was miR-462. Among the few miRNAs regulated between FLX and FINS stages (a total of 9 miRNAs) were hs-miR-7641-1 and hs-miR-7641-2, which were found to be more abundant at the FLX stage.

# Target Search

miRNA target search in European sea bass, applying positive hybridization scores with an energy threshold < −30.0, was carried out at the gene level for the 14 stage-specific miRNAs (transcript found only in one stage and zero counts in the other stages). The putative targets for the European sea bass, as identified by RNAhybrid, are listed in **Supplemental Table S3**. miRNA target search in zebrafish with miR-430 (stage-specific miRNA) and miR-21 (up-regulated in FF) resulted in specific GO terms illustrated in **Figure 6** and listed in **Supplemental Table S4**.

# DISCUSSION

# Deep Sequencing of European Sea Bass Developmental Stages

Today, NGS has made it possible to assess not only gene expression but also the abundance of small RNA within a given sample. Here, we analyzed more than 135 million reads obtained from 10 developmental stages of the European sea bass comprising NGS data from a previously published work (Kaitetzidou et al., 2015) as well as from the present study. Read numbers varied across the samples, but the numbers of annotated reads, which were "sample grouped," were alike (**Table 1**), pinpointing the fact that sufficient reads for each sample were obtained. The majority (>50%) of the trimmed

first feeding (FF) stage; and (4) combined list of the comparisons of M to stages flexion (FLX) and all-fins (FINS).

and sample grouped reads were 20 to 24 bp long (**Figure 2A**). Including the transcripts read lengths of 15 to 19 bp as well as 25 to 29 bp resulted in more than 70% of the sample grouped reads, demonstrating that the libraries were significantly enriched with miRNAs.

# Small RNA Analysis

Two main approaches exist for analyzing the small RNA reads obtained by NGS. Either the different types of small RNAs in the data are counted or the read counts are first mapped to an appropriately annotated reference genome and then submitted

to publicly available databases for annotation purposes. The first approach does not require an annotated genome for mapping, and small RNAs not mapping to the reference genome (due to gaps for instance) can still be measured. In the present study, the first approach was implemented to detect all possible small RNAs obtained by NGS. Nevertheless, with the aim of locating the obtained small RNAs within the genome, reads were mapped to the European sea bass genome. Notably, the majority of miRNAs mapped to four linkage groups (LG5, LG14, LG15, and LG25; **Figure 2B**). Other studies, in about 20 species, have also investigated the detailed chromosome-specific location of miRNA genes. The authors of these studies showed that, in all species, certain chromosomes accumulated a higher number of miRNAs and that miRNAs involved in specific diseases also accumulated in a specific location (e.g., Ghorai and Ghosh, 2014). The aforementioned study by Ghorai and Gosh (Ghorai and Ghosh, 2014) also included three teleost species: the zebrafish, medaka (*O. latipes*), and tetraodon (*T. nigroviridis*). Although miRNAs clearly accumulated in specific chromosomes in zebrafish and medaka, in tetraodon, no clear preference was shown. Nevertheless, the authors listed the top four highest numbers of miRNA genes containing chromosomes, which were chromosomes 7, 4, 14, and 17 for medaka and chromosomes 2, 10, 9, and 13 for tetraodon (chromosome numbers ordered by the highest miRNA counts found). Medaka chromosome 7 is the homolog group to tetraodon 9 (Sarropoulou et al., 2008) and corresponds to the European sea bass LG25, which also showed an abundance of miRNA in the present study (**Figure 2B**). The European sea bass LG14 corresponds to medaka chromosome 14 (third highest number of

miRNA in medaka) and LG15 to medaka chromosome 21 (fifth highest number of miRNA in medaka) and tetraodon 2 (highest number of miRNA in tetraodon). Of the four LGs with enriched miRNA counts (LG5, LG14, LG15 and LG25), only European sea bass LG5 did not match with any of the chromosomes with enriched miRNA counts in medaka and tetraodon. In humans, it has been shown that clustered miRNAs are involved in specific pathways or, in particular, cell functions and that clustered miRNAs may be maintained through evolution (Guo et al., 2014). In teleosts, this aspect has not yet been investigated and may be the objective of future studies.

# Differentially Expressed miRNAs During Sea Bass Development

DE miRNAs showed that the investigated earlier stages (i.e., M and E) and the latest stages (i.e., FLX and FINS) separated themselves clearly from the other stages (**Figure 4A**). Therefore, a first comparison to the M stage was carried out, grouping together stages GO, O, LA, HA, and 24 hph as well as stages FLX and FINS (**Figure 5**).

# Between Morula and 50% Epiboly

A total of 220 transcripts were found to be DE between the two earliest stages studied within the present work. Of the 220 transcripts, 165 (75%) were annotated as miRNAs, and 70 miRNAs were found in higher abundance in the E stage. The other 95 miRNAs were detected to be more abundant in the M stage. Four transcripts were exclusively DE in the M vs. E comparison (**Figure 5**). All of them were found in higher abundance in the M stage and appeared to belong to the miR-30a family (miR-30a-1, miR-30a-2, miR-30a-3, and miR-30a-4). For comparison, in *Xenopus*, studies have shown that primiRNAs of miR-30a are active only during maternal stages and that no zygotic transcription could be detected within the studied stages (Nepal et al., 2015). Higher miR-30a abundance during the blastula period in *Xenopus* was also shown to be linked to early neural crest development (Ward et al., 2018). In contrast, the highest miRNA fold change between the M and E stages was found for miR-196b with 0 and 1.111 copy numbers, respectively. In zebrafish, research has shown that miR-196 is involved in the regulation of axial patterning and pectoral appendage initiation and that the first appearance of miR-196b has been seen to be later than it was detected in the present study, i.e., at the bud stage (He et al., 2011).

# Between Morula, 50% Epiboly and Gastrulation-Organogenesis, Organogenesis, Late Organogenesis, Hatching, and 24 Hours Post-Hatching

According to the PCA plot, the stages from GO to 24 hph are grouped closely together in comparison to the other stages. Therefore, in the Venn diagram, DE transcripts between those stages and the M stage were considered as one list (**Figure 5**). Here, a total of 40 transcripts, comprising 36 miRNAs and 4 snoRNAs, were found to be exclusively DE. The fact that 90% of the reads exclusively found in the M-GO comparison to 24 hph stages are annotated as miRNA may indicate that miRNAs are more active during early development. Similar findings were described in zebrafish, where an increase of miRNA expression was observed as early development proceeded (Yao et al., 2014).

## Between Morula and First Feeding Stage

Likewise, 13 transcripts (6 miRNAs, 2 snoRNAs, and 5 snRNAs) were found to be exclusively present in the FF stage. Five of the six identified miRNAs, namely miR-4792, let-7b-1, let-7b-2, miR-155-1, and miR-155-2, showed higher abundance in the M stage than in the FF stage (**Figure 5**). Whereas let-7 is a well-studied miRNA belonging to the first founding miRNAs with possible roles in growth development (Kloosterman et al., 2004; Zhao et al., 2017), both miR-155 and miR-4792 have been examined less. Of the remaining miRNAs, miR-21 (ENSGACT00000028044.1) showed higher expression levels in the FF stage. It is also among the first miRNAs to have been identified (Kim, 2005) and has been found to be involved in various biological processes, including development (Kumarswamy et al., 2011). In zebrafish, miR-21 expression was found at very early developmental stages (Chen et al., 2005), whereas, in rainbow trout, miR-21 was suggested to play an important role in degrading maternally inherited mRNAs (Ramachandra et al., 2008). Enrichment analysis of target genes in zebrafish, identified *via* TargetScanFish (Ulitsky et al., 2012), revealed a high number of genes classified in the biological process termed as "tube development" (GO:0035295; **Figure 6**). The formation of tubes such as epithelial and endothelial tubes are of

importance in view of gases, liquids, and cell transport. In tilapia (*Oreochromis niloticus*), it has been shown that miR-21 is involved in the modulation of alkalinity stress (Zhao et al., 2016). During European sea bass development, stress, as indicated by water cortisol measurement, was first detected at the FF stage (Tsalafouta et al., 2015). The present finding may indicate that miR-21 in the European sea bass is also involved in mechanisms related to the modulation of stress.

# Between Morula and Flexion/All Fins

The comparison between M and FLX/FINS stages revealed the highest number (263) of exclusively found transcripts. Notably, of the 263 transcripts, only 62 (24%) were classified as miRNA, whereas 164 (62%) reads were annotated as snRNA; all of them were found in higher copy numbers in the M stage. snRNA molecules are known to be an abundant component of eukaryotic cells (Valadkhan and Gunawardane, 2013), their importance for development having already been demonstrated in *Xenopus* nearly four decades ago (Forbes et al., 1983). The exclusively high abundance of snRNA, as illustrated in **Figure 5**, may indicate the importance of studying snRNAs during development in the near future; however, it is beyond the scope of the present work.

# Stage-Specific miRNAs and Their Targets

Annotated miRNAs detected only in one stage and not in the others were found for the M, GO, O, LO, 24 hph, and FF stages. Unique miRNAs found in only one stage and their targets in the European sea bass are listed in **Supplemental Table S3**. For the first stage of the present study (i.e., the M stage), five miRNAs were found; of these, three belonged to the miR-430 family. In zebrafish, it has been shown that the miR-430 family comprised 72 members and that they targeted a large number of maternal mRNAs (Giraldez et al., 2006). The results of miR-430 target search using TargetScanFish in the present study are illustrated in **Figure 6**. Among the significantly enriched biological process categories were the GO terms "establishment or maintenance of cell polarity," "protein targeting," "response to growth factor," "cellular response to growth factor stimulus," "response to endogenous stimulus," and "cellular response to endogenous stimulus." It has been documented in zebrafish that miR-430a directs cell division, which in turn leads to neural tube development (Takacs and Giraldez, 2016); this is also consistent with the findings of this study. Concerning miR-430d, two clusters were found to comprise the mature mirR-430d sequence. These clusters are located on different regions of the genome and appear to fold into the miRNA typical stem-loop (**Supplemental Figure S2**).

The FINS stage, determined by the end of metamorphosis and the start of squamation, did not show the presence of any unique DE miRNA. Nevertheless, one of the highly abundant small RNAs at this stage, compared to the others, was miR-462, which has been linked to growth and muscle development in the blunt snout bream (*Megalobrama amblycephala*; Yi et al., 2013). The authors have found miR-462 to have a higher count number in small-sized fish than in bigger fish. However, in both cases, a high count number was reported. Among the few miRNAs being regulated between stages FLX and FINS (9), hs-miR-7641-1 and hs-miR-7641-2 were found to be more abundant at the FLX stage. miR-7641 in humans is known as a regulator of ribosomal proteins (Reza et al., 2017). For humans, about 3,500 targets have been identified (TargetScanHuman, release 7.2, March 2018). The respective search in zebrafish did not result in any match. Target search using the European sea bass transcriptome identified the gene paralemmin-1 as a putative target for miR-7641-1/miR-7641- 2. Paralemmin-1 is also among the 3,500 targets identified for humans and is known to play an important role in filopodia induction and spine maturation (Arstikaitis et al., 2008). In contrast, hhi-miR-7641 (*H. hippoglossus*) was detected in all stages, except in GO and O, with the highest expression of hhi-miR-7641 found at the 24 hph stage.

In conclusion, investigating the miRNA repertoire during the early development of a commercially important fish species such as the European sea bass may contribute to a better understanding of regulatory processes during early ontogenesis. In the present study, we identified 2,115 ncRNA transcripts in the European sea bass; of these, 684 were annotated as miRNA. Distinct ncRNA expression profiles of 10 developmental stages were recognized and stage-specific miRNA were identified. Putative targets were also detected to provide first insights into miRNA involvement during early development.

# DATA AVAILABILITY

The datasets generated for this study can be found in NCBI SRA database, PRJNA269278 and PRJNA369460.

# ETHICS STATEMENT

All experiments/methods in the present study were performed in accordance with the approved guidelines and regulations from the HCMR institutional animal care and use committee following the three Rs (Replacement, Reduction, Refinement) guiding principles for more ethical use of animals in testing, first described by Russell and Burch in 1959 (EU Directive 2010/63). These principles are now followed in many testing establishments worldwide prior to initiation of experiments.

# AUTHOR CONTRIBUTIONS

ES conceived the study, drafted the manuscript, and performed the bioinformatics analyses. EK carried out RNA extractions and validation and helped in the miRNA work. AT and NP conceived the sampling and collected them for further analysis. MP initiated the study, critically revised the manuscript, and approved the final version.

# FUNDING

Financial support for this study has been provided by the Ministry of Education and Religious Affairs, under the Call "ARISTEIA I" of the National Strategic Reference Framework 2007–2013 (ANnOTATE), co-funded by the EU and the Hellenic Republic through the European Social Fund, the European Union Seventh Framework Program (FP7 2010–2014) under the grant agreement no 265957 (CopeWell).

# ACKNOWLEDGMENTS

Financial support for this study was provided by the Ministry of Education and Religious Affairs, under the Call "ARISTEIA I" of the National Strategic Reference Framework 2007–2013 (ANnOTATE), co-funded by the European Union and the Hellenic Republic through the European Social Fund as well as the European Union Seventh Framework Program (FP7 2010–2014) under grant agreement no. 265957 (CopeWell). We further, would like to thank the Cornell University Core Laboratories Center for sequencing provision and the Informatics Group of the IMBBC for computational support. We acknowledge support of this work by the project "Centre for the study and sustainable exploitation of Marine Biological Resources (CMBR)" (MIS 5002670) which is implemented under the Action "Reinforcement of the Research and Innovation Infrastructure", funded by the Operational Programme "Competitiveness, Entrepreneurship and Innovation" (NSRF 2014-2020) and co-financed by Greece and the European Union (European Regional Development Fund).

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.00657/ full#supplementary-material

SUPPLEMENTAL FIGURE S1 | Total RNA extraction was evaluated on an RNA Nano Bioanalysis chip (Agilent) as well as by gel electrophoreses. Only samples passing this evaluation step as presented here were used for NGS. (A) Total RNA. (B) Small RNA.

SUPPLEMENTAL FIGURE S2 | miR-430d sequences along with their identified stem-loop.

SUPPLEMENTAL TABLE S1 | Identified small RNAs mapped onto the European sea bass genome with threshold e-value of < 0.00005.

SUPPLEMENTAL TABLE S2 | Normalized mean read counts of significant DE transcripts.

SUPPLEMENTAL TABLE S3 | miRNAs found in only one stage and zero counts in the other stages along with their putative target genes, annotation, and accession number.

SUPPLEMENTAL TABLE S4 | GO terms of putative target genes regulated by miR-21 and miR-430.

# REFERENCES


the European seabass (*Dicentrarchus labrax*). *Physiol. Genomics*. 47, 158–169. doi: 10.1152/physiolgenomics.00001.2015


(*Megalobrama amblycephala*) by Solexa sequencing. *BMC Genomics* 14, 754. doi: 10.1186/1471-2164-14-754


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Sarropoulou, Kaitetzidou, Papandroulakis, Tsalafouta and Pavlidis. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Identification of Single Nucleotide Polymorphisms Related to the Resistance Against Acute Hepatopancreatic Necrosis Disease in the Pacific White Shrimp *Litopenaeus vannamei* by Target Sequencing Approach

### *Edited by:*

*Gen Hua Yue, Temasek Life Sciences Laboratory, Singapore*

### *Reviewed by:*

*Jian Xu, Chinese Academy of Fishery Sciences (CAFS), China Baoqing Ye, Temasek Life Sciences Laboratory, Singapore*

### *\*Correspondence:*

*Fuhua Li fhli@qdio.ac.cn*

*†These authors have contributed equally to this work.*

### *Specialty section:*

*This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics*

*Received: 28 February 2019 Accepted: 03 July 2019 Published: 02 August 2019*

### *Citation:*

*Zhang Q, Yu Y, Wang Q, Liu F, Luo Z, Zhang C, Zhang X, Huang H, Xiang J and Li F (2019) Identification of Single Nucleotide Polymorphisms Related to the Resistance Against Acute Hepatopancreatic Necrosis Disease in the Pacific White Shrimp Litopenaeus vannamei by Target Sequencing Approach. Front. Genet. 10:700. doi: 10.3389/fgene.2019.00700*

*Qian Zhang1,2,3†, Yang Yu1,2†, Quanchao Wang1, Fei Liu1, Zheng Luo1,3, Chengsong Zhang1,2, Xiaojun Zhang1,2, Hao Huang4, Jianhai Xiang1,2 and Fuhua Li1,2,5\**

*1 Key Laboratory of Experimental Marine Biology, Institute of Oceanology, Chinese Academy of Sciences, Qingdao, China, 2 Laboratory for Marine Biology and Biotechnology, Qingdao National Laboratory for Marine Science and Technology, Qingdao, China, 3 University of Chinese Academy of Sciences, Beijing, China, 4 Hainan Grand Suntop Ocean Breeding Co., Ltd., Wenchang, China, 5 Center for Ocean Mega-Science, Chinese Academy of Sciences, Qingdao, China*

Acute hepatopancreatic necrosis disease (AHPND) is a major bacterial disease in Pacific white shrimp *Litopenaeus vannamei* farming, which is caused by *Vibrio parahaemolyticus*. AHPND has led to a significant reduction of shrimp output since its outbreak. Selective breeding of disease-resistant broodstock is regarded as a key strategy in solving the disease problem. Understanding the relationship between genetic variance and AHPND resistance is the basis for marker-assisted selection in shrimp. The purpose of this study was to identify single nucleotide polymorphisms (SNPs) associated with the resistance against AHPND in *L. vannamei*. In this work, two independent populations were used for *V. parahaemolyticus* challenge and the resistant or susceptible shrimp were evaluated according to the survival time after *Vibrio* infection. The above two populations were genotyped separately by a SNP panel designed based on the target sequencing platform using a pooling strategy. The SNP panel contained 508 amplicons from DNA fragments distributed evenly along the genome and some immune-related genes of *L. vannamei*. By analyzing the allele frequency in the resistant and susceptible groups, 30 SNPs were found to be significantly associated with the resistance of the shrimp against *V. parahaemolyticus* infection (false discovery rate corrected at *P* < 0.05). Three SNPs were further validated by individual genotyping in all samples of population 1. Our study illustrated that target sequencing and pooling sequencing were effective in identifying the markers associated with economic traits, and the SNPs identified in this study could be used as molecular markers for breeding disease-resistant shrimp.

Keywords: single nucleotide polymorphism, target sequencing, association analysis, disease resistance, penaeid shrimp

**326**

# INTRODUCTION

The Pacific white shrimp *Litopenaeus vannamei*, which naturally distributed along the Pacific coasts of Central and South America, has become the primary cultivated shrimp species into many regions of the world (Lu et al., 2017). Since it was introduced in China into the 1980s, it has become the most dominant aquaculture shrimp species in China (Lu et al., 2016; Wang et al., 2017). In recent years, due to intensive culture and environmental deterioration, infectious diseases caused by viruses and bacteria have led to serious production loss (Wang et al., 2015). Acute hepatopancreatic necrosis disease (AHPND), also called early mortality syndrome, is a devastating disease that usually occurs in 35 days after stocking in cultivation ponds, and the mortality reached as high as 40% to 100% (Joshi et al., 2014; Lun et al., 2018). During disease outbreak, the shrimp showed clinical signs of atrophied hepatopancreas and empty midgut (Nunan et al., 2014). It was reported that *Vibrio parahaemolyticus*  carrying the *Photorhabdus* insect-related (Pir*vp*) binary toxin was the pathogenic agent for AHPND (Lee et al., 2015).

Many immune-related genes were found to be involved in the defense of shrimp against *V. parahaemolyticus* infection. The Rho signaling pathway was suggested to be helpful for AHPND pathogenesis in shrimp through transcriptome analysis (Ng et al., 2018). Infection of *V. parahaemolyticus* could strongly activate the genes involved in cell growth and anti-apoptosis (Zheng et al., 2018). In addition, the expression of genes encoding immune effectors, including lectins and antimicrobial peptides (AMPs), was significantly up-regulated after *V. parahaemolyticus* challenge (Soonthornchai et al., 2016; Qi et al., 2017; Maralit et al., 2018; Qin et al., 2018). Although investigation on the mechanism of AHPND outbreak in shrimp has been conducted, there is still no effective strategy to control this disease.

Breeding disease-resistant broodstock was regarded as an efficient approach (Moss et al., 2012). Compared to the traditional breeding technique, marker-assisted selection (MAS) is more efficient, time-saving, and free from environmental impact (Lu et al., 2018; Wang et al., 2017). Previous studies indicated that gene polymorphism was closely associated with economic traits (Wang et al., 2016a; Li et al., 2017). Immune-related genes were usually considered as optimal candidates for the selection of markers associated with resistance to pathogens (Guo et al., 2013; Hao et al., 2015). Single nucleotide polymorphisms (SNPs) are a type of widely used markers for MAS due to its high polymorphism and abundance in the genome. It has already been used in aquatic species to facilitate selection breeding and speed up the discovery of genes related to the economic trait, such as high growth (Lv et al., 2015; Tsai et al., 2015), disease resistance (Yue et al., 2012), precocity (Xu et al., 2016), and body conformation (Geng et al., 2017).

Over the past years, molecular genetic studies on the penaeid shrimp *L. vannamei* have made a great progress, such as identification of high-throughput SNPs from the transcriptome (Yu et al., 2014) and construction of high-density linkage map (Yu et al., 2015; Jones et al., 2017). These studies provided fundamental information to identify quantitative trait genes and quantitative trait nucleotides (QTNs) for the AHPND-resistant trait. QTNs have been identified in *L. vannamei* for several traits, such as growth-related trait (Andriantahina et al., 2013), ammonia-tolerant trait (Lu et al., 2018), and WSSV resistance trait (Liu et al., 2014a). Screening for SNPs or genes associated with AHPND is the basic work to reveal the molecular mechanism of shrimp defense for bacteria infection and it will also be of benefit to MAS or gene-assisted selection in shrimp. However, markers related to the AHPNDresistant trait of shrimp are seldom reported until now. In the present study, genetic association analysis using genotyping by target sequencing method and bulked segregant analysis was performed to screen for disease resistance-related SNPs. The bulks were built by selection of extreme samples from populations (Zou et al., 2016). Consequently, several SNPs or genes were identified to be related to the resistance of the shrimp against *V. parahaemolyticus*. The results would provide new insights into the molecular basis of disease defense and accelerate the breeding of shrimp with disease resistance.

# MATERIALS AND METHODS

# Experimental Shrimp

Healthy shrimp were obtained from Hainan Grand Suntop Ocean Breeding Co., Ltd. (Wenchang, China) and maintained in filtered seawater at a temperature of 26 ± 1°C with continuous aeration. Before the experiment, the shrimp were kept in the aquarium for 3 days to make them acclimate to the laboratory conditions. The shrimp used in this study were the progeny of mating of two breeding lines. The two breeding lines were bred by 7 years of artificial selection, respectively. The progenies produced by the cross of the breeding lines were genetically similar. Although two populations were consisted of multiple families, their genetic backgrounds were very similar.

# *V. parahaemolyticus* Challenge Tests

*V. parahaemolyticus* was isolated from the diseased shrimp. The *Vibrio* strain was cultured in tryptic soy broth with 2.0% sodium chloride liquid media. The cultured bacteria were further checked by positive polymerase chain reaction (PCR) amplification of *PirAvp* and *PirBvp* genes (Han et al., 2015). Bacterial titer was counted with a hemocytometer under a light microscope. The infection dose was set at 2.5 × 104 colony-forming units (CFU)/g (body weight). A total of 236 shrimp with a body weight of 3.98 ± 1.9 g and a body length of 6.89 ± 1.35 cm were used as the first population. About 10 μl bacterial suspension in phosphate-buffered saline (PBS) with 1.0 × 105 CFU of *V. parahaemolyticus* was injected into each shrimp at the site between the IV and V abdominal segments. A total of 270 shrimp with a body weight of 5.25 ± 2.9 g and a body length of 7.45 ± 1.56 cm were used as the second population, and each shrimp was injected with 1.3 × 105 CFU of pathogenic bacteria. At the same time, 60 shrimp were set aside as the control group and injected with an equal volume of sterile PBS. The mortality was recorded for 8 days. The dead shrimp were stored at −80°C until DNA extraction. Susceptible and resistant phenotypes were evaluated according to the survival time after challenge (Sawayama et al., 2017). In each population, 60 samples that died at an early time [16 h postinfection (hpi)] after injection of *Vibrio* were taken as the highly susceptible group, and 60 samples that survived from the infection or died at a later time [6 days postinfection (dpi)] were selected as the resistant group.

# Loads of *V. parahaemolyticus in Vivo*

To explore the dynamic change of *V. parahaemolyticus* in shrimp, 150 healthy shrimp with a body weight of 5.15 ± 1.6 g were injected with 2.5 × 104 CFU/g (body weight). To determine the pathogenic bacteria count in surviving shrimp, six shrimp were randomly chosen from the tank at 0, 1, 2, 3, 8, and 12 dpi. Various tissues were collected from each shrimp, including hepatopancreas, gill, and muscle. Three repetitions were used each time, and two shrimp were mixed together in each repetition. Tissues were dissected aseptically and then homogenized in PBS. Serial 10-fold dilutions of the supernatant with PBS were produced, and 100-μl solution was inoculated onto thiosulfate citrate bile salts sucrose (TCBS) agar. Plates were incubated at 30°C for 18 h. Colonies were counted to estimate the number of viable bacteria per pool of shrimp tissue (Sotorodriguez et al., 2015).

# SNP Genotyping Based on Next-Generation Target Sequencing

DNA was extracted from the muscle of shrimp with the Tiangen Plant Genomic DNA Extraction Kit (Tiangen, Beijing, China) according to the manufacturer's instructions. DNA concentration was assessed using a NanoDrop 2000 spectrophotometer (Thermo Scientific, USA). A SNP genotyping panel based on the AmpliSeq™ method was designed. This panel contained 508 amplicons including 60 immune-related genes with 148 amplicons, and the other 360 amplicons were selected with even distribution along the shrimp genome (almost 10 cM per marker along the genome). The primer sequences of the amplicons are displayed in **Supplemental Table S1**.

Each population was divided into two DNA pools from shrimp with susceptibility and two DNA pools with resistance against *V. parahaemolyticus*. Every DNA pool was generated by combining an equal amount of DNA from 30 susceptible or resistant shrimp. These DNA pools were amplified and next-generation sequencing libraries were constructed using DNA Seq Library Preparation Kit for Amplicon Sequencing-Illumina Compatible (Gnomegen Technologies, San Diego, CA, USA). The amplicons were sequenced using Illumina Hiseq 2500. The raw reads were filtered and then mapped to the target sequence using BWA (version 0.7.12; Li, 2011). SNPs were genotyped using the Genome Analysis Toolkit. The read depth of each allele in the pools was extracted from the VCF file and used to estimate the allele frequency in each pool.

# Association Analysis and Candidate Gene Identification

The difference in allele frequency between the susceptible and resistant groups was analyzed using χ2 test. The analysis was performed in the combined population (population 1 + population 2) and *P* values were corrected using the false discovery rate (FDR) according to the method described by Benjamini and Hochberg (1995). The threshold of adjusted *P* was set to 0.05 (5% FDR correction). In view of the consistency of favorable alleles in the two populations, SNPs that showed opposite allele frequencies in two populations were filtered out. Statistical tests were performed using the R statistical software (R Core Team, 2018). Screened SNPs were mapped to the genome of *L. vannamei* (Zhang et al., 2019), and the genes around SNPs were examined for candidate genes according to their locations and functions.

# Validation of the Pooling Genotyping Result

To further validate the pooling genotyping result, three SNPs (ALF6-1\_\_22575\_510\_57, ALF6-1\_\_22575\_510\_\_224, and Marker2060\_197) were amplified by individual genotyping method in all samples of population 1. Primers were designed based on the flanking sequence of SNPs (**Table 1**). The PCR program was as follows: 1 cycle of denaturation at 94°C for 5 min, 35 cycles of denaturation at 94°C for 30 s, annealing at 58°C for 30 s, and extension at 72°C for 45 s followed by an extension at 72°C for 7 min. The confirmed PCR products were sequenced by Sanger sequencing in Tsingke Biotech (Qingdao, China). The genotype of each sample was determined based on the sequencing chromatograms.


# Transcription Analysis on Two Candidate Genes

Based on the annotation of SNPs located in the genes, two candidate genes, including anti-lipopolysaccharide factor (*LvALF6*) and phosphatidylinositol 3-kinase (PI3K) regulatory subunit α isoform X2 (*LvPI3K*), were selected for further analysis. Their expression patterns in the shrimp during *V. parahaemolyticus* infection were analyzed. Healthy shrimp with a body weight of 8.39 ± 2.2 g and a body length of 8.89 ± 0.78 cm were injected with 1 × 105 CFU of *V. parahaemolyticus*. The hepatopancreas, hemocytes, and gill of shrimp were collected separately at 1, 6, 24, 48, and 72 hpi. Nine shrimp were collected at each time point and three individuals were put together as one sample. The detailed steps of total RNA extraction and cDNA synthesis were the same as described previously by Wang et al. (2016b). The expression patterns in different tissues were detected by SYBR Green-based quantitative real-time PCR with the primers shown in **Table 1**. The program was as follows: 95°C for 2 min followed by 40 cycles of 95°C for 15 s, 56°C for 15 s, and 72°C for 20 s, and the melting curve analysis was added to the end of each PCR.

# RESULTS

# AHPND Challenge Test

The cumulative mortality of two populations was 52.12% and 86.29%, respectively (**Figure 1**). In population 1, 60 samples that died earlier were selected as the susceptible group, whereas 60 surviving samples were selected as the resistant group. Similarly, 60 shrimp that died first were collected from population 2, whereas 37 surviving shrimp and 23 shrimp that died later were selected as the resistant group. No mortality was observed in the PBS group. The loads of *V. parahaemolyticus* in shrimp at different infection stages are shown in **Table 2**. No *V. parahaemolyticus* was detected in the uninfected shrimp. The number of bacteria in shrimp tissues increased gradually after injection. In hepatopancreas and muscle, the bacterial loads reached 106 and 105 CFU/g at 2 dpi, respectively. The density of bacteria in the gill reached the peak with a bacterial load of 103 CFU/g. At 2 dpi, the bacterial loads decreased slightly. Afterward, it decreased continually in the surviving shrimp and reached 102 to 103 CFU/g in the end.

# SNPs Associated With AHPND Resistance

A total of 1566 SNPs were kept after filtering out SNPs with a minor allele frequency of less than 0.05. The raw data were submitted to the SRA database with accession numbers from SRR9016244 to SRR9016251. Through association analysis on the two populations, 40 significantly different SNPs were identified. After filtering out 10 false-positive sites that showed opposite favorable alleles in two separate populations (**Figure 2**), 30 SNPs were retained as significant markers (**Table 3**). The Manhattan plot of -log10(original *P*) and the Q–Q plot were supplied in the supplementary file (**Supplemental Figures S1** and **S2**). Among these 30 markers, 4 SNPs were located in previously selected immune genes and 26 SNPs were located in the fragment distributed along the genome.


TABLE 2 | Loads of *V. parahaemolyticus* in hepatopancreas, gill, and muscle of shrimp.

As several SNPs were located in the same fragment, a total of 17 independent fragments were screened out to be related with the resistance. **Figure 3** shows the -log10(adjusted *P*) of three SNPs with the most significant difference, including loci Marker15416\_294, Marker8720\_486, and Marker1077\_61 located in the fragment of Marker15416, Marker8720, and Marker1077, respectively.

To further validate the association result, three SNPs were genotyped by Sanger sequencing in all individuals of population 1. The genotype at the SNP site was determined based on sequencing chromatograms, which is shown in **Supplemental Figure S3**. As for ALF6-1\_\_22575\_510\_57, 219 samples of population 1 were successfully genotyped. The number of each genotype was 65, 58, and 96 for the A/A, A/T, and T/T genotypes, respectively (**Table 4**). The A/A genotype accounted for 37.5% in the susceptible group, which was higher than that in the resistant group (20.2%). The T/T genotype accounted for 39.2% in the susceptible group, which was lower than that in the resistant group (49.5%). χ2 test showed that the shrimp carrying allele T were more resistant

than those carrying allele A (*P =* 0.0204). Another SNP in the same gene was ALF6-1\_\_22575\_510\_224, which also showed significant difference in 219 individuals. The percentage of G in the susceptible group was lower than that in the resistant group (*P =* 0.0293). For Marker2060\_197, T was the favorable allele (*P =* 0.031).

# Genomic Regions Associated With AHPND

The corresponding flanking sequences of the above 30 SNPs with significant difference were obtained by blasting the SNP sequence with the assembled reference genome (**Supplemental Table S2**). Among these SNPs, there were only 10 markers located in areas with annotated genes. The surrounding candidate genes are listed in **Table 5**. SNPs ALF6-1\_\_22575\_510\_57 and ALF6- 1\_\_22575\_510\_224 were both located in *LvALF6*. The SNP Unigene19157\_All\_\_1806\_348\_223 was located in gene *LvPI3K*. In addition, Marker66\_168 and Marker66\_291 were near the ubiquitin carboxyl-terminal hydrolase. Some other genes, such TABLE 3 | χ2 test and allele frequency distribution of SNP markers with significant difference (FDR-corrected *P* < 0.05) in two populations.


*Ref, allele frequency refers to the allele shown in the "Ref"; P1S1 and P1S2, allele frequency in susceptible bulk of population 1; P2S1 and P2S2, allele frequency in susceptible bulk of population 2; P1R1 and P1R, allele frequency in resistant bulk of population 1; P2R1 and P2R2, allele frequency in resistant bulk of population 2.*

August 2019 | Volume 10 | Article 700 Zhang et al.

FIGURE 3 | -log10(adjusted *P*) of three SNPs, including loci Marker15416\_294, Marker8720\_486, and Marker1077\_61 located in the fragment of Marker15416, (A) Marker8720 (B), and Marker1077 (C), respectively.

TABLE 4 | Distribution of the three markers in the susceptible and resistant groups of population 1.


*\*P < 0.05, significant difference of the genotype distributions between these two groups.*

TABLE 5 | Information of associated regions and candidate gene identification.


*\*LG, linkage group.*

as low molecular weight phosphotyrosine protein phosphataselike and calpain-B-like, were also found as candidate genes.

According to the gene annotation and association *P* value, we focused on two candidate genes. One was *LvALF6*, where two SNPs with significant difference were located. The other was *LvPI3K*, which contained one SNP with significant difference designated as Unigene19157\_All\_\_1806\_348\_223 (**Table 3**). For SNP Unigene19157\_All\_\_1806\_348\_223, the average frequency of C allele was 36% in the susceptible group and 20% in the resistant group. Thus, C allele was identified as a deleterious allele for disease resistance. We further analyzed the expression profiles of the two candidate genes after *Vibrio* challenge (**Figure 4**). The expression level of *LvALF6* in hepatopancreas and gill was significantly up-regulated after infection (**Figure 4A**, **B**), and it was up-regulated at 6 hpi and then down-regulated at the late infection stage in hemocytes (**Figure 4C**). The expression level of *LvPI3K* regulatory subunit in hepatopancreas was significantly down-regulated at early infection stage (**Figure 4D**), whereas it was obviously up-regulated in gill at 48 and 72 hpi (**Figure 4E**). *LvPI3K* showed no expression difference in hemocytes after infection (**Figure 4F**).

## DISCUSSION

In recent years, AHPND has been considered as one of the main problems that hinders the rapid development of the shrimp aquaculture industry. However, no effective method has been reported to control this disease until now. It is widely believed that the selection for a disease-resistant host is one of the valid control measures to prevent disease outbreak in shrimp farming (Cock et al., 2009). A good example is selective breeding of Taura syndrome virus-resistant lines in *L. vannamei* (Argue et al., 2002; Moss et al., 2012). Recently, Ge et al. (2018) reported a three-generation selective breeding of ridgetail prawn (*Exopalaemon carinicauda*) to improve the resistance of AHPND infection. However, no research

about the selection of AHPND-resistant broodstock was reported for *L. vannamei* until now. Currently, MAS was known as a method that can greatly accelerate the process of breeding. In red sea bream, it suggested that the RSIVD-resistant trait was controlled by one major quantitative trait loci and could be useful for MAS (Sawayama et al., 2017). Seven SNP markers were identified as markers for the selection of *V. parahaemolyticus* infection resistance in clam (*Meretrix meretrix*; Nie et al., 2015). Compared to other species, there are few studies about molecular markers associated with AHPND in shrimp. Thus, screening of SNP markers associated with disease resistance is the first step to MAS.

In this study, we carried out the *V. parahaemolyticus* challenge experiment to obtain AHPND-resistant and AHPND-susceptible populations. Most of the mortality occurred within the first 48 hpi, whereas there was no mortality in the control shrimp. By investigating the dynamic change of pathogenic bacteria *in vivo*, healthy shrimp were confirmed to be *V. parahaemolyticus* free. The amount of *V. parahaemolyticus* in infected shrimp was higher than that in surviving shrimp, and *V. parahaemolyticus* density was rapidly increased at 1 or 2 dpi, which was in accordance with the highest mortality rate during this time. In the terminal stage of the disease, a decrease of *V. parahaemolyticus* counts was observed due to the immune system mechanisms (Khimmakthong and Sukkarun, 2017).

The approaches of discovering SNPs include whole genome sequencing, large-scale amplicon sequencing, transcriptome sequencing, gene-enriched genome sequencing, and so on (Henry et al., 2012). Here, SNP screening and genotyping were performed using a targeting sequencing panel. It contained genome-wide distributed fragments and immune-related genes, which include immune effective factors or immune signaling pathways such as the JAK/STAT, Toll, and IMD pathways. The target sequencing approach based on multiplex PCR is an efficient approach to analyze genetic variation in specific genomic regions. It is cost-effective, flexible, high throughput, and suitable for non-model species. We applied this method for genotyping genomic regions of interest using pooling genotyping strategy in *L. vannamei*. Subsequently, individual genotyping result proved that pooling sequencing using the designed amplicon panel displayed high sensitivity in detecting allele frequency.

To reduce false-positives, two populations were designed to provide validation for each other. By comparing the allele frequency between the two populations, some SNPs that showed opposite favorable alleles in two separate populations were removed. Among 30 identified associated markers, 14 SNPs were located in LG33. Totally, eight, four, and two SNPs with significant difference were found in the fragment of Marker1077, Marker4976, and Marker66, respectively. We speculate that most of these markers were closely linked. However, these regions have no complete genome annotation. It may result from two reasons: one is that these markers might be closely linked to a candidate gene nearby and the other is that the genome assembly might not be complete; therefore, no candidate genes were identified in this region. With the improvement of the genome assembly, these genomic regions deserve further analysis. Totally, 10 markers were located in the gene region. Several SNPs showing the most significant difference, including loci Marker15416\_294, Marker8720\_486, and Marker1077\_61, were not annotated yet. However, this does not affect the usefulness of SNPs as molecular markers for breeding disease-resistant broodstock.

Among these annotated genes, ALF is a kind of AMP with broad-spectrum activities against bacterial pathogens. A number of ALFs were identified and characterized in *E. carinicauda* (Lv et al., 2017) and Chinese shrimp *Fenneropenaeus chinensis* (Li et al., 2015). Different ALFs exhibited diverse antibacterial and antiviral activities in *E. carinicauda* (Lv et al., 2018) and *F. chinensis* (Yang et al., 2015)*.* In our previous studies, mutations in *LvALF* gene have been regarded as the important immune factor in *L. vannamei* with WSSV resistance (Liu et al., 2014a; Liu et al., 2014b). Several polymorphisms of *PtALF* in the swimming crab (*Portunus trituberculatus*) were reported to be associated with resistance and susceptibility to *Vibrio alginolyticus* (Li et al., 2013). In this study, the expression level of *LvALF6* was obviously up-regulated in the shrimp infected by bacteria compared to those in the control group. Therefore, data suggested that *LvALF6* played an important role in the disease resistance of shrimp.

In addition, SNPs in *LvPI3K* were also identified to be associated with *V. parahaemolyticus* resistance. The PI3K-Akt pathway is an important intracellular signaling pathway involved in various cellular functions, including anti-apoptosis, protein synthesis, glucose metabolism, and cell cycling (Ruan et al., 2014). Several key molecules in this pathway have been reported to be associated with the invasion of some viruses in shrimp (Su et al., 2014; Zhang et al., 2016). Previous studies have reported that *V. alginolyticus* challenge induced an up-regulation of *LvPI3K* expression (Kong et al., 2018), which illustrated that *LvPI3K* might play an important role in *Vibrio* infection.

In summary, we analyzed 1566 SNPs distributed along genome and candidate genes and identified 30 SNPs associated with shrimp resistance to AHPND. The results proved that the target sequencing approach was a useful method for genotyping interested genomic regions and the pooling genotyping method was more cost-effective than sequencing each individual separately, whereas the results were accurate and convinced.

# REFERENCES


The method established in this study and the identified SNPs could be used for the selective breeding of AHPND-resistant broodstock in *L. vannamei*. The candidate disease-resistant genes will provide information in dissecting the resistance response of *L. vannamei* against *V. parahaemolyticus.*

# DATA AVAILABILITY

All datasets generated for this study are included in the manuscript and the supplementary files.

# AUTHOR CONTRIBUTIONS

QZ and YY conducted the experiment and data processing. JX and FHL conceived and supervised the project. QW contributed to statistical analysis. HH and CZ prepared and cultured the experimental animals. ZL participated in the extraction of genomic DNA. FL contributed to prepare *V. parahaemolyticus*. QZ, YY, and FHL prepared the manuscript. All authors have read and approved the manuscript.

# FUNDING

This work was supported by the National Natural Science Foundation of China (31830100), the National Key R&D Program of China (2018YFD0901301 and 2018YFD0900103), and China Agriculture Research System-48.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.00700/ full#supplementary-material

SUPPLEMENTAL FIGURE S1 | Manhattan plot of -log10(original *P*).

SUPPLEMENTAL FIGURE S2 | Q-Q plot for *P* values.

SUPPLEMENTAL FIGURE S3 | Chromatograms of different individuals with different allelic variants at the SNP loci (arrows).


selected resistance stocks. *Fish Shellfish Immun.* 33, 559–568. doi: 10.1016/j. fsi.2012.06.007


**Conflict of Interest Statement:** Author HH was employed by company Hainan Grand Suntop Ocean Breeding Co.,Ltd.Wenchang, China. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Zhang, Yu, Wang, Liu, Luo, Zhang, Zhang, Huang, Xiang and Li. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Morphological Characteristics and Comparative Transcriptome Analysis of Three Different Phenotypes of *Pristella maxillaris*

*Fangfang Bian1†, Xuefen Yang1†, Zhijie Ou1,2, Junzhi Luo1, Bozhen Tan1, Mingrui Yuan1, Tiansheng Chen1,3‡\* and Ruibin Yang1‡\**

*1 Key Laboratory of Freshwater Animal Breeding, Ministry of Agriculture, College of Fisheries, Huazhong Agricultural University, Wuhan, China, 2 Department of Fisheries, Guangdong Maoming Agriculture & Forestry Technical College, Maoming, China, 3 Collaborative Innovation Center for Efficient and Health Production of Fisheries in Hunan Province, Changde, China*

### *Edited by:*

*Peng Xu, Xiamen University, China*

### *Reviewed by:*

*Yamei Xiao, Hunan Normal University, China Deshou Wang, Southwest University, China*

### *\*Correspondence:*

*Tiansheng Chen tiansheng.chen@mail.hzau.edu.cn Ruibin Yang rbyang@mail.hzau.edu.cn*

*†These authors have contributed equally to this work.*

### *‡ORCID:*

*Tiansheng Chen orcid.org/0000-0003-4763-2307 Ruibin Yang orcid.org/0000-0003-0888-6347*

### *Specialty section:*

*This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics*

*Received: 28 February 2019 Accepted: 03 July 2019 Published: 02 August 2019*

### *Citation:*

*Bian F, Yang X, Ou Z, Luo J, Tan B, Yuan M, Chen T and Yang R (2019) Morphological Characteristics and Comparative Transcriptome Analysis of Three Different Phenotypes of Pristella maxillaris. Front. Genet. 10:698. doi: 10.3389/fgene.2019.00698*

*Pristella maxillaris* is known as the X-ray fish based on its translucent body. However, the morphological characteristics and the molecular regulatory mechanisms of these translucent bodies are still unknown. In this study, the following three phenotypes, a black-and-gray body color or wild-type (WT), a silvery-white body color defined as mutant I (MU1), and a fully transparent body with a visible visceral mass named as mutant II (MU2), were investigated to analyze their chromatophores and molecular mechanisms. The variety and distribution of pigment cells in the three phenotypes of *P. maxillaris* significantly differed by histological assessment. Three types of chromatophores (melanophores, iridophores, and xanthophores) were observed in the WT, whereas MU1 fish were deficient in melanophores, and MU2 fish lacked melanophores and iridophores. Transcriptome sequencing of the skin and peritoneal tissues of *P. maxillaris* identified a total of 166,089 unigenes. After comparing intergroup gene expression levels, more than 3,000 unigenes with significantly differential expression levels were identified among three strains. Functional annotation and Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analyses of the differentially expressed genes (DEGs) identified a number of candidates melanophores and iridophores genes that influence body color. Some DEGs that were identified using transcriptome analysis were confirmed by quantitative real-time PCR. This study serves as a global survey of the morphological characteristics and molecular mechanism of different body colors observed in *P. maxillaris* and thus provides a valuable theoretical foundation for the molecular regulation of the transparent phenotype.

Keywords: *Pristella maxillaris*, RNA-seq, melanophores, iridophores, molecular mechanism

# INTRODUCTION

As one of the most diverse phenotypic traits under strong selection pressure in many organisms, coloration plays numerous adaptive functions such as predator deterrence, species recognition, and even protecting the organism from solar ultraviolet radiation damage (Lowe and Goodman-Lowe, 1996; Parichy, 2006; Roberts et al., 2009; Zhang et al., 2015). Skin coloration can be influenced by many factors, such as genetics, diet, and general health (Wang et al., 2014; Zhang et al., 2015). Nevertheless, genetics remains the major determining factor (Braasch et al., 2007). Most animals have different body colors that are mainly determined by diverse pigments synthesized by chromatophores or pigment cells. Chromatophores are cells that are specialized in the storage and/or synthesis of light-absorbing pigments or light-reflecting structures (Bagnara et al., 2007; Leclercq et al., 2010; Henning et al., 2013). Teleost fishes have more than six types of pigment cells (Michiels et al., 2008; Wucherer and Michiels, 2012; Goda et al., 2013; Wucherer and Michiels, 2014; Schartl et al., 2016). Zhang et al. (2015) found that the differences in two color patterns of the crimson snapper (*Lutjanus erythropterus*) primarily depended on the density and distribution of pigment cells; in black skin, melanophores are the major pigment cells, and in red skin, iridophores and xanthophores are the major pigment cells. In adult red crucian carp (*Carassius auratus*, red var.), body color undergoes a gray-to-red change, which is due to alterations in the number of skin melanophores (Zhang et al., 2017). In addition, several studies have reported that many fish species can change body transparency based on the differentiation and development of chromatophores (Nilsson Skold et al., 2010; Krauss et al., 2013; Franco-Belussi et al., 2016). Some model fish mutants, such as medaka and zebrafish, exhibit a transparent phenotype by regulating the expression levels of some genes (Parichy, 2006; White et al., 2008; Krauss et al., 2013; Kimura et al., 2017). The differentiation and development of pigment cells are strictly regulated by genes (Roberts et al., 2009; Dooley et al., 2013), whereas the genetics behind natural color morph variants in fish remains largely unknown.

RNA-Seq analysis is the most convenient method to investigate gene expression patterns in organisms. To date, several studies have revealed gene expression profiles that are responsible for different color patterns in freshwater fish. The striped pattern of the zebrafish (*Danio rerio*) is generated by self-organizing mechanisms that require interactions among three different types of pigment cells (Irion et al., 2016). In addition, transcriptome analyses of different colored varieties of the common carp (*Cyprinus carpio* var. *color*) (Wang et al., 2014), crimson snapper (Zhang et al., 2015), Midas cichlid (*Amphilophus citrinellus*) (Higdon et al., 2013), and red crucian carp (Zhu et al., 2018) were performed to understand the genetic basis of coloration. Signaling pathways, such as the Wnt/βcatenin (wingless-type MMTV integration site family), tyrosinase synthesis, MAPK (mitogen-activated protein kinase), and cAMP (cyclic adenosine monophosphate) pathways, have been shown as conserved pathways that are related to melanophore development in vertebrates (Jiang et al., 2014; Wang et al., 2014). Several studies have investigated the regulatory mechanism of melanophore development, and many pigment-related genes have been identified in mice and fish (Newton et al., 2005; Hoekstra et al., 2006; Fang et al., 2018). However, only a few studies have investigated the role of iridophores (Ng et al., 2009; Higdon et al., 2013) and xanthophores (Sefc et al., 2014) in body coloration, and their detailed molecular mechanism has been less investigated.

*Pristella maxillaris*, also known as the X-ray fish, is a warm water fish belonging to the family Characidae and the order Characiformes. It is a widely distributed and adaptable fish found in the Amazon and Orinoco basins, as well as in coastal rivers in the Guianas. Due to its translucent body color, *P. maxillaris* is a valuable, ornamental fish that has a huge market. Morphological diversification of *P. maxillaris* has produced many kinds of transparent body mutations leading to coloration ranging from a black-gray body color to entirely transparent. There are three typical phenotypes: wild-type (WT), which has a black-and-gray body color with black spots on the trailing edge and fin of the operculum; mutant I (MU1), which has a silvery-white body color; and mutant II (MU2), which is fully transparent and has clearly observed visceral tissues. To date, most studies on *P. maxillaris*  have mainly focused on their growth and development (Yu et al., 2018). However, investigations on body transparency mutations and the molecular mechanism in *P. maxillaris* have not yet been conducted. To better understand how cells and genetic factors alter body transparency, we utilized stereomicroscopy to observe the differences in chromatophores in WT fish and two different mutants, namely, MU1 and MU2. RNA-Seq was conducted on samples from the three phenotypes to compare their gene expression profiles. In particular, the signaling pathways and candidate genes whose mutations are responsible for differences in body transparency were also examined and quantified. The purpose of this study was to provide a global survey of the morphological characteristics and molecular mechanism of the different body colors in *P. maxillaris,* as well as generate theoretical foundation for the molecular regulation of the transparent phenotype.

# MATERIALS AND METHODS

# Ethics Statement

No specific permissions were required for the use of *P. maxillaris* collected for this study in China. All the experimental procedures involving fish were approved by the Institutional Animal Care and Use Committee of Huazhong Agricultural University.

# Samples for Microscopy and Transcriptome Analysis

Samples from fish exhibiting three *P. maxillaris* phenotypes (WT, MU1, and MU2) were collected from the Flower and Bird Market in Wuhan, Hubei, China. Prior to the experiments, the fish were kept in laboratory aquariums under 14:10 h light/dark conditions at temperatures of 24 ± 2°C for 2 weeks to acclimate them to the experimental conditions. The fish were anesthetized in well-aerated water containing 100 mg/L tricaine methanesulfonate (MS-222) before being immediately euthanized. Six adult individuals exhibiting each *P. maxillaris* phenotype (average length, 3.5 ± 0.3 cm) were selected. The fresh pieces of the operculum lining, peritoneum, and skin were surgically excised and temporarily mounted for subsequent light microscopic (OLYMPUS SZX16) observation. In addition, fish exhibiting the three phenotypes were anesthetized and fixed for 24 h in formalin, and areas of approximately 1 cm2 in size were cut from the skin and peritoneal tissues for paraffin sectioning. The types and distributions of pigment cells in fish exhibiting the three different phenotypes were observed under a microscope (Imager A2). Pigment cell types are easily identified due to their colors and shapes by microscopic and histological methods based on the literatures. Melanophores show black/gray color and stellated shape; xanthophores exhibit yellow, orange, and red colors; iridophores contain white, blue, and purple-red color (Kelsh, 2004; Darias et al., 2013; Zhang et al., 2015; Zhang et al., 2017).

Additional skin and peritoneal tissues from different phenotypes were collected to extract total RNA, and we pooled the skin and peritoneal tissues from multiple individuals of each phenotype of *P. maxillaris*. All fresh tissue samples were frozen immediately in liquid nitrogen and then stored at −80°C before RNA isolation.

# RNA Extraction

Total RNA was obtained from the mixed samples of skin and peritoneum from fish exhibiting the three different phenotypes of *P. maxillaris* using RNAiso Plus Reagent (TaKaRa, China) according to the manufacturer's protocol. Total RNA was extracted with a Qubit® RNA Assay Kit in a Qubit® 2.0 Fluorometer (Life Technologies, CA, USA). The RNA Nano 6000 Assay Kit from the Agilent Bioanalyzer 2100 system (Agilent Technologies, CA, USA) and gel electrophoresis were used to assess the quantity and quality of the total RNA.

mRNA was purified from the total RNA using polyT oligoattached magnetic beads (NEB, USA). Fragmentation was carried out using divalent cations under elevated temperature in NEB-Next First-Strand Synthesis Reaction Buffer (5×). Firststrand cDNA was synthesized using random hexameric primers and M-MuLV reverse transcriptase (RNase H). Second-strand cDNA synthesis was subsequently performed using DNA polymerase I and RNase H. The remaining overhangs were converted into blunt ends *via* exonuclease/polymerase activities. After adenylation of the 3' ends of the DNA fragments, NEB Next Adaptor with hairpin loop structure was ligated to prepare for hybridization. cDNA fragments 250 to 300 bp long were selected as templates. An Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA) and an AMPure XP real-time PCR system (Beckman Coulter, Beverly, USA) were used to quantify and qualify the sample library.

# Sequencing, Assembly, and Annotation

Transcriptome sequencing was conducted on an Illumina HiSeq 2000 RNA-Seq platform (Illumina, San Diego, CA, USA). Clean reads were acquired after removing reads with adapters, reads with more than 5% unknown nucleotides, and reads with a percentage of low-quality bases (base quality ď 10) more than 20%. Trinity was used to conduct the *de novo* assembly of the transcriptome (Grabherr et al., 2011). Contigs, longer fragments without N, were obtained by combining overlapping reads. Then, different contigs were connected to obtain sequences that could not be extended on either end, which were defined as unigenes. The assembled sequences were compared against the NCBI nonredundant (Nr) protein database, Swiss-Prot, Kyoto Encyclopedia of Gene and Genomes (KEGG), and the Clusters of Orthologous Groups (COG) database using BLASTX with an E-value of 1 × 10-5. The directions of the contig sequences were based on the best alignment results. A combination of the BLAST, Blast2GO, KEGG, and GO databases was used for functional annotation. BLASTX alignment (E-value < 1 × 10-5) with the NT, NR, KEGG, Swiss-Prot, and COG databases was conducted to obtain the associated gene name and gene ontology (GO) term accession number, and GO analysis was performed with WEGO software (Ye et al., 2006).

# Differential Gene Expression Analysis

Differential expression analysis between each pair of samples was performed using the DEG-seq R package (Robinson et al., 2010). P values were adjusted using q values (Shannon et al., 2003). The threshold for significantly differential expression was set at a q value < 0.005 and |Log2(fold change)| > 1. Based on the hypergeometric distribution model, GO and KEGG ontology enrichment analyses were conducted on the differentially expressed genes (DEGs). GO enrichment analysis of the DEGs was implemented by the GO-seq R package-based Wallenius noncentral hypergeometric distribution (Young et al., 2010), which can adjust for gene length bias in DEGs. KEGG is a database resource used to understand high-level functions and utilities of biological systems such as the cell, organism, and ecosystem from information at the molecular level, especially large-scale molecular data sets generated by genome sequencing and other high-throughput experimental technologies (http://www. genome.jp/kegg/). We used KOBAS software to test the statistical enrichment of DEGs in KEGG pathways (Mao et al., 2005).

# Quantitative Real-Time PCR Validation

We selected some genes randomly to validate the transcriptome data by using qRT-PCR with *gapdh* as an internal control. Firststrand cDNA was obtained from the total RNA using random primers and the MMLV reverse transcriptase (Promega, Madison, WI, USA). Primers (listed in **Table S1**) were designed using Beacon software. The qRT-PCR was performed with SYBR Green PCR Super Mix (Thermo Scientific, Wilmington, DE, USA) and the CFX96 real-time PCR detection system (Bio-Rad, Hercules, CA, USA). PCR was performed in a 10-µl reaction volume containing 0.5 µl of each primer (5 µM), 0.5 µl cDNA, 5 µl SYBR Green Super Mix, and 3.5 µl ddH2O. The PCR cycle was performed as follows: 95°C for 7 min, followed by 40 cycles of 95°C for 10 s, 55°C for 15 s, and 72°C for 15 s. Three technical replicates and three biological replicates of each sample were run along with the internal control gene. Differences in the expression levels of the WT, MU1 and MU2 fish were assessed after first normalizing expression levels to those of *gapdh*, followed by log transformation.

# RESULTS

# Differences in Chromatophores Among Three Different Phenotypes of *P. maxillaris*

Chromatophores are mainly responsible for the generation of body color and can be further subdivided based on differences in body color. The types and distribution of pigment cells significantly differed among fish exhibiting the different body color phenotypes upon morphological observation. The WT phenotype was much more common than the MU1 and MU2 phenotypes, and WT individuals showed a black and gray body color (**Figure 1 A1**). The MU1 fish were translucent and showed a silvery-white body color (**Figure 1 A2**), and the MU2 individuals were completely transparent with clearly visible gill filaments and visceral tissues (**Figure 1 A3**).

Pigment cell types are easily identified due to their colors and shapes. Three types of pigment cells, i.e., melanophores, xanthophores, and iridophores, were observed in the WT operculum lining, skin, and peritoneum **(Figures 1 B1, C1, D1**). Nevertheless, in the MU1 fish, melanophores were missing in the

FIGURE 1 | Morphological and histological observation of different body color phenotypes in *Pristella maxillaris*. The top row of fish: wild-type (WT, A1), Mutant I (Mu1, A2) and Mutant II (Mu2, A3). B-D: Morphological observation of pigment cells in tissues by temporary mount. Operculum lining (B1, B2, B3); skin (C1, C2, C3); peritoneum (D1, D2, D3). E-F: Histological observation of pigment cells in skin and peritoneal tissues by paraffin section. Skin (E1, E2, E3); peritoneum (F1, F2, F3); melanophores (white arrow); iridophores (black arrow); xanthophores (red arrow). E, epidermis; SS, stratum spongiosum; SC, stratum compactum; M, muscle; P, peritoneum. Scale bar = 100 μm (B-D) or 10 μm (E-F).

operculum lining, skin, and peritoneal tissues, which contained only two types of pigment cells, iridophores and xanthophores (**Figures 1 B2, C2, D2**). Without melanophores and iridophores, the MU2 individuals were fully transparent (**Figures 1 B3, C3, D3**). The histological observation was used to compare the skin and peritoneum of fish exhibiting three different phenotypes of *P. maxillaris*. WT fish had many melanophores and iridophores in their skin and peritoneum (**Figures 1 E1, F1**). However, MU1 fish had many iridophores in their skin and peritoneum but had no melanophores, which was different from those in WT fish (**Figures 1 E2, F2**). In addition, melanophores and iridophores were not observed in MU2 fish (**Figures 1 E3, F3**).

# Sequencing and Assembly of the *P. maxillaris* Transcriptome

To better understand the genetics of the translucent body phenotypes, we conducted a comparative transcriptomic analysis among the three different phenotypes of *P. maxillaris* (WT, MU1, and MU2) using next-generation sequencing. After filtering the raw reads, 72.49 million, 71.70 million, and 74.89 million clean reads were generated from the skin and peritoneal tissues of WT, MU1, and MU2 fish, respectively. The detailed sequencing results are summarized in **Table 1**.

Transcriptome assemblies obtained from the three different transcriptome libraries were pooled and used to assemble full-length transcripts without reference genomes by Trinity software. After the elimination of redundant transcripts, 166,089 unigenes were acquired that ranged from 201 to 50,278 base pairs (bp) in length with a mean length of 1,293 bp and an N50 of 2,018 bp (**Supplementary Table S2**). In addition, the size distribution of the transcripts and unigenes is presented in **Figure 2A**. The unigenes provided the basis for the gene expression analysis in the skin and peritoneal tissues from fish exhibiting the three phenotypes of *P. maxillaris*.

# Annotation and Functional Classification

To identify functional information about the assembled unigenes, all of 166,089 unigenes sequences were used to search against four public databases: the NCBI non-redundant protein (Nr) database, euKaryotic Ortholog Groups (KOG), the Gene Ontology (GO) database, and the Protein family (Pfam) database. The annotation results are demonstrated by a Venn diagram (**Figure 2B**). Approximately 94,111 (56.61%), 130,854 (78.78%), 35,703 (21.49%), 78,431 (47.22%), and 77,763 (46.82%) of the unigenes were identified from the Nr, Nt, KOG, KEGG, and Pfam

TABLE 1 | Summary statistics of transcriptome sequencing for three different phenotypes of *Pristella maxillaris.*


databases, respectively (**Supplementary Table S3**). Furthermore, 139,108 (83.75%) unigenes were simultaneously annotated in more than one database. Analysis of the BLASTX top-hit species distribution showed that 58,760 (35.20%) unigenes were similar to the *Astyanax mexicanus* sequence, 10,476 (6.31%) were similar to the *Danio rerio* sequence, 3,037 (1.83%) were similar to the *Clupea harengus* sequence, and 2,216 (1.33%) were similar to the *Oncorhynchus mykiss* sequence (**Figure 3**).

# Recognition of DEGs in Three Different *P. maxillaris* Phenotypes

To reveal differences in the chromatophores of the skin and peritoneal tissues in *P. maxillaris* with different phenotypes, we performed a comparative analysis of the three transcriptomes. Based on criteria in which a q value < 0.005 and |Log2(fold change)| > 1 indicate a DEG, we identified 3,808 DEGs between MU1 and WT fish, of which 1,698 were upregulated and 2,110 were downregulated. We also identified 4,699 DEGs between MU2 and WT fish, including 1,859 upregulated genes and 2,840 downregulated genes. In addition, 3,109 DEGs were detected between MU1 and MU2 fish, of which 1,661 were upregulated and 1,448 were downregulated (**Figure 4**).

# Functional Enrichment of Differentially Expressed Genes

By further analysis of GO term enrichment and the KEGG pathways of the DEGs, all the DEGs were classified into different gene ontologies and pathways. After GO annotation, the DEGs between the MU1 and WT, MU2 and WT, and MU1 and MU2 fish were classified into 58 GO terms, 60 GO terms, and 61 GO terms, respectively. Most DEGs were mainly enriched in the following pigmentation-related terms: melanosome, pigment catabolic process, tyrosine biosynthetic process, tRNA (guanine) methyltransferase activity, pigment metabolic process, tyrosine metabolic process, calcium ion transport, purine nucleoside metabolic process, activation of MAPK activity, purine ribonucleotide binding, regulation of Wnt signaling pathway, tricarboxylic acid cycle, and purine-containing compound biosynthetic process. The DEGs were classified into biological processes, cellular components, and molecular functions (as shown in **Figure 5**). The cellular processes and metabolic processes were the two largest categories within the biological processes; the two largest molecular function categories were binding and catalytic activity; the most abundant categories were cell and intracellular for the cellular components.

The DEGs in the skin and peritoneal tissues in three different phenotypes of *P. maxillaris* were annotated in the KEGG database. The DEGs between the MU1 and WT, MU2 and WT, and MU1 and MU2 fish participated in 20 pathways (**Table 2**) and were significantly enriched. Among the DEGs between the MU1 and WT fish involved in these 20 pathways, most of the DEGs involved in DNA replication, mismatch repair, nucleotide excision repair, oxidative phosphorylation, and the citrate cycle were downregulated in the MU1 fish compared with their expression in the WT fish (**Table 2a**). At the same time, some of the DEGs involved in tyrosine metabolism and melanogenesis

were upregulated in the WT fish compared to their expression in the MU1 fish. The DEGs between the MU2 and WT fish were significantly enriched in some metabolic pathways, including purine metabolism, nucleotide excision repair, the pentose phosphate pathway, glycolysis/gluconeogenesis, mismatch repair, oxidative phosphorylation, tyrosine metabolism, and the melanogenesis pathway (**Table 2b**). Besides, some DEGs involved in purine metabolism, the pentose phosphate pathway, glycolysis/ gluconeogenesis, and tyrosine metabolism were downregulated in the MU2 fish compared with the WT fish. In addition, the ECM receptor interaction, protein digestion and absorption, proteasome, glycolysis/gluconeogenesis, DNA replication, and purine metabolism terms were significantly enriched in most DEGs between the MU1 and MU2 fish (**Table 2c**). Meanwhile, some DEGs involved in glycolysis/gluconeogenesis, DNA replication, and purine metabolism were upregulated in the MU1 fish compared with the MU2 fish.

# Candidate Genes Related to Chromatophores

According to zebrafish ensemble database (http://asia.ensembl.org/ Danio rerio/Info/Index), 97 genes were annotated in the pigmentation category. After a BLAST search with the 97 pigmentation-related genes, a total of approximately 40 melanophore- and iridophorerelated genes were detected in the *P. maxillaris* skin and peritoneal

tissue. Considering the FPKM (expected number of fragments per kilobase of transcript sequence per millions base pairs sequenced) of these genes, we found 14 genes that enriched the tyrosine metabolism and melanogenesis pathways that were expressed at a significantly higher level in WT fish than in MU1 and MU2 fish and nine genes with significantly higher expression in WT fish (**Table 3**). Among the DEGs, the *protein Wnt-8a* (*wnt8*), *frizzled 2* (*fzd2*), *agouti-signaling protein* (*asip*), *cyclic AMP-responsive element-binding protein 3-like protein 4* (*creb*), and *dual specificity mitogen-activated protein kinase 2-like* (*map2k2*) were found to be the most highly expressed genes in the WT fish, followed by the *alcohol dehydrogenase 6-like* (*adh6*), *glutathione S-transferase* (*gst*), *guanine nucleotide-binding protein* (*gnai*), and *calmodulin-like* (cam) genes.

Through GO term and KEGG pathway analyses of the significant DEGs, a total of 26 DEGs were involved in glycolysis/ gluconeogenesis, purine metabolism, and the pentose phosphate pathway, which play an important role in iridophore development. Fifteen of the 26 genes are detailed in **Table 4**. Interestingly, seven crucial genes were identified in these pathways, including *trifunctional purine biosynthetic protein adenosine-3* (*gart*), *hypoxanthine guanine phosphoribosyl transferase* (*hprt*), *betaenolase* (*eno*), *nucleoside diphosphate kinase-like* (*ndk*), *guanylate kinase isoform X1* (*guk1a*), *phosphoglycerate mutase 1-like* (*pgam1*), and *6-phosphofructokinase, muscle type* (*pfka*), which were significantly upregulated in the WT and MU1 fish but significantly downregulated in the MU2 fish (*p* value ≤ 0.005). In contrast, the expression of the *bifunctional purine biosynthesis protein* (*pur9*) and *L-lactate dehydrogenase B-A chain* (*ldh*) genes did not differ among the three different translucent body phenotypes.

The GO term enrichment and KEGG pathway analyses of the DEGs identified candidate genes that could regulate melanophore or iridophore development. We also analyzed the mRNA expression of 17 candidate genes (**Figure 6**). Some genes, such as *wnt8*, *fzd2*, *map2k2*, *cam*, *creb*, and *gst*, were significantly upregulated in WT fish compared with the MU1 and MU2 fish. However, the mRNA expression levels of some iridophore-related genes (*pfka*, *eno*, *gart*, and *hprt*) were markedly lower in the MU2 fish than in the WT and MU1 fish.

# Confirmation of DEGs Identified with RNA-Seq by Quantitative Real-Time PCR

To test the DEGs identified by comparative transcriptomic analysis, we selected nine genes from the three comparative groups and the *gapdh* gene for qRT-PCR confirmation. The quantitative real-time PCR (qRT-PCR) expression patterns of 6 of 9 randomly selected DEGs that were related to pigment biosynthesis agreed with the results from RNA-Seq analysis, except for the *pk*  (*pyruvate kinase*)*, cam*, and *transketolase-like protein 2* (*tktl*) genes (**Figure 7**). Therefore, the expression patterns of the selected genes determined by qRT-PCR were nearly in accordance with the RNA-Seq data. Combining the qRT-PCR and RNA-Seq results, we found that melanin-related genes were more highly expressed in the skin and peritoneal tissues of WT fish than in mutant fish, whereas guanine-related genes were more highly expressed in the WT and MU1 fish than in the MU2 fish. However, the melaninrelated genes and guanine-related genes were expressed at lower levels in MU2 fish compared with the WT and MU1 fish.

# DISCUSSION

Animal coloration plays an important role in ecological interactions, species recognition, and even protecting the organism from ultraviolet radiation damage (Muske and Fernald, 1987; Lowe and Goodman-Lowe, 1996; Fujimura et al., 2009; Sefc et al., 2014). Diverse body coloration is mainly controlled by the development and location of pigment cells. The variety and number of pigment cells affect animal body transparency (Nilsson Skold et al., 2010; Croucher et al., 2013; Nilsson Skold et al., 2013). In this study, we observed the different morphologies of transparent body parts from three different phenotypes of *P. maxillaris,* and revealed significant differences in the types and distributions of pigment cells by

microscopic observation as performed in other literatures (Kelsh, 2004; Darias et al., 2013; Zhang et al., 2015; Zhang et al., 2017). Moreover, we also found that changes in the type and number of pigment cells led to different phenotypes and increased the transparency of the *P. maxillaris* body*.* Extensive research has been performed on this topic, and many species of fish have been shown to change their internal color due to responsive peritoneal chromatophores, in which the degree of this response was correlated with the level of body transparency (Parichy, 2007; Nilsson Skold et al., 2010). Meanwhile, Krauss indicated that the inner organs were observed through the skin due to the loss of iridophores (Krauss et al., 2013). Our results suggested that the loss of melanophores and iridophores resulted in changing of body color from gray to transparent during *P. maxillaris* reproduction. In the study, xanthophores were not observed in the different phenotypes fish by histological



*(Continued)*

### TABLE 2 | Continued


*\*DEGs, differentially expressed genes, which was identified by the DEGseq package. DEGs between the two samples were selected with the following filter criteria: log2 transcript abundance ratio ≥1 and FDR (false discovery ratio) ≤ 0.001.*

TABLE 3 | KEGG pathway analysis of positively selected genes involved in melanophores in *P. maxillaris*.


TABLE 4 | KEGG pathway analysis of positively selected genes involved in iridophores metabolism in *P. maxillaris*.


method. The reason may be that the inclusions of xanthophores are fat-soluble carotenoids and water-soluble dinidine (Hirata et al., 2003), which are easy to dissolve during dehydration and repeated washing. In future studies, we will try to use the other method to observe the xanthophores.

Genetic factors, which are the major determinants of animal body color, influence the production and distribution of pigment cells. In recent years, the mechanism of body color formation in fish has received attentions such as transcriptome analyses of model or economic fish including zebrafish (Irion et al., 2016), crucian carp (Zhang et al., 2017), and the common carp (Wang et al., 2014). In this study, we used

Illumina sequencing technology to examine the skin and peritoneal tissues from fish exhibiting three phenotypes of *P. maxillaris* at the transcriptome level and found many DEGs associated with pigmentation. The identified DEGs among the three phenotypes could help us understand the molecular mechanism and provide valuable genetic information to explore pigmentation in the future.

The GO enrichment analysis of the DEGs revealed that variations in pigmentation are related to cellular components and biological processes. Most of the clustered groups of DEGs were consistent with those identified in previous works with fish such as zebrafish (Higdon et al., 2013), Midas cichlids

(Henning et al., 2013), crucian carp (Zhang et al., 2017), and common carp (Jiang et al., 2014; Li et al., 2015). Interestingly, we found that most of the genes downregulated in the MU2 fish compared to their expression in the WT fish were enriched in GO terms related to the pigment metabolic process, including the tyrosine metabolic process, the activation of MAPK activity, tyrosine 3-monooxygenase activity, the pigment catabolic process, the purine-containing compound biosynthetic process, tRNA (guanine) methyltransferase activity, and the purine nucleobase biosynthetic process. In addition, most of the genes downregulated in the MU1 fish compared with their expression in the WT fish were enriched in the tyrosine metabolic process, activation of MAPK activity, tyrosine 3-monooxygenase activity, and pigment biosynthetic process GO terms.

The KEGG pathway analysis showed that some DEGs were associated with pigmentation-related pathways. In our study, some DEGs between the MU1 and WT fish were enriched in the tyrosine metabolism, melanogenesis, cAMP signaling, and Wnt or MAPK signaling pathways. Both the cAMP and MAPK signaling pathways are involved in melanophore development in vertebrates (Zhang et al., 2015; Zhang et al., 2017). The DEGs in *P. maxillaris* were likely involved in melanin synthesis. Meanwhile, we found that some DEGs between the MU2 and WT fish were enriched in the glycolysis/gluconeogenesis, purine metabolism, and pentose phosphate pathway terms. The identification of genes enriched in these pigmentation-related terms and pathways are informative, and these genes are worth further study.

In this study, comparing known pigmentation genes with identified genes by the current transcriptome data, we found many of the pigmentation genes and pathways in *P. maxillaris*. The putative genes and pathways involved in the three body transparency phenotypes that are related to the pigmentation process are shown in **Figure 8**. We found that the mRNA expression levels of *wnt8*, *fzd2*, *map2k2*, *creb*, *asip*, and *cam* were downregulated in the skin and peritoneal tissues of MU2 fish compared to the WT fish. Several studies have reported that the Wnt signaling pathway participates in the synthesis of melanogenesis in teleost fishes, as well as in mammals (Fujimura et al., 2009; Xing et al., 2011; Zhang et al., 2015). *Wnt8*, a noncanonical Wnt protein family gene, was found in the matrix and precortical cells in the hair follicles of mice (Yamaguchi et al., 1999; Croucher et al., 2013). Interestingly, *wnt8* was expressed at lower levels in the MU1 and MU2 fish than in the WT fish. The wnt8 can bind with *fzd2* to promote the production of guanine-binding protein (Go/Gq), which in turn promotes the expression of β-catenin, thereby inducing the expression of *mitf* (*melanocyte inducing transcription factor*). The *mitf* is a key regulatory gene in the melanophore lineage (Levy et al., 2006; Zeng et al., 2015). Some transcription factors, such as β-catenin and sox10 (*SRY-box containing gene 10*), have been reported to act on the promoter region of *mitf*, which promotes the expression of *mitf* (Sakai et al., 1997; Zhang et al., 2017). In addition, *mitf* directly regulates the expression of multiple genes (*tyr*, *tyrp1*, and *dct* [*dopachrome tautomerase*]) that are necessary for the survival and proliferation of melanophores (Opdecamp et al., 1997; Cheli et al., 2010) and are responsible for the synthesis of melanin (Li et al., 2012). These results show that *wnt8*, *fzd2*, and *β-catenin* might play important roles in the body transparency phenotypes of *P. maxillaris*.

We also found that the dual specificity *map2k2* and *cam* genes were significantly upregulated in the WT fish compared to the MU1 and MU2 fish. *Cam* is activated by cytoplasmic Ca2+, which is released from the endoplasmic reticulum to assist protein kinase C (Ma et al., 2014). Additionally, protein kinase C can expand the promotion of melanin synthesis by protein kinase A by upregulating the *mitf* gene (Park et al., 2006). Another gene, *map2k2,* encodes an important enzyme in the MAPK signaling pathway that can activate *mitf*, increase *mitf* expression, and then stimulate the synthesis of melanin (Levy et al., 2006). In addition, the mRNA expression level of *asip* decreased as the *P. maxillaris* body color changed from gray to transparent. Asip was an endogenous antagonist of alphamelanocyte stimulating hormone (α-MSH). The α-MSH causes an increase in tyrosinase activity, and α-MSH could activate the *melanocortin 1 receptor* (*mc1r*), a key gene in melanogenesis in animals, resulting in increased cAMP levels. Consequently, the melanin biosynthesis process is triggered (Voisey et al., 2001; Henning et al., 2010; Zhang et al., 2017). On the contrary, Asip can block melanin synthesis by competing with α-MSH in binding to the *mc1r* gene (Sakai et al., 1997; Zhang et al., 2017). Histological assessment revealed that MU1 and MU2 do not possess melanophores. In addition, the results revealed that *asip* might not inhibit melanin synthesis in MU1 and MU2 fish. We also found that the *creb* gene was significantly upregulated in the WT fish compared to its expression in MU1 and MU2 fish. It has been reported that *mc1r* activates the *creb*, and its cascade involves the upregulation of the expression of *mitf*, which binds and activates melanogenic gene promoters to increase their expression, resulting in increased melanin synthesis (Busca

and Ballotti, 2000). Therefore, it was again confirmed that *creb* might play a key role in melanin production.

We also discovered some DEGs between group of MU2 and WT and group of MU1 and MU2 fishes involved in purine metabolism, glycolysis/gluconeogenesis, the citric acid cycle, and the pentose phosphate pathway such as the β-*eno*, *gart*, *aldo*  (*fructose-bisphosphate aldolase C-B-like*), *ldh*, and *hprt* genes. These genes exhibited significantly lower expression in MU2 fish compared to the WT and MU1 fish, followed by *guk1a* and *pfka*, which implied the participation of these pathways in body coloration in MU2 fish. Fish skin and other tissues contained stacks of guanine plates in iridophores (Hirata et al., 2003; Hirata et al., 2005; Failde et al., 2014), and glycolysis and the citrate cycle pathway were found to be key participants in extensive guanine synthesis (Higdon et al., 2013; Irion et al., 2016). Combined with our microscopic observations, many iridophores were observed in the WT and MU1 fish, but fish with the MU2 mutation did not harbor iridophores. Thus, the increased expression of genes within these pathways might be in accordance with the increased requirement of guanine for the reflective iridophore pigment in the skin and peritoneal tissues of WT and MU1 fish.

In the purine metabolism pathway, the *gart* and *phosphoribo sylaminoimidazolesuccinocarboxamide synthase* (*paics*) genes combine into a complex that is involved in the synthesis of inosine monophosphate, a precursor of the purine nucleotides adenosine monophosphate and guanosine monophosphate. Some studies have shown that guanosine monophosphate

synthase increases the number of iridophores (Ng et al., 2009). In addition, Higdon et al. (2013) illustrated that specific enzymes such as *aldo*, *eno*, *pgam1*, and *gart* could regulate guanine synthesis. These results revealed the conservation of pigmentation genes across various species in terms of their sequences and functions. However, further investigations are still needed to determine how these genes work together to regulate guanine synthesis in iridophores.

To test the reliability of the RNA-Seq data, nine genes were randomly selected for qRT-PCR, including *chaperonin containing tcp1 subunit 3* (*cct3*), *solute carrier family 25 member 33* (*slc25a33*), *glutathione s-transferase* (*gst*), and *map2k2*, and so on. The expression pattern of these pigment-specific genes by qRT-PCR coincided with the results of the RNA-Seq analysis, except for the *pk*, *cam*, and *tktl* genes. The expression levels measured using the two methods were roughly coincided, indicating the reliability of our transcriptome data. We found that body coloration differs among varieties and the distribution of chromatophores at the cellular level. Therefore, an increase in body transparency might be caused by the absence of melanophores and iridophores in *P. maxillaris*. Moreover, after analyzing the WT and two mutant transcriptomes, we found that differentially expressed candidate pigmentation genes mainly enriched pathways related to melanin and guanine synthesis. However, further work is still needed to determine how these pathways and genes regulate the development of melanophores and iridophores in the body transparency phenotypes of *P. maxillaris*.

In addition, we also found that most DEGs were enriched in ribosome-related pathways in the skin and peritoneal tissues of fish exhibiting different phenotypes of *P. maxillaris*, which indicated that ribosomes might play an important role in fish body color formation. Higdon et al. (2013) found that four of the five most highly expressed genes were encoding ribosomal proteins in the transcriptome of zebrafish pigment cells. A similar finding was also reported in the transcriptome analysis of sheep skin (*Ovis aries*) (Fan et al., 2013). Some studies have proven that highly expressed levels of ribosome protein-related genes are correlated to black coat color in mice (Skarnes et al., 2011). Combined with the transcriptome data, we noted that the ribosomal protein genes might be involved in the formation of body coloration in *P. maxillaris*. However, further studies are needed to elucidate its exact function. We also discovered that some DEGs are involved in nucleotide excision repair, mismatch repair, oxidative phosphorylation, and systemic lupus erythematosus signaling pathways. These genes were significantly downregulated in the MU1 and MU2 fish compared with their expression in WT fish, which might be related to the absence of melanocytes and iridophores. Some studies have indicated that melanin from melanocytes not only scatters and absorbs UV as a physical barrier but also protects other epidermal cells by transferring melanin and reducing DNA damage (Kobayashi et al., 1998; Brenner and Hearing, 2008). In addition, the scattered reflectors and the arbitrary orientations of iridophores reflected all wavelengths of light (Grether et al., 2004).

Taken together, we observed significant differences in the types and distribution of pigment cells in three different phenotypes of *P. maxillaris* and elucidated the potential genes and signaling pathways involved in body transparency.

# DATA AVAILABILITY

This manuscript contains previously unpublished data. The name of the accession number is PRJNA525550 (https://www.ncbi. nlm.nih.gov/bioproject/PRJNA525550).

# AUTHOR CONTRIBUTIONS

FB, XY, RY, and TC contributed to the study design, the major acquisition, analysis, and interpretation of data, and drafting/ revising the article. FB, ZO, and JL performed most of the laboratory work, and BT and MY assisted. FB contributed to the analysis of the data and wrote the manuscript. All authors read and approved the final manuscript.

# REFERENCES


# FUNDING

The Huazhong Agricultural University Scientific & Technological Self-Innovation Foundation (2662018PY083), the National Natural Science Foundation of China (31771648), and the Finance Special Fund of Ministry of Agricultural of China (Fisheries resources and environment survey in the key water areas of Tibet) supported this study.

# ACKNOWLEDGMENTS

We thank LetPub (www.letpub.com) for its linguistic assistance during the preparation of this manuscript.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.00698/ full#supplementary-material


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Bian, Yang, Ou, Luo, Tan, Yuan, Chen and Yang. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Scanning of Genetic Variants and Genetic Mapping of Phenotypic Traits in Gilthead Sea Bream Through ddRAD Sequencing

*Dimitrios Kyriakis1,2,3, Alexandros Kanterakis2, Tereza Manousaki3, Alexandros Tsakogiannis3, Michalis Tsagris4, Ioannis Tsamardinos5, Leonidas Papaharisis6, Dimitris Chatziplis7, George Potamias2 and Costas S. Tsigenopoulos3\**

*1 School of Medicine, University of Crete, Heraklion, Greece, 2 Foundation for Research and Technology–Hellas (FORTH), Heraklion, Greece, 3 Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC), Hellenic Center for Marine Research (HCMR) Crete, Greece, 4 Deparment of Economics, University of Crete, Gallos Campus, Rethymnon, Greece, 5 Department of Computer Science, University of Crete, Voutes Campus, Heraklion, Greece, 6 Nireus Aquaculture SA, Koropi, Greece, 7 Department of Agriculture Technology, Alexander Technological Education Institute of Thessaloniki, Thessaloniki, Greece*

### *Edited by:*

*Lior David, Hebrew University of Jerusalem, Israel*

### *Reviewed by:*

*Gen Hua Yue, Temasek Life Sciences Laboratory, Singapore Diego Robledo, University of Edinburgh, United Kingdom Gonzalo Martinez-Rodriguez, Institute of Marine Sciences of Andalusia (ICMAN), Spain*

> *\*Correspondence: Costas S. Tsigenopoulos tsigeno@hcmr.gr*

### *Specialty section:*

*This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics*

*Received: 19 October 2018 Accepted: 27 June 2019 Published: 06 August 2019*

### *Citation:*

*Kyriakis D, Kanterakis A, Manousaki T, Tsakogiannis A, Tsagris M, Tsamardinos I, Papaharisis L, Chatziplis D, Potamias G and Tsigenopoulos CS (2019) Scanning of Genetic Variants and Genetic Mapping of Phenotypic Traits in Gilthead Sea Bream Through ddRAD Sequencing. Front. Genet. 10:675. doi: 10.3389/fgene.2019.00675*

Gilthead sea bream (Sparus aurata) is a teleost of considerable economic importance in Southern European aquaculture. The aquaculture industry shows a growing interest in the application of genetic methods that can locate phenotype–genotype associations with high economic impact. Through selective breeding, the aquaculture industry can exploit this information to maximize the financial yield. Here, we present a Genome Wide Association Study (GWAS) of 112 samples belonging to seven different sea bream families collected from a Greek commercial aquaculture company. Through double digest Random Amplified DNA (ddRAD) Sequencing, we generated a per-sample genetic profile consisting of 2,258 highquality Single Nucleotide Polymorphisms (SNPs). These profiles were tested for association with four phenotypes of major financial importance: Fat, Weight, Tag Weight, and the Length to Width ratio. We applied two methods of association analysis. The first is the typical single-SNP to phenotype test, and the second is a feature selection (FS) method through two novel algorithms that are employed for the first time in aquaculture genomics and produce groups with multiple SNPs associated to a phenotype. In total, we identified 9 single SNPs and 6 groups of SNPs associated with weight-related phenotypes (Weight and Tag Weight), 2 groups associated with Fat, and 16 groups associated with the Length to Width ratio. Six identified loci (Chr4:23265532, Chr6:12617755, Chr:8:11613979, Chr13:1098152, Chr15:3260819, and Chr22:14483563) were present in genes associated with growth in other teleosts or even mammals, such as semaphorin-3A and neurotrophin-3. These loci are strong candidates for future studies that will help us unveil the genetic mechanisms underlying growth and improve the sea bream aquaculture productivity by providing genomic anchors for selection programs.

### Keywords: aquaculture, *Sparus aurata*, double digest random amplified DNA, Genome Wide Association Study, feature selection

**Abbreviations:** GWAS, Genome Wide Association Study; ddRAD, double digest Random Amplified DNA; SNPs, Single Nucleotide Polymorphisms; FS, Feature Selection; MAS, Marker Assisted Selection; GS, Genomic Selection; QTL, Quantitative trait locus; BIC, Bayesian Information Criterion; LD, Linkage Disequilibrium; QQ plot, Quantile–Quantile plot; SES, Statistically Equivalent Signature; OMP, Orthogonal Matching Pursuit; MB, Markov Blanket; CV, Cross-Validation; PCA, Principal Component Analysis.

# INTRODUCTION

The gilthead sea bream, Sparus aurata (Linnaeus, 1758), is a teleost fish of great economic importance for the Mediterranean aquaculture industry (Tsigenopoulos et al., 2014). It ranks first among other aquacultured species in South Mediterranean with total production of 160,563 tons for 2016 (FEAP, 2017). One of the top interests of the aquaculture industry is the genetic improvement of the stocks to maximize the efficiency of the production and the product quality (Fernandes et al., 2017). Coupled with this concern, various areas of sea bream biology are being explored, such as nutrition requirements (Silva-Merrero et al., 2017; Guardiola et al., 2018), immune responses (Antonopoulou et al., 2017; Bahi et al., 2018; Tapia-Paniagua et al., 2018), skeletal development (Negrín Báez et al., 2015; Vélez et al., 2018), reproduction, and broodstock management (Loukovitis et al., 2011). Recently, the genome of sea bream has been sequenced and analysed offering a backbone for conducting genomic analyses on the species (Pauletto et al., 2018).

One of the main avenues to genetically improve the cultured stock is to identify associations between genetic variants and traits of interest, such as growth, disease resistance, and fat content. Genome Wide Association Studies (GWAS) offer the way to accomplish this by comparing the genotypes of individuals having varying phenotypes for a specific trait of interest. GWAS have boosted the field of human genetics as well as plant and livestock breeding (Geng et al., 2017), leading to improved higher selection accuracies of the animal breeding programmes, which in turn leads to lower costs and greater yield (Geng et al., 2017). To conduct a GWAS experiment in nonmodel species, genome-wide sampling of genetic variants is required. Application of double digest Random Amplified DNA (ddRAD) leads to thousands of polymorphic loci that require sophisticated strategies for data analysis (Catchen, 2013) and is widely used for GWAS studies (Baird et al., 2008; Etter et al., 2011; Anderson et al., 2012; Palaiokostas et al., 2013). It is well known that biological datasets are susceptible to the curse of dimensionality (Lie, 2014; Stephens et al., 2015). Various methods have been developed to solve such complicated problems, such as feature selection (Tsagris et al., 2018a). Feature selection (FS) is used to identify the important, predictive genetic variants by removing the noise propagated by redundant features, i.e., markers that have the same genotypic profile across all samples. Several FS algorithms have been developed like (Fontanarosa and Dai, 2011), Orthogonal Matching Pursuit (OMP) (Cai and Wang, 2011), and Statistically Equivalent Signature (SES) (Lagani et al., 2017), differing mainly in the approach to discover associations and the computational efficiency.

In aquaculture breeding programs, these features-markers can be used for marker assisted selection (MAS) (Yue, 2014). However, genome-wide variants can also be used to directly evaluate breeders, the so-called genomic selection (GS) method (Yue, 2014). Genomic selection is a breeding value estimation methodology that aims to increase the rate of genetic gain, leading to improvement of certain phenotypes *via* genetic marker utilization *(*Heffner et al., 2011; Lorenz et al., 2011; Yue, 2014; Khatkar, 2017). Genetic markers associated with production traits are used to predict breeding values with high accuracy (Goddard and Hayes, 2007; Sonesson and Meuwissen, 2009; Wang et al., 2017, Gutierrez et al., 2015). Although high availability of genetic markers (i.e., SNP markers) could be used for the improvement of the accuracy of breeding value estimation through the use of a Genomic Relationship matrix (i.e., GBLUP), some genetic markers that are also associated with production traits could further increase the accuracy of breeding value estimation and, moreover, allow for the inclusion of alternative models of inheritance, rather than only additive, in the genetic evaluation procedures. Genomic selection based on specific traits such as fat, weight, and disease resistance can have great effects on the productivity and profitability of several aquaculture species (Yue, 2014).

In this study, we sought to identify genetic markers associated with important phenotypes in sea bream. We used ddRAD sequencing to identify and genotype genome-wide single nucleotide polymorphisms (SNPs) in multiple sea bream families. We performed both GWAS and FS to test the association among a combination of loci and the phenotypes of fat, weight, tag weight, and length/width. Finally, genomic prediction of the phenotypes was tested using the selected polymorphisms to evaluate its potential in selection for improved phenotypic traits like weight in sea bream. Our ultimate goal was to construct a signature—a combination of genetic markers—that will lead to maximizing the sea bream aquaculture efficiency, by improving the selected phenotypic traits.

# MATERIALS AND METHODS

# Sample Collection

The fish used in this study were a subset of a larger experiment with progeny from 66 male and 35 female brooders constituting 73 different full sib families from the breeding program of a commercial aquaculture company (Nireus Aquaculture S.A.). From those 73 full sib families, 14 families originating from 13 males and 11 females were selected (selective genotyping), based on their within-family variation of bodyweight at harvest, for genotyping with microsatellite markers in order to perform a QTL confirmation experiment (Chatziplis et al. 2018, in preparation). Seven male and six female brooders with 105 progeny in total, constituting six full sib families and one maternal half sib family (10 progeny on average per family), were used for ddRAD library preparation and sequencing. These seven families were those exhibiting the greatest family variation of bodyweight at harvest out of 14 total families included in the QTL verification experiment (Chatziplis et al. 2018, in preparation). All progeny were reared in commercial conditions, and after PIT tagging, they were transferred to sea cages at 220 Days Post Hatching (DPH) for the growth period. For all progeny, the weight at tagging (g) (205 DPH), weight at harvest (g) (750 DPH), percentage (%) of fat at harvest (as measured in terms of body electrical conductivity, 692 Distell) as described by Besson et al. (2019), the total length at harvest (cm) (750 DPH), and the width at harvest (cm) (750 DPH) were measured.

# Library Preparation and Sequencing

Individual DNA library preparation and sequencing of the samples, which were extracted using a modified salt-based extraction protocol based on Miller et al. (1988) and treated with RNase to remove residual RNA, were performed. Genomic DNA was eluted in 5 mmol/L Tris, pH 8.5, and stored in 4°C. Each sample was quantified by spectrophotometry (Nanodrop 1000–Thermo Fisher Scientific) and quality assessed by 0.7% agarose gel electrophoresis. To build the ddRAD library, we used the protocol described by Manousaki et al. (2016), with some minor modifications. Briefly, each of 144 DNA samples (13 parents in triplicates and 105 offspring; 21 ng DNA per sample) was separately but simultaneously digested by two high-fidelity restriction enzymes (RE): SbfI (CCTGCA|GG recognition site) and SphI (GCATG|C recognition site), both sourced from New England Biolabs (NEB), UK. Digestions were incubated at 37°C for 90 min, using 10 U of each enzyme per microgram DNA in 1 CutSmart Buffer (NEB), in a 6 µl total reaction volume. The reactions were slowly cooled to room temperature, and 3 µl of a premade adapter mix was added to the digested DNA and incubated at room temperature for 10 min. This adapter mix contained individual-specific combinations of P1 (SbfIcompatible) and P2 (SphI-compatible) adapters at 6 and 72 nM concentrations, respectively, in 1· reaction buffer 2 (NEB). The ratio of P1 to P2 adapter (1:12) was selected to reflect the relative abundance of SbfI and SphI cut sites present. P1 and P2 adapter included an inline five- or seven-base barcode for sample identification. Ligations were implemented over 3 h at 22°C by addition of a further 3 µl of a ligation mix comprising 4 mM rATP (Promega, UK) and 2000 cohesive-end units of T4 ligase (NEB) in 1· CutSmart buffer (NEB). The ligated samples were pooled together, and the single pool was column-purified (MinElute PCR PurificationKit, Qiagen, UK) and eluted in 70 µl EB buffer (Qiagen, UK). The size selection was performed by agarose gel separation, keeping the fragments between 400 and 700 bp. Following gel purification (MinElute Gel Extraction Kit, Qiagen, UK), the eluted size-selected template DNA (68 µl in EB buffer) was PCR amplified (15 cycles PCR; 32 separate 12.5 µl reactions, each with 1 µl template DNA) using a high-fidelity Taq polymerase (Q5 Hot Start High-Fidelity DNA Polymerase, NEB). The PCR reactions were combined (400 µl total) and columnpurified (MinElute PCR Purification Kit). The 57 µl eluate, in EB buffer, was then subjected to a further size-selection clean-up using an equal volume of AMPure magnetic beads (Perkin-Elmer, UK) to maximize removal of small fragments. The final libraries were eluted in 24 µl EB buffer. Lastly, the ddRAD libraries were sequenced in one HiSeq 2500 lane (2x125 bp reads).

# Raw Read Quality Control and Demultiplexing

We used FastQC v.0.11.5 software to check the quality control of the raw sequence data (Andrews and Babraham Bioinformatics Group, 2010). To recover the reads belonging to each individual, we then cleaned and demultiplexed the raw data using Process radtags program from STACKS v.1.46 software (Catchen, 2013). In this step, -c parameter was used to remove reads with an

uncalled base, -q parameter was used to discard sequencing reads of low quality (below 20) using the Phred scores provided from the FASTQ files (Catchen, 2013), and -t parameter was set to 100 to truncate final reads length to 100 bp.

# Data Alignment Against Sea Bream Reference Genome

The annotated reference genome of gilthead sea bream has been provided by Hellenic Centre for Marine Research (H.C.M.R.) (Accession Numbers: SRR6244977-SRR6244982) (Pauletto et al., 2018). To align our samples to the reference genome, we used Bowtie2 v.2.3.0 (Langmead and Salzberg, 2012) with the following parameters: {end-to-end {sensitive {no-unal. Then, we removed multi-aligned reads, reads with >3 mismatches, and reads with mapping quality lower than 20 with Samtools (Li et al., 2009).

# Genotyping RAD Alleles

Genotypes of each sample were constructed using STACKS pipeline (Catchen, 2013). For each individual, pstacks program was used to build the rad loci based on the alignment on the reference genome, setting the minimum depth of coverage to create a stack (-m) equal to 3 (default) (Paris et al., 2017). Then, a catalogue of loci was constructed using only the parental reads on cstacks program, using default parameters. To match the data of each offspring separately against the respective catalogue, we used sstacks program with ––aligned parameter. Finally, to retrieve the vcf file with the genotypes, we used populations program.

## Kinship

To check family relationship and indicate possible pedigree errors, we used KING v.2.1 software (Manichaikul et al., 2010). Kinship coefficients have been estimated by KING, setting the ––degree parameter equal to 10. Kinship coefficient is a measurement of kinship between two individuals; 1 means homozygous twins, 0 means unrelated (Manichaikul et al., 2010). Finally, to see the genetic distances of studied individuals, we performed a Principal Components analysis (PCA) and Hierarchical clustering, using Euclidean distance. Both PCA and Hierarchical clustering were implemented in R using prcomp and hclust functions, respectively.

# Linear Mixed Models

To fit the mixed model for every phenotype, we used the command lmer from the lme4 R package (Bates et al., 2014). Random effects were fitted for each family to control for the correlation within the families. In mathematical notation, the linear mixed model is written as

$$\mathcal{Y}\_i = a + \mathfrak{r}\_i + \sum\_{j=1}^{p} \mathcal{B}\_j X\_j + \mathfrak{e}\_i \tag{1}$$

where i = 1,…,K, with K denoting the number of families and *yi* is the vector of measurements of the i-th family containing ni

measurements with *n n <sup>i</sup> <sup>i</sup> K* <sup>=</sup> ∑( ) <sup>=</sup><sup>1</sup> , the overall sample size. The term τi is the overall constant term. The *τi* is the random effect of the i-th family, the deviation of the i-th family from the overall constant a. The term *βj* is the fixed regression coefficient of the variable *Xj* , and *ei* is the vector of residuals of the i-th family. The model has two sources of variation: one stemming from the residuals and one stemming from the repeated measurements, *e Nij <sup>e</sup>* , 0 <sup>2</sup> ( ) σ and τ σ *<sup>i</sup> N* τ , 0 <sup>2</sup> ( ) , respectively. Residuals represent elements of variation unexplained by the fitted model. Since this is a form of error, the same general assumptions apply to the group of residuals that we typically use for errors in general: one expects them to be normal and approximately independently distributed with a mean of 0 with some constant variance (Bates et al., 2014). To compare two linear mixed models, we used the Bayesian information criterion (BIC). BIC is a criterion for model selection among a finite set of models; the model with the lowest BIC is preferred. It is based on the log-likelihood function and takes into account the number of estimated parameters. When fitting models, it is possible to increase the likelihood by adding parameters, but doing so may result in over-fitting (Vrieze, 2012). BIC attempts to resolve this problem by introducing a penalty term for the number of parameters in the model.

## Genome Wide Association Study

A typical GWAS analysis tests for variant significance in a set of independent samples. The most common source of sample dependence is family relationships. Yet, our study is based on a family designed cohort. For this reason, we applied a familybased test for variant significance. To perform this, we used lmer in order to create a linear mixed model for each phenotype. This model includes family id as a random effect. To correct for multiple testing, we set the significance threshold to 10–4, which is the typical significance level a = 0.05 divided to the number of independent SNPs (497) based on linkage disequilibrium (LD) (Johnson et al., 2010; Clarke et al., 2011). We used the plink tool v.1.90 in order to calculate the LD score (––indep-pairwise 50 5 0.05) (Purcell et al., 2007). Finally, we presented the distribution of the p-values across the genome in Manhattan plots, and we tested for possible p-value inflation through Quantile–quantile (QQ) plots. For these plots, we used the GWASTools (Gogarten et al., 2012) library in R (scripts available upon request).

# Feature Selection

The typical GWAS pipeline reveals individual SNPs that are associated with a specific phenotype. One limitation of this pipeline is that it cannot produce signatures that contain combinations of variants. This problem is commonly referred as SNP to SNP interaction induction (Balliu and Zaitlen, 2016). The large number of tested genotypes in a typical GWAS experiment makes prohibitive the efficient computation of variant combinations. Also, the burden of multiple testing increases linearly to the number of combined variants. This means that a SNP–SNP interaction should be of extreme significance in order to be detected by a method that tests all possible combinations of variants. To tackle this problem, we employed a different approach. We considered SNPs as variables that describe a certain phenotype. We then applied methods that seek the optimum subset of variables with which we can construct a predictive model for a trait of interest (e.g., Weight). This approach is called Variable selection, or Feature Selection (FS). Solving the FS problem has numerous advantages (Tsamardinos and Aliferis, 2003). Features in biology (e.g., SNPs and gene expressions) are commonly found to be expensive to measure, store, and process (Stephens et al., 2015). By reducing the number of measurable markers-loci via FS, one can reduce this cost. A high-quality FS algorithm improves the predictive performance of the resulting model by removing the noise propagated by redundant features. For our study, we used two different FS algorithms: The first is the statistically equivalent signature (SES) algorithm, and the second is the Orthogonal Matching Pursuit (OMP) algorithm.

### The Statistically Equivalent Signature Algorithm

Commonly FS algorithms aim to find a single group of features that has the highest predictive power. On the contrary, SES algorithm introduced by Lagani et al. (2017) attempts to identify multiple signatures (feature subsets) whose performances are statistically equivalent. SES produces several signatures of the same size and predictive power regardless of the limited sample size or high collinearity of the data (Statnikov and Aliferis, 2010). It performs multiple hypothesis tests for each feature, conditioning on subsets of the selected features. For each feature, the maximum p-value of these tests is retained and the feature with the minimum p-value is selected. This heuristic has been proved to control the False Discovery Rate (Tsamardinos and Brown, 2008). SES is specially engineered for small sample sizes and eliminates the need for Bonferroni correction and/or FDR filtering (Lagani et al., 2017). Here, we used an adaptation of the SES algorithm that accommodates repeated measurements (Tsagris et al., 2018a). SES algorithm is influenced by the principles of constraint-based learning of Bayesian networks (Lagani et al., 2017). Bayesian networks are directed acyclic graphs that represent the dependency relationships between variables in a dataset. An edge A → B in a Bayesian graph represents the conditional dependence of variable B from variable A. There is a theoretical connection between S and the Bayesian (causal) network that describes best the data at hand (Tsamardinos and Aliferis, 2003). Following the Bayesian networks terminology, the Markov Blanket (MB) of a variable or node A in a Bayesian network is the set of nodes ∂A composed of A's parents (direct causes), its children (direct effects), and its children's other parents (other direct causes of the A's direct effects). Every set of nodes in the network is conditionally independent of A when conditioned on the Markov blanket of the node A (∂A as described in formula 2). Thus, the Markov blanket of a node contains the only knowledge needed to predict the behavior of that node.

$$\Pr\left(A \, \middle| \, \hat{\partial}A, B\right) = \Pr\left(A \, \middle| \, \hat{\partial}A\right) \tag{2}$$

### Orthogonal Matching Pursuit Algorithm

Orthogonal Matching Pursuit is an iterative algorithm. At each iteration, it selects the column-marker of the SNP data matrix, which has the greatest correlation with the current residuals (Cai and Wang, 2011). OMP updates the residuals by projecting the observation onto the linear subspace spanned by the columns that have already been selected and then proceeds to the next iteration. No column is selected twice because the residuals are orthogonal to all the selected columns. The algorithm stops when a criterion is satisfied. We have used its generalized form, gOMP, whose stopping criterion is based upon the difference of the BIC score between two successive models. If the difference is lower than a predefined threshold, the algorithm stops. The major advantage of OMP compared with other alternative methods is its simplicity and fast implementation (Cai and Wang, 2011).

# Model Selection Through Cross Validation

The selection of the appropriate algorithm for each dataset is a challenging task. Commonly, a k-fold cross-validation (CV) is used in order to end up with the algorithm with the best fit in the examined dataset. Cross-validation is a model validation technique for assessing the results of a model. It is commonly used for estimating how precisely a predictive model performs in unknown data samples. The standard method of a prediction problem, where a dataset of known data is given, is to split data samples in folds and every time use the n-1 folds as training dataset and the one fold that is left, as test dataset ("unknown data"). The goal of cross validation is to estimate the expected level of fit of a model to a data set that is independent of the data that were used to train the model. This approach limits problems like over-fitting and gives an insight on how the model will generalize to an independent dataset (Tibshirani and Tibshirani, 2009). To compare the algorithms and select the best model (including algorithm and parameters), we performed cross validation by using all but one sample as training set and the remaining sample as test set iterating over all samples, the so-called Leave-One-Out cross validation method.

The different models were assessed based on the sum or errors when assuming that the "unknown data" belong to each family (Equation 3). The model with the lowest mean sum of errors is selected as best model (Equation 4).

$$ErrOB = \sum\_{i=1}^{m} E(\boldsymbol{\nu}\_{i\binom{n\_i+1}{n\_i+1}} - \boldsymbol{\pi}\_{i\binom{n\_i+1}{n\_i+1}}^T \hat{\boldsymbol{\beta}} - \boldsymbol{z}\_{i\binom{n\_i+1}{n\_i+1}}^T \hat{\boldsymbol{b}}\_i)^2 / m,\tag{3}$$

where *yi* (*ni* + 1), *xi* (*ni* + 1) and *zi* (*ni* + 1) are, respectively, the outcome and predictors of the new observation in cluster i, and ˆ β and ˆ *bi* are, respectively, the estimates of *β* and *bi* based on all the training data. This can be estimated by the leave-one-out cross validation,

$$LOOCV = \sum\_{\iota=1}^{m} \sum\_{j=1}^{n\_{\iota}} (\nu\_{\circ} - x\_{\circ}^{T} \hat{\boldsymbol{\beta}}^{[\iota,\jmath]} - z\_{\circ}^{T} \hat{\boldsymbol{b}}\_{\iota}^{[\iota,\jmath]})^2 / N,\tag{4}$$

where ˆ[ , ] β *i j* and ˆ[ ] , *bi i j* are, respectively, the estimates of *β* and *bi* based on the training data without subject j in cluster i (Fang 2011).

# Selected SNP Annotation

To identify potential genes that might be affected by the retrieved SNPs, we searched the reference genome and classified the SNPs to those falling within a genic region (located within or in a window of 10Kb upstream or downstream of an annotated gene) and those that do not. If these regions were described as conserved at the genome browser of Gilthead sea bream (http:// biocluster.her.hcmr.gr/myGenomeBrowser?search=1&portaln ame=Saurata\_v1) in any of the following species: Stickleback, Asian sea bass, Medaka, Asian swamp eel and Amazon molly, they were considered as conserved.

# RESULTS

## Genotyping RAD Alleles

Illumina sequencing yielded 559,191,588 raw reads. Following quality control, we filtered out ~ 15.2% due to ambiguous barcodes, ~ 2.9% due to low quality, and 1% due to the lack of restriction sites. The rest were successfully assigned to individuals (**Supplementary Table 1** with number of reads per individual). After the demultiplexing, the high-quality reads of each sample were aligned against the reference genome. In total, 93% of the reads were mapped. Downstream filtering resulted in further discarding of multi-aligned reads (~ 8%) and those with more than three mismatches (~ 2.96%), keeping finally 351,781,485 reads for analysis. This resulted in an average coverage of 188.25. Although we did not experiment with greater values of m and used the default value proposed by STACKS, the sequencing effort was enough to have 188.25 coverage on average (s.e +/− 9.68) for the loci in our study. However it has been suggested that moderate values of m (3–6) (Paris et al., 2017) might not have any effect on the mean coverage of the reconstructed loci on a teleost species. The ddRAD catalogue built from all parental samples consisted of 15,233 SNPs. The used ddRAD protocol has been applied in other sparids as well (Manousaki et al., 2016; Manousaki et al. unpublished data). In all cases, the number of produced SNPs was in the range of 5,000–10,000 per individual (Manousaki et al., 2016; Palaiokostas et al., 2018). In this study and in accordance to this protocol, the SNP catalogue was built using solely parental data. Thus, the discovered SNPs are within the expected range given the following ddRAD protocol. Variants with allele frequency lower than 0.05 (n = 2,065) were filtered out. From the remaining 13,168, we filtered out the SNPs with call rate lower than 90% (n = 7,882). From the remaining 5,286 SNPs, 3,028 had at least one missing value and 2,258 had no missing values.

# Kinship Assignment

To verify the family identity of the studied individual, we used three different methods: King kinship, Principal Component Analysis (PCA), and Hierarchical clustering (**Supplementary Figure 1**). All three resulted in similar results, and they confirmed the tagging family id, except for two samples, one placed in different family (sample 133 that was identified as a member of Family 2 instead of Family 3) and one that was not placed in any family (sample 882). These two samples were discarded and not included in downstream analyses.

# Association Analysis Through GWAS

The results from the GWAS test among all SNPs and the four phenotypes are shown in **Table 1**. In total, we found five SNPs associated with Weight, four SNPs with Tag Weight, and none for Fat and Length/Width. In **Figure 1**, we show the phenotype distribution, Manhattan plot, and QQ-plot for each phenotype. For illustration purposes, the Manhattan plot depicted was built with variants of known ordered positions on the reference genome. The Manhattan plot for the variants in scaffolds that we do not know the exact position in the genome is given in the **Supplementary Figure 2**. The QQ-plot of Weight revealed a systemic inflation of the observed p-values possibly attributed to the fact that families were selected in such a way as to maximize the weight variation within the cohort. Regarding the loci associated with weight and tag weight, we identified nine SNPs in total (**Table 1**). Five SNPs associated with weight at harvest have been retrieved from the typical GWAS analysis. The first was found in chromosome 1 (chr1:16636968) on "ethanolamine phosphate cytidylyltransferase-like" gene and the second (chr6:12617755) in a conserved region upstream of "myosin-7-like" gene. The third (chr16:2232897) was located on two overlapping genes acetylserotonin O-methyltransferase-like and LBH-like isoform X1. Another two SNPs were found in chromosome 1. The first (chr1.6970078) located downstream of "lymphoid enhancerbinding factor 1" and the second (chr1:20827142) located upstream of "mucin-5AC-like isoform X1" (**Table 1**). Finally, four SNPs (in chromosomes 2, 13, and 22) were associated with weight at tagging. Two were found at "RNA-binding 27 isoform X1" gene (chr13:20975921,chr13:20975924), the third upstream from "Tetratricopeptide repeat 36" gene (Chr2:2623351), and the fourth upstream from "tectonin beta-propeller repeat-containing 2" gene (chr22:18343985).

# Association Analysis Through FS

Feature selection methods generate groups of SNPs that are associated with a phenotype en masse. Therefore, FS is a valuable family of methods for association analysis. We performed FS with 10 models (8 variants of SES and 2 variants of OMP), and from each model, we extracted the median squared error as an evaluation metric (**Figure 2**). All OMP models were inferior to SES. The best models for Fat and Weight have been constructed by SES algorithm (significance threshold equal to 0.01; number of condition set equal to three). The best model for Tag weight and Length/Width ratio prediction was the model constructed by variables retrieved from SES with size of condition set equal to two. The selected features of the best model, for each phenotype, are presented in **Tables 2**–**5**. SES produced different combination of SNPs (signatures) that have the same predictive strength on each one of the examined traits. In **Tables 2**–**5**, we illustrate one of these combinations, while the rest are illustrated in **Supplementary Tables 2–5**. Finally, the effects of all selected SES SNPs (17 in total, out of which 6 were also found in GWAS) from all traits are presented in **Figures 3**–**6**.

## Selected SNPs for Fat Content

The selected variables/SNPs associated with Fat content (%) at harvest, retrieved from SES algorithm (threshold 0.01), recovered three SNPs, out of which two were located within or proximal to an annotated gene (**Table 2**). The first annotated SNP is located within "telomeres 1 (POT1)" gene (chromosome 8), a region found conserved in other species as well (Medaka, Asian swamp, Asian sea bass). The second SNP was located within the "Rho family GTP-binding" gene (chr13:1098152). However, when lowering the significance threshold to 0.05, the number of SNPs increased to six (**Table 2**).

### Selected SNPs for Weight at Harvest

Four selected variables associated with weight at harvest (800 g average weight at harvest) have been retrieved from SES algorithm

TABLE 1 | Selected SNPs from GWAs analysis using linear mixed models, with significance threshold equal to 10–4.


thresholds as stop criterion (Threshold = 2 or 4 units in BIC score).



TABLE 3 | Selected SNPs from SES algorithm with significance threshold equal to 0.05 (best method based on median squared error).


TABLE 4 | Selected SNPs from SES algorithm with significance threshold equal to 0.05 (best method based on median squared error score).


TABLE 5 | Selected SNPs from SES algorithm with significance threshold equal to 0.05 (best method based on median squared error score).


FIGURE 3 | The effect of each of the selected SES SNPs associated with fat content. (A-C) Boxplots of selected SNPs. (A) chr8:1385781, (B) chr13:1098152, (C) chr21:19924408.

with number of condition set equal to three. The first was found in chromosome 1 (chr1:16636968) on "ethanolamine phosphate cytidylyltransferase-like" gene, the second (chr6:12617755) in a conserved region upstream of "myosin-7-like" gene, the third (chr8:11613979) was located in "semaphorin-3A" gene (Conserved in Asian sea bass, Asian swamp eel) and upstream of 'Piccolo' gene, and another one (chr16:2232897) and the fourth on two overlapping genes acetylserotonin O-methyltransferase-like and LBH-like isoform X1. When lowering the significance threshold to 0.05, four SNPs were added to the signatures, retrieving two more annotated genes **(Table 3).**

### Selected SNPs for Weight at Tagging

Five SNPs were associated with Tag Weight, as retrieved from SES algorithm (**Table 4**). The first was found at "RNA-binding 27 isoform X1" gene (chr13:20975921), the second upstream

from "Tetratricopeptide repeat 36" gene (Chr2:2623351), the third at "DNA repair RAD50" gene (chr13:20883924), the fourth upstream from "tectonin beta-propeller repeat-containing 2" gene (chr22:18343985), and the fifth (scaffold4139:36071) was not in an annotated region. Lowering the significance threshold to 0.05, four annotated SNPs were added to the discovered signatures (**Table 4**).

## Selected SNPs for Length/Width Phenotype

Finally, five SNPs were associated with Length/Width ratio (at 750 DPH) as retrieved from SES algorithm (**Table 5**). The first SNP (chr6:23799286) was located on the "phosphatase 1 regulatory subunit 3D-like." The second SNP (chr16:2232897) was located in two genes "acetylserotonin O-methyltransferase-like" and LBH-like isoform X1. The third SNP (chr13:9665394) was located in "ATP-dependent RNA helicase DHX33," the next one in "A-kinase anchor 9 isoform X3," and the last one (scaffold13177:8369) downstream of phosphatase 1 regulatory subunit 3C.

# DISCUSSION

Here, we present a family-based approach for the discovery of genetic variants that are significantly associated with a set of phenotypes with economic importance for the farmed gilthead sea bream. The application of these methods on seven families, each measured on four phenotypes, revealed several genetic

signatures that may be used for genomic selection. Various QTL affecting growth, morphology, and stress-related traits have been detected using microsatellite markers in gilthead sea bream (Boulton et al., 2011; Loukovitis et al., 2011; Loukovitis et al., 2012; Loukovitis et al., 2013). Some of those QTL have been verified in genetically unrelated populations (Loukovitis et al., 2016). However, no association study using SNP markers was available for production traits in sea bream except this by Palaiokostas et al. (2016) on pasteurelosis. Our study fills this gap enabling for the first time a genomic scan for SNPs that are linked to important traits. We applied two intrinsically different methods. The first is a typical GWA study that examines variants independently, and the second is a family of methods (SES and OMP) that generates signatures with multiple variants.

The sample size of our study (N = 103) might indeed produce some artifacts of this kind. Nevertheless, the analysis pipeline that we apply (SES) is specially tailored for small or moderate sample sizes in order to detect statistically significant QTLs. We anticipate that a future study with greater sample size will refine our findings and might locate additional important QTLs.

In GWA analysis after the LD-pruning, we found 497 independent SNPs. It expected the LD-pruning to reduce drastically the number of SNPs. Studies has shown that a strict LD filters like the one that we applied has minimal effect on the predictive accuracy of the remaining SNPs (Palaiokostas et al., 2019). In general, we noticed a concordance between the SNPs discovered by GWAS and SES. Both methods include tests for SNP–phenotype statistical association, whereas OMP conducts residual-based tests for SNP association. SES algorithm attempts to identify specific sets of SNPs that model a specific phenotype, whereas the typical GWAS pipeline reveals statistical associations. An interpretation of the significance of the SNPs that were located from GWAS but not from SES is that these SNPs do not have a direct effect. Or else, the effect of these SNPs can be eliminated by conditioning on the SNPs that SES revealed. For example, two SNPs that were identified from the typical GWAS, to be associated with weight at tagging (chr13:20975921, chr13:20975924), were marked by SES as equivalents. SES was built upon MMPC algorithm (Tsamardinos et al., 2003). The difference between these two algorithms is that MMPC does not return multiple solutions. MMPC was shown to achieve excellent false positive rates (Aliferis et al., 2010). Seen from the biological perspective, multiple equivalent signatures may arise from redundant mechanisms, for example, genes performing identical tasks within the cell. For example, Ein-Dor et al. (2005) demonstrated that multiple, equivalent prognostic signatures for breast cancer can be extracted just by analyzing the same dataset with a different partition in training and test set, showing the existence of several loci that are practically interchangeable in terms of predictive power. SES was tested against LASSO (Lagani et al., 2017) with continuous, binary, and survival target variables, resulting in SES outperforming the LASSO algorithm (Groll and Tutz, 2014) both in predictive

performance and computational efficiency. Overall, SES seems to be performing well in smaller datasets, while OMP is known to perform better in larger datasets (Tsagris et al., 2018b). A known limitation in every GWA study is that the power to detect small QTL effects is limited by the number of samples. An underpowered GWA study may fail to detect some associations, whereas the detected signals might be inaccurate in terms of location and/ or biological interpretation. The sample size of our study (N = 103) might indeed produce some artifacts of this kind. Nevertheless, the analysis pipeline that we applied (SES) is specially tailored for small or moderate sample sizes in order to detect statistically significant QTLs. We anticipate that a future study with greater sample size will refine our findings and might locate additional important QTLs. Our findings highlight novel SNPs found within or close to coding genes that are significantly associated with our focal traits of interest in sea bream. However, multiple of those genes have been linked with such traits in other species as well. Multiple interesting genes were associated with fat content. For example, one SNP locus is linked with the gene Rho-GTP binding, which is involved in adipogenesis in mice, (Sordella et al., 2003). This gene and its regulator (p190-B RhoGAP) seem to have a key role in the outcome of the differentiation of mesenchymal stem cells to either adipocytes or myocytes (Sordella et al., 2003). Another SNP associated with fat was located on neurotrophin-3 (NT-3), a gene with well-recognized effects on peripheral nerve and Schwann cells, promoting axonal regeneration and associated myelination (Yalvac et al., 2018). NT-3 increases muscle fiber diameter in the neurogenic muscle through direct activation of mTOR pathway and that the fiber size increase is more prominent for fast twitch glycolytic fibers. Thus, fat content seems to be influenced greatly by few genes with well-known role in adipogenesis.

Regarding the loci associated with weight and tag weight, we identified 15 genes in total. Interestingly, although those two traits represent the same trait at different stages, we found no gene associated with both. There are many reasons for such result. One reason may be due to the low power of the experiment and the differences in variation in the weight of the fish at different ages. Another reason may be because different genes are affecting growth at different stages of development. A third reason is that may be the gene action is not only additive and epistatic effects exist. In any case, all these scenarios should be further investigated in a more powerful experiment, which would be necessary in any case. The outcome of our analysis revealed SNPs close to very important genes with a well-known role in weight gain– loss, such as Follistatin, myosin-7, and semaphorin (SEMA3A) genes. Follistatin binds and inhibits the activity of several TGFfamily members in mice (Lee and McPherron, 2001). Strikingly, follistatin knockout mice have reduced muscle mass at birth underlying the importance of this gene in muscle growth (Lee and McPherron, 2001). Apart from Follistatin, the significant association with Myosin, an actin-based motor molecule with ATPase activity essential for muscle contraction, shows the importance of regulation of muscle growth-related genes in weight. The third gene, semaphorin, is significantly associated with both weight and length/width. SEMA3A gene is involved in synapse development underlying the importance of genes in regulating the nervous system in length. Also, the same SNP, which is located on SEMA3A, was direct upstream of Piccolo gene. Piccolo play roles in regulating the pool of neurotransmitter-filled synaptic vesicles present at synapses. Mice lacking Piccolo are viable; nevertheless, each mutant displays abnormalities. Piccolo mutants reduced postnatal viability and body weight (Mukherjee et al., 2010). Another associated gene, ethanolamine phosphate cytidylyltransferase, plays a role in lipid metabolism and finally EXT1, a gene regulating important developmental pathways such as hedgehog (Siekmann and Brand, 2005).

The compilation of an annotated reference genome for this species has been recently published by the Hellenic Centre for Marine Research (H.C.M.R). (Pauletto et al., 2018) and is also available on the Genome Browser1 . To our knowledge, this analysis is the first to use this genome as a reference for read alignment and variant calling. Moreover, a literature review did not reveal any study examining the same collection of traits on this species. As an effect, for the moment, we cannot provide a comparative analysis with other studies. Studies on related species include those of Yoshida et al. (2019), which examines weight paper on Nile tilapia, Nguyen et al. (2018), which examines weight on Yellowtail Kingfish, and Yu et al. (2018), which examines weight and total length on Epinephelus coioides. Although our study does not have any common gene with these studies, it is interesting that among these studies, there are also no common genes. This suggests the high genetic variability on these traits across different species and also the need for future studies with higher sample sizes and better coverage that can provide additional insights on the common genetic content of aquacultured species.

# CONCLUSION

In this study, we employed two different approaches to identify variants associated with growth-related phenotypic traits. Our chosen selected panel combined with the vigorous bioinformatic analyses revealed the most significant SNP loci on the sea bream genome. The discovered candidates are located in the proximity of genes with known involvement in processes related to growth. The combination of these novel loci may lead to the selection of brooders based on specific genetic signatures and can have a great effect on the efficiency of the aquaculture. Moreover, these results could be used to verify or not putative QTL identified in previous studies and could also be used in order to fine map identified QTL in the same population using other types of genetic markers (Chatziplis et al, 2018, in preparation). Following this step, the use of these variants independently as individual SNP (or SNP haplotypes) and/or in combination with other marker information in a MAS program could be a form of direct application in the aquaculture breeding industry. When more dense SNP markers would be available (i.e., SNPchip) for the species and more families

# ETHICS STATEMENT

Animal welfare was achieved according to the "Guidelines for the treatment of animals in behavioural research and teaching" (Guidelines for the treatment of animals in behavioural research and teaching, 1997) (see also Tsakogiannis et al., 2018). All fish utilized in the study were kept in registered and authorized facilities to maintain and perform animal experiments; rearing and sampling followed the guidelines of the Directive 2010/63/ EU for the protection of animals used for experimental and other scientific purposes (Official Journal L276/33) (EU, 2010. Directive 2010/63/EU of the European Parliament and the Council of 22 September 2010 on the protection of animals used for scientific purposes. Official Journal of the European Union L 276/33, Animal protection.). In addition, experimental sampling protocols were approved by the IMBBC's aquaculture department committee and methods were in accordance with relevant guidelines and regulations approved by the Hellenic Ministry of Rural Development and Food and the Regional Directorate of Veterinary Medicine for certified experimental installations (EL 91-BIO-04) and experimental animal breeding (AQUALABS, EL 91-BIO-03). Laboratory personnel include accredited technicians by the Federation for Laboratory Animal Science Associations (FELASA).

# AUTHOR CONTRIBUTIONS

CT, GP, TM, AK, and DK conceived and designed the study. LP, DC, and CT designed and performed the family selection. AT performed the DNA extraction and ddRAD library preparation. DK performed the bioinformatic analyses with guidance from AK and TM. DK performed the statistical analyses with guidance from MT and IT. DK wrote the first draft of the manuscript. MT, AT, DC, AK, and TM wrote sections of the manuscript. All authors contributed to manuscript revision and read and approved the submitted version.

# FUNDING

Financial support for this study has been provided by the General Secretariat for Research and Technology (GSRT), Ministry of Education and Religious Affairs, under the National Programme for Competitiveness & Entrepreneurship (EPAN II) funded by National sources and the European Regional Development Fund for the gilthead sea bream. This research was supported in part through computational resources provided by IMBBC (Institute of Marine Biology, Biotechnology, and Aquaculture

from more populations are genotyped (i.e., increase LD), then the application of Genomic Selection will be more feasible and cost effective in terms of any selection accuracy benefits. Nevertheless, our study presents, in a small scale example, the feasibility of GS application as well as the availability of the tools necessary before its application (i.e., GWAS using SNP markers) in an important Mediterranean aquaculture species such as gilthead sea bream.

<sup>1</sup> http://biocluster.her.hcmr.gr/myGenomeBrowser?portalname=Saurata\_v 1

of the HCMR (Hellenic Centre for Marine Research). Funding for establishing the IMBBC HPC has been received by the MARBIGEN (EU Regpot) project, LifeWatchGreece RI, and the CMBR (Centre for the study and sustainable exploitation of Marine Biological Resources) RI.

# REFERENCES


# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.00675/ full#supplementary-material


**Conflict of Interest Statement:** Author LP was employed by company Nireus Aquaculture SA, Greece. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Kyriakis, Kanterakis, Manousaki, Tsakogiannis, Tsagris, Tsamardinos, Papaharisis, Chatziplis, Potamias and Tsigenopoulos. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Identification of Potential Immune-Related circRNA–miRNA–mRNA Regulatory Network in Intestine of *Paralichthys olivaceus* During *Edwardsiella tarda* Infection

*Yunji Xiu1,2, Guangpeng Jiang1, Shun Zhou1, Jing Diao2, Hongjun Liu2, Baofeng Su3 and Chao Li1\**

*1 School of Marine Science and Engineering, Qingdao Agricultural University, Qingdao, China, 2 Shandong Key Laboratory of Disease Control in Mariculture, Marine Biology Institute of Shandong Province, Qingdao, China, 3 School of Fisheries, Aquaculture and Aquatic Sciences, Auburn University, Auburn, AL, United States*

### *Edited by:*

*Peng Xu, Xiamen University, China*

### *Reviewed by:*

*Yanliang Jiang, Chinese Academy of Sciences, China Xiao-Qin Xia, Chinese Academy of Sciences, China*

> *\*Correspondence: Chao Li leoochao@163.com*

### *Specialty section:*

*This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics*

*Received: 09 December 2018 Accepted: 11 July 2019 Published: 14 August 2019*

### *Citation:*

*Xiu Y, Jiang G, Zhou S, Diao J, Liu H, Su B and Li C (2019) Identification of Potential Immune-Related circRNA– miRNA–mRNA Regulatory Network in Intestine of Paralichthys olivaceus During Edwardsiella tarda Infection. Front. Genet. 10:731. doi: 10.3389/fgene.2019.00731*

Olive flounder (*Paralichthys olivaceus*) is an important economical flatfish in Japan, Korea, and China, but its production has been greatly threatened by disease outbreaks. In this research, we aimed to explore the immune responsive mechanism of *P. olivaceus* against *Edwardsiella tarda* infection by profiling the expression of circRNA, miRNA, and mRNA by RNA-seq and constructing a regulatory circular circRNA–miRNA–mRNA network. Illumina sequencing of samples from normal control (H0), 2 h (H2), 8 h (H8), and 12 h (H12) postchallenge was conducted. Differentially expressed (DE) circRNA (DE–circRNAs), miRNAs (DE–miRNAs), and mRNAs [differential expression genes (DEGs)] between challenge and control groups were identified, resulting in a total of 62 DE–circRNAs, 39 DE–miRNAs, and 3,011 DEGs. Based on the differentially expressed gene results, miRNA target interactions (circRNA–miRNA pairs and miRNA–mRNA pairs) were predicted by MiRanda software. Once these paired were combined, a preliminary circRNA–miRNA–mRNA network was generated with 198 circRNA–miRNA edges and 3,873 miRNA–mRNA edges, including 44 DE–circRNAs, 32 DE–miRNAs, and 1,774 DEGs. Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis was performed to evaluate the function of the DEGs in this network, and we focused and identified two important intestinal immune pathways (herpes simplex infection and intestinal immune network for IgA production) that showed statistical significance between the challenge and control groups. Furthermore, three critical DEGs (nectin2, MHC II α-chain, and MHC II β-chain) were identified, mapped into the preliminary circRNA–miRNA–mRNA network, and new circRNA–miRNA–mRNA regulatory networks were constructed. In conclusion, we, for the first time, identified circRNA–miRNA–mRNA network from *P. olivaceus* in the pathogenesis of *E. tarda* and provided valuable resources for further analyses of the molecular mechanisms and signaling networks.

Keywords: circRNA, miRNA, mRNA, circRNA–miRNA–mRNA network, immune response, *Paralichthys olivaceus*, *Edwardsiella tarda*

# INTRODUCTION

Circular RNAs (circRNAs), identified from RNA viruses in the 1970s (Sanger et al., 1976), were initially treated as viral genomes or by-products of rare mis-splicing, and thus, they have long been thought to be nonfunctional (Capel et al., 1993). CircRNAs are generated during the process of back-splicing and could be grouped into four categories: circular exonic RNAs (ecircRNAs), circular intronic RNAs (ciRNAs), exon–intron circRNAs (eiciRNAs), and intergenic circRNAs (Qu et al., 2015; Sonja and Sabine, 2015; Li et al., 2017a). During back-splicing, a downstream 5′ splice donor is joined with an upstream 3′ splice acceptor, and the resulting RNA circle is ligated by a 3′–5′ phosphodiester bond at the junction site (Lasda and Parker, 2014; Chen, 2016; Wilusz, 2018). Back-splicing is catalyzed by the canonical spliceosome machinery and modulated by both intronic complementary sequences and RNA binding proteins (Li et al., 2018b). Recent advancements to the high-throughput sequencing technology have benefited large amounts of circRNAs identified in succession from many organisms, such as plants, animals, human beings, fungi, and protists (Li et al., 2018b). Emerging researches suggest that some circRNAs are critical in many physiological and pathological conditions (Li et al., 2018b). For example, expression profiles and knockout experiments proved that circRNAs have been implicated in neuronal function (Rybak-Wolf et al., 2015; Piwecka and Glazar, 2017) and testes development (Capel et al., 1993; Hansen et al., 2013). Besides, more and more circRNAs have been found to be associated with human diseases, such as cancers (Conn et al., 2015; Guarnerio et al., 2016), Alzheimer's disease (Lukiw, 2013), neuronal diseases (Errichelli et al., 2017), and others. In addition, the most recent progresses reveal that some circRNAs are also involved in innate immune responses (Chen et al., 2017b; Li et al., 2017b; Wang et al., 2017). Collectively, considerable evidences proved that circRNAs are not simply accidental by-products but represent an essential part of noncoding RNA families.

Although relevant research on circRNAs is still in its infancy, it is becoming apparent that circRNAs play their regulatory roles through distinct mechanisms. Initially, circRNAs function as miRNA sponges through abundant binding sites for microRNAs and then modulate the activity of miRNAs on their target genes (Hyeon Ho et al., 2009). Remarkably, some circRNAs are strongly associated with cancer progression through competing with miRNAs to influence the expression of target genes that are involved in biological processes, for example, tumor cell proliferation, apoptosis, invasion, and migration (Zhong et al., 2018). Apart from acting as miRNA sponges, circRNAs play multiple functions through affecting splicing of their linear mRNA counterparts, regulating transcription of their parental genes, influencing splicing of their linear cognates, interacting with associated proteins, protein-coding genes, and generating pseudogenes (Li et al., 2018b).

Olive flounder (*P. olivaceus*) is an important economical flatfish that has been widely cultured in Japan, Korea, and China. The production of *P. olivaceus* has been greatly threatened by disease outbreaks, including bacteria, virus, and parasites (Kim et al., 2010). *Edwardsiella tarda*, associated with hemorrhagic septicemia of freshwater and marine fish, could also result in extensive economic losses to aquaculture industry of *P. olivaceus* (Mohanty and Sahoo, 2007; Xu and Zhang, 2014). It was reported that *E. tarda* is an important zoonotic and intestinal pathogen, and the intestine was likely the main route of entry to host (Li et al., 2012; Wang et al., 2012). Therefore, besides serving as the prime site for absorption of nutrients, intestine represents one of the first-line defense systems (Lauriano et al., 2019). It has been confirmed that intestinal hypo-immunity of fish favors *E. tarda* infection (Liu et al., 2014). Teleost fish possess a diffuse mucosa-associated immune system in the intestine where B cells act as one of the main responders (Parra et al., 2016). Moreover, IgT+ B cells represent the predominant mucosal B-cell subset, and the accumulation of IgT+ B cells has been detected in trout intestine after infection (Yu and Kong, 2018). Immunoglobulins produced by these B cells constitute a critical line of defense, which prevents the entrance of pathogens and commensal bacteria into the epithelium (Parra et al., 2016).

It is vitally necessary to understand and apply their immune mechanism against pathogen infection. Over the years, massive efforts have been conducted in exploring the immune mechanism of *P. olivaceus* at a molecular level (Hwang et al., 2018; Ma et al., 2018; Wu et al., 2018), among which a few researches have been conducted in non-coding RNA (Zhang et al., 2014). However, there has been no report about the important roles of circRNAs during the immune process of *P. olivaceus*. In fish, few circRNA researches have been published on teleosts, including half-smooth tongue sole (*Cynoglossus semilaevis*) (Li et al., 2018a), large yellow croaker (*Larimichthys crocea*) (Xu et al., 2017), zebrafish (*Danio rerio*) (Shen et al., 2017), coelacanth (Anne et al., 2014) and grass carp (*Ctenopharyngodon idella*) (He et al., 2017). Interestingly, the most recent progresses reveal that circRNAs also take part in immune regulation and viral infection (Wang et al., 2017). It has been identified that the *in vitro* synthesized circRNAs would activate RIG-I-mediated innate immune responses, which will provide protection against viral infection (Chen et al., 2017b). Besides, the immune factors NF90/NF110 modulate circRNA biosynthesis and suppress viral infection by interacting with viral mRNAs (Li et al., 2017b). Moreover, it has been speculated that a circRNA–miRNA–mRNA network may be present in grass carp reovirus (GCRV)-infected grass carp, which provides new insights into the immune mechanism underlying grass carp against GCRV (He et al., 2017). However, there are still several intriguing questions that remained to be clarified: 1) Are circRNAs involved in antibacterial immune responses? 2) How do circRNAs contribute to antibacterial immune responses?

In this study, we examined interaction of circRNAs, miRNAs, and mRNAs of *P. olivaceus* in the pathogenesis of *Edwardsiella tarda* by high-throughput sequencing. We screened and identified differentially expressed circRNAs, miRNAs, and mRNAs; predicted the potential circRNA–miRNA–mRNA network; analyzed their significant enrichment pathways; and emphasized their implications in antibacterial immunity for the first time.

# MATERIALS AND METHODS

# The Experimental Fish and Ethical Statement

Healthy olive flounders were obtained from Huanghai Aquaculture Company (Haiyang, Shandong, China). The fish were acclimatized in a recirculating water system (temperature 20 ± 1°C) for 1 week before processing, during which they were fed twice a day with commercial diet. In order to make sure that all the experimental fish were healthy, the olive flounders were monitored every day; after 1-week acclimation, they were randomly sampled for bacteriological examination. This study was carried out in accordance with the recommendations in the Guide for the Care and Use of Laboratory Animals of the National Institutes of Health, Qingdao Agricultural University. The protocol was approved by the Committee on the Ethics of Animal Experiments of Qingdao Agricultural University IACUC (Institutional Animal Care and Use Committee).

# Bacteria Challenge and Sample Collection

The bacterial of *E. tarda* were isolated from diseased olive flounders and kept by our laboratory. Before the challenge experiment, *E. tarda* were incubated in Luria broth (LB) medium at 28°C to mid-logarithmic stage. The concentration was determined by colony-forming unit (CFU) method. Overall, this experiment included four time points, and each time point contained three biological replicates (three fish for each biological replicate). Before *E. tarda* infection, nine fish were immersed in sterilized media, and their posterior intestines were sampled as the normal control group. The samples were designated as H0 (H0\_1, H0\_2, and H0\_3). During challenge, the experimental groups were immersed in the bacteria solution with a final concentration of 6 × 107 CFU/ml for 2 h and then returned to the circulating water system. At 2, 8, and 12 h post-treatment, nine fish from each time point were collected, and their posterior intestines were collected for sequencing. The samples were designated as H2 (H2\_1, H2\_2, and H2\_3), H8 (H8\_1, H8\_2, and H8\_3), and H12 (H12\_1, H12\_2, and H12\_3).

# Histopathological Analysis

To observe histopathological changes of intestine in the *E. tarda* infected *P. olivaceus*, we took posterior intestines from nine fish at each time point to make pathological sections. Tissue samples were fixed in 4% paraformaldehyde in phosphate-buffered saline (PBS) and then further processed through the following steps: dehydrated in graded ethanol, cleared in xylene, embedded in paraffin, cut into 5-mm sections, and stained with hematoxylin and eosin (H&E) for examination by light microscopy (Licata et al., 2018). The histological measurements for the structures, height of mucosal folds, thickness of lamina propria, inner circular muscular layer, and outer longitudinal muscle were measured and analyzed. The mean ± standard error of mean (SEM) of each structure was compared among all of samples using the analysis of variance with Tukey LSD (SAS 9.4) at the significance level *p* < 0.05.

# RNA Isolation, Library Construction, and Sequencing

Total RNA from samples was extracted by using the TRIzol reagent (Invitrogen, USA). The purity of total RNA was checked by using the NanoPhotometer® spectrophotometer, its concentration was checked by using Qubit® RNA Assay Kit in Qubit® 2.0 Fluorometer, and its integrity was checked by Bioanalyzer 2100 system.

At different time points before (H0) and after (H2, H8, and H12) *E. tarda* infection, intestine tissues from nine olive flounders were respectively collected and used for circRNA sequencing. Three replicate samples were processed for each time point, and a total of 12 libraries were sequenced. For constructing the library of circRNAs or mRNAs, 5 μg of RNA for each sample was prepared. Then, Epicentre Ribo-Zero™ rRNA Removal Kit (Epicentre, USA) was used to remove rRNA, and ethanol precipitation was applied to clean up rRNA-free residue. Subsequently, the linear RNA was digested with 3 U of RNase R (Epicentre, USA) per microgram of RNA for mRNA library and without RNase R treatment for circRNAs, which was the only difference between library construction of circRNAs and mRNAs. The sequencing libraries were generated by using NEBNext® Ultra™ Directional RNA Library Prep Kit for Illumina® (NEB, USA) according to manufacturer's protocol. For circRNA, mRNA, and miRNAs, the library construction and sequencing were operated by Novogene Corporation (China) in the same way as previously reported (Lu et al., 2016). In consideration that the Ribo-Zero library contained both mRNA and lncRNA at the same time, but we were not interested in lncRNA in this research, we set up a series of strict screening conditions to identify and remove lncRNA according to their structural characteristics and functional characteristics. The screening process of lncRNA is shown as follows: 1) Select transcripts with exon number ≥2. 2) Select transcripts with length > 200 bp. 3) Screen transcripts that overlap with the annotated exon area. 4) Cuffquant was used to calculate the expression of each transcript, and the transcripts with fragments per kilobase of transcript per million mapped reads (FPKM) ≥0.5 were selected. (5) CNCI (coding–non-coding index) (v2), CPC (Coding Potential Calculator) (0.9-r2), Pfam Scan (v1.3), and PhyloCSF (phylogenetic codon substitution frequency) (v20121028) were used to predict the coding potential of transcripts, and the intersected transcripts without coding potential from these four software analysis were selected as the lncRNA.

# Data Analysis

Raw data (raw reads) were firstly processed through in-house perl scripts (for circRNAs and mRNAs) or customperl and python scripts (for miRNAs). More concretely, reads containing adapter, ploy-N, and low-quality reads from raw data were removed. Then, Q20, Q30, and GC contents of the clean data were calculated. All the downstream analyses were based on the clean data.

Reference genome and gene model annotation files were downloaded from genome website directly (ftp://ftp.ncbi.nlm.

nih.gov/genomes/all/GCF/001/970/005/GCF\_001970005.1\_ Flounder\_ref\_guided\_V1.0/). Index of the reference genome was built using bowtie2 (v2.2.8), and clean reads were aligned to the reference genome using Bowtie or HISAT2 (Langmead and Pop, 2009; Langmead and Salzberg, 2012).

The circRNAs were detected and identified using find\_circ (Memczak et al., 2013) and CIRI2 (Gao et al., 2018). The workflow of find\_circ was as follows: 1) The reads that aligned contiguously to the genome were filtered out, and spliced reads were retained; 2) the terminal parts of each candidate read were mapped to the genome to find unique anchor positions; 3) candidate circRNAs were confirmed when their 3′ end of anchor sequence aligned to the upstream of 5′ end of anchor sequence, and the inferred breakpoint was flanked by GU/AG splice signals. The circos figures were constructed by using Circos software. Mapped small RNA tags were used to search for known miRNA. MiRBase20.0 was used as reference; modified software mirdeep2 (Friedländer et al., 2012) and srna-tools-cli were used to obtain the potential miRNA and draw the secondary structures. Custom scripts were used to obtain the miRNA counts as well as base bias on the first position of identified miRNA with certain length and on each position of all identified miRNA, respectively (Jian et al., 2018). The characteristics of hairpin structure of miRNA precursor can be used to predict novel miRNA. The available software miREvo (Wen, 2012) and mirdeep2 (Friedländer et al., 2012) were integrated to predict novel miRNA through exploring the secondary structure, the Dicer cleavage site, and the minimum free energy of the small RNA tags unannotated in the former steps. Meanwhile, custom scripts were used to obtain the identified miRNA counts as well as base bias on the first position with certain length and on each position of all identified miRNA, respectively (Fan et al., 2018). For transcriptome assembly, the mapped reads of each sample were assembled by StringTie (v1.3.1) (Pertea et al., 2016) in a reference-based approach. StringTie uses a novel network flow algorithm as well as an optional *de novo* assembly step to assemble and quantitate fulllength transcripts representing multiple splice variants for each gene locus.

# Differential Expression Analysis, Enrichment Analysis, and circRNA– miRNA–mRNA Network Analysis

Differential expression analysis between two groups was performed using the DESeq R package (1.8.3). The *p*-value was adjusted using the Benjamini and Hochberg method. Corrected *p*-value of 0.05 was set as the threshold for significantly differential expression by default.

Gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analyses were used on significantly differential expressed genes, including host genes of differentially expressed circRNAs and the target gene candidates of differentially expressed miRNAs. Gene ontology (GO) enrichment analysis was implemented by the GOseq R package, in which gene length bias was corrected (Young et al., 2010). GO terms with corrected *p*-value of less than 0.05 were considered significantly enriched by differentially expressed genes. KEGG is a database resource for understanding high-level functions and utilities of the biological system (Kanehisa et al., 2008), such as the cell, the organism, and the ecosystem, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other high-throughput experimental technologies (http:// www.genome.jp/kegg/). We used KOBAS software to test their statistical enrichment in KEGG pathways (Mao et al., 2005).

The circRNA–miRNA–mRNA network was developed based on possible functional relationships between DE–circRNAs, DE–miRNAs, and differential expression genes (DEGs). Firstly, the target circRNAs of DE–miRNAs were predicted by scanning for conserved miRNA target sites with MiRanda (Enright et al., 2003); then the interactions between target circRNAs and DE–circRNAs were identified; and finally circRNA–miRNA regulation network was constructed. Secondly, the target mRNAs of DE–miRNAs were predicted by scanning for conserved miRNA target sites with MiRanda; then the interactions between target mRNAs and DEGs were identified; and finally miRNA– mRNA regulation network was constructed. At last, circRNA– miRNA–mRNA network was generated using a combination of circRNA–miRNA network and miRNA–mRNA network with Cytoscape 3.6.1 software (Su et al., 2014), and only the network follows the expression trend of "up–down–up" or "down–up– down" was selected for further research. In conclusion, the construction of circRNA–miRNA–mRNA network followed the following principles: circRNAs served as bait, microRNAs served as core, and RNA served as target.

# Confirmation of the Expression Level of circRNAs, miRNAs, and mRNAs

To validate the reliability of the data obtained from Illumina sequencing, real-time quantitative reverse transcription polymerase chain reaction (qRT-PCR) and Sanger sequencing were conducted. To confirm the expression pattern of differentially expressed circRNAs, six circRNAs were randomly selected for qRT-PCR and Sanger sequencing. Primer Premier 5 software was used to design their divergent primers. In general, the divergent primers were designed to span the circRNA backsplice junction, and a fragment of 80- to 150-bp length was expected. Total RNA was extracted, digested using RNase-Free DNase (Promega), and then purified. A total of 1 μg of purified RNA was used to prepare first-strand cDNA by using a random 6 mers primers and the PrimeScript 1st strand cDNA Synthesis Kit (Takara, Japan). qRT-PCR was carried out on a CFX 96 real-time PCR system (Bio-Rad, Hercules, CA, USA). Each RT-qPCR mixture contained 4.2 μl of ddH2O, 5 μl of 2× SYBR Green master mix (Aidlab), 0.4 μl of template, and 0.2 μl each of forward and reverse primers. The EF1α gene was used as an internal control. For each sample, three replicates were included. The program for qRT-PCR was as follows: 95°C for 3 min, followed by 40 cycles of 95°C for 10 s and 60°C for 34 s. Relative expression level was calculated using the Pfaffl method (Pfaffl, 2001). Data are shown as means ± SE of three replicates. To further prove that the amplified products were the template with circRNA rather than liner counterparts, products of qRT-PCR were subjected to pEASY-T1 cloning vector (TransGen, Beijing, China) and then sequenced by Tsingke Company (Tsingke, Qingdao, China).

Meanwhile, to further confirm the expression level of miRNAs and mRNAs, 8 miRNAs and 10 mRNAs were randomly selected for qRT-PCR experiments. For miRNAs analysis, small RNA (<200 nt) was harvested using the miRcute miRNA isolation kit (Tiangen Biotech). The amplification reactions were carried out using the miRcute miRNA qPCR detection kit (Tiangen Biotech) with the following conditions: 95°C for 15 min, 40 cycles of two steps (95°C for 5 s and 60°C for 30 s) within triplicate wells of each sample. The relative expression level of miRNA was normalized by 5S rRNA expression. For mRNAs analysis, reaction mixtures and the program for qRT-PCR were the same as circRNAs, and the EF1α gene was again used as an internal control for the normalization of gene expression.

# RESULTS

# Histopathological Description and Cytokine Expression Analysis

As shown in **Figure 1A**, the healthy intestine sample contained tunica mucosa, submucosa, mucosal folds, circular muscular layer, and longitudinal muscular layer. The mucosal folds were lined with abundant of columnar epithelial cells and goblet cells. And all the cells appeared to be interrelated and uniformly

arranged. For H2, in early infection, hyperplasia of intestinal mucosa was observed. Meanwhile, the blood vessel and the lymph vessel enlarged, which were associated with the increasing thickness of lamina propria (**Supplementary Table 1**). However, the intestine structure was still integrated, the height of the mucosal folds decreased (**Supplementary Table 1**), and most of the mucosal folds did not show significant lesions. The columnar epithelial cells and goblet cells were also lined tightly (**Figure 1B**). For H8, cellular swelling and hydropic change were observed in intestinal epithelium, some intestinal epithelial cells were shedding, and some became disrupted. Inflammatory cells were infiltrated in connective tissue. Simultaneously, cells in the blood vessel and the lymph vessel became quantitatively more and more distinct. The intestine structure was injured severely (**Figure 1C**). For H12, the infection led to destructive damage to the intestine structures. The mucosal folds suffered from further damage; fragmentation were observed. The blood vessel was also severely damaged, and only a few blood vessels could be observed. Necrosis was seen in the mucosa, submucosa, and muscle layers of intestinal wall (**Figure 1D**). The lamina propria was significantly thicker at 12 h post-challenge (**Supplementary Table 1**; **Figure 1D**).

In order to clarify the biological importance for time course, the qRT-PCR check for expression of key cytokines for

CE, columnar epithelial cell; CM, circular muscularis; GC, goblet cell; LM, longitudinal muscularis; S, submucosa; TM, tunica mucosa. The scale bar in (A) is 100 µm. The scale bar in (B), (C), and (D) is 50 µm.

enteritis was conducted. The results showed that *E. tarda* could 1) up-regulate mRNA levels of intestinal pro-inflammatory cytokines interleukin *1β* (*IL-1β*), *IL-6*, *IL-8*, *IL-16*, and *IL-17D*, tumor necrosis factor α (*TNF-α*), and granulocyte colonystimulating factor (*G-CSF*); and 2) down-regulate the mRNA levels of anti-inflammatory cytokines *IL-10* (**Figure 2**). The primers of qRT-PCR are shown in **Supplementary Table 2**.

## Overview of circRNA Sequencing Data

As shown in **Table 1**, raw reads, clean reads, clean bases, and Q20, Q30, and GC (guanine and cytosine) contents for each library were identified. All libraries gave a good quality base value ≥ 12.25 Gb, Q20 ≥ 96.99%, Q30 ≥ 92.32%, and an error rate ≤ 0.02. Therefore, all libraries proved to be suitable for further study. These data were deposited in NCBI database with the BioProject number of PRJNA511138.

Clean reads from the 12 libraries were used to identify circRNAs. After a series of selection, 5,478 novel circRNAs were obtained and termed from novel\_circ\_0000001 to novel\_circ\_0005478 (**Supplementary Table 3**). There was no single circRNA reported in olive flounders previously, so all the identified circRNAs were novel. A size distribution analysis revealed that the length of circRNAs ranged from 150 to 71,793 bp, but most (74.82%) were ≤5,000 bp (**Figure 3A**). Most of the circRNAs were exonic and intronic (**Figure 3B**).

To identify circRNAs that potentially participated in *E. tarda* infection, their expression profiles were examined at 0, 2, 8, and 12 h post-infection. As shown in **Supplementary Table 4**, a total of 34, 20, and 8 differentially expressed circRNAs (DE–circRNAs) were observed at 2, 8, and 12 h relative to 0-h control, respectively. A heatmap showing 62 DE–circRNAs (**Figure 3C**) and a Venn diagram revealing one circRNA (novel\_circ\_0005065) were differentially expressed at all three comparisons (**Figure 3D**).

# Overview of miRNA Sequencing Data

Meanwhile, to study the miRNA profile of *P. olivaceus* after *E. tarda* infection, four sRNA libraries (i.e., H0, H2, H8, and H12) were also constructed and sequenced. Altogether, 33,957,978,


values that are significantly different (*p* < 0.05) from H0 time point.


33,067,562, 34,149,754, and 33,590,933 raw reads were acquired from H0, H2, H8, and H12, respectively. These data were deposited in NCBI database with the BioProject number of PRJNA510916. After the low-quality reads, adaptor sequences, and reads with sequences < 1 or >35 nt were filtered, all sRNAs were obtained (**Table 2**). As shown in **Figure 4A**, the majority of the sRNAs from the four libraries ranged from 20 to 23 nt, and the peak distribution was for sequences that were 21 nt long. Furthermore, the sRNAs of all samples were compared with their reference genomes. The results showed that most of sRNAs (>89%) were able to be mapped onto the olive flounder genome (**Table 2**). After Rfam and genome databases were searched, other non-coding RNAs (rRNA, tRNA, snRNA, and snoRNA) (**Supplementary Table 5**) and repeat sequences (**Supplementary Table 6**) were annotated.

After a series of selections, a total of 303 miRNAs were identified, including 33 known miRNAs and 270 putative novel miRNAs (**Table 3**). The expression levels of miRNAs were calculated based on the read count and were subsequently normalized to TPM (number of transcripts per million clean tags). Among the 303 miRNAs, approximately 22% miRNAs were in a higher level (TPM interval > 60). However, approximately

### TABLE 2 | Information list of miRNA sequencing data.


25% miRNAs showed a much lower expression level (0 < TPM interval < 0.1) (**Supplementary Table 7**). Specifically, the TPM of novel miRNAs was mostly in a lower level, and the known miRNAs were mostly in a higher level (**Supplementary Table 8**). While comparison was performed between different groups, the overall expression pattern of miRNAs among the four groups was highly consistent (**Figure 4B**). Compared with the miRNA expression levels of uninfected group, 39 miRNAs showed significantly differential expression (*p* < 0.05), including 26, 7, and 6 DE–miRNAs in H2 vs H0, H8 vs H0, and H12 vs H0 comparisons, respectively (**Figure 4C**). Unsupervised hierarchical clustering revealed that all samples were clustered according to their respective groups, proving that the miRNA expression signatures were able to differentiate challenge groups from normal group (**Figure 4D**). Venn diagram revealed that some DE–miRNAs were differentially expressed at two or three comparisons (**Figure 4E**).

# Overview of mRNA Sequencing Data

To identify the expression levels of *P. olivaceus* mRNAs, 12 cDNA libraries (H0, H2, H8, and H12) were constructed and sequenced. After low-quality reads were filtered and sequences with Ns 287,208,722, 294,346,456, 254,332,520, and 329,139,946 were removed, clean reads were obtained from the H0, H2, H8, and H12 libraries, respectively. These data were deposited in NCBI database, with the BioProject number of PRJNA510440. Furthermore, 84.91%, 82.57%, 84.45%, and 84.83% of the clean


TABLE 3 | Information of identified miRNAs.

reads from the H0, H2, H8, and H12 libraries were uniquely mapped to the *P. olivaceus* genome (**Table 4**).

In comparison with the H0 library, 2,100, 400, and 511 genes were identified as DEGs in H2, H8, and H12 libraries, respectively (**Figure 5A**, **Supplementary Table 9**). As shown in the Venn diagram, some DEGs were differentially expressed at two or three comparisons (**Figure 5B**). Considering that circRNAs could regulate transcription of their parental genes, we took the intersection of DE– circRNAs parental genes and DEGs. As shown in **Supplementary Table 9**, 13 DEGs served as parental genes of 13 DE–circRNAs. To further explore the functions of the DEGs in response to *E. tarda*  infection, GO and KEGG enrichment analyses were conducted. For GO analysis, the dominant functions in each of the three main categories were metabolic process (GO:0008152) in the biological process (BP) category, cell (GO:0005623) in the cellular component (CC) category, and structural constituent of ribosome (GO:0003735) in the molecular function (MF) category (**Figure 5C**). In addition, the DEGs were aligned against the KEGG pathways database to identify pathways that were responsive to *E. tarda* infection. As shown in **Figure 5D**, the pathways of ribosome, proteasome, oxidative phosphorylation, and spliceosome were mostly activated.

# Construction of the Potential circRNA– miRNA Network

As we all know, circRNAs serving as competing endogenous RNA (ceRNA) of miRNA could regulate the expression of corresponding genes. In order to construct circRNA–miRNA regulation network, the MiRanda software was used to predict the relationship between DE–circRNAs and DE–miRNAs. The circRNA–miRNA network contained 325 circRNA–miRNA pairs, including 51 circRNAs and 32 miRNAs (**Supplementary Table 10**). As shown in **Figure 6A**, some circRNAs were predicted to combine several miRNAs; for example, novel\_ circ\_0003296 could combine 20 miRNAs. Meanwhile, some miRNAs could also bind to several circRNAs; for example, novel\_51 could link 30 circRNAs (**Figure 6B**). Considering the importance of hub genes in a network, we employed an MCODE approach (Bader and Hogue, 2003) to screen hub genes from the protein–protein interaction (PPI) network. With the k-core = 2, one subnetwork with 11 nodes and 18 edges was identified, including four circRNAs (novel\_ circ\_0002248, novel\_circ\_0002267, novel\_circ\_0001856, and novel\_circ\_0000799) and seven miRNAs (novel\_149,


FIGURE 5 | Analysis of mRNA sequencing data. Volcano plots for differential expression genes (DEGs) between H2, H8, H12, and H0, respectively. (B) Venn diagrams display the distribution of the 2,531 unique DEGs between H2, H8, H12, and H0, respectively. (C) Gene ontology (GO) analysis for all of DEGs between H2, H8, H12, and H0. *X*-axis represents the number of genes. The *Y*-axis on the left represents the GO term, and the *Y*-axis on the right represents GO type. The green column indicates the biological process, the red column the cellular component, and the gray column the molecular function. (D) Statistics of pathways enrichment of all DEGs between H2, H8, H12, and H0. Colors of the points refer to the *q*-value of the respective signaling pathway. Size of the point refers to the number of genes within each pathway.

novel\_154, novel\_171, novel\_204, novel\_272, pol-miR-144-5p, and pol-miR-182-5p) (**Figure 6C**).

# Construction of the Potential miRNA– mRNA Network

In order to construct miRNA–mRNA regulation network, the MiRanda software was used to predict the relationship between DE–miRNAs and DEGs. We predicted the potential target genes of the 39 DE–miRNAs using MiRanda software. Then, we took the intersection of potential target genes and DEGs. We obtained in total 3,873 possible miRNA–mRNA target pairs under three comparisons. As shown in **Supplementary Table 11**, all miRNAs had more than one intersected DEGs. Five significant miRNAs, novel\_51 (degree = 510), novel\_144 (degree = 417), novel\_171 (degree = 294), pol-miR-144-3p (degree = 218), and novel\_318 (degree = 207), had the most target genes. Moreover, many mRNAs were associated with more than one miRNA, such as serpin H1 (gene\_id, 109634651; transcript\_id, XM\_020095324.1), which was targeted by novel\_265, novel\_495, pol-miR-144-3p, novel\_204, novel\_17, novel\_59, pol-miR-206-3p, novel\_3, pollet-7a-5p, novel\_127, novel\_154, and pol-miR-144-3p.

# Construction of the Potential circRNA– miRNA–mRNA Network

According to the differentially expressed results, circRNA–miRNA pairs and miRNA–mRNA pairs were predicted by MiRanda software. Then, circRNA–miRNA–mRNA network was generated using a combination of data obtained from circRNA–miRNA pairs and miRNA–mRNA pairs (**Figure 7**). This network contained 198 circRNA–miRNA pairs and 3,873 miRNA–mRNA pairs, including 44 circRNAs, 32 miRNAs, and 1,774 mRNAs (**Supplementary Table 12**). Among 198 circRNA–miRNA pairs, four pairs of circRNA– miRNA existed in multiple comparison groups; for example, novel\_circ\_0005065-novel\_171 existed in all three comparisons, novel\_circ\_0003068-novel\_51 and novel\_circ\_0003068-novel\_144 both existed in H2 vs H0 and H12 vs H0 comparisons, and novel\_ circ\_0005065-pol-miR-144-5p existed in H2 vs H0 and H8 vs H0 comparisons. Among 3,873 miRNA–mRNA pairs, 178 miRNA– mRNA repeats existed in multiple comparison groups; for example, novel\_171-109646742, novel\_171-109646311, and novel\_171- 109644261 pairs existed in all of the three comparisons.

Next, GO and KEGG analyses were performed to evaluate the function of the DEGs in the network. GO analysis revealed that there were 107, 53, and 96 enriched GO terms with statistical significance (*p* < 0.05) in the biological process, cellular component, and molecular function categories, respectively. GO analysis results suggested that some of the DEGs might play important biological roles during olive flounder against the *E. tarda* infection. As shown in **Figure 8A**, specific GO items were mainly involved in biological processes (e.g., metabolic process, organic substance metabolic process, and primary metabolic process), cell components (e.g., cell, cell part, and intracellular), and molecular function (e.g., catalytic activity, organic cyclic compound binding, and heterocyclic compound binding).

KEGG analysis revealed that there were 13 enriched terms with the statistical significance (*p* < 0.05), including metabolic pathways; ribosome; oxidative phosphorylation; spliceosome; herpes simplex infection; carbon metabolism; RNA transport; proteasome; RNA degradation; fatty acid degradation; pyruvate metabolism; valine, leucine, and isoleucine degradation; and drug metabolism (**Figure 8B**). Of those, herpes simplex infection pathway attracted considerable attention due to its involvement in immune response. A total of 32 DEGs are involved in herpes simplex infection pathway, including 109626283, 109623691, 109627599, 109644197, 109647155, 109641940, 109625845, 109637327, 109624406, 109633274, 109643961, 109636767, 109628267, 109643520, 109639858, 109644261, 109625570, 109634833, 109633363, 109643253, 109643252, 109637639, 109626354, 109631327, 109642261, 109641879, 109629246, 109641908, 109629344, 109646115, 109645569, and 109633948. With the exception of the above-mentioned herpes simplex infection pathway, some other immune-related pathways (*p* > 0.05) were also identified, for example, RIG-I-like receptor signaling pathway, *Salmonella* infection, apoptosis, intestinal immune network for IgA production, regulation of autophagy, toll-like receptor signaling pathway, endocytosis, phagosome, lysosome, and mitogen-activated protein

triangles, circles, and squares, respectively. The red color represents up-regulated, and green color represents down-regulated.

kinase signaling pathway. Of those, intestinal immune network for IgA production (*p* < 0.05 in H8 vs H0 comparison) is a pathway, which is well known for its ability to generate great amounts of noninflammatory immunoglobulin A (IgA) antibodies that serve as the first line of defense against microorganisms. A total of three DEGs were involved in intestinal immune network for IgA production, including 109643253, 109643252, and 109633940.

# qRT-PCR Verification of Selected circRNAs, miRNAs, and mRNAs

To further confirm the expression level of circRNAs obtained by Illumina sequencing, six DE–circRNAs (novel\_circ\_0001462, novel\_circ\_0002610, novel\_circ\_0002746, novel\_circ\_0003643, novel\_circ\_0003068, and novel\_circ\_0002248) were selected for qRT-PCR. Divergent primers were designed for each selected circRNA, and EF1α was used as internal control (**Supplementary Table 13**). Specifically, divergent primers could only amplify circular RNA forms, but not genomic DNA or linearized mRNAs, and Sanger sequencing further confirmed the amplified products to be circRNAs (**Figure 9A**). Their relative expression level at different time points (H2, H8, and H12) was compared with that of H0. As shown in **Figure 9B**, most of the qRT-PCR results were consistent with those of Illumina sequencing. Therefore, the Sanger sequencing and qRT-PCR results confirmed the reliability and accuracy of the circRNA sequencing data.

To further confirm the expression level of miRNAs obtained by Illumina sequencing, eight DE–miRNAs (pol-miR-144-3p, pol-miR-182-5p, novel\_318, novel\_171, novel\_561, novel\_154, novel\_272, and novel\_54) were selected for qRT-PCR. Different primers were designed for each selected miRNA, and 5S rRNA was used as internal control (**Supplementary Table 14**). As shown

in **Figure 10**, there was similarity between the quantitative assay and high-throughput sequencing analysis of the eight miRNAs in terms of fold change and significance of differential expression. Although there were few differences in fold change of expression, the variation trend was identical.

To validate the expression patterns of the mRNAs, qRT-PCR was utilized to detect the expression of 10 randomly selected DEGs (XM\_020102825, XM\_020094518, XM\_020094521, XM\_ 020094517, XM\_020091231, XM\_020092535, XM\_020113897, XM\_020112506, XM\_020093123, and XM\_020106476). Different primers were designed for each selected mRNA, and EF1α was again used as internal control (**Supplementary Table 15**). As shown in **Figure 11**, all of 10 randomly selected DEGs showed a similar expression pattern between qRT-PCR and Illumina sequencing, although there were slight differences in the fold change.

# DISCUSSION

Histopathological evaluation and cytokine expression analysis on the posterior intestine of *P. olivaceus* helped in understanding and determining the pathogenesis of *E. tarda*. This research described the morphological changes and cytokine expression changes of the posterior intestine from olive flounder after *E. tarda* infection at

FIGURE 10 | Confirmation of miRNAs by qRT-PCR analysis. qRT-PCR analysis result (orange) was compared with data obtained from Illumina sequencing (crimson). The expression rates of qRT-PCR between H2, H8, H12, and H0 samples are shown by fold change. The data present the mean ± standard error (SE) derived from triplicate experiments. The relative expression levels by high-throughput sequencing analysis are represented by 2log2(treatment/control).

different time points. As the infection developed, the integrity of the intestinal mucosal structures showed pathological changes, such as cellular swelling, thickness of the lamina propria, shedding of the epithelial cells, and fragmentation of mucosal folds. Furthermore, structures of the posterior intestine, which were for immunity, especially the adaptive immunity, had seen an increased number of inflammatory cells (predominantly lymphocytes) and goblet cells at late infection (i.e., 8 and 12 h post *E. tarda* infection), which was consistent with qRT-PCR results of cytokine expression. Significantly thicker lamina propria, which was the structure for immunity, was observed at 8 and 12 h post-infection, an indication that the immune system actively participated at that particular time period. Pathogenesis study integrating morphological histology approach and next-generation sequencing approach would benefit a better understanding and elucidation of the intestinal immune response in the course of host–bacterial interaction.

More and more researches have proved that circRNAs, acting as miRNA sponges, counteract miRNA and eventually mediate expression of mRNAs (Hansen et al., 2013). Recently, the regulatory network of circRNA–miRNA–mRNA has been increasingly demonstrated in different kinds of diseases (Zheng et al., 2016; Chen et al., 2017a). In this research, we constructed a global circRNA–miRNA–mRNA network based on predicted circRNA–miRNA and miRNA–mRNA pairs in the pathogenesis of *E. tarda* in olive flounder. The integrated circRNA–miRNA– mRNA network consisted of 44 circRNAs, 32 miRNAs, and 1,774 mRNAs. GO and KEGG analyses were performed to evaluate the function of the DEGs in the network. KEGG analysis revealed that two important intestinal immune pathways (herpes simplex infection pathway and intestinal immune network for IgA production pathway) showed statistical significance between challenge and control groups.

Humans are the natural host of herpes simplex virus (HSV), of which HSV-1 resides in greater than 60% of the world's population and causes peripheral disease (Farooq and Deepak, 2012). HSV-1 and HSV-2 initiate the infection process in epithelial cells of mucosal surfaces. Binding glycoproteins gB and gC of HSV-1 with heparan sulfate proteoglycans from host cell surface allows attachment of the viral glycoproteins gB, gD, and gL to host cellular receptors, such as nectin1, herpes virus entry mediator, or 3-*O*-sulfated HS for membrane fusion and viral entry (Agelidis and Shukla, 2015; Menendez and Carr, 2017). In addition, nectin2 also functions as a receptor for HSV-2, although the binding to the gD is weak. Consistent with these conclusions, CHO cell line expressing hNectin2 was susceptible to HSV-2 infection (Fujimoto et al., 2016). Coincidentally, we also found nectin2 (mRNA id, XM\_020093024.1; gene id, 109633274) in the current circRNA–miRNA–mRNA network, and it was down-regulated between H2 and H0 comparison. Similar to HSV, the challenged fish were immersed in the bacteria solution so that their mucosal surfaces (gill, skin, and gastrointestinal tract) became important sites of bacterial exposure and colonization. Therefore, we concluded that nectin2 of *P. olivaceus* may be also a receptor of *E. tarda*, and the down-regulation of nectin2 is an active regulation of the host for self-protection. Then, we found several circRNA–miRNA–RNA networks where nectin2 is located, including 1) novel\_circ\_0002248/novel\_circ\_0002610/ novel\_circ\_0003068/novel\_circ\_0004671-novel\_144nectin2 and 2) novel\_circ\_0000686/novel\_circ\_0002248/novel\_ circ\_0005065-novel\_149-nectin2. In the following study, we focused on the function of nectin2 and its circRNA–miRNA– mRNA regulatory network.

The intestine is one of the main mucosa-associated lymphoid tissues (MALT) of teleosts (Salinas, 2015). One striking feature of intestinal immunity is its ability to generate great amounts of noninflammatory immunoglobulin A (IgA), which would promote immune exclusion by entrapping dietary antigens and microorganisms in the mucus and function for neutralization of toxins and pathogenic microbes (Sidonia and Tasuku, 2003). In this research, three up-regulated DEGs were involved in intestinal immune network for IgA production pathway, including one MAP3K14 (mRNA id, XM\_020094132.1; gene id, 109633940) and two components of MHC II (mRNA id, XM\_020108298.1 and XM\_020108299.1; gene id, 109643252 and 109643253). MHC II proteins present antigens to CD4+ T cells and interact with the T-cell receptor, followed by T-cell activation and cytokine secretion in immune responses (Wang et al., 1999; Bénichou and Benmerah, 2003). Besides, it was indicated that mice deficient in MHC class II expression (C2d mice) cannot produce IgA antibody whether protein antigens administered orally or against antigens from a protozoan parasite that colonized the small intestine (Snider et al., 1999). Therefore, we concluded that MHC II may play important roles in *P. olivaceus* against *E. tarda* infection. Then, we found several circRNA–miRNA–RNA networks where MHC II is located, including 1) novel\_circ\_0001006/novel\_circ\_ 0002863/novel\_circ\_0002865/novel\_circ\_0003059/novel\_ circ\_0003296/novel\_circ\_0003912/novel\_circ\_0004144/novel\_circ\_ 0004162/novel\_circ\_0004346/novel\_circ\_0004959-novel\_171- 109643253, 2) novel\_circ\_0000115/novel\_circ\_0000682/novel\_ circ\_0000799/novel\_circ\_0002614/novel\_circ\_0002708/ novel\_circ\_0002863/novel\_circ\_0002865/novel\_circ\_0002983/ novel\_circ\_0003059 novel\_circ\_0004144/novel\_circ\_0004516/novel\_ circ\_0004517/novel\_circ\_0004959/novel\_51-109643252, and 3) novel\_circ\_0000799/novel\_circ\_0002614/novel\_circ\_0003059/ novel\_circ\_0003912-novel\_205-109643252. The above-mentioned MHC II-associated regulatory network should be paid much more attention.

In conclusion, by employing Illumina sequencing, bioinformatics, and qRT-PCR technologies, we constructed circRNA–miRNA–mRNA networks and found two important intestinal immune pathways (herpes simplex infection pathway and intestinal immune network for IgA production pathway). In addition, three critical DEGs (nectin2, MHC II α-chain, and MHC II β-chain) were identified, and their circRNA–miRNA– mRNA networks were also constructed and discussed. Our study provides a novel insight into the immune response for *P. olivaceus* to *E. tarda* infection from the circRNA–miRNA–mRNA view. Future research on the specific mechanism of action should be investigated in the pathology of *E. tarda*.

# AUTHOR CONTRIBUTIONS

YX: constructed the circRNA–miRNA–mRNA network and wrote this paper; GJ: raised *Paralichthys olivaceus* and conducted the bacteria challenge experiment; SZ, JD, and HL: analyzed the sequencing results of circRNA, miRNA, and mRNA; BS: performed the histopathological analysis on the intestine tissues and revised the manuscript; CL: conceived and designed the research, and revised the manuscript.

# FUNDING

This work was supported by Key Research and Invention program in Shandong Province (2017GHY215004); Key Research and Development Program of Shandong Province (2016GSF115026); the Open Fund of Shandong Key Laboratory of Disease Control in Mariculture (KF201804); Natural Science Foundation of Shandong Province (Grant No. ZR2019BC009);

# REFERENCES


Advanced Talents Foundation of QAU (Grant No. 6651118016); Fish Innovation Team of Shandong Agriculture Research System (SDAIT-12-06); Major Agricultural Applied Technological Innovation Projects of Shandong Province; First-class Fishery Discipline programme in Shandong Province; Aquatic Animal Immunologic Agents Engineering Research Center of Shandong Province; and Graduate Innovation Program of Qingdao Agricultural University.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.00731/ full#supplementary-material

the resistance against herpes simplex virus type 2 infection in transfected cells. *Acta Virol.* 60, 41–48. doi: 10.4149/av\_2016\_01\_41


channel catfish, *Ictalurus punctatus. Fish Shellfish Immunol.* 32, 816–827. doi: 10.1016/j.fsi.2012.02.004


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Xiu, Jiang, Zhou, Diao, Liu, Su and Li. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Genome-Wide Patterns of Population Structure and Linkage Disequilibrium in Farmed Nile Tilapia (*Oreochromis niloticus*)

*Grazyella M. Yoshida1,2, Agustín Barria1, Katharina Correa2, Giovanna Cáceres1, Ana Jedlicki1, María I. Cadiz1, Jean P. Lhorente2 and José M. Yáñez1,2,3\**

*1 Facultad de Ciencias Veterinarias y Pecuarias, Universidad de Chile, Santiago, Chile, 2 Benchmark Genetics Chile, Puerto Montt, Chile, 3 Nucleo Milenio INVASAL, Concepción, Chile*

Nile tilapia (*Oreochromis niloticus*) is one of the most produced farmed fish in the world

### *Edited by:*

*Lior David, Hebrew University of Jerusalem, Israel*

### *Reviewed by:*

*Rajesh Joshi, Genomar Genetics AS, Norway Solomon Antwi Boison, Marine Harvest (Norway), Norway*

> *\*Correspondence: José Manuel Yáñez jmayanez@uchile.cl*

### *Specialty section:*

*This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics*

*Received: 13 January 2019 Accepted: 16 July 2019 Published: 04 September 2019*

### *Citation:*

*Yoshida GM, Barria A, Correa K, Cáceres G, Jedlicki A, Cadiz MI, Lhorente JP and Yáñez JM (2019) Genome-Wide Patterns of Population Structure and Linkage Disequilibrium in Farmed Nile Tilapia (Oreochromis niloticus). Front. Genet. 10:745. doi: 10.3389/fgene.2019.00745*

and represents an important source of protein for human consumption. Farmed Nile tilapia populations are increasingly based on genetically improved stocks, which have been established from admixed populations. To date, there is scarce information about the population genomics of farmed Nile tilapia, assessed by dense single nucleotide polymorphism (SNP) panels. The patterns of linkage disequilibrium (LD) may affect the success of genome-wide association studies (GWAS) and genomic selection (GS), and also provide key information about demographic history of farmed Nile tilapia populations. The objectives of this study were to provide further knowledge about the population structure and LD patterns, as well as, estimate the effective population size (*Ne*) for three farmed Nile tilapia populations, one from Brazil (POP A) and two from Costa Rica (POP B and POP C). A total of 55 individuals from each population, were genotyped using a 50K SNP panel selected from a whole-genome sequencing (WGS) experiment. The first two principal components explained about 20% of the total variation and clearly differentiated between the three populations. Population genetic structure analysis showed evidence of admixture, especially for POP C. The contemporary *Ne* estimated, based on LD values, ranged from 78 to 159. No differences were observed in the LD decay among populations, with a rapid decrease of *r2* with increasing inter-marker distance. Average *r2* between adjacent SNP pairs ranged from 0.19 to 0.03 for both POP A and C, and 0.20 to 0.03 f or POP B. Based on the number of independent chromosome segments in the Nile tilapia genome, at least 9.4, 7.6, and 4.6K SNPs for POP A, POP B, and POP C respectively, are required for the implementation of GS in the present farmed Nile tilapia populations.

Keywords: effective population size, LD decay, linkage disequilibrium, *Oreochromis niloticus,* population structure

# INTRODUCTION

Nile tilapia (*Oreochromis niloticus*) is one of most important farmed fish species worldwide (FAO, 2018). Breeding programs established since the 1990s have played a key role in improving commercially important traits and expanding Nile tilapia farming. The Genetically Improved Farmed Tilapia (GIFT) is the most widespread tilapia breeding strain (Lim and Webster, 2006), which has been introduced to several countries in Asia, Africa and Latin America (Gupta and Acosta, 2004). The genetic base of GIFT was established from eight African and Asian populations, and after six generations of selection, the genetic gains ranged from 10 to 15% per generation for growth-related traits (Eknath et al., 1993), providing evidence that selective breeding using phenotype and pedigree information can achieve high and constant genetic gains (Gjedrem and Rye, 2018).

The recent development of dense SNP panels for Nile tilapia (Joshi et al., 2018; Yáñez et al., 2019) will provide new opportunities for uncovering the genetic basis of important commercial traits; especially in those traits that are difficult or expensive to measure in selected candidates. As has been demonstrated for different traits in salmonid species, the incorporation of genomic evaluations in breeding programs is expected to increase the accuracy of breeding values, compared to pedigree-based methods (Tsai et al., 2016; Bangera et al., 2017; Correa et al., 2017; Sae-Lim et al., 2017; Yoshida et al., 2017; Barria et al., 2018b; Vallejo et al., 2018; Yoshida et al., 2019a).

Genomic studies exploit the linkage disequilibrium (LD) between SNPs and quantitative trait locus (QTL) or causative mutation. Thus, knowing the extent and decay of LD within a population is important to determine the number of markers that are required for successful association mapping and genomic prediction (de Roos et al., 2008; Khatkar et al., 2008; Porto-Neto et al., 2014; Brito et al., 2015). Therefore, when low LD levels are present within a population, a higher marker density is required to capture the genetic variation across the genome (Khatkar et al., 2008). In addition, LD patterns provide relevant information about past demographic events including response to both natural and artificial selection (Slatkin, 2008). Therefore, the LD estimates throughout the genome, reflects the population history and provides insight about the breeding system and patterns of geographic subdivision, which can be explored to study the degree of diversity in different populations.

To date, the most widely used measures of LD between two loci are Pearson's squared correlation coefficient (r2 ) and Lewontin's D' (commonly named D'). Values lower than 1 for D' indicate loci separation due to recombination, while D' = 1 indicates complete LD between loci, i.e. no recombination. However, this parameter is highly influenced by allele frequency and sample size. Thus, high D' estimations are possible even when loci are in linkage equilibrium (Ardlie et al., 2002). Therefore, LD measured as r2 between two loci is suggested as the most suitable measurement for SNP data (Pritchard and Przeworski, 2001).

LD patterns have been widely studied in different livestock species, such as sheep (Prieur et al., 2017), goats (Mdladla et al., 2016), pigs (Ai et al., 2013), beef (Espigolan et al., 2013; Porto-Neto et al., 2014) and dairy cattle (Bohmanova et al., 2010). In aquaculture, recent studies have aimed at characterizing the extent and decay of LD in farmed species, such as Pacific white shrimp (Jones et al., 2017), Pacific oyster (Zhong et al., 2017), rainbow trout (Rexroad and Vallejo, 2009; Vallejo et al., 2018), coho salmon (Barria et al., 2018a) and Atlantic salmon (Hayes et al., 2006; Gutierrez et al., 2015; Kijas et al., 2016; Barria et al., 2018c). However, to date there is scarce information about population genomic structure and LD in farmed Nile tilapia assessed by the use of dense SNP panels. The assessment of LD patterns in Nile tilapia is still limited to a few studies in which either a small number of markers (14 microsatellites) (Sukmanomon et al., 2012) and individuals (4 to 23 samples) (Hong Xia et al., 2015) have been used. Recently, the construction of a dense linkage map for Nile tilapia suggested a sigmoid recombination profile in most linkage groups (LG), showing higher recombination rates in the middle and lower recombination at the end of the LGs (Joshi et al., 2018). These patterns are consistent with the high LD levels found in the end of almost all chromosomes in a hybrid Nile tilapia population (Conte et al., 2019). The objectives of the present study were to i) estimate the population structure and genetic differentiation; ii) to assess the genome-wide levels of LD and iii) determine the effective population size among three Nile tilapia breeding populations established in Latin America.

# METHODS

# Populations

Samples were obtained from three different commercial breeding populations established in Latin America, originated from admixed stocks imported from Asia and genetically improved for growth rate for more than 20 generations. Individuals from population A (POP A) belong to the AquaAmerica (Brazil) breeding population, where the animals are evaluated in cagebased production systems and have been artificially selected during three generations for improved growth rate using daily weight gain as selection criteria. This population was imported from GIFT Malaysia in 2005 for breeding and farming purposes. Individuals from population B (POP B) and C (POP C) were obtained from Aquacorporación Internacional (Costa Rica) and correspond to fish from the seventh and eighth generation, respectively, of selection for improved growth-related traits (body weight at 400 g as selection criteria) under pond system production. The POP B breeding population is a mixture of the GIFT strain (8th generation), POP C and the wild strains from Egypt and Kenya used to generate the GIFT strain. The POP C breeding population represents a combination of genetic material from Israel, Singapore, Taiwan and Thailand. Therefore, the three breeding populations are considered recently admixed populations; which are directly or indirectly related to the GIFT strain. Based on the overall identical by descent (IBD) alleles, average relatedness between individuals, within each population, was estimated using Plink v1.90 (Purcell et al., 2007), through the --genome option.

# Genotyping

The genotypes were selected from a whole-genome sequencing experiment aimed at designing a 50K SNP Illumina BeadChip, which is described in detail by Yáñez et al. (2019). Briefly, caudal fin-clip were sampled from 59, 126 and 141 individuals belonging to POP A, POP B and POP C, respectively. Genomic DNA was purified from all the samples using the DNeasy Blood & Tissue Kit (QIAGEN) according to the manufacturer's protocol (http:// www.bea.ki.se/documents/EN-DNeasy%20handbook.pdf). Whole-genome sequencing was performed using multiplexing of four bar-coded samples per lane of 100bp paired-end in the Illumina HiSeq 2500 machine. The sequences were trimmed and aligned against the genome assembly O\_niloticus\_UMD\_ NMBU (Conte et al., 2019). About 36 million polymorphic sites were discovered after variant calling using the Genome Analysis Toolkit GATK (McKenna et al., 2010). A list of 50K SNP were selected based on quality of genotype and site, number of missing values, minor allele frequency (MAF), unique position in the genome, and even distribution across the genome as described by Yáñez et al. (2019). Genotype quality control (QC) was performed within each population separately, excluding SNPs with MAF lower than 5%, Hardy–Weinberg Equilibrium P-value < 1e−06, and missing genotypes higher than 70%. Animals with a genotype call rate below 95% were discarded. Subsequent analyses were done using the common markers along the three populations after QC (**Table 1**). Using the --genome function from Plink, animals from POP B and POP C with the highest identical by descent (IBD) were excluded (Gutierrez et al., 2015), to use a similar sample size among populations.

## Population Structure

We investigated population differentiation calculating the pairwise Weir and Cockerham's Fst (Weir and Cockerham, 1984) estimator across all loci among populations, using VCFTools (Danecek et al., 2011) software. We used the software Plink v1.09 (Purcell et al., 2007) to calculate observed (Ho) and expected (He) heterozygosity of samples for each of the three populations and for genetic differentiation through principal component analysis (PCA). The results of the first two PCAs were plotted along two axes using R scripts (R Core Team, 2016). Additionally, the population structure was examined using a hierarchical Bayesian model implemented in STRUCTURE software v.2.3.4 (Pritchard et al., 2000). We used three replicates of K value ranging from 1 to 12, a burn-in of 20,000 iterations and running of 50,000. To choose the best K value we computed the posterior probability of each K as suggested by Pritchard et al. (2000).

# Estimation of Linkage Disequilibrium and Effective Population Size

We used the Pearson's squared correlation coefficient (r2 ) to estimate the LD between each pair of markers. We used Plink

TABLE 1 | Summary of results from quality control of SNPs for each farmed Nile tilapia population.


*\*To selected common markers among populations.*

*\*\*From a total of 46,334 SNPs.*

v1.09 using the parameters --ld-window-kb 10000, --ld-window 99999, and --ld-window-r2 set to zero to calculate the LD between all pairs of SNPs on each chromosome. Based on the physical distance of each SNP pair, we created bins of 100 kb among all pairwise combinations. The extent and decay of the LD, for each population, were visualized by plotting the average r2 within each bin, spanning a physical distance from 0 to 10 Mb. We used the software SNeP v1.1 (Barbato et al., 2015) to estimate the historical effective population size (*N*e). Considering the LD within each population, *Ne* was estimated using the following equation proposed by Corbin et al. (2012):

$$\mathbf{N}\_{\rm et} = \frac{1}{(4f(\mathbf{c}\_{\rm t}))} \left( \frac{1}{\mathbf{E} \left[ r\_{\rm adj}^2 \middle| \mathbf{c}\_{\rm t} \right]} - \alpha \right)^2$$

where *Net* is the effective population size *t* generations ago, the expectation (E) of *radj* 2 is the estimated LD corrected for sample size r r sample size adj 2 2 ( ) = −1/ and is conditional to the markers being the appropriate distant apart given *t* and mapping function *f*(ct ), and α is the adjustment for mutation rate (α = 2, indicate the presence of mutation). Values for number and size of each bin were used as default (30 and 50 Kb, respectively). Based on the relatively small number of SNP per chromosome, *N*e per chromosome was calculated using harmonic mean (Alvarenga et al., 2018). Using the LD method, we calculated the contemporary population size using the software NeEstimator v2.01 (Do et al., 2014), with a nonrandom mating model and a critical value of 0.05. Additionally we fitted a linear regression model for historical values of *Ne* to calculate the contemporary *Ne*.

Estimation of the effective number of chromosome segments (Me) was assessed based on the following formula proposed by Daetwyler et al. (2010):

$$\mathbf{M}\_{\mathbf{e}} = 4N\_{\mathbf{e}}\mathbf{L}$$

where *Ne* is the effective population size and L is the length of the Nile tilapia genome in Morgans.

## RESULTS

## Quality Control

Out of the initial 46,334 markers, a total of 33,236 markers were shared among the three populations after QC criteria. The MAF < 0.05 excluded the higher number of SNPs along populations (ranging from ~ 3K to ~ 9 K) (**Table 1)**. After QC, all three populations showed a similar mean MAF value of 0.26 ± 0.13 and similar proportion of SNPs for each MAF class (**Figure 1**). The lower (~ 0.13) and higher (~ 0.25) proportion of SNP were observed in the MAF classes ranging from 0.05 to 0.09 and 0.10 to 0.19, respectively.

For downstream analysis, we selected 55 animals for each population based on identity by descent analysis (IBD). We discarded a total of 4, 71 and 86 animals from POP A, POP B and POP C, respectively. Thus, the average relatedness within populations was 0.00 ± 0.01.

# Population Structure

Upon plotting the first two eigenvectors on the PCA plot, the three populations were stratified based on the single dimensional variation between them. The first two principal components together accounted for 20.0% of the genetic variation, revealing different populations (**Figure 2**). PCA1 differentiates POP B and C (Costa Rica) with respect to POP A (Brazil) and accounted for 11.3% of the total genetic variation. The second principal component explains 8.7% of the total variance and separated the populations from Costa Rica (POP B and C) into two different clusters. To assess the genetic diversity within populations, we calculated the observed/expected heterozygosity ratio (Ho/He). We found values of 0.23/0.34, 0.26/0.35 and 0.26/0.36 for POP A, POP B and POP C, respectively. Similar levels of genetic differentiation were found between POP A and POP BC and POP A and POP C (Fst = 0.072 ± 0.11 and Fst = 0.070 ± 0.10),

whereas a lower Fst value was observed between POP B and POP C (Fst = 0.056 ± 0.09).

In the admixture analysis, the posterior probability (Pr) of the fitted admixture model to the data was computed using K-values from 1 to 12 (**Supplementary Table 1)**. After several runs of MCMC for each K-value (Pritchard et al., 2000), the best result was obtained with K = 11. These results indicated that the three populations share higher genome proportions with each other, indicating higher admixture level and a diverse genetic composition (**Figure 3**). STRUCTURE results evaluating K values from 2 to 12 are presented in **Supplementary File 1**, while posterior probabilities are showed in **Supplementary Table 1**.

# Estimation of Linkage Disequilibrium and Effective Population Size

The overall mean LD between marker pairs measured using r2 was similar among populations, with values of 0.06 ± 0.10 for the three populations studied (**Table 2**). In general, the average LD among chromosomes ranged from 0.04 to 0.08 for all populations (**Table 2**). From 1 to 10,000 Kb, the average of *r2* decreased with

FIGURE 3 | Admixture clustering of the three Nile population for K = 11. The animals are grouped by population and each individual is represented by a vertical bar. The gradient black lines delineate different populations under study and each color represent a different cluster ranged from 2 to 12 (C02 to C12).


TABLE 2 | Number of SNPs, chromosome linkage group (LG), size in megabases (Mb), average linkage disequilibrium (r2) ± standard deviation (SD) and effective population size (*Ne*) values for three Nile tilapia farmed populations.

increasing physical distance between markers, from 0.19 to 0.03 for both POP A and C, and 0.20 to 0.03 for POP B. The average LD decayed to less than 0.05 within 5 Mb (**Figure 4**), and this rate of decrease was very similar across all of the chromosomes for the three populations (**Supplementary Files 2** to **4**). In addition, the r2 > 0.80 were plotted for each chromosome (**Supplementary Files 5** to **7**) and suggested that for some chromosomes (e.g. LG01, LG2, LG19 and LG23) the highest r2 values were at both chromosome ends in the three studied populations.

**Figure 5** shows the historical *N*e from 1,105 to 5 generations ago. The *Ne* values were lower in the recent past than the distant past. These values calculated at five generations ago were 93, 90 and 78 for POP A, POP B and POP C, respectively. The harmonic means for *Ne* at five to 1,105 generations ago was 196, 199 and 181 for POP A, POP B and POP C, respectively. In addition, the *Ne* varied among chromosomes, ranging from 127 to 255 (**Table 2**). Recent *Ne* calculated based on LD values were 159, 128 and 78 for POP A, POP B and POP C, respectively, whereas the regression on historical *Ne* resulted in contemporary *Ne* values of 111, 121 and 106 for POP A, B, and C respectively. Based on the effective number of chromosome segments, a minimum number of markers for a high power genomic analysis

Yoshida et al. Linkage Disequilibrium in Nile Tilapia

should be at least 9,400, 7,600, and 4,600 for POP A, POP B, and POP C, respectively.

# DISCUSSION

# Genomic Population Structure

In the PCA, the first two principal components explained about 20% of the total genetic variation for the populations studied and clearly revealed three different clusters, corresponding to the three populations present in the dataset (**Figure 2**). In addition, the low value of Ho in relation of He suggest a loss of genetic diversity due to founder effect or effective population size.

The admixture results provided evidence of a recent mixture of different strains to conform highly admixture populations. Although the PCA demonstrates three distinct populations, the admixture analysis suggested that, the three Nile tilapia populations studied are related through the common GIFT origin. The genetic differentiation among populations may have been partly generated by genetic drift or founder effect events which can have a pronounced effect on allele frequencies (Allendorf and Phelps, 1980). Furthermore, the three populations have undergone artificial selection for the improvement of growth-related traits in different geographic locations, exposing the populations to distinct environmental conditions and production systems. This could be observed especially in the comparison between the population from Brazil and from Costa Rica. POP A from Brazil is evaluated in cagebased production system during the autumn and winter season, whereas both POP B and C, from Costa Rica, are evaluated in pondbased conditions, during winter and spring season. Furthermore, environmental conditions as temperature and rainfall are different between the countries. In Brazil, the temperature ranges from 10 to 29°C during the year and the rainfall period coincides with the spring and summer season, which is different in Costa Rica, where the wet periods coincide with autumn and winter, and the temperature is rarely lower than 22°C.

# Linkage Disequilibrium and Effective Population Size

Evaluating the whole-genome LD within populations, may help to understand the different demographic processes experienced by these populations. These processes include admixture, mutation, founder effect, inbreeding and selection (Gaut and Long, 2003). This is the first study aimed at estimating the extent and decay of LD in farmed Nile tilapia populations established in Latin America (specifically, Brazil and Costa Rica), and artificially selected for growth-related traits. Measures to reduce biasness included the removal of animals with high IBD, as described in methodology. Thus, we used a similar number of animals from each population. Similarly, alleles with high frequency result in less biases estimation of LD (Espigolan et al., 2013). In the present study a small proportion of SNP (<13%) have MAF lower than 0.10 and low IBD values indicating an accurate estimation of LD.

Accurate LD estimations depend on the different factors including sample size and relatedness among individuals. In the current work we used 55 animals for each population, as it has been suggested by Bohmanova et al. (2010) and Khatkar et al. (2008). Furthermore, we used r2 as a measure of LD instead of |D'| to avoid the likely overestimation of LD due to this sample size (Khatkar et al., 2008).

We updated the order and positions of the SNP on the 50K SNPs from Illumina BeadChip panel (Yáñez et al., 2019) to the most recent Nile tilapia genome reference (O\_niloticus\_UMD\_ NMBU, Conte et al., 2019), to get a more accurate intermarker distances. However, we observed on chromosome LG13 and LG19, a pool of *r2* values > 0.40 for pair-wise SNP at large distances (>7 Mb; **Supplementary Files 2** to **4**), but a decline in LD with the increase in physical distance between markers is expected. Incorrect position of SNPs on the reference genome or errors in the reference genome assembly might have resulted in these errors. Our study revealed that the LD level declined to 0.05 at the intermarker distance of 5 Mb and that the decay patterns were similar between populations (**Figure 4**). A previous study conducted by Hong Xia et al. (2015) reported similar LD patterns for GIFT tilapia stocks collected from South Africa, Singapore and China. Using microsatellite, Sukmanomon et al. (2012) estimated LD means in terms of the disequilibrium coefficient (D') of 0.05 for a GIFT population originating from the Philippines. Whereas, Conte et al. (2019) and Joshi et al. (2018) using a dense marker panel (>40 K) reported higher LD values at the end of LGs and low values at the middle, supported by the identification of a sigmoidal pattern of recombination in most of the chromosomes, with high and low recombination rates at the middle and both chromosome extremes, respectively. We found similar LD patterns in some chromosomes for the three farmed Nile tilapia populations from Latin America studied here, nevertheless we found a smaller number of marker pairs that are in high LD, compared to Conte et al. (2019).

Due to differences between genomes, the quality control applied and population structure, LD comparison between species is inappropriate, however we used references from other farmed fish species because of the limited information that exists for this kind of study in tilapia. The tilapia population seems to present a weaker short-range LD than other farmed fish populations (Gutierrez et al., 2015; Kijas et al., 2016; Barria et al., 2018a; Barria et al., 2018c; Vallejo et al., 2018). A likely explanation is due the diverse origin of the base population used to form the Nile tilapia populations studied here, as it has also been suggested for a Chilean farmed Atlantic salmon population with Norwegian origin (Barria et al. 2018c). In salmonids, some suggest admixture is a major factor contributing to long-range LD (Ødegård et al., 2014; Barria et al., 2018c; Vallejo et al., 2018). Our results suggest that there is evidence of recent admixture in the three studied populations with introgression of multiple strains with different origins. However, this admixture process has not resulted in long-range LD, suggesting that other biological and demographic processes are also important in the current levels of LD in POP A, B and C, including recombination rates and effective population size.

Linkage disequilibrium at a short distance is a function of effective population size many generations ago and LD at long distances reflect the recent population history. The LD estimation at small and large distance, have similar pattern for the three Nile tilapia populations (**Figure 4**). These results reflected in slight difference in *Ne* value of many generations ago and in the recent past among populations (**Figure 5**). However, the continuous reduction in the *Ne*, was observed over the previous 1,105 generations (**Figure 5**). The three populations in this study have been under artificial selection for several generations. The reduction of *Ne* can be an indicator of selection and suggested an important cause of increased LD (Pritchard and Przeworski, 2001). The use of a common GIFT strain as genetic basis to form the POP A, B, and C and similar demographic processes among them (recent admixture and selection), may have resulted in the similar pattern of LD and historical *Ne*. Among the chromosomes, the highest LD mean value (ranging from 0.04 to 0.09) and also the lowest effective population size (<161) was reported for LG7, LG13 and LG19 (**Table 2**). The variation in autosomal recombination rates among chromosomes (Conte et al., 2019) leads to diversity in the pattern of LD in different genomic regions. In addition, differences in the LD can be attributed to the number of markers analyzed among chromosomes, their MAF values and also the effect of artificial selection across the genome (López et al., 2015).

The contemporary *Ne* estimated using both the NeEstimator v2.01 (Do et al., 2014) software and the regression of historical *Ne* values, resulted in the same *Ne* expanded pattern. The most likely explanation for the increasing *Ne* in the recent generations is because of the recent establishment of these composite populations based on the hybridization of different Nile tilapia strains 5 to 10 generations back (Cáceres et al., 2019). Moreover, the selection and mating methods for these populations are based on the optimization of the contributions from parents to progeny; minimizing the average co-ancestry among progeny, reducing the inbreeding level (Meuwissen, 1997; Kinghorn, 1998), and maximizing the effective population size (Caballero and Toro, 2000). Previously, similar value of *Ne* was estimated using pedigree information from a GIFT population from Malaysia (Ne = 88) (Ponzoni et al., 2010). Some authors suggest keeping *Ne* values between 50 and 200 to ensure genetic variability and diversity in a long-term breeding population (Smitherman and Tave, 1987). In contrast to the results found here, a smaller *Ne* was found for farmed rainbow trout (Vallejo et al., 2018) and Atlantic salmon with North American and European origins (Kijas et al., 2016; Barria et al., 2018c).

In summary, within tilapia populations, the LD values were very low even in short distances (*r2* = 0.15 for markers spaced at 20–80 Kb). Similar values were found in humans (Reich et al., 2001; Ardlie et al., 2002), coho salmon (Barria et al., 2018a), some breeds of cattle (de Roos et al., 2008; Khatkar et al., 2008; Yurchenko et al., 2018), sheep (Alvarenga et al., 2018) and goats (Brito et al., 2015).

## Practical Implications

The LD results have several implications for future implementation of genomic tools in the current farmed Nile tilapia populations. Both GWAS and genomic selection are dependent on LD extent to define the number of SNPs necessary to assure the causative mutation variance (Flint-Garcia et al., 2003) and to achieve a certain accuracy of genomic estimated breeding value (Meuwissen et al., 2001). Meuwissen (2009) suggested that to achieve accuracies of genomic breeding (GEBV) ranging from 0.88 to 0.93 using unrelated individuals; it is necessary to have *2NeL* number of individuals and *10NeL* number of markers, where L is the length of genome in Morgans. In our study, the contemporary *N*e is 159, 128 and 78 for POP A, POP B and POP C, respectively, and the length of the genome is 14.8 Morgans (Joshi et al., 2018; Conte et al., 2019). Thus, the 11,500 to 23,500 markers will be required for unrelated Nile tilapia populations. In contrast, Goddard (2009) suggested that accuracy of genomic prediction is highly dependent on the effective number of chromosome segments (*Me* = *4NeL*).

Having a number of independent, biallelic and additive QTL affecting the trait we would need a smaller number of markers to achieve a high accuracy. Thus, the minimum number of markers for a high-power genomic analysis should be at least, 9,400, 7,600, and 4,600 for POP A, POP B, and POP C, respectively. Despite the fact that these numbers were slightly lower than those suggested by Vallejo et al. (2018) and Barria et al. (2018a) for rainbow trout and coho salmon, respectively, alternative methods are necessary for cost-efficient genomic application in tilapia breeding programs.

A recent study tested different marker densities and imputed genotypes to assess genomic prediction accuracies in a farmed Nile tilapia population. The prediction accuracy using genomic information outperformed the estimated breeding values using the classical pedigree-based best linear unbiased prediction, even using a very low-density panel (0.5K) for growth and fillet yield (Yoshida et al., 2019b). In addition, the high values of imputation accuracy (>0.90) were not affected by the linkage disequilibrium pattern, probably due to the family-based population structure and high relatedness among animals, suggesting that genomic information may be cost-effectively included in Nile tilapia breeding programs.

# CONCLUSIONS

The current study revealed similar short-range LD decay for three farmed Nile tilapia populations. The PCA suggested three distinct populations and the admixture analysis confirmed that these three populations are highly admixed. Based on the number of independent chromosome segments, at least 9.4, 7.6, and 4.6 K SNPs for POP A, B, and C, respectively might be required to implement genomic prediction in the current Nile tilapia populations, whereas for GWAs studies more markers may be necessary to achieve higher power and greater precision for QTL detection.

# ETHICS STATEMENT

The sampling protocol was previously approved by The Comité de Bioética Animal, Facultad de Ciencias Veterinarias y Pecuarias, Universidad de Chile (certificate N° 18179-VET-UCH).

# AUTHOR CONTRIBUTIONS

GY performed the analysis and wrote the initial version of the manuscript. AB contributed with discussion and writing. GC, MC and AJ performed DNA extraction. KC and JL contributed with study design. JY conceived and designed the study; contributed to the analysis, discussion and writing. All authors have reviewed and approved the manuscript.

# FUNDING

This work has been funded by Corfo (project number 14EIAT-28667).

# ACKNOWLEDGMENTS

The authors are grateful to Aquacorporación Internacional and AquaAmerica for providing the Nile tilapia samples. We would like to thank José Soto and Diego Salas from Aquacorporación

# REFERENCES


International and Natalí Kunita and Gabriel Rizzato from AquaAmerica for their kind contribution with Nile tilapia samples from Costa Rica and Brazil, respectively.

## SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.00745/ full#supplementary-material


**Conflict of Interest Statement:** GY, JPL and KC were hired by a commercial institution (Benchmark Genetics Chile) during the period of the study. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Yoshida, Barria, Correa, Cáceres, Jedlicki, Cadiz, Lhorente and Yáñez. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Draft Genome and Complete *Hox*-Cluster Characterization of the Sterlet (*Acipenser ruthenus*)

*Peilin Cheng1,2†, Yu Huang1,3,4†, Hao Du1, Chuangju Li1, Yunyun Lv3,4, Rui Ruan1, Huan Ye1, Chao Bian3, Xinxin You3, Junmin Xu3,5, Xufang Liang2, Qiong Shi3,4\* and Qiwei Wei1\**

### *Edited by:*

*Lior David, Hebrew University of Jerusalem, Israel*

### *Reviewed by:*

*Ron Dirks, ZF-screens BV, Netherlands Jie Mei, Huazhong Agricultural University, China László Orbán, University of Pannonia, Hungary*

### *\*Correspondence:*

*Qiong Shi shiqiong@genomics.cn Qiwei Wei weiqw@yfi.ac.cn*

*†These author contributed equally to this work*

### *Specialty section:*

*This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics*

*Received: 12 October 2018 Accepted: 23 July 2019 Published: 05 September 2019*

### *Citation:*

*Cheng P, Huang Y, Du H, Li C, Lv Y, Ruan R, Ye H, Bian C, You X, Xu J, Liang X, Shi Q and Wei Q (2019) Draft Genome and Complete Hox-Cluster Characterization of the Sterlet (Acipenser ruthenus). Front. Genet. 10:776. doi: 10.3389/fgene.2019.00776*

*1 Key Laboratory of Freshwater Biodiversity Conservation, Ministry of Agriculture of China, Yangtze River Fisheries Research Institute, Chinese Academy of Fishery Sciences, Wuhan, China, 2 College of Fisheries, Chinese Perch Research Center, Huazhong Agricultural University, Wuhan, China, 3 Shenzhen Key Lab of Marine Genomics, Guangdong Provincial Key Lab of Molecular Breeding in Marine Economic Animals, Academy of Marine Sciences, BGI Marine, Shenzhen, China, 4 BGI Education Center, University of Chinese Academy of Sciences, Shenzhen, China, 5 School of Veterinary Medicine, Rakuno Gakuen University, Ebetsu, Japan*

Background: Sturgeons (Chondrostei: Acipenseridae) are a group of "living fossil" fishes at a basal position among Actinopteri. They have raised great public interest due to their special evolutionary position, species conservation challenges, as well as their highlyprized eggs (caviar). The sterlet, *Acipenser ruthenus*, is a relatively small-sized member of sturgeons and has been widely distributing in both Europe and Asia. In this study, we performed whole genome sequencing, *de novo* assembly and gene annotation of the tarlet to construct its draft genome.

Findings: We finally obtained a 1.83-Gb genome assembly (BUSCO completeness of 81.6%) from a total of 316.8-Gb raw reads generated by an Illumina Hiseq 2500 platform. The scaffold N50 and contig N50 values reached 191.06 and 18.88 kb, respectively. The sterlet genome was predicted to be comprised of 42.84% repeated sequences and to contain 22,184 protein-coding genes, of which 21,112 (95.17%) have been functionally annotated with at least one hit in public databases. A genetic phylogeny demonstrated that the sterlet is situated in the basal position among ray-finned fishes and 4dTv analysis estimated that a recent whole genome duplication occurred 21.3 million years ago. Moreover, seven *Hox* clusters carrying 68 *Hox* genes were characterized in the sterlet. Phylogeny of *Hox*A clusters in the sterlet and American paddlefish divided these sturgeons into two groups, confirming the independence of each lineage-specific genome duplication in Acipenseridae and Polyodontidae.

Conclusions: This draft genome makes up for the lack of genomic and molecular data of the sterlet and its *Hox* clusters. It also provides a genetic basis for further investigation of lineage-specific genome duplication and the early evolution of ray-finned fishes.

Keywords: sterlet, sturgeon, genome, *hox*, lineage-specific whole genome duplication

# INTRODUCTION

Sturgeons (Acipenseridae, Acipenseriformes) have long been considered as an interesting group of fishes due to their commercial value and conservational challenges (Wei et al., 2011). They have also drawn noteworthy attention due to occupying a basal position on the phylogenetic tree of ray-finned fishes. It is estimated that the origin of sturgeons dates back to approximately 350 million years ago (Mya), which is even earlier than the origins of Holostei (bowfin and gars) and Teleostei (teleosts) (Hughes et al., 2018). Therefore, sturgeons did not the teleost-specific genome duplication (TGD) event that happened around 320 Mya (Jaillon et al., 2004). However, there are clear evidences based on molecular markers, chromosome numbers and inferred ploidy levels that they have experienced their own lineage-specific polyploidizations with one or more rounds of genome duplication (GD; Crow et al., 2012), resulting in complex genome structures and the widest range of chromosome numbers among all vertebrates (Havelka et al., 2016). However, little is known about Acipenseridae-specific GD and its consequences due to a lack of sturgeon genome sequences.

This special whole genome duplication (WGD) event has also provided new genetic material to generate phenotypic diversity among sturgeons. However, sturgeons have quite limited species diversity with exceedingly fast overall rates of body size evolution, serving as an interesting exception to the phenotypic 'evolvability' hypothesis (Rabosky et al., 2013). As one of the earliest evolved fish groups among ray-finned fishes, sturgeons still retain many shark-like features such as a cartilaginous skeleton and heterocercal tail, and the extant species look conspicuously similar to their fossil counterparts, suggesting that there has been of body-shape evolution (Rabosky et al., 2013). Therefore, sturgeons represent an ideal evolutionary group to investigate the complicated relationship between phenotypes and the polyploidy genomes caused by WGD. Meanwhile, *Hox* genes, encoding a distinct class of transcription factors associated with axial patterning and appendages development, have been often among the first list for examination to understand their roles in evolution of vertebrate body plans and novelty (Amemiya et al., 2010; Crow et al., 2012).

The sterlet (*Acipenser ruthenus*, Linnaeus, 1758) is a famous representative of sturgeon species, well-known for its relatively small body size and wide distribution in comparison to other sturgeons. Composed of 120 chromosomes, the sterlet genome has both diploid and tetraploid chromosome segments (Romanenko et al., 2015); however, various chromosomes are unequally involved in the multiple interchromosomal rearrangements after the GD event (Andreyushkova et al., 2017). In this study, we performed whole genome sequencing of the sterlet and generated a draft genome assembly of a sturgeon for the first time. We also constructed a fossil-calibrated phylogenetic tree, estimated the occurrence time of the sturgeon-specific GD (although it is unclear how many members in this family have experienced such an independent lineage-specific GD, considering that this is the first sturgeon with public genome sequences) and retrieved the complete *Hox* clusters to preliminarily reveal the early evolutionary history of ray-finned fishes.

# Value of the Data


# MATERIALS AND METHODS

# Sample Processing

The sequenced sterlet (an immature juvenile, about 2.5 years old, 56.8 cm in length, weighing 0.8 kg) was artificially cultured at Taihu Station, Yangtze River Fisheries Research Institute, Chinese Academy of Fisheries Sciences, China. First, we obtained 10 mL of blood from the caudal vertebral vessels (without sacrificing the fish), but the sample was only sufficient for transcriptome sequencing. Subsequently, we had to anesthetize and sacrifice the fish to collected 30 g of skeletal muscle in order to obtain enough DNA for genome sequencing. All vouchers were deposited in China National GeneBank with accession numbers of WH20161125002-MU (muscle) and -BL (blood). All experiments were carried out in accordance with the guidelines of the Animal Ethics Committee of Yangtze River Fisheries Research Institute of Chinese Academy of Fishery Sciences (No. YFI-01).

# Genome Sequencing and Assembly

We applied whole-genome shotgun sequencing to generate short paired-end reads (125 or 150 bp) by constructing a series of shortinsert (270, 500, and 800 bp) or long-insert (2, 5, 10, and 20 kb) libraries (**Supplementary Figure 1**) and sequencing on a Hiseq 2500 platform (Illumina, San Diego, CA, USA). Raw reads were subsequently pre-processed by SOAPfilter software (Luo et al., 2012) to trim five bases at the 5' end of all reads and to discard the low-quality reads (quality value <20) and those reads with many nonsequenced bases (N > 10). Subsequently, the 17-mer depth frequency distribution method was employed to estimate the genome size of the sterlet using data from short-insert libraries according to the following formula: genome size = total number of k-mers/peak value of k-mer frequency distribution (Li et al., 2010). Clean reads from all the seven libraries were assembled into contigs and scaffolds using SOAPdenovo v2.04 (Luo et al., 2012) with optimized parameters (pregraph -K 41 -d 1; contig –M 3; scaff -F; others as the default). Finally, gaps in the scaffolds were successively filled by using Kgf and GapCloser (Luo et al., 2012) with clean reads from short-insert libraries. Completeness of the final genome assembly and the entire gene set was assessed by BUSCO (Simão et al., 2015).

# Repeat-Sequence Prediction and Gene Annotation

A *de novo* repeat library for the sterlet was constructed by a combination of RepeatModeler v1.05 (RepeatModeler, RRID: SCR\_015027) and LTR\_FINDER v1.0.6 (Xu and Wang, 2007). Known and *de novo* transposable elements (TEs) in the assembled genome were identified by RepeatMasker v4.0.6 (RepeatMasker, RRID : SCR\_012954) using both the RepBase v21.01 (Jurka et al., 2005) and the *de novo* repeat library. RepeatProteinMask v3.3.0 (Chen, 2004) was then used to identify the TE relevant proteins. Meanwhile, tandem repeats were predicted by using Tandem Repeats Finder (TRF) v4.07b (Benson, 1999), and Tandem Repeats Analysis Program (Sobreira et al., 2006) was used to select candidate microsatellite markers from the TRF output.

Gene models in the sterlet genome were predicted by an integrated strategy of three methods. For homology annotation, we downloaded published protein sequences of ten representative vertebrates including zebrafish (*Danio rerio*), spotted gar (*Lepisosteus oculatus*), elephant shark (*Callorhinchus milii*), sea lamprey (*Petromyzon marinus*), medaka (*Oryzias latipes*), Nile tilapia (*Oreochromis niloticus*), three-spined stickleback (*Gasterosteus aculeatus*), Atlantic cod (*Gadus morhua*), fugu (*Takifugu rubripes*) and spotted green pufferfish (*Tetraodon nigroviridis*), and aligned them against the assembly of the sterlet genome using BLAST (Altschul et al., 1990) with tblastn mode and an e-value of 1e-5. SOLAR (Yu et al., 2006) was subsequently employed to select the best hit for each alignment. For *ab initio* prediction, the sterlet genome assembly was masked according to the previously identified repeated sequences and was then scanned using AUGUSTUS v3.2.3 (Stanke et al., 2006) and GENSCAN v1.0 (Burge and Karlin, 1997) to predict gene structures. For transcriptome-based annotation, we sequenced a blood transcriptome on a Hiseq X10 platform (Illumina), mapped the reads to the genome scaffolds using TopHat v2.0.13 (Trapnell et al., 2009) and assembled them into transcripts using Cufflinks v2.2.1 (Trapnell et al., 2010). Finally, all predicted genes from these three methods were merged and filtered by GLEAN v1.1 (Elsik et al., 2007) to create a consensus gene set.

Gene functional annotation of the sterlet genome was firstly performed by aligning all the protein sequences produced by GLEAN against public databases including Swiss-Prot, TeEMBL (Boeckmann et al., 2003) and KEGG (Kanehisa et al., 2016) using BLASTP v2.3.0+ (Altschul et al., 1990) with an e-value of 1e-5. Subsequently, motifs and domains were annotated using InterProScan (Hunter et al., 2008) by searching PANTHER (Thomas et al., 2003), Pfam (Finn et al., 2013), PRINTS (Attwood, 2002), ProDom (Bru et al., 2005) and SMART (Letunic et al., 2004) databases. Finally, InterProScan (Hunter et al., 2008) was applied to assign Gene Ontology (GO) terms and conduct a GO enrichment analysis (Ashburner et al., 2000).

# Fossil-Calibrated Phylogenetic Analysis

To perform a phylogenetic analysis of the sterlet, we obtained the predicted coding sequences (CDS) from the sterlet and 14 other vertebrates, including Asian arowana (*Scleropages formosus*), coelacanth (*Latimeria menadoensis*), common carp (*Cyprinus carpio*) and Atlantic salmon (*Salmo salar*) as well as the ten species used for homology gene annotation, and used the sea lamprey as the outgroup. BLAST with blastp mode and an e-value of 1e-5 were used to build the super similarity matrix, followed by OrthoMCL (Li et al., 2003) to distinguish gene families. Oneto-one orthologues were identified by Markov Chain Clustering (MCL) and were aligned by MUSCLE v3.7 (Edgar, 2004). The first nucleotide of each codon was chosen to construct a Maximum-likelihood (ML) tree using PhyML v3.0 (Guindon et al., 2010) with gamma distribution across aligned sites and HKY85 substitution model. Branch supports were evaluated by approximate likelihood ratio test (aLRT). Meanwhile, we also conducted Bayesian inference (BI) independently using MrBayes v3.2.2 (Ronquist et al., 2012) to confirm the topology deduced from ML. Totally, we performed 100,000 generations and sampled every 100 generations. The initial 20% of the runs were regarded as unreliable samples and were discarded. The rest of the samples were used to estimate the branch supports. The divergence time of the sterlet from other vertebrates was estimated by Bayesian method using MCMCtree in PAML v4.9 (Yang, 2007) with two fossil calibrations, which are *Latimeria* (Sarcopterygii, 408.0 ~ 427.9 Mya) and *Danio* (Teleostei, 151.2 ~ 252.7 Mya; Hughes et al., 2018).

# 4dTv Analysis to Determine the Sturgeon-Specific Genome Duplication

We performed 4-fold degenerative third-codon transversion (4dTv) analysis to test the sturgeon-specific GD by comparing the sterlet genome to Asian arowana genome. Protein sequences from the two genomes were firstly aligned using all-to-all BLAST with blastp mode and an e-value of 1e-5. Subsequently, syntenic regions between sterlet-sterlet, arowana-arowana and sterletarowana were identified by MCscan v0.8 (Wang et al., 2012) with default parameters. Homologous protein sequences from these syntenic regions were retrieved and converted to CDS for alignment by MUSCLE (Edgar, 2004). Lastly, 4dTv values were calculated and corrected with the HKY model in PAML package (Yang, 2007).

# *Hox*-Cluster Identification and Phylogenetic Analysis

Reference protein sequences of complete *Hox*A cluster and partial *Hox*D cluster of American paddlefish (*Polyodon spathula*) (Crow et al., 2012) were downloaded from National Center of Biotechnology Information (NCBI). Sequences of four complete *Hox* clusters of the Indonesian coelacanth (Amemiya et al., 2010) and spotted gar (Braasch et al., 2015) were downloaded from Ensembl. The protein sequences were firstly aligned to the sterlet genome assembly by BLAST (Altschul et al., 1990) with tblastn mode and the hit sequences were further analyzed by Exonerate software (Slater Cheng et al. Draft Genome of the Sterlet

and Birney, 2005) to extract exons. *Hox* gene order and synteny were finally determined by aligning back to the genome assembly and the best hits were selected by SOLAR (Yu et al., 2006). The *Hox*A clusters from the sterlet and paddlefish, as well as *HoxA9* genes from ten vertebrates were separately aligned with MEGA v7.0.26 (Kumar et al., 2016) followed by construction of a ML phylogenetic tree.

# RESULTS AND DISCUSSION

# Summary of the Genome Sequencing and Assembly

We generated 316.8 Gb of pair-end raw reads (**Supplementary Table 1**) to assemble the draft genome of the sterlet. After filtering low-quality sequences, the data size of the remaining clean reads was about 248.4 Gb (**Supplementary Table 1**). The haploid genome size of the tarlet was estimated (**Supplementary Figure 2**) by a k-mer analysis (Li et al., 2010). Using all the clean reads, we produced a final genome assembly of 1.83Gb, which is quite close to the previously reported 1.87 Gb by flow cytometry (Birstein et al., 1993). The achieved draft assembly had a contig N50 of 18.88 kb and a scaffold N50 of 191.06 kb (**Table 1**).

Accordingly, the genome sequencing depth for the tarlet reached 132-fold based on the final 1.83-Gb assembly, and as much as 87.19% of the bases had an over 20-fold sequencing depth (**Supplementary Figure 3**). The total completeness of the assembly was estimated to be 81.6% by evaluation with BUSCO, including 51.9% complete and single-copy BUSCOs and another 29.7% duplicated BUSCOs. A total of 4,584 genes were searched and 302 (6.6%) of them were fragmental BUSCOs (**Supplementary Table 2**). Along with the homogeneous GC distribution of the scaffolds (**Supplementary Figure 4**), we concluded that our draft assembly of the tarlet genome was qualified for further analyses.

# A Relatively High Content of Repetitive Elements

We performed repeat annotation, and a total of 784-Mb (42.84%) repeated sequences, including 726-Mb (39.68%) transposable elements (Tes) and 79 Mb (4.34%) tandem repeats, were identified in the tarlet genome assembly (**Supplementary Table 3**). These data are consistent with the dominant sub-peak ideally located at 2-fold the position of the main k-mer peak (**Supplementary** 

**Figure 2**). This repeat content was higher than those of the majority of the published fish genomes that usually contain no more than 40% repeats (Yuan et al., 2018). Interestingly, more class I (28.95%) than class II (14.93%) Tes were found in the tarlet genome (**Supplementary Table 4**), which resembled a cartilaginous species pattern (Yuan et al., 2018). In addition, as a potamodromous species dwelling mainly in freshwater, the sterlet had a relatively high DNA/TcMar-Tc1 proportion (16.58% for 130 Mb) but a relatively low microsatellites proportion (2.10% for 16 Mb) (**Supplementary Table 5**), a pattern preferred by freshwater species (**Supplementary Figure 5;** Yuan et al., 2018).

Furthermore, we identified 318 copies of *Tana1*, a new putative active *Tc1*-like transposable element (Pujolar et al., 2013) but not referred in the repeat annotation library (Romanenko et al., 2015). Our results showed that 299 of the predicted *Tana1* copies contain full-length transposases. Interestingly, the majority of these *Tana1* copies did not have internal stop codon(s) as determined in the a previous study (Pujolar et al., 2013), suggesting that this element is more likely to be active. The 299 complete *Tana1* genes were from 250 different scaffolds, with an average of 1.19 genes in each scaffold. Sequences and gene locations of the identified *Tana1* are publicly available in figshare with an accession ID of doi: 10.6084/m9.figshare.8289881.

We then calculated the number of repeats that were co-localized with the protein coding genes after gene annotation to estimate their potential functions. Our results showed that a total of 34,987 repeats (14.23 Mb in length, accounting for 1.82% of all repeats) were co-localized with 10,460 protein coding genes, among which LINE/CR1, DNA/TcMar-Tc1 and LINE/L2 were the most abundant types (**Supplementary Data Sheet 2**). The GO enrichment analysis revealed that these repeats were enriched into 52 terms. Cellular process, binding, single-organism process, metabolic process and biological regulation were the top five enriched ones (**Supplementary Figure 6**), indicating that these repeats may participate in such biological processes.

However, the distribution and location of these repeats and annotated genes on chromosomes are still awaiting identification with assistance of on-going PacBio sequencing. It seems that repetitive DNA sequences have a tendency to cluster in specific regions, such as in pericentromeric, centromeric and telomeric regions (Biltueva et al., 2017). The potential roles of repetitive sequences in chromosomal rearrangements will also be much



clearer, once a chromosome-level genome assembly is available for the sterlet.

# Statistics of Gene Annotation and Phylogenetic Analysis

After masking the abundant repeats in scaffolds, we annotated 22,184 protein-coding genes with an average gene length of 21 kb using a combined strategy of *ab initio*, homology-based and transcriptome-based annotation. This predicted gene number of the sterlet genome seems to be lower than estimation, possibly due to missing data and many gaps in the draft assembly. In addition, the repetitive sequences and complex polyploidy (Romanenko et al., 2015) make it more difficult to produce a fine assembly and to predict a complete gene set. Our BUSCO analysis of the gene set showed that complete and fragmented BUSCOs accounted for 73.2% of the searched genes, and 26.8% were missing BUSCOs (**Supplementary Table 2**); we therefore inferr that the total gene number of the sterlet could reach 28,136 (with the addition of the missing BUSCOs), which is more than that of a diploid fish but less than a tetraploid species when taking the partial tetraploidy into consideration. Statistics of the gene list are provided in **Supplementary Table 6**. Length distributions of the predicted genes, CDS, exons and introns were comparable to those of spotted gar, elephant shark and many other fishes (**Supplementary Figure 7**). Of all these genes, a total of 21,112 genes (95.17%) were functionally annotated in at least one public database (find more details in **Supplementary Table 7**).

Afterwards, the predicted CDS sequences along with wholegenome CDS from other 14 examined vertebrates were clustered into gene families to determine 198 single-copy consensus orthologues from these genomes (**Supplementary Table 8**; **Supplementary Figure 8**), which were selected out for generation of the phylogenetic topology by ML (**Supplementary Figure 9**) or BI (**Supplementary Figure 10**). The two methods produced a complete coincidence of phylogenetic topology with high branch support values, suggesting that the hypothesis was well supported (**Figure 1A**). Our tree confirms the results of others (Hughes et al., 2018; Peng et al., 2007), that the sterlet is located at a base position of Actinopterygii, which serves as a sister group to all ray-finned fishes. Therefore, this phylogeny of the sterlet using numerous single-copy genes confirms its very basal position as reported by other studies. Fossil calibrations date the origin of the sterlet back to 358 Mya (**Figure 1A**), with a 95% confidence interval of 316~394 Mya (**Supplementary Figure 11**). These data are consistent with our previous comprehensive phylogeny analysis (Hughes et al., 2018), and most interestingly, this date is extremely close to the Late Devonian Extinction that happened around 358.9 Mya (McGhee et al., 1984).

# Identification of an Independent WGD Event that Occurred Recently in the Sterlet

Sturgeons didn't experience the TGD event (Ravi and Venkatesh, 2018), but there are clear evidences that there was a sturgeonspecific GD event (Havelka et al., 2016). In order to identify this lineage-specific GD in the sterlet, we performed a 4dTv analysis along with Asian arowana (Bian et al., 2016), which had experienced the TGD event around 320 Mya (Jaillon et al., 2004). Our analysis displayed distinct peaks in each group of sterlet-sterlet (sturgeon-specific GD), arowana-arowana (TGD) and sterlet-arowana (speciation event), and the synonymous transversions rates (Ks values) were estimated to be 0.03 and 0.45 in the sterlet and Asia arowana, respectively (**Figure 1B**). Hence, the sturgeon-specific GD was estimated to have occurred about 21.3 Mya ([320 Mya/0.45]\*0.03) d, long after the evolutionary splitting between the sturgeon and paddlefish (184 Mya; Peng et al., 2007). Hence, it that sturgeons (Acipenseridae) and paddlefish (Polyodontidae) experienced polyploidization events independently.

# Characterization of the Complete *Hox* Clusters

To provide additional insights into polyploidy of the genome at the gene level after the sturgeon-specific GD event, we investigated *Hox* gene clusters in the sterlet genome. We identified seven *Hox* clusters including 68 *Hox* genes (60 intact and 8 partial/pseudo genes) in the draft assembly (**Figure 1C**, **Supplementary Data Sheet 3**). The *Hox* data seemed to be a consequence of the sturgeon-specific GD, since only four *Hox* clusters were identified in sea lamprey (43 genes), elephant shark (47 genes) and spotted gar (43 genes; Venkatesh et al., 2014). Interestingly, the possible absence of a whole *Hox*C cluster in the sterlet is similar to that in some diploid teleost such as fugu, medaka and stickleback (Pascual-Anaya et al., 2013). Furthermore, our *Hox*A based genealogy showed that, contrary to the *Hox* pattern in teleost after TGD (**Supplementary Figure 12**), *Hox*A clusters from the sterlet and paddlefish formed two separate groups (**Supplementary Figure 12**), which indicates that *Hox* genes duplicated independently after the divergence of the two families. It confirmed the independence of lineage-specific GDs in the sterlet and paddlefish, which is consistent with our abovementioned prediction by 4dTv.

However, whether this WGD is sturgeon-specific or shared by all members of the Acipenseridae family is awaiting answers from genome sequencing of more sturgeon species. Furthermore, the present research on a complete genechromosome pattern of the sterlet genome is still preliminary, but this work and a previous report of sequencing 15 chromosome-specific libraries (Andreyushkova et al., 2017) provide some novel insights. We attempted to map our assembly to the spotted gar chromosomes, but the results were difficult to interpret, possibly due to the non-full-length assembly of our current draft genome, the great complexity of the sterlet chromosomes, and high sequence divergences between the two fish species. Therefore, based on our current knowledge on the sterlet genome (Romanenko et al., 2015; Andreyushkova et al., 2017), a chromosome-level assembly needs to be generated, with assistance of long-read sequencing and chromatin conformation capture technology for a better understanding of the complicated structure and evolutionary pattern of the sterlet genome.

FIGURE 1 | The sterlet takes the most basal position at the phylogeny of ray-finned fish and evolved seven *Hox* clusters after the lineage-specific whole genome duplication. (A) The fossil-calibrated phylogenetic tree of 15 examined vertebrates including the sterlet. The phylogenetic topology was deduced by both the ML and BI methods. TGD, teleost-specific GD; CGD, carp GD; SaGD, salmonid GD. (B) A 4dTv comparison between Asian arowana and the sterlet. (C) Presence of *Hox* clusters in elephant shark (Venkatesh et al., 2014), sterlet (this study), spotted gar (Braasch et al., 2015), zebrafish (Bian et al., 2016), Atlantic salmon (Lien et al., 2016) and fugu (Bian et al., 2016). Each black line refers to a *Hox* cluster. Solid circles represent complete *Hox*A (green), *Hox*B (pink), *Hox*C (blue) and *Hox*D (orange) genes, while hollow circles stand for pseudo or partial genes. Paralogs generated by TGD were labeled with a and b, whereas paralogs produced by lineage-specific GD were named by α and β.

# DATA AVAILABILITY

The datasets generated for this study can be found in the NCBI with accession PRJNA491785, SRR8371834 ~ SRR837184.

# ETHICS STATEMENT

All experiments in the present study were carried out in accordance with the guidelines of the Animal Ethics Committee

of Yangtze River Fisheries Research Institute of Chinese Academy of Fishery Sciences (No. YFI-01).

# AUTHOR CONTRIBUTIONS

QW, HD, CL, JX, and QS, conceived and designed the project. YH, PC, YL, and CB, analyzed the data. CL, RR, HY, and XY collected and processed the samples. PC, YH, and QS wrote the manuscript. QS, XL and QW revised the manuscript. All authors have read and approved the final manuscript and declared no competing interests.

# FUNDING

The study was supported by the the National Natural Science Foundation of China (grant number NSFC 31772854), China Postdoctoral Science Foundation (grant number 2017M622560), the National Program on Key Basic Research Project (973 Program, 2015CB15072), Hubei Postdoctoral

# REFERENCES


Innovation Post Project (No. 2017C08), Shenzhen Special Program for Development of Emerging Strategic Industries (No. JSGG20170412153411369) and Office of Fisheries Supervision and Management for the Yangtze River Basin, MARA, PRC (No. 171821301354051046).

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.00776/ full#supplementary-material.

American paddlefish—a representative basal ray-finned fish and important comparative reference. *Genome Biol. Evol.* 4, 937–953. doi: 10.1093/gbe/evs067


biological function, using curated protein family and subfamily classification. *Nucleic Acids Res.* 31, 334–341. doi: 10.1093/nar/gkg115


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The reviewer JM declared a shared affiliation, with no collaboration, with several of the authors, PC, XL, to the handling editor at the time of review.

*Copyright © 2019 Cheng, Huang, Du, Li, Lv, Ruan, Ye, Bian, You, Xu, Liang, Shi and Wei. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Differences in DNA Methylation Between Disease-Resistant and Disease-Susceptible Chinese Tongue Sole *(Cynoglossus semilaevis)* Families

*Yunji Xiu1,2,3†, Changwei Shao1,2†, Ying Zhu1,2†, Yangzhen Li1,2, Tian Gan1,2, Wenteng Xu1,2, Francesc Piferrer 4 and Songlin Chen1,2\**

*1 Key Lab of Sustainable Development of Marine Fisheries, Ministry of Agriculture; Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Qingdao, China, 2 Laboratory for Marine Fisheries Science and Food Production Processes, Qingdao National Laboratory for Marine Science and Technology, Qingdao, China, 3 School of Marine Science and Engineering, Qingdao Agricultural University, Qingdao, China, 4 Institut de Ciències del Mar (ICM), Spanish National Research Council (CSIC), Barcelona, Spain*

### *Edited by:*

*Paulino Martínez, University of Santiago de Compostela, Spain*

### *Reviewed by:*

*Jun Hong Xia, Sun Yat-sen University, China Paloma Morán, University of Vigo, Spain*

> *\*Correspondence: Songlin Chen chensl@ysfri.ac.cn*

*†These authors have contributed equally to this work*

### *Specialty section:*

*This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics*

*Received: 22 February 2019 Accepted: 14 August 2019 Published: 13 September 2019*

### *Citation:*

*Xiu Y, Shao C, Zhu Y, Li Y, Gan T, Xu W, Piferrer F and Chen S (2019) Differences in DNA Methylation Between Disease-Resistant and Disease-Susceptible Chinese Tongue Sole (Cynoglossus semilaevis) Families. Front. Genet. 10:847. doi: 10.3389/fgene.2019.00847*

DNA methylation, the most widely studied and most well-understood epigenetic modification, has been reported to play crucial roles in diverse processes. Although it has been found that DNA methylation can modulate the expression of immune-related genes in teleosts, a systemic analysis of epigenetic regulation on teleost immunity has rarely been performed. In this research, we employed whole-genome bisulfite sequencing to investigate the genome-wide DNA methylation profiles in select disease-resistant *Cynoglossus semilaevis* (DR-CS, family 14L006) and disease-susceptible *C. semilaevis*  (DS-CS, family 14L104) against *Vibrio harveyi* infection. The results showed that following selective breeding, DR-CS had higher DNA methylation levels and different DNA methylation patterns, with 3,311 differentially methylated regions and 6,456 differentially methylated genes. Combining these data with the corresponding transcriptome data, we identified several immune-related genes that exhibited differential expression levels that were modulated by DNA methylation. Specifically, DNA methylation of tumor necrosis factor–like and lipopolysaccharide-binding protein-like was significantly correlated with their expression and significantly contributed to the disease resistance of the selected *C. semilaevis* family. In conclusion, we suggest that artificial selection for disease resistance in Chinese tongue sole causes changes in DNA methylation levels in important immunerelated genes and that these epigenetic changes are potentially involved in multiple immune responses in Chinese tongue sole.

Keywords: DNA methylation, whole-genome bisulfite sequencing, *Cynoglossus semilaevis*, disease-resistant, disease-susceptible

# INTRODUCTION

Epigenetic modifications, which are influenced by external environmental factors and predetermined inherited programs, induce changes in gene activity without altering the underlying DNA sequence (Jablonka and Lamb, 2002; Long et al., 2014). DNA methylation, the most widely and well-understood type of epigenetic modification, has been reported to play crucial roles in diverse processes such as X chromosome inactivation, embryogenesis, genomic imprinting, transposon silencing, and the onset of diseases (Yang et al., 2016). DNA methylation is performed by a set of enzymes called DNA methyltransferases, in which a methyl group (CH3) is added to position 5 of the pyrimidine ring of a cytosine (5 mC) (Bestor, 1990; Goll and Bestor, 2005). In most animals, the majority of methylated cytosines occur at CpG dinucleotides, whereas in plants and fungi, a large fraction of DNA methylation also occurs at CHG or CHH (where H = A, C, or T) (Feng et al., 2010; Junhyun et al., 2015).

In recent years, whole-genome analysis of DNA methylation has become an effective approach for researching human diseases (Zong et al., 2015) and will provide potential theoretical support and new targets for our study. DNA methylation is also increasingly recognized as prominent in diverse immune processes (Teitell and Richardson, 2003). Epigenetic changes, which mainly alter DNA methylation profiles, have been implicated in various types of cancers. Hypermethylation of CpG islands in the promoter region results in transcriptional silencing of tumor suppressor genes, whereas hypomethylation leads to oncogene activation in many cancers (Baylin and Jones, 2011; Esteller, 2011; Rodríguez-Paredes and Esteller, 2011; Chatterjee and Vinson, 2012). Methylation has important functions in tumor initiation and progression, and changes in methylation have been used as potential biomarkers for the early detection of cancers (Meng et al., 2012; Farkas et al., 2013; Guo et al., 2015; Vedeld et al., 2015).

In fish, genome-wide DNA methylation studies have been conducted to uncover the epigenetic effects on muscular polyunsaturated fatty acid metabolism of the common carp (*Cyprinus carpio*) (Zhang et al., 2019a), skin color variations in crucian carp [*Carassius carassius* L. (Zhang et al., 2017b) and *C. carpio* (Li et al., 2015)], behavioral effects of zebrafish (*Danio rerio*) (Olsvik et al., 2019), the evolution of heteromorphic sex chromosomes of three-spine stickleback (*Gasterosteus aculeatus*) (Metzger and Schulte, 2018), sexual dimorphism of hybrid tilapia (*Oreochromis* spp.) (Wan et al., 2016), sex determination of *C. semilaevis* (Shao et al., 2014), growth of large yellow croaker (*Larimichthys crocea*) (Zhang et al., 2019b), thermal acclimation of *G. aculeatus* (Metzger and Schulte, 2017), and so on. DNA methylation also functions in fish immune responses against diseases. For example, the loss of the regulator ubiquitinlike protein containing PHD and RING finger domains 1 (uhrf1) leads to tumor necrosis factor α (*TNF*-α) promoter hypomethylation and *TNF*-α activation. The changes in *TNF*-α expression promote cell shedding, a rapid loss of intestinal barrier function, and the recruitment of a series of immune cells (Marjoram et al., 2015). It has also been demonstrated that DNA methylation in *Ctenopharyngodon idella* was highly correlated with resistance against grass carp reovirus, probably due to the negative modulation of antiviral transcription (Shang et al., 2015; Shang et al., 2016).

Chinese tongue sole (*Cynoglossus semilaevis*) is a commercially valuable flatfish in China. However, the development of *C. semilaevis* aquaculture has been severely threatened by the outbreak of several bacterial- and viral-related diseases (Zhang et al., 2015). Fortunately, the publication of the article on the genome of *C. semilaevis* has laid a very important foundation for genomewide methylation research (Chen et al., 2014). In previous studies, disease-resistant families against *Vibrio harveyi* infections were developed and bred. Subsequent challenge experiments found these fish had significantly higher survival (Chen et al., 2010). It was verified that different *C. semilaevis* families (resistant families vs. nonresistant families) showed obvious genetic variations in MHC IIB, which is a candidate molecular marker for resistance/susceptibility to various diseases (Niu et al., 2015). Further study is required to establish whether fish immunity is influenced by DNA methylation and to determine which genes have a role in this process. Here, we addressed these questions using a comprehensive analysis of the whole-genome DNA methylome and transcriptome of Chinese tongue sole immune tissues (liver, spleen, and kidney), which were compared between disease-resistant and disease-susceptible families. Our study revealed that significant differences exist between these two families and that DNA (de) methylation processes may play critical roles in certain immune response pathways.

# MATERIALS AND METHODS

# Ethics Statement

The collection and handling of the animals in the study was approved by the Chinese Academy of Fishery Sciences' animal care and use committee, and all experimental animal protocols were carried out in accordance with the guidelines for the care and use of laboratory animals at the Chinese Academy of Fishery Sciences.

# Sample Collection

Disease-resistant and disease-susceptible families of Chinese tongue sole were established by our research group from 2014 to 2015. The family establishment method is described in Chen et al. (2010). Briefly, cultured male and female populations were used as basic populations. All of the established families (1 year old) were chosen for challenge experiments against *V. harveyi*, which showed that the 14L006 family had strong disease resistance, with a final survival rate of 93.46%; the 14L104 family had low disease resistance, with a final survival rate of 9.15%. We also assessed some growth-related traits, and the *T* test showed no significant differences between the 14L006 and 14L104 families in body weight (*P* = 0.140) or length (*P* = 0.704). For each family (14L006 and 14L104), three nonchallenged fish were anesthetized by using MS-222. Then, three immune-related tissues, including the liver, spleen, and kidney of each fish, were isolated and stored at −80°C until DNA or RNA extraction. The experimental fish were approximately 1.5 years old with an average length of 22.7 ± 3.2 cm and an average weight of 81.8 ± 5.2 g. The fish were acclimatized for 7 days before the experiments. The animals were collected and handled in accordance with the guidelines for the care and use of laboratory animals at the Chinese Academy of Fishery Sciences.

# Whole-Genome Bisulfite Sequencing

Genomic DNA was extracted from each tissue using a DNeasy Blood & Tissue Kit (Qiagen GmbH, Hilden, Germany) according to the manufacturer's recommendations. DNA purity was monitored on agarose gels. The DNA from nine immune-related tissues of the same family was pooled equally. DNA libraries were prepared following a previously described method (Zhang et al., 2017a). For library construction, a total of 5.2 µg mixed DNA spiked with 26 ng lambda DNA was fragmented into 200 to 300 bp by sonication, followed by terminal repairing and adenylation–ligation. Then, sonicated DNAs from different samples were ligated with different cytosine-methylated barcodes. The DNA fragments were treated twice with bisulfite using the EZ DNA Methylation-Gold™ Kit (Zymo Research), and the resulting single-strand DNA fragments were amplified using the KAPA HiFi HotStart Uracil + ReadyMix (2X) (Kapa Biosystems, Wilmington, MA, USA). The concentration of the library was quantified by using a Qubit® 2.0 Fluorometer. Then, an Agilent Bioanalyzer 2100 system was applied to assess the insert size. The barcode-ligated samples were clustered by a cBot Cluster Generation System *via* TruSeq PE Cluster Kit v3-cBot-HS, followed by sequencing on an Illumina HiSeq 2500 platform (Novogene Bioinformatics Institute, Beijing, China). Finally, 100-bp paired-end reads were generated after image analysis and base calling with the standard Illumina pipeline.

# Data Analysis

Read sequences produced by the Illumina pipeline in FASTQ format were first preprocessed using in-house Perl scripts with the following steps: (1) remove reads with adaptor; (2) remove reads with the percentage of N (unknown bases) larger than 10%; and (3) remove reads with low quality (PHRED ≤5, percentage of low-quality bases ≥50%). All subsequent analyses were based on clean reads. The remaining reads that passed the filters were called clean reads, and all of the subsequent analyses were based on them. The clean reads and reference genome were transformed into bisulfite-converted sequences (C-to-T and G-to-A converted). Then, Bismark software (0.16.3) (Krueger and Andrews, 2011) and the aligner engine of bowtie2 (2.2.5) (Langmead and Salzberg, 2012) were used to perform the alignment of the converted clean reads to the *C. semilaevis* reference genome with the following set of parameters: –score\_min L, 0, -0.2, -X 700 –dovetail. The clean reads that produce a unique best alignment from the two alignment processes (original top and bottom strand) were then compared to the normal genomic sequence, and the methylation state of all cytosine positions in the read was inferred by Bismark (bismark\_methylation\_extractor) with the parameters –multicore 4 –paired-end –no\_overlap -ignore 5 –ignore\_r2 5. The sequencing depth and coverage were summarized using deduplicated reads performed by Bismark (deduplicate\_bismark) with parameters –paired –samtools\_path. The methylation extractor results were transformed into bigWig format for visualization using the IGV browser. The sodium bisulfite nonconversion rate was calculated as the percentage of cytosines sequenced at cytosine reference positions in the lambda genome.

A window size *w* = 3,000 bp and step size of 600 bp (Smallwood et al., 2014) were selected, and the sum of the methylated and unmethylated read counts in each window was calculated. The methylation level (ML) for each CpG site shows the fraction of methylated Cs (mC) and is defined by the following equation: ML = reads(mC)/reads(mC+umC), where umC are the nonmethylated Cs. The calculated ML was further corrected with the bisulfite nonconversion rate as described (Lister et al., 2013).

# Differentially Methylated Region Analysis

Differentially methylated regions (DMRs) were identified using the swDMR software (https://sourceforge.net/projects/swdmr/), which uses a sliding-window approach. The window was set to 1,000 bp and step length to 100 bp, and only windows with at least 10 informative CpGs were considered. Fisher exact test was used to detect the DMRs. Windows in which a greater than twofold ML change was identified with an adjusted *P* < 0.05 were considered DMRs. Differentially methylated genes (DMGs) were defined as genes containing DMRs in any part of the gene features, where putative promoter regions were designated from −2 kb to the transcription start site (TSS). Fisher test was used to detect the DMRs.

To check the reliability of the whole-genome bisulfite sequencing (WGBS), DMRs located in genes such as gramd1b, KCNH4, plekhg5, TNF-like, and lipopolysaccharide (LPS)–binding proteinlike (LBP-like) were selected for verification of the WGBS data by bisulfate polymerase chain reaction (BS-PCR) analysis for DNA methylation. DNA samples from nine immune-related tissues of the same family were pooled equally, and then the mixed DNA was sodium bisulfite modified using a kit following the manufacturer's instructions. Amplification primers for BS-PCR were designed using MethPrimer design software (http://www.urogene.org/ methprimer/) (**Supplementary Table S1**). Polymerase chain reaction was performed with TaKaRa EpiTaq HS (Takara, Japan) following the manufacturer's instructions. The PCR was performed in a volume of 25 µL, containing 2.5 µL 10× EpiTaq PCR buffer, 2.5 µL MgCl2 (25 mM), 3 µL of dNTP Mixture (2.5 mM for each dNTP), 1 µL of each forward and reverse primer (10 µM), 0.15 µL EpiTaq HS DNA Polymerase, 1 µL bisulfite-treated genomic DNA, and 13.85 µL ddH2O. The PCR amplification conditions were as follows: denaturation at 98 for 3 min, then 35 cycles of 98 for 10 s, 55 for 30 s, and 72° Cfor 30 s, and a final extension at 72°C for 7 min. The amplified products were purified and cloned into the pEASY-T1 vector, and at least 10 clones per fish, tissue, and family were randomly selected for sequencing.

# Rna-Seq

Total RNA from liver, spleen, and kidney samples was extracted using an EasyPure RNA Kit (TransGen, Beijing, China) according to the manufacturer's instructions. The integrity and quality of the total RNA were determined using an Agilent 2100 NanoDrop and agarose gel electrophoresis. For the RNA-seq libraries, RNA from nine immune-related tissues of the same family was pooled equally. A total of 3 µg RNA per family was used as input material for the RNA sample. Sequencing libraries were generated using the NEBNext® Ultra™ RNA Library Prep Kit for Illumina® (NEB, USA) following the manufacturer's recommendations. Fragments per kilobase of exon per million fragments (FPKM) of each gene were calculated based on the length of the gene and read count mapped to this gene. Prior to differential gene expression analysis, for each sequenced library, the read counts were adjusted by the edgeR program package through one scaling-normalized factor. Differential expression analysis of two conditions was performed using the DEGSeq R package (1.20.0). Transcripts with a *P* < 0.05 were assigned as significantly differentially expressed.

To check the reliability of the RNA-seq results, some of the differentially expressed genes (DEGs) were randomly selected for quantitative reverse transcriptase (RT)–PCR verification, including thsd7b, plce1, ddr2, c7h6orf58, TNF-like, and LBPlike. The amplification primers for qRT-PCR are shown in **Supplementary Table S1**. RNA from nine immune-related tissues of the same family was pooled equally, and then cDNA was synthesized with PrimeScript™ II 1st Strand cDNA Synthesis Kit (Takara). The qRT-PCR was performed using SYBR® Premix Ex Taq TM II (Tli RNase H Plus) according to the manufacturer's instructions. The PCR program was 9°C or 30 s, followed by 40 cycles of 95°C for 5 s and 60°C for 30 s. All samples were run thrice. The relative expression levels were calculated according to the 2−△△Ct method. Statistical significance was determined by one-way analysis of variance. The significance was set at *P* < 0.05.

## Luciferase Reporter Assay

Lipopolysaccharide-binding protein-like promoter regions (−3056 to −1657) containing DMRs were amplified and cloned into the pGL3-basic vector (**Supplementary Table S1**). Whole reporter-gene plasmids were methylated *in vitro* using *M. Sss*I methylase (New England BioLabs) following the manufacturer's protocol. Successful vector methylation was checked by analyzing the band patterns by gel electrophoresis after digestion of the purified plasmids with the *Hpa*II enzyme, which digests only unmethylated DNA.

HEK293T cells were cultured in Dulbecco modified Eagle medium (DMEM) supplemented with 10% (vol/vol) heatinactivated fetal bovine serum and antibiotics (100 IU/mL penicillin and 100 µg/mL streptomycin). HEK293T cells were grown at 37°C and supplied with 5% CO2. Following overnight culturing in 24-well plates, the cells were transfected using Lipofectamine 2000 (Invitrogen). For transfections, 400 ng/well of methylated or unmethylated plasmids was used, and 40 ng/ well pRL-TK plasmid served as an internal control. After 48 h, cells were lysed, and luciferase activity was measured with a dual luciferase report gene assay kit (Beyotime) following the manufacturer's instructions. The assay was repeated three times.

# RESULTS

## Disease-Associated Methylation Profiles

To obtain the DNA MLs of disease-resistant (DR-CS) and disease-susceptible (DS-CR) families at base-pair resolution, we performed WGBS. For DR-CS and DS-CS, 15.52 and 13.36 Gb of clean bases were produced, with 78.39% and 72.78%, respectively, of genomic cytosines (Cs) being covered by at least five unique reads (**Supplementary Table S2**). These data were deposited in the NCBI SRA database with the accession numbers SRR8447884 and SRR8447885. Among the detected methylation sites, the mC percentage of reference genomic cytosines was 4.44% and 3.38% for DR-CS and DS-CS, respectively; mCG sites accounted for 43.12% and 32.82% of DR-CS and DS-CS; and mCHG and mCHH sites accounted for 0.12% and 0.09% of DR-CS and DS-CS (**Supplementary Table S3**). Among the mC sites we identified, more than 97% were in the mCG dinucleotide context, while non-CG MLs were very low (mCHG 0.6% and mCHH 1.7%) (**Supplementary Table S4**; **Figure 1A**). The methylation status of mCGs in various genomic elements was analyzed, showing that mCGs had slightly lower DNA MLs in promoter elements than in exons and introns; however, mCHG and mCHH MLs were higher in introns (**Figure 1B**; **Supplementary Figure S1**). To obtain an overview of the detected DNA methylation, we examined chromosome-wide MLs, which consists of 20 euchromatin and 2 heterochromatin. Circos analysis visualizing data and information in a circular layout showed that DR-CS and DS-CS had relatively similar mC levels and mC densities in the same chromosome (**Supplementary Figure S2**). A relatively high methylation density of the W chromosome (NC\_024327.1) in both DR-CS and DS-CS was identified, which was probably associated with its high repeat content (**Supplementary Figure S3**). The site preference of CHG and CHH was analyzed, and logo plots showed that they were highly correlated with sequence context, with CAG/CTG the most frequent CHG (**Figure 1C**).

# DMR Between DR-CS and DS-CS

A total of 3,311 DMRs were detected between the two families, including 2,959 hyper-DMRs (hypermethylation in DR-CS compared with DS-CS) and 352 hypo-DMRs (hypomethylation in DR-CS compared with DS-CS) (*P* < 0.05, **Supplementary Table S5**). We identified 6,456 DMGs that harbored DMRs in their promoter, exon, or intron regions, including 4,504 DMGs containing DMRs in their promoter regions (**Supplementary Table S5**). Gene ontology (GO) and directed acyclic graph (DAG) enrichment analyses showed that DMGs were overrepresented for "positive regulation of mating type–specific transcription DNA-templated" among the following biological processes: nitrogen compound metabolic process, gene expression, cellular nitrogen compound metabolic process, cellular aromatic compound metabolic process, nucleobasecontaining compound metabolic process, RNA metabolic process, and organic cyclic compound metabolic process (**Supplementary Figure S4**). Differentially methylated genes were also enriched in nucleic acid binding under molecular function, including heterocyclic compound binding and organic cyclic compound binding (**Supplementary Figure S5**). This pattern appears to be related to the regulation of transcription, as positive regulation of mating type–specific transcription and nucleic acid binding are both part of the transcription process.

Cluster analysis of the DMRs was conducted using a heat map, in which two significantly different regions (region1 and region2) were found between the DR-CS and DS-CS families (**Figure 1D**).

DMR-associated genes. The *x* axis is the rich factor, and the *y* axis is the KEGG pathway classification.

Gene ontology enrichment analysis showed that region1 was clustered into cellular response to stimulus and response to stimulus in biological process (**Supplementary Figure S6**), while region2 was clustered into interleukin 4 (IL-4) receptor binding and growth factor receptor binding (**Supplementary Figure S7**). KEGG pathway analysis identified four significantly enriched (*P* < 0.05) and immune-related biological pathways, including the RIG-I–like receptor signaling pathway, mTOR signaling pathway, apoptosis, and MAPK signaling pathway (**Figure 1E**).

To further validate the technical reproducibility of our results, we randomly selected three genes that harbored DMRs (gramd1b, KCNH4, and plekhg5) to perform bisulfite sequencing experiments on the same samples. We observed good consistency between the WGBS and bisulfite sequencing results by Pearson correlation analysis (*r* = 1.000, *P* = 0.002) (**Supplementary Table S1**, **Supplementary Figure S8**).

# Correlations Between Methylation and Gene Expression

To identify the DEGs between the two families and assess the relationship between DNA methylation and gene expression, we measured gene expression profiles by RNA sequencing using the same tissues used for the DNA methylation studies. A total of 117,880,006 (59,143,966 in DR-CS and 58,736,040 in DS-CS) clean reads with a Q20 percentage of 96% were generated and used for the subsequent analysis. In total, transcripts from 467 DEGs were identified, with 239 of them being significantly upregulated and 228 of them being downregulated in the DR-CS compared with the DS-CS family (**Figure 2A**). To assign functional information to the transcripts, upregulated or downregulated DEGs were selected for annotation. Gene ontology annotation indicated that seven downregulated DEGs were annotated to immune response (GO:0006955) (**Figure 2B**), which may provide an explanation for the changes experienced in the key components of the defense mechanism. KEGG pathway analysis identified two significantly enriched (*P* < 0.05) and immune-related biological pathways, including the NOD-like receptor signaling pathway and Tolllike receptor signaling pathway. The transcriptome data were

FIGURE 2 | Integrated analysis of the genome-wide DNA methylation and gene expression profiles. (A) The number of differentially expressed genes (DEGs) identified by comparing DR-CS and DS-CS. The red dots represent the 239 upregulated genes in DR-CS compared with DS-CS. The green dots represent the 228 downregulated genes in DR-CS compared with DS-CS. (B) Gene ontology enrichment analysis of the downregulated DEGs. The GO enrichment analysis results for the differentially expressed genes are classified into the following three categories: biological process (green histogram), cellular component (orange histogram), and molecular function (blue histogram). The *x* axis is the corresponding number of genes, and the *y* axis is the gene ontology (GO) gene function classification. \* indicates significant enrichment (*P* adjusted value < 0.05). (C) Methylation levels in CG, CHG, and CHH contexts of differentially expressed genes from DR-CS. The DEGs were divided into the following four groups according to their expression levels: none (FPKM <1), low (1 < FPKM < Q1), medium (Q1 < FPKM < Q3), high (FPKM > Q3). (D) Methylation levels in CG, CHG, and CHH contexts of differentially expressed genes from DS-CS. The DEGs were divided into the following four groups according to their expression level: none (FPKM < 1), low (1 < FPKM < Q1), medium (Q1 < FPKM < Q3), high (FPKM > Q3).

deposited in the NCBI SRA database with accession numbers SRR9009084 and SRR9009085.

To further validate the technical reproducibility of our results, we randomly selected four DEGs (thsd7b, plce1, ddr2, and c7h6orf58) to perform qRT-PCR experiments on the same samples. Most of the qRT-PCR results were consistent with the transcriptome data, and Pearson correlation analysis identified that RNA-seq showed a moderate correlation with qRT-PCR (*r* = 0.447, *P* = 0.553) (**Supplementary Table S1**, **Supplementary Figure S9**).

To examine the relationships between expression and methylation, genes were sorted into four groups according to their expression levels. We found a general trend of negative associations between expression levels and MLs for CG contexts, but there was no distinct trend for the CHH and CHG contexts (**Figures 2C**, **D**). By comparing the lists of DMGs and DEGs, we identified 59 DEGs with statistically significant methylation variations, including 52 hypermethylated DMRs (DR-CS compared with DS-CS) and 7 hypomethylated DMRs (DR-CS compared with DS-CS). KEGG analysis found three immune-related genes, including TNF-like (XM\_008324037.1), dual specificity phosphatase 2 (dusp2, XM\_008336172.1), and Toll-like receptor 5 (TLR5, XM\_008313329.1), which participate in the Toll-like receptor signaling pathway, NOD-like receptor signaling pathway, RIG-I–like receptor signaling pathway, mTOR signaling pathway, MAPK signaling pathway, adipocytokine signaling pathway, transforming growth factor β signaling pathway, apoptosis, cytokine–cytokine receptor interaction, and herpes simplex infection.

# TNF-Like as an Epigenetic Target Contributes to Disease Resistance

The TNF-like gene was selected for further analysis of the relationship between aberrant DNA methylation and mRNA transcription. Based on BS-PCR analysis, significant differences in DNA methylation were confirmed between the two families (*P* < 0.05) (**Figure 3A**). Furthermore, the DR-CS family showed significantly higher DNA MLs compared with the DS-CS family. The results indicated that 13 of 14 CpG sites present in the TNFlike gene DMR region had significantly higher MLs in DR-CS

represent DMRs. Yellow vertical lines indicate the methylation level of cytosines identified by WGBS. The DMRs were confirmed by BS-PCR, and the filled or open circles indicate methylated or unmethylated CpG sites, respectively. Each row represents one sequenced clone. (B) The relative expression of TNF-like in the DR-CS and DS-CS immune tissues. Different letters a and b indicate significant differences (*P* < 0.05). (C) DNA methylation profiles for tumor necrosis factor (TNF)–like in the liver, spleen and kidney from DR-CS. The filled or open circles indicate methylated or unmethylated CpG sites, respectively, and each row represents one sequenced clone. (D) The relative expression of TNF-like in the liver, spleen, and kidney from DR-CS. Different letters a, b, and c indicate significant differences (*P* < 0.05).

compared to the DS-CR family (*P* < 0.05). The TNF-like mRNA levels were quantified by qRT-PCR in matched tissue samples, and there was significantly higher expression in DS-CS than in the DR-CS family (*P* < 0.05, **Figure 3B**).

To further determine whether methylation changes in TNF-like affected gene expression, the correlations between DNA methylation and gene expression levels in different tissues (liver, spleen, and kidney) from DR-CS were analyzed. Methylation patterns for these 14 CpG sites for each tissue are shown in **Figure 3C**. The MLs of the CpG1 and CpG12 sites were essentially uniform across the tissues. The highest expression of TNF-like was observed in the liver, with significantly lower levels in the spleen (*P* < 0.05) and the lowest levels in the kidney (*P* < 0.05) (**Figure 3D**). The correlation analysis showed that there was a significant negative correlation between TNF-like MLs and mRNA expression (Pearson *r* = −0.997, *P* < 0.05). Specifically, the MLs of the CpG1 site (Pearson *r* = −0.995, *P* < 0.05) and CpG12 site (Pearson *r* = −0.996, *P* < 0.05) were significantly negatively correlated with mRNA expression.

# LBP-Like as an Epigenetic Target Contributes to Disease Resistance

Differences in DNA MLs between DR-CS and DS-CS for the LBPlike DMR were of similar magnitude as observed in the WGBS results. By alignment and statistical analysis, MLs of the CpG sites located at 1906984–1907244 were found to be significantly higher in the DS-CS family than in the DR-CS family (*P* < 0.05). All of the former 27 CpG sites except CpG2 and CpG23 possessed highly significant discrepancies in MLs between the DR-CS and DS-CS families (*P* < 0.05) (**Figure 4A**).

To further assess whether the methylation of the CpG loci could act as regulatory elements for gene expression, LBP-like mRNA expression from the DR-CS and DS-CS families was examined by qRT-PCR. Although the results were different from the results obtained with RNA-seq, qRT-PCR showed that the average mRNA expression of LBP-like was significantly higher in the DR-CS family (*P* < 0.05, **Figure 4B**), indicating a negative correlation with CpG loci ML.

To study the effects of DNA methylation on LBP-like promoter activity, pGL3-LBP plasmids were methylated by *M. Sss*I methylase *in vitro*. Subsequently, the methylated or unmethylated plasmids were transfected into HEK293T cells, and promoter activity was compared with the negative control (pGL3-Basic plasmid). The results indicated that methylation of the pGL3-LBP plasmids led to a significant repression of promoter activity (*P* < 0.05) (**Figure 4C**).

# DISCUSSION

In this study, we provide the first comprehensive investigation of the DNA methylation and transcriptome relationships underlying the differences between *V. harveyi-*resistant and -susceptible Chinese tongue sole, thus providing insights into the role of epigenetics in the regulation of bony fish immunity. The two families used in this study were derived from a long-term selection, and there may be significant differences in other phenotypic traits besides the disease resistance trait, such as fatty acid metabolism, skin color variations, thermal acclimation, and sex ratio. Although in this research we focused on the relationship between methylation and disease resistance, we cannot deny that differences in DNA methylation between the two families may also be related to other traits.

# DNA Methylation Patterns

Bacterial challenge experiments demonstrated that there were significant differences in survival of *V. harveyi* infection between the DR-CS and DS-CS families. In this study, WGBS was used to explore whether there were differences in DNA methylation between these two families. The results showed that the mC percentage of DR-CS was higher than in the DS-CS family, and the mCG percentage of DR-CS was higher than in the DS-CS family. Therefore, it is proven that DNA MLs were correspondingly modified during the selected breeding of *C. semilaevis*, implying that DNA methylation may play important roles in regulating functional gene expression associated with resistance traits. Similarly, in the selected breeding of scallop (*Patinopecten yessoensis*), the rate of methylation of polymorphic fragments in "Yubei" was higher than that in the control group (Wu et al., 2015). Genomic DNA MLs are altered by cold stress and inherited across multiple generations in Nile tilapia (Zhu et al., 2013). The average MLs of mCGs (43.12% and 32.82%) in this study were lower than in maternal zebrafish (70%–95%) (Potok et al., 2013) or gonads from *C. semilaevis* (86% average) (Shao et al., 2014). Differences in the MLs of these species may be attributed to the different materials used in the experiments.

Next, the methylation patterns in various functional elements, such as the promoter, exon, and intron, were checked. The results showed that the unmethylated CpG sites were highly enriched in the promoter regions and that DNA MLs were dramatically decreased at the TSS (**Figure 1B**). The pattern of DNA methylation was similar to that observed in zebrafish (Jiang et al., 2013; Potok et al., 2013) and mammals (Molaro et al., 2011). This result is consistent with the "CG content rule" that regions with a high CpG ratio of observed over expected (obs/exp) are unmethylated (Potok et al., 2013). mCHG and mCHH MLs also decreased significantly toward the TSS, which occurs in human embryonic stem cells (Lister et al., 2009). We analyzed the surrounding DNA motifs for their CHG and CHH methylation patterns; the results showed that CHG MLs were higher than CHH levels, and we found more in CAG than in CTG (**Figure 1C**) contexts, which is similar to what has been observed in humans (Lister et al., 2009).

Using WGBS, 3311 regions were identified with significant methylation differences between the DR-CS and DS-CS families. In total, 65% of the DMRs were found within the promoter regions, while 56% and 61% of the DMRs were found within the exon and intron regions, respectively. Overall, these DMRs were widely distributed along the genome. It is widely accepted that DNA methylation has active roles in gene regulation (Li et al., 1992; Li et al., 1993; Shao et al., 2014; Sallustio et al., 2019). Gene ontology and DAG enrichment analysis implied that DNA methylation is involved in the transcription process by regulating a series of biological processes. We observed a remarkable methylation contrast in the immune-related biological pathways between the DR-CS and DS-CS families. In general, our study suggested that the genome-wide methylation patterns in *C. semilaevis* changed after selective breeding in the generation of the DR-CS families.

# Different Transcription Patterns

KEGG pathway analysis identified two significantly enriched (*P* < 0.05) biological pathways related to the immune systems, including the NOD-like receptor signaling pathway and the Tolllike receptor signaling pathway. Interestingly, most of the DEGs in the abovementioned signaling pathways had lower expression levels in the DR-CS compared to the DS-CS family, including TNF-α, IL-1β, IL-6, IL-12 and IL-8. Following a previous study in *C. semilaevis*, we compared the HOSG group, which shows obvious symptoms of infection, versus the NOSG group, which

shows no obvious symptoms of infection. After challenge with *Vibrio anguillarum*, genes related to the NOD-like receptor signaling and the Toll-like receptor signaling pathways showed differential expression. In the HOSG group, several acutephase proteins, such as IL-6, IL-1β, ferritin, and HSPs, were significantly upregulated (Zhang et al., 2015). The systemic immune response induced by noninfectious agents is called the systemic inflammatory response syndrome (SIRS), and the infection-induced systemic immune response is called sepsis. The host inflammatory responses are similar between SIRS and sepsis and may lead to multiple-organ dysfunction syndrome and ultimately death (Castellheim et al., 2009). We speculate that excessive inflammatory factors in the DS-CS family may affect organ function and contribute to their lower survival rate.

activity. Different letters a and b indicate significant differences (*P* < 0.05).

assays for methylated or unmethylated recombinant plasmids. The *x* axis represents different recombinant plasmids, and the *y* axis represents the relative luciferase

According to previous reports, hypermethylation at the promoter is often associated with gene repression (Li and Zhang, 2014), and the methylation location in intragenic regions is often influenced by the active expression of nearby genes and the regulation of alternative splice variants (Park et al., 2011; Farkas et al., 2013). However, not all genes conform to these rules (Zhang et al., 2017a). Interestingly, in some conditions, such as in human dendritic cells, gene activation precedes DNA demethylation in response to infection (Pacis et al., 2019). In our study, highly expressed genes correlated with lower DNA MLs in different genomic features, and a general trend of negative associations between expression and MLs for CG contexts was established (**Figure 2C, D**).

We identified 59 DEGs with statistically significant methylation variations, including 35 highly expressed genes (DR-CS compared with DS-CS) and 24 genes with lower expression (DR-CS compared with DS-CS), by comparing DMG and DEG gene lists. A close relationship was observed between these DMGs and disease resistance, and KEGG analysis found three immune-related genes. We selected the TNF-like and LBP-like genes for further research.

BLAST analysis showed that TNF-α has been characterized in several bony fish. It was found that TNF-α increases the susceptibility of zebrafish to viral (spring viremia of carp virus) and bacterial infections (*Streptococcus iniae*) (Roca et al., 2008). Similarly, TNF-α is poorly upregulated by immune challenge *in vitro* and *in vivo* in mammals (Laing et al., 2001; Garcíacastillo et al., 2002), and it weakly induces chemotaxis, respiratory burst, and phagocytosis and showed no response in macrophages (Zou et al., 2003; Grayfer et al., 2008). Based on these results, we speculated that the higher expression of TNF-like contributed to the susceptibility of DS-CS to *V. harveyi*. Furthermore, our correlation analysis between DNA methylation and gene expression of TNF-like showed that hypermethylation of the TNF-like promoter led to low expression of TNF-like mRNA. Previous research in zebrafish has demonstrated that epigenetic regulators, such as a ubiquitin-like protein containing PHD and RING finger domains 1 (uhrf1), reduce TNF-α promoter methylation in intestinal epithelial cells (IECs). Interestingly, the increased expression of TNF-α in IEC results in shedding and apoptosis, immune cell recruitment, and barrier dysfunction (Marjoram et al., 2015). These results suggest that TNF-α promoter methylation in DR-CS tends to be hypermethylated, coinciding with its downregulation, which may decrease the susceptibility of *C. semilaevis* to *V. harveyi* infections by protecting epithelial cells from damage. Additionally, the DNA methylation patterns of *C. semilaevis* were modified during the course of selective breeding to create the DR-CS family.

In our research, LBP-like cells showed hypomethylation and higher expression levels in DR-CS. Simultaneously, a luciferase reporter assay indicated that DNA methylation modification of the LBP-like promoter led to a significant repression of transcriptional activity (*P* < 0.05). The cDNA and amino acid sequence (accession no. XP\_008316264) of *C. semilaevis* LBPlike has been identified, characterized, and named *C. semilaevis*  bactericidal/permeability-increasing protein (*CsBPI*). It was found that recombinant CsBPI (rCsBPI) was able to bind to a number of Gram-negative bacteria, which leads to bacterial death through membrane permeabilization and structural destruction (Sun and Sun, 2016). Furthermore, rCsBPI can enhance the resistance of tongue sole against bacterial as well as viral infection (Sun and Sun, 2016). In mice, it was identified that LBP is essential for the rapid induction of an inflammatory response by small amounts of LPS or Gram-negative bacteria during the survival of intraperitoneal *Salmonella* infections (Jack et al., 1997). Recently, more experiments have indicated that LBP primarily acts as an LPS transporter to CD14 (Wright et al., 1990; Hailman et al., 1994; Schumann and Latz, 2000) and the Toll-like receptor complex (Kopp and Medzhitov, 1999; Means et al., 2000). In comparison, LBP-like showed higher expression in the DR-CS family, which suggests important roles in the immune response. Overall, we speculate that LBP-like DNA methylation is modified during selective breeding and mediates epigenetic regulatory mechanisms.

# DATA AVAILABILITY

This data was deposited in NCBI SRA database, the results of WGBS have accessed on NCBI, and the accession number is to SRR8447884 and SRR8447885. The transcriptome data was deposited in NCBI SRA database, with the accession number of SRR9009084 and SRR9009085.

# AUTHOR CONTRIBUTIONS

SC obtained and designed the project. YX performed the experiments; YX and YZ wrote the manuscript; SC and YL instructed, organized and constructed disease-resistant and disease-susceptible families; TG and WX sampled the tissues; FP revised the manuscript; CS and SC designed the experiments and revised the MS.

# FUNDING

This work was supported by the National Nature Science Foundation (31530078, 31461163005), the Taishan Scholar Project Fund of Shandong, China, the Applied Basic Research Project of Qingdao City (16-5-1-52-jch), the Natural Science Foundation of Shandong Province (ZR2019BC009), the Advanced Talents Foundation of QAU (6651118016), the "First Class Fishery Discipline" programme in Shandong Province.

# ACKNOWLEDGMENTS

We would like to thank Yingming Yang and Zhongkai Cui at Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, for sample collection.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.00847/ full#supplementary-material

SUPPLEMENTARY FIGURE S1 | DNA methylation levels of mCG, mCHG and mCHH in functional regions of the genome. The blue, green and red features represent the promoter (the 2 kb region upstream of the TSS), exon and intron functional regions, respectively.

SUPPLEMENTARY FIGURE S2 | The circos plot representing the mC level, gene number density and mC density of each chromosome in (A) DR-CS and (B) DS-CS. The red, blue and purple lines represent the CG, CHG and CHH context, respectively.

SUPPLEMENTARY FIGURE S3 | The circos plot representing the mC density of each chromosome in (A) DR-CS and (B) DS-CS. The mC density heat map from green to red indicates the density of methylation from low to high. The W chromosome number is NC\_024327.1.

SUPPLEMENTARY FIGURE S4 | DAG visualization of the hierarchical structure of the biological process. Children that represent a more specific instance of a parent term have an 'is a' relationship to the parent. The darker the color of the node, the higher its number of Blast hits and annotation score. All nodes contain the hit annotation scores in numbers.

SUPPLEMENTARY FIGURE S5 | DAG visualization of the hierarchical structure of the molecular function. Children that represent a more specific instance of a parent term have an 'is a' relationship to the parent. The darker the color of the node, the higher its number of Blast hits and annotation score. All nodes contain the hit annotation scores in numbers.

SUPPLEMENTARY FIGURE S6 | GO enrichment analysis of the DMGs from region 1. The GO enrichment analysis results for the differentially expressed

# REFERENCES


genes were classified into the following three categories: biological process (BP), cellular component (CC) and molecular function (MF). The x-axis represents the GO gene function classification, and the y-axis represents the corresponding number of genes.

SUPPLEMENTARY FIGURE S7 | GO enrichment analysis of the DMGs from region 2. The GO enrichment analysis results for the differentially expressed genes were classified into the following three categories: biological process (BP), cellular component (CC) and molecular function (MF). The x-axis represents the GO gene function classification, and the y-axis represents the corresponding number of genes.

SUPPLEMENTARY FIGURE S8 | Sodium bisulfite clone sequencing results for the (A) gramd1b, (B) KCNH4 and (C) plekhg5 DMRs. Filled and open circles indicate methylated and unmethylated CpG sites, respectively, and each row represents one sequenced clone. At least 10 clones were sequenced for DR-CS or DS-CS.

SUPPLEMENTARY FIGURE S9 | The relative expression of thsd7b, plce1, ddr2 and c7h6orf58 in the immune tissues from DR-CS (black bars) and DS-CS (gray bars). Asterisks indicate significant differences (*P* < 0.05) between DR-CS and DS-CS.

SUPPLEMENTARY TABLE S1 | Primers used in this research.

SUPPLEMENTARY TABLE S2| Statistics of WGBS-seq for each sample.

SUPPLEMENTARY TABLE S3 | Percentage of different contexts.

SUPPLEMENTARY TABLE S4 | Percentage of mC in the mCG, mCHG, and mCHH contexts.

SUPPLEMENTARY TABLE S5 | The DMR location and ID.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Xiu, Shao, Zhu, Li, Gan, Xu, Piferrer and Chen. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# The Model of the Conserved Epigenetic Regulation of Sex

*Francesc Piferrer\*, Dafni Anastasiadi, Alejandro Valdivieso, Núria Sánchez-Baizán, Javier Moraleda-Prados and Laia Ribas*

*Institut de Ciències del Mar (ICM), Spanish National Research Council (CSIC), Barcelona, Spain*

### *Edited by:*

*Peng Xu, Xiamen University, China*

### *Reviewed by:*

*Zhigang Shen, Huazhong Agricultural University, China Eveline M. Ibeagha-Awemu, Agriculture and Agri-Food Canada (AAFC), Canada Tao Zhou, Auburn University, United States*

> *\*Correspondence: Francesc Piferrer piferrer@icm.csic.es*

### *Specialty section:*

*This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics*

*Received: 27 March 2019 Accepted: 16 August 2019 Published: 26 September 2019*

### *Citation:*

*Piferrer F, Anastasiadi D, Valdivieso A, Sánchez-Baizán N, Moraleda-Prados J and Ribas L (2019) The Model of the Conserved Epigenetic Regulation of Sex. Front. Genet. 10:857. doi: 10.3389/fgene.2019.00857*

Epigenetics integrates genomic and environmental information to produce a given phenotype. Here, the model of Conserved Epigenetic Regulation of Sex (CERS) is discussed. This model is based on our knowledge on genes involved in sexual development and on epigenetic regulation of gene expression activation and silencing. This model was recently postulated to be applied to the sexual development of fish, and it states that epigenetic and gene expression patterns are more associated with the development of a particular gonadal phenotype, e.g., testis differentiation, rather than with the intrinsic or extrinsic causes that lead to the development of this phenotype. This requires the existence of genes with different epigenetic modifications, for example, changes in DNA methylation levels associated with the development of a particular sex. Focusing on DNA methylation, the identification of CpGs, the methylation of which is linked to sex, constitutes the basis for the identification of Essential Epigenetic Marks (EEM). EEMs are defined as the number and identity of informative epigenetic marks that are strictly necessary, albeit perhaps not sufficient, to bring about a specific, measurable, phenotype of interest. Here, we provide a summary of the genes where DNA methylation has been investigated so far, focusing on fish. We found that *cyp19a1a* and *dmrt1*, two key genes for ovary and testis development, respectively, consistently show an inverse relationship between their DNA methylation and expression levels, thus following CERS predictions. However, in *foxl2a*, a pro-female gene, and *amh*, a pro-male gene, such relationship is not clear. The available data of other genes related to sexual development such as *sox9*, *gsdf*, and *amhr2* are also discussed. Next, we discuss the use of CERS to make testable predictions of how sex is epigenetically regulated and to better understand sexual development, as well as the use of EEMs as tools for the diagnosis and prognosis of sex. We argue that CERS can aid in focusing research on the epigenetic regulation of sexual development not only in fish but also in vertebrates in general, particularly in reptiles with temperature sex-determination, and can be the basis for possible practical applications including sex control in aquaculture and also in conservation biology.

Keywords: conserved epigenetic regulation of sex, essential epigenetic marks, DNA methylation, sex determination, sex differentiation, sex control, environmental sex determination

# INTRODUCTION

# Background on Epigenetics

The origin of the term "epigenetics" and its implications are continuously subjected to revision. Here, we will use the definition proposed by Deans and Maggert (2015): "the study of phenomena and mechanisms that cause chromosome-bound, heritable changes to gene expression that are not dependent on changes to DNA sequence." These epigenetic changes or epimutations can be inherited not only during mitosis from mother to daughter cells but also through meiosis from parents to offspring (Dupont et al., 2009). Epigenetics has emerged as a powerful discipline in the study of the integration of genomic and environmental information to bring about a specific phenotype (Turner, 2009; Vogt, 2017).

Fish sex is remarkably plastic when compared with the situation in other vertebrates since it can be determined genetically, environmentally, or by a combination of both types of influences (see Wang et al., 2019 and articles therein). Fish present three major sexual patterns: gonochorism, hermaphroditism, and unisexuality. Thus, the phenotypic sex is, in many fish, a clear example of phenotypic plasticity not only because, in hermaphrodites, the same genotype is capable of producing two different phenotypes but also because, under certain environmental conditions, e.g., unusually warm temperatures, some gonochoristic species may develop a phenotypic sex different from its genotypic sex (Ospina-Alvarez and Piferrer, 2008; Baroiller and D'Cotta, 2016; Ribas et al., 2017).

During sexual differentiation, cells of the germ and somatic lines acquire identity and, in this process, changes in gene expression patterns play a central role. Thus, sexual differentiation involves a certain antagonism between male and female pathways as well as multiple feedback loops that reinforce the effects of the primary effector, be genetic or environmental (Munger and Capel, 2012). Gene networks, involved in testis or ovarian differentiation, consist of genes the expression of which is activated or suppressed in a tight spatial and temporal fashion (Capel, 2017). We now know that in this type of regulation, epigenetic mechanisms such as DNA methylation, histone modification, and noncoding RNAs (Berger et al., 2009) play a role, and hence, in the last years, the contribution of epigenetics to sex determination and differentiation across taxa has emerged (reviewed in Piferrer, 2013). In the rest of this paper, we will use the term "sexual development" when collectively referring to sex determination and sex differentiation.

# The Model of Conserved Epigenetic Regulation of Sex

Recently, the concept of Essential Epigenetic Marks (EEM), defined as "the number and identity of informative epigenetic marks that are strictly necessary, albeit perhaps not sufficient, to bring about a specific, measurable, phenotype of interest," was proposed (Piferrer, 2019). The model of Conserved Epigenetic Regulation of Sex (CERS) was also proposed (Piferrer, 2019) in regards to the regulation of gene expression during the emergence of the sexual phenotype. This model is based on the assumptions that there are "pro-male" and "pro-female" genes and that there is an inverse relationship between epigenetic silencing and expression of the genes. The terms "pro-male" and "pro-female" genes refer to the exclusive or preferential expression of these genes in one sex rather than in the other. Specifically, the model applies to sex differentiation in gonochoristic species and sex change in hermaphroditic species regardless of the underlying sex-determining mechanism. The CERS model postulates that, for a given sex-related gene, the association between DNA methylation and expression levels with a particular gonadal phenotype is stronger than the means by which this phenotype is obtained (Piferrer, 2019). This implies that, in females, DNA methylation of pro-female genes will be low while expression of these genes will be high and that, in contrast, DNA methylation of pro-male genes will be high while their expression will be low. Conversely, in males, DNA methylation of pro-male genes will be low, while expression of these genes will be high and, in contrast, DNA methylation of pro-female genes will be high, while their expression will be low. Notice that "low" and "high" rather than absolute values indicate values of one sex relative to the other sex. The regulation of gene expression levels by changes in DNA methylation constitute one of the main molecular mechanisms of CERS (the other two would be regulation of gene expression by histone modifications or variants and abundance and activity of miRNAs).

Regarding the causation of differentially methylation levels of "pro-male" and "pro-female" genes, currently, there is debate on whether epigenetic changes are a cause or a consequence of changes in gene expression (probably both things are correct). Allele-specific effects have been found in the half-smooth tongue sole, *Cynoglossus semilaevis*, neomales (ZW females sex reversed into males) with Z chromosomes inherited from high-temperature-exposed sires (Shao et al., 2014). In the European sea bass, *Dicentrarchus labrax*, we found genes with methylation levels that resembled those of oocytes, while other genes had methylation levels resembling those of the sperm, suggesting female- and male-specific inheritance, respectively (Anastasiadi et al., 2018b).

Testis development, at least in fish, where sex can be labile, can be achieved as a consequence of normal male sex differentiation, protogynous sex change, or as masculinization induced by high temperature, stress, aromatase inhibitors, or androgens (Blazquez et al., 2001; Navarro-Martin et al., 2009; Piferrer, 2019). Conversely, ovarian development can be achieved as a consequence of normal female sex differentiation, protandrous sex change, or feminization induced by estrogens or endocrine disrupting chemicals. The model is called conserved because the underlying mechanisms are thought to be shared across species even if they have different reproductive strategies (**Figure 1**). It should be noted that DNA methylation patterns may differ depending on the cell type within the same gonad. Thus, DNA methylation values reported until now in the gonads represent the combined values of the different cell types.

In the first inception of this model, the following aspects were discussed (Piferrer, 2019): 1) What species are more fruitful to study and why; 2) Which are the best developmental stages to target; 3) Whether there are other organs than the gonads worth targeting; 4) The links with ecotoxicology; and 5) The added comparative value of these studies. In this review, the concept CERS will be further developed. Thus, here we will: 1) Discuss some general considerations about epigenetic marks to put CERS and the concept of EEM in a broader perspective; 2) Since, in the

epigenetic silencing, and the right half to gene expression levels. White and gray squares indicate lower and higher levels, respectively, of epigenetic silencing and gene expression. Boxed text indicates possible different means to arrive to a given phenotype. There might be other means. AI, aromatase inhibitor; EDC, endocrine disrupting chemical; Hi. Temp., high temperature. Figure modified from Piferrer (2019), with permission.

last 2–3 years, several studies have provided information on DNA methylation levels and given the extraordinary diversity of fishes, we will attempt to summarize the available data on the epigenetic regulation of sex and hence test CERS. This will allow drawing conclusions that can be used not only to establish an appropriate framework but also to help to focus future studies; and 3) Make the suggestion that the CERS can be also applied to other vertebrates regardless of the sex-determining system, whether is genetic or environmental. In fact, even in plants, there is evidence of the involvement of epigenetic regulatory mechanisms in sex determination. This is the case of the *Populus balsamifera* tree, where the *pbrr9* gene showed sex-specific patterns of DNA methylation (mostly male-biased) in the putative promoter and in the first intron (Bräutigam et al., 2017).

# EPIGENETIC BIOMARKERS

## General Concepts

Biomarkers have been developed mostly in the context of human health (e.g., Liu et al., 2019). According to the Biomarkers Definitions Working Group, a biomarker is defined as "a characteristic that is objectively measured and evaluated as indicator of normal biological processes, pathogenic processes or pharmacologic responses to a therapeutic intervention" (Atkinson et al., 2001). Biomarkers can be proteins, levels of mRNA transcripts, or epigenetic modifications and can mainly be used for diagnosis and prognosis, e.g., to predict responses to therapy in cancer (Costa-Pinheiro et al., 2015; Prensner et al., 2012). Proper biomarkers have to be harmless and characterized by high sensitivity, specificity, and reproducibility (Atkinson et al., 2001; Costa-Pinheiro et al., 2015; García-Giménez et al., 2017).

Also in the context of human health, epigenetic alterations including DNA methylation, histone modifications, and noncoding RNAs have been suggested as good candidates for becoming cancer biomarkers because they can be stable, frequent, abundant, and accessible (Costa-Pinheiro et al., 2015). Nevertheless, the most frequently studied epigenetic modification as potential biomarker is DNA methylation, mainly because of its stability and relative ease of measurement by the available technologies (Bock, 2009; Van Neste et al., 2012). Thus, DNA methylation biomarkers are thought to be extremely promising in the context of human health (Van Neste et al., 2012; Costa-Pinheiro et al., 2015). However, other biomarkers such as microRNAs have been identified also as good candidates for human diseases (Navickas et al., 2016) and, to a lesser extent, as an aid in animal breeding programs (Ibeagha-Awemu and Zhao, 2015).

# Biomarker Development

A systematic approach to develop epigenetic biomarkers based on DNA methylation has been suggested by Bock (2009) in the context of clinical applications, where different steps have to be completed. Here, we modify this approach for the development of epigenetic biomarkers to test the CERS (**Figure 2**). In the first step, a whole-genome or genome-wide method should be used in order to simultaneously assess hundreds or thousands of candidate sites. For DNA methylation biomarkers, wholegenome bisulfite sequencing (WGBS), reduced representation bisulfite sequencing (RRBS) (Gu et al., 2011), or bisulfite RADseq (Trucchi et al., 2016) could be employed. These techniques allow to measure the actual DNA methylation levels present in those cytosines located in a CpG context in vertebrate genomes and should lead to the identification of candidate EEMs. These can include differentially methylated cytosines (DMCs) or differentially methylated regions (DMRs) between sexes. In the second step, selected biomarkers are tested using targeted approaches in a large number of independent samples. Here, appropriate approaches include, but are not limited to, multiplex bisulfite sequencing (MBS) (Masser et al., 2013; Anastasiadi et al., 2018b), enrichment bisulfite sequencing (Diep et al., 2012; Paul et al., 2014), pyrosequencing, or mass spectrometric analysis of DNA methylation (Coolen et al., 2007; Bock et al., 2016). Computational and statistical, machine learning procedures involving regression (best subsets regression, penalized regression, principal components-based regression analysis) or classification analysis should be used. In the third step, from all those EEMs that are strongly correlated to the trait of interest, a handful of them that allow an optimal trait association and/or prediction are validated and a targeted assay is developed (arraytype, MeDIP-qPCR or MBS) (Bøvelstad et al., 2007; James et al., 2013; Anastasiadi et al., 2018b).

Switching the perspective from clinical research to ecology and animal production, biomarkers in vertebrates have been used as indicators of environmental pollution (Monserrat et al., 2007) and animal health, including endocrine, immune, nutritional, and metabolic processes (Warne et al., 2015). Epigenetic biomarkers have already been used to predict age and sex in vertebrates. Thus, after the discovery of an epigenetic clock in humans, i.e., a panel of DNA methylation biomarkers as diagnostic of biological age (Horvath, 2013), epigenetic clocks have been constructed in other vertebrates, such as mice, *Mus muculus* (Han et al., 2018), chimpanzees, *Pan troglodytes* (Horvath, 2013), humpback

idea of steps to epigenetic biomarker development was laid down by Bock (2009).

whales, *Megaptera novaeangliae* (Polanowski et al., 2014), in a long-lived seabird, *Ardenna tenuirostris* (De Paoli‐Iseppi et al., 2019) and in the European sea bass (Anastasiadi et al., 2019).

# Development of Biomarkers of Sex

In livestock and animal production, epigenetic biomarkers have been suggested recently as candidates with extreme potential to predict the phenotypic outcome, as well as to improve production traits (Ibeagha-Awemu and Zhao, 2015; Moghadam et al., 2015). This need was first described in a report by the Food and Agriculture Organization of the United Nations in 2015 stating that the knowledge on epigenetics will offer new opportunities for animal breeding (Scherf and Pilling, 2015). In fish, using the European sea bass, as a model, a carefully selected panel of CpGs in three genes constitute an example of EEMs that were capable to predict the sex phenotype of the gonad with ~90% accuracy (Anastasiadi et al., 2018b). To our knowledge, this is so far the first and only method to predict sex based on EEMs. Currently, sex prediction using EEMs is lethal and is not cost-effective. However, we are testing the possible existence of correlations between DNA methylation in predictor CpGs in the gonads with equally predictive CpGs in other tissues. On the other hand, the own development of biomarkers involves, in the last step, the use of CpG in an array-type approach or in multiplexing (MBS) that, along with the continued decrease of next-generation sequencing costs, should make the cost of screening per sample affordable.

# TESTING THE MODEL OF THE CONSERVED EPIGENETIC REGULATION OF SEX

Epigenetic regulation of gene expression is involved in the sexual development of gonochoristic fish with different types of sexdetermining mechanisms, as well as in driving the process of sex change in different types of hermaphrodites.

Here, we searched the published literature in fish and collected information on the DNA methylation of genes related to sexual development. A WGBS was used in the half-smooth tongue sole, (Shao et al., 2014), while a MBS was used in the European sea bass (Anastasiadi et al., 2018b). These are the exceptions because in the rest of studies carried so far, which concern around 15 different species, just one or two genes have been analyzed in each case (**Table 1**). DNA methylation at a single CpG is of a binary nature, since a given CpG can be either methylated or unmethylated. However, mean percent DNA methylation can, theoretically, fall in any value between 0 and 100%. This applies regardless of whether one considers the promoter or the first intron (Anastasiadi et al., 2018a) or other genomic features in a predefined window of a given length. Information drawn from the primary literature shows that DNA methylation levels are more or less evenly distributed across five arbitrarily defined methylation classes (0–20%, 21–40%, 41–60%, 61–80%, and 81–100%), perhaps with a higher preponderance in the 0–20% class, regardless of other considerations such as method of analysis, targeted genomic feature, sex, species, etc. Thus, these preliminary data indicate that there are no preferred or typical DNA methylation values for the sex-related genes as a whole (**Figure 3A**). Again, it should be remembered that DNA methylation values represent the combined values resulting from the different cell types making up the gonads. Thus, the correlation with gene expression, if present, should take this into account.

Gonadal aromatase (*cyp19a1a*) was the first gene shown to be under epigenetic regulation during sexual development in a vertebrate, the European sea bass (Navarro-Martín et al., 2011). This is not surprising because it is the only steroidogenic enzyme responsible for the balance between androgens and estrogens and because estrogens are needed for ovarian differentiation in all nonmammalian vertebrates (Guiguen et al., 2010). Since then, the DNA methylation of only few genes has been studied in more than two species: *cyp19a1a* just cited earlier (studied in 10 species), doublesex- and mab-3-related transcription factor 1 (*dmrt1*) (6 species), anti-Müllerian hormone or Müllerian-inhibiting hormone (*amh*) (4 species), and the member of the winged helix/ forkhead group (*foxl2a*) (3 species) (**Figure 3B**). From the analysis of the published data and our own unpublished data, we found that mean DNA methylation levels of *cyp19a1a* were typically <50% in ovaries (mean: 46.2%, sd: 15.98) and >75% in testes (mean: 77.0%, sd: 24.89) (*t*-test: -4.0439; df = 28, *p* = 0.00037) in a fairly consistent manner across species (see list of species in **Table 1**). This finding was in accordance with the constitutive higher expression of *cyp19a1a* in ovaries when compared with testes (Piferrer and Blázquez, 2005; Guiguen et al., 2010). Likewise, mean DNA methylation levels of *dmrt1* were ~30% in ovaries (mean: 32.54%, sd: 15.98) and <10% in testes (mean: 5.54%, sd: 5.21) (*t*-test: 4.54; df = 14, *p* = 0.00046), also in accordance with the higher constitutive expression of *dmrt1* in testes when compared with ovaries (Herpin and Schartl, 2011) (**Figure 3B**). Therefore, these two important genes for sex differentiation, which have been used as sex markers in some fish species, e.g., turbot, *Scophthalmus maximus* (Ribas et al., 2016), and medaka, *Oryzias latipes* (Herpin and Schartl, 2011), do indeed conform to the CERS predictions, since there is an inverse relationship between DNA methylation and gene expression with clear sex-specific differences.

This inverse relationship does not seem apparent when two other well-known genes with sex-biased expression in fish are considered: *amh* and *foxl2a* (**Figure 3B**). *Amh* is a member of the TGF-β superfamily of growth and differentiation factors involved in sex differentiation from mammals to fish (Piferrer and Guiguen, 2008). Relatively low and equal levels of *amh* expression are detected in gonads prior to the appearance of sex-specific differences. However, once sex differentiation is underway, higher *amh* levels are typically associated with testis differentiation in several species analyzed (reviewed in Pfennig et al., 2015). Here, we found that mean DNA methylation levels of *amh* were 54.05% (sd: 25.26) in ovaries and 80.24% (sd: 12.74) in testes, a difference that did not reach statistical significance with the data available so far (*t*-test: -2.07; df = 8, *p* = 0.07211). In the same way, *foxl2a* is expressed at higher levels in the ovary when compared with the testis (reviewed in Bertho et al., 2016), like *cyp19a1a*. On the other hand, *foxl2a* is actually one of the earliest transcriptional activators of *cyp19a1a* that co-localizes in the granulosa cells (Wang et al., 2004). However, DNA methylation levels were clearly not different (mean = 3.08% and sd = 3.88 in ovaries and mean = 2.59% and sd = 3.3 in testes) TABLE 1 | Studies involving fish where DNA methylation of genes associated with sexual development has been measured


*(\*) This is from a multiplex bisulfite sequencing analysis with a larger panel of genes. Here, a subset of the most sex-related genes is shown.*

*(\*\*) This subset of genes showed differential methylation level between ovaries and testes and are taken from Supplementary Table 8 in Shao et al. (2014), where whole-genome bisulfite sequencing was used.*

(*t*-test: 0.1656; df = 4, *p* = 0.8765). Therefore, unlike *cyp19a1a* and *dmrt1,* and with the information available so far, data suggest that *amh* and *foxl2a* do not seem to conform to CERS predictions or that, in these genes, the relationship between DNA methylation and gene expression is positive (**Figure 3B**), although, clearly, further research is needed.

There are other genes related to sex differentiation at different degrees for which it may be premature to attempt any sort of generalizations. These genes include *amhr2*, *cyp11a*, *hsd3b2*, *nr3c1*, *sox9*, *vasa*, and *gsdf (***Figure 3C)**. Here, it is worth noting that allelic diversification of *amhr2* in *Takigugu rubripes* results in a dominant master sex-determining gene, while allelic diversification of *gsdf* has given rise to the sexdetermining gene in some fish species, including *Oryzias luzonensis* and *Anoplopoma fimbria* (reviewed in Piferrer, 2018; Guiguen et al., 2019). DNA methylation levels of *amhr2* in the European sea bass were ~50% without sexrelated differences (Anastasiadi et al., 2018b), while in the half-smooth tongue sole DNA methylation levels are higher in females (Shao et al., 2014). Similarly, in the latter species, the only species where *gsdf* DNA methylation values have been determined, these values are clearly lower in males, in accordance with the higher expression of *gsdf* in males (Shao et al., 2014) (**Figure 3C**).

Except in the half-smooth tongue sole (Shao et al., 2014), where WGBS was used, in the rest of the studies reported in **Table 1** and used to draw **Figure 3**, targeted approaches were utilized to query the DNA methylation status of the target genes. For these studies, an average of ~9 late juvenile or adult fish per sex was used. Typically, amplicons spawn ~450 bp and usually include ~15 CpG located around the transcription start site, although the latter figure may vary considerably among species. It is interesting to note that while sex-specific differences involve change in DNA methylation of several CpGs in some genes, in contrast, in other genes, sex-differences involve only a low number of CpGs (**Figure 4**). A more comprehensive picture will emerge when genomewide DNA methylation techniques such as WGBS or RRBS will be employed in lieu of the targeted approaches used so far in most studies.

For the rest of the genes, *cyp11a*, *hsd3b2*, and *nr3c1*, there are only preliminary data gathered in our lab with the European

FIGURE 3 | DNA methylation levels of some key genes (see Table 1) involved in sexual development. (A) Histogram of overall methylation levels for the genes discussed in this paper. Frequency refers to number of DNA methylation values obtained by combining published data and unpublished research performed in our lab. (B) Boxplot of DNA methylation levels of *cyp19a1a* and *dmrt1*, which conform to CERS postulates (left side), and *amh* and *foxl2*, which do not conform to CERS postulates (right side). The lower and upper hinges display the distribution of values between the first and third quartiles, the upper whisker extends to the maximum value up to 1.5 \* interquartile range (IQR), the lower whisker extends to the minimum value up to 1.5 \* IQR, while the black line indicates the median of the distribution. One outlier outside the end of the whiskers has been excluded. Numbers between parentheses indicate number of datapoints/species. If the first number is bigger than the second, it indicates that there are species for which there is more than one datapoint. Significant differences were assessed with the *t*-test. \*\*\**P* < 0.001; ns, not significant. (C) DNA methylation levels of *amhr2, cyp11a, hsd3b2, nr3c1, sox9, vasa,* and *gsdf* in different species. For easier visualization, lines connect datapoints of the same species. In all genes except *gsdf*, there is data for at least two different species. In addition, in (B and C), data are also color-coded according to sex.

sea bass (Anastasiadi et al., unpublished) and zebrafish (*Danio rerio*) (Valdivieso et al., unpublished; Moraleda-Prados et al., unpublished). In the European sea bass, methylation values of *hsd3b2* are higher in females. This is in agreement with the expression of this gene that is male-skewed in the developing gonads of Nile tilapia, *Oreochromis niloticus* (Ijiri et al., 2008). On the other hand, DNA methylation values of *nr3c1* and *sox9* were quite different between the two species.

In many species, sex determination has an environmental component. Hence, it is worth mentioning that an environmental factor such as temperature or population density may be connected to sex through epigenetic mechanisms. DNA methylation changes in sex-related genes is the type of epigenetic modification most commonly studied so far. Temperature can affect DNA methylation of many genes, as shown by MeDIP-seq in the Nile tilapia (Sun et al., 2016), although the exact mechanism is not known yet. In the European sea bass, elevated temperature induces hypermethylation in the promoter of *cyp19a1a*, and this prevents the binding of *cyp19a1a* transcriptional activators such as *sf1* and *foxl2a* (Navarro-Martín et al., 2011). Other epigenetic modifications can also be involved in the connection between environmental factors and sex. Thus, temperature increases the transcription of lysine-specific demethylase 6B (*kdm6b*)*,* a chromatin modifier gene in the red-eared slider turtle, *Trachemys scripta*. *Kdm6b* eliminates the trimethylation of H3K27 in the promoter of *dmrt1*, leading to upregulation of its expression and male development (Ge et al., 2018; Georges and Holleley, 2018).

We would like to mention three considerations for further testing the CERS model. First, what species are worth testing? Obviously, fish, due to their great diversity of sexual systems and sex determining systems, which can vary even in closely related species. Reptiles can also provide very relevant information. Many reptiles possess temperature-dependent sex determination and thus offer the opportunity to test whether DNA methylation in key genes do correlate with gene expression and phenotypic sex under different incubation temperatures during the thermosensitive period. Thus, in the red-eared slider turtle *cyp19a1* DNA methylation levels conformed to CERS predictions (Matsumoto et al., 2013). The same is true in the alligator, *Alligator mississippiensis*, for *cyp19a1* and *sox9* (Parrott et al., 2014) and in the sea turtle, *Lepidochelys olivacea*, for *sox9* (Venegas et al., 2016). In birds and mammals, sexual development is strongly canalized (Capel, 2017), and therefore there is little or no room for sexual plasticity. Nevertheless, in such canalized systems, it would also be interesting to determine to what extent DNA methylation of key genes correlates with expression and whether this is established before the completion of gonadal differentiation. In any case, and regardless of the species of choice, testing the role of epigenetic regulation on the expression of key sexrelated genes during the process of sex differentiation should involve, in our opinion, the analysis of at least three different time points. The first one, ideally, should be prior to any morphological sign of sex differentiation, the second around the middle of the process, and the third towards the end or after the completion of sex differentiation.

Second, what other genes can be targeted? In our view, the genes to be tested should include at least the ones that consistently follow or not the predictions of the CERS model, namely, *cyp19a1a*, *dmrt1*, *amh*, and *foxl2a*, as shown in this


TABLE 2 | Genes related to sexual development in mammals, birds, reptiles, and fish (Ge et al., 2018; Capel, 2017; Valenzuela et al., 2019; Todd et al., 2019) where its epigenetic regulation would be worth studying

*(\*)Genes for which there is data on DNA methylation during sex differentiation, as detailed in* Table 1.

paper. However, other genes with known functions in sexual development in vertebrates, including mammals, birds, and reptiles, should also be studied. We propose here a list of some of the most relevant genes found in the literature (**Table 2**). Information on DNA methylation of additional genes during gonadal differentiation and any possible sex-related differences will help to better understand the epigenetic regulation of sexual development.

Third, what other approaches can be used? Gene-editing techniques such as CRISPR/Cas9 or the more recently developed technique to edit the methylome in the mammalian genome by Liu et al. (2016b) can be very useful. To date, knockout mutants of sex-related genes in fish have been mostly developed for some model species, e.g., in zebrafish: *cyp19a1a* (Lau et al., 2016), *amh*, and *dmrt1* (Lin et al., 2017), and in medaka: estrogen receptor 1 (*esr1*) (Tohyama et al., 2017), *gnrh* family genes (Marvel et al., 2018), and *cyp19a1a* knockout (Nakamoto et al., 2018). Lau et al. (2016) found that all knockout mutants of *cyp19a1a* were males, supporting the view that aromatase plays an essential role in ovarian differentiation and development. Yet, Lin et al. (2017) found that *dmrt1* and *amh* knockout zebrafish mutants displayed female-biased sex ratios, but the development of abnormal testes was still possible. *Dmrt1* was suggested to be necessary for the maintenance, self-renewal, and differentiation of male germ cells, and *amh* was proposed to control the balance between proliferation and differentiation of these cells. Therefore, it would be interesting to analyze the DNA methylation of *dmrt1* and other male-biased genes in *amh* knockout mutants and vice versa, the DNA methylation of *amh* and other male-biased genes of the network in *dmrt1* knockout mutants.

# GAPS IN KNOWLEDGE AND FUTURE PROSPECTS

There are some aspects worth discussing regarding future studies of the involvement of DNA methylation on the regulation of sexual development. First, one aspect concerns the genomic feature on which one should focus when the goal is to associate DNA methylation with gene expression levels. Determination of the expression should be accurate and consistent for each gene assuming that the method of measurement, e.g., qPCR, is properly employed according to the appropriate standards (primers, reference genes, etc.). In contrast, DNA methylation levels can vary across different genomic features of the same gene. In most studies, the promoter region has been typically targeted. However, methylation of other genomic regions has been found to be equally or even better associated with gene silencing. Indeed, it was shown that the first exon is tightly linked to transcriptional silencing (Brenet et al., 2011). Furthermore, in a systematic study aimed at addressing this question, it was found that the first intron, more than the promoter and the first exon, is tightly related to gene silencing. This seems to be conserved across vertebrate species since it was observed in fish (*Japanese puffer, Takifugu rubipres,* and the European sea bass), frog (*Xenopus*), and humans (Anastasiadi et al., 2018a). Thus, for the epigenetic regulation of sex, as well as for sexrelated development of biomarkers, it is better to focus around the transcription start site and to prioritize the CpGs localized in the first intron, first exon, and promoter regions, in the order mentioned. Furthermore, gene expression can also be positively correlated to tissue-specific DNA methylation, and this should be kept in mind (Lokk et al., 2014; Wan et al., 2015; Anastasiadi et al., 2018a).

Another aspect concerns the possible effect of genetic variation on DNA methylation levels and how to account for it in the data analysis (Lea et al., 2017; Anastasiadi et al., 2018b). This is related to the number of samples to be analyzed per treatment in studies of DNA methylation, which has been discussed elsewhere (Bock et al., 2016). Also, it would be desirable to overcome the noise induced by the cell heterogeneity of the gonadal tissue. In this regard, recent technological advances allow to determine the epigenome of single cells (Farlik et al., 2015). Efforts toward such type of measurements would definitively help in obtaining more robust measurements of DNA methylation.

Furthermore, DNA methylation and gene expression levels discussed throughout this paper refer to the gonads. DNA methylation is known to be tissue-specific. However, it cannot be ruled out that the methylation patterns of the gonads could be replicated in other tissues. This could be the case of tissues involved in the control of reproduction (e.g., the hypothalamus) or that present sex dimorphism (e.g., secondary sexual characters) because they are under the control of hormonal steroids. To the best of our knowledge, this information does not still exist despite increasing evidence of sex-related differences in DNA methylation for many genes in nonreproductive tissues, such as the muscle or the liver (Davegårdh et al., 2019; Grimm et al., 2019).

Another major challenge will be to determine the sexual phenotype just by the DNA methylation levels of selected EEMs before it can be determined by other means (e.g., by analyzing transcriptomic or histological changes). This would be achievable if demonstrated that the epigenetic modifications precede changes in gene expression. In this case, the EEMs and the CERS model can be foreseen as having potentially useful applications. For example, a defined set of EEMs could be used to predict the sexual phenotype in species with marked sexual growth dimorphism (Parker, 1992; Wang et al., 2019). EEMs could allow to predict the sex ratio in a subsample of a clutch before gonadal differentiation. This would aid in the stock management and in the selection of future broodstock. The same principle could be applied in ornamental fish culture, where the secondary sexual characteristics of males make them usually more desirable than females (Piferrer and Lim, 1997). Another case would be to aid in selection of broodstock fish with a certain epigenetic profile that is suitable to withstand, for example, a masculinization environment due to elevated density or temperature. In reptiles, the use of EEMs combined with temperature manipulations could aid in the research toward our understanding of the underlying molecular mechanism of temperature-dependent sex determination.

Finally, epigenetic modifications can recapitulate past environmental influences (Turner, 2009; Vogt, 2017). Taking advantage of this, EEMs could help to determine whether animals in the wild were exposed to altered environmental conditions such as, for example, exposure to pollutants or elevated temperatures. These EEMs could therefore be useful in conservation programs aimed at determining the environmental hazards to which natural populations may have been previously exposed. As an example along these lines, Guillette et al. (2016) identified epigenetic biomarkers to assess the environmental exposures and health impacts on populations of alligators from lakes contaminated with endocrine-disrupting compounds. The effects of endocrine-disrupting compounds on DNA methylation in the field of aquatic toxicology and biodiversity conservation have recently been reviewed by Tubbs and McDonough, (2018). This approach would allow determining whether a wild population was subjected to a sex-altering condition in the past. To our knowledge, this type of applications has not been fully developed to date, but several studies have started to identify biomarkers with this aim in mind.

# CONCLUSIONS


# DATA AVAILABILITY

All datasets analyzed for this study are included in the manuscript and the supplementary files.

# AUTHOR CONTRIBUTIONS

FP conceived the study, supervised the collection of data, coined the concepts of EEM and CERS, and wrote the paper. DA developed the technique of MBS and wrote the paper. AV, NS-B, JM-P, and LR collected data on different species and wrote the paper. All authors approved the final version of the manuscript.

# FUNDING

This study was supported by the Spanish Ministry of Science grants AGL2016–787107-R "Epimark" to FP and AGL2015-73864-JIN "Ambisex" to LR. DA was supported by an Epimark contract, AV and NS-B were supported by Spanish government scholarships

# REFERENCES


(BES-2014-069051 and BES-2017-079744, respectively); LR and JM-P were supported by Ambisex contracts.

# ACKNOWLEDGMENTS

We would like to thank the editors, Drs. Peng Xu, Lior David, Paulino Martínez, and Gen Hua Yue, for allowing us to prepare this paper.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Piferrer, Anastasiadi, Valdivieso, Sánchez-Baizán, Moraleda-Prados and Ribas. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Characterization of Full-Length Transcriptome Sequences and Splice Variants of Lateolabrax maculatus by Single-Molecule Long-Read Sequencing and Their Involvement in Salinity Regulation

*Yuan Tian, Haishen Wen, Xin Qi, Xiaoyan Zhang, Shikai Liu, Bingyu Li, Yalong Sun, Jifang Li, Feng He, Wenzhao Yang and Yun Li\**

Key Laboratory of Mariculture, Ministry of Education, Ocean University of China, Qingdao, China

### Edited by:

Gen Hua Yue, Temasek Life Sciences Laboratory, Singapore

### Reviewed by:

Jun Hong Xia, Sun Yat-sen University, China Chen Jiang, Dalian Ocean University, China

> \*Correspondence: Yun Li yunli0116@ouc.edu.cn

### Specialty section:

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Received: 20 March 2019 Accepted: 17 October 2019 Published: 15 November 2019

### Citation:

Tian Y, Wen H, Qi X, Zhang X, Liu S, Li B, Sun Y, Li J, He F, Yang W and Li Y (2019) Characterization of Full-Length Transcriptome Sequences and Splice Variants of Lateolabrax maculatus by Single-Molecule Long-Read Sequencing and Their Involvement in Salinity Regulation. Front. Genet. 10:1126. doi: 10.3389/fgene.2019.01126

Transcriptome complexity plays crucial roles in regulating the biological functions of eukaryotes. Except for functional genes, alternative splicing and fusion transcripts produce a vast expansion of transcriptome diversity. In this study, we applied PacBio single-molecule long-read sequencing technology to unveil the whole transcriptome landscape of Lateolabrax maculatus. We obtained 28,809 high-quality non-redundant transcripts, including 18,280 novel isoforms covering 8,961 annotated gene loci within the current reference genome and 3,172 novel isoforms. A total of 10,249 AS events were detected, and intron retention was the predominant AS event. In addition, 1,359 alternative polyadenylation events, 3,112 lncRNAs, 29,609 SSRs, 365 fusion transcripts, and 1,194 transcription factors were identified in this study. Furthermore, we performed RNA-Seq analysis combined with Iso-Seq results to investigate salinity regulation mechanism at the transcripts level. A total of 518 transcripts were differentially expressed, which were further divided into 8 functional groups. Notably, transcripts from the same genes exhibited similar or opposite expression patterns. Our study provides a comprehensive view of the transcriptome complexity in L. maculatus, which significantly improves current gene models. Moreover, the diversity of the expression patterns of transcripts may enhance the understanding of salinity regulatory mechanism in L. maculatus and other euryhaline teleosts.

Keywords: Lateolabrax maculatus, Iso-Seq, full-length transcripts, alternative splicing, isoform, salinity regulation

# INTRODUCTION

With the development of high-throughput sequencing of the transcriptome, biologists have begun to pay more attention to multiple post-transcriptional processes of precursor-messenger RNA (premRNA). Alternative splicing (AS), a key post-transcriptional processing of pre-mRNA, is prevalent in most eukaryotic organisms (Barbazuk et al., 2008; Pan et al., 2008; Kornblihtt et al., 2013; Yi et al., 2018; Zhang et al., 2019a), which makes an important contribution to the enhancement of

1 **428** the functional complexity of the transcriptome (Graveley, 2001; Reddy et al., 2013; Abdel-Ghany et al., 2016). Transcriptome complexity plays an important role in increasing the coding capacity of genes, generating proteome diversity, regulating gene expression, cellular physiological and developmental processes (Lareau et al., 2004; Abdel-Ghany et al., 2016; Wang et al., 2016). It has been shown that over 90% of multi-exonic genes in human (*Homo sapiens*) (Pan et al., 2008), 46% in fruit fly (*Drosophila melanogaster*) (Hansen et al., 2009), and 61% in the model plant thale cress (*Arabidopsis thaliana*) are alternatively spiced (Marquez et al., 2012). Although the functional significance of most spliced isoforms has yet to be fully elucidated, several studies suggest that AS is a profound regulatory process involved in organismal function. For instance, in thale cress, zincinduced facilitator-like 1 can produce two spliced isoforms, one that regulates stomatal movement and another that influences cellular auxin transport (Reddy et al., 2013). Coincidentally, the *Bcl-x* gene in fruit fly yields two different isoforms, one of which inhibits apoptosis, while the other activates apoptosis (Chang et al., 2004). In addition, alternative polyadenylation (APA), another post-transcriptional regulatory events in which RNA molecules with different 3' ends originate from distinct polyadenylation sites of a single gene, is emerging as a mechanism widely used to regulate gene expression (Chen et al., 2017b). APA events may alter sequence elements and/or the coding capacity of transcripts, and could be considered as a mechanism that adds another layer to regulation of transcriptome diversity (Shen et al., 2011; Abdel-Ghany et al., 2016; Ha et al., 2018).

However, due to technical limitations, most information on these post-transcriptional regulatory events remains limited. Although data from short-read sequencing have accumulated over recent years, it remains an immense challenge to obtain fulllength (FL) sequences for each RNA because of difficulties in the short read-based assembly, which limits the identification and prediction of post-transcriptional events (Wang et al., 2016; Chen et al., 2017a). In the last few years, Pacific BioSciences (PacBio) single-molecule real-time sequencing has been introduced (Rhoads and Au, 2015). The PacBio isoform sequencing (Iso-Seq) platform can directly produce FL transcripts without an assembly process, providing superior evidence for comprehensive analysis of splice isoforms of each gene and improving the annotation of existing gene models (Tilgner et al., 2014; Gordon et al., 2015; Wang et al., 2016). Recently, Iso-Seq has led to the discovery of thousands of novel genes and alternatively spliced isoforms in human (Au et al., 2013), mouse (*Mus musculus*) (Karlsson and Linnarsson, 2017), rabbit (*Oryctolagus cuniculus*) (Chen et al., 2017a), sorghum (*Sorghum bicolor*) (Abdel-Ghany et al., 2016), and maize (*Zea mays*) (Wang et al., 2016). This finding indicates that Iso-Seq is sensitive to detecting FL transcripts and serves as a valuable resource for transcriptome complexity research. In addition, the studies in aspen (*Populus tremuloides*) (Chao et al., 2019), strawberry (*Fragaria vesca*) (Li et al., 2017), and pig (*Sus scrofa*) (Li et al., 2018) also provide strong evidence that Iso-Seq could complement short-read sequencing with cataloguing and quantifying eukaryotic transcripts.

Spotted sea bass (*Lateolabrax maculatus*) is a euryhaline teleost fish naturally distributed in the northwestern Pacific Ocean, especially along the Chinese coast, reaching south to the borders of Vietnam and north to Korea and Japan (Zhang et al., 2001; Tseng and Hwang, 2008; Seo et al., 2016). It is considered as one of the most popular economic fishes because of its high nutritive value and pleasant taste. Since the release of the draft reference genome of *L. maculatus* (Shao et al., 2018; Chen et al., 2019), more functional genes have been discovered. However, most of the existing gene models are derived from *in silico* prediction with a lack of reliable annotation of alternative isoforms and untranslated regions, which would prevent accurate evaluation of transcriptome complexity (Chen et al., 2017a). Hence, our study is crucial for facilitating the biological research of *L. maculatus*.

Salinity represents a major abiotic stress and critical environmental factor that directly affects the survival, growth, development, reproduction, and physiological functions of all aquatic organisms (Kultz et al., 2013; Kultz, 2015). *L. maculatus*, a typical euryhaline fish, is capable of inhabiting freshwater, brackish water, seawater, and hypersaline water (Kim et al., 1998). It has been documented that *L. maculatus* can tolerate a considerable range of external salt concentrations (0–45 ppt) and maintain constant internal osmotic homeostasis (Zhang et al., 2019b). Hence, it provides an excellent model with which to identify and characterize osmoregulatory mechanisms. In spotted sea bass, RNA-Seq analysis has been performed to identify hundreds of genes involved in salinity adaptation and osmoregulation (Zhang et al., 2017). In addition, previous RNA-Seq studies in other aquaculture fish species, including Asian seabass (*Lates calcarifer*), striped catfish (*Pangasianodon hypophthalmus*), Mozambique tilapia (*Oreochromis mossambicus*), and Nile tilapia (*Oreochromis niloticus*), have identified several differentially expressed genes in response to distinct salinity concentrations (Xia et al., 2013; Thanh et al., 2014; Ronkin et al., 2015), which were considered as candidate osmoregulatory genes. However, due to limitations of technology, RNA-Seq lacks the ability to accurately quantify the transcripts or isoforms (Steijger et al., 2013). In this study, we applied Iso-Seq to uncover posttranscriptional regulatory events in *L. maculatus* and combined with gill RNA-Seq to investigate salinity regulation at the transcript level. It was the first time that Iso-Seq was applied in aquaculture teleosts, providing the first comprehensive view of transcriptome complexity in *L. maculatus* and characterizing differentially expressed transcripts (DETs) involved in osmoregulatory mechanisms, which refines the annotation of the reference genome and serves as a valuable reference for future Iso-Seq studies.

# MATERIALS AND METHODS

# Ethics Statement

All experiments involving animals were conducted according to the guidelines and approved by the respective Animal Research and Ethics Committees of Ocean University of China (Permit Number: 20141201). The field studies did not involve any endangered or protected species.

# Fish Sample Collection for Iso-Seq

Three *L. maculatus* adults (body length: 44.92 ± 4.63 cm, body weight: 551.23 ± 7 9.84 g) were obtained from Kiaochow Bay of the Yellow Sea, China. The fish individuals were anesthetized with MS-222 and rapidly dissected for 13 tissues including brain, hypophysis, gill, heart, liver, stomach, intestine, kidney, spleen, gonad, muscle, fin, and skin. Then, these tissues were immediately frozen in liquid nitrogen and transferred to −80°C refrigerator until the extraction of RNA.

# RNA Extraction

Total RNA was extracted using TRIzol reagent (Invitrogen, CA, USA) according to the manufacturer's instructions and digested with RNase-free DNase I (Takara, Shiga, Japan) to remove genomic DNA contamination. The reagents and instruments involved in this experiment were treated with 0.1% (vol/vol) diethylpyrocarbonate (DEPC) to maintain RNasefree conditions. The concentration and integrity of RNA was monitored using NanoDrop ND-1000 (NanoDrop Technologies, DE, USA) and 1% agarose gel electrophoresis, respectively. Agilent Bioanalyzer 2100 system (Agilent Technologies, CA, USA) was used to assess the quality of extracted RNA. All RNA samples from three *L. maculatus* were equally pooled together for following PacBio Iso-Seq.

# Iso-Seq Library Construction and Sequencing

According to the Iso-Seq protocol, 1 μg total RNA was transcribed to generate full-length cDNA using the SMARTer PCR cDNA Synthesis Kit (Clontech, CA, USA). Then, the cDNA was amplified using the advantage 2 PCR kit (Clontech, CA, USA), and PCR products were purified with AMpure PB beads (Beckman Coulter, CA, UAS). Purification was followed by size selection using the BluePippinTM Size Selection System (Sage Science, MA, USA) of the following bins: 1-2, 2-3 and 3-6 kb. The three libraries were then constructed using SMRTbell Template Prep kit (Pacific Biosciences, CA, USA). Before sequencing, the quality of the libraries was assessed by Agilent Bioanalyzer 2100 system (Agilent Technologies, CA, USA) and Qubit fuorometer 2.0 (Life Technologies, CA, USA). Libraries were prepared for sequencing by annealing a sequencing primer and adding polymerase to the primer annealed template. The polymerasebound template was bound to MagBeads and a total of 6 SMRT cells were sequencing on PacBio RS II platform using P6-C4 chemistry (2 cell each library).

# PacBio Long-Read Processing

PacBio polymerase reads were processed into error corrected reads of insert (ROIs) using SMART Analysis v2.3 (https:// www.pacb.com/products-and-services/analytical-software) with min-Full Pass > = 0 and min-Predicted Accuracy >75%. After ROIs <50 bp were discarded, they were classified into full-length non-chimeric (FLNC) and non-full-length (NFL) reads based on the presence of the poly(A) tail signal and the 5' and 3' cDNA primers. FLNC reads were clustered into consensus sequences using the Iterative Clustering for Error Correction (ICE) algorithm (https://www.pacb.com/productsand-services/analytical-software). Combined with NFL reads, consensus sequences were then polished in clusters using Quiver (Chin et al., 2013). Based on the criteria of post-correction accuracy >99%, consensus sequences were divided into highquality and low-quality sequences. To improve the accuracy of consensus sequences, low-quality sequences were corrected by the above Illumina clean reads using Proovread v2.13.13 with default parameters (Thomas et al., 2014). Consensus sequences were mapped to the reference genome of *L. maculatus* (NCBI BioProject ID: PRJNA407434) using Genomic Mapping and Alignment Program (GMAP) (Wu and Watanabe, 2005). Mapped sequences were further collapsed using the pbtranscript-ToFU package (http://github.com/PacificBiosciences/cDNA\_primer/) with min-coverage = 85% and min-identity = 90% to generate non-redundant transcripts.

# Alternative Splicing (AS) Analysis

AStalavista v3.2 software with default parameters was employed to determine the AS events in above obtained non-redundant transcripts (Foissac and Sammeth, 2007). The non-redundant transcripts were further classified into five major types of AS events following the rules in previous publication (Wang et al., 2016), namely Intron retention, Exon skipping, Alternative 3' splice site, Mutually exclusive exon, and Alternative 5' splice site.

# Alternative Polyadenylation (APA) Identification

In our study, FLNC reads were selected to identify APA sites using Transcriptome Analysis Pipeline for Isoform Sequencing (TAPIS pipeline v1.2.1, default parameters) (Abdel-Ghany et al., 2016). The qualified APA for a gene must be supported by at least two FLNC reads aligned to the gene loci.

# Long Non-Coding RNA (LncRNA) and Simple Sequence Repeat (SSR) Analysis

Four computational approaches, including Coding-Non-Coding-Index (CNCI, v2), Coding Potential Calculator (CPC, v1), Coding Potential Assessment Tool (CPAT, v1.2), and Pfam (v1.5), were combined to identify non-protein coding RNA candidates from the non-redundant transcripts. Transcripts with lengths more than 200 bp and more than two exons were selected as lncRNAs candidates and further screened using CPC, CNCI, CPAT, and Pfam that have the power to distinguish the protein-coding genes from the non-coding genes. The relationship between lncRNAs and target genes were predicted based on their position (<100 kb upstream or downstream) and base complementary using lncTar target gene prediction tool (v1.0) with default parameters (Li et al., 2015).

The simple sequence repeats (SSR) were identified using Microsatellite identification tool (MISA, v1.0) with default parameters (Beier et al., 2017). Only non-redundant transcripts that were > = 500 bp in size were selected for SSR detection. A total of seven SSR types were identified, namely, Mononucleotide, Di-, Tri-, Tetra-, Penta-, Hexa-, and compound SSR, respectively.

# Fusion Transcripts and Transcription Factors (TFs) Detection

Consensus sequences from PacBio Iso-Seq were selected for fusion transcripts identification. A fusion transcript is a chimeric RNA encoded by a single fusion gene or by two different genes that are subsequently joined by trans-splicing. The criteria used to identify candidate fusion transcripts as follows A) map to two or more loci; B) minimum coverage for each locus is 5% and minimum coverage in bp is > = 1bp; C) total coverage is > = 95%; D) distance between the loci is at least 10 kb (Wang et al., 2016; Li et al., 2018).

Animal TFDB 3.0 (http://bioinfo.life.hust.edu.cn/AnimalTFDB/) was set as the reference transcription factor database. The algorithm HMMER 3.0 software was used to identify TFs and assign transcripts to different families (Eddy, 2009).

# Functional Annotation

The non-redundant transcripts were aligned against several protein and nucleotide databases, including Clusters of Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), conserved Protein families or domains (Pfam), Swissprot, NCBI non-redundant proteins (NR), and non-redundant nucleotide (NT) databases, using BLASTX (v2.2.26) with cutoff E-value < = 1e-5 (Camacho et al., 2009).

# Fish and Experimental Treatments for RNA-Seq

*L. maculatus* adults (body length: 21.92 ± 3.17 cm, body weight: 158.23 ± 18.77 g) were acquired from Shuangying Aquaculture Company (Dongying, Shandong Province, China) and acclimated for a week. Water temperature (13.5–14.5°C), pH (7.8–8.15), salinity (30 ppt), and DO (6.7–7.5 mg/L) were stabilized during the acclimation. After acclimation, 90 individuals were randomly divided into two groups: freshwater (FW, 0 ppt) and seawater group (SW, 30 ppt) in triplicate tanks at the density of 15 individuals per tank. After 30 days of breeding, three individuals per tank were anesthetized with MS-222 and rapidly sampled for gill tissues, which were frozen in liquid nitrogen and transferred to −80°C refrigerator until the extraction of RNA.

# RNA-Seq Library Construction and Sequencing

Total RNA was extracted using TRIzol method mentioned above (the section *RNA Extraction*). Equal amounts of RNA from the gill tissues of three individuals (500 ng per individual) from the same tank were pooled as one sample to minimize the variation among individuals. A total of 6 libraries (3 replicated samples × 2 treatment groups) were constructed using the TruSeq™ RNA Sample Prep Kit (Illumina, CA, USA). The libraries were sequenced on Illumina HiSeq 4000 platform and 150 bp pairedend raw reads were generated. Then, the raw reads were processed using Trimmomatic software (Bolger et al., 2014) and clean reads were obtained for the following analysis.

# Differentially Expressed Transcript (DETs) Analysis

To estimate the expression level of transcripts, the Iso-Seq database was added to the genome database to construct a new database with a modified general feature format (GFF) file. Then, the clean reads from the above RNA-Seq were mapped to the new database using STAR software (v2.5.3) (Dobin et al., 2013). The Cuffquant and Cuffnorm modules of the Cufflinks program (v2.2.1) were used to quantify transcript abundance based on the mapped results (Trapnell et al., 2010). When all the splicing junctions of transcripts were supported by clean reads, it was defined as an expressed transcript. For the reads mapped to multiple isoforms derived from the same gene, they were distributed according to uniquely mapped reads using Cuffquant and Cuffnorm modules. The mapped reads were counted and subsequently normalized to fragments per kilobase of transcript per million fragments mapped (FPKM) as the expression value. Differential expression analysis in the FW and SW environments was performed using the DEseq R package (1.10.1). The *P*-values were adjusted using Benjamini and Hochberg's approach for controlling the false discovery rate (FDR). The FDR <0.05 and |fold change| > = 2 were set as the threshold for significantly differentially expressed transcripts (DETs).

# Validation Experiments

For RT-PCR validation of AS events, fusion transcripts, and novel transcripts, total RNA was reverse-transcribed to cDNA using PrimeScript RT reagent kit (Takara, Shiga, Japan) following the manufacturer's instructions. 10×diluted cDNA was served as the template and Fastpfu reagent kit (TransGen, Beijing, China) was used for RT-PCR amplification. Transcript-specific primers were designed to span the predicted splicing events using Primer 5 software (**Supplementary Table 1**). PCR conditions were 5 min at 94°C followed by 35 cycles of 94°C for 30 s, Tm for 30 s, 72°C for a time period that depends on the product sizes, and 72°C for 10 min. PCR products were monitored on 1% agarose gel stained by GelStain (TransGen, Beijing, China). For the APA validation, 1 μg RNA was used to synthesize first-strand cDNA using the SMART™ RACE cDNA Amplification Kit (Clontech, California, USA). Gene-specific primers (**Supplementary Table 1**) were designed for the 3' rapid amplification of cDNA ends (3' RACE). PCRs were performed using Taq DNA Polymerase (Clontech, California, USA) following touchdown PCR cycling conditions: denaturation step at 94°C for 3 min, followed by 20 cycles of 94°C for 15 s and at a range of annealing temperature from 60 to 50°C, decreasing 0.5°C each cycle and 72°C for 40 s, and finally ended with 10 min at 72°C for extension. PCR products were also monitored on 1% agarose gel stained by GelStain (TransGen, Beijing, China). Finally, the products were purified, subcloned into T1 vector, propagated in *Escherichia coli* DH5a, and sequenced by the Sanger method.

Quantitative real-time PCR (qPCR) analysis was employed to verify differentially expressed transcripts. Total RNA was isolated from the gill tissues of fish exposed to freshwater and seawater in previous salinity challenge experiment. cDNA was synthesized using the PrimeScript RT reagent kit (TaKaRa, Shiga, Japan). All transcripts-specific primers for qPCR were designed using Primer 5 software and listed in **Supplementary Table 2**. SYBR Premix Ex Taq kit was used for qPCR (Takara, Shiga, Japan). Each PCR reaction consisted of 2 μl cDNA, 10 μl SYBR premix Ex Taq, 0.4 μl of each forward and reverse primers, 0.4 μl ROX Reference Dye, 6. 8μl ddH2O to a final volume of 20 μl. qPCR was performed on the Applied Biosystems 7300 machines (Applied Biosystems, CA, USA) under the following conditions: 95°C for 30 s and 40 cycles of 95°C for 5 s, 55°C for 30 s, and 72°C for 30 s. The relative expression levels of transcripts were normalized by 18S ribosomal RNA. 2-ΔΔCT method was used for subsequent analysis. The correlation coefficient between differential expressed transcripts and qPCR were determined by SPSS 17.0 software (Bryman and Cramer, 2011). One-way ANOVA was conducted followed by Duncan's multiple tests to identify significance differences when *P <* 0.05.

# RESULTS

# PacBio Iso-Seq and Bioinformatic Analysis

In total, six SMRT cells, including three size-fractionated libraries (1–2, 2–3, and 3–6 kb), were used for Iso-Seq, yielding 13.42 Gb of clean data. A specific bioinformatic analysis pipeline for our Iso-Seq data is outlined in **Figure 1A**. In detail, 363,371 ROIs were retained after filtering, and the mean length was 3,120 bp (**Table 1**). The density plot of the length of the ROIs showed three obvious peaks, which was consistent with the size of the three libraries (**Supplementary Figure 1**). ROIs were further classified into FLNC and NFL reads based on the presence of 5' primer, 3' primer, and poly(A) tails. A total of 39.79%, 44.88%, and 44.77% of ROIs were qualified as FLNC reads in the 1–2, 2–3, and 3–6 kb libraries (**Table 1**), respectively, with an average FLNC ratio of 42.5% (**Figure 1B**). ICE was applied for sequence clustering, yielding 60,573 consensus sequences (**Table 1**). Combined with NFL reads, these consensus sequences were corrected by Quiver. A total of 68.92% (41,744) of the sequences were defined as highquality sequences. The remaining 18,829 consensus sequences were defined as low-quality sequences, which were subsequently corrected by Illumina clean reads. Finally, these consensus sequences were collapsed by the TOFU process, yielding 28,809 non-redundant transcripts retained for the following study (**Table 1**). The reference genome and Iso-Seq data information was shown in **Figure 2**.

# Transcripts and Alternative Splicing (AS) Events

A total of 28,809 non-redundant transcripts were compared against *L. maculatus* reference genome. In total, 88.9% of nonredundant transcripts (25,637) were aligned to 12,477 annotated gene loci (**Figure 3A**), covering 52.7% of the *L. maculatus* genome loci (23,657). Based on the splice sites of the genome and structures of transcripts, a total of 25,637 transcripts annotated in the genome were further classified into two groups as follows (**Figure 3A**): 1) known isoforms (7,357, 25.5%) sharing the same splice sites with the existing *L. maculatus* gene models; and 2) novel isoforms (18,280, 63.5%) that share at least one splice site with existing *L. maculatus* gene models but differ in other splice sites. Typical examples were shown in **Supplementary Figure 2A**. The remaining 3,172 transcripts (11.1%) were absent from any annotated gene loci in the *L. maculatus* genome and were identified as novel isoforms from novel genes. The 3,172 novel isoforms were clustered into 2,580 gene loci defined as novel genes (**Figure 3B**, **Supplementary Figure 2B**). To further investigate the homology and annotation, these novel gene loci were aligned against the Swiss-Prot and NR databases. A total of 24.96% of novel gene loci (644) were annotated in the Swiss-Prot protein database, and 43.64% of novel loci (1,126) were in NR database, which exhibited their homology to other species. The remaining genes absent in the databases were likely species-specific genes in *L. maculatus*. Four novel isoforms from novel genes were randomly selected for validation by RT-PCR (**Supplementary Figure 3A**).

A total of 15,057 gene loci (12,477 existing gene loci and 2580 novel gene loci) were identified in our Iso-Seq data, of which 6,396 (42.5%) were found to generate at least two different isoforms (**Figure 3C**). Notably, 476 out of 15,057 (3.2%) genes produced more than 5 isoforms for each gene loci, generating a total of 6,087 unique isoforms that accounted for 21% of the total Iso-Seq transcripts. To investigate the potential function of the genes with numerous isoforms, KEGG pathway enrichment analysis was performed for 847 genes harboring more than 4 isoforms. The results revealed that the most enriched pathways were related to the phagosome, apoptosis, and the AGE-RAGE signaling pathway (**Supplementary Figure 4**).

During AS events, splice sites are used with greater or lesser frequency to produce messages that differ in their exon content and structure (Liu et al., 2017). Although this happens frequently, only a few of the AS events have been reported in aquaculture species. In this study, a total of 10,249 AS events were detected from the Iso-Seq database and further classified into five main types (**Figures 2E** and **3D, E**, **Supplementary Figures 5A**–**E**). Strikingly, intron retention (39.9%, 4,089) was the most enriched type of AS event, and exon skipping (27.99%, 2,869) was the second most prevalent AS event. The number of the two AS types accounted for more than half (67.86%) of the total AS events in *L. maculatus*.

To verify the accuracy of isoforms identified by Iso-Seq, 10 genes with predicted AS events were randomly selected, and the existence and size of isoforms were validated by RT-PCR. Primers were designed in the overlapping regions of various transcripts derived from the same gene. The experimental results demonstrated that the amplified product sizes were consistent with predicted target fragments by Iso-Seq, confirming the credibility of our Iso-Seq data (**Figure 4**).

# Alternative Polyadenylation (APA) Events

In our Iso-Seq data, the TAPIS pipeline was used to detect APA events in *L. maculatus*. The qualified gene loci for APA must be supported by at least two aligned FLNC reads. Of the 6,506 detected genes with evidence of a poly(A) site, 5,147 genes (79.11%) were found to contain a single poly(A) site (**Figures 2F** and **5A**).

### Tian et al. Full-Length Transcriptome of Lateolabrax maculatus


### TABLE 1 | Statistics of Iso-Seq data in L. maculatus.

The remaining 1,359 (20.89%) genes contained two or more detected poly(A) sites, and 14 genes were predicted to generate more than 5 poly(A) sites. An example, the transcripts structure of *haptoglobin* gene, which contained several distinct poly(A) sites, was illustrated in **Figure 5B**. Additionally, a gene with APA events was randomly selected for the validation experiment using 3'RACE and Sanger sequencing (**Supplementary Figure 3B**).

# Long Non-Coding RNA (LncRNA)

In our Iso-Seq database, a total of 3,112 lncRNAs were ultimately identified by intersection analysis of four computational approaches, including CNCI, CPC, CPAT, and Pfam (**Figures 2G** and **6A**). Based on biogenesis positions relative to proteincoding genes of the genome of *L. maculatus*, 2,734 (87.85%) lncRNAs were further divided into four categories: 22.69% (706) were generated from intergenic regions (lincRNA), 20.18% (628) from intronic regions (intronic-lncRNA), 10.32% (321) from the antisense strand (antisense-lncRNA), and 34.67% (1,079) from the sense strand (sense-lncRNA) (**Figure 6B**). The relationship between lncRNAs and target genes was predicted based on their position (< 100 kb upstream or downstream) and base complementary. A total of 13,566 protein-coding gene loci were screened in the 100 kb upstream or downstream of 3,007 lncRNAs. In total, 909 lncRNAs were found to have a basepairing interaction with 14,080 mRNAs.

# Simple Sequence Repeat (SSR)

Of 21,432 selected novel transcripts (novel isoforms from known gene loci and novel transcripts from novel gene loci), a total of 13,450 transcripts were found to contain 29,690 SSR motifs (**Supplementary Table 3**). Additionally, more than half of the SSR transcripts (7,401, 55.03%) contained at least 2 SSR motifs, and 5,019 SSR motifs (16.95%) were classified as compound repeats. Of the detected SSR motifs, the mono-nucleotide motif (240.30/Mb) was the most abundant in density, followed by di- (82.00/Mb), compound (69.10/Mb), tri- (48.66/Mb), tetra- (5.40/Mb), penta- (0.87/Mb), and hexa- (0.45/Mb) nucleotide motifs (**Figure 6C**).

# Fusion Transcripts and Transcription Factors (TFs)

Fusion transcripts, such as chimeric mRNA transcripts, result from either trans-splicing of distinct genes or aberrant chromosomal translocations (Wang et al., 2016). In our Iso-Seq dataset, a total of 365 fusion transcripts were identified, and their chromosome distribution was shown in **Figure 2H**. Among them, 43 fusion transcripts were observed in the intrachromosomic region, while the others (322) were in the interchromosomic region. The results of GO enrichment analysis showed that the fusion transcripts were primarily (top six) associated with cell (GO:0005623), cell part (GO: 0044464), catalytic activity (GO:0003824), binding (GO: 0005488), cellular process (GO: 0009987), and single-organism process (GO: 0044699) (**Supplementary Figure 6**). Two fusion transcripts (PB.19, PB.130) were randomly selected and verified by RT-PCR (**Supplementary Figure 3C**).

In our Iso-Seq, a total of 1,194 TFs transcripts generated from 723 TFs genes were identified and their detailed information was shown in **Supplementary Table 4**. Based on the Animal TFDB 3.0 database classification, these TFs belong to more than 52 families. It is the first time to extensively identified TFs using transcriptome dataset in *L. maculatus*, which provided a useful foundation for TFs studies in the future.

# Differentially Expressed Transcripts (DETs) in Response to FW and SW Environment

To capture transcript-level expression differences in response to different salinity environment, the Illumina RNA-Seq data of gill tissue was aligned to the refined genome combined with both the reference genome and the Iso-Seq database for quantification. In total, 265.90 million clean reads were mapped to the new database. Using these criteria, a total of 518 DETs covering 497 gene loci were identified, of which 264 transcripts were up-regulated and 254 transcripts were down-regulated in the SW relative to the FW group (**Supplementary Table 5**). The distribution of DETs was illustrated in **Supplementary Figure 7**.

These candidate DETs were classified into eight functional groups, including energy metabolism, immune response, molecule and ion transport and metabolism, protein biosynthesis, protein degradation, RNA processing and modification, signal transduction, and structure reorganization based on the combination of GO and KEGG annotation, enrichment analysis, and published literature (**Supplementary Table 6**). These results indicated that transcripts showed different expression patterns in response to FW and SW environment. As shown in **Supplementary Figure 8**, among these DETs, exon skipping events were the most frequent AS type, accounting for 35.37% (208), followed by intron retention (32.48%, 191), which was different from their percentages in the Iso-Seq results.

To verify the accuracy of the expression patterns of DETs by Iso-Seq, we randomly selected nine transcripts derived from four genes for qPCR validation (**Figure 7**). The experimental results demonstrated that the expression patterns were consistent with our analysis results, confirming the credibility and accuracy of DETs results.

# Genes With Distinct Differentially Expressed Transcripts (DETs)

In our study, a total of 518 DETs were generated from 497 genes, suggesting that some genes could produce at least two DETs. A total of 17 genes were found to generate several spliced variants, of which two or more variants were differentially expressed. Based on their expression patterns, two situations were observed for the genes (**Table 2**). Firstly, DETs from the same gene loci exhibited the similar expression trends. For example, all three transcripts of *Cysteine dioxygenase type 1* (*cdo1*) gene were significantly down-regulated in the SW relative to the FW group, and a similar expression pattern was also found in different transcripts of *sodium/potassium-transporting ATPase subunit beta-233* (*nkabeta233*) gene. The second situation was that DETs of the same gene exhibited the opposite expression trends (including seven genes; the schematic diagram of their structures was illustrated in **Supplementary Figure 9**). For example, *2,4-dienoyl-CoA reductase, mitochondrial* (*decr1*), involved in the decomposition of polyunsaturated fatty acids, was alternatively spliced to generate two transcripts with opposite expression patterns. Similar events were also discovered in *6-phosphofructo-2-kinase*  *fructose-2,6-bisphosphatase 3* (*pfkfb3*), and *prostaglandin D2 receptor 2* (*ptgdr2*). The results suggested that spliced transcripts of the same gene may be involved in diverse physiological functions.

# Differentially Expressed Transcription Factors

Of 518 DETs, 17 DETs were identified as TFs belonging to 8 families, including C2H2-ZF, bHLH, ETS, Fork head, HMG, Homeobox, MBD, and RHD (**Supplementary Table 7**). Both C2H2-ZF and bHLH TF families have been well characterized with roles in response to stresses (Fujita et al., 2006; Terova et al., 2008; Steinberg, 2012). Of 17 differentially expressed TFs, 14 TFs were found to be up-regulated in SW relative to FW group, suggesting that TFs may play important roles in response to hypertonic stress and enhance salt tolerance of *L. maculatus*.

# DISCUSSION

In our study, we employed PacBio Iso-Seq to uncover the complexity of the *L. maculatus* transcriptome, providing the first

comprehensive view of splice variants in aquaculture teleosts. Using Iso-Seq, 2,580 novel genes were discovered, accounting for more than 10% of the total number of genes in the *L. maculatus* genome. The most impressive gene was *immunoglobulin heavy chain* with 279 unique transcripts, which was even more than *neurexin-1-alpha* with 247 splicing variants in mouse (Barbara et al., 2014). This suggests that Iso-Seq is advantageous for the identification of novel gene loci and the detection of alternative transcripts, which is consistent with a previous study. In addition, post-transcriptional events (AS and APA), lncRNAs, and fusion transcripts were predicted to improve the understanding of the complexity of *L. maculatus*. These results would be great resource for further analysis of post-transcriptional events and refinement of the annotation of the *L. maculatus* reference genome.

Over the past decade, it has been shown that AS is a major mechanism for the enhancement of transcriptome and proteome diversity (Keren et al., 2010). A certain AS event is the outcome of the cooperative or antagonistic interactions between RNA *cis*-elements and splicing factors (Black, 2003; Matlin et al., 2005), including members of the serine-arginine-rich protein family (Fu, 1995), members of the heterogenous nuclear ribonucleoproteins family (Krecic and Swanson, 1999), and other specific proteins (Underwood et al., 2005). Accumulating evidence indicates that numerous stimuli, such as growth factors, cytokines, and stress, would alter the choice of splice sites and produce multiple transcripts (Barbazuk et al., 2008). Multiple transcripts in teleosts could promote the tolerance to stresses (Xia et al., 2017; Tan et al., 2018). In our study, several typical splicing factors, including *serine/arginine-rich splicing factor 1*, *serine/arginine-rich splicing factor 7*, *RNA-binding protein 5*, *RNA-binding protein 33*, *RNAbinding protein 39,* and *RNA-binding protein 47*, were differentially expressed under salinity challenge, indicating the splice factors and AS events in *L. maculatus* can be activated by salinity stimuli. The stress-induced AS events could increase the tolerance to the stresses by two different mechanisms. 1) stress-induced AS events could generate aberrant transcripts with splicing errors, which would be removed by nonsense mediated mRNA decay (Wollerton et al., 2004; Chang et al., 2007). This mechanism could weaken the function of the corresponding genes by decreasing the abundance of functional transcripts (Maquat, 2004; Chang et al.,

2014; Cui et al., 2014). 2) AS transcripts could encode unique protein, often with alternations in localization, activity, and function (Wang et al., 2008; Kalam et al., 2017). Moreover, their biological function change and expression abundance regulation are largely independent process to increase organismal tolerance against stresses. For example, in human, Na+/K+/2Cl cotransporter (*nkcc2*) is proved to be key regulator associated with salt and water homoeostasis in kidney. At least 3 *nkcc2* transcripts are generated *via* different splicing of exon 4 (Schiessl and Castrop, 2015). The exon 4 encodes the second transmembrane domain, which is crucially involved in the Cl binding (Haas and Mcmanus, 1983; Schiessl and Castrop, 2015). As a result, these *nkcc2* transcripts differ markedly in their ion affinities and transport characteristics (Haas and Mcmanus, 1983; Schiessl and Castrop, 2015). The *nkcc2* splicing is need for enhanced ion reabsorption during a saltrestricted diet, even without changes in total *nkcc2* abundance (Schiessl and Castrop, 2015). In our study, three *nkcc2* transcripts were also identified in *L. maculatus* and their splicing mode has been shown in **Supplementary Figure 5A**. These transcripts in *L. maculatus* were generated by different splicing mode similar with those in human, which 5 exons were lost between exon 2 and 8; intron is retained between exon 23 and 24. The various splicing modes may also change transmembrane domain of *nkcc2* and influence their ion affinities. However, their specific physiological function is required to be further studied in the future. Besides, AS transcripts could also regulate their expression abundance to cope with stress, which has been widely reported in previous transcriptomic studies (Xia et al., 2017). In our study, a total 518 DETs were identified in response to different salinity environment. However, the functional significance of most spliced transcripts in teleosts is yet to be fully elucidated. Hence, their gene function would be discussed as follows.

In response to cell shrinkage and swelling caused by salinity stress, fishes need to cope with salt depletion or gain, and water loss or gain. In the molecule and ion transport and metabolism groups, the expression of *sodium/potassium-transporting ATPase subunit alpha-2* and *beta-233* transcripts were significantly up-regulated in the FW environment. Both the *sodium/potassium-transporting ATPase subunit alpha-2* and *beta-233* genes are members of the sodium/potassium-transporting ATPase family, which play an important role in providing a driving force for ion transport to maintain cell osmotic balance and volume in euryhaline teleosts, such as senegal sole (*Solea senegalensis*)(Skou and Esmann, 1992; Feng et al., 2002; Armesto et al., 2014). *Solute carrier family 4 a1* (*slc4a1*) is generally accepted as a bifunctional protein with both Cl- /HCO3 exchange and Cl- /taurine channel functions (Romero et al., 2013). It has been proposed that hypotonic stress induces taurine movements *via* an anion channel that is depending on or controlled by *slc4a1* (Fiévet et al., 1995). Consistent with previous studies, our data indicated that the expression of *slc4a1* in the FW environment was up-regulated in response to swelling stress. In coping with hypertonic stress, *adenosylhomocysteinase 2* gene

### TABLE 2 | List of genes with distinct differentially expressed transcripts (DETs) in SW relative to FW environment.


could reduce the apparent affinity for intracellular Mg2+ in the inhibition of *slc4a1* currents, which explains the high expression level of *adenosylhomocysteinase 2* in the SW environment (Soichiro and Toru, 2014).

An adequate and timely energy supply is a prerequisite for enzymes and transporters used in iono- and osmoregulatory processes (Tseng and Hwang, 2008). The oxidation of glucose and fatty acids is the major source of energy for organisms (Lavrentyev et al., 2004). In the energy metabolism group, transcripts of *6-phosphofructo-2-kinase/fructose-2,6-bisphosphatase 3* and *2,4-dienoyl-CoA reductase*were differentially expressed in response to salinity stress. *6-phosphofructo-2-kinase/fructose-2,6 bisphosphatase 3* plays a role in maintaining elevating *fructose-2,6-bisphosphate* levels, which is considered as the major regulator controlling carbon flux through glycolysis (Sakakibara et al., 1997; Chesney et al., 1999; Alexander et al., 2002; Kawaguchi et al., 2015). *2,4-dienoyl-CoA reductase* encodes an essential enzyme that participates in the beta-oxidation and metabolism of polyunsaturated fatty enoyl-CoA esters (Gurvitz et al., 1999).

Adaptive and acclamatory responses of fish to salinity stress depend on efficient mechanisms of osmosensing and osmotic stress signaling (Kultz, 2015). Instead of directly coupling osmosensors to osmotic effector proteins, large-scale osmoregulatory mechanisms are operated by linking molecular osmosensors to cell signaling pathways to initiate adaptive reactions (Evans, 2010). In our study, several DETs were involved in typical signal transduction, such as *mitogen-activated protein kinase kinase kinase 14*, *tyrosine-protein kinase Fyn-like*, *rho GTPase-activating protein 35*, *tyrosine kinase 2*, and *serine/threonine-protein kinase Sgk2*. They may integrate and amplify signals from osmosensors to activate appropriate downstream targets mediating physiological acclimation (Kültz, 2010; Zhang et al., 2017).

In addition, one of the *heat shock 70 kDa protein* transcripts was differentially expressed in the SW group. *Heat shock 70 kDa protein*, known as chaperone proteins, is pivotal in maintaining protein homeostasis by interacting with stress-denatured proteins to prevent their aggregation and malfolding (Parsell and Lindquist, 1993). In the protein degradation classification, many DETs were involved in ubiquitin. Ubiquitin in cells acts as a covalent modifier of proteins in functionalization and degradation, which is dependent on ubiquitin ligase (Lyu et al., 2018). E3 ubiquitin proteins are the final enzymes in the ubiquitin-proteasome pathway, regulating protein degradation, cell growth, and apoptosis in response to environmental changes (Mani and Gelmann, 2005; Sardella and Kultz, 2009; Li et al., 2014).

Cytoskeletal organization is notably affected by perturbations in cell volume. Thus, cytoskeletal protein has been considered as a putative osmosensor. Correspondingly, several DETs are found to be involved in structural components of the cytoskeleton, such as *cuticle protein*, *filamin-B*, and *beta tubulin.* In addition, previous reports demonstrate that salinity could enhance the abundance of innate immune defenses proteins, and chronic salinity stressors could stimulate the proliferation and antimicrobial functions of innate immune cells, as well as the release of pro-inflammatory cytokines, in several euryhaline and stenohaline fish species (Cuesta et al., 2005; Delamare-Deboutteville et al., 2006; Jiang et al., 2008; Schmitz et al., 2016). In *L. maculatus*, several transcripts, encoding immune-related proteins, also exhibited differentially expressed profiles, such as *classical MHC class I molecule alpha-chain*, *tumor necrosis factor receptor superfamily member 6B*, *IgGFc-binding protein-like*, and *leucine-rich-repeat-containing protein C3*.

Recently, accumulating evidence indicates that TFs are also crucial in mediating organism adaptation to salinity stresses by activating or suppressing downstream genes in the pathway (Fujita et al., 2006; Nie et al., 2019). Indeed, TFs are also greatly affected by AS events and TF transcripts with the alternative function often are low in abundance. One interesting example is the sex determination mechanism of fruit fly. *Sex-lethal*, acting as a master regulatory switch in female flies, plays a key role in orchestrating the changes in gene expression responsible for all aspects of sexual determination in fruit fly (Förch and Valcárcel, 2003). Despite the presence of *Sexlethal* transcripts in both sexes, however, *Sex-lethal* protein is only produced in female flies. It is resulted from a critical difference between the transcripts in the two sexes: exon 3 with stop codons in frame is included in male flies and skipped in females (Salz et al., 1989; Bell et al., 1991). However, little is known about similar mechanism in response to salinity. In our study, a total of 1,194 TF transcripts from 723 genes are identified in *L. maculatus*, suggesting AS events are common in the TF genes of *L. maculatus*. The question remains as to whether there exists alternative biological function among TF transcripts. A satisfactory answer to this question will require further researches in the future. Additionally, a total of 17 TF transcripts were differentially expressed after salinity change, including members of C2H2-ZF, bHLH, ETS, and others. Previous studies have demonstrated that C2H2-ZF (Steinberg, 2012), bHLH (Terova et al., 2008; Liu et al., 2009), Homeobox (Nie et al., 2019), RHD (Carlsen et al., 2004), and ETS (Wasylyk et al., 1998) TFs could be crucial in increasing stresses tolerance by signal transduction or modulation pathway. MBD TFs are mainly involved in the cytosine methylation of the nuclear DNA (Nan et al., 1998) and HMG proteins are ubiquitous nuclear proteins that bind to DNA, nucleosomes and induce structural changes in the chromatin fiber (Hock et al., 2007). In our study, a total of 17 differentially expressed TFs suggested their important roles in response in salinity change in *L. maculatus*.

Post-transcriptional regulatory mechanisms, including AS, APA, and fusion transcripts, make essential contribution to physiological function regulatory of aquaculture species. For example, AS events have been studies in Pacific oyster (*Crassostrea gigas*) (Huang et al., 2016), Nile tilapia (Xia et al., 2017) and channel catfish (*Ietalurus punetaus*) (Tan et al., 2018) using Illumina RNA-Seq datasets. However, these RNA-Seq projects of aquaculture species obtain transcripts only based on short read-based assembly, which would limit the accuracy of identification of post-transcriptional events. In our study, it is the first time that Iso-Seq is applied in the aquaculture teleost, which has detected numerous full-length transcripts and characterized many post-transcriptional regulatory events in *L. maculatus*. It creates a paradigm for future post-transcriptional regulatory studies of aquaculture species in transcriptome wide. Besides, this study investigates the DETs of euryhaline *L. maculatus* in response to different salinity environment. It has pushed the limit of previous genelevel transcriptome studies (Zhang et al., 2017), which would be helpful to unveil molecular mechanism of coping with salinity stress in fishes.

# CONCLUSION

In our study, we applied PacBio Iso-Seq to yield a new set of transcriptomic data of *L. maculatus* as follows: 28,809 nonredundant transcripts, 10,249 AS events, 1,359 APA events, 3,112 lncRNA, 29,609 SSRs, 365 fusion transcripts, and 1,194 TFs. It is the first time in aquaculture teleosts that Iso-Seq was applied to unveil the transcriptome complexity. To investigate transcripts involved in salinity regulation in *L. maculatus*, RNA-Seq data was combined with Iso-Seq results and identified 518 DETs in different environment. Notably, transcripts from the same genes may exhibit similar or opposite expression patterns. In addition, the expression level of 14 TFs is significantly up-regulated in SW environment, implying their roles in hypertonic stress. Our study not only improves current gene models of *L. maculatus*, but also enhances the understanding of salinity regulatory mechanisms in euryhaline teleosts.

# DATA AVAILABILITY STATEMENT

The raw sequences of our study have been submitted to the Sequence Read Archive (SRA) of National Center for Biotechnology Information (NCBI) with the accession number of PRJNA515783 (BioProject ID of Iso-Seq) and PRJNA515986 (BioProject ID of RNA-Seq). Reference genome of *L. maculatus* was downloaded from NCBI with the accession number of PRJNA407434 (BioProject ID).

# ETHICS STATEMENT

This study was carried out in accordance with the recommendations of "Animal Research and Ethics Committees of Ocean University of China (Permit Number: 20141201)." The protocol was approved by the "Animal Research and Ethics Committees of Ocean University of China."

# AUTHOR CONTRIBUTIONS

YL and YT conceived the study. YT, XQ, XZ, and SL performed bioinformatics analysis. YL provided funding support. WY collected samples and extracted RNA samples. HW, JL, and FH administrated the project. BL and YS verified the sequencing results. All authors read and approved the final manuscript.

# FUNDING

This study was supported by National Natural Science Foundation of China (No.31602147), National Key R&D Program of China (No.2018YFD0900101), and China Agriculture Research System (No. CARS-47).

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.01126/ full#supplementary-material

# REFERENCES


SUPPLEMENTARY FIGURE 1 | ROIs length distribution of three size bins.

SUPPLEMENTARY FIGURE 2 | Examples of structures showing the gene with different types of isoforms. The blue bars indicated the annotated gene model in the L. maculatus genome. In (A), the orange bars indicated the isoform structure detected by Iso-Seq. Lines represented introns, and arrows indicated the orientation of transcription. In (B), novel gene and their transcripts were absent in the annotation of the genome.

SUPPLEMENTARY FIGURE 3 | Examples of validation experiments for (A) novel isoforms, (B) APA event and (C) fusion transcripts.

SUPPLEMENTARY FIGURE 4 | KEGG analysis of the genes with more than four isoforms. (A) KEGG pathway annotation; (B) Statistics of pathway enrichment.

SUPPLEMENTARY FIGURE 5 | Examples of five types of alternative splicing events detected by Iso-Seq. The blue bars indicated the annotated gene model in the L. maculatus genome, and dark green bars indicated the transcripts structure detected by Iso-Seq. (A) Example of intron retention events in evm.TU.scaffold\_294.3; (B) Example of exon skipping events in evm.TU.scaffold\_229.19; (C) Example of alternative 3' splice site events in evm.TU.scaffold\_13.295; (E) Example of alternative 5' splice site events in evm.TU.scaffold\_81.120; (E) Example of mutually exclusive exon events in evm.TU.scaffold\_98.14;

SUPPLEMENTARY FIGURE 6 | Histogram of gene ontology classifications of L. maculatus fusion transcripts.

SUPPLEMENTARY FIGURE 7 | Volcano plot showing the DETs between the FW and SW treatment groups. The horizontal axis was the log2 fold change in SW relative to FW groups. The vertical axis was the -log10 false discovery rate. Green dots represented significantly down-regulated transcripts in SW relative to FW group, while red dots represented significantly up-regulated transcripts. Black dots represented transcripts without significant expression difference between two groups.

SUPPLEMENTARY FIGURE 8 | Pie chart showing frequencies of five types of alternative spliced events in DETs.

SUPPLEMENTARY FIGURE 9 | The transcripts structures of DETs with opposite expression patterns generated from 7 genes. Blue bars indicated the annotated gene model in the L. maculatus genome, and orange bars indicated the transcripts detected by Iso-Seq. Transcripts with DETs were marked with red dashed rectangles.


Black, D. L. (2003). Mechanisms of alternative pre-messenger RNA splicing. *Annu. Rev. Biochem.* 72, 291–336. doi: 10.1146/annurev.biochem.72.121801.161720


involved in the innate immune system. *Fish Shellfish Immunol.* 87, 346–359. doi: 10.1016/j.fsi.2019.01.023


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Tian, Wen, Qi, Zhang, Liu, Li, Sun, Li, He, Yang and Li. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Single-Nucleotide Polymorphisms (SNP) Mining and Their Effect on the Tridimensional Protein Structure Prediction in a Set of Immunity-Related Expressed Sequence Tags (EST) in Atlantic Salmon (Salmo salar)

### Edited by:

Gen Hua Yue, Temasek Life Sciences Laboratory, Singapore

### Reviewed by:

Vladimir M. Milenkovic, University Medical Center Regensburg, Germany Zituo Yang, National University of Singapore, Singapore

### \*Correspondence:

Mónica Imarai monica.imarai@usach.cl Felipe E. Reyes-López Felipe.Reyes@uab.cat

### Specialty section:

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Received: 04 December 2018 Accepted: 24 December 2019 Published: 27 February 2020

### Citation:

Vallejos-Vidal E, Reyes-Cerpa S, Rivas-Pardo JA, Maisey K, Yáñez JM, Valenzuela H, Cea PA, Castro-Fernandez V, Tort L, Sandino AM, Imarai M and Reyes-López FE (2020) Single-Nucleotide Polymorphisms (SNP) Mining and Their Effect on the Tridimensional Protein Structure Prediction in a Set of Immunity-Related Expressed Sequence Tags (EST) in Atlantic Salmon (Salmo salar). Front. Genet. 10:1406. doi: 10.3389/fgene.2019.01406 Eva Vallejos-Vidal <sup>1</sup> , Sebastián Reyes-Cerpa2,3, Jaime Andrés Rivas-Pardo2,3, Kevin Maisey <sup>4</sup> , José M. Yáñez <sup>5</sup> , Hector Valenzuela<sup>4</sup> , Pablo A. Cea<sup>6</sup> , Victor Castro-Fernandez <sup>6</sup> , Lluis Tort <sup>1</sup> , Ana M. Sandino<sup>4</sup> , Mónica Imarai 4\* and Felipe E. Reyes-López 1\*

<sup>1</sup> Department of Cell Biology, Physiology and Immunology, Faculty of Biosciences, Universitat Autònoma de Barcelona, Barcelona, Spain, <sup>2</sup> Centro de Genómica y Bioinformática, Facultad de Ciencias, Universidad Mayor, Santiago, Chile, <sup>3</sup> Escuela de Biotecnología, Facultad de Ciencias, Universidad Mayor, Santiago, Chile, <sup>4</sup> Centro de Biotecnología Acuícola, Departamento de Biología, Facultad de Química y Biología, Universidad de Santiago de Chile, Santiago, Chile, <sup>5</sup> Facultad de Ciencias Veterinarias y Pecuarias, Universidad de Chile, Santiago, Chile, <sup>6</sup> Facultad de Ciencias, Universidad de Chile, Santiago, Chile

Single-nucleotide polymorphisms (SNPs) are single genetic code variations considered one of the most common forms of nucleotide modifications. Such SNPs can be located in genes associated to immune response and, therefore, they may have direct implications over the phenotype of susceptibility to infections affecting the productive sector. In this study, a set of immune-related genes (cc motif chemokine 19 precursor [ccl19], integrin b2 (itb2, also named cd18), glutathione transferase omega-1 [gsto-1], heat shock 70 KDa protein [hsp70], major histocompatibility complex class I [mhc-I]) were analyzed to identify SNPs by data mining. These genes were chosen based on their previously reported expression on infectious pancreatic necrosis virus (IPNV)-infected Atlantic salmon phenotype. The available EST sequences for these genes were obtained from the Unigene database. Twenty-eight SNPs were found in the genes evaluated and identified most of them as transition base changes. The effect of the SNPs located on the 5'-untranslated region (UTR) or 3'-UTR upon transcription factor binding sites and alternative splicing regulatory motifs was assessed and ranked with a low-medium predicted FASTSNP score risk. Synonymous SNPs were found on itb2 (c.2275G > A), gsto-1 (c.558G > A), and hsp70 (c.1950C > T) with low FASTSNP predicted score risk. The difference in the relative synonymous codon usage (RSCU) value between the variant codons and the wild-type codon (DRSCU) showed one negative (hsp70 c.1950C > T) and two positive DRSCU values (itb2 c.2275G > A; gsto-1 c.558G > A), suggesting that these synonymous SNPs (sSNPs) may be associated to differences in the local rate of elongation. Nonsynonymous SNPs (nsSNPs) in the gsto-1 translatable gene region were ranked, using SIFT and POLYPHEN web-tools, with the second highest (c.205A > G; c484T > C) and the highest (c.499T > C; c.769A > C) predicted score risk possible. Using homology modeling to predict the effect of these nonsynonymous SNPs, the most relevant nucleotide changes for gsto-1 were observed for the nsSNPs c.205A > G, c484T > C, and c.769A > C. Molecular dynamics was assessed to analyze if these GSTO-1 variants have significant differences in their conformational dynamics, suggesting these SNPs could have allosteric effects modulating its catalysis. Altogether, these results suggest that candidate SNPs identified may play a crucial potential role in the immune response of Atlantic salmon.

Keywords: single-nucleotide polymorphism, immune response, synonymous SNP, nonsynonymous SNP, homology modeling, 3D protein structure, molecular dynamics simulation, Salmo salar

# INTRODUCTION

Genetic variation occurs within and among populations, leading to polymorphisms that could be associated with genetic trait or also a phenotype in the presence of an environmental stimulus (Brookes, 1999; Rebbeck et al., 2004; Hirschhorn and Daly, 2005). A single-nucleotide polymorphism (SNP) is a single genetic code variation (i.e., polymorphic). Although multiallelic SNPs do exist, the SNPs are usually biallelic (two alternative bases occur) and require a minimum frequency (>1%) in the population (Wang et al., 1998). The SNPs are the most common form of variation in the genome and they are extensively used to study genetic differences between individuals and populations. These SNPs may contribute to changes in the genomic sequence, either in the coding (exons), intergenic, or noncoding (introns) region (Dijk et al., 2014; Ahmad et al., 2018).

SNPs are considered the most useful biomarkers for disease diagnosis or prognosis due to their common frequency, ease of analysis, low genotyping costs, and the possibility to carry out association studies based on statistical and bioinformatics tools (Srinivasan et al., 2016). Thus, SNPs have gained importance as major drivers in disease-association studies in the recent era. In mammals, on the past decade it has been seen an enormous progress in identifying hundreds of thousands SNPs to identify associations with complex clinical conditions and phenotypic traits associated with hundreds of common diseases (Welter et al., 2014; Wijmenga and Zhernakova, 2018).

Furthermore, SNPs may also have a great influence on the immune response towards pathogenic challenges and diseases outcome, contributing in a range of susceptibility to infections among the individuals. Thus, the SNP may have a protective role, may influence the rate of diseases progression or even the type of cellular immune response evoked by pathogens (Hill, 2001; Skevaki et al., 2015). In this regard, polymorphisms on several immune-related genes have been associated with susceptibility to infections including pattern recognition receptors (prr) and downstream signaling molecules (Skevaki et al., 2015), mannose- binding lectin 2 (mbl2) and tollinterleukin 1 receptor domain containing adaptor protein (tirap) (Gowin et al., 2018), c-c chemokine receptor type 5 (ccr5) (Martin et al., 1998; Salkowitz et al., 2003; Fellay et al., 2009; Chapman and Hill, 2012), interleukin 6 (il-6) (Zhang et al., 2012), and il-22 (Zhang et al., 2011), among others.

In species related to aquaculture, SNPs are especially important because they may be associated with different phenotypic traits with economical implications. Therefore, this increase of information has a direct impact on the accuracy of selection for these traits, improving the rate of genetic gain and production efficiency (Hayes et al., 2007b). The ubiquity of SNPs across the genomes examined to date, has allowed their use as markers for a wide range of applications including quantitative trait locus (QTL) mapping, pedigree analysis, association studies and population genetics, among others. Conversely, whereas the effect on gene mutations in mammals has been well documented, such information in teleost species is still limited. However, several efforts have been made to provide information regarding the consequences of genetic alterations in immune responserelated genes and may influence susceptibility to diseases in fish (Kongchum et al., 2011). In this context, SNP variations were found on il-1b of Cyprinus pellegrini and C. carpio that can be helpful in understanding differential resistance to koi herpesvirus (KHV) and Aeromonas hydrophila, respectively (Jia et al., 2015; Wenne, 2018). On the other hand, three SNPs were identified in the leukocyte cell-derived chemotaxin-2 (lect2) gene to be associated with resistance to the big belly disease on Latis calcarifer (Fu et al., 2014a; Wenne, 2018). Three SNPs in the mast cell protease 8 (mcp-8) gene were also significantly associated with resistance of tilapia to Streptococcus agalactiae (Fu et al., 2014b; Wenne, 2018).

In Atlantic salmon, some studies have reported SNPs associated with a resistance genotype against infectious pancreatic necrosis (IPN) virus (IPNV). IPN is a highly transmissible disease with worldwide distribution that occurs both at the initial stage of rearing in freshwater and in postsmolts in seawater (Bruno, 2004). Importantly, asymptomatic carriers (McAllister et al., 1993), establishment of viral persistence (Reyes-Cerpa et al., 2014), IPN-resistance phenotype (Reyes-López et al., 2015), and vertical transmission through eggs (Bootland et al., 1991) have been reported. In this matter, several segregating SNP markers linked to a major QTL associated with resistance against infectious pancreatic necrosis virus (IPNV) in Atlantic salmon from an Scotland commercial breeding program have been reported (Houston et al., 2012; Wenne, 2018). Similarly, a QTL in Norwegian salmon has been employed in marker-assisted selection in breeding companies from Norway and Scotland, which resulted in 75% reduction in the number of IPN-outbreaks in the salmon farming industry. This QTL has been located on the SNP-based linkage map and identified as the epithelial cadherin (cdh1-1) gene with a functional involvement in viral attachment and entry of IPNV (Moen et al., 2009; Moen et al., 2015; Wenne, 2018). Based on these reports, it seems that the strategy to detect SNPs in immune-related genes could provide a set of candidate polymorphisms that could explain the correlation between the pathogen and the disease phenotype. Particularly, the relevance to identify SNPs in the coding sequence of immunity-related genes could explain directly (causal) the variability upon a specific phenotype evaluated (Carlson et al., 2003; Cheng et al., 2004; Hirschhorn and Daly, 2005). In this context, most of the knowledge about fish immune response is based on large-scale expressed sequence tag (EST) sequencing that has helped to identify immune-related genes in teleosts. Undoubtedly, the EST sequencing based on tissues that play a central role on immune response contribute to detect gene sequences that are directly related to host defense functions. The identification of a set of splenic leukocytes immune-related genes from Atlantic salmon IPNV-infected using EST analysis has been reported (Cepeda et al., 2011; Cepeda et al., 2012). Importantly, some of these genes were detected differentially expressed when the IPN-susceptible and IPN-resistant phenotypes were compared (Reyes-López et al., 2015). Thus, the search and identification of SNPs on these immune-related genes using a data mining strategy may contribute to provide a set of candidate polymorphisms that could help in the progress to establish a link between the possible causes of this differential expression pattern and the IPNphenotype variability. Hayes et al. (Hayes et al., 2007b), described for first time an in silico detection of 2,507 putative SNPs in Atlantic salmon from the alignment of 100,866 EST. Despite the large number of SNPs identified, there is no gene directly associated with immune function.

Therefore, the aim of this study was the identification of SNPs (in silico) by data mining upon a set of immune-related genes whose expression has been previously reported in response to the infection with IPNV in Atlantic salmon. For this purpose, a set of immune-related genes (cc motif chemokine 19 precursor [ccl19], integrin b2 (itb2, also named cd18), glutathione transferase omega-1 [gsto-1], heat shock 70 KDa protein [hsp70], major histocompatibility complex class I [mhc-I]) were selected as target for the in silico SNP search and identification using as template the EST sequences for these genes obtained from the Unigene database. Based on their nucleotide sequence, the SNPs were located in the 5'/3'-UTR or in the translated region. In the case of those nucleotide variations located in the translated region, the SNP were classified as synonymous (sSNPs) or nonsynonymous (nsSNPs) based on the change provoked in the predicted amino acid sequence. While for sSNPs a codon usage analysis was conducted, for those nsSNPs a homology modeling analysis was carried out in order to evaluate whether they could have an effect on the predicted tridimensional protein structure. In addition, on those nsSNP in which a significant change in the three-dimensional protein structure was observed by homology modeling, an analysis of such variants over the time was performed by molecular dynamics (MD) simulation. This study provide a set of identified candidate SNPs that may help to determine potential correlations between the immunity gene expression pattern in Atlantic salmon and their response against the pathogens they are exposed under aquaculture conditions.

# MATERIAL AND METHODS

# Selection of Genes Modulated in Response to IPNV and Sequence Cluster Collection

The immune-related genes analyzed (cc motif chemokine 19 precursor [ccl19], integrin b2 [itb2], glutathione transferase omega-1 [gsto-1], heat shock 70 KDa protein [hsp70], major histocompatibility complex class I [mhc-I]) were selected based on their previously reported expression on splenic leukocytes isolated from IPNV-infected Atlantic salmon and whose expression was also differentially modulated between the IPNsusceptible and IPN-resistant phenotypes (Reyes-López et al., 2015). The EST sequences for the previously above-mentioned selected candidate genes were downloaded from the Unigene database (NCBI). The detail regarding all the EST sequences analyzed in this study to identify SNPs are indicated on Supplementary Tables 1–5.

# Data Mining for SNP Identification

The identification of SNPs by data mining was carried out in order to identify possible functional effects in the sequences related to defense and immune response in Atlantic salmon challenged with IPNV. All the sequences including in the analysis were first preanalyzed in order to remove any vector sequences or repetitive elements using Cross-Match (Ewing and Green, 1998) and RepeatMasker (Tarailo-Graovac and Chen, 2009), respectively. The search for SNPs was carried out based on multiple sequence alignment including the total number of sequences collected for each evaluated gene using the HaploSNPer web-based tool (Tang et al., 2008). From the alignment analysis, each nucleotide variant detected was considered as putative SNP and it was defined according to the nucleotide position in the gene sequence. The sequences were then filtered to exclude possible sequencing errors and noninformative polymorphisms (variants with a frequency lower than 1%). The nonfiltered SNPs were chosen as the most probable or reliable SNPs. The nucleotide variations were described according to the nomenclature suggested by Dunnen and Antonarakis (Den Dunnen and Antonarakis, 2001).

# Effect of SNPs on Gene Function

In order to determine the SNP location onto the gene region (5'- UTR, coding region, 3'-UTR), the nucleotide sequence was obtained based on their Unigene annotation from Nucleotide database (NCBI) and compared them by alignment with BioEdit sequence alignment editor (version 7.0.5.3). In addition, the nucleotide sequence was also used as template to get the predicted amino acid sequence using ORF Finder tool (NCBI) to confirm by protein BLAST (NCBI) the annotation for the unigene sequence analyzed.

The functional impact of the SNP was assessed depending on the gene region (5'-UTR, coding region, 3'-UTR) on which the nucleotide variant was located. For those SNP located in the noncoding region (5'-UTR, 3'-UTR) the predictive nucleotide variant effect upon motifs associated to transcription factor binding sites was evaluated with TFSearch webtool (http:// diyhpl.us/~bryan/irc/protocol-online/protocol-cache/ TFSEARCH.html) (Heinemeyer et al., 1998). Furthermore, the predictive SNP effect onto possible exon splicing enhancer [ESEfinder (http://krainer01.cshl.edu/cgi-bin/tools/ESE3/ esefinder.cgi?process=home) (Cartegni et al., 2003); RESCUE-ESE (http://hollywood.mit.edu/burgelab/rescue-ese/) (Fairbrother et al., 2004)] and exon splicing silencer [FASS-ESS (http://genes.mit.edu/fas-ess/) (Wang et al., 2004)] was also evaluated. Based on these results, the SNP functional effect was predicted according to FASTSNP in order to assign a FASTSNP score (Yuan et al., 2006). A FASTSNP score between 0 and 5 was assigned to each individually SNP evaluated (representing from 0 to 5 the minimum to maximum functional SNP effect, respectively). On the other hand, for those SNP located in the coding region the nucleotide variant was first individually evaluated with ORF Finder tool (NCBI) in order to compare the predicted unigene amino acid sequence with the SNPcontaining unigene sequence by BioEdit sequence alignment editor (version 7.0.5.3). In the case of the nucleotide variations located in untranslated region (5'-UTR; 3'-UTR) and those did not provoke any change in the predicted amino acid sequence (sSNP; synonymous amino acid change), the SNP effect was evaluated with the FASTSNP decision tree in order to assign a FASTSNP score (Yuan et al., 2006). In addition, on each sSNP identified a codon usage analysis was performed according to Sharp et al. (M.Sharp and Li, 1987). The difference in the relative synonymous codon usage (RSCU) value between the variant codons and the wild-type codon (DRSCU = RSCU mutant - RSCU wild-type) was calculated for all the SNPs identified based on the codon usage database (http://www.kazusa.or.jp/codon) for Salmo salar. On the other hand, in those nucleotide variants in which a change in the amino acid predicted sequence was detected (nsSNP; nonsynonymous amino acid change), the nucleotide variation impact was analyzed combining the score obtained from Sorting Intolerant From Tolerant (SIFT) (applied to human variant databases) (http://sift.bii.a-star.edu.sg/) (Ng and Henikoff, 2003) and POLYPHEN web-based resources (Flanagan et al., 2010). The results were ranked according to the protocol established by Bhatti et al. (Bhatti et al., 2006) with some modifications in order to homogenize the significance of all scores obtained for the SNP obtained in this study (lower score = minimum effect; higher score = maximum effect). Thus, the scores obtained for SIFT were ranked from I (tolerated) to IV (intolerant); meanwhile the scores obtained for POLYPHEN were ranked from A (benign) to E (probably damaging). The combination of the SIFT and POLYPHEN analysis give a score whose value was ranked from 1 (minimum effect) to 4 (maximum effect). The implications of the nonsynonymous SNPs reported at protein tertiary structure level was also evaluated in silico through homology modeling analysis.

# Homology Modeling

In order to evaluate whether the nonsynonymous SNPs reported in this study have an effect on the predicted tridimensional protein structure, a homology modeling analysis was carried out. For this, the nucleotide sequence was used as target to obtain the predicting tridimensional protein structure with the CPHModels-3.0 webserver (Nielsen et al., 2010). The higher bit score alignment obtained was chosen as template for GSTO-1 (PDB ID: 1EEM; score = 278; E-value = 4e-75). Alternatively, the predicted protein structure for CCL19 (PDB ID: 2HCI; score = 55; E-value = 2e-08), ITB2 (PDB ID: 2KCN; score = 42; E-value = 2e-04), HSP70 (PDB ID: 1YUW; score = 1036; E-value = 0) and MHC class I (PDB ID: 1KTL; score = 182; E-value = 4e-46) were also obtained. The tridimensional protein structure for each PDB ID match was obtained to then be used as template for the homology modeling of the above mentioned gene sequences using the MODELLER webtool (Swiss-Model) (Sali and Blundell, 1993). The stereochemical quality of the modeled tridimensional protein structure was evaluated with Procheck webtool (Swiss-Model) (Laskowski et al., 1996).

# All-Atom Explicit Solvent MD Analysis

The protonation state of residues at physiological conditions (pH = 7) was assigned using the Propka software included in the Maestro Suite (Olsson et al., 2011). The models were solvated in TIP3P truncated octahedron with an extension of 12 Å over the protein surface, including 4 Cl- atoms to maintain the net charge neutrality within the system. Parameters for calculations were derived from the AMBER ff14SB forcefield (Maier et al., 2015). Energy minimization was carried out in four different stages, in each one of them, 5,000 steps of steepest descent followed by 5,000 steps of a conjugated gradient were performed. In the first stage, the solute was fixed with a positional restraint of 500 kcal/mol Å2 , and only the accommodation of the solvent was allowed. Then, the hydrogen atoms of the protein were freed from the restraint to allow their relaxation. In the third stage, the minimization was done imposing a lighter restriction of 10 kcal/mol Å2 on the heavy atoms of the protein. Lastly, a minimization without restraint was conducted. After that, the systems were equilibrated under NVT conditions, heating up from 0 K to 298.15 K in a window of 200 ps and maintaining the final temperature for 100 ps, using the Langevin thermostat with a collision frequency of 2 ps-1. Then, the systems were equilibrated for 10 ns under NPT conditions at 298.15 K and 1 atmosphere using the weak-coupling Berendsen barostat. Three production runs of 100 ns under the NPT conditions with random velocities seeds were performed for each protein system. All the MD

calculations were performed with periodic boundary conditions with a time step of 2 fs, a 10 Å direct space cutoff for PME and constraining hydrogen atoms with SHAKE algorithm. All the simulations were performed using Amber18 with GPU acceleration (Salomon-Ferrer et al., 2013) and the trajectories were analyzed using cpptraj (Roe and Cheatham, 2013).

# RESULTS

# SNP Identification

A set of immune-related genes (gsto-1, ccl19, itb2, hsp70, mhc-I) were chosen based on their previous reported role on the immune response in Atlantic salmon (Cepeda et al., 2011; Cepeda et al., 2012; Reyes-López et al., 2015). The identification of SNPs was carried out upon these genes based on data mining analysis. A total of 310 EST sequences obtained from the Unigene database (NCBI) were analyzed to identify SNPs on ccl19, itb2, gsto-1, hsp70, and mhc-I. Twenty-eight SNPs were found, broken down into 18 transitions (7 A > G; 11 C > T) and 10 transversions (7 A > C; 1 A > T; 1 C > G; 1 G > T) as the most probable or reliable nucleotide variants (Table 1). From them, 3 SNPs were found for ccl19, 2 SNPs for itb2, 15 SNPs for gsto-1, 4 SNPs for hsp70, and 4 SNPs for mhc-I (Table 1).

# SNP Predicted Functionality

From the nucleotide variations identified, the 21.43% of the SNPs were found in the 5'-UTR region and 53.57% in the 3'-UTR region. At untranslated region level, mhc-I showed only nucleotide modifications at 5'-UTR. In the case of ccl19, itb2, and hsp70 genes, the modifications were registered on the 3'- UTR (Table 2). Only in the case of gsto-1 the SNPs were found in both 5'- and 3'-UTR regions. To evaluate the effect of these SNPs located in the UTR regions, the FASTSNP decision tree was used in order to assign a FASTSNP score risk for each of the nucleotide variations found. The results showed that some SNPs on 5'-UTR (gsto-1: c.48C > T, c.89A > G; mhc-I: c.17G > T) and 3'-UTR (ccl19: c.933T > C, c.996C > T; gsto-1: c.1068A > C, c.1177A > C; hsp70: c.2102C > T, c.2139A > G) obtained the minimum FASTSNP score (FASTSNP score = 0), indicating that these nucleotide variations did not have any predicted consequence. However, the majority of the SNPs detected on these regions had a FASTSNP score of 1–3, which indicates that the SNP generates a low to medium impact (Table 2). Based on this SNP functional score effect, the results suggest that the nucleotide variations evaluated could be involved in the regulation of the immune-related genes evaluated.

From the total number of SNPs found, the 25% of the nucleotide substitutions were located in the coding region. At functional level, three of them (10.71%) were identified as synonymous SNP since no variation in the predicted amino acid sequence for itb2 (c.2275G > A), gsto-1 (c.558G > A), and hsp70 (c.1950C > T) was determined. Thus, a low-risk ranking was assigned (FASTSNP score = 1) (Table 2). In terms of codon usage, a positive RSCU value was observed for itb2 (c.2275G > A) and gsto-1 (c.558G > A) (Table 3). By contrast, hsp70 (c.1950C > T) showed a negative RSCU value. This antecedents indicate that these sSNPs may have an impact in the local rate of translation elongation. On the other hand, the remaining 14.29% from total nucleotide variations were identified in the coding region as nonsynonymous SNPs. Importantly, all these nucleotide variations were only found into the gsto-1 sequence (Table 4). Two of these variations (c.205A > G, c.484T > C) obtained the highest risk score (SIFT+POLYPHEN score = 5) for nonsynonymous SNPs and whose nucleotide variations represented the modifications in the predicted amino acid sequence of serine (polar, uncharged R group) by glycine (nonpolar, flexible and the smallest R group; S26G) and by proline (nonpolar, cyclic, and rigid R group; S119P), respectively. The other two nucleotide substitutions (c.499T > C, c.769A > C) were ranked in the next risk score level (SIFT+POLYPHEN score = 4), representing in the predicted amino acid sequence a change of tyrosine (polar and aromatic R group) by histidine (polar and sometimes positively charged imidazole R group; Y124H), and threonine (polar, uncharged R group) by proline (T214P), respectively (Table 4). Taken together, these results indicate that the SNPs found in the gsto-1 could have a relevant impact at functional protein structural level.

# Homology Modeling and All-Atom Explicit Solvent MD Analysis

In order to evaluate whether the predicted SIFT+POLYPHEN score on the nonsynonymous nucleotide variations for gsto-1 have an impact at the protein structure level, a homology modeling was performed. To do this, the predicted tridimensional protein structure was obtained, on which the effect of such nucleotide variations determined for gsto-1 were analyzed individually.

Using the CPHModels-3.0 webserver, the higher score alignment (score = 278; E-value = 4e-75) obtained for the S. salar gsto-1 predicted amino acid sequence was the Homo sapiens

TABLE 1 | Summary of single-nucleotide polymorphisms (SNPs) identified in a set of immune-related genes expressed in Salmo salar.


The UniGene ID, number of EST included in the SNP identification per each gene, and the transition (A > G, C > T) and transversion (A > C, A > T, C > G, G > T) nucleotide substitutions are indicated.

TABLE 2 | Evaluation of nucleotide variations located in untranslated region (5'-UTR; 3'-UTR) and synonymous single-nucleotide polymorphisms (SNPs) in the translated (t) region on immune-related genes expressed in Salmo salar.


Bold and underlined letter represents the nucleotide variation (transcription factor binding site, and alternative splicing regulatory motif). The arrow behind the sequence on the predicted transcription factor binding site column shows the 5'– > 3' sense and the motif composition. On the alternative splicing regulatory motif column, normal and italic styles indicate the presence of predicted exon splicing enhancer (ESE) or silencer (ESS), respectively. The FASTSNP score was assigned to each SNP (0: minimum score risk; 5: maximum score risk). NF, not found. ND, not determined.

Glutathione-S-transferase omega 1 (GSTO-1; PDB ID: 1EEM) and thus chosen as the best template structure.

The overall comparison between the human GSTO-1 structure (Figure 1A) and the modeled Atlantic salmon GSTO-1 (Figure 1B) showed high similarity, evidencing that the salmon predicted modeled protein does not present serious problems with the structural restrictions dictated by the template. Only some local differences were found by



Bold letter represents the nucleotide variations in codon for each synonymous single-nucleotide polymorphism (sSNP). RSCU and DRSCU values were calculated (DRSCU = RSCU mutant - RSCU wild type).



SIFT and POLYPHEN scores are indicated from which the risk is obtained to assess whether the amino acid change produces a variation at protein structural level giving a risk of 1–4, being 1 the higher impact risk. S, serine; G, glycine; P, proline; Y, tyrosine; H, histidine; T, threonine.

comparing in detail the human and salmon protein model of GSTO-1 at the helixes <sup>a</sup>4a, <sup>a</sup>4b, <sup>a</sup>7, and <sup>a</sup>8 (Figures 1B, C). To further validate the model, a stereochemical evaluation was performed by generating a Ramachandran plot to assess the Phy and Psi dihedral angles distribution. The stereochemical quality of the modeled tridimensional protein structure showed that the amino acids of the predicted sGSTO-1 structure were found mainly within the most favored (89.4%) and the additional allowed (10.1%) energy regions, meanwhile only the 0.5% were at the disallowed regions (Figure 1D). This antecedent indicates the good quality of the predicted sGSTO-1 structure obtained.

The effect of the gsto-1 nonsynonymous nucleotide variations detected were individually evaluated by homology modeling of S. salar GSTO-1. The model of the variant c.205A > G (S26G on the predicted amino acid sequence; sGSTO-1 S26G) suggests that glycine would increase the conformational freedom on the b2 sheet making it shorter (Figures 2A–C). The stereochemical quality of the sGSTO-1 S26G protein structure kept the good quality of the model, with only a slight increase in the number of the amino acids in the most favored energy regions (90.3%) compared to the predicted sGSTO-1 structure (Figure 2D).

The SNP c.484T> C (responsible of the modification S119P in the predicted amino acid sequence; sGSTO-1 S119P) is located in the <sup>a</sup>4 helix kink region (Figure 3). The model with the S119P substitution suggests that proline would makes the helix kink longer (Figures 3B, C). This modification may generate as consequence that the a4 helix are no longer composed by the a4a and a4b helix but rather independent two different new ahelices. Compared to the sGSTO-1 predicted structure, the sGSTO-1 S119P stereochemical quality showed a slight decrease in the number of the amino acids in the most favored energy regions (87.9%) and an increase on those located in the additional allowed regions (11.6%) compared to the predicted sGSTO-1 structure (Figure 3D).

On the other hand, no appreciable variations in the predicted tridimensional S. salar GSTO-1 structure were observed for the SNP c499T > C (Y124H substitution; sGSTO-1 Y214H) (Figure 4). By contrast, in the case of the SNP c.769C > C (responsible of the modification T214P in the predicted amino acid sequence; sGSTO-1 T214P), a change in the tridimensional conformation affecting the C-terminus region is observed on the <sup>a</sup>7 helix of the model (Figure 5). Punctually, the amino acid substitution makes the a7 helix shorter, thus lengthening the a7 helix kink (Figures 5B, C). This region is the special relevance in the protein structure because the residues of the C-terminal end are involved in the formation of the H site, a binding site that accommodates a hydrophobic motif adjacent to the glutathione binding site (Board et al., 2000). The stereochemical quality of the sGSTO-1 T214P was similar to sGSTO-1 (89.9% of the amino acid in the most favored regions) (Figure 5D). Thus, the predicted salmon GSTO-1 sequence analysis suggests the high-impact of T214P on the protein functionality.

The predicted protein structure for S. salar CCL19 (PDB ID: 2HCI; score = 55; E-value = 2e-08), ITB2 (PDB ID: 2KCN; score = 42; E-value = 2e-04), HSP70 (PDB ID: 1YUW; score = 1036; Evalue = 0), and MHC class I (PDB ID: 1KTL; score = 182; Evalue = 4e-46) are also shown (Supplementary Figures 1–4).

Since homology modeling provides only restricted information about side-chain orientations, three of the GSTO-1 variants that evidenced to be more affected in their structure (S26G; S119P; T214P) were further studied by all-atom MD simulations. Based on that the homology models showed a good stereochemical quality, they were further used as initial coordinates for MD simulations. In Figure 6 is showed the global structure of the GSTO-1 protein and highlights the location of each mutation evaluated. Analysis of three replicas of 100 ns of simulation for each protein model (GSTO-1 wildtype, S26G; S119P and T214P) showed that the progression of the root mean square deviation (RMSD) of the Ca atoms as a

FIGURE 1 | Homology modeling for predicted S. salar GSTO-1 based on H. sapiens GSTO-1 structure. (A) Tridimensional H. sapiens GSTO-1 (hGSTO-1) structure. (B) Predicted tridimensional S. salar GSTO-1 (sGSTO-1) structure. The differences on the secondary structure in the sGSTO-1 compared to hGSTO-1 are shown (arrow lines). (C) hGSTO-1 and sGSTO-1 overlay. (D) Ramachandran plot for the predicted sGSTO-1 structure. The amino acid distribution into most favored regions (A,B,L), additional allowed regions (a,b,l,p), generously allowed regions (~a,~b,~l,~p), and disallowed regions (GLU83) is indicated. In the protein structures, a-helices (a1- a8), b-sheets (b1-4) are indicated. The reduced glutathione (GSH) molecule is represented.

function of time kept values below 3 Å, suggesting they were stable under the time windows explored (Supplementary Figure 5).

To assess if the GSTO-1 variants have significant differences in their conformational dynamics, we calculated the root mean square fluctuation (RMSF) over the whole trajectories (300 ns) comparing them with the wild-type trajectories. Importantly, even though the GSTO-1 variants are located far apart within the protein structure, all of them seem to produce similar effects on conformational flexibility (Figure 7). Compared to the wild-type GSTO-1 (Figure 7A), in the three sGSTO-1 variants explored [S26G (Figure 7B), S119P (Figure 7C), and T214P (Figure 7D)] the a10 helix located at the C-terminus of the protein is notoriously more rigid. Moreover, the loop that connects the helixes a6 and a7 is significantly rigidized in the sGSTO-1 S26G and S119P compared to the wild-type, whereas for the variant sGSTO-1 T214P this loop is only slightly more rigid than the wild-type GSTO-1. The detailed fluctuational profile for each GSTO-1 variant compared to the WT is shown on Supplementary Figure 6. Altogether, the differences observed in the local flexibility of secondary structure elements suggests that the SNP variations detected have an impact on the GSTO-1 protein structure and, in consequence, a potential impairment in the protein functionality. Nevertheless, further studies are needed to unravel the precise biophysical consequences of these GSTO-1 variants over the function of the enzyme and their consequences at physiological level.

# DISCUSSION

In this study, we identified a set of SNP located for a group of immune-related genes (ccl19, itb2, gsto-1, hsp70, mhc-I) with differential gene expression in Salmo salar challenged with IPNV (Cepeda et al., 2011; Cepeda et al., 2012; Reyes-López et al., 2015). These nucleotide variations were localized in both untranslated regions (5'-UTR; 3'-UTR). In the translated region, synonymous (itb2, gsto-1, hsp70) and nonsynonymous (gsto-1) SNPs were found. The potential impact of the nonsynonymous SNPs was also evaluated based on the predicted tridimensional S. salar GSTO-1 structure obtained through homology modeling. To our knowledge, currently there are no previous studies in Atlantic salmon in which the SNP functional effect had been evaluated at predicted

FIGURE 2 | Evaluation of single-nucleotide polymorphism (SNP) c.205A > G effect on the predicted S. salar GSTO-1 (sGSTO-1) structure by homology modeling. (A) Predicted tridimensional sGSTO-1 including the S26G substitution (sGSTO-1 S26G). (B) sGSTO-1 and sGSTO-1 S26P overlay. The punctual amino acid substitution (dotted arrow line) and the region of the secondary structure affected (b2 sheet, circle) are indicated. (C) Enlarged view of the sGSTO-1 S26P substitution shown in (B). sGSTO-1 (transparent) and sGSTO-1 (colored) are shown. The reduced glutathione (GSH) molecule is represented. (D) Ramachandran plot for the predicted sGSTO-1 S26G structure. The amino acid distribution into most favored regions (A,B,L), additional allowed regions (a,b,l,p), generously allowed regions (~a,~b,~l,~p), and disallowed regions (GLU83, ASP97) is indicated.

tridimensional protein structure level. These results open the possibility that these candidate SNPs found on genes with immune function could be associated to the inter-individual gene expression variability in Atlantic salmon and their response against pathogens they are exposed at aquaculture rearing conditions.

The interest in identifying SNPs is given by the large amount of information that exists in model organisms that relate these nucleotide variations with susceptibility to diseases (Mooney et al., 2010), which may allow to identify a series of candidate genes as targets for therapies (Voisey and Morris, 2008). In salmon, the association between polymorphism and susceptibility has been suggested (Correa et al., 2017; Holborn et al., 2018). For this reason, the identification of polymorphisms including SNPs in genes of immunological relevance may help to establish associations between polymorphism and susceptibility to diseases affecting the productive sector. In this way, the identification of SNPs and its analysis through the use of different bioinformatics resources (including those intended for the evaluation of risk scores and estimating its consequences in the predicted three-dimensional protein structure from the predicted amino acid sequence) appears as an interesting alternative to select more accurately those candidate SNPs from a set of SNPs identified with a predicted impact.

Different strategies have been used to identify SNP including small (single-strand conformation polymorphism (SSCP), heteroduplex analyses, random shotgun, polymerase chain reaction (PCR) product sequencing, EST analysis) (Liu and Cordes, 2004), and high scale strategies (next generation sequencing (NGS) and whole genome sequencing) (Abdelrahman et al., 2017; Kumar and Kocour, 2017). Particularly, the EST sequence analysis-based SNP identification is not new although it has not been extensively explored. In this ambit, the EST sequences available on data repositories allow to exploit the redundancy existent in these collections making feasible the analysis on transcripts of the same gene from multiple individuals to identify potential true SNPs minimizing the sequencing errors (Hayes et al., 2007a). Such volume of information becomes an excellent database for the search of candidate polymorphisms through the use of bioinformatics approaches (Irizarry et al., 2000; Guryev et al., 2004). In Atlantic salmon previous reports have been carried out

FIGURE 3 | Evaluation of single-nucleotide polymorphism (SNP) c.484T > C effect on the predicted S. salar GSTO-1 (sGSTO-1) structure by homology modeling. (A) Predicted tridimensional sGSTO-1 including the S119P substitution (sGSTO-1 S119P). (B) sGSTO-1 and sGSTO-1 S119P overlay. The punctual amino acid substitution (dotted arrow line) and the structural region (a4 helix kink, circle) are indicated. (C) Enlarged view of the sGSTO-1 S119P substitution shown in (B). sGSTO-1 (transparent) and sGSTO-1 (colored) are shown. (D) Ramachandran plot for the predicted sGSTO-1 S119P structure. The amino acid distribution into most favored regions (A,B,L), additional allowed regions (a,b,l,p), generously allowed regions (~a,~b,~l,~p), and disallowed regions (GLU83) is indicated.

to identify SNPs based on ESTs data (Hayes et al., 2007b) with a high percentage (72.4%) of successfully validated SNPs from the putative identified SNPs obtained (Hayes et al., 2007a). Altogether, the search of SNP from the EST database in Atlantic salmon could contribute to the identification of a set of candidate SNP with a high percentage of success, and also allows to extract the maximum possible benefit from the large amount (and often little exploited) of available information in the data repositories. The use of integrated bioinformatics resources to evaluate the potential consequences of those SNPs (in terms of risk factor and predicted protein three-dimensional structure) could favor the selection of certain candidate SNPs for their validation (both genetic and functional) as well as could shed lights of their effects from a physiological context point of view.

According to Hayes et al. (Hayes et al., 2007b), our results showed a higher number of transition (A > G, C > T) than transversion (A > C, A > T, C > G, G > T) nucleotide substitutions. At the same time, our results showed that 25% of the functional nucleotide variants were associated to synonymous and nonsynonymous modifications. Importantly, a previous study reported a total of 23.8% of the SNPs found with these functional variations (Hayes et al., 2007b). These antecedents suggest that the methodology used in this study may result in the identification of a set of candidate SNPs using in silico approaches based on information that is currently contained from public databases.

Previous reports have assessed the use of the bioinformatics tools used in our current study (Mohamed et al., 2008; Capasso et al., 2009; Liu et al., 2009; Wei et al., 2010; Rajasekaran and Sethumadhavan, 2010). Several reported have used FASTSNP as a tool to evaluate the SNP significance in genes associated to disease susceptibility in mammals (Ali Mohamoud et al., 2014; Panda and Suresh, 2014; Phani et al., 2014; Liu et al., 2016; Yu et al., 2017). For SNP identification, we used the FASTSNP decision tree to evaluate whether the nucleotide variation on noncoding region for ccl19, itb2, gsto-1, hsp70, and mhc-I could have an predictive effect over gene processes including transcription binding sites or splicing regulation (Yuan et al., 2006). Most of these nucleotide substitutions were ranked with a

FIGURE 4 | Evaluation of single-nucleotide polymorphism (SNP) c.499T > C effect on the predicted S. salar GSTO-1 (sGSTO-1) structure by homology modeling. (A) Predicted tridimensional sGSTO-1 including the Y124H substitution (sGSTO-1 Y124H). (B) sGSTO-1 and sGSTO-1 Y124H overlay. The punctual amino acid substitution (dotted arrow line) is indicated. (C) Enlarged view of the sGSTO-1 Y124H substitution shown in (B). sGSTO-1 (transparent), sGSTO-1 (colored), and the reduced glutathione (GSH) molecule are shown. (D) Ramachandran plot for the predicted sGSTO-1 Y124H structure. The amino acid distribution into most favored regions (A,B,L), additional allowed regions (a,b,l,p), generously allowed regions (~a,~b,~l,~p), and disallowed regions (GLU83, ASP97) is represented.

score 1-3, assigned by the predictive change upon transcription factor binding sites. Furthermore, changes affecting predicted exonic splicing enhancer or exonic splicing silencer motif may also compromise the gene function (Fairbrother et al., 2002; Cartegni et al., 2003; Fairbrother et al., 2004; Wang et al., 2004). Importantly, the link between alternative splicing on immunerelated genes in Atlantic salmon has been previously proposed using IPNV (Maisey et al., 2011), suggesting that SNPs affecting these regulatory processes could play a relevant role on the host immune response. Alterations in the immune response associated to SNP located on untranslated regions have been reported in mammals including ccl19 (Muinos-Gimeno et al., 2009; Elton et al., 2010; Cai et al., 2014), itb2 (Bachmann et al., 1997; Mueller et al., 2004; Zhao et al., 2013), and mhc-I (Cree et al., 2010). By contrast, in teleost there is a clear lack of information about polymorphisms located on the coding region of these genes, being only reported previously on hsp70 (Miichthys miiuy) and particularly in mhc-I of Atlantic salmon (Grimholt et al., 2002; Grimholt et al., 2003; Miller et al., 2004; Glover et al., 2007; Wynne et al., 2007), probably because its relevant role on the antigen presentation and the subsequent immune response activation. In regard to the modulation of these genes on Atlantic salmon against pathogens, in IPNVinfected salmon the expression of all the genes evaluated in this current study have been identified and annotated from splenic leukocytes cDNA library obtained from IPNV-infected Atlantic salmon (Cepeda et al., 2011; Cepeda et al., 2012). In addition, the expression for ccl19, itb2, hsp70, and mhc-I was modulated in head kidney Atlantic salmon full-sibling families (Reyes-López et al., 2015). In the case of ccl19, its expression is upregulated in trout head kidney leucocytes infected with IPNV (Montero et al., 2009), suggesting a role in inflammation and immune response mediated by the recruitment of lymphocytes (Reyes-López et al., 2015). In the same study, we detected a differential expression pattern between IPN-susceptible and IPN-resistant families in the expression of ccl19 and itb2 (cd18), noting in the susceptible families an abrupt high expression at 1 day post-infection (dpi) to then drop drastically at 5 dpi, meanwhile the expression in the IPN-resistant families remained constant (Reyes-López et al., 2015). In the case of hsp70 and mhc-I a similar upregulation was

FIGURE 5 | Evaluation of single-nucleotide polymorphism (SNP) c.769A > C effect on the predicted S. salar GSTO-1 (sGSTO-1) structure by homology modeling. (A) Predicted tridimensional sGSTO-1 including the T214P substitution (sGSTO-1 T214P). (B) sGSTO-1 and sGSTO-1 T214P overlay. The punctual amino acid substitution (dotted arrow line) and the region of the secondary structure affected (a7 helix, circle) are indicated. (C) Enlarged view of the sGSTO-1 T214P substitution shown in (B). (D) Ramachandran plot for the predicted sGSTO-1 structure. The amino acid distribution into most favored regions (A,B,L), additional allowed regions (a,b,l,p), generously allowed regions (~a,~b,~l,~p), and disallowed regions (GLU83) is indicated. In the protein structures, a-helices (a1- a8), bsheets (b1-4) are indicated. The reduced glutathione (GSH) molecule is represented.

observed in the susceptible and resistant families to IPNV, but differences were found at 5 dpi when expression dropped to basal levels in susceptible families but not in those resistant, suggesting that an impaired antigen presentation can contribute to the IPNsusceptible phenotype of salmon (Reyes-López et al., 2015). Further studies are needed to elucidate if there is a correlation between the candidate SNPs found from the SNP mining analysis on this current study and IPN-resistance.

Synonymous SNPs have been often called "silent" SNPs because they do not induce a change in the protein amino acid composition due to the degenerating genetic code (more than one codon translate the same amino acid). In the last years, sSNPs are attracting more attention in mammals since these silent mutations appear to be linked to a large list of diseases (Sauna and Kimchi-Sarfaty, 2011). However, in fish no attention has been paid to these mutations so far. In human, these sSNPs can lead to disease by different mechanisms: disrupting splicing signals (Parmley and Hurst, 2007) altering regulatory binding sites (Stergachis et al., 2013); and by changing the secondary structure of the mRNA affecting protein expression (Brest et al., 2011). It has been also described that a change in the codon usage may compromise the translation elongation by introducing a rarer codon (associated to a negative DRSCU value). This leads to the use of a less abundant tRNA that may decrease the rate of local elongation and, in consequence, a lower protein synthesis levels and/or protein missfolding. By contrast, the opposite above-mentioned effect may take place in those nucleotide variants with a positive DRSCU value (Sauna and Kimchi-Sarfaty, 2011). In our study, one negative DRSCU value was assign (hsp70 c.1950C > T), and two positive DRSCU values (itb2 c.2275G > A; gsto-1 c.558G > A). These values suggest that these sSNPs may be associated to differences in the local rate of elongation. Nonetheless, it is not clear if these nucleotide variants may cause a change at physiological level (Sauna and Kimchi-Sarfaty, 2011). Further studies are needed to elucidate the physiological consequences of these polymorphisms.

We obtained the score for the nonsynonymous SNPs based on the cutoffs for SIFT and POLYPHEN because they have previously deemed appropriate for those nucleotide variations that in mammals may play a role on infective processes (Xi et al., 2004; Bhatti et al., 2006). The ranking scheme strategy used in our study has the purpose to evaluate in silico the nucleotide

FIGURE 6 | Structure of the GSTO-1 homology model with labeled secondary structure elements. Colored spheres show the position of each punctual variant: red (S26G), yellow (S119P), blue (T214P).

type (WT), and the variants S26G, S119P, and T214P. The radius of the secondary protein structure and the gradient color (from 0.5 to 4.5 Å) are functions of the RMSF values calculated. As the most affected protein regions, arrowheads highlight the helix a10 (black) and the loop between helixes a6 and a7 (white), respectively.

variations on sequences of interest in order to provide semiquantitative information and primarily intended for its use in the absence of biochemical characterization. The nonsynonymous SNPs were found only in the gsto-1 gene sequence. GSTO-1 is an enzyme involved in biotransformation of compounds including toxic substances and oxidative stress products, transport of ligands, and regulation of signaling pathways (Burmeister et al., 2008). The GSTO-1 enzyme is responsible for catalyzing the reaction that results from glutathione conjugates generated by GSH in response to damage signals, thus influencing the relationship between reduced glutathione and oxidized glutathione (GSH/GSSG) modulating the expression of cytokines related to the Th1 immune response (Dobashi et al., 2001), thus playing a key role on the host immune response modulation. To date, very few studies have identified the effect of SNP into the gsto-1 sequence (He et al., 2013; Zmorzynski et al., 2015), including teleost species (Liao et al., 2017). The analysis of the four gsto-1 nonsynonymous SNPs found by SIFT and POLYPHEN showed two of them with the second (c.205A > G; c484T > C) and the highest (c.499T > C; c.769A > C) score possible. These results indicate that the protein would be affected on its tridimensional structure.

Despite the relevance and progress made on the SNP identification, there is a clear lack of studies in which would be proposed predictive models that could help to visualize the consequence of those SNPs located on the translated region and ranked with a high-risk score. In fact, one of the limitations of these tools is there is no reliable proposal about the changes that could take place in the protein structure. Hence, it is not possible to estimate at the structural level if the high impact score for the nucleotide modifications evaluated may have an effect in essential sectors for the protein functionality such as the active site or other fundamental interactions involved in the stabilization of its structure. To carry out this kind of analysis is necessary to characterize the mutant in vitro and determine the protein structure by, for instance, X-ray crystallography. Regrettably, the knowledge about salmon tridimensional protein structures is still very limited, with no protein structural information for GSTO-1. Therefore, in order to evaluate the nonsynonymous SNP effect at protein structure level it is strictly necessary the generation of the predicted tridimensional protein structure using bioinformatics tools. Thus, the homology modeling strategy could be used to predict the tridimensional structure of the protein and, with it, to be able to establish a relationship between the nonsynonymous SNPsprovoked structural protein changes and the relevance of them (SIFT+POLYPHEN score).

Although there is no previous antecedents in fish, the SNP effect on the protein structure has been evaluated on genes of immunologic interest (Jang et al., 2017; Majumdar et al., 2017; Melzer and Palanisamy, 2018), including human glutathione-Stransferase superfamily (Kitteringham et al., 2007). The structure of GSTO-1 is composed mainly of an N-terminal domain of the thioredoxin type which consists of four central b-sheets surrounded at each end by a-helices, and a C-terminal end which is mainly made of a-helices (Board et al., 2000). The latter helices, fold over the N-terminal domain generating a network of hydrogen bonds which define a continuous surface (b/a structure) (Board et al., 2000). This b/a structure could be destabilized by the SNP c.205A > G (S26 amino acid substitution) located in b2 sheet, which

could introduce a structural conformational freedom either in the b1 sheet as in the adjacent one (b2 sheet). In addition, on the Nterminal end there is the glutathione binding site, also called G-site, in which the cysteine 32 (located at the N-terminal end of the a1 helix) forms a disulfide bond in the presence of reduced glutathione (GSH) (Rossjohn et al., 1998). It has been described in other GST proteins that a H-bond between the GSH sulfur and the OH of the tyrosine or serine stabilizes the GSH thiolate anion (Kortemme and Creighton, 1995).Mutations of these serine or tyrosine sites lead to a substantial or complete protein inactivation (Stenberg et al., 1991; Board et al., 1995). Based on the predicted tridimensional S. salar GSTO-1, the S26 is located in the proximity of the G-site. Taken together, the S26G substitution could affect the protein functionality, although more studies are needed to establish the consequence at protein level of this candidate nonsynonymous SNP.

The predicted protein structure for sGSTO-1 S119P showed a modification on the a4 helix kink region, rendering it longer compared with the wild-type. This probably because the proline substitution prevents the H-bonds network within the alpha helix. In a previous report has been described that proline modifies the secondary protein structure, suggesting that the replacement of Ser119Pro probably interferes with the a4 helix formation (Chiu et al., 2013). On the other hand, despite the high SIFT+POLYPHEN score obtained for the SNP c.499T > C, no appreciable changes in terms of the secondary structure were observed in the predicted sGSTO-1 Y124H model. However, the physicochemical change may have impacts that are not easily predicted from a structural point of view.

The homology model for the SNP c.769A > C (sGSTO-1 T214P) showed a change in the tridimensional conformation at the C-terminal end of the a7 helix of the protein. The a7 helix is directly involved in the H site of the protein, which has a hydrophobic motif and is adjacent to the G binding site. This site is composed of both the N-terminal and the C-terminal ends and has the putative function of being a binding site for GSH or other target molecules. In the sGSTO-1, the threonine 214 (hydrophilic nature) forms part of a hydrophobic motif, contributing to the formation of H bonds, thus allowing the union of a greater variety of substrates (Board et al., 2000). Therefore, the T214 substitution could decrease the efficiency of the protein, since having an exclusively apolar character decreases the range of substrates that can bind to this active site.

MD simulations of those SNPs with a relevant impact based on the predicted GSTO-1 structure (S26G; S119P; T214P) were also assessed. The results show that these variations affect the dynamics of common regions within the GSTO-1 structure, turning the overall structure of the enzyme more rigid than the wild-type variant. From the three variants above-mentioned assayed, only T214P is located directly in the main affected helix, while the other two (S26G; S119P) are located closer to the N-terminal region. Since the protein structure is a complex network of covalent and noncovalent interactions, the result obtained by MD simulations implies that the individual local effect of each one of these substitutions is being transduced throughout the protein structure and affecting regions that are likely to be determining in the stability or the catalytic capabilities of the enzyme. Since conformational dynamics are an essential aspect of a protein function, these differences are probably impacting the functioning of sGSTO-1, and consequently the fitness of the organism (Bhabha et al., 2013). Enhanced rigidity is a trait commonly associated with more stable proteins (Radestock and Gohlke, 2011). Therefore, it is possible that these SNPs could alter the stability of GSTO-1. On the other hand, although these variations are not located at the binding site of the enzyme, it is possible that these SNPs could have allosteric effects modulating the catalysis. This because conformational dynamics can play a role in multiple aspects of enzymatic activity like accessibility of the substrate to the active site, product release rate and the probability of finding catalytically active populations within the conformational ensemble (Henzler-Wildman et al., 2007; Kokkinidis et al., 2012).

Altogether, and based on the modeled sGSTO-1 structure, most of the candidate nonsynonymous SNP identified showed changed in the secondary structure dynamics. Therefore, the strategy to evaluate nonsynonymous SNP based on homology modeling and all-atomMD simulations provide additional evidence thatmay help to rank them in order to subsequently validate those that are most relevant according to their structural effect. Further studies should be focused in the in vitro characterization of these enzymes to generate a complete biophysical picture of the effects of these SNPs in these genes associated to immune function.

# AUTHOR CONTRIBUTIONS

The conceptualization of the study was performed by EV-V, MI, and FER-L. The methodology was originally proposed by EV-V, KM, AMS, MI, and FER-L. The SNP search and identification was carried out by EV-V, SR-C, JY, and FER-L. The codon usage analysis was carried out by EV-V, HV, and FER-L. The homology modeling was conducted by EV-V, JAR-P, KM, and FER-L. The molecular dynamics simulations were performed by JAR-P, PC, VC-F, and FER-L. EV-V, SR-C, JAR-P, KM, JY, HV, PC, VC-F, LT, AMS, MI, and FER-L participated actively in the data analysis and interpretation. EV-V, SR-C, and FER-L wrote the original draft. All the authors corrected, read, and approved the final manuscript.

# FUNDING

This study was supported by INNOVA-CORFO (No. 09MCSS-6691 and 09MCSS-6698), FONDECYT (No. 1161015; 11150807; 11180705; 11181133), DICYT- USACH, VRIDEI-USACH (USA1899 VRIDEI 021943IB-PAP), and Universidad Mayor startup funds (No. OI101205; SR-C). The authors also thank to the grants from CONICYT-BCH (Chile) Postdoctoral fellowship (No. 74170091; EV-V), International postdoctoral stay 2019 (Universidad de Chile, UCH1566; EV-V) and VRIDEI-USACH (FR-L). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019. 01406/full#supplementary-material

FIGURE S1 | Homology modeling for predicted S. salar CCL19 based on tridimensional H. sapiens CCL19 structure. (A) Tridimensional H. sapiens CCL19 (hCCL19) structure. (B) Predicted tridimensional S. salar CCL19 (sCCL19) structure. (C) hCCL19 and sCCL19 overlay. (D) Ramachandran plot for the predicted sCCL19 structure. The amino acid distribution into most favored regions (A,B,L), additional allowed regions (a,b,l,p), generously allowed regions (~a,~b,~l,~p) is indicated.

FIGURE S2 | Homology modeling for predicted S. salar ITB2 based on tridimensional H. sapiens ITB2 structure. (A) Tridimensional H. sapiens ITB2 (hITB2) structure. (B) Predicted tridimensional S. salar ITB2 (sITB2) structure. (C) hITB2 and sITB2 overlay. (D) Ramachandran plot for the predicted sITB2 structure. The amino acid distribution into most favored regions (A,B,L), additional allowed regions (a,b,l, p), generously allowed regions (~a,~b,~l,~p) is represented.

FIGURE S3 | Homology modeling for predicted S. salar HSP70 based on tridimensional H. sapiens HSP70 structure. (A) Tridimensional H. sapiens HSP70

# REFERENCES


(hHSP70) structure. (B) Predicted tridimensional S. salar HSP70 (sHSP70) structure. (C) hHSP70 and sHSP70 overlay. (D) Ramachandran plot for the predicted sHSP70 structure. The amino acid distribution into most favored regions (A,B,L), additional allowed regions (a,b,l,p), generously allowed regions (~a,~b,~l, ~p) and disallowed regions (GLN83 and ALA540) is represented.

FIGURE S4 | Homology modeling for predicted S. salar MHC class I based on tridimensional H. sapiens MHC class I structure. (A) Tridimensional H. sapiens MHC class I (hMHC class I) structure. (B) Predicted tridimensional S. salar MHC class I (sMHC class I) structure. (C) hMHC class I and sMHC class I overlay. (D) Ramachandran plot for the predicted sMHC class I structure. The amino acid distribution into most favored regions (A,B,L), additional allowed regions (a,b,l, p), generously allowed regions (~a,~b,~l,~p) and disallowed regions (ASP39 and ILE40) is represented.

FIGURE S5 | Molecular dynamics simulation for the wild-type GSTO-1 (A) and the variants S26G (B), S119P (C), and T214P (D). The root mean square deviation (RMSD) of the Ca atoms on a time windows explored of 100 ns is shown for each replicate.

FIGURE S6 | Root mean square fluctuation (RMSF) profiles for each mutant analyzed (S26G; S119P; T214P) compared to the GSTO-1 wild-type.


population screenings are predicted to impact protein function. Genomics 83, 970–979. doi: 10.1016/j.ygeno.2003.12.016


Conflict of Interest: The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Vallejos-Vidal, Reyes-Cerpa, Rivas-Pardo, Maisey, Yáñez, Valenzuela, Cea, Castro-Fernandez, Tort, Sandino, Imarai and Reyes-López. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

digital media

of impactful research

article's readership