Data Report ARTICLE
Draft Genome and Complete Hox-Cluster Characterization of the Sterlet Sturgeon (Acipenser ruthenus)
- 1Key Laboratory of Freshwater Biodiversity Conservation, Ministry of Agriculture of China, Yangtze River Fisheries Research Institute, Chinese Academy of Fishery Sciences, China
- 2College of Fisheries, Chinese Perch Research Center, Huazhong Agricultural University, China
- 3Shenzhen Key Lab of Marine Genomics, Guangdong Provincial Key Lab of Molecular Breeding in Marine Economic Animals, BGI Academy of Marine Sciences, China
- 4BGI Education Center, University of Chinese Academy of Sciences, China
- 5School of Veterinary Medicine, Rakuno Gakuen University, Japan
Background: Sturgeons (Chondrostei: Acipenseridae) are a group of “living fossil” fishes at a basal position among Actinopteri. They have raised great public interest due to their special evolutionary position, species conservation challenges as well as high-prized eggs (caviar). Sterlet sturgeon, (Acipenser ruthenus Linnaeus, 1758), is a relatively small-sized member of sturgeons and has been widely distributing in both Europe and Asia. In this study, we performed whole genome sequencing, de novo assembly and gene annotation of the sterlet to construct its draft genome. Findings: We finally obtained a 1.83-Gb genome assembly (BUSCO completeness of 81.6%) from a total of 316.8-Gb raw reads generated by an Illumina Hiseq 2500 platform. The scaffold N50 and contig N50 values reached 191.06 kb and 18.88 kb, respectively. The sterlet genome is predicted to comprise of 42.84% repeated sequences and contain 22,18422,222 protein-coding genes, in which 21,11221,127 (95.1795.07%) have been functionally annotated with at least one hit in public databases. GeneticGenome-level phylogeny proves that sterlet takes the basal position among ray-finned fishes and 4dTv analysis estimates quite a recent whole genome duplication occurred 21.3 million years ago. Moreover, seven Hox clusters carrying 68 Hox genes were characterized in the sterlet. Phylogeny of HoxA clusters in sterlet and American paddlefish divided these sturgeons into two groups, confirming the independence of each lineage-specific genome duplication in Acipenseridae and Polyodontidae. Conclusions: This draft genome makes up for the lack of genomic and molecular data of the sterlet and its Hox clusters. It also provides a genetic basis for further investigation of lineage-specific genome duplication and early evolution of ray-finned fishes.
Sturgeons (Acipenseridae, Acipenseriformes) have long been an interesting group of fish due to their commercial value and conservation challenge (Wei et al., 2011). They also draw wide attention due to a basal position on the phylogenetic tree of ray-finned fishes. It is estimated that the origin of sturgeons dated back to approximately 350 million years ago (Mya), even earlier than the origin of Holostei (bowfin and gars) and Teleostei (teleosts) (Hughes et al., 2018). Therefore, they didn’t experience the teleost-specific genome duplication (TGD) event that happened around 320 Mya (Jaillon et al., 2004). However, there are clear evidences based on molecular markers, chromosome numbers and inferred ploidy levels that they have experienced their own lineage-specific polyploidizations with one or more rounds (Crow et al., 2012), resulting in complex genome structures and the widest range of chromosome numbers among all vertebrates (Havelka et al., 2016). However, little is known about Acipenseridae-specific genome duplication (GD) and its consequences due to the lack of sturgeon genome sequences.
This special whole genome duplication (WGD) event also provide new genetic materials to generate phenotypic diversity in sturgeons. However, sturgeons have quite limited species diversity but exceedingly fast overall rates of body size evolution, serving as an interesting exception of the phenotypic ‘evolvability’ hypothesis (Rabosky et al., 2013). As one of the earliest evolved fish groups in ray-finned fishes, sturgeons still retain many shark-like features such as cartilaginous skeleton and heterocercal tail, and the extant species look very similar to their fossil counterparts, suggesting a low rate of body-shape evolution (Rabosky et al., 2013). Therefore, sturgeons have been perfect materials to investigate the complicated relationship between phenotype and the polyploidy genome caused by WGD. Meanwhile, Hox genes, encoding a distinct class of transcription factors associated with axial patterning and appendages development, were often among the first list for examination to understand their roles in evolution of vertebrate body plans and novelty (Amemiya et al., 2010; Crow et al., 2012). Sterlet sturgeon (Acipenser ruthenus, Linnaeus, 1758) is a famous representative of sturgeon species, well-known for its relatively small body size and wide distribution in comparison to other sturgeons. It is considered as vulnerable by the IUCN but has been successfully bred artificially. In this study, we performed whole genome sequencing of the sterlet and generated a draft genome assembly for a sturgeon for the first time. We also constructed a fossil-calibrated phylogenetic tree, estimated the occurrence time of the sterlet-specific GD and retrieved the complete Hox clusters to preliminarily reveal early evolutionary history of ray-finned fishes.
VALUE OF THE DATA
This is the first genome report of a sturgeon. The sterlet sturgeon genome was 1.83-Gb in size with a scaffold N50 of 191.06 kb. Our draft assembly contains 784 Mb (42.84% of the genome) repeats and 22,222 protein-coding genes.
The time-calibrated phylogenetic tree showed a most basal position of sterlet in Actinopterygii (ray-finned fishes) at the genome level and dated the origin of sterlet back to 358 Mya, extremely close to the Late Devonian Extinction that happened around 358.9 Mya.
4dTv analysis shows that the sterlet-specific genome duplication event happened quite recently about 21.3 Mya, close to the estimated occurrence time (42 Mya) of paddlefish-specific genome duplication event, regardless of the independence of these two whole-genome duplication events.
Seven Hox clusters including 68 Hox genes were identified in the sterlet genome. Phylogeny of HoxA clusters of sterlet and American paddlefish divided these sturgeons into two groups, suggesting that the WGD events happened independently in these two sturgeon species.
MATERIALS AND METHODS
The sequenced sterlet sturgeon (about 2.5 years old, 0.80 kg, and 56.8 cm in length) was artificially cultured at Taihu Station, Yangtze River Fisheries Research Institute, Chinese Academy of Fisheries Sciences, China. About 30 g of skeletal muscle was dissected from the anesthetized fish for genomic DNA isolation. Another 10 mL of blood was also collected from the caudal vertebral vessels for RNA extraction and transcriptome sequencing. All vouchers were deposited in China National GeneBank with accession numbers of WH20161125002-MU (muscle) and -BL (blood). All experiments were carried out in accordance with the guidelines of the Animal Ethics Committee of Yangtze River Fisheries Research Institute of Chinese Academy of Fishery Sciences (No. YFI-01).
Genome sequencing and assembly
We applied whole-genome shot-gun sequencing strategy to generate short paired-end reads (125 or 150 bp) by constructing a series of short-insert (270, 500 and 800 bp) or long-insert (2, 5, 10 and 20 kb) libraries and sequencing on a Hiseq 2500 platform (Illumina, San Diego, USA). Raw reads were subsequently pre-processed by SOAPfilter software (Luo et al., 2012) to trim 5 bases at 5’ end of all reads and to discard the low-quality reads (quality value <20) and reads with N bases (number of nonsequenced base >10). Subsequently, the 17-mer depth frequency distribution method was employed to estimate the genome size of the sterlet sturgeon using data from short-insert libraries according to the formula of Genome size = total number of k-mers/peak value of k-mer frequency distribution (Li et al., 2010). All clean reads were assembled into contigs and scaffolds using SOAPdenovo v2.04 (Luo et al., 2012) with optimized parameters (pregraph -K 41 -d 1; contig –M 3; scaff -F; others as the default). Finally, gaps in the scaffolds were successively filled by using Kgf and GapCloser (Luo et al. 2012) with clean reads from short-insert libraries. Completeness of the final assembly was assessed by BUSCO (Simão et al., 2015).
Repeat sequence prediction and gene annotation
A de novo repeat library for the sterlet was constructed by combination of RepeatModeler v1.05 (RepeatModeler, RRID: SCR_015027) and LTR_FINDER v1.0.6 (Xu et al. 2007). Known and de novo transposable elements (TEs) in the assembled genome were identified by RepeatMasker v4.0.6 (RepeatMasker, RRID:SCR_012954) using both the RepBase v21.01 (Jurka et al., 2005) and the de novo repeat library. RepeatProteinMask v3.3.0 (Chen, 2004) was then used to identify the TE relevant proteins. Meanwhile, tandem repeats were predicted by using Tandem Repeats Finder (TRF) v4.07b (Benson et al., 1999), and Tandem Repeats Analysis Program (Sobreira et al., 2006) was used to select candidate microsatellite markers from the TRF output.
Gene models in the sterlet genome were predicted by an integrated strategy of three methods. For homology annotation, we downloaded published protein sequences of ten representative vertebrates including zebrafish (Danio rerio; Howe et al., 2013), spotted gar (Lepisosteus oculatus; Braasch et al.; 2015), elephant shark (Callorhinchus milii; Venkatesh et al., 2014), sea lamprey (Petromyzon marinus; Smith et al., 2013), medaka (Oryzias latipes; Kasahara et al., 2007), Nile tilapia (Oreochromis niloticus; Brawand et al., 2014), three-spined stickleback (Gasterosteus aculeatus; Jones et al., 2012), Atlantic cod (Gadus morhua; Star et al., 2011), fugu (Takifugu rubripes; Aparicio et al., 2002) and spotted green pufferfish (Tetraodon nigroviridis; Jaillon et al., 2004), and aligned them against the assembly of the sterlet using BLAST (Altschul et al., 1990) with tblastn mode and e-value of 1e-5. SOLAR (Yu et al.; 2006) was subsequently employed to select the best hit of each alignment. For ab initio prediction, the sterlet genome assembly was masked according to the previously identified repeat sequences and then scanned using AUGUSTUS v3.2.3 (Stanke et al. 2006) and GENSCAN v1.0 (Burge and Karlin 1997) to predict gene structures. For transcriptome-based annotation, we sequenced a blood transcriptome on a Hiseq X10 platform (Illumina), mapped the reads to the genome scaffolds using TopHat v2.0.13 (Trapnell et al., 2009) and assembled them into transcripts using Cufflinks v2.2.1 (Trapnell et al., 2010). Finally, all predicted genes from these three methods were merged and filtered by GLEAN v1.1 (Elsik et al., 2007) to create a consensus gene set.
Gene functional annotation of the sterlet genome was firstly performed by aligning all the protein sequences produced by GLEAN against public databases including Swiss-Prot, TeEMBL (Boeckmann et al., 2003) and KEGG (Kanehisa et al., 2016) using BLASTP v2.3.0+ (Altschul et al., 1990) with e-value of 1e-5. Subsequently, motifs and domains were annotated using InterProScan (Hunter et al., 2008) by searching PANTHER (Thomas et al., 2003), Pfam (Finn et al., 2013), PRINTS (Attwood, 2002), ProDom (Bru et al., 2005) and SMART (Letunic et al., 2004) databases. Finally, InterProScan (Hunter et al., 2008) was applied to assign Gene Ontology (GO) terms (Ashburner et al., 2000).
Fossil-calibrated phylogeny analysis
To realize a phylogenetic analysis of the sterlet, we obtained the predicted coding sequences (CDS) from sterlet and other 14 vertebrates, including Asian arowana (Scleropages formosus, Bian et al., 2016), coelacanth (Latimeria menadoensis, Amemiya et al., 2013), common carp (Cyprinus carpio, Xu et la., 2014) and Atlantic salmon (Salmo salar, Lien, et al., 2016) as well as the ten species used for homology gene annotation, with sea lamprey as the outgroup. BLAST with blastp mode and e-value of 1e-5 was used to build the super similarity matrix, followed by OrthoMCL (Li et al., 2003) to distinguish gene families. One-to-one orthologues were identified by Markov Chain Clustering (MCL) and aligned by MUSCLE v3.7 (Edgar, 2004). The first nucleotide of each codon was chosen to construct a Maximum-likelihood (ML) tree using PhyML v3.0 (Guindon et al. 2010) with gamma distribution across aligned sites and HKY85 substitution model. Brach supports were evaluated by approximate likelihood ratio test (aLRT), and deduced topology was tested by Bayesian inference (BI) using MrBayes v3.2.2 (Ronquist et al., 2012). Divergence time of the sterlet from other vertebrates was estimated by Bayesian method using MCMCtree in PAML v4.9 (Yang, 2007) with two fossil calibrations, which are Latimeria (Sarcopterygii, 408 Mya ~ 427.9 Mya) and Danio (Teleostei, 151.2 Mya ~ 252.7 Mya) (Hughes et al., 2018).
4dTv analysis to check the sterlet-specific GD
4-fold degenerative third-codon transversion (4dTv) analysis was carried out to test the sterlet-specific GD by comparing to Asian arowana genome. Proteins from the two genomes were firstly aligned using All-to-all BLAST with blastp mode and e-value of 1e-5. Subsequently, syntenic regions between sterlet-sterlet, arowana-arowana and sterlet-arowana were identified by MCscan v0.8 (Wang et al., 2012) with default parameters. Homologous protein sequences from these syntenic regions were retrieved and converted to CDS for alignment by MUSCLE (Edgar, 2004). Lastly, 4dTv values were calculated and corrected with the HKY model in PAML package (Yang, 2007).
Hox-cluster identification and phylogenetic analysis
Reference protein sequences of complete HoxA cluster and partial HoxD cluster of American paddlefish (Polyodon spathula) (Crow et al., 2012) were downloaded from National Center of Biotechnology Information (NCBI). Sequences of four complete Hox clusters of the Indonesian coelacanth (Amemiya et al., 2010) and spotted gar (Braasch et al., 2015) were downloaded from Ensembl. The protein sequences were firstly aligned to the sterlet genome assembly by BLAST (Altschul et al., 1990) with tblastn mode and the hit sequences were further analyzed by exonerate software (Slater and Birney, 2005) to extract exons. Hox gene order and synteny were finally determined by aligning back to the genome assembly and the best hits were selected by SOLAR (Yu et al., 2006). The HoxA clusters from sterlet and paddlefish, HoxA9 genes from ten vertebrates were separately aligned with MEGA v7.0.26 (Kumar et al., 2016) followed by construction of a ML phylogenetic tree.
RESULTS AND DISCUSSION
Summary of the genome sequencing and assembly
We generated 316.8 Gb of pair-end short reads (Supplementary Table 1) to assemble the draft genome of the sterlet sturgeon. After filtering low-quality sequences, data size of the remaining clean reads was about 240.9 Gb (Supplementary Table 1). The genome size of the sterlet haploid was estimated to be about 2.00 Gb (Supplementary Figure 1) by k-mer analysis (Li et al. 2010). It is quite close to the previously reported 1.87 Gb by flow cytometry (Birstein et al., 1993). Using all the clean reads, we produced a final genome assembly of 1.83 Gb, slightly smaller than the estimate. The achieved draft assembly had a contig N50 of 18.88 kb and a scaffold N50 of 191.06 kb (Supplementary Table 2).
Accordingly, the genome sequencing depth for the sterlet reached 132-fold based on the final 1.83-Gb assembly, and as much as 87.19% of the bases had an over 20-fold sequencing depth (Supplementary Figure 2). The total completeness of the assembly was estimated to be 81.6% by evaluation with BUSCO, including 51.9% complete and single-copy BUSCOs and another 29.7% duplicated BUSCOs. A total of 4,584 genes were searched and 302 (6.6%) of them were fragmental BUSCOs. Along with the homogeneous GC distribution of the scaffolds (Supplementary Figure 3), we concluded that our draft assembly of the sterlet genome was qualified for further analyses.
A relatively high content of repetitive elements
We performed repeat annotation, and a total of 784 Mb (42.84%) repeat sequences, including 726 Mb (39.68%) transposable elements (TEs) and 79 Mb (4.34%) tandem repeats, were identified in the sterlet genome (Supplementary Table 3). These data are consistent with the dominant sub-peak ideally located at 2-fold the position of the main k-mer peak (Supplementary Figure 1). This repeats content was higher than the majority of the published fish genomes, which usually contain no more than 40% repeats (Yuan et al., 2018). Interestingly, more class I (28.95%) than class II (14.93%) TEs were found in the sterlet genome (Supplementary Table 4), resembling a cartilaginous species pattern (Yuan et al., 2018). In addition, as a potamodromous species dwelling mainly in freshwater, the sterlet had a relatively high DNA/TcMar-Tc1 proportion of 16.58% (130 Mb) but a relatively low microsatellites proportion of 2.10% (16 Mb) (Supplementary Table 5), a pattern preferred by freshwater species (Supplementary Figure 4, data from Yuan et al., 2018). More interesting results will be found by taking a deeper look at each TE types in the sterlet.
Data of gene annotation and phylogenetic analysis
After masking the abundant repeats in scaffolds, we annotated 22,222 protein-coding genes (Supplementary Table 6) with an average mRNA length of 21 kb using a combined strategy of ab initio, homology-based and transcriptome-based annotation. Length distributions of the predicted genes, CDS, exons and introns were comparable to those of spotted gar, elephant shark and many other fishes (Supplementary Figure 5). Of all these genes, a total of 21,127 (95.07%) were functionally annotated in at least one public database (Supplementary Table 7).
Afterwards, we used these predicted CDS sequences along with whole-genome CDS from other 14 selected vertebrates to perform a phylogenetic analysis, and a total of 198 single-copy orthologues from each genome (Supplementary Table 8; Supplementary Figure 6) were obtained for generation of the phylogenetic topology by ML (Supplementary Figure 7) or BI (Supplementary Figure 8). The two methods produced complete coincidence of phylogenetic topology with high bootstrap support values, suggesting high convince of the deduced tree (Figure 1A). It shows that the sterlet located at the base of Actinopterygii, serving as the sister group to all other ray-finned fishes. Therefore, this phylogeny of sterlet from a genome level confirms its very basal position as reported by other studies (Hughes et al. 2018; Peng et al., 2007). Fossil calibrations date the origin of sterlet back to 358 Mya (Figure 1A), with a 95% confidence interval of 316~394 Mya (Supplementary Figure 9). These data are consistent with our previous comprehensive phylogeny analysis (Hughes et al. 2018), and most interestingly, the date is extremely close to the Late Devonian Extinction happened around 358.9 Mya (McGhee et al., 1984).
Identification of an independent WGD event happened recently in the sterlet
Sturgeons didn’t experience the TGD event in most fishes (Ravi and Venkatesh, 2018) but there are clear evidences of sturgeon-specific GD event (Havelka et la., 2016). In order to identify this lineage-specific GD in the sterlet, we performed a 4dTv analysis along with Asian arowana (Bian et al. 2016), which had experienced the TGD event around 320 Mya (Jaillon et al., 2004). The analysis displays distinct peaks in each group of sterlet-sterlet (sterlet-specific GD), arowana-arowana (TGD) and sterlet-arowana (speciation event), and the synonoymous transversions rates (Ks values) were estimated to be 0.03 and 0.45 in sterlet sturgeon and Asia arowana respectively (Figure 1B). Hence, the sterlet-specific GD was estimated to have occurred about 21.3 Mya (320 Mya/0.45*0.03), long after the sturgeon and paddlefish splitting (184 Mya; Peng et al., 2007). It seems that sturgeons (Acipenseridae) and paddlefish (Polyodontidae) experienced polyploidization events independently.
Characterization of the complete Hox clusters
To further provide insights into the nature of polyploidy of the genome at the gene level after the sterlet-specific GD event, we investigated Hox gene clusters in the sterlet genome. Finally, we identified seven Hox clusters including 68 Hox genes (60 intact and 8 partial/pseudo genes) in the draft assembly (Figure 1C, Supplementary Table 9). The Hox data seem to be a consequence of the sterlet-specific GD, since only four Hox clusters were found in sea lamprey (43 genes), elephant shark (47 genes) and spotted gar (43 genes; Venkatesh et al., 2014). Interestingly, the missing of a whole HoxC cluster in the sterlet was similar to that in some diploid teleost such as fugu, medaka and stickleback (Pascual-Anaya et al., 2013), suggesting that there is a trend of losing HoxC cluster both after TGD and lineage-specific GD events. Furthermore, our HoxA based genealogy showed that, contrary to the Hox pattern in teleost after TGD (Supplementary Figure 10), HoxA clusters from sterlet and paddlefish formed two separate groups (Supplementary Figure 10), indicating that Hox genes duplicated independently after the divergence of the two families. It confirmed the independence of lineage-specific GDs in the starlet and paddlefish, which is consistent with our above-mentioned prediction by 4dTv.
Sturgeons are a group of basal ray-finned fish that had undergone their lineage-specific GD rather than the TGD shared by teleost. This special phylogenetic position and genome evolution feature as well as their economic importance and species conservation challenges prompt us to generate this whole genome assembly of the sterlet. Our draft genome assembly is 1.83 Gb, with 42.84% repeated sequences and 22,222 protein-coding genes. The time-calibrated phylogenetic tree showed a most basal position of the sterlet in all ray-finned fishes and the origin can date back to about 358 Mya. Our 4dTv analysis suggests that sterlet-specific GD was predicted to occur quite recently at 21.3 Mya. We also observed nearly doubled number of Hox clusters and genes in the sterlet, resulting in seven Hox clusters carrying 68 Hox genes. Our data confirm that the sterlet-specific GD and paddlefish-specific GD occurred independently. However, whether this WGD is sterlet-specific or shared by all members of the Acipenseridae family is waiting for answers from genome sequencing of more sturgeon species.
Raw reads, draft assembly and gene annotation of the whole genome sequencing along with the transcriptome raw reads were deposited at NCBI with accession ID of BioProject PRJNA491785.
Q.W., H.D., C.L., J.X. and Q.S. conceived and designed the project. Y.H, P.C., Y.L, and C.B. analyzed the data. C.L., R.R., H.Y. and X.Y. collected and processed the samples. P.C, Y.H., and Q.S wrote the manuscript. Q.S and Q.W revised the manuscript. All authors have read and approved the final manuscript and declared no competing interests.
The study was supported by the the National Natural Science Foundation of China (grant number NSFC 31772854), China Postdoctoral Science Foundation (grant number 2017M622560), the National Program on Key Basic Research Project (973 Program, 2015CB15072), Hubei Postdoctoral Innovation Post Project (No. 2017C08), Shenzhen Special Program for Development of Emerging Strategic Industries (No. JSGG20170412153411369), and Office of Fisheries Supervision and Management for the Yangtze River Basin, MARA, PRC (No. 171821301354051046).
Keywords: sterlet, sturgeon, Genome, hox, lineage-specific whole genome duplication
Received: 12 Oct 2018;
Accepted: 23 Jul 2019.
Copyright: © 2019 Cheng, Huang, Du, Li, Lv, Ruan, Ye, Bian, You, Xu, Liang, Shi and Wei. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Dr. Qiong Shi, BGI Academy of Marine Sciences, Shenzhen Key Lab of Marine Genomics, Guangdong Provincial Key Lab of Molecular Breeding in Marine Economic Animals, Shenzhen, China, email@example.com
Prof. Qiwei Wei, Chinese Academy of Fishery Sciences, Key Laboratory of Freshwater Biodiversity Conservation, Ministry of Agriculture of China, Yangtze River Fisheries Research Institute, Wuhan, China, firstname.lastname@example.org