Draft Assembled Genome of Walleye Pollock (Gadus chalcogrammus)

1 Biotechnology Research Division, National Institute of Fisheries Science, Busan, South Korea, D.iF Inc., Yongin-si 16954, Gyeonggi-do, South Korea, 3 Research and Development Center, Insilicogen Inc., Yongin-si 16954, Gyeonggi-do, South Korea, Department of Biological Sciences, Sungkyunkwan University, Suwon, South Korea, 5 Aquaculture Industry Research Division, East Sea Fisheries Research Institute, National Institute of Fisheries Science, Gangneung, South Korea


INTRODUCTION
Major populations have adopted seafood-based diets worldwide, and overconsumption can lead to species extinction. Global warming and coastal sea-surface contamination lead to broken food chains by altering the sea environment. South Korea has a prevalent seafood culture, and is one of the biggest seafood importers and exporters in the world. The country constantly invests in aquaculture infrastructure to meet food requirements, increase production, and reduce marine hunting to preserve the marine ecosystem. The country also focuses on species that are not adapted to artificial aquaculture systems. In this study, we sequenced the genome of Gadus chalcogrammus (walleye pollock), a cold-water species with a deep-sea habitat (200-1,200 m depth) that requires temperatures of 1-10 • C to survive (Bang et al., 2018). It is the second most commonly consumed fish in Korea, and is used worldwide in foods, such as surimi and roe (Anvari et al., 2018). Walleye pollock dominated the seafood market until the 1990s, but in the 2000s its market collapsed because of overfishing and the rise in sea-surface temperatures, which affected the cod ecosystem (Hwang et al., 2019;Kangsu et al., 2020). A decline in production led to fake labeling of other fish as walleye pollock. To control this malpractice, various molecular authentication systems, such as polymerase chain reaction (PCR) and other marker kits were introduced (Noh et al., 2019). Possibilities of artificial insemination to circumvent the unfavorable natural conditions were also explored to increase production in natural and aquaculture systems (Joo-Young and O-Nam, 2017). Various initiatives have attempted to breed this fish into aquaculture environments, but the reference genome to conduct genomic selection from the phenotype is missing. Only the mitochondrial genomes (Carr and Dawn Marshall, 2008;Sim et al., 2018) and partially assembled contigs are available for this fish, along with a few transcriptomes deposited in the National Center for Biotechnology Information (NCBI) database. Additionally, in the genus Gadus, only the genome for Gadus morhua is publicly available.

Significance of the Data
This Gadus chalcogrammus genome is another reference for molecular studies in the Gadus genus. It will be a valuable resource to conduct comparative analyses within the Gadus genus, and enhance the genomic selection process in molecular-assisted breeding.

Sample Collection and Genomic DNA Extraction
A single female fish (93 g) was obtained from the East Sea Fisheries Research Institute in March 2018, and maintained at 8 ± 0.5 • C in aerated seawater. The abdominal muscle tissues were sampled aseptically and stored in liquid nitrogen for genomic DNA extraction. The complete experimental procedure from DNA isolation to sequencing was conducted by DNA Link, South Korea (www.dnalink.com), in accordance with the product protocol.

Genomic DNA Library Preparation and Sequencing
The concentrated genomic DNA (gDNA) (24 µg) from the given samples was prepared using the DNeasy Animal Mini Kit (Qiagen, Hilden, Germany). The completely isolated gDNA was quantified using an ND-1000 spectrophotometer (Thermo Fisher Scientific, Wilmington, DE, USA) and a Qubit fluorometer. The gDNA samples were then subjected to the following steps: fragmentation using the g-TUBE (Covaris, Woburn, MA, USA) to obtain >20-kb fragments; small-fragment filtration using 0.45X AMPure (Beckman Coulter, Brea, CA, USA); fragment end repair by ExoVII treatment; ligation of blunt adapters with double-stranded DNA fragments; attachment of primer and polymerase to SMRTbell templates (Template Prep Kit 1.0); and addition of magnetic beads. The impurities were washed out carefully using 1.0X AMPure , and only the double-stranded DNA fragments with blunt adapters were used for sequencing with P6-C4-chemistry (DNA sequencing Reagent 4.0) on the Pacific Biosciences (PacBio) sequencing platform, by capturing 1 × 240minute-long videos of each SMRT cell. Similarly, the isolated gDNAs were also subjected to sequencing library preparation using stranded Illumina paired-end (PE) protocols (Illumina, San Diego, CA, USA). The fragmented libraries were subjected to size selection and sequencing on the Illumina HiSeq 2000 platform (Illumina).

De novo Genome Assembly
Complete sequence reads were error-corrected using SMRT Analysis v2.3, and imported into a diploid-aware hierarchical genome assembler to construct the contigs from the longsequence PacBio reads (FALCON) (Chin et al., 2016). The assembled contigs were further subjected to sequence polishing using the Quiver consensus method to reduce the base-calling errors (Chin et al., 2016). Finally, the assembled and polished contigs were assessed for completeness of the genome using BUSCO v5.0 (Simão et al., 2015). The reference BUSCO datasets are actinopterygii_odb10 and vertebrate_ odb10. The quality of the assembly was assessed by short-read mapping to the draft by BWA v0.7.15 (Li and Durbin, 2010) (Supplementary Figure 2).

Gene Prediction and Annotation
The genes from the G. chalcogrammus draft genome were predicted using an in-house gene prediction pipeline, which includes three modules: an evidence-based gene modeler, an ab initio gene modeler, and a consensus gene modeler. Finally, the functional annotation processing was performed for the consensus genes. Initially, sequenced transcriptomes obtained from the two methods (Illumina [156.9 Gb] and Iso-Seq  Table 4) were used for prediction. Finally, the transcript models and predicted models from the evidence-based and ab initio gene modelers were subjected to the consensus gene modeler to produce the final gene and transcript models. The consensus transcripts were then subjected to functional annotation from biological databases (NCBI-NR, Swiss-Prot, Gene Ontology, and KEGG Pathway) using OmicsBox v1.2 (Götz et al., 2008).

Preliminary Analysis Report
Initially, the genome size of G. chalcogrammus was estimated to be 683.61 Mb (Figure 1A) with 42 Gb of short-read sequences ( Table 1A, Supplementary Table 2) and 629.66 Mb of representative contigs from 97 Gb of error-corrected longread sequences (Supplementary Tables 1, 3). The contigs were then assembled into 116 scaffolds in the reference draft genome ( Table 1B). The N50 of the assembled genome was 27,035,343 bases, and 245 Mb (38.89%) of the assembled contigs were covered by repeats, in which the long terminal repeat (LTR) elements dominated (34%). In total, 23,353 genes were predicted from the genome, with an average size of 9261.51 bases, and 90.4% completeness on the BUSCO score (Table 1C). Homologous sequences were found for 19,760 (84.61%) genes in GenBank, and 17,259 (73.90%) genes had Gene Ontology descriptions ( Table 1D). The first genome published for the Gadus genus was G. morhua (gadMor1) in 2011, as an 832-Mb genome with an N50 of 2.3 kb (scaffold N50; 0.14 Mb) (Star et al., 2011). An improved version of the same genome (gadMor2) was published in 2017 with 116 kb (scaffold N50; 1.15 Mb) (Tørresen et al., 2017), and the third NCBI version (gadMor3) was 669 Mb with a contig N50 of 1.01 Mb (scaffold N50; 28.7 Mb) and 23 chromosomes. The gadMor3 genome was used as a reference to scaffold the contigs (N50: 3.6 Mb) with the RaGOO method (Alonge et al., 2019), and 167 scaffolds were obtained with an N50 of 27.03 Mb and 23 chromosomes. The complete workflow used in this study is illustrated in Supplementary Figure 1. Overall, this genome assembly improved significantly in fragmented assembly (Figures 1B-F) and BUSCO completeness score (Table 1B). However, there is conflict in chromosome number i.e, G. morhua have 23 chromosome and G. chalcogrammus has 22 chromosomes (Supplementary Table 5). Since, the contigs scaffold well with all G. morhua 23 chromosomes, this will be improved in future version of this genome assembly (Ishii and Yabu, 1985).

Dataset Information to the User
The complete sequences generated in this study were deposited in the NCBI Sequence Read Archive under accession no. PRJNA736536. The assembled contigs and the annotation files (CDS, gff, repeats, and proteins) are available in the Figshare repository (https://figshare.com/s/2ff9e3a49a07c990a400) with all of the annotation details in a Readme file. The contig assembly of this draft genome was submitted to the NCBI Assembly database under accession no. JAHRIL000000000.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number (