Data Report ARTICLE
Whole genome assembly of the snout otter clam, Lutraria rhynchaena, using Nanopore and Illumina data, benchmarked against bivalve genome assemblies
- 1Other, Vietnam
- 2Centre for Integrative Ecology, Deakin University, Australia
- 3Deakin Genomics Centre, School of Life and Environmental Sciences, Deakin University, Australia
- 4Hanoi National University of Education, Vietnam
- 5Hung Vuong University, Vietnam
Production of cultured bivalve molluscs was 17.1 million tons in 2016 accounting for 21.4% of global aquaculture production (FAO, 2018). The lack of genomic resources coupled with limited understanding of the molecular basis of gene expression and phenotypic variation have limited advances in aquaculture-based productivity of marine bivalves. Understanding the molecular basis of phenotypic variation and gene function is therefore important for selective breeding programs for traits such as increased growth and disease resistance. Similarly, whole genome assembles support GWAS studies to identify trait-specific loci and for genomic-based selective breeding. To this end, whole-genome sequencing has been conducted on several commercial bivalve species, including the edible oysters Crassostrea virginica (Gómez-Chiarri et al., 2015), Crassostrea gigas (Gerdol et al., 2015), pearl oysters (Takeuchi et al., 2012) and clams (Mun et al., 2017). However, in general, genomic data for bivalve molluscs, which includes a taxonomically diverse group of species, is sparse (Takeuchi et al., 2012; Zhang et al., 2012; Murgarella et al., 2016; Du et al., 2017; Li et al., 2017; Mun et al., 2017; Sun et al., 2017; Uliano-Silva et al., 2017; Wang et al., 2017; Li et al., 2018; Powell et al., 2018; Renaut et al., 2018; Bai et al., 2019). In this study we present the first genomic resources for a species of clam from the superfamily Mactroidea and for a Vietnamese shellfish species and generate a draft reference genome to form the basis of on-going selective breeding studies. This study also demonstrates the efficacy of using Oxford Nanopore Technology (ONT) reads to scaffold a bivalve genome assembly and shows the value of these relatively inexpensive long reads for spanning large repetitive regions and overcoming complex assembly issues caused by high heterozygosity, which typically confounds short read only assemblies. The quality of our assembly is also benchmarked against other bivalves genome assemblies and we present an initial phylogenomic analysis for the class Bivalvia, which illustrates the value and potential of the increasing number of high quality genomic data sets for phylogenetics.
2. Materials and Methods
2.1 Extraction, library preparation and sequencing
Muscle tissue was collected from a sample obtained from a snout otter clam farm in Van Dong district, Quang Ninh province in Vietnam. For short read sequencing of the clam genome, genomic DNA (gDNA) was extracted according to Sokolov’s improved method (Sokolov, 2000). Two sequencing libraries were prepared. These include a PCR-free library prepared using NuGen Celero DNA-Seq Library Preparation Kit (Tecan Genomics, San Carlos, CA) according to manufacturer’s instruction and a PCR-based library prepared with NEBNext Ultra DNA Library Preparation Kit (New England Biolabs, Ipwich, MA). On the other hand, transcriptomic RNA was extracted from the digestive gland tissue using the Zymo Quick-RNA Miniprep kit, followed by RNA library constructed using the Nugen Universal Plus mRNA-Seq Kit (Tecan Genomics, San Carlos, CA) according to manufacturer’s instruction. Both DNA and RNA libraries were sequenced on an Illumina NovaSeq6000 platform at the Deakin Genomics Centre with 2 × 150bp configuration. For Nanopore long read sequencing, gDNA was extracted from the muscle tissue of the same individual using the column-based Zymo Quick DNA miniprep plus. Approximately 1 µg of the purified gDNA was used as the input for library preparation using the LSK109 kit followed by sequencing on a FLO-MIN106 revD SpotON R9.4 Flow Cell for 48 hours.
2.2 Genome size estimation
Short reads generated by the Illumina NovaSeq SGS platform were preprocessed for genome size estimation. The fastp v0.19.4 tool (Chen et al., 2018) was used to trim polyG tails from the 3’ end of short reads (--poly_g_min_len 1), followed by adapter and quality trimming with Trimmomatic v0.36 (Bolger et al., 2014) (ILLUMINACLIP:2:30:10, AVGQUAL:20). Reads of mitochondrial origin were removed via alignment against the Lutraria rhynchaena mitogenome (Gan et al., 2016) using bowtie2 v184.108.40.206 (default parameters) (Langmead and Salzberg, 2012). Jellyfish v2.2.6 (Marçais and Kingsford, 2011) was used to count the frequency of 19-, 21- and 25-mers in the preprocessed reads and finally, these k-mer histograms were uploaded to the GenomeScope webserver (max kmer coverage disabled) (Vurture et al., 2017) for an estimation of the L. rhynchaena haploid genome size.
2.3 De novo hybrid assembly of the snout otter clam genome
Long reads generated by the ONT MinION sequencing device were basecalled and trimmed (adapter and quality) with Guppy v3.0.3 (high accuracy mode, min_qscore 7), which is available via the ONT community site (https://community.nanoporetech.com). A hybrid de novo assembly approach was performed with the MaSuRCA v3.2.8 assembler (USE_LINKING_MATES = 0, cgwErrorRate = 0.15, KMER_COUNT_THRESHOLD = 1) (Zimin et al., 2013), using the previously trimmed long ONT reads and polyG-trimmed short Illumina reads as input. MaSuRCA runs hybrid assembly of Illumina and Nanopore data. Briefly, the assembler first reduces high coverage Illumina reads into longer super-reads, then align these to long Nanopore reads to build even longer mega-reads (Zimin et al., 2017). These are then assembled by the CABOG assembler within MaSuRCA. Additionally, the version of MaSuRCA used in this study uses the high coverage of long high error Nanopore reads to build consensus sequences for regions not captured by short accurate Illumina reads. As a result, this approach benefits from the accuracy of Illumina reads and the length of Nanopore reads. Completeness of the assembly was assessed with BUSCO v3.0.2 (Simão et al., 2015), searching the assembly against single-copy orthologs from the Metazoan dataset (metazoa_odb9).
As bivalve genomes have been previously shown to be highly heterozygous (Zhang et al., 2012; Wang et al., 2017; Powell et al., 2018), an observation also apparent from the bimodal distribution of k-mer profiles obtained from the GenomeScope analysis in this study (Supplementary Figure 1), it was necessary to optimise the assembly by removing haplotigs or duplications in the haploid representation of this genome. This was achieved by aligning the long reads back to the assembly with minimap2 v2.15 (-x map_ont, --secondary=no) (Li, 2018) and passing the alignment through the Purge Haplotigs pipeline v1.0.4 (-l 5, -m 20, -h 150) (Roach et al., 2018) to remove artefactual scaffolds.
2.4 Estimation of heterozygosity and repeat content
Presence of single nucleotide polymorphisms (SNP) in the snout otter clam genome was estimated by aligning short reads to the final curated assembly with bowtie2 v220.127.116.11 (Chen et al., 2018) and called with the mpileup (--max-depth 200) and call (-mv) commands within bcftools v1.9 (Li, 2011), retaining high quality calls with sufficient read depth (QUAL ≥ 20, DP ≥ 10, AF ≥ 0.25). Repeat families within the final assembly were identified de novo with RepeatModeler v1.0.11 (default parameters) (Smit and Hubley, 2019), which uses both RECON v1.08 (Bao and Eddy, 2002) and RepeatScout v1.0.5 (Price et al., 2005) to search for repeats within a given assembly. From this, a set of consensus sequences for each identified family was then used to mask repeats within the assembly with RepeatMasker v4.0.9 (-e rmblast) (Smit et al., 2018).
2.5 Benchmarking against other bivalve genomes
The assembly sizes, scaffold N50 lengths and other information of the L. rhynchaena assembly were compared against 13 other published bivalve genomes available representing 6 orders and 8 families. These comprise: Argopecten purpuratus (Li et al., 2018), Bathymodiolus platifrons (Sun et al., 2017), Chlamys farreri (Li et al., 2017), Crassostrea gigas (Zhang et al., 2012), Limnoperna fortunei (Uliano-Silva et al., 2017), Modiolus philippinarum (Sun et al., 2017), Mytillus galloprovincialis (Murgarella et al., 2016), Patinopecten yessoensis (Wang et al., 2017), Pinctada fucata (Takeuchi et al., 2012; Du et al., 2017), Ruditapes philippinarum (Mun et al., 2017), Saccostrea glomerata (Powell et al., 2018), Scapharca broughtonii (Bai et al., 2019) and Venustaconcha ellipsiformis (Renaut et al., 2018). Assessments of repeat content and BUSCO completeness (metazoa_odb9) were repeated as above for these 13 bivalve genomes.
2.6 Transcriptome assembly
One RNA-seq library was sequenced and a transcriptome was generated to assist in gene prediction. Paired-end, strand-specific reads were polyG-, adapter- and quality-trimmed with the same tools and parameters as was outlined for genomic reads. The resulting reads were assembled de novo with the Trinity v2.8.5 assembler (--SS_lib_type FR) (Grabherr et al., 2011). Short reads were subsequently mapped back to the assembly with bowtie2 v18.104.22.168 (Langmead and Salzberg, 2012). Additionally, GMAP v2019-06-10 (default parameters) (Wu and Watanabe, 2005) was used to align the assembled transcriptome to the current genome assembly.
2.7 Gene prediction and annotation
Gene prediction was carried out with the MAKER v2.31.10 annotation pipeline (Holt and Yandell, 2011). Providing this pipeline with the previously identified repeat families, the assembled transcript sequences and a set of protein sequences from the 13 published bivalve genomes (Supplementary Table 1) (Takeuchi et al., 2012; Zhang et al., 2012; Murgarella et al., 2016; Du et al., 2017; Li et al., 2017; Mun et al., 2017; Sun et al., 2017; Uliano-Silva et al., 2017; Wang et al., 2017; Li et al., 2018; Powell et al., 2018; Renaut et al., 2018; Bai et al., 2019), MAKER identifies repeats and subsequently aligns transcripts and proteins to the genome to produce initial gene models and sequences in its first iteration (est2genome=1, protein2genome=1). These gene models were used to train two ab initio gene prediction softwares AUGUSTUS (using BUSCO with --long and metazoa_odb9) (Stanke et al., 2006) and SNAP (-categorize 1000, -export 1000) (Korf, 2004), followed by a second iteration of MAKER incorporating gene models generated from ab initio predictors. Gene models were again retrained with a second iteration of AUGUSTUS and SNAP before MAKER was repeated for a third iteration.
Sequences with Annotation Edit Distance (AED) values ≤ 0.5 were retained. AED values are evidence based (Eilbeck et al., 2009), thus a small value suggests a lesser degree of difference between the predicted gene and the protein and/or transcript evidence used during prediction. These protein sequences were further aligned against proteins in UniProtKB (Swiss-Prot and TrEMBL) (Consortium, 2018) for homology using DIAMOND v0.9.24.125 (--max_target_seqs 1, --evalue 1e-10) (Buchfink et al., 2014) and searched for protein domains and signatures with InterProScan v5.36-75.0 (-t p) (Jones et al., 2014).
2.8 Maximum likelihood and Bayesian phylogenetic analyses
BUSCO-predicted protein-coding genes were used to construct a multi-gene supermatrix in order to infer the phylogenetic positions of L. rhynchaena and other available published bivalve genomes (Supplementary Table 1), with the exception of Mytilus galloprovincialis that was excluded from this analysis since its assembly reported substantially lower BUSCO completeness. The owl limpet (Lottia gigantea), two-spot octopus (Octopus bimaculoides) and fruit fly (Drosophila melanogaster) were included as outgroup species. Each orthologous group was aligned with MAFFT v7.394 (default parameters) (Katoh and Standley, 2013) and trimmed with Gblocks v0.91b (default parameters) (Castresana, 2000), allowing no gaps in the alignments. Only orthologous protein groups with 100% representation (i.e. contains proteins from all 13 bivalves and 3 outgroups) were incorporated to form the final supermatrix. This supermatrix (22,668 amino acids, 129 orthologous groups), partitioned by proteins (-spp), was supplied to IQ-TREE v1.5.5 (-bb 1000, -alrt 1000, -m TESTMERGE) (Nguyen et al., 2014) for model testing and to infer phylogenetic relationships using maximum likelihood (ML) with nodal support assessed using Ultrafast Bootstrap support (UFBoot) (Minh et al., 2013) and SH-aLRT (Guindon et al., 2010) values. The same supermatrix was used in ExaBayes v1.5 (Aberer et al., 2014) to infer a Bayesian tree (BI) from four independent runs of 2 million generations. Discarding 25% of initial samples as burn-in, convergence was determined when the average standard deviation of split frequencies (asdsf) value falls below 1%.
3. Results and Discussion
3.1 Genome size and sequencing coverage
This study generated a large volume of genomic data that includes 122 Gbp of short paired-end reads (150 bp) and 14 Gbp of long reads (average: 4960 bp, longest: 58,804 bp, shortest: 200 bp) (Supplementary Table 2). In addition, 34 Gbp of transcriptomic reads were generated to facilitate gene prediction. Raw Illumina and Nanopore reads generated in this study are available in the Sequence Read Archive (SRP201027) under BioProject PRJNA548223.
Estimations based on 19-, 21- and 25-mers reported a range of 545 to 547 Mbp sizes for the snout otter clam genome (Supplementary Data 1), histograms are provided in Supplementary Data 2. Based on this estimate, this study generated short and long read data with sequencing depths of over 200× and 25×, respectively. The haploid genome size for the snout otter clam is at the lower end of a large range of estimated genome sizes for the species of the family Mactridae and also within the phylum Bivalvia more generally (Supplementary Data 3).
3.2 Genome and genes of the snout otter clam
The snout otter genome joins the blood clam (Scapharca broughtonii) (Bai et al., 2019) as the only two bivalve genome assemblies to incorporate the use of long ONT reads and has generated the largest volume of ONT data for a bivalve species to date. As the first study to use only Illumina paired-end reads and ONT long reads, this study also demonstrates the efficacy of ONT reads as a cost-effective option to achieve a high quality and contiguous genome assembly. The hybrid assembly approach generated a draft genome of 1,502 scaffolds (N50 length: 1.84 Mbp) and a total assembly size of 586.5 Mbp (Table 1), slightly exceeding the previously-estimated genome size. Completeness of the assembly reported a high overall completeness of 95.9%. However, 2.9% metazoan BUSCOs detected in this assessment occurred in duplicates within the assembly, likely to be a result of high heterozygosity within the genome resulting in duplicated copies within the assembly. These duplicated regions can cause issues in downstream analyses such as variant discovery and therefore was removed from the assembly before its use for other applications. This produced a final curated assembly of 544 Mbp contained in 622 scaffolds (N50 length: 2.14 Mbp) (Table 1), to which 95.6% of Illumina short reads were successfully aligned using bowtie2 v22.214.171.124 (Langmead and Salzberg, 2012). BUSCO reports a similar overall completeness of 95.8% but with 1.5% BUSCOs detected in duplicates, half of that reported prior to the haplotig purging strategy. Whole Genome Shotgun project has been deposited at DDBJ/EMBL/GenBank under the accession VIBL00000000 under BioProject PRJNA548223.
Variant calling analysis reports a total 4,903,576 SNPs, which translates into a heterozygosity estimate of 0.90%. In contrast, heterozygosity was estimated at 1.60% was estimated by the kmer-based approach with GenomeScope (Supplementary Figure 1). Since variant calling methods are typically more conservative than kmer-based ones, the former approach may have excluded some true positives and, therefore, results in the observed discrepancy in estimations. Nevertheless, heterozygosity rates estimated by both methods (0.90 to 1.60%) is within the range of values that have been reported for other bivalves (0.51 to 2.02%). Additionally, repeat modelling and masking of the genome masked 29.4% of the L. rhynchaena assembly.
This study also generated 34.6 Gbp of short paired-end, strand specific reads. Transcriptome assembly produced a set of 295,234 transcripts containing 83.9% complete BUSCO genes. Subsequently, 96.4% of short RNA reads were mapped back to this set of assembled transcripts and of these, 79% were aligned to the genome in a splice-aware manner. Both alignment rates indicate that the quality set of transcripts is sufficient to improve the gene prediction. Based on hints and evidence from the clam transcripts and protein sequences from other bivalves, gene prediction from the assembled genome resulted in a final set of 26,380 protein-coding genes (AED ≤ 0.5), 89.8% of which was functionally annotated (i.e. a protein would have at least one associated functional annotation) (Table 1). The assembled transcriptome, predicted protein-coding genes and annotation information are available as Supplementary Data 4 and Supplementary Data 5.
3.3 Comparison against other bivalves
A summary of assembly sizes, scaffold N50 lengths, sequencing technologies and other information for all 13 assemblies is available in Supplementary Table 2 while a comparison of repeat content, genome and assembly sizes and BUSCO completeness for all 14 bivalve genomes is visualised in Figure 1A. The assembly for Mytilus galloprovincialis has a high level of missing BUSCO genes (14.8% complete, 29.6% fragmented, 55.6% missing) whereas the Venustaconcha ellipsiformis assembly contained relatively more fragmented BUSCO genes (22.8%). This potentially points to limitations of the data types used, since the former was assembled with only short Illumina paired-end reads, while the latter had both paired-end (PE) and mate-pair (MP) reads but was assisted by only 0.3× of PacBio long reads. While three other assemblies also had only PE and MP reads, most of the remaining assemblies employed the use of other scaffolding strategies including a greater volume of long PacBio reads, fosmid and bacterial artificial chromosome (BAC) libraries (see Supplementary Table 2 for details). Furthermore, there is a large discrepancy in estimated genome size and assembly size for Ruditapes philippinarum (Figure 1A), where the reported assembly is almost twice the size of the expected genome size. For this assembly, a high proportion of duplicated BUSCO genes was also detected (19.1%), highlighting the importance of haplotig purging within bivalve assemblies to remove paralogous scaffolds. However, this step should be executed with caution as there is also the potential to “over-purge” and exclude actual parts of the genome.
3.4 Phylogeny of Bivalvia based on nuclear loci
Rooted with the fruit fly, phylogenies inferred by both methods showed strong UFBoot, SH-aLRT and Bayesian posterior probability (BPP) support for most nodes, the exception being the sister relationship between bivalve orders Mytilida and Pectinida (Figure 1B). The snout otter clam, Lutraria rhynchaena, is the only representative for Mactridae and forms a sister relationship with Ruditapes philippinarum from the Veneridae, both belonging to the order Venerida. This initial phylogenomic analysis for the Bivalvia contains species from the three subclasses Imparidentia, Palaeoheterodonta and Pteriomorpha. All bivalve species are placed within groups consistent with their taxonomic classification at the family and ordinal levels and observed relationships among these three subclasses are consistent with findings from recent studies based on Sanger and transcriptome sequencing (Kocot et al., 2011; González Vanessa et al., 2015; Combosch et al., 2017; Lemer et al., 2019). While these studies include a larger range of taxa from more bivalve subclasses, data matrices used to infer phylogeny are often gappy with a higher level of missing data (~ 16 to 75%). On the other hand, this study successfully used information from a substantial number of genes (> 100) across the available taxa with minimal gaps. Nevertheless, this approach is currently limited due to scarcity in whole genome resources for bivalve species generally and therefore is missing representatives from important bivalve subclasses such as Protobranchia, Archiheterodonta and Anomalodesmata.
This initial phylogenomic analysis for the Bivalvia, which illustrates the value and potential of this approach as a greater number of high quality genomic data sets become available for phylogenetic study and resolution of basal relationships of what has been a challenging group (Plazzi et al., 2011; Sharma et al., 2012) Multiple sequence alignment and the resulting Newick tree are available as Supplementary Data 6.
We present the first draft genome of the snout otter clam (Lutraria rhynchaena) based on relatively large volumes of short and long genomic reads from Illumina and ONT platforms, respectively, providing sufficient sequencing depth to facilitate the generation of one of the best quality genome assemblies for a bivalve mollusc. The use of long ONT reads in the hybrid assembly process presents an effective yet economical approach to achieving a good quality assembly and highlights the importance of long reads in spanning large repeat regions and resolving other complex regions that arises from the high levels of heterozygosity. This highly contiguous and complete assembly makes this draft genome an important and valuable resource to support ongoing genomic and molecular-based breeding studies for aquaculture. In addition, a transcriptomic data set was generated and assembled to support more refined gene prediction, further adding to the currently scarce transcriptomic resources available for the Mactridae family. Ultimately, we expect results of this study to be used as valuable genomic references for a range of genetic, genomic, phylogenetic and population studies of the snout otter clam and other bivalve species.
Keywords: snout otter clam, Genome, hybrid assembly, Aquaculture, Illumina, Oxford nanopore
Received: 23 Aug 2019;
Accepted: 22 Oct 2019.
Copyright: © 2019 Thai, Lee, Gan, Austin, Croft, Trieu and Tan. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence: Dr. Mun Hua Tan, Centre for Integrative Ecology, Deakin University, Victoria, Victoria, Australia, firstname.lastname@example.org