Data Report ARTICLE
Draft genome assembly and annotation of the Gila topminnow Poeciliopsis occidentalis
- 1Texas A&M University, United States
- 2Physiological Chemistry, Biocenter, University of Wuerzburg, Germany
- 3INRA Mathématiques et Informatique Appliquées de Toulouse (MIAT), France
- 4INRA UMR1388 Génétique Physiologie et Systèmes d'Elevage, France
- 5INSERM US09 BioCampus Montpellier, France
- 6Department of Biology, College of Science, Texas A&M University College Station, United States
- 7Department of Physiological Chemistry, Biocenter, University of Würzburg, Germany
- 8Howard Hughes Medical Institute (HHMI), United States
- 9Harvard University, United States
- 10California State University, Monterey Bay, United States
- 11INRA Laboratoire de Physiologie et Génomique des Poissons (LPGP), France
- 12Hagler Institute for Advanced Study, Texas A&M University, United States
The freshwater livebearing fish genus Poeciliopsis (Poeciliidae) constitutes a valuable research system for questions within the field of evolutionary ecology, including life history evolution (e.g. multiple origins of placentas), intergenomic conflict, evolution of sex (with the existence of several asexual hybrid biotypes), and biogeography (reviewed in Mateos et al., 2019). Despite its importance, a robust phylogenetic framework and genomic resources are lacking for this taxon. Herein, we report the first whole genome draft sequence of a member of this genus: Poeciliopsis occidentalis Baird and Girard (1853), the Gila topminnow. Poeciliopsis occidentalis, along with its sister lineage Poeciliopsis sonoriensis Girard (1859) (the Yaqui topminnow), are currently considered separate species (Miller et al., 2005). They are distributed in Mexico and the United States, where they are listed (as subspecies of Poeciliopsis occidentalis sensu lato; Sonoran topminnow, “guatopote de Sonora” in Spanish) as threatened and endangered, respectively. P. occidentalis s.l. has several interesting biological features, whose study would benefit from annotated genomes. First, it has an intermediate level of placentation (matrotrophy index) within a clade (i.e., Leptorhaphis group) that contains members with higher (i.e., P. prolifica) and lower (i.e., P. infans) matrotrophy indices (Reznick et al., 2002). Secondly, it is the sexual host of the oldest known asexual hybrid biotype of the genus Poeciliopsis (i.e., the hybridogen Poeciliopsis monacha-occidentalis; Quattro et al., 1992). Moreover, P. occidentalis s.l. has an unresolved phylogenetic position, possibly due to incomplete lineage sorting and/or reticulation (Mateos et al., 2019). In addition, the taxonomy and status of evolutionary significant units (ESUs) within P. occidentalis s.l. are controversial, as additional ESUs have been proposed (Vrijenhoek et al., 1985; Hedrick and Hurt, 2012). The genome sequence of P. occidentalis will thus be a valuable resource for macroevolutionary and molecular evolution studies of the genus, as well as for phylogeographic and conservation genetics research. In the work presented herein, we used the “linked-reads” Chromium System (10x Genomics, Pleasanton, CA, USA) to sequence, assemble and annotate a draft genome of P. occidentalis. The resulting assembly had a contig and scaffold N50 of 0.103 and 1.540 Mbp respectively.
Source of specimens
The specimen used for the genome assembly was a snap-frozen (sampled in the early 2000’s, stored at -80°C) lab-bred male of Poeciliopsis occidentalis s.s. (sample ID MS-8/9/10 AV76-7; strain originally collected at Oquitoa, Rio Altar, Concepcion drainage, Sonora, Mexico, 1976; permits 13 and 4962 from Departamento de Pesca). Genomic DNA for 10X Genomics Chromium libraries (average fragment size range 50–120 kb) was extracted from the whole body (excluding gut) using a conventional phenol/chloroform method (Sambrook et al., 1989). The specimens used for RNA-sequencing (used as transcriptomic evidence for genome annotation) included: (1) a snap-frozen (gut removed) field-collected male Poeciliopsis sonoriensis ID MVH99-3, lower Yaqui, Sonora, Mexico, 1999; permit 020299-213-03 from SEMARNAP; and (2) snap-frozen individual tissue samples provided from the captive stock populations for P. occidentalis s.s. from Cienega Creek housed at Arizona State University. Total RNA (RIN > 7) was extracted from the P. sonoriensis individual with TRIzol Reagent (Thermo Fisher Scientific, Waltham, USA) according to the supplier’s recommendation; and from the P. occidentalis s.s. tissues using the Qiagen RNeasy Plus Mini Kit (Qiagen, Germantown, MD, USA).
Libraries construction and sequencing
High molecular weight (HMW) genomic DNA was quantified using microfluorimetry (Qubit High sensitivity dsDNA kit, Invitrogen, Carlsbad, CA) and diluted to 0.8 ng/µl. After denaturation, diluted single stranded DNA was processed using the Chromium Genome Library Kit & Gel Bead Kit v2 and Chromium Controller and Next GEM Accessory Kit (10x Genomics, Pleasanton, California) following the manufacturer’s instructions. Briefly, high molecular weight DNA in a master mix was combined with a library of Genome Gel Beads and partitioning oil to create Gel Bead-In-EMulsions (GEMs) in a microfluidic Genome Chip on the Chromium Controller. Emulsion was then recovered from the microfluidic chip and underwent an isothermal incubation. This isothermal incubation leads to the dissolution of the Genome Gel Bead, releasing primers containing an Illumina R1 sequence (Read 1 sequencing primer), a 16 bp 10x Barcode (specific to each Genome Gel Bead), and a 6 bp random primer sequence. Those 6 bp random primer sequences hybridize on the HMW DNA and the isothermal incubation produces barcoded fragments ranging from a few to several hundred base pairs. After incubation, the GEMs were broken and the pooled fractions were recovered. The pool of barcoded DNA fragments was repaired and adenylated on their 3’ ends. 10x Genomics adapters were ligated to the ends of each fragment. The ligated fragments underwent an 8-cycle PCR, which enabled the indexing of the library. The final library was verified on a fragment analyzer and quantified by qPCR (Light Cycler 480, Roche Applied Science, Penzberg, Germany). The library was sequenced on a full paired end 2x150nt lane on a Hiseq2500 (Illumina, San Diego, CA, USA) for a total of 235 million sequences.
The RNA library for P. sonoriensis was prepared with the KAPA mRNA HyperPrep Kit (KAPABiosystems, Wilmington, MA, USA) following the manufacturer’s instructions at the Bauer core at Harvard University. One ug of total RNA, quantified with the 2100 Bioanalyzer (Agilent, Santa Clara, CA, USA), was used as input. The library was amplified for 10 cycles and purified using KAPA Pure Beads (KAPABiosystems). Library size distribution and quality were examined with Agilent High Sensitivity D1000 ScreenTape assay. Libraries were sequenced at the Harvard Bauer Core on a NextSeq 500 machine (Illumina) to collect paired-end 75 bp reads. The RNA libraries for all P. occidentalis s.s. samples were prepared by removing rRNA with the Epicentre Ribo-Zero rRNA removal kit (Illumina), and processed into a sequencing library using standard library prep methods for ABI SOLiD sequencing using the Total RNA-Seq Kit (ThermoFisher Scientific, Waltham, MA, USA). Libraries were then sequenced at the University of Connecticut Center for Genome Innovation on a SOLiD 5500xl (ThermoFisher Scientific) to collect paired-end 60-bp reads.
The linked reads were assembled with Supernova (Weisenfeld et al., 2017) version 1.0 (“supernova1_complete”) and twice with version 2.0.0, the first with all reads (“supernova2_complete”) and the second with the –maxreads parameter set to use only reads corresponding to a 56X genome coverage (“supernova2_reduced), following the software best practices. The assembly metrics were calculated using the assemblathon_stats.pl script (Bradnam, 2012). To obtain k-mer spectrum graphs (shown in Supporting Fig. S1), k-mers were counted with Jellyfish mer counter v.2.1.1 (Marcais and Kingsford, 2011; parameters: count -C -m 21 -s 100M). The K-mer Analysis Toolkit v.2.4.1 (KAT; Mapleson et al., 2017) was used to compare k-mers between raw reads and assembly (kat comp; parameters: -m 21) and to draw plots (kat plot spectra-cn; parameters: -w 30 -l 20 -x 100 --dpi 300). The assembly gene content was assessed with BUSCO version 3.0.2 (Waterhouse et al., 2017) using actinopterygii_odb9 as reference data set.
The three Poeciliopsis occidentalis genome assemblies gave different statistics (Supporting Table S1). Depending on the software version and method the P. occidentalis genome assembly was comprised of 7,753, 7,444 and 15,410 scaffolds corresponding to 19,800, 15,621 and 23,570 contigs having a scaffold N50 of 2.77, 0.99 and 1.54 Mb, and scaffold L50 of 60, 178 and 103, respectively. The total assembly lengths are 613, 608 and 725 Mb; i.e., within the range of the Feulgen densitometry estimated size of 680 Mb for this species (Cimino, 1974). The unknown base assembly fraction is low, corresponding to 3.26, 1.52 and 1.62 % of the scaffold lengths, respectively. Because of the closest length to genome size estimation, the supernova2_reduced assembly (i.e., 724 Mb) was chosen for annotation.
The three assemblies harbor 94, 91.5 and 92% of complete BUSCO genes, respectively (Supporting Table S2). The fragmented genes correspond to 5.6, 4.0 and 5.2% and the missing genes to 2.9, 2.0 and 2.8%, respectively. This validation was performed using a large set that included 4,584 Actinopterygii genes.
The assembly was compared to the chromosomal-scale assembly of the Southern platyfish, Xiphophorus maculatus (NCBI GCF_002775205.1), a phylogenetically related species, using D-GENIES (Cabanettes and Klopp, 2018). The dot plot between the Poeciliopsis occidentalis and Xiphophorus maculatus assemblies (Supporting Fig. S2) shows a good correspondence between long scaffolds and chromosomes. Only 12.1% of the Gila topminnow assembly did not align to the Southern platyfish chromosomes. The alignment identities are distributed as follows: 17.14% above zero and lower than 25%; 70.39% between 25 and 50%; 0.36% between 50 and 75%; and 0.01% over 75%.
Gene prediction was achieved with a pipeline that collects and synthesizes evidence from genes and intergenic regions, repeat identification, homology annotation, RNA-seq read mapping, and de novo gene prediction. For each homology annotation, transcript evidence and ab-initio gene model annotation, two independent threads were run. Before starting the annotation process, we used the Actinopterygii odb9 database of BUSCO (Simao et al., 2015) to train AUGUSTUS 3.2.3(Stanke et al., 2006).
Repeat elements were first identified de novo using RepeatModeler (Smit et al., 2013-2015) (http://www.repeatmasker.org/). The result, along with a fish-specific repeat database (unpublished) and the database from Shao et al. (2018), was used as a custom library for RepeatMasker to identify repeats in a comprehensive way. The repeats from known families were masked (replaced with N) from the genome assembly, and their location was collected as intergenic evidence.
Next we collected gene evidence from homology alignments. Proteins from Swiss-Prot (www.uniprot.org) and 13 Ensembl genomes (version 95, http://www.ensembl.org): human (Homo sapiens), mouse (Mus musculus), coelacanth (Latimeria chalumnae), spotted gar (Lepisosteus oculatus), zebrafish (Danio rerio), cod (Gadus morhua), tilapia (Oreochromis niloticus), medaka (Oryzias latipes), platyfish (Xiphophorus maculatus), fugu (Takifugu rubripes), tetraodon (Tetraodon nigrovirdis), stickleback (Gasterosteus aculeatus) and sea lamprey (P. marinus) were collected and processed using CD-HIT to form 544,476 non-redundant proteins. They were aligned to the genome assembly with exonerate2.2.0 (Slater and Birney, 2005) https://www.ebi.ac.uk/about/vertebrate-genomics/software/exonerate and Genewise2-2-0 (Birney et al., 2004) independently.
To collect gene evidence from RNA-seq data, we used two independent threads: one with Tophat to map reads and Cufflinks 2.1.1 (Trapnell et al., 2012) to form gene models; the other one with HISAT2 2.1.0 (Kim et al., 2015) and Trinity 2.4.0 to assemble transcripts guided by the genome, and PASA 2.2.0 to align transcripts and form gene models (Haas et al., 2003; Haas et al., 2013).
For de novo annotation, we used SNAP 2006-07-28 (Korf, 2004) (http://korflab.ucdavis.edu) and GeneMark-ES (Ter-Hovhannisyan et al., 2008) independently. Confirmed by all lines of evidence described above, a group of high-quality gene models were selected using EVidenceModeler1.1.1 (Haas et al., 2008). They were used to train AUGUSTUS again. Then the species-specifically trained AUGUSTUS was run to predict genes in the assembly, taking as hint all those intergenic and gene evidences collected above. Finally, low-quality genes, which had low confidence score and no BLAST hit to Swiss-Prot (www.uniprot.org), were removed.
The repeat elements, identified using RepeatModeler and RepeatMasker, account for 19.98% (144Mbp) of the assembly. DNA elements account for 44.69%; SINE, 0.95%; LINE, 11.36%; and LTR elements for 4.10% of the repetitive fraction of the genome (Fig. 1). The majority of the represented TE landscape is composed of DNA elements. This makes it distinct from more ancient fishes. In the elephant shark (Callorhinchus milii), TEs are mostly composed of SINE and LINE elements. The Gila topminnow genome shares with other poeciliid genomes (i.e., southern platyfish, guppy, and Amazon molly) an ancient wave of transposable element (TE) expansion. Nonetheless, the more recent TE expansion that is typically seen in poeciliids appears to have started relatively more recently in the Gila topminnow, where its peak in the most recent elements implies that this expansion is probably ongoing. Detailed studies on TE expression and activity in the Gila topminnow will yield insights into how TEs shape the evolution of their host genomes.
In total, 41,501 genes were predicted of which 10,625 were removed because they were deemed of low-quality [i.e., no hit in BUSCO, Swissprot and Pfam database, failing to present an intact structure (both start and stop codon predicted) and with Augustus score <80). Among the 30,976 retained genes, 27,947 (90.22%) were annotated with start and stop codons, 28,141 (90.84%) have BLAST hit to database Swiss-Prot (www.uniprot.org), 24,912 (80.42%) were suggested by InterProScan (http://www.ebi.ac.uk/interpro/interproscan.html) to contain functional protein domains, and 28,722 (92.72%) were supported by RNA-seq reads. The mean coding sequence length in the retained genes is 1,463 bp and the longest is 54,050 bp. A quality assessment by BUSCO analysis revealed 95.4% complete conserved Actinopterygii genes, 3.6% fragmented and only 1.0% missing genes (Supporting Table S1). The annotation process thus increased the quality of all BUSCO parameters evaluated in this assembly.
Orthology relationships between genes of P. occidentalis and other fish were inferred based on sequence similarity and species phylogeny. First, the annotated genomes of Poecilia formosa (Amazon molly), Xiphophorus maculatus (Platyfish), Gambusia affinis (Western mosquitofish), Fundulus heteroclitus (Mummichog), Kryptolebias marmoratus (Mangrove rivulus), Oryzias latipes (Japanese medaka), Takifugu rubripes (Fugu), Gasterosteus aculeatus (Stickleback) and Lepisosteus oculatus (Spotted gar) were downloaded from Ensemble (http://www.ensembl.org/info/data/ftp/index.html), and of Poecilia reticulata (Guppy) from NCBI (https://www.ncbi.nlm.nih.gov/genome/?term=Poecilia%20reticulata). Second, an all-vs-all blast was performed among the protein sequences of these fish and P. occidentalis (Camacho et al., 2009). Based on the BLAST raw score, H-score, defined as score (Gene1Gene2)/max (score (Gene1Gene1), score (Gene2Gene2)) (Cho et al., 2013), was calculated to evaluate the sequence similarity between any of two genes. Then with H-scores and L. oculatus set as the outgroup, genes were clustered into groups using Hcluster_sg (Ruan et al., 2007). Third, for each group, a gene tree was reconstructed in the guild of species tree using TreeBeST (Ponting, 2007). Finally, according to the tree, genes were assigned as “XtoX” orthologs to each other (X refers to a positive integer).
We collected 251,212 genes from the genomes listed above and clustered them into 20823 groups based on sequence similarity. Among them 11,639 were shared by P. occidentalis, T. rubripes, X. maculatus, G. aculeatus and O. latipes (Supporting Fig. S3). 5276 groups contained only one gene from each of the fish. These genes were identified as one-to-one orthologs and were used to construct the phylogenomic tree using RaxML (Fig. 2), as follows. One-to-one orthologs were identified during the orthology assignment after genes were clustered into groups. Protein sequences of each ortholog were then aligned among species using MAFFT (Nakamura et al., 2018). Alignment regions with bad quality were trimmed using trimAI (Capella-Gutiérrez et al., 2009). After trimming, the alignments of orthologs were concatenated into a massive alignment. Based on the massive alignment and with L. oculatus set as the outgroup, RaxML was used to reconstruct the phylogeny (Stamatakis, 2014). Clade support was assessed by means of a bootstrap analysis (100 replicates). Inferred relationships among all taxa were as expected (e.g. Betancur et al., 2013; Reznick et al., 2017; Bragança et al., 2018). As observed in Fig. 2, five genes are absent from all Poeciliid branches and thus likely lost in their common ancestor (Solute carrier family 27 member 6; zgc:55888; Leucine-rich repeat neuronal protein 1-like; TGF-beta receptor type-2-like; the other one not characterized). Two genes (both not characterized; “InPoecil” Fig. 2) are present in all poeciliid branches and absent in all others; thus, likely gained in the common ancestor of poeciliids.
Assembly and annotation of mitochondrial genomes from P. occidentalis and P. sonoriensis
A BLAST search using the cytochrome b sequence of P. occidentalis was used to search among the assembled contigs to identify the mitochondrial genome. A contig 16,912 bp long was recovered and annotated using the MitoFish webserver (Iwasaki et al., 2013; Sato et al., 2018). The ND2 gene, which spanned the ends of the contig, was corrected with Sanger sequences obtained from PCR products of the same specimen, leading to a shorter contig of 16,772 bp, which was circularized and annotated (NCBI Acc. No. MK860198) based on the Poeciliopsis occidentalis mitochondrial genome available at NCBI (Acc. No. KP013108). We also assembled the entire mitochondrial genome of P. sonoriensis (Acc. No. MK860197) by mapping the RNAseq reads from this species to the mitochondrial genome of P. occidentalis. Comparison of the three mitochondrial genomes revealed 0.23% divergence (uncorrected p) between the two samples of P. occidentalis s.s., and 0.94–0.99% divergence between P. occidentalis and P. sonoriensis.
We confirm the utility of 10X Genomics technology and the Supernova assembler to generate an assembly with high contiguity and high quality (e.g. Ozerov et al., 2018 and references therein). Our results demonstrate the utility of long-term frozen material for this purpose. The scaffold N50 above 1 Mb that we obtained is in the range of the best assemblies based exclusively on Illumina (i.e., short reads), but only those that have employed jumping libraries (also known as long-insert paired-end reads or mate pair libraries) (e.g. Liu et al., 2017; Schartl et al., 2019). As the first published genome assembly for the genus Poeciliopsis, we expect it will serve as a valuable resource for research in phylogenomics, enabling the generation of a robust framework for macroevolutionary questions, including speciation, hybridization, and adaptation. Furthermore, this resource is expected to facilitate research aimed at taxonomic delimitation and conservation genetics of this endangered taxon.
Keywords: supernova genome assembly, Genome annotation, transposable elements (TE), Topminnow, Mitochondrial Genome
Received: 22 Jun 2019;
Accepted: 09 Oct 2019.
Copyright: © 2019 Mateos, Kang, Klopp, Parrinello, Garcia, Schumer, Jue, Guiguen and Schartl. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Mx. Mariana Mateos, Texas A&M University, College Station, United States, firstname.lastname@example.org
Mx. Manfred Schartl, Department of Physiological Chemistry, Biocenter, University of Würzburg, Würzburg, 97074, Bavaria, Germany, email@example.com