Draft Genome Assembly and Annotation of the Gila Topminnow Poeciliopsis occidentalis

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. Draft genome assembly and annotation of the gila topminnow Poeciliopsis occidentalis Mariana Mateos, Du Kang, Christophe Klopp, Hugues Parrinello, Mateo Garcia-Olazabal, Molly Schumer, Nathaniel K. Jue, Yann Guiguen, Manfred Schartl


INTRODUCTION
The freshwater livebearing fish genus Poeciliopsis (Poeciliidae) constitutes a valuable research system for questions within the field of evolutionary ecology, including life history evolution (e.g., multiple origins of placentas), intergenomic conflict, evolution of sex (with the existence of several asexual hybrid biotypes), and biogeography (reviewed in Mateos et al., 2019). Despite its importance, a robust phylogenetic framework, and genomic resources are lacking for this taxon. Herein, we report the first whole genome draft sequence of a member of this genus: Poeciliopsis occidentalis Baird and Girard (1853), the Gila topminnow. Poeciliopsis occidentalis, along with its sister lineage P. sonoriensis Girard (1859) (the Yaqui topminnow), are currently considered separate species (Miller et al., 2005). They are distributed in Mexico and the United States, where they are listed (as subspecies of P. occidentalis sensu lato; Sonoran topminnow, "guatopote de Sonora" in Spanish) as threatened and endangered, respectively. P. occidentalis s.l. has several interesting biological features, whose study would benefit from annotated genomes. First, it has an intermediate level of placentation (matrotrophy index) within a clade (i.e., Leptorhaphis group) that contains members with higher (i.e., P. prolifica) and lower (i.e., P. infans) matrotrophy indices (Reznick et al., 2002). Secondly, it is the sexual host of the oldest known asexual hybrid biotype of the genus Poeciliopsis (i.e., the hybridogen Poeciliopsis monacha-occidentalis; Quattro et al., 1992). Moreover, P. occidentalis s.l. has an unresolved phylogenetic position, possibly due to incomplete lineage sorting, and/or reticulation (Mateos et al., 2019). In addition, the taxonomy and status of evolutionary significant units (ESUs) within P. occidentalis s.l. are controversial, as additional ESUs have been proposed (Vrijenhoek et al., 1985;Hedrick and Hurt, 2012). The genome sequence of P. occidentalis will thus be a valuable resource for macroevolutionary and molecular evolution studies of the genus, as well as for phylogeographic and conservation genetics research. In the work presented herein, we used the "linked-reads" Chromium System (10x Genomics, Pleasanton, CA, USA) to sequence, assemble, and annotate a draft genome of P. occidentalis. The resulting assembly had a contig and scaffold N50 of 0.103 and 1.540 Mbp, respectively.

LIBRARIES CONSTRUCTION AND SEQUENCING
High molecular weight (HMW) genomic DNA was quantified using microfluorimetry (Qubit High sensitivity dsDNA kit, Invitrogen, Carlsbad, CA) and diluted to 0.8 ng/µl. After denaturation, diluted single stranded DNA was processed using the Chromium Genome Library Kit & Gel Bead Kit v2 and Chromium Controller and Next GEM Accessory Kit (10x Genomics, Pleasanton, California) following the manufacturer's instructions. Briefly, high molecular weight DNA in a master mix was combined with a library of Genome Gel Beads and partitioning oil to create Gel Bead-In-EMulsions (GEMs) in a microfluidic Genome Chip on the Chromium Controller. Emulsion was then recovered from the microfluidic chip and underwent an isothermal incubation. This isothermal incubation leads to the dissolution of the Genome Gel Bead, releasing primers containing an Illumina R1 sequence (Read 1 sequencing primer), a 16 bp 10x Barcode (specific to each Genome Gel Bead), and a 6 bp random primer sequence. Those 6 bp random primer sequences hybridize on the HMW DNA and the isothermal incubation produces barcoded fragments ranging from a few to several hundred base pairs. After incubation, the GEMs were broken and the pooled fractions were recovered. The pool of barcoded DNA fragments was repaired and adenylated on their 3 ′ ends. 10x Genomics adapters were ligated to the ends of each fragment. The ligated fragments underwent an 8-cycle PCR, which enabled the indexing of the library. The final library was verified on a fragment analyzer and quantified by qPCR (Light Cycler 480, Roche Applied Science, Penzberg, Germany). The library was sequenced on a full paired end 2 × 150 nt lane on a Hiseq2500 (Illumina, San Diego, CA, USA) for a total of 235 million sequences.
The RNA library for P. sonoriensis was prepared with the KAPA mRNA HyperPrep Kit (KAPABiosystems, Wilmington, MA, USA) following the manufacturer's instructions at the Bauer core at Harvard University. One microgram of total RNA, quantified with the 2100 Bioanalyzer (Agilent, Santa Clara, CA, USA), was used as input. The library was amplified for 10 cycles and purified using KAPA Pure Beads (KAPABiosystems). Library size distribution and quality were examined with Agilent High Sensitivity D1000 ScreenTape assay. Libraries were sequenced at the Harvard Bauer Core on a NextSeq 500 machine (Illumina) to collect paired-end 75 bp reads. The RNA libraries for all P. occidentalis s.s. samples were prepared by removing rRNA with the Epicenter Ribo-Zero rRNA removal kit (Illumina), and processed into a sequencing library using standard library prep methods for ABI SOLiD sequencing using the Total RNA-Seq Kit (ThermoFisher Scientific, Waltham, MA, USA). Libraries were then sequenced at the University of Connecticut Center for Genome Innovation on a SOLiD 5500xl (ThermoFisher Scientific) to collect paired-end 60-bp reads.
The three P. occidentalis genome assemblies gave different statistics (Supporting Table S1). Depending on the software version and method the P. occidentalis genome assembly was comprised of 7,753, 7,444, and 15,410 scaffolds corresponding to 19,800, 15,621, and 23,570 contigs having a scaffold N50 of 2.77, 0.99, and 1.54 Mb, and scaffold L50 of 60, 178, and 103, respectively. The total assembly lengths are 613, 608, and 725 Mb; i.e., within the range of the Feulgen densitometry estimated size of 680 Mb for this species (Cimino, 1974). The unknown base assembly fraction is low, corresponding to 3.26, 1.52, and 1.62% of the scaffold lengths, respectively. Because of the closest length to genome size estimation, the supernova2_reduced assembly (i.e., 725 Mb) was chosen for annotation.
The three assemblies harbor 94, 91.5, and 92% of complete BUSCO genes, respectively (Supporting Table S1). The Frontiers in Ecology and Evolution | www.frontiersin.org 2 October 2019 | Volume 7 | Article 404 fragmented genes correspond to 5.6, 4.0, and 5.2% and the missing genes to 2.9, 2.0, and 2.8%, respectively. This validation was performed using a large set that included 4,584 Actinopterygii genes. The assembly was compared to the chromosomal-scale assembly of the Southern platyfish, Xiphophorus maculatus (NCBI GCF_002775205.1), a phylogenetically related species, using D-GENIES (Cabanettes and Klopp, 2018). The dot plot between the P. occidentalis and Xiphophorus maculatus assemblies (Supporting Figure S2) shows a good correspondence between long scaffolds and chromosomes. Only 12.1% of the Gila topminnow assembly did not align to the Southern platyfish chromosomes. The alignment identities are distributed as follows: 17.14% above zero and lower than 25%; 70.39% between 25 and 50%; 0.36% between 50 and 75%; and 0.01% over 75%.

ANNOTATION
Gene prediction was achieved with a pipeline that collects and synthesizes evidence from genes and intergenic regions, repeat identification, homology annotation, RNA-seq read mapping, and de novo gene prediction. For each homology annotation, transcript evidence and ab-initio gene model annotation, two independent threads were run. Before starting the annotation process, we used the Actinopterygii odb9 database of BUSCO (Simao et al., 2015) to train AUGUSTUS 3.2.3 (Stanke et al., 2006).
Repeat elements were first identified de novo using RepeatModeler (Smit et al., 2013(Smit et al., -2015 (http://www. repeatmasker.org/). The result, along with a fish-specific repeat database (unpublished) and the database from Shao et al. (2018), was used as a custom library for RepeatMasker to identify repeats in a comprehensive way. The repeats from known families were masked (replaced with N) from the genome assembly, and their location was collected as intergenic evidence.
To collect gene evidence from RNA-seq data, we used two independent threads: one with Tophat to map reads and Cufflinks 2.1.1 (Trapnell et al., 2012) to form gene models; the other one with HISAT2 2.1.0 (Kim et al., 2015) and Trinity 2.4.0 to assemble transcripts guided by the genome, and PASA 2.2.0 to align transcripts and form gene models (Haas et al., 2003(Haas et al., , 2013. For de novo annotation, we used SNAP 2006-07-28 (Korf, 2004) (http://korflab.ucdavis.edu) and GeneMark-ES (Ter-Hovhannisyan et al., 2008) independently. Confirmed by all lines of evidence described above, a group of high-quality gene models were selected using EVidenceModeler1.1.1 (Haas et al., 2008). They were used to train AUGUSTUS again. Then the species-specifically trained AUGUSTUS was run to predict genes in the assembly, taking as hint all those intergenic and gene evidences collected above. Finally, low-quality genes, which had low confidence score and no BLAST hit to Swiss-Prot (www. uniprot.org), were removed.
The repeat elements, identified using RepeatModeler and RepeatMasker, account for 19.98% (144 Mbp) of the assembly. DNA elements account for 44.69%; SINE, 0.95%; LINE, 11.36%; and LTR elements for 4.10% of the repetitive fraction of the genome (Figure 1). The majority of the represented TE landscape is composed of DNA elements. This makes it distinct from more ancient fishes. In the elephant shark (Callorhinchus milii), TEs are mostly composed of SINE and LINE elements. The Gila topminnow genome shares with other poeciliid genomes (i.e., southern platyfish, guppy, and Amazon molly) an ancient wave of transposable element (TE) expansion. Nonetheless, the more recent TE expansion that is typically seen in poeciliids appears to have started relatively more recently in the Gila topminnow, where its peak in the most recent elements implies that this expansion is probably ongoing. Detailed studies on TE expression and activity in the Gila topminnow will yield insights into how TEs shape the evolution of their host genomes.
In total, 41,501 genes were predicted of which 10,625 were removed because they were deemed of low-quality [i.e., no hit in BUSCO, Swissprot, and Pfam database, failing to present an intact structure (both start and stop codon predicted) and with Augustus score <80). Among the 30,976 retained genes, 27,947 (90.22%) were annotated with start and stop codons, 28,141 (90.84%) have BLAST hit to database Swiss-Prot (www.uniprot.org), 24,912 (80.42%) were suggested by InterProScan (http://www.ebi.ac.uk/interpro/interproscan.html) to contain functional protein domains, and 28,722 (92.72%) were supported by RNA-seq reads. The mean coding sequence length in the retained genes is 1,463 bp and the longest is 54,050 bp. A quality assessment by BUSCO analysis revealed 95.4% complete conserved Actinopterygii genes, 3.6% fragmented and only 1.0% missing genes (Supporting Table S1). The annotation process thus increased the quality of all BUSCO parameters evaluated in this assembly.
We collected 251,212 genes from the genomes listed above and clustered them into 20,823 groups based on sequence similarity. Among them 11,639 were shared by P. occidentalis, T. rubripes, X. maculatus, G. aculeatus, and O. latipes (Supporting Figure S3). Five thousand two hundred seventy-six groups contained only one gene from each of the fish. These genes were identified as one-to-one orthologs and were used to construct the phylogenomic tree using RaxML (Figure 2), as follows. One-toone orthologs were identified during the orthology assignment after genes were clustered into groups. Protein sequences of Frontiers in Ecology and Evolution | www.frontiersin.org 4 October 2019 | Volume 7 | Article 404 each ortholog were then aligned among species using MAFFT (Nakamura et al., 2018). Alignment regions with bad quality were trimmed using trimAI (Capella-Gutiérrez et al., 2009). After trimming, the alignments of orthologs were concatenated into a massive alignment. Based on the massive alignment and with L. oculatus set as the outgroup, RaxML was used to reconstruct the phylogeny (Stamatakis, 2014). Clade support was assessed by means of a bootstrap analysis (100 replicates). Inferred relationships among all taxa were as expected (e.g., Betancur et al., 2013;Reznick et al., 2017;Bragança et al., 2018). As observed in Figure 2, five genes are absent from all Poeciliid branches and thus likely lost in their common ancestor (Solute carrier family 27 member 6; zgc:55888; Leucine-rich repeat neuronal protein 1-like; TGF-beta receptor type-2-like; the other one not characterized). Two genes (both not characterized; "InPoecil" Figure 2) are present in all poeciliid branches and absent in all others; thus, likely gained in the common ancestor of poeciliids.

ASSEMBLY AND ANNOTATION OF MITOCHONDRIAL GENOMES FROM P. OCCIDENTALIS AND P. SONORIENSIS
A BLAST search using the cytochrome b sequence of P. occidentalis was used to search among the assembled contigs to identify the mitochondrial genome. A contig 16,912 bp long was recovered and annotated using the MitoFish webserver (Iwasaki et al., 2013;Sato et al., 2018). The ND2 gene, which spanned the ends of the contig, was corrected with Sanger sequences obtained from PCR products of the same specimen, leading to a shorter contig of 16,772 bp, which was circularized and annotated (NCBI Acc. No. MK860198) based on the P. occidentalis mitochondrial genome available at NCBI (Acc. No. KP013108). We also assembled the entire mitochondrial genome of P. sonoriensis (Acc. No. MK860197) by mapping the RNAseq reads from this species to the mitochondrial genome of P. occidentalis. Comparison of the three mitochondrial genomes revealed 0.23% divergence (uncorrected p) between the two samples of P. occidentalis s.s., and 0.94-0.99% divergence between P. occidentalis s.s. and P. sonoriensis.

CONCLUSION
We confirm the utility of 10X Genomics technology and the Supernova assembler to generate an assembly with high contiguity and high quality (e.g., Ozerov et al., 2018 and references therein). Our results demonstrate the utility of longterm frozen material for this purpose. The scaffold N50 above 1 Mb that we obtained is in the range of the best assemblies based exclusively on Illumina (i.e., short reads), but only those that have employed jumping libraries (also known as long-insert pairedend reads or mate pair libraries) (e.g., Liu et al., 2017;Schartl et al., 2019). As the first published genome assembly for the genus Poeciliopsis, we expect it will serve as a valuable resource for research in phylogenomics, enabling the generation of a robust framework for macroevolutionary questions, including speciation, hybridization, and adaptation. Furthermore, this resource is expected to facilitate research aimed at taxonomic delimitation and conservation genetics of this endangered taxon.
FIGURE 2 | Phylogenetic relationships and gene repertoire of P. occidentalis with other fish species. (A) A phylogenetic tree reconstructed using RaxML based on 5,276 one-to-one orthologs, the numbers on the nodes refer to the support value calculated from 100 bootstraps. (B) A bar chart revealing the percentage of orthologous genes of different types. "1:1" refers to universal single-copy genes; "X:X" orthologs exist in all species but not always as single copy; "NoPoecil," orthologs exist in none of the Poeciliid branches (Poecilia reticulata, Poecilia formosa, P. occidentalis, Xiphophorus maculatus, and Gambusia affinis) but in at least two non-Poeciliid branches; "InTeleo," orthologs exist in at least two Poeciliids but none of non-Poeciliid branches; "Homology," orthologs exist in both Poeciliids and non-Poeciliids but are missing from at least one branch; "LinSpe," lineage-specific genes where no ortholog was found in other branches.

ETHICS STATEMENT
The animal study was reviewed and approved by the University of Connecticut's Institutional Animal Care and Use Committee for a subset of the specimens. The other specimens have been frozen for 18-20 years at institutions that did not have an animal ethics committee at the time.

AUTHOR CONTRIBUTIONS
MM and MScha designed the study, supervised all steps of the project, analyzed the data, and drafted the manuscript. MM provided the biological material. HP performed the genome sequencing. CK assembled the genome. YG and MScha analyzed the genome data. DK did the genome annotation. MG-O and MScha isolated and processed the HMW DNA and the RNA. MSchu performed the RNA-seq from the whole fish. NJ provided RNA-seq read data from ovary and embryo. All authors were involved in the manuscript writing and editing.