Construction of a Full-Length Transcriptome Resource for the Chinese Sucker (Myxocyprinus asiaticus), a Rare Protected Fish, Based on Isoform Sequencing (Iso-Seq)

Citation: Ge H, Zhang H, Zhao Q, Li F, Gu H, Liu S, Yang H, Li Y and Wang Z (2021) Construction of a Full-Length Transcriptome Resource for the Chinese Sucker (Myxocyprinus asiaticus), a Rare Protected Fish, Based on Isoform Sequencing (Iso-Seq). Front. Mar. Sci. 8:699504. doi: 10.3389/fmars.2021.699504 Construction of a Full-Length Transcriptome Resource for the Chinese Sucker (Myxocyprinus asiaticus), a Rare Protected Fish, Based on Isoform Sequencing (Iso-Seq)


BACKGROUND
The Chinese sucker (Myxocyprinus asiaticus), which belongs to the genus Myxocyprinus in the family Catostomidae, is the only Catostomidae species in China and the only endemic Catostomidae species in Asia. It is distributed mainly in the Yangtze River. The Chinese sucker, a unique and rare freshwater species in China, has been listed as a second-class national key protected animal and is nicknamed the Asian Mermaid. The Chinese sucker is brightly colored and has a unique body shape and delicious, nutritious flesh, giving it certain value as an ornamental and edible fish (Huang et al., 2015). It is a valuable aquaculture fish (Yuan et al., 2011;Xu et al., 2013).
To date, most studies on the Chinese sucker have focused on several topics, including its resource status and artificial breeding (Songliang et al., 2002;Adeoba et al., 2019), its non-specific immunity (Zhang et al., 2009(Zhang et al., , 2021, its genetic diversity and mitochondrial genome (Chen, 2013;Liu et al., 2018), and its feed application (Guo et al., 2016;Jianhua et al., 2016;Li et al., , 2019Lu et al., 2020). As the conservation value of this species has become increasingly prominent, related research has gradually expanded. Due to the importance of germplasm resource protection, extensive research and development plans have been launched. The goals of these plans are to better understand Chinese sucker biology and to use basic knowledge to improve resource conservation and artificial breeding programs in order to achieve long-term maintenance and utilization of the germplasm resource. The ultimate goal is to realize the sustainable development of Chinese suckers.
In this study, we applied the isoform sequencing (Iso-Seq) technique of Pacific Biosciences (PacBio) for long reads to generate the first full-length transcriptome assembly of the Chinese sucker. The PacBio sequencing platform is an ideal method for constructing reference transcriptome components without reference genome sequences (Dong et al., 2015;Kuo et al., 2017;Workman et al., 2018). Full-length mRNA sequences from five major organs were obtained using PacBio's circular consensus sequencing. Due to the lack of high-quality genome sequence data for the Chinese sucker, the high-quality reference transcriptome assembly obtained in this study will be an important resource for further transcriptome analyses of Chinese suckers under various conditions.

Sample Collection
Five organs were collected from healthy 1-year-old Chinese suckers. The animal welfare and laboratory procedures for this study were in accordance with the relevant ethical requirements of the Key Laboratory of Freshwater Fish Reproduction and Development (Ministry of Education). The following organs were dissected and immediately frozen in liquid nitrogen: the brain, gills, liver, skin, and muscles. We mixed the frozen ground tissue samples together, then extracted RNA from that. All samples were stored at −80 • C before RNA extraction.

Library Construction and Single-Molecule Real-Time (SMRT) Sequencing
Total RNA was extracted by grinding tissue in TRIzol reagent (Life Technologies) on dry ice and processed following the protocol provided by the manufacturer. The integrity of the RNA was determined with an Agilent 2100 Bioanalyzer and agarose gel electrophoresis. The purity and concentration of the RNA were determined with a NanoDrop microspectrophotometer (Thermo Fisher). mRNA was enriched with oligo (dT) magnetic beads. Then, the enriched mRNA was reverse-transcribed into cDNA using a Clontech SMARTer PCR cDNA Synthesis Kit. The PCR cycling parameters were optimized to determine the best number of amplification cycles for the downstream large-scale PCR procedures. Then, double-stranded cDNA was generated with the determined number of cycles. In addition, cDNA molecules >5 kb in size were selected using a BluePippin TM Size-Selection System and mixed equally with non-size-selected cDNA. Then, large-scale PCR was performed for the next step: SMRTbell library construction. The cDNA was subjected to DNA damage repair and end repair and was ligated to sequencing adapters. The SMRTbell template was annealed to a sequencing primer, bound to polymerase, and sequenced on the PacBio Sequel II platform by Gene Denovo Biotechnology Co. (Guangzhou, China).

Data Processing
The raw sequencing reads of the cDNA libraries were analyzed by using the SMRT Link (V9.0.0) pipeline. First, high-quality circular consensus sequences (CCSs, HiFi reads) were extracted from the BAM subread file using the CCS function with the following parameters: min_predicted_accuracy 0.8, min_length 50, min_passes 1. We analyzed the integrity of transcripts on the basis of by whether the CCS reads contained 5 ′ primers, 3 ′ primers and polyA structures. Sequences containing all three structures were considered full-length sequences (fulllength reads, FL reads). Afterwards, the primers, barcodes, polyA tails and concatemers of full passes were removed to obtain the full-length nonchimeric (FLNC) reads. The FLNC reads were clustered to generate the entire isoform. The Cluster function (which use minimap2 for clustering, and use transcripts as reference) was used to cluster the consensus sequences for similar FLNC reads with the following parameters: hq_quiver_min_accuracy 0.99, qv_trim_5p 100, and qv_trim_3p 30, use the Quiver algorithm for correction. Then, we used the Cluster Database at High Identity with Tolerance (CD-HIT) program (v4.6.7) to further correct the consensus sequences with the following parameters: -c 0.99, -T 6, -G 0, -aL 0.90, -AL 100, -aS 0.99, and -AS 30. We used BUSCO for assembly access According to the results, high-quality isoforms (prediction accuracy ≥ 0.99) were used for further analysis (Supplementary Figure 1).
The open reading frames (ORFs) of the isoform sequences were detected by using ANGEL software (version 2.4) with the default parameters to obtain the coding sequences (CDSs), protein sequences, and UTR sequences (Shimizu et al., 2006). Protein domain prediction was performed by aligning the protein sequences of the isoforms to the Pfam database (v26.0) with the Pfam_Scan program (v1.3) and to the SMART database (updated 06/08/2012) with HMMER (v3.0). The parameters -E 0.00001 and -domE 0.00001 were used to obtain protein domain annotations (Eddy, 2011;Letunic et al., 2012;Finn et al., 2015). The protein-coding sequences of the isoforms were aligned with hmmscan (v3.1b2) with the default parameters against TFdb (v2.0) to predict the transcription factor (TF) families .
Potential transmembrane helices in the proteins were predicted with TMHMM Server 2.0 (Möller et al., 2001).
The presence and locations of signal peptide cleavage sites in the amino acid sequences were predicted with SignalP 4.1 Server (Nielsen, 2017). Mucin-type GalNAc O-glycosylation sites in mammalian proteins were predicted with the NetOGlyc 4.0 Server (Steentoft et al., 2013). Arginine and lysine propeptide cleavage sites in eukaryotic protein sequences were predicted with ProP 1.0 Server (Duckert et al., 2004). ca ta ly tic a ct iv ity m o le cu la r fu n ct io n re g u la to r tr a n sp o rt e r a ct iv ity n u cl e ic a ci d b in d in g tr a n sc ri p tio n fa ct o r a ct iv ity m o le cu la r tr a n sd u ce r a ct iv ity si g n a l tr a n sd u ce r a ct iv ity tr a n sc ri p tio n fa ct o r a ct iv ity , p ro te in b in d in g st ru ct u ra l m o le cu le a ct iv ity tr a n sl a tio n re g u la to r a ct iv ity a n tio xi d a n t a ct iv ity e le ct ro n ca rr ie r a ct iv ity The interruptions parameter indicates that if the distance between two simple sequence repeats (SSRs) is shorter than 100 bp, the SSRs will be considered as one SSR.

Gene Structure Prediction
Based on the MISA results, Primer 1.1.4 was used to design primer pairs in the flanking regions of the SSRs for subsequent validation (Rozen and Skaletsky, 2000). CPC (v0.92r2) was used to assess the protein-coding potential of transcripts without annotations according to the default parameters in order to identify potential long non-coding RNAs (lncRNAs), CPC reference database is made of uniport sequences which come from Swiss-prot database. LncRNA analysis was performed on the full-length transcript sequences that were not annotated to the four major databases. The RNAs predicted as "non-coding" by both software programs were considered the lncRNAs. To better annotate lncRNAs at the evolutionary level, the software Infernal (v1.1.2) (http://eddylab.org/infernal/) was used for sequence alignment with the default parameters (Nawrocki and Eddy, 2013). The lncRNAs were classified by their secondary structures and sequence conservation.
To analyze the alternative splicing (AS) events of the transcript isoforms, the COding GENome reconstruction Tool (Cogent, v3.3) was initially used with the default parameters to divide the transcripts into gene families based on k-mer similarity and to reconstruct each family into a coding reference genome based on a De Bruijn graph (Li et al., 2017). Then, the SUPPA (2.2) tool was used to analyze AS events of isoforms using the default parameters (Alamancos et al., 2015).

Data Summary
A total of 71.58 Gb of nucleotide data was obtained from Iso-Seq using the PacBio SMRT Sequencing method. The raw sequencing data have been deposited in the NCBI Sequence Read Archive (SRA) database with the accession number PRJNA718002. After initial quality control involving removal of the adaptor reads and subreads <50 bp in length, a total of 33,957,771 reads (71.58 Gb of nucleotides) were generated (Supplementary Figure 2a), with an average length and N50 of 2,263 and 2,684 bp, respectively. Thereafter, all subreads were used for CCS analysis. A total of 997,309 CCSs were produced (Supplementary Figure 2b), with an average length of 2,805 bp. FLNC read clustering was performed using Minimap2 for error correction of the third-generation sequencing data; 58,375 high-quality (HQ) isoforms (prediction accuracy ≥ 0.99) were obtained for subsequent analysis, and 1,731 low-quality (LQ) isoforms were obtained (prediction accuracy < 0.99) (Supplementary Figure 2c). CD-HIT was used to cluster redundant sequences with >99% similarity. Ultimately, 54,470 full-length non-redundant transcripts were obtained (Supplementary Figure 2d), with a maximum length of 14,296 bp and an N50 length of 3,406 bp (Table 1). Using the SMRT Sequencing pipeline, the number of nonredundant full-length transcripts was significantly reduced. BUSCO evaluation indicates suitable sequence completeness (Supplementary Figure 2e).
A total of 357 pathways were derived from the KEGG database, and a total of 51,894 transcripts were annotated. These pathways include 6 primary categories with 46 secondary subcategories. The most prominent subcategory was "metabolism."

Gene Structure
AS of RNA is widespread in biology. AS occurs after mRNA transcription but before template DNA is formed and produces multiple proteins from a single protein-coding gene. In total, we identified 9,499 AS events (Supplementary Figure 3a). Since no reference genome was available, for these data, only 1,855 events were classified into 6 types of AS events (Supplementary Figure 3b). The most prominent type of AS was intron retention (RI, 1382).
To study genetic diversity, assess quality and promote genetic research, the SSRs identified in the Iso-Seq library were analyzed. In addition, a total of 21,428 SSRs were detected in 13,714 transcripts (Supplementary Figure 3c). Of these SSRs, 14,028 (65.14%) were dinucleotide repeats, most of which consisted of 4-7 repeated sequences; 3,977 SSRs were dinucleotide repeats with 8-11 repeats. In addition, 3,919 SSRs were trinucleotide repeats with 4-7 repeats (Supplementary Figure 3d).

LncRNA Prediction, Coding Sequence, and TF Analyses
In this study, Iso-Seq was also used to sequence lncRNAs with polyA ends. Due to the lack of a genome, the exons of each lncRNA were not evaluated, and the expression densities of lncRNAs and protein-coding RNAs were analyzed at the same time (Supplementary Figure 4b).
The protein-encoding transcripts identified from the mRNA transcripts were submitted to ANGEL to predict the CDSs. A total of 52,465 (96.32%) protein-coding transcripts were predicted (Supplementary Figure 4a). Most of the CDSs were shorter than 2,500 bp.
Hmmsearch was used to predict the TFs among all single genes, and 3,316 TF transcripts were identified. The four most common families were the zf-C2H2 (858), TF_bZIP (266), bHLH (254), and homeobox (186) families (Supplementary Figure 4c). The results provide convenient data for further elucidation of transcriptional regulation.
In this study, we used the SMRT Sequencing method to sequence the full-length transcriptome and obtained 54,470 non-redundant full-length transcripts. A total of 52,178 (95.79%) transcripts were annotated into four databases. In addition, 2,155 (3.96%) transcripts were predicted to be lncRNAs, and 3,316 (6.09%) were predicted to encode TFs. A total of 9,499 AS events and 21,428 SSRs were detected. Thus, we obtained the full-length transcriptome of Myxocyprinus asiaticus and genetic information via analysis of known data. Our findings significantly increase the genetic information on Myxocyprinus asiaticus and will aid future research on the functions of genes related to Chinese sucker development and reproduction as well as the evolution of the Chinese sucker.

Reuse Potential
Here, we report the first full-length transcriptome of the Chinese sucker, which was obtained using PacBio long-read SMRT Sequencing technology. The Iso-Seq-based full-length transcriptome assembly reported here will greatly benefit future transcriptome analysis and genome sequencing of the Chinese sucker and closely related species.

DATA AVAILABILITY STATEMENT
The datasets generated for this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found at: NCBI [accession: PRJNA718002].

ETHICS STATEMENT
The animal study was reviewed and approved by Key laboratory of Freshwater Fish Reproduction and Development (Ministry of Education), Key Laboratory of Aquatic Science of Chongqing, Southwest University, Chongqing, China. Written informed consent was obtained from the owners for the participation of their animals in this study.

AUTHOR CONTRIBUTIONS
HGe and HZ conceived and designed the experiment and drafted the manuscript. SL and HY raised the fish. HGe and HGu dissected the fish and collected fish tissue samples. HZ and QZ completed the bioinformatics analysis. HGe and FL provided the tables and figures. ZW and YL revised the manuscript. All authors read and approved the final manuscript.