Full-Length Transcriptome Sequencing From the Longest-Lived Freshwater Bony Fish of the World: Bigmouth Buffalo (Ictiobus Cyprinellus)

Citation: Ge H, Zhang H, Yang L, Wang H, Tu L, Jiang Z, Zheng J, Chen B, Chen J, Li Y and Wang Z (2021) Full-Length Transcriptome Sequencing From the Longest-Lived Freshwater Bony Fish of the World: Bigmouth Buffalo (Ictiobus Cyprinellus). Front. Mar. Sci. 8:736188. doi: 10.3389/fmars.2021.736188 Full-Length Transcriptome Sequencing From the Longest-Lived Freshwater Bony Fish of the World: Bigmouth Buffalo (Ictiobus Cyprinellus)

As the longest-lived freshwater bony fish identified to date, the research value of bigmouth buffalo is becoming increasingly evident, and related research will gradually intensify.
In this study, we applied long-read pacific bioscience (PacBio) isomer sequencing (ISO-seq) to produce the first full-length transcriptome assembly for bigmouth buffalo. The use of the PacBio sequencing platform, especially without reference genome sequence, is an ideal method for constructing reference transcriptome assembly (Dong et al., 2015;Kuo et al., 2017;Workman et al., 2018). We used the PacBio platform to process full-length mRNA of five major organs, consistent cycle sequencing (CCS). Since bigmouth buffalo lacks high-quality draft genome sequences, we believed that our data of highquality transcriptome reference sequence could be helpful for transcriptome analysis in the future under a variety of conditions.

DATA DESCRIPTION Sample Collection and RNA Preparation
The bigmouth buffalo (Ictiobus cyprinellus) used in this experiment was taken from the Key Laboratory of Freshwater Fish Resources and Reproductive Under the Breeding Environment of Development of the Ministry of Education, 1 year old, sex unknown, physical health, no disease, and no damage. In this study, the animal welfare protocols and experimental procedures applied were carried out in accordance with the recommendations of animal research ethics guidelines, and the program complied with the relevant ethical regulations of the Key Laboratory of Freshwater Fish Resources and Reproductive Development of the Ministry of Education. Samples of five organs, namely, gills, skin, brain, liver, and muscle, were frozen in liquid nitrogen before storage at −80 • C immediately after dissection. We mixed frozen tissue samples of five organs together and extracted RNA from them.

Library Construction and SMRT Sequencing
The tissue was ground in TRIzol reagent (Life Technologies, Thermo Fisher Scientific, Shanghai, China) on dry ice to extract the total RNA, following the protocol provided by the manufacturer. Then, we used Agilent 2100 Bioanalyzer (Agilent, Beijing, China) to evaluate the RNA integrity. A NanoDrop microspectrophotometer (Thermo Fisher, Thermo Fisher Scientific, Shanghai, China) was used to determine the purity and concentration of RNA. mRNA was enriched using Oligo (dT) magnetic beads (NEB). We used Clontech SMARTer PCR cDNA Synthesis Kit (Takara Bio, Takara Biomedical Technology, Beijing, China) to reverse transcribe mRNA into cDNA. PCR cycle optimization was used to determine the optimal amplification cycle number for the downstream largescale PCR. Then, the optimized cycle number was adopted to generate double-stranded cDNA. In addition, >5 kb size selection was performed using the BluePippin TM Size-Selection System, and the size-selected cDNA was mixed in equal amounts with the non-size-selected cDNA. Then, large-scale PCR was performed for subsequent construction of the SMRTbell library. Then, DNA damage repair, end repair, and ligation with sequencing adapters of the cDNAs were processed. The SMRT sequencing was performed on the PacBio Sequel II platform (PacBio) (Supplementary Figure 1A).

Generation of Full-Length Transcriptomes
The SMRT Link (version 9.0.0) pipeline was used to process the raw sequencing data. First of all, the CCS function was used to extract high-quality circular consensus sequences (CCS, HiFi reads) from the subread BAM file. The sequences containing structures of 5 ′ primers, 3 ′ primers, and polyA were considered as full-length sequences (full-length reads, FL reads). Full-length non-chimeric (FLNC) reads were formed by removing primers, barcodes, trimmed polyA tails, and concatemers of full passes and then used to generate complete isoforms by cluster. Similar FLNC reads were obtained using the cluster function to cluster the consistent sequences (which use minimap2 mapping to transcripts). Then, we used CD-HIT (version 4.6.7) to further correct the consistent sequences. Finally, BUSCO4 (https:// busco.ezlab.org/) was used for assembly access. The high-quality isoforms according to these results (prediction accuracy ≥ 0.99) were used for our annotation of the transcriptome.

Predicting Gene Structure of Isoforms
For further information, we processed the predicted coding sequences (CDSs), alternative splicing (AS) isoforms, simple sequence repeats (SSRs), long non-coding RNAs (lncRNAs), and transcription factors (TFs). Open reading frames (ORFs) were detected by applying ANGEL software 1 (version 2.4) to the isoform sequences to predict the coding sequence regions (Shimizu et al., 2006).
To analyze the AS events of the isoforms from our data, the COding GENome reconstruction Tool Cogent (version 3.3) was used for the purpose to divide the transcripts into gene families, which use k-mer similarity algorithm based on a De Bruijn graph and then reconstruct families to produce coding reference genome. Then, we further analyzed AS events using the SUPPA program 2 (2.2) among the isoforms. The MIcroSAtellite server (http://pgrc.ipk-gatersleben.de/misa/) (MISA, version 1.0) was used for microsatellite annotation of the whole transcriptome. We used Primer 1.1.4 program to design primer pairs in the flanking regions of SSRs for subsequent validation from MISA results (Rozen and Skaletsky, 2000).
We used CNCI 3 (version 2.0) and CPC 4 (version 0.92r2) programs to access protein-coding potential of transcripts without annotations for potential long non-coding RNAs CPC reference database use UniProt sequences in SWISS-PROT database. LncRNA analysis used full-length transcripts apart from the four major databases. Results predicted by both software programs were considered to represent "noncoding" sequences and were taken as the final lncRNA results. We used Infernal (version 1.1.2) software (http://eddylab.org/ infernal/) for sequence alignment (Nawrocki and Eddy, 2013). LncRNAs were classified according to sequence conservation and secondary structures. The animal protein-coding sequences of isoforms were aligned by using hmmscan (version 3.1b2) (http://hmmer.org/download.html) against TFdb 5 (version 2.0) to predict TF families .
2 Leveraging transcript quantification for fast computation of alternative splicing profiles. 3 Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts. 4 CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine. 5 Huazhong University of Science and Technology. AnimalTFDB 2.0: a resource for expression, prediction and functional study of animal transcription factors.

Quality Control of the Full-Length Transcriptomes
We obtained a total of 38,548,713 reads (76.18 Gb nucleotides) (Supplementary Figure 2A), with 2,121 bp of average length and 2,581 bp of N50 length. Then the data were deposited in the NCBI SRA database with project number PRJNA718003. All the subreads were further analyzed into CCS. A total of 1,063,453 CCSs were generated with an average length of 2,805 bp. FLNC reads used Minimap2 error correction to polish the third-generation sequence data into clusters. We generated 57,678 high-quality isoforms (HQ isoforms, prediction accuracy ≥ 0.99) for further analysis, and 1,415 low-quality sequences (low-quality isoforms, LQ isoforms, prediction accuracy < 0.99) were obtained but not used in subsequent analysis. Later, we used CD-HIT-EST to process FLNC reads into clusters to remove redundant sequences, which was defined as the sequence similarity >99%. Finally, 54,319 full-length nonredundant transcripts were generated, with the features, i.e., 14,591 bp of maximum length and 3,308 bp of N50 sequence length (Supplementary Figure 2D). After the SMRT sequencing pipeline processing, we used BUSCO 6 to evaluate the sequence completeness, and the result shows that the number of nonredundant sequence we generated was significantly reduced compared with the previous data (Supplementary Figure 2E).

Prediction of Coding Sequences and Functional Annotation
Full-length non-redundant transcripts were aligned against four databases using BLASTx. A total of 52,443 (96.54%) transcripts were annotated, and 40,067 (73.76%) transcripts were annotated in all four databases. The prediction and functional annotation of the encoded transcripts resulted in the annotation of 52,415 (96.49%), 52,182 (96.07%), 48,376 (89.06%), and 40,150 (73.92%) transcripts in the non-redundant (NR) database, KEGG database, SWISS-PROT, and KOG database, respectively (Figure 1A), as full-length transcripts.

Gene Structure Predictions
Protein-encoding transcripts were identified from the mRNA transcripts used for the ANGEL prediction of CDSs. A total of 52,726 (97.07%) transcripts were predicted (Supplementary Figure 3E and Table 1). Most of these CDSs were shorter than 2,500 bp. A total of 766 domains from a total of 25,024 isoforms were aligned via protein domain prediction.
RNA AS is a widespread biological phenomenon. It occurs after mRNA transcription before the formation of template DNA, and these events helped single protein-coding genes to form multiple proteins and increased biodiversity. We obtained 9,707 AS events at last (Supplementary Figure 3A), due to the lack of the reference genome of bigmouth buffalo, only 1,892 events were classified into 7 types of AS events (Supplementary Figure 3B) and the most identified event was reserved introns (RIs, 1,382).
We used high-quality isoforms, promoted genetic research to study genetic diversity, and analyzed SSRs identified in the ISO-seq library. For complex SSRs, a total of 20,684 SSRs were detected from 13,272 transcripts (Supplementary Figure 3C). The results showed that the dinucleotide repeats were the prominent part, with the number of 9,471 (65.14%), and most of them showed 4-7 duplicates. In the total of our results, the higher annotation number were dinucleotide repeats with 8-11 duplicates and trinucleotide repeats with 4-7 duplicates (Supplementary Figure 3D).
Long non-coding RNA was also analyzed using our data, which possess the feature of polyA ends. Since the bigmouth buffalo reference genome was not available, the exons of lncRNA had not been evaluated, but the figure shows the different annotation results from two software (Supplementary Figure 3F). We used Hmm search to predict the TFs from all single protein-coding genes, and we obtained 3,316 TF transcripts. The first four common families were the zf-C2H2 (748), bHLH (271), TF_bZIP (255), and homeobox (253) families (Supplementary Figure 3G). The results may facilitate the further elucidation of transcriptional regulation.

Reuse Potential
We reported for the first time the full-length transcriptome of bigmouth buffalo obtained using the PacBio SMRT platform. Our data of assembly and annotation results based on the ISO-seq technology could be helpful for further relative research on this species and could be of great benefit to future transcriptome analysis and genome sequencing of the bigmouth buffalo and its relatives.

DATA AVAILABILITY STATEMENT
The raw sequencing data and files from the gene abundance analysis conducted in this study were deposited in the NCBI Sequence Read Archive (SRA) with accession number PRJNA718003.

ETHICS STATEMENT
The animal study was reviewed and approved by the Key Laboratory of Freshwater Fish Resources and Reproductive Development of the Ministry of Education.