Isoform Sequencing Based Transcriptome Resource for Flathead Grey Mullet (Mugil cephalus)

Mugil cephalus, commonly known as flathead grey mullet, is an important candidate brackish water food fish species of herbivorous nature positioned at lower trophic levels of the food chain. The biology, genetics, ecology, and fishing aspects of grey mullet were thoroughly reviewed by Whitfield et al. (2012). In the Indian context, though captive broodstock of grey mullet was developed about two decades back (Abraham et al., 2000), large-scale successful induced breeding trials were realized only recently (Sukumaran et al., 2021). Further, the reproductive period of captive grey mullet is reported to be asynchronous between the Eastern and Western Coast stocks of India. The peak reproductive periods are November and June–August for Eastern and Western grey mullet captive stocks, respectively (Sukumaran et al., 2021). A year-round commercial seed production with captive broodstock could not be realized till date due to constraints in captive maturation and seasonal spawning behavior of the species. The availability of genomic and transcriptomic resources would lead to a better understanding of the regulatory pathways involved in captive maturation and breeding of grey mullet. Such information would also facilitate the conduct of functional studies and derive meaningful inferences out of them for the improvement of grey mullet. Conventionally, transcriptome resources are established by sequencing RNA on an Illumina sequencer wherein the short RNA sequence reads are assembled to generate full-length transcripts. With the development of Pacific Biosciences (PacBio) Single Molecule Real-Time (SMRT) sequencing, the full-length isoform-level transcript sequences could be generated directly without the necessity of assembly. Recently, the PacBio Isoform Sequencing (IsoSeq) based full-length transcript resources were made available for candidate brackish water aquaculture species (Zhang et al., 2019; Katneni et al., 2020; Pootakham et al., 2020) with potential applications to an understanding of economic traits and role in genome annotation. Furthermore, the IsoSeq transcripts can be easily annotated (Zeng et al., 2018) and would help in the discovery of splice sites and structural annotation of the genome. For grey mullet, transcript information based on short-read sequence datasets is available (Byadgi et al., 2016; Dor et al., 2020), but no full-length transcript resource is available. So far, a


INTRODUCTION
Mugil cephalus, commonly known as flathead grey mullet, is an important candidate brackish water food fish species of herbivorous nature positioned at lower trophic levels of the food chain. The biology, genetics, ecology, and fishing aspects of grey mullet were thoroughly reviewed by Whitfield et al. (2012). In the Indian context, though captive broodstock of grey mullet was developed about two decades back (Abraham et al., 2000), large-scale successful induced breeding trials were realized only recently (Sukumaran et al., 2021). Further, the reproductive period of captive grey mullet is reported to be asynchronous between the Eastern and Western Coast stocks of India. The peak reproductive periods are November and June-August for Eastern and Western grey mullet captive stocks, respectively (Sukumaran et al., 2021). A year-round commercial seed production with captive broodstock could not be realized till date due to constraints in captive maturation and seasonal spawning behavior of the species. The availability of genomic and transcriptomic resources would lead to a better understanding of the regulatory pathways involved in captive maturation and breeding of grey mullet. Such information would also facilitate the conduct of functional studies and derive meaningful inferences out of them for the improvement of grey mullet.
Conventionally, transcriptome resources are established by sequencing RNA on an Illumina sequencer wherein the short RNA sequence reads are assembled to generate full-length transcripts. With the development of Pacific Biosciences (PacBio) Single Molecule Real-Time (SMRT) sequencing, the full-length isoform-level transcript sequences could be generated directly without the necessity of assembly. Recently, the PacBio Isoform Sequencing (IsoSeq) based full-length transcript resources were made available for candidate brackish water aquaculture species (Zhang et al., 2019;Katneni et al., 2020;Pootakham et al., 2020) with potential applications to an understanding of economic traits and role in genome annotation. Furthermore, the IsoSeq transcripts can be easily annotated (Zeng et al., 2018) and would help in the discovery of splice sites and structural annotation of the genome.
For grey mullet, transcript information based on short-read sequence datasets is available (Byadgi et al., 2016;Dor et al., 2020), but no full-length transcript resource is available. So far, a full genome is available for only the red lip mullet (Liyanage et al., 2019;Zhao et al., 2021) among fishes in the family Mugilidae. For the first time, we report here a transcriptome resource for grey mullet based on the PacBio IsoSeq reads using nine different adult tissues and five different developmental stages.

Sample Collection and RNA Extraction
About 9 tissues (muscle, gills, liver, kidney, spleen, heart, stomach, intestine, and gonad) from adult grey mullet and whole specimens from five early developmental stages, including eggs, 3/12/24/55day old larvae, were collected from the Muttukadu experimental station. The samples were collected following the guidelines approved by the Institutional Animal Ethics Committee of ICAR-Central Institute of Brackishwater Aquaculture (CIBA/ IAEC/2021-19). The species identity of the adult specimen is confirmed with the partial sequence of the barcoding gene, cytochrome C oxidase I (GenBank accession: MW584357). All the samples were flash frozen in liquid nitrogen and stored at a temperature of −80°C. The total RNA was extracted using the conventional TRIzol method and its quantity was measured using a Qubit 3.0 fluorometer (Thermofisher #Q33238, Thermofisher Scientific, Massachusetts, USA) using an RNA HS assay kit (Thermofisher #Q32851, Thermofisher Scientific, Massachusetts, USA). The integrity of RNA was evaluated on a 1% agarose gel and an Agilent 2100 Bioanalyzer (Agilent Technologies, California, USA).

Illumina Library Preparation and Sequencing
To generate Illumina compatible RNAseq libraries, 1 ug of total RNA was used for Illumina library construction according to the instructions of the manufacturer using the NEBNext II RNA Library Prep Kit for Illumina ® (New England Biolabs Inc., Massachusetts, USA). The libraries were quantified with a Qubit 3.0 fluorometer (Thermofisher Scientific, Massachusetts, USA) using a DNA HS assay kit (Thermofisher Scientific, Massachusetts, USA). The insert size of the libraries was estimated by querying on Tapestation 4150 using highsensitivity D1000 screentapes (Agilent Technologies, California, USA) following the protocol of the manufacturer. Sequencing of libraries was carried out on an Illumina NovaSeq 6000, S4 Flow Cell (2 × 150 bp read lengths). 1, 2

Pacbio IsoSeq Library Preparation and Sequencing
The total RNA of the same 9 tissues and 5 developmental stages was pooled in equimolar concentration and subjected to cDNA synthesis and amplification using the NEBNext ® Single Cell/Low Input cDNA Synthesis & Amplification Module (New England Biolabs Inc., Massachusetts, USA) in conjunction with the Iso-Seq Express Oligo Kit (Pacific Biosciences, California, USA). The Pronex beads (Promega, Wisconsin, USA) were used for the purification of the cDNA before amplification and later for size selection of the amplified product. The library was constructed using the SMRTbell Express template Preparation Kit 2.0 (Pacific Biosciences, California, USA) as per the protocol of the manufacturers. The library was purified using Pronex beads (Promega, Wisconsin, USA) and the library size was assessed using a Bioanalyzer (Agilent Technologies, California, USA). About 70 pM of the library was loaded onto 8M SMRTcell and sequenced in the PacBio Sequel II system in CCS/HiFi mode.

RNAseq Based De Novo Transcript Assembly
Good quality RNAseq reads after trimming for poor quality bases and adapters using trimmomatic v0.39 (Bolger et al., 2014), were analyzed in Trinity v2.12.0 (Grabherr et al., 2011) to generate a de novo transcript assembly containing 1,051,143 transcripts. Then the transcripts were processed with TransDecoder v5.5.0 1 to predict the coding regions in which only one open reading frame (ORF) per transcript that is longer than 100 amino acids was considered further. The selected transcripts were clustered with a similarity threshold of 95% in CD-HIT-EST v4.6 (Li and Godzik, 2006) with default parameters to obtain a final set of 214,899 non-redundant transcripts.

PacBio IsoSeq Based Full-Length Transcripts
Pacbio Isoform sequencing has resulted in total data of about 118.23 Gb, comprising 37.56 million subreads ( Table 1). The Iso-Seq3 pipeline from SMRTLink v10.1 was used for data analysis. The Circular Consensus Sequence (CCS) reads were obtained from total data by calling ccs step with parameters as minimum passes = 3 and minimum quality = 0.99, which were further demultiplexed using lima with parameters peek-guess, dump-clips, and dump-removed. The full length non-concatamer sequences (FLNC) were obtained by calling the isoseq3 refine step with parameters require-polya and minrq = 1. The FLNC reads were further clustered to obtain 87,286 highquality isoforms with an N50 length of 3,736 bp. Then, Mash v2.2 (Ondov et al., 2019) was used to screen for any possible contaminants with the refseq genomes sketch. Thereafter, the redundant transcript models were collapsed using cDNA_Cupcake v. 28.0.0 2 (MIN_ALN_COVERAGE = 0.85; MIN_ALN_IDENTITY  (Omicsbox, 2022) against eukaryote_odb10 (10 September 2020) and actinopterygii_odb10 (5 August 2020) orthologous databases (Kriventseva et al., 2019) respectively (Supplementary Figure 1).

PacBio IsoSeq Transcripts Annotation
The full-length coding transcript sequences were subjected to homology-based annotation using blastx (Altschul et al., 1990) against the Actinopterygii (txid7898) cohort of non-redundant protein databases. The protein domain and orthology-based annotation was performed using the Interprosan and EggNOG mapper modules of OmicsBox v2.0.36 (Omicsbox, 2022). These annotations were merged and the final gene ontology based annotation and enzyme code mapping were obtained using OmicsBox v2.0.36 (Omicsbox, 2022 Figure 3). Enzyme codes were obtained for 9,055 (37.57%) of the functionally annotated transcripts. Transferases and hydrolases were the most dominant enzyme classes expressed (Supplementary Figure 4). Gene ontology (Supplementary Figure 5) revealed that the most expressed GO categories were cellular protein metabolic processes (Biological processes), metal ion binding (Molecular function), and intracellular membrane-bound organelles (Cellular components). Based on the InterProScan search, the major domains, families, repeats and sites in the M. cephalus transcriptome were immunoglobulin-like domain, P-loop containing nucleoside triphosphate hydrolase, leucine-rich repeats, and IQ motif, EF-hand binding site, respectively ( Supplementary  Figures 6-9). For about 69.5% of the transcripts, GO annotations were obtained through EggNOG mapper search while assigning 12.48, 43.95, and 11.51% of COG categories to information storage and processing, cellular processes and signaling, and metabolism, respectively. The SSR analysis of the IsoSeq transcripts using the MISA v1.0 tool identified 47,394 SSRs (Supplementary Figure 10). The monomeric SSRs were most abundant (55.76%), followed by trimeric SSRs (22.92%).

Isoform Diversity
The analysis of full-length transcripts in SQANTI 3 v. 4.2 (Tardaguila et al., 2018) indicated that the 24,102 unique gene models in the M. cephalus transcriptome produce 66,538 unique isoforms of different categories ( Figure 1A). Of these, 15.8% (10,525) of isoforms were categorized as Full Splice Match (FSM) as these match with the reference transcript at all the splice junctions while 14.39% (9,576) were classified as Incomplete Splice Match (ISM) as these match with consecutive but not all splice junctions as that of the reference transcripts. About 44.34% of isoforms, were categorized as novel transcripts coded by known genes. Here, a few of them were Novel in Catalog (NIC) isoforms which have either known splice junctions in different combinations or novel splice junctions from known donors and acceptors. A majority of the novel transcripts were categorized as Novel Not in Catalog (NNC), which used alternative donors/ acceptors. Another category of novel transcripts that deviated from the splicing pattern of the reference transcripts was intergenic (14.09%), where the entire transcript locus is outside the boundary of a reference gene. Other isoform categories detected were Genic genomic (3.91%; isoforms with partial exon and intron/intergenic overlap of a reference gene) and Fusion (6.38%; isoform spanning over two reference genes). About 60% of the genes were observed to produce only one isoform, and the rest of the genes produced more than one isoform, of which 4.88% (1,176) had more than 10 detected isoforms ( Figure 1B). A higher proportion of genes with more isoforms indicated a high degree of transcriptome complexity of the M. cephalus transcriptome. In addition, out of the splice junctions detected in M. cephalus genes, 96.78% (183,204) are canonical with the standard GT-AG (also GC-AG and AT-AC) intron flanking sequence at splice sites while 3.22% (6,088) are non-canonical, having rare intron splice sites (Supplementary Table 1). The splice junctions detected for all of the FSM, ISM, and NIC isoforms were canonical, while 17.5% of NNC isoforms were found to have non-canonical splice junctions (Supplementary Figure 11). We have also verified the splice junctions detected in IsoSeq transcripts with the help of RNAseq reads generated in this study. The splice junctions of more than 99% of FSM, ISM, and NIC isoforms were supported by RNAseq reads, while the splice junctions of only 79.6% of NNC isoforms had support from RNAseq reads (Supplementary Figure 12). The unverified splice junctions could be either false or the result of tissue-specific gene expression. In addition, we have used SUPPA2 (Trincado et al., 2018) with default parameters to estimate the possible alternative splicing (AS) events in the M.cephalus transcriptome. We have identified 25,983 AS events in 5,903 genes, of which 18.51% are A3/A5 (alternative 3′ and 5′ splice sites), 42.64% are AF/AL (alternative first and last exons), 14.73% are SE (skipping exon), 23.09% are RI (retained intron), and 1.04% are MX (mutually exclusive exon) events ( Figure 1C).

Rarefaction Curve Analysis
A rarefaction curve (RC) analysis was performed to assess the sufficiency of sequencing toward the discovery of genes and isoforms. The RC analyses were executed with the collapsed transcripts by following the cDNA_Cupcake scripts 2 , subsample.py (min_fl_count = 2; step = 1,000) and subsample_with_category.py. While the former script was used to subsample at the gene and isoform levels, the latter script was used to sample various isoformlevel categories. For M. cephalus IsoSeq data, it was observed that the discovery of isoforms required a higher sequence depth than the discovery of genes to reach saturation (Supplementary Figure 13). Among the various isoform categories identified by Squanti3, the NNC type of isoform needed more sequencing depth to reach saturation ( Figure 1D) than any other category of isoform. The tissue-specific expression and low abundance nature of NNC isoforms would explain the necessity of a higher sequence depth to detect them. Overall, the RC analyses indicated the sufficiency of the sequencing data generated in this study toward the detection of genes and isoforms.

RE-USE POTENTIAL
For the first time, an isoform-level, full-length transcriptome resource was generated for grey mullet based on the Pacbio Iso-Sequencing using nine different adult tissues and five different developmental stages. The full-length transcripts are of great value for annotation of the M. cephalus genome, the conduct of functional studies, and to support annotations of other mullet fish genomes. The transcript assembly has the potential to contribute to a better understanding of the regulatory pathways involved in the captive maturation and breeding of grey mullet.

ETHICS STATEMENT
The animal study was reviewed and approved by the Institutional Animal Ethics Committee of ICAR-Central Institute of Brackishwater Aquaculture (CIBA/IAEC/2021-19).

AUTHOR CONTRIBUTIONS
MS, MK, and JJ conceived and designed the study and experiments. JA and KS collected the samples. SP, KK, AJ, and VK performed bioinformatics analysis. AJ and VK wrote manuscript with inputs from all coauthors. All authors listed have made a substantial, direct, and intellectual contribution to the work and approved it for publication.

FUNDING
The authors acknowledge the funding support from the ICAR-CRP on Genomics project, "Genomic resources for augmentation of economic traits in Indian white shrimp Penaeus indicus and whole genome sequencing of brackishwater aquaculture candidate species".

ACKNOWLEDGMENTS
The authors thank the Director of ICAR-CIBA for providing the necessary infrastructure and support for the study.