A Novel Full-Length Transcriptome Resource for Black Tiger Shrimp (Penaeus monodon) Developed Using Isoform Sequencing (Iso-Seq)

Citation: Pootakham W, Uengwetwanit T, Sonthirod C, Sittikankaew K and Karoonuthaisiri N (2020) A Novel Full-Length Transcriptome Resource for Black Tiger Shrimp (Penaeus monodon) Developed Using Isoform Sequencing (Iso-Seq). Front. Mar. Sci. 7:172. doi: 10.3389/fmars.2020.00172 A Novel Full-Length Transcriptome Resource for Black Tiger Shrimp (Penaeus monodon) Developed Using Isoform Sequencing (Iso-Seq)


BACKGROUND
The black tiger shrimp, Penaeus monodon, is one of the major cultured penaeid shrimp species, contributing 9% of total crustacean production (FAO, 2018). With its economic and nutritional importance, extensive research and development programs have been initiated to gain better understanding of various biological processes in black tiger shrimp and to apply this basic knowledge to improve farming practices and advance selective breeding programs with an ultimate goal of sustainable shrimp farming industry. Although several transcriptome studies in shrimp have been performed using expressed sequence tags (ESTs) (Lehnert et al., 1999;Supungul et al., 2002;Tassanakajon et al., 2006), microarray technology (Karoonuthaisiri et al., 2009;Wongsurawat et al., 2010;Leelatanawit et al., 2011;Uawisetwathana et al., 2011), and second-generation sequencing (SGS) technologies (Rotllant et al., 2015;Nguyen et al., 2016;Huerlimann et al., 2018;Uengwetwanit et al., 2018), the efforts were significantly impeded by the lack of high-quality reference genome assembly in this organism. It is also extremely challenging to obtain a complete, full-length transcriptome assembly from short-read RNA sequencing (RNAseq) data given the high contents of repetitive elements, which have been estimated to be 46.68-51.18% (Huang et al., 2011;Yuan et al., 2018).
In this study, we applied long-read Pacific Biosciences (PacBio) isoform sequencing (Iso-seq) to generate the first full-length transcriptome assembly for black tiger shrimp. Full-length mRNA sequences from nine major organs and hemocytes were obtained using PacBio's circular consensus sequencing (CCS) technology. PacBio sequencing platform enables full-length transcripts from their 5 ′ ends to poly(A) tails to be captured in single long reads, making this an ideal approach for constructing a reference transcriptome assembly without reference genome sequences (Dong et al., 2015;Kuo et al., 2017;Workman et al., 2018). The ability to capture full-length transcripts also facilitates the discovery of novel isoforms and alternative transcripts that vary with developmental stages (Thatcher et al., 2016), cell types (Swarup et al., 2016) and stress conditions (Yan et al., 2012;Liu et al., 2016). Without a high-quality genome sequence draft, we believe that this high-quality reference transcriptome assembly will be a valuable resource for transcriptome profiling studies under various conditions in black tiger shrimp.

Sample Collection
A total of nine organs and hemocytes were harvested from two male and two female 4-month-old juvenile black tiger shrimps (Yui Gung Oog, Prachuap Khiri Khan, Thailand). The shrimps were specific pathogen free: they were tested for White Spot Syndrome Virus (WSSV), Yellowhead disease, Taura Syndrome (TS) virus and acute hepatopancreatic necrosis disease (AHPND). All tissue samples (except ovary and testis) were pooled from four individuals while the ovary and testis tissues were pooled from two individuals. This study was performed in accordance with the recommendations of Animal Research Ethics Guidelines, and the protocol was approved by the National Center of Genetic Engineering and Biotechnology Animal Research Ethics Committee (approval number BT-IACUC-RF01-10-01). The following organs were dissected and immediately frozen in liquid nitrogen: gill, heart, hepatopancreas, intestine, ovary, pleopod, stomach, testis and thoracic ganglia. Haemolymph was centrifuged at 3,000 rpm at 4 • C for 5 min to pellet the hemocytes, which were then frozen in liquid nitrogen. All samples were stored at −80 • C until RNA extraction.

RNA Extraction and PacBio Sequencing
Total RNA was extracted from the frozen tissues using TRI REAGENT according to the manufacturer's instruction (Molecular Research Center, USA). Contaminated genomic DNA was removed by DNase I treatment at 0.5 U/µg total RNA at 37 • C for 30 min (Promega, USA). One ug of DNasetreated RNA sample from each tissue was used for cDNA synthesis. SMRTbell library construction was carried out using the SMRTbell Template Prep Kit 1.0-SPv3 protocol, following the manufacturer's instruction (Pacific Biosciences, Menlo Park, USA). Barcodes were used to keep track of the tissue origin of the samples. The libraries were run on the PacBio Sequel System using the sequencing chemistry version 3.0 and 10-h movie times (the sequencing was outsourced to NovogenAIT, Singapore and Macrogen, Korea).

Transcriptome Assembly and Annotation
SMRTlink version 6.0 (Pacific Biosciences, Menlo Park, CA, USA) was used to filter and process raw subreads, using the Iso-seq3 pipeline to obtain highly accurate long reads. Subreads shorter than 50 nt (default parameter) or with the ReadScore lower than 0.8 were removed prior to generating the circular consensus sequence (CCS) reads using the following parameters: minPasses = 1 and minPredictedAccuracy = 0.9. Full-length non-chimeric (FLNC) or non full-length (NFL) reads were classified based on the presence or absence of 5 ′ primer, 3 ′ primer and the poly(A) tail. FLNC reads were clustered using the Iso-seq3 cluster module in the SMRTlink to generate consensus sequences, which were subsequently polished with the Arrow software (https://www.pacb.com/wp-content/ uploads/SMRT_Tools_Reference_Guide_v600.pdf) using NFL reads. We obtained a total of 1,604,098 reads of insert (from all tissues combined), ranging from 50 to 14,973 nt with an average read length of 2,942 and an N50 of 3,242 nt. Of those reads of insert, 82,441 were high-quality FLNC transcripts (Table 1).
To exclude contaminating sequences from other organisms, we aligned FLNC transcripts to virus, bacteria, fungi and human sequences (from the NCBI RefSeq database release number 97) using BLASTN and identified 22 FLNC transcripts that exhibited over 90% identity to those sequences with at least 90% sequence coverage. Those were removed prior to downstream analyses. We subsequently used CPAT version 1.2.4 (Wang et al., 2013) to filter out 5,627 long non-coding RNA (lncRNA) and clustered the remaining transcript sequences at 99% identity using UCLUST (Edgar, 2010) to obtain a set of 22,418 non-redundant sequences in the final transcriptome assembly. Compared to our fulllength transcriptome assembly, the previously reported assembly derived from Illumina short reads contained a greater number of transcripts (236,085 sequences; TRA: GGLH00000000) with considerably shorter average read length (956 nt) and N50 length (1,431 nt) (Huerlimann et al., 2018).
The annotations of this reference assembly, including gene ontology (GO) and eukaryotic orthologous groups (KOG) assignments, were carried out using Blast2GO version 5.2 (Götz et al., 2008) (BioBam, Valencia, Spain; Supplementary Table 1). Approximately 94% of sequences in the assembly were annotated, and 53% and 83% had been assigned GO and KOG annotations, respectively ( Table 1). The highest number of transcripts sequenced was affiliated with cellular metabolic process, organic substance metabolic process, primary metabolic process and nitrogen compound metabolic process ( Figure 1A). We also investigated the completeness of the transcriptome assembly by assessing the coverage of the benchmarking universal single-copy orthologs (BUSCO) using the eukaryotic gene set (Simão et al., 2015). Of the 429 conserved eukaryotic genes, 80.86% were identified as "complete" in the assembly and  an additional 2.64% were partially assembled. Our assembly had a much higher proportion of "missing" orthologs (16.5%) compared to the assembly reported by Huerlimann et al. (2018), which was 98% complete based on BUSCO analysis. The fact that our shrimp samples were from a single stage (4-month old) while those used in Huerlimann et al. (2018) were from multiple stages and that some of the tissues sequenced in Huerlimann et al. (2018) were not included in our study could attribute to the missing orthologs in the BUSCO assessment. Most of the conserved orthologs that were missing in our assembly were those involved in the DNA replication processes. RNA samples from nine different tissue types and hemocytes were sequenced to generate the reference transcriptome assembly. We kept track of the origin of individual full-length transcript sequence and classified each of them as follows: "tissue-specific" when it is present in only one tissue type, "ubiquitous" when it is present in all tissue types investigated and "common" when it is present in between two to nine tissue types studied. Of the 22,418 non-redundant transcript sequences in the assembly, 462 transcripts were present in all tissues examined (Supplementary Table 2). The number of nonredundant transcripts detected in each tissue type is displayed in Figure 1B and Supplementary Table 2. Hepatopancreas (26.7%) and testis (26.0%) tissues had the highest proportions of tissue-specific transcripts while heart tissues had the lowest proportion of tissue-specific transcripts (11.1%; Figure 1B and Supplementary Table 2). Interestingly, between 54 and 65% of the transcripts detected in each individual tissue type were also present in the stomach. On contrary, only 17-30% of the transcripts detected in each individual tissue type were also present in either hepatopancreas or ovary tissues (Supplementary Table 2).
The ability to capture full-length transcripts using Iso-seq technology provided a unique opportunity to detect alternative splicing events in different tissue types. To identify candidate transcripts with multiple isoforms, we examined clusters of FLNC transcripts containing sequences of multiple lengths. Examples of transcripts with different isoforms are shown in Supplementary Figure 1. A transcript annotated as T-cell immunomodulatory protein had two alternative 5 ′ splice sites and displayed a hemocyte-specific isoform with a truncated 3 ′ end, which likely represented an exon-skipping event. Another example was the thoracic ganglia-specific isoform (annotated as protein lines-like) appeared to skip an exon near the 5 ′ end of the transcript. Complementary information from genome sequences is required to map the exon-intron junctions and determine if the alternative splicing events observed are exon skipping or intron retention. PacBio technology eliminates the need to assemble transcripts from short-read data and allows (long) full-length transcripts to be sequenced in single-molecule reads, providing direct evidence for alternatively spliced isoforms.

Re-use Potential
This study reports the first full-length transcriptome assembly for P. mondon, utilizing PacBio long-read single molecule real-time (SMRT) sequencing technology to obtain full-length transcript sequences. Future transcriptome profiling studies and genome sequencing projects in P. monodon and closely related species will greatly benefit from the full-length Iso-seq based transcriptome assembly reported here and the previously reported Illuminabased transcriptome assembly.

DATA AVAILABILITY STATEMENT
The transcriptome assembly is available from NCBI GenBank database under BioProject ID PRJNA602748, and the transcriptome annotations are provided in Supplementary Table 1. AUTHOR CONTRIBUTIONS NK, TU, and WP conceived and designed the experiment. KS collected shrimp samples and carried out the RNA extraction. CS and TU performed bioinformatics analyses. WP and NK wrote the manuscript. All authors read and approved the final manuscript.

FUNDING
This work was supported by the National Center for Genetic Engineering and Biotechnology (RI grant number P1851724 and Platform grant number P1651718), National Science and Technology Development Agency (grant number P1950419), and partially supported by the European Union's Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement no. 734486.