Methodologies for Transcript Profiling Using Long-Read Technologies

RNA sequencing using next-generation sequencing technologies (NGS) is currently the standard approach for gene expression profiling, particularly for large-scale high-throughput studies. NGS technologies comprise high throughput, cost efficient short-read RNA-Seq, while emerging single molecule, long-read RNA-Seq technologies have enabled new approaches to study the transcriptome and its function. The emerging single molecule, long-read technologies are currently commercially available by Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT), while new methodologies based on short-read sequencing approaches are also being developed in order to provide long range single molecule level information—for example, the ones represented by the 10x Genomics linked read methodology. The shift toward long-read sequencing technologies for transcriptome characterization is based on current increases in throughput and decreases in cost, making these attractive for de novo transcriptome assembly, isoform expression quantification, and in-depth RNA species analysis. These types of analyses were challenging with standard short sequencing approaches, due to the complex nature of the transcriptome, which consists of variable lengths of transcripts and multiple alternatively spliced isoforms for most genes, as well as the high sequence similarity of highly abundant species of RNA, such as rRNAs. Here we aim to focus on single molecule level sequencing technologies and single-cell technologies that, combined with perturbation tools, allow the analysis of complete RNA species, whether short or long, at high resolution. In parallel, these tools have opened new ways in understanding gene functions at the tissue, network, and pathway levels, as well as their detailed functional characterization. Analysis of the epi-transcriptome, including RNA methylation and modification and the effects of such modifications on biological systems is now enabled through direct RNA sequencing instead of classical indirect approaches. However, many difficulties and challenges remain, such as methodologies to generate full-length RNA or cDNA libraries from all different species of RNAs, not only poly-A containing transcripts, and the identification of allele-specific transcripts due to current error rates of single molecule technologies, while the bioinformatics analysis on long-read data for accurate identification of 5′ and 3′ UTRs is still in development.

1. Applications of long read sequencing technologies 1.1.
Identification of gene fusions, integration with the underlying genomic rearrangements and molecular monitoring of disease progression with long read sequencing Gene fusions are one of the main class of somatic events in cancer genomes that occur when structural chromosomal rearrangements (translocations, inversions, interstitial deletions) brings two separate genes together into a new functioning gene. The first gene fusion (BCR-ABL1) was discovered >50 years ago in chronic myeloid leukemia (Nowell and Hungerford, 1961). Since then and with the advancement of new technologies, several others have been discovered in different type of cancers, including carcinomas of the thyroid, salivary gland, prostate, lung, breast, head and neck, brain, skin, gastrointestinal tract, and kidney (Kumar-Sinha et al., 2015). During these years, the technologies for detection of gene fusions have evolved to include DNA (DNA-Seq) and RNA (RNA-Seq) approaches (Parker and Zhang, 2013). The latter has been widely used for transcriptome profiling and detection of expressed chimeric transcripts (likely driver events) from non-expressed passenger events in diverse solid tumors (Maher et al., 2009). A recent systematic analysis of TCGA RNA-Seq data for detection of gene fusions on 9624 tumors across 33 cancer types revealed a total of 25664 fusions with a 63% validation rate . Moreover, Lee et al ) developed a database of fusion genes that contains 33316 gene pairs by compiling data from well-known public resources (e.g. COSMIC, OMIM, etc.), published papers using text mining approach and systematic computational analysis of RNA-Seq data such as TCGA. Although these studies give us a rough estimation on gene fusion frequency in tumors, most of them are computationally predicted from short-read RNA-Seq data and remain to be confirmed by more sensitive and accurate experimental or sequencing approaches.
Recently, Nattestad et al. combined genomic structural variant discovery from long-read PacBio genomic sequencing with Iso-Seq PacBio full-length cDNA transcriptome sequencing (Nattestad et al., 2018). As the length of the transcripts expressed from the fused genes is unknown, the Iso-Seq approach was used to have a comprehensive picture of cDNA molecules of different sizes. The authors observed both the genomic rearrangements as well as the fused genes that are produced from the closed proximity of the genes in the rearranged genomic fragments (Nattestad et al., 2018). To be confident about the fused regions only the full-length High-Quality Quivered consensus sequences were used supported by at least 5 full length Iso-Seq reads (Nattestad et al., 2018). The authors discovered three gene fusions at the cDNA level and their corresponding genomic rearrangements that can take place in two or three steps (Nattestad et al., 2018). Gene fusions in other highly rearranged cancer genomes could reveal other instances of a complex type of variation. In general, it is very advantageous to combine long read genomic sequencing with long read RNA sequencing as both approaches validate each other (Nattestad et al., 2018).
RNA-seq has been used for molecular monitoring of patients during disease progression. For instance, the BCR-ABL fused gene is a known tyrosine kinase responsible for the chronic myeloid leukemia progression. Tyrosine kinase inhibitors (TKI) halt the progression of the disease until resistant clones appear whose blasts will take over the bone marrow. Early detection of resistant clones is important, as bone marrow depletion or a different pharmacological scheme can be used. Preferably, mutations in both the regulatory and kinase domains as well as co-existing mutations should be detected as early as possible, prior to an expansion of resistant clones. In addition to point mutations, the BCR-ABL1 protein can be affected by alterations in splicing where whole exons, or smaller parts of exons, are included or skipped from the main transcript. Full length transcripts have been used to characterize biologically important compound mutations (Cavelier et al., 2015), along with splice isoforms. The ability to differentiate between independent and compound mutations is a major advantage of this technology. For example, compound mutations have been associated with moderate and high-level resistance to chemotherapeutic drugs ponatinib and rebastinib, respectively (Zabriskie et al., 2014), while the individual mutants are instead sensitive to these substances.
In the example mentioned above, a 10,000X coverage of BCR-ABL1 for each of the presented clinical samples was achieved, leading to increased sensitivity and ability to detect resistant clones months earlier than after applying Sanger sequencing (Cavelier et al., 2015). With the ability to sequence the whole transcript, the method permits us to exclude other BCR-ABL1 mutations as responsible factors for the observed TKI resistance. Furthermore, rare mutations within the regulatory domain of ABL1 have also been reported to lead to TKI resistance in patients without kinase domain mutations (Sherbenou et al., 2010). These rare mutations can be detected and associated with the outcome of the disease.
Gene fusion detection has also been performed on the nanopore platform (Jeck et al., 2019). The authors showed that once the high-quality nanopore reads are used, the platform can detect quickly with high sensitivity and specificity gene fusion events derived from targeted amplification of specific loci even when the tumor fraction is 5-10% (Jeck et al., 2019). The fused genes BCR-ABL1, PAX5-AUTS2, KMT2A-MLLT3, HAS2-PLAG1, EWSR1-FLI1, NCALD-PLAG1, ETV6-NTRK3, SLC34A2 -ROS1, RARA-PML, EWSR1-FLI1, CBFB-MYH11 present in the acute promyelocytic leukemia clinical samples were detected on the ONT platform (Jeck et al., 2019). The fused genes EML4-ALK, KIF5B-RET were also identified from cDNA amplicons produced from clinical samples of lung adenocarcinoma patients (Suzuki et al., 2017). Similar to the mentioned PacBio studies, the authors phased compound mutations in the EGFR gene (Suzuki et al., 2017). They also phased Single Nucleotide Polymorphisms (SNPs) with fused transcripts for example in the case of the EFHD1-UBR3 fusion (Suzuki et al., 2017).

1.2.
Long read sequencing unveils mechanistic aspects of mRNA splicing Long read sequencing can be used to assess mRNA processing events. Mechanisms of mRNA splicing can easily be explored. In case of the mRNA splicing, comparison of the unspliced nascent mRNAs versus spliced ones can be used to derive in vivo splicing kinetics (Oesterreich et al., 2016) . For example, although it is known that transcription and splicing can occur simultaneously and influence one another, knowledge of the rate of splicing in vivo was missing. A technique was initially developed that permitted the cDNA synthesis of unprocessed mRNA molecules based on their properties as presented in Supplementary Figure 1. The technique was named long-read SMIT-seq (Supplementary Figure 7). The authors used PacBio to examine both unspliced and spliced reads of the same gene and figured out that splicing starts when Pol II is 26 nt downstream of a 3'SS and splicing ends when Pol II is 129 nt downstream of the 3'SS and on average splicing is complete within 1.4 seconds after 3' SS synthesis, at least an order of magnitude faster than previous estimates (Oesterreich et al., 2016). This indicates that Pol II and the spliceosome are physically closer during splicing catalysis than previously anticipated and splicing catalysis starts as soon as the 3'SS exits Pol II. The authors concluded that the mechanistic implications of these findings shed new light on the physical and temporal relationship between transcription and splicing catalysis, the relative speed with which the spliceosome assembles, and how opportunities for regulation arise in vivo. In another study the authors used the same technique to assess directly the order of intron removal as can be examined in nascent RNAs from multi-intron transcripts in the fission yeast Schizosaccharomyces pombe relative to the progression of transcription which is marked by nascent RNA 3′ ends (Herzel et al., 2018). The authors observed fully spliced, fully unspliced, and partially spliced nascent RNAs, noticing that ordered intron removal is not strictly enforced (Herzel et al., 2018). The high proportion of multi-intron transcripts that were either fully spliced or fully unspliced, suggested that splicing of any given intron is dependent on the splicing status of other introns in the transcript (Herzel et al., 2018).
The ability to sequence both processed and unprocessed full-length transcripts can give clues to the biogenesis of the processed transcripts. For example, in the case of insect mitochondrial gene transcription it was possible to propose a biogenesis model of mature mRNA and tRNA molecules from polycistronic RNAs. In the single-celled alga Euglena gracilis, the authors sequenced both processed and unprocessed transcripts in order to identify the order of removal of conventional and nonconventional introns for a subset of genes that have both types (Guminska et al., 2018). Based on the abundance of the different intermediates between the unprocessed and the fully processed transcripts, the authors concluded that the conventional introns are removed before the nonconventional introns, that the nonconventional intron removal is quicker than the conventional intron one and that the conventional intron removal does not strictly proceeds collinearly with the progress of transcription but all possible combinations of occurrence of conventional introns can be noticed in the sequenced transcripts (Guminska et al., 2018). Additionally, they noticed that there is a lag in the processing time between conventional and non-conventional intron removal indicating that the conventional intron removal might take place co-transcriptionally whereas the non-conventional intron removal might take place post-transcriptionally (Guminska et al., 2018).

Long read sequencing unveils mechanistic aspects of mRNA degradation
Canonical mRNA decay involves deadenylation followed by 3′-to-5′ exonucleolysis by the exosome and decapping followed by 5′-to-3′ exonucleolysis by XRN1 (Schoenberg and Maquat, 2012). The 5' degradation pattern is conceptually appealing because the mRNA that undergoes 5′ -to-3′ decay will not be able to reinitiate translation while the last protein molecule that it produces will be full length. Much less clear, however, is what happens to translated mRNAs when degradation occurs in the 3′ -to-5′ direction. The authors (Ibrahim et al., 2018) Figure 1). The authors found that the capped RNAs, are co-translationally degraded from the 3′ end because these fragments were found in translating ribosomes. This indicates that canonical human mRNAs undergo repeated co-translational and ribosome-phased endonucleolytic cuts at the exit site of the mRNA ribosome channel, in a process called "ribothrypsis" (Ibrahim et al., 2018). We note here that the authors based their conclusions on a modification of the Akron-SMRT method, the Akron3-seq method which is similar to the Akron-SMRT but permits the sequencing of libraries on a short-read sequencing platform. The authors argued that due to the high sequencing depth of the short-read sequencing platforms, the Akron3-seq method permitted them a whole-transcriptome analysis, in contrast to the Akron-SMRT, which interrogated a few targeted genes.

Allele specific sequencing
The unambiguous mapping of long reads permit the identification of allele specific expression (Sun et al., 2018). The authors investigated the effects of cis-regulatory divergence on RNA decay in an F1 hybrid mouse and quantified allele-specific differences in RNA decay rates. They identified 621 genes exhibiting significant cis-divergence, contributed by genetic variants affecting microRNA binding and RNA secondary structures, which are features that affect RNA splicing, RNA polyadenylation, RNA stability and RNA degradation (Bevilacqua et al., 2016;Duchaine and Fabian, 2019). The authors used PacBio to validate the observations based on the Illumina data. Using the PacBio RS system, they deep-sequenced the RT-PCR products using primers targeted at the regions with no sequence variants between the two alleles. The longer read length allowed the assignment of the PacBio reads to the parental alleles by utilizing CCS reads without any ambiguity. Allelic ratios of the read counts were then compared between the two time points. The allelic decay rates were closely correlated with those determined using the Illumina approach.
In the case of the ONT platform and the direct RNA sequencing platform, allele specific expression detection has been explored from raw read data (Workman et al., 2018b). Due to the high error rate of the reads, allele specific analysis was only possible in well characterized genomes like the GM12878. Raw reads that contain at least two variants known to be heterozygous in GM12878 were annotated as maternal, paternal or unassigned. To reduce the chances of a false positive from nanopore sequencing errors as well as to avoid positions with systematic errors, the authors examined only genes where, greater than 75% of the reads that contain at least two variants, agreed on the parental allele of origin. The authors found 492 genes that showed allele specific gene expression as well as 34 genes with allele specific isoform expression out of 2,917 genes with at least 10 haplotype informative reads. These numbers are likely an underestimation of the true number of allele specific transcripts, as only deeper sequencing will allow the analysis of low abundance transcripts.
Although not directly relevant to allele specific sequencing but given that identification of SNPs is important before phasing, the long-read sequencing platforms have been used to identify SNPs corresponding to RNA editing events. By comparing mitochondrial genomic DNA assemblies and the corresponding PacBio generated mitochondrial transcripts, RNA editing was observed in select species of Leucaena (Kovar et al., 2018).

Transcriptome annotation in the presence or absence of a reference genome
The long-read sequencing technologies have been extensively used in characterizing genes and isoforms across multiple organisms. As no assembly is required, but rather multiple rounds of error correction, the gene and isoform models are reliable enough and can be used without the necessity the corresponding genomes to be either well characterized or even present. Novel genes, novel isoforms, corrected gene and isoform models, novel alternative splicing sites and transcription start sites as well as alternative polyadenylation sites are reported in all these studies. The long-read RNA-seq methodologies have been used across different organisms and some examples are presented in Supplementary Table 2. Databases have been generated, hosting Iso-Seq data along with an in-depth analysis of the datasets and visualization of the full-length transcript isoforms (Xie et al., 2018).
Studies that use the PacBio Iso-Seq approach to characterize de novo transcripts are usually correcting and expanding on a combination of Illumina short-read RNA-seq assemblies and predicted gene models. Novel genes are frequently reported. For example, in Oryctolagus cuniculus long read RNA-seq, 23% of genic loci have not been annotated before . Similarly, in the Sorghum bicolor transcriptome over 2,100 novel genes are reported (Abdel-Ghany et al., 2016).
The main advantage of the long-read platforms is the accurate characterization of isoforms without employing any statistical prediction approaches. In the short-read sequencing platforms although most transcript reconstruction algorithms achieve good performance at assigning the corresponding exons to the individual genes, accuracy is a lot lower for assembly of complete transcripts (Steijger et al., 2013). Indeed, in Oryctolagus cuniculus long read RNA-seq, 66% of the observed isoforms were not annotated before  whereas in the Sorghum bicolor transcriptome the corresponding number is 40.7% (Abdel-Ghany et al., 2016).

Accurate determination of Transcription Start Sites and Transcription End Sites
An advantage of the long-read platforms is the accurate determination of alternative Transcription Start Sites (TSS) or alternative Termination End Sites (TES). In the standard short-read Illumina RNA-seq the coverage at the end of the individual molecules drops. The result of this is that the TSS and the TES of low abundant transcripts is not demarcated properly. The exact position of the TES is more problematic than the TSS as due to the paired end nature of sequencing of the cDNA fragments, there is no guarantee that all the read up to the polyA tail will be sequenced. Nevertheless, there are techniques that can target specifically only the 5' end (NanoCAGE (Salimullah et al., 2011)) or the 3' end (PAS-seq (Shepard et al., 2011)) of the cDNA molecules.
In the Oryctolagus cuniculus long read RNA-seq, 11,184 alternative polyadenylation sites were detected in 3,492 genes  whereas in the Sorghum bicolor long read RNA-seq 11,013 polyadenylation sites were detected in 14,550 genes (Abdel-Ghany et al., 2016). In moso bamboo long read RNA-seq 25,069 polyadenylation sites were detected in 11,450 genes and 6,311 of these genes had more than one polyadenylation site .
In another study the authors identified the 5′ and 3′ ends of bacterial operons by specifically sequencing full length unprocessed polycistronic transcripts through a methodology called SMRT-Cappable-seq that enriches full-length primary transcripts with triphosphorylated 5' ends and hydroxylated 3' ends (Yan et al., 2018) (Supplementary Figure 1). For 1/3 of the detected operons the authors revised the genes involved by adding at least one more gene (Yan et al., 2018).

Error correcting long read sequencing data for genome annotation
Ideally, characterizing the transcriptome with a long-read sequencing platform is the optimal approach. Data from short-read sequencing constitute often part of the analysis pipeline, as longread sequencing does not reach enough depth for accurate consensus based basecalling of low abundance transcripts. In this case the long reads are error corrected with the short reads. For example, in Oryctolagus cuniculus long read RNA-seq, 86.2% of the constructed isoform sequences contained erroneous fragments that were corrected with the short read sequences .
Error correction of long reads can be performed with either short read sequencing reads like the FMLRC method , the HySeMaFi method (Ning et al., 2017) or with a reference genome like in the case of TranscriptClean (Wyman and Mortazavi, 2019). Short-and long-read cDNA sequencing data can be combined together with protein evidence, and ab initio prediction algorithms to generate accurate genome annotations as has been implement in the LoReAn package (Cook et al., 2019).
Additionally, feature classification can be used to increase the confidence in the identified isoforms. In this case, in order to assert that the exon structure is the correct one, machine learning approaches can be used that can classify the non-highly confident isoforms and discriminate artifacts from true novel transcripts based on a set of features. For example, in the SQANTI software (Tardaguila et al., 2018) a Random Forest classifier employs different features to classify the transcripts into highly confident ones, based on both short read and long read sequencing data.
In another case the PLEX (Li et al., 2014) software has been used to classify transcripts generated with the error prone long read technologies into potential long non-coding RNAs or messenger RNAs in the absence of genomic sequences or annotations. In this case a support vector machine algorithm has been trained on calibrated k-mer frequencies derived from known long non-coding RNAs and messenger RNAs (Li et al., 2014). Additionally, the CPAT tool (Wang et al., 2013) uses a logistic regression model, trained with four difference sequence features, to classify coding and noncoding transcripts.

1.8.
Complementing short read data gene model assemblies with long read data Gene models assembled from short-read data can be validated with transcripts sequenced on longread sequencing platforms. In one study the authors identified, in the marine syllid polychaete Odontosyllis undecimdonta, four transcripts from Illumina assembled transcript models as well as from ONT MinION cDNA transcripts, that matched Mass Spectrometry data corresponding to potential luciferase coding genes (Schultz et al., 2018). Through cloning, protein expression, protein purification and luciferin assays they validated that only two of them were coding for functional luciferase proteins (Schultz et al., 2018). In another study, long read cDNA sequencing was used to validate protein isoforms of industrially relevant hydrolases that were identified through haplotype reconstruction in the samples of a metatranscriptome from a rumen microbial community (Nicholls et al., 2017).

Long read sequencing of highly similar genes
Short-read RNA-seq of paralogous genes makes both their de-novo assembly and their unique genomic assignment, problematic. On the contrary, long read RNA-seq of paralogous genes permits their assignment in the correct genomic positions. For example, in Oryctolagus cuniculus the gene structure of nine paralogous genes of the Major Histocompatibility Complex (MHC) were well recovered by PacBio transcripts in comparison with reference annotation . Assembled transcripts from short reads by genome-guided methods (Cufflinks (Trapnell et al., 2010)) performed adequately in eight of these MHC paralogous genes to identify the gene structure but some of the observed PacBio isoforms were missing . On the contrary, the assembled transcripts of the MHC locus using de novo approach were poorly assembled showing fragmented and confusing gene models . The PacBio CCS reads have been used to characterize novel alleles from full-length MHC class I cDNA sequences expressed by a cohort of cynomolgus macaques (Westbrook et al., 2015;Karl et al., 2017). A similar work with PacBio CCS cDNA reads was done to characterize novel KIR alleles expressed by a cohort of 30 cynomolgus macaques (Prall et al., 2017).
In another study, the authors used nanopore cDNA sequencing to find the expression levels of the different alleles expressed from the paralogous loci of the NOTCH2NL gene (Fiddes et al., 2018) using a targeted cDNA capturing approach. The NOTCH2NL gene is expressed in radial glial and is responsible for delaying differentiation of neuronal progenitors into cortical neurons (Fiddes et al., 2018). Each of these alleles can produce distinct protein or protein-abundance variants with a different functional effect on the activation of the NOTCH2 receptor (Fiddes et al., 2018). Similarly, PacBio was also used to assess the expression levels of paralogous genes (>99% identity between them) from 19 gene families that are present in human-specific segmental duplications and that are having a brain specific gene expression pattern . The IsoCon (Sahlin et al., 2018) software was developed to assign separate transcripts into putative gene copies and to derive copy-specific exon sequences and splice variants from multigene families where the individual gene copies can have in some cases a high sequence identity (up to 99.99%). In this situation teasing out sequencing errors from true variants can be difficult and the use of a reference genome for correction might not be effective as the variability of gene copies might not be reliably captured by the reference (Sahlin et al., 2018).

Correcting misannotation with long-read sequencing data
The long-reads can also overlap exons from adjacent annotated genes. This can either reflect misannotations or represent read-through transcripts. For example, 3.4% of the unique transcripts identified in the Sorghum bicolor long-read RNA-seq overlap two adjacent annotated genes (Abdel-Ghany et al., 2016). We can assume that the misannotations can derive from exonic regions of low abundant genes which are partially covered from short-reads or from long genes whose partial assembly from short-read data failed to produce the full-length transcript. These gene fragments can be directly connected with just a few long-read sequenced molecules because there is no stochastic sampling of the cDNA fragments as the cDNA molecules are full length. In the case of the HSV-1 (Tombacz et al., 2017;Depledge et al., 2018), the long-read RNA-seq revealed a greater complexity of the viral transcriptome as the authors observed spliced HSV-1 read-through transcripts that encode a cryptic class of novel protein fusions which provide evidence for disruption of transcription termination for a number of viral transcription units. In another study, the authors used Iso-Seq to improve the gene annotation of Anopheles stephensi and identified 6 trans-splicing events (Jiang et al., 2017).
1.11. Characterization of long non-coding RNAs and anti-sense transcripts with longread sequencing LncRNA gene annotations remain incomplete and poorly characterized (Derrien et al., 2012) due to their low steady state levels (Derrien et al., 2012) which results in partial gene structures and lack of terminal exons or splice junctions between adjacent exons, as the sequencing coverage is usually low (Steijger et al., 2013). Accurate gene annotation requires the use of high-confidence transcriptomic evidence, such as sequencing of full-length cDNA or large cDNA fragments derived from 5' and 3' RACE (Lagarde et al., 2016). SMRT sequencing has been used in combination with targeted RNA capture (RNA Capture Long Seq) (Lagarde et al., 2017). This method captures full length unfragmented cDNAs and enriches the low abundant full-length transcripts of long-noncoding RNAs. In another study, novel transcripts and long non-coding RNAs were also observed in PacBio data of human mitochondria (Gao et al., 2018b).
Long genes are more difficult to reconstruct using transcriptome assembly by short-read technologies. In the maize long-read RNA-seq, 867 novel high-confidence lncRNAs were identified that had a much longer mean length than those identified by Illumina short-read sequencing assembly (Wang et al., 2016). Additionally, as the transcripts are sequenced full length, the long-read platforms permit the unambiguous assignment of noncoding status. For example in the Red Clover transcriptome characterization, the authors identified 4,333 lncRNAs compared with 11 lncRNAs present on the reference genome (Chao et al., 2018b).
In another study the authors used PacBio SMRT sequencing to identify the exon structure of natural antisense transcripts in the moso bamboo seedlings in the presence or absence of exogenous gibberellins treatment . These transcripts were further validated with strand specific short-read RNA-seq .
1.12. Gene expression quantification and differential expression using long-read sequencing data Short-read RNA-seq can be used not only to polish the long-read RNA-seq but also to complement it. For example, the short-read data can be used to quantify differentially expressed exons and genes between the different conditions in a given sample. Afterwards, the long-reads can be used to identify the precise differential expressed isoform models for the genes that showed the differential splicing or the differential expression pattern. This approach was followed in order to find differentially expressed isoforms in Arabidopsis thaliana in response to 6h and 48h of abscisic acid (ABA) treatment (Zhu et al., 2017).
A few studies have performed differential gene expression analysis with the long-read sequencing data. In one study the authors performed differential gene expression analysis based on ONT direct RNA sequencing data comparing the two metabolic stages of diauxic growth of Saccharomyces cerevisiae, namely the respiro-fermentative growth on glucose and the oxidative growth on ethanol (Jenjaroenpun et al., 2018). Bayega et al. (Bayega et al., 2018) performed differential gene expression analysis based on cDNA sequencing, between the 1 hour interval stages of the first 6 hours of the Olive fruit fly (Bactrocera oleae) embryonic development. Differential expression from only the long-read data has also been performed for example to identify differentially expressed genes in response to drought treatment in Sorghum bicolor (Abdel-Ghany et al., 2016).
In another study the authors sequenced full length cDNA from Chronic Lymphocytic Leukemia samples with and without mutations of the splicing factor SF3B1 . As the mutated splicing factor is responsible for aberrant splicing patterns, the authors identified that the mutant splicing factor SF3B1 K700E induces alternative upstream 3' splice sites on certain transcripts as well as a reduction of isoforms that showed intron retention .
Iso-Seq has been used to characterize the different isoforms in comparative transcriptome analyses of human and rhesus macaque cerebellum which led to the identification of lineage specific isoforms in humans . PacBio was also used to quantify the abundance of L1 retrotransposon elements, in comparison to the abundance of L1-related sequences that are cotranscribed within genes, through a 5' RACE assay (Deininger et al., 2017).
Finally, the PRAPI pipeline can be used to characterize alternative transcription initiation, alternative splicing, alternative cleavage and polyadenylation from long-read sequencing data or from a combination of long-read and short-read sequencing data (Gao et al., 2018c). It can also be used to annotate antisense transcripts, novel genes, to correct mis-annotated ones as well as to perform differential expression analysis of all the previous categories across different samples (Gao et al., 2018c).

Targeted long read sequencing
The least expensive way of performing full length cDNA sequencing is through targeted fulllength amplification of the cDNA locus of interest using long amplification PCR. For example, on the PacBio platform, the human FMR1 (Pretto et al., 2015) and the Neurexin-1-alpha (Schreiner et al., 2014;Treutlein et al., 2014) cDNAs have been specifically targeted for SMRT sequencing.
In another study, cDNA amplicons of gluten genes were individually barcoded, pooled and sequenced with SMRT sequencing . This permitted the characterization and phylogenetic reconstruction of gluten gene families across 10 wheat cultivars, even without any draft genome sequence available .
Several eukaryotic genes can encode hundreds to thousands of isoforms. This can be accomplished through a combination of constitutive exons and exon clusters. The exons within each cluster are spliced in a mutually exclusive manner leading to a large number of alternatively spliced isoforms (Yang et al., 2011). Assembling these isoforms from short read data is usually unsuccessful as the variable exon clusters are located far from one another on the mRNA sequence and the exons within each cluster can be up to 80% identical to one another at the nucleotide level (Bolisetty et al., 2015). In Drosophila (Brown et al., 2014), example of genes with complex isoforms are the Dscam1, MRP, Mhc, and Rdl genes. For these Drosophila genes, the ONT MinION platform has been used to characterize the type and abundance of the complex isoforms (Bolisetty et al., 2015). In human, examples of genes with complex isoforms are the Voltage Gated Calcium Channels (VGCCs). Alternative spliced isoforms encode functionally different VGCCs (Raj and Blencowe, 2015) and therefore only full length cDNA sequencing can identify the entire protein coding sequence of these isoforms. One of these genes is the CACNA1C gene, that consists of at least 50 known exons. The ONT MinION was used to sequence long cDNA amplicons and the authors identified 90 isoforms from which 83 were novel .
SMRT sequencing has been used to sequence a FANCB aberrant transcript containing exon 3 duplication which is predicted to introduce a stop codon in the FANCB protein and was identified in a Fanconi anemia patient with a mild phenotype (Asur et al., 2018).
A similar cDNA amplicon study with the ONT MinION was performed with the human PKD1 gene (Lea et al., 2018), a gene responsible for the major form of autosomal dominant polycystic kidney disease. For this gene, the authors observed that the gene is differentially spliced over introns 21 and 22 and by applying long read sequencing were able to study the different isoforms resulting from combinations of exons 20-24 (Lea et al., 2018). The same platform was used to sequence cDNA amplicons from the ABCA7 gene whose Premature Termination Codon (PTC) mutations have been identified as an intermediate-to-high penetrant risk factor for late-onset Alzheimer's disease (De Roeck et al., 2017). These PTC mutations can frequently lead to the nonsense mediated decay of the isoforms that carry the exon with the corresponding mutation (Hillman et al., 2004). The authors screened early onset Alzheimer disease patients for mutations on the ABCA7 gene in order to identify the exons that carry novel PTC mutations and identified isoforms that exclude the exon with the PTC mutation and thus rescue the transcript from nonsense mediated decay (De Roeck et al., 2017). They also measured the varied expression levels of the transcripts that carry exons with the PTC mutation and identified the sequence of truncated proteins (De Roeck et al., 2017).
In another study the authors used ONT MinION to sequence long range PCR generated cDNA amplicons (~6kb) of the BRCA1 transcripts in order to resolve the exon structure of the whole transcript which enabled them to predict in-frame and out-of-frame coding events (de Jong et al., 2017). This is important for interpreting the clinical significance of spliceogenic variants and to identify mRNA splicing changes that are expected to disrupt protein function either through truncation or in-frame deletion of important regions of the encoded proteins (de Jong et al., 2017).

Characterization of the transcriptome of RNA and DNA viruses
In general the viral transcriptomes are usually highly complex showing embedded RNAs, complex transcripts (Prazsak et al., 2018), bicistronic and polycistronic transcripts, transcript isoforms as well as an extended meshwork of overlaps between the transcripts (Tombacz et al., 2016;Moldovan et al., 2017b;Prazsak et al., 2018). Compared to the short-read sequencing platforms, the long-read sequencing ones are the ideal technology to characterize in greater detail and accuracy all this complexity which can provide useful information to target transcript candidates for controlling the viral replication and propagation.
Indeed, all the available SMRT and nanopore library preparation methods have been used to characterize the viral RNA transcriptomes. For example, the non-amplified and amplified SMRT methods, the PacBio Iso-Seq protocol, the Nanopore full-length cDNA-sequencing, the Nanopore direct RNA-sequencing, as well as the Nanopore cDNA-sequencing on 5′ Cap-selected samples for both oligo (d)T and random primed reverse transcribed samples have been used either exclusively or in combination  to characterize the viral transcripts. Transcriptomes from a large number of viruses have been extensively characterized and some examples are presented in Supplementary Table 2.
The transcriptome of the viruses has been characterized through either targeted cDNA amplification of the viral transcripts or through RNA-seq or direct RNA-seq of the infected viral host transcripts for either the lytic or the latent phase of the viral life cycle. In all the cases except the targeted amplification, the sequenced reads are aligning in both the host genome and the genome of the virus if known. The majority of studies identified novel RNA molecules, alternatively transcribed and processed transcripts, coding and non-coding RNAs as well as they permitted to distinguish between transcript isoforms including splice and length variants. In a lot of cases the studies are uncovering a very complex transcriptome. For example, the Vaccinia virus (VACV) has a unique form of gene regulation where the transcriptional overlaps generated by the read-through mechanisms are very frequent. Additionally, the transcription patterns of VACV genes exhibit an increased stochasticity, which includes a large number of transcriptional start sites (TSSs) and transcription end sites (TESs) even within the open reading frames (ORFs) . Therefore, the use of full-length sequencing methods is important to identify the transcript ends with relative base pair precision . These alternative TESs can potentially influence RNA metabolism by allowing or preventing the binding of microRNAs to the viral RNA (Barth et al., 2008) and by facilitating or preventing deadenylation (Dickson et al., 2012).
Long read RNA-seq has also been used to characterize polycistronic mRNAs in the Herpesvirus HSV-1 (Tombacz et al., 2017;Depledge et al., 2018). These viral polycistronic transcripts are different from the prokaryotic polycistronic transcripts because the polycistronic transcripts overlap with each other and the same viral gene can be expressed from more than one polycistronic transcripts. These overlapping polycistronic architecture can be achieved by a common poly-A signal and varying transcription start sites that are controlled by distinct promoters. The authors (Tombacz et al., 2017;Depledge et al., 2018) report new transcription initiation sites that produce mRNAs encoding novel or alternative ORFs. Transcription initiation sites are critical for productive gene expression as their location relative to the translation initiation site determines the length and composition of the 5′ UTR of mRNAs, which can have profound effects on translation efficiency.
Multi-platform data integration is also important as the same sample can result in different read length distributions with the longest reads derived from direct RNA sequencing followed by the PacBio Sequel and ONT platforms depending on the library preparation methods (Boldogkői et al., 2018b). The method can also be used to infer the different viral clones present in the genome of the host cell. For example, for the Porcine Endogenous Retrovirus (PERV) the authors uncovered several intronic regions in the transcriptome data after mapping them back on the genome . The authors confirmed that these introns were not alternative spliced products but rather deleted genomic segments indicating the existence of at least four different PERV clones in the genome of the host PK-15 cells . For the Varicella Zoster Virus (VZV), the authors defined a new class of transcripts the nroRNA that are transcripts encoded by the genomic region located in close vicinity to the viral replication origin which suggests an interference between the replication and transcription machineries (Prazsak et al., 2018). A similar case was observed in Pseudorabies Virus (PRV) after profiling the transcriptome with PacBio (Tombacz et al., 2015). For the VZV virus, the complex meshwork of transcriptional read-throughs can regulate gene expression through a transcriptional interference mechanism (Boldogkoi, 2012;Prazsak et al., 2018). Additionally, the authors detected RNA editing in a novel non-coding RNA molecule from the nanopore sequenced reads (Prazsak et al., 2018). Furthermore, in the VZV virus, additional to the lytic phase transcriptome, the latent phase transcriptome was characterized, and it was showed that the three genes that are expressed during the latent phase can occasionally form a single transcriptional unit (Prazsak et al., 2018). ONT and PacBio-based studies have also detected a number of embedded transcripts with in-frame truncated ORFs in several herpesviruses (Prazsak et al., 2018) coding for potentially truncated polypeptides.
In the case of the VZV virus, four complex transcripts, which are multigenic RNAs that contain one or more genes in opposite  orientations, have been characterized (Prazsak et al., 2018).
The cDNA fragments derived from 5' rapid amplification of cDNA ends (5'-RACE) from the transcriptome of the Mouse Papillomavirus type 1 have been sequenced with the PacBio Iso-Seq method in order to map the viral transcription start sites (Xue et al., 2017). In another cases, the PacBio Iso-Seq was used to sequence cDNAs derived from total RNA of Porcine Circovirus type 1 infected cells that was reversed transcribed with either oligo (dT) primers or random hexamer primers (Moldovan et al., 2017a).
Lastly for a more general review of the application of long read sequencing technologies on the transcriptomic analysis of complex viral genomes see Depledge et al.(Depledge et al., 2019)

Characterizing the genome of RNA viruses
Although not a transcriptome, the genome of RNA viruses can be sequenced with the same methods. From relative pure samples, RNA/cDNA hybrids have been sequenced at a reasonable depth, without amplification, on the nanopore platform (Kilianski et al., 2016). For the RNA viruses if the titer is low or enough coverage is required for strain typing and variant identification an enrichment method needs to be followed. In this case either an amplicon or a non-amplicon based approach can be used. The non-amplicon based approaches are using RNA baits on streptavidin beads and are supposed to improve library complexity and uniformity of the sample, thus aiding in the detection of single-nucleotide variants. This method was used for enrichment of the Influenza virus A genome cDNA from the cDNAs of the MDCK cell line (Eckert et al., 2016). In another study, cDNA amplicons from the Zika Virus (ZIKV) were sequenced on the ONT MinION in order to identify the origin and epidemic history of ZIKV in Brazil . Similarly, the ONT MinION sequencing of multiplexed cDNA amplicons, has been used to measure the diversity of the West Nile virus (Grubaugh et al., 2019) and the defective viral emergence in the Flock House virus (Jaworski and Routh, 2017). MinION has been used to sequence the genome of multiple RNA viruses and some examples are presented in Supplementary  Table 2.
Metagenomic sequencing allows for identification of multiple RNA pathogens within a sample in a non-targeted and unbiased approach. With this approach the chikungunya and the dengue viruses have been identified in total RNA extracted from patient samples and permitted the identification of co-infections (Kafetzopoulou et al., 2018) as well as it has the potential to identify novel pathogens. PacBio was used to sequence the genome of the Hepatitis C virus at near full length which permitted the identification of viral quasispecies in clinical samples after phasing the observed variants (Bull et al., 2016). As a starting material either an enriched polyA+ fraction is used or total RNA with the equivalent amount of polyA+ RNA (1 -5% of the total RNA is polyA+). If the reverse transcription of the full-length cDNA synthesis protocol is followed with cDNA amplification, then ~1 -6 ngs polyA+ RNA are used as a starting material or the equivalent 50 -300 ngs of total RNA. If no cDNA amplification is followed, then ~100 ng polyA+ RNA are needed. Yes. Depending on the sample preparation and flow cell loading procedure followed, the length distribution of the sequenced reads does not match the length distribution of the original solution of the cDNA molecules. For example, if the MagAttract loading procedure is followed then cDNA molecules equal or less than 600 bp are not sequenced.
Additionally, if the Iso-Seq sample preparation method is followed then the cDNA is first size fractionated and then the fraction(s) with the enriched sizes of interest are sequenced.   Table 2. Examples of organisms whose transcriptome or genome was sequenced on a long-read sequencing platform.
Supplementary Figure 1 (previous page). Sequence characteristics of RNA molecules or RNA fragments and their full-length cDNA synthesis methods. No RNA or cDNA fragmentation methods are followed in any of the presented cDNA synthesis methods. The dotted cyan lines correspond to RNA fragments produced from the processed eukaryotic mRNA transcripts of from the prokaryotic transcripts. The beginning and end of the dotted purple arrows indicate the same molecules at different places on the graph. The yellow, dark blue, grey short arrows indicate enzymatic reactions responsible for the addition of adaptors sequences on the RNA molecules/fragments from the truncated T4 RNA ligase 2, T4 RNA ligase 1 and RtcB RNA ligase respectively as indicated in the boxes. The black short arrows correspond to enzymatic reactions that are affecting the chemical moieties on the 5' end of RNA molecules/RNA fragments through the action of the Alkaline Phosphatase, Tobacco Acid Pyrophosphatase or T4 polynucleotide kinase as indicated inside the respective boxes. The dotted red lines indicate potential degradation of the fragments, present at the beginning of the arrow, through the action of the Terminator 5´-Phosphate-Dependent Exonuclease. In case that an "enzymatic reaction set" is not part of an already published protocol mentioned in the manuscript, this set is characterized as "Potential protocol". The methods presented here are from the following papers (Peach et al., 2015;Oesterreich et al., 2016;Ibrahim et al., 2018;Yan et al., 2018). The picture corresponds to the consensus accuracy recovered from the SMRT sequencing data. As the raw error rates between the Nanopore cDNA and PacBio cDNA platforms are not significantly different, the base call accuracy versus coverage plot should be approximately similar. We note here that the indicated improvement in the base calling accuracy is due to either the pile up of individually sequenced reads on the same locus or the pile up of the subreads from the PacBio Circular Consensus sequencing and by extrapolation from the R2C2 (Volden et al., 2018) or from the INC-seq (Li et al., 2016) nanopore based methods. The improvement in accuracy does not correspond to joint basecalling methods such as the ONT 2D sequencing protocol or the ONT 1D 2 sequencing protocol. The picture was produced after merging the reported consensus accuracy metrics for the PacBio data from similar plots publicly available on the PacBio website (PacBio_website _1, 2013;PacBio_website _2, 2015). The amplified cDNA sequencing runs are presented as "cDNA_X" and the direct RNA sequencing runs are presented as "dRNA_X" where "X" is the number of a specific sample. The RNA from five different samples (mouse embryonic stem cells) were sequenced on the MinION with both an amplified cDNA and a direct RNA protocol. cDNA and direct RNA runs with the same "X" number correspond to the same sample. For the cDNA runs the sequencing chemistry used is the "SQK-PCS109" whereas for the direct RNA runs it is the "SQK-RNA001". The nanopore version used is the "v9.4.1". In the SQK-PCS109 chemistry the attachment of the sequencing adaptors on the cDNA molecules is performed through a click chemistry reaction.

Supplementary
Supplementary Figure 6. Biological problems addressed with the long-read sequencing platforms relevant to RNA sequencing. The colored boxes present the problem addressed. As indicated in the bottom legend, the different colors correspond to the platform used to address the biological question. The red horizontal lines and mixed red/green horizontal lines correspond to fully processed and unprocessed mRNA molecules respectively. Figure 7 (previous page). Identification of processing patterns with longread sequencing. The long read SMIT-seq method (Oesterreich et al., 2016) is presented here. The aim of this method is to sequence unprocessed and partially processed non poly-adenylated transcripts. The protocol has the following steps as presented in the figure. Initially, cell lysis is followed by chromatin isolation. The nascent RNA is extracted from the chromatin fraction with the acidic Phenol/Chloroform method followed by DNAse I treatment. A 3' end ligation is performed with a truncated T4 RNA ligase 2 that adds a 3′ end linker (orange box) to the 3' end of all the RNA molecules. Non poly-adenylated RNA is obtained after depletion of polyadenylated RNA with oligo-dT coated beads. Then, rRNA removal is performed. The nascent polyA minus RNA is reversed transcribed with a custom primer hybridizing on the linker (cyan arrow). Double-stranded DNA is generated with gene-specific primers (pink arrow) in order to increase sequence read counts for a specific set of genes. Low PCR cycles cDNA amplification and long read sequencing is followed at the end.

Supplementary
Supplementary Figure 8 (previous page). Identification of degradation patterns with longread sequencing. The long-read Akron-SMRT method (Ibrahim et al., 2018) is presented here. The aim of this method is to sequence degradation fragments of RNA molecules and define the exact position of the endonucleolytic cut on the 5' end fragments of the RNA molecules. The method has been used to profile the degradation patterns of the mRNA molecules occurred during their translation on the polysomes. The protocol has the following steps presented in the figure.
Total RNA is extracted from the polysome fraction with Trizol followed by rRNA depletion, snRNA depletion and short fragment removal (<200bp). To sequence the degradation fragments derived from the 5' end of the mRNA molecules, the RNA is treated with Terminator exonuclease to remove non-capped, 5′-P-bearing RNAs. The 3' end of the remaining mRNA fragments is then ligated to a 5'-P and 3′-biotinylated RNA adaptor (green box) with a T4 RNA ligase 1, followed by streptavidin pulldown. cDNA synthesis is performed with a primer that binds on the added adaptor sequence (cyan arrow). Double-stranded DNA is generated with gene-specific primers (pink arrow) in order to increase sequence read counts for a specific set of genes. A universal primer (yellow arrow) is used for cDNA amplification followed by long read sequencing.