Depletion of Hemoglobin Transcripts and Long-Read Sequencing Improves the Transcriptome Annotation of the Polar Bear (Ursus maritimus)

Transcriptome studies evaluating whole blood and tissues are often confounded by overrepresentation of highly abundant transcripts. These abundant transcripts are problematic, as they compete with and prevent the detection of rare RNA transcripts, obscuring their biological importance. This issue is more pronounced when using long-read sequencing technologies for isoform-level transcriptome analysis, as they have relatively lower throughput compared to short-read sequencers. As a result, long-read based transcriptome analysis is prohibitively expensive for non-model organisms. While there are off-the-shelf kits available for select model organisms capable of depleting highly abundant transcripts for alpha (HBA) and beta (HBB) hemoglobin, they are unsuitable for non-model organisms. To address this, we have adapted the recent CRISPR/Cas9-based depletion method (depletion of abundant sequences by hybridization) for long-read full-length cDNA sequencing approaches that we call Long-DASH. Using a recombinant Cas9 protein with appropriate guide RNAs, full-length hemoglobin transcripts can be depleted in vitro prior to performing any short- and long-read sequencing library preparations. Using this method, we sequenced depleted full-length cDNA in parallel using both our Oxford Nanopore Technology (ONT) based R2C2 long-read approach, as well as the Illumina short-read based Smart-seq2 approach. To showcase this, we have applied our methods to create an isoform-level transcriptome from whole blood samples derived from three polar bears (Ursus maritimus). Using Long-DASH, we succeeded in depleting hemoglobin transcripts and generated deep Smart-seq2 Illumina datasets and 3.8 million R2C2 full-length cDNA consensus reads. Applying Long-DASH with our isoform identification pipeline, Mandalorion, we discovered ∼6,000 high-confidence isoforms and a number of novel genes. This indicates that there is a high diversity of gene isoforms within U. maritimus not yet reported. This reproducible and straightforward approach has not only improved the polar bear transcriptome annotations but will serve as the foundation for future efforts to investigate transcriptional dynamics within the 19 polar bear subpopulations around the Arctic.

Transcriptome studies evaluating whole blood and tissues are often confounded by overrepresentation of highly abundant transcripts. These abundant transcripts are problematic, as they compete with and prevent the detection of rare RNA transcripts, obscuring their biological importance. This issue is more pronounced when using longread sequencing technologies for isoform-level transcriptome analysis, as they have relatively lower throughput compared to short-read sequencers. As a result, long-read based transcriptome analysis is prohibitively expensive for non-model organisms. While there are off-the-shelf kits available for select model organisms capable of depleting highly abundant transcripts for alpha (HBA) and beta (HBB) hemoglobin, they are unsuitable for non-model organisms. To address this, we have adapted the recent CRISPR/Cas9-based depletion method (depletion of abundant sequences by hybridization) for long-read fulllength cDNA sequencing approaches that we call Long-DASH. Using a recombinant Cas9 protein with appropriate guide RNAs, full-length hemoglobin transcripts can be depleted in vitro prior to performing any short-and long-read sequencing library preparations. Using this method, we sequenced depleted full-length cDNA in parallel using both our Oxford Nanopore Technology (ONT) based R2C2 long-read approach, as well as the Illumina short-read based Smart-seq2 approach. To showcase this, we have applied our methods to create an isoform-level transcriptome from whole blood samples derived from three polar bears (Ursus maritimus). Using Long-DASH, we succeeded in depleting hemoglobin transcripts and generated deep Smart-seq2 Illumina datasets and 3.8 million R2C2 full-length cDNA consensus reads. Applying Long-DASH with our isoform identification pipeline, Mandalorion, we discovered ~6,000 high-confidence isoforms and a number of novel genes. This indicates that there is a high diversity of gene isoforms within U. maritimus not yet reported. This reproducible and straightforward approach INTRODUCTION Accurate isoform-level differential expression analysis of transcriptomes is essential for interpreting gene regulation under different biological, environmental, or physiological conditions. RNA transcript isoforms-which are often unique among different cell types, tissues, developmental stages, and organisms (Kalsotra et al., 2008;Wang et al., 2008;Zhang et al., 2016)-are defined by the use of alternative transcription start sites (TSSs), polyA sites, and splice sites. Use of alternative isoforms is highly regulated and thought to contribute to cellular and organismal diversification within higher eukaryotes (Graveley, 2001), adaptation, and speciation (Harr and Turner, 2010;Shi et al., 2012) and can also reflect certain disease states (Busslinger et al., 1981;Andreadis, 2005;Ilagan et al., 2015).
To perform this type of differential expression analysis on the isoform level requires both short-and long-read sequencing technology. Short-read RNA-seq provides the read depth necessary for gene expression quantification but requires accurate and exhaustive isoform-level transcriptome annotations for its analysis. However, existing transcriptome annotations of non-model organisms are often incomplete or inaccurate (Ungaro et al., 2017) because they cannot rely on labor-intensive efforts like GENCODE, which are working to exhaustively annotate the isoform-level transcriptomes of human and mouse. While short-read RNA-seq data can itself be used for transcriptome annotation, it fails at annotating transcriptomes on the isoform-level because it cannot recapitulate full-length transcripts. This inability to define full-length transcripts is due to the fragmentation of RNA, or their cDNA copies, prior to sequencing, making it difficult to computationally re-assemble reliably (Grabherr et al., 2011;Bankevich et al., 2012;Pertea et al., 2015). To provide an accurate isoform-level transcriptome annotation for non-model organisms, long-read sequencing technology is required to sequence full-length cDNA molecules.
The ability to perform combined short-and long-read transcriptome analysis on non-model organisms is further complicated by sample availability. In contrast to the organs and tissues of model organisms which can be easily acquired, availability of samples from non-model organisms is often more limited. In rare circumstances, sampling can be performed through fat-and muscle-tissue biopsies (Khudyakov et al., 2017), but the current gold standard still relies on whole blood RNA samples, especially for large non-model organisms (Du et al., 2015). This is particularly true for protected and endangered species (Huang et al., 2016;Hernández-Fernández et al., 2017). While whole blood samples can be easily acquired and provide a wealth of information regarding physiological or disease states in surrounding tissues (Liew et al., 2006), polyadenylated RNA extracted from whole blood can be comprised of >50% hemoglobin transcripts (Mastrokolias et al., 2012;Shin et al., 2014). In any high-throughput sequencing-based assay, these highly abundant transcripts will compete for a limited number of sequencing reads and, as a result, will be sequenced over and over again without generating any new information. This would waste valuable reads which could otherwise detect less abundant transcripts.
Currently, there is no approach to deplete hemoglobin transcripts from whole blood RNA while enabling downstream analysis of the depleted RNA/cDNA with both short-and long-read sequencing. Commercially available hemoglobin depletion kits-including GLOBINclear (Ambion) or Ribo-Zero (Illumina)-are specifically designed for human samples and rely on hemoglobin RNA pull-down methods (Field et al., 2007). Even if they would succeed in depleting hemoglobin from nonmodel organism samples, which is far from guaranteed (Choi et al., 2014), these pull-down approaches use harsh conditions and high temperatures during long incubation steps which contribute to RNA fragmentation and introduce unwanted technical variability (Debey et al., 2004). While fragmented RNA is suitable as input into short-read RNA-seq, it is not suitable for long-read full-length cDNA sequencing.
To perform a comprehensive analysis of non-model organism transcriptomes from whole-blood with short-and long-read technologies, we require a new approach that can deplete highly abundant transcripts like hemoglobin from whole-blood samples of a wide range of organisms without fragmenting transcripts. To this end, we chose to adapt the powerful, recently published DASH (depletion of abundant sequences by hybridization) (Gu et al., 2016) method which utilizes a recombinant Cas9 to perform in vitro depletion using sequence-specific sgRNAs. Our adapted method which we will refer to as Long-DASH also takes advantage of the CRISPR/Cas9 system to selectively deplete hemoglobin HBA and HBB transcripts but targets full-length cDNA instead of fragmented short-read Illumina sequencing libraries like regular DASH. By depleting full-length cDNA prior to any library preparation, this allows the user the choice to use any short-or long-read sequencing platform.
As a proof of concept, we evaluated three hemoglobin-depleted and non-depleted polar bear whole blood transcriptomes using our ONT-based R2C2 (Volden et al., 2018) full-length cDNA sequencing method and an Illumina-based modified Smart-seq2 method. We found that by incorporating Long-DASH, we successfully depleted hemoglobin transcripts without nonspecifically affecting the rest of the cDNA pool. Finally, we generated ~3.8 million ONT-based R2C2 consensus reads, dramatically refining the polar bear transcriptome annotations.
has not only improved the polar bear transcriptome annotations but will serve as the foundation for future efforts to investigate transcriptional dynamics within the 19 polar bear subpopulations around the Arctic.

Sample Collection/RNA Extraction From Whole Blood
Permits for field operations and animal care were provided by the Government of Greenland (permit numbers 2015-110281 and 2017-5446). Polar bear whole blood samples were collected in PAXgene Blood RNA Tubes (PreAnalytiX GmbH, BD Biosciences, Mississauga, ON, Canada). Total RNA was isolated from whole blood (2.5 ml) thawed at room temperature for 2 h prior to using the PAXGene RNA Extraction Kit (Qiagen, Chatsworth, CA, USA) according to manufacturer's protocol. All samples were DNAse (QIAGEN) treated and eluted in 50 μl. The RNA yield and purity were accessed using a NanoDrop 8000 UV Spectrophotometer (Thermo Fisher Scientific, Waltham, MA, USA). RNA quantities ranged from 110 to 310 ng/μl, and the A260/280 ratio values were ≥ 2.0. Human whole blood RNA was purchased from Zyagen Labs (NC1453913).

Full-Length cDNA Generation
RNA was reverse transcribed (RT) using SMARTScribe Reverse Transcriptase (Clonetech). We generated full-length cDNA using a modified Smart-seq2 approach . During the RT reaction, a template-switch oligo and an oligodT primer were used to select for polyA+ RNA ( Table S2). The RT reaction was performed in 10 μl reactions with an input of 70 ng of RNA and took place at 42°C for 1 h. After cDNA synthesis, 1 μl of 1:10 dilutions of RNAse A (Thermo Fisher) and Lambda Exonuclease (NEB) were added and incubated at 37°C for 30 min. Following the incubation, an amplification step was performed in 25-μl final volumes using KAPA HiFi ReadyMix 2X (Kapa Biosystems) containing 1 μl of the ISPCR primer (10 μM) primer. Samples were incubated at 95°C for 3 min, followed by 12 cycles of (98°C for 20 s, 67°C for 15 s, and 72°C for 4 min), with a final extension of 72°C for 5 min. Samples were purified using Agencourt AMPure XP SPRI beads (Beckman Coulter) and eluted at 25 ml. The final cDNA product was then visualized on an agarose gel to confirm distribution (Figure 2).

In Vitro Preparation of CRISPR/Cas9
SpCas9-2xNLS was purified based on the protocol described in (Jinek et al., 2012). Briefly, a plasmid encoding His 6 MBP-SpCas9-2xNLS (Addgene plasmid #69090) was transformed into Rosetta2(DE3) Escherichia coli cells. Cultures were grown at 37°C in 2YT medium with shaking until they reached an OD 600 of ~0.6, and then placed on ice for 5 min before adding IPTG to a final concentration of 0.25 mM; cultures were then grown overnight at 18°C with shaking. Cell pellets were harvested by centrifugation, and then lysed in an AVESTIN cell extruder in Ni-A buffer (20 mM Tris pH 8.0, 500 mM NaCl, 5% vol/vol glycerol, 25 mM imidazole) with EDTA-free protease inhibitors (Pierce). Clarified supernatants were purified by gravity column on Ni-NTA agarose (QIAGEN) using Ni-A buffer to load and wash, and Ni-B buffer (20 mM Tris pH 8.0, 500 mM NaCl, 5% vol/vol glycerol, 250 mM imidazole) to elute. Peak fractions were concentrated in an Amicon Ultra spin concentrator with a 30-kDa molecular weight cut-off at 4°C, and then loaded onto a 50-ml HiPrep Desalting Column (GE Healthcare) preequilibrated with 17% IEX-B (IEX-A buffer: 20 mM HEPES pH 7.5, 150 mM KCl, 5% vol/vol glycerol; IEX-B 20 mM HEPES pH 7.5, 1 M KCl, 5% vol/vol glycerol). The flow through was then loaded onto a 2-ml HiTrap SP column (GE Healthcare) in 17% IEX-B buffer. After thoroughly washing the column in 17% IEX-B, the protein was eluted with a linear gradient from 17% to 50% IEX-B. Peak fractions were pooled and loaded onto a Superdex 200 16/60 column (GE Healthcare) pre-equilibrated in 20 mM HEPES pH 7.5, 150 mM KCl, 1 mM DTT, and 10% vol/vol glycerol. Peak fractions were concentrated in an Amicon Ultra spin concentrator with a 30-kDa molecular weight cut-off at 4°C until a concentration of 40 μM, which was estimated using the calculated molar extinction coefficient of 120,575 M −1 cm −1 . The protein was aliquoted into small volumes (10 μL), quick frozen in liquid nitrogen, and stored at −80°C.

sgRNA Design and Construction
Other studies have shown that sgRNAs designed between 17 and 20 bp showed increased efficacy (Fu et al., 2014;Ren et al., 2014). As a result, the sgRNAs were designed between 17 and 20 bp in length. sgRNAs were designed to target hemoglobin transcripts in human and polar bear. A multi-sequence alignment was performed on the human and polar bear annotated HBA and HBB gene transcripts to find conserved regions using the Clustal Omega tool (Sievers et al., 2011;McWilliam et al., 2013;Li et al., 2015) ( Figure S3). Regions with high homology were chosen for sgRNA design. sgRNAs that did not share complete homology were designed to contain degenerate bases to ensure compatibility across species using the same sgRNA ( Figure S4). sgRNA specificity was determined by using BLAST (Altschul et al., 1990). One sgRNA was designed even though the N-GG (PAM motif) had been lost in the human but was still kept in the pool for the polar bear depletion ( Figure S3). A total of 16 sgRNAs were designed to target HBA and HBB hemoglobin transcripts. The target oligos were then constructed into sgRNAs as previously described (Ren et al., 2014). Single-stranded oligos were designed to contain a T7 promoter attached to each sgRNA sequence (IDT) followed by the first 22 bases of the tracrRNA sequence ( Figure S3). The complementary tracrRNA and singlestranded oligo were annealed and extended to form a dsDNA product containing the T7-sgRNA and tracrRNA templates. The template was then used for in vitro transcription using the HiScribe T7 High Yield RNA Synthesis Kit (NEB). The in vitro transcription reaction was carried out at 37°C for 16 h. The in vitro transcribed RNA was then purified using MEGAclear Transcription Clean-Up Kit (Invitrogen). The final sgRNA product was then checked for purity and quantified using NanoDrop 8000 UV Spectrophotometer (Thermo Fisher). All sgRNAs were then pooled at equal molar concentrations and stored in single-use aliquots at −80°C.

CRISPR/Cas9 Treatment
Since it has been predicted that human whole blood samples can contain up to ~50-80% of hemoglobin transcripts of the total sample (Field et al., 2007;Mastrokolias et al., 2012), we calculated the ratio of sgRNA and Cas9 molar amounts to sample based upon this assumption. According to the DASH protocol, it was determined that 150-fold of Cas9 and 1,500-fold of sgRNA should be sufficient (Gu et al., 2016). All cDNA samples were quantified by qubit using the dsDNA HS Assay Kit (Thermo Fisher) to calculate the molar amounts. To calculate the expected molar amounts we use the following formula:

anscript in bp)
Once the molar amounts were determined, the ribonucleoprotein (RNP) complex was formed by adding the 150fold Cas9 and 1,500-fold sgRNA excess amount with 1.0 μl of the 10X Cas9 Buffer (final concentration 50 mM Tris pH 8.0, 100 mM NaCl, 10 mM MgCl 2 , and 1 mM TCEP) and incubated for 25°C for 10 min. Following the 25°C incubation, the calculated cDNA amount was then added (final volume of 10 μl) and incubated at 37°C for 4 h to overnight. After the Cas9 depletion, 1 μl of proteinase K and RNAse A were added to inactivate the Cas9 and remove excess sgRNAs from the reaction and incubated at 37°C for 15 min and 95°C for 15 min. It is critical that the proteinase K is deactivated properly as the samples are immediately used for amplification. Treated samples were PCR amplified [95°C for 3 min, followed by 13 cycles of (98°C for 20 s, 67°C for 15 s, and 72°C for 4 min) followed by a final extension of 72°C for 5 min]. PCR was performed using KAPA HiFi ReadyMix 2X (Kapa Biosystems) and 1 μl of the (10 μM) ISPCR primer. The amplified product was then purified using SPRI beads to remove everything below 500 bp. Selecting against cDNA below 500 bp ensured that all cut hemoglobin products were removed before making the Tn5 libraries. The depleted cDNA product was visualized on a 1-2% agarose gel to confirm depletion. Once confirmed, the depleted cDNA product was then prepped for either Illumina or Nanopore sequencing.

R2C2 Library Preparation and ONT Sequencing
To prepare R2C2 libraries, ~30 ng of the depleted cDNA was used. The R2C2 libraries were made as previously described (Volden et al., 2018). Briefly, an equal concentration of splint to cDNA were combined [30 ng of depleted cDNA and 30 ng of our (~200 bp) DNA splint]. The full-length cDNA was then circularized using the 2X NEBuilder HiFi DNA Assembly Mix (NEB). The reaction took place at 50°C for 1 h per manufacturer protocol. Once the full-length cDNA was circularized, linear ssDNA and dsDNA were digested by adding 3 μl each of Lambda Exonuclease, Exonuclease I, and Exonuclease III (all NEB) and incubated at 37°C overnight. We performed the longer incubation overnight to ensure complete digestion. After digestion, the sample was further purified using SPRI beads and eluted in 30 μl of ultrapure water. Thirty microliters of sample was then split into three reactions containing 10 μl each for the Phi29 amplification. The Phi29 amplification took place in a reaction volume of 50 μl containing 5 μl of 10X Buffer, 2.5 μl of 10 μM each dNTPs, 2.5 μl of random hexamers (10 μM), 29 μl of ultrapure water, and 1 μl of Phi29 polymerase. The Phi29 reactions were incubated at 30°C for 16 h and 65°C for 15 min and held at 4°C. All three samples were pooled together, and ultrapure water was added to make up the final volume to 300 μl. The product was purified using SPRI beads with a 1:0.5 sampleto-bead ratio. This ratio was chosen as it removed all fragments <2,000 kb. The sample was then eluted in 90 μl of ultrapure H20, 10 μl of NEB2 Buffer (NEB), and 3 μl of T7 endonuclease (NEB) and incubated at 37°C for 2 h to ensure complete debranching of the Phi29 product. The eluted sample was again purified using SPRI beads with a 1:0.5 sample-to-bead ratio. The product was eluted in 30 μl and quantified using Qubit dsDNA HS Kit (Thermo Fisher). The length distribution was verified on a 1% agarose gel prior to performing the ONT library prep.
For the library preparation, ~1-2 μg of the final R2C2 product was converted into a ONT compatible library using the SQK-LSK109 kit according to ONT instructions with minor modifications. First, during the end repair and A-tailing reaction, we performed incubations for 30 min each at 20°C and 65°C instead of 5 min each. Second, we adjusted the ligation reaction time to 30 min at room temperature instead of 10 min per the protocol. We also found that loading between ~200 and 500 ng of the final library onto the flowcell was the most optimal. Loading more library resulted in severe loss in throughput as can be seen for the R2C2 runs PB3_depleted_R1 and PB19_depleted_R1 (Table S2). R2C2 libraries were sequenced on a MinION device using the 48-h sequencing protocol using the FLO-Min106 R9.4 Rev D chemistry flowcells. All reads were basecalled with Albacore v2.1.3.

Smart-seq2 Library Preparation and Illumina Sequencing
Illumina libraries of the depleted and non-depleted samples were prepared using a tagmentation-based method using our own Tn5 (Picelli et al., 2014). The Tn5 enzyme was custom loaded with Tn5ME-A/R and Tn5ME-B/R adapters ( Table S2). The Tn5 reaction contained 5 μl of the full-length cDNA product, 1 μl of the loaded Tn5 enzyme, 10 μl of ultrapure water, and 4 μl of the 5X TAPS-PEG buffer and incubated at 55°C for 7 min. After incubation, 5 μl of 0.2% of sodium dodecyl sulfate (SDS) was added to the product to inactivate the Tn5 enzyme. Due to the Tn5-generating gaps, 5 μl of the Tn5 product had to be nick translated at 72°C for 5 min. The Tn5 product was then amplified using KAPA HiFi Polymerase (KAPA) with 10 cycles of PCR using (98°C for 10 s, 63°C for 30 s, 72°C for 2 min) with a final extension at 72°C for 5 min. The final reaction volume was 25 μl and contained 0.5 μl KAPA HiFi Polymerase (KAPA), 5 μl of 5X Buffer, 0.8 μl of dNTPs (10 mM each), 11.7 μl of ultrapure water, 5 μl of the nick-translated product, and 1 μl each of Nextera_Primer_A and Nextera_Primer_B primers ( Table S2). The amplified Tn5 libraries were then size selected from 300 to 800 bp on a 2% EX E-gel (Thermo Fisher) and purified using QIAquick Gel Extraction Kit (Qiagen). The libraries were then pooled at equal concentrations and ran on a HiSeq X 2×151 bp run.

R2C2 Read Processing and Isoform Analysis
R2C2 consensus reads were generated from raw reads using the C3POa pipeline (https://github.com/rvolden/C3POa). C3POa identifies subreads in the raw reads and uses poaV2 (Lee et al., 2002) and racon (Vaser et al., 2017) to determine a more accurate consensus of these subreads. The consensus reads were then aligned to the polar bear genome (Liu et al., 2014) using minimap2 (Li, 2018) using standard setting and the -ax splice'flag. The resulting SAM files are converted to PSL files using SAMtools (Li et al., 2009) and jvarkit SamToPsl utility (Lindenbaum, 2015).
The resulting PSL, SAM, and FASTA files of all depleted samples were merged and used as input into the Mandalorion (https://github.com/rvolden/Mandalorion-Episode-II) pipeline to determine isoforms. To accommodate issues regarding RNA degradation and genomic DNA contamination, we integrated two new optional filter into Mandalorion. We implemented the filtering of isoforms that are entirely contained within one other isoform, which indicates degraded input RNA molecules, and the filtering of unspliced isoforms which might stem from DNA contamination.
Accuracy of R2C2 reads and Mandalorion isoforms were determined using alignments in SAM format containing MD strings and a custom script that calculates

Smart-seq2 Read Processing
Paired FASTQ files were downloaded from BaseSpace and aligned to the polar bear genome using STAR with standard settings. The STAR index for the polar bear genome was built without a transcriptome reference because the GFF file provided by (Liu et al., 2014) did not conform to GFF standard (no "exon" features) and could therefore not be used. Read alignments in ordered BAM format were converted to PSL as described above.

Hemoglobin and Gene Expression Quantification
Hemoglobin content was determined through a kmer-based counting method using a custom script. In short, all possible 10nt kmers were extracted from the sequence of hemoglobin HBA and HBB transcripts. The presence of these kmers were then determined in each R2C2 or Smart-seq2 read from depleted and undepleted samples. Cut-offs for read assignments to hemoglobin were then determined by also analyzing R2C2 and Illumina reads of the GM12878 cell line which does not express hemoglobin. Gene expression was determined using Smart-seq2 (Illumina) read alignments in PSL format and a custom script. Reads aligning to hemoglobin loci were not counted toward total aligned reads in the RPM calculations.

Long-DASH Depletes Hemoglobin Transcripts From Full-Length cDNA
We used a modified Smart-seq2 protocol (Picelli et al., 2014;Cole et al., 2018;Volden et al., 2018) to reverse transcribe and amplify full-length cDNA from 70 ng of whole blood RNA of three polar bears (PB3, PB19, PB21). We then performed a targeted depletion of hemoglobin transcripts by incubating the full-length cDNA with Cas9 protein loaded with 16 guide RNAs (sgRNAs) specific to hemoglobin transcripts-eight sgRNAs targeting the HBA transcripts and eight sgRNAs targeting the HBB transcripts. The sgRNAs were selected to deplete hemoglobin transcripts from human and polar bear samples. The sgRNAs were chosen based upon sequence homology between these two species to eventually allow the removal of abundant of hemoglobin transcripts in whole blood from both human and polar bear samples using the same sgRNAs (Field et al., 2007) ( Figure S1). These 16 sgRNA probes we designed may allow for the depletion of samples of other vertebrates although sequence similarity should be checked before this is attempted.
The depletion process using the Cas9 system should cut the ~700-800 bp transcripts at different sites allowing us to do two things. First, we can re-amplify the sample, thereby only enriching for full-length molecules since the cut cDNA molecules no longer contain two priming sites required for exponential amplification during PCR amplification ( Figure  1). Second, we can remove the cut transcripts by performing a SPRI-bead-based size selection whereby only transcripts > 500 bp are retained. Indeed, prior to any depletion, we observed very strong bands located at ~700-800 bp in our agarose gels indicating the presence of a substantial amount of HBA and HBB hemoglobin transcripts (Figure 2). After depletion, reamplification, and size selection, the full-length cDNA product was visualized again to reveal the removal of the putative hemoglobin bands (Figure 2). After hemoglobin depletion is confirmed, the cDNA is ready to be converted into ONT-and Illumina-based libraries, with each protocol using the same input cDNA (Figure 1).

Long-DASH Is Compatible With Smart-Seq2 Library Preparation and Does not Distort cDNA Composition
Next, we aimed to validate whether Long-DASH truly depletes hemoglobin transcripts in the cDNA pool and can be used for Illumina's short-read RNA-seq sequencing platform. To show this, we prepared independent Tn5-based Smart-seq2 sequencing libraries for each depleted and undepleted cDNA pool. We then sequenced the Smart-seq2 libraries on a multiplexed Illumina HiSeq X 2 × 151 bp run. We generated ~20 million reads for depleted and ~60 million reads for undepleted samples. By sequencing the undepleted samples deeper, we reasoned that the non-hemoglobin genes should receive equivalent read coverage in depleted and undepleted samples. This allowed us to perform a side-by-side comparison of the depleted and non-depleted samples to ensure no off-target effects.
First, we analyzed the resulting sequencing data using a custom kmer-based approach to estimate the number of reads originating from hemoglobin transcripts. In the undepleted cDNA pools, 48-68% of reads were scored as originating from hemoglobin transcripts. In depleted samples, this was reduced to 1-4% reads ( Figure 3A). As a consequence, at the same read depth, reads per million (RPM) values for non-hemoglobin genes increased by a factor of 3 on average.
Second, to show that the depletion of hemoglobin transcripts did not distort the rest of the cDNA pool, we aligned the reads to the polar bear genome and quantified the expression of all previously annotated genes. We observed that when reads aligning to the hemoglobin loci were included in the analysis, the reads aligning to the few hemoglobin loci in our undepleted samples skewed the RPM calculations. By sequencing undepleted samples to great depth, this allowed us to exclude hemoglobin from quantification of gene expression while matching nonhemoglobin read depth of depleted samples. This analysis showed that the overall gene expression patterns were not dramatically distorted between depleted and undepleted samples. The three polar bear samples showed a Pearson r-value of 0.97-0.98 ( Figure  3B) when the gene expression values of depleted and undepleted samples were compared, and reads aligning to hemoglobin loci were discarded.
Next, we checked for genes whose expression was systematically affected by depletion. No genes were downregulated more than FIGURE 1 | Schematic of Long-DASH, (A) Whole blood RNA is extracted, and full-length cDNA is generated with the first half of the Smart-seq2 protocol. The cDNA is then depleted of hemoglobin transcripts using the recombinant S. pyogenes Cas9 protein bound to hemoglobin-specific sgRNA, which cuts hemoglobin cDNA molecules by introducing doublestrand breaks (△) in a sequence-specific manner. The cut molecules can no longer be exponentially amplified with PCR, so a subsequent PCR step is performed to enrich for complete non-hemoglobin cDNA molecules. The resulting hemoglobin-depleted cDNA pool is then sequenced using the ONTbased R2C2 library prep and the Illumina-based Smart-seq2 library prep. 4-fold in all three polar bear samples suggesting that there were no strong systematic off-target effects using the Cas9-based depletion. We did, however, find 151 genes out of ~12,000 expressed genes to be upregulated by at least fourfold in all three polar bears suggesting that Cas9-based depletion and subsequent second PCR amplification have had a systematic impact on a number of genes. We then investigated whether this effect would affect differential expression analysis between depleted samples. To this end, we calculated gene expression differences for each pair of polar bears twice, once pre-and once post-depletion. We then compare the pre-and post-depletion gene expression differences and found that, while depletion does introduce differences in the upregulated genes, these effects appear to be small, random in direction, and similar to a random selection of genes with similar expression levels ( Figure S2).
Overall, this indicates that the depletion of hemoglobin from full-length cDNA pools was successful, thereby freeing up the vast majority of sequencing reads to analyze the rest of the polar bear transcriptome. Although, the data suggests that a number of genes were systematically affected by depletion and additional PCR steps, further experiments including several technical replicates should enable differential expression analysis between depleted samples.

Long-DASH Is Compatible With Full-Length cDNA Sequencing Methods
Having established the compatibility of Long-DASH with the short-read RNA-seq assay, we investigated whether we could generate a long-read data set from the depleted cDNA using our R2C2 approach. By incorporating R2C2, we can generate error-corrected full-length cDNA reads using long-read ONT sequencers. We used five partially multiplexed flowcells to generate ~3.8 million R2C2 consensus reads of five depleted cDNA pools-two Long-DASH replicates (R1 and R2) for PB3 and PB19 as well as a single Long-DASH run for PB21. The R2C2 reads we generated had a median accuracy of 94%, which is between 8 and 10% more accurate than standard ONT cDNA sequencing protocols (Table S1).
We also generated ~5,000 R2C2 consensus reads of undepleted cDNA from one polar bear, which allowed us to compare hemoglobin content and consensus read length distributions between depleted and undepleted samples (Figure 4). In the undepleted sample, the majority of R2C2 reads were of two distinct lengths, both around 700 bp, likely representing the 79.3% of hemoglobin transcripts present in that sample. The five depleted samples showed a much more evenly distributed read length with a median hemoglobin content of 1.2% (0.6-8.3%) ( Figure 4). Higher hemoglobin levels for R2C2 compared to Smart-seq2 based library preps (1-4%) using the same cDNA might be explained with R2C2 being somewhat biased toward transcripts between 500 and 1,000 bp.
The median read length of the depleted samples was slightly below 1 kb, which is in line with cDNA read length distributions published to date (Workman et al., 2018). This means that despite the less than ideal conditions for RNA integrity given difficult field conditions and the lag time between sample collection and processing, the analyzed RNA molecules were largely intact.

R2C2 Reads of Depleted Full-Length cDNA Can Refine Transcriptome Annotations
Next, we generated high confidence isoform-level information from our full-length cDNA to refine the currently available polar bear transcriptome annotation. To this end, we analyzed our 3.8 million R2C2 consensus reads using the Mandalorion pipeline we previously developed (Byrne et al., 2017). We aligned the R2C2 reads to the polar bear genome sequence (Liu et al., 2014) using minimap2. These alignments, together with previously known individual splice sites (Genomic Resources Development Consortium et al., 2014;Liu et al., 2014), then serve as input into our Mandalorion pipeline which processes read alignments into high-confidence isoforms.
The Mandalorion pipeline first complements known splice sites with new splice sites; it identifies de novo from R2C2 read alignments. It then groups R2C2 reads based on the splice sites they use. Pairs of TSSs and polyA sites are then determined for each group to identify full-length isoforms. Two additional processing steps were performed whereby isoforms were excluded if they were fully contained within longer isoforms or unspliced. This was to ensure removal of any non-full-length isoforms that may result from RNA degradation, as well as isoforms potentially caused by DNA contamination, respectively. In total, this analysis produced 5,831 high-confidence isoforms with a median accuracy of 99.1%.
We then classified these 5,831 high-confidence-spliced isoforms using the SQANTI algorithm (Tardaguila et al., 2018) that determines what relationship an experimentally determined isoform has to genes and isoforms in a reference annotation ( Figure 5). As a reference, we downloaded 28,880 known and predicted mRNA sequences from NCBI by selecting "RefSeq" and "mRNA" filters in the NCBI Nucleotide database most of which are based on the NCBI Ursus maritimus Annotation Release 100 catalog of polar bear mRNA sequences (Pruitt et al., 2014).
Out of the 5,831 Mandalorion isoforms, 1,239 were classified as "novel_not_in_catalog" (NNC), which means that they overlapped a known gene but contained at least one unannotated splice site. In-depth analysis of this NNC group found that they contained a total of 521 new exons. In addition to R2C2 read coverage, Smart-seq2 read coverage was elevated in these new exons providing additional evidence for their inclusion in transcripts. Further, 1,301 isoforms were classified as "novel_in_catalog" (NIC), which means that they overlapped a known gene and used only annotated splice sites but at least once as part of a previously unannotated splice junction. In total, we observed 2,540 (1,239 NNC and 1,301 NIC) new isoforms with unannotated exon configurations. An additional 1,893 isoforms were classified and "full_splice_ match" (FSM), which means that their splice junctions matched an annotated isoform completely, but it doesn't mean that TSS and polyA sites also matched this isoform. In-depth analysis of the putative full-length NNC, NIC, and FSM isoform groups identified 2,885 new TSSs and 1,817 new polyA sites. R2C2 read coverage declined rapidly at TSSs and polyA sites providing clear evidence for their validity. Smart-seq2 read coverage was elevated inside TSS and polyA sites but declined slowly toward the respective features, which is characteristic for standard short-read Illumina data (Figure 5). This is not surprising as short-read-based protocols have to be specifically designed to capture those features (Ruan and Ruan, 2011;Salimullah et al., 2011;Cole et al., 2018). So, while these data validate the existence of these features, they cannot be used for confirming their exact location.
Finally, 769 isoforms were classified as "incomplete_splice_ match" (ISM), which means that they contain a subset of splice junctions of an annotated isoform. While these isoforms could represent real, shorter transcripts, they might also represent experimental artifacts so we excluded them from TSS and polyA analysis.
Considering RefSeq mRNA sets are in part based on deep short-read data and computational annotation, we did not expect to discover many entirely new gene loci. However, 509 of the 5,831 isoforms were classified as "intergenic" (IG), which means that they did not overlap with any annotated gene locus. By determining which of these isoforms overlapped with each other, we identified 176 new gene loci.
Overall, this analysis dramatically refined our isoformlevel knowledge of the whole blood polar bear transcriptome (Figure 5). To make this knowledge straightforward to use for future analysis, we have generated a GTF annotation file containing RefSeq mRNA entries merged with our R2C2/ Mandalorion isoforms.
How these new isoforms and isoform features have improved the current annotation can be seen clearly in these three following examples. In the RBX1 gene, we discovered 10 new isoforms containing new TSSs and polyA sites, several of which were associated with new terminal first or last exons ( Figure  6A). In the GMFG gene, we similarly identified new isoforms containing unannotated internal and terminal exons, intron retention events, TSSs, and polyA sites ( Figure 6B). Finally, we discovered a new gene locus that contains two isoforms and is entirely absent in the polar bear RefSeq mRNA set. However, aligning the two isoforms to the Panda genome (Li et al., 2010) resulted in unique matches to the CCDC72 gene ( Figure 6C).

DISCUSSION
To better understand how humans and environmental perturbations impact threatened or endangered species, it is critical to understand changes in transcriptome dynamics. Fluctuations at the molecular and cellular level are sensitive indicators of environmental change (Kim et al., 2011;Brown et al., 2017); they are analogous to veterinary medicine where blood transcriptomes serve as proxies for identifying health status, disease, and exposures to environmental toxicants (Burgess et al., 2012;McLoughlin et al., 2014;Watson et al., 2017;Lv et al., 2018). Changes at the transcriptome level may also be useful indicators of ecological specialization, and therefore useful to design strategies for species management and conservation (Supple and Shapiro, 2018). However, existing approaches to generate transcriptome data from whole blood RNA are either specifically designed for short-read sequencing (DASH) or human samples (commercial hemoglobin depletion kit like GLOBINclear) and therefore lack a cost-effective approach for analyzing isoformlevel transcriptomes of non-model organisms.
Any study investigating whole blood transcriptomes using short-or long-read sequencing will greatly benefit from the Long-DASH method. Long-DASH effectively and economically depletes hemoglobin transcripts from whole blood full-length cDNA, which can then be sequenced with short-or long-read sequencing. We validated Long-DASH by depleting hemoglobin transcripts from polar bear whole blood cDNA pools and generated deep short-read Smart-seq2 RNA-seq data as well as 3.8 million R2C2 full-length cDNA consensus reads. We processed the 3.8 million full-length R2C2 reads to identify close to 6,000 high confidence isoforms which we then used to refine and improve the polar bear whole blood transcriptome annotation.
In addition to polar bear hemoglobin transcripts, the sgRNAs designed for this study will also target human hemoglobin transcripts making them useful for basic research as well as clinical applications in cancer biology and disease diagnosis ( Figure S1) (Valk et al., 2004;Borovecki et al., 2005;Gervasoni et al., 2008;Morey et al., 2016). Further, the sgRNA sequences used in Long-DASH can be easily adapted to any organisms or gene. The ease and adaptability place Long-DASH at an advantage over "as-is" commercial kits like GLOBINclear (Ambion), which promises >95% of depletion of human and mouse hemoglobin transcripts, but fails to efficiently deplete hemoglobin transcripts from pig whole blood RNA samples (Choi et al., 2014).
Since cDNA can be generated from femtogram levels of polyA-RNA, Long-DASH requires very little RNA input compared to RNA pull-down methods. This allows the investigator to gather small samples, or only process small aliquots of existing samples, thereby maximizing the usefulness of each sample collection and minimizing harm to animals. In its current state, depletion by Long-DASH is still somewhat variable, resulting in hemoglobin levels from 0.6% to 8.3%. While still a large improvement compared to the undepleted samples, future work on the method will center on removing this variability through either longer incubation times or higher number or concentration of sgRNA probes and the Cas9 protein. It may also be beneficial to measure depletion success by qPCR before sequencing a depleted cDNA pool.
Going forward, the Long-DASH depletion method and the R2C2 long-read sequencing method will form a very powerful combination for transcriptome analysis and annotation from whole blood samples and beyond. The transcriptomes of many tissues contain several highly abundant transcripts that represent >50% of all transcript molecules (Mure et al., 2018). A set of sgRNAs targeting any abundant transcripts can be easily generated, making Long-DASH conducive for surveying other tissues as well. Specifically, depleted full-length cDNA libraries can be sequenced using our R2C2 method, which currently represents the most powerful combination of throughput and accuracy in the long-read sequencing field. Our most recent R2C2 run emphasizes this by generating ~1,000,000 R2C2 reads at a median accuracy of 97.5% on a single ONT MinION flowcell at a cost of ~$650 (Table S1). This represents an increase in accuracy of >10% over standard ONT cDNA sequencing and 10 times more complete reads than the PacBio Sequel at the same cost. Combining our Long-DASH and R2C2 methods therefore brings the exhaustive annotation of non-model organisms within reach.

DATA AVAILABILITY
All Illumina and ONT raw read data are available at SRA under Bioproject accession PRJNA514749.

ETHICS STATEMENT
Permits for field operations and animal care were provided by the Government of Greenland (permit numbers 2015-110281 and 2017-5446).