Using pan RNA-seq analysis to reveal the ubiquitous existence of 5’ end and 3’ end small RNAs

In this study, we used pan RNA-seq analysis to reveal the ubiquitous existence of 5’ end and 3’ end small RNAs. 5’ and 3’ sRNAs alone can be used to annotate mitochondrial with 1-bp resolution and nuclear non-coding genes and identify new steady-state RNAs, which are usually from functional genes. Using 5’, 3’ and intronic sRNAs, we revealed that the enzymatic dsRNA cleavage and RNAi could involve in the RNA degradation and gene expression regulation of U1 snRNA in human. The further study of 5’, 3’ and intronic sRNAs help rediscover double-stranded RNA (dsRNA) cleavage, RNA interference (RNAi) and the regulation of gene expression, which challenges the classical theories. In this study, we provided a simple and cost effective way for the annotation of mitochondrial and nuclear non-coding genes and the identification of new steady-state RNAs, particularly long non-coding RNAs (lncRNAs). We also provided a different point of view for cancer and virus, based on the new discoveries of dsRNA cleavage, RNAi and the regulation of gene expression.


53
RNA sequencing (RNA-seq) , usually based on the Next Generation Sequencing (NGS) technologies 54 is widely used to simultaneously measure the expression levels of genes with higher accuracy than Serial 55 Analysis of Gene Expression (SAGE) and microarray [1]. RNA-seq is also used to annotate genes in 56 sequenced genomes, which is the only basis to study gene transcription, RNA processing and biological 57 functions of these genes, etc. Particularly, RNA-seq or small RNA sequencing (sRNA-seq) is indispensable 58 and cost effective way for the annotation of mitochondrial and nuclear non-coding genes and the 88 identification of new steady-state RNAs, which are usually from functional genes. The further study of 5', 3' 89 and intronic sRNAs help rediscover double-stranded RNA (dsRNA) cleavage, RNA interference (RNAi) 90 and the regulation of gene expression, which challenges the classical theories. In a genome-alignment map of sRNA data, there usually are some peaks or hotspots [8], where the 103 depths of the positions are much higher than those of other positions in the genome. In our previous study of 104 human rRNA genes [6], we found that some of peaks comprised 5' and 3' sRNAs and they were ubiquitously 105 existed in mitochondrial and nuclear non-coding genes. As the current sRNA-seq technologies usually 106 provide sequences with short lengths, 5' and 3' sRNAs are defined as sRNA-seq reads with lengths of 15~50 107 bp, which are precisely aligned to the 5' and 3' ends of mature RNAs respectively ( Figure 1A) and they have 108 such features: 1) 5' and 3' sRNAs are degraded fragments from mature RNAs and the lengths of them vary 109 progressively with 1-bp differences. 2) The cleavage sites between 3' sRNAs and their downstream 5' 110 sRNAs are not limited to one (usually three) due to inexact cleavage by enzymes ( Figure 1B). 3) 5' and 3' 111 sRNAs of steady-state RNAs (e.g. 18S, 5.8S and 28S rRNA) are significantly more abundant than the 112 intronic sRNAs of them, while 5' and 3' sRNAs of transicent RNAs (e.g. Internal Transcribed Spacers of 113 rRNA, ITS1 and ITS2) are not. This criterion can be used to identify new steady-state RNAs, which are 114 usually from functional genes. One example of a new steady-state RNA downstream tRNAs and another 115 example of two novel mitochondrial lncRNAs were introduced in the following paragraphs. Particularly, it 116 was proved that MDL1 and MDL1AS were two steady-state lncRNAs in the human mitochondrial DNA and 117 predicted to have biological functions [7]. 118 We used 5' and 3' sRNAs from one sRNA-seq dataset to annotate genes and used one CAGE-seq 119 dataset, one GRO-seq dataset and one PacBio cDNA-seq dataset (Materials and Methods) to validate the 120 annotations. Later, we developed a simplified gene-annotation procedure. Using only 5' sRNAs, gene 121 annotation can be reduced to the identification of 5' ends of mature RNAs. By doing so, the 3' ends of their 122 upstream mature RNAs and their cleavage sites can be derived ( Figure 1A). We have defined a new file 123 format, named 5-end format, to easily identify 5' ends of mature RNAs. The new format is derived from the 124 stranded reads. Using the 5-end format, the 5' end of one mature RNA can be easily identified from two to 132 three candidates ( Figure 1B). Their ratio1s or ratio2s must be above a threshold (e.g. 75%) and significantly 133 higher than the ratio1s or ratio2s of positions surrounding these two to three positions. 134 135 5' and 3' sRNAs in non-coding genes 136 Using 5' and 3' sRNAs, we corrected the annotation of human rRNA genes. For the 5' end of each 137 mature RNA, we obtained two or three candidates and selected the position with the highest ratio1 or ratio2 138 as the result. For example (Figure 1B), we obtained three positions 7,923, 7,924 and 7,925 to identify the 5' 139 end of 28S rRNA and selected 7,925 as the result. In the same way, 5' ends of 18S and 5.8S rRNA were also 140 identified using 5' sRNAs. Then, 3' ends of 18S, 5.8S and 28S rRNA were identified using 3' sRNAs. 141 Finally, the annotations of ITS1 and ITS2 were derived using the annotations of 18S, 5.8S and 28S rRNA. 142 These corrected annotations ( Table 2) were validated using the CAGE-seq dataset and the GRO-seq dataset 143 (Materials and Methods). Although the depth 1,471,247 at the position 6,601 was much higher than the 144 depth 647,406 at the position 6,596 in the sRNA-seq dataset, the 5' end of 5.8S rRNA annotated as the 145 position 6,601 with the ratio1 of 35.42% (520,006/1,468,024) was still corrected as the position 6,596 with 146 the ratio1 of 88.11% (569,882/646,805). In addition, the genome-alignment map using the sRNA-seq dataset 147 showed that human rRNA genes had peaks at the position 6,596, 7,925 and 6,756 corresponding to the 5' 148 All rights reserved. No reuse allowed without permission.
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. . https://doi.org/10.1101/444117 doi: bioRxiv preprint ends of 5.8S and 28S rRNA and the 3' end of 5.8S rRNA, respectively (Figure 2A). The genome-alignment 149 map using the CAGE-seq dataset showed that human rRNA genes had peaks at the position 3,675 and 7,926 150 corresponding to the 5' ends of 18S and 28S rRNA, respectively ( Figure 2B). This suggested that 5' m 7 G or 151 other caps of 18S and 28S rRNA could exist. By the analysis of 3' sRNAs, we confirmed that rRNA genes 152 did not contain 3' polyAs. 153

160
All rights reserved. No reuse allowed without permission.
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. . https://doi.org/10.1101/444117 doi: bioRxiv preprint In 2009, a novel class of sRNAs named tRNA-derived RNA fragments (tRFs) was introduced and three 161 series of tRFs (tRF-5, tRF-3 and tRF-1) were identified using the sRNA-seq data of the human prostate 162 cancer cell line by 454 deep sequencing [9]. However, these authors did not correctly understand the tRFs 163 due to the technology limit and the small dataset size. Using the pan RNA-seq analysis, we proved that the 164 tRF-5 and tRF-3 series were 5' and 3' sRNAs from mature tRNAs and the tRF-1 series were 5' sRNAs from 165 mature RNAs of their downstream genes (Figure 2A). In this study, 13 mature tRNAs and their 43 166 precursors (Supplementary file 1) were analyzed, and the 5/3-sRNA ratios of them ranged from 0.1 to 167 595.8. However, we did not found the characteristics in the size distribution of 5' and 3' sRNAs from these 168 tRNAs. As 3' sRNAs contain detailed 3'-end information of mature RNAs, we acquired more discoveries 169 related to tRNA processing, maturation and degradation. One example was 3' sRNAs of tRNAs had four 170 types, which were non-tail, C-, CC-, and CCA-tailed. The proportions of these four types were 5.26% 171 (22,906/4,355,95), 12.36% (53,845/4,355,95), 13.81% (60,176/4,355,95) and 68.57% (298,668/4,355,95). In 172 addition, we obtained sequences of full-length mature tRNAs with non-tail, C-, CC-, and CCA-tailed. 173 Among these full-length mature tRNAs, 8,539 TRD-GTC2-1 tRNAs (for Asp) and 16,900 TRE-CTC1-1 174 tRNAs (for Glu) were obtained. These results suggested that 3' sRNAs were produced by tRNA degradation 175 during its synthesis, when CCAs were post-transcriptionally added to 3' ends of tRNAs one nucleotide by 176 one nucleotide. Another example was the correction of TRL-TAG3-1's annotation. The mature TRL-TAG3-177 1 was annotated as a 82-nt sequence from the human genome with its 3' cleavage site 178 ACCGCTGCCA|cacctcagaa. Using 5' and 3' sRNAs, the 3' cleavage site of TRL-TAG3-1 was determined as 179 ACCGCTGCCAC|acctcagaa. Instead of CAA, it was CA that was post-transcriptionally added to the 3' end 180 of TRL-TAG3-1 ACCGCTGCCAC. The genome-alignment results using the CAGE-seq dataset showed 181 that 5' m 7 G or other caps of tRNA did not exist. By the analysis of 3' sRNAs, we confirmed that tRNA genes 182 did not contain 3' polyAs. 5' and 3' sRNAs from all the 13 mature tRNAs resided in peaks in the genome-183 alignment maps, while a few 3' sRNAs of their upstream genes or 5' sRNAs of their downstream genes 184 resided in peaks. Among the peaks from these upstream or downstream genes, the highest one was on the 185 downstream of TRS-TGA1-1 (chr10:67764503-67764584). It suggested that this peak was the 5' end of a 186 new steady-state RNA, which could be from a functional gene and had not been annotated in the current 187 genome version. 188 Small nuclear RNAs (snRNAs) include a class of small RNA molecules that are found within the 189 splicing speckles and Cajal bodies of the cell nucleus in eukaryotic cells [10]. snRNAs are always associated 190 with a set of specific proteins and the complexes are referred to as small nuclear ribonucleoproteins 191 (snRNPs). SnRNAs are also commonly referred to as U-RNAs and one well-known member is U1 snRNA 192 [11]. Using 5' sRNAs, we confirmed annotations of U1, U2, U3, U4, U5, U6 and U7 (Supplementary file 193 confirmed that all the snRNAs did not contain 3' polyAs. In addition, we did not found any new steady-state 196 RNA on the upstream or downstream of seven snRNA genes. 197 198
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. . https://doi.org/10.1101/444117 doi: bioRxiv preprint Another example was that tRNA Ala AS-tRNA Tyr AS (NC_012920: 1318-1638) could not be further cleaved, 227 which was against a hypothesis from one previous study [13]. 228 Among 29 mature transcripts on the H-strand, tRNA transcripts were tailed by 3' CCAs, while other 229 mature transcripts were tailed by 3' polyAs. The maximum lengths of polyAs in tRNA Gln AS, ND2, 230 tRNA Ala AS-tRNA Tyr AS, COI/tRNA Ser AS, COII, ATP8/6, COIII, ND3, ND4L/4, ND5/ND6AS/tRNA Glu AS, 231 Cytb, MDL1, 12S rRNA, 16S rRNA,and ND1 are 14,1,12,11,20,35,10,1,22,1,8,28,32,21 and 16,232 respectively. 3' sRNAs containing polyAs or CCAs of different lengths were captured to proved that 3' 233 sRNAs were produced by RNA degradation during its synthesis, when polyAs or CCAs were post-234 transcriptionally added to 3' ends of RNAs one nucleotide by one nucleotide. In this study, we also 235 confirmed that both of mRNA and rRNA transcripts were capped by 5' m 7 G [3]. Our data supported that 236 MDL1AS, ND5/ND6AS/tRNA Glu AS and tRNA Ala AS/tRNA Asn AS/tRNA Cys AS/tRNA Tyr AS could be capped 237 by 5' m 7 G, but tRNA Gln AS and MDL1 could not be capped. 5' and 3' sRNAs of MDL1 and MDL1AS were 238 significantly more abundant than the intronic sRNAs of them, which was one criterion to identify steady-239 state RNAs. Although MDL1 was not capped by 5' m 7 G as MDL1AS, we still proposed that they were 240 steady-state RNAs and could have biological functions. The further study showed that qPCR of MDL1 241 provided higher sensitivities than that of BAX/BCL2 and CASP3 in the detection of cell apoptosis [14]. 242 In our previous study, the first Transcription Initiation Site (TIS) of H-strand (IT H1 ) and the TIS of L-243 strand (IT L ) were determined at the position 561 and 407 on the human mitochondrial genome (RefSeq: 244 NC_012920.1), but the second TIS of H-strand (IT H2 ) was not determined [7]. In this study, IT H2 was 245 determined at the position 648, which was also the 5' end of 12S rRNA. This finding was against the long-246 standing knowledge that IT H2 was at the position 638 [15]. Using pan RNA-seq analysis, we found that all 247 the TISs (IT H1, IT H2 and IT L ) could be capped by 5' m 7 G. We also found polyAs before TISs, which 248 suggested that the transcription of mitochondrial genes could be initated by primers containing polyTs. This 249 finding explained why all the TISs were resided in A-enriched regions. However, these explanations need be 250 proved in the future studies. 251 252

Analysis of RNA degradation using 5', 3' and intronic sRNAs 253
As 5', 3' and intronic sRNAs are accumulated RNA degradation intermediates, they can be used to 254 investigate the RNA degradation [16], particularly for steady-state RNAs. The result using our data showed 255 that in general, 5' and 3' sRNAs were more abundant than intronic sRNAs and short 5' and 3' sRNAs were 256 more abundant than longer ones for tRNAs, rRNAs, snRNAs and mitochondrial RNAs. This suggested that 257 these mature RNAs, particularly short RNAs (e.g. tRNAs), were mainly degraded by 3' and 5' exonucleases 258 to accumulate 5' and 3' sRNAs. As for rRNAs and snRNAs, we found many peaks in the body of genes, 259 which were even much higher than the peaks comprising 5' or 3' sRNAs in the genome-alignment map. In 260 addition, the peaks comprising intronic sRNAs in rRNAs showed tissue specificities. The liver tissue (SRA: 261 SRP002272) had specific peaks at the positions 12,891. The Plasma (SRA: SRP034590) had specific peaks 262 and 10,627. Further study of these tissue specificities was beyond the scope of this study. Then, we focused 274 on the study of the secondary structures around these peaks in rRNAs and snRNAs and found that some of 275 them involved in dsRNA regions. Particularly, we found a featured peak spanning a 43-bp region from 49 276 bp to 92 bp of U1 snRNA (Figure 3). In this region, 5' ends of most intronic sRNAs were precisely aligned 277 to 49 bp or 78 bp (Figure 3A). We also found that this region formed a hairpin in the secondary structure of 278 U1 snRNA and produced a series of siRNA duplexes [17] with lengths from 15 bp to at least 25 bp (Figure  279 3C). The most abundant reads AGGGCGAGGCTTATC and TGTGCTGACCCCTGC formed a 15-bp 280 duplex structure. The duplex ratio of AGGGCGAGGCTTATC against TGTGCTGACCCCTGC was 2.15 281 (34,078/15,825) and 99.97% (49,889/49,903) of these duplexes were sequenced from 14 samples of plasma 282 (SRA: SRP034590). It suggested that this dsRNA region was cleaved by the ribonuclease III (RNase III) 283 family [18] to produce these siRNA duplexes and could induced RNAi. Based on the findings in this study, 284 our hypothesis is: 5' and 3' exonucleases are prevalent than endonuceases for the degradation of mature non-285 coding RNAs. So abundant 5' and 3' sRNAs were observed using sRNA-seq data. The longer mature RNAs 286 have more and longer dsRNA regions (e.g. 15-bp long for U1) than short ones (e.g. 7-bp longest for tRNAs) 287 to induce dsRNA cleavage to produce siRNA duplexes. Although the lengths of siRNA duplexes discovered 288 in this study were only 15 bp, we still hypothesized that they could induce RNAi due to the unbalanced 289 duplex ratio 2.15. As RNAi regulate the expression of these genes through a negative feed-back mechanism, 290 we designed preliminary experiments to over-express U1 snRNA in the HEK293 (human), SY5Y (human) 291 and PC-12 (rat) cell lines to prove our hypothesis. The basic idea was that if the negative feed-back 292 mechanism existed, the expression level of U1 snRNA could decrease rather than be stable when the over-293 expression of it beyond a threshold. The experimental results showed that the expression level of U1 snRNA 294 decreased after 4X, 9X and 6X dosage (Materials and Methods) in the HEK293 (human), SY5Y (human) 295 and PC-12 (rat) cell lines, respectively ( Figure 3D). Particularly, the results in the HEK293 cell line showed 296 a significant effect caused by the negative feed-back mechanism. Therefore, RNAi could involve in the 297 RNA degradation and gene expression regulation of U1 snRNA. 298 299

300
In this study, we used the pan RNA-seq analysis to reveal the ubiquitous existence of 5' end and 3' end 301 small RNAs. 5' and 3' sRNAs alone can be used to annotate mitochondrial and nuclear non-coding genes 302 with 1-bp resolution and identify new steady-state RNAs. Using 5', 3' and intronic sRNAs, we revealed that 303 the enzymatic dsRNA cleavage and RNAi could involve in the RNA degradation and gene expression 304 regulation of U1 snRNA in human. The RNAi's function in the RNA degradation was reported as a 305 inappropriate event in yeast rRNA and tRNA degradation and only happened when 5' and 3' degradation 306 were absent [19]. However, our finding suggested that RNAi's function in the RNA degradation could be a 307 general mechanism. 308 Based on a previous study, the Rnt1p  The ancestral function of RNAi is generally agreed to have been immune defense against exogenous 317 genetic elements such as transposons and viral genomes [20]. However, our findings help rediscover dsRNA 318 cleavage, RNAi and the regulation of gene expression. That is, both of dsRNA cleavage and RNAi are 319 innate mechanisms rather than immune defense mechanisms. Basically, the enzymatic dsRNA cleavage is 320 responsible for RNA processing, maturation and degradation, while RNAi is responsible for RNA 321 degradation. By the degradation of mature RNAs, RNAi of one gene produces siRNA duplexes to regulate 322 expression levels of itself or other genes. Mature RNAs containing more hairpin structures have more 323 chances to induce RNAi, which is more important for highly expressed genes (e.g. U1 snRNA) or virus 324 genes. These genes need RNAi to regulate gene expression though a negative feed-back mechanism. In one 325 of our previous studies, we reported for the first time the existence of complemented palindromic small 326 RNAs (cpsRNAs) and proposed that one cpsRNA from severe acute respiratory syndrome coronavirus 327 (SARS-CoV) could induced RNAi [14]. This cpsRNA was detected from a 22-bp DNA complemented 328 palindrome in the SARS-CoV genome. As DNA complemented palindromes are prone to produce dsRNA 329 regions, viruses containing more DNA complemented palindromes in their genomes have more chances to 330 induce RNAi and have more abilities for the regulation of gene expression, which is important for their 331 infection or pathogenesis. 332 We also provided a different point of view for the gene expression regulation of U1 snRNA, The 333 primary function of U1 snRNA is its involvement in the splicing of pre-mRNAs in nuclei. In the past 20 334 years, research of U1 snRNA has focused on its primary function, particularly related to neurodegenerative 335 diseases caused by abnormalities of U1 snRNA [11]. In one of our previous studies, we reported that over-336 expression of U1 snRNA induces decrease of U1 spliceosome function associated with Alzheimer's disease. 337 However, the relationship between U1 snRNA over-expression and U1 snRNP loss of function remain 338 unknown [21]. In another previous study, we reported that U1 snRNA over-expression induced cell 339 apoptosis in SY5Y cells, but not in PC-12 cells [11]. These contradictory phenomena can be explained using 340 the RNAi's function in the RNA degradation and the negative feedback mechanism. 341 We provided a different point of view for cancer and virus. In one of our previous study, we reported 342 how U1 snRNA over-expression affected the expression of mammal genes on a genome-wide scale and that 343  (Supplementary file 1). The cleaning and quality control of sRNA-seq data were performed 359 using the pipeline Fastq_clean [24] that was optimized to clean the raw reads from Illumina platforms. To 360 simply annotate genes from a sequenced genome, we aligned all the cleaned reads from sRNA-seq, CAGE-361 seq and GRO-seq data to the reference sequences using the software bowtie v0.12.7 allowing one mismatch. 362 Then, we obtained SAM, BAM, sorted BAM, Pileup and 5-end files using the software samtools. Statistical 363 computation and plotting were performed using the software R v2.15.3 with the Bioconductor packages [1]. 364
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.