Distinct and Modular Organization of Protein Interacting Sites in Long Non-coding RNAs

Background: Long non-coding RNAs (lncRNAs), are being reported to be extensively involved in diverse regulatory roles and have exhibited numerous disease associations. LncRNAs modulate their function through interaction with other biomolecules in the cell including DNA, RNA, and proteins. The availability of genome-scale experimental datasets of RNA binding proteins (RBP) motivated us to understand the role of lncRNAs in terms of its interactions with these proteins. In the current report, we demonstrate a comprehensive study of interactions between RBP and lncRNAs at a transcriptome scale through extensive analysis of the crosslinking and immunoprecipitation (CLIP) experimental datasets available for 70 RNA binding proteins. Results: Our analysis suggests that density of interaction sites for these proteins was significantly higher for specific sub-classes of lncRNAs when compared to protein-coding transcripts. We also observe a positional preference of these RBPs across lncRNA and protein coding transcripts in addition to a significant co-occurrence of RBPs having similar functions, suggesting a modular organization of these elements across lncRNAs. Conclusion: The significant enrichment of RBP sites across some lncRNA classes is suggestive that these interactions might be important in understanding the functional role of lncRNA. We observed a significant enrichment of RBPs which are involved in functional roles such as silencing, splicing, mRNA processing, and transport, indicating the potential participation of lncRNAs in such processes.

and regulation of lncRNAs have not been studied in great detail, though it is believed that they are transcribed majorly by Polymerase II and are capped and polyadenylated (Goodrich and Kugel, 2006;Gibb et al., 2011). One particular class of lncRNAs, the large intergenic non-coding RNA has been primarily discovered through their association with epigenetic marks in the genome (Cabili et al., 2011;Cao, 2014). We have recently shown extensive similarities and specific dissimilarities in epigenetic regulation of lncRNAs in comparison to protein-coding genes . The precise biological function of many of the lncRNAs are not known, though a handful of the candidates have been recently shown to be mechanistically involved in gene regulation and associated with diseases (Wapinski and Chang, 2011). Recent reports from our group also suggest processing of a subset of lncRNAs to smaller RNAs , and that a subset of lncRNAs could be potentially targeted by microRNAs (Jalali et al., 2013), thus constituting an intricate and yet poorly understood network of non-coding RNA mediated regulation.
Mechanistically, the characterization of lncRNA could be generalized as a function of its interactions with other biomolecules in the cell: DNA, RNA, protein, and smallmolecules (Bhartiya et al., 2012). Current studies have showed that molecular and computational biology techniques can act as catalyst in discovering lncRNA-mediated regulation via understanding their interactions with different biomolecules (Jalali et al., 2015). Recent reports have also suggested the possibility of protein-lncRNA interactions and regulatory interactions mediated through them (Kung et al., 2013). The present understanding of protein-lncRNA interactions are limited to a handful of candidates associated with proteins involved in epigenetic modifications as in the cases of HOTAIR (Gupta et al., 2010), Anril (Kogo et al., 2011), and Xist (Arthold et al., 2011); splicing as in the case of MALAT1 (or NEAT2) (Tripathi et al., 2010) conserved nuclear ncRNA; transcriptional regulation through interaction with transcription factors as in the case of Gas5 (Kino et al., 2010) and few other candidates like Meg3 (Zhao et al., 2006), DHFR (Blume et al., 2003), and Gomafu (Sheik Mohamed et al., 2010). It has been recommended that computational methods for predicting protein-RNA interactions, though less accurate, could be potentially used to guide in experiments (Puton et al., 2012). Recently experimental methodologies to understand protein-RNA interactions on a genomic-scale, including CLIP-seq (Darnell, 2012) and variants thereof (Hafner et al., 2010a;Jain et al., 2011;Konig et al., 2011) has provided insights into the target-sites of a number of RNA binding proteins with much higher resolution (Popov and Gil, Abbreviations: lncRNA, long non-coding RNA; DNA, Deoxyribonucleic acid; CLIP, cross-linking immunoprecipitation; HITS-CLIP, UV crosslinking and immunoprecipitation with high-throughput sequencing; PAR-CLIP, Photoactivatable Ribonucleoside-Enhanced Crosslinking and Immunoprecipitation; iCLIP, individual-nucleotide resolution Cross-Linking and Immuno Precipitation; ncRNA, Non-coding RNA; lincRNA, long intergenic RNA; TEC, To be Experimentally Confirmed; RIP-seq, RNA Immunoprecipitation sequencing; CLASH, cross-linking ligation and sequencing of hybrids; CIMS, Crosslinking induced mutation site; CITS, crosslinking induced truncation analysis; RISC, RNA-induced silencing complex; RNAi, RNA interference; miRNA, microRNA; RBP, RNA binding protein; UTR, untranslated region; CDS, coding sequence; UCSC, University of California, Santa Cruz. 2010).The availability of genome-scale maps of RNA binding proteins provide a novel opportunity toward understanding patterns of RNA binding proteins interaction sites in different transcript classes and derive clues on the interaction networks, regulation and functional consequences of these interactions.
Recently, Li and coworkers showed the interaction between protein and lncRNAs, in addition to their association with disease causing SNPs. They have deposited all the interaction data in form of bed files in starBase 2.0 database, the same datasets are also included in our current study . Tartaglia and coworkers have also employed a novel algorithm catRAPID to evaluate the binding tendency of protein with RNAs (Livi et al., 2015). A similar study by Park et al. has also attempted to explore the possible functions of lncRNAs by focusing at the RBP-lncRNA interactions. LncRNAtor functionally annotates lncRNA molecules based on their expression profiles and co-expression with mRNAs. It also encompass lncRNA's interaction data with 57 RBPs for 5 organisms (Park et al., 2014).
The functional interactions of lncRNAs could be potentially summarized as the sum total of the interactions between other biomolecules independently or in context of one another. The interaction of lncRNAs with genomic DNA and its involvement in chromatin organization (Lee and Bartolomei, 2013) and with other RNA species (Salmena et al., 2011;Bhartiya et al., 2012;Jalali et al., 2015) including microRNAs (Jalali et al., 2013) has been explored at length. Though there have been a number of reports characterizing functional roles of lncRNAs through their association with proteins (Wilusz et al., 2009), no systematic analysis reports has been published on mapping or on characterizing the functional domains of lncRNAs for protein-binding sites. Our study focuses on providing a platform to explore these interactions at a larger scale using computational approaches to functionally indict the lncRNA molecules.
In the present report, we have performed a comprehensive analysis of 70 experimental RNA binding protein datasets available in the public domain. We have derived the peak information (or the most probable site of interaction between protein and RNA) for these RNA binding protein sites at a genome-scale from doRiNA (Blin et al., 2015), starBase (Yang et al., 2011;Li et al., 2014), and CLIPdb (Yang et al., 2015) and analyzed their binding sites in lncRNAs and protein coding transcripts. Our analysis suggests 6 lncRNA subtypes (viz; antisense, lincRNA, miscRNA, processed transcripts, retained intron, and sense intronic) to be largely enriched for proteinbinding sites compared to other subclasses hence potentially contribute to a novel layer of regulatory interactions mediated through protein-RNA interactions in ncRNA transcripts. Our analysis shows the distribution of RBP binding sites on the lncRNA loci as opposed to only protein coding transcripts. In our study, we also reveal an interesting pattern of positional clustering of RBP target sites in lncRNAs suggesting a modular organization of regulatory sites in lncRNAs. We also propose how the functionally similar proteins co-occur in both protein coding and lncRNA transcripts. To our knowledge, this is the most comprehensive study on the comparison of lncRNA-RBP interactions as opposed to protein coding loci.
These datasets comprise of positions of interaction of RNA binding protein and RNA target sites derived after PAR-CLIP (Photoactivatable Ribonucleoside Enhanced Crosslinking and Immunoprecipitation), HITS-CLIP-seq (High Throughput sequencing of RNA isolated by crosslinking immunoprecipitation), RIP-seq (RNA immunoprecipitation), iCLIP (individual nucleotide resolution crosslinking and immunoprecipitation), PAR-iCLIP (Photoactivatable Ribonucleoside Enhanced individual nucleotide resolution crosslinking and immunoprecipitation) and CLASH (crosslinking ligation and sequencing of hybrids) followed by sequencing of the pull-down fraction of RNA. The sequenced RNA is further used to identify exact or probable binding site using various bioinformatic approaches. In case of ClipDB, the peak calling and identification were done using PARalyzer, CIMS, Piranha, and CITS software tools. Hence, we stored each of files derived from all databases in form of peaks as separate files for downstream analysis. Details of all the techniques and methodologies used to process the data used in our analysis is given in Table 1.

Mapping of RNA Binding Protein Interaction Sites
The peaks of the RNA binding protein interaction sites were mapped to the lncRNA exons using bespoken perl script and BEDtools (v2.17.0) (Quinlan and Hall, 2010). The most probable site of interaction (or the peaks) between protein and RNA were derived from datasets taken from doRiNA, starBase, and CLIPdb databases which were processed through standard computational pipelines (as listed in Table 2), offering an easy comparability at the analysis point of view. Further, we tried to analyze the binding sites in each of the individual lncRNA subclasses as defined by GENCODE annotations (i.e., 3 prime overlapping ncRNA, antisense, bidirectional promoter lncRNA, lincRNA, macro lncRNA, miscRNA, non-coding, processed transcripts, pseudogene, retained intron, sense intronic, sense overlapping, and TEC). Similarly, we also plotted the distribution of the binding sites across the protein-coding exons derived from the GENCODE v24 annotation file.
We further tested the significance of binding frequency for each of the lncRNA biotype when compared to the protein coding transcripts. The normalized frequency of binding was calculated by dividing the unique number of RBP peaks mapped from each dataset by unique number of bases of lncRNA/protein coding/random transcripts per kb. Statistical unpaired t-test was applied using R (version 3.1.3) (R Core Team, 2015) script, to test if any of the lncRNA biotypes had significantly higher RBP binding frequency as compared to the protein coding transcripts.

Combinatorial Patterns for RNA Binding Protein Interaction Sites in lncRNAs
We explored the possibility of positional clustering of RNA binding protein interaction sites across the lncRNA and   protein coding transcripts. For this, we calculated the cooccurrence binding frequencies for each of the 70 RBPs from the six datasets for each of the lncRNAs and protein coding transcripts in the annotation list. For this analysis we did not consider the CLIPdb-Piranha-non-stranded) dataset due to lack of strand orientation information.
Bespoke shell scripts were used to identify RBP sites which co-occurred with each other and were therefore clubbed together. The coordinates for each RBP peak dataset were intersected separately with both the lncRNA and protein coding exons using BEDtools. These intersecting coordinates were then used to calculate the number of bases which were shared between each of the protein datasets to examine their cooccurrence. The values were further normalized by dividing it with the total number of unique bases of individual RBP datasets which were intersecting with lncRNA and protein coding exons. The mapping percentage in protein coding transcripts provided the baseline for co-occurrence frequency of the binding sites. These co-occurrence frequencies were calculated independently for all the RBP across six datasets.

Positional Preference of RNA Binding Protein Interaction Sites in lncRNAs
We also examined the positional preference of the RNA binding protein interaction sites across the length of lncRNA transcript. As the length of the transcripts varied considerably in our analysis therefore, we briefly define the length of the transcripts as divided into three equal parts. The length of long non-coding transcripts were normalized to 100 nucleotides and arbitrary divided into three equal parts viz., 5 prime end, the middle region, and 3 prime end for comparisons. The notation 5 prime, middle region, and 3 prime denote the positions of the three equal fragments and have no bearing with 5 prime and 3 prime UTRs. Except for datasets analyzed using Piranha, which did not have strand information of the called RBP peaks, all other datasets were used to check for their positional preference. The unique number of bases intersecting with each of the three lncRNA segments was calculated for each dataset. These were further normalized by dividing these values with the unique number of bases in the respective lncRNA segment. Percentage preference was calculated for each segment and the positional location of RNA protein-binding sites were enumerated and plotted as heatmaps.
Additionally, we also plotted the counts of the RNA binding protein interactions sites in protein coding transcripts derived from GENCODE annotation file and the mappings were divided into 3 regions: 5 prime UTR, coding exons, and 3 prime UTR of the coding genes. The CLIPdb-Piranha-non-stranded dataset were not used for the analysis due to the lack of strand information of the peaks.

Analysis of Mapping of RNA Binding Proteins Datasets
We analyzed publicly available datasets for 70 RNA binding proteins derived from seven datasets encompassing five technologies viz. PAR-CLIP, HITS-CLIP, iCLIP, RIP-seq, and CLASH. The experimental datasets were downloaded for RNA binding proteins from three databases (details in Table 1). The experiments briefly included high-throughput genome-scale analysis of RNA protein interactions through pull down and sequencing. The derived data in form of interaction sites (or peaks) which were pre-processed using different computational pipelines including PARalyzer, CIMS, Piranha, and CITS for each of the proteins and were mapped onto the hg38 build of the Human reference genome. The total number of peaks mapping to the genome for respective datasets corresponding to each RNA binding protein has been detailed in Supplementary Tables 1A,B. Each of the dataset was kept as a separate file even if the name of the RNA binding protein was same. This was followed to maintain the identity of each dataset as there were differences in number of peaks for same proteins across different databases which could be attributed to the different experimental protocols used for processing including difference in cell lines, conditions or end points, or downstream computational processing. As same protein was present in more than one dataset, we did not group them as one because different databases had differences in the number and position of peaks owing to the differences in the peak calling softwares and computational pipelines adopted by the users. Nevertheless, the differences in the global frequencies have not been influenced by these.

Comparison of RNA Binding Protein Interaction Sites Within lncRNAs and Protein Coding Genes
We compared the interaction sites for each of the RNA binding proteins in lncRNAs as well as protein-coding transcripts. Toward this end, we used the transcript annotations as provided by GENCODE V24 (Harrow et al., 2012) for protein-coding transcripts and lncRNAs. In total the dataset comprised of 79,930 protein-coding transcripts from 19,655 genes and 83,215 lncRNA transcripts arising out of 32,446 genic loci. We analyzed the distribution of RNA binding protein interaction sites across lncRNAs and protein coding transcripts.
All proteins showed distinct frequency distribution across both protein-coding and long non-coding transcripts. In general, RBP binding was higher in protein coding transcripts when compared to long non-coding transcripts. But when we looked closely, few of RBPs showed higher enrichment for lncRNA subclass when compared to protein coding transcripts. We tested the significance of the enrichment of RBP sites across lncRNA subtypes as opposed to protein coding transcripts using paired t-test. We observed that six of the biotypes including antisense, lincRNA, miscRNA, processed transcripts, retained intron, and sense intronic were more enriched (p-value ≤ 0.05) for RBP sites as opposed to protein coding transcripts in some or the other RBP dataset.
We plotted the binding frequencies of RBPs in lncRNAs and protein coding transcripts for each of the seven datasets as separate graphs. Those datasets and biotypes which had a significantly higher binding for RBPs have been plotted (Figure 1, Supplementary Figures 1, 2). The RBP binding frequency for CLIPdb-CIMS dataset was significantly higher in lincRNA class when compared to protein coding transcripts for all proteins, while HNRNP (F, H, and U) protein had consistent enrichment for miscRNA class (Figure 1). HNRNP complexes help in processing of pre-mRNAs into functional, translatable mRNAs in the cytoplasm. AGO group from CLIPdb-Piranha-non-stranded dataset were mostly enriched for miscRNA, sense intronic, and lincRNA class compared to protein coding transcript while most of proteins showed enrichment for miscRNA and lincRNA classes (Supplementary Figure 1). In Supplementary Figure 1B, we observed miscRNA and lincRNA class to be mostly enriched for most of proteins including AGO proteins, CSTF2 in sense intronic and DGCR8 in retained intron class. AGO2 protein is an important part of RNA-induced silencing complex (RISC) and is required for RNA-mediated gene silencing (RNAi). CSTF2 plays role in polyadenylation and 3'-end FIGURE 1 | Distribution of RNA binding proteins from CLIPdb-CIMS across 6 biotypes of lncRNA genes and protein-coding genes. X-axis of the graph shows the distribution of RNA binding protein interaction sites in subclasses of lncRNAs and protein coding genes frequency of binding sites. The Y-axis represents the normalized frequency of RBP binding, which was calculated as Unique No. of RBP peaks mapped/Unique No. of Exonic bases/1000. Different ranges of frequency are plotted in A (0-0.008) and B (0-0.12).
cleavage of mammalian pre-mRNAs. DGCR8 is a component of the microprocessor complex that acts as a RNA-and heme-binding protein that is involved in the initial step of microRNA (miRNA) biogenesis. For the starBase, CLIPdb-CITS, doRiNA, Clipdb-PARalyzer datasets RBPs showed higher frequency distribution for lncRNAs (miscRNA, retained intron processed transcript) compared to protein coding transcripts ( Supplementary Figures 2A-D), ATXN2 protein from Supplementary Figure 2D had a comparable binding frequency in miscRNA class to protein coding transcripts. This protein is involved in EGFR trafficking, acting as negative regulator of endocytic EGFR internalization at the plasma membrane. Proteins from CLIPdb-Piranha-stranded had enrichment for miscRNA class when compared to protein coding transcripts (Supplementary Figure 2E).
We additionally chose a random set of 1,000,000 (1 million) genomic loci as a control set with an average length of 240 bases and mapped the RBP sites across this control set. The frequencies of protein binding sites across these random genomic loci, lncRNA, and protein coding transcripts of randomly chosen RBPs from each of the six datasets have been depicted in the Supplementary Figure 3, to illustrate that the frequency of protein binding sites in lncRNAs is not an arbitrary event. The observed RBP frequency was significantly lower for these random positions when compared to protein coding transcripts and lncRNAs. This clearly substantiates the fact that the observed RBP distribution frequencies are not just due to randomness but are inherently due to the class of RNA they bind.

Combinatorial Patterns for Protein-Binding Sites in lncRNAs Show Similar Proteins Have Overlapping Binding Sites
The seven datasets considered in this study were observed to map onto lncRNA transcripts as well as protein-coding transcripts. To understand whether they map to common subset of loci in the respective transcripts, we evaluated the positional overlaps of the binding sites for each protein from these seven datasets individually. The counts of overlaps were measured as proportion of the total number of independent occurrences of binding sites for each protein. The overlaps were counted separately for all positions in the protein coding transcripts and in lncRNAs. The mapping in protein coding transcripts served as the control set which provided a fair idea of the general overlap in the genomic scale.
Four proteins from the CLIPdb-CITS dataset CSTF2, HNRNPC, TARDB, and TIA1 showed maximum co-occurrence with their respective set of proteins both in protein coding and lncRNAs transcripts while CSTF2, HNRNPC, and TIAL1 co-occurred with each other as well. Our analysis revealed that similar functioning proteins have significantly higher overlapping binding sites with each other, as expected, while EZH2 was an exception in this dataset (Figure 2). Similarly, RBPs from other five datasets also showed same behavior of co-occurrence between the same set of proteins as shown in Supplementary Figures 4-8 as heatmap. ELAVL1 co-occurred with HUR proteins from doRiNA dataset with high co-occurrence binding frequency as both being the alternate name of same protein. HNRNPF co-occurred with HNRNPU; both are part of the same HNRNP complex, infact all the HNRNP proteins are related to each other.
While protein having similar function such as AGO and DGCR8 proteins were co-occurring in both the doRiNA and CLIPdb-CIMS datasets. Similarly, TNRC6 (A-C) proteins cooccurred with AGO proteins from CLIPdb-Piranha-stranded, Clipdb-PARalyzer, and starBase datasets, from previous observations it is has been seen that functionally related proteins co-occur as in case of TNRC6 with Argonautes, as they have shown to be to play important roles in microRNA mediated regulation of transcripts (Baillat and Shiekhattar, 2009;Chen et al., 2009). ATXN2 and TARDB from Clipdb-PARalyzer are known to associate in one complex depending on RNA where they bind, we observed them to co-occur in our analysis (Elden et al., 2010). From Clipdb-PARalyzer dataset CSTF2 co-occurred with CPSF proteins. Argonaute protein was observed to co-occur with FUS, HNRNP, PTBP1, and PTBP2 from CLIPdb-CIMS datasets and from literature it has been reported that all these proteins interact with each except AGO, hence we believe if other proteins co-occur then AGO should also functionally correlate with these proteins. From starBase dataset, we also observed TAF15 and FUS co-occurred. In addition, we also observed that FUS and TARDB proteins co-occurred from Clipdb-PARalyzer dataset and AGO group of proteins from CLIPdb-CIMS dataset co-occured with HNRNP2B1, HNRNPF, HNRNPM, and HNRNPU proteins. There were other proteins also which co-occurred but with low co-occurrence binding frequency. There was no stark difference in the overlaps of the binding sites between protein coding transcripts and lncRNA sites for each of the proteins considered in our analysis.

Positional Clustering of the Protein-Binding Sites
Positional preferences of the RNA binding protein interaction sites were examined across the entire length lncRNAs. The entire length of transcript was calculated by summing up the lengths of individual exons falling in a transcript and then calculating the position of the mapped RNA binding protein interaction site across this calculated length. As the length of the transcript varied therefore, the entire length was arbitrarily divided into three equal parts viz. 5 prime end, middle region, and 3 prime end. Our analysis revealed that the number of RNA binding protein interaction sites for most of the proteins were in majorly mapping to the 3 prime end and the mid segment of the transcripts as shown in Figure 3 and Supplementary Figure 9. To observe the frequencies of binding sites in protein coding transcripts, we mapped and analyzed the RNA binding protein interaction sites in the protein coding transcripts. The binding frequencies for RBPs were evaluated in protein coding transcripts which were divided as 5 prime UTR, CDS, and 3 prime UTR. The data for the same was derived from GENCODE annotation file in form of bed files. We observed that RNA binding protein interaction sites were distributed in 3 prime UTR, 5 prime UTR, and coding exons and frequencies varied for each protein. The HUR/ELAV1 protein showed a positional preference toward the 3 prime end across the lncRNA transcript and the same has been reported recently by Wang and group (Wang et al., 2015) (Figure 4,  Supplementary Figures 10, 11).
We further observed that AGO proteins across the three datasets, namely; Clipdb-PARalyzer, starBase, and doRiNA showed to have a positional preference in protein coding and lncRNA transcripts (Figures 3, 4). When we examined the mapping for the three datasets in protein coding transcripts, we observed that AGO protein showed preference toward the 3 prime UTR. Previous reports have shown AGO proteins bound to miRNAs to target toward 3 prime end of mRNA thereby affecting its translation (Pillai et al., 2004). Such positional preference for AGO proteins is an established fact when targeting the 3 ′ end of mRNAs leading to post-transcriptional silencing. We observed similar positional preference for AGO protein in lncRNAs, thereby suggesting certain regulatory roles.

High Frequencies of RNA Binding Protein Interaction Sites in a Subset of Transcripts
We also observed that many well-known lncRNAs including XIST, NEAT1, OIP5-AS1, and MALAT1 had large number of RNA binding protein sites across their length. A subset of well-annotated lncRNA genes had consistently large number of binding sites for majority of the proteins considered. MALAT1 (metastasis associated lung adenocarcinoma transcript 1), a wellstudied lncRNA with intricate roles in the pathophysiology of cancer Metastases is one of such candidate (Gutschner et al., 2013). MALAT1 is highly conserved amongst mammals and is known to be localized in nucleus. We plotted the binding sites for all RBPs to the full-length of MALAT1 transcripts and the same is shown in Figure 5 for ClipDB-CIMS, CLIPdb-CITS, and CLIPdb-Piranha-stranded datasets. We combined all the datasets for each protein within a database and divided them into three classes (Cytoplasmic, Nuclear, or Both) based on their cellular localization. The distribution profiles for all the RBPs across the MALAT1 gene was derived using UCSC Genome Browser (Meyer et al., 2013).
We observed that the RBPs known to be localizing in nucleus were shown to have higher binding sites across MALAT1 when compared to other RBPs. The functional interaction of MALAT1 with a number of RNA binding proteins have been previously studied (Tripathi et al., 2010), suggesting extensive functional link to the interactions and thereby providing interesting insights for lncRNA functions and biological regulatory networks they take part in. The mapping for all other datasets across the MALAT1 lncRNA is shown in Supplementary Figures 12, 13.

DISCUSSION
LncRNAs have lately emerged as one of the major transcript forms encoded by the human genome, the numbers growing as much as the number of protein-coding transcripts over the years. GENCODE v24 has 83,215 lncRNA loci compared to 79,930 protein-coding gene loci. The functional role of many candidate lncRNAs have been extensively studied in the recent past, nevertheless the general lack of conservation of lncRNAs, even between closely related organisms, barring a handful of candidate lncRNAs has restricted the possibility to model functionalities of lncRNAs in model systems.
The availability of genome-scale assays for evaluating proteinbinding sites in RNA (Kishore et al., 2011), has offered new opportunities to address this issue at much higher confidence and resolution than which were provided by computational approaches (Bellucci et al., 2011;Puton et al., 2012). To date, seven datasets for genome-scale protein-RNA interactions are available in public domain (i.e., doRiNA, Clipdb, starBase) and the present analysis makes use of all these available datasets. We show such approaches involving repurposing of datasets could provide immense insights into the biological functions with potential regulation of lncRNAs.
In the present study, we have used the peak information (or the most probable site of interaction between protein and FIGURE 5 | Depiction of the mapping of RNA binding protein interaction sites from CLIPdb-CIMS, CLIPdb-CITS, and CLIPdb-Piranha-stranded datasets across the length of MALAT1 lncRNA. The RBP highlighted in gray box are the ones generally localized to cytoplasm (C). The RBP generally localized to nucleus (N) are marked as yellow box. C/N labeled RBPs is the ones which are present in both Nucleus and Cytoplasm. RNA) from seven datasets processed through standardized computational pipeline for accurate assessment of protein-RNA interaction sites (doRiNA, Clipdb, starBase). This allowed us to compare the frequencies of the protein binding sites in systematic fashion. It has not escaped our attention that the datasets encompass a diverse set of experiments; cell line, and experimental protocols, nevertheless; our findings hold true despite these differences available in public domain as part of this analysis encompassing six experimental databases of RNA binding proteins. For instance, one of the most studied RBP, the Argonaute datasets showed similar trends regardless of the diverse experimental protocols (HITS-CLIP, iCLIP, PAR-CLIP) and analysis methodologies employed.
The RBPs considered in our study are known to be involved in varied types of functional roles including silencing, splicing, stability, mRNA processing, and transport. In the current study, we observed RBPs enriched for specific lncRNA biotypes are involved in diverse functions, suggesting their probable functional mechanism of action. RBPs such as AGO, DGCR8, EWSR1, TNRC6A/B/C, and FUS, involved in maintenance of the stability of RNA, were having significant enrichment for the lincRNA, miscRNA, retained intron subclasses suggesting they might be acting as either transporters or as sponges for these RBPs. Another set of RBPs such as CPSF complex, FBL, TAF15, and HNRNP family, playing a role in mRNA processing were shown to be enriched in lncRNA subclasses, signifying that lncRNAs inturn might be acting as guides. These proteins might be also involved in mechanism of lncRNA biogenesis. Enrichment was also observed for proteins such ATXN2, C17ORF85, and HNRNPs which predominantly are involved in the export and transporting of RNA moieties, in addition to proteins such as EIF4A3, FOX2, PTBP1, QKI, SFRS1, SRRM4 among others which are predominantly involved in splicing. Hence our analysis suggests that interaction of lncRNA with such types of RBPs surely provide hints about the possible functional roles lncRNAs might be playing which can be validated by experimental approaches.
We also highlight the localization of lncRNAs and RBPs within a cell. We classified the RBPs based on their known localization within the cells and overlapped it with MALAT1, which is an established nuclear enriched lncRNA. The results indicated that the intensity of nuclear localized RBPs were higher for MALAT1 across all the seven datasets. This further strengthened the fact that these bindings were not an arbitrary event and are indeed interacting with the co-localized lncRNAs.
The present analysis reveals a set of interesting characteristics of protein-RNA interaction in the context of lncRNAs: (1) high frequency of RNA-protein interaction sites in lncRNAs subclasses; (2) co-occurrence of RNA binding protein interaction sites; and (3) positional preference for the binding sites across the transcript length. This analysis, to our best of knowledge is the most comprehensive analysis of RNA binding protein interaction sites in lncRNAs, and provides the basis for further analysis on the functional consequences of these patterns. It has also not escaped our attention that targeting proteininteraction sites and thus the functionalities could be in the future therapeutically explored. Recent reports from other laboratories have explored the possibility of targeting RNA structures using small molecules (Jamal et al., 2012;Bose et al., 2013). Further availability of genome-scale protein-RNA interaction datasets and availability of tools to query RNA secondary structures at genome scale (Hofacker, 2003) would provide us with immense opportunities toward understanding the entire repertoire of functional RNA interactions and phenotypic correlates at a genome-scale level. This would also form the much-needed resource of knowledge to potentially query and understand consequences of genomic variations at these loci.

CONCLUSION
The interactions between proteins and RNA molecules can provide the essential insights into the functioning of the lncRNAs. In this study, we highlight the enrichment of RBP sites across some of the lncRNA transcript classes in comparison with protein coding transcripts. We have systematically demonstrated that proteins having similar functional roles showed a higher cooccurrence across both lncRNA and protein coding transcripts. Also, the positional preference of most of RBPs agreed with their possible functional roles. Our study gives a compendium of lncRNA and RBP interactions suggesting a large number of functional roles which they can play including silencing, splicing, mRNA processing, export or transport.

AUTHOR CONTRIBUTIONS
VS conceptualized the analysis. Data analysis was performed by SJ and SG. SJ prepared the data summaries and visualization. SJ and SG wrote the manuscript. All authors reviewed the manuscript.

ACKNOWLEDGMENTS
Authors would like to acknowledge Dr. S. Ramachandran and Dr. Sheetal Gandotra for their valuable discussions which helped in compiling the analysis of this study and writing of the manuscript.