Edited by: Yungang Xu, The University of Texas Health Science Center at Houston, United States
Reviewed by: Xue Xiao, The University of Texas Southwestern Medical Center, United States; Renzhi Cao, Pacific Lutheran University, United States
This article was submitted to Computational Genomics, a section of the journal Frontiers in Genetics
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
It is increasingly appreciated that long non-coding RNAs (lncRNAs) associated with alternative splicing (AS) could be involved in aggressive hepatocellular carcinoma. Although many recent studies show the alteration of RNA alternative splicing by deregulated lncRNAs in cancer, the extent to which and how lncRNAs impact alternative splicing at the genome scale remains largely elusive. We analyzed RNA-seq data obtained from 369 hepatocellular carcinomas (HCCs) and 160 normal liver tissues, quantified 198,619 isoform transcripts, and identified a total of 1,375 significant AS events in liver cancer. In order to predict novel AS-associated lncRNAs, we performed an integration of co-expression, protein-protein interaction (PPI) and epigenetic interaction networks that links lncRNA modulators (such as splicing factors, transcript factors, and miRNAs) along with their targeted AS genes in HCC. We developed a random walk-based multi-graphic (RWMG) model algorithm that prioritizes functional lncRNAs with their associated AS targets to computationally model the heterogeneous networks in HCC. RWMG shows a good performance evaluated by the ROC curve based on cross-validation and bootstrapping strategies. As a conclusion, our robust network-based framework has derived 31 AS-related lncRNAs that not only validates known cancer-associated cases MALAT1 and HOXA11-AS, but also reveals new players such as DNM1P35 and DLX6-AS1with potential functional implications. Survival analysis further provides insights into the clinical significance of identified lncRNAs.
Alternative splicing (AS) events are frequently observed in tumorigenesis and serve as cancer-driving genes. AS can originate from somatic mutations that disrupt splicing regulatory mechanisms or influence the expression levels of splicing factors or transcription factors (
Studies from
Proteins that have multiple splicing regulators and that promote the transformation of target genes generally get triggered by transcriptional factors (TFs). For example, the transcription regulator MYC, induces upregulation of hnRNP A1/2, that, in turn, regulates alternative splicing events in expressing the cancer-associated pyruvate kinase M2 (PKM2) isoform (
Although studies have identified the correlation of lncRNAs and AS to be important in cancer prognosis, there still remains gaps within current studies as only a few cancer-related AS events are known to be regulated by lncRNAs. In addition, it was not clear how the lncRNAs were linked to specific AS sites, hence, providing no evidence to correlate clinical outcomes. Next-generation sequencing technologies have helped identify ∼40K novel lncRNAs cancer, whose regulatory functions in AS remain unknown in tumorigenesis. Hence, computationally predicting novel lncRNAs and associated alternative splicing events may help in the comprehensive understanding of the HCC disease at a systems level.
In this study, we established an innovative technology for propagating molecular networks called the random walk-based multi-graphic (RWMG) model. The RWMG model simultaneously integrates sophisticated biological connections among lncRNA targets [such as transcription factors (TF), alternative splice factors (ASF), and microRNAs] based on both biophysical interaction networks and their co-expression profiles within a single analytical framework. When comparing conventional random walk algorithms that considers equal proportion of all input genes, our flexible and scalable method can be formulated to rank a subset of lncRNAs based on literature survey. In addition, the method we propose has better accuracy than other previously defined “shortest path” network-based algorithms, with advantages of overcoming “noise” and “incomplete” dimensional heterogenicity from the data.
In addition, previous published reports on comparing tumor and normal tissues are generally limited to normal adjacent tissues (NAT). However, these tissues are not truly “normal” as they are usually surrounded by tumor contaminations. Therefore, many potential cancer biomarkers involved in AS may be missed. Hence, to increase the performance of such analysis, we combined healthy liver tissue samples that were downloaded from GTEx along with expression data from TCGA.
The framework of the underlying biological hypothesis and model assumption for this project is described in
We performed a method of trimmed mean of
In order to analyze alternative splice isoforms, we first discarded those isoforms that contained nil values on abundance levels across all the samples. We used R package “IsoformSwitchAnalyzeR” to analyze individual isoform switches from T/N comparison and their biological processes (
We collected physical interaction information of lncRNAs and associated targeted genes through database searching and text mining. These interactions were evidenced from experimental validations, neighboring gene pairs, gene fusions, and co-occurrence of lncRNAs that connect with miRNA-, TF-, ASF-, and switched genes. Furthermore, HCC lncRNA-target networks were compiled from the following resources:
Features that were enriched in AS regulatory pathways were collected from pathCards (
Finally, identified HCC-DE lncRNAs, pseudogenes, and miRNAs were mapped to the global regulatory networks to construct HCC-specific sub-networks that contain switched genes as the targets or TF/ASF as the co-effectors of non-coding RNA regulators.
Pearson correlation was used to estimate the lncRNA co-expression relationships at isoform level. We only included connections for the pairs of lncRNA and protein-coding genes, with absolute correlation coefficient greater than 0.75 and FDR
Random walk multi-graphic (RWMG) model is an integrative application of random walk with restart (RWR) algorithm on multiple layers of heterogeneous network. Our framework is encoded with data sets for the same cohort of patients including:
Co-expression network, which is a bipartite graph containing the association between
Epigenetic regulatory network, which is also a bipartite graph containing the association between
Splicing pathway PPI networks, which is an
We first create an extended graph
is the average of all included edge scores connecting nodes
Each entry
Therefore, we can write the RWMG model on a multi- and heterogeneous- graph
where vectors
To evaluate our approach’s sensitivity, we simulated different random walk strategies for optimization. We created a list of experimentally validated AS-associated genes as “gold-standard” true positive genes (TPG) curated from the careful literature review and randomly selected genes as the “gold-standard” true negative (TNG). We chose the “best” model that has the most candidates significantly enriched in the “gold-standard” gene list. In reality, the number of TPG is much smaller compared to TNG. To avoid bias from highly imbalanced data between these two sets, we performed a bootstrap resampling technique by selecting an equal number of data as TNG. This process was repeated 10 times, and the overall performances were calculated by the mean value of these performances.
To confirm the pathogenic characteristics of identified lncRNAs and pseudogenes, univariate Cox proportional model was used to evaluate the association of selected genes with overall survival outcomes. Kaplan–Meier plots and log-rank test statistics were used to visualize the high- and low-risk groups. The cutoff of the high- and low-risk group was determined by the median value of the normalized count of selected genes.
We identified 369 DE lncRNA genes and 171 DE pseudogenes from T/N comparison (
Hepatocellular carcinoma (HCC)-specific long non-coding RNAs (lncRNAs)
From the expression levels of isoform when comparing tumor and normal samples, we identified 1,375 isoforms that had switching properties and that mapped to 1,078 unique genes. Among these switched isoforms, 1,251 were protein-coding isoforms, and 124 were non-coding isoforms that included antisense, lncRNA and pseudogenes (
Genome-wide transcript analysis for switched isoforms between tumor vs. normal comparison HCC.
The overview of switched isoforms impacting the biological function alterations in HCC is shown in
Importantly, we confirmed 20 genes with switched isoforms that were involved in AS regulatory functions (
Statistic summary of splicing factor genes with alternative switched isoforms.
ENST00000555295.1 | ENSG00000100836.10 | PABPN1 | 0.182 | 1.10E–32 |
ENST00000459687.5 | ENSG00000100410.7 | PHF5A | 0.172 | 6.07E–18 |
ENST00000411938.1 | ENSG00000128534.7 | LSM8 | 0.169 | 2.49E–19 |
ENST00000553192.5 | ENSG00000139343.10 | SNRPF | 0.152 | 6.22E–21 |
ENST00000297157.7 | ENSG00000164610.8 | RP9 | 0.145 | 2.68E–19 |
ENST00000491106.1 | ENSG00000060688.12 | SNRNP40 | 0.128 | 4.28E–19 |
ENST00000560313.2 | ENSG00000090470.14 | PDCD7 | 0.124 | 2.17E–06 |
ENST00000301785.5 | ENSG00000214753.2 | HNRNPUL2 | 0.116 | 1.19E–28 |
ENST00000402849.5 | ENSG00000100028.11 | SNRPD3 | 0.113 | 1.67E–11 |
ENST00000535326.1 | ENSG00000110107.8 | PRPF19 | 0.103 | 1.79E–07 |
ENST00000597776.1 | ENSG00000130520.10 | LSM4 | 0.102 | 2.31E–34 |
ENST00000472237.5 | ENSG00000132792.18 | CTNNBL1 | 0.102 | 5.80E–13 |
ENST00000548994.1 | ENSG00000075188.8 | NUP37 | 0.101 | 1.68E–11 |
ENST00000564651.5 | ENSG00000102978.12 | POLR2C | 0.1 | 3.01E–14 |
ENST00000505885.1 | ENSG00000096063.14 | SRPK1 | –0.108 | 1.94E–11 |
ENST00000404603.5 | ENSG00000100028.11 | SNRPD3 | –0.109 | 2.30E–15 |
ENST00000540127.1 | ENSG00000214753.2 | HNRNPUL2 | –0.116 | 2.99E–48 |
ENST00000367208.1 | ENSG00000182004.12 | SNRPE | –0.13 | 1.79E–31 |
ENST00000527554.2 | ENSG00000100697.14 | DICER1 | –0.139 | 2.98E–21 |
ENST00000595761.1 | ENSG00000213024.10 | NUP62 | –0.157 | 3.62E–31 |
ENST00000488937.1 | ENSG00000136875.12 | PRPF4 | –0.159 | 6.94E–12 |
ENST00000559051.1 | ENSG00000090470.14 | PDCD7 | –0.163 | 9.60E–15 |
ENST00000216252.3 | ENSG00000100410.7 | PHF5A | –0.216 | 4.30E–21 |
In order to identify which lncRNAs were associated to switched isoforms at the transcript level, we constructed a co-expression network that comprised lncRNA and genes with switched isoforms. Different from traditional gene level co-expression network, the connections between lncRNA and genes with multiple splicing isoforms could be singular or multiple when interacting between molecules. The lncRNAs-switched isoform connections are summarized in
However, since the lncRNA regulation mechanism involved in AS events was comprehensive, AS regulation may not directly be reflected from expression abundance, but through physical interaction or DNA/RNA binding sites. LncRNAs could influence gene-splicing patterns by inhibiting and activating the expression of ASFs, or through transcription factors that indirectly interact with splicing factors and ultimately cause changes in AS factor-targeted gene expression. Hence, constructing a comprehensive gene regulatory network that includes TF, AS regulators, and lncRNAs could allow better understanding of the mechanism of AS in cancers.
The ROC curve shown in
Statistic summary of predicted top-ranked non-coding RNAs associated with alternative splicing (AS) ranking by random walk-based multi-graphic (RWMG) score.
LINC00675 | 1 | 0.00368174 | LincRNA |
CTD-2171N6.1 | 2 | 0.002824633 | LincRNA |
HOTTIP | 3 | 0.002677841 | Antisense |
DNM1P35 | 4 | 0.002483954 | Antisense |
LEF1-AS1 | 5 | 0.002397948 | Antisense |
AP006285.7 | 6 | 0.002123275 | LincRNA |
WARS2-IT1 | 7 | 0.002081091 | Antisense |
LINC00355 | 8 | 0.002063983 | LincRNA |
RP11-81H3.2 | 9 | 0.002032482 | LincRNA |
HOXA11-AS | 10 | 0.002028198 | Antisense |
RP11-261N11.8 | 11 | 0.002005615 | Antisense |
RP3-355L5.4 | 12 | 0.001938399 | Antisense |
RP11-138J23.1 | 13 | 0.001929433 | LincRNA |
RP11-525K10.3 | 14 | 0.001923733 | Antisense |
RP11-495P10.7 | 15 | 0.001917542 | LincRNA |
DLX6-AS1 | 16 | 0.00188837 | Antisense |
RP11-356C4.5 | 17 | 0.001861006 | LincRNA |
CDKN2B-AS1 | 18 | 0.001856714 | Antisense |
RP11-495P10.5 | 19 | 0.001834267 | LincRNA |
SFTA1P | 20 | 0.001751498 | LincRNA |
PRSS51 | 21 | 0.001750058 | Antisense |
MALAT1 | 22 | 0.001672339 | LincRNA |
FEZF1-AS1 | 23 | 0.001669135 | Antisense |
RP4-530I15.9 | 24 | 0.001619806 | Antisense |
RP11-158M2.5 | 25 | 0.001618054 | Antisense |
CTD-2374C24.1 | 26 | 0.001617345 | LincRNA |
PWRN1 | 27 | 0.001605646 | LincRNA |
CTC-573N18.1 | 28 | 0.001534221 | LincRNA |
RP11-284F21.9 | 29 | 0.001527715 | LincRNA |
RP11-3J1.1 | 30 | 0.001523171 | LincRNA |
FENDRR | 31 | 0.001509286 | LincRNA |
Among the top predicted lncRNAs that were involved in AS, we further confirmed their clinical significance. As a result of univariate survival analysis screening, a total of 51 lncRNAs and 24 pseudogenes were found to be associated with HCC overall 5-year survival, respectively (
Survival analysis for the identified lncRNA and pseudogenes involved in AS mechanisms. Hazard ratio plots from Cox regression analysis for top 10 lncRNAs
In the last decade, studies have investigated the association of splicing isoforms and lncRNA profiles from deep sequencing technologies. For instance, it has been known that some small nuclear uridine (U)-rich RNAs (snRNPs) are core components of the pre-mRNA processing spliceosome and can collaborate with some splicing factors that are encoded by heterogeneous nuclear ribonucleoprotein complex subunits (hnRNPs) in order to fine tune complex splicing regulations (
The primary mechanisms involving lncRNAs in AS modulation can be classified into three ways that include: (i) lncRNAs that directly influence isoform expression through activation or inhibition mechanism; (ii) lncRNAs that form RNA–RNA duplexes with pre-mRNA molecules, and (iii) lncRNAs that affect target AS genes through indirectly inhibiting or promoting the expression of splicing factors or through transcript factors. However, most previous studies only focus on individual genes and/or isoform switches regulated by lncRNAs. More comprehensive interactions can be detected at the isoform level besides the gene level. Our predictions identified several candidates that were either oncogenes or tumor suppressors and lncRNAs whose somatic alterations were associated with AS at both isoform and gene level in addition to showing clinical significance in HCC patients.
At the transcriptional level correlation network, we found that majority of lncRNA isoforms were correlated with more than one AS event, among which some were showing opposite roles in the AS regulations. In addition, we can see that many lncRNAs may partially compete with the same AS event. For example, the pseudogenes of UBE2S, which are UBE2SP1, UBE2SP2, and UBE2MP1, are significantly correlated with FEN1’s intron retention and Alternative 5′ donor site mechanisms (
AS events are mainly regulated by splicing factors, which bind to pre-mRNAs and influence exon selection and splicing site choice. Moreover, TFs activate or suppress the expression of ASF. Importantly, we found ASF that may have switched isoforms. A switched ASF RP9, which can be bound by the proto-oncogene PIM1 product, a serine/threonine protein kinase, also can cause its target PIM1 to get switched. Although TFs were usually thought for a long time to encode a single protein that changes the expression of their target genes, more and more TFs are now found to be alternatively spliced (
The association of CDKN2B-AS1, also known as ANRIL, with HCC has been reported in several studies (
In this project, we extended a previous existing algorithm into multiplex and heterogeneous networks. The research community can explore different layers of the epigenetic regulatory network, expression correlation network, and protein interaction network. A recent Nature Review paper by
To our best knowledge, this is the first attempt to predict lncRNA regulations on AS using a rigorous, multi-graphic approach by the integration of large-scale and complex networks. Of interest for potentially limiting the accuracy of random walk and network propagation methods are an incomplete collection of known lncRNAs, especially pseudogenes, used to supervise prediction of new candidates. As such, we addressed several unique challenges associated with these dataset complexities in each step. For example, in the data preprocessing steps, we carefully address the challenges by collecting as many as experimentally verified and predicted lncRNAs that were taking account of AS. In our statistical modeling steps, we specifically addressed the robustness of complex data integration, especially for non-informative or noisy datasets. Also, we investigated several random walk strategies by trying different groups of vertices such as lncRNAs, ASFs, and TFs as a starting point to optimize our models.
However, the lncRNA regulatory mechanism is complicated, as its mechanism differs with different stages, such as the pre-mRNA or post-mRNA stage. Therefore, the major limitation of this article is we were not able to consider other comprehensive mechanisms at different stages, such as recognition of the splicing site can be modulated by
Publicly available datasets were analyzed in this study. TCGA and GTEx data can be found in data hubs in
JW contributed to the analysis and interpretation of data. XW contributed to mathematics model interpretation. AB reviewed and edited the manuscript. AB and SY contributed to data analysis, provided feedback, and edited the manuscript. YC and KX contributed to the machine learning predictive model design. YM contributed to the project design and interpretation of biological meanings. YZ led the project, provided the guidance, and prepared the manuscript. All the authors read and approved the manuscript.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
This manuscript has been released as a pre-print at SSRN:
The Supplementary Material for this article can be found online at:
Statistic summaries for significantly differential expression genes between tumor and normal comparison for lncRNAs and pseudogenes.
Statistic summaries and functional analysis for significantly switched isoforms between tumor and normal comparison for lncRNA and protein-coding genes, as well as prediction of splicing event patterns for switched genes. Gene ontology and KEGG pathway enrichment analysis are performed for upregulated isoforms (gain) and downregulated isoforms (loss), respectively.
Predicted lncRNA interaction pairs at the transcriptional co-expression level.
Statistic summary of AS-associated lncRNAs and PCGs ranking by RWMG predictive score.
Statistic summary of significant lncRNAs and pseudogenes associated with overall survival.
Total number of node and edge distribution for the three networks.