Comprehensive network analysis reveals alternative splicing-related lncRNAs in hepatocellular carcinoma

A number of recent studies have highlighted the findings that certain lncRNAs are associated with alternative splicing (AS) in tumorigenesis and progression. Although existing work showed the importance of linking certain misregulations of RNA splicing with lncRNAs, a primary concern is the lack of genome-wide comprehensive analysis for their associations. We analyzed an extensive collection of RNA-seq data, quantified 198,619 isoform expressions, and found systematic isoform usage changes between hepatocellular carcinoma (HCC) and normal liver tissue. We identified a total of 1375 splicing switched isoforms and further analyzed their biological functions. To predict which lncRNAs are associated with these AS genes, we integrated the co-expression networks and epigenetic interaction networks collected from text mining and database searching, linking lncRNA modulators such as splicing factors, transcript factors, and miRNAs with their targeted AS genes in HCC. To model the heterogeneous networks in a single framework, we developed a multi-graphic random walk (RWMG) network method to prioritize the lncRNAs associated with AS in HCC. RWMG showed a good performace evaluated by ROC curve based on cross-validation and bootstrapping strategy. As a summary, we identified 31 AS-related lncRNAs including MALAT1 and HOXA11-AS, which have been reported before, as well as some novel lncRNAs such as DNM1P35, HAND2-AS1, and DLX6-AS1. Survival analysis further confirmed the clinical significance of identified lncRNAs.


Introduction
Alternative splicing (AS) changes are frequently observed in cancer and may serve as the cancer driver genes. They could originate from somatic mutations that dysfunctions the splicing regulatory mechanisms or influences the expression changes of splicing factors or transcript factors 1 . Therefore, AS genes recognized as important signatures for tumorigenesis are of significant values in developing therapeutic targets in cancer clinical trials. For example, SF3B1targeting compounds spliceosome inhibitor E7107 which have been implemented in advanced Numerous studies showed that the long non-coding RNAs "lncRNAs" (>200nt in length) are associated with a number of AS mechanisms 3,4 . LncRNAs may interact with specific alternative splicing factors (ASF) or through other intermediate molecules affecting chromatin remodeling to fine-tune the splicing of target genes 4 . For instance, our previous experimental study showed the MALAT1 regulated a ASF, SRSF1 (SF2), in gastric cancer cells 5,6 . Ji et al. reported MALAT1 promotes tumor growth and metastasis in colorectal cancer through binding to SFPQ and releasing oncogene PTBP2 7 . LINC01133 interacts with splicing factor SRSF6 in colorectal cancer 8 , acting as a target mimic for SRSF6 interaction with EZH2 and LSD1 in non-small cell lung cancer (NSCLC) 9 .
A number of splicing regulatory proteins that promote the transformation of their target genes can be triggered by transcriptional factors (TFs). For example, a TF, MYC, induced upregulation of hnRNP A1 and hnRNP A2, which in turn, regulate alternative splicing of pyruvate kinase to promote expression of the cancer-associated pyruvate kinase M2 (PKM2) isoform 10,11 .
There exist more comprehensive lncRNA regulatory mechanism in AS, since lncRNAs are either pre-transcriptional or post-transcriptional specialists, acting as decoys to draw effectors (miRNAs, TFs, or ASFs) away from their targets, as cofactors or guides to alter TF-promoter interactions, and as molecular switches to alter TF or ASF activity across multiple targets.
Although previous efforts identified both lncRNAs and AS that may be important in cancer, gaps exist in current studies that only a few cancer-related AS that are regulated by lncRNAs have been extensively investigated. However, their methodology either didn't genomewidely link lncRNAs to comprehensive AS mechanisms or correlate with clinical outcomes.
Furthermore, to date, next-generation sequencing technologies have facilitated the identification of an accumulation of ~40K novel lncRNAs, whose regulatory functions in AS remain unknown in tumorigeneses. Therefore, a key open question is: how many novel lncRNAs are associated with AS modulations in tumorigenesis, genome-wisely.
Here we utilized a novel network propagation technology, random walk-based multigraphic model (RWMG), to simultaneously integrate complicated biological connections among lncRNA -effectors (TF, ASF, and miRNA) -AS interaction networks and co-expression networks in a single analysis framework. This method is an extended application inspired by Random walk with restart algorithm to prioritize important lncRNAs that are involved in AS based on the hypothesis that more important genes are likely to receive more links from other networks. In comparison with traditional random walk algorithms, which treat all genes equally, our flexible, scalable method can be formulated to rank a subset of vertices (e.g., PCGs, lncRNAs), based on pre-knowledge as the starting walking vertices. This method is more accurate than other traditional "shortest path" network-based integrative methods, as it can overcome the "noisy" and "incomplete" highly dimensional heterogeneous data.
In addition, previous tumor and normal comparison studies are limited to normal adjacent to tumor (NAT) tissues. However, these tissues are not truly 'normal' as they usually surrounded by tumor contaminations. Therefore, many potential cancer biomarkers involved in AS may be missed. By combining the 'pure' healthy liver organs from GTEx with TCGA expression data, we broaden the scope of suspected candidates with a variety of AS patterns.

Data description and project design
The framework of the underlying biological hypothesis and model assumption for this project can be found in Fig S1 A was re-processed with UCSC's Xena Toil 12 to quantify gene level and transcript isoform level expression for both coding and non-coding genes. Given that the non-coding transcript expression of RNAseq data contains many small and uncertain transcripts, we filtered out the small and uncertain transcripts but kept transcripts for intergenic lncRNAs, antisense, sense_intronic, sense_overlapping, processed_transcripted, and processed_pseudogene categories based on GENCODE v23 annotation 13 .
We performed a trimmed mean of M-values (TMM) normalization method for RNAseq count data 14 so that the expression level for lncRNAs and pseudogenes are comparable. The TMM normalized data were further transformed to log2-counts per million for linear modeling. HCC differentially expressed (DE) lncRNAs and pseudogenes between tumor and normal samples (T/N) were analyzed by R package limma 15 with cutoff settings (P<1.0E-04 and Fold Change > 2). The method to identify DE miRNAs has been reported in our previous work 16 . These identified HCC specific non-coding DE genes (lncRNAs, pseudogenes, and miRNAs) are expected to represent potential key functions in liver tumorigenesis.

Analysis of alternative splicing isoforms and functional consequence
We first removed isoforms which are all zero counts across all the samples. We used R package "IsoformSwitchAnalyzeR" to analyze individual isoform switches from T/N comparison and their biological consequence changes 17,18 . Differentially switched isoforms between T/N were determined by the following criteria: difference in isoform fraction (dIF) > 0.1 and FDR corrected q-value < 0.05. The functional consequences of switched isoforms were further analyzed for protein-coding potential(CPAT) 19 , Non-sense mediated decay (NMD) status, protein domains (Pfam) 20,21 , and the amino acid sequence of open reading frames (ORF). We used 0.364 as suggested to distinguish coding and non-coding isoforms in CPAT analysis. NMD is a process that recognizes mRNAs carrying a premature termination codon (PTC) and triggers their degradation to prevent the synthesis of dysfunctional or even harmful proteins. AS that controls gene expression is an important process facilitating mRNA degradation in specific isoforms that would lead to NMD 22 . Since we know the exon structure of all isoforms in a given gene (with isoform switching), we obtained their corresponding spliced nucleotide sequence and corresponding  Genes related to AS regulatory pathway were collected from pathCards 34 , KEGG spliceosome 35 , NCBI Biosystems mRNA processing 36 , REATOME mRNA splicing pathway and processing of capped intron-containing pre-mRNA pathway 37 . These genes were involved in an essential component of splicing factors or non-snRNA spliceosome required for the second catalytic step of pre-mRNA splicing. Among these collected 335 splicing regulator genes, 86 are experimental validated as alternative splicing factors (ASF). ASF and target genes interactions were manually confirmed from SpliceAid 2 38 , ASF motif analysis from SFmap 39 , a subset of RNAbinding protein network by Chiu et. al 25, Table S6 , and STRING database 40 .
Finally, identified HCC DE lncRNAs, pseudogenes, and miRNAs were mapped to the global regulatory networks to construct HCC-specific sub-networks that contain switched genes as the targets or TF/ASF as the co-effectors of non-coding RNA regulators.

Construction of HCC lncRNA-AS co-expression interaction networks at isoform level
Pearson correlation was used to estimate the lncRNA co-expression relationships at isoform level. We only kept the connections for the pairs of lncRNA and protein-coding genes (PCGs), if their absolute correlation coefficient > 0.75, FDR p <0.05. The types of PCGs included the TFs, ASFs, and genes with isoform switches. Directions of lncRNAs that are negatively correlated with their targeted PCGs were predicted to inhibit their expression, while positive correlation indicates activation.

Random walk multi-graphic (RWMG) model for the integration of heterogeneous interaction networks
Random walk multi-graphic (RWMG) model is an integrative application of page rank with restart algorithm (RWR) on multiple layers of networks. Detailed method description can be found in our previously published report 41 . Briefly, given a graph G(V,E), ( , )∈V are the vertices lncRNA i and AS genes j , and edge ( , )∈ is weighted by the connectivity score between these vertices. Multiple edges are allowed to connect between any two vertices based on the relationship defined from the co-expression network, epigenetic regulatory network and splicing pathway PPI networks. Assuming the total number of lncRNAs to be m and of AS genes to be n, the probability for which lncRNA i will traverse to AS genes j is defined by the adjacency matrix: . A "teleportation term" is added to for the sake of numerical stability. Thus the transition matrix for networks is defined as , where an m by n matrix with all entries is equal to 1, α is the probability of a lncRNA jumping to one of its neighbors with the probability governed by matrix , and 1− defines the probability of this lncRNA jumping randomly to any other vertex and relinquishing the matrix for traversal.
The final multi-dimensional heterogeneous network will be merged for the overlapped lncRNA node features and taken the union of distinct nodes to augment each individual network with missing connections. We implemented the RWR algorithm on the final multi-graphic network by R packages dnet and igraph 42 . Network visualization was performed by R package visNetwork 43 . Those genes with known roles in regulating AS network will be set as the "seed" nodes in advance to predict the "new" lncRNAs, based on move probabilities from the current node to any of their randomly selected neighbors.
To evaluate our approach's sensitivity, we simulated different random walk strategies for optimization. We created a list of experimentally validated AS associated genes as "gold-standard" true positive genes (TPG) curated from the careful literature review and randomly selected genes as the "gold-standard" true negative (TNG). We chose the "best" model that has the most candidate significantly enriched in the "gold-standard" gene list. In reality, the number of TPG is much smaller compared to TNG. To avoid bias from highly imbalanced data between these two sets, we performed a bootstrap resampling technique by selecting an equal number of data as TNG. This process was repeated 10 times, and the overall performances were calculated by the mean value of these performances.

Survival analysis for prognostic confirmation of identified pathogenic lncRNAs and Pseudogenes.
To confirm the pathogenic characters of identified lncRNAs and pseudogenes, univariate Cox proportional model was used to evaluate the association of selected genes with overall survival outcomes. Kaplan-Meier plots and log-rank test statistics were used to visualize the highand low-risk groups. The cutoff of the high-and low-risk group was determined by the median value of the normalized count of selected genes.

Significantly expressed lncRNAs and pseudogenes in HCC
We identified 369 differentially expressed (DE) lncRNA genes and 171 DE pseudogenes from T/N comparison ( Table S1). The visualizations of DE lncRNAs and pseudogenes were shown in volcano plots ( Fig. 1). Compared to other studies, many DE LncRNAs, such as MALAT1, CDKN2B-AS1, and HOTTIP, have been reported to be associated with liver cancers before [44][45][46] .
We also highlighted several important pseudogenes, such as HNRNPA1P4, HNRNPA1P21, which are the pseudogenes of heterogeneous nuclear ribonucleoproteins A1 (hnRNPs) who play key roles in the regulation of alternative splicing. We performed DE analysis as the initial screen step to narrow the focus of the HCC specific non-coding genes associated with AS for the downstream network analysis.

Identification of significantly switched isoforms and prediction of alternative splicing patterns
From the isoform level expression T/N comparison, we identified 1375 switched isoforms mapping to 1078 unique genes. Among these switched isoforms, 1251 are protein-coding isoforms, and 124 are non-coding isoforms including antisense, lincRNA, pseudogenes, and others (Table. S2). We found that the proportion of switching rate for coding genes is much higher than noncoding genes (Fisher's exact test, p.value= 8.4e-08, Fig. 2 A-B). In order to intuitively visualize the splicing composition of these switched isoforms, we breakdown the dIF distribution according to isoform types such as lincRNA, antisense, and pseudogenes with the most significant switched isoforms (dIF > 0.2 or dIF < -0.2) highlighted ( Fig. 2C).
Fig . 2D shows the eight splicing patterns for switched isoforms stratified by isoforms usage gain or loss in the tumor. Some of the switched isoforms are predicted to have multiple AS events in HCC (Table S3). Interestingly, we observed a global phenomenon that the AS events are not equally used -most prominently illustrated by the use of ATSS in HCC, where there is (a lot) more losses than the gain of amino acid coding exons. It is noteworthy that IR and ATSS are enriched in significant loss of isoform usage in tumor, but A5 and A3 are significantly enriched in a gain ( Fig 2E). Here IR events are of particular functional interest since they represent the largest changes in isoforms. As we displayed in the violin plots, the enriched IR and A3 splicing groups reported the significantly opposite direction of isoform usages between T/N samples (Fig. 2F).

Analysis of functional consequences for switched isoforms
The overview of switched isoforms impacting the biological function alterations in HCC is shown in Fig 3A. We can see that the number of protein domain gain is comparable to domain loss, but is significantly more than domain "switch". Here, the "switch" term indicates both a gain and a loss occurred. Also, switching resulting in ORF gain is significantly more than ORF loss.
For the Gene Ontology analysis, both gain and loss switched isoforms are associated with different types of metabolic process. KEGG analysis shown the isoform loss in tumor tissue are associated with virus infection, hepatitis C, etc, while isoform gain in the tumor is associated with Base excision repair, apoptosis, etc (Table S2).
Importantly, we confirmed 20 genes with switched isoforms which are involved in AS regulatory functions (Table 1) with other isoforms. All the above evidence showed that genes with switched isoforms are often functional important in tumorigenesis, but may be ignored from previous studies due to their genes expression may be not significantly differentiated.

Prediction of AS correlated non-coding RNAs at both transcript and gene level
In order to identify which lncRNAs are associated switched isoforms at the transcript level, we constructed a lncRNA and genes with switched isoforms co-expression network. Different from traditional gene level co-expression network, the connections between lncRNA and genes with multiple splicing isoforms could be 1 vs. 1 or 1 vs. many. The lncRNAs -switched isoforms connections have been summarized in Table S3. Due to space limitation, we only illustrated the relationships between lncRNAs and genes with enriched AS patterns in Fig 4A. However, since the lncRNA regulation mechanism involved in AS events is very comprehensive, as the regulation may not directly be reflected from expression abundance, but  Table S4 provided the prediction of all AS-related genes ranking by RWMG important score.

Computational, Clinical and experimental evaluation for predicted pathogenic lncRNAs involved in AS regulation
As the ROC curve shown in Fig 5A, the averaged area under curve (AUC) value after optimization has been improved from 0.751 to 0.923 based on bootstrapping value. In order to select the best number of top n ranked genes that correspond to a good tradeoff between the sensitivity and specificity, we selected the cutoff based on the trend of the changes at which n where Δ /Δ exhibiting a sudden drop (Fig 5B). We can see from the figure, n=150 is the best number for gene selection. The top ranked lncRNAs associated with AS functions can be found in Table 2.
Among the top predicted lncRNAs that are involved in AS, we further confirmed their clinical significance. As a result of univariate survival analysis screen, a total of 51 lncRNAs and 24 pseudogenes were found to be associated with HCC overall survival respectively (Table S5).

Discussion
In the last decade, great studies have investigated the association of splicing isoforms and lncRNAs profiles from deep sequencing. For example, it has been known for a long time that some small nuclear uridine (U)-rich RNAs (snRNPs) who are the core components of the pre-mRNA processing spliceosome can collaborate with some splicing factors which are encoded by heterogeneous nuclear ribonucleoprotein complex subunits (hnRNPs) to fine-tune complex splicing regulations 4 . Impressively, we found a number of core snRNPs isoforms including SNRPE, SNRPD3, SNRPD3, SNRPF, and SNRNP40 are switched, although their expression abundance is not necessary differentially expressed from T/N comparison in HCC development. In the transcriptional level correlation network, we found the majority of lncRNA isoforms were correlated with more than one AS event, among which some were playing opposite roles in the AS regulations. In addition, we can see that many lncRNAs may partially in competition with the same AS event. For example, the pseudogenes of UBE2S, which are UBE2SP1, UBE2SP2, and UBE2MP1, are significantly correlated with FEN1's Intron Retention and Alternative 5' donor sites mechanisms (Fig 4A). The FEN1 gene plays an important role in removing 5' overhanging flaps and the 5-3 exonuclease activities involved in DNA replication and repair 47 . While the UBE2S is involved in ubiquitination and subsequent degradation of VHL, resulting in an accumulation of HIF1A 48 , however, the reason why its pseudogenes are associated with FEN1 is not clear. Maybe the next step for experimental validation for a deep understanding of the mechanisms. Taken together, these results confirmed that these identified lncRNAs need to be better investigated in the future. Our results provided a better resolution of AS correlated lncRNAs at the isoform level.
AS events are mainly regulated by splicing factors, which bind to pre-mRNAs and influence exon selection and splicing site choice. Moreover, transcriptional factors will activate or suppress the expression of ASF. Importantly, we found ASF may have switched isoforms. A switched ASF RP9, which can be bound by the proto-oncogene PIM1 product, a serine/threonine protein kinase, also can cause its target PIM1 switched. Although TFs were usually thought for a long time to encode a single protein that changes the expression of their target genes, more and more TFs, are now found to be alternatively spliced 49 . Here, we also found a group of TFs in the ETS family (E26 transformation-specific), which are ETS1, ETS2, ETV3, ELF4, are switched together. These ETS genes have been confirmed to be associated with cancer through gene fusion 50 and are involved in a wide variety of regulatory functions such as cell migration, proliferation, and cancer progression 51 52 . Interestingly, the ETS1 target splicing factor QKI, and ETV3 target splicing factor CELF1. More interestingly, lncRNA FAM99B is predicted to be associated with these ETS family genes and their low expression are associated with HCC patients' poor prognosis.
CDKN2B-AS1, also known as ANRIL, its association with HCC has been reported in several studies 53-55 . CDKN2B-AS1 has both linear and circular isoforms and their functions are different. For example, its linear isoform can regulate the c-myc-enhancer binding factor RBMS1 56 , while its circular isoform is confirmed to be an important AS regulator that causing skipped exons 57 mainly found in cardiovascular disease 58 59 . However, this is the first time we found it can activate alternative splicing genes in liver cancer. A potential explanation could be it is functionally related to lipid metabolism and a majority of liver cancer due to lipid disorder. In addition, the prognostic value of CDKN2B-AS1 has been revealed in our project. However, how exactly CDKN2B-AS1 controls these genes' splicing is not clear. Further experimental validation can be planed. We identified HAND2-AS1 gene showed consistent alternative splicing pattern at the start sites and termination site for METTL7B at isoforms level. METTL7B is a membraneassociated protein that resides on hepatic lipid Droplets. An explanation is that HAND2-AS1 activate the METTL7B spliced isoform lipid disordered and is associated with HCC, which didn't report before. Gene-level RWMG network analysis further reveals that both CDKN2B-AS1 and HAND2-AS1 can influence AS either through TFs and ASFs. For example, HAND2-AS1 TFs (i.e. ETS1, SP1, E2F7), or ASFs (i.e. SRSF7, SFRP1, HNRNPK); CDKN2B-AS1 associated TFs (SP4, E2F7) and ASF (SRSF1, SRSF2).
In this project, we extended this algorithm to multiplex and heterogeneous networks. The walk can explore different layers of the epigenetic regulatory network, expression correlation network, and protein interaction network. A recent Nature Review paper by Sharan et al., also suggested that the "network-propagation" method is a "powerful" and "accurate" refined approach in the network, since it is capable of dealing with "noisy" and "incomplete" observations by simultaneously considering all possible paths among vertices. Analyzing these heterogeneous data together will significantly improve the prediction accuracy. By using this gene-ranking strategy, potentially spurious predictions (false positives) that are supported by a single (shortest) path are down-weighted, and true high ranked genes that are potentially missed, even though they are well connected to the prior list (false negatives), are promoted.
To our best knowledge, this is the first attempt to predict lncRNAs regulations on AS using a rigorous, multi-graphic approach by integrating such large scale and complex networks. Of interest for potentially limiting the accuracy of random walk and network propagation methods are an incomplete collection of known lncRNAs, especially pseudogenes, used to supervise prediction of new candidates. As such, we addressed several unique challenges associated with these datasets' complexity in each step. For example, in the data preprocessing steps, we have carefully addressed the challenges by collecting as many as experimental verified and predicted lncRNAs that are taking account of AS; In our statistical modeling steps, we carefully addressed the robustness of complex data integration, especially for non-informative or noisy datasets. Also, we investigated several random walk strategies by trying different groups of vertices such as lncRNAs, ASFs and TFs as a start point to optimize models.
However, the lncRNA regulatory mechanism is very complicated, as its mechanism differs with different stages, such as the pre-mRNA or post-mRNA stage. Therefore, the main limitation of this project is we are not able to consider several other comprehensive mechanisms at different stages, such as recognition of the splicing site can be modulated by cis-regulatory sequences, known as splicing enhancers or silencers, which contribute to the generation of two or more alternatively-spliced mRNAs from the same pre-mRNA. Also, lncRNA determine AS patterns through chromatin remodeling mechanism and shape the 3D genome organization. Our next step -13 -will focus on interpreting these mechanisms at different stages.

Conclusion
We performed a large-scale RNAseq analysis to identify alternative splicing isoforms and their biological consequences in liver cancer. To predict which lncRNAs are associated with splicing mechanisms in HCC, we developed a RWMG model to integrate multi-layers heterogeneous networks including epigenetic regulatory network, transcriptional level coexpression network, and PPI network with the evidence that lncRNAs in the correlation between effectors (miRNAs, TFs, or ASFs) and their associated splicing genes. Our project is the first time using the network-based computational method to genome-wisely predict AS-related lncRNAs in HCC, which showed a good prediction performance (AUC=0.923) and clinical significance.         Table S1. Statistic summaries for significantly differential expression genes between tumor and normal comparison for lncRNAs and pseudogenes. Table S2. Statistic summaries and functional analysis for significantly switched isoforms between tumor and normal comparison for lncRNA and protein-coding genes, as well as prediction of splicing event patterns for switched genes. Gene ontology and KEGG pathway enrichment analysis are performed for upregulated isoforms (gain) and downregulated isoforms (loss) respectively. Table S3. Predicted lncRNA interaction pairs at the transcriptional co-expression level. Table S4. Statistic summary of AS-associated lncRNAs and PCGs ranking by RWMG predictive score.    l  e  1   i  s  o  f  o  r  m  _  i  d  g  e  n  e  _  i  d  g  e  n  e  _  n  a  m  e  d  I  F  q  _  v  a  l  u  e   E  N  S  T  0  0  0  0  0  5  5  5  2  9  5  .  1   E  N  S  G  0  0  0  0  0  1  0  0  8  3  6  .  1  0  P  A  B  P  N  1  0  .  1  8  2  1  .  1  0  E  -3  2   E  N  S  T  0  0  0  0  0  4  5  9  6  8  7 .