Identification of a Somatic Mutation-Derived Long Non-Coding RNA Signatures of Genomic Instability in Renal Cell Carcinoma

Background Renal cell carcinoma (RCC) is a malignant tumor with high morbidity and mortality. It is characterized by a large number of somatic mutations and genomic instability. Long non-coding RNAs (lncRNAs) are widely involved in the expression of genomic instability in renal cell carcinoma. But no studies have identified the genome instability-related lncRNAs (GInLncRNAs) and their clinical significances in RCC. Methods Clinical data, gene expression data and mutation data of 943 RCC patients were downloaded from The Cancer Genome Atlas (TCGA) database. Based on the mutation data and lncRNA expression data, GInLncRNAs were screened out. Co-expression analysis, Gene Ontology (GO) enrichment analysis and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis were conducted to explore their potential functions and related signaling pathways. A prognosis model was further constructed based on genome instability-related lncRNAs signature (GInLncSig). And the efficiency of the model was verified by receiver operating characteristic (ROC) curve. The relationships between the model and clinical information, prognosis, mutation number and gene expression were analyzed using correlation prognostic analysis. Finally, the prognostic model was verified in clinical stratification according to TCGA dataset. Results A total of 45 GInLncRNAs were screened out. Functional analysis showed that the functional genes of these GInLncRNAs were mainly enriched in chromosome and nucleoplasmic components, DNA binding in molecular function, transcription and complex anabolism in biological processes. Univariate and Multivariate Cox analyses further screened out 11 GInLncSig to construct a prognostic model (AL031123.1, AC114803.1, AC103563.7, AL031710.1, LINC00460, AC156455.1, AC015977.2, ‘PRDM16-dt’, AL139351.1, AL035661.1 and LINC01606), and the coefficient of each GInLncSig in the model was calculated. The area under the curve (AUC) value of the ROC curve was 0.770. Independent analysis of the model showed that the GInLncSig model was significantly correlated with the RCC patients’ overall survival. Furthermore, the GInLncSig model still had prognostic value in different subgroups of RCC patients. Conclusion Our study preliminarily explored the relationship between genomic instability, lncRNA and clinical characteristics of RCC patients, and constructed a GInLncSig model consisted of 11 GInLncSig to predict the prognosis of patients with RCC. At the same time, our study provided theoretical support for the exploration of the formation and development of RCC.


INTRODUCTION
As the most common urinary malignancy in the United States, there are about 73,820 new cases of RCC each year, and 14,770 deaths from the disease (1). At present, surgical resection is the main treatment for patients with RCC (2,3). However, about 1/3 of patients have metastases at the time of diagnosis, and 1/5 patients have metastases or recurrences after radical treatment (4). RCC has a poor sensitivity to radiotherapy and chemotherapy. Targeted therapy and immunotherapy are the other treatment options (5)(6)(7). Currently, there is no biomarker in clinical practice that can well predict the prognosis of RCC patients.
In recent years, scientists have found that an important part of the non-coding genome is transcribed to produce non-coding RNAs, among which a subset longer than 200 nt at length, capped polyglandular transcripts transcribed by RNA polymerase II, which are known as long non-coding RNAs (lncRNA). It has been confirmed that lncRNA plays an important role in genomic function and gene expression, as well as in the etiology and treatment of malignant tumors (8). Some researchers have found that several lncRNAs are related to the prognosis of RCC patients. LncRNA B7H4 was expressed in the endothelium of tumor cells or tumor blood vessels but not in normal tissues. Therefore, low expression of B7H4 can be used as a predictor of survival in patients with RCC (9)(10)(11). TRAF3IP2-AS1 functioned as a tumor suppressor in RCC progression. Overexpression of TRAF3IP2-AS1 inhibited the proliferation, migration and invasion of UOK109 cells (12). Generally, single biomarkers have their limitations in terms of sensitivity and specificity. Therefore, some researchers have tried to construct prognostic models to predict the prognosis of RCC patients in a broader sense by screening a panel of biomarkers (13,14).
Genomic instability plays a key role in the development of cancer. For example, structural changes in proto-oncogenes and tumor suppressor genes may lead to abnormalities in cell functions, including cell growth, cell cycle, cell senescence and apoptosis, cell invasion and metastasis (15)(16)(17)(18). The genomic instabilities of cancer mainly include gene mutation, chromosome rearrangement and aneuploidy (19). Genomic instability has been identified as a key prognostic factor, and it is of great importance to explore its clinical significance.

Clinical Data Download and Data Consolidation
Gene transcriptome profiling, gene mutation data and clinical information of RCC patients were downloaded from the TCGA database as of March 2021. Perl software was used to extract all the mRNAs and lncRNAs and then annotated them with the HUGO Gene Nomenclature Committee (HGNC) database. This study complied with the publication guidelines of TCGA. And no additional ethical consent was required.

Screening and Identification of lncRNAs Related to Genomic Instability
We extracted the lncRNA expression profile and the cumulative mutation frequency of each sample from the entire annotated transcriptome data. Wilcoxon rank sum test was used to screen GInLncRNAs differentially expressed between the top 25% of samples and the bottom 25% of samples with cumulative mutation frequency by using "limma" package in R software. Heat maps of differentially expressed GInLncRNAs in the two groups were constructed using the "pheatmap" package in the R software. Subsequently, the expression level of each sample was calculated according to the above GInLncRNAs and were divided into two groups. Median mutation frequency values of the two groups were calculated respectively. The two groups with higher and lower median values were set as Genomic Unstable type (GU-like) and Genomic Stable type (GS-like), respectively. Heatmaps of the two groups of differentially expressed lncRNAs were constructed using the "pheatmap" package in the R software. Next, the mutation frequency of all samples in the GU-like group and the GS-like group as well as the expression level of the genomic instability driver gene "UBQLN4" (20) were compared. Then the "ggpubr" package was used to construct a boxplot of the difference in mutation frequency and expression of "UBQLN4" gene between the two groups in R software.

Gene Co-Expression Network
The expression levels of lncRNA and mRNA were retrieved using the "limma" package in R software. Pearson correlation analysis was performed to determine the target mRNA of GInLncRNA. The top ten genes with the highest correlation were selected as the target genes of this GInLncRNA according to Pearson correlation coefficient, and the co-expression network was visualized by Cytoscape software.

Functional Enrichment Analysis
In order to predict the potential function and pathway of the GInLncRNAs, we performed functional enrichment analysis on related mRNAs co-expressed with GInLncRNAs to determine significantly enriched GO terms and KEGG pathways. The packages of "clusterProfiler", "org.Hs.eg.db", "enrichplot" and "ggplot2" in R software were used with P value and adjust P value < 0.05.

Construction of a Prognostic Risk Model
After matching the expression of lncRNA transcriptional profile, somatic mutation data and clinicopathological features of the RCC patients, we randomly divided 857 samples into two groups in a ratio of 1:1 and named Train set and Test set, using the "caret" package in R software. The 429 samples of the Train set were designed to identify GInLncSig and to construct prognostic model. Besides, the 428 samples of the Test set were used to investigate the performance of the prognostic model. Univariate Cox analysis was conducted to find out the GInLncSig in the Train set with the "Survival" package in R software. Subsequently these GInLncSig were used to construct a prognostic model.

Assessment of the Performance of the Prognostic Risk Model
Each patient's risk value in the Train set was calculated by the coef of GInLncSig in the prognostic model. According to the median risk value of the Train set, patients in the Test set and entire TCGA cohort were divided into high-risk group and lowrisk group. The packages "survival" and "survminer" were used to perform log-rank test (P <0.05) for patients in the high-risk and low-risk groups in R software. Furthermore, the Kaplan-Meier method was used to draw the survival curves of the Test set and entire TCGA cohort. The Receiver Operating Characteristic (ROC) curve with Areas Under Curve (AUC) values of the prognostic model was assessed by the "survival ROC" package in R software.

Independent Analysis of the Model Constructed by GInLncSig in Clinical Stratification
In order to test whether the prognostic model constructed by GInLncSig can be used as an independent prognostic factor, we extracted the clinical information of RCC patients in TCGA database and deleted the cases with clinical information gaps. Univariate Cox regression analysis was performed to explore the prognostic values of age, gender, tumor grade, tumor clinical stage and prognostic model using the "survival" package of R software. And Multivariate prognostic analysis was performed for factors with P < 0.05 in the extraction results. Next, we divided the RCC patients of the TCGA database into different subgroups according to clinical parameters, including age (≦ 65 years and > 65), gender (female and male), tumor grade (G1-2 and G3-4), and tumor stage (I-II and III-IV). Patients in each clinical subgroup were further divided into high-risk and lowrisk groups based on the median risk value. The Kaplan-Meier analysis and log-rank test were used to compare the differences in survival between high-risk and low-risk groups in each clinical subgroup.

Statistical Analysis
Gene expression, mutation data and clinical information from the TCGA dataset were retrieved in R software (R3.6.2) or Perl software (Strawberry Perl(64-bit)). All the statistical analyses were evaluated by R software. AP-value < 0.05 was statistically significant.

Screening of Genomic Instability-Related lncRNAs of RCC Patients in TCGA Database
The somatic mutation information of 336 patients with RCC, lncRNA and mRNA transcriptional profiles of 895 patients with RCC, and clinicopathological features of 943 patients with RCC were downloaded from the TCGA database. Each sample was ranked according to the number of mutations. There were 45 GInLncRNAs differentially expressed between the GU-like group and GS-like group. All the differentially expressed GInLncRNAs were demonstrated by heat maps (Supplementary Figure 1). Subsequently, boxplot showed that there were significant differences in the somatic mutation count ( Figure 1A) and the expression of genomic instability driver gene "UBQLN4" ( Figure 1B) between the two groups.

Analysis of Potential Functions and Pathways of the GInLncRNAs
In order to determine whether the functions and pathways of these 45 differential GInLncRNAs were related to genomic instability, we explored the potential functions of GInLncRNAs by co-expression analysis, GO function enrichment analysis and KEGG pathway analysis. Figure 2A showed the co-expression network of IncRNA-mRNA which reflecting the relationship between them. The names of the top 10 mRNAs co-expressed with each GInLncRNA were labeled according to Pearson correlation coefficient analysis. GO function enrichment analysis of GInLncRNA-related genes showed that their functions were mainly enriched in chromosome and nucleoplasmic components (CC), DNA binding in molecular function (MF), and transcription and complex anabolism in biological processes (BP) (P<0.05) ( Figure 2B). By analyzing the KEGG pathways of LncRNA-associated protein-related genes, 18 pathways were found to be significantly enriched. And most of the enriched pathways were related to genomic instability, including HIF-1 signaling pathway, AMPK signaling pathway and Oxidative phosphorylation ( Figure 2C).

Construction of a Prognostic Model
A total of 857 RCC patients in the TCGA database had complete lncRNA expression profiles, somatic mutation data and clinicopathological characteristics. Table 1 showed that there were no significant differences in clinicopathological characteristics between Train set (429 cases) and Test set (428 cases) (including age, gender, tumor grade, clinical stage, T stage, N stage, and M stage).
Univariate Cox proportional hazard regression analysis was used for screening. And 25 out of 45 GInLncRNAs in the Train set were significantly associated with overall survival (P<0.05, Figure 3 and Supplementary Table 1). Subsequently, a survival prediction model with 11 GInLncRNAs was constructed by Multivariate Cox proportional hazard regression analysis (Supplementary Table 2). Among them, the coefficient of four GInLncRNAs (AL031123.1, AC114803.1, AC103563.7 and AL031710.1) were negative, indicating that they might act as protective factors. Up-regulated expression of them were associated with better survival. Otherwise, the coefficient of

Assessment of the Prognostic Model
In order to evaluate the predictive effectiveness of the prognostic model, patients in the Test set and entire TCGA cohort were divided into high-risk group and low-risk group according to the median risk value of patients in the Train set. It was obvious that the overall survival of patients in the low-risk group was significantly better than the high-risk group(P<0.001, log-rank test; Figures 4A, B).
The AUC of the ROC curves in the Test set and entire TCGA cohort were 0.743 and 0.770, respectively, which suggesting that the risk model had a good predictive effectiveness ( Figures 4C, D).

Independent Analysis of the Prognostic Model in Clinical Stratification
To explore whether the prognostic model constructed by GInLncSig can be used as an independent prognostic factor, Univariate and Multivariate Cox regression analyses were performed to analyze the prognostic values of age, gender, tumor grade, tumor clinical stage and the prognostic model. As shown in In order to verify whether the prognostic model still had prognostic value in different subgroups of RCC patients, we divided the RCC patients into different subgroups according to various clinical parameters, including age (≦ 65 years and > 65 years), gender (female and male), tumor grade (G1-2 and G3-4), and tumor stage (I-II and III-IV). Kaplan-Meier analysis showed that patients with a low-risk value had significantly better survival outcomes than those with a high-risk value in all the subgroups except for the G1-2 subgroup ( Figure 5). These results suggested that the model constructed by GInLncSig can be used as an independent prognostic factor for different subgroups of RCC patients.

DISCUSSION
Genomic instability was a characteristic feature of most human cancers (21). Generally, genomic instability included single-base, double-base, or cluster-base substitutions, copy number changes, small insertions and deletions, and genome rearrangements. The accumulation of genetic changes could turn normal cells into malignant cells (22)(23)(24). In addition, persistent genomic instability allowed tumor cells to adapt to their microenvironment under selective pressure and thus become resistant to antitumor therapies (25,26). Processes of genomic instability drove tumors genetics. Studies had found that proto-oncogenes and tumor suppressor genes in nearly 20 common malignant tumors, in which Kidney chromophobe (KICH) was the fewest with 2 genes and Uterine Corpus Endometrial Carcinoma (UCEC) were the most with 55 genes (22). Furthermore, genomic instability was widely involved in the diagnosis and prognosis of various malignant tumors, especially esophageal, bladder, breast, and lung cancers (15,17,27,28). Long non-coding RNAs (lncRNAs) were major components of the mammalian transcriptome and played central roles in a variety of cellular mechanisms. At present, there was evidence indicating that lncRNA was closely related to genomic instability (29). It could promote the occurrence, development and metastasis of RCC (30)(31)(32). However, there were still few studies concerning the relationship between lncRNA associated with genomic instability in RCC, which was the main purpose of this research. In our study, 45 lncRNAs were identified by comparing the expression levels of lncRNAs between genetic-unstable and genetic-stable RCC tumors from the selected patients based on TCGA database. Co-expression analysis, GO enrichment analysis and KEGG pathway analysis showed that the genes co-expressed with these 45 lncRNAs mainly enriched in pathways including HIF-1 signaling pathway, AMPK signaling pathway and Oxidative phosphorylation signaling pathway, which had also been proven to be related to genomic instability of tumors (33)(34)(35). These results suggested that the differentially expressed GInLncRNAs may have direct or indirect causal relationships with genomic instability through the above mentioned pathways. Next, Univariate Cox regression analysis was used to screen 25 GInLncRNAs that were significantly related to the survival of RCC patients from the 45 differentials expressed GInLncRNAs in the Train set. Moreover, Multivariate Cox regression analysis was used to screen 11 GInLncRNAs from the 25 prognostic GInLncRNAs to build a prognostic model. The correlation coefficient calculation formula of the prognostic model for each GInLncSig was obtained, and the risk value of each patient in the Test set was calculated according to the prognostic model. The results showed that patients with high risk value tended to have poor overall survival. The AUC of the model was > 0.7 in both the Train set and the Test set, which indicating that the model had good prediction performance. According to Multivariate Cox regression analysis, we found that the prognostic model had independent prognostic value as well as age, tumor grade and tumor clinical stage. Furthermore, the prognostic model had prognostic value in patients with different ages (≦ 65 years and > 65 years), gender (female and male), tumor grade (G3-4) and clinical stage (I-II and III-IV). These results enabled the GInLncSig model to have a wider range of application, and which could well predict the prognosis of RCC patients in different stratification.
Scholars had explored some prognostic models in RCC. May et al. predicted the disease-free and tumor-related survival rate of patients with RCC undergoing reoperation by using three pathological factors of tumor microvascular invasion, size and grade (36). Later studies gradually moved to the detection of molecules in the blood, including tumor markers (37), protein expression (38), and RNA-binding proteins (39). Some research also explored corresponding prognostic models according to metastatic (40)(41)(42) and non-metastatic (43,44) RCC. With the development of next-generation sequencing (NGS) technologies, the detection of DNA, microRNA and lncRNA in the blood or tissues of patients had become more and more convenient and accessible. There were also studies that systematically evaluated the expression of DNA methylation markers (45), microRNA (46), and lncRNA (47,48) to predict the prognosis of patients with RCC. However, these models were not relate to genomic instability, and some models could only predict the prognosis of specific patients, such as metastatic or non-metastatic RCC. In contrast, our prognostic model constructed by using genomic instability could not only effectively predict the prognosis of all  patients with RCC, but also had predictive value for patients with different stratification. Some of the GInLncSig in our model had also been verified to be abnormally expressed in a variety of malignant tumors and had prognostic values. LINC00460 was located on chromosome 13q33.2 and was transcribed as a 935nt transcript (49). Currently, abnormal expression of LINC00460 had been found in many cancers, including nasopharyngeal carcinoma (50), esophageal squamous cell carcinoma (51), colorectal carcinoma (52)(53)(54), lung adenocarcinoma (49), non-small cell lung cancer (55)(56)(57), etc. Upregulated expression of LINC00460 predicted poorer differentiation, advanced stage, and worse overall survival (58). In addition, mechanism studies had shown that overexpression of LINC00460 may promote invasion and metastasis of NSCLC cells through epithelial-mesenchymal transformation pathway (57). Data had shown that LINC01606 was significantly overexpressed in gastric cancer cells compared with normal cells, and it promoted the migration and invasion of gastric cancer cells by acting as a ceRNA of miR-423-5p in Wnt/b-catenin-dependent pathway (59). This suggested that LINC01606 may act as an oncogene in gastric cancer. Zimta et al. found that the expression of PRDM16-DT was down-regulated in acute myeloid leukemia, and patients with lower expression of PRDM16-DT had better prognosis than those with higher expression (60).  A, B) showed that there was significant difference in the survival rate between the high-risk group and low-risk group in the Test set and TCGA set, respectively (log-rank test, p < 0.05). ROC curves for 1-year survival prediction of the GInLncSig in the Test set (C) and the TCGA set (D).  FIGURE 5 | Clinical stratification analysis of the differences in overall survival between the high-and low-risk RCC patients by age, gender, tumor grade and tumor clinical stage. Kaplan-Meier survival curves of patients in high-and low-risk groups within six clinically stratified subgroups, including patients with age>65 years (A), age<=65 years (B), the gender of female (C), the gender of male (D), the tumor grade of G1-2 (E), grade of G3-4 (F), the tumor stage of I-II (G), and the tumor stage of III-IV (H), respectively. Log-rank test (p<0.05) showed that patients with a low-risk score had significantly better survival outcomes than those with a high-risk score in all the subgroups except for the tumor grade G1-2 subgroup. This is a preliminary study that explores the relationship between genomic instability, lncRNA and clinical characteristics of RCC patients. We constructed a GInLncSig model, which could provide prognostic information in different clinical stratification. However, there were still several limitations of our study. First of all, the data sources of this study were only limited to TCGA database, and had not been further verified in multiple databases or clinical data of our hospital. Secondly, it was necessary to further explore the respective functions of each lncRNA in the prognostic model. Finally, given the NGS is becoming more available, consequently changes in gene expression are expected to be monitored during the treatment in the future.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding authors.

AUTHOR CONTRIBUTIONS
GL and LL designed the study. XF performed the data analysis, graphing, and writing. XL was responsible for the critical reading of the manuscript. All authors contributed to the article and approved the submitted version.