Prognostic Nomogram of Prognosis-Related Genes and Clinicopathological Characteristics to Predict the 5-Year Survival Rate of Colon Cancer Patients

Background: The Cancer Genome Atlas (TCGA) has established a genome-wide gene expression profile, increasing our understanding of the impact of tumor heredity on clinical outcomes. The aim of this study was to construct a nomogram using data from the TCGA regarding prognosis-related genes and clinicopathological characteristics to predict the 5-years survival rate of colon cancer (CC) patients. Methods: Kaplan–Meier and Cox regression analyses were used to identify genes associated with the 5-years survival rate of CC patients. Cox regression was used to analyze the relationship between the clinicopathological features and prognostic genes and overall survival rates in patients with CC and to identify independent risk factors for the prognosis of CC patients. A nomogram for predicting the 5-years survival rate of CC patients was constructed by R software. Results: A total of eight genes (KCNJ14, CILP2, ATP6V1G2, GABRD, RIMKLB, SIX2, PLEKHA8P1, and MPP2) related to the 5-years survival of rate CC patients were identified. Age, stage, and PLEKHA8P1 were independent risk factors for the 5-years survival rate in patients with CC. The accuracy, sensitivity and specificity of the nomogram model constructed by age, TNM staging, and PLEKHA8P1 for predicting the 5-years survival of rate CC patients were 83.3, 83.97, and 85.79%, respectively. Conclusion: The nomogram can correctly predict the 5-year survival rate of patients with CC, thus aiding the individualized decision-making process for patients with CC.


INTRODUCTION
Colon cancer (CC) ranks third in incidence and second in mortality rates (1). Surgical treatment is the main method for managing CC to prolong survival time (2,3). Adjuvant chemoradiotherapy can also significantly improve the prognosis of CC (4,5). The 5-years survival rates of patients with stage I, II, and III CC are ∼93, 80, and 60%, respectively (6). The American Joint Committee on Cancer (AJCC) TNM staging system is widely used to assess the prognosis of patients with CC (5). However, the prognosis of patients with CC at the same stage varies widely, and the accuracy of TMN staging as a predictive approach has certain limitations (7,8). Therefore, another approach is needed to identify patients with poor prognosis to allow for the development of individualized treatment and monitoring approaches. Nomograms can provide an overall probability of a specific outcome for an individual patient and provide more accurate predictions than traditional staging systems, thereby improving personalized treatment decisions (9,10). Previously developed microarray techniques can be used to predict the prognosis of many types of cancer (11)(12)(13). Previous studies have shown that gene expression profiles have certain application prospects in predicting patients' long-term prognosis (14). Meanwhile, prognostic gene expression profiles of colorectal cancer patients from tumor samples and adjacent normal mucosa have been described (15)(16)(17). Relevant studies have indicated that gene expression characteristics can improve the accuracy of prognosis prediction for stage II and III colorectal cancer (18,19). However, few studies have combined prognostic genes with clinicopathological features to predict the long-term survival of patients with CC. In addition, The Cancer Genome Atlas (TCGA) has established a genome-wide gene expression profile, increasing our understanding of the impact of tumor heredity on clinical outcomes (20). Therefore, the aim of this study was to construct a nomogram using data from the TCGA regarding prognosis-related genes and clinicopathological characteristics to predict the 5-years survival rate of CC patients, thus providing an important basis for individualized decision-making for patients with CC.

Data Download and Processing
RNA sequencing results from 437 tissues and 382 human colon adenoma and adenocarcinoma samples were obtained from the TCGA database (https://portal.gdc.cancer.gov). RNA sequencing results from 39 normal samples and 398 cancer samples were combined into a single matrix file using scripts in the Perl language (http://www.perl.org/). The Ensembl database (http:// www.ensembl.org/index.html) was then used to convert the Ensembl ID in the matrix file to the gene name. Moreover, the clinical data of 385 cases were downloaded, and relevant clinical data were extracted.

Identification of Prognostic-Related Genes
First, Kaplan-Meier and Cox regression analyses were used to screen for genes associated with the 5-years survival of rate CC patients, and a P < 0.05 was used to define statistical significance. Next, the survivalROC package in R language was used to identify genes that were associated with 5-years survival and had an area under the curve (AUC) >0.6.

Survival Analysis
To determine the relationship between prognostic genes and CC survival, we used the survival package in R language for the survival analysis of the prognostic genes. The relationship between the clinicopathological characteristics and prognosisrelated genes and the overall survival of patients with CC was analyzed by a univariate analysis. The factors affecting CC survival in the univariate analysis were analyzed by multivariate Cox regression to identify independent risk factors for CC prognosis.

ROC Curve Analysis
To determine the accuracy of the combined factors to predict the 5-years survival rate of CC patients and the cutoff value of prognostic genes, we used the survivalROC package in R language for analysis. In addition, the sensitivity and specificity of the combined factors were calculated using the survivalROC package in R language.

Construction of Nomogram
The combined factors that predict the most accurate prognosis of CC were used to construct a nomogram model for predicting the 5-years survival rate of CC patients using the rms package in R language, and scores for various indicators were obtained. The scores corresponding to the indicators were added to obtain a total score; the higher the total score, the lower the 5-years survival rate of CC patients. Meanwhile, the survivalROC package was used to calculate the sensitivity and specificity of the model to evaluate its Frontiers in Surgery | www.frontiersin.org clinical value. Moreover, the concordance index (C-index) was calculated to evaluate the performance of the model prediction results, and the calibration curve was plotted to observe the relationship between the predicted probability and the actual incidence (21,22).

Clinical Characteristics of Patients
From the clinical data of 385 patients, the patients' age, sex, stage, TNM staging, survival time and survival status were extracted. After deleting samples with incomplete clinical data, a total of 364 cases were retained for further analysis ( Table 1).

ROC Curve Analysis
The  of CC patients were 0.833, 83.97, and 85.79%, respectively. The AUC, sensitivity and specificity of age combined with TNM staging for assessing the 5-years survival rate of CC patients were 0.735, 70.21, and 73.06%, respectively (Figure 4). These results indicated that age combined with TNM staging and PLEKHA8P1 were most accurate for evaluating the 5-years survival rate of CC patients.

Construction of Nomogram
The rms package in R language was used to construct a logistic regression model constructed by age, TNM staging and PLEKHA8P1, and the C-index for evaluation was 0.74, indicating that the prediction model was accurate. Then, the plotting function was constructed, and the nomogram was plotted (Figure 5). A score of age ≤65 years was 0 points, while a score of age >65 years was 60 points; a score of T1 was 0 points; a score of T2 was 33 points; a score of T3 was 67 points; a score of T4 was 100 points; a score of N0 was 0 points; a score of N1 was 36 points; a score of N2 was 72 points; a score of M0 was 0 points; a score of M1 was 39 points; a score of Mx was 78 points; and a score of PLEKHA8P1 ≤ 1.545 U/ml was 0 points, while a score of PLEKHA8P1 > 1.545 U/ml was 48 points. The highest score was 358 points, suggesting that the 5-years survival probability of patients with CC was <10%. The probability of 5-years survival of CC can be predicted based on the total points ( Table 4). The accuracy, sensitivity and specificity of this prediction model were 83.30, 83.97, and 85.79%, respectively, indicating the validity of the model. The calibration curve was closer to the ideal curve, which indicated that the prediction was in good agreement with the actual results (Figure 6).

DISCUSSION
This study is the first to combine prognostic genes and clinicopathological characteristics of CC to predict the 5-years survival rate of CC patients. We first performed a series of analyses to identify genes that significantly affected 5-years survival in CC. Then, the relationship between these genes and clinicopathological characteristics and the overall survival rate of CC was analyzed, and independent risk factors for CC survival were identified. Finally, a logistic regression model was constructed based on the AUC of the combined factors, and a nomogram was drawn to predict the 5-years overall survival rate of CC patients. This study found that age, stage, and PLEKHA8P1 were independent risk factors for the 5years survival rate in patients with CC. PLEKHA8P1 belongs to the pseudogene family. Only ∼2% of the genes in the human genome encode proteins. Non-coding RNAs include microRNAs, long non-coding RNAs, and pseudogenes (23)(24)(25). Currently, the functions and mechanisms of lncRNAs and pseudogenes have not been fully elucidated (24)(25)(26). However, an increasing number of studies have shown that pseudogenes have important biological functions (27,28). In the process of homologous recombination, pseudogenes may result in the loss of some bases, thus affecting the transcription level of genes (29). Pseudogenes can also induce endogenous small interfering RNAs to inhibit the expression of functional genes (25). Pseudogene RNAs can play a regulatory role as competing endogenous RNAs (26,30). On the other hand, the results of an increasing number of studies have indicated that pseudogenes play a crucial role in cancer. Chen et al. (31) found that the pseudogene CTNNAP1 promotes the growth of human tumors by regulating the expression of its homologous gene, CTNNA1. Lin et al. (32) showed that the pseudogene OCT4-pg could inhibit the growth and differentiation of mesenchymal stem cells. Rutnam et al. indicated that the pseudogene TUSC2p1 protects the expression of the tumor suppressor gene TUSC2 by competitively binding with miRNA, thereby inhibiting the proliferation of breast cancer cells (33). Poliseno et al. (25) demonstrated that the pseudogene PTENP1 had the ability to produce the corresponding mRNA and can interact with the transcription products of the parent gene PTEN, thus playing a role in inhibiting cell growth. Poliseno et al. (34) also found deletion of the pseudogene PTENP1 in some CC, gastric cancer and malignant melanoma. Moreover, the expression of some pseudogenes is related to the staging and grading of cancer and can be a molecular marker for the prognosis of cancer. The increased expression level of the pseudogene OCT4-pgq1 was closely associated with poor prognosis in gastric cancer and could lead to worse overall patient survival rates (35). PLEKHA8P1 expression was significantly correlated with the monthly survival rate and monthly disease-free survival rate of renal cell carcinoma patients, suggesting that its expression changes play a key role in predicting the prognosis of renal cell carcinoma (36). This study also showed that PLEKHA8P1 was significantly associated with the 5-years survival rate of CC patients and was highly expressed in colon tumors. The American Joint Committee on Cancer TNM staging system is widely used for the prognostic evaluation of CC patients (5). However, Liu et al. (8) indicated that the MSKCC nomogram was better than the AJCC staging system for predicting the 5years survival rate, and the C-index of the MSKCC nomogram in the studied Chinese cohort was 0.71. Weiser et al. (37) demonstrated that a prognostic model including prognostic factors was superior to the current AJCC system, and its Cindex increased from 0.60 to 0.68. The applicability of gene expression profiles for predicting the prognosis of colorectal cancer patients has been demonstrated in several studies (19,38,39). Barrier et al. (38) showed that microarray gene expression profile analysis can predict the prognosis of patients with stage II CC. Lee et al. (40) found that a nomogram model including TNM staging and genetic risk score obtained from the TCGA database could successfully predict the overall survival rate of colorectal cancer patients, and its C-index was higher than that of TNM staging alone (0.75 vs. 0.69). The prognostic prediction model constructed by pathologic M combined with pathologic T had a prognostic prediction efficiency with a 5-years AUC of 0.712 and C-index of 0.680 for patients with colon adenocarcinoma (41). Another prognostic model composed of six significant prognostic factors (age, first-degree relative cancer history, differentiation grade, vessel/nerve invasion, TNM stage, and HALP) had a 5years AUC of 0.73 for patients with locally advanced colorectal cancer (42). The prognostic nomogram constructed by age, sex, histological grade, T stage, number of lymph nodes retrieved, tumor size and N stage had a 5-years AUC of 0.729 for patients with non-metastatic CC (43). In this study, the accuracy, sensitivity and specificity of age combined with TNM staging and PLEKHA8P1 for predicting the 5-years survival rate of CC patients were higher than those of the TNM staging system. In addition, the C-index of the model constructed by age, TNM staging, and PLEKHA8P1 for predicting the 5-years survival rate was 0.74, and its accuracy, sensitivity, and specificity were 83.3, 83.97, and 85.79%, respectively, indicating that the model has high validity. There are some limitations in this study. First, the mRNA gene expression value is difficult to obtain due to the high cost in clinical practice. However, when the cost is reduced, this approach could be widely used in clinical practice. Second, other prognostic factors, such as tumor markers and inflammatory markers, were not included.
In conclusion, age, PLEKHA8P1 and stage were risk factors for poor patient prognosis in CC. The nomogram model constructed by age, TNM staging, and PLEKHA8P1 can correctly predict the 5-years survival rate of patients with CC, thus aiding individualized decision-making for patients with CC. Moreover, the results of this study also provide some direction for future fundamental research. However, the biological function and molecular mechanism of PLEKHA8P1 need further study.

DATA AVAILABILITY STATEMENT
The datasets generated for this study can be found at https:// portal.gdc.cancer.gov.