A novel prognostic model for patients with colon adenocarcinoma

Background Colon adenocarcinoma (COAD) is a highly heterogeneous disease, which makes its prognostic prediction challenging. The purpose of this study was to investigate the clinical epidemiological characteristics, prognostic factors, and survival outcomes of patients with COAD in order to establish and validate a predictive clinical model (nomogram) for these patients. Methods Using the SEER (Surveillance, Epidemiology, and End Results) database, we identified patients diagnosed with COAD between 1983 and 2015. Disease-specific survival (DSS) and overall survival (OS) were assessed using the log-rank test and Kaplan–Meier approach. Univariate and multivariate analyses were performed using Cox regression, which identified the independent prognostic factors for OS and DSS. The nomograms constructed to predict OS were based on these independent prognostic factors. The predictive ability of the nomograms was assessed using receiver operating characteristic (ROC) curves and calibration plots, while accuracy was assessed using decision curve analysis (DCA). Clinical utility was evaluated with a clinical impact curve (CIC). Results A total of 104,933 patients were identified to have COAD, including 31,479 women and 73,454 men. The follow-up study duration ranged from 22 to 88 months, with an average of 46 months. Multivariate Cox regression analysis revealed that age, gender, race, site_recode_ICD, grade, CS_tumor_size, CS_extension, and metastasis were independent prognostic factors. Nomograms were constructed to predict the probability of 1-, 3-, and 5-year OS and DSS. The concordance index (C-index) and calibration plots showed that the established nomograms had robust predictive ability. The clinical decision chart (from the DCA) and the clinical impact chart (from the CIC) showed good predictive accuracy and clinical utility. Conclusion In this study, a nomogram model for predicting the individualized survival probability of patients with COAD was constructed and validated. The nomograms of patients with COAD were accurate for predicting the 1-, 3-, and 5-year DSS. This study has great significance for clinical treatments. It also provides guidance for further prospective follow-up studies.


Introduction
Colon adenocarcinoma (COAD) is an aggressive primary intestinal malignancy (1) that ranks fourth (6.1%) and fifth (5.8%) in morbidity and mortality, respectively (2). Furthermore, this disease is genetically heterogeneous. China is ranked first in the world in terms of new cancer cases and deaths (3,4). In China, more than 380,000 new cancer cases were projected to be discovered in the colon and rectum annually (5). It could be seen that the global burden of cancer, including COAD, is rising, and cancer is on the verge of becoming the leading cause of death in the 21st century (6). Therefore, discovering new therapeutic strategies for COAD is of great significance.
Current knowledge of COAD is from small series and mostly from retrospective studies or individual case reports. Studies on COAD focusing on the survival and treatment of large populations have not been reported yet. The SEER (Surveillance, Epidemiology, and End Results) database offers favorable resources for the study of malignancies such as COAD for those limited to clinical trials or prospective data (7). This database using retrospective analysis represents the latest and largest COAD cohort in the literature.
A nomogram is used to calculate the possibility of clinical events using complex computational formulas. A nomogram is displayed graphically, with each clinical or laboratory indicator being listed separately and can be scored independently. The probability of clinical events can then be determined according to the cumulative scores of all variables (8)(9)(10). With the help of a nomogram, clinicians can assess the risks of survival, personalize treatment plans, optimize treatment strategies, and actively conduct follow-ups (11,12).
In this study, the SEER database was used to depict the survival tendencies and the prognostic risk factors for COAD. We characterized the independent prognostic factors that were related to COAD and constructed a prognostic nomogram that could help oncologists accurately estimate prognosis and guide individualized treatments.

Patients
The data of patients diagnosed with COAD between January 1, 2004 and December 31, 2015, were extracted from the SEER database through the SEER*Stat tool (7,13). A total of 347,418 patients with COAD were enrolled in this study. Patients were excluded if their demographic or clinicopathological data, as well as follow-up, were incomplete. The following demographic variables and clinicopathological characteristics were included: age, gender, race, site_recode_ICD (International Classification of Diseases), grade, CS (Collaborative Stage)_tumor_size, CS_extension, and metastasis. To examine survival in COAD, we categorized patients with COAD based on age: <45, 45-59, 60-74, and ≥75 years. Site_recode_ICD is a recode based on primary site and ICD-O-3 histology, which included the large intestine, colon, appendix, cecum, and rectum. Grade consists of four categories: well differentiated, moderately differentiated, poorly differentiated, and anaplastic. CS_tumor_size is information on the tumor size, while CS_extension is information on the extension of the tumor. Metastasis is information on distant metastasis. Overall survival (OS) is defined as the time interval from diagnosis to death regardless of any cause, while disease-specific survival (DSS) is the time interval from diagnosis to death for patients with COAD. The patients weredivided into a training group and a validation group at a ratio of 7:3.

Univariate and multivariate Cox analyses
The incidence rates of COAD were estimated per 100,000 individuals and age-adjusted to the 2000 US Standard Population using SEER*Stat (version 8.3.2). The annual percentage changes (APCs) were calculated using the National Cancer Institute Joinpoint regression analysis scheme (version 4.5.0.1).
Univariate and multivariate analyses were performed to identify the related-risk factors. Univariate Cox analysis was used to analyze the occurrence relationship and the age, gender, race, site_recode_ICD, grade, CS_tumor_size, CS_extension, and metastasis. Using the results from the univariate analysis, multivariate analysis was conducted to validate the independent risk factors. Estimated DSS and OS were determined using Kaplan-Meier analysis and were compared using the log-rank test. Both univariate and multivariate analyses used a Cox regression model.

Statistical analyses
DSS was analyzed using the "forestplot" R package to present the pvalue, hazard ratio (HR), and 95% confidence interval (95%CI) of each variable. Based on the results of the Cox regression analysis of patients with COAD, the final multivariate Cox regression model was visualized using the nomograms to predict the 1-, 3-, and 5-year DSS and OS. Harrell's concordance index (C-index) was calculated to assess the performance of the nomogram. This index could expound the discrimination between a patient's predicted and actual survival (14). Both clinical prediction model calibration plots and receiver operating characteristic (ROC) curves were plotted,   Correlations between clinical indicators in DSS. with the ROC curves being used to estimate the prediction performance and the validation set used for external validation (15,16). The higher the area under the ROC curve (AUC), the better the prognostic accuracy. On the other hand, decision curve analysis (DCA) plotted the net benefit (NB), which was used to assess the clinical utility value (17,18). Moreover, clinical impact maps were drawn to estimate the number of high-risk patients for each risk threshold (18). Calibration curves were also constructed for quantification. A nomogram was constructed in the training set. All statistical analyses were carried out using R software. The R packages mainly used in the analyses included ggplot2, survival, survminer, rms, and rmda. A t-test was performed to analyze the quantitative variables, while the chi-square test was used for qualitative data. A p -value < 0.05 was conside red indicat ive of statistical significance.

Patient baseline characteristics
After applying the inclusion and exclusion criteria, and removing missing values, the study finally identified 104,933 patients with COAD diagnosed from 2004 to 2015. The baseline characteristics were in a ratio of 7:3 and were classified into a training group (n = 73,454) and a validation group (n = 31,479). The training and validation groups showed no statistically significant difference (p > 0.05). The detailed results are shown in Table 1. The total study population included 51,360 women and 53,573 men. The follow-up study duration ranged from 22 to 88 months, with an average of 46 months.

Univariate Cox and risk factors for COAD patients
A correlation analysis between the clinical indicators was conducted. Survival_months and status showed the most significant correlation for DSS (Figure 1), and metastasis and Survival_months had the most significant correlation for OS (Supplementary Figure S1). Univariate Cox analysis was performed to identify the related risk factors. The extracted variables in the training set showed that age, gender, race, site_recode_ICD, grade, CS_tumor_size, CS_extension, and metastasis were prognostic factors (p < 0.05) ( Table 2). Figure 2A presents the survival status of all included patients with COAD. The Kaplan-Meier survival analysis showed that those aged ≥ 75 years had shorter DSS compared to younger participants ( Figure 2B). Male gender was significantly associated with shorter DSS compared to female gender ( Figure 2C). Black patients were significantly associated with the shortest DSS compared to patients of other races ( Figure 2D). In terms of site_recode_ICD, the large intestine was significantly associated with the shortest DSS, while the appendix was significantly associated with a higher DSS compared to the other sites ( Figure 2E). Early stage (stages I and II) was significantly associated with higher DSS compared to other stages in site_recode_ICD ( Figure 2F  Forest plots of DSS in training data set. shorter DSS compared to M0 metastasis ( Figure 2H). These results were consistent with the results for DSS in the validation cohort (Supplementary Figure S2). We also performed the Kaplan-Meier survival analysis for OS, which showed the same trends of the prognostic factors (i.e., age, gender, race, site_recode_ICD, grade, CS_tumor_size, CS_extension, and metastasis) (Supplementary Figures  S3, S4).

Multivariable Cox regression and forest plot
Applying multivariable Cox regression on the results of the variables from the univariate analysis, eight independent prognostic factors were screened out, namely, age, gender, race, site_recode_ICD, grade, CS_tumor_size, CS_extension, and metastasis. All variables showed statistical significance for Nomogram for COAD patients. (B-D) and (E-F) were its training data sets and the validation data sets calibration diagrams respectively, which showed good consistency.

A B
D E F C FIGURE 5 ROC curves for the training and validation data set (A-C training data set and D-F validation data set). both DSS and OS (Table 3, Supplementary Table S1). The HRs of age and race were lower than those predicted for DSS (Table 3), which was consistent with the results of OS (Supplementary  Table S1 ). On the other han d, the HRs of gen der , site_recode_ICD, grade, CS_tumor_size, CS_extension, and metastasis were higher than 1 as risk factors for both DSS and OS (Table 3, Supplementary Table S1). Furthermore, forest plots were drawn using these eight independent prognostic factors, as shown in Figure 3 for DSS and Supplementary Figure S5 for OS. The forest plots showed that age, gender, race, site_recode_ICD, grade, CS_tumor_size, CS_extension, and metastasis were independent risk factors.

Nomogram construction and model validation
Based on the univariate and multivariate Cox regression analyses, a nomogram was constructed including all predictors (age, gender, race, site_recode_ICD, grade, CS_tumor_size, CS_extension, and metastasis) ( Figure 4A). The calibration plot showed good agreement both in the training and validation datasets (Figures 4B-F). The AUC values of the 1-, 3-, and 5-year survival in the nomograms were 0.818, 0.829, and 0.824, respectively, in the training group ( Figures 5A-C), while these values were 0.825, 0.836, and 0.828, respectively, in the validation group (Figures 5D-F). Furthermore, we calculated the C-index to assess the performance of the constructed nomograms. The predicted C-index values for the DSS nomogram were 0.787 and 0.782 in the training and validation datasets, respectively.

Clinical applicability of the nomogram
The survival curves of the DSS of the 31,479 patients were plotted using the Kaplan-Meier method (Figure 2A). Our results illustrated that survival significantly decreased in COAD patients with follow-up time (p < 0.001). The DCA plots showed that the threshold probability was within the range from 0.1 to 0.9 with the maximum benefit range of the model ( Figures 6A-C), which presented the same trend as the validation data with the Kaplan-Meier survival curves (p < 0.001) (Figures 6D-F) and consistent with the results for OS (Supplementary Figure S6).

Discussion
The nomogram was made simpler with multivariate regression analysis including many prognostic factors into a simplified estimation model constructed to predict the possibility of events (19,20). The nomogram allows clinicians to more visually evaluate the individual health of patients and to offer personalized treatments (21,22). At present, nomograms are commonly applied for prognosis (e.g., OS and DSS of patients with cancer) (23)(24)(25). A study found that HOXC8, IRF7, and CXCL13 could be used as potential prognostic signatures for COAD based on the nomogram algorithm (26). Based on patients with COAD, we constructed a new prognosis prediction model.
The correlations between the clinical indicators were calculated. Survival_months and status had the most significant correlation for DSS (Figure 1), while metastasis and Survival_months showed the most significant correlation for OS (Supplementary Figure S1). This showed that metastasis was associated with prognosis. The independent prognostic factors for DSS and OS were confirmed via univariate and multivariate Cox regression analyses. Univariate analysis showed that age, gender, race, site_recode_ICD, grade, CS_tumor_size, CS_extension, and metastasis were associated with DSS (Table 2). These factors were then applied in the multivariate Cox regression. The results of the multivariate analysis showed that age, gender, race, site_recode_ICD, grade, CS_tumor_size, CS_extension, and metastasis were independent prognostic factors for both DSS and OS (Table 3, Supplementary Table S1). The HRs of gender, site_recode_ICD, grade, CS_tumor_size, and metastasis were higher than 1 for both DSS and OS (Table 3, Supplementary Table  S1). This clarified that gender, site_recode_ICD, grade, CS_tumor_size, and metastasis were risk factors for COAD. Kaplan-Meier survival analysis revealed that black race was significantly associated with the shortest DSS compared to other races ( Figure 2D), implying that black patients need priority monitoring. The large intestine was significantly associated with the shortest DSS compared to others in site_recode_ICD ( Figure 2E), indicating that more attention should be paid to this site. Early stage (stages I and II) was significantly associated with a higher DSS in site_recode_ICD ( Figure 2F), and the risk increases with grade, which was in line with reality. CS_tumor_size was significantly associated with DSS ( Figure 2G), and M1 metastasis showed a greater risk compared to M0 metastasis. In conclusion, the bigger the tumor_size and the more occurrence of tumor metastasis, clinical measures should be taken. The same results for DSS were found in the validation cohort (Supplementary Figure S2). Similarly, the same trends of the prognostic factors (age, gender, race, site_recode_ICD, grade, CS_tumor_size, CS_extension and metastasis) were also found for OS (Supplementary Figure S3, S4).
All independent prognostic factors in the Cox regression model analysis were used to build the prognostic prediction nomogram. By summing up the scores associated with each indicator variable according to the bottom scale by projecting the total points downward, the probabilities of OS and DSS at 1, 3, and 5 years were estimated for each patient. The C-index values indicated that our newly built nomogram had great potential to accurately predict the prognosis of patients. The DCA plots demonstrated good clinical utility in the training dataset for prediction of the 1-, 3-, and 5-year survival ( Figures 6A-C). The validation set also showed similar trends (Figures 6D-F) and were consistent with the results of OS (Supplementary Figure S6). The DCA results revealed good predictive accuracy and clinical utility. However, the following study limitations remain. Firstly, this study had a retrospective design; therefore, the retrospective nature of this study cannot exclude all potential bias. Secondly, although we randomly split data into the training and validation datasets, more external validation, such as validation of the model in other institutions or other countries, is still necessary in the future.
In conclusion, we constructed and validated a nomogram model for predicting individualized survival probability in patients with COAD. This convenient visual nomogram showed not only excellent clinical utility but also the ability to adequately differentiate patients with COAD, suggesting that it may be a potentially simple and maneuverable tool for clinicians to personalize prognostic assessment and determine treatment strategies.

Data availability statement
The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding author.

Ethics statement
This study is based on the SEER database and does not require ethical approval.