Development and Validation of Nomogram for Predicting Survival of Primary Liver Cancers Using Machine Learning

Background and Aims Primary liver cancer (PLC) is a common malignancy with poor survival and requires long-term follow-up. Hence, nomograms need to be established to predict overall survival (OS) and cancer-specific survival (CSS) from different databases for patients with PLC. Methods Data of PLC patients were downloaded from Surveillance, Epidemiology, and End Results (SEER) and the Cancer Genome Atlas (TCGA) databases. The Kaplan Meier method and log-rank test were used to compare differences in OS and CSS. Independent prognostic factors for patients with PLC were determined by univariate and multivariate Cox regression analyses. Two nomograms were developed based on the result of the multivariable analysis and evaluated by calibration curves and receiver operating characteristic curves. Results OS and CSS nomograms were based on age, race, TNM stage, primary diagnosis, and pathologic stage. The area under the curve (AUC) was 0.777, 0.769, and 0.772 for 1-, 3- and 5-year OS. The AUC was 0.739, 0.729 and 0.780 for 1-, 3- and 5-year CSS. The performance of the two new models was then evaluated using calibration curves. Conclusions We systematically reviewed the prognosis of PLC and developed two nomograms. Both nomograms facilitate clinical application and may benefit clinical decision-making.


INTRODUCTION
Primary liver cancer (PLC) is one of the most common malignancies of the digestive system, and its mortality rate in men and women has increased so that it now ranks fourth and seventh in terms of cancer-related deaths among global malignancies (1). Traditionally, tumors of the PLC at the pathological level can be subdivided into 3 groups: Hepatocellular carcinoma (HCC, comprising 75%-85% of cases), cholangiocarcinoma (CC, 10%-15%), and combined hepatocellular-cholangiocarcinoma (CHC) that is a rare primary liver cancer (2). Although the trend of PLC largely reflects the trend of HCC, there are notable exceptions (3). The main risk factors for liver cancer are chronic hepatitis B virus (HBV) or hepatitis C virus (HCV), eating aflatoxincontaminated food, and heavy drinking. However, the main risk factors differ in different regions. As one of the malignant tumors, due to the low early diagnosis rate, high recurrence, and metastasis rate after resection, the 5-year survival rate of PLC has been maintained between 15% and 40% (2). With a poor prognosis and survival rates, HCC patients must have a longterm follow-up.
In the era of big data, various intelligent techniques can be used to optimize medical management plans, provide better patient care and treatment, improve population health and reduce costs (4). Surveillance, Epidemiology, and End Results (SEER) database, supported by the surveillance research program of the National Cancer Institute (NCI) Department of cancer control and Population Sciences, is one of the most representative large-scale tumor registration databases. It collects a large number of evidence-based medical data and provides systematic evidence and valuable first-hand information for clinicians' evidence-based practice and clinical medical research (5). The Cancer Genome Atlas (TCGA) project was jointly launched by the NCI and the National Human Genome Research Institute. At present, there is clinical and genetic information of more than 11,000 tumor patients with 33 cancers of more than 20 tissue types. In addition, the fields of big data and machine learning integrate genomics and other omics, as well as electronic health records (EHRs) and other clinical data, which in turn have the potential to transform medicine. Machine learning algorithms can predict the risk of individual patients and more accurately determine which patients will benefit the most from specific treatment (6,7).
Nomogram is a common prediction model used to predict and quantify the probability of clinical events. It is of great value for clinical decision-making and risk stratification, especially for cancer patients (8). The nomogram of breast cancer, lung cancer, liver cancer (9)(10)(11), and other malignancies can help patients to predict the risks and benefits of treatment (12) (5). In recent years, there have been relatively few systematic review studies of liver cancer by combing two separate databases. Therefore, we decided to combine SEER and TCGA databases to construct nomograms to predict the prognosis of PLC and help provide new horizons for treatment.

Data Collection
Clinical data were downloaded from the SEER data portal (www. seer.cancer.gov) and the TCGA data portal (https://portal.gdc. cancer.gov). Inclusion criteria included: a) complete clinical information; b) only one malignant primary tumor; c) the International Classification of Diseases for Oncology-3 (ICD-O-3) histology code: 8170/3: HCC, 8160/3: CC, 8180/3: CHC. Follow-up was suspended when patients with liver cancer died or lost contact. As SEER and TCGA data are open to the public, approval from a local ethics committee is not necessary.
The patient study variables we extracted and analyzed included baseline demographics and tumor characteristics. Baseline demographics include age(≤50y, 50-59y, 60 -69y, 70-79y, ≥80y), race (White, Black, Other), gender (Female, Male) and time of diagnosis, survival time (months), follow-up and vital survival status. The main clinical variables were as follows: pathological type of liver cancer (HCC, CC, CHC), American Joint Committee on Cancer (AJCC) stage, and TNM staging were determined according to AJCC Cancer Staging Manual.
Overall survival (OS) or cancer-specific survival (CSS) was used as the endpoints of our study. OS represents the time duration from diagnosis to the date of death or last contact. CSS represents the time duration from diagnosis to the date of cancer death.

Statistical Analyses
To make full use of our data to build predictive models, we used python (version 3.8) to randomize data from SEER, taking the first 9161 as the training group, and the remaining 184 as the internal validation group while 172 patients from TCGA as the external validation group. We used the training group to build the prediction model and draw the nomogram. A validation group was used to validate the model.
For survival analyses, univariate Cox analysis was used to determine significant variables, defined as a p-value of less than 0.05, from clinical data. In all statistical analyses, P values were < 0.05 is considered significant. Univariate and multivariate Cox proportional hazards regression models were used to estimate hazard ratios (HR) and corresponding 95% confidential intervals (CI) for each potential prognostic variable. SPSS 25.0 (SPSS, Chicago, IL) was used for the above analysis. Based on the results of multivariate analysis, nomograms were developed to provide visual risk prediction. The nomogram was formulated based on the results of multivariate analysis using R software. The performance of the predictive prognostic model was evaluated by calculating the concordance index (c-index). Nomograms were calibrated for one -, three -, and five-year survival rates by comparing observed survival with predicted survival probabilities.

Patient Characteristics
According to the screening criteria, the data of 66039 patients were extracted from the SEER database. Subsequently, the data of 54588 patients were excluded because they did not have complete data. The final sample included 11451 patients in the entire cohort. Among them, 9161 (80%) patients were used as a training set to establish a predictive nomogram. The remaining 2290 (20%) patients were used to validate the nomogram. The external validation cohort included 172 patients from TCGA (Supplementary Table 1). The clinicopathological features of the training and validation cohort are shown in Table 1. All patients had complete information on survival time and cause of death. The median survival time of patients with liver cancer in this sample was 13.0 months. The 1-year, 3-year, and 5-year OS rate in the SEER population was 36.1%, 9.4%, and 2.2% respectively. While the 1 -, 3 -, and 5-year CSS were 51.5%, 29.7% and 21.5%, respectively.

Univariate and Multivariate Cox Proportional Hazard Analysis
We performed univariate and multivariate analyses to identify prognostic factors associated with the survival of PLC patients in the training cohort. In the univariate analysis, older age, higher TNM stage, higher pathologic stage, CHC, and American Indian/Alaska Native can predict worse OS and CSS. However, ethnicity and gender had no significant effect on OS or CSS (Figures 1, 2).
Univariate Cox regression analysis of the training cohort revealed the role of the following parameters in predicting patient survival. Factors such as age, pathological type-CC, pathologic stage, stage T0, stage T1, and stage N were associated with patients' prognoses. All the above variables were statistically significant (all P<0.05) and were included in multivariate analysis. Among these factors, the pathologic stage (c-index=0.669) and T stage (c-index=0.643) had higher discriminatory power in predicting PLC survival compared with other factors. In the Cox analysis, the maximum number of iterations was 20.
Variables involved in the multivariate analysis of OS include pathological types, pathologic stage, TNM stage, and age. According to multivariate analysis, patients with younger age, disease type of CC, lower TNM stage and adequate treatment had improved outcomes. These factors were then incorporated into the prediction model ( Table 2).

Development and Validation of a Prognostic Nomogram
Factors from the multivariate analysis were used to develop nomograms to calculate 1-,3 -, and 5-year OS or CSS probabilities ( Figure 3). Each prognostic parameter was scored according to its prognostic value. The total score was used to predict 1 -, 3 -, and 5year OS and CSS. Furthermore, the total score for all variables was converted into an estimate of the probability of death. The distinction between survival probabilities and actual observations was assessed using the c-index. The value of the c-index fluctuates between 0.5 and 1.0 representing random chance and 1.0 represents fully corrected discrimination (13). The c-index of the prognostic nomogram for OS prediction was 0.702 (95% CI, 0.696-0.708) in the training cohort and 0.702 (95% CI, 0.689-0.714) in the internal validation cohort. We tested the nomogram using an internal receiver operating characteristic (ROC) curve in the training cohort. The area under the curve (AUC) was 0.777, 0.769 and 0.772 for 1-,3-and 5-year OS respectively, with 0.739, 0.729 and 0.780 for 1-,3-and 5-year CSS ( Figure 4). The calibration plot shows good agreement between the internal and external validation cohorts ( Figure 5) (Supplementary Figures 1, 2).

DISCUSSION
Worldwide, PLC is a common cause of cancer-related death. PLC death rates are increasing faster than any other cancer (14). In addition, PLC is the second most lethal tumor after pancreatic cancer. HCC accounts for the majority of PLC (15). The increasing number of deaths due to HCC is an increasing concern (16). Disease and tumor-related factors have a great impact on the treatment of PLC (17). CC is an epithelial cell malignancy, and most CCs are well, moderately, and poorly differentiated adenocarcinomas, with other histological subtypes, rarely occurring. Most CCs are new-onset, with no risk factors identified (18). Moreover, CHC is a rare and aggressive variant with features of both HCC and CC, and it is unclear whether treatments commonly used for PLC are effective. The prognosis of CHC is particularly poor due to its aggressive nature. The estimated incidence of CHC ranges from 1% to 14.2% (19) (20).
In this research, patients diagnosed with PLC were included in the analysis. With more than 60000 patients, we included 9161 patients with complete clinical information in the training set from the SEER database and 172 patients from TCGA. By univariate analysis, race, age, pathologic stage, primary diagnosis, and T and N stage were all related to liver cancer progression. In addition, we conducted the multivariate analysis using these significant variables in univariate analysis. In multivariable analyses, we demonstrated that older age, higher pathological stage, and more advanced T and N stages were independently associated with poor overall survival in PLC. PLC incidence rates vary by race/ethnicity and state, largely because of differences in the prevalence of major risk factors and, to some extent, because of different access to high-quality care (21) (22). We can also know that social status is associated with better survival. In this research, we analyzed the association between ethnicity and race with tumor survival and found that survival was slightly lower in the American Indian/Alaska Native and black.
Many studies have shown that the TNM stage may be an important prognostic factor in HCC (23) (24). In the present study, we analyzed the relationship between the TNM stage and tumor survival and found that the higher the TNM stage, the worse the survival.
Hence, we plotted the nomogram according to independent prognostic factors in the multivariate. The data used were derived from the SEER database, which ensured the validity and reliability of our conclusions, as well as the internal and external validity of the nomograms. To validate this value and prevent overfitting of the current model, it is necessary to validate a new nomogram. Moreover, we validated the predictive value of the model by using both internal and external validation cohorts. In addition, we measured the accuracy of this model by ROC curve and a calibration plot, and the larger the AUC, the higher the accuracy of the model. The training cohort AUC was 0.777,0.769 and 0.772 for 1-,3-and 5year OS and 0.739,0.729 and 0.780 for 1-,3-and 5-year CSS. All these results indicated that the model had good accuracy for the prediction of liver cancer survival. Meanwhile, the calibration curve also validated the model's prediction ability on the overall sample. It has been reported that individualized prediction is considered a critical condition of predictive models (25). However, most current studies are based on a single database (26) (27). In this research, we mainly performed long-term follow-ups of patients with PLC. The main objective of this study was to use two databases to predict total and cancer-specific mortality in patients with liver cancer, which differs from currently published studies regarding predictive nomograms. The huge number of patients with PLC recorded in the SEER database helped us to build a more accurate model. In addition, the items included in the nomogram are common, easily accessible, and comprehensible items for physicians and patients in the clinic.
There are also relevant studies applied to predict cancer-specific diseases. Ni et al. (5) developed a hepatocellular carcinoma nomogram to predict cancer-specific mortality and overall mortality using the SEER database, which will help clinicians to obtain personal prediction information to determine whether patients are at high risk of death. Song et al. (28) created a pancreatic cancer survival nomogram to effectively predict patients' survival and use it in clinical practice. Similarly, Wang et al. (29) developed and validated a new nomogram for pulmonary invasive mucinous adenocarcinoma based on the SEER database, which is expected to provide new ideas for treatment. All of these studies are based on a bioinformatics database such as the SEER database to develop nomograms for multiple cancers that predict CSS characteristics to help clinicians make clinical decisions. In this research, we used two bioinformatics databases (SEER and TCGA databases) and developed two nomograms simultaneously. Making clinical decisions more convenient and effective.
Although we have developed powerful nomograms, there are still several limitations that must be acknowledged. Potential prognostic factors available in public databases are limited. Further analysis with a more complete data set may enhance the predictive power of this tool. Data from SEER and TCGA that did not report underlying chronic liver disease, laboratory studies to assess liver function, calculation of Child-Pugh score, or details of tumor characteristics were missing, which would be important for further treatment and thus impact survival.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding authors.