Analysis of risk factors of hepatocellular carcinoma and establishment of a clinical prognosis model

Liver cancer is a common malignancy of the digestive system. Hepatocellular carcinoma (HCC) accounts for the most majority of these tumors and it has brought a heavy medical burden to underdeveloped countries and regions. Many factors affect the prognosis of HCC patients, however, there is no specific statistical model to predict the survival time of clinical patients. This study derived a risk factor signature of HCC and reliable clinical prediction model by statistically analyzing The Surveillance, Epidemiology, and End Results (SEER) database patient information using an open source package in the python environment.


Background
Liver cancer is a common malignancy of the digestive system (1,2). Primary liver cancer mainly includes hepatocellular carcinoma (HCC) and intrahepatic cholangiocarcinoma (ICC) (3). HCC accounts for most of these tumors and is the fifth leading cause of cancer and the fourth leading cause of cancer-related deaths worldwide (4,5).Men have a higher risk of HCC than women, comprising the second leading cause of cancer death in men. Besides, HCC morbidity and mortality are still rising (6,7). The main risk factors for HCC development are cirrhosis and chronic liver disease (8). Cirrhosis is an important process for HCC viral carcinogenesis (9). Additionally, chronic hepatitis, caused by hepatitis B virus (HBV) and hepatitis C virus (HCV) infections, is an important risk factor for liver cancer (10). Most new liver cancer cases occur in developing countries with a high rate of hepatitis B virus infections. Meanwhile, non-alcoholic fatty liver disease (NAFLD) is the leading cause of HCC in developed countries (11,12).
Liver Doppler ultrasound and AFP are simple and easy methods to screen liver cancer (13). Elevated AFP and DCP levels are typical features of liver cancer (14). Additionally, CT, enhanced CT, MRI, enhanced MRI, and other imaging methods are helpful for precise HCC diagnosis (15). Since liver biopsy is related to tumor implantation and bleeding risks, and false negative results might occur, it is generally not recommended for HCC (16).
At present, the most commonly used staging systems for liver cancer include the TNM (tumor node metastasis), China liver cancer (CNLC), and Barcelona clinical liver cancer (BCLC) staging systems (17). The TNM staging was jointly proposed by the American Joint Committee on Cancer (AJCC) and the Union for International Cancer Control (UICC) and has been widely used in clinical practice. TNM is a tumor staging system based on tumor morphology (T), regional lymph node metastasis (N), and distant metastasis (M). The TNM staging of liver cancer is very detailed, especially the T staging, including the invasion of microvessels around the tumor that can better help evaluate the prognosis.
Radical surgical resection is the primary treatment for early HCC. However, whether advanced HCC patients can benefit from surgery is controversial. Recently, breakthroughs have been made in non-surgical treatments. For example, drug therapy, immunotherapy, and targeted therapy have been successfully applied to treat advanced liver cancer (18). Transcatheter arterial chemoembolization (TACE), hepatic arterial infusion chemotherapy (HAIC), and radiotherapy can improve patient prognosis (19). Some experts believe that conventional chemotherapy can also benefit HCC patients (20). Nevertheless, most experts believe that conventional chemotherapy has little effect on liver cancer (21)(22)(23).
The SEER database is a publicly available cancer reporting system funded by the US federal government (24). This representative and reliable data come from 18 US states. Users can retrieve the patient's sex, age, surgical method, chemotherapy, radiotherapy, other clinical information, survival time, and status. This study obtained permission to use the SEER PLUS database. Thus, to further explore HCC risk factors and treatment plans and establish a machine learning model to guide clinical treatment, we retrieved HCC patient data from the SEER database and analyzed them after the screening.

Data acquisition
Herein, we retrieved data from 107148 HCC patients from the SEER database. Clinical information included gender, age, race, histological type, histological grading, surgical method, regional lymph node dissection, radiotherapy, chemotherapy, diagnosis to treatment time, AFP, TNM staging, survival time, and survival status.

Excluding factors
To ensure the accuracy of the machine learning model, we did not use automatic imputation of missing information. Data were filtered according to the clinical characteristics of each group, and the information gaps and unknown groups were excluded from a total of 102680 patients. Finally, 4468 patients were selected for subsequent analysis.

Statistical methods
The algorithm applied here was based on python 3.10.6 (Python Software Foundation, https://www.python.org/). Clinical feature analysis was conducted with TableOne. The COX regression analysis was performed using Lifelines. The random survival forest (RSF) analysis was carried out using Scikit-Survival. The survival curves of clinical patients were predicted using the random forest model. The accuracy of the model was evaluated using the C-index.

Clinical characteristics
After the screening, 4468 patients were selected for further analysis ( Table 1) . The clinical characteristics were analyzed in Table 1. A total of 2324 patients received chemotherapy, and 2144 patients did not. Most clinical features significantly differed between the two groups, including gender, race, histological type, surgery, regional lymph node dissection, diagnosis to treatment time, survival time, AFP, survival status, and T, N, M stages (c 2 test, p < 0.05).

Overall risk factors
Furthermore, we used COX regression analysis to evaluate the impact of various clinical features on the survival of HCC patients ( Table 2). Distant organ metastasis, lymph node metastasis, chemotherapy, AFP positive, histological grade, sex, race, tumor size, and age were risk factors for HCC. On the other hand, surgical treatment and early diagnosis and treatment were remission factors for HCC (p < 0.05). No significant differences were detected for radiotherapy and regional lymph node dissection (p > 0.05). The Cindex of the COX regression model was 0.76 ( Figure 1).

Risk factors at different stages
To explore the differences in treatment plans for HCC patients at different TNM stages, we divided patients into I, II, IIIa, IIIb, IVa, and IVb groups according to the 7th edition of the AJCC staging system. Then, we applied COX regression analysis to evaluate the risk for each group (Table 3). We found that early diagnosis and treatment, and timely surgery were mitigating factors for HCC patients at stages I, II, and IIIa. In contrast, chemotherapy, radiotherapy, and positive AFP were risk factors for HCC patients, unfavorable for prognoses. Surgical treatment and early

Clinical feature importance and survival prediction
We randomly selected 25% of the included test group data, and the remaining 75% was used as the training group data. To obtain the best model, the survival analysis of the post-screening data was performed using the RSF model based on hyperparameter optimization with manual parameter adjustment, leading to a Cindex of 0.80 for the training set and 0.77 for the testing set. Thus, the RSF model had slightly better reliability than the Cox regression model.
The clinical feature importance ranking indicated that surgical treatment was the most important feature among clinical factors in the RSF model (Table 4). Then, three patients in surgery and non-surgery groups were separately retrieved from the test group to draw predictive survival curves. Patients in the surgery group had a significantly better prognosis than those in the non-surgery group ( Figure 2).
Subsequently, we used Streamlit to establish a clinical patient survival prediction platform based on the RSF model. In this framework, clinicians can enter the corresponding clinical information, which is used to generate survival and cumulative risk curves of predicted patients and real-time survival curve changes by dynamically adjusting treatment parameters. Therefore, this platform can be used to guide clinical treatment selection (Video 1).

Discussion
The incidence and mortality of liver cancer continue to rise, and its treatment remains a global challenge (25). Surgery is the primary treatment of liver cancer (26). Nevertheless, liver cancer treatment has entered a new era with the development of immunotherapy and targeted therapeutic drugs. Since early liver cancer has no specific manifestation, few patients are diagnosed at early stages during Moreover, HCC has brought a heavy medical burden to underdeveloped countries and regions (18).Chronic HBV infection, chronic HCV infection, NAFLD, aflatoxin, and alcohol intake are important causes of HCC. For example, Hepatitis B virus vaccination can reduce HCC incidence. Herein, the COX regression analysis showed that the time from diagnosis to treatment was a remission factor for HCC patients. Thus, early detection and timely treatment might improve the prognosis of HCC patients (HR: 0.92, p < 0.005). Thus, government departments and relevant medical security institutions should strengthen the health testing of high-risk HCC groups to achieve early detection and treatment, which can prolong the survival time of patients and reduce the economic burden on families and medical security institutions.
We found that positive AFP was also a risk factor for HCC patients at stages I, II, and IIIA. Hence, AFP can be used as an indicator of the prognosis of HCC patients, and similar conclusions have been reached in other studies (27).The Cox regression and RSF models indicated that surgery could reduce HCC risk and improve patient outcomes. Surgical treatment was the most important clinical feature affecting the survival of HCC patients in the RSF model, comprising a key factor for HCC management. For patients who can tolerate surgery, appropriate surgical treatment should be implemented as early as possible to avoid missing the optimal timing of treatment. Meanwhile, for patients not temporarily We found that chemotherapy and radiotherapy were unsuitable for early liver cancer patients. Unnecessary radiotherapy and chemotherapy can increase the risk of these patients. However, chemotherapy can be used for advanced liver cancer patients, who might benefit from systemic chemotherapy (HR: 0.67, p < 0.005). Sun et al. showed that chemotherapy was a common treatment for advanced HCC, but the effects were not ideal. Adding all-transretinoic acid (ATRA) to fluorouracil, leuprorelin, and oxaliplatin (FOLFOX4) to treat advanced HCC can improve the overall survival and disease progression time of patients.
However, our study also has some limitations. First, the SEER database does not contain specific information on targeted therapy and immunotherapy regimens, which can extend the survival time of patients with recurrent or advanced liver malignancies. Second, we did not evaluate various objective factors affecting tumor patients' survival time, such as economic conditions, medical insurance systems, and the level of medical development in the region. Finally, different machine learning models exhibit varying degrees of prognostic evaluation of patients. Therefore, this study should only be considered a machine-learning reference for treating tumor patients. With the continuous refinement of local databases and the optimization of artificial intelligence algorithms, machine learning models will be increasingly close to the reality of clinical practice.
Herein, we obtained a relatively reliable machine learning model by RSF. Then, we used this model to establish a survival prediction platform for HCC patients. This platform can generate a predicted survival curve by inputting clinical patient information. Survival curves can also be compared to get the best clinical treatment plan. Since the SEER database does not contain immunotherapy, targeted therapy, TACE, and other information, this platform only tests the feasibility of methods based on existing data to guide further research.

Conclusion
In the present study, we found that distant organ metastasis, lymph node metastasis, histological grade, sex, race, tumor size, and age were risk factors for HCC patients. Additionally, early detection and timely treatment might improve the prognosis of HCC patients, and positive AFP might be used as a risk indicator. Moreover, surgical treatment is crucial for HCC patient survival. Chemotherapy and radiotherapy are inappropriate for early liver cancer patients since these treatments can increase their risk. Nevertheless, advanced liver cancer patients might benefit from systemic chemotherapy. Finally, the RSF model can be used for clinical survival prediction.

Data availability statement
The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding author.
Author contributions X-YG is responsible for writing manuscripts and program codes; M-CS and T-YW are responsible for literature retrieval; X-MW, GL, Y-ML and TY are responsible for the program code; WW is responsible for proofreading and reviewing. All authors contributed to the article and approved the submitted version. Survival curves for the surgery and non-surgery groups.