A comparison of machine learning models and Cox proportional hazards models regarding their ability to predict the risk of gastrointestinal cancer based on metabolic syndrome and its components

Background Little is known about applying machine learning (ML) techniques to identify the important variables contributing to the occurrence of gastrointestinal (GI) cancer in epidemiological studies. We aimed to compare different ML models to a Cox proportional hazards (CPH) model regarding their ability to predict the risk of GI cancer based on metabolic syndrome (MetS) and its components. Methods A total of 41,837 participants were included in a prospective cohort study. Incident cancer cases were identified by following up with participants until December 2019. We used CPH, random survival forest (RSF), survival trees (ST), gradient boosting (GB), survival support vector machine (SSVM), and extra survival trees (EST) models to explore the impact of MetS on GI cancer prediction. We used the C-index and integrated Brier score (IBS) to compare the models. Results In all, 540 incident GI cancer cases were identified. The GB and SSVM models exhibited comparable performance to the CPH model concerning the C-index (0.725). We also recorded a similar IBS for all models (0.017). Fasting glucose and waist circumference were considered important predictors. Conclusions Our study found comparably good performance concerning the C-index for the ML models and CPH model. This finding suggests that ML models may be considered another method for survival analysis when the CPH model’s conditions are not satisfied.


Introduction
Gastrointestinal (GI) cancer refers to cancers affecting the digestive system. Gastric cancer, colorectal cancer, liver cancer, esophageal cancer, and pancreatic cancer are recognized as common GI cancers (1). According to an estimate in 2018, GI cancer accounted for 26% of new cancer cases and 35% of deaths related to cancer worldwide (1). The trend of GI cancers varies geographically by specific types; for instance, the highest rates of liver cancer, esophageal cancer, and gastric cancer are found in Asia (2). Although the trend has been upwards for colorectal and pancreatic cancer, the remaining cancers have experienced a downwards trend since 1999; GI cancer remains one of the most common cancers in Korea (3).
Although risk factors for specific types of GI cancer vary, lifestylerelated factors contribute significantly to the development of GI cancer (2). Specifically, a Western lifestyle has been documented to be associated with a higher prevalence of GI cancer (2). Notably, metabolic syndrome (MetS), which has been reported to be a global epidemic, might be considered an important mediator of the effect of a Western lifestyle on GI cancer development (2). MetS has been described as a group of conditions including obesity, hypertension, high blood sugar, and dyslipidemia (4). Currently, there are three definitions used for MetS diagnosis, namely, the WHO 1999, the National Cholesterol Education Program (NCEP) Adult Treatment Panel III (ATP III) 2005, and the International Diabetes Federation 2006 definitions. Although MetS definitions are modified by health care organizations for different regions, MetS remains a significant and alarming global public health problem (5).
Existing evidence from epidemiological studies has revealed that MetS may be an etiologic factor for GI cancer development. For example, a previous study of large-scale molecular data for 366,016 participants reported a potential link between MetS and an elevated risk of GI cancer regardless of the MetS definition used (6). Similarly, an increased risk of colorectal cancer, liver cancer, and gastric cancer was found in participants with MetS in other cohort studies (7)(8)(9)(10)(11). Notably, Cox proportional hazards (CPH) model was used in these studies; a CPH model is a linear regression with good interpretability (12). CPH model is known to be a semiparametric method in which survival times are assumed in relation to predictor variables in a particular way and proportional hazards (13).
To date, attention has been drawn to the application of machine learning (ML) to cancer prediction. In particular, the application of ML in survival analysis has been indicated in recent years (14,15). However, the value of ML models compared to CPH model is ambiguous due to inconsistent results in previous studies (15-18). For example, an ML model outperformed a CPH model in predicting breast cancer survival in a previous study (16). In contrast, a comparably good performance was observed to predict survival in patients with oral and pharyngeal cancer in another study (17). Thus, the value of ML approaches compared to CPH model in survival analysis is still debatable.
To our knowledge, little is known about the application of ML techniques in epidemiological studies to identify important predictors affecting GI cancer development. In addition, no study has compared the accuracy of ML and CPH models for predicting GI cancer. Therefore, our study aimed to examine whether MetS and its components are related to GI cancer prediction and whether ML models outperform a CPH model in predicting GI cancer based on MetS and other confounders.

Study population
The Cancer Screenee Cohort Study was established in 2002 to explore the association of risk factors with cancer development in South Korea. The information of this cohort has been described (19). In brief, we recruited 41,837 participants aged 16 and older who visited the Center for Cancer Prevention and Detection at the National Cancer Center in South Korea for health examinations between August 2002 and December 2014. We required participants to complete baseline questionnaires and identified incident cancer cases by following up with participants until December 2019. Our final analysis included 24,139 participants after exclusion of 1,754 participants with incomplete questionnaires, 2,100 participants with a diagnosis of any cancer before recruitment, 6 participants aged <20 years, and 13,122 participants who lacked information on individual characteristics ( Figure 1). We obtained written informed consent from all participants and approval for the study protocol from the institutional review board of the National Cancer Center (No. NCCNCS-07-077).

Outcome and predictor measurement
Potential incident GI cancer cases were obtained by linking to the 2019 Korea National Cancer Incidence Database of the Korea Central Cancer Registry. We used the following International Classification of Diseases, 10th Revision codes to identify common incident GI cancers: gastric cancer (C16), colorectal cancer (C18-C20), liver cancer (C22), esophageal cancer (C15), gallbladder cancer (C23), and pancreatic cancer (C25 We collected venous blood samples from participants at baseline after they had fasted for 8 hours to determine the blood-related components of MetS. Height (m) and weight (kg) were measured with InBody 3.0 (Biospace, Seoul, Korea) or automatic height and weight measurements (DS-102, Dong Shin Jenix Co., Ltd., Seoul, Korea). The measurement of waist circumference was performed with a tape measure 1 cm above the umbilicus with minimal respiration. A chemistry analyzer (TBA-200FR, Toshiba, Tokyo, Japan) was used to measure fasting glucose, triglyceride, and HDL cholesterol levels. Blood pressure was measured by trained personnel with an automatic blood pressure monitor (FT-200S, Jawon Medical, Kyungsan, Korea) after the patients had 15 minutes of rest (21). In addition, a selfadministered questionnaire regarding information on baseline characteristics was completed by participants.
Predictors of GI cancer incidence in our study included MetS and its individual components (waist circumference, HDL, triglycerides, blood pressure, and fasting glucose). In addition, sociodemographic variables included age, sex, educational level (high school graduate or less and college or higher), marital status (married or cohabitating and others), monthly income (10,000 Korean won/month) (<200, 200-400, and >400), and first-degree family history of cancer (yes, no). Lifestyle factors included smoking status (nonsmoker, ex-smoker and current-smoker), alcohol consumption (nondrinker, former drinker and current-drinker), and physical activity (yes, no). These sociodemographic and lifestyle characteristics may be confounders for the association between MetS and GI cancer (6).

Models and evaluation
We used a CPH model and ML survival models, including random survival forest (RSF), survival trees (ST), gradient boosting (GB), survival support vector machine (SSVM), and extra survival trees (EST), to predict GI cancer. The ability of these ML models to predict an outcome in right-censored time-to-event data has been documented in the literature as follows: Random forest is an ML model that is most frequently used to solve problems in relation to classification and regression by constructing ensembles from decision trees and combining results to give a final decision. RSF extends random forest to censored lifetime data (22).
ST is another forest approach that has been widely used to handle time-to-event data. The implementation of ST is as follows: data partitioning is performed based on a criterion for splitting, and objects with similar events are grouped as the same node. (23).
Similar to RSF, the GB model is an ensemble model that combines the predictions of multiple base learners to improve the prediction of the overall model. However, RSF averages predictions from independent trees to obtain the overall prediction, while the GB model is additive (24). Support vector machine aims to find a hyperplane to maximize the margin between classes. SSVM is an extension of the support vector machine to handle time-to-event data (16).
The EST model is a slightly different version of RSF (25). In comparison with RSF, the splitting criteria of EST are more random (26).
We used the following two evaluation metrics, which have been widely used in survival analyses in the literature, to compare the performance of the regression models (15, 22,23,26). The concordance index (C-index) is a rank order statistic for Flow chart of the study participants. Among 41,837 participants recruited, 41,121 participants were linked to the Korea Central Cancer Registry. Our final analysis included 24,139 participants after exclusion of 1,754 participants with incomplete questionnaires, 2,100 participants with a diagnosis of any cancer before recruitment, 6 participants aged <20 years, and 13,122 participants who lacked information on individual characteristics.
predictions against true outcomes and is defined as the ratio of the concordant pairs to the total comparable pairs (23); the closer the Cindex is to 1, the better the model performs (15,26). The integrated Brier score (IBS) reflects calibration over all time points, with a smaller value indicating greater accuracy (22). Furthermore, we evaluated the models based on the time-dependent area under the curve (AUC).

Statistical analysis
We calculated person-years from baseline to the date of cancer diagnosis, death, or end of follow-up (December 31, 2019), whichever came first. We used chi-square tests and Wilcoxon rank-sum tests to compare the baseline characteristics between the incident GI cancer cases and nonincident GI cancer.
There were several steps for model development. First, we used an 80:20 ratio to randomly split the data into training and testing datasets. The purposes of the training and testing datasets were to fit the model and evaluate the final model, respectively. Second, a grid search was utilized to search hyperparameters for C-index maximization with 10-fold cross validation. We found the following optimal hyperparameters: n_estimators=400, max_depth=4 for RSF, max_depth=4 for ST, learning_rate=1 and max_depth=1 for GB, alpha=0.0002 for SSVM, and n_estimators=500, max_depth=4 for EST. Third, we fit the models using the training dataset based on selected input variables, the optimal hyperparameters, and default values of other hyperparameters. Fourth, the testing dataset was used to evaluate and compare model performance. Then, the ELI5 package was used to explore the contribution of predictors to the models, which calculates important variables based on the permutation important method by identifying the weight of variables (26). We used bootstrapping and 10-fold cross validation to assess the robustness of the models.
Furthermore, we used the CPH model to explore the specific associations of MetS and its components with incident GI cancer after adjusting for the aforementioned confounding factors. We performed statistical analyses by using Python software (version 3.7.9) with the scikit-survival library (26) and SAS software (version 9.4, SAS Institute, Cary, NC, USA) with a two-sided P value less than 0.05 was considered statistically significant.

Model performance
The C-index and IBS were used to compare the performance of the models. Table 2 presents these metrics for the CPH and ML models based on the testing dataset. Comparably good performance was recorded for the GB, SSVM, and CPH models, with a C-index value of 0.725. The RSF and EST models exhibited a lower performance than the CPH model (0.699 vs. 0.725 and 0.671 vs. 0.725, respectively). IBS was not estimated for SSVM because it is applicable for models that can estimate a survival function. A similar value of IBS was found for the remaining models (0.017). Furthermore, comparably good discrimination was found for the ML and CPH models concerning the time-dependent AUC ( Figure 2). Table 3 presents the top five most important predictors of incident GI cancer based on the ML models. Notably, among the predictors related to MetS, fasting glucose was indicated as an important predictor for the occurrence of GI cancer across models. Furthermore, according to the GB model, waist circumference was one of the five important predictors contributing to incident GI cancer.

Importance of predictors of GI cancer
We then determined the specific relationships between MetS and its components and GI cancer development by using the CPH model. Notably, the important predictors were identified by the CPH model, which were similar to those of the ML models. In detail, a higher risk of GI cancer was found for participants with high waist circumference in both the crude model and adjusted model; the HRs (95% CIs) were 1 (Table 4).

Discussion
In this study, we constructed five ML models and compared them with a conventional CPH model to predict incident GI cancer and identify whether MetS and its components are potential predictors of GI cancer development. Our findings identified a comparably good performance concerning the C-index for the GB, SSVM, and CPH models. High fasting glucose was found to be a predictor for GI cancer development across six models. However, the important predictor was not restricted to high fasting glucose, and the importance of high waist circumference emerged in the GB and CPH models. To date, attention has been drawn to the application of ML models to time-to-event data. As a result, many studies have been conducted to compare the predictive performance of ML models against CPH models. For example, based on a previous study, an extreme gradient boosting model outperformed a CPH model in predicting breast cancer survival based on C-index values (0.73 vs. 0.63) (16). Similar results were obtained in other studies comparing RSF models and CPH models for survival prediction in patients with liver transplantation or oral squamous cell carcinoma (18,27). The Cindex values obtained in these studies were 0.622/0.620 and 0.764/ 0.694 for the RSF and CPH models, respectively (18,27). However, the value of ML models against CPH model is still open to discussion because inconsistent results have been obtained in other studies. Specifically, a comparably good performance was recorded for RSF, conditional inference forest, and CPH models in predicting the survival of patients with oral and pharyngeal cancer (17). Similarly, a CPH model showed better performance in another study conducted in China, where CPH and RSF models were used to predict the progression of high-grade glioma after proton and carbon ion radiotherapy (15).
To our knowledge, our study is the first attempt to use ML and CPH models for the prediction of GI cancer development based on MetS and other confounders. CPH model has been widely applied to investigate the impact of risk factors on incident cancers due to its simple, fast computation and meaningful outputs; however, its limitations need to be clarified (16). First, the proportional hazard assumption must be satisfied in the model. Survival curves for different strata need to have hazard functions that are proportional over time. Second, there is a linear relationship between log hazards and covariates (28). Thus, CPH model may not be appropriate for a dataset with nonlinearity due to a decreased accuracy in prediction (12,17). To date, the development of ML has been documented to address the limitations of conventional statistical analysis in cancer prediction (16). The self-study, classification, prediction, and feature selection abilities of ML have been well recognized. ML methods can be adapted to deal with data with nonlinearity and high-dimensional covariates (22,29). However, we found comparable performance for the CPH and ML models. This finding is consistent with some previous studies (17,30) and may be explained as follows. First, a CPH model tries to fit the data to a specific model and tests the proportional hazards assumptions to examine the influence of predictors on an outcome (17). Thus, the proportional hazards assumption of CPH was   The C-index was estimated based on 100 bootstrapped data samples. The integrated Brier score (IBS) applies to models that can estimate a survival function. Thus, it is impossible to estimate the IBS for the survival support vector machine. satisfied in our data, which may be a potential explanation. Second, complex associations and interactions seem to be unimportant in our data. Third, a small number of predictors were used in our study. Overall, it is important to realize that the superiority of ML models is found only when a CPH model meets its limitations (17). Notably, we evaluated the models based on the C-index, which can account for censoring and does not depend on a single fixed evaluation time (27). Taken together, a CPH model should be considered a method in epidemiological studies when its conditions are satisfied. Additionally, a combination of ML and CPH models could be used to provide further insight into predictors; specifically, nonlinear interactions may be obtained using ML models, whereas a CPH model is used to summarize the risk in a dataset that is not suitable for CPH model analysis (14,17). Among the variables related to MetS, the importance of high fasting glucose as a predictor for GI cancer development was found across all models. This finding is consistent with the results of a previous study, where diabetes mellitus was documented to be associated with elevated GI cancer (31). Notably, a consistent association was also found for specific types of GI cancer. For instance, high fasting glucose was demonstrated to play an important role in gastric cancer development in our previous study (32). Similarly, we identified a positive association for colorectal cancer (11). Our finding was reinforced by the conclusions of other studies (33,34). For example, a higher risk of colorectal cancer was observed in participants with type II diabetes in two large prospective cohorts in the U.S (33). Additionally, a higher risk of primary liver cancer and pancreatic cancer may be attributed to increased fasting glucose (35)(36)(37). Our study suggests that greater emphasis must be placed on participants with high fasting glucose, including those with prediabetes (100 mg/dL-125 mg/dL) and diabetes (≥126 mg/dL or a history of diabetes). Prediabetes may have a potential link to GI cancer development (32). Several biological mechanisms may be involved in the effect of high fasting glucose on GI cancer development. First, hyperglycemia could provide nutrients for tumor cells, which has certain effects on the proliferation of these cells. For example, epidermal growth factor expression and epidermal growth factor receptor transactivation may be induced by high glucose, which can contribute to promoting cell proliferation in pancreatic cancer (38). Second, there is a positive association between hyperglycemia and proinflammatory factor production; proinflammatory factors are known to stimulate the expression of oncogenes, regulate the cell cycle, promote the proliferation of tumor cells, inhibit apoptosis, and even induce the epithelial-to-mesenchymal transition (38). Third, insulin-like growth factor 1 (IGF-1) bioavailability could be promoted by insulin, which could inhibit apoptosis, stimulate cellular proliferation, and induce carcinogenesis (39). Time-dependent AUC. We presented the time-dependent AUC of the CPH model and five ML models, namely, the random survival forest, survival trees, gradient boosting, extra survival trees, and survival support vector machine models. The vertical axis is the time-dependent AUC. The horizontal axis is follow-up (year). Furthermore, high waist circumference was considered an important predictor of the occurrence of GI cancer in the CPH model and GB model in our study. Overall obesity has received more attention in previous studies than central obesity. However, central obesity may have more influence on cancer risk than overall obesity because metabolic derangement is reflected by insulin and IGF levels (40). This hypothesis was supported by a conclusion drawn from a previous study, which emphasized the stronger influence of waist circumference on colon cancer risk than BMI (41). Furthermore, the adverse effect of central obesity on GI cancer development was reinforced by evidence from two meta-analyses of prospective studies (42,43). The pathophysiological mechanisms can be explained as follows. First, insulin resistance is an important mediator of the link between central obesity and cancer. In detail, high insulin levels lead to IGF activation, promote cellular proliferation, and inhibit apoptosis (44). Second, sex hormones may be a possible mechanism because they are related to a relationship between body size and shape. The pathogenesis of the link between body size and shape and cancer may include obesity-induced hypoxia, genetic susceptibility, and adipose stromal cell migration (44). Notably, MetS was demonstrated to be associated with GI cancer development in the CPH model. However, it was not indicated as the top five important variables in the ML models. A possible reason may be the stronger effects of high fasting glucose and waist circumference. These components are implied to be central factors for the causal link between MetS and GI cancer.

Models C-index on testing dataset IBS
Notably, with regard to variable importance, CPH model exhibits a straightforward interpretation as an HR, whereas a large important variable has more influence on the transition to predict an outcome, and ML models do not provide the sign of the prediction (negative or positive effect) (45). To date, the direct comparison of CPH with ML models regarding interpretation is limited due to the lack of a common metric. Thus, it is necessary to address this limitation in further studies (18).
There are several strengths in our study. First, this is the first attempt to use an ML approach to predict and identify the adverse effects of high fasting glucose and central obesity on GI development with time-to-event data. Second, our study has a relatively large sample size with a long follow-up time, accurately identifying incident cases by linking to the national cancer registry using a high-quality database. Third, we used standardized operating procedures to perform the laboratory tests with standardized equipment and trained personnel. However, there are some limitations in our study. First, we used a small number of predictors. Second, the predictive power may be affected by a low proportion of incident cases in our study. Thus, further studies with a larger number of incident cases and predictors may be warranted to clarify the value of ML models against CPH model. Third, information on medication and dietary factors was not available to consider in our models.
In conclusion, our study found comparably good performance according to the C-index for the ML and CPH models. This finding suggested that ML models may be considered another method for survival analysis when a CPH model has limitations. However, further studies with a larger number of predictors are necessary to clarify the value of ML models. Furthermore, our study indicated that preventing high fasting glucose and central obesity could be expected to reduce GI cancer development.

Data availability statement
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Ethics statement
The studies involving human participants were reviewed and approved by institutional review board of the National Cancer Center (No. NCCNCS-07-077). Written informed consent to participate in this study was provided by the participants' legal guardian/next of kin.