A machine learning-based model for predicting distant metastasis in patients with rectal cancer

Background Distant metastasis from rectal cancer usually results in poorer survival and quality of life, so early identification of patients at high risk of distant metastasis from rectal cancer is essential. Method The study used eight machine-learning algorithms to construct a machine-learning model for the risk of distant metastasis from rectal cancer. We developed the models using 23867 patients with rectal cancer from the Surveillance, Epidemiology, and End Results (SEER) database between 2010 and 2017. Meanwhile, 1178 rectal cancer patients from Chinese hospitals were selected to validate the model performance and extrapolation. We tuned the hyperparameters by random search and tenfold cross-validation to construct the machine-learning models. We evaluated the models using the area under the receiver operating characteristic curves (AUC), the area under the precision-recall curve (AUPRC), decision curve analysis, calibration curves, and the precision and accuracy of the internal test set and external validation cohorts. In addition, Shapley’s Additive explanations (SHAP) were used to interpret the machine-learning models. Finally, the best model was applied to develop a web calculator for predicting the risk of distant metastasis in rectal cancer. Result The study included 23,867 rectal cancer patients and 2,840 patients with distant metastasis. Multiple logistic regression analysis showed that age, differentiation grade, T-stage, N-stage, preoperative carcinoembryonic antigen (CEA), tumor deposits, perineural invasion, tumor size, radiation, and chemotherapy were-independent risk factors for distant metastasis in rectal cancer. The mean AUC value of the extreme gradient boosting (XGB) model in ten-fold cross-validation in the training set was 0.859. The XGB model performed best in the internal test set and external validation set. The XGB model in the internal test set had an AUC was 0.855, AUPRC was 0.510, accuracy was 0.900, and precision was 0.880. The metric AUC for the external validation set of the XGB model was 0.814, AUPRC was 0.609, accuracy was 0.800, and precision was 0.810. Finally, we constructed a web calculator using the XGB model for distant metastasis of rectal cancer. Conclusion The study developed and validated an XGB model based on clinicopathological information for predicting the risk of distant metastasis in patients with rectal cancer, which may help physicians make clinical decisions. rectal cancer, distant metastasis, web calculator, machine learning algorithm, external validation


Introduction
Colorectal cancer is the third most common cancer worldwide and the second leading cause of cancer-related deaths (1,2).The World Health Organization (WHO) estimates it kills more than 930,000 people yearly (3).It is estimated that people in Western and East Asian countries have a 5% and 1% lifetime risk of developing colorectal cancer (4).With increased health awareness and improved medical care, the prognosis for colorectal cancer has improved over the years.However, patients with early and advanced colorectal cancer show significant differences in prognosis.The five-year survival rate for patients with stage I-II colorectal cancer is 88-95%, while patients with metastatic colorectal cancer have a survival period of 3 months to 5 years, and approximately 60% of patients with metastatic colorectal cancer will die within 1-2 years (5).Rectal cancer is an essential subtype of colorectal cancer, accounting for over 40% of colorectal cancer patients in the United States (US) (6).Early assessment and screening of patients at high risk for distant metastasis from rectal cancer is beneficial in improving prognostic outcomes for patients with rectal cancer and helps to reduce the potential risks associated with aggressive multimodal therapy (7).The proportions of the most common sites of metastasis in rectal cancer were 45.2% liver, 15% lung, 10% bone, and 8% brain (8)(9)(10)(11).This study focuses on distant metastasis from rectal cancer rather than primary tumors, as they account for 90% of all cancer deaths (12).
Artificial intelligence (AI) is the field of computer science dedicated to building intelligent machines that can perform intelligence that requires human-level intelligence (13).AI is generally divided into machine learning and deep learning.Machine learning is an essential branch of AI and can usually be classified as supervised, unsupervised, and reinforcement learning (14).Machine learning has successfully penetrated the medical field with great success, such as in developing patronymics and imaging histology.While traditional regression approaches are susceptible to narrow variables, machine learning allows for more detail to be mined from the data, allowing for the development of better diagnostic and prognostic tools than traditional approaches (15).Classical statistical methods focus primarily on inference, including model parameter estimation and hypothesis testing.Such techniques produce relatively simple models, emphasize interpretability over predictive accuracy, and are less suited to dealing with data with many relevant interacting factors (16).The emergence of machine learning shows promise in addressing many of the problems inherent in previous approaches.Machine learning is ideally suited to take advantage of emerging big data and increasing computer processing power, making it feasible and easier to run large-scale analyses (17).
In this study, we constructed eight machine-learning prediction models using common clinicopathological factors while exploring the factors influencing distant metastasis in rectal cancer.We evaluated model performance based on multiple metrics while analyzing the interpretability of the different influences on the models.The best-performing model was then applied to clinical assessments to facilitate the screening of patients at high risk of distant rectal metastasis, which should provide a more accurate diagnosis of distant rectal metastasis and can help develop treatment guidelines and standard of care for distant rectal metastasis.

Patient cohort
The Surveillance, Epidemiology, and End Results (SEER) database is a US population-based cancer database created by the National Cancer Institute in 1973, representing approximately 28% of the US population and providing us with a wealth of data for cancer-related research (18).With access to the SEER database, we constructed an open-access rectal cancer patient cohort using the rectal cancer patient data.Details of the SEER database are available at the following website (http://seer.cancer.gov/about/).The SEER database has started collecting information on patients' distant metastasis since 2010.Therefore, the years of rectal cancer patients included in this study were 2010-2017.For the cohort of rectal cancer patients obtained from SEER, the following inclusion   1.The study flow for this paper is shown in Figure 1.

Data collection and processing
The SEER * STAT (8.4.0) software extracted data from SEER Research Plus Data, 18 Registries + Hurricane Katrina Impacted Louisiana Cases + Hispanic Ethnicity, Nov 2020 Sub (2000-2018) from the rectal cancer patient data.Baseline clinicopathological data from patients with rectal cancer from an external validation set were processed using the SEER classification criteria (Supplement Table 1).All pathological indicators in this study were processed using the 7th edition AJCC TNM staging and SEER-related guidelines (Supplement Table 1).We coded the categorical variables to facilitate data analysis and further application in model building (Supplement Table 2).We provide the code for Machine Learning in this paper in Supplementary Table 3.

Model construction and evaluation
In this study, we constructed models using eight machine learning algorithms, including extreme gradient boosting (XGB) (19), random forest (RF) (20), decision tree (DT) (21), logistic regression (LR) (22), K-nearest neighbor (KNN) (23), support vector machine (SVM) (24), naive Bayes (NBC) (25) and multilayer perceptron (MLP) (26).Machine learning models can obtain complex correlations between data from extensive data.So, we chose the SEER database data, which has a large sample size, to develop the models.We randomly divided the SEER data into a training set and an internal test set in a ratio of 7:3.We trained eight models using the training set.We used random hyperparameters to search for the optimal model parameters while calculating the average AUC value for each algorithm under 10-fold crossvalidation.The AUC value is the area under the receiver operating characteristic curves (ROC) value, with values close to 1 indicating reliable predictive power and values close to 0.5 implying poor prognostic power.When the data is an unbalanced data set, the AUC is less effective for assessing the model than the area under the precision-recall curve (AUPRC), so we plotted the precision-recall curve and calculated the AUPRC, which was used to validate and complement the AUC values (27).We plotted decision curves to assess the models' clinical decision-making ability.To compare the predictive effectiveness of the models, calibration curves were plotted.The models were accurate if the calibration curves were close to the diagonal.We determined the best model by combining multiple metrics.To assess the generalization and extrapolation performance of the models, we applied the eight models trained to the internal test set and external validation set.We plotted the ROCs, precision-recall curves, and calibration curves.We identify the best model by combining the performance of the machine learning models on the training set, the internal test set, and the external validation set.Shapley's Additive explanations (SHAP) is a cooperative game-theoretic-based model agnostic technique used to explain predictions filtered through the best-integrated machine learning model (28).We use the interpretable model SHAP to calculate the importance of each variable of the optimal model.Finally, we create a web calculator to facilitate the clinical dissemination and use of the model.

Statistical analysis
We performed the statistical analysis and model building of clinicopathological information using R (version 4.2.3, http:// www.r-project.org) and Python (version 3.8, Python Software Foundation, http://www.python.org).Categorical variables were expressed as frequency (percentage, %) and compared using the chi-square or Fisher's exact test.We used univariate logistic regression analysis to determine the factors associated with distant metastasis in rectal cancer.The multiple logistic regression analysis included elements with P<0.05 in the univariate logistic regression analysis.We identified the factors with P<0.05 in the multiple logistic regression as independent risk factors for distant metastasis of rectal cancer.We calculated each factor's odds ratio (OR) and confidence interval (CI).The independent risk factors identified by multiple logistic regression were incorporated into constructing subsequent machine-learning models.Bilateral P<0.05 we considered to be statistically different.

Result Baseline population characteristics
The study included 23,867 rectal cancer patients from the SEER database.Among them, 2840 (11.90%) developed distant metastasis, and 21027 (88.10%) did not develop distant metastasis.The demographic and clinicopathological characteristics of all these patients are shown in Table 2.The SEER database patients were randomly divided into the training set (n = 16706) and the internal test set (n = 7161) in a ratio of 7:3.The external validation was performed using data from 1178 rectal cancer patients from the First Hospital of Jilin University (Table 3).Details of the training, testing, and validation sets are shown in Table 1.We have analyzed the differences between patients in the SEER database by metastatic and non-metastatic groups, and we have some findings as follows.Thirteen clinicopathological factors were incorporated into our study: age, sex, marital status, race, tumor size, differentiation grade, T-stage, N-stage, preoperative CEA level, tumor deposits, PI, radiation, and chemotherapy.Patients in the SEER database were divided into DM (-) subgroups (21207 patients without distant metastasis,88.10%) and DM (+) (2840 patients with distant metastasis, 11.90%) subgroups.We found that DM (+) patients have a higher proportion of younger patients than DM (-) (P<0.001).Notably, the distant metastasis rate was significantly higher in men than women in the DM (+) subgroup (P = 0.002).Interestingly, the two subgroups had no statistical difference in race (P = 0.138).Consistent with our expectations, the incidence of distant metastasis was higher in singles (591/4103, 14.40%) than in married (1576/14059, 11.21%; P<0.001).In terms of the progression of rectal cancer, the proportion of patients with tumor size greater than 5 cm was higher in the DM (+) subgroup (45.9%) than in the DM (-) subgroup (25.1%;P<0.001).The subset with DM (+) had a significantly higher proportion of T-stage II-IV (P < 0.001) and a more advanced N-stage (P < 0.001).In addition, we observed higher levels of tumor deposits, PI, and preoperative CEA positivity in the subgroup of DM (+) than in the subgroup of DM (-) (P < 0.001).There was a significant difference between the DM (+) and DM (-) subgroups regarding patient access to treatment.(P < 0.001)

Univariate and multiple logistic regression analysis
Univariate and multiple logistic regression analyses were conducted for the training set data to identify the variables to be included in the machine learning model.Based on univariate logistic regression, age, sex, marriage, T-stage, N-stage, tumor size, tumor deposits, PI, CEA level, pathological grade, radiation, chemotherapy, and race were risk factors for distant metastasis in rectal cancer (P<0.05,Table 4).The results of including the above elements in the multiple logistic regression analysis showed that age, T-stage, N-stage, tumor size, tumor deposits, PI, preoperative CEA level, pathological grade, radiation, and chemotherapy were independent risk factors for distant metastasis of rectal cancer   4).We included variables with P<0.05 in the multiple logistic regression analysis in the machine learning analysis.

Model performance
To compare the predictive performance of the eight models, we performed ten-fold cross-validation on the training set data (Figure 2A).The average AUC values of the eight machine learning models were between 0.793 and 0.859, demonstrating excellent predictive power.The XGB algorithm had the highest average AUC value (AUC=0.859,SD=0.013).Figure 2B shows the PR curves of the models in the training set, with the XGB model having a larger AUPRC than the other seven models (AUPRC=0.656).The XGB in the clinical decision curve analysis also demonstrated the ability to outperform the other models (Figure 2C). Figure 2D shows the calibration curve of the XGB model in the training set, showing that the XGB model has a more accurate predictive performance.In summary, the XGB model has a high degree of reliability.Figure 3 shows the ROC curves, PR curves, clinical decision curves, and calibration curves for the internal test set and external validation set of the eight models.The XGB model performed well in both datasets, demonstrating discriminative power beyond other models.The heat map analysis results, a comprehensive, clear, intuitive, and easy-to-judge analysis, are suitable for thorough evaluation as it allows for multiple dimensions (Figure 4) to more clearly reflect the performance of the models.After a comprehensive review of the performance of the models in the three datasets, we concluded that the XGB model performed best in predicting distant metastasis in patients with rectal cancer and designated the XGB model as the optimal model.

The relative importance of variables in machine learning algorithms
We use SHAP to interpret the XGB model.Generally, the higher the SHAP value of a feature, the higher the probability that the target event will occur.In the SHAP analysis, red indicates feature values that have a positive impact on the model, and blue indicates feature values that have a negative impact on the model (29).The study results showed that tumor deposits were the most crucial variable, followed by CEA, N-stage, radiation, chemotherapy, T-stage, PI, tumor size, age, and differentiation grade.(Figure 5) The Workflow diagram for study design and patient screening.SEER, The Surveillance, Epidemiology, and End Results; LR, logistic regression; DT, decision tree; RF, random forest; XGB, extreme gradient boosting; NBC, naive Bayesian classification; MLP, multilayer perceptron; SVM, support vector machine; KNN, k-nearest neighbor; SHAP, Shapley's Additive explanations.

Web calculator
Although the XGB model is the best performing of the eight machine learning models, it is complex, challenging to understand, ad unsuitable for clinical generalization.We have therefore built a web calculator based on the XGB model, which allows the input of the patient's clinicopathological information on the left-hand side to obtain the probability of distant metastasis.An image of the web calculator is shown in Figure 6.The link to the web calculator is https://share.streamlit.io/woshiwz/rectal_cancer/main/distant.py.

Discussion
Rectal cancer is a common invasive tumor of the digestive system that is prone to distant metastasis.Metastasis is a significant driver of rectal cancer-related mortality, with the liver and lungs being the most commonly affected organs (30).Approximately 22% of patients with colorectal cancer have distant metastasis at the time of first presentation; also, the 5-year survival rate for these patients is less than 20% (31).The NCCN guidelines recommend routine CT of the chest and abdomen for patients with rectal cancer.Both tests can detect liver and lung metastasis, the two most common organs of metastasis in rectal cancer.However, patients often suffer unnecessary radiation damage because of the chest's high CT nodule detection and low diagnostic accuracy (32,33).Positron emission tomography/computed tomography (PET/CT) is a standard diagnostic method for distant metastasis.However, it is not routinely used to screen for distant metastasis due to the high cost of treatment and the potential for radiation damage (34).It is, therefore, crucial to develop a clinical prediction model that can screen patients at high risk of distant metastasis from rectal cancer.
To date, many researchers have constructed different models to predict the distant metastasis of rectal cancer.However, all the data used for model development and validation comes from public databases, which has the disadvantage of needing more external data to validate the extrapolation of the model (35).Secondly, the method used to construct the models is logistic regression, which has specific requirements for data distribution and is sensitive to multivariate covariance and therefore has some limitations in its application (15).Chang et al. developed a model that incorporated a small sample size of data, making the developed model potentially biased (36).The paper uses big data from SEER to create the model, uses external data to validate the model, and finally develops a clickable web calculator to aid the clinical dissemination of the model.
As far as we know, this paper is the first to use machine learning algorithms to predict distant metastasis from rectal cancer and to construct a web calculator using the best model.This study found that the XGB algorithm best predicted distant metastasis from rectal cancer.The XGB model is an efficient, flexible, and scalable machine learning algorithm classifier widely used in medical fields such as COVID-19, chronic kidney disease diagnosis, and bone metastasis in prostate cancer (37-39).It has the advantage of using a large number of decision trees with low inverse correlation, and the number of included decision trees is optimized to achieve the lowest possible error rate, thus preventing over-fitting of the training model (40).
We used descriptive statistics and logistic regression to analyze the variables associated with distant metastasis in rectal cancer.We utilized SHAP values to assess the impact of each factor.Regarding SHAP visualization of variable importance, we found that each variable contributed to the model (Figure 5).In this study, tumor deposits were the most crucial variable in predicting distant metastasis in rectal cancer.Tumor deposits are isolated tumor nodules present within the lymphatic drainage area of the primary tumor and without identifiable lymph nodes, blood vessels, or perineural structures within them (41).A meta-analysis of 17 retrospective studies found that tumor deposits were a stronger predictor of distant metastasis from rectal cancer than lymph node metastasis or vascular infiltration (42).In the importance ranking, the CEA was the second most crucial variable after tumor deposits.
Several reports have pointed to preoperative CEA as an essential indicator of distant metastasis in rectal cancer, and our study confirms this (43)(44)(45).Although CEA is a broad-spectrum tumor marker and cannot be used as a specific indicator for diagnosing a particular malignancy, it still has significant clinical value in the differential diagnosis of malignancies, disease monitoring, and evaluation of the efficacy of treatment (46).Therefore, patients with rectal cancer with high preoperative CEA levels need enhanced postoperative screening.Logistic regression results showed that patients with regional lymph node involvement had a significantly higher risk of distant metastasis, 2-3 times higher than those with rectal cancer without lymph node metastasis (Figure 5).This may be because invaded regional lymph nodes can act as metastatic stations for tumor cell proliferation (47).
Tumor size is another high-risk factor for developing distant metastasis from malignant tumors.Li et al. found that the risk of distant metastasis increased by 15% for each standard increase in rectal cancer tumor size, and our findings remain primarily consistent with them (48).Larger tumors may have invaded the surrounding soft tissues, which may explain the relationship between tumor size and distant metastasis.Tayyab et al. found that some lymphatic reflux was not present in some lymphatic tissues but could be found in larger tumor tissues (49).PI is a risk factor for distant metastasis in rectal cancer, but in-depth studies on how PI leads to distant metastasis remain elusive.Experts have emphasized the correlation between T-stage and distant metastasis.Our present study also found that T4 staging is an independent risk factor for distant metastasis of rectal cancer.We believe the reason for this is that the T4 stage implies that the tumor has grown through the plasma membrane layer, and the tumor cells can be implanted in the peritoneal tissue by direct metastasis, increasing the risk of distant metastasis of rectal cancer.Interestingly, the results of this study indicate that younger rectal cancer patients are more likely to develop distant metastasis, which is different from what we would expect (Figure 5).We believe that this may be because younger rectal cancer patients may have less differentiated tumors and are more likely to develop distant metastasis due to the tendency of younger patients to establish tumor mutations (50).
According to George and Keklikoglou et al., chemotherapy may increase metastasis in malignant tumors, possibly because it promotes the expression of metastatic genes and increases the secretion of exosomes that promote metastasis (51, 52).This suggests that although chemotherapy may result in tumor shrinkage, it may also increase the chances of metastasis.Our study also shows that the administration of radiotherapy reduces distant metastasis in patients with rectal cancer.There have been multiple potential theories to explain the protective effect of radiotherapy on distant metastasis from rectal cancer, including killing and reducing tumor cells at the primary site, eliminating micrometastasis from rectal cancer, and immunomodulatory effects.The impact of radiotherapy on controlling distant metastasis from rectal cancer depends on the mode of administration and dose (53,54).Our model adequately incorporated various risk factors that may affect distant metastasis in patients with rectal cancer and achieved excellent predictive performance.
Despite the strengths of our study, there are some limitations to this study.Firstly, this is a retrospective study with data bias inherent to retrospective studies.Secondly, although the model demonstrated excellent performance in the external validation cohort, the data were only sourced from our one medical center, which may limit the model's generalization.Further independent validation sets are required to confirm our findings, and we will conduct a multi-center study in the future.Thirdly, because some variables in the SEER dataset were missing too much for multiple interpolations, we censored the missing data in the article, which may have caused a bias in the results.Finally, because of the limitations of the SEER database in terms of variables, we had some essential variables, such as blood biochemistry indicators, that were not available in time, thus limiting further optimization of our model, and we will investigate this issue further in the future.Of course, we hope to continue to improve the model in the future by incorporating a variety of other clinical factors to facilitate clinicians better.

Conclusion
In conclusion, we constructed eight prediction models for the risk of distant metastasis in patients with rectal cancer using machine learning algorithms.Among them, we found that the XGB model had the best predictive power, demonstrating strong discriminative power with high sensitivity, specificity, and accuracy in both the internal test set and the external validation set.We hope the XGB algorithm-based web calculator can help clinicians screen patients at high risk of distant metastasis from rectal cancer, intervene early and prevent distant metastasis from rectal cancer.
A web calculator for predicting distant metastasis from rectal cancer.

3 ( 4 (
FIGURE 2 (A) Ten-fold cross-validation results of eight machine models in the training set.(B) PR curves of eight machine learning models in the training set.(C) DCA curves of eight machine learning models in the training set.(D) Calibration curves of the best models in the training set.LR, logistic regression; DT, decision tree; RF, random forest; XGB, extreme gradient boosting; NBC, naive Bayesian; MLP, multilayer perceptron; SVM, support vector machine; KNN, k-nearest neighbor; DCA, Decision curve analysis; PR, precision-recall; SD, Standard Deviation.
The study was approved by the Ethical Review Committee of the First Hospital of Jilin University and was conducted by the guidelines of the Declaration of Helsinki.Specific information on SEER and the external validation rectal cancer cohort are shown in Table

TABLE 1
Clinical and pathological characteristics of the training, testing, and validation sets.

TABLE 1 Continued
SEER, The Surveillance, Epidemiology, and End Results; CEA, Carcinoembryonic antigen.

TABLE 2
Clinical and pathological characteristics of the study population for SEER database.

TABLE 3
Clinical and pathological characteristics of the study population for Chinese Cohort.

TABLE 4
Univariate and multiple logistic regression analysis of variables in the training set.