Machine learning-based classifiers to predict metastasis in colorectal cancer patients

Background The increasing prevalence of colorectal cancer (CRC) in Iran over the past three decades has made it a key public health burden. This study aimed to predict metastasis in CRC patients using machine learning (ML) approaches in terms of demographic and clinical factors. Methods This study focuses on 1,127 CRC patients who underwent appropriate treatments at Taleghani Hospital, a tertiary care facility. The patients were divided into training and test datasets in an 80:20 ratio. Various ML methods, including Naive Bayes (NB), random rorest (RF), support vector machine (SVM), neural network (NN), decision tree (DT), and logistic regression (LR), were used for predicting metastasis in CRC patients. Model performance was evaluated using 5-fold cross-validation, reporting sensitivity, specificity, the area under the curve (AUC), and other indexes. Results Among the 1,127 patients, 183 (16%) had experienced metastasis. In the predictionof metastasis, both the NN and RF algorithms had the highest AUC, while SVM ranked third in both the original and balanced datasets. The NN and RF algorithms achieved the highest AUC (100%), sensitivity (100% and 100%, respectively), and accuracy (99.2% and 99.3%, respectively) on the balanced dataset, followed by the SVM with an AUC of 98.8%, a sensitivity of 97.5%, and an accuracy of 97%. Moreover, lower false negative rate (FNR), false positive rate (FPR), and higher negative predictive value (NPV) can be confirmed by these two methods. The results also showed that all methods exhibited good performance in the test datasets, and the balanced dataset improved the performance of most ML methods. The most important variables for predicting metastasis were the tumor stage, the number of involved lymph nodes, and the treatment type. In a separate analysis of patients with tumor stages I–III, it was identified that tumor grade, tumor size, and tumor stage are the most important features. Conclusion This study indicated that NN and RF were the best among ML-based approaches for predicting metastasis in CRC patients. Both the tumor stage and the number of involved lymph nodes were considered the most important features.


Background
Colorectal cancer (CRC) has been regarded as one of the four most common types of cancer as well as the second leading cause of cancer deaths (Siegel et al., 2016).While there have been promising advancements in reducing the incidence of CRC, both morbidity and mortality rates remain high (Wieszczy et al., 2020).CRC has been steadily increasing worldwide since the 1960s, with mortality rates varying significantly according to geographical locations (Ferlay et al., 2019).
Machine learning (ML) models have become essential tools for identifying individuals at an increased risk of developing colorectal cancer and uncovering risk factors associated with the disease (Kourou et al., 2015).The ML algorithms have revealed magnificent performance in predicting survival cancers and their metastasis history.The ML-based approaches represent innovative and practical models for effectively predicting overall survival (OS) among CRC patients (Manilich et al., 2011;Wen et al., 2021).The ML methods overcome challenges in estimating coefficients and accurately modeling data.Additionally, ML models can automatically handle noise in datasets, non-linearity, complex interactions, large sample sizes, and numerous features.Overall, ML approaches have shown promise in improving treatment outcomes in cancer research (Zhou et al., 2021;Greener et al., 2022;Talebi et al., 2023).ML-based approaches have also been employed to predict metastatic relapse in breast cancer in several studies (Tapak et al., 2019;Nicol et al., 2020).Furthermore, numerous epidemiological studies have investigated specific hypotheses related to CRC risk factors (Talebi et al., 2019(Talebi et al., , 2020;;Borumandnia et al., 2021).These studies encompass survival analysis techniques such as Cox proportional hazards, time-dependent Cox, cure models, and other types of survival analysis using clinical datasets.Moreover, other investigations have utilized ML models such as decision tree (DT), support vector machine (SVM), neural network (NN), and Naive Bayes (NB) methods (Talebi et al., 2023).
This historical cohort study aims to predict CRC survival using supervised machine learning methods, including NB, random forest (RF), SVM, NN, DT, and logistic regression (LR).

Methods
Clinical data from 1,127 patients who underwent medical treatment for rectal cancer at Taleghani Hospital, a tertiary care facility, from 2013 to 2019, were scrutinized to develop prediction models using an ML classifier.Metastasis served as the dependent variable, while demographic and clinical factors were considered as independent variables.Demographic characteristics included sex, age, education level, smoking, marital status, and BMI, while clinical factors encompassed the number of involved lymph nodes, tumor grade, tumor stage, treatment type, and diabetes mellitus.

Inclusion and exclusion criteria
The inclusion criteria for the CRC screening program, which spanned from 2013 to 2019, included various diagnostic methods such as endoscopy, imaging, stool, or blood tests, and colonoscopy reports for patients diagnosed with CRC from 2013 to 2019.Additionally, the clinical data were utilized for assessing the overall or relative survival of CRC patients who had experienced metastasis.On the contrary, the exclusion criteria included the absence of a colonoscopy report or incomplete data.

Preliminary processing of data
To address missing data, we utilized model-based imputation techniques.The dataset used in the study consisted of 1,127 samples and 13 factors, including patients' demographic and clinical characteristics, as well as their metastasis status as an outcome binary variable, obtained from archived records.To facilitate the analysis, categorical features were transformed into discrete values using binning discretization methods.The continuous variables were normalized through one-hot encoding, centering them around the mean and scaling to a standard deviation of 1.In the NB model process, we grouped numeric values into four equally frequent categories.The dataset exhibited a significantly imbalanced distribution of metastasis rates, with a ratio of 16 metastasis to 84 non-metastasis cases, which could lead to biased models that favor the majority class and ignore the minority class, resulting in poor sensitivity and precision.To overcome this issue, the synthetic minority oversampling technique (SMOTE) was employed to balance the data by creating synthetic samples from the minority class.The SMOTE created synthetic samples from the minority class by finding the k-nearest neighbors of each sample and randomly choosing one of them to create a new sample along the line connecting them, which led to a new balanced dataset.Both the original and balanced datasets were then used to implement various ML algorithms, and their results were compared in terms of model performances.Figure 1 illustrates the step-by-step data processing and model selection process.

Model development
ML models were developed using the five-fold cross-validation method to assess the impact of the selected variables.The dataset was divided into five randomly selected folds, with 4-folds used for model training and the remaining fold applied for testing.This process ensured that 80% of the data was used for training and 20% for testing in each iteration.Various algorithms, including NB,

FIGURE
The model selection process.
RF, SVM, NN, DT, and LR, were employed to create models for predicting metastasis in CRC patients.The hyperparameters were tuned using a grid search technique, which involves testing different combinations of hyperparameter values and selecting the best one based on the cross-validation performance and prior experience.The performance of the models was evaluated using the validation data, and this iterative process was repeated until satisfactory results were obtained.In the case of the NN approach, algorithm selection was done through trial and error.A multi-layer perceptron network with the rectified linear unit activation function and stochastic gradient-based optimizer for weight optimization was utilized for NN modeling.For DT, a forward pruning technique was employed to split the data based on class purity.The RF algorithm constructed a set of decision trees using bootstrap sampling from the training data.The SVM algorithm employed a radial basis function kernel in this study.The modeling process for SVM included setting the cost to 1.00, regression loss to 0.1, and numerical tolerance to 0.001.Additionally, an NB classifier based on the Bayes' theorem was fitted.This method is known for its speed and robust performance.

Statistical analysis
The characteristics of participants were presented by reporting the mean ± SD for continuous variables and frequency with percentage for categorical ones.Missing data were imputed using model-based imputation methods.The ML-based approaches, including NB, RF, SVM, NN, DT, and LR, were used for predicting metastasis in CRC patients.The data were divided into training and test datasets in an 80:20 ratio.Then, the performance of various ML methods on both the original and balanced datasets was compared using 5-fold cross-validation and ROC curves.The area under the curve (AUC) of ROC, Precision-Recall area under the curve (PR-AUC), sensitivity, specificity, false negative rate (FNR), false positive rate (FPV), negative predictive value (NPV), F1 score, accuracy, and precision were reported for both the training and test sets in both the original and balanced datasets.The training accuracy is the overall accuracy of the model obtained by averaging the accuracies from the individual cross-validation runs.Additionally, calculating the performance analysis and obtaining the vital factors were carried out in stages I, II, and III.The ML modeling was implemented using Orange3 software version 3.36.1 and R studio version 4.2.0.

Results
A total of 1,127 registered CRC patients who experienced metastasis in 183 (16.2%) cases were included in this historical cohort study.The mean ± SD age of patients was 53.59 ± 14.35 years, ranging from 14 to 94 years.Out of the total number of patients, 437 (38.8%) patients were women.Demographic and clinical information is provided in Table 1.
Subsequently, various ML algorithms were applied to predict metastasis in CRC patients.Table 2 reveals the performance of different ML algorithms, which were evaluated using a 5-fold CV.It can be explained that the performance of all methods was acceptable in clinical research.The highest sensitivity and specificity (97.9% and 98.7%, respectively) were estimated for the NN algorithm.Both NN and RF algorithms had the highest AUC, while the SVM ranked third in both the original and balanced datasets.In addition, the evaluation of the results indicated that all methods exhibited good performance in the test datasets.The results also showed that the balanced dataset improved the performance of most ML methods, especially DT and NN, in predicting metastasis in CRC patients.As shown in Table 2, the balanced dataset increased the sensitivity and AUC of most ML methods, indicating that the models were able to better distinguish between metastasis and non-metastasis cases.The NN and RF achieved the highest AUC (100%), sensitivity (100% and 100%, respectively), and accuracy (99.2% and 99.3%, respectively) on the balanced dataset, followed by the SVM with an AUC of 98.8%, a sensitivity of 97.5%, and an accuracy of 97%.Moreover, lower FNR and FPR and higher NPV can be confirmed by these two methods.These results suggest that the NN and RF models are the most suitable ML methods for predicting metastasis in CRC patients as they can capture the complex and non-linear relationships between the features and the outcome.The SVM, LR, and NB models also showed improved performance on the balanced dataset, while they were still inferior in the NN and RF models.
The ROC curves are plotted to determine the diagnostic ability of the ML algorithms in Figure 2, separately for the test and training datasets.
Figure 3 demonstrates how the probability of metastasis depends on various features of the patients and their tumors.Some of the graphs show a clear relationship between the feature and the probability of metastasis, such as age, tumor size, and tumor stage.For example, the graph for the tumor stage shows that the probability of metastasis increases as the the tumor stage increases, which means that more advanced tumors are more likely to spread than less advanced tumors.Some of the graphs show a weak or unclear relationship between the feature and the probability of metastasis, such as diabetes, education, family history, and marital status.For example, the graph for diabetes shows that the probability of metastasis is slightly higher for patients who have diabetes than for patients who do not have diabetes.However, the difference is not very large.
Considering that tumor stage IV is a strong predictor and that modeling tumor cases at earlier stages can have a significant impact on early detection, a separate analysis was conducted on the subset of patients with tumor stages I-III (935 cases, of which 12 had metastasis).The performance of the models was compared with the baseline model based on all stages.Since the performance analysis based on dropping stage IV in the original data was unreliable due to the small sample size of metastasis (12 cases, equal to 1%), only the results of balanced datasets were reported (Table 3).The two approaches, NN and RF, had good performance.The rest of the models did not perform well.Figure 4 demonstrates that the most significant variable is determined by the Gini index.Figure 4A shows that the tumor stage was the most significant factor for predicting metastasis.Subsequently, the number of involved lymph

Discussion
In the present study, ML models were applied to predict metastasis in CRC patients.The clinical efficacy of our models was determined through ROC curve analysis and other indices, including sensitivity, specificity, and precision.Classifier performance was assessed using the six ML-based approaches.In addition, to focus on modeling tumor cases at earlier stages, which is important for early detection, a separate analysis was conducted for the subset of patients with tumor stages I-III.However, the results of these stages may be unreliable because the number of metastasis samples was very small in these patients.
In our study, while all models exhibited acceptable performance, the NN and RF models demonstrated greater predictive efficiency than the others.A number of ML-based modeling techniques have been suggested for the CRC dataset.Alternative studies have applied DT, SVM, NN, RF, and LR (Cueto-López et al., 2019;Boyne et al., 2020;Nartowt et al., 2020;Achilonu et al., 2021).A study investigated the prediction of tumor staging in colon cancer patients using TNM staging (tumor, node, and metastasis) (Gupta et al., 2019).In this study, ML techniques, such as RF, LR, SVM, NN, k-nearest neighbor (KNN), and adaptive boosting, were applied based on grouping tumor aggression score (TAS) into two categories (>9.8 and <9.8).They concluded that, when tumor size alone was regarded as a prognostic factor, the RF model outperformed other approaches with an accuracy of 84% and 74% in the training and test sets, respectively.In our study, we performed six ML-based approaches using the CRC data, and we found that NN and RF outperformed other models.The RF model was particularly compatible with our study.Moreover, NN and RF presented the highest sensitivity; furthermore, NN and DT showed the highest values in specificity.In their study, both the tumor stage and the number of involved lymph nodes are regarded as the most significant factors.However, the tumor stage was an essential variable, which is consistent with our study.
In 2020, Boyne et al. predicted early discontinuation of adjuvant chemotherapy among individuals aged >17 years with colon cancer patients at a high stage using the LR and RF models (Boyne et al., 2020).Their results revealed that the time from surgery to chemotherapy initiation and the distance from the treatment facility seemed to be the most considerable predictor factors.They concluded that the RF algorithm may help predict early discontinuation of chemotherapy among stage III colon cancer patients.In our study, the NN and RF models were of primary and secondary importance.The primary outcome of their study was chemotherapy discontinuation, defined as a receipt of <5 months and more than 5 months, while metastasis was the dependent variable in our study.In their study, RF was considered a better model than the LR method, while all ML-based approaches exhibited ideal performances in our study.
An investigation was conducted in South Africa using LR, NB, C5.0, RF, SVM, and ANN algorithms for predictive analytics of recurrence and survival outcomes in CRC patients (Achilonu et al., The analysis considered three datasets, including simulated, recurrent, and survival data.Significant variables in all models were examined and compared using the AUC, which evaluated the discriminatory power of predictive models.This assessment was supported by a threshold (accuracy) metric.Their results demonstrated that all models had the AUC values >80%; however,

FIGURE
The partial relationship between the probabilty of metastasis and features.the ANN model was considered a better method with an AUC of approximately 100%, which was compatible with our study.Nevertheless, an inconsistency arose in the results of the African study, where histology and CRC complications were prioritized in six methods, while in our study, the tumor stage emerged as the primary candidate.
A survey was carried out on the Indonesian population suffering from CRC in four hospitals from 2012 to 2015 (Anuraga and Fernanda, 2019).The predictor factors included the comorbidity, the tumor stage, age, treatment type, cancer location, gender, and metastasis in CRC patients.In this survey, the RF algorithm was employed in data classification, utilizing tree merging through training on sample data.Furthermore, the accuracy of these models was assessed based on the classification value using the AUC.In addition, the most essential variables for the survival of CRC patients were the metastasis history, cancer location, and gender.In our investigation, the outcome variable was metastatic history, whereas the survival of CRC patients served as the dependent variable in the Indonesian study.Moreover, in their study, both the tumor stage and age were of less importance, with the tumor stage being consistent with our survey.
There are some limitations to our study: This study was based on a single tertiary care facility in Iran, which may limit the generalizability of the results to other populations and settings.
In addition, the other potential predictors of metastasis, such as the molecular markers, tumor microenvironment, and treatment response, were not considered due to the lack of data availability.The study used a binary outcome of metastasis, which may not capture the complexity of the metastatic process and its clinical implications.

Conclusion
The results of this study indicated that the NN and RF methodscould be the best among ML-based approaches.In addition, in the RF method, the most important variables were the tumor stage, the number of involved lymph nodes, treatment type, and BMI.In a separate analysis of patients with tumor stages I-III, the performance of the NN and RF models was acceptable, with tumor grade, tumor size, and tumor stage identified as the most important factors.However, the results of modeling of I-III stages ./frai. .should be used with caution because the number of metastasis samples was very small.

FIGURE
FIGUREROC curves for di erent ML algorithms on training (left) and test (right) datasets.

FIGURE
FIGUREVariable importance for predicting metastasis in patients with CRC in all stages (A) and without CRC in stage IV (B).
TABLE Demographic and clinical attributes of patients with colorectal cancer.
TABLE Performance criteria for the ML methods for predicting metastasis in colorectal cancer patients.
TABLE Performance criteria for ML methods in predicting metastasis in the balanced dataset of patients with tumor stages I-III.