Prediction and the influencing factor study of colorectal cancer hospitalization costs in China based on machine learning-random forest and support vector regression: a retrospective study

Aims As people’s standard of living improves, the incidence of colorectal cancer is increasing, and colorectal cancer hospitalization costs are relatively high. Therefore, predicting the cost of hospitalization for colorectal cancer patients can provide guidance for controlling healthcare costs and for the development of related policies. Methods This study used the first page of medical record data on colorectal cancer inpatient cases of a tertiary first-class hospital in Shenzhen from 2018 to 2022. The impacting factors of hospitalization costs for colorectal cancer were analyzed. Random forest and support vector regression models were used to establish predictive models of the cost of hospitalization for colorectal cancer patients and to compare and evaluate. Results In colorectal cancer inpatients, major procedures, length of stay, level of procedure, Charlson comorbidity index, age, and medical payment method were the important influencing factors. In terms of the test set, the R2 of the Random forest model was 0.833, the R2 of the Support vector regression model was 0.824; the root mean square error (RMSE) of the Random forest model was 0.029, and the RMSE of the Support vector regression model was 0.032. In the Random Forest model, the weight of the major procedure was the highest (0.286). Conclusion Major procedures and length of stay have the greatest impacts on hospital costs for colorectal cancer patients. The random forest model is a better method to predict the hospitalization costs for colorectal cancer patients than the support vector regression.


Introduction
Colorectal cancer(CRC) is one of the most common malignant tumors of the gastrointestinal tract, accounting for approximately 10% of all cancer deaths worldwide each year (1).Global colorectal cancer incidence is expected to increase to 25 million new cases per year by 2035 (2).CRC has become a major public health problem around the world.In China, the prevalence of colorectal cancer continues to rise each year and most patients are middle to the late stage by the time they are diagnosed (3).And the total medical expenses and the total hospitalization costs of colorectal cancer patients in China ranked second among the expenses of malignant tumors in China, second only to lung cancer (4).The surgical removal of the lesions is still an important tool in the treatment of colorectal cancer.And there are comparatively few studies related to the cost of hospitalization for operative patients.Therefore, precise cost prediction models can provide a reference for the control of hospitalization costs of colorectal cancer.According to the American Cancer Society 2020, CRC is the third most prevalent of all cancers and the second-leading cause of cancer death in the US (5).As the incidence of CRC rises, the cost of treatment overall increases accordingly.Thus, the rise in the number of people with CRC has dramatically increased the pressure on national healthcare budgets (6).The cost of treatment for CRC is not only a financial strain on patients, but also a heavy financial burden on society (7).Therefore, It is significant to control the increase of CRC hospitalization costs reasonably and effectively to reduce the financial pressure on CRC patients and the economic burden of the disease on society.
With the development of computer software and computational power, artificial intelligence has been developed in leaps forward and provides a new direction for medical diagnosis, hospital management, medical data analysis, etc. (8).Data mining is the in-depth analysis of big data to reveal significant new relationships, trends and changes.The field incorporates theories and methods from a number of subjects, including machine learning, big data, and statistics (9).Data mining is an emerging field in data research with significant value.Data mining can extract hidden information and knowledge by utilizing a variety of decisions.This is very useful for the judgment process (10).And compared to traditional statistical methods, data mining methods have fewer constraints and fewer requirements on the form of data.Therefore, data mining provides a new, effective method for accurate prediction and rational control of hospitalization costs (11).Machine learning is a division of artificial intelligence that also is an algorithm for data mining, different machine learning algorithms have different advantages (12).More and more academics are beginning to adapt machine learning algorithms to the study of hospitalization costs.Zhang and Sun (13) used a Neural Network and Support Vector Machine to predict the medical costs of breast malignant tumors.Another study (14) used Random Forest and Least Absolute Shrinkage and Selection Operator (LASSO) Regression to predict medical expenditure.However, studies using machine learning to predict the cost of hospitalization for CRC patients have not been reported to the author's knowledge.
In recent years, China has gradually launched medical insurance payment methods for disease diagnosis-related groups (DRGs) and diagnostic intervention packages (DIPs) (15).Both DRGs and DIPs have the effect of controlling the cost of medicine and preventing excessive medical growth (16).These two payment are mainly applicable to acute hospitalized cases and are not suitable for patients with CRC.Therefore the development of precise cost prediction models can be used as a guide for reimbursement criteria for the hospitalization of patients with CRC, and can also provide a reference for reimbursement and prediction for other chronic diseases.Machine learning algorithms on hospitalization costs can also the control cost growth and prevent over-medication; it can also further detail disease groupings and explore more appropriate hospitalization cost policies for China.Thus, this research selected CRC patients from a tertiary hospital in Shenzhen as the research subject.Our study aims to apply Random Forest and Support Vector Regression to predict the cost of hospitalization and assess associated factors.Our research can provide a reference for controlling the growth of hospitalization costs and the breakdown of disease groupings.

Data source
This study retrospectively collected data from the first page of electronic medical records of CRC patients from January 2018 to December 2022 at a tertiary hospital in Shenzhen.Criteria for inclusion: According to the International Classification of Diseases ICD-10, The study included data from the first page of cases with primary diagnosis codes C18-C20.The name of the relevant procedure is based on the name on the first page of the case.Exclusion criteria: To ensure the reliability of results, Exclude names of procedures with a number less than 20 in the main surgical operation variable.Cases with missing or repeated data on the first page of the case, or with obvious errors, are excluded, 1,590 cases were eventually included.After searching for relevant reports (17)(18)(19), We included as input variables gender, age, level of procedure, number of hospitalizations, length of stay, occupation, marital status, major procedure, Charlson Comorbidity Index(CCI) score, and the output variable was hospitalization costs (Table 1).Identification of the corresponding Charlson comorbidity by ICD-10 codes in the other diagnoses on the first page of the case and direct calculation of the CCI score (20).

Statistical analysis
Hospitalization costs, length of stay, age, and hospitalization frequency are continuous variables.The Shapiro-Wilk test for the continuous variables in the sample data of this paper found that none of the variables follow the normal distribution.Associated studies (21) have concluded that hospital costs have special characteristics, such as the presence of large numbers of zero-cost observations, and that the distribution exhibits left-skewed and thick-tailed characteristics, and therefore it does not obey the normal distribution and variance chi-squareness.Therefore, These variables were expressed as a component ratio and median (the lower four quantiles, the upper four quantiles).Categorical variables are similarly expressed.Nonparametric testing is defined as using the sample distribution pattern to make inferences about the overall distribution pattern.Nonparametric tests are applicable when the sample data do not satisfy a normal distribution and the overall variance is unknown.

Support vector regression
Traditional linear regression methods are widely used to study the influence of various factors.However, the model required the data to meet normality, variance equality, linearity, and independence.Most reports using linear regression to predict hospitalization costs do not specify the entire set of conditions to be met (13).Due to the skewed distribution of hospitalization costs for colorectal cancer patients, traditional linear regression models have limitations in the study of factors influencing hospitalization costs.The studies have shown that (14,22) RF algorithm and SVR algorithm are more suitable for the prediction of hospital costs as compared to other algorithms.And no study has compared RF algorithms to SVR algorithms in the prediction of the cost of hospitalization.Therefore, the RF algorithm and SVR algorithm were selected in this study for modeling and comparative.Some research shows that machine learning approaches are more applicable to the study of big data in healthcare, and other studies have shown that SVR models have a strong generalization ability in hospitalization cost prediction compared to other machine learning algorithms (22).Therefore, this study introduces support vector regression(SVR) models to explore the factors influencing hospitalization costs.SVR is a supervised machine learning model for regression.SVR has good regression performance for non-linear, high-dimensional problems (23).The main thought behind SVR is to map the data to higher dimensions and to perform regression predictions in higher dimensions.In terms of SVR, the Radial basis kernel function (RBF) is a great selection, RBF is a non-linear projection that deals well with the problem of non-linear data (24).The optimal SVR model can then be constructed by adjusting the parameters after the RBF has been selected.

Random forest
Random forest is also a suitable model appropriate for our study (25).A study has shown that RF has better accuracy, sensitivity and specificity compared to other machine learning algorithms such as decision trees (26).Random forest is a supervised machine learning model and has the ability to be constraint-free, requiring only adjustments to a few parameters to reach accurate predictions and the advantage of handling a wide range of quantitative and qualitative data (27).Random forest regression models perform random return sampling of samples and sample characteristics, generating a number of least squares regression trees, the most desirable results are output by referencing all least squares regression trees (27).

Prediction performance evaluation
For the regression problem, the values of the root mean square error RMSE, the mean absolute percentage error MAPE, or the mean absolute error MAE are frequently used to evaluate the predictive performance of the model, and the coefficient of determination R2, is used to evaluate the fit of the model.The R2 is a statistical index used to measure how well a regression model fits the observed data.The standard for the r2 value is generally set at 0.75, and the R2 value greater than 0.75 indicates that the model is well-fitted.The MAE represents the mean of the absolute value of the error between the predicted value and the actual value.The MAPE is used to reflect the extent of the data discretization.The RMSE indicates the error between the actual value and the predicted value, and it also indicates the degree of discretization between the two errors.The closer the three indicators are to 0, the better they are.The three values are the deviation of the sample value from the predicted value, which is affected by the size of the sample value and the size of the predicted value.So there is no established standard.Because the RMSE has a wider scope of evaluation, the RMSE was selected as the one of the assessment indicators in this study.In summary, the coefficient of determination R2 and the values of the root mean square error RMSE were selected as evaluation indicators in this study.

Software realization
Non-parametric tests are performed using IBM SPSS Statistics 25.RF models and SVR models were implemented using a package such as "e1071, " "caret, " "random forest, " etc. in R-4.0.2 software.

Variables
Variable assignment Type

Basic information
The study included 1,590 patients with CRC; Males and females made up 60.5 and 39.5% of the sample, respectively.Inpatients with a length of stay of 11-15 days made up 36% of the study population and patients with a length of stay of 20 days or longer represented only 16.7% of the study population.The main operative operations can be divided into laparoscopic treatment and non-laparoscopic treatment, with laparoscopic right hemicolectomy being the most common operation, accounting for 16.5% of the main operations with 262 cases.Unmarried people made up 11.9% of the sample, and married people 88.1% of the sample.The proportion of urban employees' basic medical insurance made up the highest at 27.9% and the proportion of urban residents' basic medical insurance made up the smallest at 15.3%.The number of patients readmitted within 1 year was 124 (7.8%).The majority of patients' cases were outpatients (95.0%) with 1,461 patients.Among the surgical levels, the largest proportion of surgeries was grade 4 at 68% and the smallest was grade 1 at 2.6%.In terms of age of inpatients, 51.3% were aged 30-65 years, 37.8% were over 66 years and 10.9% were aged 19-35 years.70.4% of patients were first-time admissions.Patients had a maximum of 31.0% of Charlson comorbidity scores in the 0-2 range and a minimum of 21.2% in the larger than or equal to 9 range.Results are shown in Table 2.

Analysis of differences of hospitalization costs for CRC patients
The Mann-Whitney U test and Kruskal-Wallis H test with hospitalization cost as the output variable.As shown in Table 2.There were no statistical differences in hospitalization costs with gender (p = 0.375), marital status(p = 0.18), whether readmitted within 12 months (p = 0.762), route of admission (p = 0.247), or the number of hospitalizations (p = 0.246).The cost of hospitalization was statistically different from the number of days in the hospital, major procedure, payment method, level of procedure, age, and CCI index (p < 0.05).

Comparison within groups
Comparison of patient's costs for factors within variables after adjusting for alpha levels using the Bonferroni method.
As shown in Table 3, among the major procedure variables, statistical difference in hospital costs between those who did not undergo the procedure and those who had the procedure of a treatment nature; the Statistically significant difference in hospital costs between patients undergoing procedures of an investigative nature and those undergoing procedures of a treatment nature; in addition to endoscopic rectal mucosal dissection, no statistically significant difference in hospital costs between procedures of a treatment nature.Procedures of an investigative nature include Colon biopsy and rectal biopsy.Procedures of an treatment nature include Laparoscopic partial sigmoidectomy, laparoscopic anterior rectal resection, laparoscopic sigmoidectomy, laparoscopic right hemicolectomy, laparoscopic left hemicolectomy, sigmoidectomy with colonic anastomosis, laparoscopic sigmoidectomy with colonic anastomosis, radical right hemicolectomy, partial sigmoidectomy.Comparison of the number of days in hospital Internal variables, all differences in hospitalization costs between all internal variables were statistically significant, this means that the longer the hospital stay, the larger the hospital costs.Comparison of internal variables of CCI score, all differences in CCI score between all internal variables were statistically significant, this means that the larger the CCI score, the larger the hospital costs.There is a statistical difference between the cost of hospitalization for the level 3 procedure and other levels of procedure.There is a statistical difference between the age of 19-35 years and Greater than or equal to 66 years of age.There is a statistically significant difference in the cost of hospitalization between patients paying out of pocket and other payment methods.

Model construction and parameter tuning
We integrated all variables into the RF model and the SVR model.To reduce the impact of unit differences between different variables, we have normalized the variables using the linear conversion function, The formula is as follows: y = (x-Minx)/(Maxx-Minx).After we included all the variables in RF model and SVR model we found that the R2 value of the RF model is 0.65 and the R2 value of the SVR model is 0.54, it shows that both the models are poorly fitted.According to the effect of the input variables on the output variables only the top six variables were found to have greater effect on the hospitalization costs.Therefore, the model was further adjusted by reducing one variable at a time, and it was eventually found that the model had the best performance when the number of variables was 6.Finally, selected as input variables were age, length of stay, major procedure, medical payment method, CCI score, and level of procedure.Hospitalization costs as the output variable.The resulting parameters are determined as follows, The study used a grid search approach to parameter-tuning the RF and SVR models.The important parameters in the RF model are the number of variables to be sampled for each tree mtree and the number of decision trees to be constructed ntree.The grid search method can be made to achieve global optimality of parameters.Ntree range is set to 10-500, step size set to 1, ntree range is set to 10-500, step size set to 1. Build a model of 2,450, with R2 greatest when ntree = 142 and mytree = 4.The RBF kernel function is selected for the SVR model and the kernel function coefficients gamma and penalty function cost are adjusted using a grid search method.Gamma range is set to 0.01-0.1, the step size is set to 0.02, the cost range is set to 11-20, and the step size is set to 2. The model accuracy is highest when gamma = 0.01, and cost = 19, as shown in Figure 1.

Comparison of random forest and support vector regression
Both random forests and support vector regression were used with 70% of colorectal cancer patients as the training set and 30% of patients as the test set.In terms of training sets, the R2 of the RF model was 0.912, the R2 of the SVR model was 0.777; the RMSE of the

Ranking the importance of variables
In the RF model, the Major procedure had the highest weight (0.283), followed by the length of stay (0.260), and the variable with the smallest is the mode of payment.In the SVR model, only five variables had weights.The major procedure had the highest weight (0.702), followed by the length of stay (0.148).The medical payment method had a weight of zero.As shown in Table 5.

Discussion
Our findings show that major procedures, length of stay, level of procedure, CCI score, age, and medical payment method have statistically significant effects on hospital costs for inpatients with CRC.Gender, admission route, and Hospitalization frequency no statistically significant effects on hospital costs for inpatients with CRC, but the findings of Jacobs et al. (28) show a correlation between the admission route of inpatients and the costs of hospitalization.This could be due to the sample studied, the research is a single centerbased hospitalization cost forecasting study, future multi-center studies can be done, and further, investigate the effect of admission route on hospital costs for inpatients with CRC.The related study shows (29) that multiple linear regression fits the nonlinear relationship across variables by including dummy variables.It is also possible to use the two-part model to improve the fit of the model.However, as the sample size increases, these methods provide limited effects.Springer et al. (30) study used multivariate regression to find that complications were the most important influencing factor.Our study showed complications to be third in the importance list.This may be related to the study methods and sample size.Significant values of predicted variables based on the RF prediction models, for patients with CRC, the main influencing factor for hospitalization expenses is the major procedure and the length of stay, which is consistent with the research results of Wu et al. (31) The study by Gao et al. (4) showed that the length of stay and the primary treatment and Medicare payment method were important factors influencing the cost of hospitalization for colorectal cancer patients, which is also consistent with our findings.
Combining the important values of the predictive variables of the RF prediction model and the SVR regression model and the results of related analyses.The results show that the major procedure has the most important impact on the cost of hospitalization for CRC patients, with the length of stay ranking second.The major procedure for colorectal cancer patients are grouped into 15 categories, there was a significant difference in hospitalization costs between rectal surgery performed endoscopically and laparoscopically, there is a significant  difference in hospital costs for procedures of an investigative nature such as rectal biopsy, colon biopsy and procedures of a treatment type such as colorectal resection.The procedure is the most important treatment for colorectal cancer.Although the traditional open procedure can be effective, it can be highly damaging to the patient's body, and not conducive to the patient's own recovery (32).As minimally invasive techniques develop, the laparoscopic procedure is beginning to become the main procedure for the treatment of colorectal cancer.Laparoscopic procedures have the advantages of less trauma, less damage to surrounding tissues, and faster postoperative recovery.Laparoscopic surgery is not only effective in reducing the overall cost of medical care during hospitalization but also in reducing the length of stay, thus increasing the efficiency of hospital operations (33).However, disposable consumables and the cost of the procedure are more expensive for laparoscopic surgery than for open surgery (34).Comparisons within a hospital length of stay groups found statistical differences between hospital costs for all days of stay, the longer the hospital stay, the higher the cost of hospitalization.The study found that hospital costs were not significantly different between patients who had laparoscopic surgery and those who had open surgery, except for endoscopic rectal mucosal dissection.The reason for the speculation is that although laparoscopic surgery is expensive in terms of consumables, the smaller incision and shorter recovery period for patients reduces the number of hospital days, whereas open patients have a longer recovery period and therefore the difference in hospital costs between the two is not significant.Therefore, in addition to rectal cancer surgery, when performing a procedure for CRC, clinicians can choose the procedure with better treatment results.When performing a procedure for rectal cancer, the surgeon should consider not only the severity of the patient's disease but also the patient's financial situation to choose the best treatment option.
In our study, the CCI score is positively correlated with hospitalization costs.Comparison within the CCI group found significant differences in hospital costs at all levels of the CCI score.The greater the CCI score is, the more complications are; therefore, the more serious the patient's disease, the higher the cost of diagnosis and treatment, and also the longer the hospital stay, thus affecting the total cost of hospitalization.Which is in agreement with the findings of Zhang's research (35).CCI score is a notable influencing factor.In the future, the relationship between the combined benefits of comorbidities as well as major procedures and length of stay can be studied to determine a reasonable length of stay to prevent the effects of treatment from being compromised by too short a stay.While preventing excessive increases in hospital costs and unreasonable resource allocation due to excessive length of stay.
Age is an influential factor in the cost of hospitalization for people with CRC.The results show a statistical difference in hospitalization costs between patients aged 19-35 years and patients aged ≥66 years, which may be due to the fact that older people are physically weaker than younger people, recover more slowly consume healthcare resources, and stay in hospital for longer periods (36).Based on the significance results of pairwise comparison, medical payment method and level of procedure had an effect on hospitalization costs for CRC patients, but the impact was not clear, it is consistent with the results of the significant values and correlation analysis of the predictor variables of the RF prediction model and the SVR regression model.

Clinical implications
Doctors can rationalize the treatment of patients after understanding their various conditions, considering the patients' financial situation.Hospitals or policymakers can use the model to predict colorectal cancer hospitalization costs and create individualized, precise hospital reimbursement plans to provide reference for value-based care.

Conclusion
The study shows that for patients with colorectal cancer, hospitalization costs are influenced by a number of variables, including major procedure, length of stay, CCI score, level of procedure, age, and medical payment method, with major procedure and length of stay being the most consequential variables.The hospitalization costs for procedures of an investigative nature are lower than hospitalization costs for procedures of a therapeutic nature.There are no significant difference in the cost of hospitalization between procedures of a treatment nature, with the exception of endoscopic rectal procedures.The reason for the speculation is that although laparoscopic surgery is expensive in terms of consumables, the smaller incision and shorter recovery period for patients reduce the number of hospital days.Whereas open patients have a longer recovery period.Therefore, the difference in hospital costs between the two is not significant.The further research is required to substantiate this.The CCI score is an important factor in hospitalization costs, As the number of comorbidities increases, the cost of hospitalization for CRC patients increases.The hospitalization cost prediction model constructed by the RF algorithm is better than the hospitalization cost prediction model constructed by the SVR algorithm.The RF model can predict hospitalization costs for CRC patients, the model can provide an effective strategy for Medicare to consider the implementation of personalized and precise hospitalization reimbursement schemes in the future.Our study has some limitations.Due to the limitations of the conditions, the study was conducted using the patient data from only one hospital.Future studies can further expand the sample size and the sample range and conduct more in-depth studies.This paper is the study of the hospitalization costs based on the first page of the medical recard.The dependent variables included in this study are limited.Future studies could incorporate more dependent variables.

TABLE 1
Variables of research.

TABLE 2
Basic characteristics and univariate analysis in hospitalization costs of colorectal cancer.The R2 values of the RF model are slightly higher than the R2 values of the SVR model, and there is little difference between the RMSE values of the RF model and the SVR model.In terms of the prediction of the hospitalization cost of colorectal cancer, As determined by the combined results of the train set and the test set, the prediction accuracy and the fitting effect of the RF prediction model were slightly better than that of the SVR model.As shown in Table4.
RF model was 0.025, and the RMSE of the SVR model was 0.041, the prediction accuracy and the fitting effect of the RF prediction model were better than that of the SVR model.In terms of training sets, the R2 of the RF model was 0.833, the R2 of the SVR model was 0.824; the RMSE of the RF model was 0.029, and the RMSE of the SVR model was 0.032.In conclusion, both the RF and SVR models have the superior predictive ability in regression problems.in terms of the train set, the R2 value of the RF model is significantly higher than the R2 value of the SVR model, and the RMSE value of the RF model is lower than the RMSE value of the SVR model. in terms of the test set,

TABLE 3
Significance results of pairwise comparison.

TABLE 4
Comparison of prediction capacity of random forest model and support vector regression model.

TABLE 5
Importance ranking of variables.