Retrospective Study on the Influencing Factors and Prediction of Hospitalization Expenses for Chronic Renal Failure in China Based on Random Forest and LASSO Regression

Aim: With the improvement in people's living standards, the incidence of chronic renal failure (CRF) is increasing annually. The increase in the number of patients with CRF has significantly increased pressure on China's medical budget. Predicting hospitalization expenses for CRF can provide guidance for effective allocation and control of medical costs. The purpose of this study was to use the random forest (RF) method and least absolute shrinkage and selection operator (LASSO) regression to predict personal hospitalization expenses of hospitalized patients with CRF and to evaluate related influencing factors. Methods: The data set was collected from the first page of data of the medical records of three tertiary first-class hospitals for the whole year of 2016. Factors influencing hospitalization expenses for CRF were analyzed. Random forest and least absolute shrinkage and selection operator regression models were used to establish a prediction model for the hospitalization expenses of patients with CRF, and comparisons and evaluations were carried out. Results: For CRF inpatients, statistically significant differences in hospitalization expenses were found for major procedures, medical payment method, hospitalization frequency, length of stay, number of other diagnoses, and number of procedures. The R2 of LASSO regression model and RF regression model are 0.6992 and 0.7946, respectively. The mean absolute error (MAE) and root mean square error (RMSE) of the LASSO regression model were 0.0268 and 0.043, respectively, and the MAE and RMSE of the RF prediction model were 0.0171 and 0.0355, respectively. In the RF model, and the weight of length of stay was the highest (0.730). Conclusions: The hospitalization expenses of patients with CRF are most affected by length of stay. The RF prediction model is superior to the LASSO regression model and can be used to predict the hospitalization expenses of patients with CRF. Health administration departments may consider formulating accurate individualized hospitalization expense reimbursement mechanisms accordingly.


INTRODUCTION
Chronic renal failure (CRF) refers to chronic progressive renal parenchyma damage caused by various factors, leading to obvious kidney atrophy and the inability to maintain basic function. Chronic renal failure is a clinical syndrome characterized by retention of metabolites and water, electrolyte and acid-base disorders, and major clinical manifestations of other organ system involvement. Chronic renal failure has become a major public health problem worldwide. Chronic renal failure can occur at all ages, and there are many differences in the affected population. With the improvement in people's living standards, the CRF incidence is increasing annually, and CRF has become one of the major chronic diseases affecting the health of the Chinese people. A national epidemiological survey conducted by Zhang Lixin et al. (1) in 2012 showed that the overall prevalence of chronic kidney disease in China was 10.8%. According to the annual report of the United States Renal Disease Data System (USRDS) 2016, the global average prevalence of adult CKD is 14.8% (2). A study by the Korean Society of Nephrology (3) also showed that the incidence of end-stage renal disease in South Korea is 70% of that in the United States. The number of patients has increased year by year, and the treatment costs have also increased accordingly. Therefore, the increase in the number of CRF patients has significantly increased pressure on national medical budgets (4).
Mohnen et al. analyzed the medical expenses of patients using different kidney replacement methods based on Dutch health insurance claims data. The results showed that the costs of all dialysis methods are very high, with annual expenditures of 77,566 euros and 92,616 euros for continuous outpatient peritoneal dialysis (CAPD) and central hemodialysis (CHD), respectively, and 105,833 euros for patients in the mixed dialysis group. Most of the overall health care costs are related to renal replacement therapy (RRT) (5). Research by Makhele et al. (6) showed that in South Africa, from the perspective of healthcare providers, the annual cost of hemodialysis (HD) per patient (31,993.12 US dollars) is higher than that of peritoneal dialysis (PD) (25,282.00 US dollars). The treatment for CRF includes conservative medical treatment and surgical treatment, such as continuous PD, HD, and kidney transplantation. The cost of treatment between the HD and PD varies substantially. The hospitalization costs of CRF patients with different comorbidities Abbreviations: CRF, chronic renal failure; RF, random forest; LASSO, least absolute shrinkage and selection operator; CHD, in-center hemodialysis; RRT, renal replacement therapy; HD, hemodialysis; PD, peritoneal dialysis; DRGs, diagnosis-related groupings; DIP, diagnosis intervention packet; ICD, international classification of diseases; RMSE, root-mean-square error; MAE, mean absolute error; QR, quartile range. and accompanying diseases are also different (7)(8)(9)(10). A study by Khan of Tufts University School of Medicine in Boston (11) found that secondary hyperparathyroidism is associated with the high cost of treatment for CRF patients with cardiovascular complications. The research of Zhao et al. (12) indicates that the scope of medical insurance payments will affect hospitalization expenses. Different medical insurance reimbursement payment systems may affect the choice of end-stage renal disease treatment, thereby affecting the allocation of related resources and ultimately affecting the national medical budget (13,14). Therefore, many factors affect the hospitalization costs of CRF patients, and patient grouping and medical insurance payment standards are more complicated, thus necessitating further research (15). Incorporating these potential factors to predict the medical expenditures of hospitalized CRF patients will be beneficial to the development of policies regarding the allocation of health resources.
Data mining is a process involving careful analysis of large amounts of data to reveal meaningful new relationships, trends, and patterns. Data mining emerged in the late 1980s and represents a new field with important application value in database research, which is an intersecting field. The discipline integrates theories and technologies in many fields such as artificial intelligence, database technology, pattern recognition, machine learning, statistics, and data visualization (16,17). With the development of data mining research and applications, people have reached a consensus on the understanding of data mining; that is, data mining is a method that uses various strategies to extract hidden and potential information and knowledge from a large amount of data, which is very useful for the decision-making process (18). Therefore, data mining provides a new and promising method for reasonable allocation and control of hospitalization expenses, especially in the era of big data (19). Scholars have applied data mining algorithms to predict and analyze medical expenses; Yang et al. (20) used four machine learning models for patients with high-cost and high-demand chronic diseases, including ordinary least squares linear regression (LR), regularized regression (LASSO), gradient boosting machine (GBM), and recurrent neural networks (RNN, a deep learning approach), and constructed a medical expenditure prediction model. Cao et al. (21) proposed the alpha(tj) algorithm and the truncated Newton algorithm to build a dynamic medical path Net system to predict the medical expenses of gastric cancer patients. Wang et al. (22) used the random forest (RF) model to predict the medical expenses of individual diabetic patients and evaluated related influencing factors, but no studies on the prediction of hospitalization expenses and influencing factors for patients with CRF have been performed.
China is currently enacting new medical reform policies, and the medical insurance payment methods advocated by the National Medical Security Administration mainly include diagnosis-related groups (DRGs) and Diagnosis Intervention Packets (DIPs). As a payment tool that can effectively control increase in medical expenses (23), DRGs were conceived in the United States and rapidly developed worldwide. Diagnosisrelated groups are based on patient age, sex, the length of hospital stay, factors such as clinical diagnosis, disease, surgery, disease severity, comorbidities, and complications, and outcomes, which divide patients into 500-600 DRGs, and then the amount of compensation that should be given to a hospital is determined. Diagnosis intervention packets are based on the three core elements of disease screening, measuring the score for each disease, and determining the coefficients of medical institutions to establish a disease score database reflecting differences between different diseases. The relative weight of a medical institution establishes the relationship between the cost of diagnosis and treatment of a disease and the payment price, which is the payment method used in China (24). In addition to these two mainstream payment methods, traditional payment methods such as project-based payment are available. These medical expense payment policies are subject to difficulties and deficiencies in the actual implementation of human resources, information technology, and economic development. Thus, data mining and machine learning algorithms are required for innovative integration of various characteristics of diseases according to the currently implemented medical insurance payment methods to explore medical expense payment methods that are more suitable for China's national conditions. Therefore, this study selected CRF inpatients from three tertiary firstclass hospitals in Beijing as the research objects. Our purpose is to use the RF method and the LASSO method to predict individual hospitalization expenses and evaluate related factors based on data from the first page of CRF patients' hospital records. From a personal perspective, predicting the cost of CRF hospitalization will render resource allocation more accurate and reasonable. Our research can provide new ideas for health policy and management research.

Source of Data
In China, national regulations require hospitals at or above the county level to use the International Classification of Diseases (ICD) on the front pages of medical records to classify and code disease diagnoses. In the ICD-10, the three-digit code N18 represents the category of CRF. Therefore, the N18 category of the ICD was used to extract hospitalized cases of CRF. Our research data was collected from the first pages of medical records at three tertiary first-class hospitals in Beijing. In 2016, the main diagnosis code of N18 (ICD-10 CRF code) was identified for hospitalized patients, and a total of 1,819 hospitalized cases were included.
Under the guidance of relevant reports (25)(26)(27)(28)(29), we included the following variables from the medical records: input variables included sex, age, marital status, medical payment method, length of stay, the number of other diagnoses, major procedures, and the number of procedures, and the target variable was hospitalization expenses ( Table 1). The main procedure classification was adopted from the third volume of the American International Classification of Diseases Clinical Revision ICD-9-CM-3 (2011).

Data Preprocessing
According to the research purpose and the meaning of each variable value, each variable was adjusted, and variables with a small sample size (<10) were deleted. Multiple variables in the major procedures category with fewer than 10 cases were deleted, including procedures that are not frequently performed during CRF patients' hospitalization. Finally, 1,635 valid hospitalized cases with no missing values constituted the data set for analysis.
Since hospitalization frequency, age, length of stay, the number of other diagnoses, the number of procedures, and hospitalization expenses are continuous variables, by calculating skewness and kurtosis, these variables were all found to follow a skewed distribution; therefore, the continuous variables were grouped, the sample frequency and composition ratio were used for descriptive statistics, and the median and quartile of the total medical expenses for each group were calculated. The same methods were used for sex, marital status, major procedures, and medical payment method.
IBM SPSS Statistics 23 downloaded from IBM official website was used for statistical analysis of the above data set.

Random Forest Analysis
Due to the particularity of positive skewed distribution of medical expenditures, the variable types of latent factors included nominal variables and continuous variables, and the continuous data were also skewed. Related studies have used RF models for predictions (22), and other studies have shown that the RF method is a suitable ensemble learning algorithm and machine learning method with the advantages of no restrictions on variable conditions (30) and higher accuracy, sensitivity, and specificity than decision trees (31). In addition, RF can be used to predict continuous variables and obtain predictions without obvious deviations (32). Therefore, RF is a suitable prediction method for the data in this study.

Least Absolute Shrinkage and Selection Operator
Least absolute shrinkage and selection operator penalty regression is another predictive model suitable for our research data. By constructing a penalty function, the coefficients of variables can be compressed to solve the problem of regression model overfitting. Least absolute shrinkage and selection operator is a regression technique for variable selection and regularization to enhance the prediction accuracy and interpretability of the statistical model that it produces. In LASSO, data values are shrunk toward a central point, and this algorithm aids in variable selection and parameter elimination. This type of regression is well-suited for models with high multicollinearity. Least absolute shrinkage and selection operator regression adds a penalty equal to the absolute value of the magnitude of coefficients, and some coefficients can become zero and are eventually eliminated from the model, resulting in variable elimination, and thus models with fewer coefficients (20,33).

Prediction Performance Evaluation
In this study, the mean absolute error (MAE) and root mean square error (RMSE) between the predicted value and the actual value were used to evaluate prediction performance. The coefficient of determination R 2 was used to reflect the regression fitting effect of the prediction model. The mean accuracy was used to assess the relative importance of variables (34). The above algorithms were implemented using the LassoCV package and RandomForestRegressor package of sklearn in Python software.

Sample Characteristics
Among the 1,635 hospitalized cases (see Table 2), males and females accounted for 58.6 and 41.4% of the sample, respectively; unmarried people accounted for 7.9%, married people accounted for 89.5%, and others accounted for 2.6% of the sample. Arteriovenostomy for renal dialysis (ICD-10 procedure code: 39.27) accounted for the largest proportion of major procedures at 24.3%, and other oxygen enrichment procedures (ICD-10 procedure code: 93.96) accounted for the smallest proportion at only 1.3%. In terms of medical payment methods, medical insurance accounted for the highest proportion at 71.3%, and fully public expenses accounted for the smallest proportion at 2.9%. A total of 41.9% of hospitalized patients were hospitalized for the first time, and the remaining 58.1% of hospitalized patients were hospitalized for the second time or more. Patients with a hospital stay shorter than or equal to 10 days accounted for 61.4% of the sample, and patients with a hospital stay > 21 days accounted for only 8.1% of the sample. Patients with no other diagnoses or one other diagnosis accounted for the smallest proportion of the sample at only 2.5%, while the proportion of patients with the five other diagnoses accounted for the highest proportion of the sample at 18.8%. Patients who did not undergo procedures accounted for only 14.4% of the sample, and patients who underwent two or more procedures accounted for the highest proportion at 65.4%.

Analysis of Differences in Hospitalization Expenses for Chronic Renal Failure
With hospitalization expenses as the target variable and sex, marital status, major procedures, and medical payment method as characteristic variables, the Mann-Whitney U-test and Kruskal-Wallis H-test were performed. The results showed no statistically significant differences in hospitalization expenses with respect to sex (p > 0.05), marital status (p > 0.05), and age (p > 0.05), but major procedures (p < 0.001), medical payment method (p < 0.05), the number of hospitalizations (p < 0.05), the length of stay (p < 0.001), the number of other diagnoses (p < 0.001), and the number of procedures (p < 0.001) were associated with statistically significant differences in hospitalization expenses ( Table 2). Further post-hoc testing of the pairwise comparison results was performed. Using the Bonferroni method, the α level was adjusted to analyze whether there are significant differences between the variables with significant differences in hospitalization expenses. Table 3 shows the statistically significant results of the pairwise comparisons. According to the results of the pairwise comparison of major procedures, the hospitalization expenses of patients receiving other kidney transplantation procedures (median: 71,483, QR: 57,862-86,866) and all other groups of patients were significantly different. Statistical differences were found between the hospitalization expenses of patients in the no procedure group (median: 5,362, QR: 2,860-8,960) and those of patients in the venous catheterization (median: 11,808, QR: 6,137-19,764), venous catheterization for renal dialysis (median: 11,537, QR: 7,163-18,503), arteriovenostomy for renal dialysis (median: 7,444, QR: 5,007-11,528), Creation    In terms of the number of procedures, statistically significant differences in hospitalization expenses were found between patients undergoing two procedures (median: 10,028, QR: 6,471-17,649) and patients undergoing no procedures (median: 5,336, QR: 2,870-8,959) or one procedure (median: 6,045, QR: 3,921-9,310), showing that hospitalization expenses are higher for patients undergoing two or more procedures.

Model Construction and Parameter Tuning
Due to the small number of feature variables selected in this study and based on clinical experience, each variable had analytical value, and we therefore selected sex, age, marital status, medical payment method, hospitalization frequency, the number of other diagnoses, major procedures, the number of procedures, and hospitalization frequency as input variables and hospitalization expenses as the output variable. To reduce the influence of the unit difference between different variables, the linear conversion function y = (x -MinValue)/(MaxValue -MinValue) was used to normalize the variables. The study used 10-fold cross-validation to divide the entire sample into 10 equally sized subsamples. Among the 10 subsamples, one was retained as the verification data set of the test model, and the remaining nine were used as the training data set. After the cross-validation process was repeated 10 times, 10 results were generated (35), and the average value was taken as the performance metric. In this paper, the mean absolute error (MAE) was selected as the evaluation index, and the best Lambda value was obtained through cross-validation. The relationship between the model MSE and Lambda is shown in Figure 1. As shown in Figure 1, when the best penalty factor Lambda = 10 −3 , the MSE is the smallest, and the LASSO regression model has the highest accuracy. The multicollinearity problem can be solved by reducing the parameter Lambda. After performing cross-validation 10 times, the parameter n_estimator was trained in the RF model, and when it changed from 1 to 100 and n_estimator = 75, the R 2 -value of the model was the largest.

Performance Comparison
Comparing the prediction performance of the RF prediction model and the LASSO regression model, in terms of the determination coefficient R 2 , the R 2 of the LASSO regression model was 0.6992, and the R 2 of the RF regression model was 0.7946. The fitting effect of the RF prediction model was better than that of the LASSO regression model. The MAE and RMSE of the LASSO regression model were 0.0268 and 0.043, respectively, and the MAE and RMSE of the RF prediction model were 0.0171 and 0.0355, respectively. The prediction accuracy of the RF prediction model was better than that of the LASSO regression model ( Table 4). The results were also shown in Figure 2.

Variable Selection Comparison
In the RF model, all input variables had a certain weight. The length of stay had the highest weight (0.730), followed by major procedures (0.089), and the variable with the lowest weight was marital status (0.004). In the LASSO model, only five variables had weights. The length of stay had the highest weight (0.604), followed by the number of other diagnoses (0.018) and the medical payment method (0.018). Four variables had a weight of zero, namely, major procedures, hospitalization frequency, the number of procedures, and marital status ( Table 5).

DISCUSSION
Our research results show that sex, age, and marital status produced no statistically significant differences in the hospitalization expenses of patients with CRF, but the results of Muñoz et al. (25) and Life et al. (36) both show a correlation between the age of patients with kidney disease and the cost of hospitalization, which may be related to sample selection in the study. The age range of the patients in the study sample is narrow at 44-73 years. With intensification of population aging, the medical and financial pressure caused by patients with CRF in various countries will inevitably increase, which illustrates the necessity of this study. In addition, future studies can expand the sample size, increase the age range of patients, and further explore the impact of age on the hospitalization expenses of patients with CRF.
Our research results also show that major procedures, medical payment methods, hospitalization frequency, the length of stay, the number of other diagnoses, and the number of procedures have a statistically significant impact on the hospitalization expenses of patients with CRF. The research of Zhao et al. (12) shows that medical insurance has no significant effect on medical expenses in China in contrast to the results of this study. However, the studies of Turenne et al. (14) and  Hornberger et al. (15) both show that the medical payment method has an impact on end-stage renal dialysis methods and economic consumption. The study by Xiong et al. (37) also shows that the setting of different medical insurance policies has a certain impact on patients' medical expenses, which is consistent with our research results. Combining the important values of the predictive variables of the RF prediction model and the LASSO regression model and the results of related analyses, for patients with CRF, the main influencing factor for hospitalization expenses is the length of stay, which is consistent with the research results of Life et al. (36) and Arquivos de Neuro-Psiquiatria (38). The study by Wang et al. (22) showed that the length of stay and the main treatment methods are important factors affecting the hospitalization expenses of lung cancer patients, which is also consistent with our research results. Our research results also show that with major procedures as the grouping variable, hospitalization expenses between the groups of CRF patients are not completely different. The median hospitalization expenses of the patients with kidney transplantation are the highest, and the median hospitalization expenses for patients without procedures are the lowest. No statistically significant difference in hospitalization expenses was found between HD and PD patients, showing that during a single hospitalization, these two treatment options are not the factors causing the difference in hospitalization expenses.
The length of stay was divided into groups and compared between groups. Significant differences in the median hospitalization expenses were found between the groups, and the hospitalization expenses increased with increasing  hospitalization time. The cost of hospitalization for fully self-pay patients was higher than that of patients who pay for medical insurance. In terms of the number of procedures, the hospitalization expenses of CRF patients undergoing two or more procedures were higher than those of patients undergoing one or no procedures. According to the pairwise comparison between groups, hospitalization frequency and the number of other diagnoses had an impact on the hospitalization expenses of CRF patients, but the effect was not obvious. Our research results also show that in the predictive model, major procedures had a relatively small impact on the hospitalization expenses of CRF patients. On the one hand, this finding may be related to the multicollinear relationship between major procedures and length of stay. With length of stay as the main influencing factor, the RF prediction model and LASSO regression model showed a smaller impact of major procedures on hospitalization expenses; on the other hand, according to major procedures performed, most cases had complications and accompanying symptoms.
Therefore, in addition to examinations and treatments for CRF such as kidney biopsy, HD, PD, and related medical procedures, many patients were examined and treated for other diseases, such as ultrasound examinations and oxygen therapy, indicating that patients with CRF undergo many examinations and treatments during hospitalization, which increases treatment expenses. Moreover, in addition to kidney transplant group, the hospitalization expenses of CRF patients in the medical group and non-operating room surgery group, did not significantly differ. At the same time, patients with CRF suffer from a variety of diseases, and their physical condition is poor. Therefore, the hospital stay will be relatively long, and medical resource consumption will increase accordingly, which may also be the reason why the length of stay had a greater impact on the hospitalization expenses of patients with CRF.
The evaluation and comparison results of the RF prediction model and the LASSO regression model show that the regression fitting and accuracy of the RF prediction model are superior to those of the LASSO regression model, and the LASSO regression model is more suitable for feature screening (33). In the RF prediction model, all input variables had a certain contribution to the model, but in the LASSO regression model, only five variables had a certain contribution, while the other four variables were not important to the model. The length of stay contributed the most to the two prediction models. Future research can also explore objective factors affecting the length of stay of CRF patients, such as age, complications, and comorbidities, to determine the appropriate length of hospitalization for individual patients and prevent inadequate hospitalization, which can affect clinical efficacy and prognosis. At the same time, hospitalization time can be effectively controlled, medical efficiency can be improved, and medical resources can be effectively allocated.
Based on the analysis results of the influencing factors of hospitalization expenses, we also believe that unlike patients submitted to short-term hospitalization for surgical procedures, most patients with CRF suffer from complications and comorbidities, resulting in diverse conditions, long hospital stays, and different hospitalization measures. The clinical process is highly heterogeneous, and the length of hospitalization also significantly differs due to individual differences. The two payment methods currently implemented in China, DRGs and DIPs, are mainly applicable to acute hospitalized cases (39) and are not suitable for patients with CRF. The CRF hospitalization expense prediction model based on the RF algorithm constructed in this study can be used to guide determination of the hospitalization expense reimbursement standard for individual patients with CRF, which can also be applied to the prediction and reimbursement of hospitalization expenses for other chronic and complex diseases.
Our research has some limitations. First, our study included only the first page of data from the medical records of three tertiary first-class hospitals in 2016, while other studies have a longer time span. Second, due to limited conditions, the sample data used in our research are not sufficiently comprehensive; future studies can further expand the sample size and sample scope and conduct more in-depth research. Finally, the dependent variables included in this study are limited, and future studies may consider including more dependent variables to explore the construction of predictive models with better performance.

CONCLUSIONS
Our research shows that for inpatients with CRF in general hospitals, hospitalization expenses are affected by many factors such as length of stay, other diagnoses, medical payment methods, procedures, and the number of hospitalizations and that the degree of influence of each factor is also different, with length of stay being the most influential factor. The performance of the hospitalization expense prediction model constructed by the RF algorithm is better than that of the LASSO regression model. Using the RF prediction model to predict the hospitalization expenses of individual CRF patients is reasonable and convenient. In addition, the model represents an individualized and precise hospitalization cost compensation and control plan that health administration and medical security departments can consider implementing in the future.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by the institutional ethics committees of the Third Xiangya Hospital Central South University (No:2020-s343). Written informed consent to participate in this study was provided by the participants' legal guardian/next of kin.

AUTHOR CONTRIBUTIONS
AL and PD conceived and designed the work. PD, WC, and ZX performed substantial contributions to the acquisition and analysis of data for the work. PD, HC, and WO interpreted the data for the work. All authors have participated in drafting the work or revising it critically, have done a final approval of the version to be published and agreement to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

FUNDING
This study was supported by the State Key Program of National Social Science of China (grant no. 17AZD037).