Demand prediction of medical services in home and community-based services for older adults in China using machine learning

Background Home and community-based services are considered an appropriate and crucial caring method for older adults in China. However, the research examining demand for medical services in HCBS through machine learning techniques and national representative data has not yet been carried out. This study aimed to address the absence of a complete and unified demand assessment system for home and community-based services. Methods This was a cross-sectional study conducted on 15,312 older adults based on the Chinese Longitudinal Healthy Longevity Survey 2018. Models predicting demand were constructed using five machine-learning methods: Logistic regression, Logistic regression with LASSO regularization, Support Vector Machine, Random Forest, and Extreme Gradient Boosting (XGboost), and based on Andersen's behavioral model of health services use. Methods utilized 60% of older adults to develop the model, 20% of the samples to examine the performance of models, and the remaining 20% of cases to evaluate the robustness of the models. To investigate demand for medical services in HCBS, individual characteristics such as predisposing, enabling, need, and behavior factors constituted four combinations to determine the best model. Results Random Forest and XGboost models produced the best results, in which both models were over 80% at specificity and produced robust results in the validation set. Andersen's behavioral model allowed for combining odds ratio and estimating the contribution of each variable of Random Forest and XGboost models. The three most critical features that affected older adults required medical services in HCBS were self-rated health, exercise, and education. Conclusion Andersen's behavioral model combined with machine learning techniques successfully constructed a model with reasonable predictors to predict older adults who may have a higher demand for medical services in HCBS. Furthermore, the model captured their critical characteristics. This method predicting demands could be valuable for the community and managers in arranging limited primary medical resources to promote healthy aging.


. Introduction
In recent decades, the aging population in China has emerged as a prominent social problem (1). According to the seventh population census, in 2020, 13.50% of the total population i.e., 190.64 million people living in China were 65 years or older (2). It is estimated that at this rate China will become a moderately aged society by 2030 (3) leading to considerable health problems, with 75.8% of the aging population suffering from at least one chronic disease (4). The World Health Organization (WHO) proposes healthy aging as a strategy to deal with aged societies (5); it thus advises providing older adults with integrated healthcare services. It emphasizes on the concept of bio-psycho-social health i.e., maintaining good physiological, psychological, and social health conditions in older adults (5).
Of the globally available aging care services (6)(7)(8), the three mainstream care services are family-based, home-and communitybased, and elder care institutions. Due to differing national and cultural conditions, the advantages and limitations of the care services vary. Home and community-based services (HCBS) refer to individual-centered care provided by the community at home. HCBS not only retains the traditional form of caring but also reduces daily care and financial burdens for children (9), along with addressing the psychological (10, 11) and physical needs (11) of older adults.
HCBS evolved in Western countries in the 1980s and became widely popular in Europe (12), the USA (13), and Australia (14). HCBS takes care of people with different needs, such as patients with disability (15), depression (16) and dementia (17). In China, HCBS gained importance and support from the government in 2008 (18, 19). Moreover, supply intensity of HCBS among whole nation gradually increased from 2008 to 2018, which supply rates of all services doubled (20). Over time HCBS became the most appropriate care service for older adults in China (21). The 2018 Chinese Longitudinal Healthy Longevity Survey (CLHLS) classified services into the following four types, with each type having two sub-categories: (a). medical service including home visits and healthcare education, (b). daily life care service including personal care and daily shopping, (c). spiritual and cultural service including social and recreational activities and psychological consulting, and (d). mediation service including legal aid and neighborhood relations. Among all four services, medical services were in the highest demand from 2008 to 2018 (22, 23) and provisions of home visit and healthcare education were limited due to strained primary medical resources. Predicting demand for medical services could help managers in better management and targeted delivery of the service. Based on a 2014 national survey of older adults, using a logit model, a study explored the factors that influenced the demand for HCBS (24). Global research on unmet HCBS demand is scarce, and research predicting HCBS demand is lacking (25,26). Former research has adopted classification trees to predict if older adults would use HCBS (27), even though there were deficiencies between demand and supply. Recently, HCBS was in high demand, but the lack of a complete and unified demand assessment system created an inability to convert potential into effective demand (28). Moreover, community managers lacked comprehensive and accurate supply planning, thus, contributing to a severe mismatch between demand and supply. Thus, suggesting the necessity of exploring methods to assess service demand and provide efficient and cost-effective HCBS (29). Predicting the demand for HCBS among older adults could help managers provide targeted services and formulate short-and long-term plans to address deficiencies. Traditional regression methods utilized in previous studies require independence of each variable and cannot resolve collinearity between the variables. Extant studies have concentrated on specific populations or certain factors, consequently failing to comprehensively grasp the demands of the whole population and critical characteristics. Machine learning can incorporate variables, produce accurate results with fewer constraints, and explore crucial characteristics. Thus, machine learning has been widely adopted to predict demands of healthcare services. For instance, Light Gradient Boosting Machine was conducted in ambulance demand prediction in Singapore; Long-Short Term Memory, a method based on Recurrent Neural Network, was utilized to predict home hospitalization demand of cancer palliative patients; and Extreme Gradient Boosting (XGboost) was applied in outpatient appointment demand prediction (30-32). During the Covid-19 pandemic, machine learning helped predict demands of ICU, ventilator, and length of hospital stays (33).
Hence, to understand the demand for medical services in HCBS more comprehensively, Andersen's behavioral model of health service could be employed to bridge feature selection and initial feature selection as well as machine learning model fitting. Andersen's behavioral model of health service use was proposed in 1968 and subsequently modified several times. It is widely acknowledged and applied in health-related services, such as medical costs, healthcare utilization, and drug use. It is used to determine the factors that influence health service use at different levels, as well as the variables that could be more logical, diverse, and specific (26,(34)(35)(36)(37)(38). Andersen's behavioral model contains multiple domains of an individual: predisposing, enabling, need, and behavior. Each domain is associated with the outcome of demand for healthcare. Predisposing factors generally describe socio-demographic characteristics; enabling factors represent personal healthcare acquirement; need factors manifest self-cognition of a health condition; and behavior factors reflect lifestyle related to their physical, mental, and social health (39).
As medical services in HCBS had the highest demand (21) and a significant positive influence on health and chronic diseases (40,41), this study aimed to identify the best model to predict demand for medical services in HCBS among older adults in China in 2018 and explore the most critical characteristics of older adults requiring the services. We hope that the findings of this study would help in increasing efficiency in matching the demand and supply of medical services in HCBS, considering the characteristics of older adults, and, thus, contribute to healthy aging. Respondents in CLHLS were sampled randomly from households in half of the counties and cities across 23 provinces in mainland China. Instruments used for data collection were international questionnaires, interviews, basic physical capacity tests, and physical examinations. Former researchers demonstrated that the details of sample design and data quality were excellent (42). After excluding 3,933, participants younger than 65 years and/or those lacking information about the home and communitybased medical services, 15,312 participants were included in the final data analysis.
. . Outcome variable: Demand for medical services in HCBS Demand for medical services of HCBS was evaluated using two questions: "Do you expect your community to provide home visit services?" and "Do you expect your community to provide healthcare education services?" The expectation of one or more medical services was considered as a demand for HCBS. In case of no services expected, it was considered as no demand for medical services in HCBS.

. . Predictors and feature selection
We included a broad range of candidate predictors. Based on Andersen's behavioral model, the predictors were divided into predisposing, enabling, need, and behavior factors (34,35). This model was proposed in 1968 and subsequently modified several times. The model is widely acknowledged and applied in the field of health-related services, such as medical costs, self-medication, and drug use, to determine influencing factors of health service use (36, 43).
Predisposing factors included demographic characteristics that may affect requirements for medical services. Factors included gender (male or female), age (65-79 years or ≥80 years), an education level (literate or illiterate), marital status (married or unmarried), and residence (rural, town, or urban). Enabling factors included individual characteristics that may affect requirements for medical services in HCBS, such as self-rated income level (low or high), pension (yes or no), social insurance (yes or no), living conditions (live with families, live alone, or live in care institution). Need factors included individual health status, such as chronic diseases (yes or no), activities of daily living (ADL) (good or bad), cognitive function (good or bad), and self-rated health (SRH) (good, fair, or poor). Behavioral factors included daily actions and habits that could affect an individual's physiological, mental, and social health, such as smoking (yes or no), alcohol consumption (yes or no), exercising (yes or no), and socializing (yes or no).

. . Statistical analysis
Statistical analyses were performed using the Scikit-Learn package (version 1.1.2) in Python (version 3.9) (44). Scikit-Learn is a wrapper technique; it was used to apply models to the data, which were randomly split into independent training, testing sets, and validation sets at a ratio of 6:2:2.

. . . Processing of missing values
To minimize the chance of bias owing to imputation, variables with more than 20% of information were abandoned to acquire reasonable performances. The ultimate variables included were imputed by the "MICE" package in R studio 4.1.2, applying "missForest" multivariate iterative random forest ("RF" method) imputation algorithm with five iterations and 100 estimators to obtain the least variant datasets compared to the original one.

. . . Synthetic minority oversampling technique
Lack of demand for HCBS medical services was low probability attitude resulting in an imbalanced dataset i.e., adults not requiring medical services while using HCBS were less prevalent than the others. The imbalanced data was a challenge for machine learning, as the sample size of older adults without demand was small. Furthermore, a strong bias toward the majority class is evident while evaluating the classification model, leading to sub-optimal performances. To resolve the issue, we applied Synthetic Minority Oversampling Technique (SMOTE), a statistical technique proposed by Chawla et al. (45). SMOTE generates virtual replicates from the existing minority class, thus expanding the number of minority samples in the datasets (45). SMOTE algorithm has been widely applied to process imbalance data in medical research and generally performs reasonable results with machine learning (46-48).

. . . Machine learning methods
We applied five machine learning methods, including single models and ensemble models. These were: logistic regression (LR), LR with lasso regularization, support vector machine (SVM), random forest (RF), and extreme gradient boosting (XGboost). The outcome variable in this study was binary, that is, irrespective of whether older adults in China needed medical services in HCBS, all selected five models were widely applied in binary outcome prediction with great performances (46, 49,50). We compared their ability to predict demand for medical services in HCBS.

. . . . Logistic regression
Logistic regression (LR) is a kind of general linear model. The model has a potential assumption that the outputs or the results conform to the Bernoulli distribution with parameter p. Parameter p is the probability of a positive result (in our case, the probability of demand for medical services in HCBS among older Chinese adults). Moreover, Logistic regression does demands rigorously for number of features and samples, and it could be applied in different populations (51). Parameters for Logistic regression used in this study are default in the Scikit-Learn package.
. . . . LR with LASSO regularization LASSO regression is a member of the general linear model family. It is an approach to conduct variable selection and regularization while fitting the regression model. By setting .
parameter α to penalize the original linear model, LASSO regularization deals with the highly correlated variables to minimize the possibilities of over-fit; this automatically drops unnecessary covariates and preserves the most critical variables. Several studies have demonstrated that lasso regression has many ideal properties that can be used to enhance LR model's performance while including more covariates and the ability to predict outcomes in other populations. In this research, we selected the parameter (α = 0.01) to penalize large coefficients that resulted in a maximum correct classification rate and the best model performance (52, 53).
. . . . Support vector machine Support Vector Machine (SVM) is a manually controlled classification algorithm, by the statistical theory. The working principle of SVM is to create a decision boundary, based on the definition of the hyperplane, that could separate the two categories from each other in an accurate split method. There are four widely adopted kernel functions in SVM: linear, sigmoid, radial basis (RBF), and polynomial. RBF kernel was applied in this study to construct the hyperplane due to the number of features and total samples (54-56).

. . . . Random forest
Random Forest (RF) is a typical ensemble algorithm consisting of a series of decision trees as its basic unit using the Bagging method. Each tree randomly selects training samples and sample characteristics from the group and returns them to the original datasets to ensure that the amount of training samples is the same in each model. Due to these two features, the set of constructed decision trees contains abundant information for classification. To analysis the ultimate result, each decision tree is accessed to the final decision for a reliable result. Based on the majority voting on all decision trees, each sample is classified into two classes. We adopted 1,000 estimators with defaults for other parameters to assess the model and explore the features of older adults with/without demand toward medical services in HCBS (57, 58).

. . . . Extreme gradient boosting (XGboost)
XGboost classification algorithm is an ensemble algorithm of a decision tree, adopting boosting sampling method. It is an enhanced Gradient Boosting algorithm that reduces the probability of over-fit by regularizing the loss function and improves algorithm accuracy by approaching the real loss during each gradient process. In addition, XGboost possesses the ability to directly handle the encoded categorical variables. Therefore, we set 1,000 decision trees, with other parameters as defaults, to predict outcomes of demand for HBCS medical services and explore the importance of individual features (59, 60).

. . Model assessment
To assess the outcomes of each machine learning model, we True positives (TP) and True negatives (TN) indicated older adults who were identified as with and without the demand for HCBS healthcare, respectively; False positives (FP) and false negatives (FN) indicated older adults who were inaccurately identified as with and without the demand for healthcare HCBS, respectively.

. Results
As shown in Table 1, 15,312 participants were included in this study, but only 13,244 older adults demanded medical services in HCBSs, thus, the demand rate was 86.48%. We also analyzed crude and adjusted odds ratio for older adults who demanded medical services in HCBS using single and multiple variable binary logistic regression. The analysis demonstrates that illiterate older adults had higher possibilities (adjusted OR = 1.21; 95% CI: 1.07-1.36) of requiring medical services in HCBS. Compared to the urban older adults, older adults living in town (adjusted OR = 1.95; 95% CI: 1.70-2.20) and rural (adjusted OR = 1.92; 95% CI: 1.68-2.16) areas had higher demand for the service. Among enabling factors, the older adults not having social insurance (adjusted OR = 1.20; 95% CI: 1.09-1.32), needed more medical services provided by HCBS. Moreover, fair self-rated health status (adjusted OR = 1.18; 95% CI: 1.06-1.31) increased the possibility of demand for services among older adults. The results also indicate that the regular exercising group (adjusted OR = 1.26; 95% CI: 1.13-1.40) and older adults dislike socializing (adjusted OR = 0.85; 95% CI: 0.73-0.99) and had lower demand for medical services in HCBS.
The confusion metrics and the performance metrics shown in Table 2     were over-fitted in the RF and XGboost. Figure 1A displays ROCs of Model IV fitted by RF, whose AUC did not show a significant difference between the test set and the validation set. In Figure 1B ROCs were produced by XGboost, which produced robust results in the validation set. Both models fitted by all four factors of Andersen's behavioral model as presented in Table 3 performed . /fpubh. . The most important features of the older adults, who demanded for medical services provided by HCBS in CLHLS .
steady results to predict the demand for medical services in HCBS compared to the test set of Model IV in Table 2. Figure 2 shows the importance of the predictors in the RF and XGboost. In the RF method SRH, exercise, ADL, age, education, and gender were the most important predictors of the demand for medical services in HCBS. Variable importance produced by XGboost demonstrated that SRH, social insurance, education, pension, gender, and exercise were the most critical features.

. Discussion
To the best of our knowledge, this is the first research to predict the demand for medical services in HCBS among older adults in China using national representative data, CLHLS 2018, and including demographic, social, economic, health, and other parameters.
Although the demand proportion for healthcare services was relatively high worldwide (61,62), our study revealed that it was higher in China. Along with the growing life expectancy, the average age continues to increase in China (18). As people age, their need for medical services increases (24,63). Consequently, the demand for medical services provided by HCBS was high from 2008 to 2018, above 80%, with an upward trend. Moreover, with the change in the current family structure and fast-paced social life, the traditional family-based care modes have lost significance in promoting life satisfaction among older adults (64,65). Hence, more empty-nest older adults who lived alone failed to get timely treatment (64). Additionally, a large number of older adults suffered from chronic diseases, such as hypertension, diabetes, and respiratory diseases that required daily medical monitoring to ensure older adults remain in normal living conditions (66).
Some studies successfully adopted traditional regression methods (24); however, deficiencies in traditional methods, which requires absolute independence among the variables, could lead to information loss during variable selection. Moreover, demand for medical services provided by HCBS had large imbalances, resulting in higher sensitivity and accuracy but lower specificity. Therefore, it was impractical to use, as only ∼15% of the older adults did not need medical services in HCBS. As higher specificity was necessary to predict the group without need, utilizing SMOTE solved this issue; the AUC was higher for specificity (83.15% in RF and 82.84% in XGboost among Model IV). The performance of SMOTE resulted in better-fit results and produced robust data without missing samples, thus, creating a more practical model to predict older adults with and without need.
Machine learning models could include variables with fewer constraints, enabling the models to confront the presence of high dimensions and correlated predictors. Thus, they are a widely acknowledged and adapted method in exploring influencing factors of health-related services. HCBS is an integrated care service, covering the multilevel and diversified demands of older adults; therefore, by using the four factors in Andersen's behavioral model it was possible to explore the critical features above reasonable theoretical basis. The AUC and accuracy of RF and XGboost were increased sharply after including need factors. While all four factors were included in the machine learning models, the AUC of the five models was above 0.60, and RF and XGboost showed good model fit. The AUC of RF was beyond 0.75, demonstrating the feasibility of predicting the demands of older adults for medical services in HCBS, based on Andersen's behavioral model and machine learning methods. With high specificity, the model could filter the people who were more likely to have no demand for medical services in HCBS temporally. This would help decisionmakers to provide older adults in urgent demand with targeted care in situations with limited resources. To examine robustness, the performance of the validation set proved the performances of these two models were not over-fitted.
Using Andersen's behavioral model, combined with Logistic regression and estimating the contribution of each variable in machine learning models, we further confirmed that self-rated health was the most significant feature to predict if older adults needed medical services in HCBS. The present research illustrated that health conditions had a direct influence on medical services in HCBS, which confirmed the results that SRH had the highest importance in predicting if older adults had demand (24). Moreover, the aged population with good health had a stronger demand for medical services provided by HCBS (67,68). Previous research demonstrated that older adults in bad health went to the hospital and looked for more exhaustive medical services (69) whereas older adults with good health might not have urgent demand. Furthermore, there was strong evidence that confirmed chronic disease was a significant risk factor for poor SRH rate. These results could enable the community to provide medical services preferentially (70,71).
Furthermore, exercise and education played important roles in demand. Illiterate people aged >65 years had lower health literacy levels (72, 73). Therefore, they may require healthcare education services more urgently (74). Participants who rarely exercised were more likely to gain weight and have worse health status. Appropriate exercise could meet the requirement of the biopsycho-social medical model, by facilitating metabolism in older adults, obtaining a sense of happiness, and getting the chance to meet friends who share the same hobby (75,76). Therefore, older adults who do not exercise may need medical services in HCBS more than those who exercise regularly (77).
These findings indicate that the characteristics of older adults should be considered to narrow the gap between supply and demand. Communities could (a) make efforts to focus on older adults with good health, (b) provide health education on conditions like hypertension, diabetes, and stroke, to promote health literacy in the neighborhood, and (c) propose targeted measures to encourage older adults to exercise, based on their abilities, and offer periodical home medical visits to monitor their health condition.
Andersen's behavioral model and machine learning could help managers and governments construct a complete and unified demand assessment system, which could also be extrapolated to other types of demands. This would enable HCBS to narrow the supply-demand gap and improve management efficiency and costeffectiveness. Ultimately, this would promote healthy aging by providing more effective services.

. Limitation
This study has some limitations. Firstly, we only adopted data from the 2018 CLHLS to predict demand for medical services provided by HCBS, thus, this cross-sectional data could not explore causality between demand and predictors. Second, the CLHLS provided national representative data. Previous research indicated that the supply situation and intensity of HCBS in China vary significantly temporally and spatially. This regional variance may increase the supply and demand mismatch and affect the information for the use of HCBS among older adults. Simultaneously, including all predictors as factor variables could lead to information loss in estimating the contribution of individual variables. Furthermore, this study included home medical visits and healthcare education as medical services. As interactions between these two services are possible, only extensive characteristic ranges could be determined to identify demand. As, HCBS included four types of services only, hence, to construct an assessment system, further research on demands predictions for other services is required.

. Conclusion
This study adapted machine learning to predict the demand for medical services in HCBS using the 2018 CLHLS data based on Andersen's behavioral model. Andersen's behavioral model combined with machine learning successfully constructed a model with reasonable predictors and captured critical characteristics in older adults, who may have higher demand. This method predicting demands could be valuable for the community and decision-makers in arranging limited primary medical resources to promote healthy aging. Future empirical research should examine the models and conduct a longitudinal study to explore the causation between demand and individual characteristics.

Ethics statement
The studies involving human participants were reviewed and approved by Research Ethics Committees of Duke University Research Ethics Committees of Peking University (IRB00001052-13074). The patients/participants provided their written informed consent to participate in this study.

Author contributions
CC, YH, and TX conceived and designed the study. YH and TX participated in acquisition of the data and wrote the original draft. YH and CC contributed to data analysis. YH took charge of the submission. CC, XZ, YH, TX, QY, CP, LZ, and HC substantively revised the manuscript. All authors have read and approved the final manuscript.