CORRECTION article

Front. Artif. Intell.

Sec. Medicine and Public Health

Volume 8 - 2025 | doi: 10.3389/frai.2025.1659362

Correction: Predicting the Risk of Depression in Older Adults with Disability Using Machine Learning: An Analysis Based on CHARLS Data

Provisionally accepted
  • 1Shanxi University of Finance and Economics, Taiyuan, China
  • 2College of Public Management(Law), Xinjiang Agricultural University, Urumqi, Xinjiang, China

The final, formatted version of the article will be published soon.

1Introduction The comorbidity of disability and depression among older adults is a growing concern. Disability refers to a state of limited activities of daily living due to physical or mental impairments. At the same time, depression is a neuropsychiatric condition principally manifested through sustained affective dysregulation, significantly compromising physiological functioning and social adaptability (1). A well-established bidirectional association exists between disability and depression (2,3). Disability contributes to depression through loss of social roles and restricted mobility, whereas depression exacerbates functional decline by reducing rehabilitation adherence and impairing immune function. Epidemiological studies indicate that the global prevalence of major depressive disorder in older adults is approximately 13.3% (4), with disabled elders exhibiting significantly higher risks than their unimpaired counterparts (5). In China, the prevalence of geriatric depression reaches 34.1% (6), with rural areas demonstrating elevated vulnerability due to limited healthcare access and weaker familial support systems. This vicious cycle between disability and depression not only accelerates individual functional deterioration but also imposes substantial healthcare burdens and societal costs (7,8). Previous studies have predominantly employed cross-sectional designs and conventional statistical approaches (e.g., logistic regression, fixed-effects models) to identify risk factors. Regarding risk association validation, Mu et al. (9) demonstrated through binary logistic regression that individuals with disability exhibit significantly elevated risks of depressive symptoms. Using multivariate logistic regression, Yan et al. (10) further revealed urban-rural differential effects in the disability-depression association. In terms of disease trajectory research, Tian et al. (11)found that individuals with disability were more likely to follow trajectories of worsening depressive symptoms. Musliner et al. (12) associated prevalence rates, Çağan and Ünsal (13)reported a 57.8% depression rate among disabled individuals, while McGillivray and McCabe (14) documented a 39.1% depression prevalence among those with mild-to-moderate intellectual disabilities. Individual emotional states, life satisfaction, self-rated health, and social support systems have been systematically validated as critical predictors (15,16). However, traditional linear models demonstrate limited capacity in analyzing high-dimensional nonlinear relationships, and their static data frameworks fail to capture the temporal cumulative effects of risk factors. In summary, as shown in Table 1, existing studies exhibit three major limitations: (1) Design dimension: Prior studies predominantly rely on cross-sectional data, failing to capture the temporal cumulative effects of risk factors (e.g., the progressive impact of disability deterioration on depression). (2) Methodological dimension: Although conventional linear models (e.g., logistic regression) can validate risk associations, they struggle to handle high-dimensional nonlinear relationships. In contrast, ML algorithms significantly enhance predictive performance by extracting feature interactions and identifying temporal patterns. (3) Feature dimension: Existing research excessively focuses on physiological indicators (e.g., disease burden, functional impairment) while neglecting the contributions of subjective cognition (e.g. life satisfaction) and health behaviors (e.g., sleep). This study advances beyond conventional paradigms by integrating a longitudinal design, ML approaches, and a multidimensional feature structure to address these limitations. Specifically, we utilize multi-wave longitudinal data from CHARLS (2011–2020) and incorporate a temporal external validation framework (using an independent 2018–2020 cohort) to track the evolving trajectories of disability and depression dynamically. We systematically compare ten ML algorithms and introduce the SHapley Additive exPlanations (SHAP) interpretability framework to balance predictive accuracy with mechanistic insights. Furthermore, we construct a multidimensional feature matrix and employ a three-stage serial consensus feature selection (LASSO, Elastic Net, and Boruta), demonstrating that subjective perceptions (SHAP value: life satisfaction = 0.339) and health behaviors (sleep time = 0.344) exhibit stronger predictive power than conventional biomedical indicators. This integrative approach not only overcomes prior methodological constraints but also provides a robust, interpretable, and clinically actionable framework for depression risk stratification in older adults with disabilities. Table 1. Paradigm comparison between this study and previous depression prediction studies. DimensionPrevious mainstream researchThe innovation point of this study DesignCross-sectional dataMulti-wave longitudinal data + temporal external verification MethodTraditional statistical modelsComparison of 10 ML Algorithms + SHAP interpretation Feature Physiological indicatorsIntegration of subjective perception/health behavior/physiological multidimensional characteristics Although machine learning (ML) offers innovative solutions to address these limitations(17,18), recent studies based on CHARLS data still exhibit notable shortcomings: (1) Overreliance on a single algorithm for feature selection. For instance, Huang (2025) employed LASSO exclusively for feature selection in cardiovascular disease risk prediction among middle-aged and older adults (19), which may inadequately address the challenges of high-dimensional feature collinearity and stability; (2) Insufficient temporal external validation in the evaluation framework. As demonstrated by Chu et al. (2025), the disability prediction model for older adults lacked external validation (20), potentially compromising the generalizability of the findings. ML offers innovative solutions to overcome these methodological limitations (21,22). Compared to conventional approaches, ML demonstrates superior predictive performance through its capacity for feature interaction mining and temporal pattern recognition (23). Xin and Ren (24) developed random forest models to predict disability risk in urban and rural populations, achieving AUC values of 0.71 and 0.78, respectively. In a systematic comparison, Hong et al. (25) demonstrated that XGBoost models exhibited excellent performance in training sets (AUC = 0.76), while logistic regression models performed well in validation sets (AUC = 0.73). Handing et al. (26) employed random forest analysis and identified social isolation and self-rated health as significant determinants of depression. Despite these advancements, few studies in China have utilized longitudinal data and multiple ML algorithms to construct risk prediction models specifically for depressive disorders within geriatric populations with functional limitations (25). Furthermore, methodological weaknesses in validation frameworks among existing studies may compromise the reliability of findings (19,27). This study utilized multi-wave data (2011-2020) from the China Health and Retirement Longitudinal Study (CHARLS) to construct a predictive computational framework for geriatric populations with functional limitations. We integrated three waves of panel data (2011-2015) to construct a comprehensive feature matrix encompassing baseline characteristics, disease profiles, and disability progression patterns. A three-stage serial consensus approach was utilized to identify robust predictors combining elastic net regularization, least absolute shrinkage and selection operator (LASSO), and Boruta algorithms. We identified 21 robust predictors from 74 candidate variables. A temporal external validation strategy was implemented using an independent 2018-2020 cohort to systematically evaluate the cross-temporal stability of ten ML models, including HistGBM. The study aims to provide a high-accuracy tool for early identification of depression risk in disabled older populations and establish evidence-based priorities for psychosocial interventions. 2Methods 2.1 Data sources and research design This study utilized data from the CHARLS, which implements multistage stratified sampling with probability-proportional-to-size weighting based on demographic stratification. The survey encompasses 150 county-level units across 28 provincial administrative regions in China. The baseline survey was conducted in 2011, with follow-up waves completed in 2013, 2015, 2018, and 2020, collecting comprehensive data on demographic characteristics, socioeconomic status, health behaviors, and medical history. The study protocol obtained ethical certification from Peking University's Biomedical Ethics Committee (Approval ID: IRB00001052-11015). Sample selection followed three inclusion criteria: (1) age ≥60 years at baseline; (2) exclusion of individuals with pre-existing depression diagnosis or without basic or instrumental activities of daily living (BADL/IADL) disability at baseline; (3) completion of at least two consecutive follow-up assessments. Through integration of baseline (2011-2013, N=2440) and follow-up (2013-2015, N=2943) data, we constructed a longitudinal panel dataset containing 5383 observations (2011-2015). The dataset was partitioned using stratified random sampling, allocating samples in a 7:3 ratio to training (N=3768) and testing (N=1615) sets. An independent 2018-2020 follow-up cohort (N=3254) served as the external validation set. The study flowchart is presented in Figure 1. Figure 1. Research flowchart. 2.2Variable definitions and measurements Depression was assessed using the 10-item Center for Epidemiological Studies Depression Scale (CESD-10), a widely validated screening tool for depressive symptoms in older adults (28,29). The scale demonstrates good reliability and validity (30,31). In CHARLS, the CESD-10 evaluates the frequency of 10 symptoms experienced in the past week: feeling bothered, trouble concentrating, feeling depressed, difficulty in doing things, feeling hopeful about the future (reverse-coded), feeling fearful, restless sleep, feeling happy (reverse-coded), feeling lonely, feeling unable to carry on. Each item is scored from 0 to 3, yielding a total score ranging from 0 to 30. Following established international criteria, a score ≥10 was used to define clinically significant depressive symptoms. Disability in this study specifically refers to functional limitations in BADL or IADL, consistent with geriatric assessment standards (32,33). BADL evaluates six fundamental self-care functions (dressing, bathing, eating, bed transferring, toileting, and continence control). IADL assesses five complex daily living skills (housekeeping, cooking, shopping, medication management, and financial management), where each item was scored 1 for inability to perform independently or 0 otherwise, resulting in total score ranges of 0-6 for BADL and 0-5 for IADL. The BADL/IADL-based criteria were applied during data screening to select eligible participants with BADL≥2 or IADL≥2 scores. This operational definition excludes sensory or cognitive disabilities alone, ensuring a homogeneous cohort with physical functional impairments. The definitions and measurement approaches of covariates encompassed four domains: (1) demographic characteristics (gender, age, registered residence, educational level, marital status, number of children, and region); (2) health behaviors, including chronic disease history (14 conditions such as hypertension and diabetes), sensory functions (visual, auditory, and oral health assessed through assistive device use, functional scores, and tooth loss status), bodily pain, sleep time, physical activity intensity, social engagement, and lifestyle factors; (3) subjective perceptions comprising episodic memory, cognitive ability, life satisfaction, and self-rated health; (4) health care and insurance, incorporating health insurance type, healthcare utilization (frequency, duration, and costs of inpatient and outpatient services), as well as pension status, with detailed variable specifications and coding schemes provided in Supplementary Table 1. 2.3Data preprocessing and feature selection Data preprocessing included four key steps: (1) Outlier handling: We applied the interquartile range (IQR) method to detect and truncate outliers for all continuous variables. Values beyond ±1.5IQR of the 25th-75th percentile range were clipped to the lower/upper bounds. This mitigated the impact of extreme values on tree-based models while preserving data distribution integrity. (2) One-hot encoding: Categorical variables (e.g., gender, region) were converted into binary dummy variables to avoid misinterpreting ordinal relationships. (3) Normalization: Continuous features were standardized using z-score normalization (mean=0, variance=1) to enhance convergence speed for linear models. (4) Missing value imputation: Among 78 candidate variables, 4 variables (5.13%) with missing rates >30% were excluded; (2) Missing value imputation: The remaining 74 variables had an average missing rate of 8.27% (range: 0.09%-25.18%). A total of 15,918 missing records (8.82% of total training observations) were iteratively imputed using the MissForest algorithm. (5) Class imbalance adjustment: We implemented the SMOTE-Tomek hybrid sampling technique, combining synthetic minority oversampling (SMOTE) with Tomek links under sampling. This approach effectively enhanced the model’s sensitivity in detecting depression risk and improved clinical utility by generating synthetic samples. We refer to existing studies for feature selection(19,34,35). A three-stage serial consensus approach was utilized to identify robust predictors in this study. This serial consensus approach integrates complementary strengths of distinct selection paradigms. (1) LASSO (L1 regularization): Efficiently screens out zero-importance features (73 variables) by imposing sparsity constraints. Although its linearity assumption may oversimplify relationships, it serves as a high-recall initial filter. (2) Elastic Net (L1+L2 regularization): Reduces multicollinearity-induced instability by retaining correlated but biologically plausible features. The α=0.5 setting balances sparsity and grouping effects, mitigating LASSO's limitation in correlated feature selection, retaining 42 stable features. (3) The Elastic Net output variables were fed into the Boruta algorithm, which identified 28 significant predictors by comparing random forest importance scores with shadow variables (p<0.01). To ensure reproducibility, a random seed (random_state=42) was set for both the MissForest imputation and Boruta's shadow variable generation. Integration of feature selection results via a strict intersection strategy. The three feature selection outcomes were consolidated through a stringent intersection strategy. Specifically, we quantified the selection frequency of each variable across LASSO, elastic net, and Boruta algorithms, retaining only variables unanimously selected by all three methods (i.e., frequency ≥3). This approach yielded 21 high-confidence predictors, including age, self-rated health, arthritis, renal disease, stomach, asthma, memory-related disorders, observe the situation up close, hearing ability, self-reported pain in head, wrist, leg, toes, neck, sleep time, social activities, episodic memory, life satisfaction, medical insurance types, hospitalization expenses (total expenses), outpatient expenses (out of pocket expenses) (as shown in Figure 2). Compared to individual methods, this strategy significantly enhanced feature stability. Figure 2. Feature selection results using three methods (LASSO, Elastic Net, Boruta). 2.4Model construction and performance evaluation Ten ML algorithms were implemented, including logistic regression (LR), support vector machine (SVM), extreme gradient boosting (XGBoost), light gradient boosting machine (LightGBM), categorical boosting (CatBoost), random forest (RF), bootstrap aggregating (Bagging), histogram-based gradient boosting machine (HistGBM), multilayer perceptron (MLP), and decision tree (DT). To optimize model generalizability, hyperparameter tuning was performed using grid search with 3-fold stratified cross-validation (specific hyperparameter configurations are provided in Supplementary Table 2). Model performance was comprehensively evaluated through five metrics: (1) the area under the receiver operating characteristic curve (AUC), measuring the model's ability to discriminate between positive and negative cases (Formulas 1-3); (2) accuracy, representing the proportion of correctly classified samples (Formula 4); (3) precision, indicating the ratio of true positives among all predicted positives (Formula 5); (4) recall, reflecting the model's capacity to identify actual positive cases (Formula 6); and (5) the F1-score, the harmonic mean of precision and recall, which provides a balanced assessment of the model's performance on the positive class (Formula 7). The mathematical formulations were derived from established methodologies (36,37). (1) (2) (3) (4) (5) (6) (7) In these formulations, TPR denotes the true positive rate, FPR represents the false positive rate, TP indicates true positives, FP signifies false positives, TN refers to true negatives, and FN stands for false negatives. 2.5Statistical analysis Statistical analyses were performed using Stata 18.0 for data description and Python 3.13 for subsequent modeling. Continuous variables were characterized differentially based on their distribution: normally distributed variables were presented as mean ± standard deviation, while non-normally distributed variables were summarized using median and interquartile range, with normality assessed via the Shapiro-Wilk test. Categorical data were expressed as cardinality measures (absolute frequencies) with proportional composition. The statistical significance threshold was set at P<0.05 for all analyses. 3Results 3.1Baseline characteristic The analysis included samples from the training sets (N=3768), testing sets (N=1615), and external validation sets (N=3254). Table 2 summarizes the baseline demographic characteristics, disease status, and depression status across the three cohorts. The median ages were 71, 71, and 72 years in the training, testing, and validation sets, respectively, with statistically significant inter-group differences (all P<0.05). In terms of gender, the proportion of females is similar in the training sets (57.94%), testing sets (57.28%), and external validation sets (60.97%). There was no statistically significant difference in gender distribution between groups (all P>0.05), indicating that the gender ratio remained balanced in the data partitioning. In terms of marital status, the married group accounts for 70.25%, 70.59%, and 69.61% of the three groups, respectively, which is much higher than the unmarried group. In terms of registered residence, the proportion of rural registered residence registration slightly decreased in training sets (79.91%), testing sets (79.57%), and external verification sets (78.83%), but all exceeded 78%. There was a significant difference in the distribution of registered residence among groups (P<0.05). The proportion of "3 or more children" in the three groups was 72.24%, 73.32%, and 66.04%, respectively, with a significant decrease in the proportion of external validation sets. There were significant differences in distribution between groups (all P<0.05). There was no significant difference in regional distribution between groups (all P>0.05). In terms of education level, the proportion of people who have not received formal education gradually decreased in the training sets (68.21%), testing sets (69.05%), and external validation sets (65.43%), while the proportion of high school and above education increased from 4.03% to 5.53%. There was a significant difference between the groups (all P<0.05). In terms of disease characteristics, there was no significant difference (P>0.05) in the prevalence of memory-related diseases and stroke diseases among the training sets, testing sets, and external validation sets. The incidence of heart disease was significant in the training and testing sets (P<0.05), but not significant in the validation sets (P>0.05). The incidence of arthritis disease remained stable among the three groups (56.22% -56.61%), with no significant difference between the groups (P>0.05). The proportion of depression showed a significant increasing trend among the training sets (56.32%), testing sets (56.35%), and validation sets (64.20%), with no significant difference (P>0.05). Figure 3 reveals the demographic differences in the prevalence of depression among disabled individuals. The gender distribution shows that the prevalence of depression in the female population (80.80%) is significantly higher than that in the male population (37.15%). Analysis of marital status shows that unmarried individuals have a higher risk of depression (61.18%) compared to married individuals (53.35%). In age stratification, the prevalence of depression in the elderly group aged 80 and above reached 60.06%, which was higher than that in the 70-80 age group (52.49%) and the 60-70 age group (56.41%). The regional distribution shows that the incidence rate in the western region (59.24%) and rural areas (57.29%) is significantly higher than that in the eastern region (52.50%) and urban areas (49.07%). Education level analysis shows that the illiterate population has the highest incidence of disease (57.86%), and there is a non-linear relationship between educational attainment and morbidity probability. The dimension of family support shows that the risk of depression in the childless group (75.18%) is significantly higher than that in the childbearing group (53.07% -56.50%). Table 2. Baseline features of training, testing, and validation sets. CharacteristicsTraining setsTesting setsExternal validation sets N=3768P-valueN=1615P-valueN=3254P-value Demographic characteristics Age (years)71 [65-77]P < 0.0571 [65-77]P < 0.0572 [66-79]P < 0.05 Gender, n (%) Female2183(57.94)P > 0.05925(57.28)P > 0.051984(60.97)P > 0.05 Male1585(42.06)690(42.72)1270(39.03) Marital status, n (%) Married2647(70.25)P > 0.051140(70.59)P < 0.052265(69.61)P > 0.05 Unmarried1121(29.75)475(29.41)989(30.39) Registered residence, n (%) Urban757(20.09)P < 0.05330(20.43)P < 0.05689(21.17)P < 0.05 Rural3011(79.91)1285(79.57)2565(78.83) Number of children, n (%) 0 children111(2.95)P < 0.0538(2.35)P < 0.0551(1.57)P < 0.05 1 child224(5.94)99(6.13)226(6.94) 2 children711(18.87)294(18.20)828(25.45) 3 children and above2722(72.24)1184(73.32)2149(66.04) Region, n (%) Eastern1066(28.29)P > 0.05450(27.86)P > 0.051007(30.95)P > 0.05 Central1326(35.19)571(35.36)1131(34.76) Western1376(36.52)594(36.78)1116(34.29) Educational level, n (%) Illiteracy2570(68.21)P < 0.051115(69.05)P < 0.052129(65.43)P < 0.05 Elementary Schools740(19.64)308(19.07)603(18.53) Junior High Schools306(8.12)127(7.86)342(10.51) High school and above152(4.03)65(4.02)180(5.53) Disease History History of memory-related diseases, n (%) No3175(84.26)P < 0.051377(85.26)P < 0.052760(84.82)P < 0.05 Yes593(15.74)238(14.74)494(15.18) History of heart disease, n (%) No2667(70.78)P < 0.051159(71.76)P < 0.052102(64.60)P > 0.05 Yes1101(29.22)456(28.24)1152(35.40) History of stroke disease, n (%) No3245(86.12)P < 0.051370(84.83)P < 0.052591(79.63)P < 0.05 Yes523(13.88)245(15.17)663(20.37) History of arthritis disease, n (%) No1648(43.74)P > 0.05707(43.78)P > 0.051412(43.39)P > 0.05 Yes2120(56.26)908(56.22)1842(56.61) Outcome measurements Depressed, n (%) No1646(43.68)P > 0.05705(43.65)P > 0.051165(35.80)P > 0.05 Yes2122(56.32)910(56.35)2089(64.20) Figure 3. The demographic differences. 3.2Model performance This study systematically evaluated the performance of ten ML algorithms in predicting depression risk among older adults with disability across training, testing, and external validation sets (as shown in Table 3). In terms of accuracy, RF (0.741), LightGBM (0.728), and HistGBM (0.713) demonstrated the highest performance in the testing sets. While LR and DT exhibited relatively stable performance between training and testing sets, their overall accuracy was the lowest (LR: 0.667; DT: 0.633). For AUC metrics, RF (0.797) achieved the strongest discriminative capacity, followed closely by LightGBM (0.785), XGBoost (0.781), and HistGBM (0.779). XGBoost, LightGBM, and HistGBM showed superior generalizability, whereas DT (0.636) performed the poorest. Regarding the F1-score, RF (0.762) exhibited the optimal balance between precision and recall, with LightGBM (0.749), CatBoost (0.741), and HistGBM (0.735) maintaining stable performance in the testing sets. For precision, RF (0.791), LightGBM (0.778), XGBoost (0.767), and HistGBM (0.723) achieved the highest positive predictive values and lowest false positive rates, significantly outperforming DT (0.691). In recall analysis, RF (0.735), HistGBM (0.723), and HistGBM (0.707) demonstrated the strongest ability to identify true positive cases, while DT (0.633) exhibited markedly higher missed-detection risks compared to ensemble methods. Through comprehensive evaluation of ten ML models, HistGBM was selected as the optimal model based on three key criteria: (1) superior performance on testing sets metrics (AUC=0.779, F1-score=0.735, accuracy=0.713), (2) excellent generalizability demonstrated by a minimal training-testing AUC gap (8.5%), and (3) consistent performance across validation sets. HistGBM exhibited well-balanced predictive capabilities, showing above-average performance across all evaluation metrics without significant weaknesses in either precision (0.766) or recall (0.707), indicating robust discriminative power between positive and negative cases along with stable predictive performance. The model's exceptional generalizability was particularly noteworthy, with only a 10% difference in AUC between the testing sets and validation sets, as shown in Figure 4, significantly outperforming other models and demonstrating strong robustness across different data distributions. Although RF achieved the highest individual metrics on the specific testing set used in Table 3, the substantial performance degradation observed on the external validation set raised concerns about its real-world applicability and stability. HistGBM, while having marginally lower peak testing set scores than RF, offered the best overall package of strong predictive performance, minimal overfitting (small train-test gap), and exceptional stability across the independent validation set. This superior generalizability was the primary reason for selecting HistGBM as the optimal model for potential clinical application, where reliability across diverse data sources is paramount. XGBoost demonstrated strong performance on testing sets metrics (AUC=0.781, F1-score=0.735, accuracy=0.713), though it exhibited a relatively large training-testing AUC gap (10.1%) compared to LR, HistGBM, MLP, and CatBoost. However, its validation set's AUC showed a 10.7% difference from testing sets' performance (Figures 5-7), suggesting reasonable stability across data partitions and potential suitability for resource-constrained scenarios. RF achieved excellent testing set results (AUC=0.797, F1-score=0.762, accuracy=0.741), but displayed concerning generalization issues with substantial training-testing and testing-validation AUC differences, indicating potential overfitting to training data noise or specific patterns. LightGBM ranked second in testing sets AUC (0.785) with stable validation performance (0.654). However, the difference between the testing set AUC and the validation set is too large (13.1%). CatBoost performed comparably to top models in testing set metrics (AUC=0.774, F1-score=0.741) with excellent categorical feature handling, though its relatively lower F1-score (0.741) and recall (0.720) scores suggested weaker minority class identification, limiting its utility for imbalanced datasets. Among the remaining models, SVM exhibited severe overfitting, while MLP and DT significantly underperformed ensemble methods in both AUC and F1-score metrics. Table 3. Performance of ten ML algorithms on training and testing sets. ModelAccuracyAUCF1-scorePrecisionRecall TrainTestTrainTestTrainTestTrainTestTrainTest LR0.7020.6670.7810.7230.6890.6850.7210.7340.6610.642 HistGBM0.7820.7130.8640.7790.7780.7350.7930.7660.7630.707 MLP0.8050.6980.8820.7610.7970.7180.8300.7580.7670.682 XGBoost0.8010.7130.8820.7810.7970.7350.8130.7670.7820.705 Bagging0.7720.7090.8520.7700.7650.7310.7880.7640.7440.700 DT0.6510.6330.6530.6360.6480.6610.6540.6910.6420.633 LightGBM0.8030.7280.8830.7850.7990.7490.8160.7780.7820.723 RF0.8220.7410.9010.7970.8170.7620.8400.7910.7960.735 SVM0.8180.7080.8990.7680.8120.7270.8380.7660.7870.692 CatBoost0.7760.7160.8570.7740.7720.7410.7870.7630.7570.720 Note: The highest value in each column is highlighted in bold. Final model selection prioritized generalizability across training, testing, and external validation sets, as detailed in the text. Figure 4. AUC comparison of training sets, testing sets, and validation sets. Figure 5. Training sets ROC curves. Figure 6. Testing sets the ROC curves. Figure 7. External validation sets ROC curves. 3.3Model explanation The SHAP values quantify the absolute average impact of each feature on model predictions across all possible feature combinations, revealing their global importance. As shown in Figure 8, the SHAP analysis of the HistGBM model demonstrated significant variability in feature contributions. Sleep time (mean SHAP=0.344), life satisfaction (0.339), episodic memory (0.220), and self-rated health (0.197) emerged as the top four predictive features, indicating that health behaviors, subjective perceptions, and cognitive function were the core drivers of model predictions. The high contribution of sleep time likely reflects its well-established associations with chronic diseases, metabolic disorders, and cognitive decline. Life satisfaction and self-rated health, as subjective health indicators, capture the interplay between psychosocial factors and physiological health. Episodic memory directly influences prediction through cognitive and sensory pathways. Moderate contributions were observed for features such as stomach diseases, observing the situation up close, and memory-related disorders. While self-reported pain in the head, wrist, leg, toes, neck, and mental health conditions showed limited predictive importance, suggesting either weak signals or sparse data distributions. These results validate the model's multidimensional feature selection approach and provide actionable insights for intervention prioritization. Health management strategies targeting high-contribution features could enhance the model's real-world utility. Additionally, domain knowledge should guide the evaluation of low-contribution features to optimize the balance between model complexity and interpretability. Figure 9 presents the SHAP value distributions, revealing the heterogeneous directional effects and magnitudes of various features on depression probability predictions among older adults with disability. The x-axis (SHAP value) indicates each feature's influence on model output, where positive values increase and negative values decrease predicted risk. The color gradient (red means high feature value, blue means low feature value) demonstrates that: (1) higher values of sleep time, life satisfaction, and self-rated health (red clusters with negative SHAP) were strongly protective against depression, consistent with established epidemiological mechanisms; (2) better episodic memory performance (blue with positive SHAP) correlated with reduced depression risk, potentially through preserved cognitive resilience; (3) stomach diseases (red with positive SHAP) elevated risk through chronic somatic burden and psychological stress pathways; and (4) bodily pain (head, wrist, leg, toes, neck; red with positive SHAP) increased depression vulnerability in this population. These findings highlight the central role of health behaviors and psychosocial factors in depression comorbidity risk while identifying specific physiological pain features as contributory predictors. Figure 8. SHAP feature importance. Figure 9. SHAP value distribution. 4Discussion This study identified significant associations between depressive risk among disabled older adults and demographic characteristics, health status, and social support factors. The external validation cohort's higher median age than training sets and elevated depression risk in advanced age align with existing literature (38,39), potentially mediated by cognitive decline and reduced social roles. Consistent with Girgus et al. (40), females demonstrated significantly higher risk than males, possibly due to gender-specific social expectations, somatic symptom expression patterns, and help-seeking behaviors (26). The elevated risk among unmarried individuals supports the marital support hypothesis, where spousal emotional and economic support may serve as protective factors (41), corroborating Zhai et al. (42). Notably, the higher prevalence in rural western regions reflects China's geographic disparities in healthcare resource allocation, echoing Fan et al. (43) on primary mental health service accessibility. Higher education levels were protective, consistent with prior studies (44,45), likely through multiple pathways: enhanced cognitive capacity, improved socioeconomic resources, greater mental health awareness, and healthier behaviors. The elevated risk among childless individuals suggests family support network deficiencies may exacerbate disability-related stress, particularly relevant in East Asian familial care traditions (46), though some studies report no direct mental health impact of childlessness (47,48). This study systematically evaluated ten ML models (LR, SVM, XGBoost, LightGBM, CatBoost, RF, Bagging, HistGBM, MLP, DT) for predicting depressive risk among disabled older adults, demonstrating the superior performance of ensemble methods over traditional approaches. The HistGBM algorithm achieved optimal predictive accuracy, with AUC values of 0.779 (testing sets), aligning with current trends in medical prediction research. While Busi and Stephen (49) similarly compared extreme gradient boosting methods for early kidney disease diagnosis, their study did not examine the generalization enhancement effects of histogram optimization. Lee et al. (50) likewise identified extreme gradient boosting as the top performer for chronic disease prediction (AUC≥0.80). HistGBM's minimal AUC divergence between validation and testing sets (10%) confirms that histogram binning effectively mitigates overfitting caused by high-dimensional sparse features characteristic of healthcare data. Notably, while RF achieved the highest testing sets AUC (0.797), its validation performance showed significant degradation (ΔAUC=12.7%), contrasting sharply with its training sets performance (AUC=10.4%). This suggests that RF's majority voting mechanism may amplify localized features in training data when strong collinearity or noise exists in the feature space (51). In comparison, HistGBM maintained tighter training-testing consistency (8.5% AUC difference), outperforming both XGBoost (10.1%) and LightGBM (9.8%), indicating its superior suitability for handling elderly health data with measurement errors. Regarding class imbalance handling, XGBoost demonstrated significantly lower recall (0.705) than HistGBM (0.707), consistent with findings by Baba and Bunji (52). HistGBM's adaptive histogram partitioning mechanism balanced class weights while maintaining high precision (0.766), yielding a 5% F1-score improvement over LR (0.685). This enhancement likely stems from our feature engineering strategy that deeply explored disability-related psychosocial variables. These findings provide new empirical evidence for model selection in medical ML applications. The SHAP interpretability framework revealed multidimensional drivers of depressive risk prediction among disabled older adults. Sleep time emerged as the primary predictor (SHAP=0.344), demonstrating significantly greater contribution than reported in the Song et al. (53) study (SHAP=0.133). This enhanced predictive importance may reflect disability-associated sleep fragmentation, potentially activating inflammatory pathways and amplifying pathological effects. Notably, life satisfaction (SHAP=0.339) and self-rated health (SHAP=0.197) demonstrated greater predictive influence than conventional biomedical indicators, establishing psychosocial factors as pivotal determinants in depression comorbidity mechanisms among individuals with disability (54). Within cognitive domains, episodic memory (SHAP=0.220) showed higher predictive contribution than observing the situation up close (SHAP=0.192). Conversely, the relatively low importance of somatic pain features suggests the need to reevaluate clinical assessment priorities for pain screening in this population. This study achieves a breakthrough by integrating longitudinal design, ML techniques, and multidimensional feature engineering. First, compared to cross-sectional designs, our multi-wave feature construction quantitatively captures the cumulative effects of factors such as sleep disturbance (SHAP value = 0.344). Second, in contrast to conventional statistical models, the HistGBM algorithm significantly enhances generalizability through histogram optimization (training-testing AUC gap: 8.5%). Third, the predictive contribution of subjective perception indicators (life satisfaction SHAP value = 0.339) surpasses that of traditional physiological measures, validating our novel finding that psychosocial features dominate depression risk prediction. These methodological innovations collectively advance the field by providing a more robust, dynamic, and interpretable framework for risk stratification. While this study provides valuable insights into depressive disorder risk stratification in functionally impaired geriatric populations, several limitations should be acknowledged. First, the reliance on the CHARLS self-reported measures may lead to an underestimation of both disability severity and depressive symptoms. Second, the exclusion of biomarkers limits the model's ability to differentiate depression subtypes. Third, the absence of real-time dynamic health monitoring data potentially reduces the predictive value of temporal features. Future research should incorporate wearable device data and multi-omics approaches to develop dynamic prediction systems, complemented by cross-cohort validation to enhance generalizability. Fourth, our disability definition focused exclusively on BADL/IADL limitations. While this is consistent with geriatric assessment standards, it may not capture populations with pure cognitive or sensory disabilities. However, this standardized approach minimized cohort heterogeneity, facilitating model training on uniformly defined functional impairments. Future studies should validate these findings in other disability subtypes. 5Conclusion This study constructed a clinically generalizable prediction model for depressive risk among disabled older adults by integrating longitudinal data from multiple CHARLS waves. Our three-stage serial consensus approach feature selection system identified 21 robust predictors spanning physiological function, social support, and health behaviors, overcoming limitations of traditional linear modeling approaches. The HistGBM algorithm demonstrated optimal predictive stability through its histogram binning technique and adaptive learning mechanism. SHAP interpretability analysis revealed that health behavior (sleep time) and subjective perception indicators (life satisfaction, self-rated health) contributed significantly more to predictions than biomedical features, underscoring the central importance of psychosocial interventions in depression prevention for this population. The study identified significantly elevated depression risks among specific demographic subgroups with disability, including individuals residing in western rural regions, elderly females, those with limited educational attainment, and childless older adults. These findings highlight the urgent need for community-based mental health service networks and family support policies. These results provide an evidence base for preventing psychological disorders and implementing mental health interventions among the aging population with disability.

Keywords: disabled older adults, Depression, risk prediction, machine learning, CHARLS, mental health, HistGBM, MLP

Received: 04 Jul 2025; Accepted: 07 Jul 2025.

Copyright: © 2025 Jin and Halili. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Ayitijiang Halili, College of Public Management(Law), Xinjiang Agricultural University, Urumqi, Xinjiang, China

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.