Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Public Health, 22 August 2025

Sec. Public Health and Nutrition

Volume 13 - 2025 | https://doi.org/10.3389/fpubh.2025.1606751

A machine learning model for predicting obesity risk in patients with diabetes mellitus: analysis of NHANES 2007–2018

Wenqiang Wang&#x;Wenqiang Wang1Ruiqing Mo&#x;Ruiqing Mo2Xingyu ChenXingyu Chen2Sijie Yang
Sijie Yang1*
  • 1Department of Plastic and Reconstructive Surgery, The People’s Hospital of Guangxi Zhuang Autonomous Region & Research Center of Medical Sciences, Guangxi Academy of Medical Sciences, Nanning, Guangxi, China
  • 2Department of Bone and Joint Surgery, Guangxi Diabetic Foot Salvage Engineering Research Center, The First Affiliated Hospital of Guangxi Medical University, Nanning, China

Background: Obesity is a prevalent and clinically significant complication among individuals with diabetes mellitus (DM), contributing to increased cardiovascular risk, metabolic burden, and reduced quality of life. Despite its high prevalence, the risk factors for obesity within this population remain incompletely understood. With the growing availability of large-scale health datasets and advancements in machine learning, there is an opportunity to improve risk stratification. This study aimed to identify key predictors of obesity and develop a machine learning-based predictive model for patients with T2DM using data from the National Health and Nutrition Examination Survey (NHANES).

Methods: Data from adults with diabetes were extracted from the NHANES 2007–2018 cycles. Participants were categorized into obese and non-obese groups based on BMI. Least absolute shrinkage and selection operator (LASSO) regression with 10-fold cross-validation was used to select relevant features. Subsequently, nine machine learning algorithms—including logistic regression, random forest (RF), radial support vector machine (RSVM), k-nearest neighbors (KNN), XGBoost, LightGBM, decision tree (DT), elastic net regression (ENet), and multilayer perceptron (MLP)—were employed to construct predictive models. Model performance was evaluated based on area under the ROC curve (AUC), calibration curves, Brier score, and decision curve analysis (DCA). The best-performing model was visualized using a nomogram to enhance clinical applicability.

Results: A total of 3,794 participants with type 2 diabetes were included in the analysis, of whom 57.0% were classified as obese. LASSO regression identified 19 key variables associated with obesity. Among the nine machine learning models evaluated, the logistic regression model demonstrated the best overall performance, with the lowest Brier score. It also showed good discrimination (AUC = 0.751 in the training set and 0.781 in the test set), favorable calibration, and consistent clinical utility based on decision curve analysis (DCA). A nomogram was constructed based on the logistic regression model to facilitate individualized risk prediction, with total points corresponding to predicted probabilities of obesity.

Conclusion: Obesity remains highly prevalent among individuals with type 2 diabetes. Our findings highlight key clinical features associated with obesity risk and provide a practical tool to aid in early identification and individualized management of high-risk patients.

Background

Diabetes mellitus (DM) has become one of the most prevalent chronic metabolic diseases worldwide, with a significant increase in prevalence over the past few decades, particularly in low- and middle-income countries (1). According to the International Diabetes Federation (IDF), approximately 537 million adults globally were living with diabetes in 2021, and this number is projected to rise to 783 million by 2045 (2). Diabetes not only severely impacts patients’ quality of life but also imposes a substantial socioeconomic burden (3). Global healthcare costs related to diabetes and its associated complications amount to several hundred billion dollars annually, and this trend is expected to continue growing (4).

The occurrence of obesity in the DM patient group is significantly higher than in the non-diabetic population, which not only increases the risk of cardiovascular diseases, metabolic syndrome, and kidney diseases but also exacerbates the healthcare and social burden (5). However, even within the diabetic patient population, the incidence and severity of obesity show considerable variation (6). Research has shown that factors such as age, sex, race, lifestyle, dietary habits, and genetic predispositions can all influence the occurrence of obesity, and these complex factors make predicting the risk of obesity in diabetic patients more challenging (5).

In recent years, the rapid advancement of machine learning (ML) technology has provided new opportunities to address this issue, as sophisticated predictive models can effectively identify DM patients at high risk of obesity (79). The National Health and Nutrition Examination Survey (NHANES) offers high-quality, extensive clinical data, making it particularly well-suited for developing and validating predictive models. While previous studies have employed ML techniques to investigate the prediction of diabetes onset risk, treatment responses, and complications (such as cardiovascular diseases), there is relatively little research specifically focused on predicting the risk of obesity in diagnosed DM patients (10, 11).

The goal of this study is to utilize the NHANES database to develop a robust machine learning predictive model capable of distinguishing between DM patients at risk for obesity and those not at risk. Through a systematic analysis of variables, including demographic characteristics, clinical indicators, nutritional status, behavioral traits, and biochemical markers, this study aims to identify the key predictive factors for obesity in DM patients. Successful implementation of this predictive analysis will not only improve the effectiveness of personalized clinical interventions and patient outcomes but also significantly reduce the healthcare costs associated with obesity and diabetes. Ultimately, the results of this study will contribute to a deeper understanding of the mechanisms behind the occurrence of obesity in diabetes, providing practical and effective insights for clinical practice and healthcare policy development.

Methods

Study design and population

This study utilized cross-sectional data from the NHANES survey, collected in the United States between 2007 and 2018. The survey, conducted by the National Center for Health Statistics (NCHS) under the Centers for Disease Control and Prevention (CDC), aimed to provide a nationally representative assessment of the health and nutritional status of non-institutionalized civilians in the United States. Since NHANES is a publicly available database and has been approved by the Institutional Review Board (IRB) of NCHS, our institution confirmed that no additional ethical approval was required. Furthermore, the IRB acknowledges that NCHS adheres to strict ethical standards in its data collection and processing procedures, including obtaining informed consent from all participants and ensuring data anonymization. These measures guarantee full compliance with ethical guidelines for secondary data analysis.

Assessment criteria for diabetes and obesity

Individuals diagnosed with diabetes were included in this study if they self-reported a diabetes diagnosis, had a fasting blood glucose level ≥126 mg/dL, had a hemoglobin A1c level ≥6.5%, or reported using anti-diabetic medications. Obesity was assessed based on participants’ body mass index (BMI), with a BMI ≥ 30.0 kg/m2 defining obesity, and those with a lower BMI classified as non-obese. To maintain data quality and consistency, records with more than 10% missing data were excluded, while multiple imputation was applied to records with minor missing values. Following these stringent inclusion and exclusion criteria, a total of 3,794 diabetes participants were included in the final analysis, consisting of 2,163 participants in the obesity group and 1,631 participants in the non-obesity group (Figure 1).

Figure 1
Flowchart showing participant selection from NHANES 2007-2018. Total population is 59,842. Exclusions include age under 18 years (23,262) and no diabetes diagnosis (30,408), leaving 6,172 eligible participants. Further exclusion for missing data over 10% (2,378) results in 3,794 eligible for analysis. Participants are categorized by BMI: obesity (over 30 kg/m², n=2,163) and non-obesity (under 30 kg/m², n=1,631).

Figure 1. Flowchart of participant enrollment and exclusion.

This study carefully selected a set of variables to investigate the factors associated with obesity in DM patients. These variables include demographic, socioeconomic, lifestyle, clinical, and biochemical factors. The demographic variables consist of gender, age, race, education level, and marital status. Education level is categorized as below high school, high school, and above high school. Marital status is classified into the following categories: married, widowed, divorced, separated, never married, or living with a partner. Race is divided into categories such as Mexican American, Other Hispanic, Non-Hispanic White, Non-Hispanic Black, and Other Race—including Multi-Racial. The Poverty-to-Income Ratio (PIR) is calculated by dividing household income by the corresponding poverty line for the survey year and location (12).

Lifestyle variables include smoking status (smoker or non-smoker) and alcohol consumption behavior (drinker or non-drinker). Physical activity levels are categorized as light, moderate, and vigorous exercise based on MET values.

Anthropometric measurements collected during the NHANES survey cycle include BMI. Biochemical markers include Alb, ALP, ALT, AST, TC, Scr, TG, UA, Hb, and HbA1c. Data on essential hypertension (EH), chronic heart disease (CHD), and depression were obtained through the questionnaire modules. Essential hypertension (EH) and coronary heart disease (CHD) were defined based on self-reported information from the “Blood Pressure and Cholesterol Questionnaire (BPQ)” and the Medical Conditions Questionnaire (MCQ) modules. Depressive symptoms were assessed using the PHQ-9 score, where a score of ≥10 indicates depression, and a score of <10 indicates no depression (13). Intake of total energy, protein, carbohydrate, total sugar, dietary fiber, total fat, vitamin B12, vitamin C, vitamin D, and vitamin K was obtained from the Dietary Data module.

Statistical analysis

NHANES employs a complex multi-stage survey design to generate nationally representative data. Descriptive analysis involves calculating weighted averages with 95% confidence intervals and weighted percentages to account for the complex sampling design of the survey. Following NHANES analysis guidelines, chi-square tests were used for categorical variables, and weighted linear regression was applied to continuous variables to compare weighted groups. Unweighted data were used for model development and statistical analysis. Continuous variables are reported as means with standard deviations (SD), and categorical variables are presented as frequencies and percentages. Chi-square tests were used for categorical variables, and independent t-tests were applied to continuous variables to evaluate intergroup differences in clinical characteristics. A p-value of less than 0.05 was considered statistically significant, and all tests were two-tailed.

All statistical analyses in this study were performed using R software (version 4.2.2). Initially, LASSO regression with 10-fold cross-validation was conducted using the cv.glmnet() function from the glmnet package for variable selection. To enhance the robustness and comparability of model performance, we subsequently constructed and evaluated nine machine learning models: logistic regression (glm() with family = binomial), random forest (randomForest), radial support vector machine RSVM (e1071), k-nearest neighbors KNN (class), extreme gradient boosting XGBoost (xgboost), light gradient boosting machine LightGBM (lightgbm), decision tree DT (rpart), elastic net ENet (glmnet, α = 0.5), and multilayer perceptron MLP (nnet). Model performance was comprehensively assessed across four dimensions: (1) discrimination, measured by the area under the curve (AUC) using the pROC package; (2) calibration, evaluated via calibration plots using the rms and caret packages; (3) overall predictive accuracy, quantified using Brier scores calculated by the DescTools and ModelMetrics packages; and (4) clinical utility, determined by decision curve analysis (DCA) using the rmda package to estimate net clinical benefit at various threshold probabilities. All visualizations were generated using the ggplot2 package to ensure a clear and comprehensive presentation of the results. These methodological details have been incorporated into the revised Methods section to enhance clarity and reproducibility of the study.

Model building process

LASSO regression and cross-validation were used for feature selection in the training set to identify significant predictive factors. Features with non-zero coefficients were retained for subsequent analysis. We randomly allocated 70% of the patient population to the training set and 30% to the testing set. Stratified sampling was applied to ensure a balanced distribution of the target variable between the two groups. To address the issue of class imbalance, the Synthetic Minority Over-sampling Technique (SMOTE) was used to generate a balanced dataset in the training set (14).

In this study, we employed nine machine learning algorithms, including logistic regression, to model and predict the risk of obesity among patients with diabetes. These algorithms comprised logistic regression, random forest (RF), radial support vector machine (RSVM), k-nearest neighbors (KNN), extreme gradient boosting (XGBoost), light gradient boosting machine (LightGBM), decision tree (DT), elastic net regression (ENet), and multilayer perceptron (MLP). Model construction was performed on the training set, and only after repeated training yielded stable results were the models evaluated on the independent testing set. Model performance was assessed across four dimensions: discrimination (measured by ROC curves and AUC), calibration (evaluated using calibration plots), overall predictive accuracy (assessed using the Brier score), and clinical utility (measured by DCA to estimate net clinical benefit across a range of threshold probabilities). The model with the best overall performance was ultimately selected for clinical visualization and interpretation.

The model’s performance was evaluated based on classification accuracy and robustness. After training, the model was validated using the test set. The model’s classification performance was assessed using the ROC curve and the AUC metric (15). ROC curve analysis in the training set provided insights into the model’s predictive capability, while the AUC value was used to evaluate its discriminative ability.

Furthermore, a nomogram was developed based on the logistic regression model, providing an intuitive and user-friendly tool for clinical use (16). The nomogram visually illustrates the relative contribution of each predictor to the obesity risk score. Each predictor is assigned a score proportional to its regression coefficient, and the total score corresponds to the predicted probability of obesity. A calibration curve was created to assess the predictive performance of the nomogram by comparing the predicted and observed probabilities of obesity. The discriminatory ability of the nomogram was evaluated using the AUC of the ROC curve. DCA was performed to evaluate the clinical utility of the nomogram by quantifying the net benefit at different threshold probabilities.

Results

Baseline characteristics

A total of 3,794 participants were included in this study, with an obesity prevalence rate of 57.01% (2,163/3,794). Among the participants, 1,990 were male and 1,804 were female, with average ages of 60.49 ± 12.66 years and 60.22 ± 12.96 years, respectively.

Compared to non-obese diabetic patients, those in the obese group were younger (57.85 ± 12.58 years vs. 63.69 ± 12.32 years), and the proportion of females was higher in the obese group (1,179/2,163 vs. 625/1,630). Additionally, significant differences were observed in education level, marital status, and race between the two groups. Participants in the obese group slept significantly less than those in the non-obese group (6.98 ± 1.67 vs. 7.26 ± 1.66, p < 0.05). Contrary to expectations, there were no significant differences in PIR, smoking, or drinking behaviors between the two groups (p > 0.05). Regarding past diseases, the obesity group had a higher prevalence of essential hypertension and depression (p < 0.05), while there were no significant differences in the prevalence of CHD and stroke (p > 0.05).

Regarding blood markers, significant differences were observed in ALB, ALP, ALT, Scr, TG, UA, and HbA1c between the two groups (p < 0.05). No differences were found in the levels of AST, TC, and Hb between the two groups (p > 0.05). In terms of energy intake, participants in the obesity group consumed more energy (1,964.70 ± 956.79 vs. 1,863.92 ± 854.06, p < 0.05). Significant differences were observed in the intake of carbohydrates, total sugar, dietary fiber, and total fat between the two groups (p < 0.05), while no differences were found in the intake of vitamin B12, C, and D (p > 0.05). Table 1 presents all the significant variables.

Table 1
www.frontiersin.org

Table 1. Baseline data of the obese and non-obese groups [mean ± SD or n (%)].

Feature importance analysis

In the variables of DM participants, the occurrence of obesity is associated with several factors (Figure 2A). Age appears to be the most significant variable, followed by gender and UA, suggesting that age, gender, and UA may play crucial roles in the obesity risk of diabetic patients. Furthermore, variables such as ALP, essential hypertension, TG, and depression showed significant predictive value. Other influencing factors include total fat, HbA1c, and education level, among others. In comparison, variables like smoking, alcohol consumption, vitamin intake, and sleep duration demonstrated relatively low predictive performance, indicating a weaker correlation with obesity risk (Figure 2B).

Figure 2
Panel A is a heatmap showing correlations among various health and lifestyle variables, with darker shades indicating higher correlations. Panel B is a horizontal bar chart displaying the AUC values of these variables, highlighting their predictive power, with variables like Age, Sex, and UA showing high values.

Figure 2. Heatmap of correlations between various variables and their predictive capabilities. (A) Heatmap of correlations. (B) Feature importance ranking of variables in the model based on AUC values.

Machine learning model comparison after LASSO selection

To enhance the model’s robustness and minimize potential overfitting, LASSO regression was applied for the initial selection of predictor variables in the training set, with the optimal regularization parameter determined using 10-fold cross-validation (17). The LASSO model initially identified 19 variables with non-zero coefficients, including sex, age, race, marital status, PIR, ALB, ALP, AST, Scr, TG, UA, Hb, and sleep duration (Figure 3 and Supplementary Table S1).

Figure 3
Panel A shows a line graph with various colored lines representing coefficients plotted against log lambda values, displaying convergence towards zero. Panel B exhibits a plot of binomial deviance against log lambda, with red dots and error bars illustrating deviance values.

Figure 3. Coefficient trajectories and optimal lambda selection in LASSO regression. (A) LASSO coefficient path plot. (B) Least angle regression path plot. After using the Lasso regression analysis method, 19 variables were identified as key factors for diagnosing obesity.

Based on the selected variables, nine machine learning models were developed, including logistic regression, RF, RSVM, KNN, XGBoost, LightGBM, DT, ENet, and MLP. All models were stably trained on the training set and subsequently evaluated on the testing set. Among these, the logistic regression model demonstrated the best overall performance across multiple evaluation dimensions. It exhibited strong discriminatory power, with an AUC of 0.751 in the training set and 0.781 in the testing set, indicating reliable differentiation between obese and non-obese patients (Figures 4A,B). In terms of clinical applicability, DCA showed that the logistic model provided the greatest net benefit across a range of threshold probabilities (Figures 4C,D). The model also achieved favorable calibration, with predicted probabilities closely aligning with observed outcomes, and yielded the lowest Brier score (0.189), reflecting high overall predictive accuracy (Figures 4E,F). These findings support the selection of logistic regression as the optimal model for obesity risk prediction in patients with diabetes. Accordingly, logistic regression was selected as the optimal predictive model in this study.

Figure 4
Six-panel image featuring various model evaluation graphs. Panel A and B display ROC curves for different models. Panel C and D show net benefit curves plotted against threshold probability. Panel E contains calibration plots for nine models: logistic, enet, dt, rf, xgboost, rsvm, mlp, lightgbm, and knn. Panel F is a heatmap of Brier scores for these models, with scores ranging from 0.189 to 0.225.

Figure 4. Performance evaluation of nine machine learning models for obesity prediction in patients with diabetes. (A–B) Receiver operating characteristic (ROC) curves of nine models in the training set (A) and test set (B), respectively. The area under the curve (AUC) values are shown in the legend. (C–D) Decision curve analysis (DCA) of the models in the training set (C) and test set (D), indicating net clinical benefit across a range of threshold probabilities. (E) Calibration plots of each model in the test set. The red line represents the observed probability, and the diagonal black line indicates perfect calibration. (F) Heatmap of Brier scores for each model in the test set. Lower Brier scores indicate better overall predictive accuracy.

Construction and visualization of the logistic regression model

Following the identification of logistic regression as the optimal predictive model among the nine machine learning algorithms, we further visualized the model coefficients and constructed a clinically applicable prediction tool. As shown in Figure 5, among the variables included in the logistic model, depression and EH emerged as the strongest positive predictors, suggesting that patients with these conditions are at increased risk of obesity. Additionally, metabolic indicators such as ALP and total fat intake also contributed positively. In contrast, variables such as male sex, Alb, and age were negatively associated with obesity risk, indicating a potential protective effect.

Figure 5
Bar chart showing logistic regression coefficients for various factors. Depression and EH=1 have the highest positive coefficients in blue, while Sex=1 has the largest negative coefficient in orange. Other factors like ALP, Total Fat, and Age have smaller coefficients near zero.

Figure 5. The regression coefficients of the 10 significant variables after logistic regression analysis. These regression coefficients are used to construct the nomogram for diagnosing diabetic obesity.

Based on the estimated regression coefficients, a nomogram was constructed to facilitate individualized risk assessment (Figure 6). This nomogram incorporates all significant predictors from the model. By locating the value of each patient’s characteristic on the corresponding axis, assigning points, and summing the total score, clinicians can estimate the predicted probability of obesity. The tool is intuitive, easy to interpret, and clinically applicable, offering a practical method to support early identification and personalized management of obesity risk in patients with diabetes.

Figure 6
Graphical representation of a scoring system linking various factors to obesity probability. Factors include Alb, ALP, AST, Scr, UA, EH, Sex, Age, Depression, Total Fat, and Total Points. Each factor is associated with a numerical scale that contributes to the overall points, which then correlate with a probability of obesity, ranging from 0.1 to 0.9.

Figure 6. Nomogram to estimate the risk of obesity in DM patients. The points assigned to each predictor are summed to obtain the total score. A vertical line drawn from the total score corresponds to the predicted probability of obesity.

To further enhance the clinical interpretability of the nomogram, we determined the optimal cutoff point based on the maximum Youden index. As shown in Figure 7, the Youden index peaked when the total nomogram points reached 131, indicating the most balanced trade-off between sensitivity and specificity at this threshold. This cutoff value can serve as a reference point in clinical decision-making, enabling physicians to identify high-risk patients who may benefit from early lifestyle intervention or closer metabolic monitoring (Table 2).

Figure 7
A graph depicts the Youden Index (%) versus points, with a red curve forming a bell shape. The peak occurs at point 131, marked by a vertical dashed line, indicating the best Youden index.

Figure 7. Youden index analysis for determining the optimal cutoff point of the nomogram. The highest Youden index corresponds to a total point score of 131, as indicated by the dashed vertical line.

Table 2
www.frontiersin.org

Table 2. Odds ratios and 95% confidence intervals from multivariate logistic regression analysis.

Discussion

This study is based on data from the NHANES 2007–2018, involving a retrospective analysis of nearly 3,800 diabetic patients. The aim was to identify which indicators are particularly significant in the context of obesity among various clinical and lifestyle factors. Using machine learning algorithms, we ultimately identified 10 variables, including traditional physiological indicators such as ALP, Scr, and AST, as well as less frequently discussed variables, such as depression. This mixed result highlights an important issue: diabetes combined with obesity is not solely due to excess calorie intake, but rather a complex process involving metabolic, emotional, behavioral, and organ function imbalances.

This study demonstrates a certain degree of innovation in terms of research perspective, methodological design, and the significance of the findings. It focuses on predicting the risk of obesity specifically within the population of individuals with diabetes. Although this topic has been discussed in the existing literature, systematic predictive modeling targeted at this specific subgroup remains relatively limited. Our study seeks to contribute additional insight in this underexplored area. Methodologically, we applied LASSO regression for variable selection and developed predictive models using nine commonly used machine learning algorithms. Model performance was comprehensively evaluated across multiple dimensions, including AUC, calibration (calibration plots), overall predictive accuracy (Brier score), and clinical utility (DCA). Furthermore, we employed nomogram construction to enhance model interpretability and clinical applicability. In terms of results, the study identified several key variables strongly associated with obesity and established a predictive model with promising performance and generalizability. These findings may serve as a theoretical basis and practical reference for the early identification and individualized intervention of obesity risk in patients with diabetes.

First, it is important to highlight the prominent role of psychological factors, particularly depression, in the model. In the past, we tended to attribute obesity and diabetes to “eating too much and moving too little,” but the inclusion of depression as a variable in the model reminds us that we cannot overlook the long-term effects of psychological states on energy metabolism and behavior patterns. Existing studies suggest that chronic depression may disrupt appetite regulation mechanisms through abnormal activation of the HPA axis, which in turn influences eating preferences, sleep patterns, and even the willingness to exercise (18, 19). This impact may be more pronounced in diabetic patients. At the same time, the inclusion of traditional factors such as hypertension, age, and gender is not unexpected. On one hand, these serve as “background variables” for disease progression, and on the other hand, they provide the empirical basis for clinical judgment. Therefore, we should pay more attention to whether there are factors, beyond the “familiar” variables, that we have previously overlooked, which may be subtly altering the course of the disease.

Some of the findings related to biochemical markers are quite thought-provoking. The appearance of ALP, AST, and Scr suggests that the mechanisms underlying diabetes combined with obesity may have far exceeded our conventional understanding of “glucose metabolism disorders.” ALP and AST are typically regarded as indicators of liver function, and their changes may already signal the presence of visceral fat accumulation, fatty liver disease, or even inflammation in the asymptomatic stage (20). This could explain why, in recent years, NAFLD has been recognized as a precursor to “novel liver-derived diabetes” (21). Scr, being a relatively stable marker of kidney function, is unsurprising in this population. Diabetes is a major contributor to kidney damage, and the symptoms of obesity further exacerbate the high-pressure burden on the glomerulus (22). More importantly, these markers are less likely to fluctuate compared to lifestyle variables, and as indicators of end-organ damage, they often provide a more accurate reflection of disease progression.

In terms of diet, the inclusion of total fat intake signals that “what” we eat may be more important than “how much” we eat. Although total sugar intake was not included in the final model, this contrasts with some findings in the literature. Possible explanations include: first, diabetic patients generally have an awareness of blood sugar control, which narrows the differences between groups; second, compared to carbohydrates, the metabolic effects of fat are more long-lasting and have a more direct impact on insulin resistance. Additionally, the presence of uric acid and albumin further supports the potential involvement of inflammation and oxidative stress in the pathology of obesity, a direction that has gained widespread attention in recent years (23). In general, the significance of these biochemical markers lies not only in their role as static diagnostic tools but also in their potential as dynamic warning signals, providing important information before clear metabolic imbalance is observed.

Naturally, it is somewhat surprising that some factors traditionally thought to be closely related to obesity, such as smoking, drinking, stroke history, physical activity, and sleep duration, did not enter our model. This does not necessarily imply that they are unimportant, but rather suggests that their explanatory power in the context of diabetes has been overshadowed by other factors. The metabolic effects of smoking and drinking may primarily manifest through cardiovascular and inflammatory pathways, rather than directly contributing to obesity (24). Physical activity and sleep data in NHANES are primarily based on self-reported questionnaires, which are susceptible to cognitive bias. Furthermore, these factors may exhibit collinearity with the selected variables, and weaker variables are more easily “pushed out of the model” in LASSO regression. In other words, the LASSO results are more about selecting variables that maintain independent explanatory power in high-risk populations such as “diabetes + obesity,” rather than simply listing all potential influencing factors.

Overall, our goal is not to create a comprehensive obesity prediction system, but rather to use a feature selection method from the machine learning field—LASSO-logistic regression—to identify a set of variables that truly possess independent explanatory power and predictive value, amidst numerous variables and highly redundant information. Compared to traditional regression, this machine learning method offers stronger dimensionality reduction capabilities and superior recognition of collinear variables, making the model more concise and stable, and thus suitable for clinical decision-making scenarios based on real-world data (25). It is noteworthy that we chose not to rely on complex “black-box” models, but instead employed a linear regularization algorithm with good interpretability, striking a balance between statistical significance and clinical applicability. The final results reveal that the selected variables span multiple dimensions, including psychology, biochemistry, nutrition, and organ function, which, to some extent, outline the biological feature spectrum of diabetes combined with obesity. This also suggests that in future chronic disease research and management, machine learning is not only a tool but also a perspective—it can help us unravel complex data and identify the key aspects that truly warrant attention and intervention.

Limitation

This study has several limitations that should be acknowledged. First, the data were derived from the NHANES database between 2007 and 2018, and therefore may not fully capture more recent trends or behavioral and physiological changes in the post-COVID-19 era. Second, due to the cross-sectional nature of the NHANES data, causal inferences cannot be drawn from the observed associations. Third, although multiple machine learning algorithms were used and their performance compared, external validation using independent datasets was not conducted, which may limit the generalizability of the models. Fourth, some relevant variables such as genetic, environmental, or medication-related factors were not available in the dataset, potentially affecting the model’s completeness. Finally, although the study applied nomogram visualization, the clinical applicability of the prediction model should be further tested in prospective studies and real-world settings.

Conclusion

In summary, this study identifies key risk factors associated with obesity among individuals with type 2 diabetes using large-scale population-based data and a comparative machine learning framework. Our findings underscore the multifactorial nature of metabolic dysregulation in this population, involving psychological, nutritional, biochemical, and organ function indicators. Among the nine models evaluated, logistic regression demonstrated the most balanced predictive performance and was used to construct a clinically interpretable nomogram. This tool may support early risk stratification and personalized intervention in diabetic patients. Future studies should focus on external validation and longitudinal tracking to enhance model generalizability and translational potential.

Data availability statement

The original contributions presented in the study are included in the article/Supplementary material, further inquiries can be directed to the corresponding authors.

Ethics statement

Ethical approval was not required for the study involving humans in accordance with the local legislation and institutional requirements. Written informed consent to participate in this study was not required from the participants or the participants’ legal guardians/next of kin in accordance with the national legislation and the institutional requirements.

Author contributions

WW: Conceptualization, Methodology, Writing – review & editing. RM: Investigation, Visualization, Writing – review & editing. XC: Conceptualization, Writing – review & editing. SY: Conceptualization, Investigation, Methodology, Writing – original draft, Writing – review & editing.

Funding

The author(s) declare that no financial support was received for the research and/or publication of this article.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The authors declare that Gen AI was used in the creation of this manuscript. The preparation of this manuscript, we used GPT-4 for generating the abstract, performing language and grammar checks, and providing text polishing. The AI tool assisted in improving the clarity, coherence, and overall quality of the manuscript. However, the authors assume full responsibility for the content, interpretation, and final approval of the manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpubh.2025.1606751/full#supplementary-material

References

1. Bellary, S. The changing character of diabetes complications. Lancet Diabetes Endocrinol. (2022) 10:5–6. doi: 10.1016/S2213-8587(21)00313-2

PubMed Abstract | Crossref Full Text | Google Scholar

2. Diabetes is “a pandemic of unprecedented magnitude” now affecting one in 10 adults worldwide. Diabetes Res Clin Pract. (2021) 181:109133. doi: 10.1016/j.diabres.2021.109133

PubMed Abstract | Crossref Full Text | Google Scholar

3. Ong, KL, Stafford, LK, McLaughlin, SA, Boyko, EJ, Vollset, SE, Smith, AE, et al. Global, regional, and national burden of diabetes from 1990 to 2021, with projections of prevalence to 2050: a systematic analysis for the global burden of disease study 2021. Lancet. (2023) 402:203–34. doi: 10.1016/S0140-6736(23)01301-6

PubMed Abstract | Crossref Full Text | Google Scholar

4. Parker, ED, Lin, J, Mahoney, T, Ume, N, Yang, G, Gabbay, RA, et al. Economic costs of diabetes in the U.S. in 2022. Diabetes Care. (2024) 47:26–43. doi: 10.2337/dci23-0085

PubMed Abstract | Crossref Full Text | Google Scholar

5. Ruze, R, Liu, T, Zou, X, Song, J, Chen, Y, Xu, R, et al. Obesity and type 2 diabetes mellitus: connections in epidemiology, pathogenesis, and treatments. Front Endocrinol. (2023) 14:1161521. doi: 10.3389/fendo.2023.1161521

PubMed Abstract | Crossref Full Text | Google Scholar

6. Chandrasekaran, P, and Weiskirchen, R. The role of obesity in type 2 diabetes mellitus—an overview. Int J Mol Sci. (2024) 25:1882. doi: 10.3390/ijms25031882

PubMed Abstract | Crossref Full Text | Google Scholar

7. Deo, RC. Machine learning in medicine. Circulation. (2015) 132:1920–30. doi: 10.1161/CIRCULATIONAHA.115.001593

PubMed Abstract | Crossref Full Text | Google Scholar

8. Krishnan, R, Rajpurkar, P, and Topol, EJ. Self-supervised learning in medicine and healthcare. Nat Biomed Eng. (2022) 6:1346–52. doi: 10.1038/s41551-022-00914-1

PubMed Abstract | Crossref Full Text | Google Scholar

9. Theodosiou, AA, and Read, RC. Artificial intelligence, machine learning and deep learning: potential resources for the infection clinician. J Infect. (2023) 87:287–94. doi: 10.1016/j.jinf.2023.07.006

PubMed Abstract | Crossref Full Text | Google Scholar

10. Li, X, Ding, F, Zhang, L, Zhao, S, Hu, Z, Ma, Z, et al. Interpretable machine learning method to predict the risk of pre-diabetes using a national-wide cross-sectional data: evidence from CHNS. BMC Public Health. (2025) 25:1145. doi: 10.1186/s12889-025-22419-7

PubMed Abstract | Crossref Full Text | Google Scholar

11. Garg, S, Kitchen, R, Gupta, R, and Pearson, E. Applications of AI in predicting drug responses for type 2 diabetes. JMIR Diabetes. (2025) 10:e66831. doi: 10.2196/66831

PubMed Abstract | Crossref Full Text | Google Scholar

12. Yogeswaran, V, Kim, Y, Franco, RL, Lucas, AR, Sutton, AL, LaRose, JG, et al. Association of poverty-income ratio with cardiovascular disease and mortality in cancer survivors in the United States. PLoS One. (2024) 19:e0300154. doi: 10.1371/journal.pone.0300154

PubMed Abstract | Crossref Full Text | Google Scholar

13. Kroenke, K, Spitzer, RL, and Williams, JB. The PHQ-9: validity of a brief depression severity measure. J Gen Intern Med. (2001) 16:606–13. doi: 10.1046/j.1525-1497.2001.016009606.x

PubMed Abstract | Crossref Full Text | Google Scholar

14. Bunkhumpornpat, C, Boonchieng, E, Chouvatut, V, and Lipsky, D. FLEX-SMOTE: synthetic over-sampling technique that flexibly adjusts to different minority class distributions. Patterns. (2024) 5:101073. doi: 10.1016/j.patter.2024.101073

PubMed Abstract | Crossref Full Text | Google Scholar

15. Martinović, M, Dokic, K, and Pudić, D. Comparative analysis of machine learning models for predicting innovation outcomes: an applied AI approach. Appl Sci. (2025) 15:3636. doi: 10.3390/app15073636

Crossref Full Text | Google Scholar

16. Garwe, T, and Choi, J. An introduction to clinical prediction models using logistic regression in acute care surgery research: methodologic considerations and common pitfalls. J Trauma Acute Care Surg. (2025) 98:699–703. doi: 10.1097/TA.0000000000004584

PubMed Abstract | Crossref Full Text | Google Scholar

17. Zhu, Y, Liang, R, Wang, Y, Yang, J-J, Zhou, N, and Zhou, C-M. Development of a LASSO machine learning algorithm-based model for postoperative delirium prediction in hepatectomy patients. BMC Surg. (2025) 25:26. doi: 10.1186/s12893-025-02759-2

PubMed Abstract | Crossref Full Text | Google Scholar

18. Mikulska, J, Juszczyk, G, Gawrońska-Grzywacz, M, and Herbet, M. HPA axis in the pathomechanism of depression and schizophrenia: new therapeutic strategies based on its participation. Brain Sci. (2021) 11:1298. doi: 10.3390/brainsci11101298

PubMed Abstract | Crossref Full Text | Google Scholar

19. Balbo, M, Leproult, R, and Van Cauter, E. Impact of sleep and its disturbances on hypothalamo-pituitary-adrenal axis activity. Int J Endocrinol. (2010) 2010:759234. doi: 10.1155/2010/759234

Crossref Full Text | Google Scholar

20. Ling, S, Diao, H, Lu, G, and Shi, L. Associations between serum levels of liver function biomarkers and all-cause and cause-specific mortality: a prospective cohort study. BMC Public Health. (2024) 24:3302. doi: 10.1186/s12889-024-20773-6

PubMed Abstract | Crossref Full Text | Google Scholar

21. Pouwels, S, Sakran, N, Graham, Y, Leal, A, Pintar, T, Yang, W, et al. Non-alcoholic fatty liver disease (NAFLD): a review of pathophysiology, clinical management and effects of weight loss. BMC Endocr Disord. (2022) 22:63. doi: 10.1186/s12902-022-00980-1

PubMed Abstract | Crossref Full Text | Google Scholar

22. Ahmed, N, Dalmasso, C, Turner, MB, Arthur, G, Cincinelli, C, and Loria, AS. From fat to filter: the effect of adipose tissue-derived signals on kidney function. Nat Rev Nephrol. (2025) 21:417–34. doi: 10.1038/s41581-025-00950-5

PubMed Abstract | Crossref Full Text | Google Scholar

23. Li, F, Chen, S, Qiu, X, Wu, J, Tan, M, and Wang, M. Serum uric acid levels and metabolic indices in an obese population: a cross-sectional study. Diabetes Metab Syndr Obes. (2021) 14:627–35. doi: 10.2147/DMSO.S286299

PubMed Abstract | Crossref Full Text | Google Scholar

24. Ohlrogge, AH, Frost, L, and Schnabel, RB. Harmful impact of tobacco smoking and alcohol consumption on the atrial myocardium. Cells. (2022) 11:2576. doi: 10.3390/cells11162576

PubMed Abstract | Crossref Full Text | Google Scholar

25. Sullivan, B, Barker, E, MacGregor, L, Gorman, L, Williams, P, Bhamber, R, et al. Comparing conventional and Bayesian workflows for clinical outcome prediction modelling with an exemplar cohort study of severe COVID-19 infection incorporating clinical biomarker test results. BMC Med Inform Decis Mak. (2025) 25:123. doi: 10.1186/s12911-025-02955-3

PubMed Abstract | Crossref Full Text | Google Scholar

Keywords: diabetes mellitus, obesity, machine learning, NHANES, LASSO regression, predictive modeling, nomogram

Citation: Wang W, Mo R, Chen X and Yang S (2025) A machine learning model for predicting obesity risk in patients with diabetes mellitus: analysis of NHANES 2007–2018. Front. Public Health. 13:1606751. doi: 10.3389/fpubh.2025.1606751

Received: 06 April 2025; Accepted: 05 August 2025;
Published: 22 August 2025.

Edited by:

Hui-Xin Liu, China Medical University, China

Reviewed by:

Bahadır Yüzbaşı, İnönü University, Türkiye
Bhumandeep Kour, Lovely Professional University, India

Copyright © 2025 Wang, Mo, Chen and Yang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Sijie Yang, eWFuZ3NpamllbWRqQHNpbmEuY29t

These authors have contributed equally to this work

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.