Development and validation of a diabetic retinopathy risk prediction model for middle-aged patients with type 2 diabetes mellitus

Objectives The study aims to establish a predictive nomogram of diabetic retinopathy(DR) for the middle-aged population with type 2 diabetes mellitus (T2DM). Methods This retrospective study screened 931 patients with T2DM between 30 and 59 years of age from the 2011-2018 National Health and Nutrition Examination Survey database. The development group comprised 704 participants from the 2011-2016 survey, and the validation group included 227 participants from the 2017-2018 survey. The least absolute shrinkage and selection operator regression model was used to determine the best predictive variables. The logistic regression analysis built three models: the full model, the multiple fractional polynomial (MFP) model, and the stepwise (stepAIC) selected model. Then we decided optimal model based on the receiver operating characteristic curve (ROC). ROC, calibration curve, Hosmer-Lemeshow test, and decision curve analysis (DCA) were used to validate and assess the model. An online dynamic nomogram prediction tool was also constructed. Results The MFP model was selected to be the final model, including gender, the use of insulin, duration of diabetes, urinary albumin-to-creatinine ratio, and serum phosphorus. The AUC was 0.709 in the development set and 0.704 in the validation set. According to the ROC, calibration curves, and Hosmer-Lemeshow test, the nomogram demonstrated good coherence. The nomogram was clinically helpful, according to DCA. Conclusion This study established and validated a predictive model for DR in the mid-life T2DM population, which can assist clinicians quickly determining who is prone to develop DR.


Introduction
Diabetic retinopathy (DR) is a usual microvascular complication of type 2 diabetes mellitus (T2DM), one of the major reasons for blindness (1). According to The Global Burden of Disease Study, DR was the only cause of age-standardized vision loss to increase over the past three decades (2). Over 103.12 million adults worldwide were diagnosed with DR in 2020, and with the prevalence of diabetes increasing at an alarming rate, it is estimated that the world DR population will grow by 55.6%(57.4 million) between 2020 and 2045 (3). A prevalence-based cost-of-illness model estimates that Indonesia will spend $8.9 billion on the healthcare of DR in 2025 (4). As DR is often asymptomatic until the later, even more, severe stages, early diagnosis, and intervention are essential and more cost-effective for public health and healthcare costs (5)(6)(7).
DR prevalence has been discussed in some studies in different age groups of T2DM. The ADVANCE Collaborative group (8) has reported that the course of diabetes is independently related to the risk of microvascular complications, and diabetes duration has a more significant impact on younger people than on older people. Middleton et al. (9) have found that DR seems more susceptible in people diagnosed with T2DM in middle age (or with a younger present age), and the odds of DR decreased with increasing age at diagnosis. They considered this difference to be caused by reducing insulin-like growth factor 1 and growth hormone with increasing age. DR is more likely to occur in the middle-aged population after diagnosis of T2DM than in the elderly (10), so a more targeted prediction model and intervention strategy are needed.
Several prediction models have been applied to the identification and diagnosis of DR (11)(12)(13). However, these prediction models were constructed for almost all age groups. They have shortcomings in predicting the development of DR in different age groups. This will limit their ability to stratify individual patients according to risk level and select the optimal treatment. To our knowledge, there is a lack of predictive models developed separately for the middle-aged population. We suggested that developing a separate DR prediction model for the middle-aged age group and narrowing the prediction model orientation may be more important for applying the model for early identification and prevention of DR. We developed a model predicting the development of DR in middle-aged people with T2DM based on data from the National Health and Nutrition Examination Survey (NHANES), which may provide more personalized screening and treatment options for middle-aged T2DM patients.

Study design and participants
NHANES is a study program to evaluate US adults' and kids' health and nutritional condition. They sampled about 5,000 nationally representative persons with a multistage, graded, clustered sampling approach every year (14).
We included 39,156 participants in this study from the NHANES 2011 to 2018. According to the guideline from the American Diabetes Association (15), patients with T2DM were defined as follows: (1) participants who a doctor told them that they had diabetes with a diagnosis age ≥30 years; (2) participants who didn't self-report diabetes diagnosis with HbA1c ≥6.5%. We excluded data for participants <30 years (n=20,291) and >59 years (n=7,683) to obtain 11,182 cases in the age group of 30-59 years. Then, participants were separated into two groups depending on whether or not they had data of how old they were first told by a professional that they had diabetes, with data in the first group (n=1,083) and miss data in the second group (n=10,099). The first group excluded participants who were younger than 30 years old when they were first told they had diabetes and those who had no data for DR, resulting in 604 participants. The second group excluded patients with missing glycohemoglobin data and glycohemoglobin<6.5%, resulting in 327 cases. The two data groups were combined to get the final population included in the analysis for this study. The population from 2011 to 2016 was used to establish the development cohort, and the population from 2017 to 2018 was adopted as an external validation cohort. Figure 1 illustrates the detailed selection operation.

Ethics statement
Each participator provided written informed agreement before inclusion in the NHANES database, which was examined and allowed by the National Center for Health Statistics Ethics Review Board. Anonymously processing the data makes it available to the public. The researchers then can transform the data into a form suitable for analysis following privacy-preserving. Based on the study's data usage guidelines, all data will be analyzed statistically, and all studies will comply with all relevant laws and standards.

Potential predictors
We selected some potential predictors which might affect DR progress based on current relevant research and clinical experience (16)(17)(18), including age, gender, diabetes duration, HbA1C, use of insulin, use of hypoglycemic pills, hypertension, weak failing kidneys, body mass index (BMI), waist circumference, alkaline phosphatase, alanine aminotransferase (ALT), aspartate aminotransferase (AST), serum calcium, serum phosphorus, serum potassium, serum uric acid, total cholesterol, triglyceride, serum calcium, serum iron, blood urea nitrogen, serum albumin, serum creatinine, urinary albumin-to-creatinine ratio(UACR). The information on hypertension and renal failure came from the questionnaires.

Statistical analysis
R statistical software version 3.6.3 and EmpowerStats version 2.0 were used to conduct the statistical analysis for this study. Data for normally distributed was displayed as the mean ± standard deviation, and a two independent samples t-test was performed to analyze differences between groups. The categorical variables were described with proportion, which was tested using the chisquare test.
In linear regression mode, least absolute shrinkage and selection operator (LASSO) regression analysis is used for shrinkage and variable option. Firstly, we used the development set data and analyzed the data using the LASSO regression method. LASSO regression analysis was used to determine the appropriate and effective risk predictors for T2DM patients with DR, and 7 independent variables were selected according to lambda.min. Then, we built three models based on the logistic regression analysis: the full model, the multiple fractional polynomial (MFP) model, and the stepwise (stepAIC) selected model. We used the odds ratio and P-value with 95% confidence interval (CI) to describe the features. At the same time, according to the comparison of the area under the receiver operating characteristic (ROC) curve of each model in the development set and the validation set, the model with the most significant area under the curve (AUC) was selected. The model's consistency was evaluated based on the calibration curve and the Hosmer-Lemeshow test. The clinical effectiveness of the model was assessed using decision curve analysis (DCA). All statistical analyses were two-sided, with an alpha of 0.05 as the significance grade. Finally, according to the model, we established the nomogram and online dynamic nomogram prediction tool.

Baseline characteristics
According to the prespecified exclusion and inclusion criteria, 931 participants were enrolled in our research, including 704 in the development group and 227 in the validation group. Baseline characteristics like demographic, biochemical indexes, physical examination findings, duration of diabetes, and the use of medications are shown in Table 1.

Risk factors in the development group
We included 24 associated characteristic variables in LASSO regression analysis (Figures 2A, B) and selected 7 non-zero potential predictors from the LASSO regression analysis results based on the data of the development group. These predictors included gender, taking insulin now, weak failing kidneys, duration of diabetes, UACR, blood urea nitrogen, and serum phosphorus.

Prediction model development
To construct the prediction model, we performed the following steps. Firstly, we combined all 7 potential predictors selected by LASSO regression analysis into a multivariable model using FIGURE 1 Flow chart of the development and validation groups. multivariate logistic regression, which built the full model. Then, the stepwise backward regression selection method was used to fit the stepwise model based on Akaike's Information Criterion (AIC).
Since the multicollinearity of the predictors, we constructed the multiple fractional polynomial (MFP) model. The details of the three models are shown in Table 2. Finally, ROC curves were plotted for all three models, and the AUC of these models was compared (Figure 3). We chose the MFP model to construct the nomogram according to the results.

Development of nomogram
According to the MFP model, 5 independent predictors were introduced to establish a DR risk nomogram ( Figure 4A). To make it more convenient for T2DM patients to predict the progress of DR, we created an online dynamic nomogram tool (http://www.empowerstats.net/pmodel/?m=22793_ GaoXiangWangredictionmodelofretinopathyinmiddleagedpatients withtype2diabetesmellitus). In the online tool, doctors can calculate the risk of DR in middle-aged T2DM patients based on the specific values of each indicator ( Figure 4B).

Assessment of predictive nomogram
We applied the ROC curve to test the discrimination of the model (Figures 3A, B). In the development group, the AUC was  Figure 3B).
To check the consistency of this model, calibration curves and the Hosmer-Lemeshow test were used. As shown in Figure 5, the calibration curves of the model in both the development and validation sets were plotted. The horizontal axis stands for the predicted DR risk, the vertical coordinates represent the actual diagnosed DR risk, and the gray diagonal line stands for the perfect prediction of the ideal nomogram. Nomogram performance is shown by the solid line, where the closer to the diagonal gray line suggests greater predictive performance. According to the calibration curves, the nomogram displayed good coherence. In addition, there was no significant difference between the validation and development groups when we used the Hosmer-Lemeshow test to test the model calibration degree. (P=0.42 in the development group, P=0.52 in the validation group). Figure 6 shows the results of DCA curves for development and validation groups. The dashed line stood for the model, the gray line showed the net benefit when all patients with DR, and the black line represented the net benefit when no patients with DR. The region of the model curve between the "black line" and the "gray line" represented the model's clinical applicability. If the dashed line is above the black and gray lines, we can assume that the dashed value of the period can benefit.

Discussion
Nomogram is a useful and reliable forecasting tool that can produce individual probabilities of endpoint events by combining different variables and quantifying the risk individually (19). In the risk predictor analysis of this study, gender, use of insulin, renal failure, duration of diabetes, UACR, blood urea nitrogen, and serum phosphorus were related to the risk of DR in midlife patients with T2DM. Based on this, we used statistical analysis to screen five of these variables to construct and validate a novel DR risk predictive tool for middle-aged patients with T2DM. The model showed that being male, taking insulin now, longer duration of diabetes, higher UACR, and lower serum phosphorus were critical factors in determining the risk of DR in patients with T2DM, which has the same part of risk factors as those reported in previous studies (10,20). To make it more convenient for physicians to provide early individualized intervention for middle-aged T2DM patients, we have built an online, free prediction tool. According to Anne et al. discrimination and calibration ability, offering a personalized prediction of DR incidence. For patients with T2DM, the course of the disease is an unchangeable risk factor. Unlike T1DM, disease duration has a more significant impact on patients with T2DM combined with DR (22). Compared to older T2DM patients, midlife patients have a longer survival time and will be exposed to the increased risk of complications associated with a longer disease course. Elevated blood glucose and lipid metabolism disorders in T2DM patients caused pathological reactions, such as oxidative stress and inflammatory response (23,24), which were considered an important pathogenesis of DR (25)(26)(27), Longer disease duration means a sustained state of inflammation for a longer period, which raises the risk of DR. As reported by Singh et al. (28), DR prevalence was five times higher in patients with a disease duration of >15 years than it was in patients with an illness duration of < 5 years. According to Sun et al. (29), the duration of diabetes is strongly correlated with the risk of DR. In middle-aged patients with T2DM. It is critical to diagnose and intervene in the early stages of DR progression to minimize the risks that come with a longer disease course.
Gender has been discovered in some studies to be relevant to the risk of developing diabetes-related complications (30,31). Middleaged men were significantly more likely to have T2DM, indicating that gender factors are involved to some extent in the pathogenesis of T2DM and its complications in the middle-aged population (32,33). Several studies have revealed sexual dimorphism in fat distribution, inflammatory signaling pathway activation, and T2DM risk (34-37). Middle-aged T2DM patients have demonstrated gender differences in disease progression and  pathogenesis, and our model suggested that being male is significantly associated with DR in the middle-aged population. Studies based on national databases from the UK and Finland found that the male sex is an independent risk factor for advanced DR in T2DM and a risk predictor for disease progression (38, 39). Maric-Bilkan et al. (40) concluded that age-related differences in hormone levels, glycemic control, duration of diabetes, and ethnic background could explain the reported gender differences in DR risk. Although the pathological mechanisms of gender influence on DR progression are unclear now, the significantly different prevalence between gender implied different individualized care measures. Prevention strategies targeting modifiable risk factors are critical for the middle-aged T2DM population. Since its first clinical use in 1922, exogenous insulin has become a widely used hypoglycemic drug for many forms of diabetic patients worldwide (41). Our results found that mid-aged T2DM patients on insulin therapy were at greater risk of developing DR. A meta-analysis based on seven cohort studies has shown a significant association between the use of insulin and the risk of DR (42). A systematic review conducted by Song et al. (43) discovered that insulin therapy was remarkably correlated with an increased prevalence of any DR. The correlation between insulin therapy and DR demonstrated in various studies indicated that clinicians need to be more cautious when applying insulin therapy to patients at high risk of developing DR. Besides when dealing with patients on long-term insulin therapy, DR should be detected more carefully.
UACR is a clinically used indicator of renal function and a marker of endothelial dysfunction and may affect the microvasculature of the kidney and retina. Wang et al. (44) found that UACR, in addition to being an important marker of chronic kidney disease, was also closely related to the progression of DR. A 10-year prospective follow-up study confirmed that both UACR and estimated glomerular filtration rate (eGFR) were significant risk factors for DR, but UACR had a more significant association than eGFR (45), which is consistent with our result. The current studies found that high UACR is linked to changes in retinal vascular geometry, that patients with high UACR appear to be potentially predisposed to systemic vascular endothelial cell disease, and glycemic control may not affect the inherent biological risk of developing microvascular complications (46,47). For this reason, UACR may be a favorable and easily accessible biochemical indicator for predicting DR.
Although the prediction model developed in this research is meaningful for the early prevention and treatment of the middle-A B FIGURE 3 The ROC curves of prediction models. aged T2DM population, there are still some limitations. First, the population included in this study was the general middle-aged US population. Due to the differences in lifestyle and eating habits, the DR prediction nomogram may be limited in its generalization to other national people. Second, all patient data in this study were obtained from the NHANES database. Although we used data from different periods for validation, multicenter clinical validation is needed to assess the efficacy of the nomogram. Third, we could not refine our DR study according to whether it was proliferative due to the lack of data limitations of DR staging in the NHANES database data. The study developed a new web-based nomogram for predicting DR prevalence in middle-aged T2DM patients. After internal and external validation, the nomogram demonstrated good predictive performance. The line chart includes 5 common clinical characteristics of gender, serum phosphorus, UACR, duration of diabetes, and use of insulin. This nomogram enabled early to identify the high-risk groups of DR in middle-aged T2DM patients and helped to develop an aggressive individualized prevention and treatment strategy to reduce the prevalence and slow down the progression of DR. More clinical prospective and multicenter trials are needed to confirm our nomogram.

Data availability statement
This study uses data from a free and open public database, which can be found here: www.cdc.gov/nchs/nhanes/.

Ethics statement
Ethical approval was not provided for this study on human participants because each participator provided written informed agreement before inclusion in the NHANES database, which was examined and allowed by the National Center for Health Statistics Ethics Review Board. Anonymously processing the data makes it available to the public. The researchers then can transform the data into a form suitable for analysis following privacy-preserving. Based on A B FIGURE 5 Calibration curve of the risk nomogram. the study's data usage guidelines, all data will be analyzed statistically, and all studies will comply with all relevant laws and standards. Written informed consent for participation was not required for this study in accordance with the national legislation and the institutional requirements.