Clinical Characteristics and Risk of Diabetic Complications in Data-Driven Clusters Among Type 2 Diabetes

Background This study aimed to cluster newly diagnosed patients and patients with long-term diabetes and to explore the clinical characteristics, risk of diabetes complications, and medication treatment related to each cluster. Research Design and Methods K-means clustering analysis was performed on 1,060 Chinese patients with type 2 diabetes based on five variables (HbA1c, age at diagnosis, BMI, HOMA2-IR, and HOMA2-B). The clinical features, risk of diabetic complications, and the utilization of elven types of medications agents related to each cluster were evaluated with the chi-square test and the Tukey–Kramer method. Results Four replicable clusters were identified, severe insulin-resistant diabetes (SIRD), severe insulin-deficient diabetes (SIDD), mild obesity-related diabetes (MOD), and mild age-related diabetes (MARD). In terms of clinical characteristics, there were significant differences in blood pressure, renal function, and lipids among clusters. Furthermore, individuals in SIRD had the highest prevalence of stages 2 and 3 chronic kidney disease (CKD) (57%) and diabetic peripheral neuropathy (DPN) (67%), while individuals in SIDD had the highest risk of diabetic retinopathy (32%), albuminuria (31%) and lower extremity arterial disease (LEAD) (13%). Additionally, the difference in medication treatment of clusters were observed in metformin (p = 0.012), α-glucosidase inhibitor (AGI) (p = 0.006), dipeptidyl peptidase 4 inhibitor (DPP-4) (p = 0.017), glucagon-like peptide-1 (GLP-1) (p <0.001), insulin (p <0.001), and statins (p = 0.006). Conclusions The newly diagnosed patients and patients with long-term diabetes can be consistently clustered into featured clusters. Each cluster had significantly different patient characteristics, risk of diabetic complications, and medication treatment.


INTRODUCTION
Diabetes is a chronic disease, not only has caused heavy social and economic burdens, but also prone to leading to multiple complications, which have profound impacts on the life quality of patients and may potentially cause death in severe cases. The prevalence of diabetes is rapidly increasing worldwide, so effectively preventing and managing diabetes has become an important topic at this stage (1,2).
Diabetes is characterized by hyperglycemia, the causes of which are highly heterogeneous (3). Based on current classification criteria, diabetes is currently divided into two major subtypes, type 1 diabetes (T1D) and type 2 diabetes (T2D) which is approximately 85% (2,4,5). This classification relies on the age of disease diagnosis; however, it may not be enough to characterize complications and outcomes for subtypes. Individuals with diabetes have a different natural course of hyperglycemia, therefore theoretically should be treated with different clinical strategies better fitted their metabolic characteristics (6,7). A novel approach to detailed characterize the diabetes population and explore the clinical features can be very beneficial to aid with the treatment of diabetes patients.
In recent years, novel stratifications of diabetes have been attempted worldwide. Three subgroups of T2D were identified using a topological analysis based on patient-patient networks (8). It is a valuable attempt to classify the patients, however, because the approach required genotype data from patients, this can be difficult to implement in clinical settings. Moreover, in Ahlqvist and colleagues' study, five replicable clusters of diabetes based on six common clinical variables were found, which included glutamic acid decarboxylase autoantibody (GADA), HbA1c, BMI, age at onset of diabetes, and homeostasis model estimates of b-cell function (HOMA2B) and insulin resistance (HOMA2IR) (9). The five diabetes clusters were cluster 1, severe autoimmune diabetes (SAID); cluster 2, severe insulin-deficient diabetes (SIDD); cluster 3, severe insulin-resistant diabetes (SIRD); cluster 4, mild obesity-related diabetes (MOD); and cluster 5, mild age-related diabetes (MARD). Cluster 1 was characterized by the presence of GADA, being similar to T1D, and the other four clusters were T2D with the absence of GADA positivity and the other five variables. Safai et al. (10) used similar routine clinical markers to sub-group the patients into five clusters and reported the difference in probability of diabetes complications, such as cardiovascular disease, nephropathy and neuropathy. In Zaharia et al. (11), patients in German with newly diagnosed type 1 or type 2 diabetes were grouped into the same five clusters, validating such techniques in different population. In addition, the five-year follow-up research also reported the different prevalence in subgroups in terms of non-alcoholic fatty liver disease and diabetic neuropathy. In Dennis et al. (12), the five clusters could be replicated and the difference in glycemic progression were identified. Ahlqvist and colleagues in 2020 revisited the sub-grouping technique in several populations and reported the difference among groups for diabetic complications, such as retinopathy, neuropathy, kidney disease and fatty liver (13). To alleviate the requirement on clinical markers, in Kahkoska et al. (14), three global trails (DEVOTE, LEADER, and SUSTAIN-6) with recent-onset diabetes were tested for the clustering technique based on three variables, age at diabetes diagnosis, baseline glycated hemoglobin (HbA1c) and body mass index (BMI). The four T2D clusters could be fully replicated and the risk of major adverse cardiovascular events and death differed significantly in the follow-up duration. In addition, the risk of nephropathy differed in clusters. With clinical variables, HbA1c, BMI, age at onset of diabetes, HOMA2B and HOMA2IR, there were attempts that clustering the newly diagnosed diabetes patients in the United States and China into four subgroups and confirmed that different ethnic groups can also be clustered by the same variables (15). For many previous researches that had already proven the promising application of such sub-grouping techniques in precision medicine, the clustering target were usually newly diagnosed patients. Since the applications of sub-grouping technique on different populations were proven to be quite robust, we would like to make some exploration of the clustering techniques on both newly diagnosed and long-term diabetes patients, aiming to facilitating the application domain as many admitted patients were in various stages.
Current diabetes treatments mainly focus on controlling blood glucose levels, but with many researches elucidated specific characteristics with subgroups of T2D, more informative treatments on diabetes related issues, such as kidney, cardiovascular and cerebrovascular diseases, became promising. We believe a precise characterization of T2D patient populations in the clinical settings would be beneficial for the understanding of T2D pathophysiology and improvement of clinical management. Therefore, the objectives of this study are to cluster newly diagnosed and previously known Chinese T2D patients, and to explore the clinical characteristics, the risks of diabetic complications, and medication treatment in each cluster. We aim to identify different subgroups of T2D patients through the K-means clustering method with five commonly used clinical variables including HbA1c, BMI, age at diagnosis, HOMA2-B, and HOMA2-IR, and then compare clinical characteristics, identify individuals with increased risk of complications, and distinguish different medication treatment in each subgroup.

Study Population
This was a cross-sectional study conducted from January 2018 to November 2019 at the No.1 Shenzhen People's Hospital. Medical records of 1,240 participants were collected on a first come first chosen basis within a one-year time window, which include anthropometric measurements, laboratory tests, complication diagnostic information, and medication regime. All participants, diagnosed with type 2 diabetes, involved in this study were aged 18 years and above. This study was approved by the medical research ethics committee at Shenzhen People's Hospital. Informed consent was obtained from the participants subjected to anonymous information utilization in medical research.

Measurements
The height and weights of participants were measured using an automatic anthropometer and the body mass index (BMI) was calculated as body weight/height (kg/m 2 ). Blood pressure was measured by trained nurses with a blood pressure monitor. Laboratory measurements were taken in a fasting state following the standardized procedures during the health examination. Biochemical indices, such as fasting blood glucose (FPG), urine acid (UA), total cholesterol (TC), triglycerides (TG), high-density lipoprotein (HDL) cholesterol and low-density lipoprotein (LDL) cholesterol, were measured by the hexokinase method and C-peptide (CP) concentrations were measured by radioimmunoassay. The FPG and CP were used to calculate homeostasis model assessment 2 estimates of insulin resistance (HOMA2IR) and homeostasis model assessment 2 estimates of b-cell function (HOMA2B) with the HOMA2 calculator v2.2.3 at www.dtu.ox.ac.uk (16).

Definitions of Diabetes and Diabetic Complications
The criteria to diagnose participants to have diabetes followed the internationally adopted standards set by the World Health Organization (WHO) Diabetes Expert Committee in 1999 (17). Estimated glomerular filtration rate (eGFR) was calculated using the chronic kidney disease epidemiology collaboration (CKD-EPI) equation, which was used to classify kidney function as normal (stage 1, eGFR >90 ml/min per 1.73 m 2 ), abnormal (stage 2, eGFR 60-90 ml/min per 1.73 m 2 or stage 3, eGFR <60 ml/min per 1.73 m 2 ) (18,19). The range of urinary albumin to creatinine ratio (UACR) was used to describe the Albuminuria progression. The UACR less than 30, between 30 and 300, and above 300, were defined as normal, microalbuminuria, and macroalbuminuria, respectively. Ankle-brachial index (ABI) measurements were taken after a five-minute break with the supine position, which were used to identify the lower extremity arterial disease (LEAD) including hardened vessels and arterial occlusion (20). The ABI values between 0.9 and 1.3, larger than 1.3 or less than 0.9 were considered as normal and abnormal arterial, respectively (21). Both diabetic retinopathy and diabetic peripheral neuropathy (DPN) were defined following the American Diabetes Association's criteria (22,23). The diagnosis of DPN was based on the multiple symptoms and diabetes history. For the impact on the small fibers, the symptoms usually involved pain and dysesthesia, which can be assessed with the pinprick and temperature sensation tests. As the impact developed on large fibers, the symptoms usually involved numbness and loss of protective sensation, which can be confirmed by the vibration perception and 10-g monofilament tests. For diabetic retinopathy, the diagnosis was based on an initial dilated and comprehensive eye examination performed by an ophthalmologist or optometrist. Patients were arranged for diagnosis examinations during the first visit, based on which the initial status of diabetic retinopathy was determined (22,23).

Cluster Analysis
The data cleaning process followed four steps.
Step-1, twentytwo participants without type 2 diabetes were removed from the dataset as they were not part of the target group.
Step-2, twentythree individuals with missing information in variables, such as BMI, HbA1c, and so on, were removed.
Step-3, To improve the group clustering quality, one hundred thirty-four participants with values beyond the defined range of HOMA2 calculators were removed before the feature engineering.
Step-4, by checking the variables individually, one extreme outlier was identified and removed. After the data cleaning procedures, 1,060 participants were included in cluster analysis, with the result of which two additional analyses focused on medication usage and diabetes complications were carried out. For the comparison of diabetes complications, 1,060 participants were included in the analysis. For the medication usage comparison, after the removal of participants that did not have the required information, 486 participants were included in medication analysis ( Figure 1).

Statistical Analysis
The k-means was used to cluster the data according to five variables, age at diagnosis, BMI, HbA1c, HOMA2B, and HOMA2IR. All data were scaled to mean zero and unit variance before clustering. K-means clustering was performed using the Hartigan and Wong algorithm implemented in the 'k-means' package and the optimal number of clusters was determined by elbow method from 'NbClust' package in R. To use the elbow method in the given data set, the number of clusters were plotted against the total explained variance. With a straight line going across the different number of clusters, the point where the increase of variance explained became slow was the target. The number of clusters corresponding to the point was the reasonable number of clusters. Based on the characteristics of clusters described by Ahlqvist et al. (9), patients were assigned to explainable clusters, severe insulindeficient diabetes (SIDD), severe insulin-resistant diabetes (SIRD), mild obesity-related diabetes (MOD), or mild agerelated diabetes (MARD). For descriptive statistics between subgroups, the chi-square test was used for categorical data. Skewed data were log-transformed before analysis. To account for the impact of age, the linear model incorporated age variable as correction factor when comparing the clinical features in subgroups. To understand the impact of gender on the diabetesrelated complication of subgroups, odds ratios (ORs) were modeled using logistic regression. P-values less than 0.05 were considered statistically significant. Statistical analyses were done with R version 3.5.3.

Cluster Analysis
General characteristics of the population and clinical data for these patients are shown in Table 1. In this study, the age of participants ranged from 24 to 99, with most individuals being male (61%). According to the onset of the patient's diabetes, around 16% of the patients were newly diagnosed diabetes, and 84% of the patients were long-term diabetes ( Figure S1).
One thousand sixty patients were classified into four clusters, each of which had distinctive clinical features (Figure 2 and S2). Cluster 1, including 21% of the patients characterized by high age, low BMI, low insulin secretion (low HOMA2B index), and poor metabolic control (the highest HbA1c and Fasting blood glucose level), was identified as severe insulin-deficient diabetes (SIDD). Cluster 2 included 21% of patients who were labeled as severe insulin-resistant diabetes (SIRD), which was characterized by insulin resistance (high level of HOMA2-B and HOMA2-IR), high BMI, and good metabolic control. Cluster 3, including 25% of patients, was labeled as mild obesity-related diabetes (MOD) characterized by obesity, low age, average b-cell function, and insulin resistance. Cluster 4, the largest subgroup (33%), was labeled as mild age-related diabetes (MARD), which include older participants with modest metabolic control, insulin resistance, b-cell function, and the lowest BMI.

Comparison of Clinical Features in Subgroups
The clinical characteristics of subgroups can be found in Table 1.
In this section, the study mainly focused on the characteristics of subgroups in blood pressure, renal function, and lipids. Blood pressure, systolic blood pressure (SBP), and diastolic blood pressure (DBP) were following a similar trend, being highest in SIDD subgroup. Besides, in terms of DBP levels, there was no significant difference between SIRD and MARD (p = 0.104) (Tables S1, S2).
Regarding renal function, patients in SIRD had the lowest eGFR and highest urine acid (UA) compare to those in other subgroups. Urinary albumin to creatinine ratio (UACR) was highest in MOD cluster whereas there no statistical significance in MARD compared to SIDD (p = 0.174) and SIRD (p = 0.334), respectively (Table S5).
Concerning lipids, total cholesterol (TC) was the highest in patients assigned to the SIDD cluster and being statistically higher than MARD (p <0.001) whereas there was no significant difference between MOD and MARD (Table S6). Patients in MARD subgroup had the lowest triglycerides (TG) and highest HDL-cholesterol (HDL) levels comparable to those in other subgroups (Tables S7-S8). LDL-cholesterol (LDL) was the lowest in patients assigned to the SIRD cluster compared with all other clusters and there was no statistical significance between MOD and MARD clusters (p = 0.373) ( Table S9).
Separate analyses stratified by gender were performed to illustrate the risk trends of diabetes-related complications in subgroups. For male participants, there was no significantly different (p >0.05) in albuminuria and retinopathy among subgroups ( Table 2). In other complications, the MARD consistently had a higher risk of disease onset compared to other subgroups (ORs >1; p <0.05). For female participants, the risk of diabetes complication, retinopathy, was similar among subgroups (p >0.05), whereas, in other subgroups, the risk trends varied ( Table 2).

Comparison of Medication Application in Subgroups
The detailed comparison of nine types of glucose-lowering drugs, statins, and antihypertensive in different subgroups were shown in Table 3

DISCUSSION
In this study, the data-driven approach to distinguish the status of diabetes was reproducible and the distribution of patients was similar to that of the Swedish cohort (9). To improve the comprehensive application of such clustering approach in clinical settings, this study tested on population included both newly diagnosed patients and patients with long-term diabetes, which is different from previous studies that only covered newly diagnosed individuals (9,11,12,15). Through the exploration on mixed population, we hope to extend the approach closer to clinical practices as many admitted patients were not newly diagnosed. To evaluate the performance of the clustering technique on data set, we checked the mathematical clustering stability and interpretation of the grouped clusters. Overall, the data driven approach could be used on the mixed population and yielded decent stability. Notably, the clinical characteristics of clusters in this study were similar to previous researches, with a slightly higher proportion of patients in the severe insulindeficient diabetes (SIDD) cluster (21%) and severe insulinresistant diabetes (SIRD) cluster (21%). This discrepancy may result from the higher proportion of hospitalized and severe participants in our study than other research population. The highest prevalence of stages 2 and 3 chronic kidney disease (CKD) was in the SIRD subgroup with the lowest eGFR and highest urine acid. The possible explanation may be that the significant feature of the SIRD group is insulin resistance, which could lead to water-sodium retention, glomerular hypertension, hyperfiltration, and hyperuricemic, thus accelerating the progression of CKD. The trend of risk levels associated with CKD was different in male and female gender ( Table 2). There were more subgroups associated with increased risk of CKD in males compared to females, which leads to a possible belief that the prognosis of CKD may depend on gender. This was confirmed by research showing that the female gender was associated with a slower decline in GFR and better patient and renal survival in a 10-year following-up study (24). In 2016, summarized multiple publication papers related to CKD research, Goldberg and Krause reported that the mortality risk of CKD in males was higher than in females (25).
hyperglycemia in diabetes, such as the polyol and the hexosamine pathways, the de novo synthesis of diacyl-glycerol, and advanced glycosylation end products (AGEs), can promote the development of retinopathy (27). Another possible reason is that hypertension increases the expression of pro-inflammatory molecules in the retina (28). The albuminuria, which is a powerful predictor of renal and cardiovascular risk (29), especially microalbuminuria, was highest in the SIDD subgroup. Since hypertension and hypercholesterolemia are causal risk factors of cardiovascular diseases as well (30), it is no surprise to have the SIDD group with a high rate of cardiovascular diseases. Based on the literature, retinopathy can often precede diabetic nephropathy in patients with T2D and this was confirmed by the clustering result that SIDD subgroup has the highest prevalence on both albuminuria and retinopathy. In terms of gender effect, there was a study indicating that the female showed a significantly higher prevalence and the female gender was an independent factor of disease development (31). The prevalence of lower extremity arterial disease (LEAD) was highest in the SIDD subgroup (13%). Moreover, the retinopathy and albuminuria also had the highest prevalence in SIDD and there was literature indicating that they may be independent risk factors for LEAD (32)(33)(34). Similar to other researches, this study found the blood pressure and blood lipids were key risk factors for LEAD, being highest in SIDD (34,35).
Regarding medication strategy, subgroups had different treatments, being consistent with the physiological characteristics of patients. The proportion of insulin, AGI, and statins was significantly higher in SIDD compared with other subgroups. On the other hand, individuals in SIDD hold very low b-cell reserves (the level of HOMA2-B was lowest), thus were treated preferentially with insulin. AGI drugs can reduce the amounts of insulin needed to control postprandial hyperglycemia by slowing down the digestion of complex carbohydrates and sucrose, therefore it is suitable for the insulin-deficient characteristic (36). The possible reason for the high prevalence application of statins in SIDD may result from the treatment of prevailed hyperlipidemia conditions.
In terms of MOD, having the highest BMI and mild diabetic symptoms, metformin, and GLP-1 were the medications used most frequently. Studies have shown that weight loss can effectively control diabetes's disease course (37). As metformin and GLP-1 have significant effects on weight loss, the medications are suitable for MOD. DDP-4 is characterized as a low risk of hypoglycemia, a high compliance rate, therefore it is suitable for age-related diabetes (MARD) (38). The SIRD subgroup, characterized by the most severe insulin resistance and best b-cell function, was treated most frequently with metformin drugs. As metformin could increase insulin sensitivity, the application of insulin in this subgroup was lowest among all subgroups (39). Overall, the utilization of metformin, as the first-line treatment of type 2 diabetes, had the highest proportion in four subgroups, all above 69.4%.
To extend our study and comprehensively aid with clinical treatment on diabetes, long-term follow-up studies are necessary to explore disease progression and the treatment response. Additionally, newly discovered biomarkers with advanced techniques including genomics, transcriptomics, gut microbiota, could be considered to refine the current stratification strategy. To make the approach being feasible to apply in clinical conditions, a decision-making support system is necessary.
In conclusion, this study testified that the data-driven approach to cluster both newly diagnosed patients and patients with long-term diabetes can yield a consistent result. Each cluster had significantly different patient characteristics, risk of diabetic complications, and medication treatment. These findings have a potential value for clinical trial enrolment and early treatment stratification.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by the Medical Research Ethics Committee at Shenzhen People's Hospital. The patients/participants provided their written informed consent to participate in this study.

AUTHOR CONTRIBUTIONS
LX and FP performed data analysis, interpretation and manuscript writing.QL supervised the data collection and research collaboration. XD, JR, HW and SY participated in data collection and literature search. LJ and SZ designed the experiment and supervised the overall progress. All authors contributed to the article and approved the submitted version.