An early prediction model for type 2 diabetes mellitus based on genetic variants and nongenetic risk factors in a Han Chinese cohort

Aims We aimed to construct a prediction model of type 2 diabetes mellitus (T2DM) in a Han Chinese cohort using a genetic risk score (GRS) and a nongenetic risk score (NGRS). Methods A total of 297 Han Chinese subjects who were free from type 2 diabetes mellitus were selected from the Tianjin Medical University Chronic Disease Cohort for a prospective cohort study. Clinical characteristics were collected at baseline and subsequently tracked for a duration of 9 years. Genome-wide association studies (GWASs) were performed for T2DM-related phenotypes. The GRS was constructed using 13 T2DM-related quantitative trait single nucleotide polymorphisms (SNPs) loci derived from GWASs, and NGRS was calculated from 4 biochemical indicators of independent risk that screened by multifactorial Cox regressions. Results We found that HOMA-IR, uric acid, and low HDL were independent risk factors for T2DM (HR >1; P<0.05), and the NGRS model was created using these three nongenetic risk factors, with an area under the ROC curve (AUC) of 0.678; high fasting glucose (FPG >5 mmol/L) was a key risk factor for T2DM (HR = 7.174, P< 0.001), and its addition to the NGRS model caused a significant improvement in AUC (from 0.678 to 0.764). By adding 13 SNPs associated with T2DM to the GRS prediction model, the AUC increased to 0.892. The final combined prediction model was created by taking the arithmetic sum of the two models, which had an AUC of 0.908, a sensitivity of 0.845, and a specificity of 0.839. Conclusions We constructed a comprehensive prediction model for type 2 diabetes out of a Han Chinese cohort. Along with independent risk factors, GRS is a crucial element to predicting the risk of type 2 diabetes mellitus.


Introduction
Diabetes is a group of clinically and genetically heterogeneous diseases that are diagnosed by extraordinarily high blood glucose levels.It is a prevalent and rapidly growing noncommunicable chronic disease worldwide, with an expected increase in the number of affected adults from 2017 to 2045 of 50%, reaching a total of 693 million (1).In our country, approximately 92.4 million adults are already affected by diabetes (2), and approximately 90% of them have T2DM.It is well accepted that genetical and lifestyle factors contribute to T2DM (3).Numerous genetic studies have shown that there is a clear genetic predisposition to diabetes and its complexities (4).In recent years, researchers have identified more than 100 susceptibility genes and 200 susceptibility loci associated with the occurrence, development, and prognosis of T2DM by linkage analysis and large-scale GWASs (5), and the polygenic risk score calculated from these genes can predict the likelihood of developing T2DM (6).Sixty percent of the genes associated with T2DM found in Asian populations could be validated in Chinese populations (7).Hu et al. (8)confirmed the association of eight genes, namely PPARG, KCNJ11, CDKAL1, CDKN2A-CDKN2B, IDE-KIF11HHEX, IGF2BP2, and SLC30A8, with the prevalence of T2DM in a Chinese population study.Xu et al. (9) found that CDKAL1 (rs7756992) and SLC30A8 (rs13266634, rs2466293) were significantly associated with T2DM.In addition to genetic susceptibility, factors highly associated with the development of T2DM include age (10), obesity (11), lipid metabolism disorders (12), waist circumference (13), clinical biochemical indicators such as uric acid (14) and environmental factors such as lifestyle (15) and dietary habits (16).
Prediction of the risk of developing diabetes is important because of the large individual differences and the high number of complications.Diabetes models have been successfully established in some countries, such as the Framingham risk score diabetes model in the United States (17); the prediction model of diabetes onset in Mexican-descended Americans and non-Hispanic Caucasians by Stern (18); and the prediction model of diabetes onset risk in Japanese Americans by McNeely (19).There are two main T2DM models in China.Wu  Therefore, we need an early prediction model with high prediction value.Previous studies only showed a mild increase in AUC when SNPs were added to the prediction model.Although GWASs were performed in Han Chinese, many genes did not show high GRR due to low minor allele frequency (MAF) in Han Chinese.We conducted a prospective cohort study in a Han Chinese cohort, adding insulin resistance phenotypes and Chinese-specific SNPs to the prediction model.

Study design and population
The research was a prospective cohort that involved 297 participators from "The Tianjin Medical University Chronic Disease Cohort".A total of 7,032 participants were recruited between 2006 and 2010, we selected samples that did not have diabetes in 2010 and had completed follow-up information up to 2015, then we coded and sorted these subjects by computer generated random numbers, and the top 305 people were chosen for genotyping.Follow-up was continued for further 4 years till 2019, with 8 people lost in follow-up, and the final number included in the analysis was 297.During the patient follow-up, 98 incident T2DM cases were identified, with a T2DM 9-year prevalence of 32.9%.
This study received approval from the Ethics Committee of Tianjin Medical University, and all participants signed informed consent forms.

Diagnostic criteria
We defined diabetes as a fasting glucose level of 7 mmol/L or higher, or a two-hour glucose level of 11.1 mmol/L or higher and defined impaired fasting glucose as fasting glucose level of 6.1 to 6.9 mmo1/L (22).In accordance with the Chinese Hypertension Prevention Guide, hypertension was diagnosed based on a systolic blood pressure (SBP) ≥ 140 mmHg and/or diastolic blood pressure (DBP) ≥ 90 mmHg, or a history of hypertension (23).The diagnosis criteria for hyperuricemia were gender-specific, with males having a level of ≥ 420 μmol/L and females having a level of ≥ 360 μmol/L, excluding all drugs affecting uric acid metabolism (24).

Genotyping and SNPs selection
Blood samples were collected from all subjects using the high salt method to extract genomic DNA, which were subsequently genotyped using the Infinium Asian Screening Array-24 v1.0 BeadChip.After genotyping, systematic quality control analyses were carried out using PLINK 1.90 software (25): (i) Quality control procedures for genotypes: verifying the missingness rate of SNPs (>10%) and individuals with high missing rates (>5%); checking for difference in sex between the individuals recorded in the data and their sex based on X chromosome heterozygosity/homozygosity rates (the values for males and females should be >0.8 and<0.2,respectively); selecting autosomal SNPs with a MAF<0.05 and significant deviation from Hardy-Weinberg equilibrium (HWE) (P<1.0x10−4 ); identifying individuals who deviated ±3SD from the samples' heterozygosity rate mean; and calculating the identicalness by descent (IBD) of all sample pairs, setting a pi-hat threshold of 0.2.(ii) Quality control for phenotypes: phenotypes included threshold traits (T2DM or not) and continuous diabetes-related traits (FPG, Hb1AC, insulin, HOMA-IR, QUICKI).The extreme values (values beyond the mean ±3SD) in the samples were excluded during quantitative trait correlation analysis.Thus, following the quality control procedures, 306659 SNPs and 273 samples were retained out of the initial 658849 SNPs and 297 samples for further association analyses.

Weighting approaches for constructing the wGRS and wNGRS
We developed GRS with selected highly correlated SNPs by genome-wide association analysis for T2DM-related phenotypes.We excluded SNPs that showed linkage disequilibrium (LD) with each other and analyzed the estimate by performing a logistic regression to determine the association between the number of risk alleles and T2DM.The weighted genetic risk score (wGRS) was calculated by multiplying the number of risk alleles (0, 1, or 2) for each SNP by the natural logarithm of the OR for that allele and summing across all SNPs, as described in formula (1).Similarly, the weighted nongenetic risk score (wNGRS) was calculated using the same principle as the wGRS.For each individual, the wNGRS was calculated as the sum of risk factors weighted by the HR (b) value of different nongenetic risk factors in Cox regression, as described in formula (2).Assuming that genetic and nongenetic factors are independent, we added the weighted genetic score to each risk algorithm to obtain a combined nongenetic and genetic score.The comprehensive risk scoring model is the sum of the GRS and NGRS models, as described in formula (3).
bi is the weight of the ith SNP; Gi is the number of alleles at the ith SNP and assigns values of 0, 1, 2.
bi is the weight of the ith nongenetic risk factor.Si shows the status of the ith nongenetic risk factor, if the individual has the risk factor, the value is 1; if not, the value is 0.

Power calculation
We performed power calculations using PASS 2021 (NCSS, LLC.Kaysville, Utah.http://www.ncss.com/software/pass/procedures/), using a two-sided test with a= 0.05.Of the 297 participants in our study, 98 were diagnosed with T2DM during the research period.Taking into account the prevalence of T2DM of 12.4% in China reported by Wang et al.(p0 = 0.124) (26), our sample size exhibited a power of 0.83 (e.g., OR=2.2 or less, depending on the distribution of the risk factor).

Statistical analysis
The SPSS26.0 statistical software package was employed for data analysis.Missing data imputation used the expectation-maximization algorithm (27).Continuous variables were compared utilizing either an independent samples t-test or a rank sum test, described by mean (standard deviation) or median (quartile) values, respectively.Categorical variables were compared utilizing the chi-square test.Independent risk factors were determined by Cox stepwise regression, P<0.05, and all differences were considered statistically significant.Additionally, genome-wide associations between diabetesassociated phenotypes and variation were examined using PLINK 1.9, and corresponding Manhattan and quantile-quantile plots were generated using the "manhattan" and "qqman" libraries in R (v.4.1.3)(28).The prediction model was constructed by logistic regression analysis, and utilized the AUC values to evaluate the predictive power of the model.

Baseline clinical characteristics
The prospective research was conducted on 297 subjects (FPG< 7 mmol/L at baseline, age range 37-91) to establish a 9-year risk prediction model for T2DM (Figure 1).A total of 98 incident cases of T2DM, representing 32.9% of the study population, were identified.In our study, the mean age was 65.61 ± 13.55 years, and 59 subjects already had an impaired glucose test (fasting glucose 6.1-6.9mmol/L) at baseline, accounting for 19.9%, which may explain the higher prevalence of diabetes.Compared to the controls, the T2DM group had significantly higher levels of BMI, FPG, IFG, SUA, BUN, ALT, FINS and HOMA-IR.The T2DM group also had significantly lower levels of HDL and QUICKI.Although age, DBP, SCr, CRP, TG, TC, TBIL, and TP levels were higher in the T2DM group compared with the non-T2DM group, the differences were not statistically significant.Table 1 shows the baseline characteristics of the study population.

Nongenetic risk factors for T2DM
All variables (excluding collinear variables) with P<0.05 in the univariate model were involved in the multivariate model by gradual backward regression, with variable values<0.05retained in the final model.Results from a Cox regression model revealed that SUA, HDL, and HOMA-IR were independent risk factors for T2DM.The regression coefficients of the factors retained in the final model are presented in Table 2. Additionally, Kaplan-Meier survival analyses revealed that higher quartile values of HOMA-IR, SUA, and HDL (defined as their normal high values) significantly impact T2DM onset in our study, as shown in (Figure 2).

Results from genome-wide association studies
An analysis was performed on 306,659 autosomal SNPs that passed quality control to determine their association with six traits.Manhattan and QQ plots of the GWAS results are shown (Figure 3; Supplementary Figure S1).The analysis showed that no SNP reached the genome-wide significance threshold (P< 5× 10-8).Finally, we selected 26 SNPs from the T2DM-related phenotypes (T2DM, FPG, HbA1C, FINS, HOMA, QUICKI) based on P-values, SNP repeatability and biological significance of known mutations.Among them, rs10164462, rs_17_9691529, and rs76616810 were associated with T2DM, FPG, and HbA1C.rs8142739 was associated with insulin, HOMA-IR, and QUICKI.In addition, rs_3_192523400 rs11931598, rs17087830 rs16925187, rs1427793, rs_kgp4372010, and rs6066110 were all P< 1× 10 -4 and associated with at least two T2DM-related traits (Supplementary Table S1).Some information of these SNPs, such as their genome locations, the closest reported genes, MAF and OR values, are exhibited in Table 3.

Internal validation
Internal validation of different prediction models was carried out by using bootstrap ten-fold cross validation method.In this study, the AUC values verified by genetic (a), non-genetic (b; c) and comprehensive prediction (d; e) models were 0.872, 0.670, 0.734, 0.873, and 0.887, respectively, after 50 times of 10-fold crossvalidation of different prediction models.The results show that the prediction model has good stability.

External validation
We used the Framingham Diabetes Risk Score to assess the risk of developing T2DM in the Chinese Han population in this study.The Framingham Diabetes Risk Score simple clinical model includes 9 indicators, including age, sex, BMI, family history of diabetes, SBP/DBP, HDL, TG, FPG, and waist circumference (17).This research lacks information on family history of diabetes and waist circumference.When the Framingham diabetes risk prediction model was applied to our study population, the AUC was 0.889 (95% CI: 0.847-0.931).However, the Framingham diabetes risk score uses a cut-off of FPG >5.5 mmol/L, whereas if  we use the same cut-off as our model (FPG >5mmol/L), the AUC drops to 0.761 (95% CI: 0.707-0.815)(Supplementary Figure S3).

Discussion
Most of the existing studies in China have used only traditional laboratory indicators to construct diabetes prediction models, and few studies have used genetic risk factors as predictors.The combined use of SNPs to predict the risk of T2DM has been reported in other countries (29-31), and their genetic factors alone predicted an AUC between 0.55 ~0.6, traditional risk factors predicted an AUC of approximately 0.65 ~0.78, and the combination of both predicted an AUC of approximately 0.68 ~0.8.Therefore, there is a need to develop T2DM prediction models that include genetic risk factors in China.The AUCs of our genetic, nongenetic and combined risk prediction models were 0.892, 0.764 and 0.908, respectively.All three results were higher than those of other studies, indicating better predictive validity.Compared to other models, our model is unique in that it contains SNPs that are not common in European populations, and the model has Hanspecific markers, which may be one of the reasons for the better performance of our model.By adding our genotyping data, the prediction model AUC was significantly improved (from 0.764 to 0.908).
This study included new phenotypic detections, such as FINS, HOMA-IR, and QUICKI, with HOMA-IR being an independent predictor of T2DM.In addition, some new genetic loci were identified as follows: rs4755984 in the SYT13 gene, rs1547287 in the PTPRD gene, rs76616810 in the RSPO1 gene, rs16925187 in the KDM4C gene, rs_kgp9798346 in the ERBB4 gene, rs79535454 in the GRB10 gene, rs1427793 in the NUAK1 gene, rs62375492 in the YIPF5 gene, and rs10164462 in the XDH gene.We found that the nearest genes to the above SNP loci were associated with metabolism or diabetes.The SYT13 gene, located on chromosome 11, is a member of a large family of synaptic binding proteins.Compared to healthy adults, SYT13 gene expression is downregulated in T2DM patients, and downregulation of this gene decreases islet secretory function and is negatively associated with HbA1c levels in vivo (32).SNP rs154738, located in the intron of PTPRD, had a less significant association with T2DM (P = 9.91 × 10 -6 ; OR = 2.109, 95% CI = 1.406-3.164).A previous GWAS of T2DM in a Han Chinese population identified PTPRD as a susceptibility gene for T2DM (33).Overexpression of PTPRD in preadipocytes (3T3L1) inhibits adipogenesis, but this may lead to the development of adipose ectopic accumulation and insulin resistance, favoring the development of T2DM.Additionally, in human subjects, a positive correlation was observed between serum RSPO1 levels and fasting C-peptide levels, which is a marker of insulin secretion.RSPO1 levels also presented a positive correlation with both obesity and insulin resistance (34).Also associated with obesity is KDM4C located at 9p24.1, a member of the JMJD2 family that promotes preadipocyte differentiation by repressing PPARg transcriptional activation (35).Latorre (36) et al. found that ERBB4, located at 2q34, had significantly increased expression in the organs of obese people.Although these genes are not directly related to the development of T2DM, approximately 90% of T2DM patients are overweight or obese, and obesity caused by disorders of lipid metabolism is also considered important risk factor for T2DM development.The variants of GRB10, which is an inverse regulator of insulin signaling, have been shown to have a significant association with impaired b-cell function (37).In 2020, Franco (38) et al. identified YIPF5 mutations as a major cause of monogenic diabetes.XDH (XOR), the rate-limiting enzyme produced by SUA, is not only highly expressed in hyperuricemia and gout but has also been shown to have significantly higher XOR activity in diabetic patients than in normal adults (39).
In addition, when including the glucose factor FPG>5 mmol/L, the AUC value of our prediction model in this study was 0.764; the Framingham diabetes risk prediction model had an AUC value of 0.761 for FPG>5 mmol/L and 0.889 for FPG>5.5 mmol/L.Although FPG>5.5 mmol/L may be a better T2DM "predictor", it cannot achieve early prediction, and we should not use it in early prediction models.
The present study also has some limitations.First, all study participants were monitored for at least 4 years, but it is unclear whether they developed diabetes in the first 3 years due to missing data for the period from 2011 to 2013.In addition, we did not perform OGTT screening for those 297 subjects.Since OGTT requires multiple blood draws and patients have a low degree of cooperation at annual physical examinations, the diagnostic criteria for T2DM in our article is based on fasting blood sugar ≥7.0 mmol/ L. The use of a single glucose measure as an outcome diagnostic criterion may overestimate the prevalence of T2DM, which is one of the limitations of most epidemiological studies.Third, genetic risk factors were selected from a relatively small sample size, and some potential bias exists in the study results.Lastly, to enhance the applicability of our model to other populations, further external validation in larger and younger cohorts is needed.We plan to conduct such studies in the future to refine and validate our T2DM prediction model.

Conclusions
Our study provides a comprehensive and accurate prediction model for T2DM risk, highlighting the importance of considering both traditional risk factors and genetic factors in disease prediction.The identification of novel genetic loci associated with T2DM risk also adds to our understanding of the underlying biology of this disease, potentially opening up new avenues for therapeutic intervention and disease prevention.
et al. counted the risk factors for diabetes onset in China over the past 20 years to establish the first T2DM risk assessment model for the Chinese population.In 2009, based on the Framingham cardiovascular prediction model, Chien et al. (20)established a T2DM risk prediction model for the Taiwanese population.The previous model only incorporated demographic indicators and laboratory measures of risk factors.With the development of GWAS, later models were built to include genetic factors as well, such as Meigs et al. (21) Framingham cohort for adding 18 SNPs as predictors.

FIGURE 1 Flow
FIGURE 1Flow chart of subjects in the prospective study.

TABLE 1
Baseline characteristics of participants with and without incident diabetes.

TABLE 3
Single-SNP association analysis of T2DM.SNP, single nucleotide polymorphism; OR, odds ratio; EAF, effect allele frequency; GRS, genetic risk score; bold indicates effect allele.