Multi-Trait Genomic Risk Stratification for Type 2 Diabetes

Type 2 diabetes mellitus (T2DM) is continuously rising with more disease cases every year. T2DM is a chronic disease with many severe comorbidities and therefore remains a burden for the patient and the society. Disease prevention, early diagnosis, and stratified treatment are important elements in slowing down the increase in diabetes prevalence. T2DM has a substantial genetic component with an estimated heritability of 40–70%, and more than 500 genetic loci have been associated with T2DM. Because of the intrinsic genetic basis of T2DM, one tool for risk assessment is genome-wide genetic risk scores (GRS). Current GRS only account for a small proportion of the T2DM risk; thus, better methods are warranted for more accurate risk assessment. T2DM is correlated with several other diseases and complex traits, and incorporating this information by adjusting effect size of the included markers could improve risk prediction. The aim of this study was to develop multi-trait (MT)-GRS leveraging correlated information. We used phenotype and genotype information from the UK Biobank, and summary statistics from two independent T2DM studies. Marker effects for T2DM and seven correlated traits, namely, height, body mass index, pulse rate, diastolic and systolic blood pressure, smoking status, and information on current medication use, were estimated (i.e., by logistic and linear regression) within the UK Biobank. These summary statistics, together with the two independent training summary statistics, were incorporated into the MT-GRS prediction in different combinations. The prediction accuracy of the MT-GRS was improved by 12.5% compared to the single-trait GRS. Testing the MT-GRS strategy in two independent T2DM studies resulted in an elevated accuracy by 50–94%. Finally, combining the seven information traits with the two independent T2DM studies further increased the prediction accuracy by 34%. Across comparisons, body mass index and current medication use were the two traits that displayed the largest weights in construction of the MT-GRS. These results explicitly demonstrate the added benefit of leveraging correlated information when constructing genetic scores. In conclusion, constructing GRS not only based on the disease itself but incorporating genomic information from other correlated traits as well is strongly advisable for obtaining improved individual risk stratification.

Type 2 diabetes mellitus (T2DM) is continuously rising with more disease cases every year. T2DM is a chronic disease with many severe comorbidities and therefore remains a burden for the patient and the society. Disease prevention, early diagnosis, and stratified treatment are important elements in slowing down the increase in diabetes prevalence. T2DM has a substantial genetic component with an estimated heritability of 40-70%, and more than 500 genetic loci have been associated with T2DM. Because of the intrinsic genetic basis of T2DM, one tool for risk assessment is genome-wide genetic risk scores (GRS). Current GRS only account for a small proportion of the T2DM risk; thus, better methods are warranted for more accurate risk assessment. T2DM is correlated with several other diseases and complex traits, and incorporating this information by adjusting effect size of the included markers could improve risk prediction. The aim of this study was to develop multi-trait (MT)-GRS leveraging correlated information. We used phenotype and genotype information from the UK Biobank, and summary statistics from two independent T2DM studies. Marker effects for T2DM and seven correlated traits, namely, height, body mass index, pulse rate, diastolic and systolic blood pressure, smoking status, and information on current medication use, were estimated (i.e., by logistic and linear regression) within the UK Biobank. These summary statistics, together with the two independent training summary statistics, were incorporated into the MT-GRS prediction in different combinations. The prediction accuracy of the MT-GRS was improved by 12.5% compared to the single-trait GRS. Testing the MT-GRS strategy in two independent T2DM studies resulted in an elevated accuracy by 50-94%. Finally, combining the seven information traits with the two independent T2DM studies further increased the prediction accuracy by 34%. Across comparisons, body mass index and current medication use were the two traits that displayed the largest weights in construction of the MT-GRS. These results explicitly demonstrate the added benefit of leveraging correlated information when constructing genetic scores. In conclusion, constructing GRS not only based on the disease itself but incorporating genomic information from other correlated traits as well is strongly advisable for obtaining improved individual risk stratification.

INTRODUCTION
Type 2 diabetes mellitus (T2DM) is a chronic disease with severe comorbidities, such as myocardial infarction, loss of kidney function, blindness, and risk of amputations (1). Globally, the prevalence of T2DM is expected to increase exponentially in developing countries (2,3), and it is a disease that places a severe economic burden on health systems. Accurate disease risk assessment is important for early disease diagnosis for initiating lifestyle changes early in the disease progression or prompt the clinician to treat high-risk patients more aggressively, which is expected to slow down disease progression, reduce disease symptoms, and prevent severe morbidity and mortality. Thus, methods for accurate disease risk assessment are absolutely critical for reducing morbidity and mortality.
Studies have unambiguously shown that T2DM is a complex, multifactorial disease, where an individual's risk of developing the disease is influenced by a combination of genetic variation at multiple sites across the genome acting in concert with environmental factors (4)(5)(6). The heritability of T2DM has been estimated to be 40-70% (7,8), and more than 500 distinct genetic loci have been implicated with T2DM risk (6,(9)(10)(11)(12). As T2DM is greatly impacted by genetics, genomic information has the potential to not only aid with early disease diagnosis but importantly also to stratify patients across disease subtypes (13) to initiate treatment intervention and lifestyle changes early in the disease progression.
During the last decade, an enormous effort has been in method development and construction of disease risk scores based on genomic information (14)(15)(16)(17). However, until recently, these genome-wide genetic risk scores (GRS) have mainly been constructed using a single-trait approach. Because much of the variation within the human genome contributes to a large number of different complex traits and diseases (18), the accuracy of risk stratification can be improved by developing multi-trait (MT)-GRS accounting for the genetic correlation among traits. Using correlated information to construct GRS has theoretically-and to a minor extend empirical-been shown to increase the accuracy of disease risk prediction (6)(7)(8). T2DM is strongly correlated with a range of complex diseases and traits, such as overweight (19), cardiovascular disease (1,(19)(20)(21), hypertension (19,22), and chronic kidney disease (19,23); hence, T2DM is an excellent case for developing accurate GRS by leveraging correlated information.
The objective of the current study was to investigate the predictive performance of a MT-GRS model that combines marker effects from genome-wide association studies (GWAS) of T2DM and a number of correlated traits. The types of information included in this study were body mass index (BMI), height, smoking status, pulse rate, diastolic and systolic blood pressure, and a quantity of current medication use, as the total count of different prescription and over-the-counter medications is a proxy for general health and disease status. The aim of the present study was to investigate whether a MT-GRS model based on loci for multiple correlated traits had increased predictive discriminative power compared with a traditional single-trait (ST)-GRS model. This strategy was first applied within the UK Biobank (UKB) (24), and then extended to include information on two UKB-independent GWAS summary statistics and, finally, a combined model incorporating information from the UKB and the two independent T2DM GWAS data sets.

Phenotype and Genotype Data
Only unrelated British Caucasian individuals from the UKB (24) (n = 335,652 subjects) were used in the current study (excluding individuals with more than 5,000 missing genotype values or if having chromosomal aneuploidy). T2DM status was determined based on in-hospital records (by ICD-10 E.11, UKB data field 41270, which contains both main and secondary diagnoses) and self-reported disease state (UKB data field 20002) counting a total of 18,809 individuals. Seven additional phenotypes were also included: standing height, BMI, diastolic and systolic blood pressure, pulse rate, smoking status, and current medication use (measured as the number of different prescription and over-thecounter medications taken). These phenotypes were all adjusted for sex, age, UKB assessment center, and the first 10 genetic principal components (to account for any cryptic relatedness that were not accounted for by restricting to unrelated Caucasian British individuals), following inverse rank normalization to approximate normality.
Genotyped variants with minor allele frequency <0.01, genotype missingness >5%, or variants within the major histocompatibility complex were excluded from the analyses, resulting in a total of 599,297 genetic variants.

Prediction of Diabetes Risk
T2DM risk was determined using GRS based on either summary statistics obtained within the UKB cohort and other T2DMrelated GWAS studies ( Table 1). The overall workflow is depicted in Figure 1 and is described in detail below.

UKB Summary Statistics
The White-British UKB cohort of unrelated individuals (335,652 subjects) was split into 10 folds with no overlap of samples within each fold, and for each fold, the marker effects for T2DM, standing height, BMI, diastolic and systolic blood pressure, pulse rate, smoking status, and current medication use, were estimated using logistic or linear regression as implemented in PLINK2 (26). In all analyses, the same set of covariates were included as those used during phenotypic adjustment as this has been shown to increase statistical power (27).

Publicly Available Type 2 Diabetes Summary Statistics
Two recently published GWAS for T2DM were identified ( Table 1). Common for the studies were that they did not include UKB data, and therefore provide an independent training set. The regression coefficients were flipped such that the marker effect of the effect allele matched the effect allele within the UKB data. FIGURE 1 | Schematic overview of the research design of the current study. Summary statistics (β) for T2DM and seven information traits were estimated from individual-level genotypic information (X) within the UKB using a 10-fold cross validation scheme. Two external GWAS summary statistics were identified. ST-GRS for T2DM was computed based on either the summary statistics obtained within the UKB or from the two external data sets. Estimates of the heritability (h 2 ) and genetic correlations (r g ) were estimated for T2DM, the seven information traits, and the two external T2DM studies. MT-GRS were computed based on four scenarios (S1-S4), depending on which types of information the predictor variable was adjusted for.

Estimation of Genetic Parameters
Linkage disequilibrium (LD) between the genotyped variants was estimated as the squared Pearson's correlation coefficient (r 2 ) between two genetic variants adjusted for sample size (N) as the standard estimator of the Pearson's correlation coefficient has an upward bias (28). The adjusted squared Pearson's correlation coefficient (r 2 ) is obtained as (28): which was computed with the R package qgg (29). LD scores (l) for all variants within a window size of 5,000 markers (2,500 markers around the i-th variant) were computed as The MT-GRS model relies on selection index theory to obtain marker weights that require estimates of genetic parameters (30). The heritability (h 2 ) and the genetic correlation (r g ) between traits can be computed based on GWAS summary statistics using LD score regression (28). The heritability was estimated as the regression of the summary statistics on the LD score: where Z = n eff × l/m, with l being the LD score (see Equation 2), m is the number of genetic variants, and n eff is the effective number of individuals and is n eff = where af is the allele frequency, and SE b is the estimated standard error of the marker regression estimate. The response variable is y = , whereb is the estimated regression coefficient for the genetic variants [for binary traits, the odds ratios (ORs) were converted tob = log(OR), and SE b = b /P(X < (1 − p)/2) , where P(X < (1 − p)/2) is the normal cumulative distribution given the marker P-value, p (31)]. Similarly, the genetic correlation between traits 1 and 2 can be estimated as: where Z = √ n 1 √ n 2 × l m , and y =b 1 . LD score regression was implemented in the R package qgg (29) and was computed for each of the 10-folds of random data subdivisions for T2DM and the seven information traits ( Table 2), and among the information traits and the publicly available T2DM summary statistics ( Table 1).

ST-GRS
The ST-GRS was computed as, where X i is the i-th column of the genotype matrix containing allelic counts,b i is the estimated marker effect for the i-th marker, and m is the number of variants left after LD pruning (r 2 < 0.1, <0.5, or <0.9) and P-value thresholding (P < 0.001, 0.01, 0.05, 0.1, 0.2, 0.3, 0.5, 0.7, 0.9, and 0.99). The genetic scoring was performed with the R package qgg (29).

MT-GRS
The accuracy of GRS can be improved by leveraging information from correlated traits by adjusting the marker effects (b) (30). The adjustment of the marker effects for the focal trait (f , i.e., T2DM) is obtained by computing index weights for each marker ( From quantitative genetic theory, selection indices have been developed for MT selection, in which many ST individual genetic effects (i.e., breeding values) are combined with an index weight allowing selection of the individuals with the best MT phenotype (32,33). The optimal weights can be derived as w = V −1 C, where C is a k × 1 column vector of covariances between theb values of the k traits and the true marker effects of the focal trait (b f ), and V is a k × k variance-covariance matrix of theb values: The diagonal elements of variance-covariance matrix, V, are where M is the effective number of chromosomal segments [here M = 60, 000 (30,34)] and N k is the number of observations for trait k. The off-diagonal elements of V for trait k and l are which is the same for the elements of C. Combining Equations (8) and (9), Equation (7) becomes The MT-GRS is then obtained as the sum of adjusted marker effects, MT-GRS was computed by applying LD pruning (r 2 < 0.1, <0.5, or <0.9) and P-value thresholding (P < 0.001, 0.01, 0.05, 0.1, 0.2, 0.5, 0.75, and 0.99) based on UKB genotypes and T2DM summary statistics; thus, the same LD pruning and P-value thresholding were applied across traits. Four MT scenarios were applied, resulting in four different predictors (Figure 1): (1) UKB T2DM summary statistics combined with the seven UKB information traits; (2) external T2DM summary statistics [i.e., results from Scott et al. (10) and Zhao et al. (25)] combined with the seven UKB information traits; (3) external T2DM summary statistics combined with the seven UKB information traits and UKB T2DM summary statistics; and (4) UKB T2DM summary statistics combined with the seven UKB information traits and the two external T2DM summary statistics.

GRS Accuracy
The accuracy of ST-GRS and MT-GRS was determined using Nagelkerke's variance explained (R 2 ), where LR is the likelihood ratio comparing two nested logistic regression models, L 0 is the log-likelihood of a model neglecting the GRS, and n is the number of observations. The full model included sex, age, UKB assessment center, the first 10 genetic principal components, and the GRS, whereas the reduced model did not contain the GRS effect. For visualization, the GRS were divided into percentiles, and the disease prevalence within each bin was computed; the OR for each percentile was computed adjusting for sex, age, UKB assessment center, and the first 10 genetic principal components, and the OR was expressed relative to the 50-th percentile.

ST Prediction and Genetic Parameters
The analysis of T2DM was performed using 335,662 unrelated individuals from UKB with more than 18,000 T2DM cases ( Table 2). A larger proportion of T2DM cases were males and smokers; on average, T2DM cases were older than individuals without T2DM, had higher BMI, and on average used more medications than non-diabetic individuals ( Table 2). The UKB cohort was split into 10 training and validation sets, and within-cohort marginal marker effects of common genotyped variants were estimated for each training set. After LD pruning and P-value thresholding, ST-GRS were computed for individuals within the validation sets. The maximum prediction accuracy for ST-GRS was R 2 = 0.032 when using variants with LD r 2 < 0.9 and P < 0.05 (Figure 2; Supplementary Table 2).
Across the 10 training sets, the average heritability for T2DM on the observed scale was 0.07 (0.31 on the liability scale). Seven information traits were included and used in the MT genetic risk scoring ( Table 2). All seven traits showed nonzero heritability estimates (Figure 3A), and the strongest genetic correlation was observed between diastolic and systolic blood pressure ( Figure 3B). Current medication use was the trait that showed the highest genetic correlation to most of the other traits, and only standing height showed negative genetic correlation to the other traits ( Figure 3B).

Leveraging Correlated Information for MT Prediction
The T2DM marginal effects were adjusted using the estimated genetic parameters to compute MT-GRS (Scenario 1; Figure 1). Across the three levels of LD pruning, the predictive ability was generally improved when the marginal SNP effects were adjusted by the seven information traits (Supplementary Figure 1;  Supplementary Table 2). The highest prediction accuracy (R 2 = 0.036) was obtained at LD r 2 < 0.9 and P < 0.999 (Figure 2; Supplementary Table 2), which corresponds to an improved prediction accuracy by 12.5% Next, we estimated the T2DM risk within the UKB using summary statistics from two independent external sets of summary statistics (Figure 1). Both external data sets [Scott et al.  Table 4]; however, comparing the accuracy within the P-value threshold, the accuracy of the MT-GRS model was superior over the ST (Supplementary Table 4). Extending the MT model to also include UKB T2DM summary statistics (Scenario 3, Figure 1), the accuracy was further increased by 50% (from 0.028 to 0.042; Figure 4) and 94% (from 0.016 to 0.031; Figure 4) using the summary statistics of Scott et al. (10) and Zhao et al. (25), respectively.
The MT model trained within the UKB was further extended to also include summary statistics from the two independent T2DM GWAS data sets (Scenario 4; Figure 1). Adjusting the UKB T2DM summary statistics by the seven information traits and the two independent T2DM GWAS data sets resulted in an increase in prediction accuracy from 0.032 to 0.043 (Figure 5;  Supplementary Table 2), which is an increase of 34%.

T2DM Risk Stratification
Stratifying UKB participants based on their T2DM genetic risk showed that a larger proportion of individuals with a T2DM diagnosis were among the top 10% of individuals with highest genetic score when applying the MT strategy (Figure 6). The MT-GRS that in addition to the seven information traits also included information from the independent testing data gave a better stratification of cases by distributing a larger proportion of T2DM cases within the top risk (Figure 6), which also was apparent FIGURE 2 | Variance explained (R 2 ) for type 2 diabetes by ST-GRS and MT-GRS (LD pruning r 2 < 0.9) using P-value thresholding (X-axis). Points indicate mean R 2 for a given threshold, and the surrounding shading indicates the standard error of the mean. ns, non-significant difference between ST and MT, *significant difference between ST and MT.  with a large OR of the top 10% compared to the remaining (Supplementary Figure 4).

DISCUSSION
Precision medicine is predicted to change the way we prevent, diagnose, risk stratify individuals, and treat medical conditions (35,36) through development of targeted preventive or treatment approaches based on the genetic background, biomarkers, environmental exposures, and lifestyle of the individual. Diagnosis and treatment plans based on genetic testing has been effectively applied to several monogenic disorders (37); however, for common complex diseases, genomic information has been far less incorporated. One reason for the lack of incorporating genomic information in disease prevention and diagnosis for complex diseases is because a large proportion of the underlying genetic variation remains unexplained (38,39). In the current study, we investigated whether an MT-GRS approach provided more accurate risk stratification than traditional ST genetic scoring approaches. Adjusting the UKB T2DM marker effects by the genomic correlation of the seven information traits increased the prediction accuracy from R 2 = 0.032 to 0.036, and further adjusted by the two UKB-independent T2DM studies increased the accuracy to R 2 = 0.042. The great improvement in prediction accuracy (31%) is achieved as a consequence of abundant genomic pleiotropy (18,30) and the apparent genomic correlation with the selected traits. In comparison, Khera et al. (14) reported a prediction accuracy of ST-GRS of R 2 = 0.028 (14), and Maier et al. (30) obtained an accuracy of R 2 < 0.01 for both ST-GRS and MT-GRS (30). Although Maier et al. (30) showed increased prediction accuracy by combining the marker effects of selected traits (30), our reported prediction accuracies were greatly elevated compared with Maier et al. (30), most likely driven by differences in the included traits, and thereby in the optimal weights caused by differences in genomic correlation among the traits.
One of the information traits we included in the MT-GRS was the genetic liability to current medication use, which is the number of different medications the UKB participants have taken at the time of the verbal interview. Because most individuals that suffers from temporary or chronic diseases will undergo medical intervention and because of comorbidity many individuals will have multiple medical conditions, those individuals will be treated with a range of different medicines. Consequently, the total set of prescription and over-the-counter drugs is potentially an informative index of the current medical and health status of an individual. Wu et al. (40) performed genetic analysis of self-reported medication use within the UKB and found that categories of different types of medication were strongly genetically associated with a range of different diseases and traits (40). We found that the genetic correlation between T2DM and medication use was r g = 0.55 (only the correlation between T2DM and BMI had higher estimate, r g = 0.58). This is also evident by investigating the optimal weights (Equation 7), where BMI and medication use were the two information traits with the largest weights (Supplementary Figure 5A), besides T2DM itself. Including summary statistics from the two published T2DM association studies only marginally affected the optimal weights (Supplementary Figure 5B).
Although the exact level of prediction accuracy of T2DM was considerably lower when using external data from Zhao et al. (25) compared to data from Scott et al. (10) (Figure 4) Table 1). The discrepancy in prediction accuracy is most likely a consequence of different ancestries of the two external T2DM studies (10,25), where the ancestry of the individuals in the study by Scott et al. (10) is more similar to the ancestry of the UKB (European) than the study by Zhao et al. (25) (mixed ancestry). It is well-established that across ancestry, risk prediction is very difficult because the LD between populations is very diverse (41)(42)(43).
The last decade has shown us that the sample size of human genetic association studies keeps increasing (44,45), not only entailing more association signals but also providing more accurate effect estimates. This in conjunction with the increasingly accessibility of publicly available GWAS summary statistics (46,47) implies that genomic prediction of complex diseases will continually improve, in particular if multivariate predictors are created by integrating information across studies. Although we have demonstrated increased prediction accuracy by constructing MT-GRS, our work has several limitations. Firstly, as our training data were the UKB and with a 10fold cross-validation scheme, the number of cases became limited, meaning less accurate marker effect estimation and thereby less accurate risk stratification. Secondly, although we in addition to the UKB summary statistics from the 10-fold cross-validation obtained T2DM summary statistics from two independent studies (Table 1), we only had access to genotype information from the UKB and no other T2DM cohorts. Thirdly, we restricted the number of information traits to seven ( Table 2), based on the criterion that it should be a type of information that is easy and accurate to measure and obtain; height, BMI, pulse rate, and diastolic and systolic blood pressure are things that we easily and accurately can measure, and smoking status and current medication use can easily be obtained by asking the participants. Accurate observations lead to more accurate estimation of marker effects and thereby better prediction accuracies. It is compelling to speculate whether other types of information traits would improve prediction accuracy even more, and additional studies are warranted for developing methods for identifying the set of information traits most important for a particular disease.
Genomic information has the potential to change the way we diagnose and treat individuals today and will be central for implementing preventive healthcare in the clinics. An important aspect of precision medicine is accurate prediction of genetic risk toward common diseases, as it may guide the general practitioners to better and earlier identify those individuals who have an inherent genetically lifetime high disease risk, and then to initiate lifestyle changes potentially before disease outcome. Moreover, precise stratification of T2DM patients not only based on their pathophysiological symptoms (13) but also on their genetic makeup may help the general practitioners to treat highrisk patients more aggressively, which has the potential to slow down disease progression, reduce symptoms, and prevent severe morbidity and mortality.
In conclusion, by incorporating information traits and two previously published T2DM GWAS results, the prediction accuracy for T2DM was increased by 31% (from R 2 = 0.032 to R 2 = 0.042), clearly demonstrating the added benefit of incorporating correlated information in the construction of GRS. Thus, incorporating genomic information on correlated traits and disease is advisable for obtaining improved individual genetic risk stratification.

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. This data can be found at: The genetic and phenotypic data were obtained from the UK Biobank Resource (ID 31269). Researchers can apply for access through: https://www.ukbiobank.ac.uk/ registerapply/. Summary statistics for T2DM were obtained from published studies.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by The Ethics and Governance Framework (EGF) sets standards for the UK Biobank project so that all necessary safeguards are in place to ensure that the data and samples are only used for scientifically and ethically approved research. Participants of the UK Biobank have given their consent to participate which will apply throughout the lifetime of the UK Biobank unless the participants withdraw. Their consent involves the collection and storage of biological material (blood, saliva, urine samples) as well as collection of electronic health records (GP, hospitals, dental and prescription records). Information on the individual data level is anonymised for the researchers, and every research project has its own anonymised data. The ethics committee waived the requirement of written informed consent for participation.

AUTHOR CONTRIBUTIONS
PDR and PS conceived and designed the research project and performed the genetic analyses. PDR, PS, MN, and MK interpreted the results. All authors contributed to the preparation of the manuscript, read, edited, and approved the manuscript.

ACKNOWLEDGMENTS
The data used in the presented study were obtained from the UKB Resource (project ID 31269). All of the computing for this project was performed on the GenomeDK HPC cluster. We would like to thank GenomeDK and Aarhus University for providing computational resources and support that contributed to these research results.