Jointly Modelling Single Nucleotide Polymorphisms With Longitudinal and Time-to-Event Trait: An Application to Type 2 Diabetes and Fasting Plasma Glucose

In observational cohorts, longitudinal data are collected with repeated measurements at predetermined time points for many biomarkers, along with other variables measured at baseline. In these cohorts, time until a certain event of interest occurs is reported and very often, a relationship will be observed between some biomarker repeatedly measured over time and that event. Joint models were designed to efficiently estimate statistical parameters describing this relationship by combining a mixed model for the longitudinal biomarker trajectory and a survival model for the time until occurrence of the event, using a set of random effects to account for the relationship between the two types of data. In this paper, we discuss the implementation of joint models in genetic association studies. First, we check model consistency based on different simulation scenarios, by varying sample sizes, minor allele frequencies and number of repeated measurements. Second, using genotypes assayed with the Metabochip DNA arrays (Illumina) from about 4,500 individuals recruited in the French cohort D.E.S.I.R. (Data from an Epidemiological Study on the Insulin Resistance syndrome), we assess the feasibility of implementing the joint modelling approach in a real high-throughput genomic dataset. An alternative model approximating the joint model, called the Two-Step approach (TS), is also presented. Although the joint model shows more precise and less biased estimators than its alternative counterpart, the TS approach results in much reduced computational times, and could thus be used for testing millions of SNPs at the genome-wide scale.


INTRODUCTION
With the increased availability of longitudinal and survival data in large cohorts, joint models have emerged as an appropriate approach to account for both types of data, especially when dealing with informative/non-informative dropouts which commonly occur in such cohorts. Joint models have been studied and overviewed in the literature (Wulfsohn and Tsiatis, 1997;Tsiatis and Davidian, 2004;Chen et al., 2011;Elashoff et al., 2016), and their implementation has been proposed in different softwares and platforms (Diggle and Kenward, 1994;Sun et al., 2007;Elashoff et al., 2008;Proust-Lima et al., 2009;Rizopoulos, 2010;Rizopoulos and Ghosh, 2011). Main applications of the joint model approach are: (i) to efficiently model the survival process with a time-varying covariate, accounting for missing data and measurement error; and (ii) to account for informative dropouts in the longitudinal data. To model the two processes of a joint model, a linear mixed effects (LME) model and a Cox proportional hazards model (CoxPH) are classically used to, respectively, fit the longitudinal component and the survival component. Unlike the CoxPH model, in which the time-varying covariate is assumed to be exogenous, i.e., not modified by the occurrence of an event (Kalbfleisch and Prentice, 2002), the joint modelling framework allows to account for an endogenous timevarying covariate. An example of an endogenous covariate is the fasting blood plasma glucose which is irremediably modified due to glucose lowering medication, once T2D is diagnosed.
Two approaches can be used for estimation and inference of the model parameters: a "naive" two-step (TS) method or a joint likelihood method (JM). In the first method, the random effects of the trajectory are estimated by an LME model, then included as a time-varying covariate in a CoxPH model and estimated using the partial likelihood approach (Therneau and Grambsch, 2000). The second method is based on a joint likelihood of the two stochastic components (longitudinal and survival) estimated at the same time. Comparison of these two approaches showed that the latter offers more consistent and efficient estimators than the former (Albert and Shih, 2010a,b). But JM can be challenging to compute, especially when it comes to achieving convergence during the Expectation-Maximisation (EM) step. Moreover, depending on the number of time points and/or the sample size, the overall computational time can substantially increase.
In this paper, we conducted a comprehensive simulation study to compare these two approaches, JM and TS, when jointly modelling the longitudinal and the survival components, under the case of univariate variable for the longitudinal trait. Our main goal is to show that the JM approach, when compared to TS, increases statistical power to detect an effect on either or both the longitudinal and the survival processes, while resulting in bias reduction in parameter estimation. We also showed that, while highly demanding computation and convergence issues might arise during JM computation, TS offers a good alternative to JM in greatly reducing computational time, especially when applied at the genome-scale level.
Finally, we applied both approaches to a real dataset, the D.E.S.I.R. cohort (Data from the Epidemiological Study on the Insulin Resistance syndrome), which included 5,212 individuals with extensive phenotypic measurements recorded at four 3-yearly intervals, spanning a 9-year follow up. Individuals were genotyped using the Illumina Metabochip DNA array of nearly 200,000 SNPs . Relying on the conventional cross-sectional genome-wide association study design, the D.E.S.I.R. cohort was instrumental in identifying novel loci associated with prevalent type 2 diabetes (T2D) and fasting plasma glucose (FPG) level in normoglycaemic individuals (Sladek et al., 2007;Bouatia-Naji et al., 2008;Rung et al., 2009). We specifically focus on time-to-onset of T2D, in order to identify novel loci or to confirm published ones, which could simultaneously be associated with higher risk of developing T2D and/or increased FPG, consequently SNPs are rather analysed one at a time than as clusters. Our results were compared to the genetic variants reported in the literature (Vaxillaire et al., 2014;Welter et al., 2014) and to the metaanalyses published by large consortia, such as DIAGRAM (Morris et al., 2012) and MAGIC (Dupuis et al., 2010) consortia.

Joint Likelihood Model (JM)
The standard formulation of the joint model involves two components: a longitudinal component and a time-to-event component. Let n denote the sample size, and Y ij the longitudinal measurements collected for each individual i at time points t ij , i = 1, · · · , n, j = 1, · · · , m i , where m i is the number of measurements on individual i. The longitudinal component (i.e., measurements) typically consists of a (generalised) linear mixed effect (LME) model, whose within-subject correlation matrix is modelled using random-effect parameter vector b i = θ 0i θ 1i .
Under the joint likelihood framework implemented in "JM" (Rizopoulos, 2010(Rizopoulos, , 2017, within the class of "shared parameter models" (Rizopoulos, 2012;Elashoff et al., 2016), we define where Y ij is the observed value and X ij is the true (unobserved) value of the longitudinal measurement at time t ij for individual i. The quantity ǫ ij is a random error term usually assumed to be normally distributed: The quantity X ij is typically called the trajectory function, and is usually specified as a linear or quadratic function of time t ij ; for simplicity here, we assume linearity over time. We also define Z i , a vector denoting the genotype of individual i, and W i , a set of adjusting covariates: Again, without any loss of generality, we omit the term δW i in the following. Random effects θ 0i (intercept) and θ 1i (slope) are assumed bivariate Normal: θ ∼ N 2 (µ, ), and independently distributed from ǫ ij . The coefficient γ assesses the genotypic (additive) effect of variable Z i on the trajectory function. To account for varying slopes, an interaction term between Z i and time t ij could be added into the trajectory function; for simplicity, this term was not considered in this study.
The time-to-event (survival) component usually consists of a parametric (e.g., exponential or Weibull distribution) or semiparametric (e.g., Cox proportional hazards) model. T i denotes the event time for individual i, and C i the right censoring time (end of the follow-up). Let i be the event indicator: i = 0, if T i > C i , and i = 1, if T i ≤ C i . Under the Cox proportional hazards model, variable T i is specified using the following equation: where λ i (t) is the hazard function at time t for individual i and λ 0 (t) is the unspecified baseline hazard function, which we assume piecewise constant with two knots placed at intermediate time points in the follow-up. Coefficient α measures the effect of Z i on the hazard function, while β measures the association between the trajectory function and the hazard function. In this formulation, we suppose that the subject-specific random effect modify the hazard function, which implies that β is the parameter linking the longitudinal and survival components.

Two-Step Model (TS)
As an alternative to JM, and based on the work of Tsiatis et al. (1995), the two-step model estimates parameters of the joint model by first, estimating parameters of the trajectory function X i (t) in Equation (3), and second, by substituting this estimated trajectory, say X * i (t), into Equation (4) before fitting the Cox survival model.
Longitudinal data were simulated according to Equation (3), while event times were generated from an exponential distribution for the CoxPH model (Austin, 2012), with X i (t) as a linear function.
where λ was set to achieve the targeted incidence rate in the simulated dataset. Datasets were simulated by varying the number of longitudinal measurements m ∈ {2; 3; 4; 5}, the number of individuals n ∈ {500; 1, 000; 2, 500; 5, 000; 10, 000}, the allele frequency f ∈ {0.05; 0.1; 0.25; 0.5} and the incidence rate d ∈ {0.025; 0.05; 0.1}, thereby leading to 240 different scenarios. Each scenario was simulated 500 times. The Root-Mean Square Error (RMSE) was used to assess precision of estimators φ = (β, γ , α), when testing the association between Y ij , and T i , the effect of Z i on Y ij and the effect of Z i on T i , respectively. In addition, statistical power and type I error were also estimated. The computational burden of each approach (JM and TS) was also investigated as our goal is to implement these approaches at a genome-wide scale.

Computational Times
Based on our simulations, we calculated approximate computational times for four sample sizes with parameters as listed in Table 1, using a UNIX system with Intel R Xeon R CPU E7-4870 @ 2.40 GHz (80 such CPUs available computing in parallel). Table 2 shows computational times for one SNP, and for extrapolating the total computational time for 100,000 SNPs, which is the approximate number of SNPs on the Metabochip, after data cleaning and quality-control for common SNPs (minor allele frequency > 0.05).
To investigate further computational time issues, we profiled the execution of the main function "jointmodel" from the R package "JM, " which implements the joint likelihood modelling approach as described in this paper. In the "JM" package, the linear mixed effect sub-model is handled by the function "lme" from the "nlme" package. One may argue that using a faster approach, e.g., as implemented in the R package "lme4", might decrease the computational time.

Real Data
SNP genotyping was performed with Metabochip DNA arrays  using Illumina HiScan technology and GenomeStudio software (Illumina, San Diego, USA) in 5,212 individuals from the French cohort D.E.S.I.R. (Balkau, 1996). These participants have been followed for 9 years, and extensive phenotypic data has been recorded at four different 3-yearly time interval during that follow-up. All participants signed informed consent, and the protocol was approved by the ethics committee of Kremlin Bicetre Hospital, Paris. Quality control was performed using PLINK 1.90 beta version Purcell and Chang, 2015). SNPs with call rate of at least 95%, with no significant deviation from Hardy-Weinberg equilibrium at p > 1 × 10 −5 , and with minor allele frequency (MAF) over Association between Y ij and T i (β) 3.17 Error term (ǫ) ∼ N (0, 0.305 2 ) System time was computed ten times per sample size (number of individuals). Extrapolation is displayed for 100,000 SNPs.
5% were kept for analysis, resulting in 101,305 SNPs. Due to missing phenotypes which did not allow to confirm T2D status, 232 individuals were removed. An additional 554 individuals were excluded due to individual call rate lower than 95%, leaving 4,426 individuals for analysis after these quality control steps (Figure 1). Principal component analysis was performed using a combined dataset comprised of the 4,426 D.E.S.I.R. participants, along with participants from the publicly available 1,000 Genomes database (The 1000 Genomes Project Consortium, 2015). SNPs retained for analysis were restricted to those common to both sample sets. The first two components were sufficient to discriminate ethnic origin, which led to exclusion of 62 non Caucasians. A further 12 prevalent cases of T2D at baseline were also removed. As a result, the final dataset included 4,352 individuals, of whom 167 were diagnosed as T2D incident cases. Type 2 diabetes was defined using one of the following criteria: use of glucose lowering medication, and/or fasting plasma glucose [FPG] ≥ 7 mmol/L, and/or glycaeted haemoglobin A1c [HbA1c] ≥ 6.5% (48 mmol/mol).
Using the joint modelling approach implemented in the package "JM" (Rizopoulos, 2010(Rizopoulos, , 2017 within the R software version 3.4.2 (R Core Team, 2017), all 101,305 SNPs were tested for joint association with FPG and T2D. Following the above joint modelling formulation, Y ij denotes the observed values of FPG, Z i represents the genotype of individual i at each SNP, with W i being covariates such as age, sex and BMI (Figure 2). Finally, T i gives the time at which an individual is diagnosed with T2D.
In the joint modelling framework, the trajectory of FPG could be viewed as a dropout process, because all FPG values are flagged as missing after T2D diagnosis. In effect, individuals receiving a diabetes diagnostic are immediately placed under treatment to lower and regulate their blood glucose level. Therefore, FPG must be considered as an endogenous covariate, because the dropout process is not independent from the measured glucose values prior to T2D diagnosis.

Comparison of Estimation Accuracy
Due to the complexity of the estimating algorithm within JM, convergence could not be obtained (4.53% of convergence issues on average per scenario, with a standard deviation of 5.81%) for the whole set of 500 simulations (i.e., algorithm "piecewise-PH-aGH" for a time-dependent relative risk model with a piecewise constant baseline risk function, using the adaptive Gauss-Hermite quadrature rule to approximate integrals within the Expectation-Maximisation (EM) step; Rizopoulos, 2010Rizopoulos, , 2017.
RMSE for parameter γ (Figure 3) showed similar performance for JM and TS. RMSE for parameter β (Figure 4) and for parameter α (Figure 5) were smaller within the joint modelling framework (either JM or TS) than in the more classical CoxPH model with time-varying fasting plasma glucose. While the RMSE for β remains the same in the CoxPH model across all scenarios, under JM or TS it decreased whenever the sample size, the incidence rate or the allele frequency increases. Differences in RMSE for parameter α were smaller than for parameter β. Both TS and CoxPH with time-dependent covariate performed similarly, probably because partial likelihood inferences were used in these two approaches. JM estimations, for β and γ , were less biased in almost all scenarios when the sample size was >2,500. The larger bias observed within the extended CoxPH model, especially for β, could be explained by the mischaracterisation of the measurement error in the longitudinal trait.
Overall, our simulations showed that JM is less biased than when separate approaches are used to model the effect of Z i on the longitudinal process Y i , and on the time-to-event T i . While separate approaches performed well for parameters γ and α, the bias for parameter β was the highest across all scenarios.

Computational Time
Computational times are reported in Table 2. The time required to complete JM or TS algorithms increased linearly with sample size in our simulations. However, these times are very optimistic since our simulations did not include any covariate or more complex random parameters. The main issue appears within the "jointmodel" function which took over 95% of the global computation time. After examination of the call tree diagram, we observe that the more time-consuming task within the "jointmodel" function happens during the optimisation of the EM algorithm (described in Rizopoulos, 2012, Appendix B), despite the use of a calculation trick (i.e., adaptive Gauss-Hermite quadrature for numerical integration). FIGURE 2 | Causal diagram for joint modelling applied to fasting plasma glucose (FPG) and type 2 diabetes (T2D) (adapted from Ibrahim et al., 2010). SNP: Single Nucleotide Polymorphism. X ij is the true (unobserved) FPG trajectory. β measures the association between FPG trajectory and T2D incidence. exp(α) measures the hazard ratio for T2D. γ measures the association between a SNP i and FPG trajectory.

Application in Real Data
Applying the R package "JM" to our D.E.S.I.R. cleaned dataset, 265 SNPs (Figure 6) were associated (with p < 0.05) with FPG and T2D events through their respective parameters γ and α. Amongst these 265 SNPs (163 unique genes), we identified 17 genes ( Table 3) which had already been reported to be associated with FPG and/or T2D risk. Parameter β was highly significant (below the genome-wide threshold of 5×10 −8 ) for all these SNPs,  (1) Displays results for n = 500 to n = 2, 500.
Frontiers in Genetics | www.frontiersin.org FIGURE 6 | Results from statistical analysis using "JM" (Rizopoulos, 2010(Rizopoulos, , 2017, using fasting plasma glucose increasing allele as effect allele. Estimated effects of γ are displayed on the x-axis, with corresponding estimated hazard ratio exp(α) on the y-axis. Statistical power reported is the theoretical (retrospective) power to detect a genetic joint effect βγ + α based on estimated model parameters (Chen et al., 2011).
TABLE 3 | List of loci found to be associated within the joint modelling framework with both FPG and T2D risk, previously shown as associated with FPG and/or T2D risk in the NHGRI GWAS Catalogue (Welter et al., 2014). which was expected considering that β estimates the association between FPG trajectory and T2D risk. In Figure 7, we specifically focused on parameters γ and α. After Bonferroni correction (nominal p-value ≃ 5 × 10 −7 ), no genetic variants showed a highly significant association with both parameters γ and α simultaneously; only SNPs in the following genes (or within a 100 kb window) remained significant when testing for γ : G6PC2/ABCB11, GCK/YKT6, GCKR, and MTNR1B, with per-allele increasing effect varying on FPG from 0.100 to 0.047 mmol/L (data not shown). Zooming in on simultaneous associations with the longitudinal and survival processes revealed well known genes, such as TCF7L2, which has been shown in many meta-analyses to be associated with elevated FPG and an increased risk of T2D (Table 4). MTNR1B was also found to be associated (34 SNPs within 30 kb) with estimated α = −0.44 (p = 9.37 × 10 −04 ) andγ = 0.099 (p = 1.33 × 10 −23 ) for SNP rs10830963, the SNP usually reported in the literature.
To better compare JM and TS, we repeated the analysis on the whole dataset using TS. As shown in Figure 8, p-values differ, especially when testing parameter α; however for tests on parameter γ , approximations were quite close to the p-values provided via the joint likelihood framework. FIGURE 7 | Manhattan plot for estimated effects of γ and α using "JM" (Rizopoulos, 2010(Rizopoulos, , 2017. Results are presented for the cleaned set of 101,305 SNPs.  Comparison is shown with effect sizes as reported by consortia meta-analyses in genes MTNR1B and TCF7L2.

DISCUSSION
With the ever-increasing availability of genomic data generated by genotyping arrays and next generation sequencing, the need to develop and implement efficient models is important to ensure that statistical analysis will be achieved in a reasonable timeframe. In this paper, we proposed a comparison of two approaches, namely the joint model (JM) and the two-step model (TS), to estimate parameters accounting simultaneously for a genetic effect on both longitudinal and survival processes without discarding missing values or dropouts commonly generated by the longitudinal measurement process. In our real data application, FPG serves as the longitudinal process, whereas T2D diagnosis generates survival times of interest, both stochastic phenomenon being linked by the fact that a fixed threshold for FPG defines T2D onset (currently, [FPG] ≥ 7 mmol/L), along with glucose lowering medication use. Through simulations over different scenarios, we showed that joint models are less biased than classical separate approaches. Hopefully, joint models could provide more insight regarding the event of interest, and could assess the potential impact of a genetic marker on incident T2D better than simpler models. By looking at statistical measures of accuracy such as RMSE for our model estimators, and by estimating the computational time required by the available R implementations of joint models, our study showed that the use of an approximate method at a genome-wide scale, such as TS, might represent a good compromise between accuracy and computational time. TS could be used to overcome the computational burden of current joint likelihood methods by exploiting available R packages performing the two steps, LME and CoxPH, and could help filter out SNPs with low or undetectable associations during a first preliminary scan. However, depending on the parameters of the dataset (sample size, incidence rate, number of measures), a joint likelihood method always has to be preferred over TS when one wants to obtain accurate estimation of parameters γ and α, which describe the SNP effect on the trajectory of FPG and on the timeto-onset of T2D, resp. Although we computed the theoretical statistical power to detect a joint genetic effect βγ +α (Chen et al., 2011), we did not test this effect at the genome-wide scale due to its computational burden. In this paper, we used the closed-form expression from Chen et al. (2011) to evaluate retrospectively the probability to obtain the same significant results in a similarly designed study to D.E.S.I.R. This closed-form expression could be use to design a new study (e.g., compute the number of samples) which aims at identifying a joint effect. The joint SNP effect can be tested using a likelihood ratio test comparing the full joint model, i.e., with a SNP effect included in both sub-models, to the joint model without a SNP effect in the survival sub-model, as implemented in the package "JM" (Rizopoulos, 2010(Rizopoulos, , 2017. To fully characterise JM approach, further study needs to be performed, such as missing values distribution according to the usual hypotheses (i.e., Missing At Random, Missing Completely At Random and Missing Not At Random). In this paper, we did not study in our simulations the rates of change effect of the SNPs (i.e., interaction term SNP × TIME) which might also be of interest in the study of a disease such as T2D. Finally, we would like to reemphasize that using parallel and grid computing approaches will help reduce the global computational time when applied at a genome-wide scale (i.e., with millions of SNPs).
In our real data application, rs17747324 showed consistent results with the DIAGRAM and MAGIC (for FPG) consortia for both α and γ ( Table 4), but rs10830963 showed an opposite effect on T2D risk compared to the effect reported in the DIAGRAM consortium (α = 0.104, p = 7.3 × 10 −07 ). Results observed for MTNR1B (rs10830963) in the French cohort D.E.S.I.R., albeit inconsistent with previous studies, may uncover some interesting peculiarities pertaining to T2D incident cases in this population. In the literature, SNPs in MTNR1B were reported as being associated with higher FPG and T2D risk, but meta-analyses were performed on populations with different genetic backgrounds, and the two traits have never been jointly co-analysed. However, we realize that MTNR1B associations identified in our study need to be confirmed and replicated in other cohorts, as they might be cohort-specific. Finally, a major limitation in our study comes from the low number of incident T2D cases in the D.E.S.I.R. cohort (167 incident T2D cases in 4,352 individuals followed over 9 years), resulting in (retrospective) power, no higher than 70%, as shown in Figure 6 and Table 3.

DATA AVAILABILITY
The datasets for this manuscript are not publicly available due to consideration of intellectual property, ongoing active collaborations and to continuing analyses by the study investigators.
Requests to access the datasets should be directed to PF (p.froguel@imperial.ac.uk).

AUTHOR CONTRIBUTIONS
GR and MC: contributed to the study conception and design; BB, PF, and RR: made the genomic and phenotypic data available; MC: conducted the data and statistical analyses; GR and MC: interpreted the data and results; MC: drafted the first version of the manuscript; BB, GR, PF, and RR: edited and provided critical revisions to the manuscript. All authors contributed to manuscript revision, read and approved the submitted version.