Analysis of Case-Parent Trios Using a Loglinear Model with Adjustment for Transmission Ratio Distortion

Transmission of the two parental alleles to offspring deviating from the Mendelian ratio is termed Transmission Ratio Distortion (TRD), occurs throughout gametic and embryonic development. TRD has been well-studied in animals, but remains largely unknown in humans. The Transmission Disequilibrium Test (TDT) was first proposed to test for association and linkage in case-trios (affected offspring and parents); adjusting for TRD using control-trios was recommended. However, the TDT does not provide risk parameter estimates for different genetic models. A loglinear model was later proposed to provide child and maternal relative risk (RR) estimates of disease, assuming Mendelian transmission. Results from our simulation study showed that case-trios RR estimates using this model are biased in the presence of TRD; power and Type 1 error are compromised. We propose an extended loglinear model adjusting for TRD. Under this extended model, RR estimates, power and Type 1 error are correctly restored. We applied this model to an intrauterine growth restriction dataset, and showed consistent results with a previous approach that adjusted for TRD using control-trios. Our findings suggested the need to adjust for TRD in avoiding spurious results. Documenting TRD in the population is therefore essential for the correct interpretation of genetic association studies.


INTRODUCTION
Transmission Ratio Distortion (TRD) occurs when the transmission of alleles from a heterozygous parent to the offspring statistically deviates from the Mendelian Law of Inheritance. TRD results from disruptive mechanisms occurring during gametic and embryonic development , including germline selection (Hastings, 1991), meiotic drive (Pardo-Manuel de Villena and Sapienza, 2001), gametic competition (Zöllner et al., 2004), embryo lethality (Zöllner et al., 2004), and imprint resetting error (Naumova et al., 2001;Yang et al., 2008). The presence of TRD leads to spurious conclusions in association studies.
A recent study uses a Bayesian framework to model TRD in boars and piglets and was shown to achieve appealing statistical performance (Casellas et al., 2014). In humans, individuals unselected for phenotype have been studied to detect TRD in the general population, such as in the Framingham Heart study (Paterson et al., 2009;Meyer et al., 2012), the Centre d'Etude du Polymorphisme Humain (Naumova et al., 2001;Yang et al., 2008), the HapMap project (The International HapMap Consortium, 2005), and the 1000 Genomes Project (Auton et al., 2005).
In family-based study design, Transmission Disequilibrium test (TDT; Spielman et al., 1993) is among the most wellknown linkage disequilibrium tests. It is a McNemar test of transmitted vs. untransmitted alleles from parents to an affected child. It was originally developed to test both linkage and association at a marker locus by studying case-parent trios. The usage of TDT became wide-spread since its inception because of its simplicity and robustness to population stratification. There have been multiple extensions of TDT to address multiallelic loci (Sham and Curtis, 1995;Wilson, 1997;Lazzeroni and Lange, 1998), multiple marker loci (Lazzeroni and Lange, 1998), quantitative traits (Allison, 1997;Rabinowitz, 1997;Xiong et al., 1998), nuclear family with multiple affected children (Martin et al., 1997) and unaffected siblings (Lazzeroni and Lange, 1998), pedigrees (Sham and Curtis, 1995), late-onset diseases (Spielman and Ewens, 1998), and imprinting effect (Hu et al., 2007).
In some studies, case and control populations were analyzed separately to detect a difference in transmission (Friedrichs et al., 2006;Shoubridge et al., 2012). To address the possible presence of TRD in the studied population, Spielman et al. (1993) analyzed both case/control-trios separately using the TDT. True association was then assessed using a Pearson's Chi-square test. Deng and Chen (2001) proposed a TDT statistic that is the sum of TDT statistics for case/control-trios for similar purpose. Previously, we also suggested a modified TDT statistics where the two diagonal counts in McNemar test are multiplied by t and (1−t), respectively, where t is the transmission ratio of the minor allele in control-trios .
Other statistical measures have also been proposed to study affected offspring, such as Binomial exact test (Dean et al., 2006;Yang et al., 2008), Pearson's Chi-square test (Imboden et al., 2006;Bettencourt et al., 2008), multipoint non-parametric linkage (NPL) test (Paterson and Petronis, 1999;Paterson et al., 2003), Mann-Whitney U-test (De Rango et al., 2007), and multivariate logistic model (Yang et al., 2008). These methods only give statistical significance of linkage and association, but do not estimate the disease relative risk (RR). Relative risk is considered as an important information because it measures the difference in risk between individuals of different genotypes.
The family-based association test (FBAT; Lazzeroni and Lange, 1998;Rabinowitz and Laird, 2000) and likelihood methods that use case-trios to construct conditional logistic (Cordell et al., 2004), unconditional logistic (Weinberg, 1999), and loglinear models (Weinberg et al., 1998;Sinsheimer et al., 2003;Gjessing and Lie, 2006;Kistner et al., 2006Kistner et al., , 2009 have also been used in family-based studies. In particular, Weinberg et al. proposed a loglinear model to detect an association between a marker and disease (Weinberg et al., 1998). This model estimates a RR of disease for the offspring, assuming Mendelian transmission. Unlike the other tests and models, it has a probability component that can be easily extended to adjust for TRD. Our proposed method uses the transmission ratio of a minor allele in control-trios, obtained from an external dataset such as HapMap (The International HapMap Consortium, 2005), 1000 Genomes Project phase 3 data (Auton et al., 2005), and family units in Framingham Heart Study (2008). These datasets are publically available and include healthy trios, which provide transmission ratio of alleles from parents to child, can be used to account for TRD through an offset in the model. There are others consortia with genome-wide data, but they are based mostly on unrelated individuals (Cavalli-Sforza, 2005;Prüfer et al., 2014), a few trios (Drmanac et al., 2010), large pedigrees (Drmanac et al., 2010;T2D-GENES Consortium TD-G, 2016) or diseased individuals (The Cancer Genome Atlas, 2016; T2D-GENES Consortium TD-G, 2016), which are neither adequate nor appropriate for our study on TRD. This extended loglinear model was validated through extensive simulation studies and applied to an intrauterine growth restriction (IUGR) case-control study augmented with a case/control-trio study (Infante-Rivard et al., 2002;Infante-Rivard and Weinberg, 2005), investigating the role of thrombophilic genes in IUGR. The current literature in support of the association between thrombophilia and IUGR is inconsistent. We explored the possible role of TRD in these inconsistencies.

MATERIALS AND METHODS
We investigated the association between a bi-allelic codominant disease susceptibility locus (DSL) and a disease, of which individuals express distinct disease risk associated with each of the three possible genotypes at the DSL. We defined genotype by the number of copies of the minor allele.
Loglinear Model by Weinberg et al. (1998) The loglinear model proposed by Weinberg et al. (1998) assumes Mendelian transmission and mating symmetry, but not Hardy-Weinberg Equilibrium (HWE). We considered the simpler form of this model with only child genotype parameters.
In this model, the response variable is the number of trios for the 15 mother-father-child (MFC) genotype categories ( Table 1). These 15 categories can be subdivided into six parental mating types. Covariates entering the model include two indicator variables for child genotypes 1 and 2, and five for mating types. The model which includes an intercept and an offset, is described as: (1) n MFC is the number of trios with genotypes MFC, and D is the disease status of the child. The ρ j + ρ 6 terms are the regression coefficients for the first five parental mating types; ρ 6 is the intercept for the 6th mating type MF = 00; β 1 and β 2 are the regression coefficients for child genotypes 1 and 2, where β 1 = log (R 1 ) and β 2 = log (R 2 ). R 1 and R 2 are the RR with respect to genotype 0. This model 1, operates under the assumption of Mendelian transmission [derived in Appendix Derivation of Model 1 (Without TRD Offset) and 2 (With TRD Offset) and Table 6 in Supplementary Materials].

Loglinear Model with Adjustment for TRD
Without the assumption of Mendelian transmission, model 1 can be generalized into: where τ MFC is the transmission offset P[C|MF], ξ j + ξ 6 terms (j = 1-5) are the regression coefficients for the first five mating types, and ξ 6 is the intercept corresponding to the 6th mating type. The coefficients β 1 and β 2 are as defined in model 1. This model 2 accounts for TRD [derived in Appendix and Table 6: Derivation of Model 1 (Without TRD Offset) and 2 (With TRD Offset) and Table 6 in Supplementary Materials].
The offset τ MFC depends on the TRD ratio t, defined as the transmission probability of a minor allele from a heterozygous parent to the child. This leads to a different offset in each MFC genotype category. The parameter t can take on values different from 0.5, and t = 0.5 corresponds to Mendelian transmission, in which case models 1 and 2 are equivalent [see Appendix and Table 6: Derivation of Model 1 (Without TRD Offset) and 2 (With TRD Offset) and Table 6 in Supplementary Materials].
We fitted both loglinear models (1) and (2) to obtain estimates R 1 and R 2 , and their corresponding Z-test p-values. To assess significance of the association between the disease and the DSL, a Likelihood Ratio Test (LRT) was used [see Appendix Non-Central Chi-Square Likelihood for Model 1 (Without TRD Offset) and Model 2 (With TRD Offset) for the distribution of the LRT under the null and alternative hypotheses].

Simulation Study
A simulation study was set up for different TRD scenarios, where RR parameters, p-values, LRT p-values, Type 1 error, and power were compared between the 2 models, and the true t was used in model 2. A sensitivity analysis was also carried out to test the impact on RR estimates and power when an incorrect t is used.

Simulation Setup
We considered a causal locus with no recombination. Disease prevalence is 0.1 for low penetrant common disease, and 0.01 for high penetrant rare disease. 100,000 trios were generated where 500 case-trios were sampled. Parental genotypes at the DSL were generated under HWE assuming a minor allele frequency (MAF) 0.1. The parameter t was specified between 0.1 and 0.9. Offspring were assigned to diseased or non-diseased phenotypes using risk associated with genotypes 0, 1, and 2, as f 0 , f 1 , and f 2 , respectively. The simulation was repeated 100 times and averaged RR estimates, p-value of the averaged Z statistics for RR and p-value of the averaged LRT statistics are reported.

Measuring Impact of TRD on Association Statistics
We compared the RR, 95% CI, p-value and LRT p-value of both models under two scenarios: (1) a common disease associated of low penetrance at f 0 = 0.1, f 1 = 0.11, f 2 = 0.15, and (2) a rare disease of high penetrance at f 0 = 0.1, f 1 = 0.5, f 2 = 0.5. In scenario (2), a dominant model was assumed. To measure the inflation in RR and LRT p-values in model 1, we computed the log ratio of RR and LRT p-values in model 1 vs. 2. We also varied f 1 fixing f 2 = 0.15 to describe the corresponding inflation of LRT p-values. To assess the inflation of Type 1 error, we set the penetrance factors to f 0 = f 1 = f 2 = 0.1 assuming no association while varying t from 0.1 to 0.9, using sample sizes of 100, 300, 2 | Relative risk with 95% CI, P-values, and likelihood ratio test P-values of models 1 (Unadjusted) and 2 (Adjusted) for a low penetrance common disease. and 500. Finally, we evaluated the power of both models to detect a true association signal in the presence of TRD, by setting f 0 = 0.1, f 1 = 0.2, f 2 = 0.3, varying t from 0.1 to 0.9 in the simulation. Critical value for declaring significance was α = 0.05.

Sensitivity Analysis
The assumption in the simulation study was that true t is known. We examined the consequences of a misspecification of t on the RR estimates and the power, simulating three scenarios with true association signal, f 0 = 0.1, f 1 = 0.2, f 2 = 0.3, and true t = 0.3, 0.5, or 0.7. For each scenario, model 2 was fitted with the offset τ MFC calculated using a selected t varying between 0.1 and 0.9. We then evaluated the log ratio of RR and power obtained from model 2 using selected t-values vs. true t that adjust for TRD.

Application of Models 1 and 2 to a Real Dataset
We applied our model to the IUGR study described previously (Sapru et al., 2009;Kvasnicka et al., 2012). Cases were below 10th percentile according to weight whereas controls were selected at the same hospital and measured at or above the 10th percentile. DNA was obtained from parents of both cases and controls. The investigation pertained to the role of thrombophilic genes in IUGR. We examined six thrombophilic genes: Coagulation Factor XIII, A1 polypeptide (F13A1), Plasminogen activator inhibitor type 1 (PAI-1), Methylenetetrahydrofolate reductase variant A1298C (MTHFR A1298C), Methylenetetrahydrofolate reductase variant C677T (MTHFR C677T), Coagulation Factor V (F5), and Coagulation Factor II (F2). We computed the MAF using all complete trios and t using control-trios. We compared our extended model 2 with another method proposed by Infante-Rivard and Weinberg (2005) to quantify the extent of TRD in the same IUGR population, specifically for F5. The difference between our model 2 and the model used in Infante-Rivard and Weinberg (2005) is that the former inserts t as an offset in the loglinear model fitted with case-trios only, while the latter uses both case-and control-triosadding an interaction term between child genotype and case status. This study was carried out in accordance with the recommendations of Le Comité d'éthique de la recherche,

Inflation of RR Estimates and LRT P-values
When the transmission ratio was Mendelian, models 1 and 2 yielded the same RR and 95%CI (Tables 2, 3). When testing t = 0.3 where the disease allele is under-transmitted, the RR for model 1 was attenuated excluding 1 in the 95% CI, whereas RR estimates, p-values and LRT p-values were restored in model 2. Similarly, for t = 0.7, the RR for model 1 were inflated and this inflation was removed under model 2.
The RR inflation ratio changes exponentially with respect to t, implying that even small deviation from t = 0.5 can lead to a substantial inflation ( Figure 1A). The slope of RR ratio for R 2 was double that of R 1 , showing that TRD affected R 2 more severely than R 1 . In Figure 1B, when TRD is not adjusted for, the significance of the LRT p-values was inflated when t deviates from 0.5.   N, sample size (100, 300, and 500); f 0 , penetrance for genotype 0 individuals; f 1 , penetrance for genotype 1 individuals; f 2 , penetrance for genotype 2 individuals. Figure 2A shows the empirical Type 1 Error we observed by fitting the loglinear model which is similar to our theoretical results in Figure 3A. Type 1 Error of the TRDadjusted model 2 remained the same across all t-values, and were exactly the same for all sample sizes. Type 1 Error for model 2 does not depend on sample size or t, meaning that this model is robust to the effect of TRD when the null hypothesis is true. In Figure 2A, Type 1 Error for the unadjusted model 1 increased as t deviated from 0.5 which led to a false inflation of the association signals.

Power Loss
Power for sample size n = 100 was poor in Figure 2B, with or without TRD. We also noticed that model 2 gave relatively stable power in the range of t, while model 1 power suffered from the effect of TRD. However, when t was lower than 0.2 or >0.5, model 1 power was greater than that of model 2. This is because a strong TRD actually inflates the power of detecting an association signal in either direction. Power for model 2 decreased slightly when t > 0.7, which suggested that the TRD offset overcompensates the inflation in power. However, a TRD ratio as large as 0.9 is rare, but even when t = 0.8, the power was still maintained around 0.8 for sample sizes of 300 and 500. Therefore, the power for model 2 was still adequate for a t between 0.2 and 0.8. Relatively consistent results were obtained between theoretical power ( Figure 3B) and empirical power ( Figure 2B).

Sensitivity Analysis: Inflation in RR Estimates
We observed that using an under-estimated t-value in model 2 led to inflation, while an over-estimated t led to attenuation for R 1 (Figure 4). We also noted that the inflation and attenuation of the log RR ratio was linear, which means exponential in arithmetic scale. When the difference between the true and selected t was ±0.1, the inflation ratio lied between 10 0.25 = 1.78 and 10 −0.25 = 0.56 for R 1 . When the difference was greater than ± 0.1, the inflation ratio became more pronounced. The slope of the log RR ratio curve for R 2 was twice (not shown) that of R 1 in Figure 4. Therefore, the inflation or attenuation in R 2 was more severe than in R 1 . Results from our model 2 were highly sensitive to an incorrect input of t-value.

Sensitivity Analysis: Attenuation and Inflation in Power
In Figures 5A,B, for t = 0.3 and 0.5, the power to detect true association was completely restored when the selected t was equal to the true t. However, setting the selected and true at t = 0.7 (Figure 5C), the power for detecting true association was not completely restored, consistent with what we observed previously in power analysis. There was a decrease in power when true signal is partially canceled by the selected t. We see that power was also highly sensitive to incorrect t.
Application to a Case-Control, Case-, and Control-Parent Trio Study of IUGR  (Tables 3, 4). Except for MTHFR A1298C, all MAF were close to the expected range from the literature (Kawamura et al., 1989;Ulvik et al., 1998;Ariens et al., 2002;Sapru et al., 2009;Alfirevic et al., 2010;Kvasnicka et al., 2012). Discrepancies were likely due FIGURE 4 | Log ratio of RR in model 2 (Adjusted) for selected t (from 0.1 to 0.9) vs. True t.  to the fact that the samples were genetically heterogeneous with ∼25% being black.

Application to 6 IUGR Genes
We see in Table 4 that F13A1, PAI-1, and MTHFR C677T all had transmission ratios around 0.5. MTHFR A1298C had slightly lower transmission of the disease allele with t = 0.45. However, F5 and F2 had transmission deviate significantly from the Mendelian ratio with t = 0.36 and 0.11 (Table 5). RR from the loglinear model showed noassociation for F13A1, PAI-1, MTHFR A1298C, and MTHFR C677T variants (Table 4), similar to previous reports (Infante-Rivard et al., 2002, 2005. Due to the small number of genotype 2 cases for F5 and F2, these two genes were analyzed under a dominant model. We see that for F5, conclusion on RR, p-values and LRT p-values are reversed from model 1 to model 2, suggesting a deleterious effect of the minor allele. For F2, we observed the opposite trend. The change in risk after adjustment for TRD was coherent with the expected effects from these variants given that they are known to affect placental circulation and thus potentially fetal growth.

Comparison with TRD Analysis in Infante-Rivard and Weinberg (2005) on FV Gene
Infante-Rivard and Weinberg (2005) found in their study that both F5 and F2 exhibited evidence of TRD, as well as MTHFR A1298C but to a lesser extent, which is consistent with our estimation from control-trios (Tables 4, 5). The authors used six more strata from control-trios together with an interaction term between child genotype and case status. A gene-dosage model (R 2 = R 1 2 ) was used implicitly to adjust for TRD; the RR for cases was estimated to be 3.59. We fitted model 2 using a genedosage model, and obtained a RR estimate of 2.88 with 95% CI: 1.31, 6.35. This result is in the range of the estimate from Infante-Rivard and Weinberg (2005). The number of trios included in these two analyses was different as Infante-Rivard and Weinberg (2005) used the LEM software with built-in EM algorithm for missing data whereas we only used complete trios. This shows that results from our extended loglinear model 2, which adjusts for TRD were comparable to those from the augmented model proposed in Infante-Rivard and Weinberg (2005). The method proposed by Infante-Rivard and Weinberg (2005) requires fitting the loglinear model with actual control-trios, which is not required in our method where the transmission ratio of the minor allele is obtained through publicly available datasets. Therefore, less recruitment effort is needed leading to lower study cost. This difference is more significant for genome-wide studies where large samples are required.
Both models can include the same covariates. However, since control-trios are directly fitted in the model proposed by Infante-Rivard and Weinberg (2005), each covariate included in the model will lead to 2 • of freedom loss because an interaction between case status (0, control; 1, case) and the covariate itself also has to be added. This leads to a faster decline in degrees of freedom than our method. The difference will further be magnified when other more complicated covariates, such as the mother-fetal interaction effect, are included in the model. Each of the four mother-fetal interaction covariates requires an additional interaction term with the case status.
The loglinear model proposed by Infante-Rivard and Weinberg (2005) allows missing data while our method requires complete trios only. The former has the advantage of using trios with missing parental genotypes, and hence does not need to discard trios with incomplete information. Currently, there is no immediate plan to augment our R-package for missing data, but it is possible in the future to address this issue using EM algorithm and include it as an option in our R-package. The loglinear model with control-trios has the advantage of adjusting for TRD without knowing the extent of distortion, and hence, remains a gold standard when the transmission ratio of the minor allele is not available.

DISCUSSION
Studies using animal models can potentially provide new insights in handling the phenomenon of TRD. TRD is much less studied in humans. In most genetic association studies in the current literature TRD remains largely unaccounted for. We previously reviewed a number of human studies on TRD (Naumova et al., 2001;Pardo-Manuel de Villena and Sapienza, 2001;Zöllner et al., 2004;Hanchard et al., 2005;The International HapMap Consortium, 2005;Paterson et al., 2009) and discussed the various methods and study designs in detecting TRD .
Here, we extend a model used for family-based association studies, accounting for TRD. Our simulation study showed that when TRD is unaccounted for as in model 1, the RR is inflated or attenuated exponentially. Power and Type 1 error also suffered greatly. Using a real dataset where the F5 gene was studied as a determinant of IUGR, we validated our model in comparison with an approach using control trios (Infante-Rivard and Weinberg, 2005). However, we noted that the accuracy of our results depended on the correct TRD offset used in model 2. If we conduct a study with less well-known DSL and diseases, it is unlikely that we will have information on the TRD factor. Nevertheless, by leveraging on studies such as the HapMap project (The International HapMap Consortium, 2005), the 1000 Genomes Project (Auton et al., 2005), or the Framingham Heart Study (Framingham Heart Study, 2008), it may be possible to obtain such information.
The LEM software developed by van Den Oord and Vermunt (2000) that was used by Infante-Rivard and Weinberg (2005) to fit a loglinear model that takes into account of missing data. We compared RR estimates obtained from LEM and our models in the absence of TRD, and they were similar in values. HAPLIN, a software developed by Gjessing and Lie also studies case-parent-trios, which estimates the effect of multi-allelic markers or haplotype for single-and double-dose maternal and fetal haplotype (Gjessing and Lie, 2006). There are other software developed for studying caseparent trios such as TRANSMIT (Clayton and Jones, 1999), which can handle multi-locus haplotypes and missing parental information, and GASSOC (Schaid, 1996), which accommodate multi-allelic markers. These software do not readily have a component to adjust for TRD. However, we implemented the model 2 with TRD offset in an R package (named TRD) available on the Comprehensive R Archive Network (CRAN).
Currently, there is no comprehensive knowledge on TRD in the human genome. As TRD can inflate or attenuate an association signal, with large sets of SNPs being tested, results can be severely biased leading to spurious conclusions. Since TRD over generations leads to reduced mutational diversity in the genome, many of these TRD loci contain rare variants which are currently intensively researched. When transmission counts are small, even a slight distortion could lead to major impact on the outcome of the studies. Given what we observed in our simulation study, sequencing a control population to identify and quantify the extent of TRD in the human genome would seem necessary. Incorporating this information in the analysis of genetic association studies provides more accurate and valid estimates. Therefore, we suggest that knowledge of TRD in genomic databases is essential to determine the relevance of genes in various diseases.

AUTHOR CONTRIBUTIONS
The research question for this manuscript was conceived by LH. AL reviewed and approved the conceived research question. CI acquired and provided the data used in this manuscript. LH developed, implemented and applied the method for simulation studies and real data analysis, and wrote the R software package "TRD." AL contributed to a revision of the statistical model. LH drafted the manuscript. AL and CI reviewed it critically for important intellectual content. LH, AL, and CI all approved the final version to be published. LH, AL, and CI all agreed to be accountable for all aspects of work in ensuring the questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.