Inference of gene-environment interaction from heterogeneous case-parent trios

Introduction: In genetic epidemiology, log-linear models of population risk may be used to study the effect of genotypes and exposures on the relative risk of a disease. Such models may also include gene-environment interaction terms that allow the genotypes to modify the effect of the exposure, or equivalently, the exposure to modify the effect of genotypes on the relative risk. When a measured test locus is in linkage disequilibrium with an unmeasured causal locus, exposure-related genetic structure in the population can lead to spurious gene-environment interaction; that is, to apparent gene-environment interaction at the test locus in the absence of true gene-environment interaction at the causal locus. Exposure-related genetic structure occurs when the distributions of exposures and of haplotypes at the test and causal locus both differ across population strata. A case-parent trio design can protect inference of genetic main effects from confounding bias due to genetic structure in the population. Unfortunately, when the genetic structure is exposure-related, the protection against confounding bias for the genetic main effect does not extend to the gene-environment interaction term. Methods: We show that current methods to reduce the bias in estimated gene-environment interactions from case-parent trio data can only account for simple population structure involving two strata. To fill this gap, we propose to directly accommodate multiple population strata by adjusting for genetic principal components (PCs). Results and Discussion: Through simulations, we show that our PC adjustment maintains the nominal type-1 error rate and has nearly identical power to detect gene-environment interaction as an oracle approach based directly on population strata. We also apply the PC-adjustment approach to data from a study of genetic modifiers of cleft palate comprised primarily of case-parent trios of European and East Asian ancestry. Consistent with earlier analyses, our results suggest that the gene-environment interaction signal in these data is due to the self-reported European trios.

A complete list of conditional genotype probabilities for the affected child is given in Table 1 of Shin et al. (2014). For an additive model, in which β 1 = β 2 ≡ β and f 1 (e) = f 2 (e) ≡ f (e), the model simplifies considerably; e.g., where O g is an "offset" term that equals log 2 for g = 1 and 0 otherwise.
The likelihood is a product of conditional probabilities over all trios in the study, viewed as a function of the parameters β 1 , β 2 , f 1 (e) and f 2 (e). Each trio's contribution to the likelihood can be viewed as the contribution of a matched set to a likelihood for a conditional logistic regression, in which the matched set comprises the affected child and other possible offspring of the parents, referred to here as the affected child's pseudo-siblings. After constructing appropriate matched sets, software for conditional logistic regression may be used to maximize the likelihood from a case-parent trio study.
Code in the R environment for statistical computing R Core Team (2022) is available to perform such analyses and may be obtained from the first author upon request. The code sets up a data frame with rows for each affected child and pseudo-sibling, and columns specifying the ID for each trio (ID), affection status coded as 1 for the affected child and 0 for pseudo-siblings, an offset variable (O) coded as log 2 for a heterozygous offspring of doubly-heterozygous parents and 0 otherwise, and the G, E and PC variables. We then call clogit() from the survival package Therneau (2021) to perform the conditional logistic regression. The argument to clogit() is a formula that specifies affection status as the response, trio IDs as strata(ID), offsets as offset(O) and the other model terms. For an additive model, the other model terms are a main effect for G, two-way interactions between G and E and between G and the PCs, and, finally, a three-way interaction between G, E and the PCs.

A.2 Dependence of latent-class probabilities on E
Write the probabilities in terms of the conditional distribution of GG ′ given E as .
Supposing that the numerator and denominator both depend on E, so may their ratio. However, if we condition on the blocking variable X Thus, latent-class probabilities in the model adjusted for X do not depend on E.

A.3 LDheatmaps of SNPs in MLLT3
LDheatmaps of pairwise R 2 values in and around the six SNPs in the MLLT3 gene that showed significant G × E with maternal alcohol consumption in Beaty et al. (2011) are shown in Figure S1 for self-reported Europeans and self-reported East Asians. There is generally stronger pairwise LD between SNPs that showed significant G × E in the self-reported Europeans than in the self-reported East Asians. The − log 10 p-values from the PC-adjusted analysis are shown above the self-reported Europeans, who appear to be the drivers of the G × E signal. Figure S1. LDheatmap of pairwise R 2 values in and around the six SNPs in the MLLT3 gene that showed significant G × E with maternal alcohol consumption in Beaty et al. (2011). Left panel: self-reported Europeans, with p-values from the PC-adjusted analysis shown above. Right panel: self-reported East Asians, with the names of the six SNPs shown above.