A Simple Approximation to Bias in Gene–Environment Interaction Estimates When a Case Might Not Be the Case

Case–control genetic association studies are often used to examine the role of the genetic basis in complex diseases, such as cancer and neurodegenerative diseases. The role of the genetic basis might vary by nongenetic (environmental) measures, what is traditionally defined as gene–environment interactions (G×E). A commonly overlooked complication is that the set of clinically diagnosed cases might be contaminated by a subset with a nuisance pathologic state that presents with the same symptoms as the pathologic state of interest. The genetic basis of the pathologic state of interest might differ from that of the nuisance pathologic state. Often, frequencies of the pathologically defined states within the clinically diagnosed set of cases vary by the environment. We derive a simple and general approximation to bias in G×E parameter estimates when the presence of the nuisance pathologic state is ignored. We then perform extensive simulation studies to show that ignoring the presence of the nuisance pathologic state can result in substantial bias in G×E estimates and that the approximation we derived is reasonably accurate in finite samples. We demonstrate the applicability of the proposed approximation in a study of Alzheimer’s disease.


INTRODUCTION
Genetic association studies that estimate the relationship between gene-environment interactions (G×E) and a complex disease have the potential to provide valuable clues to the underlying etiology of complex diseases, such as cancer and neurodegenerative disease. Casecontrol studies sample a set of cases and a set of healthy controls conditionally on the disease status that is often defined based on the observed clinical diagnosis. A commonly overlooked complication is that multiple pathologic mechanisms might share symptoms and hence result in the same clinical diagnosis; i.e., the set of cases might be contaminated by a subset with a nuisance pathologic diagnosis. Frequencies of the pathologic diagnosis of interest within the set of clinically diagnosed cases might vary by the environment.
Our motivating study is a genome-wide association study (GWAS) of late-onset Alzheimer's disease (AD), a neurodegenerative disorder that is clinically characterized by progressive mental decline but histopathologically defined by highly abundant amyloid deposits and neurofibrillary tangles in the brain (Potter and Wisniewski, 2012). Recent biomarker studies of AD (Salloway et al., 2014;Salloway and Sperling, 2015) reported that 36% of ApoE ε4 noncarriers and 6% of ApoE ε4 carriers clinically diagnosed with AD do not have evidence of amyloid deposition and hence do not qualify for the pathologic (true) diagnosis of AD.
We are interested in estimating the association between the genetic variants serving adaptive immune systems and the true, pathologically confirmed, AD status. The effect of the genetic variants might vary by the ApoE ε4 status, and in the context of this study, we define ApoE ε4 to be the environment. It is possible that the symptoms resulting in the clinical diagnosis are manifestations of an underlying polygenic mechanism, and hence, the clinical diagnosis is a surrogate of the true diagnosis. It is also possible that the symptoms with and without the amyloid deposition evidence represent diverse mechanisms each with a distinct genetic basis. In the latter case, the usual logistic model with the clinical diagnosis as an outcome is based on a misclassified disease status and hence misspecifies the link between G×E and the AD status.
Traditionally, case-control GWASs are analyzed in a logistic regression model as if the data are collected prospectively based on a justification provided by Prentice and Pyke (1979). In the situation when the disease status is misclassified with frequencies varying by the environment, the result of Prentice and Pyke (1979) does not naturally extend.
Extensive literature (Carroll et al., 2006;Buonaccorsi, 2010;Demidenko, 2013) reports on how the estimates of the main genetic effects can be biased in situations when the disease status is misclassified. We have recently examined bias in G×E when misclassification probabilities vary by the environment and proposed a solution that alleviates the bias (Lobach et al., 2018). This solution requires optimization of a complex nonlinear function. Interestingly, Neuhaus (1999) derived a general approximation to the bias in a univariate setting when the data are collected prospectively and are analyzed in the logistic regression model. We extend the literature by deriving a general theoretical bias and a convenient approximation to the bias for G×E when the data are collected retrospectively and hence in a logistic regression model where both the design of data collection and presence of the nuisance pathologic diagnosis in the set of cases are ignored. The approximation that we have derived does not require optimization of a complex nonlinear function and provides a convenient understanding about the magnitude and directionality of the bias.
Our paper proceeds as follows. We first describe the setting and the approximation that we have derived in the Results section. We next describe a series of simulation experiments showing that the approximation that we have derived is accurate relative to the empirical estimates and apply the approximation to the study of AD. We conclude the paper by a brief discussion.

RESULTS
We define G as the genotype, e.g. single-nucleotide polymorphisms (SNPs) measured at multiple locations. Let X be the environmental variable that interacts with G. For clarity of presentation, we base the development on binary variables, G and X. We assume that the genotype is independent of all environmental variables and the genotypes follow the Hardy-Weinberg equilibrium: G ~ Q(g, θ).
If θ is the frequency of minor allele a when the major allele is A, then the Hardy-Weinberg equilibrium model (Hardy, 1908) states We define D CL = {0, 1} as the observed clinical disease status defined based on a set of symptoms. Suppose that the same set of symptoms can be caused by two distinct pathophysiologic mechanisms. Let D be the true disease status defined based on the underlying pathology, where D = 1 indicates the disease of interest, while D = 1 * is the nuisance disease. Thus, pr(D = 1) + pr(D = 0) + pr(D = 1 * ) = 1. For ethical and/or budgetary reasons, it might not be possible to measure the underlying pathology on all patients; hence, D is latent.
In the setting of AD, D = 1 indicates the true AD state, i.e. the clinical diagnosis and evidence of amyloid deposition, and D = 1 * indicates the nuisance disease state, e.g. vascular dementia, Lewy body disease, and hippocampal sclerosis (Salloway and Sperling, 2015), clinically diagnosed as AD. In the setting of breast cancer, for example, the disease state of interest might be an estrogen receptor (ER)+ breast cancer, while the nuisance disease state is an ER− cancer. A hypothesis then could be that a set of the genetic variants is associated with the ER+ disease state, but not the ER− disease state.
Instead, an evaluation might be performed on a subset of patients or in an external reliability study. We define τ(X) = pr(D = 1|D CL = 1, X) to be the frequency of the true diagnosis of interest within the clinically diagnosed set that varies by the environmental variable X. In our setting, the clinical diagnosis of healthy controls corresponds to D = 0; hence, pr(D = 0|D CL = 0, X) = 1, and pr(D = 1|D CL = 0, X) = 0. We let probabilities of the clinical and true diagnoses in the population be π d D d cl pr CL cl = = ( ) and π d = pr(D = d), respectively. Let (X) = pr(D = 0|D CL = 1, X). For clarity of presentation, we suppose that all variables are binary. The setting can be easily extended to accommodate multilevel categorical variables. We first consider a binary setting where the risk parameters are defined in terms of D = 1 vs. D = 1 * and D = 0 combined. Then, the true risk model in terms of coefficients B = (β 0 , β G , β X , β G×X ) is In the second setting, we consider that the risk model is defined separately for D = 1 vs. D = 0 in terms of B = (β 0 , β G , β X , β G×X ) and for D = 1 * vs. D = 0 in terms of Β * * * * × * = ( ) β β β β 0 , , , In equation (2), B and B * might share coefficients, e.g. if β β X X = * . While we define the parameters of interest to be B = (β 0 , β G , β X , β G×X ), these parameters are independent and are different in models (1) and (2). Suppose that the clinical-pathological diagnosis relationship is ignored and the clinical diagnosis is used as the outcome variable. Then, the risk model is Derivations provided in the Online (Kullback, 1959) Methods show that Remarks on Formulas (4)- (7) 1. The Online Methods section provides formulas (A11)-(A15) for the setting with environmental variable Z that does not interact with the genetics.
2. When the clinical diagnosis and pathologic disease status correspond, i.e. (X) = 0 for X = 1 and X = 0, then all parameter estimates are unbiased. 3. If β G = 0, then γ G = 0. Hence, the usual logistic regression yields the correct estimate of the null β G . 4. β G = 0 does not necessary result in γ 0 = 0. Similarly, if β X = 0, then γ X might not be zero, and if β G×X = 0, then γ G×X might not be zero. Hence, the usual logistic regression does not yield the correct estimate of the null effect β 0 , β X , β G , and β G×X . 5. If β G = 0 and β G×X = 0, then γ G = 0 and γ G×X = 0. Hence, the usual logistic regression yields the correct estimate of the null β G and β G×X . 6. If the misclassification is nondifferential, i.e. (0) = (1), then if β X = 0, then γ X = 0. That is, the usual logistic regression model yields the correct estimate of the null effect β X . 7. If the misclassification is nondifferential, i.e. (0) = (1), then if β 0 = 0, β X = 0, β G = 0, and β G×X = 0, then γ G×X = 0. That is, the usual logistic regression model yields the correct estimate of the null effect of β G×X .
We next suppose that the true model is (2), while the parameters are estimated based on a misspecified model (3).
Derivations provided in the Appendix show that (10) Remarks on Formulas (8)-(11): 1. The Online Methods section provides formulae (A19)-(A23) for the setting with environmental variable Z that does not interact with the genetics. 2. If β 0 = 0 and β 0 0 * = , then γ 0 = 0. That is, the usual logistic regression yields the correct estimate of the null β 0 .

Simulation Experiments
We perform a series of simulation experiments to evaluate the accuracy of the proposed approximation. We set our other parameters to be similar to the values observed in our GWAS of AD. We let the genotype (G) be a Bernoulli random variable with a frequency of 0.10 to mimic a SNP and allow its effect to follow a recessive or dominant model. We simulated age (Z 1 ) to be Bernoulli with a frequency of 0.50, e.g. corresponding to a median split. Sex (Z 2 ) is Bernoulli with pr(Z 2 = 1) = 0.52. The binary variable X = {ε4+, ε4−}, which represents the ApoE ε4 status according to the presence or absence of the ε4 allele that occurs in approximately 14% of the population. Similarly to Salloway et al. (2014), we define the proportion of the nuisance disease within the clinical diagnosis as pr(D = 1 * |D CL = 1, ε4−) = 0.36 and pr(D = 1 * |D CL = 1, ε4+) = 0.06.

Setting A
We first simulate data according to model (1) while we estimate parameters in model (3) where the clinically diagnosed status is the outcome variable. The disease status is simulated according to model (1) with β G×ε4 varying from log(1) to log(8), and we set β Z 1 to be 0, 0.5, 1, or 1.5. The other risk coefficients are β 0 = −1, β G = −0.41, β Z 2 0 08 = − . , and β ε4 = log(8) = 2.1. We simulated 500 datasets with 3,000 cases and 3,000 controls. Shown in Table S1 are parameter estimates when β G×ε4 = 0, obtained as an average (empirical estimate) and standard deviation (SD) across the 500 simulated datasets, theoretical values, and the simple approximation derived in (A11)-(A15). For all parameters, the empirical values are close to the approximation, with the average difference across all parameters being 0.11. The approximation was furthest away from the empirical estimate of the main effect of ApoE ε4 status, with the empirical estimate being 1.80, while the approximation is 1.30.
Shown in Figures S1A-S1D are the estimates [empirical estimate that is the average across 500 simulated datasets (AVE), theoretical estimate (TH) (A11)-(A15), and approximation (APX) (A11.1)-(A15.1)] across the levels of β Z 2 that are indicated by color and across values of β G×ε4 across the panels of each graph. In all of these settings, the theoretical values and the approximations are accurate relative to the empirical estimates. The furthest from the empirical estimates are approximation for the main effect of the ApoE ε4 status with the difference mainly driven by the approximation.
We first note that when the presence of the nuisance disease is ignored, estimates of β 0 , β ε4 , β G×ε4 , and β G are substantially biased. The approximation that we derived is accurate relative to the empirical averages of the parameter estimates. For example, the empirical estimate of γ ε4 is 1.08, while the approximation is 1.08. The empirical estimate of γ G×ε4 is −0.10, while the approximation is −0.05.
Shown in Figures S4A, S4B are the estimates [empirical estimate that is the average across 500 simulated datasets (AVE) and approximation (A20)-(A23) across values of β G X × * along the x-axis when β G×ε4 = 0 ( Figure S4A) and when β G×ε4 = −0.9 (Figures S4B). In all of these settings, the theoretical values and the approximations are accurate relative to the empirical estimates.

Setting C
We next simulate the data with a smaller sample size and unequal number of cases and controls, i.e. 1,000 controls and 2,000 cases, with the rest of the setting being the same as Setting B. Table S3 presents empirical estimates, theoretical values (A16-A22), and approximation (A16.1-A22.1). We note here that when the nuisance disease is ignored, the estimates of the coefficients are substantially biased. The approximation that we derived is reasonably accurate relative to the empirical averages of the estimates.

Setting D
We next performed a study where we misspecified rates of the nuisance disease state within the clinical diagnosis by 5%, with the rest of the setting being the same as Setting C. We considered three settings: 1) when the rate is overestimated by 5% for both ε4 carriers and ε4 noncarriers; 2) when the rate is underestimated by 5% for both ε4 carriers and ε4 noncarriers; and 3) when the rate is overestimated by 5% in ε4 carriers and overestimated by 5% in ε4 noncarriers. In settings 1 and 2, the rate is equally misspecified between cases and controls; hence, the degree to which the rate is differential, i.e. varies by ε4 status, is preserved. In setting 3, however, we also increased the degree to which the rate is differential, which would likely have more impact. In settings 1 and 2, we observed no impact on the estimates, as shown in Table S4. In setting 3, however, the estimates changed slightly. Hence, the estimates are more robust to misspecification that does not increase the degree to which the rate is differential.

Analyses of Genetic Variants Serving Adaptive Immune System in AD
We applied the proposed analyses to a dataset collected as part of the Alzheimer's Disease Genetics Consortium. The data consist of 1,245 controls and 2,785 cases. The average age (SD) of cases and controls are 72.1 (9.1) and 70.9 (8.8) years, respectively. Among cases, 1,458 (52.4%) are men; among controls, 678 (63.9%) are men. At least one ApoE ε4 allele is present in (64.5%) of cases and 365 (29.1%) of controls.
Illumina Human 660K markers have been mapped onto human chromosomes using the National Center for Biotechnology Information (NCBI) dbSNP database (https:// www.ncbi.nlm.nih.gov/projects/SNP/). Chromosome location, proximal gene or genes, and gene structure location [e.g. intron, exon, intergenic, and untranslated region (UTR)] have been recorded for all SNPs. We inferred the adaptive immune system pathways based on the information from the Kyoto Encyclopedia of Genes and Genomes (www.genome.jp/kegg) (KEGG), Gene Ontology (GO) Consortium (www.geneontology.org), and Ariadne Genomics (www.ariadnegenomics.com). From these data with quality control measures (observed frequency of minor allele >5%), we inferred 133 SNPs to reside in genes serving the adaptive immune system.
It is of interest to examine a relationship between the pathologic diagnosis and each of the 133 SNPs (G), ApoE ε4 status (X), age (Z 1 ), and sex (Z 2 ). The effect of SNPs might vary by ApoE ε4; hence, we included the interaction between the genotype and ApoE ε4 status. The genetic variables are modeled using a binary indicator of the presence or absence of a minor allele.
We estimate parameters using the standard logistic model (3) that uses the clinical diagnosis as a surrogate of the pathophysiology. We assume that the proportion of nuisance disease is as estimated by the Salloway, S., and Sperling, R. (2015) study; i.e. the proportion of the nuisance disease within the clinically diagnosed set of cases is 36% in ApoE ε4 noncarriers and 6% in ApoE ε4 carriers. We first assume that the true model is (1); i.e. the susceptibility model is defined for amyloid-related AD symptoms vs. healthy controls and nonamyloid AD symptoms combined. In this setting, we estimate the magnitude of bias using approximation (A11.1-A15.1). We next assume that the true model is (2); i.e. the susceptibility model is defined for amyloid-related AD symptoms vs. controls and for non-amyloid-related symptoms vs. controls. We then estimate the magnitude of bias using (A19.1)-(A25.1).
Shown on Figures 1 and 2 are the estimated biases in the main effect of each SNP, ApoE ε4, and interaction between the SNPs and ApoE ε4 status. We first note that the bias in the estimate can be substantial. For example, in the usual logistic regression with the clinical diagnosis as an outcome variable, the main effect of  rs597587 is statistically significant with Bonferroni correction (p < 0.05/133). The estimate is γ  G = −1 1 . , while if model (1) is the true model, the bias is approximated to be 2.9 and the interactive effect is estimated to be γ  G× = ε4 2 0 . , p < 0.008, which means if the model was incorrectly specified, both main and interactive effects would be mistakenly interpreted. For another SNP, rs12111032, the main effect estimate is γ  G = 0 78 . with p < 0.05/133, while the bias is −0.04 if model (1) is the true model and −0.17 if model (2) is the true model. Table 1 presents the estimates of main effects of SNPs and p-values obtained using the usual logistic regression model with the clinically diagnosed disease status as an outcome variable for SNPs with a p-value <0.05 for γ  G or γ 

G×ε4
. We also mapped the genes from Table 1 to the amyloid pathway using KEGG and GO. A SNP, rs10059242, residing in an intergenic region between HTR4 and ADRB2 (ADRB2 is found to be in the amyloid pathway) has the main effect of γ  G = 1 5 . (p = 0.04) with bias almost equal to −0.9 (1) and 1.8 (2) and the G×E interaction of γ ε  G× = − 4 2 0 . , (p = 0.04) with bias almost equal to −0.9 (1) and −2 (2).
For three SNPs, rs401904, rs1748383, and rs12692222, the estimates in the usual logistic model (3) are nearly zero, while biases estimated assuming model (1) is true are nearly zero as well. This is correct with the theoretical observations that the null effect estimate based on the logistic regression can be unbiased.

DISCUSSION
We derived a simple and general approximation to bias in G×E parameter estimates when multiple pathologic mechanisms present with the same set of symptoms. The approximation to the bias relies on the estimates of the frequencies of the pathologic disease state within the clinical diagnosis. The approximation that we derived complements a recent study (Lobach et al., 2018) where we developed a pseudolikelihood method to incorporate uncertainty about the clinicalpathological diagnosis relationship, where the solution requires optimization of a complex nonlinear function. The approximation that we derived provides a simple formula that is intuitive and easy to apply.
We observed that parameter estimates could be substantially biased when the presence of the nuisance pathology is ignored. This observation has been also made by Manchia et al. (2013), where the simulation studies and data analyses on the main effect of genotype showed that the risk parameter estimates attributable to the genetics could be largely underestimated.
We define the bias to be a difference between the estimates obtained with the clinically diagnosed disease status as an outcome variable and the estimates with respect to the disease states. Within the notation of this study, bias is the difference between Γ and B. For example, from approximations (4)-(7), we see that the bias of a coefficient is a function of the true value of the coefficient, frequencies of the disease states within the clinical diagnosis, and also values of the other coefficients. Often, but not always, the null effect is correctly estimated, as we describe in the Remarks. When the proportions of the disease state of interest within the clinical diagnosis are the same in subpopulations by the environmental variable, in large samples, the estimates with the clinical diagnosis as an outcome variable are nearly unbiased. Even then, in practical settings, however, the estimates that ignore the presence of the nuisance disease might still have notable bias.
In our analysis of AD, the reliability study is based on 1,121 carriers of the ApoE ε4 allele and 1,331 noncarriers of the ApoE ε4 allele. Hence, we suppose that the clinicalpathological diagnosis relationship is estimated reasonably reliably well. Moreover, the reliability study is performed on the same patient population, i.e. patients followed up by the AD centers. There is a substantial, although not exactly known, overlap between the set with genotypes and the reliability set. In general, we advocate sensitivity analyses that examine potential differences in the parameter estimates due to misspecifications of frequencies of the pathologic diagnosis within the set of clinically defined cases.
Our derivations are based on a logistic model with a linear disease risk function. The same line of arguments might be extended to more complex relationships when, e.g., the link is not logit and when the interaction is nonlinear (Wu and Cui, 2013;Wu et al., 2018). These extensions might be the natural extensions of the current study. Another possible extension is to consider the setting when the disease states are the stages of the disease. Then the disease stage can be an outcome variable in a multilevel proportional logistic regression model.
In the context of gene-environment interaction analyses, model selection is often needed. While the current study does not provide a mechanism for model selection, in our previous work, we devised a pseudolikelihood model to correct for bias (Lobach et al., 2018) that can be used as a basis for model selection.
Though in the data analyses, we extracted the 113 SNPs for a better illustration of the application of proposed methods, the current method can be easily extended to all SNPs measured in a GWAS. Many of the SNPs would probably be not associated with the disease status, however, since as we see in the data analyses and theoretical derivations, null effects are often correctly estimated.
While our study is motivated by a specific example of AD, the application of the approximation that we derived is readily applicable to other diseases. For example, recent studies report that the underlying biologic mechanisms of breast cancer vary by expression of progesterone and estrogen measured in the ER/PR/HER2 status. Frequencies of subtypes can be estimated based on the SEER database (https://seer.cancer. gov/). Another possible application is when disease states of cancer are estimated by the mutation patterns available in the Cancer Genome Atlas.

ETHICS STATEMENT
The study is based on dataset downloaded from dbGaP database. The data are completely de-identified and identifiers are not available to the study researchers.

AUTHOR CONTRIBUTIONS
IL conceived the study, wrote the manuscript, derived the approximation, performed simulation studies and data analyses.IK supported the derivations of the approximations, simulation studies and data analyses; and reviewed the manuscript. AA mapped the genetic variants into subsets, supervised the analyses and interpretation of the results. SL contributed to the derivations, performed simulation studies and reviewed the manuscript. LZ provided overall supervision of the study.

FUNDING
IL is supported by grant # 5R21AG043710-02 from the National Institutes of Health.