Conditioning on parental mating types can reduce necessary assumptions for Mendelian randomization

Mendelian randomization (MR) has become a common tool used in epidemiological studies. However, when confounding variables are correlated with the instrumental variable (in this case, a genetic/variant/marker), the estimation can remain biased even with MR. We propose conditioning on parental mating types (a function of parental genotypes) in MR to eliminate the need for one set of assumptions, thereby plausibly reducing such bias. We illustrate a situation in which the instrumental variable and confounding variables are correlated using two unlinked diallelic genetic loci: one, an instrumental variable and the other, a confounding variable. Assortative mating or population admixture can create an association between the two unlinked loci, which can violate one of the necessary assumptions for MR. We simulated datasets involving assortative mating and population admixture and analyzed them using three different methods: 1) conventional MR, 2) MR conditioning on parental genotypes, and 3) MR conditioning on parental mating types. We demonstrated that conventional MR leads to type I error rate inflation and biased estimates for cases with assortative mating or population admixtures. In the presence of non-additive effects, MR with an adjustment for parental genotypes only partially reduced the type I error rate inflation and bias. In contrast, conditioning on parental mating types in MR eliminated the type I error inflation and bias under these circumstances. Conditioning on parental mating types is a useful strategy to reduce the burden of assumptions and the potential bias in MR when the correlation between the instrument variable and confounders is due to assortative mating or population stratification but not linkage.

Mendelian randomization (MR) has become a common tool used in epidemiological studies. However, when confounding variables are correlated with the instrumental variable (in this case, a genetic/variant/marker), the estimation can remain biased even with MR. We propose conditioning on parental mating types (a function of parental genotypes) in MR to eliminate the need for one set of assumptions, thereby plausibly reducing such bias. We illustrate a situation in which the instrumental variable and confounding variables are correlated using two unlinked diallelic genetic loci: one, an instrumental variable and the other, a confounding variable. Assortative mating or population admixture can create an association between the two unlinked loci, which can violate one of the necessary assumptions for MR. We simulated datasets involving assortative mating and population admixture and analyzed them using three different methods: 1) conventional MR, 2) MR conditioning on parental genotypes, and 3) MR conditioning on parental mating types. We demonstrated that conventional MR leads to type I error rate inflation and biased estimates for cases with assortative mating or population admixtures. In the presence of nonadditive effects, MR with an adjustment for parental genotypes only partially reduced the type I error rate inflation and bias. In contrast, conditioning on parental mating types in MR eliminated the type I error inflation and bias under these circumstances. Conditioning on parental mating types is a useful strategy to reduce the burden of assumptions and the potential bias in MR when the correlation between the instrument variable and confounders is due to assortative mating or population stratification but not linkage.

Introduction
Randomized experiments, often called randomized controlled trials, are the gold standard for drawing causal inferences. In randomized experiments, observational units (e.g., subjects) are randomly assigned to different levels of the variable being used to assess the causal effect, e.g., the treatment. The randomization process eliminates the influence of potential confounding variables on the exposure variable (e.g., treatment or control). Therefore, we can conclude that the observed difference in outcomes between groups in randomized controlled trials is purely caused by the treatment (barring stochastic variations). However, randomized experiments are not always ethical, feasible, or practical (Sanson-Fisher et al., 2007).
Observational studies do not always yield unbiased estimates of effects because of their lack of random assignment. Of the multiple limitations that these studies have, herein, we will only consider the bias due to confounding.
To mitigate confounding, researchers often include potential confounders in analyses as covariates in regression models or stratify analyses by confounders. Figure 1A depicts a general causal model with an exposure variable (X), an outcome (Y), a confounder (U), and a genetic marker (G), where U is associated with both X and Y and G determines X. The variables X, Y, and U are assumed to be continuous. Causal effects and associations are represented by directional and bidirectional arrows, respectively. If U is observable and is included in the model, the estimate of the effect of X on Y will be unbiased, provided the estimation method does not induce a bias. However, the confounder (U) is not always measurable or known. If U is a set of confounders of the relationship between X and Y and is not appropriately accounted for in the analysis, the estimator of the regression coefficient of Y on X will be biased.
Mendelian randomization (MR) was proposed to address the issue of unmeasured confounders in observational studies (Smith and Ebrahim, 2003;Boutwell and Adams, 2020;Sanderson et al., 2022). MR uses genotypic (G) data from loci that affect the exposure variable (X), do not have a direct effect on the outcome, and are uncorrelated with potential confounders. The most commonly used process of estimation is as follows: 1) X is regressed on G to obtain the predicted value of X,X; 2) Y is regressed onX, and then, the estimated coefficient is an unbiased estimator of the effect of X on Y under some assumptions. As a simple and robust approach for causal inference, MR has become common in epidemiological studies during the last few decades.
However, MR rests on three assumptions (Emdin et al., 2017): "1) the genetic variant is associated with the risk factor; 2) the genetic variant is not associated with confounders; and 3) the genetic variant influences the outcome only through the risk factor." In Figure 1A, these assumptions correspond to the following: 1) G and X are associated, 2) there is no association between G and U, and 3) there is no direct effect of G on Y, not through X. If any of the aforementioned three assumptions are violated, the estimated effect is not guaranteed to be unbiased.
Unfortunately, the violation of assumptions, especially the violation of assumption (2), is quite plausible: the genotype (G) can be associated with confounders (U). Even without a direct effect of G on U (or vice versa), assortative mating and population stratification can yield associations between them, which violate assumption (2). Furthermore, it is hard to verify this assumption because U includes unmeasurable variables: "The second and third assumptions, however, cannot be empirically proven and require both judgment by the investigators and the performance of various sensitivity analyses" (Emdin et al., 2017). This paper proposes conditioning on parental mating types (defined as a combination of genotypes of parents at a locus used as an instrumental variable (Allison, 1997)) in MR to eliminate the bias in conventional MR, when there is correlation between the instrumental variable and confounding variables. This means that our approach obviates the need for one of the three necessary assumptions in MR.
This paper consists of two parts. First, we demonstrate that the estimation using conventional MR (without conditioning on parental mating types) could lead to biased estimates, when there is a correlation between the instrumental variable and confounding variables due to assortative mating or population stratification. Second, we propose the use of parental mating types in conventional MR and to assess the utility of this approach.

Materials and methods
Mechanisms violating assumptions for MR: Assortative mating and population stratification First, we define the variables, parameters, and error terms used in the simulation and analyses (summarized in Table 1), with explicit mathematical expressions and causal mechanisms. There are six variables: X, Y, U, G, H, and P. X, Y, and U are an exposure variable, an outcome variable, and a confounder, respectively, and all are quantitative traits (thus, continuous variables), such as weight and height. G and H are genotypes defined by SNPs; thus, they are one of the three statuses: AA, Aa, aa { }for G and BB, Bb, bb { }for H, respectively; f A (*) and f D (*) are functions to calculate the additive and dominance effect of a genotype, respectively (f A counts the number of A [or B] alleles for the genotype; f D is 1 for a heterozygote and 0 for a homozygote). P is the parental mating type, a combination of genotypes of parents at a locus used as an instrumental variable, and one of the six statuses: AA/AA, AA/Aa, AA/aa, Aa/Aa, Aa/aa, aa/aa { } . We introduce five indicator functions to compute the genetic effect of parental mating types, I AA/AA (P), I AA/Aa (P), I AA/aa (P), I Aa/Aa (P), I Aa/aa (P), where the function is 1 if P is the same as the subscript of the function and otherwise, 0. The effect of a variable M on a variable N (i.e., the difference in N due to a single unit increase in M) is represented by β MN . It should be noted that the additive effect and the dominance effect of a genotype M on a variable N are represented as β MAN (i.e., difference in N by substituting allele A [or B] for allele a [or b]) and β MDN (i.e., deviance from the average of genotypic values of the two homozygotes), respectively. Furthermore, there are five coefficients to represent the effect of parental mating type P on a variable M using the parental mating type aa/aa as a reference group. Thus, for example, β AA/AAM is the unit increase in M for the parental mating type AA/AA compared to the increase in the parental mating type aa/aa. Estimated regression coefficients are distinguished from the causal effect using the Frontiers in Genetics frontiersin.org following:β MN is the regression coefficient estimated by regressing N on M. Figure 1B is a causal model in which the confounding variable set, U, is separated into two sets of variables: one set includes a confounding variable consisting of a genotype on a single biallelic locus (H) with two alleles B and b, and the second set U consists of all other confounders. We note that in our scenario, H and the genotype on the biallelic locus, used as an instrumental variable (G), are unlinked. However, G and H could be correlated (i.e., non-linkage disequilibrium), which would violate assumption (2). The following describes two situations, assortative mating and population stratification, which can cause such a non-linkage disequilibrium.

Situation 1: Assortative mating
In human and other animal populations, the choice of a mate does not plausibly occur at random. One may be more likely to mate with another who has specific phenotypes, resulting in non-random or assortative mating (Anonymous, 1903). For example, assortative mating for body mass index (BMI) or body fatness (i.e., individuals with a high BMI or body fatness are more likely to mate with one another, as are individuals with low BMI or body fatness) is widely observed (Allison et al., 1996;Silventoinen et al., 2003;Jackson et al., 2007). We modeled assortative mating as being dependent on the exposure variable X. Mothers and fathers are separately sorted by X, and they are paired according to the order. For this purpose, each parent's genotype, exposure, outcome, and confounders are explicitly modeled. Variables are given with one of the two subscripts, m or f, for either the mother or the father (variables without these subscripts are for an offspring). The model is summarized in Figure 1C.
Briefly, the correlation between G and H is explained as follows: Assortative mating on X (i.e., X m and X f ) induces associations between G m and H f and G f and H m , which result in an association between G and H, thus violating the MR assumption (2).

Situation 2: Population stratification
Population stratification occurs and can create genotype-phenotype associations in the absence of linkage or a causal effect of the specific genotype on the specific phenotype, when a population consists of multiple subpopulations (Freedman et al., 2004) and some subpopulations have different allele frequencies and phenotypic distributions. By using the framework given in Figure 1C without assortative mating, we assume two different subpopulations. Therefore, within each subpopulation, three assumptions are held for conventional MR. The difference between the two populations is that they have different allele frequencies. If data from the two subpopulations were analyzed as a single population without accounting for the population substructure, they would yield a spurious association between G and H (because all parental loci [G m , H m , G f , and H f ] are associated), which violates the MR assumption (2).

Correcting the bias in MR: Conditioning on parental mating types
Assuming the aforementioned two situations, the conventional MR estimation procedure can lead to biased estimates because MR assumption (2) is violated. To eliminate the bias, we propose conditioning on the parental mating type P, which is a combination of parental genotypes used for the instrumental variable in MR. The rationale for using P is that both G m and G f are located on open (i.e., d connected) paths between genotypes G and H in both situations 1 and 2, and conditioning on P blocks the path. We also follow the approach of using parental genotypes instead of mating types, as proposed by Hartwig et al. (2018), which is another reference method.
In the following, we show details of three methods: conventional MR, MR conditioning on parental genotypes [a method proposed by Hartwig et al. (2018)], and MR conditioning on parental mating types (which we propose in this study). It should be noted that unmeasurable variables (variables in ovals, given in Figure 1C) do not appear in any of the analyses.

1) Conventional
MR uses the following model: where ε X is an error U. (C) Explicit causal model for the father-mother-offspring trio. The parental mating type, P, is the combination of parental genotypes (G m and G f ), which takes one of the six possible values. The dotted line connecting X m and X f implies assortative mating of these variables.
Frontiers in Genetics frontiersin.org term. Therefore, the auxiliary regression of X on f A (G) and f D (G) is performed to obtain the estimated value of X (=X): X β 1 +β GAX f A (G) +β GDX f D (G). 2) Then, the regression of Y onX is conducted by assuming the following model with an error term ε Y : Y β 2 + β XYX + ε Y .

MR conditioning on parental genotypes
To correct for the bias in MR, Hartwig et al. (2018) proposed conditioning on parental genotypes. The analysis proceeds as follows: 1) MR conditioning on parental genotypes uses the following model: 2) The regression of Y onX is conducted assuming the following model: MR conditioning on parental mating types Hartwig et al. (2018) assumed an additive model and, thus, used a parental genotype as an instrumental variable. However, if the effect of the parental genotype on an offspring's phenotype is non-additive, using a parental mating type, i.e., a combination of parental genotypes taking one of the six possible values ( Figure 1C), is more appropriate. The corresponding analysis proceeds as follows: 1) MR conditioning on parental mating types uses the following model: X β 1 + β GAX f A (G) + β GDX f D (G)+ β AA/AAX I AA/AA (P) + β AA/AaX I AA/Aa (P) + β AA/aaX I AA/aa (P), + β Aa/AaX I Aa/Aa (P) + β Aa/aaX I Aa/aa (P) + ε X . Therefore, the auxiliary regression of X on G 1 conditioning on the parental mating type P is performed to obtain the estimated value of X (=X):X β 1 +β GAX f A (G) +β GDX f D (G) + β AA/AAX I AA/AA (P) +β AA/AaX I AA/Aa (P) +β AA/aaX I AA/aa (P), + β Aa/AaX I Aa/Aa (P)+β Aa/aaX I Aa/aa (P).

Simulations
To demonstrate the potential bias when conventional MR is used due to the violation of the MR assumption (2) and the utility of using parental mating types to eliminate the bias, we performed simulations considering assortative mating and population stratification.  For the simulation of each situation, we created data for 1,000 trio (father-mother-offspring) families (500 trios each for the population for situation 2) for a single simulation and performed three different analyses on each dataset. We repeated the process 1,000 times for each parameter setting. The type I error rate (when β XY 0) is defined as the proportion of simulations in which the estimated association between X and Y is statistically significant (false-positive finding). The bias in the estimated coefficient E[β XY − β XY ] is also assessed when β XY > 0. The coefficient β XY was set as 1.0 for bias assessment. The sensitivity of the type I error rate and the bias on the magnitude of the violation of MR assumption (2) were assessed by varying the parameters. The significance level was set as 0.05. The process for generating data and analyses are described in the next section.

Simulation 1: Assortative mating
The following is a step-by-step protocol and parameter setting for the simulation: 1) Allele frequencies of A and B are 10% for each: Prob(A) Prob(B) 0.1, Prob(a) Prob(b) 0.9. Each parent's genotypes (G m , H m , G f , and H f ) are determined assuming the Hardy-Weinberg equilibrium (Hardy, 1908). It should be noted that G and H are independent.
2) The confounding variables for parents, U m and U f , are determined, which follow a bivariate normal distribution: N(0, 0.1).
3) The exposure variables of parents, X m and X f , are determined by their genotype and confounding variable: where ε X~N (0, 0.1). β 1 is interpreted as the genotypic effect of the genotype aa on X. X f is determined in the same way as X m . 4) The outcome of parents, Y m and Y f , are determined by their exposure, genotype, and confounding variable: . β 2 is interpreted as the genotypic effect of the genotype bb on Y, when both X and U are zero. Y f is determined in the same way as Y m . 5) Proportion p is selected from paternal and maternal populations.
In the selected population, both parents are sorted separately by the exposure X m or X f and are paired according to the order of X m and X f . Unselected parents (1-p) are randomly coupled regardless of the values of X and Y. 6) The genotype of the offspring, G and H, are determined by randomly selecting an allele from each parent. 7) The exposure, X, and the outcome, Y, of the offspring are determined by following the same process as for the parents (see 3 and 4). Frontiers in Genetics frontiersin.org The sensitivity of the type I error rate and bias was assessed by changing p from 0.0 to 0.8. All effects from U to X and Y are assumed to be 1. For genetic effects, we assumed that there is no additive effect (β GAX β HAX β HAY 0), but there is a strong dominance effect (β GDX β HDX β HDY 1) of G and H on any associated variables.

Simulation 2: Population stratification
The simulation setting for simulation 2 is similar to simulation 1 save for a couple of differences: 1) no assortative mating and 2) we assume two populations (i.e., subpopulation 1 and subpopulation 2) with different allele frequencies. Allele frequencies of A and B for subpopulation 1 are 10% each. Otherwise, all simulation settings, including parameter settings, are the same as those in simulation 1. The source of the violation of MR assumption (2) is different allele frequencies. To demonstrate the sensitivity of the type I error rate and bias on the magnitude of the violation of MR assumption (2), allele frequencies of A and B for subpopulation 2 were varied from 10% to 90%. All simulations and analyses were performed using statistical computing software R (version 3.6.1).

Results
The type I error rate for simulation 1 is shown in Figure 2A. Type I error inflation was observed for conventional MR and MR conditioning on parental genotypes, and it increased as the proportion involved in assortative mating increased. Type I error inflation was not observed for MR conditioning on parental mating types. Type I error inflation was mitigated by conditioning on parental genotypes to some extent, which still remained. The type I error rate for simulation 2 is shown in Figure 2B. Type I error inflation was observed for both conventional MR and MR conditioning on parental genotypes but not for MR conditioning on parental mating types. As shown in simulation 1, conditioning on parental genotypes reduced but did not eliminate type I error rate inflation. Interestingly, we observed a large type I error inflation when allele frequencies for the subpopulation were intermediate (0.5). This is because we assumed that homozygous genotypes (i.e., AA, aa and BB, bb) have the same effect on phenotypes (X and Y).
The bias of the estimated coefficient is shown in Figures 2C, D. We observed similar results for the bias in estimation as in type I error rates. When type I error rate inflation was observed, a statistically significant bias was also observed, and magnitudes of type I error rate inflation and absolute bias were positively associated.

Discussion
MR has become a common approach for causal inference in epidemiology, as genetic data become more accessible owing to fast and efficient DNA sequencing technology and as journals and funding bodies encourage data sharing (Levey et al., 2009;Bloom et al., 2014;Loder and Groves, 2015). However, as for most epidemiological approaches, MR has essential assumptions we need to check before performing analysis. Among them, the assumption of no association between genetic variants used in MR and confounders [MR assumption (2)] could be violated or is difficult to check in practice. First, we demonstrated that MR produces inflation in type I error rates and a biased estimation in realistic settings where the assumption is violated. We introduced two plausible situations: assortative mating and population stratification. The sensitivity of type I error rates and estimation bias was assessed by changing parameters relevant to the violation of the MR assumption. As expected, we observed type I error inflation and estimation bias in these realistic settings when conventional MR was used, and such inflations and biases worsened as violations became more severe. They were mitigated by conditioning on parental genotypes to some extent; however, type I error inflation remained. Second, we proposed the use of parental mating types for a valid association inference for these two situations. We successfully confirmed that conditioning on parental mating types solves the problem in both situations.
We noted that we are not the first to propose the idea of considering parental genetic information in an epidemiological study. The idea was originally proposed in testing for linkages in the presence of associations (Allison, 1997). Redden et al. suggested using parental mating types in the inference of genotype-phenotype associations (Redden and Allison, 2006). Later, Liu et al. (2015) extended the idea to testing causal effects of a fetal drive. In this work, they showed the relationship between this idea and MR. In MR, the genetic variant needs to be a causal variant. However, it may be difficult to verify this assumption in practice, if not impossible. Conditioning on parental mating types is one way to identify causal genetic variants, thus relaxing assumptions, specifically assumption (2) of MR (resulting in the strengthening of MR). In the context of MR, Hartwig et al. proposed using parental genotypes in the case of assortative mating, which violates MR assumption (3) (Hartwig et al., 2018). They proposed two methods to integrate parental genotypes in MR analyses. The first method is to adjust conventional MR by parental allele scores, which we used in this study. The second method is to use parental nontransmitted allele scores and the offspring allele score as instrumental variables of parental and offspring exposure variables. They demonstrated that both methods provide unbiased estimates of the exposure-outcome association and avoid type I error inflation even under strong assortative mating conditions. The difference between the study by Hartwig et al. and ours is that we assumed that the locus influencing the outcome (H) also influences the exposure (X). Therefore, their model is considered a special case of ours. Although Hartwig et al. (2018) concluded that only cross-trait assortative mating (between X and Y) yields a bias, we found that same-trait assortative mating (between Xs or between Ys) can also yield a bias due to the heritable confounding variable (H). Furthermore, we found that conditioning of parental genotypes is not enough to control the bias if effects of alleles on phenotypes are non-additive. In our previous work (Liu et al., 2015), we indicated that random mating is not assumed with conditioning on parental mating types. We also explained that it is necessary to condition on parental mating types to achieve randomization, which is the basis for causal inference. Further insights into the rationale for this or other ways of expressing fundamental ideas can be found in the study by Pearl et al. (2016).
Frontiers in Genetics frontiersin.org