Mendelian Randomization: A Review of Methods for the Prevention, Assessment, and Discussion of Pleiotropy in Studies Using the Fat Mass and Obesity-Associated Gene as an Instrument for Adiposity

Pleiotropy assessment is critical for the validity of Mendelian randomization (MR) analyses, and its management remains a challenging task for researchers. This review examines how the authors of MR studies address bias due to pleiotropy in practice. We reviewed Pubmed, Medline, Embase and Web of Science for MR studies published before 21 May 2020 that used at least one single-nucleotide polymorphism (SNP) in the fat mass and obesity-associated (FTO) gene as instrumental variable (IV) for body mass index, irrespective of the outcome. We reviewed: 1) the approaches used to prevent pleiotropy, 2) the methods cited to detect or control the independence or the exclusion restriction assumption highlighting whether pleiotropy assessment was explicitly stated to justify the use of these methods, and 3) the discussion of findings related to pleiotropy. We included 128 studies, of which thirty-three reported one approach to prevent pleiotropy, such as the use of multiple (independent) SNPs combined in a genetic risk score as IVs. One hundred and twenty studies cited at least one method to detect or account for pleiotropy, including robust and other IV estimation methods (n = 70), methods for detection of heterogeneity between estimated causal effects across IVs (n = 72), methods to detect or account associations between IV and outcome outside thought the exposure (n = 85), and other methods (n = 5). Twenty-one studies suspected IV invalidity, of which 16 explicitly referred to pleiotropy, and six incriminating FTO SNPs. Most reviewed MR studies have cited methods to prevent or to detect or control bias due to pleiotropy. These methods are heterogeneous, their triangulation should increase the reliability of causal inference.


INTRODUCTION
Mendelian randomization (MR) is an instrumental variables (IVs) approach that exploits genetic variants (mostly singlenucleotide polymorphisms (SNPs)) as IVs of a non-genetic exposure to infer a causal relationship between this exposure and an outcome in observational studies (Lawlor et al., 2008). The validity of MR is based on three key assumptions: 1) the IV is associated with the exposure, 2) the association between the IV and the outcome is unconfounded, and 3) the IV only affects the outcome via the exposure, known as the exclusion restriction criterion (Labrecque and Swanson, 2018;Brumpton et al., 2020). Horizontal pleiotropy, the phenomenon whereby a genetic variant affects the exposure and the outcome through independent pathways and without being mediated by another (Davey Smith and Hemani, 2014;Hemani et al., 2018a;Jordan et al., 2019), is a primary cause of violation of the exclusion restriction criterion (Dixon et al., 2020). It may lead to biased causal effect estimates, reduced statistical power, and/or increased type I error Verbanck et al., 2018). We thereafter refer to "horizontal pleiotropy" as "pleiotropy" for the sake of brevity.
The increasing use of MR (Sekula et al., 2016) has prompted both subject-specific (Pingault et al., 2016;Frayling and Stoneman, 2018;Goodarzi, 2018;Lor et al., 2019;Meng et al., 2019;Guo et al., 2021) and general reviews (Bochud and Rousson, 2010;Davies et al., 2013;Boef et al., 2015) of MR studies summarizing the state of practice of MR in the last decade. These suggest that the exclusion restriction criterion is not systematically assessed or discussed. For example, in their meta-epidemiological overview on the approaches used in MR, Boef et al. (2015) noted that only 111 of 178 studies (62.4%) reported on the plausibility of the exclusion restriction criterion. However, no review thus far has focused on how authors prevent or minimize bias related to pleiotropy in MR studies. Such examination is important because while evidence suggests that pleiotropy is ubiquitous in the human genome (Boyle et al., 2017;Chesmore et al., 2018), the absence of pleiotropy in a MR study cannot be empirically proven (Glymour et al., 2012). While several approaches to detect pleiotropy and/or provide robust MR estimates have been recently proposed and compared (Bowden et al., 2015;Hartwig et al., 2017;Thompson et al., 2017;Hemani et al., 2018a;Verbanck et al., 2018;Rees et al., 2019;Burgess et al., 2020b;Zhao Q. et al., 2020;Minelli et al., 2021), their use in practice, including in sensitivity analyses and triangulation, has not been documented across studies. In light of the recent MR guidelines (Davey Smith et al., 2019;Burgess et al., 2020a) that recommend assessing the robustness of MR results, we examine how potential bias due to pleiotropy is considered in the literature. More specifically we summarize 1) the approaches used to avoid selecting pleiotropic genetic variants, 2) the methods used to detect and account for pleiotropy in the estimation of the causal effect, and 3) how researchers discuss the exclusion restriction criterion considering their assessment of pleiotropy, including the impact on the results when pleiotropy is suspected.
Reviewing the entire body of published MR studies would not be practical. Instead, we limited our investigation to MR studies that use SNPs in the fat mass and obesity-associated (FTO) gene as IV to investigate the causal effect of adiposity on diverse outcomes. We expected studies that used FTO as an IV to provide a rich discussion of pleiotropy for several reasons. First, several SNPs in FTO have large and robust associations with body mass index (BMI) (Gill et al., 2019) and thus are considered strong IVs for adiposity and commonly used in MR studies. Second, unlike IVs such as variants on the C-reactive protein gene which encodes C-reactive protein Reactive Protein Coronary Heart Disease Genetics Collaboration et al., 2011), the exact biological pathways through which FTO affect adiposity are not fully understood, which complexifies the assessment of pleiotropy. Third, some FTO SNPs are suspected of pleiotropy with reported effects on a wide array of health issues ranging from cardiometabolic outcomes to cancer or mental health (Pausova et al., 2009;Delahanty et al., 2011;Hertel et al., 2011;Kivimäki et al., 2011;Li et al., 2012;Iles et al., 2013;Liu et al., 2013;Cronin et al., 2014;Aijala et al., 2015).

Search Strategy and Inclusion/Exclusion Criteria
We searched Pubmed, Medline, Embase and Web of Science to identify articles published before 21 May 2020 that met the following three criteria: 1) the primary analysis was MR, 2) the primary exposure was adiposity assessed by BMI, and 3) the IV(s) included at least one SNP in the FTO gene. We excluded studies in which MR was a secondary analysis to ensure that the study provided a detailed assessment of MR assumptions. We placed no restrictions on the outcome of interest. The search strategy and the specific exclusion criteria are provided in Supplementary Methods S1, S2, respectively.

Data Extraction and Analysis
For each study we recorded the use of a one-sample or twosample MR design. We considered as one-sample both 1) studies that performed MR using individual level data on the SNPs, exposure and outcome and 2) studies that used summary statistics on SNP-exposure and SNP-outcome associations from the same sample (Hemani et al., 2018a). Two-sample data were defined as the use of summary statistics on SNPexposure and SNP-outcome associations both from two distinct samples (Hemani et al., 2018a). We documented three types of IVs including 1) single IV, 2) genetic risk scores (GRS) that aggregate several SNPs into a single variable that corresponds to a weighted or unweighted sum of risk alleles (Burgess and Thompson, 2013;Burgess et al., 2016) and 3) multiple IVs. We defined multiple IVs as the use of ≥2 SNPs as separate IVs in a single model or to the combination of estimated effects from ≥2 single SNPs into one summary causal effect using metaanalytic techniques (Burgess et al., 2016). We recorded the specific FTO SNP(s) used in each of the types of IVs described above.
We organized the data analysis around three themes that described how pleiotropy was handled in the analytical process including 1) the selection and combinations of IVs; 2) the methods used to detect and account for pleiotropy in the estimation of causal effects, and 3) the discussion of findings considering the assessment of pleiotropy. We documented the methods as they were explicitly stated in each article irrespective of their applicability or relevance. A critical appraisal of the use of some of the methods is provided in the Discussion.

Approaches to Prevent Pleiotropy (Selection and Combination of IVs)
According to MR guidelines (Burgess et al., 2020a), SNPs can either be selected from gene regions that specifically encodes the exposure (biological approach) or on the basis of their statistical association with the exposure of interest (statistical approach). SNPs known or suspected of pleiotropy may be excluded from the initial selection before performing the main analysis (Burgess et al., 2020a). We reported the use of the biological and or statistical approach to SNP selection. We also documented the use of multiple independent SNPs as IVs as a method to attenuate the effect of pleiotropy under the assumption that the pleiotropic effects of SNPs would be balanced and thus cancel each other out (Davey Smith, 2011). Finally, we recorded any other strategy explicitly presented as pertaining to the selection of IVs to minimize the presence of pleiotropy.
Approaches to Detect and/or Account for Pleiotropy in the Estimation of the Causal Effect MR guidelines require authors to report on the methods used to evaluate MR assumptions, which includes investigating bias due to pleiotropy (Davey Smith et al., 2019;Burgess et al., 2020a). We recorded the methods used to evaluate the independence and the exclusion restriction assumptions with the exception of those used for population stratification, highlighting whether pleiotropy assessment was explicitly stated to justify the use of these methods. We organized methods into four categories including 1) robust (e.g., MR-Egger) and other IV estimation methods (e.g., multivariable MR), 2) methods to detect heterogeneity of estimated causal effects across IVs (e.g., statistical tests of heterogeneity between the estimated SNPspecific causal effects), 3) methods to detect or account for associations between the IVs and the outcome that arise through pathways outside of the exposure (e.g., mediation analysis), and 4) other methods. Robust methods provide causal effect estimates under a weaker set of assumptions than conventional methods Burgess et al., 2020a). A summary description of the methods, including their main assumptions and limitations is presented in Supplementary Table S6.

Discussion of Findings Considering the Assessment of Pleiotropy
We verified whether the authors discussed the independence and the exclusion restriction assumptions, distinguishing studies that explicitly referred to the term "pleiotropy" from those that did not. When studies suspected IV invalidity, we further verified whether the authors report the impact of IV invalidity on MR results, and if any FTO SNPs were incriminated.
The data were tabulated using Microsoft Excel ® 2016 and described using Stata/IC version 14.2 software (StataCorp, College Station, Texas, United States).

Study Selection and Characteristics of Studies
Our search identified 2,985 publications, of which 128 articles were included upon completion of the screening process ( Figure 1). The 128 articles are listed in Supplementary Table  S1. Included articles were published between June 2008 and May 2020, and mostly comprised one-sample MR analyses (n = 98, 76.6%; Table 1 and Supplementary Table S3). A total of 31 FTO SNPs were selected as IVs with rs1558902 being the most frequently used (n = 64, 50%; Supplementary Table S2). While 74 studies (57.8%) used a GRS to represent the IVs ( Table 1 and Supplementary Table S3), Figure 2 suggests a recent decline in the use of GRS in favor of multiple IVs.

Approaches to Prevent Pleiotropy (Selection and Combination of IVs)
While all of the 128 studies selected IVs on the basis of statistical association between SNPs and BMI in the literature, 33 studies (25.8%) proceeded further in their attempt to prevent the effect of pleiotropy in the selection of the IVs ( Table 2 and Supplementary  Table S4). Of the 33 studies, 12 excluded previously selected SNPs known or suspected of pleiotropy. Ten articles respectively cited the use of a GRS and of multiple (independent) IVs (even if six of these ten in fact used a GRS as IVs) to prevent the effect of pleiotropy, a single study justified the use of a single SNP as IV as a method to prevent pleiotropy, although these strategies do not guarantee that bias due to pleiotropy is prevented or reduced (Burgess and Thompson, 2013).

Approaches to Detect and Account for Pleiotropy in the Estimation of the Causal Effect
Overall, 120 of 128 included studies assessed the plausibility of the independence and/or exclusion restriction assumptions (Table 3 and  Supplementary Table S5). Of the 120 studies, 78 reported using more than one category of methods within our classification (robust and other IV estimation methods (n = 70), heterogeneity (n = 72), alternative pathways (n = 85) and others (n = 5, including the use of positive or negative control outcomes, colocalization, and verifying the concordance of MR results with those from other studies). A total of 95 studies explicitly cited pleiotropy to justify the use of such methods. MR-Egger was the most frequently reported method to assess pleiotropy (n = 68). Of the 68 studies, 66 used the intercept p-value as a test of the validity of IVs and 29 studies (Gao et al., 2016; Frontiers in Genetics | www.frontiersin.org  specify how the MR-Egger results were used. Forty-six studies assessed pleiotropy by evaluating the heterogeneity of estimated causal effects across IVs, whether by graphical assessment (n = 20) or statistical testing (n = 7), or by comparing estimated causal effects from GRS or multiple IVs before vs. after exclusion of suspected pleiotropic SNP(s) (n = 20). A total of 30 studies attempted to detect pleiotropy by estimating pathways through which the IVs were associated with the outcomes outside of that implicating the exposure. Such studies mostly reported the estimation of associations between the IVs and measured risk factors of the outcome (n = 19), but also adjusted the IV-outcome or IVconfounders associations for exposure (n = 7) or documented the associations between the IV and risk factors for the outcome in the literature (n = 5).

Discussion of Findings Considering the Assessment of Pleiotropy
Of 128 included articles, 108 discussed the plausibility of the independence and/or exclusion restriction assumptions of which 89 studies made an explicit reference to pleiotropy (Table 4 and   Supplementary Table S7). Invalid IVs were suspected in 21 studies (Supplementary Table S8), 16 of which cited pleiotropy as a potential source of invalidity. Eight of the 21 studies concluded that the MR results were possibly invalid, while nine studies reported that the results were robust. The remaining four studies did not discuss the impact of IV invalidity on the results. Six studies suggested that at least one FTO SNP (rs1558902, rs1421085, and rs17817449) was suspected of pleiotropy on the basis of sensitivity analyses ( Table 4,  Supplementary Tables S7, S8).

DISCUSSION
Pleiotropy is considered widespread in humans (Boyle et al., 2017) and thus presents a major challenge to the validity of MR studies, especially considering the limited knowledge of the biological function of many of the SNPs used as IVs (Danchin and Fang, 2016;Swerdlow et al., 2016). We reviewed studies that used SNP(s) in the FTO gene as IV(s) in MR to examine the strategies employed by authors to prevent, detect or control, and discuss biases due to the use of pleiotropic IVs. Our review extends the overview of statistical approaches used in MR published by Boef et al. (2015) by focusing on pleiotropy and by including the recent developments such as two-sample MR and the use of MR-Egger. We observed that the vast majority of studies addressed pleiotropy by using several methods that operate under different assumptions (Lawlor et al., 2016;Hemani et al., 2018b). While most authors invoked pleiotropy at the analytical stage to justify the use of detection tools and robust methods, explicit attention was also given to the prevention of pleiotropy in the selection of IVs in a fourth  Abbreviations: GRS, genetic risk score; SNP, single-nucleotide polymorphisms; IV, instrumental variable. a These 10 articles discussed the use of multiple independent SNPs without explicitly mentioning GRS even if six of the 10 in fact used a GRS as IV in the main analysis.
Frontiers in Genetics | www.frontiersin.org February 2022 | Volume 13 | Article 803238 of the articles reviewed and pleiotropy was mentioned in the discussion in 70% of articles. Our review highlighted three observations that merit attention for future MR studies. First, we documented an increasing use of multiple IVs over time, in addition to the exclusive use of statistical criteria to select IVs from the literature. This is largely explained by our focus on BMI as the exposure, since BMI is a polygenetic trait without a specific proximal coding gene from which to select SNPs as it is common with protein-like exposures (Swerdlow et al., 2016;. The increasing use of multiple SNPs is also motivated by attempts to increase the strength of the instruments, the availability of Documenting the associations between the IV and risk factors for the outcome in the literature  Table exceed the total number of studies (n = 128). b See (Burgess et al., 2020a) and  for a summary of the listed methods. c Of the 69 studies that reported using MR-Egger, 66 used the intercept test p-value to infer whether or not pleiotropy was present, two studies (Tyrrell et al., 2016;Fan et al., 2018) compared the MR-Egger slope and the conventional MR causal effect estimate, while the last study  did not specify how the MR-Egger results were used. d One study (Censin et al., 2017) did not specify the heterogeneity test used. e Three articles (Guo et al., 2016;Censin et al., 2019;Sun et al., 2020) mentioned adjustment of MR analyses for covariates potentially involved in pleiotropic pathways without specifying as multivariable MR.
Frontiers in Genetics | www.frontiersin.org February 2022 | Volume 13 | Article 803238 large-scale genome-wide association studies to select instruments from, and by the development of MR methods that require multiple IVs to provide robust MR estimates under a less stringent set of assumptions Burgess et al., 2020a). However, selecting SNPs exclusively by statistical approach increases the likelihood of including pleiotropic SNPs which may lead to biased MR results (Hartwig et al., 2017;Bowden and Holmes, 2019). Advantages of using multiple IVs independently in a MR study include the use of robust methods, such as MR-Egger or median or modebased methods (Slob and Burgess, 2020). On the other hand, the use of GRS, which was the most frequent method to combine SNPs in the studies that we reviewed, is convenient because it leads to a single IV (Burgess et al., 2016). Further, weighted GRS with independently-derived weights lead to MR studies with similar statistical power than those using multiple independent IVs (Palmer et al., 2012). While several studies justified the use of a GRS as IV as a method to prevent bias due to pleiotropy, they have to rely on the restrictive assumption that the pleiotropic effects of SNPs cancel each other (Davey Smith, 2011), which is difficult to verify in practice. Using a GRS as IV further requires ensuring that each SNP in the GRS is itself a valid IV (Burgess and Thompson, 2013;Skaaby et al., 2018), which is limited by the low statistical power available for each SNP. Simulation studies have demonstrated that even including a small number of pleiotropic SNPs into a GRS can lead to biased MR estimates (Burgess and Thompson, 2013). We thus recommend that robust methods be used on the SNPs that form the GRS. Second, our review suggests that the MR-Egger intercept test from is the most frequently reported method for pleiotropy assessment. The validity of MR-Egger estimates and the interpretation of the intercept as the average pleiotropic effect of IVs require the InSIDE (Instrument Strength Independent of Direct Effect) assumption to be satisfied . InSIDE is not required for the use of the p-value associated with the intercept test of the validity of IVs . InSIDE states that the effects of the IVs on the exposure must be uncorrelated with the direct effects of the IVs on the outcome (Bowden et al., 2015;, which is likely to be violated in a one-sample setting (Slob et al., 2017;Minelli et al., 2021) because parameters are estimated in the same subjects. Violations of the InSIDE assumption results in increased type I error rates (Hartwig and Davies, 2016; and biased estimates in the direction of the observational associations (Minelli et al., 2021). Because testing the plausibility of InSIDE assumption is still a challenge to date (Bowden, 2017), researchers should restrict the interpretation of MR-Egger estimates in two sample settings where the lack of correlation between SNPs-exposure and SNPs-outcome associations is more plausible. Thus, when using summarized data in a one-sample setting, the MR-Egger intercept test can be used to assess the invalidity of the IVs, but other robust methods such as the median (the second robust method widely reported in this review)-and the mode-based methods should be preferred when estimating the robust causal estimates because they do not depend on the InSIDE assumption (Bowden, 2017). Further, their causal estimates are consistent in one-sample context, unlike Egger's estimates which are biased in the direction of the observational association, as shown in simulations (Minelli et al., 2021). Burgess and Thompson offer a careful discussion of the use of MR-Egger , while Burgess et al. (2018) show that the statistical power for the intercept test is low in most settings.
Third, using statistical approaches to detect pleiotropic IVs is challenging because apparent manifestations of pleiotropy may

Discussion of the independence and/or exclusion restriction assumptions
n % Plausibility of the independence and/or exclusion restriction assumptions n = 128 Discussed with specific reference to pleiotropy 89 69.5 Discussed without specific reference to pleiotropy 19 14.9 IV invalidity a n = 128 Suspected with specific reference to pleiotropy 16 12.5 Suspected without specific reference to pleiotropy 5 3.9 Impact of IV invalidity on the validity of MR results n = 21 May have affected validity of results 8 3 8 . 1 No or low impact on validity of results 9 4 2 . 9 Impact not (clearly) reported 4 1 9 . 0 Suspicion of invalidity/pleiotropy of FTO SNP(s) n = 128 Yes 6 4 . 7 FTO SNP(s) suspected to be invalid/pleiotropic n = 6 rs1558902 b 3 5 0 . 0 rs1421085 c 2 3 3 . 3 rs17817449 d 1 1 6 . 7 Abbreviations: IV, instrumental variable; MR, Mendelian randomization; SNP, single-nucleotide polymorphism; FTO, fat mass and obesity-associated. a Refers to the suspicion of invalidity of one or more body mass index IV(s) for any outcome of interest. b The outcomes of interest involved in the suspected invalidity of rs1558902 were multiple sclerosis susceptibility (Gianfrancesco et al., 2017), phobic anxiety symptoms (Walter et al., 2015a), and depression (Walter et al., 2015b). c The outcomes of interest involved in the suspected pleiotropy of rs1421085 were common mental disorders (Kivimäki et al., 2011), and subjective well-being (van den Broek et al., 2018). d The outcome of interest involved in the suspected pleiotropy of rs17817449 was lipid profiles (Wang N. et al., 2018).
Frontiers in Genetics | www.frontiersin.org February 2022 | Volume 13 | Article 803238 be confused with other phenomena, some of which not invalidating MR or requiring different approaches than pleiotropy. For example, the assessment of heterogeneity in the estimated causal effects across different IVs is based on the principle that if IVs are valid, the variation in their corresponding MR estimates should be due to chance (Greco et al., 2015). Large variations in MR IV-specific estimates are often considered as indicative of pleiotropy, but can be due to other causes such as the non-collapsibility of odds ratios in case of MR analysis with binary outcome (Vansteelandt et al., 2011;Hemani et al., 2018a), heterogeneity in the distribution of confounders of IV-exposure or IV-outcome associations in two-sample settings (Hemani et al., 2018a;Zhao Q. et al., 2020), or differential complier causal effects, i.e., association between the IVs and the exposure that vary importantly across individuals (Baiocchia et al., 2014;Sainani, 2018). Similarly, reasons other than pleiotropy may explain non-null associations between the IVs and the outcome that may be considered indicative of pleiotropic IVs. For example, population stratification which can be addressed by restricting the sample to homogeneous ancestry or by applying correction methods (e.g., adjustment of MR models for principal components) (Davey Smith and Hemani, 2014). Additional causes of violation of the exclusion restriction that can be confused with pleiotropy have been proposed, including an exposure that varies over time, the presence of gene-environment interactions implicating IVs, and linkage disequilibrium between at least one of the IVs and a SNP that also affects the outcome (VanderWeele et al., 2014). Our review also allows a few observations pertaining to the use of SNPs in the FTO gene as IVs for BMI. Four of the six studies that suspected that FTO SNPs used as IVs might be pleiotropic involved mental health phenotypes [e.g., subjective well-being (van den Broek et al., 2018) or phobic anxiety symptoms (Walter et al., 2015a), or common mental disorders (Kivimäki et al., 2011), including depression (Walter et al., 2015b)]. This suggests that FTO may be associated with mental health through pathways that do not involve BMI, a hypothesis that is supported by animal studies (Hess et al., 2013;Sun et al., 2019). For example, FTO regulates the activity of the dopaminergic signaling pathways related to the regulation of learning, reward behavior, motor functions, and feeding in mice (Hess et al., 2013). Furthermore, other work on FTO-deficient mice suggested that FTO could influence anxiety-and depression-like behaviours via alterations in gut microbiota (Sun et al., 2019). Caution is required regarding the use of FTO as an IV for BMI implicating mental health phenotypes.
Two limitations of the current review should be noted. First, we do not present an exhaustive list of the methods to prevent, detect or control, and discuss biases due to the use of pleiotropic IVs. Rather we focus on the methods reported in the 128 studies that we review. While we captured most of the methods that are currently available, newer methods such as the Causal Analysis Using Summary Effect estimates (CAUSE) (Morrison et al., 2020) and MR analysis using mixture-model (MRMix) (Qi and Chatterjee, 2019) are not reported. Second, the methods reported in this review include common and validated methods, as well as methods or strategies that may be less efficient/optimal for detecting or accounting bias due to pleiotropy in MR studies. Users must exert caution in selecting the best method(s) for the data at hand.
Pleiotropy is a ubiquitous phenomenon that poses a threat to the validity of MR results and that is difficult to assess. MRrelated methodological development is thriving, and users are encouraged to use more than one method to assess pleiotropy, heeding the assumptions required for each.