Correcting for Population Stratification Reduces False Positive and False Negative Results in Joint Analyses of Host and Pathogen Genomes

Naret, Olivier; Chaturvedi, Nimisha; Bartha, Istvan; Hammer, Christian; Fellay, Jacques; , The Swiss HIV Cohort Study (SHCS)

doi:10.3389/fgene.2018.00266

ORIGINAL RESEARCH article

Front. Genet., 30 July 2018

Sec. Computational Genomics

Volume 9 - 2018 | https://doi.org/10.3389/fgene.2018.00266

Correcting for Population Stratification Reduces False Positive and False Negative Results in Joint Analyses of Host and Pathogen Genomes

Olivier Naret^1,2^†

Nimisha Chaturvedi^1,2^*^†

Istvan Bartha^1,2

Christian Hammer^1,2

Jacques Fellay^1,2,3^* and the Swiss HIV Cohort Study (SHCS)^‡

¹School of Life Sciences, Ecole Polytechnique Fédérale de Lausanne, Switzerland
²Swiss Institute of Bioinformatics, Switzerland
³Precision Medicine Unit, Lausanne University Hospital, Switzerland

Studies of host genetic determinants of pathogen sequence variations can identify sites of genomic conflicts, by highlighting variants that are implicated in immune response on the host side and adaptive escape on the pathogen side. However, systematic genetic differences in host and pathogen populations can lead to inflated type I (false positive) and type II (false negative) error rates in genome-wide association analyses. Here, we demonstrate through a simulation that correcting for both host and pathogen stratification reduces spurious signals and increases power to detect real associations in a variety of tested scenarios. We confirm the validity of the simulations by showing comparable results in an analysis of paired human and HIV genomes.

Introduction

Important inter-individual differences can be observed in human responses to infections, and in recent years researchers have started to explore the genetic underpinning of this phenotypic diversity (Prugnolle et al., 2005; Vannberg et al., 2011; Chapman and Hill, 2012; Rausell and Telenti, 2014; McLaren and Carrington, 2015). A better understanding of host-pathogen interactions at the genomic level could help explain pathogenesis, predict disease outcome or develop new therapeutics or vaccines.

Multiple genome-wide association studies (GWAS) of clinical outcomes have identified human genetic variants that play a modulating role in infectious diseases (Fumagalli et al., 2009; Thomas D. L. et al., 2009; Rauch et al., 2010; Rockett et al., 2014). One of the most prominent examples is the strong association between human leukocyte antigen (HLA) variation and HIV-1 control (Fellay et al., 2007; Thomas R. et al., 2009; Apps et al., 2013). To further explore the potential impact of human genetic diversity on infection, we recently proposed to integrate host and pathogen genomic data in a single analytic framework (which we called genome-to-genome analysis, or G2G; Bartha et al., 2013). Through a pilot study in an HIV-infected population, we demonstrated that it is possible to detect the sites of genomic conflict between the host and the pathogen. Host restriction factors can thus be uncovered by identifying the escape mutations that accumulate in the pathogen genome in response to selection pressure exerted by host genetic variants.

Systematic genetic differences in a study population, or population stratification, can lead to inflated type I (false positive) and type II (false negative) error rates (Wright, 1951; Serre et al., 2008; Hinrichs et al., 2009). In standard GWAS, the association analyses are corrected for stratification by adding host covariates, such as principal components calculated from the host genomic data, to the model (Price et al., 2006; Bouaziz et al., 2011; Liu et al., 2013; Abraham and Inouye, 2014; Tucker et al., 2014). However, accounting for population structure becomes more challenging in a G2G analysis, when these systematic differences can be present in both the host and the pathogen populations.

In this paper, we explore the effects of population stratification in G2G analyses. We first present a basic introduction to the G2G framework and to the methods used for population stratification correction. We then simulate host and pathogen genetic variations using a broad array of parameters including stratification on both sides, which allows us to pinpoint the various effects of population stratification on different statistical models. We finally test for associations between genome-wide human genotypes and HIV-1 sequence diversity in a real-life dataset. The simulation models, as well as all the steps for analyzing the simulated dataset are implemented in R, and the framework is available on GitHub.

Materials and Methods

Genome-to-Genome Framework

In our G2G analysis, we use a regression model to search for associations between host and pathogen genetic variation. Host genetic variation is represented by a genotype data matrix with n samples and p SNPs. Pathogen variation is represented by a matrix with n samples and m binary variants. Then, if the variation data corresponding to a single position of the pathogen genome y with n observations is $y = {(y_{1}, y_{2}, \dots, y_{n})}^{⊤}$ , the regression model can be written as:

\begin{matrix} E (y_{j}) = α + \sum_{k = 1}^{p} β_{k} X_{j k}, j = 1, \dots, n, & (1) \end{matrix}

where X_jk is the genotype for the j^th sample and the k^th SNP, and α is the intercept. β_k is the coefficient value for the k^th SNP; it explains the association patterns between the pathogen variations y and genotype data {X_jk, j = 1, …, n; k = 1, …, p}. Once all the β coefficients are known, the association p-values are obtained by testing the null hypothesis H₀:β = 0 against the alternative hypothesis H₁:β≠0.

Simulation Study

Generation of Host Genetic Data (SNPs)

Our simulation design is based on the Balding Nicholas Model (Balding and Nichols, 1995), which provides a framework for estimating the probabilities of observed genotypes, taking into account population structure and variance in allele frequency estimates. For simulating stratification between host subpopulations, we start with drawing an ancestral allele frequency R_f for each SNP from the uniform distribution on [0.1, 0.9]. To generate a SNP stratified between two subpopulations we use a specified F_ST (Wright, 1951; Holsinger and Bruce, 2009) for drawing two alternate allele frequencies $A_{f}^{P 1}$ and $A_{f}^{P 2}$ from a β distribution with parameters:

\begin{matrix} B (\frac{R_{f} (1 - F_{s t})}{F_{s t}}, \frac{(1 - R_{f}) (1 - F_{s t})}{F_{s t}}) & (2) \end{matrix}

This distribution has mean R_f and variance F_stR_f(1−R_f). Then, for a population P1 with an alternate allele frequency of $A_{f}^{P 1}$ , the genotypes 0 (homozygous for reference allele), 1 (heterozygous) or 2 (homozygous for alternate allele) are assigned to each sample with probabilities ${(1 - A_{f}^{P 1})}^{2}$ , $2 A_{f}^{P 1} (1 - A_{f}^{P 1})$ and ${(A_{f}^{P 1})}^{2}$ , respectively. For an unstratified SNP, the genotypes 0, 1, and 2 are assigned with probabilities ${(1 - R_{f})}^{2}, 2 R_{f} (1 - R_{f})$ , and $R_{f}^{2}$ , respectively.

Generation of Pathogen Variables

To create a binary variable of value 0 (absence) or 1 (presence) for each pathogen variant, we start by generating the background random variations represented by a binary vector, Y_bg. We draw the parameter R_f from a uniform distribution, which defines the ancestral mutation rate for a given position on the pathogen genome. For variables that are not stratified between groups A and B, Y_bg is generated from a binomial distribution with R_f being the probability of mutation. For variables that are stratified between A and B, we use a specified F_ST to obtain two alternate mutation frequencies $A_{f}^{A}$ and $A_{f}^{B}$ . These alternate frequencies are then used as probabilities for generating Y_bg, in groups A or B, from a binomial distribution.

In a second step, we add associations between a selected set of binary pathogen variants and host SNPs. We start by generating a binary vector, Y_ca, using the logistic function given as:

\begin{matrix} p_{c a} = \frac{1}{1 + e^{- θ (x)}} - 0.5 & (3) \end{matrix}

Here, θ(x) = x β, with x representing a vector of genotypes for all samples and β representing the association coefficient. The vector of probabilities p_ca is then used to generate Y_ca from a binomial distribution. Using Y_bg and Y_ca, the final pathogen variations are generated as:

\begin{matrix} Y = {\begin{array}{l} 0 & Y_{b g} = Y_{c a} = 0, \\ 1 & o t h e r w i s e \end{array} & (4) \end{matrix}

Therefore, a sample can have a value of 1 for a given pathogen variant in the presence of background variation (Y_bg = 1), causal variation (Y_ca = 1), or both (Y_ca = Y_bg = 1). This strategy allows for the generation of a matrix of binary pathogen variants that includes background variation as well as variants associated with host SNPs.

Joint Analysis of Human and HIV Genetic Variation

Study Participants

Human and viral genomic variation data from 1,668 participants in the Swiss HIV Cohort Study (Ledergerber et al., 1994) were included in the analysis.

Human Genotype Data

Human genome-wide genotyping data were generated in the context of previous studies. SNPs were excluded on the basis of per-individual missingness (>3%), genotype missingness (>1%), and marked deviation from Hardy-Weinberg equilibrium (p < 1 × 10⁻⁷). Genotype imputation was performed on the Sanger imputation server, using EAGLE2 (Loh et al., 2016) for pre-phasing and PBWT (Durbin, 2014) with the 1000 Genomes Phase 3 reference panel (The 1000 Genomes Project Consortium, 2015). Imputed variants were filtered based on imputation INFO score (< 0.8).

HIV Sequence Data

Partial retroviral sequence data were obtained as part of routine clinical testing for resistance against antiretroviral drugs. The complete PR and 50% of the RT region were sequenced from pretreatment plasma samples. We defined an amino acid residue as variable if at least 10 study samples carried an alternative allele. Per position, separate binary variables were generated for each alternate amino acid, indicating the presence or absence of that allele in a given sample.

Association Analyses

We used logistic regression to test for associations between host SNPs and HIV amino acid variants. All models were run with plink (Purcell et al., 2007), assuming an additive genetic model. To assess the effect of stratification correction, we used four different approaches. First, we tested for associations without any correction for stratification. Second, we adjusted the model for covariates capturing host stratification. Third, we adjusted the model for covariates capturing pathogen stratification. Fourth, we used both host and pathogen covariates in the model. This resulted in four distinct sets of association results. To account for multiple testing, we used a Bonferroni adjusted alpha threshold.

Results

Simulation Analysis

Our simulation study includes paired host and pathogen data from 5000 samples. The hosts are stratified between two subpopulations P1 and P2 of 2500 samples each and the pathogens are stratified between two groups, A and B of 2500 samples each. To generate spurious signal, we created unequal groups of paired host and pathogen. Within P1, 1500 samples are infected by strain A and 1000 by strain B. Within P2, 1000 samples are infected by strain A and 1500 are infected by strain B. Genomic variation data include 50,300 host single nucleotide polymorphisms (SNPs) and 400 pathogen variants. The parameters used to generate host SNPs and pathogen variants are presented in Tables 1, 2, respectively. The first column of the tables (“Quantity”) gives the number of host SNPs or pathogen variant. The second column (“Causal association”) specifies the presence or absence of association between pathogen variants and host SNPs. For host SNPs, it also specifies the strength of the causal relationship (value of the β coefficient), as well as the corresponding associated pathogen variant (line reference for Table 2). The columns 3 (“Major stratification”) and 4 (“Minor stratification”) represent the two possible levels of stratification and specify the strength of stratification through Wright's F-statistics values (F_ST). These two columns also show which subpopulation has the highest/lowest variation frequency. More details on the simulation setup are provided in the Materials and Methods section.

TABLE 1

Table 1. Host SNPs parameters.

TABLE 2

Table 2. Pathogen variants parameters.

We used logistic regression to test for association between each host SNP and pathogen variant (20,120,000 tests in total). To correct for host stratification, we used the top 5 principal components of the human genotyping data, and to correct for pathogen stratification, we used the group marker as covariates. The significance was assessed using the Bonferroni threshold of 2.49 × 10⁻⁹. The results are presented in the next three subsections: [1] “No spurious signals;” [2] “False positives,” where the only observed signals are due to stratification; [3] “Power gain,” where real association signals are weakened by stratification. The input parameters to reproduce similar datasets using the G2G-Simulator as well as the exact simulated dataset detailed in this section is available on GitHub and zenodo (details under the section “Availability of Data and Material”).

No Spurious Signal

A total of 17,060,100 association tests fall into this category. For host-pathogen variant pairs with no association (N = 17,060,000), the median p-value is 0.5 regardless of the correction applied. For the 100 pairs of host SNPs (line 4, Table 1) and pathogen variants (line 3, Table 2) that are associated, we observe similar p-values irrespectively of the type of stratification correction, with median p-values ranging between 5.3 × 10⁻⁷and 6.7 × 10⁻⁷ for the four different approaches of correction. So, in the absence of spurious signals, correcting for any population structure has negligible effect on the association results. The Manhattan plots of the corresponding tests are available in the online supplementary materials (http://www.oliviernaret.com/g2g-sim-visualizer) by selecting “No spurious signal.”

False Positives

Here, we study the effects of stratification correction on false positive signals, through two scenarios, with different levels of stratification complexity.

Scenario 1, optimal correction: This scenario presents association tests between 10,000 host SNPs (l.2 Table 1) and 100 pathogen variants (l.1 Table 2) resulting in 1,000,000 tests (Figure 1A). The top half of the figure represents host SNPs frequencies in the two populations. Host SNPs are stratified between P1 and P2 with higher SNP frequency in P1, represented by more “+” signs under P1 than in P2. The bottom half of the figure represents pathogen variants frequencies in the two strains. Pathogen variants are stratified between A and B with higher variant frequency in A than B (Figure 1B, more “+” signs under A than B). Since there are more samples in P1 infected by pathogen group A and more samples in P2 infected by pathogen group B, it creates a correlation structure between host and pathogen populations, producing false positive signals. Figure 1B shows that, in the absence of correction for stratification, 0.5% of association results (N = 5,944) are false positive. However, both host or pathogen covariates are able to fully capture host or pathogen stratification. Therefore, using host covariate, pathogen covariate, or both, allows for a proper control of false positive signals, restoring a median p-value of 0.5. Details of this scenario are available in the online supplementary materials by selecting “Optimal host and pathogen correction.”

FIGURE 1

Figure 1. False positive signal. (A) Simulated host and pathogen variant frequency distribution for case 1. (B) P-value boxplot for case 1. (C) Simulated host and pathogen variant frequency distribution for case 2. (D) P-value boxplot for case 2.

Scenario 2, suboptimal correction: This scenario is represented in Figure 1C, with 39,900 tests between 200 host SNPs (line 3 and 5 of Table 1) and 200 pathogen variants (line 2 and 4 of Table 2). 100 tests are on pairs in a causal relation and will be discussed in the next section. Host SNPs are stratified at two levels: major stratification between P1 and P2 and minor stratification between hosts infected by the two pathogen groups (resulting in higher allele frequencies for hosts infected by pathogen A, regardless of population, (Figure 1D, additional “.” for host populations infected by A). Pathogen variants are also stratified at two levels: major stratification between A and B and minor stratification between pathogens present in the two host populations (resulting in higher pathogen allele frequencies in P1, regardless of the pathogen group, Figure 1D, additional “.” for pathogen groups within P1). False positive associations between host SNPs and pathogen variants can here result from: [A] major stratification on both sides (similar to scenario 1, above); [B] the correlation structure between major and minor stratification; and [C] the correlation structure between minor stratification on both sides. Figure 1D shows that not adjusting for stratification produces 9.8% (N = 2,954) of false positive associations with a median p-value of 5.7 × 10⁻³. Because the minor stratification is too weak to be entirely captured by principal components, using host covariate cannot entirely correct for stratification and leaves 1.4% (N = 409) of false positive associations with a median p-value of 1.9 × 10⁻¹. Similarly, using only the pathogen covariate does not capture the minor stratification and leaves 0.07% (N = 23) of false positive associations with a median p-value of 3.4 × 10⁻¹. Finally, including both covariates allows for a proper control and restores a median p-value of 0.5. Therefore, correcting for both host and pathogen stratification prevents false positive associations. Details of this scenario are available in the online supplementary materials by selecting “Suboptimal host and pathogen correction.”

Power Gain

This scenario presents association tests between 100 pairs of associated host SNPs and pathogen variants. The causal association is represented by the “arrow” sign from host to pathogen (Figure 2A).

FIGURE 2

Figure 2. Power gain. (A) Simulated data structure for stratified host and pathogen data with true associations. (B) P-value boxplot for stratified host and pathogen data with true associations.

For this scenario, the host SNPs (l.5 Table 1) are stratified at two levels: major stratification between P1 and P2 (with higher allele frequencies in P2 than P1), and minor stratification between hosts infected by the two pathogen groups, resulting in higher allele frequencies for hosts infected by pathogen B.

To generate the pathogen variants (l.4 Table 2) for this scenario we used equation 3 and 4, described in section “Materials and Methods.” This requires the host SNPs, binary background variations for the pathogen (Y_bg) and another set of binary pathogen variations, causally associated with the SNPs (Y_ca). The pathogen background variations or Y_bg are generated randomly, with stratification at two levels: major stratification between A and B (with higher pathogen allele frequencies in A than in B), and minor stratification between pathogens present in the two host populations, resulting in higher pathogen allele frequencies in P1. This distribution of pathogen background variation, along with the distribution of host variations, results into a negative correlation between host and pathogens because of two reasons. First, because the host population with a lower alternate allele frequency has pathogen genomes with a higher mutation rate, and vice versa. Second, because P1 and P2 infected by pathogen group B (lower mutation frequency) have a slightly higher alternate allele frequency than P1 and P2 infected by A (higher mutation frequency).

To generate the causal associations or Y_ca, we used equation 3 in section Generation Pathogen Variables. This makes Y_ca positively correlated with the host SNPs as the samples with fewer SNPs also have lower mutation rates in Y_ca. Using these positively correlated Y_ca and negatively correlated Y_bg for generating the final pathogen variations (equation 4, section Generation Pathogen Variables), nullifies the correlation structure due to stratification. This results in host and pathogen datasets in which causal associations are hidden under random noise due to stratification.

In Figure 2B, without correction the median p-value is 0.11 with 16% of significant associations. Correcting the model using host covariates decreased the median p-value to 3.6 × 10⁻⁴ (20% of significant associations), while adjusting the model using pathogen covariate decreased the median p-value to 4.1 × 10⁻⁵ (21% of significant associations). Finally, adjusting the model using both host and pathogen covariates decreased the mean p-value to 8.3 × 10⁻⁷, with 29% of significant associations. These results show that correcting for stratification on both host and pathogen sides reduces the number of false negative signals. The mean p-value from the model corrected for both host and pathogen stratification is comparable to what we observed in the case with association and no stratification (mean p-value of 6.7 × 10⁻⁷), with an expected minimal power loss due to stratification of associated SNPs. Details of this scenario are available in the online supplementary materials by selecting “Power gain.”

Joint Analysis of Human and HIV Genetic Variation

To compare the results of our simulations to real-life data, we accessed human and viral genomic data collected from 1,668 HIV-1 infected individuals participating in the Swiss HIV Cohort Study. We purposely selected a heterogeneous sample to ensure human and viral stratification. The principal component analyses showed diversity in terms of ethnicity, with 88% Caucasians, 5% Asians, and 4% Africans (Figure 3A); and of HIV-1 subtypes, with 81% of subtype B and 15% of subtype C infections (Figure 3B).

FIGURE 3

Figure 3. Population structures in HIV data. (A) Principal component plot for host data (first and second axis) (B). Phylogenetic principal component plot for the HIV virus data (first and second axis) (C). Pearson correlation between first five host principal components and first three HIV virus phylogenetic principal components.

The human genome-wide data consisted of 5,600,166 SNPs, obtained after quality control, imputation, and filtering. Binary HIV amino acid variants (N = 403) were obtained from sequence data in the protease (N = 155, 40 amino acid positions) and reverse transcriptase (N = 248, 82 amino acid positions).

We searched for associations between individual SNPs and amino acid variants using logistic regression. To correct for population stratification on the human side, we included the top five principal components of the genotyping data in the models. The principal component plot in Figure 4A shows the clustering of study samples with European, Asian and African reference samples from the HapMap dataset.

FIGURE 4

Figure 4. Allelic distribution of host SNP rs4913471 and HIV amino acid variant at position 67 in the protease region. (A) Genotypes of rs4913471 plotted on first two host principal components. (B) Presence or absence of amino acid variant at position 67 in the protease region plotted on first two host principal components.

To correct for pathogen stratification, we included the top three phylogenetic principal components (pPCs) (Liam, 2009) calculated using a phylogenetic tree built from the assembled HIV sequence data. Phylogenetic principal components control for phylogenetic covariance while producing PCA-like ordination. Figure 3B shows that the HIV genomes clustered mainly into two large groups (the first pPC explained 30% of the variation), on the first two pPC axes. We also computed Pearson correlation between the top five host principal components and the top three viral pPCs. Figure 3C shows only a weak correlation between host PCs and HIV pPCs, suggesting a need to correct for both to efficiently handle human and viral stratification.

Similar to what we did in the simulation study, we compared the association p-values obtained with four correction approaches. Upon correction for both host and pathogen stratification, we observed multiple highly significant associations between SNPs in the major histocompatibility complex (MHC) region of chromosome 6 and HIV amino acid variants. After correction for multiple testing (p < 2.2 × 10⁻¹¹), significant associations were observed with 9 positions in the HIV proteome (5 in the protease (PR) region and 4 in the reverse transcriptase (RT) region). The strongest association was between rs2844527 and RT position 135 (p = 3.4 × 10⁻³⁵). The top associated SNP for each HIV position is listed in Table 3. We replicated the associations, detected by Bartha et al. (2013), for the protease position 35 (associated SNPs in complete LD with rs2523577), 93 (associated SNPs in complete LD with rs2263323) and the reverse transcriptase position 135 (associated SNPs in high LD with rs1050502 with r² = 0.7). In addition, we found new associations between SNPs in the MHC region and multiple other positions in the protease and the reverse transcriptase regions.

TABLE 3

Table 3. HIV G2G associations.

As expected, correcting for population stratification at the human and/or viral levels led to an improvement in association p-values for known associated variants (“true positives”). For example, the p-value for the association between rs9266628 in the MHC region and HIV position 135 (presence or absence of amino acid residue I) in the reverse transcriptase region was 5.2 × 10⁻¹⁸ with no correction for stratification, 1.1 × 10⁻¹⁸ with correcting for host stratification only, 5.9 × 10⁻¹⁹ with correcting for pathogen stratification only, and 3.7 × 10⁻²⁰ with correction for both host and pathogen stratification.

Conversely, several false positive associations (type I errors) could be identified. For example, the p-value for the association between rs4913471 on chromosome 12 and HIV position 67 (presence or absence of amino acid residue T) in the protease region was 1.5 × 10⁻¹⁶ with no correction for stratification, 3.2 × 10⁻⁷ with correcting for host stratification only, 4.5 × 10⁻⁵ with correcting for pathogen stratification only, and 3 × 10⁻⁴ with correcting for both host and pathogen stratification. To show the absence of true association, we plotted the distribution of genotypes for rs4913471 as well as the distribution of 1 s and 0 s for the HIV amino acid, over the host principal components (Figure 4). Ethnicity correlates with both the SNP allele frequency (4A) and the presence or absence of the amino acid residue (4B), making this a classical case of false positive association due to stratification.

The summary statistics of this G2G analysis for all positions presented on Table 3 are available under the section “Availability of Data and Material.”

Discussion

Genome-to-genome (G2G) association analyses can help dissect complex interactions between host and pathogen by integrating their respective genomic variation in a single model. The allele frequency distribution of genetic variants is not uniform in populations, resulting in systematic differences (“stratification”) upon sampling of host and pathogen populations. Such population stratification, if not accounted for properly, can lead to false positive association signals as well as false negative results. We here performed a comprehensive analysis to understand the various aspects of population stratification correction in a G2G framework.

Our simulation study demonstrated three main points that characterize the stratification correction effects. First, in the absence of stratification, the inclusion of covariates to correct for potential stratification only has a negligible impact on the results. Second, the existence of stratification on both sides produces false positive signals, which can be best minimized by including covariates that summarize both host and pathogen stratification. Third, population stratification can weaken true association signals, resulting in false negative results. In this case, correcting for stratification increases power and facilitates the identification of true associations.

We also presented a framework to properly correct for pathogen stratification by using pPCs, derived from the pathogen phylogenetic tree, as covariates. In our joint analysis of human and HIV genetic variation, we corrected for stratification using pPCs for HIV and principal components from the human genotyping data. We identified strong association signals between several SNPs in the MHC region and amino acid variants in the HIV proteome. Several of them were reported in a previous HIV G2G analysis, which was performed in a very homogeneous human population and only corrected for population stratification on the viral side. Our G2G analysis of HIV did not identify new association, but confirmed the known HLA associations (Bartha et al., 2013; McLaren and Carrington, 2015). However, because of the extensive LD in the MHC region, determining a causal effect of the indexed SNPs will require elaborate population genetics, molecular genetics and functional assays.

We here propose to include principal components and phylogenetic principal components as covariates in the regression models to correct for host and pathogen stratification, respectively. An alternative approach would be to use mixed models, in which genetic relatedness matrices are obtained for both host and pathogen and used to correct for population stratification. Building on such an approach, Wang et al (Wang et al., 2018) recently proposed a two-way mixed effects model designed to simultaneously detect genetic variants on pairs of host and pathogen genomes that are associated with a phenotypic outcome. In addition to modeling the marginal effects, they also modeled the joint effects of host and pathogen variants on the phenotype. However, in contrast to our approach, where we search for associations between pathogen variant (dependent variable) and host variant (independent variable) independently of any clinical trait, the strategy proposed by Wang et al. is not designed to detect host-pathogen associations when neither host nor pathogen variants associate strongly with a phenotype.

Our proposed strategy involves the use of all pathogen variants at every step of the G2G analysis pipeline. However, this approach is suboptimal in a number of cases: [A] it can lead to overcorrection and loss of power if a single host SNP is associated with multiple correlated pathogen variants that are captured by one of the top principal components; [B] it increases the computational overhead and multiple testing burden if strain-defining pathogen variants, which are perfectly collinear with phylogeny, are included in the regression analyses. Additional steps could be envisioned to overcome these problems, including a correlation-based pruning of pathogen variants in the first case and the targeted removal of variants that are strongly correlated with phylogenetic PCs or strain types in the second case.

Conclusions

In summary, we show that correcting for both host and pathogen stratification is necessary for unbiased G2G analysis. We also provide a framework that can adjust for stratification in the absence of reliable categorical labels, by exploiting host and pathogen genetic information. Our simulation design is implemented in R and available via GitHub.

Ethics Approval and Consent to Participate

The Swiss HIV Cohort Study was approved by the local Ethics Committees of all participating centers, and written informed consent was obtained from the participants.

Availability of Data and Material

The G2G-simulator is available on GitHub, https://github.com/onaret/G2G-Simulator. The dataset based on the simulations of this article is available in the zenodo repository, https://zenodo.org/record/1154600. The dataset containing the summary statistics of genome-to-genome analysis on HIV is available in the zenodo repository, https://zenodo.org/record/1118308.

Author Contributions

NC and ON conceived and designed the experiments. ON implemented the G2G framework and conducted the simulation study. CH conducted the imputation and quality control of the HIV dataset. NC performed the G2G analysis on the HIV dataset. This work uses an original idea by JF, further developed by NC, ON, and IB. NC and ON wrote the manuscript. All authors read and approved the final manuscript.

Funding

This study has been partly financed within the framework of the Swiss HIV Cohort Study, supported by the Swiss National Science Foundation (grant #148522), by SHCS project #682 and by the SHCS research foundation. The data are gathered by the Five Swiss University Hospitals, two Cantonal Hospitals, 15 affiliated hospitals and 36 private physicians (listed in http://www.shcs.ch/180-health-care-providers).

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

A preprint version of this article is available on biorxiv (Naret et al., 2018). We would like to acknowledge the members of the Swiss HIV Cohort Study: Anagnostopoulos A, Battegay M, Bernasconi E, Böni J, Braun DL, Bucher HC, Calmy A, Cavassini M, Ciuffi A, Dollenmaier G, Egger M, Elzi L, Fehr J, JF, Furrer H (Chairman of the Clinical and Laboratory Committee), Fux CA, Günthard HF (President of the SHCS), Haerry D (deputy of Positive Council), Hasse B, Hirsch HH, Hoffmann M, Hösli I, Huber M, Kahlert C, Kaiser L, Keiser O, Klimkait T, Kouyos RD, Kovari H, Ledergerber B, Martinetti G, Martinez de Tejada B, Marzolini C, Metzner KJ, Müller N, Nicca D, Paioni P, Pantaleo G, Perreau M, Rauch A (Chairman of the Scientific Board), Rudin C (Chairman of the Mother & Child Substudy), Scherrer AU (Head of Data Centre), Schmid P, Speck R, Stöckle M, Tarr P, Trkola A, Vernazza P, Wandeler G, Weber R, Yerly S.

References

Abraham, G., and Inouye, M. (2014). Fast principal component analysis of large-scale genome-wide data. PloS ONE 9:e93766. doi: 10.1371/journal.pone.0093766

PubMed Abstract | CrossRef Full Text | Google Scholar

Apps, R., Qi, Y., Carlson, J. M., Chen, H., Gao, X., Thomas, R., et al. (2013). Influence of HLA-C expression level on HIV control. Science 340, 87–91. doi: 10.1126/science.1232685

PubMed Abstract | CrossRef Full Text | Google Scholar

Balding, D. J., and Nichols, R. A. (1995). A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica 96, 3–12.

PubMed Abstract | Google Scholar

Bartha, I., Carlson, J. M., Brumme, C. J., McLaren, P. J., Brumme, Z. L., John, M., et al. (2013). A genome-to-genome analysis of associations between human genetic variation, HIV-1 sequence diversity, and viral control. eLife 2:e01123. doi: 10.7554/eLife.01123

PubMed Abstract | CrossRef Full Text | Google Scholar

Bouaziz, M., Ambroise, C., and Guedj, M. (2011). Accounting for population stratification in practice: a comparison of the main strategies dedicated to genome-wide association studies. PloS ONE 6:e28845. doi: 10.1371/journal.pone.0028845

PubMed Abstract | CrossRef Full Text | Google Scholar

Chapman, S. J., and Hill, A. V. (2012). Human genetic susceptibility to infectious disease. Nat. Rev. Genet. 13, 175– 188. doi: 10.1038/nrg3114

PubMed Abstract | CrossRef Full Text | Google Scholar

Durbin, R. (2014). Efficient haplotype matching and storage using the positional Burrows-Wheeler transform (PBWT). Bioinformatics 30, 1266–1272. doi: 10.1093/bioinformatics/btu014

PubMed Abstract | CrossRef Full Text | Google Scholar

Fellay, J., Shianna, K. V., Ge, D., Colombo, S., Ledergerber, B., Weale, M., et al. (2007). A whole-genome association study of major determinants for host control of HIV-1. Science 317, 944–947. doi: 10.1126/science.1143767

PubMed Abstract | CrossRef Full Text | Google Scholar

Fumagalli, M., Cagliani, R., Pozzoli, U., Riva, S., Comi, G. P., Menozzi, G., et al. (2009). Widespread balancing selection and pathogen-driven selection at blood group antigen genes. Genome Res. 19, 199–212. doi: 10.1101/gr.082768.108

PubMed Abstract | CrossRef Full Text | Google Scholar

Hinrichs, A. L., Larkin, E. K., and Suarez, B. K. (2009). Population stratification and patterns of linkage disequilibrium. Genetic Epidemiology 33 (Suppl. 1), S88–92. doi: 10.1002/gepi.20478

PubMed Abstract | CrossRef Full Text | Google Scholar

Holsinger, K. E., and Bruce, S. (2009). Weir. Genetics in geographically structured populations: defining, estimating and interpreting FST. Nat. Rev. Genet. 10, 639–650. doi: 10.1038/nrg2611

CrossRef Full Text | Google Scholar

Ledergerber, B., von Overbeck, J., Egger, M., and Lüthy, R. (1994). The Swiss HIV Cohort Study: Rationale, organization and selected baseline characteristics. Sozial- und Präventivmed. 39, 387–394.

PubMed Abstract | Google Scholar

Liam, J. (2009). Revell. Size-correction and principal components for interspecific comparative studies. Evol. Int, J. Organ. Evol. 63, 3258–3268. doi: 10.1111/j.1558-5646.2009.00804.x

CrossRef Full Text | Google Scholar

Liu, L., Zhang, D., Liu, H., and Arendt, C. (2013). Robust methods for population stratification in genome wide association studies. BMC bioinformatics 14:132. doi: 10.1186/1471-2105-14-132

CrossRef Full Text | Google Scholar

Loh, P. R., Danecek, P., Palamara, P. F., Fuchsberger, C., Reshef, Y., Finucane, H., et al. (2016). Reference-based phasing using the Haplotype Reference Consortium panel. Nat. Genet. 48, 1443–1448. doi: 10.1038/ng.3679

PubMed Abstract | CrossRef Full Text | Google Scholar

McLaren, P. J., and Carrington, M. (2015). The impact of host genetic variation on infection with HIV-1. Nat. Immunol. 16, 577–583. doi: 10.1038/ni.3147

PubMed Abstract | CrossRef Full Text | Google Scholar

Naret, O., Chaturvedi, N., Bartha, I., and Fellay, J. (2018). Correcting for population stratification reduces false positive and false negative results in joint analyses of host and pathogen genomes. bioRxiv doi: 10.1101/232900

CrossRef Full Text | Google Scholar

Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. A., and Reich, D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909. doi: 10.1038/ng1847

PubMed Abstract | CrossRef Full Text | Google Scholar

Prugnolle, F., Manica, A., Charpentier, M., Guégan, J. F., Guernier, V., and Balloux, F. (2005). Pathogen-driven selection and worldwide HLA class I diversity. Curr. Biol. 15, 1022–1027. doi: 10.1016/j.cub.2005.04.050

PubMed Abstract | CrossRef Full Text | Google Scholar

Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A., Bender, D., et al. (2007). PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575. doi: 10.1086/519795

PubMed Abstract | CrossRef Full Text | Google Scholar

Rauch, A., Kutalik, Z., Descombes, P., Cai, T., Di Iulio, J., Mueller, T., et al. (2010). Genetic variation in IL28B is associated with chronic hepatitis C and treatment failure: a genome-wide association study. Gastroenterology 138, 1338–1345, 1345.e1–7. doi: 10.1053/j.gastro.2009.12.056

PubMed Abstract | CrossRef Full Text | Google Scholar

Rausell, A., and Telenti, A. (2014). Genomics of host–pathogen interactions. Curr. Opin. Immunol. 30, 32–38. doi: 10.1016/j.coi.2014.06.001

PubMed Abstract | CrossRef Full Text | Google Scholar

Rockett, K. A., Clarke, G. M., Fitzpatrick, K., Hubbart, C., Jeffreys, A. E., Rowlands, K., et al. (2014). Reappraisal of known malaria resistance loci in a large multi-centre study. In: Nat. Genet. 46, 1197–1204. doi: 10.1038/ng.3107

CrossRef Full Text | Google Scholar

Serre, D., Montpetit, A., Par,é, G., Engert, J. C., Yusuf, S., Keavney, B., et al. (2008). Correction of population stratification in large multi-ethnic association studies. PloS ONE 3:e1382. doi: 10.1371/journal.pone.0001382

PubMed Abstract | CrossRef Full Text | Google Scholar

The 1000 Genomes Project Consortium (2015). A global reference for human genetic variation. Nature 526, 68–74. doi: 10.1038/nature15393

CrossRef Full Text

Thomas, D. L., Thio, C. L., Martin, M. P., Qi, Y., Ge, D., O'Huigin, C., et al. (2009). Genetic variation in IL28B and spontaneous clearance of hepatitis C virus. Nature 461, 798–801. doi: 10.1038/nature08463

PubMed Abstract | CrossRef Full Text | Google Scholar

Thomas, R., Apps, R., Qi, Y., Gao, X., Male, V., O'hUigin, C., et al. (2009). HLA-C cell surface expression and control of HIV/AIDS correlate with a variant upstream of HLA-C. Nat. Genet. 41, 1290–1294. doi: 10.1038/ng.486

PubMed Abstract | CrossRef Full Text | Google Scholar

Tucker, G., Price, A. L., and Berger, B. (2014). Improving the power of GWAS and avoiding confounding from population stratification with PC-Select. Genetics 197:1045–1049. doi: 10.1534/genetics.114.164285

PubMed Abstract | CrossRef Full Text | Google Scholar

Vannberg, F. O., Chapman, S. J., and Hill, A. V. (2011). Hill. Human genetic susceptibility to intracellular pathogens. Immunol. Rev. 240, 105–116. doi: 10.1111/j.1600-065X.2010.00996.x

CrossRef Full Text | Google Scholar

Wang, M., Roux, F., Bartoli, C., Huard-Chauveau, C., Meyer, C., Lee, H., et al. (2018). Two-way mixed-effects methods for joint association analysis using both host and pathogen genomes. Proc. Natl. Acad. Sci. U. S. A. 115, E5440–E5449. doi: 10.1073/pnas.1710980115

PubMed Abstract | CrossRef Full Text | Google Scholar

Wright, S. (1951). The genetical structure of populations. Ann. Eugen. 15, 323–354.

PubMed Abstract | Google Scholar

Keywords: host-pathogen genomics, genome-wide association study, escape variants, population stratification, simulation study

Citation: Naret O, Chaturvedi N, Bartha I, Hammer C, Fellay J and the Swiss HIV Cohort Study (SHCS) (2018) Correcting for Population Stratification Reduces False Positive and False Negative Results in Joint Analyses of Host and Pathogen Genomes. Front. Genet. 9:266. doi: 10.3389/fgene.2018.00266

Received: 15 May 2018; Accepted: 02 July 2018;
Published: 30 July 2018.

Edited by:

Gustavo Glusman, Institute for Systems Biology, United States

Reviewed by:

Daniel Shriner, National Human Genome Research Institute (NHGRI), United States
Sandeep Kumar Dhanda, La Jolla Institute for Allergy and Immunology (LJI), United States

Copyright © 2018 Naret, Chaturvedi, Bartha, Hammer, Fellay and the Swiss HIV Cohort Study (SHCS). This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Nimisha Chaturvedi, bmltaXNoYS5jaGF0dXJ2ZWRpQGVwZmwuY2g=
Jacques Fellay, SmFjcXVlcy5mZWxsYXlAZXBmbC5jaA==

^†These authors have contributed equally to this work.

^‡Membership of the Swiss HIV Cohort Study is provided in the Acknowledgments.

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.