ORIGINAL RESEARCH article

Front. Genet., 08 December 2021

Sec. Computational Genomics

Volume 12 - 2021 | https://doi.org/10.3389/fgene.2021.801261

Testing Gene-Gene Interactions Based on a Neighborhood Perspective in Genome-wide Association Studies

  • 1. School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, China

  • 2. Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China

  • 3. School of Information, Shanxi University of Finance and Economics, Taiyuan, China

  • 4. Research Institute of Big Data Science and Industry, Shanxi University, Taiyuan, China

  • 5. School of Life Science, Shanxi University, Taiyuan, China

  • 6. Beidahuang Industry Group General Hospital, Harbin, China

Article metrics

View details

6

Citations

2,4k

Views

907

Downloads

Abstract

Unexplained genetic variation that causes complex diseases is often induced by gene-gene interactions (GGIs). Gene-based methods are one of the current statistical methodologies for discovering GGIs in case-control genome-wide association studies that are not only powerful statistically, but also interpretable biologically. However, most approaches include assumptions about the form of GGIs, which results in poor statistical performance. As a result, we propose gene-based testing based on the maximal neighborhood coefficient (MNC) called gene-based gene-gene interaction through a maximal neighborhood coefficient (GBMNC). MNC is a metric for capturing a wide range of relationships between two random vectors with arbitrary, but not necessarily equal, dimensions. We established a statistic that leverages the difference in MNC in case and in control samples as an indication of the existence of GGIs, based on the assumption that the joint distribution of two genes in cases and controls should not be substantially different if there is no interaction between them. We then used a permutation-based statistical test to evaluate this statistic and calculate a statistical p-value to represent the significance of the interaction. Experimental results using both simulation and real data showed that our approach outperformed earlier methods for detecting GGIs.

1 Introduction

Genome-wide association studies (GWAS) has been used to investigate the associations between genetic variants and complex disorders with great success. Researchers have discovered more than 71,000 unique single nucleotide polymorphisms (SNPs) associated to diseases throughout the last decade (Hindorff et al., 2009; Zhang et al., 2016; Zeng et al., 2017; Guo et al., 2018; Buniello et al., 2019; Loos, 2020; Li et al., 2021). Traditional GWAS, on the other hand, concentrated on the independent, additive, and cumulative effects of individual SNPs on specific diseases. The majority of associated SNPs are common genetic variants with small effects that only explain a portion of complex disease heritability. Many genes, environmental variables, and interactions play a crucial role in the underlying genetic architecture of complex diseases (Cordell, 2009; Moore et al., 2010; Jiang et al., 2018; Liu et al., 2018; Liu et al., 2019a; Zhang et al., 2019; Chen et al., 2020; Luo et al., 2020; Liu et al., 2021; Shao et al., 2021; Su et al., 2021; Wang et al., 2021). As a result, genetic interactions are thought to enlighten studies into “missing heritability” (Manolio et al., 2009; Fang et al., 2019; Young, 2019; Tang et al., 2020; Song et al., 2021) and give important knowledge for constructing topologies for complex disease-related pathway.

Genetic interaction was originally explored at the SNP level, named epistasis. Methods (Li et al., 2015a; Ritchie and Van Steen, 2018; Lyu et al., 2020) can be classified into three categories based on their search strategy: exhaustive methods, searching methods, and machine learning-based methods, such as statistics based on entropy (Dong et al., 2008) and odds-ratios (Emily, 2012); MDR (Ritchie et al., 2003), BEAM (Zhang and Liu, 2007), BOOST (Wan et al., 2010), Epi-GTBN (Guo et al., 2019), GenEpi (Chang et al., 2020), and some accelerate methods (Nobre et al., 2021). For example, a logistic regression analysis revealed a significant interaction between the genes ERAP1 (rs27524) and HLA-C (rs10484554) in psoriasis (), indicating that ERAP1 SNP was effective only in individuals who had at least one copy of the HLA-C SNP risk allele (Képíró et al., 2021). The statistical weakness of high-order or pairwise tests, which come from enormous multiple testing corrections over all pairs of SNPs, is one of the general problems of these marker-based approaches. Instead, we explored the interaction of two genes in a single gene-based interaction detection by treating SNPs inside a gene as a group.

The effectiveness of gene-based methods in GWAS marginal association studies should be extended to the study of gene-gene interaction (GGIs) (Emily, 2018; Emily et al., 2020). This strategy offers a number of possible benefits. For starters, it often has substantially fewer genes than SNPs, which dramatically decreases the number of pairwise testing. To discover GGIs in pair of 20,000 genes, for example, tests are necessary. However, for three million SNPs in a marker-based interaction, more than tests are required. Second, gene-based methods are more powerful statistically because a gene carries more information than individual SNP and genes interact in a variety of ways (Liu et al., 2010; Li et al., 2011; Jiang et al., 2017; Su et al., 2019; Hu et al., 2020; Hu et al., 2021a; Hu et al., 2021b; Guo et al., 2021). Furthermore, these methods can include biological prior knowledge (e.g., information about known gene association within protein-protein interactions (PPIs) or pathways) (Wei et al., 2017a; Wei et al., 2017b; Wei et al., 2018; Liu et al., 2019b; Wei et al., 2019; Zeng et al., 2019; Cai et al., 2020; Zhai et al., 2020; Zhu et al., 2020). Finally, gene-based outcomes stand out for their better interpretability and crucial biological consequences.

Many statistical and computational approaches for detecting gene-based GGIs have been established. Peng et al.(Peng et al., 2010) introduced the canonical correlation-based U statistic (CCU). They calculated canonical correlation of two genes in both cases and controls. They next used CCU to calculate the difference in correlation, which revealed the presence of GGIs between the two genes. However, this strategy only considered linear correlation in the study. CCU was then expanded to Kernelized CCU (KCCU) (Yuan et al., 2012; Larson et al., 2013), where the kernel discovered a nonlinear relationship. Emily (Emily, 2016) recently introduced AGGrGATOr, a method that combines p-values of interaction tests at the marker-level to assess how a pair of genes interacted, which was a strategy that Ma et al. (Ma et al., 2013) previously utilized to discover interactions under quantitative traits. GBIGM is a non-parametric entropy-based approach suggested by Li et al. (Li et al., 2015b).

In this paper, we propose a new approach called gene-based, gene-gene interaction through a maximal neighborhood coefficient (GBMNC), which uses the maximal neighborhood coefficient (MNC) (Cheng et al., 2020) to identify gene-gene interaction of complex diseases at the gene-level in case-control studies. MNC measures a wide variety of dependence with no bias toward relationship types between two random vectors of arbitrary, but not necessarily equal, dimensions; this is superior to Pearson’s correlation, which only consider linear correlations. We introduced a statistic that uses the difference of MIC in cases and controls as an indicator of occurrence of GGIs, bases on the assumption that the joint distribution of two genes should not be significantly different in case and in control samples if there is no interaction between them (i.e. independent) under complex diseases. In simulation studies, our method exhibited an outstanding performance in recognizing the underlying GGIs at the gene level under a variety of conditions. Its application using real data sets showed accurate identification of GGIs.

2 Materials and Methods

The statistical procedure for GBMNC is described in depth in this section. We give different parameter settings for simulation studies to evaluate the power to identify GGIs and the ability to control type-I error. Then, we adopted a real-world Rheumatoid Arthritis data set from the WTCCC (Wellcome Trust case Control Consortium) database to evaluate out method’s effectiveness in a real situation.

2.1 GBMNC

2.1.1 Preliminaries and Notation

Here, we take genes, a couple of SNPs, as the basic unit. Suppose that we have random samples:whereand and represent two genes each with and SNPs, independently. In the case-control studies, is a categorical label where 0 is a control subject and one is a case subject. represents the copy number of the minor alleles of SNP in gene for sample .

In this work, to investigate whether there is a statistical interaction between two genes in a qualitative phenotype, we designed a statistic based on the maximal neighborhood coefficient to characterize the GGI intensity. We applied a permutation strategy to estimate the distribution of the statistic. Our approach was based on the intuition that, if there was no interaction between two genes, then, if they were independent of the case set, they should be independent of the control set; if they were dependent on the case set, they should be dependent on the control set as well, and the “strength” of such dependence should be the same for the case and control sets. Pearson’s correlation coefficient measures the degree of dependence between two random variables. However, it can only measure linear dependency and not nonlinear dependency, and it is not very convenient for random variables that take a value in . Therefore, we proposed to measure dependency between random variables by the maximal neighborhood coefficient (MNC) instead.

2.1.2 Maximal Neighborhood Coefficient

MNC is an association measure that decipher the potential complex associations from neighborhood insight. It assumes that if a relationship exists between two variables, the samples of each variable will appear to have a similar neighborhood tendency to approximate that relationship, and MNC can find those common neighborhood structures by exploring the possible neighborhoods of each variable. By introducing a -NN granule to reconstruct samples, and a novel neighborhood mutual information (NMI) to measure the certainty information of one variable from another under a fixed neighborhood combination, MNC enables us to detect more complex associations.

Let be a finite set that is sampled from a joint distribution , and and represents samples from marginal variables and , respectively. Given a designated neighborhood combination (a pairwise positive integer), designed as the -NN granule of , where the subscript sequence is obtained by . All samples of -NN granules form a cover of , that is . At the same time, there exists a cover for , . The cover of samples under is recorded as . Let represents the distribution of on the cover , and different neighborhood combinations produce different distributions.

MNC is defined based on the neighborhood characteristic matrix (NM) of a sample set . Given a finite data set and a neighborhood combination , the element of NM of is: denotes the neighborhood mutual information of distribution . The neighborhood mutual information of () is defined as follow:

Based on the equation above, the neighborhood mutual information of is defined as:With the definition of in Eq. 2, NMC is defined as:where is the search range, and for some . It also naturally extends to the case of two random vectors with arbitrary, but not necessarily equal, dimensions.

MNC Satisfies the Following Properties

  • 1) Symmertry: ;

  • 2) Comparability: , denotes that two variables are statistically independent; implies a strong association between two variables.

  • 3) Generality: captures comprehensive range relationships.

  • 4) Equitability: is robust to noisy relationships. It provides similar scores to the equally noisy relationships of different types.

2.1.3 Illustration of the GBMNC Workflow

Assume there are control samples and case samples in a case-control study for a pair of genes such that has SNPs and has SNPs. Let be the sample association score between and . First, we calculate the for control samples and for case samples. Second, we design a statistic to measure the difference in between cases and controls. represents how different the two joint distributions and are. The larger the , the higher the probability that and interact.

To get a p-value, we needed to estimate the distribution of under the null hypothesis. Here, we used a non-parametric strategy based on permutation: we shuffled the label y randomly times, calculated using the same procedure above, and used the resulting empirical distribution as an estimate for the distribution of under the null hypothesis. Let the result of these permutations be , then an estimated p-value for the null hypothesis is

We summarized the process of GBMNC in the algorithm below (Algorithm 1) and presented the overall workflow (Figure 1).

FIGURE 1

FIGURE 1

Illustration of the Gene-Based gene-gene interaction through a Maximal Neighborhood Coefficient (GBMNC) workflow for detection of gene-based, gene-gene interaction.

Algorithm 1

  • Data: Genotype , Phenotype , permutation times

  • Result: significant p-value for interaction between

  • 1  Calculate and for both and by Eq. 5;

  • 2  Calculate the difference between and ;

  •  3 for to do

  •   4 Randomly permute label , and generate the new data set;

  •   5 Repeat Steps 1 and 2;

  •   6 end

  •   7 Estimated p-value of is the number of , , which are larger than , divided by .

2.2 Simulation Study

To assess the performance of GBMNC to control type I error and the power to detect GGIs, we compared GBMNC with KCCA (Larson et al., 2013), GBIGM (Li et al., 2015b), and AGGrEGATOr (Emily, 2016).

2.2.1 Simulation With GAMETES

The goal of this simulation study was to evaluate the performance of the GBMNC procedure to detect gene-gene interaction. We set all simulated datasets to have 50 SNPs. Among them, two SNPs were functional, and the remaining 48 SNPs were non-functional. The 50 SNPs formed five genes, and each had 10 SNPs. The two functional SNPs were put into the first and second genes. We chose the publicly available tool GAMETES (Urbanowicz et al., 2012) to generate the simulated genotype data. This tool was designed to generate pure and strict epistasis models. Pure and strict epistasis models are the most difficult disease-related patterns to identify. Such associations can only be observed if all n-loci are included in the disease model. This requirement makes these types of models an attractive gold standard for simulation studies of complex multi-locus effects.

Evaluation of Type-I error: The type-I error indicates the ability of a method to reject the null hypothesis when it is true (i.e., the false positive rate). We used GAMETES to generate the custom disease model (Table 1) with one causal SNP pair. characterizes the baseline odds (i.e., the odds conditional on genotype pair ). We ran the simulation 100 times with each sample size and . The significance level was set to be 0.05.

TABLE 1

AAAaAa
BB
Bb
bb

Table of odds for the no effect model without interaction between a pair of SNPs.

Evaluation of power of the test

: The power of a test indicates the probability that the method rejects the null hypothesis correctly when the alternative hypothesis is true. In this simulation study, we generated 100 data sets for each parameter settings. The power under each parameter setting was expressed by the frequency, and the null hypothesis of the data set was rejected correctly at the significance level of

.

  • 1) To assess the impact of heritability , which measured the intensity of correlation between genotype and phenotype, we chose and two different minor allele frequencies MAF with population prevalence set to 0.2 and sample size set at 4,000. Under each parameter combination, five models were generated so that we had a total of 100 models that followed Hardy-Weinberg proportions. For a specified genetic constrain combination, the 10 models were sorted roughly by the ascending customized odds ratio (COR) using GAMETES and labeled M1 to M5. COR is a metric of detectability that was calculated directly from the genetic model. The higher it is, the easier it is to detect GGIs. GAMETES generated the penetrance tables for these 100 models in the absence of the main effect. One hundred replicated data sets were generated from each model with balanced cases and controls, which resulted in 5,000 data sets in total in this scenario.

  • 2) To evaluate the influence of sample size, we set heritability to be 0.025, MAF {0.2,0.4} and prevalence to be 0.2 with a sample size of 10,000. Then, 100 data sets were generated by random sampling from this large dataset for each of the sample sizes . In this scenario, we had 1,000 datasets in total.

For GBMNC, KCCU, AGGrEGATOr, and GBIGM, if the number of data sets with a significance level less than is , then the power can be calculated by the following formula:

GBIGM and AGGrEGATOr methods are nonparametric methods, so no parameters need to be specific. We only set the ratio of the trimmed jackknife to 0.05 () for KCCU.

2.3 Experiments Using Rheumatoid Arthritis Data

To evaluate GBMNC’s ability to process real GGIs in a qualitative data set, we analyzed the susceptibility of a series of pairs of genes in Rheumatoid Arthritis (RA). RA is a chronic autoimmune disease that causes pannus development and cartilage and bone loss in synovial joints. It leads to progressive bone deterioration and interferes with bone repair. In this work, we used the WTCCC (2007) data set, which includes genotype data from the British population obtained by the Affymetrix GeneGhip 500 k. Our dataset was pre-processed in the following ways:

  • 1) We used pathway hsa05323 from the KEGG pathway database to validate the GGIs in the RA. The WTCCC data set’s genotyping coordinates can be found in UCSC hg18/NCBI Build36. This pathway contained 90genes. Many of the genes belonged to the protein combinations MHCII and V-ATPase. Because numerous GGIs happened on their own, we only chose representative genes from each protein combination and then remove the others. Finally, 48genes remained, resulting in a total of pairs of genes to be analyzed.

  • 2) We collected the detailed gene information from the NCBI Build36 annotation file, and for each gene, we inserted a 10 kb buffer region both downstream and upstream of the originally defined gene location. For each gene, all SNPs within the area were chosen.

  • 3) According to the quality control of GWAS, samples that included gender that did not match the chromosome X heterozygote rates were removed. SNPs were also removed if any of the following requirements were met: the missing rate in the sample was , MAF was , or the frequency of control violated Hardy-Weinberg equilibrium (). Finally, 385 SNPs remained in 4,966 samples, which included 2,993 control subjects and 1973 case subjects.

3 Results and Discussion

The experimental environment for all the following results was a workstation with an Intel Xeon CPU E5-2,620 v2 at 2.10GHz, 96 GB of DDR3, and python3.6.

3.1 Simulation Study

3.1.1 Evaluation of Type-I Error

For type-I error, we varied the sample size from 1,000 to 5,000. Except for GBIGM with , all methods tested had a type-I error comparable to a significance level (Table 2), which implied that these methods controlled for type-I error for various sample sizes quite well.

TABLE 2

MethodsSample size
,000
KCCU0.020.020.010.050.07
GBIGM0.130.060.070.070.07
AGGrEGATOr0.050.060.070.040.02
GBMNC0.020.050.070.050.05

Type-I error for KCCU, GBIGM, AGGrEGATOr, and GBMNC when varying the sample size from 1,000 to 5,000.

3.1.2 Evaluation of the Power of GBMNC

Impact of heritability: To evaluate the statistical power of our GBMNC and the other three methods, we used 10 heritability-MAF combinations, with a population prevalence of 0.2, a sample size of 4,000, and heritability that varied from 0.01 to 0.2 (Table 3). The bold in Table 3 shows the best-performed method in each model under a given heritability-MAF combination. Notice that a larger value indicates better performance. On average, GBMNC was the best performing algorithm in this comparison. It largely outperformed the other methods, but not for all the data sets; it was inferior to AGGrEGATOr for some data sets. However, its performance was remarkably consistent, and it was the top performer for most data sets. AGGrEGATOr achieved the same performance when MAF was 0.2 and heritability was >0.05.

TABLE 3

MAFHeritabilityModelM1M2M3M4M5
Method
0.20.01GBMNC0.130.400.680.720.89
AGGrEGATOr0.120.120.890.891
KCCU0.150.090.290.430.62
GBIGM0.090.110.130.110.08
0.025GBMNC0.950.7510.961
AGGrEGATOr10.2710.371
KCCU0.580.090.740.240.8
GBIGM0.080.070.110.130.2
0.05GBMNC0.680.830.9411
AGGrEGATOr0.090.590.8911
KCCU0.130.570.650.840.85
GBIGM0.180.080.220.170.19
0.1GBMNC11111
AGGrEGATOr11111
KCCU0.810.930.90.860.91
GBIGM0.150.140.230.160.16
0.2GBMNC11111
AGGrEGATOr11111
KCCU0.890.970.940.890.97
GBIGM0.190.310.180.220.21
0.40.01GBMNC0.750.660.820.900.96
AGGrEGATOr0.710.090.10.940.96
KCCU0.340.050.080.770.29
GBIGM0.090.080.10.110.07
0.025GBMNC10.730.850.930.80
AGGrEGATOr0.990.560.120.910.26
KCCU0.580.240.080.240.11
GBIGM0.150.120.140.110.09
0.05GBMNC1110.680.86
AGGrEGATOr10.970.910.350.42
KCCU0.860.90.950.410.37
GBIGM0.110.120.090.080.10
0.1GBMNC1110.631
AGGrEGATOr0.9810.960.271
KCCU0.6210.950.411
GBIGM0.120.190.180.260.20
0.2GBMNC11111
AGGrEGATOr0.9310.9910.80
KCCU0.2810.8310.76
GBIGM0.190.250.310.130.26

The statistical power of simulation studies for GBMNC, AGGrEGATOr, KCCU and GBIGM under 10 heritability-MAF combinations, with and MAF . Each heritability-MAF combination has five models. Bold font indicates the method that performed best under each model.

The power of all the methods was significantly affected by heritability (i.e., the effect size of interaction) (Table 4). A larger heritability led to better performance for all methods under a specific MAF. When heritability varied from 0.01 to 0.025, GBMNC almost doubled its power for a given sample size of 4,000 with MAF . Other methods also show a steady upward trend (Table 4). The power also depended on the MAF of the interacting SNPs (e.g., for the cases of , the power of GBMNC under model M1-M5 ranged between 0.13–0.89 for MAF , but it ranged between 0.66–0.96 for MAF (Table 3)). The average power was 0.564 for MAF , which was much lower than 0.818 for MAF (Table 4).

TABLE 4

MAFMethodGBMNCAGGrE-GATOrKCCUGBIGM
Heritability
0.20.010.5640.6040.3160.104
0.0250.9320.7280.4900.118
0.050.8900.7140.6080.168
0.1110.8820.168
0.2110.9320.222
0.40.010.8180.5600.3060.090
0.0250.8620.5680.2500.122
0.050.9080.7300.6980.100
0.10.9260.8420.7960.190
0.210.9440.7740.228

Average power for GBMNC, AGGrEGATOr, KCCU, and GBIGM under 10 heritability-MAF combinations, with heritability and MAF.

It is worth noting that under the same combination of habitability and MAF, GBMNC was more stable under models with different COR compared with AGGrEGATOr (Figure 2). KCCU detected the interaction of some simulated disease models in our study, and it had a similar performance pattern with AGGrEGATOr. However, AGGrEGATOr was much more powerful in most of the simulated scenarios. GBIGM had little power to detecting pure gene-gene interaction,. This result replicated Emily's (Emily, 2016) result of the simulation.

FIGURE 2

FIGURE 2

Illustration of the distribution of power of each method in each heritability-MAF combination with and MAF .

Impact of sample size: The sample size of the data set had a considerable effect on power. Let the sample size be , , and MAF (Table 5). As the sample size increased, the power of all methods increased almost monotonically under different MAF settings. With all methods, a larger sample size corresponded to better performance.

TABLE 5

MAFMethodGBMNCAGGrEGATOrKCCUGBIGM
Sample size
0.21,0000.670.150.110.2
20000.830.180.380.16
3,00010.200.550.23
4,00010.310.760.21
5,00010.290.870.12
0.41,0000.680.160.130
20000.970.200.110.04
3,00010.350.20.11
4,00010.540.370.11
5,00010.650.580.05

The statistical power of simulation studies for GBMNC, AGGrEGATOr, KCCU, and GBIGM under models with , MAF , and sample sizes that varied from to .

In conclusion, in simulated studies, our results showed that GBMNC detected gene-gene interaction effectively, in which a pair of SNPs was a causal factor by the purely and strictly epistasis model without main effect, which can only be observed if all 2-loci are included in the disease model. Compared with other methods, GBMNC identified a broad range of epistatic signals accurately.

3.2 Experiments Using Rheumatoid Arthritis Data

RA is a chronic autoimmune disease where HLA genes, TNF family, and TRAF1 are important genetic risk factors in the development. Each unique gene pair of the hsa05323 pathway was evaluated in the RA study, which resulted in total pairs for 48 genes. With a significance level and multiple testing adjustment, for KCCU and GIGBM, we obtained 159 and 134 significant GGIs, respectively. Among them, 30 and 65 had p-values equal to 0; hence we were unable to rank them in the order of significance. AGGrGETOr did not show any significant results. Following Emily (Emily, 2016), and after removing the multiple testing correction, AGGrGETOr exhibited 17 significant GGIs, which we ranked by their p-values. We chose the top 10 gene pairs obtained by GBMNC and by AGGrGETOr to analyze, which comprised approximately 1% of the total interactions (Table 6).

TABLE 6

Gene1ChrGene2Chrp-value
GBMNCAGGrEGATOr
TGF- 21CXCL840.01
CTLA42GM-CSF50.00.327
CD803HLA-classII60.00.37
GM-CSF5TRAP190.00.01
TLR-49FLT-1130.00.069
IL-176TNFSF13B130.00.185
CXCL64ICAM1190.01
CD282CXCL640.00.512
CTLA42CXCL640.00.849
MMP-311FLT-1130.00.089
CD803April170.990.0007
CTSK1TNFSF13B130.6150.0008
JUN1IL-670.4450.0019
CD803CTSL250.00.002
CXCL64FLT-1130.2970.0021
CTLA42FOS370.7270.0022
FLT-113LFA-1390.8150.0033
CCL317TRAP190.5640.0034
IL-1811TGF- 3140.6930.004
IL-12CXCL12100.0810.004

The calculated p-value for the 20 gene pairs using GBMNC and AGGrEGATOr. p-values in bold font indicate that they are significant. The “Chr” column indicates the chromosome number of the human genome where the gene is located.

We found that some of our findings were supported by prior research (Xiao et al., 2008; Klocke et al., 2016; Cen et al., 2019). For instance, our method detected a significant interaction between IL17 and TNFSF13B. Studies (Xiao et al., 2008) show that both B cells and T cells formed aggregates in the synovium of inflamed joints and mediated the pathogenesis of RA, and B-cell-activating factor (BAFF, also named TNFSF13B, BLys) played a vital role in B-cell survival and maturation. After activation and expansion, CD4+ T cells developed into different T helper cell subsets with different cytokine profiles and distinct effector functions. In addition to Th1 and Th2 cells, Th17 cells were a third T helper cell and produce IL-17. Th17 cells can recruit and activate inflammatory cells and they have been recognized as a primary cause of bone destruction and inflammation in autoimmune diseases. BAFF promoted Th17 cell proliferation and expansion preferentially (Lai Kwan Lam et al., 2008). IL-17 was a key cytokine for BAFF-mediated proinflammatory effects during collagen-induced arthritis pathogenesis. Only one pair of potential interactions between CD80 and CTSL was captured by both methods within the top 10 GGIs. However, there is not yet direct evidence to show the interaction between CD80 and CTSL.

4 Conclusion

The study of detecting GGIs is of great importance in understanding the pathogenesis of complex human diseases. In this paper, we proposed a gene-based GGI detection method called GBMNC based on a maximal neighborhood coefficient and a permutation strategy for case-control studies in GWAS. The method not only benefited from the ability of a maximal neighborhood coefficient, which considered the neighborhood structure of each sample and captured a wide range of associations, but also from the robustness of our permutation-based hypothesis testing scheme.

We designed a statistic to capture the different intensities of interaction between two genes in both cases and controls, then transformed the problem of GGI detection into a form of hypothesis testing; our null hypothesis was there was no significant difference in the relationship between the two genes in the disease data and the control data. This hypothesis did not limit the form of interaction between genes, and it enhanced the method’s ability to detect different types of interactions. We demonstrated the effectiveness of our method through a simulation study and retrospective analysis of rheumatoid arthritis. Under a large range of settings, GBMNC outperformed previous methods in the power to detect GGIs. The statistical power of our method increased monotonically with the increase in the heritability and the MAF. The method was also stable to sample size based on a test of false positive rates. MNC did not restrict the dimension of two random vectors. Therefore, it is possible to generalize the method for marker-based detection of gene pairs that are identified as interactive. Investigating the mechanism of gene-based interaction at the marker level might point the way for further research. In summary, GBMNC is a helpful addition to the current toolbox of statistical models to elucidate GGIs in case-control studies.

Statements

Data availability statement

Publicly available datasets were analyzed in this study. This data can be found here: https://www.wtccc.org.uk/info/access_to_data_samples.html.

Author contributions

YG: Conceptualization, Methodology, Investigation, Funding acquisition, Writing-Original Draft. HC: Methodology, Formal analysis, Writing-Original Draft. ZY: Software, Formal analysis. ZL: Resources, Data Curation. YW: Formal analysis, Writing-Review and Editing. DD: Conceptualization, Project administration.

Funding

The work was supported by the National Natural Science Foundation of China (No. 62002243, No. 31900306), the Post-doctoral Foundation Project of Shenzhen Polytechnic China (6020330004K), the Post-doctoral Foundation Project of Shenzhen Polytechnic China (6020330004K). Scientific and Technological Innovation Programs of Higher Education Institutions in Shanxi (2021L286).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

  • 1

    BunielloA.MacArthurJ. A. L.CerezoM.HarrisL. W.HayhurstJ.MalangoneC.et al (2019). The NHGRI-EBI GWAS Catalog of Published Genome-wide Association Studies, Targeted Arrays and Summary Statistics 2019. Nucleic Acids Res.47 (D1), D1005D1012. 10.1093/nar/gky1120

  • 2

    CaiL.WangL.FuX.XiaC.ZengX.ZouQ. (2020). ITP-pred: an Interpretable Method for Predicting, Therapeutic Peptides with Fused Features Low-Dimension Representation. Brief. Bioinform.22 (4), bbaa367. 10.1093/bib/bbaa367

  • 3

    CenS.WangP.XieZ.YangR.LiJ.LiuZ.et al (2019). Autophagy Enhances Mesenchymal Stem Cell-Mediated CD4+ T Cell Migration and Differentiation through CXCL8 and TGF-Β1. Stem Cel Res Ther.10 (1), 265. 10.1186/s13287-019-1380-0

  • 4

    ChangY.-C.WuJ.-T.HongM.-Y.TungY.-A.HsiehP.-H.YeeS. W.et al (2020). GenEpi: Gene-Based Epistasis Discovery Using Machine Learning. BMC Bioinformatics21 (1), 68. 10.1186/s12859-020-3368-2

  • 5

    ChenL.LiJ.ChangM. (2020). Cancer Diagnosis and Disease Gene Identification via Statistical Machine Learning. Curr. Bioinformatics15 (9), 956962. 10.2174/1574893615666200207094947

  • 6

    ChengH.LiangJ.QianY.HuZ. (2020). Association Mining Method Based on Neighborhood Perspective. Sci. Sin.-Inf.50 (6), 824844. 10.1360/ssi-2020-0009

  • 7

    CordellH. J. (2009). Detecting Gene-Gene Interactions that Underlie Human Diseases. Nat. Rev. Genet.10 (6), 392404. 10.1038/nrg2579

  • 8

    DongC.ChuX.WangY.WangY.JinL.ShiT.et al (2008). Exploration of Gene-Gene Interaction Effects Using Entropy-Based Methods. Eur. J. Hum. Genet.16 (2), 229235. 10.1038/sj.ejhg.5201921

  • 9

    EmilyM. (2016). AGGrEGATOr: A Gene-Based GEne-Gene interActTiOn Test for Case-Control Association Studies. Stat. Appl. Genet. Mol. Biol.15 (2), 151171. 10.1515/sagmb-2015-0074

  • 10

    EmilyM. (2018). A Survey of Statistical Methods for Gene-Gene Interaction in Case-Control Genome-wide Association Studies. Journal de la société française de statistique. 159 (1), 2767.

  • 11

    EmilyM. (2012). IndOR: a New Statistical Procedure to Test for SNP-SNP Epistasis in Genome-wide Association Studies. Statist. Med.31 (21), 23592373. 10.1002/sim.5364

  • 12

    EmilyM.SounacN.KroellF.Houée-BigotM. (2020). Gene-Based Methods to Detect Gene-Gene Interaction in R: The GeneGeneInteR Package. J. Stat. Softw.95 (12), 132. 10.18637/jss.v095.i12

  • 13

    FangG.WangW.PaunicV.HeydariH.CostanzoM.LiuX.et al (2019). Discovering Genetic Interactions Bridging Pathways in Genome-wide Association Studies. Nat. Commun.10 (1), 4274. 10.1038/s41467-019-12131-7

  • 14

    GuoF.WangD.WangL. (2018). Progressive Approach for SNP Calling and Haplotype Assembly Using Single Molecular Sequencing Data. Bioinformatics34 (12), 20122018. 10.1093/bioinformatics/bty059

  • 15

    GuoY.YanK.LvH.LiuB. (2021). PreTP-EL: Prediction of Therapeutic Peptides Based on Ensemble Learning. Brief. Bioinform.22 (6), bbab358. 10.1093/bib/bbab358

  • 16

    GuoY.ZhongZ.YangC.HuJ.JiangY.LiangZ.et al (2019). Epi-GTBN: an Approach of Epistasis Mining Based on Genetic Tabu Algorithm and Bayesian Network. BMC Bioinformatics20 (1), 444. 10.1186/s12859-019-3022-z

  • 17

    HindorffL. A.SethupathyP.JunkinsH. A.RamosE. M.MehtaJ. P.CollinsF. S.et al (2009). Potential Etiologic and Functional Implications of Genome-wide Association Loci for Human Diseases and Traits. Proc. Natl. Acad. Sci.106 (23), 93629367. 10.1073/pnas.0903103106

  • 18

    HuY.QiuS.ChengL. (2021). Integration of Multiple-Omics Data to Analyze the Population-specific Differences for Coronary Artery Disease. Comput. Math. Methods Med.2021, 7036592. 10.1155/2021/7036592

  • 19

    HuY.SunJ. Y.ZhangY.ZhangH.GaoS.WangT.et al (2021). rs1990622 Variant Associates with Alzheimer's Disease and Regulates TMEM106B Expression in Human Brain Tissues. BMC Med.19 (1), 11. 10.1186/s12916-020-01883-5

  • 20

    HuY.ZhangH.LiuB.GaoS.WangT.HanZ.et al (2020). rs34331204 Regulates TSPAN13 Expression and Contributes to Alzheimer's Disease with Sex Differences. Brain143 (11), e95. 10.1093/brain/awaa302

  • 21

    JiangL.XiaoY.DingY.TangJ.GuoF. (2018). FKL-Spa-LapRLS: an Accurate Method for Identifying Human microRNA-Disease Association. Bmc Genomics19, 911. 10.1186/s12864-018-5273-x

  • 22

    JiangQ.JinS.JiangY.LiaoM.FengR.ZhangL.et al (2017). Alzheimer's Disease Variants with the Genome-wide Significance Are Significantly Enriched in Immune Pathways and Active in Immune Cells. Mol. Neurobiol.54 (1), 594600. 10.1007/s12035-015-9670-8

  • 23

    KépíróL.SzéllM.KovácsL.KeszthelyiP.KeményL.GyulaiR. (2021). The Association of HLA-C and ERAP1 Polymorphisms in Early and Late Onset Psoriasis and Psoriatic Arthritis Patients of Hungary. Postepy Dermatol. Alergol.38 (2), 4351. 10.5114/ada.2021.104277

  • 24

    KlockeK.SakaguchiS.HolmdahlR.WingK. (2016). Induction of Autoimmune Disease by Deletion of CTLA-4 in Mice in Adulthood. Proc. Natl. Acad. Sci. USA113 (17), E2383E2392. 10.1073/pnas.1603892113

  • 25

    Lai Kwan LamQ.King Hung KoO.ZhengB.-J.LuL. (2008). Local BAFF Gene Silencing Suppresses Th17-Cell Generation and Ameliorates Autoimmune Arthritis. Proc. Natl. Acad. Sci.105 (39), 1499314998. 10.1073/pnas.0806044105

  • 26

    LarsonN. B.JenkinsG. D.LarsonM. C.VierkantR. A.SellersT. A.PhelanC. M.et al (2013). Kernel Canonical Correlation Analysis for Assessing Gene-Gene Interactions and Application to Ovarian Cancer. Eur. J. Hum. Genet.22 (1), 126131. 10.1038/ejhg.2013.69

  • 27

    LiH.-L.PangY.-H.LiuB. (2021). BioSeq-BLM: a Platform for Analyzing DNA, RNA and Protein Sequences Based on Biological Language Models. Nucleic Acids Res.gkab829. 10.1093/nar/gkab829

  • 28

    LiJ.HuangD.GuoM.LiuX.WangC.TengZ.et al (2015). A Gene-Based Information Gain Method for Detecting Gene-Gene Interactions in Case-Control Studies. Eur. J. Hum. Genet.23 (11), 15661572. 10.1038/ejhg.2015.16

  • 29

    LiM.-X.GuiH.-S.KwanJ. S. H.ShamP. C. (2011). GATES: a Rapid and Powerful Gene-Based Association Test Using Extended Simes Procedure. Am. J. Hum. Genet.88 (3), 283293. 10.1016/j.ajhg.2011.01.019

  • 30

    LiP.GuoM.WangC.LiuX.ZouQ. (2015). An Overview of SNP Interactions in Genome-wide Association Studies. Brief. Funct. Genomics14 (2), 143155. 10.1093/bfgp/elu036

  • 31

    LiuB.GaoX.ZhangH. (2019). BioSeq-Analysis2.0: an Updated Platform for Analyzing DNA, RNA and Protein Sequences at Sequence Level and Residue Level Based on Machine Learning Approaches. Nucleic Acids Res.47 (20), e127. 10.1093/nar/gkz740

  • 32

    LiuG.HuY.HanZ.JinS.JiangQ. (2019). Genetic Variant Rs17185536 Regulates SIM1 Gene Expression in Human Brain Hypothalamus. Proc. Natl. Acad. Sci. USA116 (9), 33473348. 10.1073/pnas.1821550116

  • 33

    LiuG.JinS.HuY.JiangQ. (2018). Disease Status Affects the Association between Rs4813620 and the Expression of Alzheimer's Disease Susceptibility geneTRIB3. Proc. Natl. Acad. Sci. USA115 (45), E10519E10520. 10.1073/pnas.1812975115

  • 34

    LiuJ.SuR.ZhangJ.WeiL. (2021). Classification and Gene Selection of Triple-Negative Breast Cancer Subtype Embedding Gene Connectivity Matrix in Deep Neural Network. Brief. Bioinform.22, 14774054. (Electronic). 10.1093/bib/bbaa395

  • 35

    LiuJ. Z.McraeA. F.NyholtD. R.MedlandS. E.WrayN. R.BrownK. M.et al (2010). A Versatile Gene-Based Test for Genome-wide Association Studies. Am. J. Hum. Genet.87 (1), 139145. 10.1016/j.ajhg.2010.06.009

  • 36

    LoosR. J. F. (2020). 15 Years of Genome-wide Association Studies and No Signs of Slowing Down. Nat. Commun.11 (1), 5900. 10.1038/s41467-020-19653-5

  • 37

    LuoJ.MengY.ZhaiJ.ZhuY.LiY.WuY. (2020). Screening of SLE-Susceptible SNPs in One Chinese Family with Systemic Lupus Erythematosus. Cbio15 (7), 778787. 10.2174/1574893615666200120105153

  • 38

    LyuP.HouJ.YuH.ShiH. (2020). High-density Genetic Linkage Map Construction in Sunflower (Helianthus Annuus L.) Using SNP and SSR Markers. Curr. Bioinformatics15 (8), 889897. 10.2174/1574893615666200324134725

  • 39

    MaL.ClarkA. G.KeinanA. (2013). Gene-based Testing of Interactions in Association Studies of Quantitative Traits. Plos Genet.9 (2), e1003321. 10.1371/journal.pgen.1003321

  • 40

    ManolioT. A.CollinsF. S.CoxN. J.GoldsteinD. B.HindorffL. A.HunterD. J.et al (2009). Finding the Missing Heritability of Complex Diseases. Nature461 (7265), 747753. 10.1038/nature08494

  • 41

    MooreJ. H.AsselbergsF. W.WilliamsS. M. (2010). Bioinformatics Challenges for Genome-wide Association Studies. Bioinformatics26 (4), 445455. 10.1093/bioinformatics/btp713

  • 42

    NobreR.IlicA.Santander-JimenezS.SousaL. (2021). Retargeting Tensor Accelerators for Epistasis Detection. IEEE Trans. Parallel Distrib. Syst.32 (9), 21602174. 10.1109/tpds.2021.3060322

  • 43

    PengQ.ZhaoJ.XueF. (2010). A Gene-Based Method for Detecting Gene-Gene Co-association in a Case-Control Association Study. Eur. J. Hum. Genet.18 (5), 582587. 10.1038/ejhg.2009.223

  • 44

    RitchieM. D.HahnL. W.MooreJ. H. (2003). Power of Multifactor Dimensionality Reduction for Detecting Gene-Gene Interactions in the Presence of Genotyping Error, Missing Data, Phenocopy, and Genetic Heterogeneity. Genet. Epidemiol.24 (2), 150157. 10.1002/gepi.10218

  • 45

    RitchieM. D.Van SteenK. (2018). The Search for Gene-Gene Interactions in Genome-wide Association Studies: Challenges in Abundance of Methods, Practical Considerations, and Biological Interpretation. Ann. Transl. Med.6 (8), 157. 10.21037/atm.2018.04.05

  • 46

    ShaoJ.YanK.LiuB. (2021). FoldRec-C2C: Protein Fold Recognition by Combining Cluster-To-Cluster Model and Protein Similarity Network. Brief Bioinform22 (3), bbaa144. 10.1093/bib/bbaa144

  • 47

    SongB.LiF.LiuY.ZengX. (2021). Deep Learning Methods for Biomedical Named Entity Recognition: a Survey and Qualitative Comparison. Brief. Bioinform.22 (6), bbab282. 10.1093/bib/bbab282

  • 48

    SuR.LiuX.JinQ.LiuX.WeiL. (2021). Identification of Glioblastoma Molecular Subtype and Prognosis Based on Deep MRI Features. Knowledge-Based Syst.232, 107490. 10.1016/j.knosys.2021.107490

  • 49

    SuR.LiuX.WeiL.ZouQ. (2019). Deep-Resp-Forest: A Deep forest Model to Predict Anti-cancer Drug Response. Methods166, 91102. 10.1016/j.ymeth.2019.02.009

  • 50

    TangY.-J.PangY.-H.LiuB. (2020). IDP-Seq2Seq: Identification of Intrinsically Disordered Regions Based on Sequence to Sequence Learning. Bioinformaitcs36 (21), 51775186. 10.1093/bioinformatics/btaa667

  • 51

    UrbanowiczR. J.KiralisJ.Sinnott-ArmstrongN. A.HeberlingT.FisherJ. M.MooreJ. H. (2012). GAMETES: a Fast, Direct Algorithm for Generating Pure, Strict, Epistatic Models with Random Architectures. BioData Mining5 (1), 16. 10.1186/1756-0381-5-16

  • 52

    WanX.YangC.YangQ.XueH.FanX.TangN. L. S.et al (2010). BOOST: A Fast Approach to Detecting Gene-Gene Interactions in Genome-wide Case-Control Studies. Am. J. Hum. Genet.87 (3), 325340. 10.1016/j.ajhg.2010.07.021

  • 53

    WangH. T.TangJ.DingY.GuoF. (2021). Exploring Associations of Non-Coding RNAs in Human Diseases via Three-Matrix Factorization with Hypergraph-Regular Terms on Center Kernel Alignment. Brief. Bioinform.22 (5), bbaa409. 10.1093/bib/bbaa409

  • 54

    WeiL.ChenH.SuR. (2018). M6APred-EL: A Sequence-Based Predictor for Identifying N6-Methyladenosine Sites Using Ensemble Learning. Mol. Ther. - Nucleic Acids12, 635644. 10.1016/j.omtn.2018.07.004

  • 55

    WeiL.LuanS.NagaiL. A. E.SuR.ZouQ. (2019). Exploring Sequence-Based Features for the Improved Prediction of DNA N4-Methylcytosine Sites in Multiple Species. Bioinformatics35 (8), 13261333. 10.1093/bioinformatics/bty824

  • 56

    WeiL.WanS.GuoJ.WongK. K. (2017). A Novel Hierarchical Selective Ensemble Classifier with Bioinformatics Application. Artif. Intelligence Med.83, 8290. 10.1016/j.artmed.2017.02.005

  • 57

    WeiL.XingP.ZengJ.ChenJ.SuR.GuoF. (2017). Improved Prediction of Protein-Protein Interactions Using Novel Negative Samples, Features, and an Ensemble Classifier. Artif. Intelligence Med.83, 6774. 10.1016/j.artmed.2017.03.001

  • 58

    XiaoY.MotomuraS.PodackE. R. (2008). APRIL (TNFSF13) Regulates Collagen-Induced Arthritis, IL-17 Production and Th2 Response. Eur. J. Immunol.38 (12), 34503458. 10.1002/eji.200838640

  • 59

    YoungA. I. (2019). Solving the Missing Heritability Problem. Plos Genet.15 (6), e1008222. 10.1371/journal.pgen.1008222

  • 60

    YuanZ.GaoQ.HeY.ZhangX.LiF.ZhaoJ.et al (2012). Detection for Gene-Gene Co-association via Kernel Canonical Correlation Analysis. BMC Genet.13, 83. 10.1186/1471-2156-13-83

  • 61

    ZengX.LiaoY.LiuY.ZouQ. (2017). Prediction and Validation of Disease Genes Using HeteSim Scores. Ieee/acm Trans. Comput. Biol. Bioinf.14 (3), 687695. 10.1109/tcbb.2016.2520947

  • 62

    ZengX.ZhuS.LiuX.ZhouY.NussinovR.ChengF. (2019). deepDR: a Network-Based Deep Learning Approach to In Silico Drug Repositioning. Bioinformatics35 (24), 51915198. 10.1093/bioinformatics/btz418

  • 63

    ZhaiY.ChenY.TengZ.ZhaoY. (2020). Identifying Antioxidant Proteins by Using Amino Acid Composition and Protein-Protein Interactions. Front. Cel Dev. Biol.8, 591487. 10.3389/fcell.2020.591487

  • 64

    ZhangT.HuY.WuX.MaR.JiangQ.WangY. (2016). Identifying Liver Cancer-Related Enhancer SNPs by Integrating GWAS and Histone Modification ChIP-Seq Data. Biomed. Res. Int.2016, 2395341. 10.1155/2016/2395341

  • 65

    ZhangX.ZouQ.Rodriguez-PatonA.ZengX. (2019). Meta-path Methods for Prioritizing Candidate Disease miRNAs. Ieee/acm Trans. Comput. Biol. Bioinf.16 (1), 283291. 10.1109/tcbb.2017.2776280

  • 66

    ZhangY.LiuJ. S. (2007). Bayesian Inference of Epistatic Interactions in Case-Control Studies. Nat. Genet.39 (9), 11671173. 10.1038/ng2110

  • 67

    ZhuH.DuX.YaoY. (2020). ConvsPPIS: Identifying Protein-Protein Interaction Sites by an Ensemble Convolutional Neural Network with Feature Graph. Cbio15 (4), 368378. 10.2174/1574893614666191105155713

Summary

Keywords

genome-wide association studies, qualitative traits, gene-gene interactions, maximal neighborhood coefficient, gene-based testing

Citation

Guo Y, Cheng H, Yuan Z, Liang Z, Wang Y and Du D (2021) Testing Gene-Gene Interactions Based on a Neighborhood Perspective in Genome-wide Association Studies. Front. Genet. 12:801261. doi: 10.3389/fgene.2021.801261

Received

25 October 2021

Accepted

15 November 2021

Published

08 December 2021

Volume

12 - 2021

Edited by

Dariusz Mrozek, Silesian University of Technology, Poland

Reviewed by

Yang Hu, Harbin Institute of Technology, China

Lumeng Chao, Inner Mongolia Agricultural University, China

Updates

Copyright

*Correspondence: Yingjie Guo, ; Debing Du,

This article was submitted to Computational Genomics, a section of the journal Frontiers in Genetics

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Figures

Cite article

Copy to clipboard


Export citation file


Share article

Article metrics