Testing Gene-Gene Interactions Based on a Neighborhood Perspective in Genome-wide Association Studies

Unexplained genetic variation that causes complex diseases is often induced by gene-gene interactions (GGIs). Gene-based methods are one of the current statistical methodologies for discovering GGIs in case-control genome-wide association studies that are not only powerful statistically, but also interpretable biologically. However, most approaches include assumptions about the form of GGIs, which results in poor statistical performance. As a result, we propose gene-based testing based on the maximal neighborhood coefficient (MNC) called gene-based gene-gene interaction through a maximal neighborhood coefficient (GBMNC). MNC is a metric for capturing a wide range of relationships between two random vectors with arbitrary, but not necessarily equal, dimensions. We established a statistic that leverages the difference in MNC in case and in control samples as an indication of the existence of GGIs, based on the assumption that the joint distribution of two genes in cases and controls should not be substantially different if there is no interaction between them. We then used a permutation-based statistical test to evaluate this statistic and calculate a statistical p-value to represent the significance of the interaction. Experimental results using both simulation and real data showed that our approach outperformed earlier methods for detecting GGIs.


INTRODUCTION
Genome-wide association studies (GWAS) has been used to investigate the associations between genetic variants and complex disorders with great success. Researchers have discovered more than 71,000 unique single nucleotide polymorphisms (SNPs) associated to diseases throughout the last decade Zhang et al., 2016;Zeng et al., 2017;Guo et al., 2018;Buniello et al., 2019;Loos, 2020;Li et al., 2021). Traditional GWAS, on the other hand, concentrated on the independent, additive, and cumulative effects of individual SNPs on specific diseases. The majority of associated SNPs are common genetic variants with small effects that only explain a portion of complex disease heritability. Many genes, environmental variables, and interactions play a crucial role in the underlying genetic architecture of complex diseases (Cordell, 2009;Moore et al., 2010;Jiang et al., 2018;Liu et al., 2018;Liu et al., 2019a;Zhang et al., 2019;Chen et al., 2020;Luo et al., 2020;Liu et al., 2021;Shao et al., 2021;Su et al., 2021;Wang et al., 2021). As a result, genetic interactions are thought to enlighten studies into "missing heritability" (Manolio et al., 2009;Fang et al., 2019;Young, 2019;Tang et al., 2020;Song et al., 2021) and give important knowledge for constructing topologies for complex disease-related pathway.
Genetic interaction was originally explored at the SNP level, named epistasis. Methods (Li et al., 2015a;Ritchie and Van Steen, 2018;Lyu et al., 2020) can be classified into three categories based on their search strategy: exhaustive methods, searching methods, and machine learning-based methods, such as statistics based on entropy (Dong et al., 2008) and odds-ratios (Emily, 2012); MDR (Ritchie et al., 2003), BEAM (Zhang and Liu, 2007), BOOST (Wan et al., 2010), Epi-GTBN (Guo et al., 2019), GenEpi , and some accelerate methods (Nobre et al., 2021). For example, a logistic regression analysis revealed a significant interaction between the genes ERAP1 (rs27524) and HLA-C (rs10484554) in psoriasis (p 6.95 × 10 −6 ), indicating that ERAP1 SNP was effective only in individuals who had at least one copy of the HLA-C SNP risk allele (Képíró et al., 2021). The statistical weakness of high-order or pairwise tests, which come from enormous multiple testing corrections over all pairs of SNPs, is one of the general problems of these marker-based approaches. Instead, we explored the interaction of two genes in a single gene-based interaction detection by treating SNPs inside a gene as a group.
The effectiveness of gene-based methods in GWAS marginal association studies should be extended to the study of gene-gene interaction (GGIs) (Emily, 2018;Emily et al., 2020). This strategy offers a number of possible benefits. For starters, it often has substantially fewer genes than SNPs, which dramatically decreases the number of pairwise testing. To discover GGIs in pair of 20,000 genes, for example, ∼ 2 × 10 8 tests are necessary. However, for three million SNPs in a marker-based interaction, more than 5 × 10 12 tests are required. Second, gene-based methods are more powerful statistically because a gene carries more information than individual SNP and genes interact in a variety of ways (Liu et al., 2010;Li et al., 2011;Jiang et al., 2017;Su et al., 2019;Hu et al., 2020;Hu et al., 2021a;Hu et al., 2021b;Guo et al., 2021). Furthermore, these methods can include biological prior knowledge (e.g., information about known gene association within protein-protein interactions (PPIs) or pathways) (Wei et al., 2017a;Wei et al., 2017b;Wei et al., 2018;Liu et al., 2019b;Wei et al., 2019;Zeng et al., 2019;Cai et al., 2020;Zhai et al., 2020;Zhu et al., 2020). Finally, gene-based outcomes stand out for their better interpretability and crucial biological consequences.
Many statistical and computational approaches for detecting gene-based GGIs have been established. Peng et al.(Peng et al., 2010) introduced the canonical correlation-based U statistic (CCU). They calculated canonical correlation of two genes in both cases and controls. They next used CCU to calculate the difference in correlation, which revealed the presence of GGIs between the two genes. However, this strategy only considered linear correlation in the study. CCU was then expanded to Kernelized CCU (KCCU) (Yuan et al., 2012;Larson et al., 2013), where the kernel discovered a nonlinear relationship. Emily (Emily, 2016) recently introduced AGGrGATOr, a method that combines p-values of interaction tests at the marker-level to assess how a pair of genes interacted, which was a strategy that Ma et al. (Ma et al., 2013) previously utilized to discover interactions under quantitative traits. GBIGM is a non-parametric entropy-based approach suggested by Li et al. (Li et al., 2015b).
In this paper, we propose a new approach called gene-based, gene-gene interaction through a maximal neighborhood coefficient (GBMNC), which uses the maximal neighborhood coefficient (MNC) (Cheng et al., 2020) to identify gene-gene interaction of complex diseases at the gene-level in case-control studies. MNC measures a wide variety of dependence with no bias toward relationship types between two random vectors of arbitrary, but not necessarily equal, dimensions; this is superior to Pearson's correlation, which only consider linear correlations. We introduced a statistic that uses the difference of MIC in cases and controls as an indicator of occurrence of GGIs, bases on the assumption that the joint distribution of two genes should not be significantly different in case and in control samples if there is no interaction between them (i.e. independent) under complex diseases. In simulation studies, our method exhibited an outstanding performance in recognizing the underlying GGIs at the gene level under a variety of conditions. Its application using real data sets showed accurate identification of GGIs.

MATERIALS AND METHODS
The statistical procedure for GBMNC is described in depth in this section. We give different parameter settings for simulation studies to evaluate the power to identify GGIs and the ability to control type-I error. Then, we adopted a real-world Rheumatoid Arthritis data set from the WTCCC (Wellcome Trust case Control Consortium) database to evaluate out method's effectiveness in a real situation.

Preliminaries and Notation
Here, we take genes, a couple of SNPs, as the basic unit. Suppose that we have n random samples: where G 1,i g 1,i,1 , g 1,i,2 , . . . , g 1,i,p , G 2,i g 2,i,1 , g 2,i,2 , . . . , g 2,i,q , i 1, 2, . . . , n and G 1 and G 2 represent two genes each with p and q SNPs, independently. In the case-control studies, y i ∈ {0, 1} is a categorical label where 0 is a control subject and one is a case subject. g k,i,j ∈ {0, 1, 2} represents the copy number of the minor alleles of SNP j in gene k for sample i. In this work, to investigate whether there is a statistical interaction between two genes in a qualitative phenotype, we designed a statistic based on the maximal neighborhood coefficient to characterize the GGI intensity. We applied a permutation strategy to estimate the distribution of the statistic. Our approach was based on the intuition that, if there was no interaction between two genes, then, if they were independent of the case set, they should be independent of the control set; if they were dependent on the case set, they should be dependent on the control set as well, and the "strength" of such dependence should be the same for the case and control sets. Pearson's correlation Frontiers in Genetics | www.frontiersin.org December 2021 | Volume 12 | Article 801261 coefficient measures the degree of dependence between two random variables. However, it can only measure linear dependency and not nonlinear dependency, and it is not very convenient for random variables that take a value in R n . Therefore, we proposed to measure dependency between random variables by the maximal neighborhood coefficient (MNC) instead.

Maximal Neighborhood Coefficient
MNC is an association measure that decipher the potential complex associations from neighborhood insight. It assumes that if a relationship exists between two variables, the samples of each variable will appear to have a similar neighborhood tendency to approximate that relationship, and MNC can find those common neighborhood structures by exploring the possible neighborhoods of each variable. By introducing a k-NN granule to reconstruct samples, and a novel neighborhood mutual information (NMI) to measure the certainty information of one variable from another under a fixed (k x , k y ) neighborhood combination, MNC enables us to detect more complex associations. Let S {(x 1 , y 1 ), . . . , (x n , y n )} ∈ R 2 be a finite set that is sampled from a joint distribution (X, Y), and S X {x 1 , . . . , x n } and S Y {y 1 , . . . , y n } represents samples from marginal variables X and Y, respectively. Given a designated neighborhood combination (k x , k y ) (a pairwise positive integer), The cover of samples S under (k x , k y ) is recorded as C kx,ky . Let S| C kx,ky represents the distribution of S on the cover C kx,ky , and different neighborhood combinations produce different distributions.
MNC is defined based on the neighborhood characteristic matrix (NM) of a sample set S. Given a finite data set S and a neighborhood combination (k x , k y ), the element of NM of S is: NMI(S| C kx ,ky ) denotes the neighborhood mutual information of distribution S| C kx ,ky . The neighborhood mutual information of (x i , y i ) is defined as follow: Based on the equation above, the neighborhood mutual information of (X, Y) is defined as: With the definition of NM(S) in Eq. 2, NMC is defined as: where NB(n) is the search range, and 1 ≤ k x k y ≤ O(n α ) for some 0 < α < 1. It also naturally extends to the case of two random vectors with arbitrary, but not necessarily equal, dimensions.

Illustration of the GBMNC Workflow
Assume there are n 1 control samples and n 2 case samples in a case-control study for a pair of genes such that G 1 has p SNPs and G 2 has q SNPs. Let MNC n (G 1 , G 2 ) be the sample association score between G 1 and G 2 . First, we calculate the MNC C n 1 (G 1 , G 2 ) for control samples and MNC D n2 (G 1 , G 2 ) for case samples. Second, we design a statistic ΔMNC to measure the difference in MNC between cases and controls. ΔMNC represents how different the two joint distributions The larger the ΔMNC, the higher the probability that G 1 and G 2 interact.
To get a p-value, we needed to estimate the distribution of ΔMNC 0 under the null hypothesis. Here, we used a nonparametric strategy based on permutation: we shuffled the label y randomly m times, calculated ΔMNC using the same procedure above, and used the resulting empirical distribution as an estimate for the distribution of ΔMNC under the null hypothesis. Let the result of these m permutations be ΔMNC 1 , . . . , ΔMNC m , then an estimated p-value for the null hypothesis is We summarized the process of GBMNC in the algorithm below (Algorithm 1) and presented the overall workflow ( Figure 1). Algorithm 1. GBMNC Data: Genotype G 1 , G 2 , Phenotype y, permutation times m Result: significant p-value for interaction between G 1 , G 2 1 Calculate MNC C n 1 (G 1 , G 2 ) and MNC D n 2 (G 1 , G 2 ) for both ; 3 for i 1 to m do 4 Randomly permute label y, and generate the new data set; 5 Repeat Steps 1 and 2; 6 end 7 Estimated p-value of ΔMNC 0 is the number of ΔMNC i , i 1, . . . , m, which are larger than ΔMNC 0 , divided by m.

Simulation Study
To assess the performance of GBMNC to control type I error and the power to detect GGIs, we compared GBMNC with KCCA (Larson et al., 2013), GBIGM (Li et al., 2015b), and AGGrEGATOr (Emily, 2016).

Simulation With GAMETES
The goal of this simulation study was to evaluate the performance of the GBMNC procedure to detect gene-gene interaction. We set all simulated datasets to have 50 SNPs. Among them, two SNPs were functional, and the remaining 48 SNPs were non-functional. The 50 SNPs formed five genes, and each had 10 SNPs. The two functional SNPs were put into the first and second genes. We chose the publicly available tool GAMETES (Urbanowicz et al., 2012) to generate the simulated genotype data. This tool was designed to generate pure and strict epistasis models. Pure and strict epistasis models are the most difficult disease-related patterns to identify. Such associations can only be observed if all n-loci are included in the disease model. This requirement makes these types of models an attractive gold standard for simulation studies of complex multi-locus effects.
Evaluation of Type-I error: The type-I error indicates the ability of a method to reject the null hypothesis when it is true (i.e., the false positive rate). We used GAMETES to generate the custom disease model ( Table 1) with one causal SNP pair. c characterizes the baseline odds (i.e., the odds conditional on genotype pair AABB). We ran the simulation 100 times with each sample size n ∈ {1k, 2k, 3k, 4k, 5k} and c 1. The significance level α was set to be 0.05.
Evaluation of power of the test: The power of a test indicates the probability that the method rejects the null hypothesis correctly when the alternative hypothesis is true. In this simulation study, we generated 100 data sets for each parameter settings. The power under each parameter setting was expressed by the frequency, and the null hypothesis of the data set was rejected correctly at the significance level of α 0.05. 1) To assess the impact of heritability h, which measured the intensity of correlation between genotype and phenotype, we chose h ∈ {0.01, 0.025, 0.05, 0.1, 0.2} and two different minor allele frequencies MAF ∈ {0.2, 0.4} with population prevalence set to 0.2 and sample size set at 4,000. Under each parameter combination, five models were generated so that we had a total of 100 models that followed Hardy-Weinberg proportions. For a specified genetic constrain combination, the 10 models were sorted roughly by the ascending customized odds ratio (COR) using GAMETES and labeled M1 to M5. COR is a metric of detectability that was calculated directly from the genetic model. The higher it is, the easier it is to detect GGIs. GAMETES generated the penetrance tables for these 100 models in the absence of the main effect. One hundred replicated data sets were generated from each model with balanced cases and controls, which resulted in 5,000 data sets in total in this scenario. 2) To evaluate the influence of sample size, we set heritability to be 0.025, MAF ∈ {0.2,0.4} and prevalence to be 0.2 with a sample size of 10,000. Then, 100 data sets were generated by random sampling from this large dataset for each of the sample sizes n ∈ {1k, 2k, 3k, 4k, 5k}. In this scenario, we had 1,000 datasets in total.
For GBMNC, KCCU, AGGrEGATOr, and GBIGM, if the number of data sets with a significance level less than α is m 1 , then the power can be calculated by the following formula: power m 1 100 GBIGM and AGGrEGATOr methods are nonparametric methods, so no parameters need to be specific. We only set the ratio of the trimmed jackknife to 0.05 (ω 0.05) for KCCU.

EXPERIMENTS USING RHEUMATOID ARTHRITIS DATA
To evaluate GBMNC's ability to process real GGIs in a qualitative data set, we analyzed the susceptibility of a series of pairs of genes in Rheumatoid Arthritis (RA). RA is a chronic autoimmune disease that causes pannus development and cartilage and bone loss in synovial joints. It leads to progressive bone deterioration and interferes with bone repair. In this work, we used the WTCCC (2007) data set, which includes genotype data from the British population obtained by the Affymetrix GeneGhip 500 k. Our dataset was pre-processed in the following ways: 1) We used pathway hsa05323 from the KEGG pathway database to validate the GGIs in the RA. The WTCCC data set's genotyping coordinates can be found in UCSC hg18/NCBI Build36. This pathway contained 90genes. Many of the genes belonged to the protein combinations MHCII and V-ATPase. Because numerous GGIs happened on their own, we only chose representative genes from each protein combination and then remove the others. Finally, 48genes remained, resulting in a total of C 2 48 1128 pairs of genes to be analyzed. 2) We collected the detailed gene information from the NCBI Build36 annotation file, and for each gene, we inserted a 10 kb buffer region both downstream and upstream of the originally defined gene location. For each gene, all SNPs within the area were chosen. 3) According to the quality control of GWAS, samples that included gender that did not match the chromosome X heterozygote rates were removed. SNPs were also removed if any of the following requirements were met: the missing rate in the sample was ≥ 10%, MAF was ≤ 0.05, or the frequency of control violated Hardy-Weinberg equilibrium (p < 0.0001). Finally, 385 SNPs remained in 4,966 samples, which included 2,993 control subjects and 1973 case subjects.

RESULTS AND DISCUSSION
The experimental environment for all the following results was a workstation with an Intel Xeon CPU E5-2,620 v2 at 2.10GHz, 96 GB of DDR3, and python3.6.

Evaluation of Type-I Error
For type-I error, we varied the sample size from 1,000 to 5,000. Except for GBIGM with n 1, 000, all methods tested had a type-I error comparable to a significance level α 0.05 (Table 2), which implied that these methods controlled for type-I error for various sample sizes quite well.

Evaluation of the Power of GBMNC
Impact of heritability: To evaluate the statistical power of our GBMNC and the other three methods, we used 10 heritability-MAF combinations, with a population prevalence of 0.2, a sample size of 4,000, and heritability that varied from 0.01 to 0.2 ( Table 3). The bold in Table 3 shows the best-performed method in each model under a given heritability-MAF combination. Notice that a larger value indicates better performance. On average, GBMNC was the best performing algorithm in this comparison. It largely outperformed the other methods, but not for all the data sets; it was inferior to AGGrEGATOr for some data sets. However, its performance was remarkably consistent, and it was the top performer for most data sets. AGGrEGATOr achieved the same performance when MAF was 0.2 and heritability was >0.05.
The power of all the methods was significantly affected by heritability (i.e., the effect size of interaction) ( Table 4). A larger heritability led to better performance for all methods under a specific MAF. When heritability varied from 0.01 to 0.025, GBMNC almost doubled its power for a given sample size of 4,000 with MAF 0.2. Other methods also show a steady upward trend ( Table 4). The power also depended on the MAF of the interacting SNPs (e.g., for the cases of h 0.01, the power of GBMNC under model M1-M5 ranged between 0.13-0.89 for MAF 0.2, but it ranged between 0.66-0.96 for MAF 0.4 (Table 3)). The average power was 0.564 for MAF 0.2, which was much lower than 0.818 for MAF 0.4 ( Table 4).
It is worth noting that under the same combination of habitability and MAF, GBMNC was more stable under models with different COR compared with AGGrEGATOr ( Figure 2). KCCU detected the interaction of some simulated disease models in our study, and it had a similar performance pattern with AGGrEGATOr. However, AGGrEGATOr was much more powerful in most of the simulated scenarios. GBIGM had little power to detecting pure gene-gene interaction,. This result replicated Emily's (Emily, 2016) result of the simulation.
Impact of sample size: The sample size of the data set had a considerable effect on power. Let the sample size be n ∈ {1k, 2k, 3k, 4k, 5k}, h 0.025, and MAF ∈ {0.2, 0.4} ( Table 5). As the sample size increased, the power of all methods increased almost monotonically under different MAF settings. With all methods, a larger sample size corresponded to better performance.
In conclusion, in simulated studies, our results showed that GBMNC detected gene-gene interaction effectively, in which a

EXPERIMENTS USING RHEUMATOID ARTHRITIS DATA
RA is a chronic autoimmune disease where HLA genes, TNF family, and TRAF1 are important genetic risk factors in the development. Each unique gene pair of the hsa05323 pathway was evaluated in the RA study, which resulted in C 2 48 1128 total pairs for 48 genes. With a significance level α 0.01 and multiple testing adjustment, for KCCU and GIGBM, we obtained 159 and 134 significant GGIs, respectively. Among them, 30 and 65 had p-values equal to 0; hence we were unable to rank them in the order of significance. AGGrGETOr did not show any significant results. Following Emily (Emily, 2016), and after removing the multiple testing correction, AGGrGETOr exhibited 17 significant GGIs, which we ranked by their  p-values. We chose the top 10 gene pairs obtained by GBMNC and by AGGrGETOr to analyze, which comprised approximately 1% of the total interactions ( Table 6).
We found that some of our findings were supported by prior research (Xiao et al., 2008;Klocke et al., 2016;Cen et al., 2019). For instance, our method detected a significant interaction between IL17 and TNFSF13B. Studies (Xiao et al., 2008) show that both B cells and T cells formed aggregates in the synovium of inflamed joints and mediated the pathogenesis of RA, and B-cell-activating factor (BAFF, also named TNFSF13B, BLys) played a vital role in B-cell survival and maturation. After activation and expansion, CD4 + T cells developed into different T helper cell subsets with different cytokine profiles and distinct effector functions. In addition to Th1 and Th2 cells, Th17 cells were a third T helper cell and produce IL-17. Th17 cells can recruit and activate inflammatory cells and they have been recognized as a primary cause of bone destruction and inflammation in autoimmune diseases. BAFF promoted Th17 cell proliferation and expansion preferentially (Lai Kwan Lam et al., 2008). IL-17 was a key cytokine for BAFF-mediated proinflammatory effects during collagen-induced arthritis pathogenesis. Only one pair of potential interactions between CD80 and CTSL was captured by both methods within the top 10 GGIs. However, there is not yet direct evidence to show the interaction between CD80 and CTSL.

CONCLUSION
The study of detecting GGIs is of great importance in understanding the pathogenesis of complex human diseases. In this paper, we proposed a gene-based GGI detection method called GBMNC based on a maximal neighborhood coefficient and a permutation strategy for case-control studies in GWAS. The method not only benefited from the ability of a maximal neighborhood coefficient, which considered the neighborhood structure of each sample and captured a wide range of associations, but also from the robustness of our permutationbased hypothesis testing scheme.
We designed a statistic to capture the different intensities of interaction between two genes in both cases and controls, then transformed the problem of GGI detection into a form of hypothesis testing; our null hypothesis was there was no significant difference in the relationship between the two genes in the disease data and the control data. This hypothesis did not limit the form of interaction between genes, and it enhanced the method's ability to detect different types of interactions. We demonstrated the effectiveness of our method through a simulation study and retrospective analysis of rheumatoid arthritis. Under a large range of settings, GBMNC outperformed previous methods in the power to detect GGIs. The statistical power of our method increased monotonically with the increase in the heritability and the MAF. The method was also stable to sample size based on a test of false positive rates. MNC did not restrict the dimension of two random vectors. Therefore, it is possible to generalize the method for marker-based detection of gene pairs that are identified as interactive. Investigating the mechanism of gene-based interaction at the marker level might point the way for further research. In summary, GBMNC is a helpful addition to the current toolbox of statistical models to elucidate GGIs in case-control studies.

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. This data can be found here: https://www.wtccc.org.uk/info/access_to_ data_samples.html.