Nearest-Neighbor Projected Distance Regression for Epistasis Detection in GWAS With Population Structure Correction

Nearest-neighbor Projected-Distance Regression (NPDR) is a feature selection technique that uses nearest-neighbors in high dimensional data to detect complex multivariate effects including epistasis. NPDR uses a regression formalism that allows statistical significance testing and efficient control for multiple testing. In addition, the regression formalism provides a mechanism for NPDR to adjust for population structure, which we apply to a GWAS of systemic lupus erythematosus (SLE). We also test NPDR on benchmark simulated genetic variant data with epistatic effects, main effects, imbalanced data for case-control design and continuous outcomes. NPDR identifies potential interactions in an epistasis network that influences the SLE disorder.


INTRODUCTION
An important challenge for machine learning in GWAS is to perform computationally efficient screening for variants involved in complex genetic models, including epistatic effects. The identification of interactions in GWAS may lead to an increased understanding of pathogenic mechanisms and potential therapeutic targets, but low minor allele frequencies and the curse of dimensionality make interaction detection difficult. Machine learning methods also face the challenge of identifying statistical thresholds that limit false discoveries and handling the intricacies of biomedical studies such as covariates and population structure.
Recently we developed a flexible nearest-neighbor-based machine learning feature selection method called Nearest-neighbor Projected Distance Regression (NPDR) to address these challenges (Le et al., 2020). NPDR integrates a regression formalism to allow statistical significance testing with projected nearest-neighbor machine learning to enable detection of complex multivariate models in high dimensional data. The projection of nearest neighbors from high dimensions onto single feature dimensions allows NPDR to detect features involved in complex patterns with other features in high-dimensional data that influence phenotypic variance. The regression formalism of NPDR maintains the ability to detect interactions while providing a statistical basis for feature selection thresholding and control of false discoveries due to multiple hypothesis testing.
In the current study, we demonstrate the capabilities of the NPDR framework to detect variants involved in complex genetic models and to adjust for population structure. We compare the performance of NPDR with random forest and univariate analysis on a panel of benchmark simulated genetic variant data described by Urbanowicz et al. (2018). We analyze data with multivariate main effects and multiple epistatic effects and outcomes with balanced and imbalanced cases-control ratios as well as continuous variation. Consistent with our previous studies (McKinney et al., 2009;Le et al., 2020), we show that random forest is able to detect interactions when the number of predictors is small but its power diminishes with the dimensionality of the data. NPDR is less susceptible to the curse of dimensionality as we show it is able to detect interactions with statistical significance in both low and high dimensional contexts.
In addition to adjustment for multiple testing, NPDR enables the adjustment for covariates such as sex, age, or population structure -due to population stratification or cryptic relatedness. Population structure leads to linkage disequilibrium (LD) and this deviation from independence may increase false associations (McCarthy et al., 2008;Chen et al., 2016). The confounding effect of population structure may be exacerbated for complex models involving interactions between variants. Covariate adjustment is challenging for many machine learning methods that have the flexibility of being model free (Le et al., 2020). NPDR is model free in its use of nearest neighbors for detecting interactions, but it includes a statistical model for the projected distance for each feature. This generalized linear model (GLM) of projected distances then allows for the inclusion of projected distance covariates such as principal components (PCs).
Systemic lupus erythematosus (SLE) is an autoimmune inflammatory disease characterized by antinuclear autoanti bodies, complement and interferon activation, and tissue destruction. It predominantly affects women. Numerous immune-related genes and genes with other functions have been shown to predispose to SLE (Harley et al., 2008;Gregersen and Olsson, 2009), but there is a need to identify other genomic factors that may be interacting with each other as pairs or in a higher-order network to influence the development of this complex disease (Davis et al., 2013;Tyler et al., 2019). We use NPDR to enrich for interactions in the systemic lupus erythematosus genetics (SLEGEN) GWAS, which consists of females of European ancestry (720 SLE and the 2,337 controls) (Harley et al., 2008). Although the SLEGEN data is a homogeneous sample, we demonstrate the ability of NPDR to adjust for possible cryptic relatedness by including PCs as covariates. Identifying additional interacting variants may lead to a better understanding of the pathways affecting SLE.

Nearest-Neighbor Projected-Distance Regression
Relief-based methods are known for their ability to identify interactions with computational efficiency but generally do not account for statistical significance of the attributes that may lead to high misclassification rate. In order to control false discoveries and adjust for covariates, we developed NPDR to use the GLM to perform regression between nearest-neighbor pair distances projected onto each predictor dimension (Le et al., 2020). We define the NPDR neighborhood set N of ordered pair indices of subjects as follows.
In NPDR, instance (e.g., subject) i is a point in p attribute (e.g., variant) dimensions, and the topological neighborhood of i is labeled by N i . This neighborhood is a set of other instances trained on the dataset X m×p of m instances and p attributes and depends on the type of Relief neighborhood method (e.g., fixedk or adaptive radius) and the type of metric (e.g., Manhattan or Euclidean). If instance j is in the neighborhood of i (j ∈ N i ), then the ordered pair is in the overall neighborhood ((i, j) ∈ N ) for the projected-distance regression analysis. The ordered pairs constituting the overall neighborhood can then be represented as nested sets: The cardinality of the set {j = i : j ∈ N i } is k i , the number of nearest neighbors for subject i. In the analyses in the current study, we use an adaptive k for hits and misses, k = 0.154 (m-1), that has shown good balance between detecting main effects and interaction effects (Le et al., 2019(Le et al., , 2020.
We compute the distance between two instances i and j in the space of the set A of all attributes with an L q metric where |A| = p is the number of attributes in the dataset. We use q = 1 (Manhattan) in this study. The projected difference or diff function [d ij (a)] between two instances i and j onto a SNP is of critical importance to the NDPR algorithm and can be computed by various difference functions. The standard difference used by Relief-based algorithms for categorical variables is a binary mismatch. For SNPs, this genotype mismatch (GM) is a 0 or 1 difference between two individuals (R i , R j ) for a SNP, a, based on the individuals' genotypes for this SNP. Specifically, the diff function is In other words, two individuals have zero diff if they have identical genotypes and they have unit diff if they have different genotypes. A potential drawback of GM is that it is not sensitive to heterozygous genotype differences when computing the diff. The following allele mismatch (AM) diff accounts for the difference in the number of alleles for a SNP when computing the distance between two individuals (Arabnejad et al., 2018). The AM difference of two individuals can be calculated by the following formula: where genotype (a, R i ) is the genotype encoding for number of copies of the reference allele for SNP a for individual i. In other words, the value of genotype (a, R i ) is the number of minor alleles in the genotype: 0, 1, or 2. Then the return value of d AM ij (a) is 0, 0.5, or 1 when the two individual have 2, 1, or 0 alleles in common, respectively. Other projected differences and metrics are described by Arabnejad et al. (2018).
Nearest-neighbor Projected-Distance Regression uses the GLM to perform regression between nearest neighbors. For each attribute, the NPDR model fits a GLM to the attribute's FIGURE 2 | Performance of three feature selection algorithms (from left to right): NPDR, random forest (RF) using permutation P-value, and univariate regression for three simulation models from https://github.com/EpistasisLab/rebate-benchmark/. Simulated models include (from top to bottom) an interacting pair, an interacting pair with a continuous outcome, and imbalanced case-control (60%). The -log(adjusted P-value) is plotted for the 20 variants in each dataset for 30 replicate simulations. The functional variable names (blue) begin with letter M, and the background variable names (red) start with letter N. Datasets have 1,600 samples (800 cases and 800 controls). Additional dataset details are given in Table 1. The dashed line represents Bonferroni adjusted P-value of 0.05.
projected distances between all pairs of nearest neighbors. The regression coefficients are calculated to minimize the least-squares error. For case-control phenotypes, p miss ij is the probability that subjects i and j are in the opposite phenotype class (misses) versus the same class (hits). We model this probability from the projected differences for SNP a with logit function: logit p miss ij = β 0 + β a d ij (a) + ε ij . The NPDR test statistic for attribute a is the β a estimate with one-sided hypotheses: H 0 : β a < 0. H 1 : β a ≥ 0. The quantity e β a is the predicted change in odds of neighbors being in opposite classes when the difference of the attribute a changes by one unit. For a continuous outcome (quantitative trait), NPDR uses linear regression of the numerical difference d num ij (y) of the outcome y between neighbors: and feature importance and significance are again determined from the coefficient β a . False-positive associations can arise in GWAS due to population stratification or cryptic relatedness. A standard approach to correct for population structure is to include PCs in the regression model to account for the genetic background. Many machine learning feature selection algorithms have limited ability to adjust for population structure or other potentially confounding covariates. However, the NPDR formalism can adjust for multiple covariates by including projected differences d num ij ( y covs ) for each covariate attribute in the regression model. The covariate adjusted model then becomes, where the covariate coefficient vector for 10 PCs is, and again p miss ij is the probability of neighbors having a different phenotype. Neighbors are still determined in the attribute (variant) space, but we add additional covariate diffs to the NDPR regression model.

Simulated Data
We compare methods using existing simulated data from the epistasis benchmark described by Urbanowicz et al. (2018) and available at https://github.com/EpistasisLab/rebate-benchmark/. For simplicity, many of the benchmark simulations use 20 total variants, but we also compare performance for multiple replicate simulations with 10,000 total variants to demonstrate computational feasibility and the effect of higher dimensionality (Table 1 summary of datasets). For case-control data, we use data with 1,600 balanced instances (800 cases and 800 controls) and one imbalanced scenario with (60% cases). Datasets have a heritability effect size of 0.4, minor allele frequency of 0.2 and include models with 2-4 functional variants and models with additive main effects and epistatic effects. We also use a dataset with a pair of interacting variants that influence a continuous endpoint.

Real GWAS
We apply NPDR to a study of females with European ancestry with genotyping data for 317,503 SNPs for 720 SLE subjects and 2,337 controls from the SLEGEN consortium (Harley et al., 2008). All women with SLE satisfied the revised criteria for classification of SLE from the American College of Rheumatology. The study sample consisted of 730 unrelated women with SLE and 475 controls from SLEGEN. Additional "out-of-study" European ancestry female controls were added from Illumina's iControlDB. The majority of iControlDB are from the Robert S. Boas Center for Genomics and Human Genetics at the FIGURE 4 | Epistasis Network for Lupus GWAS. Edges are significant pairwise interactions with adjusted P-value < 10 −6 between variants in genes computed after filtering. The edge weights are the magnitude of the statistical interaction between SNPs calculated by -log 10 (adjusted P-value). The espin-like (ESPNL) gene is an epistasis hub with 15 edges. The interaction on the left (HLA-DOB and PBX2) is between genes in the MHC II region.
Feinstein Institute for Medical Research. We reduced the data dimensionality using LD pruning with a correlation threshold of 0.5, minor allele frequency threshold of 0.01, Hardy-Weinberg Expectations (HWE) in controls P > 0.01, and HWE in cases P > 0.0001. LD pruning helps remove redundant features, where a SNP from a pair in high LD is removed from the data (Calus and Vandenplas, 2018). Initial filtering reduced the number of SNPs to 184,170. Due to computer memory constraints of the current implementation of NPDR, we further filtered to 10,000 SNPs based on univariate association. This filtering risks removing some interaction effects but should capture a considerable amount of variation in the data. Future implementations of NPDR will improve memory efficiency and incorporate additional variants. An advantage of NPDR is the ability to incorporate covariates into feature selection. We used the top 10 PCs from the variance-standardized relationship matrix. We mapped the top SNPs found by NPDR to Ensembl gene IDs based on proximity. If the SNP is not in an intron or exon of a gene, the algorithm computes the distance of the variant to the nearest two genes and the SNP is mapped to the gene symbol of the closest gene.

Simulation Analysis
For low-dimensional simulations with only 20 total attributes (Figures 1, 2), both NPDR with AM metric and random forest rank the functionally interacting attributes at the top for all simulation scenarios. All methods detect the main effects (Figure 1), but as expected the univariate analysis cannot detect interaction effects. For low dimensional datasets (20 attributes), random forest is able to exhaustively sample all attributes and find a tree with the interacting attributes. When the total number of attributes increases from 20 to 10,000 (Figure 3), random forest is unable to detect the functional interactions with average ranks near random (5,000). In this case of 10,000 attributes, the random forest ranking is very similar to a univariate ranking, while NPDR has good rankings for interacting attributes using either the GM or AM projected difference metrics (Figure 3). This is consistent with our previous results that random forest is unable to detect interactions beyond random chance in high dimensional data, whereas Relief-based methods are less affected by the curse of dimensionality (McKinney et al., 2009). NPDR also performs TABLE 2 | Logistic regression interaction statistics for pairs of variants in genes in the epistasis network (Figure 4) for SLE GWAS with adjusted P-values < 10 −6 . well for multiple pairwise interactions (Figure 1), interactions for a continuous outcome (Figure 2, middle row) and imbalanced case-control data (Figure 2, bottom row).

SLE GWAS
We apply NPDR to the SLEGEN data, which is a real GWAS composed of females with SLE and healthy controls. Although the study is composed of European ancestry individuals, we include 10 PCs as covariates in NPDR to account for possible cryptic relatedness. Following filtering we create a network from significant (adjusted P-value < 1e-6) pairwise interactions (Figure 4). This edge significance threshold results in 35 edges ( Table 2) in the epistasis network. The espin-like (ESPNL) gene is a hub of the network, involved in 15 of the 35 significant interactions. The particular interacting SNP (rs10210979) in ESPNL is an intron variant on chromosome 2. Although ESPNL is involved in hearing, it is ubiquitously expressed and some of its interactions may indicate novel function for the immune system. In addition to this hub gene, there is an interesting interaction between HLA-DOB and PBX2 (Pre-B-cell leukemia homeobox 2). Both of these genes are located within the major histocompatibility complex (MHC) class II region on chromosome 6, and HLA-DOB is a beta chain of the MHC class II molecule.

CONCLUSION
Machine learning feature selection methods are needed to enrich for attributes involved in complex interaction network effects in high dimensional data, such as GWAS and gene expression, for case-control and quantitative trait studies. In addition to interactions, machine learning methods need to handle complicated modeling scenarios, such as controlling for potential confounders from demographic data or population structure, which is a perennial challenge for GWAS data.
In the current study of GWAS data, we applied a new feature selection technique called NPDR that uses the GLM to perform regression between nearest-neighbor pair distances projected onto predictor dimensions. NPDR detects interaction structure using local nearest-neighbor information in the full space of predictors, which may be SNPs or expression levels. Using simulated GWAS, we showed that NPDR has good power to detect functional variants in a variety of simulation scenarios including case-control data with and without imbalance, quantitative trait outcomes, main effects, and multiple pairwise epistatic effects. Similar to our previous findings (McKinney et al., 2009;Le et al., 2020), NPDR is less susceptible to the curse of dimensionality than random forest because when the total number of variants increases to 10,000, the ranking of interacting variants by random forest is consistent with a random ranking, while NPDR consistently ranks the functional variants near the top.
We demonstrated NPDR's ability to handle covariates by including the first 10 PCs in the NPDR models for a GWAS of SLE. Previously we showed that using the covariate term in NPDR can remove genes from nearest-neighbor feature selection that are confounded by sex (Le et al., 2020). In the current study, we constructed a candidate epistasis network for SLE from the filtered data, and found the ESPNL gene is a hub in the network with 15 of the 35 statistically significant interactions. There is no prior evidence for the role of ESPNL in autoimmunity, but it is ubiquitously expressed. Replication and functional investigation of these interactions are needed to identify mechanisms in the pathogenesis of autoimmunity. The lupus epistasis network also contains a significant interaction between HLA-DOB and PBX2, which are both located within the MHC class II region on chromosome 6. A limitation of this discovery analysis was the lack of a replication dataset. Another limitation is the need for SNP filtering in the current NPDR implementation in R for GWAS. Future implementations will take advantage of binary GWAS data formats for improved memory management.

AUTHOR CONTRIBUTIONS
MA and BM conceived of the project and wrote the initial draft. MA performed the analyses. CM and PG provided the statistical and biological expertise. All authors contributed to the article and approved the submitted version.

FUNDING
This work was supported in part by the National Institute of Health GM121312 and GM103456, and the William K. Warren Jr. Foundation.