Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Genet., 19 September 2025

Sec. Computational Genomics

Volume 16 - 2025 | https://doi.org/10.3389/fgene.2025.1668200

A hypergraph neural network for prioritizing Alzheimer’s disease risk genes

Meng MaMeng Ma1Chao DengChao Deng2Yan LiuYan Liu1Qingqing CaoQingqing Cao1Fang LiuFang Liu1Yan Zhang,
Yan Zhang2,3*
  • 1College of Information Engineering, Hunan Open University, Changsha, China
  • 2School of Computer Science and Engineering, Central South University, Changsha, China
  • 3School of Computer Science and Engineering, Hunan University of Information Technology, Changsha, China

Identifying the complex genetic architecture of Alzheimer’s disease (AD) is critical for understanding its pathophysiology. While network-based computational methods assist in this task, they primarily model simple pairwise gene interactions and fail to capture the higher-order associations of genes that drive complex diseases. To address this limitation, we introduce HyperAD, a novel hypergraph neural network framework designed to predict AD risk genes by explicitly modeling these higher-order associations of genes. HyperAD constructs a hypergraph in which functional gene sets from databases such as MSigDB form hyperedges, and uses a two-stage hypergraph message passing neural network to extract high-order association information from the hypergraph. Comprehensive evaluations demonstrate that HyperAD significantly outperforms state-of-the-art methods. We validate the prediction results of HyperAD through multiple lines of evidence. HyperAD-predicted genes are enriched in AD-related biological processes and have significant associations with known related genes in terms of sequence similarity, protein interaction, and miRNA. In addition, their protein expression levels are significantly altered in the brains of AD patients, and they contain both known risk sites and new, high-confidence candidate genes. HyperAD provides a more accurate and biologically insightful tool for prioritizing genes and unraveling the complex genetic landscape of AD.

1 Introduction

Alzheimer’s disease (AD) is a progressive and devastating neurodegenerative disorder that represents the most common cause of dementia worldwide. As global populations age, the societal and economic burden of AD is escalating, making the search for effective therapeutic strategies a global health priority. The disease has a strong genetic component, with heritability estimates ranging from 60% to 80% (Bellenguez et al., 2022). While the discovery of rare, highly penetrant mutations in genes such as APP, PSEN1, and PSEN2 confirmed the role of genetics in early-onset familial AD, the genetic architecture of the more common late-onset AD (LOAD) is far more complex and remains incompletely understood (Spina et al., 2021). Therefore, the systematic identification and prioritization of novel AD risk genes are fundamental to unraveling the intricate pathophysiology of the disease, identifying new biological pathways for investigation, and ultimately developing targeted therapies.

Genome-wide association studies (GWAS) have been the primary engine for discovering common genetic variants associated with AD (Bellenguez et al., 2022; Dalmasso et al., 2024). To date, these efforts have successfully identified dozens of risk loci. For instance, a landmark GWAS meta-analysis by (Jansen et al., 2019) identified 29 risk loci, implicating potential pathogenic genes such as ADAMTS4 and KAT8. These findings highlighted the enrichment of risk genes in immune-related cell types, particularly microglia, and their involvement in pathways such as lipid processing and amyloid-beta precursor protein degradation. More recently, a large two-stage GWAS by (Bellenguez et al., 2022) further expanded the genetic map of AD, identifying 75 risk loci, 42 of which were novel, including those near SORT1 and ANK3. Despite these successes, GWAS have inherent limitations. The stringent p-value thresholds required to correct for multiple testing can lead to high false-negative rates, while the analysis often identifies large genomic loci containing numerous genes, making it challenging to pinpoint the true causal gene (McCarthy et al., 2008; Lee et al., 2011). Furthermore, these studies demand massive, well-phenotyped patient cohorts, which are expensive and time-consuming to assemble. Consequently, computational approaches are essential to complement GWAS findings, helping to prioritize the most promising candidates from thousands of potential genes for functional validation.

In recent years, a variety of computational methods have been developed to predict disease risk genes. The majority of these methods operate on the “guilt-by-association” principle, which posits that genes involved in the same disease are likely to be functionally related and thus exhibit proximity or strong connections within biological networks (Cowen et al., 2017; Yao et al., 2018; Valdeolivas et al., 2019; Ata et al., 2021). Methods based on protein-protein interaction (PPI) networks or functional gene networks have been applied to this problem. In the study of disease gene identification, network-based machine learning approaches have proven effective. For instance (Yao et al., 2018), developed a Bayesian framework to construct tissue-specific functional gene networks by integrating multiple types of gene association data. These networks were then used as features to train machine learning models for identifying disease-associated genes. Similarly (Krishnan et al., 2016), proposed a machine learning approach that leverages a brain-specific FGN to prioritize candidate genes for autism spectrum disorder. In another approach (Tran et al., 2020), introduced DiGI, a node kernel-based method that generates gene features by integrating information from multiple heterogeneous networks, which are subsequently used by a Support Vector Machine (SVM) for prediction. More recently (Mancuso et al., 2022), developed GenePlexus, a framework that utilizes biological networks and a semi-supervised learning approach to identify novel disease-gene associations. These models typically leverage pairwise, or binary, associations between genes to propagate risk information across the network, identifying novel candidates based on their connectivity to known disease genes.

However, a significant limitation of these biological network-based approaches is their reliance on pairwise interactions. The pathophysiology of a complex polygenic disease like AD is not driven by a series of independent gene-gene interactions, but rather by the intricate and collective interplay of multiple genes within complex biological pathways and functional modules (Sims et al., 2020). By modeling only binary relationships, existing methods may fail to capture the higher-order associative patterns that characterize these multi-gene functional units (Li et al., 2022; Deng et al., 2024). This oversight means that crucial information embedded within functionally cohesive gene sets—where the coordinated action of several genes is necessary for a biological outcome—is largely ignored.

To address this critical gap, we introduce HyperAD, a novel hypergraph neural network framework for predicting AD risk genes by explicitly modeling higher-order gene relationships. We leverage the comprehensive gene set collections from the Molecular Signatures Database (MSigDB) (Liberzon et al., 2015), which curates functionally related gene groups based on diverse biological evidence, including chromosomal position, Gene Ontology (GO) annotations (Consortium, 2004), KEGG pathways (Kanehisa and Goto, 2000), and Human Phenotype Ontology (HPO) associations (Köhler et al., 2021). We structure this information using a hypergraph, a powerful data structure where a single hyperedge can connect any number of nodes (Agarwal et al., 2006; Zhou et al., 2006). This representation provides a natural and superior way to model the complex, many-to-many relationships inherent in biological systems compared to traditional graphs. To exploit this rich representation, we designed a two-stage message-passing hypergraph neural network that effectively learns and aggregates features from these complex, higher-order structures. By capturing the collective context of gene function, HyperAD is designed to more accurately prioritize novel AD risk genes and provide deeper insights into the multi-gene mechanisms underlying the disease.

To assess the effectiveness and robustness of our proposed framework, we conduct a series of comprehensive experiments. HyperAD is compared with a series of state-of-the-art AD risk gene prediction methods, and the results demonstrate its superior predictive performance. In addition to these comparative analyses, we also conducted several validation studies to confirm the biological relevance and accuracy of the HyperAD-predicted AD risk gene. Functional enrichment analyses revealed that the HyperAD-predicted AD risk genes are significantly enriched in AD-associated biological processes and share a substantial number of network connections with known AD genes, underscoring their strong functional relevance to the disease. Furthermore, we integrate proteomic expression data into our validation pipeline, which shows that the HyperAD-predicted AD risk genes correspond to proteins that are significantly differentially expressed in AD brain tissues.

2 Materials and methods

2.1 Datasets

To construct the hypergraph, we obtain gene sets from the Molecular Signatures Database (v7.4), one of the most comprehensive and widely used collections of annotated gene sets (Liberzon et al., 2015). MSigDB aggregates data from numerous sources to cover diverse biological contexts, including chromosomal location, biological function, metabolic pathways, and regulatory targets (Subramanian et al., 2005; Liberzon et al., 2011). The database organizes gene sets into nine major collections (H and C1-C8) according to their origin and type. For this study, we use several of these key collections. These included Hallmark gene sets (H), which represent well-defined biological processes, and positional gene sets (C1), which group genes by chromosomal location. We also incorporate curated gene sets (C2) from pathway databases and the literature, alongside regulatory target gene sets (C3) that comprise genes sharing predicted targets for microRNAs or transcription factors. Additional information is derived from ontology gene sets (C5), which group genes based on GO and HPO terms, and computational gene sets (C4) derived from mining cancer-oriented microarray data. Finally, we include several signature-based collections: oncogenic gene sets (C6) from cancer gene perturbation studies, immunologic gene sets (C7) from immune system perturbations, and cell type gene sets (C8) derived from cluster markers in single-cell sequencing studies. The number and description of gene sets in each category are in Table 1.

Table 1
www.frontiersin.org

Table 1. Brief introduction of nine major collections in MSigDB.

To train our model and evaluate its performance, we establish a ground-truth dataset of genes associated with AD. The positive set is composed of 147 high-confidence AD risk genes, manually curated from several authoritative databases such as OMIM (Amberger et al., 2019), GWAS Catalog (Sollis et al., 2023), DisGeNet (Piñero et al., 2020), and AlzGene (Bertram et al., 2007). Constructing a reliable negative set—genes with no association to AD is inherently challenging. To address this, we adopt a stringent filtering strategy. First, we compile a comprehensive list of potential AD-associated genes from the databases mentioned above. We then create a negative (non-AD) gene pool by removing all these potential AD genes from the complete human genome. To mitigate class imbalance during model training, we randomly sampled 1,000 genes from this pool to form our final negative set.

2.2 Method

2.2.1 Brief introduction of hypergraph

A weighted hypergraph is defined as G(V,E,W), where V={v1,v2,v3,,vN} represents the set of N nodes in the hypergraph, E={e1,e2,e3,,eM} represents the set of M hyperedges in the hypergraph, and diag(W)=[w(e1),w(e2),,w(em)] is the diagonal matrix representing the weight of the hyperedge. W(e) is used to express the importance of the hyperedge e in the hypergraph. For an unweighted hypergraph, W can be regarded as the identity matrix. An edge in the graph can only represent the relationship between two nodes, but the hyperedges in the hypergraph can connect more than two nodes and can represent the relationship among multiple nodes. In order to describe the relationship between the nodes and the hyperedges, the hypergraph is usually represented by an incidence matrix H of N×M, and H(va,eb){0,1} indicates that whether the a-th node va is belong to the b-th hyperedge eb:

Hva,eb=1if vaeb0if vaeb(1)

For a node vaV, the degree of the node is defined as:

dva=eEweHva,e(2)

For an hyperedge ebE, the degree of the hyperedge is defined as:

δeb=vVHv,eb(3)

And, we can define the DvRN×N and DeRM×M denote the diagonal matrices of the node degrees and the hyperedge degrees, respectively.

Kipf et al. (Kipf and Welling, 2017) propose the graph convolution network (GCN) through the local first-order approximation of the spectral graph convolution. A GCN layer is define as:

Xl+1=σD̃12ÃD̃12XlΘl(4)

where à denotes the adjacency matrix with self connections and D̃ denotes the degree matrix of the graph. Θ(l) is a learnable weight matrix and σ denotes the activation function.

The Laplace matrix is crucial in the spectral method of graphs. For an graph, the Laplace matrix is defined as Δ=DA. Zhou et al. (2006) use the Laplacian matrix of graphs for reference, introduce the concept of the Laplacian matrix in the hypergraph, and define it as:

Δ=IDv12HWDe1HTDv12(5)

Feng et al. (2019) bring the hypergraph Laplacian matrix Δ into the traditional spectrogram convolution, and define the spectrogram convolution operation of the hypergraph gx=Φg(Λ)ΦTx. At the same time, Feng et al. (2019) use the second-order Chebyshev inequality to approximate gx, and the HGNN is proposed. Finally, a hypergraph convolution layer is defined as:

Xl+1=σDv12HWDe1HTDv12XlΘl(6)

where Dv denotes the node degree matrix and De denotes the hyperedge degree matrix. Θ(l) is a learnable weight matrix and σ is the activation function.

2.2.2 Framework of HyperAD

HyperAD is a novel, end-to-end deep learning framework designed to identify genes associated with AD by modeling higher-order assocation among genes (Figure 1). It uniquely frames gene prioritization as a learning task on a hypergraph, where genes are represented as nodes and diverse gene sets from the MSigDB database as hyperedges. The core of HyperAD is a novel two-stage hypergraph message-passing module that learns gene representations by explicitly modeling the flow of information from genes to pathways and back. Moreover, we propose an AD-specific weighting mechanism that dynamically focuses the model on the most relevant biological contexts during learning, leading to highly specific and robust gene embeddings for AD-associated gene prediction. Finally, the learned gene representations are used for AD gene prioritization.

Figure 1
Diagram showing a model workflow starting with the MSigDB dataset, including hallmark, positional, and cell type signature gene sets. The dataset is used to create a functional gene hypergraph, represented by nodes and edges, and an incidence matrix. This is followed by a hypergraph embedding model, including hypergraph message-passing blocks and fully connected layers, concluding with a softmax operation to produce output Z.

Figure 1. The HyperAD framework for AD gene prioritization. The framework begins by constructing a hypergraph from diverse gene sets collected from the MSigDB, where each gene set forms a hyperedge connecting its member genes. This hypergraph is then fed into a deep learning model composed of multiple hypergraph message-passing modules. These modules iteratively aggregate information across the hyperedges, allowing the model to learn complex associative patterns that are often missed by traditional pairwise network models. Finally, the model predicts a genome-wide AD risk score for each gene, enabling the prioritization of novel candidates.

First, we use a hypergraph to structurally represent the gene sets from the MSigDB database. This database contains gene sets from various sources, including GO terms, KEGG pathways, and HPO. Each gene set consists of genes that are associated in some way. For example, a KEGG pathway indicates that multiple genes are involved in a common signaling pathway. In bioinformatics and computational biology, graphs or network structures are commonly used to describe the relationships between proteins, genes, or other biological entities. However, for annotated gene sets, standard graph structures are not suitable because they cannot represent the higher-order relationships among multiple genes within a single gene set. To accurately describe the complex associations of genes in these sets and to fully utilize their higher-order relationships, our method, HyperAD, uses a hypergraph structure to represent and integrate different types of gene sets. We define the hypergraph constructed using gene sets as G(V,E,W), the node V={v1,v2,v3,,vN} in the hypergraph represents N genes, and the hyperedge E={e1,e2,e3,,eM} represents M annotated gene sets. For the incidence matrix HRN×M of the hypergraph, we define that if the a-th gene va belongs to the b-th annotated gene set eb, then H(va,eb)=1, otherwise it is 0. The incidence matrix H of the hypergraph G is constructed from the inclusion relationship between genes and annotated gene sets.

Next, we detail the architecture of our model, HyperAD. The model first projects the initial one-hot encoded gene features xRN×N into a dense, low-dimensional embedding space xv(0)RN×d using a non-linear transformation layer:

xv0=ReLUxΘ0+b0(7)

where Θ0 represents a learnable parameter matrix, and b0 represents the bias value of the layer. The x is the initial feature vector for gene v, and xv(0) is its embedding after the initial transformation.

The resulting gene embeddings are then processed by a series of our core two-stage hypergraph message-passing blocks. Within each block, a node-centric perspective reveals how higher-order associations are learned by explicitly modeling the information flow from genes to gene sets and back. The update rules for a single node v and hyperedge e (from layer l to l+1) are defined as follows:

Node-to-Hyperedge Aggregation. In this stage, each hyperedge e updates its feature vector ye by aggregating messages from all its member nodes v. This message is the normalized feature vector of the node:

yel=σvNexvldvΘvel(8)

Where xv(l) is the feature vector of node v at layer l and ye(l) is the feature vector of hyperedge e. The N(e) is the set of all nodes contained within the hyperedge e. The d(v) is the degree of node v, used for normalization. The Θve(l) is the learnable parameter matrix. The σ denotes a non-linear activation function, such as ReLU.

Hyperedge-to-Node Aggregation. Subsequently, each node v updates its feature vector xv by aggregating messages from all the hyperedges e it belongs to. In a hypergraph, the weight of a hyperedge signifies its relevance to the classification task. An unweighted hypergraph, such as the one we initially construct, implicitly assumes that all hyperedges—representing diverse biological pathways—contribute equally to predicting AD-associated genes. This assumption is biologically implausible. The etiology of complex diseases like Alzheimer’s is known to be driven by a specific subset of biological processes and molecular functions. For instance, pathways related to synaptic function or neuro-inflammation are critically relevant to AD, whereas some others may not be involved. Treating these distinct pathways with uniform importance would introduce significant noise and overlook crucial disease-specific signals. Therefore, to align our model with this biological reality and enhance its predictive power, we propose a weighted aggregation of hyperedge information. This approach assigns greater importance to pathways more significantly associated with AD. The mechanism is formalized as follows.

we=vVHv,efv,VdvVHv,e(9)
fv,Vd=1if vVd0if vVd(10)

where w(e) denotes the weight of the hyperedge e, and Vd denotes the known disease-associated gene. The f(v,Vd) is used to indicate whether gene v belongs to Vd. This step is followed by a residual connection to preserve the node’s original information (He et al., 2016). Therefore, the specific formula of the message passing mechanism from hyperedge to node in HyperAD is as follows:

xvl+1=σeEvweyelΘevl+xvl(11)

Where xv(l+1) is the feature vector of node v at layer l+1. The E(v) is the set of all hyperedges that contain node v. The w(e) is the pre-computed weight of hyperedge e, modulating the message importance. The term +xv(l) represents the residual connection. The σ denotes a non-linear activation function.

Finally, HyperAD inputs the gene feature matrix xv(l)RN×d to the fully connected layer and uses the softmax layer to get the prediction result of the model, the formula is as follows:

Z=SoftmaxxvlΘ1+b1(12)

where ZRN×2 represents the predicted risk score of each gene, and we can use it to prioritize the degree of association between AD and genes. We use cross-entropy as the loss function of the model and use the Adam algorithm to optimize all parameters during the training.

3 Results

3.1 Benchmarking of HyperAD

To demonstrate the superiority of HyperAD, we conduct a comprehensive benchmark against six state-of-the-art methods for disease gene prediction: FGN, RWRM, DiGI, GCN-GENE, GSI, and GenePlexus. FGN (Yao et al., 2018) first constructs a tissue-specific functional gene network by integrating various evidence sources, such as coexpression and protein-protein interactions, within a Bayesian framework; this network is then used to train an SVM classifier. c DiGI, a node kernel-based approach, generates powerful gene features by applying a graph kernel to multiple heterogeneous networks and then integrates these features to train an SVM for prediction (Tran et al., 2020). GCN-GENE (Zhang et al., 2022) represents a deep learning approach, utilizing a Graph Convolutional Network to learn node embeddings by convolving over a biological network structure while using gene expression data as initial node features. GSI (Li et al., 2022) first constructs a gene signal matrix where each row is a gene and each column represents a biological signal derived from gene-set membership, which is then fed into a machine learning model; for our experiments, we use its high-performing Random Forest-based implementation. Lastly, GenePlexus (Mancuso et al., 2022) employs a semi-supervised learning strategy on a large-scale human gene interaction network, propagating labels from known disease genes to predict novel associations. For a fair and robust evaluation, all methods are trained and tested on identical positive and negative gene sets under a rigorous ten-repetition, five-fold cross-validation scheme. We evaluate performance using the Area Under the Receiver Operating Characteristic Curve (AUROC) and the Area Under the Precision-Recall Curve (AUPRC), reporting the mean and standard deviation across all 50 folds to ensure a reliable comparison.

As illustrated in Figure 2, HyperAD demonstrates superior performance, significantly outperforming all state-of-the-art methods for AD-associated gene prediction. Compared to approaches that rely on traditional pairwise biological networks such as FGN, DiGI, RWRM, GCN-GENE, and GenePlexus, HyperAD achieves substantial performance gains, with an increase of 7.9%–21.7% in AUROC and a remarkable 23.2%–44.7% in AUPRC. This result strongly suggests that higher-order associations captured within annotated gene sets provide a more accurate and informative representation of complex gene functions than pairwise interactions in conventional networks. Furthermore, HyperAD also significantly outperforms GSI, the most comparable baseline, improving AUROC by 2.3% and AUPRC by 9.0%. This comparison is particularly insightful because while both methods use the same raw gene-set membership information, their modeling strategies differ fundamentally. GSI treats the gene-set membership (the incidence matrix H) as a static feature matrix for a standard classifier, whereas HyperAD employs its two-stage message-passing architecture to model the hypergraph structure. This allows HyperAD to deeply explore higher-order neighborhood information and the local topology around each gene while simultaneously integrating disease-specific functional relevance through its weighted aggregation mechanism. The superior performance of HyperAD over GSI unequivocally demonstrates the power of this dynamic modeling approach. It underscores the critical importance of explicitly learning from the neighborhood context within the hypergraph. We also analyze the hyperparameter sensitivity of HyperAD (Supplementary Figure). Collectively, these results establish HyperAD as a highly effective framework and suggest that leveraging hypergraph neural networks to model higher-order biological relationships is a promising new direction for disease gene prediction.

Figure 2
Bar charts comparing the performance of various models: FGN, RWRM, DIGI, GCN-GENE, GenePlexus, GSI, and HyperAD. The left chart shows AUROC values, where HyperAD has the highest score near 0.95 and DIGI the lowest around 0.68. The right chart shows AUPRC values, with HyperAD again having the highest score near 0.78 and DIGI the lowest around 0.32. Error bars indicate variability across measures.

Figure 2. Benchmarking experimental results of HyperAD and six advanced AD gene prediction methods.

3.2 Ablation studies of HyperAD

To understand the relative contribution of different types of biological information to HyperAD’s predictive performance, we conduct a comprehensive ablation study. In this analysis, we train and evaluate our model separately on each of the nine individual MSigDB collections (H and C1-C8) and compare their performance against the final integrated model that utilizes all collections simultaneously. As shown in Figure 3, the results demonstrate that the integrated model achieves the highest predictive accuracy, significantly outperforming any model trained on a single data source. This finding suggests that our model’s superior performance stems from its ability to synthesize diverse and complementary biological knowledge, from curated pathways to cell-type-specific signatures.

Figure 3
Two box plots compare AUROC and AUPRC values across different categories: H, C1, C2, C3, C4, C5, C6, C7, C8, and ALL. The left plot shows AUROC scores ranging from 0.4 to 0.9, with noticeable variability among categories. The right plot presents AUPRC scores, also displaying significant variability. Both plots show outliers with markers beyond the whiskers.

Figure 3. Ablation analysis results on the MSigDB database. We analyze and show the performance of HyperAD on different types of gene sets in the MSigDB database.

Upon closer checking of the models trained on individual collections, we find that the prediction performance is relatively better for curated (C2) and ontology (C5) gene sets. This indicates that the higher-order functional relationships derived from canonical pathway databases (C2) and formal biological ontologies like GO and HPO (C5) form the primary informational backbone for HyperAD’s ability to identify AD-associated genes. While other collections perform more modestly in isolation, their inclusion is not redundant. Instead, they act as a crucial source of Supplementary Material. For example, gene sets like positional (C1) and regulatory target (C3) likely provide essential context on genomic proximity and co-regulation that refines the core functional predictions. Therefore, we conclude that HyperAD’s strength lies in its hierarchical use of information: it leverages the rich, explicit functional knowledge in C2 and C5 as its foundation, while integrating contextual cues from the other collections to enhance its final predictions.

Furthermore, we conduct a second ablation study to specifically evaluate the contribution of our core AD-specific weighting module. In this experiment, we remove this module from the architecture, thereby treating all biological information with equal importance without disease-specific prioritization. As detailed in Table 2, the full HyperAD model demonstrates a significant performance gain compared to the ablated version, with an increase of 3.3% in AUROC and 12.3% in AUPRC. This result strongly highlights the critical role of the AD-specific weighting module in focusing the model on the most disease-relevant biological contexts, which substantially enhances its predictive accuracy.

Table 2
www.frontiersin.org

Table 2. Ablation study result of AD-specific weighting module.

3.3 HyperAD-predicted genes significantly enriched in AD-associated biological process

A key criterion for a successful disease gene prioritization model is its ability to rank genes involved in the core biological processes of the disease highly (Krishnan et al., 2016). To evaluate our model against this standard, we conduct a functional enrichment analysis. We use a set of 41 highly relevant GO biological processes (Di Paolo and Kim, 2011), which represent key AD pathways derived from enrichment analysis of known AD genes and filtered by AD risk factors and phenotypes. We then employ a decile enrichment test to systematically assess whether our ranked list of genes (after excluding known positives) is enriched for these AD-associated functions at its top tier.

The results from this analysis provide compelling evidence for our model’s biological validity. As shown in Figure 4A and detailed in Supplementary Table, we observed an overwhelming enrichment of these 41 core AD biological processes specifically within the first decile of our ranked list (i.e., the top 10% of scored genes). The significance of this finding lies not only in its statistical power but also in the nature of the enriched functions. We note that processes directly corresponding to hallmark AD pathologies were among the most significant, with exceptionally low FDR values indicating high confidence. These can be broadly categorized into two groups: those reflecting neuronal dysfunction, such as “cognition” (GO:0050890, FDR =7.7×1049), “learning or memory” (GO:0007611, FDR =3.4×1047), and “regulation of synaptic plasticity” (GO:0048167, FDR =8.2×1037); and those reflecting core molecular pathology, such as “amyloid-beta clearance” (GO:0097242, FDR =3.9×1020). This ability to recapitulate functions related to both macro-level phenotypes and micro-level molecular mechanisms of AD among the top-ranked genes clearly demonstrates that our model does not just find AD-associated genes, but precisely identifies those central to the disease’s functional network. This result substantially increases our confidence in prioritizing these top-ranked candidates for future experimental validation and drug target discovery.

Figure 4
Panel A shows bar graphs depicting fractions of genes across deciles for cognition, learning or memory, neuroinflammatory response, regulation of synaptic plasticity, and amyloid-beta clearance, with significant FDR values indicated. Panel B displays three density plots for the number of Alzheimer's disease-associated miRNAs, number of protein-protein interactions, and standardized sequence similarity.

Figure 4. Evaluation of HyperAD prediction. (A) Evaluation of the overlap of AD-associated biological processes within the first decile of our predictions. Fractions of genes (y-axis) indicate the distribution of genes in the gene set in each decile of the predictions. (B) Statistical validation of our predictions using multiple biological networks.

3.4 HyperAD-predicted genes are significantly linked to known AD-associated genes

We further assess the AD-relevance of our top-ranked candidate genes using a network biology approach. We select the top 50 novel genes (after excluding known positive genes) and test whether they are significantly associated with the set of established AD genes in three distinct biological networks. These networks represent different biological dimensions: a protein sequence similarity network, the protein-protein interaction (PPI) network from the STRING database (Szklarczyk et al., 2015), and the miRNA-target interaction network from mirTarBase (Hsu et al., 2011). As shown in Figure 4B, the statistical analysis yields a highly significant result, indicating that the top 50 genes are strongly associated with known AD genes in all three networks (P-value <0.0001).

This highly consistent result carries significant biological implications. It suggests that our identified candidate genes have multiple functional links to known AD genes: they not only exhibit high sequence similarity, implying potential functional homology, but their protein products also tend to participate in common cellular interaction networks and pathways. Furthermore, they are likely subject to shared miRNA co-regulation at the post-transcriptional level. Therefore, this convergent evidence from diverse molecular layers strongly supports the conclusion that our predicted genes are functionally coupled with the pathobiology of AD, marking them as promising targets for future investigation.

3.5 Protein expression of HyperAD-predicted genes is associated with cognitive function

To bridge our computational predictions with real-world clinical pathology, we conduct a critical validation to assess whether the protein expression levels of our top-ranked candidate genes are directly associated with the progression of cognitive decline in AD. For this purpose, we leverage longitudinal proteomics data from the Religious Orders Study and Memory and Aging Project (ROSMAP) (Canchi et al., 2019), a deeply phenotyped and highly valuable cohort study (Synapse accession doi:10.7303/syn3219045). Based on the clinical diagnosis variable ‘dcfdx_lv’, we stratified the samples into three distinct cognitive groups: No Cognitive Impairment (NCI, n = 174), Mild Cognitive Impairment (MCI, n = 100), and AD (n = 104). Our central hypothesis is that if a gene is a key driver of AD, its encoded protein expression should exhibit a monotonic increase or decrease across the disease progression continuum from NCI to MCI to AD. To test the significance of such trends, we employ the Kendall’s Tau-b test, a non-parametric statistical method ideally suited for assessing monotonic relationships between ordinal variables.

Among our top 200 candidate genes (after excluding known positives), protein expression data are available for 67 in the ROSMAP dataset. The results are highly encouraging: of these 67 genes, we identify that a substantial proportion—27 genes (40.3%)—exhibit protein expression levels that are significantly and monotonically correlated with the severity of cognitive decline (Kendall’s Tau-b test, FDR <0.05). Of these, 12 show a positive correlation (expression increasing as cognitive function declines), while the remaining 15 show a negative correlation (Figure 5). This high percentage of significantly associated genes strongly suggests that the candidates identified by our model are not random but are coupled at the molecular level to the clinical progression of AD.

Figure 5
Line graph comparing protein expression levels of various genes across three conditions: NCI, MCI, and AD. The left panel shows genes such as SRC and NPC2, with diverse expression trends. The right panel features genes like GSK3B and MTOR, also displaying varying patterns. Each gene is colored differently, with a legend on the right indicating the corresponding color for each gene.

Figure 5. The evaluation results of ROSMAP data. Protein expression of 27 genes is monotonically increasing (n = 12) or decreasing (n = 15) in three stages.

Deeper case studies further illuminate the biological significance of our model’s predictions. For instance, we find that the protein expression of CDK5 and GSK3B, two kinases known to play pivotal roles in AD, is significantly and negatively correlated with cognitive function. This aligns perfectly with their established biological roles, as the aberrant activation of these kinases is a central event driving the hyperphosphorylation of the microtubule-associated protein Tau (MAPT), leading to neurofibrillary tangles (Jayapalan and Natarajan, 2013). The ability of our model to independently identify these key regulators and confirm that their expression trend matches disease progression significantly bolsters its credibility. Furthermore, a particularly compelling finding involves APOA4. We not only find its expression to be negatively correlated with cognitive function but also note its colocalization on chromosome 11 with APOA1 and APOC3, which also appeared in our top-200 list. The decreased expression of this APOA1-APOC3-APOA4 gene cluster is known to be associated with AD risk (Lin et al., 2015). This result powerfully suggests that our HyperAD model can effectively utilize higher-order association information among multiple genes to identify functionally synergistic modules, a feat that is often beyond the reach of traditional single-gene or pairwise association methods.

4 Discussion

In this study, we introduced HyperAD, a hypergraph neural network framework designed to overcome a key limitation of existing methods for AD gene prioritization: their reliance on simple pairwise gene interactions. By explicitly modeling higher-order relationships derived from functional gene sets, HyperAD more accurately reflects the complex, multi-gene nature of AD pathophysiology. Our results demonstrate that this approach not only significantly outperforms state-of-the-art models but also derives its strength from synergistically integrating diverse biological information, with curated pathways (C2) and ontology gene sets (C5) forming its predictive core.

The biological relevance of HyperAD’s predictions is strongly supported by multiple, independent lines of validation. Our top-ranked candidate genes are significantly enriched in hallmark AD biological processes, such as synaptic regulation and amyloid-beta clearance, and are densely connected to known AD genes in biological networks. Most compellingly, by leveraging proteomics data from the ROSMAP cohort, we show a direct link between the expression levels of our predicted genes and the clinical progression of cognitive decline. This convergence of computational and clinical evidence provides high confidence in our prioritized candidates as genuine players in AD.

While HyperAD marks a significant advance, we acknowledge its limitations. The model’s performance is dependent on the quality of existing gene set databases, and like all computational models, it predicts association, not causation. Therefore, the most critical next step is the experimental validation of our novel high-confidence candidates to confirm their functional roles. Furthermore, the HyperAD framework is inherently flexible and can be readily applied to other complex diseases, offering a powerful new tool for computational genomics.

In conclusion, HyperAD provides a more powerful and biologically intuitive approach to disease gene discovery. By shifting the paradigm from pairwise relationships to higher-order functional contexts, it delivers a robust and validated list of novel AD risk genes. These findings offer promising new avenues for research, with the potential to accelerate the development of future diagnostics and targeted therapies for Alzheimer’s disease.

Data availability statement

The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.

Ethics statement

The manuscript presents research on animals that do not require ethical approval for their study.

Author contributions

MM: Methodology, Writing – original draft. CD: Conceptualization, Methodology, Writing – original draft, Writing – review and editing. YL: Writing – review and editing. QC: Writing – review and editing. FL: Writing – review and editing. YZ: Conceptualization, Supervision, Writing – original draft, Writing – review and editing.

Funding

The author(s) declare that no financial support was received for the research and/or publication of this article.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declare that no Generative AI was used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2025.1668200/full#supplementary-material

References

Agarwal, S., Branson, K., and Belongie, S. (2006). “Higher order learning with graphs,” in Proceedings of the 23rd international conference on machine learning, 17–24.

Google Scholar

Amberger, J. S., Bocchini, C. A., Scott, A. F., and Hamosh, A. (2019). Omim. Org: leveraging knowledge across phenotype–gene relationships. Nucleic Acids Res. 47, D1038-D1043–D1043. doi:10.1093/nar/gky1151

PubMed Abstract | CrossRef Full Text | Google Scholar

Ata, S. K., Wu, M., Fang, Y., Ou-Yang, L., Kwoh, C. K., and Li, X.-L. (2021). Recent advances in network-based methods for disease gene prediction. Briefings Bioinforma. 22, bbaa303. doi:10.1093/bib/bbaa303

PubMed Abstract | CrossRef Full Text | Google Scholar

Bellenguez, C., Küçükali, F., Jansen, I. E., Kleineidam, L., Moreno-Grau, S., Amin, N., et al. (2022). New insights into the genetic etiology of alzheimer’s disease and related dementias. Nat. Genet. 54, 412–436. doi:10.1038/s41588-022-01024-z

PubMed Abstract | CrossRef Full Text | Google Scholar

Bertram, L., McQueen, M. B., Mullin, K., Blacker, D., and Tanzi, R. E. (2007). Systematic meta-analyses of alzheimer disease genetic association studies: the alzgene database. Nat. Genet. 39, 17–23. doi:10.1038/ng1934

PubMed Abstract | CrossRef Full Text | Google Scholar

Canchi, S., Raao, B., Masliah, D., Rosenthal, S. B., Sasik, R., Fisch, K. M., et al. (2019). Integrating gene and protein expression reveals perturbed functional networks in alzheimer’s disease. Cell Rep. 28, 1103–1116. doi:10.1016/j.celrep.2019.06.073

PubMed Abstract | CrossRef Full Text | Google Scholar

Consortium, G. O., Clark, J., Ireland, A., Lomax, J., Ashburner, M., Foulger, R., et al. (2004). The gene ontology (go) database and informatics resource. Nucleic Acids Res. 32, D258–D261. doi:10.1093/nar/gkh036

PubMed Abstract | CrossRef Full Text | Google Scholar

Cowen, L., Ideker, T., Raphael, B. J., and Sharan, R. (2017). Network propagation: a universal amplifier of genetic associations. Nat. Rev. Genet. 18, 551–562. doi:10.1038/nrg.2017.38

PubMed Abstract | CrossRef Full Text | Google Scholar

Dalmasso, M. C., De Rojas, I., Olivar, N., Muchnik, C., Angel, B., Gloger, S., et al. (2024). The first genome-wide association study in the Argentinian and Chilean populations identifies shared genetics with Europeans in alzheimer’s disease. Alzheimer’s and Dementia 20, 1298–1308. doi:10.1002/alz.13522

PubMed Abstract | CrossRef Full Text | Google Scholar

Deng, C., Li, H.-D., Zhang, L.-S., Liu, Y., Li, Y., and Wang, J. (2024). Identifying new cancer genes based on the integration of annotated gene sets via hypergraph neural networks. Bioinformatics 40, i511–i520. doi:10.1093/bioinformatics/btae257

PubMed Abstract | CrossRef Full Text | Google Scholar

Di Paolo, G., and Kim, T.-W. (2011). Linking lipids to alzheimer’s disease: cholesterol and beyond. Nat. Rev. Neurosci. 12, 284–296. doi:10.1038/nrn3012

PubMed Abstract | CrossRef Full Text | Google Scholar

Feng, Y., You, H., Zhang, Z., Ji, R., and Gao, Y. (2019). Hypergraph neural networks. Proc. AAAI Conf. Artif. Intell. 33, 3558–3565. doi:10.1609/aaai.v33i01.33013558

CrossRef Full Text | Google Scholar

He, K., Zhang, X., Ren, S., and Sun, J. (2016). “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.

Google Scholar

Hsu, S.-D., Lin, F.-M., Wu, W.-Y., Liang, C., Huang, W.-C., Chan, W.-L., et al. (2011). Mirtarbase: a database curates experimentally validated microrna–target interactions. Nucleic Acids Res. 39, D163–D169. doi:10.1093/nar/gkq1107

PubMed Abstract | CrossRef Full Text | Google Scholar

Jansen, I. E., Savage, J. E., Watanabe, K., Bryois, J., Williams, D. M., Steinberg, S., et al. (2019). Genome-wide meta-analysis identifies new loci and functional pathways influencing alzheimer’s disease risk. Nat. Genet. 51, 404–413. doi:10.1038/s41588-018-0311-9

PubMed Abstract | CrossRef Full Text | Google Scholar

Jayapalan, S., and Natarajan, J. (2013). The role of cdk5 and gsk3b kinases in hyperphosphorylation of microtubule associated protein tau (mapt) in alzheimer’s disease. Bioinformation 9, 1023–1030. doi:10.6026/97320630091023

PubMed Abstract | CrossRef Full Text | Google Scholar

Kanehisa, M., and Goto, S. (2000). Kegg: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30. doi:10.1093/nar/28.1.27

PubMed Abstract | CrossRef Full Text | Google Scholar

Kipf, T. N., and Welling, M. (2017). “Semi-supervised classification with graph convolutional networks,” in International conference on learning representations.

Google Scholar

Köhler, S., Gargano, M., Matentzoglu, N., Carmody, L. C., Lewis-Smith, D., Vasilevsky, N. A., et al. (2021). The human phenotype ontology in 2021. Nucleic Acids Res. 49, D1207–D1217. doi:10.1093/nar/gkaa1043

PubMed Abstract | CrossRef Full Text | Google Scholar

Krishnan, A., Zhang, R., Yao, V., Theesfeld, C. L., Wong, A. K., Tadych, A., et al. (2016). Genome-wide prediction and functional characterization of the genetic basis of autism spectrum disorder. Nat. Neurosci. 19, 1454–1462. doi:10.1038/nn.4353

PubMed Abstract | CrossRef Full Text | Google Scholar

Lee, I., Blom, U. M., Wang, P. I., Shim, J. E., and Marcotte, E. M. (2011). Prioritizing candidate disease genes by network-based boosting of genome-wide association data. Genome Res. 21, 1109–1121. doi:10.1101/gr.118992.110

PubMed Abstract | CrossRef Full Text | Google Scholar

Li, H.-D., Deng, C., Zhang, X.-Q., and Lin, C.-X. (2022). A gene set-integrated approach for predicting disease-associated genes. IEEE/ACM Trans. Comput. Biol. Bioinforma. 20, 3440–3450. doi:10.1109/TCBB.2022.3214517

PubMed Abstract | CrossRef Full Text | Google Scholar

Liberzon, A., Subramanian, A., Pinchback, R., Thorvaldsdóttir, H., Tamayo, P., and Mesirov, J. P. (2011). Molecular signatures database (Msigdb) 3.0. Bioinformatics 27, 1739–1740. doi:10.1093/bioinformatics/btr260

PubMed Abstract | CrossRef Full Text | Google Scholar

Liberzon, A., Birger, C., Thorvaldsdóttir, H., Ghandi, M., Mesirov, J. P., and Tamayo, P. (2015). The molecular signatures database (MSigDB) hallmark gene set collection. Cell Syst. 1, 417–425. doi:10.1016/j.cels.2015.12.004

PubMed Abstract | CrossRef Full Text | Google Scholar

Lin, Q., Cao, Y., and Gao, J. (2015). Decreased expression of the apoa1–apoc3–apoa4 gene cluster is associated with risk of alzheimer’s disease. Drug Des. Dev. Ther. 9, 5421–5431. doi:10.2147/DDDT.S89279

PubMed Abstract | CrossRef Full Text | Google Scholar

Mancuso, C. A., Bills, P. S., Krum, D., Newsted, J., Liu, R., and Krishnan, A. (2022). Geneplexus: a web-server for gene discovery using network-based machine learning. Nucleic Acids Res. 50, W358–W366. doi:10.1093/nar/gkac335

PubMed Abstract | CrossRef Full Text | Google Scholar

McCarthy, M. I., Abecasis, G. R., Cardon, L. R., Goldstein, D. B., Little, J., Ioannidis, J. P., et al. (2008). Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat. Rev. Genet. 9, 356–369. doi:10.1038/nrg2344

PubMed Abstract | CrossRef Full Text | Google Scholar

Piñero, J., Ramírez-Anguita, J. M., Saüch-Pitarch, J., Ronzano, F., Centeno, E., Sanz, F., et al. (2020). The disgenet knowledge platform for disease genomics: 2019 update. Nucleic Acids Res. 48, D845–D855. doi:10.1093/nar/gkz1021

PubMed Abstract | CrossRef Full Text | Google Scholar

Sims, R., Hill, M., and Williams, J. (2020). The multiplex model of the genetics of alzheimer’s disease. Nat. Neurosci. 23, 311–322. doi:10.1038/s41593-020-0599-5

PubMed Abstract | CrossRef Full Text | Google Scholar

Sollis, E., Mosaku, A., Abid, A., Buniello, A., Cerezo, M., Gil, L., et al. (2023). The nhgri-ebi gwas catalog: knowledgebase and deposition resource. Nucleic Acids Res. 51, D977–D985. doi:10.1093/nar/gkac1010

PubMed Abstract | CrossRef Full Text | Google Scholar

Spina, S., La Joie, R., Petersen, C., Nolan, A. L., Cuevas, D., Cosme, C., et al. (2021). Comorbid neuropathological diagnoses in early versus late-onset alzheimer’s disease. Brain 144, 2186–2198. doi:10.1093/brain/awab099

PubMed Abstract | CrossRef Full Text | Google Scholar

Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., et al. (2005). “Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles,”, 102. Washington, DC: Proceedings of the National Academy of Sciences, 15545–15550. doi:10.1073/pnas.0506580102

CrossRef Full Text | Google Scholar

Szklarczyk, D., Franceschini, A., Wyder, S., Forslund, K., Heller, D., Huerta-Cepas, J., et al. (2015). String v10: Protein–Protein interaction networks, integrated over the tree of life. Nucleic Acids Res. 43, D447–D452. doi:10.1093/nar/gku1003

PubMed Abstract | CrossRef Full Text | Google Scholar

Tran, V. D., Sperduti, A., Backofen, R., and Costa, F. (2020). Heterogeneous networks integration for disease–gene prioritization with node kernels. Bioinformatics 36, 2649–2656. doi:10.1093/bioinformatics/btaa008

PubMed Abstract | CrossRef Full Text | Google Scholar

Valdeolivas, A., Tichit, L., Navarro, C., Perrin, S., Odelin, G., Levy, N., et al. (2019). Random walk with restart on multiplex and heterogeneous biological networks. Bioinformatics 35, 497–505. doi:10.1093/bioinformatics/bty637

PubMed Abstract | CrossRef Full Text | Google Scholar

Yao, V., Kaletsky, R., Keyes, W., Mor, D. E., Wong, A. K., Sohrabi, S., et al. (2018). An integrative tissue-network approach to identify and test human disease genes. Nat. Biotechnol. 36, 1091–1099. doi:10.1038/nbt.4246

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhang, T., Lin, Y., He, W., Yuan, F., Zeng, Y., and Zhang, S. (2022). Gcn-gene: a novel method for prediction of coronary heart disease-related genes. Comput. Biol. Med. 150, 105918. doi:10.1016/j.compbiomed.2022.105918

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhou, D., Huang, J., and Schölkopf, B. (2006). Learning with hypergraphs: clustering, classification, and embedding. Adv. Neural Inf. Process. Syst. 19, 1601–1608. doi:10.7551/mitpress/7503.003.0205

CrossRef Full Text | Google Scholar

Keywords: Alzheimer’s disease, higher-order associations, hypergraph neural network, hypergraph, disease gene

Citation: Ma M, Deng C, Liu Y, Cao Q, Liu F and Zhang Y (2025) A hypergraph neural network for prioritizing Alzheimer’s disease risk genes. Front. Genet. 16:1668200. doi: 10.3389/fgene.2025.1668200

Received: 18 July 2025; Accepted: 08 September 2025;
Published: 19 September 2025.

Edited by:

Chunhou Zheng, Anhui University, China

Reviewed by:

Libing Bai, Qinghai Normal University, China
Massimo Orazio Spata, University of Catania, Italy

Copyright © 2025 Ma, Deng, Liu, Cao, Liu and Zhang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Yan Zhang, emhhbmd5YW5sam1AY3N1LmVkdS5jbg==

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.