ORIGINAL RESEARCH article

Front. Genet., 20 November 2020

Sec. Computational Genomics

Volume 11 - 2020 | https://doi.org/10.3389/fgene.2020.571609

Module Analysis Using Single-Patient Differential Expression Signatures Improves the Power of Association Studies for Alzheimer's Disease

  • Institute of Interdisciplinary Integrative Medicine Research, Shanghai University of Traditional Chinese Medicine, Shanghai, China

Abstract

The causal mechanism of Alzheimer's disease is extremely complex. Achieving great statistical power in association studies usually requires a large number of samples. In this work, we illustrated a different strategy to identify AD risk genes by clustering AD patients into modules based on their single-patient differential expression signatures. The evaluation suggested that our method could enrich AD patients with similar clinical manifestations. Applying this to a cohort of only 310 AD patients, we identified 174 AD risk loci at a strict threshold of empirical p < 0.05, while only two loci were identified using all the AD patients. As an evaluation, we collected 23 AD risk genes reported in a recent large-scale meta-analysis and found that 18 of them were rediscovered by association studies using clustered AD patients, while only three of them were rediscovered using all AD patients. Functional annotation suggested that AD-associated genetic variants mainly disturbed neuronal/synaptic function. Our results suggested module analysis helped to enrich AD patients affected by the common risk variants.

1. Introduction

Alzheimer's disease is a prevalent neurological disease among the aging population. Even with decades of intensive studies, its causal mechanisms remain elusive. Studies of the familial early-onset cases revealed three mutated genes, including APP, PSEN1, and PSEN2 (Lanoiselée et al., 2017). They provided valuable insights into the contribution of the amyloidogenic pathway for AD genesis. Genome-wide association studies (GWAS) of late-onset AD patients discovered more risk genes. Among them, APOE ε4, an apolipoprotein, is a major genetic risk of late-onset AD. It accounts for 3- (heterozygous) to 15-fold (homozygous) increase in AD risk (De Jager et al., 2012). The AlzGene database (http://www.alzgene.org) stores these AD risk genes. Even though many AD risk genes have been discovered, known genes only explain a small proportion of heritability. There is still a great challenge on how to illustrate the AD causal mechanism in an integrated way, limiting their application in drug discovery.

Power is a critical consideration in association studies (Ball, 2013). AD often requires a large sample size to achieve a good power (Belloy et al., 2019; Kunkle et al., 2019). For example, a recent meta-analysis included 71,880 cases and 383,378 controls and identified 25 risk loci, implicating 215 potential causative genes (Jansen et al., 2019). However, such studies are limited by sample collection and cost, which hinders the discovery of more risk variants. To overcome such a problem, a viable strategy is to stratify patients based on some disease-relevant features (Dahl et al., 2019). For AD, the carrier's status of APOE-ε4 had been used to cluster the AD patients in association studies and successfully reported novel features (Marioni et al., 2017). Other factors, e.g., sex (Deming et al., 2018) and age (Lo et al., 2019), have also been used for AD patient stratification and the improved performance supports the values of patient stratification in association studies.

With the increase in multi-omics data, system biology methods show a good potential to unveil the complex mechanism of AD genesis (Wang et al., 2016; Mostafavi et al., 2018; Meng and Mei, 2019). One example is the Accelerating Medicines Partnership–Alzheimer's Disease (AMP-AD) projects, which includes transcriptomics, epigenomics, genetics, and proteomics data from over 2,000 human brain samples. These data allow for AD patient stratification so that the patients affected by the common causal mechanism can be clustered together. Currently, there are some single-patient analysis tools (Vitali et al., 2017). These algorithms integrate gene expression with network or pathway annotation to identify the disease-related changes at a single-patient level (Gardeux et al., 2014; Liu et al., 2014; Schissler et al., 2017). These tools can report disease-relevant genes or pathways at a single-patient level. However, these tools still have some limitations. For example, it is hard to evaluate the generality of reported genes and pathways among different patients, making it less applicable for precise drug discovery.

In this work, we have proposed a new strategy to stratify AD patients by clustering the AD patients with similar differential expression patterns at a single-patient level. Our evaluation suggested that this method could enrich AD patients with common clinical manifestations. We applied it to 310 AD patients for both patient module analysis and genetic association studies. We identified 174 AD risk loci in 143 modules at a strict cutoff of empirical p < 0.05, while there were only two risk loci identified using all the AD patients. Function annotation suggested that identified risk genes were mainly related to neuronal/synaptic functions. We also evaluated 23 known AD risk genes and re-discovered 18 of them in at least one module. Allele frequency studies indicated that module analysis using single-patient DEGs enriched AD patients affected by common risk variants.

2. Materials and Methods

2.1. The Samples and Subjects

The AD and control sample data were collected from the “ROS/MAP” study (De Jager et al., 2012) and “HBTRC” study (Zhang et al., 2013). “ROS/MAP” data included the genotype, expression, and clinical data of 1,788 subjects. The AD-related clinical annotation was provided by the data suppliers. The important one included age, the cognitive score (cts), years of education, ApoE genotype, braak stage (braaksc), and assessment of neuritic plaques (ceradsc). We use the clinical annotation for “cogdx,” a physician's overall cognitive diagnostic category, to select the AD patient (cogdx = 4 or 5) and control subjects (cogdx = 1). After filtering the ones with missing or unclear information for either clinical records or RNA-seq, we found 219 AD patients and 187 control subjects that would be used for module analysis and clinical enrichment studies. “HBTRC” study had both RNA-seq and genotype data for 573 samples, including 311 AD samples. We filtered the one with missed clinical information, RNA-seq, or genotype data. Finally, 310 AD patients and 153 control subjects were used. Detailed information for selected data are publicly available at https://www.synapse.org/#!Synapse:syn5550382.

2.2. Clustering AD Patients Using Single-Patient DEGs

We developed a computational algorithm to cluster AD patients (see Supplementary Figure 1). The main idea behind this tool is that AD patients are highly diverse and can be affected by divergent mechanisms; it is possible to cluster AD patients if they shared a subset of differentially expressed genes (DEGs). This algorithm is implemented in the R package DEComplexDisease. It mainly includes four steps:

  • Utilize RNA-seq data of normal subjects to construct reference expression profiles. In this step, the parameters of a negative binomial distribution or Gaussian distribution are estimated to describe the distribution profile of non-disease samples;

  • The gene expression of AD patients is transformed into binary differential expression status. In this step, the expression values of genes are fitted against reference expression profiles. Binary differential expression status is assigned as 1, –1, or 0 to indicate upregulation, downregulation, or no difference, respectively;

  • Apply a bi-clustering analysis to identify DEGs that are repeatedly observed in multiple AD patients, e.g., n = 5;

  • Using the spDEG of each AD patient as the signature, we compute the co-expression correlation and identify the patients with the most similar expression profiles to construct modules.

The R codes are publicly available at https://github.com/menggf/DEComplexDisease.

2.3. Clinical Manifestation Association Analysis

“ROS/MAP” data mainly includes three AD-related clinical features, including cognitive score (cts), CERAD score, and braaksc. “HBTRC” has clinical information for braak and atrophy. Such clinical features can be used to evaluate the disease relevance of modules. Therefore, we applied our tool to generate modules of different sizes, e.g., 40, 60, and 120. For each module, AD patients can be grouped as module patients and non-module patients. We did the Kolmogorov-Smirnov (KS) test to evaluate the clinical manifestation differences between two groups of AD patients.

2.4. Processing Genotype Data

We applied stringent quality control (QC) filters to the genotype data. First, we removed the individuals with missing genotype rates >0.05 and SNPs with missing call rate > 0.05. In the next step, the SNPs with minor allele frequency MAF < 0.1 or Hardy-Weinberg equilibrium p-value < 1.0 × 10−5 were excluded. The individuals with autosomal heterozygosity above empirically determined thresholds were filtered. We calculated the identity-by-descent (IBD) of all possible gene pairs and removed the ones with potential genetic relatedness. These QC filters were performed for multiple rounds to make sure that no individual or SNP could be filtered anyway. We then performed pre-phasing with SHAPEIT2 (Delaneau et al., 2011), using the 1000 Genome Project data as a reference. Afterwards, we conducted whole-genome imputation using IMPUTE2 (Howie et al., 2012) in 5-Mb segments with filtering of the SNP with MAF less than 0.1 in the EUR population. The imputed data were evaluated for quality control using the thresholds mentioned before. We performed principal component analysis (PCA) on autosomal genotype data and adjustment for stratification.

2.5. Association Study

Association studies were performed for both AD patients and module patients. To simplify it, we only include the definite AD patients and control individuals in the association analysis so that binary disease status could be assigned for each patient. We performed population stratification by use of the principal components of chromosomal genetic variations. Association analysis was performed using a fast score test implemented in the GenABEL package. In this step, the first 10 principal components were used as covariates to remove the effects of population structure to make sure there was no clear evidence of inflation in the association results. To control the false positive discovery, we also estimated the empirical p-values using performing permutation analysis by generating the distribution under the null hypothesis for 1,000 times. In each round of call, the minimal p-value was compared with the original p-values. For an SNP, its empirical p-value is defined as the proportion of times where the minimal p-values of 1,000 resamples was less than the original p-value. We set empirical p-values < 0.05 as the cutoff to selecting the module-associated SNPs. The codes for association studies are available at https://github.com/menggf/spDEG_and_Association.

2.6. Enrichment Analysis

We performed enrichment analysis to find the clinical association of patient modules. Among a total of n patients, k patients were predicted as a module. The number of patients with a clinical manifestation is p, which also includes x module patients. We used Fisher's exact test to evaluate if the observed x patients resulted from random occurrences. We used the following R codes to calculate the p-value:

3. Results

3.1. A New Pipeline to Cluster AD Patients Utilizing Single-Patient DEGs

Considering the diversity of AD patients, we propose a new analysis strategy to cluster the AD patients affected by the common mechanisms. This method is based on differential expression analysis at single-patient levels. Figure 1A and Supplementary Figure 1 describe the schema of the whole analysis pipeline. In our analysis, the reference expression profile was firstly built using the RNA-seq counts or normalized data of the normal individuals, which defined the ranges of gene expression values at a non-disease status. Next, gene expression values of patients were transformed into binary status by fitting to the reference expression profiles. In detail, if the gene expression values of patients exceeded the range of reference expression profiles, 1 or –1 is assigned to indicate up- or downregulation. To improve confidence, a bi-clustering analysis algorithm is applied to perform filtering and cross-validation so that the whole set of single-patient differentially expressed genes (spDEGs) can be repeatedly observed in multiple patients, e.g., n = 5. Finally, using each patient as a seed, we cluster the patients into modules if they carry the same set of spDEGs.

Figure 1

As an evaluation, we applied this pipeline to the dataset collected from the ROS/MAP study (De Jager et al., 2012), which includes 251 AD samples with both RNA-seq data and clinical annotation. We identified cross-validated spDEGs for 171 patients. Among 15,582 brain expressed genes, 3,878 spDEGs were predicted to be differentially expressed in at least one AD patient at a cutoff of p < 0.05. Compared to traditional differential expression analysis using all the AD samples, they covered 93.8% of AD DEGs. We then investigated their differential expression status among all the AD patients. Figure 1B showed the results of the top 20 most observed spDEGs. We did not observe any shared differential expressed genes across all the AD patients. On the contrary, all spDEGs were only differentially expressed in a small proportion of 251 AD patients. Additionally, we also observed inconsistent differential expression directions. Taking the QDPR gene as an example, it was upregulated in 22% of AD patients while also downregulated in 3% AD patients. Similar results were observed with other spDEGs (see Figure 1B). We also performed module analysis using the most observed differential expressed genes and observed distinct differential expression patterns (see Supplementary Figures 2, 3). All these results suggested that AD patients were greatly diverse and that it would be a risk to treat AD patients as a homogeneous whole in any analysis.

Next, we investigated if AD patient clustering could enrich AD patients with common clinical manifestations. We generated patient modules based on spDEG expression profile similarity. The modules were arbitrarily set to have different sizes of AD patients, e.g., 40, 60, and 100, which could be denoted as pdeg40, pdeg60, and pdeg100, respectively. The patients within the same module were supposed to be affected by the common mechanisms. As a control, we also generated modules using randomly selecting genes and DEGs identified by traditional differential expression analysis. Figure 1C showed the evaluation results using cognitive scores (cts). At a cutoff of p < 0.01, 37 “pdeg60” modules were enriched with detrimental cts scores, while only five modules identified by common DEGs or random genes were enriched. The most significant p-value was up to p = 2.51 × 10−5 in the “pdeg60” module. On the contrary, no module in “common DEG” and "random gene" exceeded the significance of p = 0.001. This result suggested that modules analysis using spDEG better-enriched AD patients with common clinical manifestations.

3.2. More Risk Variants Were Identified in AD Patient Modules

We collected genotyping data from the “hbtrc” study (Zhang et al., 2013), including 310 LOAD patients and 153 non-demented healthy controls. We performed genome-wide association studies (GWAS) using all the AD patients. In this process, we performed a permutation procedure for 1,000 times to estimate empirical p values. We found only two loci to have a significant association with AD at a cutoff of empirical p < 0.05. The significant SNPs included rs2405283 (p = 1.15 × 10−7) and rs769450 (p = 1.65 × 10−6) (see Figure 2A). rs769450 was mapped to the second intron of the APOE gene, which is consistent with published reports about the critical roles of APOE.

Figure 2

Applying module analysis, we predicted 143 modules of AD patients. Three association tests were performed for each module: (1) module patients against normal control; (2) module patients against non-module patients; and (3) non-module patients against normal control. The p-values were denoted as p1, p2, and p3, respectively. At a strict cutoff of empirical p1 < 0.05, we found 174 loci to have a significant association in at least one of 143 modules (see Figure 2A and Supplementary Table 1). Compared to the results of the association study using all the AD patients, more AD risk loci were observed within module patients. The APOE SNP rs769450 was observed in 41 modules and its association significance was also greatly improved. For example, the significance of rs769450 was up to p1 = 2.08 × 10−8 in a module of 80 AD patients while the significance for all 310 patients was p1 = 1.65 × 10−6. Tests between module patients and non-module patients supported allele frequency differences in 165 out of 174 loci at a cutoff of p2 < 0.01. Figure 2B showed the allele frequency for some exemplary SNPs. We observed that allele frequencies of identified risk SNPs were different from the non-module patients and normal individuals. In most cases, non-module patients usually had similar allele frequencies with normal subjects. We checked if module patients were more associated with risk SNPs than non-module patients by comparing p1 and p3 value distribution (see Supplementary Figure 4). We found module patients tended to report more significant associations than non-module patients. It suggested that module analysis enriched the AD patient affected by the common risk SNPs.

We mapped 174 AD risk loci to 107 genes based on genomic proxy and GTEx eQTL annotation (see Supplementary Table 1). Among them, 86 genes were observed in more than one module at a cutoff of empirical p1 < 0.05. APOE is the most observed risk gene, which is significantly associated with AD patients in 41 modules. We searched the published GWAS results and found that 46 genes had been reported for AD or brain-related function (see Table 1). Some of them had been reported in association studies of AD, such as PDE1A, JAM3, DLGAP1, CYYR1, SERPINB11, and MCPH1. To understand their function involvement, we performed Gene Ontology enrichment analysis to 107 AD risk genes (see Figure 2C). We found that the most enriched terms were also related to synaptic and neuronal function, e.g., “synapse organization” (p = 7.65 × 10−6). It suggested that the identified AD risk genes were related to normal brain function and had potential roles in AD genesis.

Table 1

idchrposp1p2p3p(emp)ORTypeGeneRegionPubMed ID
rs38675931774640461.59E-081.20E-043.73E-030.0012.63pdeg60ZBTB4Intron29045054
rs76945019449071872.08E-081.94E-021.44E-040.0012.91pdeg80APOEIntron24821312
rs14662425221824120802.40E-083.14E-072.60E-010.0022.01pdeg60PDE1AIntron29363967
rs99128641791052332.31E-079.05E-042.53E-020.0052.51pdeg120NTN1Intron27060954
rs8016720831132249662.46E-071.69E-051.11E-010.0091.99pdeg60BOCIntron22445332
rs3423352661509476952.91E-076.86E-065.81E-020.0132.31pdeg60MTHFD1LIntron22330827
rs7212987011076453223.07E-071.25E-053.32E-010.0082.03pdeg100VAV3Intron28927664
rs478857916719179423.77E-073.00E-041.45E-020.0113.24pdeg60IST1Intron31223056
rs1133374846877109804.05E-075.46E-042.03E-020.0092.87pdeg60AKIRIN2Intergenic27871306
rs11253483108720714.83E-077.06E-045.34E-030.0292.26pdeg40LARP4BIntron20435134
rs17077094864800055.50E-073.75E-031.73E-020.0153.01pdeg60MCPH1Intron21297427
rs1133907211850613325.87E-074.92E-042.66E-020.0192.86pdeg60DLG2intron29290481
rs3395474521692591626.69E-072.54E-048.27E-020.0390.52pdeg60LRP2Exon20971101
rs114124263654930796.77E-078.09E-041.07E-020.0120.44pdeg80MAGI1Intron22166940
rs22296021265518987.11E-072.34E-044.81E-020.0062.19pdeg80CYYR1Intron30820047
rs80888351837280557.17E-076.90E-051.76E-010.0063.23pdeg120DLGAP1Intron30448613
rs118592921664918198.68E-077.55E-033.35E-030.022.09pdeg80RBFOX1NMD30596066
rs1013855514300207598.74E-072.14E-042.10E-020.0352.15pdeg60PRKD1Nocoding21696630
rs250121513700698959.29E-071.92E-031.76E-020.0112.47pdeg100KLHL1Intron15715669
rs178374911850496839.82E-074.92E-043.71E-020.034.01pdeg60DLG2Intron29290481
rs34865812620635791.04E-062.38E-032.33E-020.0283.06pdeg80TAFA2Intron30137205
rs695864471397964161.06E-061.91E-031.92E-020.0352.03pdeg80TBXAS1Nocoding24608097
rs58922068695834071.11E-062.10E-021.89E-030.042.67pdeg60SULF1Intron30035253
rs1186258716836281621.27E-062.69E-042.01E-020.0212.11pdeg60CDH13Intron29771432
26460479
rs2876418617793064431.30E-066.08E-038.46E-030.0210.44pdeg100RBFOX3Intron30475774
rs1228124311401335621.46E-066.98E-059.56E-020.0392.58pdeg60LRRC4CIntron29751835
rs1270574171108736881.48E-062.33E-041.18E-010.0462.14pdeg80IMMP2LIntron22486522
rs237396171509841221.50E-061.45E-078.23E-010.0440.42pdeg80KCNH2Intergenic19412172
rs11523170311823487041.51E-065.15E-053.88E-010.0460.47pdeg120GLULIntergenic29441491
rs54808474317479190051.64E-064.09E-041.29E-010.0282.26pdeg60SP2
SP2-AS1
Intron23293287
rs77144903131021446571.82E-061.41E-033.92E-020.0390.2pdeg100FGF14Intron28522250
28469558
rs146092846151002179741.87E-061.22E-034.53E-020.030.43pdeg120ADAMTS17Intron22710270
rs714782814719946651.88E-067.58E-048.16E-020.0392.18pdeg80RGS6Intron27002730
rs755387198367942701.90E-063.36E-038.00E-030.0462.47pdeg100KCNU1Intron26858991
rs297754881332248491.92E-067.50E-041.05E-010.0442.53pdeg60CCN4NMD22475393
rs7881892214546388702.03E-065.89E-041.87E-010.0382.06pdeg100SAMD4AIntron29432188
rs6222337221313779662.04E-061.09E-038.62E-030.0090.42pdeg80TIAM1Intron23109420
rs1288184414516399302.06E-061.79E-021.77E-030.0230.39pdeg120FRMD6Nocoding22190428
rs609214131021749322.20E-063.00E-037.04E-020.0370.23pdeg120FGF14Intron28522250
28469558
rs490356614772740802.30E-069.77E-059.71E-020.0450.46pdeg60POMT2Intergenic22984654
rs6011957718571553562.56E-061.22E-031.59E-010.0290.41pdeg100BOD1L2Intergenic27166630
rs14662307415321078012.76E-062.36E-036.07E-030.0350.43pdeg80CHRNA7Intron24951635
rs14188784018794822782.79E-061.18E-033.07E-020.0362pdeg60NFATC1Intron20401186
rs1290271015553189283.03E-067.21E-046.12E-020.0480.47pdeg100PIGBOS1
RAB27A
5'UTR26985808
rs1044485515333936293.25E-061.97E-075.52E-010.0471.89pdeg60RYR3Intron29590321
rs610337920435477673.94E-062.28E-041.99E-010.0410.49pdeg100L3MBTL1NMD29898393
31061493

The results of association studies using module patients, non-module patients, and control.

In a recent large-scale meta-analysis, 23 AD risk loci were reported (Kunkle et al., 2019). We checked their association using either all patients or module patients. We loosed the cutoffs of significant association by replacing empirical p < 0.05 with p1 < 10−4. Association study using all AD patients failed to identify any extra known AD risk gene to satisfy a threshold of p1 < 10−4. Unlike the results using all AD patients, we observed that 18 out of 23 AD genes have a significant association with AD in at least one module. Table 2 summarized the analysis results using module patients. By checking p2 and p3 values, we found significant allele frequency differences between module patients and no-module patients, supporting the conclusion that module analysis enriched AD patients affected by commonly known risk variants.

Table 2

Association of module patientsAssociation of all AD patients
SNPgenep1p2p3p(emp)ORModule typeRegionSNPp1p(emp)Region
rs769450APOE2.08E-081.94E-021.44E-040.0013.68pdeg80Intronrs7694501.65E-060.015Intron
rs71454394MS4A29.25E-063.73E-034.32E-020.2572.48pdeg40Intergenic
rs9462659TREM21.08E-058.99E-034.85E-020.352.02pdeg40Intergenic
rs7152488SLC24A41.21E-051.85E-041.71E-010.1750.3pdeg100Intron
rs5021727HLA-DRB11.59E-051.80E-043.88E-010.3890.45pdeg120Intergenic
rs144409358CR12.09E-051.44E-031.79E-010.5520.3pdeg120Intron
rs12416009ECHDC32.10E-052.66E-042.19E-010.5141.86pdeg40Intergenic
rs9897336ACE2.41E-052.03E-044.91E-010.3060.48pdeg100Intergenic
rs55662472EPHA12.61E-055.33E-037.65E-020.5193.15pdeg80Intergenic
rs34708229MEF2C2.81E-054.09E-032.17E-020.6752.45pdeg40Intronrs798201741.40E-041Intron
rs6099038CASS42.86E-051.55E-043.16E-010.3052.30pdeg100Intergenic
rs13422890BIN13.35E-054.42E-068.04E-010.7531.96pdeg60Intron
rs36057699PTK2B3.39E-058.08E-034.63E-020.5760.41pdeg120Intronrs360576998.70E-041intron
rs659023PICALM6.53E-058.73E-064.35E-010.7970.54pdeg120Intergenic
rs77792633FERMT28.95E-055.18E-045.44E-010.80.62pdeg60Intergenic
rs57816367CD2AP9.17E-059.36E-054.60E-010.9572.13pdeg40Intron
rs10539341INPP5D9.42E-057.99E-039.36E-020.9830.42pdeg100Intron
rs2285898ABCA79.09E-051.00E-021.48E-010.6320.53pdeg120Intergenic

The association results for known AD risk genes.

3.3. Biological Relevance of AD Risk Genes

Module-based clustering analysis allows us to bridge AD risk genes to clinical features and affected biological processes. The clinical association of modules is determined by enrichment analysis. In HBRTC's dataset, we identified nine and eight modules to be associated with braak and brain generalized atrophy at a cutoff of p < 0.01, respectively. Among them, three modules were associated with both braak and brain atrophy. Association study to these modules identified eight and 20 loci respectively. In Table 3, we summarized the analysis results. These results supported that some AD risk genes might be more associated with some AD clinical outcomes. For example, the NTN1 gene is a microtubule-associated force-producing protein and it is predicted to be related to the braak stage.

Table 3

SNPchrposp1p1(emp)Seed patienttpGeneRegionBraakAtrophy
SNPs ASSOCIATED WITH ATROPHY
rs14721662711574676097.04E-070.022X15888pdeg1007.4E-02
rs7881892214546388702.03E-060.038X15888pdeg100SAMD4AIntron7.4E-02
rs123170211295258144.97E-070.012X15914pdeg60AC110058.1,AC090124.1Intergenic0.95
rs386726318636643769.40E-070.031X15914pdeg40SERPINB11Intron0.24
rs2361112059528897.13E-070.011X15914pdeg60MCM8Intron0.95
rs711316111169690389.68E-070.024X15941pdeg120PLEKHA7Intron1.34E-02
rs1048929311722176471.12E-070.005X16020pdeg40DNM3Intron6.52E-02
rs12819631121040133932.84E-070.01X16020pdeg40GLT8D2Intron6.52E-02
rs99128641791052332.89E-060.037X16020pdeg100NTN1Intron0.97
rs687556151215375321.47E-060.049X16037pdeg801.27E-02
rs79306381155677221.85E-060.043X16179pdeg120AC104389.4NMD3.56E-02
rs54808474317479190059.62E-070.021X16179pdeg40SP2,SP2-AS1Intron9.51E-02
rs76462414719938572.32E-060.049X16183pdeg60RGS6Intron0.11
rs7864185010534213832.17E-070.001X21821pdeg1002.02E-02
rs1711251814219487032.30E-060.027X21901pdeg1200.10
rs1288184414516399302.06E-060.023X21901pdeg120FRMD6Intergenic0.10
rs124803782031107112.29E-060.025X21901pdeg120UBOX5-AS1,UBOX5Intergenic0.10
SNPs ASSOCIATED WITH BRAAK
rs610337920435477673.94E-060.041X15917pdeg100Z98752.3,L3MBTL1NMD1.01E-01
rs1185089414223122432.04E-060.033X15989pdeg80TRAV40Intergenic7.33E-02
rs736997627573416241.01E-060.028X15989pdeg1206.59E-02
rs22296021265518984.39E-060.033X16038pdeg60CYYR1,CYYR1-AS1Intron5.03E-01
rs688040451639904939.32E-070.031X16105pdeg1204.77E-02
rs5380608781791423091.10E-070.004X21810pdeg40NTN1Intron6.56E-01
rs1016268121295172651.88E-060.048X21810pdeg80TMEM132DIntron1.92E-01
rs67699673442173121.77E-070.012X21810pdeg406.56E-01
rs168859316222659408.04E-070.021X21810pdeg120CASC15Intergenic2.32E-02
SNPs ASSOCIATED WITH BOTH ATROPHY AND BRAAK
rs76945019449071877.12E-070.008X16149pdeg120APOEIntron
rs7841580812694061158.07E-070.023X16183pdeg80
rs82056231127453661.46E-060.042X16037pdeg120LINC02042Intergenic

The association results for known AD risk genes.

AD patient modules are always associated with a list of spDEG signature genes, which could be used to investigate the biological relevance of AD risk genes. Figure 3 showed the analysis results of functional annotation to module spDEG signature genes. Among the significant terms, "extracellular matrix assembly," "synaptic signaling," "learning and memory," and "protein folding" were more observed or more significant. By text mining studies, we found much published evidence for their close association with AD, supporting that predicted AD risk genes contributed to AD development. For example, the extracellular matrix was observed to have significant changes during the early stages of AD (Lepelletier et al., 2017) and extracellular matrix could induce β-Amyloid Levels (Ma et al., 2019). Among predicted risk genes, APOE, POMT2, FGF14, CDH13, and RBFOX3 display more functional involvements.

Figure 3

3.4. Evaluation Using Randomly Modules of AD Patients

In the above analysis, we attempted to cluster AD patients with a common set of spDEGs so that the clustered patients were more affected by common AD variants. As an evaluation, we performed a simulated study by randomly splitting AD patients into simulated modules at corresponding sizes. We then predicted AD risk SNPs using the same setting. In each round of the simulation, we identified about 105 AD risk SNPs on average at a cutoff of empirical p < 0.05. We compared their analysis results to those of true modules and found that about 63% of risked SNPs (out of total 174 loci) could be overlapped with the SNPs predicted using true modules. This evaluation seemed to support the conclusion that subsetting AD patients had benefits to improve the power of association studies, even when the criteria to stratify AD patients was to randomly pick up. Compared to random modules, modules of spDEG signature could recover more AD risk SNPs.

4. Discussion

In this work, we took more consideration to AD patient diversity and attempted to stratify patients into modules affected by different genetic background. We therefore came up with an analysis pipeline to cluster AD patients based on some assumptions. These included that (1) AD patients are very diverse and differential expression patterns differ among AD patients, and that (2) we can use single-patient DEGs as biomarkers to indicate the dysregulation status of AD patients and to cluster the AD patients affected by common mechanisms. In our previous work, we have applied similar strategies to discover enriched transcription factor binding sites (Meng and Vingron, 2014) and cancer driver mutations (Meng, 2018), and we achieved a good performance. Evaluation using real patient data suggested that this method could group AD patients with similar clinical outcomes and common risk variants, validating our assumptions. Compared to existent methods, our pipeline not only can discover patient-specific DEGs but also considers the reliability of spDEGs by evaluating their occurrence in multiple patients.

We applied it to find the differentially expressed genes for each AD patient and module patients based on the spDEG signatures. In this process, we made some assumptions. For example, we defined the reference expression profiles for normal individuals by fitting to a Gaussian or negative binomial distribution. The robustness of this step was dependent on the number and homogeneity of control individuals. To identify the differentially expressed genes, we need to set some thresholds to determine if the gene expression level of one AD patient was beyond the normal ranges. In our work, we tested different cutoffs and selected p = 0.1.

We did an association study in each module of size 40 to 120. Compared to the study using all AD patients, the statistical power decreased with a decreased sample size in each association study. However, more AD risk loci were identified for the increased number of AD patient modules. A total of 174 loci were predicted to be associated with AD at a strict threshold of empirical p < 0.05, while only two loci exceed such a threshold using all AD patients. The genotype frequency was found to be different between module and non-module patients. All these results suggested that AD risk variants might contribute only a limited subset of the AD population.

During simulation analysis, we also predicted many AD risk SNPs. The reason could be that random sampling also enriched AD patients affected by some AD risk SNPs. For example, we found APOE SNPs were not associated with AD patients in nearly half of simulated modules. It suggested that randomly sampling enriched the AD patient less affected by APOE. Similarly, it was possible to enrich the AD patients affected by other AD risk SNPs, especially when the AD sample size was limited.

As an evaluation, we also collected ROSMAP data and performed a similar study. We found that our method helped to identify more AD risk genes, validating our conclusion that module analysis improved the power of association study. However, we observed only limited overlaps for identified AD risk SNPs between ROSMAP and HBTRC dataset (see Table 4). The reasons could be that (1) there were only 251 AD patients in ROSMAP data, which were too limited to recover full AD risk SNPs, and (2) the cutoff of the association study was too strict to identify all the AD risk SNPs.

Table 4

idchrposp1p2p3p(emp)TypeGene
rs268874810707445659.85E-081.51E-060.04pdeg60ADAMTS14
rs25320191640742993.60E-085.80E-050.03pdeg40ADCY9
rs117051234865382982.56E-070.025pdeg120ANGPT2,MCPH1
rs11815073812994290007.79E-092.32E-070.025pdeg40ANKS1B
rs106472519449193041.20E-090.005pdeg40APOC1
rs755244759330215861.01E-070.03pdeg60APTX
rs5579151614582991828.04E-081.81E-050.005pdeg60ARID4A
rs7903873716808414282.24E-090.005pdeg60ARLNC1
rs803301415501448666.64E-072.48E-060.04pdeg60ATP8B4
rs215449821293117276.36E-070.04pdeg40BACH1
rs23010791191637871.91E-081.22E-060.005pdeg80BRINP1
rs1164144216833204095.49E-080.005pdeg80CDH13
rs14242491610607808618.23E-099.85E-060.005pdeg40CDK1
rs11766123314952694266.51E-070.03pdeg120CLMN
rs77232965103069065.66E-080.01pdeg60CMBL
rs14251315918523701086.14E-089.24E-090.03pdeg40DCC
rs109713469330310856.41E-080.02pdeg60DNAJA1
rs1699079222435936731.44E-080.005pdeg100EFCAB6
rs14852612714997567164.68E-090.004975124pdeg60EML1
rs727956221320363842.85E-060.04pdeg120HUNK
rs7933467931245484491.99E-081.50E-080.01pdeg60KALRN
rs7342377671206894011.77E-071.19E-060.025pdeg60KCND2
rs15005674110773905254.25E-097.91E-060.004975124pdeg60KCNMA1
rs1165493417409786302.05E-085.95E-080.01pdeg60KRT40
rs14073042717410413385.90E-082.91E-050.015pdeg60KRTAP1-1
rs117929409330197941.52E-070.04pdeg60APTX
rs373085019481654522.85E-071.26E-050.025pdeg40LIG1
rs98611718465485405.90E-070.035pdeg100LOXHD1
rs11270481421413248881.33E-083.94E-050.005pdeg120LRP1B
rs7319382021291723481.39E-062.09E-060.01pdeg80MAP3K7CL
rs7353990661103028901.87E-101.31E-050.005pdeg40METTL24
rs78710139751462201.27E-083.16E-060.025pdeg40OSTF1
rs1015127614570518173.32E-070.02pdeg80OTX2-AS1
rs1156450219481104445.88E-087.32E-050.005pdeg80PLA2G4C
rs4127886522438809271.92E-085.81E-050.005pdeg40PNPLA5
rs444872412275782272.11E-071.37E-060.04pdeg60PPFIBP1
rs3430451712627390818.98E-083.88E-060.025pdeg40PPM1H
rs23015920424871562.85E-074.12E-050.015pdeg100PTPRT
rs8008706514685203662.10E-082.30E-080.005pdeg40RAD51B
rs1724278314206955275.76E-090.005pdeg40RNASE4
rs1092550112377383072.84E-070.015pdeg120RYR2
rs11167181814715439507.48E-070.005pdeg120SIPA1L1
rs22747669330558128.97E-080.03pdeg60SMU1
rs3460100422438616942.15E-082.18E-050.005pdeg40SULT4A1
rs5617144061587156624.29E-156.66E-060.005pdeg40SYTL3
rs14141848841825445453.03E-082.20E-050.005pdeg60TENM3
rs7357169361550575401.31E-072.25E-060.045pdeg60TIAM2
rs498572017169589161.74E-060.04pdeg120TNFRSF13B
rs423595761582504541.91E-085.56E-070.01pdeg60TULP4
rs1694959216791238571.95E-070.035pdeg60WWOX
rs1105390912107030842.92E-084.14E-060.03pdeg40YBX3
rs1110090141458258082.91E-080.04pdeg40ZNF827

The association results for ROSMAP data.

In this work, we proved the benefits of the patient module in association studies to AD. In our application, we reported more AD risk genes even when only 310 AD patients were used. In the large-scale meta-analysis, there were about 20–30 genes identified as AD risk genes (Cuyvers and Sleegers, 2016; Jansen et al., 2019). However, by searching public literature and databases, e.g., the GWAS catalog, we found more than 100 studies and 300 genes that had been reported in associated studies to AD patients. These studies could be treated as a subset of large-scale AD meta-analysis. This result suggested that there might be more AD risk genes, and AD patient subsetting helped to identify them.

Statements

Data availability statement

Publicly available datasets were analyzed in this study. This data can be found at: https://www.synapse.org/#!Synapse:syn2580853.

Author contributions

GM designed the project. JH and DL contributed to collected samples and analyzed the data. JH and DL contributed to analyzed results and wrote the manuscript. JH, DL, and GM revised the manuscript. All authors read and approved the final manuscript.

Funding

This work was supported by National Natural Science Foundation of China, NSCF(81973706).

Acknowledgments

The results published here are in part based on data obtained from the AMP-AD Knowledge Portal (doi: 10.7303/syn2580853). We appreciate their generous contribution to the studies of Alzhiemer's disease, including this study. We appreciated the valuable comments from Dr. Jijia Sun. This manuscript has been released as a pre-print at bioRxiv (Huang et al., 2020).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2020.571609/full#supplementary-material

References

  • 1

    BallR. D. (2013). Designing a gwas: power, sample size, and data structure. Methods Mol. Biol.1019, 3798. 10.1007/978-1-62703-447-0_3

  • 2

    BelloyM. E.NapolioniV.GreiciusM. D. (2019). A quarter century of apoe and Alzheimer's disease: progress to date and the path forward. Neuron101, 820838. 10.1016/j.neuron.2019.01.056

  • 3

    CuyversE.SleegersK. (2016). Genetic variations underlying Alzheimer's disease: evidence from genome-wide association studies and beyond. Lancet Neurol.15, 857868. 10.1016/S1474-4422(16)00127-7

  • 4

    DahlA.CaiN.KoA.LaaksoM.PajukantaP.FlintJ.et al. (2019). Reverse gwas: using genetics to identify and model phenotypic subtypes. PLoS Genet.15:e1008009. 10.1371/journal.pgen.1008009

  • 5

    De JagerP. L.ShulmanJ. M.ChibnikL. B.KeenanB. T.RajT.WilsonR. S.et al. (2012). A genome-wide scan for common variants affecting the rate of age-related cognitive decline. Neurobiol. Aging33, 1017.e11017.15. 10.1016/j.neurobiolaging.2011.09.033

  • 6

    DelaneauO.MarchiniJ.ZaguryJ.-F. (2011). A linear complexity phasing method for thousands of genomes. Nat. Methods9, 179181. 10.1038/nmeth.1785

  • 7

    DemingY.DumitrescuL.BarnesL. L.ThambisettyM.KunkleB.GiffordK. A.et al. (2018). Sex-specific genetic predictors of Alzheimer's disease biomarkers. Acta Neuropathol.136, 857872. 10.1007/s00401-018-1881-4

  • 8

    GardeuxV.AchourI.LiJ.Maienschein-ClineM.LiH.PesceL.et al. (2014). ‘n-of-1-pathways' unveils personal deregulated mechanisms from a single pair of RNA-seq samples: towards precision medicine. J. Am. Med. Inform. Assoc.21, 10151025. 10.1136/amiajnl-2013-002519

  • 9

    HowieB.FuchsbergerC.StephensM.MarchiniJ.AbecasisG. R. (2012). Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat. Genet.44, 955959. 10.1038/ng.2354

  • 10

    HuangJ.LuD.MengG. (2020). Module analysis using single-patient differential expression signatures improve the power of association study for alzheimer's disease. bioRxiv [Preprint].10.1101/2020.01.05.894931

  • 11

    JansenI. E.SavageJ. E.WatanabeK.BryoisJ.WilliamsD. M.SteinbergS.et al. (2019). Genome-wide meta-analysis identifies new loci and functional pathways influencing Alzheimer's disease risk. Nat. Genet.51, 404413. 10.1038/s41588-018-0311-9

  • 12

    KunkleB. W.Grenier-BoleyB.SimsR.BisJ. C.DamotteV.NajA. C.et al. (2019). Genetic meta-analysis of diagnosed Alzheimer's disease identifies new risk loci and implicates aβ, tau, immunity and lipid processing. Nat. Genet.51, 414430. 10.1038/s41588-019-0358-2

  • 13

    LanoiseléeH.-M.NicolasG.WallonD.Rovelet-LecruxA.LacourM.RousseauS.et al. (2017). App, psen1, and psen2 mutations in early-onset Alzheimer disease: a genetic screening study of familial and sporadic cases. PLoS Med.14:e1002270. 10.1371/journal.pmed.1002270

  • 14

    LepelletierF.-X.MannD.RobinsonA.PinteauxE.BoutinH. (2017). Early changes in extracellular matrix in Alzheimer's disease. Neuropathol. Appl. Neurobiol.43, 167182. 10.1111/nan.12295

  • 15

    LiuR.YuX.LiuX.XuD.AiharaK.ChenL. (2014). Identifying critical transitions of complex diseases based on a single sample. Bioinformatics30, 15791586. 10.1093/bioinformatics/btu084

  • 16

    LoM.-T.KauppiK.FanC.-C.SanyalN.ReasE. T.SundarV. S.et al. (2019). Identification of genetic heterogeneity of Alzheimer's disease across age. Neurobiol. Aging. 84, 243.e1-243.e9. 10.1016/j.neurobiolaging.2019.02.022

  • 17

    MaC.SuJ.SunY.FengY.ShenN.LiB.et al. (2019). Significant upregulation of Alzheimer's β-amyloid levels in living system induced by extracellular elastin polypeptides. Angew. Chem. Int. Ed. 58, 1870318709. 10.1002/anie.201912399

  • 18

    MarioniR. E.CampbellA.HagenaarsS. P.NagyR.AmadorC.HaywardC.et al. (2017). Genetic stratification to identify risk groups for Alzheimer's disease. J. Alzheimers Dis.57, 275283. 10.3233/JAD-161070

  • 19

    MengG. (2018). Applying expression profile similarity for discovery of patient-specific functional mutations. High Throughput7:6. 10.3390/ht7010006

  • 20

    MengG.MeiH. (2019). Transcriptional dysregulation study reveals a core network involving the progression of Alzheimer's disease. Front. Aging Neurosci.11:101. 10.3389/fnagi.2019.00101

  • 21

    MengG.VingronM. (2014). Condition-specific target prediction from motifs and expression. Bioinformatics30, 16431650. 10.1093/bioinformatics/btu066

  • 22

    MostafaviS.GaiteriC.SullivanS. E.WhiteC. C.TasakiS.XuJ.et al. (2018). A molecular network of the aging human brain provides insights into the pathology and cognitive decline of Alzheimer's disease. Nat. Neurosci.21, 811819. 10.1038/s41593-018-0154-9

  • 23

    SchisslerA. G.PiegorschW. W.LussierY. A. (2017). Testing for differentially expressed genetic pathways with single-subject n-of-1 data in the presence of inter-gene correlation. Stat. Methods Med. Res.27, 37973813. 10.1177/0962280217712271

  • 24

    VitaliF.LiQ.SchisslerA. G.BerghoutJ.KenostC.LussierY. A. (2017). Developing a ‘personalome' for precision medicine: emerging methods that compute interpretable effect sizes from single-subject transcriptomes. Brief. Bioinform.20, 789805. 10.1093/bib/bbx149

  • 25

    WangM.RoussosP.McKenzieA.ZhouX.KajiwaraY.BrennandK. J.et al. (2016). Integrative network analysis of nineteen brain regions identifies molecular signatures and networks underlying selective regional vulnerability to alzheimer's disease. Genome Med.8:104. 10.1186/s13073-016-0355-3

  • 26

    ZhangB.GaiteriC.BodeaL.WangZ.McelweeJ.PodtelezhnikovA. A.et al. (2013). Integrated systems approach identifies genetic nodes and networks in late-onset Alzheimer's disease. Cell153, 707720. 10.1016/j.cell.2013.03.030

Summary

Keywords

Alzheimer's disease, single-patient differentially expressed genes, module analysis, SNP, WGCNA

Citation

Huang J, Lu D and Meng G (2020) Module Analysis Using Single-Patient Differential Expression Signatures Improves the Power of Association Studies for Alzheimer's Disease. Front. Genet. 11:571609. doi: 10.3389/fgene.2020.571609

Received

11 June 2020

Accepted

05 October 2020

Published

20 November 2020

Volume

11 - 2020

Edited by

Yunyan Gu, Harbin Medical University, China

Reviewed by

Suman Ghosal, National Institutes of Health (NIH), United States; Alfred Grant Schissler, University of Nevada, Reno, United States

Updates

Copyright

*Correspondence: Guofeng Meng

This article was submitted to Computational Genomics, a section of the journal Frontiers in Genetics

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Figures

Cite article

Copy to clipboard


Export citation file


Share article

Article metrics