Edited by: Galina Glazko, University of Arkansas for Medical Sciences, United States
Reviewed by: Frank Emmert-Streib, Tampere University, Finland; Shandar Ahmad, Jawaharlal Nehru University, India
This article was submitted to Computational Genomics, a section of the journal Frontiers in Genetics
†These authors have contributed equally to this work
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Gene set analysis methods are widely used to provide insight into high-throughput gene expression data. There are many gene set analysis methods available. These methods rely on various assumptions and have different requirements, strengths and weaknesses. In this paper, we classify gene set analysis methods based on their components, describe the underlying requirements and assumptions for each class, and provide directions for future research in developing and evaluating gene set analysis methods.
High-throughput technologies such as DNA microarrays and RNA-Seq are widely used to monitor the activity of thousands of genes in a single experiment. The primary challenge to realizing the potential of these technologies is gaining biological insight from the generated data.
The early approach for analysing gene expression data was single-gene analysis, where expression measures of each gene for case and control samples are compared using a statistical test such as
In a high-throughput gene expression study, many single-gene tests are typically performed. Consequently, adjustment for multiple comparisons is performed for a large number of genes. Such adjustments may lead to many false negatives by detecting very few or even no gene as being differentially expressed (Sreekumar et al.,
In the single-gene approach often researchers use arbitrary cutoff values to choose a reasonable number of genes for further study and interpretation. Different choices of threshold value may lead to different biological interpretations (Pan et al.,
Cellular processes are often associated with changes in the expression patterns of groups of genes that share common biological functions or attributes. A meaningful change in a group of these genes is more biologically reliable and interpretable than a change in a single gene.
Although high-throughput technologies make the monitoring of expression of thousands of genes in a single experiment possible, they introduce a challenge of dealing with high dimensional data, often referred to as the “curse of dimensionality” (Berrar et al.,
When differences in measured values for a single-gene across treatments are subtle, the single-gene approach makes it difficult to differentiate the true difference in gene expression from the difference due to biological variability of samples (Mootha et al.,
Multi-functional genes, i.e., genes that are involved in multiple biological activities, are commonplace. For example, Pritykin et al. (
Single-gene approach may report several hundred to a few thousand genes as being differentially expressed. Interpreting a long list of differentially expressed genes is a cumbersome task prone to investigator bias toward a hypothesis of interest.
Gene set analysis, also know as enrichment analysis, is an attempt to resolve these shortcomings and to gain insight from gene expression data. The primary aim of gene set analysis is to identify enrichment or depletion of expression levels of a given set of genes of interest, referred to as a gene set. In this paper, we use the phrase “differentially enriched” to describe gene sets that either are enriched (more expression activity) or depleted (less expression activity).
Gene sets are defined based on various criteria such as membership in certain biological pathways or being co-expressed together under a certain condition. These gene sets are gathered into collections known as gene set databases. MSigDB (Subramanian et al.,
Although we provide a list of more than 100 gene set analysis methods/tools (see
The rest of the paper is organized as follows. In section 2, we survey most widely used over-representation analysis (ORA) and functional class scoring (FCS) methods. Different significance assessment approaches and null hypotheses are covered in sections 3 and 4, respectively. Pathway topology-based methods are briefly surveyed in section 5. Section 6 describes the challenges facing gene set analysis methods. In section 7, we provide directions for future research in developing and evaluating gene set analysis methods. Finally, section 8 concludes the paper with a short summary.
Data from a high-throughput case-control experiment can be organized in an expression matrix. This matrix is generated by joining the corresponding expression values for all samples in the experiment. Each column of the matrix corresponds to the expression measures for one sample and each row corresponds to the expression measures for one gene across all samples. This expression matrix is the input for expression analyses including single-gene and gene set analysis.
Expression matrix for a pairwise comparison where
There are many gene set analysis methods available. Over-representation analysis, functional class scoring, and pathway topology-based methods are three main categories of gene set analysis methods (Khatri et al.,
A schematic view of over-representation analysis (ORA) and univariate and multivariate FCS methods.
ORA is the natural extension of single-gene analysis and one of the most widely used classes of gene set analysis methods. Due to its simplicity, well-established underlying statistical model, and ease of implementation, ORA is available through many tools. Huang et al. (
Given
Representation of ORA as a contingency table.
Genes in |
|| |
||
Genes in |
|||
Total | || |
Under the null hypothesis that there is no association between differential expression and membership in
The significance of the association between genes in
Although Fisher's exact test gives the exact
For large values of
Therefore, Equation (2) can be estimated as:
where
Another alternative to estimate the
The main assumptions of ORA are that genes are independent and equally effective in biological processes. Although these assumptions simplify problem modeling, they are not biologically valid. It is well-established that genes, proteins, and other biomolecules often act in concert (Tilford and Siemers,
In contrast to ORA, the main goal of FCS methods is to use all information from an expression matrix to address the enrichment problem without relying on the aforementioned biologically invalid assumptions. Therefore, FCS methods—instead of working with a list of differentially expressed genes—take advantage of an expression matrix of gene expression measures for all genes to discern differential enrichment of gene sets.
There are many FCS methods available (see
An FCS method often consists of a set of common components such as a gene score that is a statistic summarizing the expression level of a gene across control and case samples, a gene set score that summarizes the expression level of genes within a gene set as a single statistic, a procedure for significance assessment, and an adjustment for multiple comparisons.
GSEA (Mootha et al.,
where
GSEA ranks all genes according to their scores. Then to measure the association between members of a given gene set
After calculation of the actual
It should be mentioned that the significance of the
Since the enrichment score is defined as the “maximum observed positive deviation of the running sum” (Mootha et al.,
Damian and Gorfine (
Considering these shortcomings, Tian et al. (
PAGE, a parametric method for gene set enrichment analysis, was proposed as a statistically more sensitive and computationally less demanding alternative for GSEA (Kim and Volsky,
Finally, the significance of
In another attempt to address the aforementioned shortcomings of GSEA, Subramanian et al. (
where
It should be mentioned that the enrichment score in the adjusted GSEA is similar to, but not the same as, the enrichment score in GSEA. To calculate the enrichment scores, both methods calculate a running sum by traversing the list of all genes ranked according to their gene scores. For each gene in the list, the original GSEA method updates the running sum by a constant value, while the adjusted GSEA increases the running sum with a value of
Irizarry et al. (
By accepting the assumption that the
Irizarry et al. (
They approximated the distribution of the χ2-score for a gene set of size 20 or higher using the standard normal distribution to calculate the significance of the gene set score.
Tamayo et al. (
In addition, Tamayo et al. (
Jiang and Gentleman (
where
In addition, they suggested using median and the sign test, which is a non-parametric test to assess consistent differences in paired samples, as alternatives to the gene set statistic. The sign test was used to assess the prevalence of up- or down-regulation of genes within a gene set, regardless of the magnitude of this regulation. They found a lack of sensitivity when using the sign test as gene set score. Also, they suggested that median is less susceptible to outlier effects in comparison to using mean as a gene set score.
Multivariate FCS methods, unlike single variate FCS methods, directly calculate gene set scores from expression data and skip the intermediate step of calculating gene scores (see
where β
Considering the fact that the number of samples is usually less than number of variables, i.e., gene set size ||
An implementation of the Globaltest method is available as an R-package from Bioconductor (Gentleman et al.,
Kong et al. (
where
Since
Successful application of multivariate statistical tests depends on meeting their stringent underlying requirements such as normality of data, adequacy of sample size, and equality of variance (Venter and Maxwell,
Based on the approach used for significance assessment, gene set analysis methods can be classified as parametric and non-parametric methods. In parametric methods, after calculating a gene set score for each gene set, a parametric distribution is used to assess the significance of this score. Non-parametric approaches, on the other hand, rely on an empirical distribution to assess the significance of the gene set scores. These methods often do not make any strong assumptions about the underlying distribution of the gene set scores. Phenotype permutation and gene sampling are the main non-parametric approaches used in gene set analysis. For example, methods such as GSEA offers both phenotype permutation and gene sampling for significance assessment.
The parametric approach is another way to assess the significance of gene set scores (Kim and Volsky,
Parametric methods are built based on some knowledge or assumptions about the underlying distribution of the gene set scores. For example, PAGE assumes that the average fold-change value of genes within a gene set follows a normal distribution. SEA, another parametric approach, assumes that its gene set score—which is a weighted average of the
In gene sampling the significance of a gene set score
Since gene sampling does not depend on the number of samples, it has been widely used for gene set analysis of datasets with small sample sizes (Subramanian et al.,
Phenotype permutation, also known as sample permutation, assesses the significance of a gene set score of a given gene set
First, the gene set score of
Phenotype permutation, unlike gene sampling, does not rely on the unrealistic assumption of gene independence, but it requires a large number of samples for each phenotype. This condition most often is not satisfied. Instead, due to ethical conduct in animal and human research and limited budgets, having a large number of samples is not a choice for many researchers. In some cases, like for rare diseases, having a large sample size is not possible at all. Therefore, phenotype permutation is generally not applicable, and some gene set analysis tools provide gene sampling as an alternative to phenotype permutation (Subramanian et al.,
Keller et al. (
In order to calculate the number of enrichment scores that are less than
Finally,
Defining a null hypothesis is an essential step in conducting any statistical inference. Different null hypotheses have been used in gene set enrichment analysis: competitive null hypothesis (Goeman and Bühlmann,
Visualization of gene sampling under the competitive null hypothesis. In this figure,
Visualization of phenotype permutation under the self-contained null hypothesis. In this figure,
Visualization of phenotype permutation under the self-contained hybrid null hypothesis. This type of hypothesis states that the relative expression pattern of genes within a gene set is not differentially associated with phenotypes. For example, given a gene set
For a given gene set
After calculation of a gene set score
For a given gene set
To test the self-contained null hypothesis, a phenotype permutation approach is used (see section 3.2.2). Consequently, testing a self-contained null hypothesis leads to preserving the complex correlation of genes within a gene set. However, it requires a large number of samples for each phenotype. This condition may not be met by many biological experiments.
Hybrid null hypotheses concern changes in the relative expression patterns of genes. These null hypotheses can be classified as the competitive hybrid null hypothesis or self-contained hybrid null hypothesis. Methods based on hybrid null hypotheses calculate a gene set score for a given gene set
It should be mentioned that some authors have classified hybrid methods under competitive or self-contained methods based on whether they use a sample permutation or a gene sampling for significance assessment (Das et al.,
Not all genes in a pathway play an equally important role in its activity. The knowledge of pathway topology, such as gene product interactions, can help in quantifying the importance of a gene to the pathway activity. Topology information could potentially improve the accuracy of enrichment analysis. Topology-based pathway analysis methods incorporate such information about pathways (Draghici et al.,
GSNCA was designed to account for all cross-correlations of each gene and to assign an importance value to each gene in a pathway. They compared the results of GSNCA with that of GSCA (Choi and Kendziorski,
Bayerlová et al. (
In a another study, Ihnatova et al. (
There are many gene set analysis methods available with no consensus about the best practices. One contributing factor to this lack of consensus is the lack of gold standard expression datasets. A gold standard dataset for evaluation of gene set analysis methods requires the enrichment status of given gene sets to be known
Despite having a well-established underlying statistical model, ORA suffers from several shortcomings. ORA relies on the gene-gene independence assumption that is known to be biologically invalid (Gatti et al.,
FCS methods aim at solving some of these problems. There are many FCS methods available, but there is no consensus among researchers about the method of choice for a given experiment (Goeman and Bühlmann,
Lack of specificity of gene set analysis methods is the main hindrance to gaining insight from the results of gene set analysis. For example, assume that the null hypothesis for a self-contained method is that there is no difference in the average expression of genes in a gene set between case and control samples. Then a significant change in the expression of a single gene can cause any gene set containing that gene to be reported as being differentially enriched. The problem arises in the presence of gene set overlap, where some genes may occur in several gene sets. Due to the presence of multifunctional genes (i.e., genes that play a role in several biological functions or molecular processes), and also the parent-child structure of some gene sets (e.g., gene sets extracted from GO), gene set overlap is an integral part of gene set databases (Maleki and Kusalik,
A limitation of self-contained methods is that they require a large number of samples per group, as they use phenotype permutation for significance assessment. This means that many of the high-throughput datasets available are not appropriate for use with these types of methods. Alternatively, competitive gene set analysis methods are used for datasets with small sample sizes. Competitive gene set analysis methods rely on gene sampling for the significance assessment. Gene sampling is based on the assumption that genes are independent. This assumption is known to be biologically invalid and may cause some gene sets to be predicted as being differentially enriched solely due to the correlations between its genes. This issue introduces false positives and decreases the specificity. Therefore, gene-gene correlations should be considered in the design and evaluation of gene set analysis methods.
It has been shown that for many gene set analysis methods, whether competitive or self-contained, the results of the analysis are not reproducible for small sample sizes (Maleki et al.,
Evaluation of gene set analysis methods has become an important area of research (Rahmatallah et al.,
Real datasets with presumed enrichment status for gene sets are commonly used for the evaluation of gene set analysis methods (Tarca et al.,
Due to the lack of gold standard datasets for the evaluation of gene set analysis methods, simulated expression datasets have been used (Efron and Tibshirani,
Gene set collections have also been simulated to be a small number of non-overlapping sets of equal size, a situation that is substantially different from the real gene set databases. Due to oversimplifying assumptions, evaluation of gene set analysis using these datasets has led to inconsistent and contradictory results (Maciejewski,
This is because adding unrelated genes changes the distribution of background genes. In competitive methods, the significance of a gene set score
Due to the lack of gold standard datasets, simulated datasets using normally distributed values with zero or constant gene-gene correlations have been widely used to evaluate gene set analysis methods (Efron and Tibshirani,
One important factor that should be considered in developing gene set analysis methods is their capability in dealing with gene set overlap, which has contributed to the lack of specificity of some methods (Simillion et al.,
Also, in the evaluation of gene set analysis methods, simulated gene set databases consisting of non-overlapping gene sets of equal sizes have been used. Such a setting disregards the true nature of gene set databases that have different degrees of gene set overlap and different gene set sizes, which have been reported to affect the results of gene set analysis methods (Damian and Gorfine,
Tripathi et al. (
Moreover, different distributions of up- and down-regulated genes in gene sets, various gene set sizes, different levels of differential expression, different sample sizes, and an imbalanced number of samples per group might affect the result of a gene set analysis method (Irizarry et al.,
The quantitative study of several well-established gene set databases, which are used as input to gene set analysis methods, has shown that the choice of gene set database might have a profound impact on the results of gene set analysis (Maleki et al.,
In this paper, we reviewed a set of well-established gene set analysis methods. We discussed the shortcomings and strengths of these methods based on their various components such as their gene set score, null hypothesis, and methods of significance assessment. We also provided direction for conducting further research in gene set analysis.
To resolve the lack of consensus about the method of choice for a given experiment, a systematic methodology for evaluating gene set analysis methods should be utilized. Developing benchmark datasets for facilitating such a method comparison would highly benefit the research community. The benchmark expression datasets should represent the characteristics of real expression data and avoid using oversimplifying assumptions such as normally distributed data with zero or constant gene-gene correlation. Also, non-overlapping genes sets of equal size must be avoided as well.
Despite the numerous gene set analysis methods and tools available, due to the complex nature of the problem, developing methods with high specificity and high sensitivity remains a challenge and an active research area.
FM wrote the first draft of the manuscript and contributed to writing revisions. KO and DH helped with writing later drafts of the manuscript, incorporating some of the recent studies in gene set analysis as well as helping with revisions. AK supervised the work and assisted with the revision of the manuscript.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC).
The Supplementary Material for this article can be found online at: