Integrative analysis of multiple diverse omics datasets by sparse group multitask regression

Lin, Dongdong; Zhang, Jigang; Li, Jingyao; He, Hao; Deng, Hong-Wen; Wang, Yu-Ping

doi:10.3389/fcell.2014.00062

ORIGINAL RESEARCH article

Front. Cell Dev. Biol., 27 October 2014

Sec. Systems Biology Archive

Volume 2 - 2014 | https://doi.org/10.3389/fcell.2014.00062

Integrative analysis of multiple diverse omics datasets by sparse group multitask regression

1. Biomedical Engineering Department, Tulane University New Orleans, LA, USA
2. Center for Bioinformatics and Genomics, Tulane University New Orleans, LA, USA
3. Department of Biostatistics and Bioinformatics, Tulane University New Orleans, LA, USA

Abstract

A variety of high throughput genome-wide assays enable the exploration of genetic risk factors underlying complex traits. Although these studies have remarkable impact on identifying susceptible biomarkers, they suffer from issues such as limited sample size and low reproducibility. Combining individual studies of different genetic levels/platforms has the promise to improve the power and consistency of biomarker identification. In this paper, we propose a novel integrative method, namely sparse group multitask regression, for integrating diverse omics datasets, platforms, and populations to identify risk genes/factors of complex diseases. This method combines multitask learning with sparse group regularization, which will: (1) treat the biomarker identification in each single study as a task and then combine them by multitask learning; (2) group variables from all studies for identifying significant genes; (3) enforce sparse constraint on groups of variables to overcome the “small sample, but large variables” problem. We introduce two sparse group penalties: sparse group lasso and sparse group ridge in our multitask model, and provide an effective algorithm for each model. In addition, we propose a significance test for the identification of potential risk genes. Two simulation studies are performed to evaluate the performance of our integrative method by comparing it with conventional meta-analysis method. The results show that our sparse group multitask method outperforms meta-analysis method significantly. In an application to our osteoporosis studies, 7 genes are identified as significant genes by our method and are found to have significant effects in other three independent studies for validation. The most significant gene SOD2 has been identified in our previous osteoporosis study involving the same expression dataset. Several other genes such as TREML2, HTR1E, and GLO1 are shown to be novel susceptible genes for osteoporosis, as confirmed from other studies.

Introduction

Increasing amounts of high-throughput biological data have been collected to investigate the genetic mechanism underlying complex traits at different levels, e.g., genomics, transcriptomics, proteomics, and metabolomics. However, there are usually two bottlenecks for these genetic studies. One is availability of limited sample size due to the experimental cost. Small sample size can lead to the loss of detection power and the reduction of confidence on identified biomarkers. To analyze data with small sample size but large variables is still a challenging statistical problem (Hamid et al., 2009). The other is that biomarkers identified from these different studies often suffer from poor reproducibility. This issue could be caused by many factors such as differences on profiling techniques, demographic, and ancestral information of subjects, sample sizes, and quality control in these datasets (Phan et al., 2012; Song et al., 2012). To increase the power and consistency of biomarker identification, integrating the information of diverse biological datasets from different levels and platforms shows great promise and is highly demanded.

Methods for integration of diverse biological datasets include conventional meta-analysis and a variety of integrative approaches recently developed (Huttenhower et al., 2006; Liu et al., 2013). Meta-analysis is a statistical method to summarize the p-values or statistics (e.g., z-score) from each individual dataset (Evangelou and Ioannidis, 2013). There are a dozen of approaches for combing multiple p-values or statistics such as Fisher method. Meta-analysis is usually used to find common features across multiple datasets with different sample sizes and platforms but under the same hypothesis (Rhodes and Chinnaiyan, 2005). Recently, a number of integrative approaches have been developed, which are based on machine learning and statistical methods (Zhang et al., 2010; Kirk et al., 2012; Xiong et al., 2012). They can analyze multiple datasets from: (1) different platforms and levels but for the same subjects; (2) same platforms but different levels and subjects; (3) different platforms but for the same levels and subjects. They have been successfully used for various applications such as a single or a set of biomarker identification (Chen et al., 2013), gene-gene interaction prediction (Troyanskaya et al., 2003), and genetic network construction (Balbin et al., 2013). The results in these studies demonstrate the advantage of integrating multiple diverse datasets over analyzing them individually.

In this work, we propose a novel method for integrating multiple datasets from different platforms, levels, and samples to identify common biomarkers (e.g., genes). The method was based on multitask regression model enforced with sparse group regularization, which can overcome the “small sample size, but large number of variables” problem. Multitask learning method has been successfully applied to medical imaging data fusion, where multiple types of images (e.g., CT, MRI) were combined for identifying susceptible brain regions and improving disease classification (Zhang and Shen, 2012). Among various sparse regularization terms, the use of sparse group penalty has been shown to outperform other penalties such as lasso in our previous study of pair-wise genomic data integration (Lin et al., 2013). In this study, we enforce two sparse group penalties [i.e., sparse group lasso (Friedman et al., 2010) and sparse group ridge (Chen et al., 2010a)] into the multitask regression model for data integration. We assume a regression model for each dataset as a task, and then multiple regression models will be considered as multiple tasks. Variables from all datasets will be grouped by specific units (e.g., genes). A sparse group penalty is introduced with the aims to (1) reduce dimensionality, i.e., removing a number of irrelevant genes; (2) perform group-wise feature selection, i.e., removing SNPs or expression measurements from the same gene. An effective algorithm based on alternative direction method (ADM) is proposed to solve the model. Based on the estimation of the model, a statistical test is constructed for the identification of potentially causal genes. We perform two simulation studies with both fixed and dynamic genetic effects to evaluate our sparse regression methods, which shows that our sparse group multitask regression model can increase the power of detecting risk genes by integrating multiple diverse datasets effectively. Real data analysis on four osteoporosis studies identifies some significant genes with highly susceptible to bone mineral density and osteoporosis.

Materials and methods

In this section, we will first introduce the sparse group multitask regression model and then propose an effective algorithm based on ADM to solve the model. Finally, a gene based statistical test is constructed to give the level of significance for each selected gene.

Sparse group multitask regression model

We assume T independent datasets collected from K levels of genomic data (e.g., SNP, mRNA) with P_k(k = 1, …, K) platforms (e.g., Affymetrix, Illumina) for each level, and thus . The number of observations in each dataset is denoted by n_i, i = 1, … T. Sample sizes could also be different due to the diversity of protocols in each experiment. The measurement matrix of each experiment is denoted by X⁽ⁱ⁾ ∈ R^{n_i × d_i}, i = 1, …, T, where d_i is the dimension of features in the i-th dataset, and usually d_i >> n_i. These features (e.g., SNPs and mRNA expression probes) are annotated to the genes and we assume that the genes in different datasets are the same, denoted by G = {G_i|i = 1, … Q}. For example, all SNPs and mRNA expressions are tested for the same set of genes G. To reduce scale differences among different levels and platforms, the features in X⁽ⁱ⁾s will be normalized to have zero mean and unit standard deviation. The phenotypic response in each dataset is Y⁽ⁱ⁾ ∈ R^n_i, i = 1, … T, which can be binary or quantitative trait. The study is to identify biomarkers shared by different experiments for the same phenotype. The coefficient matrix for the regression model is denoted by , where C⁽ⁱ⁾ ∈ R^d_i is the coefficient vector of the i-th model Link (Y⁽ⁱ⁾) = X⁽ⁱ⁾C⁽ⁱ⁾, and Link(.) is the known link function.

Multitask learning is adopted in this study for identifying the shared biomarkers across a set of distinct but correlated tasks for better accuracy. In this context, each regression model for an experiment under different level and/or platform is considered as a task. For the sake of simplicity, we assume a linear regression model for each experiment with quantitative trait (i.e., link function will be the identity matrix). The loss function for each model L⁽ⁱ⁾ (X⁽ⁱ⁾, C⁽ⁱ⁾) can be derived from the negative log likelihood function and thus the total loss function for the multitask regression model is .

Many conventional regression methods become ineffective for processing the large scale biological data, which usually have small sample sizes and large number of features. This issue can be addressed by introducing sparse penalty in the model. We propose a sparse multitask regression model as follows:

where Φ(C) is the sparse penalty function. Two popular penalties are used: sparse group lasso and sparse group ridge, and the corresponding models are denoted by multitask-sglasso and multitask-sgridge, respectively. For multitask-sglasso, , where C_{{k ∈ G_q}} indicates a subset of vector C corresponding to the set of features annotated to gene G_q from Ttypes of datasets and is the l-1 norm on C. This sparse group lasso penalty groups features from all datasets based on genes to perform gene level selection. The l-1 norm penalty on C can further remove those irrelevant features from each gene. This bi-level feature selection penalty has been proven to outperform several other single level sparse penalties such as lasso, group lasso, and elastic net for feature identification. For multitask-sgridge, a composite sparse penalty, i.e., group ridge penalty , is imposed on C to perform bi-level feature selection, where the features are also grouped by genes. The penalty uses the inner l-1 norm penalty on C_{{k ∈ G_q}} to achieve the sparsity within each gene while the outer ridge penalty to perform ridge regression at the gene level. This group ridge penalty has also been found to give higher power in identifying causal genes in high dimensional genomic dataset than other single level sparse penalties (Chen et al., 2010a).

In this study, we adopt these two bi-level penalties in our multitask regression models to integrate multiple diverse genomic datasets for gene-based test. Specifically, these two sparse group multitask regression models are formulated as follows:

where ω_is are the weights for the loss function of different levels of datasets, and δ_js are the weights accounting for the sample size differences among the experiments of the same type of datasets. To be more specific, ω_is reflect the prior knowledge on the importance of different levels of measurements, e.g., SNP, gene expression, and proteomics. We choose ω_i = 1, i = 1, 2, l … K in this work, assuming that all levels of measurements contain the same important genetic information. Larger sample size is expected to provide more reliable significance test on biomarkers; therefore, the weight for the experiment under the j-th platform to measure the i-th level of genomic data is given by , where λ₁, λ₂, and λ are the tuning parameters to control the sparsity of genes and the number of features in the models.

It could be noted that our sparse multitask regression model can be taken as the generalization of those existing sparse regression models to the representation of multiple datasets from different levels and/or platforms. For example, when K = 1, P = 1, it is sparse regression model for single dataset as used in Chen et al. (2010a) and Simon et al. (2013); when K = 1, P > 1, it can be reduced to sparse model on multiple datasets at the same level but from different platforms, similar to the work in Ma et al. (2011); when K > 1, P = 1, it can work for multiple datasets at different levels.

Solution algorithm by alternative direct method (ADM)

Although both (2) and (3) are convex optimization problem with global solutions, the non-smoothness and the composite norms still cause difficulties in solving the optimization. Several algorithms have been studied to address such an issue for single task regression models, e.g., second-order cone programming (SOCP) algorithm (Candes and Romberg, 2005), spectral projected gradient method (SPGL1) (van den Berg et al., 2008), accelerated gradient method (SLEP) (Liu et al., 2009), block-coordinate descent algorithm and SpaRSA (Wright et al., 2009). In sparse multitask regression model, since the loss function is separable, these algorithms are still applicable but expensive in computations. In this study, we apply ADM to solve sparse multitask regression model. ADM uses the splitting strategy to decompose the optimization problem into several easily solvable ones and updates the variable in each subproblem iteratively until the convergence is achieved. It has been successfully applied to solve many convex or non-convex optimization problems, such as lasso (Yang and Zhang, 2011), total variation regularization (Esser, 2009), matrix decomposition and our recent work on sparse low rank decomposition (Dongdong et al., 2013). Deng et al. compared ADM with several other algorithms and found that ADM outperformed others with more robustness and faster computation (Deng et al., 2013).

Taking the model in (2) for example, we use ADM to split the penalties and transform (2) into the following optimization:

where V₁, V₂ are two variables making the loss function separable. The augmented Lagrange function can be derived as

where ρ is augmentedLagrangian parameter which can be updated iteratively; D₁, D₂ are the Lagrange multipliers to approximate the residuals between C and V₁, V₂, respectively. Since the objective function and constraints are both separable and convex, ADM method is effective to solve {C, V₁, V₂, D₁, D₂} sequentially. We present the algorithm for solving multitask-sglasso by ADM in Table 1.

Table 1

1	Initialization: k = 0, choose λ₁, λ₂, μ, ρ, > 0, V⁰₁, V⁰₂, D⁰₁, D⁰₂
2	Repeat:
3	C^{k + 1} ← argmin_AL (C, V^k₁, V^k₂, D^k₁, D^k₂)
4	V^{k + 1}₁ ← argmin_V₁L (C^{k + 1}, V₁, V^k₂, D^k₁, D^k₂)
	= argmin_V₁ ǁC^{k + 1} − V₁ − D^k₁ǁ²₂ + λ₁ ǁV_{1{k ∈ G_q}}ǁ₂
5	V^{k + 1}₂ ← argmin_V₂L (C^{k + 1}, V^{k + 1}₁, V₂, D^k₁, D^k₂)
	= argmin_V₂ ǁC^{k + 1} − V₂ − D^k₂ǁ²₂ + λ₂ ǁV₂ǁ_{1, 1}
6	Update Lagrange multipliers
	D^{k + 1}₁ ← D^k₁ − C^{k + 1} + V^{k + 1}₁
	D^{k + 1}₂ ← D^k₂ − C^{k + 1} + V^{k + 1}₂
7	Update iteration k ← k + 1
8	Until some stopping criterion is satisfied

Algorithm of solving multitask-sglasso by ADM.

Remark 1. We decouple (2) into several small convex optimization problems. Step 3 is a regular least square estimation on matrix C, where an analytical solution can be derived. Step 4 is a classical sparse group lasso minimization, which can be solved efficiently by block coordinate decent in Sprechmann et al. (2011). Step 5 is a simple lasso problem, which can also be solved by soft-thresholding. The division of complex optimization into several simple sub-optimizations will improve the efficiency of computation.

Remark 2. We adopt the stopping criterion as suggested by Boyd et al. (2010) that both primal res_pri and dual res_dual residuals must be small, i.e., res_pri ≤ ε_pri, res_dual ≤ ε_dual, where primal residual indicates the difference between C and V₁ (V₂) while dual residual measures the difference between V₁ (V₂) and the values at the last iteration.

Remark 3. The convergence rate depends on the choice of Lagrangian parameter ρ. Some studies adjust ρ based on primal and dual variables iteratively to speed up the convergence. In this work, we update ρ by keeping the ratio between primal and dual residual norms within a given interval, say [0.1, 10] until they both converge to zeros.

For optimization (3), it can similarly be transformed into ADM formulation where only one splitting variable (i.e., V₁) is needed to separate (3) into two subproblems. The estimation of V₁ at Step 4 can be replaced by:

where soft-threshold can be used to get the solution.

Statistical test

λ₁, λ₂, and λ are tuning parameters used to control the number of genes and features within a gene. The K-fold cross validation is widely used to select optimal values of these parameters. Briefly, the subjects are divided into k groups, where k−1 groups of subjects are used for estimating the coefficient matrix C and the rest group of subjects is used to calculate the prediction error by the estimated C. We set λ₁, λ₂, and λ to [10^0.1, 10^0.2, …, 10³] with 30 values. We search the 30 × 30 grid to find an optimal combination of (λ^*₁, λ^*₂) for multitask-sglasso and similarly optimal value of λ^* for multitask-sgridge by 5-fold cross validation. Finally, the estimate of C can be calculated by the derived optimal parameters.

To test the significance of identified biomarkers with non-zeros coefficients at C, we construct a gene based statistical test to measure the strength and significance of the association between genes and phenotype across experiments from different platforms and levels. For the i-th gene G_i, indicates the corresponding coefficient vector estimated from the j-th experiment, denoted by , where m_i is the number of features annotated to gene G_i in the j-th experimental dataset. The null hypothesis is there is no association between the i-th gene and phenotype in all T experiments, denoted by , vs. the alternative hypothesis H_A: ≠ 0, k = 1, 2, …, T for some k. To test the hypothesis, we summarize the coefficients of the i-th gene on all datasets as follows.

where _i, i = 1, 2, …, Q is the statistical value on all Q genes. Due to different number of features included in different genes, an adjustment for gene size is necessary. A permutation based approach is used to reduce the potential bias due to varying gene size. The standardized gene level statistic is given by

where ⁰_i and _i are the mean and standard deviation of the i-th gene under the null hypothesis. Samples are permuted B times to construct null distribution of _i, denoted by ⁰_i = {⁰_i,j|j = 1, 2, …, B}. ⁰_i and _i are then estimated based on permutation data. Since all ⁰_i,j have been normalized, we could pool all ⁰_i into a set Γ⁰ = {⁰_i|i = 1, 2, …, Q} as the estimated null distribution. Therefore, the gene-level p-value of the i-th gene can be calculated by

Simulation

To evaluate the performance of our proposed integrative method for identifying biomarkers, we simulated two levels of measurements: SNP and gene expression, and assigned different sample size for each dataset.

For each simulation, we generated 3 SNP datasets and 3 gene expression datasets. The sample sizes were 600, 400, and 200 for SNP data and 70, 50, and 30 for gene expression, respectively. 200 genes were simulated in each dataset. To mimic the linkage disequilibrium (LD) structure among SNPs, we chose a chromosome, chromosome 22, from HapMap CEU panel with phase III data and sample subjects by software HAPGEN2 (Su et al., 2011). Those SNPs were kept after the following filters were applied: (1) Minor allele frequency (MAF) at least 5%; and (2) Hardy-Weinberg Equilibrium (HWE) with significant level less than 0.001. We generated a dataset consisting of 15,235 SNPs which were assigned to 576 genes as the gene pool. Assuming an additive genetic model, each SNP was recorded as the count of minor allele (denoted as A) at that locus and thereby was valued by 0 (homozygote of major allele, aa), 1 (heterozygote, Aa) and 2 (homozygote of minor allele, AA). 200 genes including more than 10 SNPs were randomly selected from the pool, of which 20 genes were chosen as causal genes and 2 SNPs with MAF from uniform distribution (Unif) (0.15, 0.25) from each causal gene were further used to induce causal genetic effects on gene expression. The number of SNPs from 200 selected genes was randomly set from Unif(10,100) and those non-causal SNPs in each gene were selected from pooled SNPs.

We used SNP data to generate gene expression and phenotype data, referring to the similar method in Huang et al. (2014). Three SNP datasets with 70, 50, and 30 subjects were first simulated, as described in the method section. For each causal gene, e.g., gene i, the expression value G_i was derived from the causal SNPs in this gene by

where n was the number of causal SNPs included in G_i; and β_j indicated the effect of the j-th causal SNP(SNP^j_causal) on G_i. We set β value from Unif(1, 1.2) and noise ε from normal distribution N(0, 1). The other non-causal gene expression values were generated by multivariate normal distribution N(0, Σ), where Σ was the covariance matrix of gene expressions, and the expressions of gene i and j have correlation coefficient 0.3^{|i − j|}. Based on the simulated gene expression, the phenotype was generated by the following formula:

where m was the number of causal genes, i.e., m = 20 in this study; G^j_causal was gene expression for the j-th causal gene and τ_j was the corresponding effects on the outcome. The logit function was used to generate binary outcome. The identity function can be used if the quantitative phenotype was used. ε′ was non-genetic variable, which was assumed to follow normal distribution N(0, 1).

Results

Synthetic data

We assessed the performance of the two proposed sparse multitask models- multitask sglasso and multitask sgridge-on each single dataset and all datasets, respectively, and also compared them with widely used meta-analysis on three SNP datasets (meta-SNP) and three gene expression datasets (meta-EXP). Meta-analysis was implemented by the software MetaL (Willer et al., 2010).

Simulation 1: Fixed effect of causal genes in diverse dataset

In this simulation, we studied the scenario that the effects of causal genes across diverse datasets were fixed, i.e., τ¹_j = τ²_j = … = τ⁶_j, i = 1, 2, …, m, which indicated a causal gene had the same effect on all datasets. For m casual genes, first, we set a baseline vector η ∈ R^m from Unif(0.2, 2) and Unif(−2, −0.2). Next, to evaluate the performance of different methods on identifying casual genes under different levels of effects, a factor δ = 0, 0.2, 0.4, 0.6, 0.8, 1.0 was multiplied by η to have the final value of gene effects τ = η × δ. 50 replicates were performed and B = 500 permutations in each replicate were implemented to calculate empirical p-value of sparse multitask models. Finally, we compared the results of the following eight cases: multitask-sglasso on three expression datasets, three SNP datasets, and all six datasets; multitask-sgridge on three expression datasets, three SNP datasets and all six datasets; meta-analysis on three SNP datasets and three expression datasets.

Figure 1 shows the comparison result of a set of methods under different values of δ, i.e., [0, 0.2, 0.4, 0.6, 0.8, 1.0]. The ROC curves were plotted using the false positive rate against true positive rate by varying the p-value threshold from 10⁻⁴ to 1. It could be seen that all methods had similar performance when there were no effective causal genes in all datasets (i.e., δ = 0). When the effects of causal genes (i.e., δ) increase, i.e., more variability of phenotypes could be explained by genetic variants, multitask-sglasso method shows better performance by removing the irrelevant genes with improved signal to noise ratio. When δ was greater than 0.2, multitask-sglasso methods on SNP, expression and both datasets significantly outperformed the other methods. This indicates that Multitask-sglasso method showed better performance by integrating all datasets than that of using only one level of data. In addition, when δ was greater than 0.4, multitask-sglasso method using only SNP or expression datasets still gave higher power than meta-analysis method. Multitask-sgridge method had less power than multitask-sglasso method and only showed better performance than meta-analysis method when causal genes have high effect sizes.

Figure 1

Simulation 2: Dynamic effects of causal genes in diverse datasets

In this simulation, we consider the situation that a causal gene has different effects at different levels and platforms. This is more likely to happen for real datasets since multiple datasets are usually generated from different studies with different study protocols, profiling techniques, and experimental platforms, leading to dynamic effect sizes of casual genes. We aimed to compare the performance of our sparse multitask methods with meta-analysis for biomarker identification in this dynamic case. Six datasets were generated with the same sample size and causal genes as those in the first simulation study. We simulated the dynamic effects of causal genes at different datasets by setting τ_j ~ N (η, σ²), i = 1, 2, …, 6, where η was fixed effect as described above, and σ was standard deviation indicating the dynamic effect of genes across datasets. We changed the value of σ from 0 to 1 with the interval of 0.2 to show different extent of heterogeneity of causal genes across diverse datasets. 50 replicates were averaged to draw the ROC curve for comparison.

Figure 2 showed the comparison result of eight cases under dynamic effect models with variance of causal genes varying from 0 to 1. When σ = 0, the models reduced to the ones with fixed effects. When σ was greater than 0.4, sparse multitask-sglasso method on SNP, expression and both datasets significantly outperformed other methods in identifying casual genes. Except for sparse multitask-sglasso method, we can also see that the performance of sparse multitask-sgridge on all datasets was better than meta-analysis methods, which indicated the advantage of multitask method for integrating diverse datasets.

Figure 2

Real data analysis

In this study, we took advantage of 3 gene expression datasets and 1 GWAS dataset with bone mineral density (BMD) measurements from our previous studies. The cohort I of gene expression data contained 80 Caucasian females, including 40 high and 40 low hip subjects (Chen et al., 2010b). The cohort II of gene expression data contained 19 Caucasian females, including 10 high and 9 low hip BMD subjects (Liu et al., 2005). The cohort III of gene expression data contained 26 Chinese females, all premenopausal and including 14 high and 12 low hip BMD subjects (Lei et al., 2009). For the GWAS dataset, SNP data were obtained using Affymetrix 500K arrays on 1,000 unrelated homogeneous Caucasians. After a suite of quality control procedures were performed, the SNP set for subsequent analysis contained 379,319 SNPs, yielding an average marker spacing of ~7.9 kb throughout the human genome (Xiong et al., 2009).

We combined gene expression and SNP datasets to identify those risk genes of BMD by our sparse multitask-sglasso integrative method. We chose one chromosome 6 containing the largest number of genes to perform gene-based analysis. 504 genes were included in the chromosome. More details in each dataset were given in Table 2.

Table 2

Data type	Platform	Gene	Genetic variants	Sample
SNP	Affymetrix 500K	504	10685	1000
Gene expression	HG-U133A	504	874	19
Gene expression	HG-U133A	504	1225	26
Gene expression	HG-U133A-Plus_2.0	504	874	80

A summary of four datasets from different levels and platforms used in this analysis.

We applied sparse multitask-sglasso method to SNP, gene expression and both datasets, respectively. To compare with meta-analysis, two gene expression datasets with the same level and experimental platforms, EXP-19 and EXP-80, were used for meta-analysis, denoted by meta-Exp. The most significant expression measurement in each gene was chosen to represent significance level of the gene. Figure 3 shows the Venn diagram of gene list by three methods: multitask-sglasso on all gene expression datasets, multitask-sglasso on all gene expression and SNP datasets, and meta-analysis on two expression datasets under the significant threshold 0.05. We could see that there were 45 genes shared by meta-Exp and multitask-sglasso on three expression datasets; 10 genes overlapped by meta-Exp and multitask-sglasso on both SNP and expression datasets; and three genes (“GPR116,” “HLA-DMB,” “PHACTR1”) identified by all methods. The small overlapping between multitask-sglasso Exp and multitask sglasso SNP + Exp is due to the use of additional information from large sample size of SNP dataset.

Figure 3

Table 3 lists 7 top significant genes identified and sorted by their p-values from sparse multitask-sglasso method on all datasets and the corresponding p-values by meta-analysis. Note that the p-values of the same gene usually were different in different studies. For example, SOD2 had much lower p-values in SNP and EXP-26 datasets than those in other datasets. This difference showed the dynamic effects of genes across diverse datasets with different levels and platforms. There are three genes (“TREML2,” “ HTR1E,” and “GLO1”) shared by sparse multitask-sglasso method on all of datasets and meta-Exp. Except for gene TREML2, the p-values of genes derived from all datasets were lower than those from the other methods, indicating higher level of significance given by our multitask method. The relatively smaller p-values of these genes in SNP data were due to the large sample size of SNP dataset, which will give more confidence on the findings.

Table 3

The top 7 identified genes and their p-values by sparse multitask-sglasso method in bone mineral density studies.

^* Genes identified by both meta-Exp and sparse multitask-sglasso on all datasets.

To further evaluate the significance of identified genes by multitask-sglasso, we performed gene level meta-analysis on three independent BMD studies for validation, more details were shown in supplementary data. The result (Table S1) listed the p-values of 24 identified genes based on single studies and meta-analysis. Most of these genes showed significant effects on BMD (p < 0.01), indicating the effectiveness of our sparse multitask regression method in identifying genetic risk factors.

Three shared genes (“TREML2,” “HTR1E,” and “GLO1”) may have important biological functions related to BMD associated with osteoporosis. TREML2 (also known as TLT-2) was located in a gene cluster on chromosome 6 with the single Ig variable (IgV) domain activating receptors TREM1 and TREM2, while these TREM receptor families were found to participate in the process of bone homeostasis by controlling the rate of osteoclastogenesis and regulating the differentiation of osteoclasts (Klesney-Tait et al., 2006; Otero et al., 2012). HTR1E was recently identified to contain SNPs significantly associated with a linear combination of multiple osteoporosis-related phenotypes including BMD (Karasik et al., 2012). GLO1, as a binding protein of methyl-gerfelin (M-GFN), was found to be able to result in the inhibition of osteoclastogenesis (Kawatani et al., 2008). Besides these three common genes, our method was also able to identify other osteoporosis-susceptible genes but was undetectable by meta-analysis. For instance, SOD2 has been identified as the gene susceptible to osteoporosis in our previous integrative analysis of mRNA, SNP, and protein data (Deng et al., 2011). It may play a significant role in BMD variation and pathogenesis of osteoporosis. HDAC2, as a member of histone deacetylases (HDACs), was found to play a critical role in bone development and biology (McGee-Lawrence and Westendorf, 2011). These genes were missed out with meta-analysis but can be detected with our proposed method, showing improved sensitivity.

Conclusion and discussion

In this work, we proposed a multi-omics integration method, i.e., sparse group multitask regression model, which can integrate multiple genomic datasets from different levels, platforms, and subjects for gene based analysis. An efficient computational algorithm based on ADM was provided for its solution. The performance of the model was compared with meta-analysis in simulation datasets. The simulation results showed that our sparse group multitask regression model can increase the power of detecting risk genes by integrating multiple diverse datasets effectively. In particular, multitask-sglasso model outperformed meta-analysis method in simulations on genes with both fixed and dynamic effects. Our real data analysis on osteoporosis studies identified significant genes but missed by meta-analysis, and these genes were reported to be highly susceptible to BMD and osteoporosis. Overall, the advantages of our sparse group multitask regression method for biomarker identification from multiple omics datasets include: (1) it can combine diverse and complementary omic datasets without; (2) group the features by gene or gene set to account for the group structures in data (e.g., LD structure, co-expression, and genetic regulatory network); (3) remove irrelevant genes and/or features within a gene simultaneously.

Our proposed sparse multitask regression model provided a general framework for integrative analysis of diverse datasets. To fuse multiple diverse datasets, we considered the regression on each single dataset as a single task and then combined all single tasks into the model. Two sets of parameters were used in the model. ω_is were used to weight object functions (i.e., data fitting term at each level) different levels, while δ_j were used for different platforms. Similar to other works, we set ω to be equal by assuming each level of genetic data contains the same information (Ma et al., 2011). We assign δ_j to the data from different platforms by their sample sizes (Wilson and Lipsey, 2001). Other methods can also be applied to estimating weights such as Kaplan–Meier estimate (Liu et al., 2013) and inverse variance (Wilson and Lipsey, 2001). In order to account for the group effects and reduce the large number of features, we used two group sparse penalties in our multitask regression models, i.e., sparse group lasso and sparse group ridge, respectively. These penalties can perform feature selection at both group level and individual for multiple dataset levels, showing better performance than those of using lasso and group lasso penalties for single dataset analysis. Similar regression models were also recently proposed for using two-level sparse group penalties such as group bridge and group MCP (Huang et al., 2012). Ma et al. has recently applied these penalties in regression model for cancer studies to identify those risk oncology genes by integrating multiple expression level datasets from different cancer studies (Liu et al., 2013). Chen et al. has also compared and found that sparse group ridge outperformed group bridge penalty in single dataset regression model (Chen et al., 2010c). However, no study has been performed to compare them for multiple dataset integration and further work is needed in this direction.

Web sources

The gene expression datasets from three cohorts can be accessed in GEO database (http://www.ncbi.nlm.nih.gov/geo/) with the following accession numbers: 19 Caucasians BMD study (GSE2208), 26 Chinese study (GSE7158), and 80 Caucasians study (GSE56815).

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Statements

Acknowledgments

Yu-Ping Wang was partially supported by NIH grants (R01MH104680, R01GM109068). Hong-Wen Deng was partially supported by NIH grants (P50AR055081, R01AG026564, R01AR059781, R01AR050496, and R01AR057049) and Edward G. Schlieder Endowment. This manuscript was not prepared in collaboration with investigators of the Framingham Heart Study and does not necessarily reflect the opinions or views of the Framingham Heart Study, Boston University, or NHLBI. The FHS datasets used for the analyses described in this manuscript were obtained from dbGaP at http://www.ncbi.nlm.nih.gov/sites/entrez?db=gap through dbGaP accession phs000007.v14.p6.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Supplementary material

The Supplementary Material for this article can be found online at: http://www.frontiersin.org/journal/10.3389/fcell.2014.00062/abstract

Funding source

The Framingham Heart Study is conducted and supported by the National Heart, Lung, and Blood Institute NHLBI in collaboration with Boston University (Contract No. N01-HC-25195). Funding for SHARe Affymetrix genotyping was provided by NHLBI Contract N02-HL-64278. SHARe Illumina genotyping was provided under an agreement between Illumina and Boston University. Funding support for the Framingham Bone Mineral Density datasets was provided by NIH grants R01AR/AG 41398.

References

1
BalbinO. A.PrensnerJ. R.SahuA.YocumA.ShankarS.MalikR.et al. (2013). Reconstructing targetable pathways in lung cancer by integrating diverse omics data. Nat. Commun. 4, 2617. 10.1038/ncomms3617
2
BoydS.ParikhN.ChuE.PeleatoB.EcksteinJ. (2010). Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3, 1–122. 10.1561/2200000016
- CrossRef
- Google Scholar
3
CandesE.RombergJ. (2005). l1-Magic: Recovery of Sparse Signals via Convex Programming. Available online at: http://users.ece.gatech.edu/justin/l1magic/downloads/l1magic.pdf.
- Google Scholar
4
ChenL.HutterC.PotterJ. D.LiuY.PrenticeR. L.PetersU. L. H. (2010c). Insights into colon cancer etiology via a regularized approach to gene set analysis of GWAS data. Am. J. Hum. Genet. 86, 860–871. 10.1016/j.ajhg.2010.04.014
5
ChenL. S.HutterC. M.PotterJ. D.LiuY.PrenticeR. L.PetersU.et al. (2010a). Insights into colon cancer etiology via a regularized approach to gene set analysis of GWAS data. Am. J. Hum. Genet. 86, 860–871. 10.1016/j.ajhg.2010.04.014
6
ChenX. D.XiaoP.LeiS. F.LiuY. Z.GuoY. F.DengF. Y.et al. (2010b). Gene expression profiling in monocytes and SNP association suggest the importance of the STAT1 gene for osteoporosis in both Chinese and Caucasians. J. Bone Miner. Res. 25, 339–355. 10.1359/jbmr.090724
7
ChenY.WuX.JiangR. (2013). Integrating human omics data to prioritize candidate genes. BMC Med. Genomics6:57. 10.1186/1755-8794-6-57
8
DengF. Y.LeiS. F.ChenX. D.TanL. J.ZhuX. Z.DengH. W. (2011). An integrative study ascertained SOD2 as a susceptibility gene for osteoporosis in Chinese. J. Bone Miner. Res. 26, 2695–2701. 10.1002/jbmr.471
9
DengW.YinW.ZhangY. (2013). Group sparse optimization by alternating direction method, in SPIE Optical Engineering+ Applications: 2013: International Society for Optics and Photonics; 88580R-88580R-88515 (San Diego, CA).
- Google Scholar
10
DongdongL.HaoH.JingyaoL.Hong-WenD.CalhounV. D.Yu-PingW. (2013). Network-based investigation of genetic modules associated with functional brain networks in schizophrenia, in 2013 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) (Beijing), 9–16.
- Google Scholar
11
EsserE. (2009). Applications of Lagrangian-Based Alternating Direction Methods and Connections to Split Bregman. Technical Report 09-31. Berkeley, CA: University of California.
- Google Scholar
12
EvangelouE.IoannidisJ. P. (2013). Meta-analysis methods for genome-wide association studies and beyond. Nat. Rev. Genet. 14, 379–389. 10.1038/nrg3472
13
FriedmanJ.HastieT.TibshiraniR. (2010). A note on the group lasso and a sparse group lasso. arXiv preprint arXiv:1001.0736.
- Google Scholar
14
HamidJ. S.HuP.RoslinN. M.LingV.GreenwoodC. M.BeyeneJ. (2009). Data integration in genetics and genomics: methods and challenges. Hum. Genomics Proteomics2009:869093. 10.4061/2009/869093
15
HuangJ.BrehenyP.MaS. (2012). A selective review of group selection in high-dimensional models. Stat. Sci. 27, 481–499. 10.1214/12-STS392
16
HuangY. T.VanderweeleT. J.LinX. (2014). Joint analysis of Snp and Gene expression data in genetic association studies of complex diseases. Ann. Appl. Stat. 8, 352–376. 10.1214/13-AOAS690
17
HuttenhowerC.HibbsM.MyersC.TroyanskayaO. G. (2006). A scalable method for integration and functional analysis of multiple microarray datasets. Bioinformatics22, 2890–2897. 10.1093/bioinformatics/btl492
18
KarasikD.CheungC. L.ZhouY.CupplesL. A.KielD. P.DemissieS. (2012). Genome−wide association of an integrated osteoporosis−related phenotype: is there evidence for pleiotropic genes?J. Bone Miner. Res. 27, 319–330. 10.1002/jbmr.563
19
KawataniM.OkumuraH.HondaK.KanohN.MuroiM.DohmaeN.et al. (2008). The identification of an osteoclastogenesis inhibitor through the inhibition of glyoxalase I. Proc. Natl. Acad. Sci. U.S.A. 105, 11691–11696. 10.1073/pnas.0712239105
20
KirkP.GriffinJ. E.SavageR. S.GhahramaniZ.WildD. L. (2012). Bayesian correlated clustering to integrate multiple datasets. Bioinformatics28, 3290–3297. 10.1093/bioinformatics/bts595
21
Klesney-TaitJ.TurnbullI. R.ColonnaM. (2006). The TREM receptor family and signal integration. Nat. Immunol. 7, 1266–1273. 10.1038/ni1411
22
LeiS. F.WuS.LiL. M.DengF. Y.XiaoS. M.JiangC.et al. (2009). An in vivo genome wide gene expression study of circulating monocytes suggested GBP1, STAT1 and CXCL10 as novel risk genes for the differentiation of peak bone mass. Bone44, 1010–1014. 10.1016/j.bone.2008.05.016
23
LinD.ZhangJ.LiJ.CalhounV. D.DengH. W.WangY. P. (2013). Group sparse canonical correlation analysis for genomic data integration. BMC Bioinformatics14:245. 10.1186/1471-2105-14-245
24
LiuJ.HuangJ.XieY.MaS. (2013). Sparse group penalized integrative analysis of multiple cancer prognosis datasets. Genet. Res. (Camb). 95, 68–77. 10.1017/S0016672313000086
25
LiuJ.JiS.YeJ. (2009). SLEP: Sparse learning with efficient projections. Arizona State University 6. Available online at: http://www.public.asu.edu/~jye02/Software/SLEP
- Google Scholar
26
LiuY. Z.DvornykV.LuY.ShenH.LappeJ. M.ReckerR. R.et al. (2005). A novel pathophysiological mechanism for osteoporosis suggested by an in vivo gene expression study of circulating monocytes. J. Biol. Chem. 280, 29011–29016. 10.1074/jbc.M501164200
27
MaS.HuangJ.SongX. (2011). Integrative analysis and variable selection with multiple high-dimensional data sets. Biostatistics12, 763–775. 10.1093/biostatistics/kxr004
28
McGee-LawrenceM. E.WestendorfJ. J. (2011). Histone deacetylases in skeletal development and bone mass maintenance. Gene474, 1–11. 10.1016/j.gene.2010.12.003
29
OteroK.ShinoharaM.ZhaoH.CellaM.GilfillanS.ColucciA.et al. (2012). TREM2 and beta-catenin regulate bone homeostasis by controlling the rate of osteoclastogenesis. J. Immunol. 188, 2612–2621. 10.4049/jimmunol.1102836
30
PhanJ. H.QuoC. F.ChengC.WangM. D. (2012). Multiscale integration of -omic, imaging, and clinical data in biomedical informatics. IEEE Rev. Biomed. Eng. 5, 74–87. 10.1109/RBME.2012.2212427
31
RhodesD. R.ChinnaiyanA. M. (2005). Integrative analysis of the cancer transcriptome. Nat. Genet. 37(Suppl.), S31–S37. 10.1038/ng1570
32
SimonN.FriedmanJ.HastieT.TibshiraniR. (2013). A sparse-group lasso. J. Comput. Graph. Stat. 22, 231–245. 10.1080/10618600.2012.681250
- CrossRef
- Google Scholar
33
SongR.HuangJ.MaS. (2012). Integrative prescreening in analysis of multiple cancer genomic studies. BMC Bioinformatics13:168. 10.1186/1471-2105-13-168
34
SprechmannP.RamirezI.SapiroG.EldarY. C. (2011). C-HiLasso: a collaborative hierarchical sparse modeling framework. IEEE Trans. Signal Process. 59, 4183–4198. 10.1109/TSP.2011.2157912
- CrossRef
- Google Scholar
35
SuZ.MarchiniJ.DonnellyP. (2011). HAPGEN2: simulation of multiple disease SNPs. Bioinformatics27, 2304–2305. 10.1093/bioinformatics/btr341
36
TroyanskayaO. G.DolinskiK.OwenA. B.AltmanR. B.BotsteinD. (2003). A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proc. Natl. Acad. Sci. U.S.A. 100, 8348–8353. 10.1073/pnas.0832373100
37
van den BergE.SchmidtM.FriedlanderM. P.MurphyK. (2008). Group Sparsity via Linear-Time Projection. Vancouver, BC: Dept Comput Sci, Univ British Columbia.
- Google Scholar
38
WillerC. J.LiY.AbecasisG. R. (2010). METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics26, 2190–2191. 10.1093/bioinformatics/btq340
39
WilsonD. B.LipseyM. (2001). Practical Meta-Analysis. Available online at: http://www.mason.gmu.edu.
- Google Scholar
40
WrightS. J.NowakR. D.FigueiredoM. A. (2009). Sparse reconstruction by separable approximation. IEEE Trans. Signal Process. 57, 2479–2493. 10.1109/TSP.2009.2016892
41
XiongD. H.LiuX. G.GuoY. F.TanL. J.WangL.ShaB. Y.et al. (2009). Genome-wide association and follow-up replication studies identified ADAMTS18 and TGFBR3 as bone mass candidate genes in different ethnic groups. Am. J. Hum. Genet. 84, 388–398. 10.1016/j.ajhg.2009.01.025
42
XiongQ.AnconaN.HauserE. R.MukherjeeS.FureyT. S. (2012). Integrating genetic and gene expression evidence into genome-wide association analysis of gene sets. Genome Res. 22, 386–397. 10.1101/gr.124370.111
43
YangJ.ZhangY. (2011). Alternating direction algorithms for \ ell_1-problems in compressive sensing. SIAM J. Sci. Comput. 33, 250–278. 10.1137/090777761
44
ZhangD.ShenD. (2012). Multi-modal multi-task learning for joint prediction of multiple regression and classification variables in Alzheimer's disease. Neuroimage59, 895–907. 10.1016/j.neuroimage.2011.09.069
45
ZhangK.GrayJ. W.ParvinB. (2010). Sparse multitask regression for identifying common mechanism of response to therapeutic targets. Bioinformatics26, i97–i105. 10.1093/bioinformatics/btq181

Summary

Keywords

sparse regression, multitask learning, group lasso, significant test, osteoporosis

Citation

Lin D, Zhang J, Li J, He H, Deng H-W and Wang Y-P (2014) Integrative analysis of multiple diverse omics datasets by sparse group multitask regression. Front. Cell Dev. Biol. 2:62. doi: 10.3389/fcell.2014.00062

Received

21 June 2014

Accepted

01 October 2014

Published

27 October 2014

Volume

2 - 2014

Edited by

Jennifer Elizabeth Dent, Quintiles, UK

Reviewed by

Dougu Nam, Ulsan National Institute of Science and Technology, South Korea; Ayse Meric Ovacik, Merck, USA

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Yu-Ping Wang, Department of Biomedical Engineering, Biostatistics and Bioinformatics, Tulane University, New Orleans, LA 70118, USA e-mail: wyp@tulane.edu

This article was submitted to Systems Biology, a section of the journal Frontiers in Cell and Developmental Biology.

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Systems Biology Archive

ORIGINAL RESEARCH article

Integrative analysis of multiple diverse omics datasets by sparse group multitask regression

Abstract

Introduction