ORIGINAL RESEARCH article

Front. Genet., 10 December 2024

Sec. Computational Genomics

Volume 15 - 2024 | https://doi.org/10.3389/fgene.2024.1488683

DMOIT: denoised multi-omics integration approach based on transformer multi-head self-attention mechanism

  • 1. Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea

  • 2. Department of Statistics, Seoul National University, Seoul, Republic of Korea

Abstract

Multi-omics data integration has become increasingly crucial for a deeper understanding of the complexity of biological systems. However, effectively integrating and analyzing multi-omics data remains challenging due to their heterogeneity and high dimensionality. Existing methods often struggle with noise, redundant features, and the complex interactions between different omics layers, leading to suboptimal performance. Additionally, they face difficulties in adequately capturing intra-omics interactions due to simplistic concatenation techiniques, and they risk losing critical inter-omics interaction information when using hierarchical attention layers. To address these challenges, we propose a novel Denoised Multi-Omics Integration approach that leverages the Transformer multi-head self-attention mechanism (DMOIT). DMOIT consists of three key modules: a generative adversarial imputation network for handling missing values, a sampling-based robust feature selection module to reduce noise and redundant features, and a multi-head self-attention (MHSA) based feature extractor with a noval architecture that enchance the intra-omics interaction capture. We validated model porformance using cancer datasets from the Cancer Genome Atlas (TCGA), conducting two tasks: survival time classification across different cancer types and estrogen receptor status classification for breast cancer. Our results show that DMOIT outperforms traditional machine learning methods and the state-of-the-art integration method MoGCN in terms of accuracy and weighted F1 score. Furthermore, we compared DMOIT with various alternative MHSA-based architectures to further validate our approach. Our results show that DMOIT consistently outperforms these models across various cancer types and different omics combinations. The strong performance and robustness of DMOIT demonstrate its potential as a valuable tool for integrating multi-omics data across various applications.

1 Introduction

With the advent of high-throughput sequencing technologies, various types of omics data, including genomics, transcriptomics, and proteomics data, have become increasingly accessible. The pathogenesis of diseases often involves complex interactions across multiple biological levels and factors. Consequently, single-omics data provide only partial insights into biological processes, often failing to capture other critical factors and leading to an incomplete understanding of disease mechanisms. In contrast, multi-omics approaches offer the potential to reveal new biological insights that are not apparent when single-omics data are used alone (Yan et al., 2018). Therefore, the integration of multiple omics data is essential for achieving a comprehensive and complementary understanding of complex disease occurrence and progression, thereby further advancing personalized medicine (Hasin et al., 2017). However, integrating multi-omics data presents several challenges. Firstly, there is significant heterogeneity among different omics data types (genomics, transcriptomics, and proteomics), making integration complex due to varying data formats and scales (López de Maturana et al., 2019). Additionally, missing values and noise in the data can impact accuracy (Flores et al., 2023), while the large scale of multi-omics datasets demands substantial computational resources and efficient algorithms (Fondi and Liò, 2015). Lastly, combining information across different biological levels adds another layer of complexity, and interpreting and visualizing the integrated results can be challenging (Krassowski et al., 2020).

In the past decade, researchers have made significant progress in developing tools for multi-omics data integration. Dimension reduction-based methods have been foundational in multi-omics data integration. For example, canonical correlation analysis (Qi et al., 2021) is commonly used to evaluate the correlation between feature sets in different omics data. principal component analysis transforms relevant variables into linearly uncorrelated principal components through orthogonal transformation. However, those approaches typically assume linear relationships between features and fail to capture nonlinear relationships. Deep learning-based models can better capture complex nonlinear relationships due to multi-level neural network architecture, making them crucial tools in multi-omics data integration (He et al., 2023). Convolutional neural networks and Recurrent neural networks are utilized to handle high-dimensional and nonlinear multi-omics data, extracting complex features from them (Kang et al., 2022). In addition, encoder-decoder models, such as variational autoencoder (Hira et al., 2021) and generative adversarial network (Ahmed et al., 2022) are widely used for the integration and generation of multi-omics data, achieving dimensionality reduction by obtaining intermediate latent feature representations. Graph convolutional networks (Li et al., 2022) facilitate efficient information dissemination and aggregation by modeling the complex relationships and graph structure characteristics between data and are also applied to multi-omics data integration.

In recent years, the attention mechanism has become a hot topic in deep learning-based integration methods. Researchers (Gong et al., 2023) have applied attention mechanisms to reduce dimensionality and learn feature representations for each omics data type. Another study (Pang et al., 2023) introduced a hierarchical attention layer based on the biological central dogma to enhance data integration effectiveness. These methods demonstrate the enormous potential of attention mechanisms in multi-omics data integration. Additionally, some researchers (Zhang et al., 2022) constructed a Graph attention network and group-level attention mechanism to learn embedding representations. However, existing attention-based methods often face limitations. Some approaches (Gong et al., 2023) typically simply input each omics data into a separate attention layer, focusing only on the intra-omics interactions and ignoring the inter-omics interaction. Other approaches concatenate multiple omics datasets before applying a single attention layer (Wang et al., 2024; Pan et al., 2023). Given the heterogeneity of omics data, this approach may not effectively attend to each data type, making it challenging to capture intra-omics integrations. Additionally, due to the high dimensionality of omics data, single-head self-attention mechanisms might be inadequate in capturing subtle interactions. Overall, these methods may not fully exploit the potential value in multi-omics data.

Effective data preprocessing is also critical for optimizing performance in multi-omics integration frameworks. The characteristics of high-dimensional omics data, such as missing values and significant noise, make multi-omics data integration challenging. It has been proven that missing values in high-dimensional omics data can adversely affect downstream analyses (Flores et al., 2023). Therefore, addressing missing values is essential for maintain data quality. However, existing attention-based multi-omics integration methods often rely on simplistic imputation strategies such as zero, mean, or median imputation. These methods often fail to account for the complex correlations within omics data, potentially introducing unnecessary noise from imputed values. Moreover, discarding features with missing values might result in losing important information. Additionally, high-dimensional data also often contain numerous redundant features that may be selected by chance and degrade performance. Therefore, feature selection is a vital preprocessing step aimed at reducing noisy features and effectively decreasing dimensionality. While many integration frameworks select features based solely on the highest variance, which can lead to the inclusion of noisy and unstable features due to noise, outliers, and data disturbances.

To improve feature relevance, reduce noise, and capture both intra- and inter-omics interactions, we propose DMOIT, a novel denoised multi-omics integration approach. As shown in Figure 1, DMOIT includes three main modules. First, the Generative Adversarial Imputation Network (GAIN) (Yoon et al., 2018) module is introduced to learn feature distributions and impute missing values. Second, the Robust Feature Selection (RFS) module, based on bootstrap sampling, is employed to identify a denoised and stable feature set. Finally, a feature extractor leveraging the transformer multi-head self-attention (MHSA) mechanism is constructed to integrate multi-omics data, capturing both intra- and inter-omics interactions. Empirical studies across various cancer types and omics combinations demonstrate that DMOIT significantly enhances performance in multi-omics data integration and analysis.

FIGURE 1

2 Materials and methods

2.1 Data acquisition and preprocessing

We evaluated our proposed framework using three types of omics data: mRNA expression profiles, DNA methylation (Met), and copy number variation (CNV). The datasets were obtained from the UCSC Xena web browser, a resource that includes multi-omics and clinical data of cancer patients from The Cancer Genome Atlas (TCGA) project. We filtered samples with complete data for all three types of omics and clinical information. To enhance the reliability of our results, we focused on the top four TCGA cancer types with the most samples after filtering, including breast invasive carcinoma (BRCA), head and neck squamous cell carcinoma (HNSC), liver hepatocellular carcinoma (LIHC), and stomach adenocarcinoma (STAD). To validate the effectiveness and robustness of our proposed model, we applied DMOIT to survival time classification tasks across these four cancer types and the estrogen receptor (ER) status classification task for BRCA. In our survival time classification task, rather than using the median survival time, we set the threshold to the nearest integer to the median to better align with practical clinical applications. Patients with survival times greater than the threshold are labeled as long-term survivors (LTS), while those with shorter survival times are labeled as non-long-term survivors (non-LTS). The distribution of survival time labels for the specific cancer types shown in Table 1. For the estrogen receptor (ER) status classification task, we obtained the clinical information from cBioPortal (Gao et al., 2013), categorizing 199 patients as ER positive (ER+) and 55 as ER negative (ER-). During data preprocessing, we first removed features with high missing value rates. Specifically, CNV data was not included in the imputation process because they did not contain any missing data. For the mRNA and Met data, we removed features with 100% missing value rate and then applied min-max scaling to mitigate the impact of magnitude differences between features and ensure that features contribute equally during variance filtering and enhances the performance of subsequent machine learning models. For CNV data, we marked the variations into three types: no (0), decreased copy number (−1), and increased copy number (Yan et al., 2018). A bootstrap sampling-based feature selection module was applied to filter the denoised feature set. A detailed explanation follows in the subsequent sections. We then imputed the mRNA and Met data using the GAIN model. Details of data preprocessing are provided in Supplementary Figures S1, S2.

TABLE 1

DatasetTotal samplesLTSNon - LTS
BRCA783416367
HNSC514370144
LIHC368251117
STAD366223143

Sample distributions of long-term survivors (LTS) and non-long-term survivors (non-LTS) across the four cancer types.

2.2 Robust feature selection module

High-dimensional data often exhibits characteristics such as noise and redundant features. Noise can introduce random variations that obscure the true signal in the data, while redundant features can lead to overfitting and increased computational complexity. These issues may further result in unstable feature selection outcomes and degrade model performance. Traditionally, researchers integrating multi-omics data have relied on a single variance filter for feature pre-selection. However, including noisy samples—especially those with extreme values due to errors in sequencing technology or data entry—can skew the results during the single variance filtering step. This can lead to an overemphasis on certain features, causing the selection of unreliable or misleading data. To address this, we incorporated a robust feature selection (RFS) module that assesses features stability using the bootstrapping resampling technique, a widely recognized method for evaluating feature reliability. Our approach was inspired by the work of Cho et al. (2010), who employed a bootstrap method to select stable feature sets and proposed a new measure called Bootstrap Selection Stability. Furthermore, previous studies have demonstrated that the feature selection results can be significantly influenced by data disturbances; even minor alterations in the sample data can lead to the substantial changes in outcomes. To address this issue, Pes (2020) also proposed employing bootstrap sampling to conduct multiple feature selections, subsequently identifying a stable dataset based on the frequency of selection. This highlights the necessity for robust feature selection methods, especially in multi-omics analyses. In our RFS module, we generated ten bootstrap samples for each type of omics data. Bootstrap sampling involves creating multiple subsets of the original data via random sampling with replacement. Bootstrap sampling simulates data variability and helps to assess the consistency of feature importance under different perturbations. For each bootstrap sample, we filtered the features with the highest variance, as these are typically more informative. We then selected the top features based on their frequency of selection across all bootstrap samples. The RFS module ensures that only consistently selected features are chosen, thus enhancing the relevance and robustness of the final feature set.

2.3 Generative adversarial imputation network in multi-omics integration

Previous studies (Gunady et al., 2019; Xu et al., 2020; Wang et al., 2023) have shown that generative adversarial network (GAN)-based methods achieve promising results in imputing mRNA expression data. However, the stability of GANs in multi-omics data integration has not been extensively investigated. Given GANs’ ability to learn and mimic any data distribution (Yoon et al., 2018), we hypothesized that they could handle multi-omics data imputation effectively, mitigating noise from missing values. GANs work by training two networks simultaneously: a generator that generates realistic data, and a discriminator that distinguishes between real and synthetic samples. This adversarial training process enables GANs to learn the underlying data distribution accurately and generate realistic data samples that closely align with the true data, thereby reducing noise and improving imputation quality. To avoid introducing additional noise from imputing true observations, we imputed only the missing values in the original data after completing the data generation process. In our study, we imputed mRNA and MET data based on the learned distribution. CNV data was not included in the imputation process because they did not contain any missing data.

2.4 Multi-head self-attention based multi-omics data integration

Self-attention allows the model to weigh the importance of different features within the same omics dataset. This capability is crucial for capturing the intricate dependencies and interactions between features both within individual layers and across different omics layers. The self-attention mechanism is mathematically described as follows (Vaswani et al., 2017):where (query), (key), and (value) are matrices derived from the same set of omics features, and where represents the dimensionality of the keys. This mechanism enables the model to focus on different features and determine which features are most relevant to each other. Expanding on self-attention, the multi-head self-attention (MHSA) mechanism incorporates multiple attention heads to capture a variety of relationships among features. The MHSA mechanism is described as follows:where each head is calculated as:

Here, , and are learned weight projection matrices for the -th head, and is a weight matrix applied to the concatenated outputs of all heads. In the DMOIT framework, we design an MHSA mechanism-based feature extractor with a novel architecture to effectively integrate multi-omics data. Compared with the single-head self-attention mechanism, the MHSA can effectively capture features from various perspectives and subspaces by processing the input data through multiple attention heads, thereby enhancing the model’s ability to detect complex interactions and improving its robustness and stability. Additionally, the MHSA enables parallel computation, significantly increasing efficiency and providing notable advantages, particularly for large-scale multi-omics datasets. Our architecture leverages the MSHA mechanism to capture both intra- and inter-omics integrations effectively. Specifically, each omics dataset is input into separate encoders to fully learn the intra-omics interactions, reducing noise and improving the signal quality. Simultaneously, the concatenated omics data are fed into a shared MHSA-based encoder to capture the inter-omics interactions. This approach ensures that interactions between different omics types are preserved and effectively learned without losing any information. The outputs from the individual and shared encoders are then combined and passed into a multilayer perceptron (MLP) for final prediction. This dual architecture ensures comprehensive learning of both intra- and inter-omics interactions, providing a thorough analysis of individual omics data while maintaining the integrity of inter-omics interactions.

2.5 Comparative multi-omics data integration methods

We employed four traditional machine learning models—logistic regression (LR), random forest (RF), support vector machine (SVM), and extreme gradient boosting (XGBoost)-as baseline models, handling multiple omics datasets by concatenating the datasets. Additionally, we included a state-of-the-art method MoGCN. To validate our proposed architecture, we compared it with three alternative MHSA-based feature extractors with different architectures (

Figure 2

). In these frameworks, omics data is first input into a linear layer before entering the multi-layer attention layers to enhance its representation.

  • • Model 1 (M1): This model inputs each omics dataset separately into an MHSA layer to capture intra-omics interactions. The outputs from each encoder are then concatenated and passed through an MLP for classification. While M1 excels at capturing intra-omics interactions, it does not address inter-omics interactions.

  • • Model 2 (M2): Here, multiple omics datasets are concatenated before being input into a single MHSA layer. M2 effectively learns inter-omics interactions and maintains information completeness. However, due to varying noise levels and heterogeneity among omics datasets, it may struggle to adequately attend to each type of omics data, making it challenging to capture intra-omics interactions.

  • • Model 3 (M3): This model processes each omics dataset with separate MHSA layers to learn intra-omics interactions. The outputs from these layers are then concatenated and fed into an additional MHSA layer to further capture inter-omics interactions. While M3 aims to capture both intra- and inter-omics interactions, there is a risk of losing some inter-omics interaction information during the initial independent processing.

FIGURE 2

2.6 Evaluation methods

We designed two classification tasks to validate each module in our proposed framework: a survival time classification across four cancer types and an ER status classification for breast cancer. Both tasks used 5-fold cross-validation to ensure the robustness of our results. Model performance was evaluated using the mean accuracy and weighted F1-score metrics from the cross-validation.

2.7 Training of the DMOIT and other MHSA-based comparison models

The MHSA-based models were developed using PyTorch (version 2.1.2) and scikit-learn. We trained the model for 50 epochs and used a grid search to identify the optimal parameters, utilizing the Adam optimizer for training. To balance performance and computational efficiency, the encoders in each model share identical parameters. The grid parameter combinations are detailed in Table 2.

TABLE 2

HyperparameterPossible values
learn_rate[0.001, 0.01]
batch_size[32, 64, 128]
num_heads[2, 4, 8]
num_blocks[1, 2, 3]
dropout_rate[0.01, 0.1]
dense_dim[32, 64, 128]

Hyperparameter settings for grid search.

3 Results

3.1 Evaluation of the GAIN imputation method

Omics data are typically high-dimensional and often contain a large proportion of missing values. In the dataset we downloaded from UCSC Xena, these missing values were treated as zeros. However, it remains uncertain whether these zeros are biologically meaningful or the result of technical issues during sequencing. In this study, we compared the impact of non-imputed data (where missing values were retained as zeros) versus data imputed using various imputation methods on survival time classification tasks. We applied these methods to imputed mRNA and Met omics data from four cancer types, using the XGBoost classifier due to its superior performance among all the traditional machine learning models we tested and considering the computational demands of deep learning approaches. Traditional imputation methods, such as mean, median, and K-nearest neighbor (KNN) (Pujianto et al., 2019), commonly used in previous studies, fail to distinguish between biological zeros and technical zeros, treating them uniformly. In contrast, GAN-based imputation is more likely to preserve biologically meaningful zeros by learning the underlying feature distribution, which in turn enhances the performance of downstream analyses. As shown in Table 3, the mean accuracy and weighted F1 score from 5-fold cross-validation on the test set, as well as the overall averages for each omics dataset and across the four cancer types, demonstrate the effectiveness of different imputation methods. Among the methods tested, the GAN-based GAIN module achieved the highest average testing accuracy of 0.655 and average weighted F1 score of 0.614, outperforming KNN imputation (accuracy: 0.640, F1: 0.600), mean imputation (accuracy: 0.641, F1: 0.596), median (accuracy: 0.648, F1: 0.606), and zero imputation (accuracy: 0.641, F1: 0.596). These findings also suggest that GANs have potential for generalizing to other omics types and highlight their promise for robust and reliable imputation across diverse omics datasets.

TABLE 3

DatasetMethodAccuracyWeighted F1
BRCAzero0.6220.621
median0.6230.622
mean0.6220.621
KNN0.6170.614
GAIN0.6330.633
HNSCzero0.7040.612
median0.7080.61
mean0.7040.612
KNN0.6960.609
GAIN0.7080.623
LIHCzero0.6410.596
median0.6660.627
mean0.6410.596
KNN0.6710.632
GAIN0.6710.632
STADzero0.5960.555
median0.5960.566
mean0.5960.555
KNN0.5770.546
GAIN0.6070.567
Mean across 4 cancer typeszero0.6410.596
median0.6480.606
mean0.6410.596
KNN0.6400.600
GAIN0.6550.614

Performance of different imputation methods.

Note: The accuracy and weighted F1 score are the averages from 5-fold cross-validation of each cancer type and across four different cancer types. The bold values represent the highest accuracy/F1 score for downstream classification tasks achieved across all datasets imputed by different imputation methods for the current cancer type.

3.2 Evaluation of the bootstrap-based robust feature selection module

To evaluate the effectiveness of robust feature selection (RFS) module within the DMOIT framework, we compared it with a direct feature selection method that identifies the top features without bootstrap sampling. Both approaches were assessed through survival time classification tasks with the XGBoost classifier across four cancer types. As shown in Table 4, the RFS module consistently outperformed the direct selection method. For instance, in STAD, accuracy increased from 0.569 to 0.601 and the weighted F1 score rose from 0.534 to 0.559. In LUSC, accuracy improved from 0.686 to 0.694, with the weighted F1 score going up from 0.593 to 0.595. Similar improvements were observed in LIHC, HNSC, and BRCA, with notable gains in both accuracy and F1 scores. These results demonstrate that the RFS module enhances feature selection by effectively handling noise and selecting a stable feature set. It mitigates the impact of data distribution imbalance and dataset-specific sensitivities in high-dimensional data, leading to improved feature relevance and overall performance.

TABLE 4

DatasetAccuracyWeighted F1
RFSDirectRFSDirect
STAD0.6010.5690.5590.534
LIHC0.6660.6490.6220.61
HNSC0.7140.7020.6310.607
BRCA0.6270.6090.6260.608

Performance comparison of different feature selection methods.

Note: “Direct” denotes direct filtering of the features based on variance, and “RFS” denotes the bootstrap-based robust feature selection module. The accuracy and weighted F1 score are the averages of 5-fold cross validation in the survival time classification task using XGBoost. The bold values represent the highest accuracy/F1 score between the two feature selection methods for the current cancer type.

3.3 Performance of DMOIT under different omics combinations

To evaluate the effectiveness and stability of DMOIT in learning intra- and inter-omics interactions, we compared it against four traditional machine learning baseline models, the state-of-the-art MoGCN method, and three alternative MHSA-based architectures, as detailed in the methods section. This evaluation was conducted on both survival time classification and ER status classification tasks. We tested the generalization ability of these models by comparing their performance across different omics combinations in various cancer types, including single-omics data, paired combinations, and an integrated dataset comprising all three omics types.

To validate how direct concatenation of multiple omics datasets increases data heterogeneity, we first investigated the changes in data properties before and after concatenation. Different features may exhibit distinct data types; for instance, in our study, mRNA and MET are continuous variables, while CNV is a discrete variable. Additionally, distributional differences among omics of the same data type may persist, indicating that heterogeneity is likely to increase after concatenation, as illustrated in the histograms of mean expression levels, coefficient of variation, and Shannon entropy shown in Supplementary Figures S4–S6. This increased heterogeneity may hinder the attention mechanism’s ability to fully capture intra-omics interactions within a specific omics dataset. We further demonstrated the complexity of omics data, particularly the hierarchical clustering and nonlinear relationships among variables, which highlights the need for multi-head attention mechanisms to learn the complex relationships among omics features compared to statistical models and traditional machine learning models. This is illustrated by the correlation heatmap of BRCA mRNA omics in Supplementary Figure S7, as well as the LOESS curve and polynomial fitting for the top 5 mRNA biomarkers in Supplementary Figures S8, S9. These factors present significant challenges for multi-omics integration. Unlike simple concatenation, DMOIT effectively addresses these issues using a multi-head attention mechanism. By learning from each omics dataset individually, we reduce the impact of heterogeneity on intra-omics interactions. Meanwhile, the learning from concatenated data ensures the completeness of inter-omics interactions.

In the survival time classification task (Table 5), DMOIT achieved the highest accuracy and the weighted F1 score across all the omics data combinations in HNSC and LIHC and performed better in at least two out of four omics combinations in BRCA and STAD. For the ER status classification task using all three omics types (Table 6), DMOIT achieved the highest weighted F1 score of 0.937. Although its accuracy was slightly lower than M3, it remains a more efficient choice due to lower computational complexity. Additionally, the original ER dataset exhibited class imbalance; therefore, we conducted a simple experiment to test the impact of varying degrees of data imbalance on DMOIT. We created scenarios with ER positive-to-negative ratios of approximately 1:1, 2:1, and 3:1 by randomly sampling from the larger set of ER-positive cases. As shown in Table 7, DMOIT performed similarly across different levels of imbalance, with accuracy and weighted F1 scores as follows: for the 1:1 ratio, accuracy was 0.927 and the weighted F1 score was 0.927; for the 2:1 ratio, accuracy was 0.933 and the weighted F1 score was 0.932; and for the 3:1 ratio, accuracy was 0.914 and the weighted F1 score was 0.910. Our findings reveal that MHSA-based models, particularly DMOIT, consistently outperform both traditional models and MoGCN. Among the four MHSA-based models, DMOIT consistently exhibits superior performance across various cancer datasets and omics combinations in most scenarios on different tasks, highlighting its effectiveness and robustness in managing complex intra- and inter-omics interactions. Furthermore, our results indicate that mRNA provides the most informative data among single-omics datasets and incorporating a third omics type does not necessarily enhance performance.

TABLE 5

Cancer typeOmics dataAccuracy
LRRFSVMXGBM1M2M3DMOITMoGCN
BRCAmRNA0.5860.5950.6130.6180.670
MET0.5440.5610.5310.5980.618
CNV0.4980.5030.4940.5160.595
mRNA + MET0.5900.5860.5820.6330.6670.6670.6690.6690.530
mRNA + CNV0.5610.6020.5270.6140.6500.6410.6510.6550.512
MET + CNV0.5350.5420.4930.5930.6050.6130.6120.6070.506
mRNA + MET + CNV0.5630.6040.5160.6270.6620.6490.6690.6530.525
HNSCmRNA0.6520.7040.7180.6910.728
MET0.6650.7120.7200.7020.730
CNV0.6150.6910.7120.6870.730
mRNA + MET0.6670.7200.7200.7080.7280.7300.7320.7320.510
mRNA + CNV0.6280.7080.7160.6960.7340.7340.7390.7410.553
MET + CNV0.6500.7100.7200.7060.7370.7370.7370.7390.531
mRNA + MET + CNV0.6360.7160.7200.7140.7370.7360.7390.7400.533
LIHCmRNA0.6500.6820.6850.6690.755
MET0.6280.6710.6790.6600.739
CNV0.6110.6360.6820.6390.717
mRNA + MET0.6330.6820.6790.6710.7550.7450.7500.7630.658
mRNA + CNV0.6220.6850.6790.6740.7340.7280.7340.7420.669
MET + CNV0.6500.6680.6740.6550.7310.7120.7250.7390.663
mRNA + MET + CNV0.6440.6850.6770.6660.7470.7310.7360.7480.649
STADmRNA0.5550.6230.6010.5900.670
MET0.4810.5820.6090.5580.642
CNV0.5300.5380.5960.5380.650
mRNA + MET0.5060.5900.6040.6070.6770.6750.6720.6780.577
mRNA + CNV0.5110.6070.5980.5770.6420.6450.6470.6480.585
MET + CNV0.4620.5550.6070.5360.6420.6560.6450.6420.609
mRNA + MET + CNV0.5030.6120.6010.6010.6500.6490.6500.6510.563
Cancer typeOmics dataWeighted F1-Score
LRRFSVMXGBM1M2M3DMOITMoGCN
BRCAmRNA0.5860.5940.6120.6170.658
MET0.5430.5560.5210.5940.598
CNV0.4930.4740.4520.5030.557
mRNA + MET0.5900.5840.5790.6330.6340.6550.6550.6550.541
mRNA + CNV0.5600.6000.5050.6130.6480.6160.6500.6510.534
MET + CNV0.5350.5350.4680.5900.5770.5720.5780.5740.544
mRNA + MET + CNV0.5620.6020.5000.6260.6490.6180.6600.6190.548
HNSCmRNA0.6190.5950.6020.5970.620
MET0.6250.6150.6030.6150.631
CNV0.5930.5970.5990.6160.690
mRNA + MET0.6340.6090.6030.6230.6230.6470.6300.6480.523
mRNA + CNV0.6160.5970.6010.6120.6560.6520.6670.6780.558
MET + CNV0.6240.6010.6030.6170.6510.6600.6610.6720.535
mRNA + MET + CNV0.6120.6040.6030.6310.6690.6660.6620.6690.535
LIHCmRNA0.6290.6270.5950.6330.722
MET0.6060.5870.5520.6250.679
CNV0.5880.5670.5530.5780.656
mRNA + MET0.6090.6220.5660.6320.7240.7100.7210.7330.716
mRNA + CNV0.6050.6320.5520.6310.6820.6600.6970.7010.664
MET + CNV0.6290.5790.5490.6100.6930.6430.6760.6940.678
mRNA + MET + CNV0.6220.6290.5500.6220.7120.6820.6930.7270.697
STADmRNA0.5420.5730.5260.5580.647
MET0.4730.5190.4610.5210.572
CNV0.5230.4730.4590.5110.616
mRNA + MET0.4910.5410.4680.5660.6300.6230.6370.6380.607
mRNA + CNV0.5060.5600.4560.5440.5660.6000.6010.6020.594
MET + CNV0.4580.4920.4600.5010.5920.5690.5780.5740.557
mRNA + MET + CNV0.4960.5550.4570.5590.6040.5940.6050.5920.576

Performance of various machine learning models on multi-omics data for different cancer types in survival time classification task.

Note: The omics data combinations used include mRNA, MET, CNV, and their integrations. The performance metrics are accuracy and weighted F1 score, averaged over 5-fold cross-validation. The highest values for each metric in each cancer type and omics combination are highlighted in bold.

TABLE 6

ModelsAccuracyWeighted F1
LR0.8900.884
RF0.9170.914
SVM0.8780.866
Xgboost0.9250.923
M10.9330.930
M20.9330.933
M30.9450.928
DMOIT0.9370.937
MoGCN0.9090.916

Performance comparison of various models in the ER classification task using all three types of omics data.

Note: The performance metrics are accuracy and the weighted F1 score, which are averaged over 5-fold cross-validation. The highest values for each metric are highlighted in bold.

TABLE 7

ER (+): ER (−)AccuracyWeighted F1
1:10.9270.927
2:10.9330.932
3:10.9140.910
199:550.9370.937

Performance of DMOIT across different class ratios in the ER classification task.

3.4 Biological findings from DMOIT in estrogen receptor status

To identify potential biomarkers, we employed a permutation importance approach to rank the most important features (Fisher et al., 2019). Specifically, we evaluated the contribution of each feature by shuffling them one at a time during model training, keeping the optimal parameters from the full feature set. We then measured the performance drop in terms of the weighted F1 score to assess how much each feature’s absence impacted the model’s predictive ability (Supplementary Figure S3). We selected features with relatively large decline compared to others, specifically the top 10 from the mRNA dataset, the top 15 from the MET dataset, and the top 4 from the CNV dataset, as shown in Table 8.

TABLE 8

Potential biomarkers
mRNAPLA2G6, SLC25A38, SLC25A26, FARS2, C2orf15, RBX1, MRFAP1L1, DYNC2LI1, NDUFC1, TRUB2
METcg18021992, cg17387069, cg24500294, cg02776659
CNVMRC2, ATG2A, FRMD8, ARSG, GRB7, HGSNAT, SAC3D1, BATF2, SNX32, OVOL1, RHOD, LRFN4, RCE1, KPNA2

Potential biomarkers discovered through the DMOIT in the ER classification task.

We reviewed previous studies on the top important features from the mRNA and CNV datasets. PLA2G6 and SLC25A26 have been identified as involved in the development of various tumors (Li et al., 2017; Gao et al., 2024), though their link to breast cancer is still underexplored. Elevated expressions of FARS2 and TRUB2 have been noted in breast cancer tissues, and C2orf15 expression significantly correlates with breast cancer prognosis, although their link to ER status needs further investigation (Sung et al., 2022; Tau et al., 2024; Mi et al., 2024). SLC25A38 is known to upregulate ER expression, while Rbx1 is essential for the degradation of ERα protein, playing a critical role in estrogen signaling (Lu et al., 2023; Marconett et al., 2010). NDUFC1, important in mitochondrial metabolism, has been found to be more critical in ER + breast cancer cells, suggesting a metabolic vulnerability in this subtype (Tau et al., 2024). MRFAP1L1 and DYNC2LI1 show promise as potential biomarkers, although no studies have yet indicated an association with breast cancer. For CNV biomarkers, MRC2 amplification and copy number gain in basal-like breast cancer may be linked to tumorigenesis and progression. Since basal-like tumors are typically ER-, this may suggest a potential connection (Wienke et al., 2007). ATG2A exhibits mutations in breast cancer, FRMD8 plays a tumor-suppressive role in breast cancer progression, ARSG is negatively correlated with positive prognosis, and differentially expressed genes upregulated by SAC3D1 are involved in regulating the cell cycle pathways in breast cancer cells. However, no research has yet examined the impact of copy number variations of these genes on ER status (Wu et al., 2021; Wu et al., 2024; Alkhateeb et al., 2020; Liu et al., 2020). GRB7 is overexpressed in breast cancer cell lines, showing a strong correlation between mRNA levels and copy number status. It is essential for the invasion and survival of triple-negative breast cancer cells (Staaf et al., 2010; Giricz et al., 2012). OVOL1 is highly expressed in ER + breast cancer (Fan et al., 2022), while RHOD has a causal role specifically in ER + breast cancer (Kazmi et al., 2022). SNX32 leads to frequent loss-of-function mutations in breast cancer patients (Li et al., 2018). The novel Ras membrane-bound regulator of Ras, Rce1, suggests a promising strategy for targeting Ras in breast cancer (Hanker and Der, 2010), and KPNA2 overexpression significantly enhances the invasion and migration capabilities of breast cancer cells (Han and Wang, 2020). The role of LRFN4, BATF2, and HGSNAT in breast cancer remains unexplored. These findings suggest that DMOIT successfully identifies potential biomarkers, enhancing its value in breast cancer studies.

Furthermore, we assessed the joint effects of multiple omics biomarkers using multiple linear regression. Specifically, we analyzed the direction of the coefficient for the biomarker in the model when only a single omics biomarker was present, and compared it to the direction of the coefficient for in the model Y = + + , which included three omics biomarkers. We explored all 560 possible combinations, from 10 mRNA, 14 CNV, and 4 MET biomarkers. When was an mRNA biomarker, the inclusion of and did not change the direction of ’s coefficient. However, when was an MET biomarker, 78 out of 560 combinations resulted in a change. Similarly, when CNV served as , 63 out of 560 combinations caused a shift in the direction of . These findings suggest that the joint effects of CNV and MET on mRNA may be relatively weak. In contrast, the joint effects of MET and mRNA on CNV are stronger, while the strongest joint effects are observed between CNV and mRNA on MET.

4 Discussion

In this study, we propose DMOIT, a denoised multi-head self-attention-based multi-omics integration framework that considers both intra- and inter-omics interactions. DMOIT introduces the GAIN module for imputation, the RFS module for feature selection, and multi-head self-attention layers for feature extraction. We investigated the effectiveness of each component in DMOIT, finding that the GAIN module can be generalized well across different omics types, effectively reducing noise from inappropriate imputation methods. Additionally, the RFS module successfully identifies stable and denoised features, reducing redundancy and noise, which enhances the data quality and improves downstream analyses performance. Furthermore, our designed MHSA mechanism-based integration model, outperforms traditional machine learning models and other MHSA-based methods across diverse cancer types and varying omics combinations.

However, our study has several limitations that warrant consideration. First, deep learning-based methods operate as black boxes and lack interpretability (Toussaint et al., 2024). This characteristic makes it challenging to understand the underlying decision-making processes and limits the insights that can be drawn from the model. As a result, the practical application value of such deep learning models in clinical settings may be restricted. To address this issue, we propose that researchers in related fields apply more knowledge from the realm of explainable AI to enhance model interpretability and provide clinicians with visualization tools that can aid in understanding model predictions, thereby increasing the feasibility of clinical applications. Second, this study focuses on optimizing integration procedures based on high-dimensional data characteristics without incorporating biological knowledge. By exclusively prioritizing data-driven optimization, the framework risks missing out on valuable biological insights that could enhance both its predictive power and interpretability. Future studies should integrate biological insights into feature selection and extraction processes, such as incorporating pathway information into the attention mechanisms to enhance model interpretability and provide more meaningful insights into the biological mechanisms underlying the data (Crawford and Greene, 2020). Furthermore, our experimental results showed that DMOIT achieved optimal performance across all omics combinations in the LIHC and HNSC datasets, but not for BRCA and STAD datasets. This indicates that the model’s effectiveness may vary with specific omics combinations and cancer types. Future studies should explore different architecture configurations and assess how various omics combinations influence interaction strength. Investing these aspects will help refine the model to better accommodate diverse mics data and improve its overall performance. Additionally, exploring the model’s adaptability to other omics types and evaluating its performance in different clinical settings could provide further validation and improvements.

In conclusion, our proposed approach effectively integrates multi-omics data by addressing noise reduction and feature stability while considering both intra- and inter-omics interactions. It demonstrates superior performance and stability, making it a promising tool for multi-omics research.

Statements

Data availability statement

The data presented in the study are deposited in the UCSC Xena repository, accession number GDC TCGA Breast Cancer (BRCA), GDC TCGA Head and Neck Cancer (HNSC), GDC TCGA Liver Cancer (LIHC) and GDC TCGA Stomach Cancer (STAD).

Ethics statement

Ethical approval was not required for the study involving humans in accordance with the local legislation and institutional requirements. Written informed consent to participate in this study was not required from the participants or the participants’ legal guardians/next of kin in accordance with the national legislation and the institutional requirements. Written informed consent was obtained from the individual(s), and minor(s)’ legal guardian/next of kin, for the publication of any potentially identifiable images or data included in this article.

Author contributions

ZL: Methodology, Visualization, Writing–original draft, Writing–review and editing. TP: Conceptualization, Funding acquisition, Methodology, Project administration, Supervision, Writing–original draft, Writing–review and editing.

Funding

The author(s) declare that financial support was received for the research, authorship, and/or publication of this article. This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF), funded by the Ministry of Education (NRF-2022R1A2C1092497).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

  • 1

    AhmedK. T.SunJ.ChengS.YongJ.ZhangW. (2022). Multi-omics data integration by generative adversarial network. Bioinformatics38 (1), 179186. 10.1093/bioinformatics/btab608

  • 2

    AlkhateebA.ZhouL.TablA. A.RuedaL. (2020) “Deep learning approach for breast cancer inclust 5 prediction based on multiomics data integration[C],” in Proceedings of the 11th ACM international conference on bioinformatics, computational biology and health Informatics, 16.

  • 3

    ChoS.KimK.KimY. J.LeeJ. K.ChoY. S.LeeJ. Y.et al (2010). Joint identification of multiple genetic variants via elastic-net variable selection in a genome-wide association analysis. Ann. Hum. Genet.74 (5), 416428. 10.1111/j.1469-1809.2010.00597.x

  • 4

    CrawfordJ.GreeneC. S. (2020). Incorporating biological structure into machine learning models in biomedicine. Curr. Opin. Biotechnol.63, 126134. 10.1016/j.copbio.2019.12.021

  • 5

    FanC.WangQ.van der ZonG.RenJ.AgaserC.SliekerR. C.et al (2022). OVOL1 inhibits breast cancer cell invasion by enhancing the degradation of TGF-β type I receptor. Signal Transduct. Target. Ther.7 (1), 126. 10.1038/s41392-022-00944-w

  • 6

    FisherA.RudinC.DominiciF. (2019). All models are wrong, but many are useful: learning a variable's importance by studying an entire class of prediction models simultaneously. J. Mach. Learn. Res.20 (177), 177181. 10.48550/arXiv.1801.01489

  • 7

    FloresJ. E.ClaborneD. M.WellerZ. D.Webb-RobertsonB. J. M.WatersK. M.BramerL. M. (2023). Missing data in multi-omics integration: recent advances through artificial intelligence. Front. Artif. Intell.6, 1098308. 10.3389/frai.2023.1098308

  • 8

    FondiM.LiòP. (2015). Multi-omics and metabolic modelling pipelines: challenges and tools for systems microbiology. Microbiol. Res.171, 5264. 10.1016/j.micres.2015.01.003

  • 9

    GaoJ.AksoyB. A.DogrusozU.DresdnerG.GrossB.SumerS. O.et al (2013). Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Sci. Signal.6 (269), pl1. 10.1126/scisignal.2004088

  • 10

    GaoR.ZhouD.QiuX.ZhangJ.LuoD.YangX.et al (2024). Cancer therapeutic potential and prognostic value of the SLC25 mitochondrial carrier family: a review. Cancer control.31, 10732748241287905. 10.1177/10732748241287905

  • 11

    GiriczO.CalvoV.PeroS. C.KragD. N.SparanoJ. A.KennyP. A. (2012). GRB7 is required for triple-negative breast cancer cell invasion and survival. Breast cancer Res. Treat.133, 607615. 10.1007/s10549-011-1822-6

  • 12

    GongP.ChengL.ZhangZ.MengA.LiE.ChenJ.et al (2023). Multi-omics integration method based on attention deep learning network for biomedical data classification. Comput. Methods Programs Biomed.231, 107377. 10.1016/j.cmpb.2023.107377

  • 13

    GunadyM. K.KancherlaJ.BravoH. C.FeiziS. (2019). scGAIN: single cell RNA-seq data imputation using generative adversarial networks. bioRxiv, 837302. 10.1101/837302

  • 14

    HanY.WangX. (2020). The emerging roles of KPNA2 in cancer. Life Sci.241, 117140. 10.1016/j.lfs.2019.117140

  • 15

    HankerA. B.DerC. J. (2010). “The roles of Ras family small GTPases in breast cancer,” in Handbook of cell signaling (Academic Press), 27632772.

  • 16

    HasinY.SeldinM.LusisA. (2017). Multi-omics approaches to disease. Genome Biol.18, 8315. 10.1186/s13059-017-1215-1

  • 17

    HeX.LiuX.ZuoF.ShiH.JingJ. (2023). Artificial intelligence-based multi-omics analysis fuels cancer precision medicine, Semin. Cancer Biol., 88: 187200. 10.1016/j.semcancer.2022.12.009

  • 18

    HiraM. T.RazzaqueM. A.AngioneC.ScrivensJ.SawanS.SarkerM. (2021). Integrated multi-omics analysis of ovarian cancer using variational autoencoders. Sci. Rep.11 (1), 6265. 10.1038/s41598-021-85285-4

  • 19

    KangM.KoE.MershaT. B. (2022). A roadmap for multi-omics data integration using deep learning. Briefings Bioinforma.23 (1), bbab454. 10.1093/bib/bbab454

  • 20

    KazmiN.RobinsonT.ZhengJ.KarS.MartinR. M.RidleyA. J. (2022). Rho GTPase gene expression and breast cancer risk: a Mendelian randomization analysis. Sci. Rep.12 (1), 1463. 10.1038/s41598-022-05549-5

  • 21

    KrassowskiM.DasV.SahuS. K.MisraB. B. (2020). State of the field in multi-omics research: from computational needs to data mining and sharing. Front. Genet.11, 610798. 10.3389/fgene.2020.610798

  • 22

    LiM.LiC.LiuW. X.LiuC.CuiJ.LiQ.et al (2017). Dysfunction of PLA2G6 and CYP2C44-associated network signals imminent carcinogenesis from chronic inflammation to hepatocellular carcinoma. J. Mol. cell Biol.9 (6), 489503. 10.1093/jmcb/mjx021

  • 23

    LiN.RowleyS. M.ThompsonE. R.McInernyS.DevereuxL.AmarasingheK. C.et al (2018). Evaluating the breast cancer predisposition role of rare variants in genes associated with low-penetrance breast cancer risk SNPs. Breast Cancer Res.20, 311. 10.1186/s13058-017-0929-z

  • 24

    LiX.MaJ.LengL.HanM.LiM.HeF.et al (2022). MoGCN: a multi-omics integration method based on graph convolutional network for cancer subtype analysis. Front. Genet.13, 806842. 10.3389/fgene.2022.806842

  • 25

    LiuA. G.ZhongJ. C.ChenG.HeR. Q.HeY. Q.MaJ.et al (2020). Upregulated expression of SAC3D1 is associated with progression in gastric cancer. Int. J. Oncol.57 (1), 122138. 10.3892/ijo.2020.5048

  • 26

    López de MaturanaE.AlonsoL.AlarcónP.Martín-AntonianoI. A.PinedaS.PiornoL.et al (2019). Challenges in the integration of omics and non-omics data. Genes10 (3), 238. 10.3390/genes10030238

  • 27

    LuJ. J.ZhangX.AbudukeyoumuA.LaiZ. Z.HouD. Y.WuJ. N.et al (2023). Active estrogen–succinate metabolism promotes heme accumulation and increases the proliferative and invasive potential of endometrial cancer cells. Biomolecules13 (7), 1097. 10.3390/biom13071097

  • 28

    MarconettC. N.SundarS. N.PoindexterK. M.StueveT. R.BjeldanesL. F.FirestoneG. L. (2010). Indole-3-carbinol triggers aryl hydrocarbon receptor-dependent estrogen receptor (ER)alpha protein degradation in breast cancer cells disrupting an ERalpha-GATA3 transcriptional cross-regulatory loop. Mol. Biol. Cell21 (7), 11661177. 10.1091/mbc.e09-08-0689

  • 29

    MiY.DongM.ZuoX.CaoQ.GuX.MiH.et al (2024). Genome-wide identification and analysis of epithelial-mesenchymal transition-related RNA-binding proteins and alternative splicing in a human breast cancer cell line. Sci. Rep.14 (1), 11753. 10.1038/s41598-024-62681-0

  • 30

    PanL.LiuD.DouY.WangL.FengZ.RongP.et al (2023). Multi-head attention mechanism learning for cancer new subtypes and treatment based on cancer multi-omics data. arXiv Prepr. arXiv:2307.04075. 10.48550/arXiv.2307.04075

  • 31

    PangJ.LiangB.DingR.YanQ.ChenR.XuJ. (2023). A denoised multi-omics integration framework for cancer subtype classification and survival prediction. Briefings Bioinforma.24 (5), bbad304. 10.1093/bib/bbad304

  • 32

    PesB. (2020). Ensemble feature selection for high-dimensional data: a stability analysis across multiple domains. Neural Comput. Appl.32 (10), 59515973. 10.1007/s00521-019-04082-3

  • 33

    PujiantoU.WibawaA. P.AkbarM. I. (2019). “K-nearest neighbor (k-NN) based missing data imputation(C),” in 2019 5th international conference on science in information technology (ICSITech). IEEE, 8388.

  • 34

    QiL.WangW.WuT.ZhuL.HeL.WangX. (2021). Multi-omics data fusion for cancer molecular subtyping using sparse canonical correlation analysis. Front. Genet.12, 607817. 10.3389/fgene.2021.607817

  • 35

    StaafJ.JönssonG.RingnérM.Vallon-ChristerssonJ.GrabauD.ArasonA.et al (2010). High-resolution genomic and expression analyses of copy number alterations in HER2-amplified breast cancer. Breast Cancer Res.12, R25R18. 10.1186/bcr2568

  • 36

    SungY.YoonI.HanJ. M.KimS. (2022). Functional and pathologic association of aminoacyl-tRNA synthetases with cancer. Exp. & Mol. Med.54 (5), 553566. 10.1038/s12276-022-00765-5

  • 37

    TauS.ChamberlinM. D.YangH.MarottiJ. D.RobertsA. M.CarmichaelM. M.et al (2024). Endocrine persistence in ER+ breast cancer is accompanied by metabolic vulnerability in oxidative phosphorylation. bioRxiv. 10.1101/2024.09.26.615177

  • 38

    ToussaintP. A.LeiserF.ThiebesS.SchlesnerM.BrorsB.SunyaevA. (2024). Explainable artificial intelligence for omics data: a systematic mapping study. Briefings Bioinforma.25 (1), bbad453. 10.1093/bib/bbad453

  • 39

    VaswaniA.ShazeerN.ParmarN.UszkoreitJ.JonesL.GomezA. N.et al (2017). Attention is all you need. Adv. neural Inf. Process. Syst.30. 10.48550/arXiv.1706.03762

  • 40

    WangJ.LiaoN.DuX.ChenQ.WeiB. (2024). A semi-supervised approach for the integration of multi-omics data based on transformer multi-head self-attention mechanism and graph convolutional networks. BMC genomics25 (1), 86. 10.1186/s12864-024-09985-7

  • 41

    WangT.ZhaoH.XuY.WangY.ShangX.PengJ.et al (2023). scMultiGAN: cell-specific imputation for single-cell transcriptomes with multiple deep generative adversarial networks. Briefings Bioinforma.24 (6), bbad384. 10.1093/bib/bbad384

  • 42

    WienkeD.DaviesG. C.JohnsonD. A.SturgeJ.LambrosM. B. K.SavageK.et al (2007). The collagen receptor Endo180 (CD280) is expressed on basal-like breast tumor cells and promotes tumor growth in vivo. Cancer Res.67 (21), 1023010240. 10.1158/0008-5472.CAN-06-3496

  • 43

    WuG.XuY.ZhangH.RuanZ.ZhangP.WangZ.et al (2021). A new prognostic risk model based on autophagy-related genes in kidney renal clear cell carcinoma. Bioengineered12 (1), 78057819. 10.1080/21655979.2021.1976050

  • 44

    WuW.YuM.LiQ.ZhaoY.ZhangL.SunY.et al (2024). Loss function of tumor suppressor FRMD8 confers resistance to tamoxifen therapy via a dual mechanism. bioRxiv. 10.7554/eLife.101888.1

  • 45

    XuY.ZhangZ.YouL.LiuJ.FanZ.ZhouX. (2020). scIGANs: single-cell RNA-seq imputation using generative adversarial networks. Nucleic acids Res.48 (15), e85. 10.1093/nar/gkaa506

  • 46

    YanJ.RisacherS. L.ShenL.SaykinA. J. (2018). Network approaches to systems biology analysis of complex disease: integrative methods for multi-omics data. Briefings Bioinforma.19 (6), 13701381. 10.1093/bib/bbx066

  • 47

    YoonJ.JordonJ.SchaarM. (2018). “Gain: missing data imputation using generative adversarial nets(C),” in International conference on machine learning. Stockholm, Sweden: PMLR, 56895698.

  • 48

    ZhangG.PengZ.YanC.WangJ.LuoJ.LuoH. (2022). MultiGATAE: a novel cancer subtype identification method based on multi-omics and attention mechanism. Front. Genet.13, 855629. 10.3389/fgene.2022.855629

Summary

Keywords

multi-omics integration, survival time prediction, deep learning, machine learning, multi-head self-attention

Citation

Liu Z and Park T (2024) DMOIT: denoised multi-omics integration approach based on transformer multi-head self-attention mechanism. Front. Genet. 15:1488683. doi: 10.3389/fgene.2024.1488683

Received

30 August 2024

Accepted

25 November 2024

Published

10 December 2024

Volume

15 - 2024

Edited by

Juexin Wang, Indiana University, Purdue University Indianapolis, United States

Reviewed by

Yang Liu, University of Texas Southwestern Medical Center, United States

Dongpeng Liu, Automat Solutions Inc., United States

Updates

Copyright

*Correspondence: Zhe Liu, ; Taesung Park,

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Figures

Cite article

Copy to clipboard


Export citation file


Share article

Article metrics