SADLN: Self-attention based deep learning network of integrating multi-omics data for cancer subtype recognition

Integrating multi-omics data for cancer subtype recognition is an important task in bioinformatics. Recently, deep learning has been applied to recognize the subtype of cancers. However, existing studies almost integrate the multi-omics data simply by concatenation as the single data and then learn a latent low-dimensional representation through a deep learning model, which did not consider the distribution differently of omics data. Moreover, these methods ignore the relationship of samples. To tackle these problems, we proposed SADLN: A self-attention based deep learning network of integrating multi-omics data for cancer subtype recognition. SADLN combined encoder, self-attention, decoder, and discriminator into a unified framework, which can not only integrate multi-omics data but also adaptively model the sample’s relationship for learning an accurately latent low-dimensional representation. With the integrated representation learned from the network, SADLN used Gaussian Mixture Model to identify cancer subtypes. Experiments on ten cancer datasets of TCGA demonstrated the advantages of SADLN compared to ten methods. The Self-Attention Based Deep Learning Network (SADLN) is an effective method of integrating multi-omics data for cancer subtype recognition.


Introduction
Cancer is one of the most common and fatal diseases with high heterogeneity, that is same cancer will produce subtypes with different phenotypes, which will affect the clinical treatment and prognosis (Bray et al., 2018;Siegel et al., 2020). Therefore, the recognition of the cancer subtype is of great significance for the choice of treatment and prognosis of cancer patients (Hong Zhao et al., 2014). With the developments of high-throughput sequencing technology, there yield large amounts of multi-omics data, such as miRNA expression data, mRNA expression data, DNA methylation data, and copy number variation etc. (Song et al., 2020). These multi-omics data can be obtained by some publicly available projects. For example, The Cancer Genome Atlas (TCGA) (Sayáns et al., 2019) stories more than 30 cancers over 11,000 patients' data and provides valuable opportunities for cancer subtype recognition. Existing studies have demonstrated that incorporating multi-omics data can obtain better performances and improve the understanding of cancer progression compared to using single-omic data (Hawkins et al., 2010;Kristensen et al., 2014;Hasin et al., 2017). Therefore, there is a strong need for integrated analysis of multi-omics data in cancer subtype recognition (Simidjievski et al., 2019;Xu et al., 2019;Picard et al., 2021).
The clustering algorithm is often used to recognize cancer subtypes. Researchers have proposed many clustering methods for multi-omics data integration. These methods can be divided into three categories: early integration, late integration, and intermediate integration (Rappoport and Shamir, 2018).
Early integration methods simply concatenate different omics' feature matrices to a single matrix and use the single omics clustering algorithm to subtype the matrix (Rappoport and Shamir, 2018). For example, K-means, LRAcluster, and Spectral clustering all belong to this category. Early integration methods do not consider the differences in the distribution and information contribution of each omics data, they increase the dimension of input data and exacerbate the dimension problem. In late integration, each omic data is clustered separately and the clustering solutions are integrated to obtain a single clustering solution. For example, COCA (Le et al., 2016) and PINS (Nguyen et al., 2017) belong to this category. Late integration methods ensure robustness against noise and bias, but the performance may be greatly affected when each omics data have different degrees of information contribution.
On the other hand, intermediate integration attempts to build a model that integrates all omics, including the method of integrating sample similarity, the method of using joint size reduction, and the method of using data statistical modeling. Similarity-based ensemble methods construct and fuse the sample similarity at each omics level to obtain consistent sample-sample relationships, and then perform cluster analysis. Typical methods include SNF (Wang et al., 2014) and NEMO (Rappoport and Shamir, 2019). These methods are very sensitive to data noise or network parameters due to the instability of the kernel distance function. An ensemble method based on dimensionality reduction is used to project each omics data into a common low-dimensional space, typical methods are CCA and MCCA (Witten and Tibshirani, 2009). However, these methods are susceptible to data noise and feature heterogeneity. Statistics-based ensemble methods build a statistical model to tackle ensemble challenges, including cluster (Shen et al., 2009), iClusterPlus and iClusterBayes.
As machine learning development, deep learning has been widely used in healthcare, such as imaging-based computeraided diagnosis (Yu et al., 2021), digital pathology (Parodi et al., 2015), drug design (Peng et al., 2020), prediction of hospital admission , classification of cancer (Zeng et al., 2021), and so on. With the advancement of the high learning capability and flexibility of deep neural networks, more and more deep learning based multi-omics integration methods have been proposed for cancer subtype recognition (Poirion et al., 2018;Guo et al., 2019). Most of them adopted autoencoder (AE) architecture, such as multi-omics autoencoder integration (MAUI) (Song et al., 2021), stacked sparse autoencoder (SSAE) (Xu et al., 2016), denoising autoencoder for accurate cancer prognosis prediction (DCAP) (Chai et al., 2021), which can efficiently leverage multi-omics datasets to learn latent factors of observed data in lower dimensions. However, these methods are almost based on early integration and ignore the distributions of different omics which would underestimate heterogeneous omics data . To solve these problems, some researchers have proposed deep learning based middle integration methods (Sharifi-Noghabi et al., 2019;Adossa et al., 2021;Picard et al., 2021). These methods separately learned each omics data through some subnetwork, and then integrated the output of every sub-network into a unified representation. For example, Tong et al. (2020) proposed ConcatAE, a method of concatenating features learned from each omics using an autoencoder. Yang et al. (2021a) proposed Subtype-GAN, an approach that used multi-input multi-output neural networks separately to model multi-omics data. Although these methods have demonstrated good performance in cancer subtype recognition, they ignore the relationship between samples when learning valuable feature representation. Different omics data types could provide unique characteristics to the patients' space. Therefore, it is crucial to utilize the relationship of patients to further boost learning performance.
More recently, attention mechanism has become a new technology in the field of deep learning. The dominant thought is to measure the similarity between the Key and the Query (Mercer and Neufeld, 2021). Attention mechanism has been applied in speech NLP, image and other fields (Luo et al., 2018;Yuan et al., 2018;Li et al., 2020a;Liu et al., 2020), since it can select the most informative features of an input, adaptively consider the importance of a single feature and allow the model to make a more accurate judgment. As a special, self-attention (Shaw et al., 2018;Hou et al., 2019), which calculates the response at a position in the sequence by attending to all positions within the same sequence has achieved notable success in modeling complicated relations (Gao et al., 2019). For instance, it displays the superiority in machine translation , sentence embedding (Li et al., 2020b) of modeling arbitrary word dependency and has been successfully applied to capture node similarities in graph embedding (Mustafa Abualsaud, 2019). Research shows that the attention-based encoder is more fit for learning high-level features (Chen et al., 2021).
To this end, we proposed SADLN: a self-attention based deep learning network integrating multi-omics data for cancer subtype recognition. SADLN is a middle integration method by consolidates the adversarial generation network and the selfattention mechanism to describe the different distributions of multi-omics data and fusion samples' relationship. It used an independent sub network to learn omics-specific features and concatenated omics-specific features to an integration representation. Then used a self-attention to learn the relationship of samples on the integration representation and obtained a feature representation that fused the sample relationship. Finally, it used the Gaussian Mixture Model (GMM) to obtain the subtyping label of each sample.
The main contribution is summarized as follows: 1) We proposed a novel deep learning method, SADLN, which combines encoder, self-attention, decoder, and discriminator into a unified framework. It can simultaneously integrate multi-omics representation and sample relations. 2) We firstly introduced the self-attention into the deep learning based method for the cancer subtyping recognition task which allows the model to automatically learn the similarity of samples for better representation.
3) We conducted experiments on ten cancer datasets of TCGA, and SADLN achieved outstanding performance compared with ten integration methods. It provided the theoretical basis and a new method for clinical diagnosis and precise treatment of cancer, which has great theoretical significance and clinical application value.

Methodology
Our proposed method consists of two steps. Firstly, we used the SADLN model to learn an integrated feature representation from multi-omics data. Secondly, with the learned feature representation, we used the GMM to identify sample's subtypes. In the SADLN model, the input is the sample's multi-omics data and the output is the sample's integrated low-dimensional feature representation. The model consists of three main blocks: self-attention based encoder, decoder and discriminator. Figure 1 gives the overview architecture of our proposed method. In the following, we describe each block in more detail.

FIGURE 1
The overview architecture of SADLN.

Self-attention based encoder
To be able to generate higher quality data distribution, we design a self-attention based encoder in our SADLN model as shown in Figure 1. The self attention based encoder transforms the multi-omics data into a low-dimensional latent space representation z with distribution N(μ, σ) using multiple independent network layers, a fully connected layer and the self attention layer. We used four sub-independent dense network to extract features from each original omics data. For each sub-independent layer, let x m {x m 1 , . . . , x m N } ∈ R N×D denotes the input of the network for the m-th omics data, y m {y m 1 , . . . , y m N } ∈ R N×d denotes the output of the m-th omics through the sub-independent layer, where N is the number of data samples, D and d are the feature dimension of the input data and the output data respectively. y m can be express as: where w m is the weight matrix, b m is the bias. To fusion features from different omics data, we concatenate four features matrices into a feature representation matrix. The integrating feature matrix Y can be expressed as: For example, if the outputs of the sub-networks is a N × d feature matrix, after concatenation, the output will be one N × 4d feature representation matrix. To prevent the model overfitting, we appended batch normalization layers and used the Gaussian Error Linear Unit (GELU) function as the non-linear activation function. That is: Although the concatenation operation can integrate multiomics data, the relationship between samples is not considered. In this study, we introduced self-attention mechanism to construct the relationship between samples. Self-attention is typically used to model the relationship of words in a sentence, we treat each sample's features vector as a word and learn the samples' weight matrix through the sample's feature vectors. Let are the parameters of linear projection layers. Z {z 1 , z 2 , . . . , z N } ∈ R N×dk denotes the finally integrating representation, the jth feature vector z j is computed as the following steps (Yang et al., 2021b). Firstly, we use the dot-product between q i and k j to compute the similarity of the sample i and j. To ensure the result does not get excessively large, we scale it by d k . That is: Secondly, softmax function was used to obtain the similarity weight. That is: Thirdly, the integrated feature vector z i of sample i can be obtained by a weighted sum of the values. That is: Finally, the integrated feature representation can be express as: To keep the data distribution unchanged, we added batch normalization layers after the self-attention model.
Suppose Z obeys Gaussian distribution Z~N(μ, σ 2 ), where μ is the mean and σ 2 is the variance. In this paper, we obtained μ and σ 2 through two fully-connected layers.

Decoder
Decoder, in our SADLN model attempts to reconstruct the original multi-omics data from the integrating representation Z. As shown in the upper right halves of Figure 1, it contains fully connected layers and an output layer. Let O } denotes the output of decoder. To minimize the error between the input X I and the output X O (Badrinarayanan et al., 2017), the square Euclidean distance was applied to calculate the loss L Decoder , it can be expressed as:

Discriminator
To force the distribution of the integrated feature representation matches the prior Gaussian distribution, we added a discriminator D to the model, which is a part of the GAN network. A typical GAN network is composed of a generator G and a discriminator D. In this work, we regard the self-attention base encoder part as the generator G, the input of the discriminator D is the output of the encoder part, and the randomly sampled data with Gaussian distribution. Let G(z) denote the function of the generator, and P(z) denote the prior Gaussian distribution. The discriminator D is used to distinguish the samples from P(z) or the G(z) (Yang et al., 2021a). Through adversarial learning, G(z) is as close to P(z) as possible.
Frontiers in Genetics frontiersin.org The objective function optimization of discriminator D adopts the method of maximization and minimization. It can be expressed as: where E represents the expected value of the distribution function. We use the binary_crossentropy function to train the discriminator learning process. The loss of the discriminator is: Our model parameters of the whole network are jointly trained by minimizing the following total loss: where L Decoder and L Discr are defined in Eq.8 and Eq. 11, respectively. λ 1 and λ 2 ∈ [0, 1] are trade-off parameters.

The GMM clustering of SADLN
For the generated feature representation Z {z n } N n 1 , we use GMM to identify sample's subtypes. GMM is a probabilistic clustering method, which also belongs to the generative model. It assumes that all the data points are generated from a mixture of a finite number of Gaussian distributions (Gu et al., 2020). GMM model has excellent clustering performance. In this paper, we use GMM as the clustering module. Let K denotes the number of clusters, π = (π 1 , π 2 , . . ., π k ) represent the weight of each cluster, μ = (μ 1 , μ 2 , . . ., μ k ) is the mean vector, = ( 1 , 2 , . . ., k ) is the covariance vector, Z {z n } N n 1 is the final integrated feature representation, p(z n ) is the probability distribution function as a mixture of K Gaussian distributions. That is: GMM used the EM algorithm to update the parameters π, μ and . According to the maximum probability density of the sample in different clusters, the most suitable subtype labels are obtained.

Experiments and analysis
3.1 Network structure and hyperparameter setting The SADLN model has 19 layers, including 10 layers of the encoder, five layers of the decoder, and four layers of the discriminator. The specific network structure of SADLN is shown in Table 1. The model is built based on python 3.6.12, Keras 2.2.4, and TensorFlow 1.14.0 (the CPU version). The operating system is Windows 10. In terms of hardware, the CPU is Intel(R) Core (TM) i7-105 10U.
Optimizing hyperparameters are the key to training neural network models. Choosing appropriate hyperparameters can significantly improve the performance of the model. In this paper, the hyperparameters of the SADLN model mainly include the feature dimension of the independent sub network (d), the initial epoch, batch size, random seed, optimizer, activation function, learning rate and loss. Table 2 shows the hyperparameter settings of the SADLN model.

Datasets and evaluation metrics
To evaluate the performance of our proposed method SADLN, we used ten TCGA cancers datasets provided by (Yang et al., 2021a) from https://github.com/haiyang1986/ Subtype-GAN. The datasets include BRCA, LUAD, BLCA, PAAD, KIRC, STAD, UVM, GBM, SKCM, and UCEC. These ten datasets contain sufficient samples and have reasonable numbers of subtypes. There are four types of omics data for each cancer: copy number, DNA methylation, mRNA and miRNA. The datasets have been preprocessed and feature selection was performed. The preprocessing steps of four types data are as follows (Hoadley et al., 2018). The DNA methylation data were combined from two generations of Infinium arrays, HumanMethylation27 (HM27) and HumanMethylation450 (HM450). Firstly, the HM27 data against the HM450 data was normalized of 0-1 for β-values using a probe-by-probe proportional rescaling method. Then, 3,139 CpG sites were selected that were methylated at a β-value of ≥ 0.3. For mRNA and miRNA data, firstly, the log transformation was performed separately, then poorly expressed genes were excluded based on median-normalized counts, and finally variance filtering was used to reduced features. Pre-processing led to 3,217 mRNA and 382 miRNA features. For copy number data, firstly, genomic regions along a chromosome defined by consecutive positions with a maximum Euclidean distance (based on copy number log-ratio segmented values) between any adjacent two probes smaller than 0.01 were formed; this resulted in a total of 3,105 copy number regions. Then each region was represented by its medoid signature, led to 3,105 copy number features. Finally, 3,105 copy number features, 3,217 mRNA features, 383 miRNA features and 3139 DNA methylation features were extracted from the original data source.
We used two evaluation metrics to evaluate the effect of cancer subtype recognition: survival analysis and clinical enrichment analysis. Survival analysis was obtained by the Cox log-rank test (Rainer and Muche and hosmer, 2001) to measure differential survival between subtypes. Smaller p-value indicates significant differences in survival profiles of different Frontiers in Genetics frontiersin.org subtypes. In the clinical enrichment analysis, the differences in clinical indicators between subtypes were measured by the pvalue obtained by Kruskal-Wallis test and Chi-square test for numerical and discrete clinical labels of cancer, respectively. Smaller p-value indicates significant differences between subtypes on this clinical label. Six clinical labels (Rappoport and Shamir, 2018) including age at diagnosis, gender, pathologic T, pathologic N, pathologic M, and pathologic stage were used for testing. The four latter parameters are discrete pathological parameters, measuring the size and extend of the primary tumor (T), the number of nearby lymph nodes that have cancer (N), whether the cancer has  To avoid the influence of small cluster size on the accuracy of evaluation metrics, the permutation test (Rappoport and Shamir, 2018) was applied to calculate the p-value of Cox log-rank test in survival analysis and Chi-square test in clinical enrichment analysis. Permutation test obtains an empirical p-value using the test statistic by permuting the cluster labels between samples. To perform permutation tests, we randomly permuted the clustering assignments of the different samples. For the logrank test, the number of permutations we performed for each clustering solution was first min((max 10 original p−value , 1e4), 1e6) and then another 1e5 permutations until the stopping condition was met. The stopping condition was having both the lower and upper ends of the 95% confidence interval for the p-value to be within 10% of its estimate, and such that the interval did not cross .05. For the clinical enrichment test, we continued on performing 1e3 permutations until the 95% confidence interval did not cross 0.05, up to a maximum of 1e5 iterations. This maximum number of iterations was only needed in case the p-value was extremely close to 0.05.

Ablation studies
To evaluate the contributions of key component of our model, we perform ablation studies in this section. There are three key modules in SADLN, self attention, decoder and discriminator. We separately removed these modules from SADLN, Table 3 gives the results of ablation studies in ten cancer datasets on TCGA.
From Table 3,

Comparison with other state-of-theart algorithms
To verify the performance of SADLN, we compared it with ten state-of-the-art methods. Three deep learning based methods include AE, VAE and Subtype-GAN and seven non-deep learning based methods include K-means, LRAcluster, iCluster, Spectral, NEMO (Rappoport and Shamir, 2019), MCCA (Witten and Tibshirani, 2009) and SNF (Wang et al., 2014). These ten methods can represent different types of approaches for integrating multi-omics data. AE and VAE belong to early integration methods, both input and output are integrated multi-omics data. Subtype-GAN belong to middle integration method, the input and output are multiomics features. For ten comparison algorithms, (Yang et al., 2021a) detailed the network structure, parameter selection and execution details its Supplementary Materials Note 1 and Note 2. In this study, we rigorously implement these algorithms following the guidelines of (Yang et al., 2021a).
To reduce the influence of different clustering numbers on the results of subtyping, following the work (Yang et al., 2021a), we set the cluster number of BRCA, LUAD, BLCA, PAAD, KIRC, STAD, UVM, GBM,SKCM and UCEC were 5,3,5,2,4,3,4,4,4,4, respectively. These cluster numbers of different cancers have been proved to be clinically informed (Berger et al., 2018;The Cancer Genome Atlas Research Network, 2013;Robertson et al., 2017a;The Cancer Genome Atlas Research Network, 2014;Levine, 2013;Akbani et al., 2015;Li and Wang, 2021;Verhaak et al., 2010;Raphael et al., 2017;Robertson et al., 2017b). Table 4 gives the cluster number and subtypes of ten cancers. For example, in a previous study, GBM was classified into Classical, Mesenchymal, Neural, and Proneural subtypes based on mRNA expression data (Verhaak et al., 2010). Table 5 gives the −log10p values of survival analysis for eleven methods of ten cancer datasets on TCGA. The clustering results of the other ten compared methods come from Yang's literature (Yang et al., 2021a). Bold indicates that this method performs best on the corresponding cancer dataset. Table 5, SADLN achieved the most significant results on PAAD, STAD, LUAD and UVM cancer datasets. Compared with Subtype-GAN, SADLN obtained better value on seven cancer datasets (BRCA, GBM, KIRC, LUAD, PAAD, STAD, and UVM). Compared with AE, SADLN obtained the best −log10p-value in ten cancer datasets. Compared with nondeep learning based methods, although same methods had best results in specific cancer datasets, the −log10p-value was highest on most cancer datasets. Table 6 gives the clinical parameters enrichment analysis result of SADLN and other compared methods of ten cancer datasets.

As shown in
From Table 6, we can see that SADLN obtained the best results on four datasets (KIRC, GBM, STAD, UCEC). Therefore,   Friedman (1937) analysis was also used to evaluate the performance (Figure 2). From Figure 2, we can see that the performance of SADLN is better than the three methods iCluster, LRAcluster and AE (p < 0.05), but not better than other methods. We found that the performance of the methods is not exactly consistent under the two evaluation strategies.
3.5 Comparison of multiple omics data and single omics data SADLN integrated four types of omics data. To demonstrate the necessity of integrating multiple omics data for subtype recognition, we compared multiple omics data and single omics data of SADLN (denoted as SADLN-single) on subtyping results. We use the random forest (RF) method to analyze the contribution of different omics data on the subtyping results of SADLN. The input of RF is the four original omics features and the subtype labels of SADLN. The output of RF was the Gini importance scores of the features. We perform RF using scikit-learn (1.0.1) package of python, where the key parameter max_depth is set to six and the other parameters are set to the default values. We summed all the Gini importance scores belonging to each type of omics data and quantified the contribution of different omics data to the final subtyping results. The results are shown in Figure 3.
From Figure 3 we can see that the greatest contribution of BRCA, BLCA, LUAD, SKCM, UCEC, and UVM datasets was mRNA data, the greatest contribution of GBM was CNV data and the greatest contribution of KIRC, PAAD, and STAD was DNA methylation data. For different cancers, we choose the greatest contribution of omics data as the input of SADLN-single. The settings of parameters remain the same as SADLN. We also use the metric of p-value of survival analysis in Cox log-rank model to compare the performance of SADLN and SADLNsingle (Table 7).
From Table 7, we can see that the p-values of SADLN are all smaller than the values of SADLN-single on ten cancer datasets. These results demonstrated that the integration of multiple omics data can help improve the performance of subtyping.

Survival analysis and visualization of clustering results
Survival curves can also be used to express the heterogeneity of different subtypes. Figures 4A-J shows the ten cancers' Kaplan Meier survival alanalysis curves. From Figure 4, we can see that different clusters have significantly differences in survival curves (p-value < 0.05). Take BRCA cancer for example ( Figure 4A), C1 has the longest average survival time, followed by C5, C2 and C3, C4 has a poor survival time.
To visualize the clustering results, we used the t-SNE embedding method to display the final integrated feature representation of the SADLN ( Figure 5). From Figure 5, we can see that samples of the same cluster are almost grouped together, and samples of different clusters are almost departed.

Case study
In this section, BRCA data is used to analyze the cancer subtypes obtained by the proposed method SADLN. Firstly, we analyzed the overlaps of the identified subtype clusters with the

FIGURE 2
The p-values of the Friedman test on ten cancer datasets.

FIGURE 3
Contribution of mRNA, miRNA, CNV, and DNA methylation to the subtyping results of SADLN on ten cancer datasets.
Frontiers in Genetics frontiersin.org 10 PAM50 cancer subtypes (Parker et al., 2009 In order to illustrate the difference between the identified subtype clusters of SADLN, we also analyzed the mutation profiles of BRCA using mutation data (the mutation data can be found at https://portal.gdc.cancer.gov). Among 1,031 samples in BRCA datasets, 820 samples have the mutation data. Figure 6 gives the 20 significantly mutated genes of the identified subtype clusters. From Figure 6, we can see that, clusters C2 and C3 have a significant difference in the frequency of PIK3CA and CDH1 genes, although clusters C2 and C3 are all dominated by LumA subtype. The C1 and C5 clusters have a high frequency of TP53 gene mutations, this also explains why clusters C1 and C5 are dominated by Basal and Her2 subtypes.
To illustrate the difference between clusters C1 and C5, we used RF method to analyzed the differential genes using mRNA expression data. Figure 7 gives the result.
Among these differential expression data, study has shown that the expression of ALDH3B2 was higher in SK-BR-3 cells compared with in other subtypes of breast cell lines, as determined by reverse transcription-polymerase chain reaction and western blot analysis. In addition, the expression levels of ALDH3B2 were higher in Her2 positive breast cancer compared with in other subtypes of breast cancer, as determined by immunohistochemistry, which may be used as a prognostic indicator for breast cancer (Feng et al., 2019). The expression level of CLEC10A to be positively associated with the level of different tumor-infiltrating immune cells in BRCA including CD8 T cells, B cells, macrophages, and NK cells. These results suggest that the relationship between lower CLEC10A expression level and poor prognosis in BRCA may be due to the role of

FIGURE 6
The mutation profiles of BRCA datasets with 20 significantly mutated genes using mRNA expression data.

Identify the key biomarkers in each cancer
To identify the key biomarkers that determine the subtyping results in each cancer, we ranked the importance of mRNA features of each cancer dataset using the clustering labels of SADLN and RF method to achieve the five most essential biomarkers. For each cancer, Table 9 gives the five biomarkers most relevant to ten cancers.
For BRCA as an example, the five key biomarkers are AGR3, GDF10, EEF1A2, ATP6V0A4, and GIPC2. By literature review, we found that the AGR3 gene (de Moraes et al., 2022) affects the prognosis of luminal breast cancer patients. EEF1A2 gene (Hassan et al., 2020) and the GDF10 (Zhou et al., 2019) gene have influenced the prognosis of triple-negative breast cancer patients. The study has shown that the expression of the ATP6V0A4 gene (Savci-Heijink et al., 2019) is a signature of visceral organ metastasis in breast cancer. Although the GIPC2 gene (Dong et al., 2021) has not been found in BRCA but has been shown that it acts on the pathogenesis and development of a pheochromocytoma. All these literature reviews demonstrated the results of SADLN on the BRCA dataset are reliable.

Discussion
Recently, integrating multi-omics data for cancer subtyping is an important task in bioinformatics. In this paper, we proposed SADLN, a novel deep learning based integrated method for cancer subtyping. The method firstly introduced self-attention into the encoder-decoder based network architecture. It attempted to describe complex and diverse multi-omics data accurately and adaptively build the samples' relationship when learning a shared low-dimensional representation during molecular subtyping. Compared with three deep learning and seven non-deep learning based integration algorithms, SADLN has two characteristics: 1) Unlike the early integration methods such as AE and VAE, SADLN characterizes multi-omics data respectively which enables the model to effectively describe different omics data with distinct distributions, meanwhile, the output integrating representation fits the prior distribution.
2) The self-attention module in SADLN taking full use of the sample's multi-omics information, can automatically learn the weight matrix between samples and make the results of feature integration more convincing.
We demonstrated the power of SADLN using ten datasets of TCGA. The experiments of survival analysis and Friedman analysis show that SADLN has a good clustering consequence. Meanwhile, the experiments of SADLN and SADLN-single show that integrating multiple omics data is a necessity and useful. The BRCA results indicated that SADLN can efficiently distinguish cancer subtypes.
SADLN found 50 biomarkers for all cancers. Some biomarkers have been verified in previous studies. In clinical research, researchers can conduct more subtype analysis studies on related cancers based on the biomarkers obtained by SADLN. For example, SADLN believes that MEOX2 is an important biomarker of STAD. The study  has shown that MEOX2 is a novel biomarker associated with macrophage infiltration in digestive system cancer.
Although SADLN has enhanced the performance of cancer subtyping recognition, it also has limitations. Firstly, it is unsuited to integrate binary data. Secondly, it could not find the genes modules that affect each subtype. Thirdly, the relationship between omics data was not considered. For the next research, we will continue our efforts to develop an attention based method to simultaneously learn the relationship between multi-omic and samples to explore cancer heterogeneity.

Conclusion
In this paper, we proposed Self-Attention Based Deep Learning Network (SADLN) for integrating multi-omics data for cancer subtype recognition. The novel method is based on recent advances in deep learning and self-attention. It can jointly learn different multi-omic data representations and relations between samples. In comparison to the state-of-the-art methods, experiments on ten datasets of TCGA have demonstrated the effectiveness of SADLN.

Data availability statement
The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.

Ethics statement
Ethical review and approval was not required for the study on human participants in accordance with the local legislation and institutional requirements. Written informed consent for participation was not required for this study in accordance with the national legislation and the institutional requirements.

Author contributions
PG conceived the project. QS and LC designed and implemented the algorithms and models. QS, LC, SG, and AM analyzed and interpreted the data. PG, QS, and LC drafted the manuscript. JC and LZ participated in study analysis. All authors approved the final article.