A Multimodal Affinity Fusion Network for Predicting the Survival of Breast Cancer Patients

Accurate survival prediction of breast cancer holds significant meaning for improving patient care. Approaches using multiple heterogeneous modalities such as gene expression, copy number alteration, and clinical data have showed significant advantages over those with only one modality for patient survival prediction. However, existing survival prediction methods tend to ignore the structured information between patients and multimodal data. We propose a multimodal data fusion model based on a novel multimodal affinity fusion network (MAFN) for survival prediction of breast cancer by integrating gene expression, copy number alteration, and clinical data. First, a stack-based shallow self-attention network is utilized to guide the amplification of tiny lesion regions on the original data, which locates and enhances the survival-related features. Then, an affinity fusion module is proposed to map the structured information between patients and multimodal data. The module endows the network with a stronger fusion feature representation and discrimination capability. Finally, the fusion feature embedding and a specific feature embedding from a triple modal network are fused to make the classification of long-term survival or short-term survival for each patient. As expected, the evaluation results on comprehensive performance indicate that MAFN achieves better predictive performance than existing methods. Additionally, our method can be extended to the survival prediction of other cancer diseases, providing a new strategy for other diseases prognosis.


INTRODUCTION
Breast cancer is the second leading cause of death from cancer in women (Bray et al., 2018;McKinney et al., 2020). According to the estimation by American Cancer Society, there are more than 2.3 million new cases of invasive breast cancer diagnosed among females and approximately 685,000 cancer deaths in 2020 (Sung et al., 2021). Accurate survival prediction is an important goal in the prognosis of breast cancer patients, because it can aid physicians make informed decisions and further guide appropriate therapies (Sun et al., 2007). However, the high-dimensional nature of the multimodal data makes it hard for physicians to manually interpret these data (Cheerla and Gevaert, 2019). Considering this situation, it is urgent to develop computational methods to provide efficient and accurate survival prediction (Cardoso et al., 2019;Zhu et al., 2020).
The goal of cancer survival prediction is to predict whether and when an event (i.e., patient death) will occur within a given time period (Gao et al., 2020). In recent years, a considerable amount of work has been done to predict the survival of breast cancer patients by applying statistical or machine learning methods to single-modular data, especially gene expression data (Wang et al., 2005;Nguyen et al., 2013). For example, Van De Vijver et al. (2002) used multivariate analysis on gene expression data to identify 70 gene prognostic signatures. Xu et al. (2012) utilized support vector machine (SVM) to select key features from gene expression data for the survival prediction of breast cancer. However, these methods solely based on gene expression data still leave room for improvement, albeit with high performance (Alizadeh et al., 2015;Lovly et al., 2016). Especially with the advancement of next-generation sequencing technologies, there is a tremendous amount of multimodal data being generated, such as gene expression data, clinical data, and copy number alteration (CNA) (Peng et al., 2005). These data are extensively providing information for the diagnosis of cancer.
Recently, researchers have begun to integrate multimodal data to predict survival of cancer patients. For example, Sun et al. (2018), for the first time, developed a multimodal deep neural network that uses decision fusion to integrate multimodal data. Cheerla and Gevaert (2019) proposed an unsupervised encoder to compress clinical data, mRNA expression data, microRNA expression data, and histopathology whole slide images (WSI) into single feature vectors for each patient; these feature vectors were then aggregated to predict patient survival. Nikhilanand et al. (Arya and Saha, 2020) introduced a STACKED_RF method based on a stacked integrated framework combined with random forest in multimodal data. These results show that better performances can be achieved with multimodal data. Although many efforts have been dedicated to integrating multimodal data for cancer survival prediction, it remains a challenging task. First, features associated with survival only exist in tiny lesion regions, thus the feature embedding extracted from multimodal data might be dominated by excessive irrelevant features in normal areas and yield restrained classification performance. Second, there is abundant structured information between patients and multimodal data.
In this paper, we address the above two challenges by proposing a novel MAFN for integrating gene expression, CNA, and clinical data to predict survival of breast cancer patients. Our MAFN framework includes attention module, affinity fusion module, and deep neural networks (DNN) module. In order to capture critical features in the tiny lesion regions, we utilized attention mechanism to adaptively localize and enhance the features associated with the supervised target while suppressing background noise. However, the traditional attention mechanism (Gao et al., 2018;Chen et al., 2019;Uddin et al., 2020) is not compatible with the need for multimodal data, because its ignorance of the heterogeneity of multimodal data would lead to great weight assigned to a few features (Gui et al., 2019). Therefore, we applied a shallow attention net to each feature, which can effectively extract key information from multimodal data, fully taking the distinction and uniformity of heterogeneous data into account. Additionally, we utilized affinity fusion module to calculate fusion feature representation and to model complex intra-modality and inter-modality relations with the knowledge of structured information between patients and multimodal data. Meanwhile, the DNN module was used to compensate the lack of single-modality specific information on fusion features. The main contributions of this paper can be summarized as follows: (1) An attention module is proposed to adaptively localize and enhance the features associated with survival. By providing a shallow attention network for each feature, mechanism alleviates the problem of few features with great weight caused by data heterogeneity.
(2) A novel feature fusion method is proposed, which constructs an affinity network to fuse multimodal data more effectively.
(3) A multimodal data fusion method based on affinity network (MAFN) is proposed by integrating gene expression data, CNA data, and clinical data. We validate the effectiveness of MAFN and suggest building blocks on four exposed datasets.
The experimental results show that MAFN performs better compared with existing research methods to the best of our knowledge (Jefferson et al., 1997;Xu et al., 2012;Nguyen et al., 2013;Sun et al., 2018;Chen et al., 2019;Arya and Saha, 2020).
The rest of this paper is organized as follows: section 2 presents the details of our proposed method and datasets. Furthermore, the experimental results are discussed in section 3 and some conclusions are drawn in section 4.

Materials
In this study, we used 4 independent breast cancer datasets, containing in total 3,380 samples ( Table 1). We downloaded METABRIC dataset from cBioPortal (Curtis et al., 2012), and other datasets from the University of California Santa Cruz (UCSC) cancer browser website (Goldman et al., 2018). The downloaded datasets consist of three sub-data, including gene expression data, CNA data, and clinical data. We used these datasets in the following two steps. The first step was to obtain the labels of survival-risk classes from the clinical data of each dataset. Similar to the previous work by Khademi and Nedialkov (2015), each sample was labeled as a good sample if the patient survive more than 5 years, and labeled as a poor sample if the patient did not survive more than 5 years. The second was to randomly divide each dataset into three groups, 80% of the samples used as training set, 10% used as test set, and the remaining 10% used as verification set.

Data Preprocessing
The preprocessing strategies for three sub-data (i.e., gene expression data, CNA data, and clinical data) were implemented as below. First, we matched the sample labels shared among three sub-data. Second, we filtered out the samples that have feature missing values (NA) of more than 20% and features with missing values (NA) in more than 10% samples for each sub-data. Then we estimated the remaining missing values using the k-nearest  neighbor algorithm (Troyanskaya et al., 2001;Ding et al., 2016). Third, the gene expression features were standardized and further processed into three categories according to two thresholds (Sun et al., 2018): under-expression (-1), normal expression (0), and over-expression (1). These thresholds depend on the variance of the gene. A gene with high variance would receive a higher threshold than a gene with low variance. For CNA data, we directly utilized the original data with five discrete values: homozygous deletion (-2); hemizygous deletion (-1); neutral/no change (0); gain (1); high level amplification (2). For clinical data, non-numerical clinical data were digitized by one-hot encoding. Feature Selection: The "curse of dimensionality" is a typical problem when using multimodal data (Tan et al., 2015;. For example, the gene expression data and CNA data in METABRIC dataset contain 24,369 and 22,545 genes, respectively. Modified mRMR (MRMR) (Peng et al., 2005) is one of the common dimensionality reduction algorithms in a wide range of applications. Hence, we applied modified mRMR method (fast-MRMR) (Ramírez-Gallego et al., 2017) to select features from the original dataset without significant loss of information. Similar to the previous work (Zhang et al., 2016;Sun et al., 2018), we used the area under curve (AUC) value as the criteria to evaluate the performance of the features. In detail, we roughly searched the best N features from 200 to 600 with a step size of 100 ( Table 2).

Methods
In this section, we introduce the detailed design of MAFN for predicting the survival of breast cancer patients. The goal of MAFN is to distinguish between poor samples and good samples. The multimodal data as input consists of gene expression data, CNA data, and clinical data. It is expressed as follows: where d = (m + n + c), and m, n, and c represent the dimension of the gene expression data, the CNA data, and the clinical data, respectively, and N is the number of patients.

Attention Module
In order to adaptively localize and enhance the features associated with survival, we used an attention mechanism framework to guide our method. Previous attention mechanism-based studies for cancer survival prediction generate feature weights uniformly on all feature dimensions (Gao et al., 2018;Chen et al., 2019), which may not be a good choice for heterogeneous data sources. Because the heterogeneity of data results in few features of a single modal assigned with relatively large weights and the loss of details in the feature set. We argue that different modalities of one patient together reflect the patient's survival risk. To address this issue, in MAFN we propose attention module that is inspired by recent achievement in self-attention (Gui et al., 2019). We used a dedicated shallow Attention Net for each feature in X, alleviating the problem of data heterogeneity. The module consists of three main parts: (1) embedding layer; (2) Attention Net; and (3) sigmoid normalization. First, an embedding layer network was used to extract the intrinsic information (denoted as E) from the raw input X ∈ R N×d and eliminate noise. At the same time, the gene expression data and CNA data from large sparse domain were mapped to the dense matrix. The embedding layer is calculated as follows: where W E and b E are trainable weight matrices, and σ (.) denotes activation function Relu(). The size of the embedding layer is E N , which is generally smaller than the size of the original input feature. In this process, the major part of information was retained, while some redundant information was discarded on the contrary. Second, a stack-based shallow self-attention network was used to seek the probability distribution for each feature (Figure 1, attention module), respectively. Using E ∈ R N×E N extracted by the embedding layer as the input of each Attention Net, the kth feature's Attention Net (L layer) output weight p k is then given by: where For the layer i of the given Attention Net, f i (.) is calculated as follows: where E k i−1 is the output of i-1 hidden layer in the kth Attention Net. W k i and b k i are trainable parameters of this layer, σ (.) denotes activation function tanh().
The outputs of all shallow Attention Nets were integrated into an attention matrix A = p k |k = 1, 2, . . . , d ∈ R N×d . In order to prevent the saturation of neuron output caused by the excessive absolute value of the weight, the sigmoid function was used to normalize: Finally, the weighted feature T was the dot product ⊙ of original data X and attention matrix A ′ . The final weighted feature of the multimodal data is represented as follows:

Affinity Fusion Module
We propose a fusion method for multimodal data based on affinity network. It consists of three main parts: (1) construction of a bipartite graph; (2) one-mode projection of the bipartite graph; and (3) extraction of fusion feature. Bipartite Graph: In order to capture the structured information between patients and multimodal data, we utilized the gene expression data and CNA data to construct a bipartite. According to the previous method (Sun et al., 2018;Chen et al., 2019), the features from gene expression data and CNA data are standardized and further processed into three and five categories in data preprocessing, respectively. Among them, if the feature value is 0, it is regarded as normal expression, otherwise abnormal expression.
First, set G B = (V, E) as an undirected bipartite graph. Vertex V consists of two mutually disjoint subsets, namely gene node set and patient node set {p-nodes, g-nodes}. A node in g-nodes represents a feature from gene expression data or CNA data, as shown in the bipartite graph in Figure 2. For each patient, an edge will be built between the patient node and a gene node, only if the gene node value is abnormal expression (non-zero). Finally, we constructed a patient-feature bipartite graph. Obviously, we could intuitively understand gene expression data and CNA data affecting patients from the patient-gene bipartite graph.
In the bipartite graph, the number of the patient nodes is N and the gene nodes is (m+n). Set B ∈ [1, 0] N×(m+n) as the bipartite graph relationship matrix, then where E is a set of edges between p-nodes and g-nodes, b ij is the element value in B, which indicates the relation between patient i and gene j. Each row of matrix B represents the link relationship of a node in P-nodes, and each column represents the link relationship of a node in g-nodes.
One-mode projection: In order to compute the affinity network from multimodal data (establish the connection between different modalities), the bipartite graph relationship matrix B was projected to the g-nodes set through one-mode projection (Le and Pham, 2018). For each patient node p i , we defined a sparse matrix G i on the vertex set g-nodes. If any two gene nodes have edges with p i , an edge will be built between the two gene nodes. The matrix G i ∈ [0, 1] (m+n)×(m+n) was computed as follows: where g ik is the element value in G i , which indicates the relation between gene j and k. Then the affinity network G was computed as follows: Frontiers in Genetics | www.frontiersin.org FIGURE 2 | The overview of bipartite graph and one-mode projection. With the input gene expression and copy number alteration (CNA) data, (1) bipartite graph expresses the structured relation between patients and multimodal data, e.g., the edges between patient p i and g j (in gene expression data) or c j (in CNA data); (2) the projection of gene, which establishes the connection of different modalities by one-mode projection.
where N is the number of patient nodes. G(j, k) indicates the weight between gene j and k. Further Prune "Weak" Edges: For G, edges with small weights are more likely to be noise. Hence, we pruned "weak" edges by constructing a KNN graph. We defined the affinity matrix G ′ as follows: where ψ(., k) is the near neighbor chosen function. It keeps the top-k values for each row of a matrix and sets the others to zero. Normalization: A feasible way is to obtain normalized affinity matrix by degree matrix: G ′ = D −1 G ′ , D is the diagonal matrix whose entries D(i, i) = j G ′ ij , so that j G ′ ij = 1. However, this normalization involves self-similarities on the diagonal of G ′ matrix, which may lead to numerical instability. One way (Peng et al., 2005) to perform a better normalization is as follows: where N k (i) is the indexes of k nearest neighbors of gene i. This normalization method can take out the diagonal self-similarity, and j G ′ ij = 1 is still valid. Extract Fusion Features: We utilized the affinity matrix to propagate features. Before this, weighted features of the gene expression and CNA modalities were concatenated in the row dimension, each row of which stores features of a sample: Inspired by the graph convolutional neural network, we extracted the fusion features by the following formula: where Z (0) = Z and W (l) f are trainable parameters of l layer, and σ (.) denotes activation function tanh().

DNN Module
In order to compensate the lack of single-modality specific information on fusion features, we utilized DNN module to extract effective features from each modality. The module consists of three deep neural networks. The specific features F i of each modal were extracted as follows: where T (0) i = T i , W (l) is trainable parameters of l layer, b (l) is the bias vector, σ (.) denotes activation function tanh().
Then, the specific features from DNN module and fusion features from affinity fusion module were concatenated in the row dimension: Finally, F with multiple fully connected layers was used to predict the survival of breast cancer patients: where W (l) and b (l) are trainable parameters of l layer, σ (.) denotes activation function tanh(), and F (l) denotes the final multimodal representation at the layer l. Finally, we obtained the final prediction score y with sigmoid function.

Optimization
For model optimization, MAFN can be trained with supervised setting. we defined cross entropy loss as objective function. In addition, we used L2 regularization to prevent overfitting. The objective function can be defined as follows: loss(X, y, y) = − 1 n n i αylog( y) where y is the real label, y is the prediction score, and n is the size of batch. λ and α are the hyperparameters.

Experimental Settings
We implemented our model using Pytorch on a Nvidia GTX 1080 GPU server. The model was trained with Adam optimizer. The learning rate was initialized as e-3, and decayed to e-4 at 6-th epoch. The parameters in section 2.3.1 were set as E N = 128, L = 2, k = 10, and l = 3. Afterward, The weights between each layer were initialized using normalized initialization proposed by Glorot and Bengio (Glorot and Bengio, 2010). The weights between layers were initialized from a truncated normal distribution defined by: where n i and n 0 denote the number of input and output of the units, respectively.

Evaluation Metrics
Following (Sun et al., 2007;Arya and Saha, 2020), we adopted AUC as the evaluation metric, which is widely used in survival prediction tasks. We plotted the receiver operating characteristic (ROC) curve to show the interaction between true positive (TP) and false positive (FP). AUC, Accuracy (Acc), Precision (Pre), F1score, and Recall were also used for performance evaluation. The metrics are evaluated as follows: where TP, TN, FP, and FN stand for true positive, true negative, false positive, and false negative, respectively.

Ablation Study
We conducted ablation studies to validate the effectiveness of two crucial components in our proposed MAFN: attention module and affinity fusion module. We employed the DNN module as our basic network, namely DNN model. Experimental results are shown in Table 3.

Validation of the Effectiveness of the Attention and Affinity Fusion Module
(1) Evaluation of attention module: To validate the effectiveness of attention module, we compared the performance of DNN model and DNN_Attention model. Figure 3 shows the AUC values of different model. We can find that DNN_Attention achieves consistently better performance than base network on four datasets. For example, the AUC value of DNN_Attention model is improved by 3.3% compared with the DNN model on METABRIC dataset, and 1.6% on TCGA-BRCA dataset. Furthermore, we calculated the corresponding Acc, Pre, F1-score, and Recall of all compared model. In particular, as shown in Table 3, we observed remarkable improvements of 3.5, 2.6, 4.1, and 1.1% for the Acc, Pre, F1-score, and Recall on METABRIC dataset, respectively. These results verify the advantage of using attention module for survival prediction of breast cancer in our proposed MAFN framework by adaptively learning the weight of each feature sequence within multimodal data.
(2) Evaluation of affinity fusion module: To validate the effectiveness of our affinity fusion module, we compared the performance of DNN model and DNN_Affinity model. As shown in Figure 3, the AUC value of DNN_Affinity model is improved by 8.1% compared with the DNN model on METABRIC dataset, and 7.0% on TCGA-BRCA dataset. In addition, in terms of other indicators, DNN_Affinity model also achieves corresponding improvement (as shown in Table 2). These results demonstrate that affinity fusion module plays a significant role in compensating for that loss of information of specific features and improving the performance of breast cancer prediction.
Furthermore, we compared the results of MAFN (DNN_Affinity_Attention) model with the model based on a single module improved algorithm (DNN_Affinity and DNN_Attention) on different dataset. As shown in Table 2, the results show that the complementarity of affinity fusion module and attention module.

Validation of the Effectiveness of Multimodal Data
To demonstrate the significance of fusing multimodal data and the effectiveness of affinity fusion module for the prediction of breast cancer survival, we adopted MAFN model to deal with different single types of data (gene expression data or CNA data or clinical data). Furthermore, it can further explore the influence of gene expression data, CNA, and clinical data on breast cancer survival prediction. We designed the following four comparative experiments: (i) MAFN with only clinical data.
In this experiment, we chose only clinical data as input for MAFN model, namely Only_Clin. It is hard to determine outliers in the one-hot coded clinical data. The affinity module cannot construct a bipartite graph based on clinical data. We extracted the features directly by attention module and DNN module. (ii) MAFN with only gene expression data.
In this experiment, we chose only gene expression data as input for MAFN model, namely Only_Gene. The affinity fusion module only propagates information in intra-modal of gene expression data.
In this experiment, we chose only CNA data as input for MAFN model, namely Only_CNA. The affinity fusion module only propagates information in intra-modal of CNA data. (iv) MAFN with multimodality data.
In this experiment, we utilized multimodal data as input for MAFN model, namely MAFN (Gene_CNA_Clin).
The ROC curves of models using different input on METABEIC dataset are shown in Figure 4. From Figure 4, we observe that compared with using single modality data alone, the application of multimodal data enhances the performance for MAFN. For example, the AUC value of MAFN reaches 93.8%, which is higher than Only_Gene, Only_CNA, and Only_Clin models by 5.9, 29.8, and 8.9%, respectively. In addition, as shown in Figure 5, the Pre of Only_CNA and Only_Clin models are 75.0 and 84.7%, which are lower than Only_Gene model. These results demonstrate that gene expression data yields better classification performance and CNA data and clinical data can provide valuable predictive information additional to those provided by gene expression data. All comparison results confirm the tremendous benefits from integrating multimodal data and features fusion by the Affinity Fusion module in survival prediction. Moreover, we conducted MAFN on METABRIC dataset, in which gene expression data is divided into different expression levels (Jin et al., 2019;Wei et al., 2020) and detailed information is provided in Supplementary File 1

Comparison With Other Methods
In order to verify the effect of MAFN, we compared the results of our method with three existing deep learning-based methods, including STACKED_RF (Arya and Saha, 2020), AMND (Chen et al., 2019), and MDNNMD (Sun et al., 2018). Experiments were conducted on METABRIC, TCGA-BRCA, GSE8757, and GSE69035 dataset, and the ROC curves of different methods are plotted in Figure 6. As expected, MAFN achieves better performance among all investigated deep learning-based methods and obtains AUC improvement of 0.8, 6.8, and 7.4% compared with STACKED_RF, AMND, and MDNNMD. From the comparative study presented in Figure 6, we can state that the results on other three dataset are consistent with those on METABRIC dataset. These results show that compared with other methods, MAFN method for multimodal fusion data has remarkable improvements in breast cancer survival prediction. Additionally, we also analyzed the metrics of Acc, Pre, F1-score, and Recall of different methods. The corresponding results are shown in Table 4. The Acc value of MAFN on METABRIC dataset is 89.0%, which is 4.2, 5.8, and 8.7% higher than those obtained by STACKED_RF, AMND, and MDNNMD, respectively. The results from other three dataset are consistent with those on METABRIC dataset. These results further confirm the effectiveness of MAFN in breast cancer survival prediction.
To further evaluate the performance of MAFN, we also compared it with three widely used traditional classification methods, including LR (Jefferson et al., 1997), RF (Nguyen et al., 2013), and SVM (Xu et al., 2012). Experiments were conducted  on METABRIC and TCGA-BRCA dataset. As shown in Table 5, experimental results show that a more optimal performance was obtained from MAFN compared traditional classification methods. For example, the AUC value of MAFN on TCGA-BRCA dataset is higher than LR, RF, and SVM by 12.1, 19.6, and 21.1%, respectively. At the same time, we could observe that the prediction effect of deep learning method is better than nondeep learning-based methods from Tables 4, 5. Moreover, some researchers Poirion et al., 2019;Tran et al., 2020) directly used the survival date as the training label, and also achieved satisfactory results. Since the training target of these methods is inconsistent with MAFN, and thus the comparison experiments cannot be performed.
In conclusion, MAFN is superior to other existing deep learning methods and non-deep learning-based methods on different datasets, indicating that MAFN method has remarkable improvements in breast cancer survival prediction. At the same time, the feasibility of deep neural network with multimodal data fusion and the practicability of multimodal data in the prediction of breast cancer prognosis are further proved.

CONCLUSION
In this study, we propose a deep neural network model based on affinity fusion (MAFN) to effectively integrate multimodal data for more accurate breast cancer survival prediction. Our findings suggest that survival prediction methods based fused feature representations from different modalities outperform those using single modality data. Moreover, our proposed attention module and affinity fusion module can efficiently extract more critical information within multimodal data, and capture the structured information within and between the modalities. Meanwhile, DNN module can compensate the lacked single-modality specific information on fusion features. The comprehensive experimental results show that by using fusion features and specific features

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author/s.

AUTHOR CONTRIBUTIONS
WG conceived and designed the algorithm and analysis, conducted the experiments, and wrote the manuscript. WL and QD performed the biological analysis and wrote the manuscript. QD and XZ provided the research guide. XZ supervised this project. All authors contributed to the article and approved the submitted version.