Multi-Modal Image Fusion Based on Matrix Product State of Tensor

Multi-modal image fusion integrates different images of the same scene collected by different sensors into one image, making the fused image recognizable by the computer and perceived by human vision easily. The traditional tensor decomposition is an approximate decomposition method and has been applied to image fusion. In this way, the image details may be lost in the process of fusion image reconstruction. To preserve the fine information of the images, an image fusion method based on tensor matrix product decomposition is proposed to fuse multi-modal images in this article. First, each source image is initialized into a separate third-order tensor. Then, the tensor is decomposed into a matrix product form by using singular value decomposition (SVD), and the Sigmoid function is used to fuse the features extracted in the decomposition process. Finally, the fused image is reconstructed by multiplying all the fused tensor components. Since the algorithm is based on a series of singular value decomposition, a stable closed solution can be obtained and the calculation is also simple. The experimental results show that the fusion image quality obtained by this algorithm is superior to other algorithms in both objective evaluation metrics and subjective evaluation.


INTRODUCTION
The purpose of image fusion is to synthesize multiple images of the same scene into a fusion image containing part or all information of each source image (Zhang, 2004). The fused image contains more information than each source image, thus, it is more suitable for machine processing and human visual perception. Image fusion has a wide range of applications in many fields, such as computer vision, remote sensing, medical imaging, and video surveillance (Goshtasby and Nikolov, 2007). The same type of sensors acquire information in a similar way, so the single-modal image fusion cannot provide information of the same scene from different aspects. On the contrary, multi-modal image fusion (Ma et al., 2019) realizes the complementarity of different features of the same scene through fusing the images collected by different types of sensors and generates an informative image for subsequent processing. As typical multi-modal images, infrared and visible images, CT and MRI images can provide distinctive features and complementary information, that is, infrared images can capture thermal radiation signal and visible images can capture reflected light signal; CT is mainly used for signal acquisition of sclerous tissue (e.g., bones), and MRI is mainly used for signal acquisition of soft tissue. Therefore, multi-modal image fusion has a wide range of applications in engineering practice.
To realize image fusion, many scholars have proposed a large number of fusion algorithms in recent years. In general, the fusion methods can be divided into two categories: the spatial-domain methods and the transform-domain methods. The typical methods in the first category include the weighted average method and principal component analysis (PCA) method (Yu et al., 2011) and so on. They fuse the gray values of image pixels directly. Although the direct operation on the pixels has low complexity, the fusion process is less robust to noise, and   the results cannot meet the needs of the application in most cases. To overcome this shortcoming, a fusion method based on transform is proposed (Burt and Adelson, 1983;Haribabu and Bindu, 2017;Li et al., 2019). In general, the transform-based methods obtain the transformed coefficients of an image using a certain set of base functions, then fuse these coefficients through certain fusion rules, and finally obtain the final fused image through the corresponding inverse transform. For example, Burt and Adelson (1983) formed a laplacian pyramid (LP) by desampling and filtering source images, and then designed different fusion strategies at each layer. Finally, the fused image is obtained by applying the inverse transform on the fusion coefficients. Haribabu and Bindu (2017) first decomposed the source images by using discrete wavelet transform (DWT) and fused the coefficients with predefined fusion rules, and then obtained the final image by applying the inverse discrete wavelet transform on fused coefficients. Because the transform-based method employs the average weighted fusion rules for the lowfrequency components which carry the most energy of the image, there will be something wrong with the contrast loss of the final fused image.
In addition to traditional spatial-domain and transformdomain methods, sparse representation (SR) has been extensively used in image fusion in recent years (Yang and Li, 2010;Jiang and Wang, 2014;Liu et al., 2016;Zhang and Levine, 2016). The SR method assumes that the signal to be processed satisfies y ∈ R n , then y = Dx, where D ∈ R n×m (n << m) is an overcomplete dictionary, and n is the dimensions of the signal and m is the number of atoms in the dictionary D which is formed by a set of image subblocks, x is the sparse coefficients vector. The fused image is reconstructed by means of fusing the sparse coefficients. Although the SR-based method has achieved many results in the field of image fusion, some detailed information will be lost in the reconstructed image (e.g., the edges and textures tend to be smoothed), which limits the ability of the SR to express images (Yang and Li, 2010). To solve this problem, some scholars proposed some improved algorithms (Jiang and Wang, 2014;Liu et al., 2016). For instance, Jiang and Wang (2014) used morphological component analysis (MCA) to represent the source images more effectively. The MCA method first applied SR to separate the source images into two parts: cartoon and texture, then different fusion rules were designed to fuse these two parts respectively. Finally, a fused image with rich information was obtained, and more characteristic features of the source images were preserved.
As an extension of the vector and matrix, the tensor (Kolda and Bader, 2009) plays an important role in the high-dimensional data processing. In the field of computer science and technology, a tensor is a multi-dimensional array. It can be extended to some common data types, for example, a zero-order tensor can be defined as a constant, the tensor of order 1 is defined as a vector, the tensor of order 2 is defined as a matrix, the tensor of order 3 and the tensor of order N (N ≥ 3) is called highorder tensor. In essence, tensor decomposition is a high-order generalization of matrix decomposition, which is mainly applied to dimensionality reduction, sparse data filling, and implicit relationship mining. The information processing method based on tensor is more suitable for the processing of high-dimensional data and the extraction of feature information than vector and matrix, therefore, some relevant applications have been emerged in recent years (Bengua et al., 2015(Bengua et al., , 2017a. In view of the excellent performance of tensors in representing Frontiers in Neurorobotics | www.frontiersin.org high-dimensional data and feature extraction, a tensor-based high-order singular value decomposition method (HOSVD) (Liang et al., 2012) was applied to image fusion and achieved good results. In this method, the source image is initialized into a tensor which is subsequently decomposed into several subtensors by using a sliding window technique. Then, the HOSVD is applied on each sub-tensors to extract the corresponding features which are fused by employing certain fusion rules.
Since HOSVD is an approximate decomposition method, it will lead to the loss of information in the process of image fusion. At the same time, the calculation process is large and a stable closed-form solution cannot be obtained. To avoid loss of detailed information, a novel method based on matrix product state (MPS) is proposed to fuse the multi-modal images. Compared with HOSVD, MPS achieves the improvement of HOSVD and achieves the purpose of acquisition image information accurately. Moreover, being different from SR who linearly represents images by using atoms in an overcomplete dictionary, MPS decomposes image tensor into MPS. The main difference is that SR is approximate decomposition, while MPS is accurate decomposition. Therefore, in terms of signal reconstruction, MPS has better performance in signal expression. The main contributions of the article are outlined as follows: (i) Considering that image fusion depends more on local Frontiers in Neurorobotics | www.frontiersin.org information of the source images and dividing the image into blocks can get more details of each pixel, the two source images are first divided into several sub-image blocks, and then the corresponding sub-image blocks are initialized into sub-tensors; (ii) We perform the MPS on each sub-tensor separately to obtain the corresponding core matrixes. The core matrixes are fused using the fusion rule based on the sigmoid function which incorporates the conventional choose-max strategy and the weighted average strategy. This fusion strategy can preserve the features of the multi-modal source images and reduce the loss of contrast to the greatest extent; (iii) Due to the application of MPS, the computational complexity of image fusion based on tensor is reduced dramatically. Hence, MPS decomposition is realized by computing a series of sub-tensors with maximum order 3. Moreover, a stable closed-form solution can also be obtained in the proposed algorithm. The rest of the article is organized as follows. Section 2 introduces the theory of matrix product decomposition. In section 3, the algorithm principle and the fusion steps are detailly discussed. Subsequently, the results of the experiments are presented in section 4. Finally, some conclusions are drawn in section 5.

Tensor
Tensor is a generalization of the vector. A vector is a kind of tensor with order 1. For simplicity and accuracy of the following  Frontiers in Neurorobotics | www.frontiersin.org expressions, first, we introduce some notations about tensors. The tensor of order 0 is a constant, represented by lowercase letter x; the tensor of order 1 is a vector represented by a bold lowercase letter x; the tensor of order 2 is a matrix represented by a bold capital letter X; the tensor of order 3 is a tensor represented by bold capital letters in italics X. In this way, a tensor of order N and the size of each dimension are I 1 × I 2 × · · · × I N can be expressed as X ∈ R I 1 ×I 2 ×···×I N , where I i corresponds to the length of the i-th dimension. In general, we use x i 1 · · · x i N to represent the (i 1 , · · · , i N )-th element of X.

MPS for Tensor
The MPS decomposition (Perez-Garcia et al., 2006;Schollwock, 2011;Schuch et al., 2011;Sanz et al., 2016) aims to decompose an N-dimensional tensor X into the corresponding left-right orthogonal factor matrix and a core matrix. First, all the dimensions of an N-dimensional tensor X are rearranged, which lets the dimension K corresponding to the number of images to be fused, for example, if the number of source images is equal to 2, then K = 2. Additionally, the tensor X satisfies X ∈ R I 1 ×···×I n−1 ×K×I n ×···×I N , I 1 ≥ · · · ≥ I n−1 , I n ≤ · · · ≤ I N , then the elements in the tensor X can be expressed in the form of MPS, and the schematic diagram of MPS form of X is shown in Figure 1: i (j−1) mentioned in the above formula are called leftright orthogonal factor matrix with size δ j−1 × δ j , where δ 0 = δ N+1 = 1, and they are all orthogonal: and where I is an identity matrix, C n k is called core matrix. A tensor X can be decomposed into the form of (1) through two series of SVD decomposition. The process includes a leftto-right sweep and a right-to-left sweep. We summarize it in Algorithm 1.

IMAGE FUSION BASED ON MPS
In this section, the whole process of image fusion will be described. The source images which have been reconstructed into tensors are decomposed into a series of sub-tensors by using the sliding window technology. The graphical representation of the sliding window technology is shown in Figure 2. Then MPS is applied to the decomposed sub-tensors to obtain the core matrixes, and the sigmoid function is used for the fusion of each pair of core matrixes to obtain the fused core matrixes.
The specific theoretical concepts of decomposition and fusion are described in sections 3.1, 3.2, respectively, and the overall process of image fusion proposed in this article is described in section 3.3.

Tensor Decomposition by MPS
For the two source images A and B with sizes of M × N, we use them to construct a tensor X with dimension M × N × 2. Taking into account the importance of local information of the Frontiers in Neurorobotics | www.frontiersin.org source image, a sliding window technology is used to decompose it into several sub-tensors F with dimension M × N × 2, and the sliding step p used should satisfy p ≤ min{M, N}; the sub-tensor F is obtained by the Algorithm 2, as follows. In Algorithm 2, the fix((M − patch size)/stepsize) represents the nearest integer to (M − patch size)/stepsize and fix((N − patch size)/stepsize) represents the nearest integer to (N − patch size)/stepsize. Then, MPS is applied to each of the sub-tensors.

Design of Fusion Rule
We introduce the sigmoid function as the fusion rule of the characteristic coefficients, the fusion coefficient of each core matrix can be defined as follows: where the subscript i indicates the number of each sub-image, and l is the label of the corresponding source image. For e i (l) obtained in the previous section, the fusion rule is selected by comparing the values of e i (1) and e i (2). When e i (1) is much less or much more than e i (2), we use the Max rule, and when the relationship between e i (1) and e i (2) satisfy the other relation, we use weighted fusion to fuse the corresponding coefficient matrix and then get the final fusion coefficient matrix. The function is as follows: ×C i (:, :, 1) + exp(−kln( e i (1) e i (2) )) 1 + exp(−kln( e i (1) e i (2) )) × C i (:, :, 2) (5) where k is the shrinkage factor of the mentioned sigmoid function. After D i is obtained, each of the fused sub-image blocks F i can be reconstructed by the inverse operations of MPS. Then the sub-image blocks F i is used to obtain the final fused image G.
To make the process of decomposition and fusion more concrete, the first group of the experiment images is used as an example to make a flowchart as shown in Figure 3:

The Process of Image Fusion Based on MPS
The process of image fusion based on MPS can be divided into the seven steps as follows 1. Input two source images; 2. Reconstructed the two source images into a third-order tensor, and the sub-tensors are extracted by sliding window technology; 3. Matrix product state decomposition is used on sub-tensors to obtain left and right factor matrixes and core matrixes; 4. Compare the vectors representing source image 1 and source image 2 in the core matrixes obtained in step 3, and obtain the fused matrixes by corresponding their quantitative relations to different situations of the sigmoid function, and then construct it as sub-tensors; 5. Multiply the fused sub-tensors by left and right factor tensors to obtain sub-images; 6. Sub-images addition; 7. Output fused image.
The specific flowchart is shown in Figure 4.

Standard deviation (SD)
SD is defined as follows: where µ is the average value of the fused image, H and W are the length and width of the image, respectively. SD is mainly used to measure the contrast of the fused image.

Mutual information (MI)
Mutual information is defined as follows: where h R,F (i, j) is the normalized joint distribution gray histogram between the source image R and the fused image Frontiers in Neurorobotics | www.frontiersin.org F, h R (i) and h F (j) are the normalized marginal distribution histogram of the two source images, respectively, L is the number of gray levels.

Structural similarity (SSIM)
Structural similarity is defined as follows: where µ x and µ y are the average value of x and y. The middle term represents the similarity of contrast, σ x and σ y is the SD of x and y. The right term characterizes the structural similarity, and σ xy is the covariance of x and y. The c 1 , c 2 , and c 3 are three constants, and the parameters α, β, and γ , respectively, adjust the contribution of the three terms. SSIM can calculate the similarity between the fused image and the source image. Its value which is between 0 and 1 is closer to 1, the more similar the two images are. The average value of the fused image and the two source images A and B is taken as the final evaluation metric, namely 4. Gradient based fusion metric (Q G ) Q G is defined as follows: where Q AF (x, y) = Q AF g (x, y)Q AF α (x, y), at each pixel (x, y),Q AF g (x, y) and Q AF α (x, y) denote the edge strength and orientation preservation values. Q BF (x, y) is defined as the same as Q AF (x, y). The weighting factors w A (x, y) and w B (x, y) indicate the significance of Q AF (x, y) and Q BF (x, y). Q G is an important fusion image quality evaluation method computing the amount of gradient information that is injected into the fused image from the source image. 5. Phase congruency based fusion metric (Q P ) The Q P is defined as follows: where p, M, and m refer to phase congruency, maximum, and minimum moments. The parameters α, β, and γ are set to 1 in this article. For more detailed information on parameters, please refer to the article Hong (2000). Q P measures the extent that the salient features in the source image are preserved.

Study of Patch Size and Step Size
Considering the sliding window technology is used, we will first study the respective influence of the size of the sub-image block and the step size of the sliding window on the performance of the fusion image experimentally. In the following statement we use patch size and step size to call the two factors briefly. To obtain the optimal patch size and step size, we will use a pair of infrared and visible images as source images, as shown in Figures 5A,B. In the experiment of patch size, the patch size is set to 2×2, 4×4, 6×6, 8×8, 10×10, 12×12, 14×14, 16×16, 18×18, and 20 × 20 with the step size fixed to 1 and shrinkage factor fixed to 200. In the experiment of step size, the step size is set to 1, 2, 4, 6, 8, and 10 with the patch size fixed to 16 × 16 and the shrinkage factor fixed to 200. The experimental results based on the objective evaluation metrics are shown in Tables 1, 2. The output fused images are shown in Figures 5, 6. It can be seen clearly from Table 1, in most cases, that the best results can be obtained when the size of the sub-image block is 16 × 16. According to simple analysis, when the sub-image block is too small, the image characteristics cannot be effectively represented. Additionally, it can be seen from Table 2 that when the step size is 1, the best result can be obtained. According to simple analysis, when the step size is too large, local information of the image may be lost or cannot be displayed well. Therefore, the in following experiments, the patch size was set to 16 × 16, and the step size was set to 1.

Computation Complexity
The computation time of each group of experimental images is recorded when different fusion algorithms are used. Experimental results show that the complexity of the proposed algorithm is lower than other algorithms. The results are shown in Table 3 as follows: All the codes are performed under MATLAB R2014a running on computer equipment with an Intel i7-7700K CPU (4.2 GHz) and 16 GB of RAM. As can be seen from the table, compared with SR and Dual-tree complex wavelet transformsparse representation (DTCWT-SR), the running of the proposed algorithm is faster. In general, the computational complexity of the proposed algorithm is reduced.

Experimental Results and Discussion
In this section, the effectiveness of the proposed method is further verified by comparing the experimental results of this algorithm with other fusion methods. The comparison methods used are DWT (Haribabu and Bindu, 2017) and LP (Burt and Adelson, 1983), SR-based methods (Liu et al., 2016), VGG-Net (Hui et al., 2018), and DTCWT-SR (Singh et al., 2012). In addition to the infrared and visible images used in the previous section, CT and MRI medical images are also used for contrast experiments. The performance of each algorithm is evaluated by calculating the evaluation metrics based on the fusion results. In the experiment, all the experimental source image size is 256×256, the fixed patch    size is 16 × 16, the step size is 1, and the shrinkage factor k is 200. The proposed method and several comparison algorithms are applied to nine pairs of source images. The experimental results are shown in Figures 7-15, respectively. The objective evaluation metrics values of the nine pairs of images are shown in Tables 4, 5.
It can be seen from the table that in most cases, the algorithm proposed in this article can achieve optimal results, especially for CT and MRI images, the various metrics of the results of MPS are much higher than other methods. For infrared and visible images, the method in this article can also achieve optimal results under more than half of the evaluation metrics. These results show that the proposed method is better than other methods for multi-modal image fusion. This advantage mainly benefits from two aspects: (i) The sliding window method is adopted to divide the image into several sub-images, so the local information of the image can be captured well; (ii) MPS method is an accurate decomposition and reconstruction method, so in the process of image fusion, there will be no loss of information due to the solution.
Further analysis of the experimental results shows that: (i) On the whole, VGG-Net has the worst performance in all cases. Compared with other comparison methods, there is a big gap in various evaluation metrics. This is because the information captured is insufficient in the layer-by-layer feature extraction of the source image, and when the details of the fusion image are weighted by the final weight graph, the contrast of the initial detail part of the fusion image is reduced; (ii) Among the two multi-scale methods used, DWT fusion method performs poorly. This is because the DWT method is based on Haar wavelet to achieve fusion, which can only capture image features in horizontal and vertical directions but cannot capture more basic features of the image; LP method is better than the DWT method because the Laplacian pyramid generates only one band-pass component at each scale, which reduces the possibility of being affected by noise; (iii) The results obtained by SR method are better than other multi-scale methods in most cases but not as good as the proposed method. This is because the signal representation ability of SR is better than that of multi-scale transformation, and errors will occur in the process of signal reconstruction, which is unavoidable for the SR method. The method proposed in this article can effectively avoid this problem by non-destructive tensor reconstruction. In addition, the "max-L1" rule of direct fusion in the spatial domain will lead to spatial inconsistency, which affects the performance of the SR method; (iv) DTCWT-SR is an method that multi-scale method combined with SR method. By comparing the objective evaluation metrics, the fusion performance of the algorithm is better than SR in some aspects, but it is still poor compared with MPS.
In addition to objective evaluation, the performance of the algorithm in this article is also discussed through some visual comparisons of the fused images. In general, the proposed method achieves the best visual effect among all the fusion images. The fusion results of infrared-visible images are shown in Figures 7-11. It can be seen from the figure that the method proposed in this article has good adaptability, and the fusion images are obtained to retain the information of the infrared and visible images, respectively. In Figure 7, both the multiscale fusion method and SR show varying degrees of artificial traces at the junction between the trees and the sky in the upper left corner, while DTCWT-SR and VGG-Net resulted in severe contrast loss. In Figure 8, the white squares in infrared picture are dimming in varying degrees in DWT, LP, SR, DTCWT-SR, and VGG-Net methods, and the leaf luster in the visible image is not well-displayed in the VGG-Net method. In Figure 9, DWT and SR show the phenomenon of information loss. LP, DTCWT-SR, and VGG can get relatively complete fusion images, but the brightness is weaker than MPS. The clarity of the billboard in the upper left corner of the fused image is better in the MPS method. In Figure 10, the fused images obtained by DWT and SR show some small black blocks, that is information loss, while the human shape brightness on the right side of the images obtained by LP, DTCWT-SR, and VGG method is low. The reason for these shortcomings is the fusion rules used in the fusion process all have a certain degree of weighting on the source image. Our fusion rules based on the sigmoid function can well avoid these shortcomings, that is, in the image, whose colors are only black and white, the weight of the white part of the image will be much larger than that of the black part, thus, evolving into the Choosemax rule. In Figure 11, compared with the other five comparison methods, it can be seen that the human figure on the right and the branch on the lower right corner of the fusion image obtained by MPS have the highest resolution. Figures 12-15 are the fusion results of CT and MRI medical images. It can be seen from the experimental results that the DWT method cannot to be applied to the fusion of medical images, and the other four methods can obtain a complete image. In Figure 12, LP, DTCWT-SR, and VGG-Net methods have no loss in details, but the sharpness of the light and dark junction is insufficient, the edge is blurred, and the contrast is lost. However, the bottom of the fused image obtained by the SR method is fractured, indicating that there is information loss. In Figure 13, the spine in the lower right corner and the jaw in the lower left corner of the image obtained by MPS were more clear than the other five methods, the brain vein was also clearer, and the contrast was higher than the other five methods. In Figure 14, the fused images obtained by LP and SR methods were fractured at the lower right corner. Although DTCWT-SR and VGG methods obtained relatively complete fusion images, there is a certain degree of contrast loss. In Figure 15, LP, DTCWT-SR, and VGG-Net methods have some contrast loss, especially in the middle part, at the same time, the image obtained by the SR method presents spatial dislocation at both sides of the eyeball and a certain degree of distortion appears at the position of white connection of the two images. The SR method also has similar shortcomings in this regard, please refer to the lower right corner of the image.

CONCLUSION
In this article, we propose a method based on MPS for multimodal image fusion. First, the source images are initialized into a three-dimensional tensor, and then the tensor is decomposed into several sub-tensors by using a sliding window to obtain the corresponding features. The core matrix is fused by the fusion rule based on the sigmoid function, and the fused image is obtained by multiplying the left-right factor matrix.
In this article, we use a sliding window to avoid blocking effects, and fully consider the local information of the source images by dividing the source image into a set of sub-images.
The experimental results show that the proposed algorithm is feasible and effective for image fusion. Being different from the average fusion rule of the multi-scale method and the "Max-L1" fusion rule of the SR method, the fusion rule based on the sigmoid function used in the article is more effective, but it also makes the fusion process more complicated of the proposed method. Future study will focus on further exploring a more effective fusion rule to improve the fusion results.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author/s.