Entanglement-Based Feature Extraction by Tensor Network Machine Learning

Liu, Yuhan; Li, Wen-Jun; Zhang, Xiao; Lewenstein, Maciej; Su, Gang; Ran, Shi-Ju

doi:10.3389/fams.2021.716044

ORIGINAL RESEARCH article

Front. Appl. Math. Stat., 06 August 2021

Sec. Statistical and Computational Physics

Volume 7 - 2021 | https://doi.org/10.3389/fams.2021.716044

This article is part of the Research TopicTensor Network Approaches for Quantum Many-body Physics and Machine LearningView all 5 articles

Entanglement-Based Feature Extraction by Tensor Network Machine Learning

Yuhan Liu¹^†

Wen-Jun Li²^†

Xiao Zhang³

Maciej Lewenstein^4,5

Gang Su^2,6*

Shi-Ju Ran⁷*

¹Department of Physics, University of Chicago, Chicago, IL, United States
²School of Physical Sciences, University of Chinese Academy of Sciences, Beijing, China
³Department of Physics, Sun Yat-sen University, Guangzhou, China
⁴ICFO - Institut de Ciencies Fotoniques, The Barcelona Institute of Science and Technology, Barcelona, Spain
⁵ICREA, Pg. Lluís Companys 23, Barcelona, Spain
⁶Kavli Institute for Theoretical Sciences, and CAS Center for Excellence in Topological Quantum Computation, University of Chinese Academy of Sciences, Beijing, China
⁷Department of Physics, Capital Normal University, Beijing, China

It is a hot topic how entanglement, a quantity from quantum information theory, can assist machine learning. In this work, we implement numerical experiments to classify patterns/images by representing the classifiers as matrix product states (MPS). We show how entanglement can interpret machine learning by characterizing the importance of data and propose a feature extraction algorithm. We show on the MNIST dataset that when reducing the number of the retained pixels to 1/10 of the original number, the decrease of the ten-class testing accuracy is only O (10^–3), which significantly improves the efficiency of the MPS machine learning. Our work improves machine learning’s interpretability and efficiency under the MPS representation by using the properties of MPS representing entanglement.

1 Introduction

Pattern recognition and classification is an important task in classical information processing. The classical patterns in question may correspond to images, temporal sound sequences, finance data, etc. During the last 30 years of developments in quantum information science, there were many attempts to generalize classical information processing schemes to their quantum analogs. Examples include proposing quantum perceptrons and quantum neural networks (e.g., see some early works [1–3] and a review [4]), quantum finance (e.g., [5]), quantum game theories [6–8], to name but a few. More recently, there were successful proposals to use quantum mechanics to enhance learning processes by introducing quantum gates, circuits, or quantum computers [9–14].

Conversely, various efforts have been made to apply the methods of quantum information theory to classical information processing, for instance, by mapping classical images to quantum mechanical states. In 2000, Hao et al. [15] developed a representation technique for long DNA sequences and obtained mathematical objects similar to many-body wavefunctions. In 2005 Latorre [16] independently developed a mapping between bitmap images and many-body wavefunctions, and applied quantum information techniques to develop an image compression algorithm. Although the compression rate was not competitive with the standard algorithms like JPEG, this work has provided valuable insight [17] that Latorre’s mapping might be inverted to obtain bitmap images out of many-body wavefunctions, which was later developed in Ref. [18].

This interdisciplinary field becomes active recently, due to the exciting breakthrough in quantum technologies (see some general introductions in, e.g., [19–22]). Among the ongoing research, an interesting topic is how to design interpretable and efficient machine learning algorithms that are executable on quantum computers [23–26]. Particularly, remarkable progresses have been made in the field merging quantum many-body physics and quantum machine learning [27] based on tensor network (TN) [28–42]. TN provides a powerful mathematical structure that can efficiently represent a subset of many-body states [43–47] which satisfy the area law scaling of the entanglement entropy. For example, the nearest-neighbor resonating valence bond (RVB) state [48, 49] and the ground states of one-dimensional gapped local Hamiltonians [50, 51]. Paradigmatic examples of TN include matrix product states (MPS) [28, 29, 43, 52–56], projected entangled pair states [43, 57], tree TN states [31–33, 58, 59], and multi-scale entanglement renormalization ansatz [60–62]. It is worth noticing that the variational training of tensor network can be realized on actual quantum platforms [63–65], which is powerful in solving ground states of many-body systems.

TN also exhibits great potential in machine learning, which can provide a natural way to build the mathematical connections between quantum physics and classical information. Among others, MPS has been utilized to the supervised image recognition [28, 65] and generative modeling to learn joint probability distribution [29]. It was justified in [30] that long-range correlation is not essential in image classification, which makes the usage of MPS feasible. Tree TN with a hierarchical structure is also used to natural language modeling [31] and image recognition [32, 33].

Despite these inspiring achievements, there are several pressing challenges. One of those concerns is how to improve the interpretability of machine learning [66–73] by incorporating quantum information theories. Classical machine learning models are sometimes called “black boxes”, in the sense that while we can get accurate predictions, we cannot clearly explain or identify the logic behind them. For TN, one challenge is how to improve the algorithms by utilizing the underlying principles between the quantum states’ properties (e.g., entanglement) and classical data.

In this work, we implement simple numerical experiments using MPS representation of the classifier (Figure 1) [28] and propose to use a quantum inspired technique for machine learning. We show how entanglement can emerge from images, which characterizes the importance of features, and use it to improve the interpretability and efficiency of machine learning under the MPS representation. The efficiency of a feature extraction scheme is usually characterized by the number of the extracted features to reach a preset accuracy. Our feature extraction algorithm significantly improves the efficiency while causes less harm to the accuracy. Specifically, for the ten-class classifiers of the MNIST dataset [74](see more details of MNIST dataset in Supplementary Appendix A), the number of features can be safely lowered to less than 1/10 of the original number with only O (10^–3) decrease of accuracy.

FIGURE 1

FIGURE 1. Illustration of MPS $\hat{Ψ}$ for image classification. The pixels in an image are vectorized to many-qubit states v by the feature map (Eq. 1), and then be contracted with the MPS $\hat{Ψ}$ to obtain the prediction of the classification (Eq. 3). The MPS illustrated here covers the 2D image in a zig-zag path. Each red or orange ball represents a tensor A^[l], and the number l on the tensors indicates tensors’ ordering in the MPS. The tensor which is being optimized and carries the label bond is highlighted in orange color. $\hat{Ψ}$ satisfies the orthogonal conditions indicated by the arrows.

2 Matrix Product State and Training Algorithm

2.1 Mapping Image to Quantum Space

TN machine learning contains two key ingredients. The first one is the feature map [33], which encodes each sample (image) to a multi-qubits product state. Following the conventional TN machine learning [28, 30], each feature (say p_n,l, the lth pixel of the nth image) is transformed to a single-site state given by a d-dimensional normalized vector w^[n,l] as

w_{m}^{[n, l]} = \sqrt{(\binom{d - 1}{m - 1})} {[\cos (\frac{θ π}{2} p_{n, l})]}^{d - m} {[\sin (\frac{θ π}{2} p_{n, l})]}^{m - 1}, (1)

where m runs from 1 to d, and θ is a hyper-parameter that controls the maximum angle. We take d = 2 in this work so each qubit state is $w^{[n, l]} = w_{1}^{[n, l]} |↑⟩ + w_{2}^{[n, l]} |↓⟩$ . By adjusting θ in Eq. 1 as a hyper parameter, we find that the best accuracy is achieved for θ = 0.5 without the discrete cosine transformation (DCT) [75–77] and for θ = 7.8 with DCT. These values are used in the following numerical experiments (see Supplementary Appendix B for a discussion on θ). Note that here we use the bold symbols to represent tensors without writing the components explicitly. In this way, the nth image that contains L pixels is mapped to a L-qubit product state

v^{[n]} = \prod_{\otimes l = 1}^{L} w^{[n, l]} . (2)

Besides mapping an image to a product state, there are other types of quantum image representation [78–81], for example, the FRQI [78] and 2-D QSNA [79] where the image is encoded in an entangled state to minimize the needed number of qubits. Here we choose the product state representation to cater for the MPS structure.

2.2 Matrix Product State Representation and Training Algorithm

The second key ingredient concerns the MPS representation and the training algorithm. TN provides a powerful mathematical tool to represent the wavefunction as a network of connected tensors. In the context of quantum many-body physics, TN has been widely used, for example, to simulate the ground states, the dynamic properties, and the finite temperature properties of quantum lattice models [43–45, 55, 56, 82–90]. For the image classification of ten digits, we expect the short-range correlation between pixels, which makes using MPS feasible [30]. An evidence for the short-range correlation among pixels is the superior performance of convolutional neural network with small convolution kernels. One may expect the MPS representation to be less efficient when working with images with long-range correlation (one example may be the images with fractal patterns).

In our TN machine learning setup, a D-class classifier is a linear map $\hat{Ψ}$ which maps a d^L-dimensional vector v^[n] to a D-dimensional vector u^[n], with D being the number of image classes in the classification. Its components are ${\hat{Ψ}}_{b, s_{1} \dots s_{L}}$ , where the index b is the D-dimensional label bond, and the indexes {s_l} are the physical bonds (which will be contracted with the vectorized images {v^[n]}). The prediction of the classification for the nth image is obtained by contracting the corresponding vectors v^[n] (in Eq. 2) with $\hat{Ψ}$ as

u_{b}^{[n]} = \sum_{s_{1} \dots s_{L}} {\hat{Ψ}}_{b, s_{1} \dots s_{L}} v_{s_{1} \dots s_{L}}^{[n]} . (3)

The bth component $u_{b}^{[n]}$ of u^[n] gives the prediction of the probability for the nth image being in the bth class.

$\hat{Ψ}$ suffers an exponentially scaled parameter complexity as L grows. To represent $\hat{Ψ}$ efficiently, we take the MPS ansatz (Figure 1). MPS is one of the simplest 1D TN structure, which is convenient and efficient [28, 29]. The specific form of MPS we use is

\begin{aligned} {\hat{Ψ}}_{b, s_{1} \dots s_{L}} = \\ \sum_{a_{1} \dots a_{L - 1}} A_{s_{1}, a_{1}}^{[1]} A_{s_{2}, a_{1}, a_{2}}^{[2]} \dots A_{s_{l}, b, a_{l - 1}, a_{l}}^{[l]} \dots A_{s_{L}, a_{L - 1}}^{[L]} . \end{aligned} (4)

The elements of each tensor {A^[l]} are initialized as random numbers generated from the normal distribution N (0, 1). The indexes {a}, which are called the virtual bonds, will be summed over. The dimension χ of the virtual bonds determines the maximal entanglement that can be carried by the MPS. With the MPS ansatz, the total number of parameters in $\hat{Ψ}$ increases only linearly with L, i.e., O (dχ²L). We take χ = 32 for the numerical simulations in this work.

To train the MPS $\hat{Ψ}$ , we optimize the tensors {A^[l]} one by one to minimize the cost function–the negative logarithmic likelihood (NLL) [29, 91].

C = - \frac{1}{| Γ |} \sum_{n \in Γ} \ln \frac{| u_{b_{n}}^{[n]} |^{2}}{Z} . (5)

For supervised learning task, b_n in the $u_{b_{n}}^{[n]}$ denotes the known correct classification of the nth image in the training set. The predicted probability of the nth image being correctly classified as the b_n-th class is given by $\frac{| u_{b_{n}}^{[n]} |^{2}}{Z}$ with Z the norm of the MPS. In the practical simulation, we utilize the canonical form of MPS [29] and keep Z = 1 in the whole updating process. Γ denotes the total number of the training samples, and the summation goes through the training set. There is no upper bound for the cost function and the lower bound is 0 when the predictions of all training samples perfectly have $u_{b}^{[n]} = 1$ for b = b_n and $u_{b}^{[n]} = 0$ for b ≠ b_n.

We use the gradient descent algorithm [28] to optimize {A^[l]} one by one, sweeping back and forth along the MPS until the cost function converges. The label b is always at the tensor that we wish to optimize. Let’s take update forward as an example, to update $A_{s_{l} b a_{l - 1} a_{l}}^{[l]}$ , we first decompose it using singular value decomposition (SVD):

A_{s_{l} b a_{l - 1} a_{l}}^{[l]} = \sum_{a_{l} a_{l^{'}}} U_{a_{l - 1} s_{l}, a_{l}} S_{a_{l}, a_{l^{'}}} V_{a_{l^{'}}, a_{l} b} . (6)

We then replace A^[l] with U and A^[l+1] with SVA^[l+1]. Notice that the label b is passed to A^[l+1] in this step. After that, we perform the gradient descent algorithm and A^[l] is updated by A^[l] − α∂C/∂A^[l] with other tensors being fixed. The step of the gradient descent (learning rate) is controlled by α, which is a small empirical parameter. Having updated A^[l], we proceed forward to update A^[l+1] using the same method and pass the label b to A^[l+2], so on and so forth. After a round of optimization, all the tensors are updated once, and the label b is back on the tensor we start at.

2.3 Entanglement of Matrix Product State

In quantum information science, entanglement measures the quantum version of correlation that characterizes how two subsystems are correlated. The MPS classifier can be regarded as a many-qubit quantum state. Given a trained classifier $\hat{Ψ}$ , to capture its entanglement, we introduce two entanglement measures: single-site entanglement entropy (SEE) and bipartite entanglement entropy (BEE). The reduced density matrix ${\hat{ρ}}^{[l]}$ of the lth site is obtained by contracting all indexes except s_l and s_l′,

{\hat{ρ}}_{s_{l} {s'}_{l}^{}}^{[l]} = \sum_{b, s_{1} \dots s_{l - 1} s_{l + 1} \dots s_{L}} {\hat{Ψ}}_{b, s_{1} \dots s_{l} \dots s_{L}} {\hat{Ψ}}_{b, s_{1} \dots s_{l}^{'} \dots s_{L}} . (7)

Note ${\hat{ρ}}^{[l]}$ is non-negative. When calculating ${\hat{ρ}}^{[l]}$ with an MPS, the leading computational complexity scales linearly with the length of the MPS (O(L)). After normalizing ${\hat{ρ}}^{[l]}$ by ${\hat{ρ}}^{[l]} \leftarrow {\hat{ρ}}^{[l]} / Tr {\hat{ρ}}^{[l]}$ , the SEE can be defined as

S_{SEE}^{[l]} = - Tr {\hat{ρ}}^{[l]} \ln {\hat{ρ}}^{[l]} . (8)

SEE captures the entanglement entropy between one qubit located at site l and the rest of the system.

The BEE is defined as the entanglement entropy between the first l qubits of the MPS and the rest, which can be obtained by the reduced density matrix after tracing over the first l sites of the MPS. Another way to compute BEE is by implementing SVD, where BEE is calculated from the singular values (Schmidt numbers). By grouping the indexes b, s₁⋯s_l and s_l+1⋯s_L, we obtain the following SVD decomposition on the virtual bond:

{\hat{Ψ}}_{b, s_{1} \dots s_{l} s_{l + 1} \dots s_{L}} = \sum_{α α^{'}} X_{b, s_{1} \dots s_{l}, α} λ_{α α^{'}}^{[l]} Y_{α^{'}, s_{l + 1} \dots s_{L}}, (9)

where the singular values are given by the positive-definite diagonal matrix λ^[l], and X and Y satisfy the orthogonal conditions XX^T = Y^TY = I. Normalizing λ^[l] by λ^[l] ←λ^[l]/Tr λ^[l], BEE can be expressed as

S_{BEE}^{[l]} = - \sum_{α} {(λ_{α α}^{[l]})}^{2} \ln {(λ_{α α}^{[l]})}^{2} . (10)

In the context of MPS, one can implement the gauge transformations to bring the MPS to the center-orthogonal form [47]. The details of the central-orthogonal form of the MPS classifier for supervised machine learning can be found in Ref. [28]. Under the center-orthogonal form, λ^[l] can be obtained by the SVD of the tensor in the center A^[l] as $A_{s_{l} a_{l - 1} a_{l}}^{[l]} = \sum_{α α^{'}} X_{s_{l} a_{l - 1}, α} λ_{α α^{'}}^{[l]} Y_{α^{'}, a_{l}}$ . The leading computational cost (of the center-orthogonalization and SVD) scales linearly with the MPS size (O(L)).

It is worthy to notice that SVD is also widely used in the classical image processing algorithms [92–94], where the singular values are the global properties of the images. In our algorithm, the SVD is performed on the virtual bond connecting different sites. In this way, the singular values for a given cut reflect the local properties (importance of features), which makes it different from the classical case.

3 Feature Extraction Based on Entanglement

3.1 Entanglement in Image Classification

Taking the “1-7” binary classifier as an example, the entanglement properties of the MPS classifier are shown in Figure 2 (we include more data on binary classifiers in Supplementary Appendix D). The size and darkness of the nodes illustrate the strength of the SEE of each site, and the thickness of the bonds shows the strength of the BEE obtained by cutting the MPS into two pieces at the bonds. An important part of our proposal is the way of arranging the features in a 1D path to contract with the MPS. For the images, the features (pixels) are originally placed as a 2D array, while the tensors in an MPS are connected as a 1D network. Therefore, a 1D path should be chosen so that each feature (after implementing the feature map) is contracted with one of the tensors in the MPS. Figure 2 illustrates three different paths we use in this work.

FIGURE 2

FIGURE 2. The entanglement properties of the “1–7” binary MPS classifier using different paths (which show how the MPS covers the 2D images). The size and darkness of nodes represent the SEE’s strength on each site (sites with vanishing SEE are represented by small black dots). The thickness of the bonds represents the strength of the BEE by cutting the MPS at the bonds. (A) demonstrates the SEE and BEE of the MPS trained by images (without DCT). In this case, we use the “line-by-line” path, and the label is at the middle site. (B) shows the SEE and BEE of the MPS trained by the frequency components after the DCT, where we use the “zig-zag” path. In this case, we put the label on the first tensor of the MPS. (C) gives the results trained by images (without DCT) using an optimized path where the sites with larger SEE are arranged closer to the center of the MPS. We only keep the 100 pixels with the largest SEE, and the label is put on the middle site of the MPS.

Figure 2A shows a 1D path that connects the neighboring rows in 2D in a head-to-tail manner, which we dub as the “line-by-line” path. Note the label is put in the middle of the MPS. By naked eyes, one can see that the SEE forms a shape of overlapped “1” and “7”. This implies that the sites with larger SEE generally form the shape of the digits to be classified. Meanwhile, the sites close to the edge of the image exhibit almost vanishing SEE. Besides, by following the 1D path, the BEE increases when going from either end of the MPS to the middle. These observations are consistent with the fact that the pixels near the edges of the images contain almost no information. Our results suggest that the importance of features can be characterized by the entanglement properties.

In Figure 2B, we first transform the images to the frequency components using discrete cosine transformation (DCT) [75–77], which is a “real-number” version of the Fourier transformation (see more details of DCT in Supplementary Appendix C). The frequency components are the coefficients of the image’s frequency modes in the horizontal and vertical directions. The frequency increases when going from the left-top to the right-bottom corner of the 2D array. This allows a natural choice of the 1D path from left-top to the right-bottom (Figure 2B), which we call the “zig-zag” path. In this case, the label bond is put on the first tensor of the MPS. The frequency components are then mapped to vectors by the feature map and subsequently contracted with the MPS to predict the classification.

We still take the “1–7” binary classifier as an example to investigate the SEE and BEE of the MPS. The sites with larger SEE appear mainly at the left-top corner, which shows that the relatively highly entangled qubits correspond to the low-frequency components. This agrees with the well-known knowledge in computer vision that the main information for images is usually encoded in the low-frequency data. It further verifies our claim that the importance of features can be characterized by the entanglement properties. Compared to the “line-by-line” path, the number of the highly entangled sites is much less than that in Figure 2A, which shows that images will become much sparser after DCT.

Fig. 2 (c) c MPS, where the real-space pixels are arranged in such a way that those with larger SEE are closer to the center of the MPS (the label is at the center). Only 100 features that possess the largest SEE are retained, and the rest are discarded. In the following, we will propose a feature extraction scheme by using such an optimized path.

3.2 Feature Extraction Based on Entanglement

If the lth qubit in the MPS gives zero SEE, it means there is no entanglement between this qubit and the other qubits (including the label). In this case, the classification will not be affected no matter how the value of the lth feature changes. In other words, the features corresponding to the qubits with vanishing SEE are irrelevant to the prediction of the classification.

Based on this observation, we propose the following feature extraction algorithm, which extract the important features from the samples according to the entanglement properties of the MPS. With a MPS trained by minimizing the cost function (Eq. 5), we first optimize the path of the MPS according to SEE so that the important features (with larger SEE) are arranged in the middle, closer to the label index (Figure 2C). Using the optimized path, we then re-initialized and train a new MPS, whose SEE would become more concentrated when going from the middle to the ends. We use the new SEE for the next path optimization. This process is repeated until SEE is sufficiently concentrated. We then calculate the BEE and keep only $\tilde{L}$ features with the largest BEE in the middle of MPS in the optimized path (Figure 3). Finally, we can train the MPS classifier using only the retained features. The main steps of our algorithm are listed in Figure 4.

FIGURE 3

FIGURE 3. Illustration of the feature extraction by cutting the “tails” of the MPS. The size and darkness of nodes represent SEE’s strength at each site, and the thickness of bonds represent the BEE’s strength by cutting the MPS at the bonds. After optimizing the path based on the SEE, the MPS is cut and only $\tilde{L}$ features with largest BEE are retained. The qubits on the cut tails possess small entanglement with the bulk of the MPS, which can be safely discarded.

FIGURE 4

FIGURE 4. The flowchart of the feature extraction algorithm.

To explain how this algorithm works, let us give a simple example with a three-qubit quantum state |ψ⟩ = |↑↑↓⟩ + |↓↑↑⟩, where |↑⟩ and |↓⟩ stand for the spin-up and spin-down states, respectively. We assume that the first spin carries the label information of a binary classification. Since the second spin of |ψ⟩ can be factored out as a direct product, ${\hat{ρ}}^{[2]}$ is a pure state, and the SEE of the second spin is zero. Therefore, discarding the feature corresponding to the second spin will not affect the classification.

Meanwhile, path optimization will lead to a more efficient MPS representation of the classifier. By writing the wave function into a three-site MPS, one can check that the two virtual bonds are both two-dimensional. The total number of parameters of this MPS is 2² + 2³ + 2² = 16. However, if we define the MPS after moving the second qubit to either end of the chain (say swapping it with the third qubit), the wave function becomes |ψ⟩ = |↑↓↑⟩ + |↓↑↑⟩ = (|↑↓⟩ + |↓↑⟩) ⊗|↑⟩. Now the virtual bonds of the MPS are two- and one-dimensional respectively, and the total number of parameters is reduced to 2² + 2² + 2 = 10.

We test the feature extraction algorithm in the ten-class MPS classifier of the images in the MNIST dataset. We randomly selected one thousand images from each class as the training set. Figure 5A shows the SEE of the trained MPS (trained by images without DCT) with and without optimizing the path according to SEE. Without path optimization, the features with large SEE are distributed almost over the whole MPS; while using path optimization, the larger values of the SEE are much more concentrated to the middle. Note we optimize the path according to the SEE of the previously trained MPS, and then we train a new MPS with the updated path, whose SEE would decrease in general but fluctuate when going from the middle to the ends. The magnitude of the fluctuations converges after optimizing the path for O (1) times, and the features with large SEE will be well concentrated to the middle. Figure 5B shows the SEE of the MPS trained by the frequency components of the images (after DCT) with and without the path optimization. The number of features with large SEE becomes much smaller than that from the MPS trained without DCT. This indicates we can achieve a similar accuracy with much less features when using DCT.

FIGURE 5

FIGURE 5. (A) and (B) show the SEE of the ten-class MPS classifier (i.e., D = 10) with or without DCT/path optimization. (C) shows the BEE of these four cases. (D) demonstrates the classification accuracy on the test dataset versus the number of the retained extracted features $\tilde{L}$ .

Figure 5C shows the BEE of the MPS’s. It can be seen that the tails with small BEE become longer by either implementing DCT or optimizing the path, which implies that more features can be discarded safely. Figure 5D shows the ten-class classification accuracy for the test set where $\tilde{L}$ is the number of retained features. We observe that the testing accuracies in the four cases are almost identical as long as we keep sufficient features ( $\tilde{L} > 400$ approximately). With relatively small numbers of the extracted features, higher accuracy can be achieved using DCT or path optimization. For instance, with only 20 features, the accuracy is 82% by using DCT and the optimized path, while the accuracy is 56 and 76% with only the path optimization and DCT, respectively (Figure 5D). When keeping $\tilde{L} = L / 10$ , the decrease of accuracy is only O (10^–3) by using DCT and path optimization.

We also apply our feature extraction algorithm to the binary MPS classifiers of the images in Fashion-MNIST dataset [95] (see Supplementary Appendix E for details), where we come to the similar observations. This show the algorithm is universal and capable of handling the more complicated dataset.

4 Summary

In this work, we implement numerical experiments with MPS for image classification and explicitly show how entanglement properties of MPS can be used to characterize the importance of features. A novel entanglement-based feature extraction algorithm is proposed by discarding the features that correspond to the less entangled qubits of the MPS. We test our proposal on the MNIST dataset of handwritten digits and show that high accuracy can be achieved with a small number of retained features using DCT and path optimization. Our results show that for the ten-class classifiers of the MNIST dataset, the number of features can be safely lowered to less than 1/10 of the original number.

In the literature, the feature extraction of images is typically achieved by image segmentation and matrix transformation (applying various filters) [96–99]. The spatial or transformed features are directly used as reference to decide which features are more important. Our algorithm does not rely on the segmentation, and focus on the correlation between features.

Our work gives a convincing startup of building connections between the entanglement properties (SEE/BEE of the MPS) and machine learning tasks. Interpretability is a challenging issue in machine learning [66–73], which concerns the interpretations of how machine learning models work, how to design the models, how information flows during the processing, and so on. One important issue of interpretability is to characterize the importance of features, which will assist in explaining the main factors that affect the results and implementing feature extractions, to name but a few. In the literature, some methods are proposed to improve the intepretability of machine learning [100–105], although there still exist various limitations. Our work explicitly shows how entanglement properties of MPS can be used to characterize the importance of features. Therefore, our work can be regarded as a tensor network version of sensitivity analysis [106], which may provide an alternative to other interpretability methods, such as the influence functions and Kernel method, etc. Our proposal can also be applied to the TN’s with more sophisticated architecture, such as projected entangled pair states [43, 57, 107], for efficient machine learning.

Data Availability Statement

The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding authors.

Author Contributions

All of the authors contributed in designing the work, analyzing the results, writing and revising the manuscript. The numerical simulation were done by YL and W-JL with equal contribution.

Funding

This work was supported by ERC AdG OSYRIS (ERC-2013-AdG Grant No. 339106), Spanish Ministry MINECO (National Plan 15 Grant: FISICATEAMO No. FIS2016-79508-P, SEVERO OCHOA No. SEV-2015–0522), Generalitat de Catalunya (AGAUR Grant No. 2017 SGR 1341 and CERCA/Program), Fundació Privada Cellex, EU FETPRO QUIC (H2020-FETPROACT-2014 No. 641122), the National Science Centre, and Poland-Symfonia Grant No. 2016/20/W/ST4/00314. S-JR was supported by Fundació Catalunya - La Pedrera. Ignacio Cirac Program Chair and is supported by Beijing Natural Science Foundation (Grant No. 1192005 and No. Z180013), Foundation of Beijing Education Committees (Grant No. KM202010028013), and the Academy for Multidisciplinary Studies, Capital Normal University. XZ is supported by the National Natural Science Foundation of China (No. 11404413), the Natural Science Foundation of Guangdong Province (No. 2015A030313188), and the Guangdong Science and Technology Innovation Youth Talent Program (Grant No. 2016TQ03X688). ML acknowledges the support by Capital Normal University on academic visiting. W-JL and GS are supported by in part by the NSFC (Grant No. 11834014), the National Key R&D Program of China (Grant No. 2018FYA0305804), the Strategetic Priority Research Program of the Chinese Academy of Sciences (Grant No. XDB28000000), and Beijing Municipal Science and Technology Commission (Grant No. Z190011).

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Acknowledgments

S-JR thanks Anna Dawid Lekowska, Lei Wang, Ding Liu, Cheng Peng, Zheng-Zhi Sun, Ivan Glasser, and Peter Wittek for stimulating discussions. YL thanks Naichao Hu for helpful suggestions of writing the manuscript.

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fams.2021.716044/full#supplementary-material

References

1. Lewenstein, M. Quantum Perceptrons. J Mod Opt (1994) 41:2491–501. doi:10.1080/09500349414552331