Entanglement-guided architectures of machine learning by quantum tensor network

It is a fundamental, but still elusive question whether the schemes based on quantum mechanics, in particular on quantum entanglement, can be used for classical information processing and machine learning. Even partial answer to this question would bring important insights to both fields of machine learning and quantum mechanics. In this work, we implement simple numerical experiments, related to pattern/images classification, in which we represent the classifiers by many-qubit quantum states written in the matrix product states (MPS). Classical machine learning algorithm is applied to these quantum states to learn the classical data. We explicitly show how quantum entanglement (i.e., single-site and bipartite entanglement) can emerge in such represented images. Entanglement characterizes here the importance of data, and such information are practically used to guide the architecture of MPS, and improve the efficiency. The number of needed qubits can be reduced to less than 1/10 of the original number, which is within the access of the state-of-the-art quantum computers. We expect such numerical experiments could open new paths in charactering classical machine learning algorithms, and at the same time shed lights on the generic quantum simulations/computations of machine learning tasks.


I. INTRODUCTION
Classical information processing mainly deals with pattern recognition and classification. The classical patterns in question may correspond to images, temporal sound sequences, finance data, and so on. During the last thirty years of developments of the quantum information science, there were many attempts to generalize classical information processing to the quantum world, for instance by proposing quantum perceptrons and quantum neural networks (e.g., see some early works [1][2][3] and a review [4]), quantum finance (e.g., [5]), quantum game theory [6][7][8], to name a few. More recently, there were successful proposals to use quantum mechanics to enhance learning processes by introducing quantum gates/circuits, or quantum computers [9][10][11][12][13][14].
Conversely, there were various attempts to apply methods of quantum information theory to classical information processing tasks, for instance by mapping classical images to quantum mechanical states. In 2000, Hao et al. [15] developed a different representation technique for long DNA sequences, obtaining mathematical objects similar to many-body wave-function. In 2005 Latorre [16] developed independently a mapping between bitmap images and many-body wavefunctions which has a similar philosophy, and applied quantum information techniques in order to develop an image compression algorithm. Although the compression rate was not competitive with standard JPEG, the insight provided by the mapping was of high value [17]. A crucial insight for this work was the idea that Latorre's mapping might be inverted, Such an interdisciplinary field was recently strongly motivated, due to the exciting achievements in the socalled "quantum technologies" (see some general introductions in, e.g., [19][20][21][22]). Thanks to the successes in quantum simulations/computations, including the D-Wave [23] and the quantum computers by Google and others ("Quantum Supremacy") [24,25], it becomes unprecedentedly urgent and important to explore the utilizations of quantum computations to solve machine learning tasks.
Particularly, a considerable progress has been made in the field merging quantum many-body physics and quantum machine learning [26] based on tensor network (TN) [46][47][48][49][50][51]. TN provides an powerful mathematical structure that can efficiently represent many-body states, operators, and quantum circuits, even though the dimension of the Hilbert (vector) space suffers an exponential growth with the size of the system [42][43][44][45]. Paradigm examples include matrix product states (MPS) [43], projected entangled pair states [43,52], tree TN states [53], or multi-scale entanglement renormalization ansatz [54]. Recently, TN proved its great potential in the field of machine learning, providing a natural way to build the mathematical connections between quantum physics and classical information. Among others, MPS has been utilized to supervised image recognition [46] and generative modeling to learn joint probability distribution [47]. Tree TN that has a hierarchical structure is also used to natural language modeling [50] and image recognition [48,49], which is proven to be of high efficiency. The relations between the mathematical models of machine learning, e.g., Boltzmann machine and TN states, MPS and string-bond state, and deep convolutional arithmetic circuits and quantum many-body wave functions, have been investigated [51,[55][56][57].
Despite these inspiring achievements, there are several pressing challenges. One of those concerns how to practically utilize quantum features or even quantum simulations/computations to process classical data [58][59][60]. With the existing methods (e.g., [46,48,49]), the number of the qubits is the same as the number of pixels in an image, which is too large to be realized with the current techniques of quantum computations. Anther challenge relates to the underlying relations between the properties of classical data and those of quantum states (e.g., quantum entanglement), which are still elusive.
In this work, we implement simple numerical experiments with MPS, and show how quantum entanglement can emerge from images and be used for the learning architecture. We encode sets of images consisting of pixels of a certain shade of grey, onto the many-qubit states in a Hilbert space [46]. The classifiers of the encoded images are represented as MPS's. A training algorithm based on Multiscale Entanglement Renormalization Ansatz (MERA) [48,61] is then used to optimize the MPS. We show, considering the images before and after the discrete cosine transformation (DCT), that the efficiency of such classical computation is characterized by the bipartite entanglement entropy (BEE). The MPS for classifying the images after DCT possesses much smaller BEE, meaning higher efficiency, than the MPS for the images before DCT. The single-site entanglement entropy (SEE) of the trained MPS's characterizes the importance of the local data (e.g., different pixels). This permits to discard the less important data, so that the number of the needed qubits can be largely reduced. Our simulations show that to reach the same accuracy, the number of qubits (28 × 28 = 784 qubits originally) for classifying the images after DCT can be lowered about ten times compared with that for classifying before DCT. Furthermore, we propose to optimize the MPS architecture according to SEE, and achieve in this way higher computational efficiency smaller number of qubits without harming the accuracy. The reduced number of qubits (about 50 ∼ 100) is accessible to the current techniques of quantum computations. The original images (either before or after DCT) will be vectorized into many-qubit states by the feature map [Eq. (1)]. Ψ satisfies the orthogonal condition, indicated by the arrows. E [n,l] is defined by contracting everything after taking out the tensor (blue) that is to be updated.

II. REVIEW OF MATRIX PRODUCT STATE AND TRAINING ALGORITHM
The basic idea is after mapping the classical data into a vector (quantum Hilbert) space, quantum states (or the quantum operator formed by these states) are trained to capture different classes of the images, in order to solve specific tasks such as classifications. Since the Hilbert space is usually exponentially large when the size of the images increases, TN (MPS in this work) are to implement the calculations efficiently by classical computers.
A. Feature from data to quantum space Such a TN machine leaning contains two key ingredients. One is the feature map [49] that encodes each input image to a product state of many qubits. Each pixel (say, the l-th pixel θ n,l of the n-th image) is transformed to a qubit given by d-dimensional normalized vector as v [n,l] where s runs from 1 to d. We take d = 2 in this work, and each qubit state satisfies |v [n,l] = v | ↓ . Then, the n-th image is mapped to a L-qubit state, which is a d L -dimensional tensor product state L l=1 |v [n,l] (L is the number of pixels of the image). One can see that the number of qubits equals to the number of pixels in one image. Note that in the paper, we use the bold symbols to represent tensors without explicitly writing the indexes.

B. Tensor network representation and training algorithm
The second key ingredient is the TN. The output of the n-th image is obtained by contracting the corresponding vectors with a linear projector denoted byΨ as |u [n] = Ψ L l=1 |v [n,l] . Its coefficients satisfy Ψ is actually a map from a d L -dimensional to a Ddimensional vector space. Here, we takeΨ as a unitary MPS ( Fig. 1) whose coefficients satisfŷ To train the MPS, we optimize the tensors {A [l] } in the MPS one by one to minimize the error of the classification. To this end, the cost function to be minimized is chosen to be the simplified negative log likeli-

D-dimensional vector (D is the number of classes) that satisfies
We use the MERA-inspired algorithm to optimize the MPS [48], where all tensors are taken as isometries that satisfy the right orthogonal condition s l ,a l−1 a l = I a l−1 a l−1 (for the rightmost tensor, it still satisfies this condition by considering it as a χ × d × 1 tensor). Then the MPS in Eq. (3) gives a unitary projector from a d L -dimension to a Ddimensional vector space. The tensors in the MPS can be initialized randomly, and then are optimized one by one (from right to left, for example). The key step is to calculate the (unnormalized) environment tensor E [n,l] , which is defined by contracting everything after taking out the target tensor A [l] (see Fig. 1 (b) and the supplementary material for details). Then, define use SVD as E [l] = UΛV T . The tensor is updated by One can see that the new tensor still satisfies the orthogonal condition. Update all tensors in this way one by one until they converge. The code can be found on GitHub [62].

C. Discrete cosine transform and motivation
In addition, we try the standard discrete cosine transformation (DCT) to transform the images in frequency space before feeding them to the MPS. The DCT is defined as with M the width/height of the images, (x, y) the position of a pixel, and α(u) = 1/ √ 2 if u = 0, or α(u) = 1 otherwise. In our case, we have M = 28 for the images in the MNIST dataset. Note L = M 2 .
We propose that DCT is very helpful while choosing the path of the MPS to deal with 2D images. In the frequency space, there exists a natural 1D path for this. The zig-zag path shown in Fig. 2 (a) is used in many standard image algorithms (e.g., JPEG). The frequency is non-increasing along the path. Note that in previous works using MPS, the 2D images are directly reshaped into 1D (i.e., (1 × M 2 )) images.
Moreover, it is known from the existing image algorithms that the most important information is normally stored in the low-frequency data. It is interesting to see if the entanglement of the trained MPS reveals the same property. In this way, the number of qubits can be further reduced when defining the MPS on the zig-zag path and training after DCT transformation.

III. LEARNING ARCHITECTURE BASED ON QUANTUM ENTANGLEMENT
We will show below that by learning the images from the frequency space (reached by DCT), the computational cost can be largely reduced without lowering the accuracy. This is revealed by a lower BEE of the MPS, which means that smaller virtual bond dimensions are needed. More interestingly, we propose a learning architecture based on quantum entanglement to further improves the efficiency. The architecture contains two aspects: optimizing the MPS path according to SEE, and discarding less important data according to BEE. Our work practically utilize (bipartite and single-site) quantum entanglement to design machine learning algorithms for classical data. It exhibits an explicit example of "quantum learning architecture". We test our proposal with MNIST dataset of handwriting digits [63].
Noteρ [l] is non-negative. The computation ofρ [l] with MPS is shown in Fig. 2 The BEE measured between, for example, the l-th and (l + 1)-th sites is similarly defined by the reduced density matrix obtained after tracing over either half of MPS. There is another way to obtain BEE by singular value decomposition (SVD), where BEE is given by the singular values (or called Schmidt numbers). The SVD is formally written aŝ aa Y a ,s l+1 ···s L , (8) where the singular values are given by the non-negative diagonal matrix λ [l] , and X and Y satisfy the orthogonal The computation of BEE in our context is illustrated in Fig. 2 (c). One only needs to transform the first (l − 1) tensors to the left orthogonal form (indicated by the arrows), then λ [l] is obtained by the SVD of aa Y a ,a l . The leading computational cost is O(ldχ 3 ).
In Fig. 3 (and most of the paper), we take the MPS trained for classifying images "0" and "2" as an example, With DCT, the important information to the classification problem are mainly of low frequencies. This is consistent with what is know from the well-established image algorithms, that the low-frequency data are more important. With our work, such a phenomenon is naturally justified by the values of SEE of the trained MPS.
Meanwhile, the BEE with DCT increases in a much slower way than that without DCT. Due to the orthogonal conditions of the MPS, the information flows from the right end of the MPS to the left (label bond). Each time when the non-trivial information (indicated by a relatively large SEE) is passed through, BEE increases and finally saturates to a finite value around ln χ. While approaching to the label bond on the left end, BEE decreases to ln D, giving a triangular plateau of BEE [see Fig. 3 (b)]. This can be understood as a "refining" process: while the information flows to the label bond (output), only the the information that is important to the classification will be kept. The value ln D of the BEE also indicates that the state of each virtual bond in the  plateau is actually described by the two-qubit maximally entangled state.
In the MPS schemes, it is well-known that the BEE determines the needed dimensions of the corresponding virtual bond. Particularly, when the entanglement entropy vanishes to zero, it means the corresponding data is uncorrelated to others and need not be fed to the MPS. In the following, we will show that to reach the same accuracy, smaller length of MPS, meaning less qubits, are needed with DCT than without DCT. This provides an efficient scheme to discard the sites of small SEE.

B. Learning architecture based on single-site entanglement entropy
To minimize the BEE, we propose to rearrange the path of the MPS, so that the SEE is in a non-ascending order. The steps are listed in Table I. After path optimization, the BEE will be lowered, meaning the computational cost will be lowered, while the accuracy remains unchanged.
To explain how this architecture works, let us give a simple example with a three-qubit quantum state. The wave function reads |ψ = |↑↑↓ + |↓↑↑ , where |↑ and |↓ stand for the spin-up and spin-down states, respectively. By writing the wave function into a three-site MPS, one can easily check that the two virtual bonds are both two-dimensional. The total number of parameters of this MPS is 2 2 +2 3 +2 2 = 16. However, if we define the MPS after swapping the second qubit to either end of the chain, say swapping it with the third qubit, the wave function becomes |ψ = |↑↓↑ +|↓↑↑ = (|↑↓ +|↓↑ )⊗|↑ . Obviously, the virtual bonds of the MPS are two-and one-dimensional, respectively, and the total number of parameters is reduced to 2 2 + 2 2 + 2 = 10. In our algorithm, the SSE will normally be in a good descending order after optimizing the path only once. Fig. 4 (a) shows the SEE in the frequency space with and without path optimization. Without path optimization, the important data where the values of SSE are relatively large are distributed on the first 200 sites (see the inset of Fig. 4 (a)). By zooming in this range, one can see that the SSE are in a good descending order after optimizing the path. For comparison, we show in Fig.4 (b) the SEE of the MPS trained by the real-space data with and without optimizing the path. Fig. 4 (c) shows the BEE, which indicates the computational cost of using the MPS to solve the classification task. It is obvious that the BEE of the MPS trained by the frequency data is much smaller than that of the MPS trained by the real-space data. By path optimization, the BEE is further reduced, indicating that smaller bond dimensions are needed. Fig. 4 (d) shows the accuracy when discarding certain less important data. We only use the firstL data of each image to train theL-site MPS. We observe that asL increases, the accuracy trained with the frequency data rises quickly and reach the value more than 0.98 withL being as small as 40. For comparison, training by the real-space data obviously requires a larger number of qubits, which can be reduced significantly by optimizing the path. The reduced number is almost comparable to that with DCT. For the training after DCT, the difference between the accuracies with and without path optimization is relatively small. This is because we take χ = 16, where the maximal capacity of the entanglement entropy (ln χ) is much larger than the reduction of the BEE by the path optimization.
To characterize the improvement of efficiency that can be gained by discarding the less important data, we define the complexity ratio L is defined by a threshold, so that the BEE is smaller than c ln D when measured after theL-th site. c is a number determined by the requirement of accuracy. We take c = 0.75. When ξ 1, it means the data on the last (1 − ξ)L sites can be ignored without harming the accuracy too much. Our results show that ξ = 0.82 when trained with real-space data without path optimization, and ξ = 0.11 and 0.10 using frequency data without and with path optimization, respectively. More results are given in Table II. We show that the trainings by the data with and without DCT lead to similar accuracies, but the efficiencies (characterized by the complexity ratios) are largely different.
Step 1 Randomly initialize the MPS, choose a path (say, zigzag), and train it by the standard algorithm; calculate the SEE of the MPS.
Step 2 Redefine the path according to the values of SEE at different sites.
Step 3 Define the MPS on the new path, randomly initialize it, and train it.
Step 4 Calculate the SSE: if the SSE is in an acceptable descending order, end the training; if not, go back to Step 2.
Step 5 Calculate the BEE and find theL-th site where BEE equals to 0.75 ln D. Discard the data after this site (l >L) and train the new MPS with the length ofL. Table I. Steps of the training algorithm, where the architecture of the MPS is guided by the entanglement.  Table II. Complexity ratios ξ [Eq. (10)] of classifiers trained by frequency data and by the real-space data (shown in the bracket) without path optimization.

IV. SUMMARY AND PROSPECTS
In this work, we explicitly show that quantum entanglement can be used for guiding the learning of data for image recognition. By training the unitary MPS, our numerical experiments demonstrate that the bipartite entanglement entropy indicates the complexity of the tasks using classical computations. The single-site entanglement entropy characterizes the importance of the data to the classification problems, with which an optimization technique of the MPS architecture is proposed to largely improve the efficiency.
Our proposal can be readily applied to feature extraction, and to improving the efficiency of other learning schemes, such as those based on hierarchical TN's. The exploitation of DCT implies that quantum techniques such as TN can be combined with classical computational techniques, such as neural networks, to develop novel efficient learning algorithms. Revealing the relations to theoretical physics (e.g., quantum information) would provide a solid ground for TN machine learning, avoiding being a "trial-and-error alchemy".
From the viewpoint of quantum computation for machine learning [58][59][60], there are two advantages of our proposal. Firstly, the MPS we train is formed by unitaries, which has good accuracy with relatively small bond dimensions. Note that in principle, any local unitary maps or gates can be realized in quantum simulators or computers. Secondly, our proposal permits to largely reduce the size of the MPS (meaning the numbers of both qubits and quantum gates) without harming the accuracy. This significantly lowers the complexity of quantum computations, which strongly depends on the numbers of the qubits and gates. The reduced number of qubits is only around 50 ∼ 100, which is within the access of the state-of-the-art quantum computers. The low demands on the bond dimensions and, particularly, on the size, permit to simulate machine learning tasks by quantum simulations or quantum computations in the near future.
APPENDIX A: SOME DETAILS OF THE TRAINING ALGORITHM We introduce several tricks to speed up the training procedure. Firstly, we evolve the environment tensors E [l] to avoid putting too many training samples in one single iteration. Specifically speaking, we only randomly select a small number of samples (say 1000) and compute the corresponding environment tensorẼ [l] . Then we update E [l] ← E [l] + δẼ [l] with δ a small constant. E [l] is the total environment tensor and can be initialized as theẼ obtained in the first iteration. Then we use SVD of the total environment tensor E [l] = UΛV T to update the tensor as A [l] ← VU T . We find this harms little the accuracy but can largely save the computational time and memory. Our simulation also shows high accuracy and fast convergence with δ = 1. The difference between large and small δ is the stability under certain extreme conditions, such as training with very small bond dimensions.
Secondly, we restore all the intermediate vectors during the contraction process to avoid repetitive computations. This trades the computational time by memory, and do no harm to the accuracy.
Thirdly, we take advantage of the unitary property of the MPS. The original cost function should be the negative log-likelihood (NLL), which reads with N the total number of images. Considering Tr(ΨΨ † ) as a constant according to the orthogonal condition, one hasẼ with E [n,l] the environment tensor for the n-th sample without normalization. More investigations are to be done to further understand the techniques explained above [Zheng-Zhi Sun et al, in preparation]. For the feature map, the standard one maps a pixel θ satisfying 0 ≤ θ ≤ 1 to a normalized vector v which ranges from [1, 0] (spin up) to [0, 1] (spin down). When the feature map is fixed, the range of θ (with 0 ≤ θ ≤θ) changes with the range of v (from [1, 0] to a canted spin state [cos α, sin α]), and vice versa. It is obvious that sin α =θ. Meanwhile, we find that by controllingθ, accuracy can change. Without DCT, we take 0 ≤ θ ≤ 1 and α = π/4, which gives relatively high precision and stability. With DCT, the signs and the maximum/minimum of the "pixels" (also denoted by θ) of each image are not fixed. The accuracy and stability are the highest with −1 ≤ θ ≤ 1 and α = 2π. This is because with DCT, most values are quite small, which requires a relatively large α.
We shall stress that our proposal of entanglement-based architecture is independent on the algorithms or tricks for optimizing the MPS (or other TN's). Once the algorithm is chosen, our proposal can be utilized to reveal the "quantum" features of the machine learning tasks and improve the efficiency of the training.

APPENDIX B: PRECISION OF THE TWO-CLASS CLASSIFIERS ON THE TEST DATASET
In Table A1, we show the accuracy on the test dataset for all the two-class classifiers trained by the frequency data. We take physical bond dimension d = 2 and the virtual bond dimension χ = 16. In each iteration, we feed 1000 samples randomly picked from the two classes.
For comparison, the accuracy obtained from the real-space data is shown in Table A2. In general, the accuracy from the frequency data is generally at the save level with that from the real-space data. This is expected since the DCT gives a unitary transformation on the data.

APPENDIX C: SEE FOR REAL-SPACE MPS CLASSIFIERS
For the real-space MPS classifiers, SEE can characterize the importance on different sites of the images. This can be viewed clearly from the SEE distribution viewed in 2D plane (Fig. A1). We see that the SEE distribution captures the main features of the two classes of images that are to be classified. With the path optimization, the extracted features of the images are stored not only in the SEE, but also in the path of the MPS (i.e., how the MPS covers the 2D image).  Table A1. Precision of the two-class classifiers trained by frequency data.The virtual bond dimension is χ = 16, with D = 2 and d = 2. Besides, we notice that with the real-space data, SEE is zero along the edges of about 4-pixel width, corresponding to the blank edges of most of the images in the MNIST images. This serves as another proof that SEE characterizes the importance of the data provided on different sites.  [2,7], and [2,8] classifiers. One can see that the SEE can capture the features of the images.