Stack operation of tensor networks

The tensor network, as a facterization of tensors, aims at performing the operations that are common for normal tensors, such as addition, contraction and stacking. However, due to its non-unique network structure, only the tensor network contraction is so far well defined. In this paper, we propose a mathematically rigorous definition for the tensor network stack approach, that compress a large amount of tensor networks into a single one without changing their structures and configurations. We illustrate the main ideas with the matrix product states based machine learning as an example. Our results are compared with the for loop and the efficient coding method on both CPU and GPU.


I. INTRODUCTION
Tensors are multi-dimensional arrays describing multilinear relationships in a vector space. The linear nature of the tensors implies that two tensors may be added, multiplied, or stacked. The stack operation of tensors place all the tensors with the same shape row by row, resulting in a higher-dimensional tensor with an extra stack dimension. For example, stacking scalars will lead to a vector and stacking vectors gives a matrix. The tensor operation is designed for the fundamental calculation in modern chips such as graphics processing unit (GPU) and tensor processing unit (TPU) [1]. Thus, computations with a combination of tensor operations would be more efficient as compared to the conventional ways. For example, there are two ways to calculate the dot product between a hundred vectors v i n×1 (i = 1, 2, · · · , 100) and a core vector c n×1 : either by applying the dot operation 100 times via a for loop or align all the vectors into a big matrix M 100×n and apply a matrix-vector dot product. The later method is much faster in GPU or TPU due to their intrinsic design.
The tensor network [2][3][4][5] is a decomposition of a very large tensor into a network structure of smaller tensors, which has seen recent applications in machine learning [6][7][8][9][10][11][12][13][14][15]. As it is merely the decomposition of the tensors, the tensor networks were expected to be able to maintain the same operations as mentioned previously. However, due to its arbitrary decomposition into a network structure, only the contraction operation is well-defined. Other operations such as 'stack' and the batched tensor networks have not been studied thoroughly so far. In fact, the stack operation is quite an important part of the modern machine learning tasks, by treating hundreds of input tensors as a single higher-dimensional batched tensor at training and inference steps. Recent tensor network machine learning approaches includes the matrix product states (MPS) [6,11], the string bond states (SBS) [12], projected entangled pair states (PEPS) [16,17], the tree * tianning zhang@mymail.sutd.edu.sg tensor network (TTN) [9,13,18] and others [19,20] that either use a naïve iteration loop, or skip the batched tensor network via a programming trick (we note it as the efficient coding). Several studies have explored the concept of stack operation of tensor network to a certain extent. For instance, Ref. [21] puts forward the 'A-tensor' to represent the new local unit of MPS after adding the extra local state space. It represents the local tensor of the symmetric tensor network, which is regarded as the stack of different 'multiplets'. This approach is developed for the scenario that the system has certain symmetries so that the computational cost can be well controlled. It did not explicitly reveal the concept of stacking hundreds of tensor networks into one network. A more recent work [22] has put forward the 'add' operation of MPS for solving the many-body Schödinger's equation. It represents the coefficients of the many-body wave function as tensor networks and 'add' them into one compact form. The 'add' operation is similar to our definition of the stacked tensor network with an additional reshape. Therefore, it is useful to discuss about what is the stack of tensor network and how it performs in machine learning tasks. It is expected that the batched tensor network method may be an alternative option for a faster machine learning program when efficient coding is unavailable. It can also provide the theoretical background for future chips designs based on tensor network contractions.
In this paper, a regular tensor network stack operation method is put forward to compress two or more tensor networks with the same structure into one tensor network with one more stack bond dimension. The remaining part of the stacked tensor network is the same as that of the original tensors. We illustrate the main procedures with the realization of the stack operation on a one-dimensional tensor network system, i.e., the MPS. However, our result can be extended to arbitrarily shaped tensor network such as the PEPS. The rest of the paper is organized as follows. In Sec. II, we present the concept of the stack operation for tensor networks, as well as the proof using MPS as an example. In Sec. III, we benchmark the speed and compare the machine learning performance for three batch computing methods for the MPS machine learning task: the naive loop method in Sec. III A, the batch tensor network in Sec. III B, and the efficient coding method in Sec. III C. In Sec. III D, we further discuss about the relationship between the batch tensor network and the efficient coding method. We draw our conclusion and provide future perspective in Sec. IV.

II. BATCH CONTRACTION OF TENSOR NETWORKS
In this section, we provide the definition of the regular stack operation of tensor networks used in the paper. We will first discuss some basic concepts of tensor in Sec II A and tensor networks in Sec II B. We then present the details of the stack operation for MPS in Sec. II C and its generalized formation about higher dimensional tensor network in Sec. II D.

A. The stack operation
A tensor T is a specifically organized high-dimensional collections of real or complex numbers. For example, a rank-k tensor has k indices. Therefore, a rank-1 tensor has only one index, and it is just a vector; a rank-3 tensor is a higher-order tensor with three independent indices. In this paper, we will use a tuple with parentheses (L 1 , L 2 , . . . , L K ) to represent the rank-K tensor where p i is the i th local index which takes the value from 1, 2 to L i , i.e. for each i, the local dimension is L i . Therefore, a vector which has N dimension could be represented as a tuple like (N, ) and a matrix whose size is 'Stack' is a pervasive operation in tensor analysis. It could merge a series of tensors having the same shape into a higher rank tensor. In computer science, the stack operation generally converts the list of tensors having the same shape into a very compact form that is memory efficient. For example in Matlab, for a certain problem which requires going through all the elements, transforming it into matrices production is much more efficient than using a for-loop operation for the multiplication of row vectors.
For modern machine learning or deep learning, the feeding input data is always encoded as tensors. In order to deal with thousands of inputs, those tensors are stacked together and reformulated into a higher-rank tensor called batch before passing it into the model. This type of higher-rank tensor is also referred to as stack operation in machine learning.

B. Basics of tensor networks
Tensor network {T i } is the graph factorization of a very large tensors by some decomposition processes like SVD or QR. The correspondingly tensor results from contracting all the auxiliary bonds of the tensor network.
where T i represent the small local tensor in tensor network.
A graphical representation of tensors is shown in Fig. 2(a). In general, a rank-k tensor has k indices. Therefore, a vector as the rank-1 has one dangling leg, and a rank-3 tensor has three dangling legs, as shown in Fig. 2(a). The contraction between two tensors can thus be defined by contracting the same index between two arbitrary tensors. In Fig. 2(a), a rank-1 tensor M k is contracted with another rank-3 tensor N i,j,k as M k ·N i,j,k , and it ends up with a rank-2 tensor, i.e. a matrix.
In the context of quantum physics, a many-body wave function can be represented as a one-dimensional rank-N tensor network where each leg represents the physical index for each site, as shown graphically in Fig. 2(b). This is usually referred to as the matrix product state (MPS). The MPS can therefore be seen as a connected array of local rank-3 tensors. Each tensor M j has three indices: the physical index σ j , and two auxiliary indices D j and D j+1 that are contracted and therefore implicit [23]. An MPS with open boundary condition can be written as

C. Stack operation for MPS
A tensor T can be converted to a tensor network {T i } which consists of contracted local tensors T i via decomposition such as the tensor train method [24]. Mathematically, one could perform operations such as product, trace, contraction, splitting and grouping, etc. on both of them.
However, the stack operation can only be directly applied to one single tensor network rather than to multiple ones separately. More precisely, we want to deal Diagrammatic representations for tensor networks: (a) Diagrammatic representations for tensors Mi and N i,j,k , and the contraction between them Mi · N i,j,k ; (b) Diagrammatic representation for one-dimensional tensor network MPS: Mj is the local rank-3 tensor and each is contracted via auxiliary bond dimensions (vertical solid lines); σj is the index for the physical dimension.
with such a problem as shown in Fig. 3: given a series of tensor networks {T i } α with the same configuration, we want to get their stacked tensor network representation with the same physical structure and shape. One direct method to realize this is to convert those tensor network {T i } firstly to the corresponding tensor (k, k, . . . , k), and then decompose the stacked tensor (n, k, k, . . . , k). However, this deviates from the original intention of representing a tensor as a tensor network. Note that we use a tensor network to avoid storing or expressing the full tensor in computing, which is memoryconsuming and may be prohibited in large systems. Thus we would like to explore other approach that can help us efficiently to establish the representation of the stacked tensor network without accessing any contraction.
We show in Fig. 4: each rank-3 tensor in MPS is (k, nD, nD). Each matrix (nD, nD) along physical dimension k, is a block diagonal matrix. The i-th block (D, D) is exactly the i-th rank-3 tensor in stack list at the same physical dimension k (so it is a matrix). There are minor differences between the start and the end of the MPS. The start tensor unit is a rank-2 matrix with a row stack for the first unit in stack list. The end tensor is a diagonal rank-3 tensor (k, nD, n) with each diagonal D × 1 block filled by the end unit in stack list. A full mathematical proof is given in Appendix. A.

D. Batch operation for general tensor networks
For other types of tensor network like PEPS, the unit tensor is a 4 + 1 rank tensor. The formula shown in II C would be more complicated. We post the results here directly: For any tensor network consisting of M tensor unit its n-batched tensor network version is another tensor network with one extra The extra index represents the stack dimension(corresponding to the tensor stack) is free to add since each local tensor is block-diagonal. We only need to unsqueeze/reshape the diagonal block from (k, L 1 , L 2 , . . . ) to (k, 1, L 1 , L 2 , . . . ). For the MPS case, we could put it at the right end, as shown in Fig. 4.
The sub-tensor (nL 1 , nL 2 , . . . , nL m ) along the physical indices for each tensor is a diagonal block tensor. For example, the j-th diagonal block is Meanwhile, one unit (for example, take the N -th local tensor) should be 'unsqueezed' to generate the dangling leg for the stack number indices For example, the MPS case in II C, will generate a row vector which is just the 'diagonal block' form for the rank-1 tensor. And the last MPS unit would get an external 'leg' with dimension N is the requirement of the batch.

III. APPLICATION
In this section, we study the potential applications of the stack of tensor network. We apply this stack operation method to the tensor network machine learning. For a tensor network based machine learning task, both the model and input data can be collectively represented as a tensor or tensor network. In fact, here, the model actually refers to the trainable parameters, and we would note it as machine learning core in the following context. The forward processing to calculate a response signal y corresponding to the input x is an inner product in a linear or nonlinear function space, as the interaction between the machine learning core and the input data. The lost function is designed based on the type of task. For example, we would measure the distance between the real signal y and the calculated signal y; for unsupervised learning, we might design the structure loss of hundreds of signals y to maximize the distance matrix. In this work, we take supervised machine learning as an example. However, such a method is flexible for any machine learning framework since it is a modeling technology. Here, by using the language from quantum physics, we denote the machine learning core as |C and one of the input data as |I j = I j (ψ) . As both the core |C and the input |I j are represented as tensors, this processing can be batch executed in parallel. This machine learning task can then be efficiently represented as where α j is the resulting output data, and |(B, I j ) is the batch representation of B stacked tensors or a tensor network, where |(B, I j ) has an explicit format.
In the following, we consider the one-dimensional case of the MPS machine learning task as an example. As shown in Fig. 5, both the core |C and the inputs |I j are the MPS, and the output responsive signal α j is the result obtained by contracting both physical bonds from the two MPS.
Below we show that there are three ways to accomplish this machine learning task: the loop (LP) method, the batched tensor network (BTN) method and the efficient coding (EC) method. The core MPS |C consist of L local tensor units with k being the dimension for all physical bonds and V being the dimension for all auxiliary bonds. Each input MPS |I j consists of L local tensor units with k being the dimension for all physical bonds and D being the dimension for all auxiliary bonds.

A. Naïve loop method
As shown in Fig. 5, for B input samples, one needs to perform B contractions to obtain B outcomes. The most obvious way to accomplish this is to perform contractions one by one. When we utilize the auto differential feature from the libraries PyTorch [25] and Tensorflow [26], the gradient information would be accumulated in every step. At the end of the loop, we can acquire the batch-gradient by averaging over the sum, and apply it to the gradient descent method for weight updating.

B. Batched Tensor Network Method
As discussed in Sec. II, we can stack all the inputs MPS to a batched tensor network which behaves like another MPS. |(B, I j ) = |I(ψ stack ) . Assume the number of input is B, although the auxiliary bond dimension of the batched MPS |(B, I j ) gets B times larger [see Fig. 4], the action of the inner product can now be written as one single tensor contraction, as shown in Fig. 6 . With the help of the format of the batched tensor network, we can now compress B times contraction into one single contraction. Our advantage is obvious: software libraries such as PyTorch and computational units such as GPU are intrinsically optimized for tensor operation, so contracting a large tensor is usually much faster than the matrix multiplication using a for loop. As you can see in the benchmark shown in Fig. 8, when using the GPU environment, the BTN method can get constant time complexity while the LP method is the linear function of batch size. Notice that we need to calculate an optimized contraction path to efficiently contract the tensor network as shown in Fig. 6 while the LP method can simply contract physical bond first and then sweep from left to right. When B gets larger, contracting physical bond first in order requires storing an MPS with auxiliary bond equal to nV B, which is quite memory-consuming. Such a method would cost extra memory allocation for the block diagonal tensor. The memory increment is O(B m ), where m is the number of the auxiliary bond. For example, the rank-3 tensor of the batched MPS will allocate a k × nB × nB memory to store only B × k × n × n valid parameter. If an autodifferential action is needed in the following process, the intermediate tensor needed for gradient calculation will also become B times larger. Thus, although the speed is faster when the batch goes larger, the memory will gradually blow up. The BTN method is a typical 'memory swap speed' method.
One possible solution is to treat the diagonal block tensor as a sparse tensor, so we only need to store the valid indices and values. Meanwhile, if we contract the physical bonds first, it will return a matrix chain whose unit is also a sparse diagonal matrix of the shape (nBV, nBV ). This implies that we can block-wise compute the matrix chain, so that each sparse unit performs like a compact tensor (B, nV, nV ). At this juncture, it turns out that we can then use the popular coding method called efficient coding, which is widely used in modern tensor network machine learning task [6,11,14,16,18].

C. Efficient coding
The efficient coding (EC) does the 'local batch contraction' for every tensor unit calculation in the tensor network contraction.
We first batch contract each physical index and get a tensor train consist of batched units as shown in Fig. 7, then we perform the batch contraction on all the auxiliary bonds for this batched tensor train. The batch contraction is realized by the built-in function einsim which is provided in many modern scientific software libraries such as Numpy [27], PyTorch [25] and Tensorflow [26]: einsum('kij,Bknm → Binjm',core,batch) (4) einsum('Bij,Bjk → Bik',core,batch) For those without einsim, one alternative solution is to implement a highly efficient 'batch matrix multiply' function. The fundamental realization of PyTorch.einsim is in fact the 'batch matrix multiply' called PyTorch.bmm. Notice such a method is not new for tensor network machine learning, and Refs. [11,14,16] has already realized successful learning algorithms based on this. The memory allocated for EC is much less than that for the BTN, which is the major reason why the BTN method is slower than EC.
Also, the dense BTN may require an optimized contraction path for a large batch case, while EC for MPS would only allow contraction of the physics bond first, followed by the contraction for the auxiliary bonds as the tensor train contraction. The standard tensor train contraction method sweeps from the left-hand side of the MPS to the right. In particular, when we require all MPS in the uniform shape, i.e., all the tensor units share the same auxiliary bond dimension, we can then stack those units together((B, L, V, V )) and split it by odd indices ((B, L/2, V, V )) or even indices ((B, L/2, V, V )). In doing so, we can get the half-length tensor units ((B, L/2, V, V )) for the next iteration by contracting the odd part and even part.

D. Comparison and discussion
In Fig. 8, we show the speed-up benchmark results for all three different methods introduced from Sec. III A to Sec. III C with respect to different batch size in a log − log plot. We compare the performance using different libraries in Python. We mainly test three batch contraction methods: the Loop method (LP), the batch tensor network method (BTN), and the efficient coding method (EC) on the MPS batch contraction task in Fig. 5. The system size is a L = 20 units MPS with each unit assigned V = 6, k = 3 and D = 1 [see Fig. 7]. Batch contraction task size from 1 to 1000. Overall, EC has the best speed-up performance over other methods for almost all libraries, especially at large batch size. Due to the pre-processing requirement, the BTN can be slower than others in small-batch cases. The CPU is not efficient in storing and processing larger matrices or tensors, while the GPU is deeply optimized to deal with tensor data structure. Thus, the BTN gets similarly linear time complexity like LP in CPU but maintain a constant time complexity in GPU. Note the BTN will allocate tensors with (nB, nB). Thus the larger the batch size, the larger the intermediate tensor. We can see that the Py-Torch(CPU) gets better optimization with large tensor than NumPy. In Fig. 8(c) and (d), both BTN and EC have constant time complexity in GPU since the GPU would compress the time complexity of matrix contraction to a constant level. It also imply that there is a close relationship between EC and BTN: As we discussed at the end of the Sec. III B, the EC is the another realization of the sparse matrix BTN. The redundancy memory requirement for the diagonal block tensor slows down the BTN method, but such redundancy requirement is eliminated in EC.
For EC and LP method, we contract the physical bond first, and obtain a tensor chain-like structure where B = 1 for the LP method and B > 1 for the EC method. Notice that we assume D = 1 here for the traditional MPS machine learning task. The memory com- Since the output bond is at the right-hand side, the optimal path is contracting from left end to the right end. Here, we use the auto differential engine to automatically calculate the gradient for each unit, then the computer records all the intermediates during the forward calculation. We therefore obtain the (B, V ) batched vector calculated by a batched vector-matrix dot for the first two units  O(B 2 ). Surprisingly, the intermediate cost is the same: (L − 2) × BV + BO. However, the left-right sweep path is not always the optimal path for the BTN method, as we discussed in Sec. III B. In our experiment, the memory complexity for the optimal path consumes much less than O(B 2 ). In Fig. 9, we give an example of the memory cost for increasing batch size which takes a L = 21 units of MPS with each unit assigned with V = 50, k = 3, D = 1 as the system parameters. The output O = 10 classes leg is set at the right end. The batch size increases from B = 10 to B = 1500. We also plot the B times larger curve of the EC, and we simply labeled it as EC * B. We could see that although the BTN method takes more memory than the EC method, the memory complexity is much less than O(B 2 ), which makes it an available option for machine learning. Also, for other problems which require small batches but larger auxiliary bond dimensions, the BTN will perform better than the EC method as shown in Fig. 10 with a B = 10, k = 3, D = 1, O = 10 and V = 10 → 3000 MPS learning system. Contracting the auxiliary bonds first and keeping comparably smaller intermediates is a better choice when B << V . This is the reason why we see the BTN method can allocate much less memory than the EC method for large auxiliary dimensions in Fig. 10 We also demonstrate a practice MPS machine learning task in Fig. 11 and Fig. 12. The dataset we used here is the MNIST handwritten digit database [28]. The machine learning task requires predicting precisely the number from 0 to 9 among the dataset. The core tensor network is a ring MPS consist of 24 × 24 unit with each unit is (V, k, V ) except the last one (C, V, k, V ). Here, V = 20 is the bond dimension, k = 2 is the physics bond dimension, and C = 10 is the number of classes. The bond dimension of the input tensor network is D = 1. We use the ring MPS as it is easier to train than the open boundary MPS. Fig. 12 shows the valid accuracy v.s. time. The valid accuracy is measured after each epoch and only plots 5 epochs since the model get overfitting later. Each epoch will swap the whole dataset once. The best valid accuracy for this setup is around 92%, while the state-of-the-art MPS machine learning task [16] can reach 99% with a large bond [29].
The result is tested on an Intel(R) Core(TM) i7 − 8700K CPU and 1080T i GPU and is shown in Fig. 11. The optimizer we used here is Adadelta [30] with the learning rate lr=0.001. The batch size is 100. The ran-dom seed is 1. The test environment is PyTorch. The x-axis in Fig. 11 is the time cost. Notice these three methods are the only different approaches to realize the batch contraction so that they won't influence the contraction result. That is why these three curves in Fig. 11 are exactly the same. The EC method is the best choice for the tensor network machine learning. The LP method is the most inefficient method, which would cost one hour to traverse the dataset while the EC only needs 60 seconds and BTN takes 300 seconds.

IV. CONCLUSIONS
We have investigated the stack realization of the tensor network as the spread concept of the stack operation for tensors. The resulting stacked/batched tensor network consists of block-diagonal local tensors larger bonds. We connected this method to the efficient coding batch contraction technique which is widely used in tensor network machine learning. An MPS machine learning task has been used as an example to validate its function and benchmark the performance for different batch sizes, different numerical libraries, and different chips. Our algorithm provides an alternative way of realizing fast tensor network machine learning in GPU and other advanced chips. All the code used in this work is available at [31].
Several possible future directions may be of significant interest. Our work reveals an intrinsic connection between the block-diagonal tensor network units and the batched contraction scheme, which is potentially useful for helping researchers realize faster tensor network implementation with the block-diagonal design like [17]. Secondly, our algorithm comes without performing SVD to the stacked tensor network; It would be great to define a batch SVD and check its performance and error when truncating the singular values for larger data size. Thirdly, the efficient coding method requires the contraction of the physical dimensions first. However, the batch method provides the possibility to start contraction in other dimensions first, which may be useful for the contraction on a two-dimensional PEPS or through tensor renormalization group method [32][33][34]. Finally, the BTN method takes advantage of compressing thousands of contractions into one contracting graph. From the hardware design perspective, with a specially designed platform optimized for tensor network contraction, the BTN method may intrinsically accelerate the tensor network machine learning or any task involving a stacked tensor network. Meanwhile, the tensor network states in many-body quantum physics has a non-trivial relationship with the quantum circuits. So the idea of representing several tensor contracting into one single contraction also provides the possibility of efficiently computing on NISQ-era quantum computers. Note that for the first site σ 0 , the resulting matrixM σ0 is stacked in row geometry [see the first site of the resulting stacked MPS in Fig. 4], all the other matricesM σj (j = 1, · · · , n − 1) are block diagonal [ Fig. 4], as the stack operation is essentially equivalent to a direct sum of local matrices. In the final step of the above equation, to maintain the form of the stacked MPS as shown in Eq. (A3), we have absorbed∆ σn intoM σn−1 and reformulate it as which is still block diagonal, but each block with size D × 1.
The above process is the stack operation for two MPS, and it is easy to extrapolate to the stack operation of n MPS. For simplicity, we assign MPS with the same auxiliary dimension D for each bond. It is not necessary as they can be different. One only needs to change the diagonal block indices from (D, D) to (D i , D j ).