Number-State Preserving Tensor Networks as Classifiers for Supervised Learning

We propose a restricted class of tensor network state, built from number-state preserving tensors, for supervised learning tasks. This class of tensor network is argued to be a natural choice for classifiers as (i) they map classical data to classical data, and thus preserve the interpretability of data under tensor transformations, (ii) they can be efficiently trained to maximize their scalar product against classical data sets, and (iii) they seem to be as powerful as generic (unrestricted) tensor networks in this task. Our proposal is demonstrated using a variety of benchmark classification problems, where number-state preserving versions of commonly used networks (including MPS, TTN and MERA) are trained as effective classifiers. This work opens the path for powerful tensor network methods such as MERA, which were previously computationally intractable as classifiers, to be employed for difficult tasks such as image recognition.


I. INTRODUCTION
Ideas and methods from the field of machine learning are currently having a significant impact in many areas of physics research 1 .Machine learning offers powerful new tools for classifying phases of matter [2][3][4][5][6][7] , for processing experimental results 8,9 , and for modeling quantum manybody systems [10][11][12] , to name but a few of the plethora of applications.With this crossing of fields has come the intriguing realization that the neural networks 13,14 used in machine learning share extensive similarities with the tensor networks 15 used in modeling quantum many-body systems 16 .These connections are perhaps not so surprising since both types of network have the primary function of encoding large sets of correlated data: neural networks encode ensembles of training data, while tensor networks encode superpositions of quantum states.Currently there is great interest in exploring the potential applications of this relation, both from the directions of (i) using ideas from neural networks and machine learning to improve methods for modeling quantum wavefunctions [17][18][19][20] and (ii) examining tensor networks as a new approach for tasks in machine learning [21][22][23][24][25][26][27][28][29][30][31] .
In this manuscript we focus on the second direction (ii), and explore the use of tensor networks as classifiers for supervised learning problems.Research in this area has already produced encouraging early results, with examples where tensor networks have been trained to produce relatively competitive classifiers in both supervised and unsupervised learning tasks 21,[25][26][27]30,31 . Howevr there are some significant issues with respect to the use of tensor networks as classifiers.One such issue is that of interpretability.Usually, when applying a tensor network as a classifier, each sample from the (classical) dataset is associated to a product state.However, under generic tensor transformations, product states can be mapped to entangled quantum states, which can no longer be re-interpreted classically.One can understand this as a problem of generic tensor networks being overly-broad when used as classifiers: they are designed to carry information about phases and/or signs between superpo-sition states, which are necessary for describing wavefunctions but seem to be extraneous from the perspective of characterizing classical datasets.A second issue is that of computational efficiency.Most previous studies have utilized only relatively simple classes of tensor networks, such as matrix product states 32,33 (MPS) and tree tensor networks 34,35 (TTN), as classifiers.The more formidable weapons in the arsenal of tensor networks, such as the multi-scale entanglement renormalization ansatz [36][37][38][39] (MERA), which are seen as the direct analogues to the high successful convolutional neural networks [40][41][42], have yet to be deployed in earnest for challenging problems.The primary reason being that, in order for a tensor network to be of use as a classifier, ones needs to be able to compute scalar products between the network and product states (representing the training data); this can be done efficiently for simple networks such as MPS and TTN, but is generally computationally intractable for more sophisticated networks like MERA.
The main motivation for this manuscript is to help resolve the two issues discussed above.In particular, we propose to use networks built from a restricted class of tensor, those which act to preserve number-states, as classifiers for supervised learning tasks.Such numberstate preserving networks automatically resolve the issue of interpretability, provided that each sample of the training data is encoded as a number state.Moreover, the restriction to number-state preserving tensors endows networks with a causal cone structure when contracted against number states, similar to the causal cone structure present in isometric networks when contracted against themselves.This property allows for a broad class of number-state preserving networks, including versions of MERA, to be efficiently trained as classifiers for supervised learning problems.Furthermore, we demonstrate numerically that networks built from this restricted class of number-state preserving tensor perform well for several example classification problems.The above considerations indicate that number-state preserving tensors are a natural restriction to impose when applying tensor methods to learn from sets of classical data.
This manuscript is organized as follows.Firstly in NStensors.eps 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 input:  ⊗  output:  ⊗   1 0 0 0 0 0 0 1 input:  output: Examples: as number state mappings Examples: as inputoutput matrices An example of a number-state preserving tensor that w maps a number state z 0 | z 1 | on its input indices to a number state z| on its output index.The tensor w can be equivalently represented as (ii) an explicit mapping between number states or (iii) as a matrix (after forming the product of input indices).(b) An example of a number-state preserving tensor u between two input and two output indices.(c) An example of a number-state preserving tensor v between one input and two output indices.Note that the three examples of number-state preserving tensors from (a-c) are also unital, in that all of their non-zero entries are the unit element.
Sect.II, we characterize number-state preserving tensors and some of their properties, then in Sect.III we formulate how problems in supervised learning can be approached using tensor networks.In Sect.IV we propose an algorithm for training number-state preserving tensor networks to correctly classify a labeled dataset, while Sect.V we describe how single tensor environments can be efficiently evaluated, a key ingredient in the proposed training algorithm.Benchmark numerical results for number-state preserving versions of MPS, TTN and MERA applied to example classification problems are presented in Sect.VI, and conclusions are presented in Sect.VII.

II. NUMBER-STATE PRESERVING NETWORKS
Let L be a lattice of sites, with each site described by a local Hilbert space of some dimension d.We label the basis states for each site by integers, |z ∈ {|0 , |1 , . . ., |d − 1 }, which are interpreted as particle number and are represented as unit vectors, A number state |Z L (or, equivalently, a Fock state) on lattice L is a product state with well-defined particle number, where superscripts are here used to denote lattice position.Alternatively, if one is thinking in terms of spin degrees of freedom, a number state can be defined as a product state with a well-defined z-component of spin.
We now turn our considerations to transformations of number-states implemented by certain types of oriented tensor: these are tensors where each index has been fixed as either incoming or outgoing.Any oriented tensor can be interpreted as a mapping between states defined on an input lattice L, whose sites match the incoming tensor indices, to states on an output lattice L , whose sites match the outgoing tensor indices.We define an oriented tensor as number-state preserving if it maps any number state defined on L to another number state on L .Several examples of number-state preserving tensors are given in Fig. 1.Let u kl ij be a four index tensor, with subscripts denoting incoming indices and superscripts denoting outgoing indices, as depicted in Fig. 1(b).Consider the reshape of u into an input-output matrix, i.e.where the rows of the matrix enumerate over the tensor product (i ⊗ j) of incoming indices and columns enumerate over the tensor product of the outgoing indices (k ⊗ l).It is easily understood that the property of u being numberstate preserving is equivalent to the property that each row of the corresponding input-output matrix must have at most a single non-zero entry.Note that we also include in the definition of number-state preserving tensors those where the input-output matrix has rows with only zero entries; equivalently these are tensors which can map some number states to the null (or norm-zero) state.An important property of number-state preserving tensors is that networks formed from their composition, where outputs from one tensor are properly matched with inputs to other tensors, are also number-state preserving, as depicted in Fig. 2(a).This allows us to form numberstate preserving versions of commonly used tensor networks, such as MERA, as shown in Fig. 2(b).However, it is vital to realize that number-state preserving tensors do not necessarily remain number-state preserving if the orientation of their indices is reversed (i.e. the incoming and outgoing indices are switched); thus number-state preserving networks can still generate interesting superpositions and entangled states when 'run' in reverse.
For the main text of this paper we shall further restrict our consideration to unital number-state preserving tensors, where each tensor entry must be either a zero or  a one, and each row of the corresponding input-output matrix is required to have a single non-zero entry.Note that this class of tensor maps incoming number-states to outgoing number-states of the same normalization and phase.The restriction to unital tensors will be useful in simplifying their application to supervised learning problems, although the formalism and optimization algorithms that we present are still general for all numberstate preserving networks.There are many reasons why one may also wish to consider networks comprised of nonunital number-state preserving tensors, where entries can take any real or complex value, and thus change the normalization of states and introduce phases; the interested reader is directed to Sect.A of the Appendix for further discussion.
Given that number-state preserving networks represent a severely restricted class of tensor network states it may be interesting to consider how much of their power has been lost, for instance, in describing ground states of quantum many-body systems.Although this remains to be explored, it seems likely that majority of many-body systems will not have ground-states that can be well-approximated by number-state preserving tensor networks.However, there does exist several examples of non-trivial quantum many-body systems related to Motzkin paths 43 , whose ground states possess interesting entanglement and yet can be exactly represented by number-state preserving networks 44,45 .Investigation of the ability of number-state preserving networks to describe general quantum ground states remains an intrigu-  ing direction for future research.

III. SUPERVISED LEARNING IN A TENSOR PRODUCT SPACE
In this section we discuss how the task of supervised learning can be formulated in terms of tensor networks.We consider problems where each training sample Z is represented as a length N vector, with the i th component z i an element of Z d (the set of integers modulo d), i.e. such that where k is a label over the set of training samples.Every training sample is assumed to be paired with a corresponding label y ∈ Z c , where c represents the number of distinct categories for the classification problem.The goal of the supervised learning problem is to construct a function f that maps each sample of the training set to its correct label, Although classifiers based on linear functions f have some considerable utility 46 , many non-trivial classification problems require non-linear functions f in order to achieve good accuracy.We now describe how a tensor network can be implemented as the classifying function in Eq. 4. At this point, one could be tempted to believe that tensor networks would have limited utility as classifiers as, given that tensors simply are extensions of matrices to higher dimensions, they are inherently linear constructs.However, in order to recast the supervised learning problem into a problem amenable to tensor networks, we first (non-linearly) embed the training data into a higher dimensional space, similar to a kernel method 47 .By using an appropriate non-linear embedding, a linear classifier acting the higher dimension space can reproduce the classifying power of non-linear functions in the original space; thus it remains possible that tensor network approaches could be competitive with classifiers based on (non-linear) neural networks.Indeed, as will be argued later in this manuscript, it can be understood that a tensor network of sufficiently large bond dimension χ can, in principle, obtain perfect accuracy for any training set of a supervised learning problem as formulated above.
Let us recast each training sample Z k as a number state, denoted |Z k , defined in a vector space of total dimension d N .Specifically, we associate each integer z ∈ Z d with a number state |z in a d-dimensional Hilbert space, represented as per Eq. 1, such that the full state vector |Z k is given as the tensor product of the single site states, Similarly the data labels y k are recast as number states |y k in a c-dimensional space.The diagrammatic tensor notation for these states is presented in Fig. 3. Given this embedding of our training data, a classifier can be represented as tensor network T that maps states see also Fig. 2(b) for an explicit example.
In general, the accuracy of T as a classifier could be quantified by evaluating the scalar products of the output states with the label states, Z out k |y k , where a large scalar product would indicate good classification.However, in the particular case of unital number-preserving networks T , the norm of states is preserved such that all scalar products Z out k |y k either evaluate to unity (indicating correct classification of the data sample with label y k ) or to zero (indicating incorrect classification of the data sample).Thus, the number of correctly classified samples N correct simply evaluates as the sum over all the scalar products, The diagrammatic tensor notation for Eq. 7, in the particular case that T is a binary MERA, is presented in Fig. 4(b).It follows we should use Eq. 7 as the cost function for training the tensor network T for the supervised learning problem: the tensors contained within T should be optimized as to maximize N correct .Methods for achieving this are discussed in the following section of this manuscript.Before moving on, we remark that the formalism we described (or similar formalisms consider , the network for Ncorrect can be factorized into a product of the tensor with its environment Γu, formed from contracting the entirety of the network sans u.The environment Γu allows the optimal tensor u that maximizes Ncorrect (with the other tensors in T held fixed) to be identified.
previously 21,23,[26][27][28]31 ) for addressing supervised learning problems using tensor networks could, in principle, employ arbitrary tensor networks T as classifiers (not only those built from number-state preserving tensor networks). Hoever, it is only for certain types of network, such as MPS and TTN, that scalar products of the form Z in k T |y k can be efficiently evaluated.The cost of (exactly) evaluating the overlap of a product state with a more sophisticated tensor network state, such as a MERA, typically does not scale efficiently with system size.Thus, one would expect that a general MERA network would only be computationally feasible as a classifier for problems with a small number of sites (or variables).In contrast the output state Z out k | of Eq. 6 can be efficiently evaluated for any number-state preserving tensor network, with cost that scales only linearly in the number of tensors in T .Nonetheless, the result that a scalar product Z in k T |y k is efficient to evaluate does not in itself imply that the network T can be efficiently trained.In Sect.V we formulate additional requirements for network T that are sufficient to allow for efficient training.

IV. SINGLE TENSOR UPDATES
In this section we propose a method to optimize the tensors of a network T to maximize the number N correct of correctly identified training samples in a supervised learning problem, as formulated in Eq. 7. We follow the same strategy of single tensor updates developed in the context optimizing MERA 48 , where only a single tensor in the network is changed at any time while all other tensors in the network are held fixed.These single tensor updates can then be organized into 'sweeps', in which all tensors in the network are optimized in turn, and the sweeps iterated until the entire network is sufficiently converged.
Key to this optimization strategy is the notion of a tensor environment, which can be understood as the derivative of the network with respect to a single tensor.Specifically, given a network that evaluates to a scalar such as that from Fig. 4(b), the environment Γ u of a tensor u results from contracting the entire network sans the particular tensor u under consideration.It follows that the number of correctly classified samples N correct from Eq. 7 can always be expressed as the scalar product of a tensor u ∈ T with its environment Γ u , where, for notational simplicity, we have recast u and Γ u into input-output matrices, see Fig. 4(c).We relegate a description of the general method for computing environments Γ u to Sect.V of the manuscript, and proceed here assuming Γ u is already known.
Let us now turn to the problem of finding the optimal number-state preserving tensor u opt., which maximizes the number of correctly identified samples N correct of Eq. 8, given a known environment Γ u .Here it is easy to see that u opt.can be built by simply identifying the location of the maximal element in each row of Γ u and then placing the unit element at the corresponding location in each row of u opt., with all other entries zero.Note that if the maximal element in a row of Γ u is degenerate then u opt. is not uniquely defined; one can still obtain an optimal solution by simply selecting one of the maximal elements in that row of Γ u .Let us consider a concrete example: imagine we are updating a tensor u with a 4 × 4 input-output matrix of the form given in Fig. 1(b-iii), and assume that the environment has been evaluated as Then the (unital and number-state preserving) 4 × 4 ma-trix u opt.that maximizes Eq. 8 is given as and the number of correctly classified training samples after this optimal update is given as N correct = (12 + 9 + 22 + 15) = 58.Some remarks are in order regarding this optimization strategy.Firstly, we notice that unlike many commonly used algorithms for training neural networks, our approach is not based upon a gradient descent.Instead we can directly 'hop' to the true maximum for any single tensor (given that the other tensors in the network are held remain fixed), provided the environment is exactly known.While this strategy has some advantages over gradient based methods with respect to avoiding local maxima, getting stuck in a solution that is not globally optimal can still remain a possibility depending on the problem until consideration.We now discuss methods to introduce some randomness into the optimization, in order to reduce the possibility of getting trapped in a local maxima.One approach could be to employ a similar strategy as used in the stochastic gradient descent methods 49 , where randomness is introduced by using only select 'batch' of training samples for each update.Instead, here we advocate a different strategy inspired by Monte Carlo methods 50 used in sampling many-body systems.Rather than updating to the optimal tensor u opt. at each step, we propose to allow updates to sub-optimal solutions of Eq. 8, with a probability diminishes exponentially in relation to how far the solution is from the optimal solution.For this purpose we first introduce the difference matrix Ω, given by subtracting from each row of Γ the maximal element within the row, For the example environment Γ u given in Eq. 10 the corresponding difference matrix is We then use the difference matrix to generate a matrix p trans. of transition probabilities, defined element-wise as where α is a tunable parameter that sets the amount of randomness.For the example difference matrix Ω of Eq. 13 and setting α = 2 we get the transition matrix 0.21 0.58 0.13 0.08 0.10 0.16 0.72 0.02 0.35 0.08 0.00 0.57 0.10 0.45 0.17 0.28 Lifting.epsThe transition matrix is then used to perform a stochastic update of the tensor u under consideration: values in each row of p trans.set the probability for the unit element in the equivalent row of the updated u to be placed at that particular location (note that Eq. 14 has been defined such that each row of p trans sums to unit probability).Notice that in the limit α → 0 the matrix p trans.tends to u opt.(provided Γ had no degeneracies in its maximal row values), since all non-optimal transitions are fully suppressed.Conversely, in limit α → ∞ all probabilities in p trans.tend to the same value, representing completely random transition probabilities.

V. EVALUATION OF TENSOR ENVIRONMENTS
Here we describe evaluation of tensor environments, crucial to the optimization algorithm discussed in the previous section.For simplicity, we describe this evaluation assuming the tensor network T under consideration is a binary MERA, although the same methodology can be employed for arbitrary (number-state preserving) tensor networks.
Rather than tackling the problem of computing tensor environments Γ directly, we first introduce the concept of configuration spaces |φ .Proper use of configuration spaces |φ , which play an analogous role to the local reduced density matrices ρ used to optimize tensor networks in the context of quantum many-body systems, will greatly simplify the subsequent evaluation of environments.Let us assume that the output index of the tensor network T under consideration has been fixed in some specified label state |y , and that the lattice on which it is defined has been partitioned into a region A and its compliment B. Then, given a number state |Z A on region A, we define the configuration space |φ B as where the sum runs over all valid configurations σ of number states |Z B σ defined on region B such that the combined number state Z A Z B σ is classified by T into the correct category |y , i.e. such that An example of a network that could be contracted to evaluate a configuration space |φ B is depicted in Fig. 5(a).It is seen that this network can be simplified, as shown Fig. 5(b), by lifting the input number state |Z A through tensors in T where-ever possible (i.e.whereever a tensor has a number state available on all of its incoming indices), using the number-state preserving tensor properties as outlined in Fig. 1.It is convenient to define the configuration causal cone C(B) associated to region B as the set of tensors remaining in the network T after this simplification; equivalently C(B) can be defined as the set of tensors C ∈ T whose output state can be affected by the choice of input state on region B.
Notice that this configuration causal cone C(B) is precisely equivalent to the (standard) causal cone 36,51 that would emerge from an isometric MERA for the same region B, defined as the set of tensors that can affect the local reduced density matrix ρ B .However, the origins of these causal cones are drastically different: the causal cones in isometric MERA result arise due to the isometric constraints imposed on tensors, whereas the numberstate preserving tensors proposed in this manuscript are not required to be isometric.Similarly, configuration causal cones arise only in networks that preserve number states, and are thus ill-defined for generic MERA.[Note that it is, however, possible to have networks with tensors that are both simultaneously isometric and number-state preserving, see Sect.A of the Appendix for further discussion].Despite the difference in the origins of these two forms of causal cone, it is not a fluke that they were exactly equivalent in the previous example.It can be understood that the configuration causal cones in any numberstate preserving tensor network are always equivalent to CausalReduce.eps[0] can be evaluated using a composition of the left/right lowering operators.
the causal cones found in an isometric tensor network of the same geometry, provided that the index orientations (specifying incoming and outgoing indices) match between the networks.Given this equivalence, we will henceforth drop the distinction between the two definitions, such that the term 'causal cone' can refer to either definition.
The process of evaluating the configuration space for a region B of three sites from a binary MERA is depicted in Fig. 6(a).This evaluation can be formulated as a sequence of contractions that each 'lower' the configuration space through the causal cone, where bracketed subscripts denote configuration spaces at different depths within the network.Each of the lowering contractions is implemented by one of two geometrically different lowering operators, depicted in Fig. 6(b), which are the direct analogues to the descending superoperators 48 used in the evaluation of density matrices from isometric MERA.
In our example using a binary MERA, the cost of evaluating |φ B for a region B of three contiguous sites scales at most linearly with the network depth, since the form of the lowering operators are self-similar at all depths.In a general (number-state) preserving network the computational cost of evaluating configuration spaces will be related to the causal structure of the network: the leading order cost will scale exponentially with maximum width of the causal cones.Thus it is apparent that not all number-state preserving tensor networks can be efficiently evaluated for local information (characterized by the configuration space |φ B ); only those for which the maximum causal width is not too large.However, since MERA are precisely designed to have bounded causal width (i.e. the causal width never spreads beyond some small number of sites), it follows that number-state preserving versions of MERA networks precisely fall within the class of networks that can be efficiently evaluated.
Given that the evaluation of configuration spaces has been understood, we now turn to the task of building the environment Γ u associated to tensor u, as depicted in Fig. 7, which is accomplished as follows.First we lift the initial number state |Z in k to a new number state | Zk that lives on the boundary of causal cone C(u) associated to tensor u, as depicted in Fig. 7(a).Then we compute the configuration space φu k | defined on the output indices of tensor u, as depicted in Fig. 7(b).Then the environment Γ u is given by taking the outer product of the configuration space φu k | with the piece of the state | Zk supported on the input indices of u, denoted | Zu k , while summing over all training samples k, see also Fig. 7(c).

VI. BENCHMARK RESULTS
In this section we present benchmark results for how number-state preserving tensor networks perform as classifiers in some simple problems.The goal here is to establish the feasibility of our proposal, rather than to establish performance for challenging real-world tasks, which will be considered in future work.In particular we demonstrate (i) that the proposed optimization algorithms can efficiently and reliably train the networks under consideration, and (ii) that number-preserving networks perform comparably well to unrestricted networks for classification tasks.

A. Parity classification
For this first test, we benchmark the performance of a number-state preserving MPS for classifying the parity of binary strings.Here each test sample is a length-N binary vector Z k = [0, 0, 1, 0, 1, . ..], which is labeled y k ∈ {0, 1} according to its parity.The MPS that we use is depicted in Fig. 8, and is built from tensors that are number-state preserving only when acting from left-toright.In this problem, we are free to choose the length N of the binary strings as well as the number n samp. of training samples to use (as these can be randomly generated).We also have two hyper-parameters associated to our method: the maximal bond dimension χ max of the MPS and the parameter α from Eq. 14 that controls the amount of randomness in the optimization.For each set of parameters investigated we performed 100 trial runs, each run starting with a randomly generated training set and a randomly initialized MPS, and then performed at no more than 100 optimization sweeps in each trial.The most computationally demanding trials (which consisted of: a length N = 20 chain, n samp.= 20000 training samples, a bond dimension of χ max = 10, and 100 optimization sweeps) each took about 5 secs to run on a single 3 GHz desktop CPU.At the end of each trial we also test the generalization error of the MPS classifier by evaluating its accuracy in classifying the parity of all possible 2 N binary strings.
A summary of the results from a large number of trials is presented in Tab.I.For binary strings of length N = 16 and N = 20 we used 1300 and 20000 training samples respectively; these numbers were chosen as they represent about 2% of all possible binary strings in each case (of which there are 2 N in total).The randomness parameter was fixed at α = 1 for N = 16 and α = 5 for N = 20 length chains; these values were determined as adequate through small amount of experimentation (and are probably not those which would give optimal performance).Somewhat surprisingly, we found that each trial would produce only one of two outcomes: (i) the optimization would fail completely, achieving only slightly over 50% classification accuracy on the set of all binary strings, or (ii) would converge to a perfect parity classifier, with 100% classification accuracy for all length-N binary strings.From Tab.I we see the proportion n perfect of perfect classifiers obtained increases dramatically as the bond dimension χ max was increased, reaching 96/100 for N = 20 and χ max = 10.This is expected, as networks with more degrees of freedom are less likely to be trapped in local minima.We found that the likelihood of obtaining a perfect classifier was also greatly improved when using a larger number of training samples, although do not provide this data here.In a recent work by Stokes and Terilla 52 standard (unrestricted) MPS were also trained to classify the parity of binary strings, and produced comparable results for similar strings lengths and training set sizes.This is a good indication that, for this classification problem, number-state preserving MPS are as powerful as unrestricted MPS.

B. Division-by-7 classification
For the second test we classify binary strings, interpreted as a base-2 representation of an integer, by their remainder under division by 7. We again use a numberstate preserving MPS, employing the same set-up as used for the parity classification considered previously.A key difference here is that the samples now take one of seven different labels, y k ∈ {0, 1, 2, 3, 4, 5, 6}.
A summary of the results from these trials is presented in Tab.I.For binary strings of length N = 16 and N = 20 we used 3000 and 30000 training samples respectively; although this was more than was used for the parity classification it is still less than 5% of the possible binary strings.Similar to the parity benchmark, we here found that each trial would either fail completely, producing no better than a random results, or would converge to a perfect division classifier, with 100% classification accuracy for all length-N binary strings.As with the parity benchmark, it is seen that the proportion of perfect classifiers obtained increases steadily with the bond dimension χ max .However, this problem required larger dimensions χ max than used for the parity benchmark, which is expected since here we have many more classification categories.

C. Height classification
The final test problem that we consider, which we refer to as height classification, takes length-N strings of integers from the set z ∈ {−1, 0, 1} and classifies them with labels y k ∈ {0, 1, 2} depending on whether the sum (under regular addition) of the integers is positive, zero or negative, respectively.We test the effectiveness of both number-state preserving binary TTN and binary MERA as classifiers for this problem, working with strings of length N = 24.A binary MERA of the form depicted in Fig. 2(b) is used, and is compared with the binary TTN that would result from restricting to trivial disentanglers u throughout the MERA network.Given that the problem is translation-invariant, we imposed that all tensors within a network layer are identical.In terms of the optimization, this is achieved by updating using the average single-tensor environment from all equivalent tensors within a network layer.We found that the injection of randomness into the optimization was unnecessary, possibly due to the imposition of translational invariance, such that the randomness parameter α from Eq. 14 could be set at α = 0.This left the bond dimension of the networks as the only hyper-parameter in the calculation, which was fixed at maximum dimension χ max = 9.The benchmark results are displayed in Fig. 9, and consisted of 100 trials, each trial starting from 12000 randomly generated training samples (with 4000 samples from each label category) and a randomly initialized network.Rather than running separate TTN and MERA trials they were instead combined: the first 20 sweeps were performed with trivial disentanglers u, such that underlying the network was a TTN, the u were then 'switched on' for the remaining 40 sweeps such that the network became a MERA.At the conclusion of each trial, the generalization error was estimated by applying the trained classifiers to a randomly generated test set of the same size as the training set.Most of the trials converged smoothly, with the proportion of wrongly identified testing samples decreasing monotonically with optimization, although about 5 trials failed to properly converge (yielding classifiers with greater than 30% error).Discarding the worst 10 trials from consideration, of the 90 remaining trials the TTN gave average training/test errors of 14.15% and 14.91%, while MERA gave substantially reduced average training/test errors of 1.13% and 1.86%.These results clearly demonstrate the extra representation power endowed through use of the disentanglers u in MERA.Impressive is that both networks generalized well, with only relatively small differences between test and training accuracies, despite being trained on less than 5×10 −6 percent of the possible 3 24 training samples.

VII. CONCLUSIONS
We have proposed the class of number-state preserving tensor networks for use as classifiers in supervised learning tasks and have shown that a large class of these networks, specifically those with bounded causal structure, are efficiently trainable for large problems.In particular we have described a training algorithm that, for any chosen tensor in the network under consideration, exactly identifies the optimal tensor for that location (i.e. that which maximizes the number of correctly classified training samples), all with cost that scales only linearly in number of training samples.Importantly, the class of efficiently trainable number-state preserving networks includes realizations of sophisticated networks such as MERA, which would otherwise be computationally intractable.As such, we believe this could be the first computationally viable proposal which would allow MERA, close tensor network analogues to convolutional neural networks, to be applied as classifiers for challenging tasks such as image recognition.This remains an interesting direction for future research.
Although number-state preserving tensors represent a highly restricted class of tensor, the preliminary results of Sect.VI are encouraging that this class is sufficient when applying tensor networks as classifiers for learning problems as outlined in Sect.III.It still remains to be seen whether number-state preserving tensor networks are as powerful as generic tensors networks for these tasks; this question requires further theoretical and numerical investigation.However it is relatively easy to understand that, in the limit of large bond dimension, a number-state preserving tensor network could in principle achieve 100% accuracy on any training problem outlined in Sect.III.The reasoning follows similarly to the argument that a generic tensor network can represent an arbitrary quantum state in the limit of large bond dimension.Consider, for instance, the MERA depicted in Fig. 2(b).One could increase the bond dimension of indices within the network until the output index of each w tensor matches the product of its input dimensions, in which case each w could be fixed as a trivial identity tensor when viewed as an input-output matrix.In this scenario, the top tensor w top could implement an arbitrary classifier that would perfectly map every training sample to its designated la-bel, regardless of the training data given.
A major difficulty with the use of MERA in D = 2 or higher spatial dimensions 37,38 is their high scaling of computational cost with bond dimension χ.However, there is reason to be more optimistic for their application as classifiers.The cost of contracting an isometric metric MERA for a density matrix, necessary for its optimization towards the ground state of a local Hamiltonian, is related to the size of the maximum causal width of the network.For instance, the most efficient known 2D isometric MERA 38 has a causal width of 2 × 2 sites, such that the density matrices within the causal cone have 8 indices.The cost of computing these density matrices can be shown to scale at most as O(χ 16 ).However, while a number-state preserving version of this 2D MERA would also have a causal width of 2 × 2 sites, the relevant configuration space |ψ within the causal cone would only have 4 indices (which follows as the density matrix involves both the bra and the ket state, whereas the configuration space only involves the ket).Thus the cost of optimizing a number-state preserving version of this 2D MERA, where the key step is the evaluation of configuration spaces, will scale roughly as O(χ 8 ) (i.e. the square-root of the cost of optimizing an isometric MERA for a quantum ground state).This square-root reduction in cost scaling as a function of bond dimension χ from isometric to number-state preserving networks will hold in general, such that number-state preserving networks could realize much larger bond dimensions given a fixed computational budget.This advantage is somewhat mitigated by the fact that the cost of optimizing a numberstate preserving network comes with a factor n samp related to the size of the training set, which could be very large.However, it would also be straight-forward to parallelize the evaluation of environments over the samples.
Although the main text of this manuscript focused on number-state preserving versions of MERA, many other forms of hierarchical network could also be of useful as classifiers as discussed further in Sect.B of the Appendix.In particular the network of Fig. 11, which does not have an isometric counterpart, seems to be the closest tensor network analogue to a convolutional neural network.Rather than disentanglers, this network uses δ-function tensors to effectively allow neighboring w tensors to 'read' from the same boundary sites, mirroring the overlap of feature maps arising in a convolution (and similar to the generalized networks recently proposed in Ref. 31).It would be interesting to compare the effectiveness of this structure versus a traditional MERA, which will be considered in future work.
The author thanks Miles Stoudenmire and John Terilla for useful discussions and comments.This research was supported in part by the National Science Foundation under Grant No. NSF PHY-1748958.The numerical examples considered in the main text trained classifiers using tensor networks built from unital number-state preserving tensors, where all tensor elements are either zero or the unit element.Using unital tensors has the advantage that they preserve the norm of number states under transformation, see Fig. 10(b), simplifying the cost function for identifying the number of correctly classified training samples.However, there are good reasons why one might want also want to consider non-unital tensors.A classifier built with unital tensors only gives a binary result {0, 1} for whether a test state belongs to a specified category.In practice it may be preferable to obtain a continuous parameter in the range p ∈ [0, 1] that indicates the likelihood of a test state belonging to the specified category, which could be achieved using number-state preserving tensors with arbitrary real entries.We now consider two potentially useful forms of non-unital tensors that are still number-state preserving.

OtherNorm.eps
A useful class of number-state preserving tensor to consider are those with unit 1-norm, as per the example of Fig. 10(c).These are tensors that, when expressed as an input-output matrix, have columns that sum to unity.This property implies that these tensors transform an equal superposition vector |I = [1, 1, 1, 1, • • • ] † on their input into an equal superposition vector on their output.It follows that a tensor network built from these will have a 1-norm of unity.The restriction to tensors with of this normalization also has the advantage in that it allows marginal probability distributions to be evaluated from number-state preserving networks, which may otherwise not be feasible.Assume that we wish to evaluate from a tensor network classifier the weighted set of permissible configurations for some region B while knowing nothing of the state on the complimentary region A of the problem space, which we call the marginal distribution for region B. We can compute the marginal distribution by repeating the calculation from Sect.V for the configuration space for B, but instead setting the state |Z A on the compliment as the equal superposition, This evaluation can be performed efficiently, since tensors with unit 1-norm map the superposition vector |I trivially to itself.
In certain cases it is also possible to restrict tensors to be both simultaneously number-state preserving and isometric, see Fig. 10(d) for an example.This is only possible if the product of the incoming dimensions is greater than or equal to the product of the outgoing dimensions, which is necessary for the isometric character.A network built from these tensors would inherit both the efficient evaluation of reduced density matrices, characteristic to isometric networks, and the efficient evaluation of configuration spaces, characteristic to number-state preserving networks.In addition, restricting to isometric tensors ensures that the 2-norm of a network is unity.number-state preserving networks may also be useful as classifiers, some of which fall outside of what is permissible with isometric networks.In this appendix we give a few examples of more general networks and discuss where they may be useful.
Consider the example MERA-like network depicted in Fig. 11.Unlike a traditional MERA this network does not use disentanglers, instead using δ-function tensors to effectively allow neighboring w tensors to 'read' from the same boundary sites, similar to the generalized networks recently proposed in Ref. 31.Notice that this construction is not compatible with imposing an isometric character on the tensors.This network seems to be close analogue to a convolutional neural network, in that the δ-function tensors mimic the overlapping feature maps arising in a convolution.The cost of optimizing this network for a supervised learning problem is seen to be cheaper than that of the binary MERA considered in the main text, since the causal cones here only have maximal width of two sites.Given this consideration, it will be interesting to see how the accuracy compares with binary MERA, which we leave for future work.
Another type of MERA-like network is depicted in Fig. 12(a); this time accomplishing disentangling using matrix product operators (MPOs) rather than a product of local tensors.In order for this network to have bounded causal width, and thus be compatible with efficient optimization, it is necessary that the u tensors are simultaneously number-state preserving with respect to two different orientations, as depicted in Fig. 12(b).If this criteria is satisfied, then the network will possess a causal-width of only one site and thus will be extremely efficient to optimize.There is some evidence to suggest that disentangling using MPOs could be much more effective that disentangling using local operators as used in a standard MERA.In a recent work 44 , this form of number-state preserving tensor network with bond dimension χ = 4 was shown to exactly describe the ground state of the Motzkin spin chain 43 , which possesses a logarithmic scaling of entanglement entropy.In contrast it is known that a regular MERA network, with arbitrarily large but finite bond dimension, cannot provide an exact representation of this ground state.

FIG. 3 .
FIG. 3. (a) The k th training sample Z k is given as a length N vector of integers z k (modulo some specified base d), and is accompanied by label y k .(b) The training sample Z k can alternatively be expressed as a unit vector |Z k in the tensor product space of dimension d N formed from mapping each base-d integer to a number-state |z k , see Eq. 1. (c) Diagrammatic tensor representation of training sample |Z k .

FIG. 4 .
FIG. 4. (a) The total number correctly classified samples Ncorrect is given as the inner product of the labels |y k against the network T applied to the training data Z in k |, summing over all training samples k.(b) Diagrammatic representation of the equation from (a) which evaluates to Ncorrect.(c)For any chosen tensor, such as the shaded tensor u in (b), the network for Ncorrect can be factorized into a product of the tensor with its environment Γu, formed from contracting the entirety of the network sans u.The environment Γu allows the optimal tensor u that maximizes Ncorrect (with the other tensors in T held fixed) to be identified.

FIG. 5 .
FIG. 5. (a) The network T with fixed output label |y is applied to a number state Z A | defined only on a sub-region A of the initial lattice, with the state on the complimentary region B left open.(b) The input number-state Z A | is lifted through T as much as is possible by using the number-state mapping properties depicted in Fig. 1.The (configuration) causal cone C(B) associated to region B describes the remaining set of tensors C ∈ T after this lifting; this is equivalently the set of tensors whose output states can be affected by the choice of input state on region B.

FIG. 6 .
FIG. 6.(a) Sequence of contractions used to evaluate the configuration space |φ B [0] associated to region B, starting from the causal cone C(B) as depicted in Fig. 5(b).At each step in the evaluation the tensors in shaded region are contracted into a single tensor.(b) For any region B of three contiguous sites on the initial lattice, the configuration space |φ B[0] can be evaluated using a composition of the left/right lowering operators.

FIG. 7 .
FIG. 7. The sequence of steps used to evaluate the environment Γu of the shaded tensor u.(a) The initial state Z in k | is transformed through the network to form a new number state on the boundary of the causal cone C(u) associated to u.(b) The configuration space | φu k , defined on the output indices of u, is computed through use of the left/right lowering operators, as in Fig. 6.(c) The environment Γ is formed by taking the outer product of the configuration space | φk with the state Zu k | defined the input of u, summing over all training samples k, see also Eq. 19.

FIG. 8 .
FIG. 8. (a) Tensor v is a number-state preserving tensor mapping from two indices to a single index.(b) An MPS network T is built from tensors v that preserve number-states when mapping from left-to-right.The MPS is trained as a classifier by maximizing the scalar product k Z in k T |y k .

FIG. 9 .
FIG. 9. (left) Results of training TTN and MERA for the height classification problem, displaying how much of training set is wrongly classified as a function of the number of optimization sweeps performed.The first 20 sweeps are performed while keeping trivial disentanglers u, such that underlying the network is a TTN, while the u are then 'switched on' for the remaining sweeps such that the network becomes a MERA.The figure displays results from 10 different trials, where each trial starts with a randomly generated training set and randomly initialized network.(right) Average results of the training data from 100 trial runs (after discarding the 10 worst trials).Dashed lines show the average generalization error computed from applying the trained TTN and MERA applied to a randomly generated test set.For TTN we get average training/test errors of 14.15% and 14.91%, while for MERA we get average training/test errors of 1.13% and 1.86%.
FIG. 10.(a) Tensor w is assumed to be a number-state preserving tensor with a single input and a single output index.(b) An example of a unital tensor w, which preserves the norm of number states under transformations, i.e. such that z|ww † |z = 1 for any normalized number-state |z .(c) An example of a tensor w with unit 1-norm, which transforms an equal superposition vector |I = [1, 1, 1, 1, • • • ] † into another equal superposition vector.(d) An example of an isometric tensor w, which annihilates to the identity I under contraction with its conjugate tensor, w † w = I.

FIG. 11 .
FIG. 11.(a) A hierarchical tensor network constructed mimic a convolutional neural network (CNN).(b) The black circles represent the δ-function, which maps a number state into two copies of itself, thus two adjacent w tensors are able to effectively 'read' from the same lattice site.The causal cone C(B) of region B is shaded, which has a bounded width of two sites.

FIG. 12 .
FIG.12.(a) A hierarchical tensor network where the disentangling is accomplished via matrix product operators (MPOs) of four index tensors u.(b) In order for the network to have a bounded causal width, each u must be simultaneously number-state preserving with respect to the two orientations pictured.

TABLE I .
Summary of results for MPS applied to the parity classification (above) and division-by-7 classification (below).Parameters are as follows: N is the length of binary strings classified, nsamp is the number of samples in the training set, χmax is the maximal MPS bond dimension, parameter α controls the randomness in the optimization as per Eq.14, n perfect is the proportion of trial runs that yielded perfect (100% accuracy) classifiers, nsweeps is the average number of variational sweeps required to reach convergence.